Compare commits

...

401 Commits

Author SHA1 Message Date
e7d351ceba Release: v4.56.0 2025-08-29 20:21:00 +02:00
1067577ad2 fix gpt-oss out shape (#40535)
* fix out shape

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* reset gpt-oss modeling

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix copies

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix tests

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

---------

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
2025-08-29 15:20:33 +00:00
7efb4c87ca Flaky CI is annoying (#40543)
* mark flaky

* and the non batch one
2025-08-29 16:47:44 +02:00
828a27fd32 Fix gpt-oss rope warning (#40550)
* fix

* fix print

* rm

* real fix

* fix

* style
2025-08-29 14:40:33 +00:00
74a24217f5 Add bfloat16 support detection for MPS in is_torch_bf16_gpu_available() (#40458)
* Add bfloat16 support detection for MPS (Apple Silicon) in is_torch_bf16_gpu_available

bfloat16 seems to have been supported for a few years now in Metal and torch.mps.

Make sure to allow it and not throw on bf16 usage with "Your setup doesn't support bf16/gpu." from TrainingArguments.

* Check bf16 support for MPS using torch method

Actually seems method exists: 5859edf113/torch/_dynamo/device_interface.py (L519)

It simply checks if you are on MacOs 14 or higher.

* Document Metal emulation for bf16 support

Add note about Metal emulation for bf16 support on M1/M2.

* Update bf16 support check for MPS backend

is_bf16_supported() not exposed even if defined on MPSInterface, use same approach as in accelerate pr.

---------

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
2025-08-29 14:37:15 +00:00
ffdd10fced Allow compression on meta device (#39039)
* disable gradient calculation for int weights

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

* Update src/transformers/quantizers/quantizer_compressed_tensors.py

Co-authored-by: Kyle Sayers <kylesayrs@gmail.com>

* updated model procession before/after weight loading

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

* fix style

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

* reformat

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

* fix style

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

---------

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
Co-authored-by: Kyle Sayers <kylesayrs@gmail.com>
2025-08-29 15:49:15 +02:00
f0e778112f Clean-up kernel loading and dispatch (#40542)
* clean

* clean imporrts

* fix imports

* oups

* more imports

* more imports

* more

* move it to integrations

* fix

* style

* fix doc
2025-08-29 14:14:38 +02:00
f68eb5f135 Redundant code removal (#40534)
redundant code
2025-08-29 11:30:23 +00:00
d888bd435d Fix typos (#40511)
Signed-off-by: cyy <cyyever@outlook.com>
2025-08-29 11:25:33 +00:00
11a6b95553 Oupsy (#40544)
fix bump!
2025-08-29 12:59:49 +02:00
b07144ac27 tokenizers bump tokenizers version (#40540)
* bump tokenizers version

* use rc0

* ?

* fml

* update
2025-08-29 12:34:41 +02:00
008c0ba8e2 Fix SeamlessM4Tv2ModelWithTextInputTest::test_retain_grad_hidden_states_attentions (#40532)
* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-28 23:30:59 +02:00
89ef1b6e0b Set test_all_params_have_gradient=False for HunYuanMoEV1ModelTest (#40530)
fix

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-28 22:32:51 +02:00
2e0f1d6a37 [Qwen Omni/VL] Fix fa tests (#40528)
* fix

* style

* flaky flaky

* flaky flaky

* oopsie, we need the out of place for sure

* flaky flaky

* flaky flaky
2025-08-28 21:07:22 +02:00
68013c505a Improve Gemma3n model and tests (#39764) 2025-08-28 20:25:42 +02:00
ffcb344612 Lazy import torchcodec (#40526)
* lazy import

* parse version

* omg, we need to guard version parse as well
2025-08-28 18:57:14 +02:00
8c7f685079 Fix typo: 'casual' to 'causal' (#40374)
fix typo: 'casual' to 'causal'

Co-authored-by: demo <vamshika0210@gamil.com>
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
2025-08-28 09:17:37 -07:00
d61fab1549 skip some padding_matches_padding_free_with_position_ids for FA2 (#40521)
skip 1

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-28 17:20:07 +02:00
31336ab750 Fix mistral3 tests after "[Kosmos 2.5] Rename checkpoints" (#40523)
* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-28 16:29:54 +02:00
851b8f281d [kernels] If flash attention2 is not installed / fails to import (cc on our cluster) default to kernels (#40178)
* first step if flash not installed but you set to use it

* try importing

* now default to using it

* update our tests as well

* wow yesterday I was not awake

* fixup

* style

* lol the fix was very very simple

* `RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/kernels@main#egg=kernels
` for updated dockers

* push review comments

* fix

---------

Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
2025-08-28 16:20:25 +02:00
de9e2d7a2e Skip some flex attn tests (#40519)
fix

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-28 15:43:38 +02:00
7e1aee4db6 [FA] Remaining Cleanup (#40424)
* fa cleanup

* flaky tests

* readd removed test and changeup comments to reflect the purpose

* flaky tests
2025-08-28 15:01:19 +02:00
893d89e5e6 [omni modality] support composite processor config (#38142)
* dump ugly option to check again tomorrow

* tiny update

* do not save as nested dict yet!

* fix and add tests

* fix dia audio tokenizers

* rename the flag and fix new model Evolla

* fix style

* address comments

* broken from different PRp

* fix saving layoutLM

* delete print

* delete!
2025-08-28 14:40:27 +02:00
becab2c601 Use the config for DynamicCache initialization in all modelings (#40420)
* update all

* remove the most horrible old code

* style
2025-08-28 14:32:30 +02:00
8acbbdcadf [serve] fix request_id unexpected (#40501)
* fix request-id in serving

* style

* fix
2025-08-28 14:16:28 +02:00
2300be3b41 sped up gguf tokenizer for nemotron test (#40509)
sped up tokenizer for nemotron test
2025-08-28 12:10:49 +00:00
b2b654afbf correct kes to keys. (#40489)
Signed-off-by: zhanluxianshen <zhanluxianshen@163.com>
2025-08-28 12:00:22 +00:00
476cd7bab1 [vision] Improve keypoint-matching models docs (#40497)
fix options and add inference_mode
2025-08-28 12:31:21 +01:00
1499f9e356 [Kosmos 2.5] Rename checkpoints (#40338) 2025-08-28 13:30:41 +02:00
10ddfb0be5 Add more missing arguments (#40354)
Add missing arguments

Signed-off-by: cyy <cyyever@outlook.com>
2025-08-28 12:21:51 +02:00
d10603f701 Add Apertus (#39381)
* init swissai model

* AutoModelForCausalLM

* AutoModelForCausalLM mapping

* qk norm and post ln optional

* fix wrong shape of qk norm: megatron uses head_dim

* automodel fixes

* minor fix in forward

* fix rope validation to accept llama3 scaling

* `SwissAIForTokenClassification` support

* Align `SwissAI` to v4.52.4

* Align `SwissAI` to v4.53.1

* Init CUDA xIELU

* `SwissAI*`->`Apertus*`

* ci fix

* check_docstring ignore ApertusConfig

* Licensing and placeholder tests

* Placeholder doc

* XIELU syntax

* `_xielu_python` optimization

* Fix xIELU

* [tmp] `{beta,eps}` persistent=False
until {beta,eps} saved in checkpoint

* Modular `Apertus`

* CUDA xIELU logging

* ci fix

* ci fix

* ci fix

* Update license

Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>

* Update tests/models/apertus/test_modeling_apertus.py

Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>

* `.utils.import_utils.is_torchdynamo_compiling`

* `Apertus` class ordering

* `past_key_value{->s}`, `make fix-copies`

* ci fix

* Remove unused configuration parameters

* `{beta,eps}` saved in checkpoint

* `{beta,eps}` Temporarily on CPU

* Suggestions

Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>

* ci fix

* remove fx_compatible (deprecated)

* remove `rotary_embedding_layer`

As the tests are written for a config without default scaling (which is not the case in Apertus) - besides, rope scaling is tested in other models so it's all safe.

* fully removing `Mask4DTestHard` class

Not needed (for now)

* switch to `dtype` instead of `torch_dtype`

Following this:
https://github.com/huggingface/transformers/pull/39782

* remove unused imports

* remove `cache_implementation="static"`

* +Apertus to `docs/source/en/_toctree.yml` for the doc builder

---------

Co-authored-by: Alexander Hagele <alexanderhagele@gmail.com>
Co-authored-by: dhia680 <garbayad@gmail.com>
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
Co-authored-by: Dhia Garbaya <84809366+dhia680@users.noreply.github.com>
2025-08-28 11:55:43 +02:00
f9b9a5e884 Update quantization overview for XPU (#40331)
* update xpu quantization overview

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix aqlm tests

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix format

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* update gguf support

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix gguf tests

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix xpu gguf precision error

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* replace deprecated models

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix import org

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* update xpu ggml tests

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* revert wrong change

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix xpu tests

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* xpu optimum-quanto goes green

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix format

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

---------

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
2025-08-28 09:52:59 +00:00
b824f4986f fix typo (#40484)
* fix typo

Signed-off-by: guochenxu <guochenxu@modelbest.cn>

* csm & qwen omni

Signed-off-by: guochenxu <guochenxu@modelbest.cn>

* format

Signed-off-by: guochenxu <guochenxu@modelbest.cn>

* Apply style fixes

* omni

Signed-off-by: guochenxu <guochenxu@modelbest.cn>

---------

Signed-off-by: guochenxu <guochenxu@modelbest.cn>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2025-08-28 08:31:25 +00:00
c9ff166718 Various AMD expectations (#40510)
* AMD expectations for qwen2

* Added more detailled excpectation to smolvlm

* Added AMD expectations to TableTransformer

* Style
2025-08-28 10:15:21 +02:00
721d4aee81 Include machine type in collated reports filename (#40514) 2025-08-28 09:28:12 +02:00
98289c5546 [modular] Classes can now be defined and referenced in arbitrary order (without bringing unwanted dependencies) (#40507)
* remove future class from dependency graph

* convert all
2025-08-27 23:06:10 +02:00
e3d8fd730e docs(pixtral): Update Pixtral model card to new format (#40442)
* docs(pixtral): Update Pixtral model card to new format

* docs(pixtral): Change cuda into auto for device_map

* docs(pixtral): Apply suggestions from review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* docs(pixtral): Apply suggestions from review, changing mistral-community into Mistral AI

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* docs(pixtral): Apply suggestions from review [!TIP] part

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* docs(pixtral): Finalize model card with tested code examples

This commit finalizes the update for the Pixtral model card.

* Fix the hfoption by the right one

* @BryanBradfo docs(pixtral): Changing the redirection of bitsandbytes

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* docs(pixtral): Add of ` to highlight the tokens

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* docs(pixtral): Move image block per final review

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-08-27 11:38:51 -07:00
821384d5d4 Fix the CI workflow of merge to main (#40503)
* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-27 18:35:12 +02:00
304225aa15 Collated reports: no need to upload artifact (#40502)
No need to upload collated reports as gh artifact
2025-08-27 18:31:55 +02:00
3c343c6601 [Whisper] Add rocm expected results to certain tests (#40482)
* Add rocm expected results to certain tests

* Specify rocm version in expectations so we know origin. Improved var names

* Update test var names
2025-08-27 16:19:11 +00:00
6350636964 Fix qwen2_moe tests (#40494)
update

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-27 16:22:04 +02:00
52aaa3f500 [EfficientLoFTR] dynamic image size support (#40329)
* fix: reverted efficientloftr embeddings computation to inference time with lru cache

* fix: added dtype and device for torch ones and zeros creation

* fix: fixed embed height and width computation with aggregation

* fix: make style

* fix error message

* fix fa2 tests

---------

Co-authored-by: qubvel <qubvel@gmail.com>
2025-08-27 15:05:08 +01:00
ed5dd2999c [ESM] support attention API (#40370)
* ESM supports attention API

* supports flags

* fix tests

* fix copiees

* another fixup needed after fixing tests

* fix tests and make sure Evolla copied everything

* fix

* order

* forgot about "is_causal" for fa2

* cross attention can't be causal
2025-08-27 15:39:04 +02:00
8b804311ba [modular] Remove ambiguity in all calls to parent class methods + fix dependency graph (#40456)
* fix in modular

* remove leftover print

* fix everything except when it's in assignment

* fix assignment as well

* more general

* better

* better

* better comment

* docstring

* cleaner

* remove base

* doc
2025-08-27 14:51:28 +02:00
a3afebbbbe [modular] Use multi-processing + fix model import issue (#40481)
* add mp and simplify a bit

* improve

* fix

* fix imports

* nit
2025-08-27 14:51:12 +02:00
75d6f17de6 Validate GptOssConfig rope config after it's fully initialized (#40474)
* Validate GptOssConfig rope config after it's fully initialized

Fixes #40461

* Remove whitespaces
2025-08-27 10:16:58 +01:00
80f4c0c6a0 CI when PR merged to main (#40451)
* up

* up

* up

* up

* up

* update

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-27 10:56:18 +02:00
ff8b88a948 Fix nightly torch CI (#40469)
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-26 22:02:15 +02:00
74ad608a2b Not to shock AMD team by the cancelled workflow run notification ❤️ 💖 (#40467) 2025-08-26 20:53:24 +02:00
c8c7623f20 Update SegFormer model card (#40417)
* Update SegFormer model card

* Update docs/source/en/model_doc/segformer.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/segformer.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/segformer.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/segformer.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/segformer.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/segformer.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/segformer.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update the segformer model card

* Remove quantization example

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-08-26 08:27:25 -07:00
78f32c3917 [pipeline] Add Keypoint Matching pipeline (#39970)
* feat: keypoint-matcher pipeline

* docs: added keypoint-matcher pipeline in docs

* fix: added missing statements for repo consistency

* docs: updated SuperGlue, LightGlue and EfficientLoFTR docs

* Apply suggestions from code review

Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>

* test: fixed run_pipeline_test

* update pipeline typing and docs

* update tests

* update docs snippets

* Fix import error

* fix: pipeline init

* pt framework

---------

Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
2025-08-26 15:26:57 +01:00
6451294f6f [RoPE] explicit factor > implicit factor in YaRN (#40320)
explicit factor > implicit factor
2025-08-26 14:58:28 +01:00
5a8ba87ecf [fast_image_processor] fix image normalization for resize (#40436) 2025-08-26 13:49:51 +00:00
VED
0ce6709e70 deci gguf support (#38669)
* deci gguf support

* make style

* tests for deci

* try except removed

* style

* try except removed
2025-08-26 13:43:17 +00:00
263d06fedc Fix extra template loading (#40455)
* Fix extra template loading

* Reformat

* Trigger tests
2025-08-26 14:01:01 +01:00
58cebc848b flash_paged: s_aux may not exist (#40434)
Some implementations (i.e.,
https://huggingface.co/kernels-community/vllm-flash-attn3) support an
`s_aux` arg for attention sinks, but others
(https://huggingface.co/kernels-community/flash-attn) do not. If s_aux
is present in the kwargs, we forward it, otherwise we don't.

The user will still get an error if they use a model like gpt-oss-20b
with an implementation that does not support `s_aux`, but models that
don't use it won't error out. For example, [this is currently
failing](399cd5c04b/examples/pytorch/continuous_batching.py (L16))
because we are sending `s_aux: None` in the dict.
2025-08-26 13:15:59 +02:00
34108a2230 Continuous batching refactor (#40426)
* Rework of the CB example

* Further rework of CB example

* Refactor PA cache, slice on tokens, add debug prints -- WIP

* Slice cache -- WIP

* Added a mechanism to check batched outputs in CB script

* Less logging, debug flag for slice, !better reset! -- WIP

* QOL and safety margins

* Refactor and style

* Better saving of cb example

* Fix

* Fixes and QOL

* Mor einformations about metrics

* Further logging

* Style

* Licenses

* Removed some comments

* Add a slice input flag

* Fix in example

* Added back some open-telemetry deps

* Removed some aux function

* Added FA2 option to example script

* Fixed math (all of it)

* Added a simple example

* Renamed core to classes

* Made allocation of attention mask optionnal

* Style
2025-08-26 13:01:42 +02:00
49e168ff08 🚨 Remove Contrastive Search decoding strategy (#40428)
* delete go brrr

* fix tests

* review
2025-08-26 12:31:46 +02:00
b8184b7ce9 Make cache_config not mandatory (#40316)
* Relaxed assumptions on cache_config

* Review compliance

* Style

* Styyyle

* Removed default and added args

* Rebase mishapfix

* Propagate args to TorchExportableModuleForDecoderOnlyLM

* Fix the test I wanted  fixed in this PR

* Added some AMD expectation related to cache tests
2025-08-26 12:06:17 +02:00
32fcc24667 rename get_cuda_warm_up_factor to get_accelerator_warm_up_factor (#40363)
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
2025-08-26 09:56:35 +00:00
f690a2a1e0 [video processors] decode only sampled videos -> less RAM and faster processing (#39600)
* draft update two models for now

* batch update all VLMs first

* update some more image processors

* update

* fix a few tests

* just make CI green for now

* fix copies

* update once more

* update

* unskip the test

* fix these two

* fix torchcodec audio loading

* maybe

* yay, i fixed torchcodec installation and now can actually test it

* fix copies deepseek

* make sure the metadata is returrned when users request it

* add docs

* update

* fixup

* Update src/transformers/audio_utils.py

Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>

* Update src/transformers/models/glm4v/video_processing_glm4v.py

Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>

* update

* what if we set some metadata attr to `None`

* fix CI

* fix one test

* fix 4 channel test

* fix glm timestemps

* rebase gone wrong

* raise warning once

* fixup

* typo

* fix copies

* ifx smolvlm test

* this is why torch's official benchmark was faster, set threads to `0`

* Apply style fixes

---------

Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2025-08-26 11:38:02 +02:00
64ae6e6b1d fix qwen25-vl grad acc (#40333)
* fix qwen25—vl grad acc

* fix Qwen2_5_VLForConditionalGeneration for accepts_loss_kwargs

* fix ci

* fix ci

* fix typo

* fix CI
2025-08-26 09:30:06 +00:00
6d2bb1e04d [Trainer] accelerate contextparallel support in trainer (#40205)
* initial context_parallel_size support in trainer

* For context parallelism, use AVG instead of SUM to avoid over-accounting tokens

* use parallelism_config.cp_enabled

* add parallelism_config to trainer state

* warn when auto-enabling FSDP

* fix some reviews

* WIP: somewhat matching loss

* Feat: add back nested_gather

* Feat: cleanup

* Fix: raise on non-sdpa attn

* remove context_parallel_size from TrainingArguments

* if we have parallelism_config, we defer to get_state_dict from accelerate

* fix form review

* Feat: add parallelism config support

* Chore: revert some unwanted formatting changes

* Fix: check None

* Check none 2

* Fix: remove duplicate import

* Update src/transformers/trainer.py

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* Update src/transformers/training_args.py

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* Fin

* require accerelate 1.10.1 and higer

---------

Co-authored-by: S1ro1 <matej.sirovatka@gmail.com>
Co-authored-by: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
2025-08-26 09:28:48 +00:00
63caaea1fb Refactor ViT-like models (#39816)
* refactor vit

* fix

* fixup

* turn off FX tests

* AST

* deit

* dinov2

* dinov2_with_registers

* dpt

* depth anything (nit)

* depth pro (nit)

* ijepa

* ijepa (modular)

* prompt_depth_anything (nit)

* vilt (nit)

* zoedepth (nit)

* videomae

* vit_mae

* vit_msn

* vivit

* yolos

* eomt

* vitpose

* update auto backbone

* disable `fx` and export tests (dnov2, dpt, ijepa, vit, vitpose)

* fix kwargs for backbone

* fix

* convnext

* fixup

* update convnext layernorm

* fix-copies layer_norm

* convnextv2

* explicit output_hidden_states for models with backbones

* explicit hidden states collection for dinov2

* tests fixed

* fix DPT as well

* fix dinov2 with registers

* add comment
2025-08-26 11:14:06 +02:00
922e65b3fc Fix non FA2 tests after FA2 installed in CI docker image (#40430)
* up

* up

* up

* up

* up

* up

* up

* up

* up

* up

* up

* up

* up

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-26 10:36:50 +02:00
e68146fbe7 Fix collated reports model name entry (#40441) 2025-08-25 20:36:01 +00:00
8ce633cc75 InternVL MI325 test expectations (#40387)
* Adjust ROCm expectations

* MI355

---------

Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
2025-08-25 22:00:35 +02:00
7637d298b3 Fix collated reports uploading (#40440) 2025-08-25 21:49:59 +02:00
fa59cf9c9f Fix https://github.com/huggingface/transformers/issues/40292 (#40439)
* Fix https://github.com/huggingface/transformers/issues/40292

* Trigger tests

---------

Co-authored-by: Matt <rocketknight1@gmail.com>
2025-08-25 20:12:57 +01:00
f0e87b436d Fix collated reports model directory traversal (#40437)
Fix model dir traversal
2025-08-25 18:01:58 +00:00
ef406902bf Gemma3 text fixes: Add expectations for MI325 (#40384)
* Add expectations for MI325

* Ruff

* Adjust CUDA expectations as well

* Another attempt for CUDA expectations
2025-08-25 19:57:50 +02:00
c81723d31b 🌐 [i18n-KO] Translated models.md to Korean (#39518)
* docs: ko: models.md

* feat: nmt draft

* fix: manual edits

* Resolved _toctree.yaml conflict during merge from main

* Apply suggestions from code review

Co-authored-by: Woojun Jung <46880056+jungnerd@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Woojun Jung <46880056+jungnerd@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: YONGSANG <71686691+4N3MONE@users.noreply.github.com>
Co-authored-by: Woojun Jung <46880056+jungnerd@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: YONGSANG <71686691+4N3MONE@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: YONGSANG <71686691+4N3MONE@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: YONGSANG <71686691+4N3MONE@users.noreply.github.com>

* Apply suggestions from code review

* fix: update toctree

* Update docs/source/ko/_toctree.yml

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Co-authored-by: Woojun Jung <46880056+jungnerd@users.noreply.github.com>
Co-authored-by: YONGSANG <71686691+4N3MONE@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-08-25 09:17:08 -07:00
6b5eab70e4 Remove working-dir from collated reports job (#40435) 2025-08-25 18:14:35 +02:00
1763ef2951 [docs] remove last references to transformers TF classes/methods (#40429)
* halfway through tasks

* complete

* Update utils/check_docstrings.py
2025-08-25 16:30:59 +01:00
eac4f00bdf Fix typo and improve GPU kernel check error message in MXFP4 quantization (#40349) (#40408)
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
2025-08-25 15:21:55 +00:00
d8f2edcc46 Add tokenizer_kwargs argument to the text generation pipeline (#40364)
* Add `tokenizer_kwargs`  arg to text generation pipeline.

* chore: re-run CI

* Rename `tokenizer_kwargs` to `tokenizer_encode_kwargs` for text generation pipeline

* Fix `tokenizer_encode_kwargs` doc string.

* Fix note related to `tokenizer _kwargs` in text generation pipeline

---------

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
2025-08-25 15:21:19 +00:00
1a35d07f56 Update collated reports working directory and --path (#40433) 2025-08-25 15:18:26 +00:00
399cd5c04b Fix modular for modernbert-decoder (#40431)
* fix the modular

* CI
2025-08-25 16:50:49 +02:00
ea8d9c8f06 🚨 Remove DoLa decoding strategy (#40082)
* remove dola generation strategy

* add fast test
2025-08-25 16:33:27 +02:00
6bf6f8490c [Mxfp4] Add a way to save with a quantization method (#40176)
* add a test

* tempdir

* fix import issue[

* wow I am tired

* properly init

* i am not super familiar with quantizer api :|

* set to TRUE fro now

* full support

* push current changes

* will clean this later but the imports are a shitshow here

* this correctly saves the block and scales but forward seems broken

* quanitze was not correct

* fix storage

* why were bias even included

* finally!

* style

* fix style

* remove print

* lazy import

* up

* not sure what happens this works now?

* holy molly it was not so far

* okay this seems to work!

* workings!!!

* allow save_pretrained to create PR

* Apply suggestions from code review

* fixup

* add deqyabtze fakse as wek

* working new

* fix

* rm swizzle and unswizzle during saving

* rm print

* Update src/transformers/modeling_utils.py

* fix

* style

---------

Co-authored-by: Marc Sun <marc@huggingface.co>
2025-08-25 16:27:19 +02:00
04c2bae3a8 Fix label smoothing incompatibility with multi-label classification (#40296)
* Fix label smoothing incompatibility with multi-label classification (#40258)

* Improve label smoothing multi-label check based on reviewer feedback

- Move check from LabelSmoother to Trainer.__init__() for better architecture
- Use model.config.problem_type instead of tensor inference for robustness
- Warn and disable smoothing instead of raising error for better UX
- Update test to verify warning behavior
2025-08-25 14:23:31 +00:00
3b5b9f6518 Fix processing tests (#40379)
* fix tests

* skip failing test in generation as well

* grounding dino was overwritten

* one more overwritten code

* clear comment
2025-08-25 14:50:54 +02:00
a0a37b3250 Gpt oss optim (#40304)
* enable fast index selecting

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* update model

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix gpt-oss tests

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix format

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix check tensor

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

---------

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
2025-08-25 14:36:33 +02:00
d73181b3fc Fix UnboundLocalError in WER metric computation (#40402)
Renamed wer metric variable to wer_metric to avoid naming conflict
with local variable assignment in compute_metrics function.

Co-authored-by: pranam-gf <pranam@goodfin.com>
2025-08-25 12:02:22 +00:00
11e12a715a Fix typo: 'seperator' to 'separator' in variable names (#40389)
Fixed 4 instances of the typo "seperator" → "separator" in variable names:
- 2 instances in src/transformers/models/shieldgemma2/convert_shieldgemma2_weights_orbax_to_hf.py
- 2 instances in src/transformers/models/gemma3/convert_gemma3_weights_orbax_to_hf.py

These typos were in variable names used for parsing path components in weight conversion scripts.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: Claude <noreply@anthropic.com>
2025-08-25 11:56:30 +00:00
40299134a8 Fix CI (hunyuan moe does not support fullgraph) (#40423)
fix flag
2025-08-25 12:01:28 +02:00
a2b37bfd58 Fix typo: 'casual' -> 'causal' in code and documentation (#40371) (#40407) 2025-08-25 09:32:15 +00:00
0031c044f8 [docs] flax/jax purge (#40372)
flax/jax purge
2025-08-25 10:25:00 +01:00
14b89fed24 fix to accept cumulative_seqlens from TransformersKwargs in FA (#40194)
* fix to the typings which are unmatched to FA function signature

cumulative_seqlens_q/k -> cu_seq_lens_q/k:
- in the FlashAttentionKwargs in modeling_flash_attention_utils
- in the TransformersKwargs in generic
- in the PagedAttentionArgs in continuous_batching

It is **BC**, because they are created in `ContinuousBatchProcessor.setup_static_tensors:L762`, used in `ContinuousBatchingManager._model_forward:L1233` and destroyed with `ContinuousBatchProcessor`

* format changes by ruff

* Update src/transformers/integrations/flash_paged.py

unused function arg in `PagedAttentionCache.update`

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

* revert continuous_batching signiture, which is more meaningful

---------

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
2025-08-25 11:00:13 +02:00
ba095d387d 🧹 🧹 🧹 Get set decoder cleanup (#39509)
* simplify common get/set

* remove some noise

* change some 5 years old modeling utils

* update examples

* fix copies

* revert some changes

* fixes, gah

* format

* move to Mixin

* remove smolvlm specific require grad

* skip

* force defaults

* remodularise some stuff

* remodularise more stuff

* add safety for audio models

* style

* have a correct fallback, you daft donkey

* remove this argh

* change heuristic for audio models

* fixup

* revert

* this works

* this should be explicit

* fix Nth ESM exception

* tryout decoder

* this as well

* revert again

* 🧠

* aaah ESM has two modelings aaah

* broom broom

* format

* wrong copies

* copies

* modular cleanups

* format

* modularities

* wrong mergefix

* seriously

* align with new model

* new model
2025-08-25 10:57:56 +02:00
2c55c7fc94 Reactivate a lot of tests skipped for no reason anymore (#40378)
* reactivate all the tests

* some tests still failing
2025-08-25 10:44:43 +02:00
4f9b4e62bc Run FA2 tests in CI (#40397)
up

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-23 12:30:18 +02:00
28ca27cb2b HF papers in doc (#40381)
* HF papers

* clean

* Update src/transformers/models/gemma3n/configuration_gemma3n.py

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* style

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-08-22 15:07:08 -07:00
7d88f57fc6 Update README_zh-hans.md (#40380)
Fix a typo.
2025-08-22 18:22:26 +00:00
29ddcacea3 Rework the Cache documentation (#40373)
* start working the doc

* remove gemma2

* review
2025-08-22 17:06:28 +02:00
dab66f15a1 Chat Template Doc Fixes (#40173)
* draft commit

* draft commit

* Fixup chat_extras too

* Update conversations.md

* Update the toctree and titles

* Update the writing guide!

* Use @zucchini-nlp's suggestion

* Update docs/source/en/conversations.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/conversations.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/conversations.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-08-22 15:48:33 +01:00
0a21e870c7 Bug Fix: Dynamically set return_lse flag in FlexAttention (#40352)
* bug fix - return_lse dynamically set

* addressed compatibility with return type - flex_attention_forward

* rename variables

* revert changes to commits
2025-08-22 13:49:26 +00:00
894b2d84b6 Add GptOssForTokenClassification for GPT-OSS models (#40190)
* Add GptOssForTokenClassification for GPT-OSS models

* After run make fixup
2025-08-22 15:14:46 +02:00
56d68c6706 Addiing ByteDance Seed Seed-OSS (#40272)
add seed oss
2025-08-22 14:54:28 +02:00
8a6908c10d fix(example): align parameter names with the latest function definition for gdino (#40369) 2025-08-22 12:27:58 +00:00
7db228a92a [configuration] allow to overwrite kwargs from subconfigs (#40241)
allow to overwrite kwargs from subconfigs
2025-08-22 13:31:25 +02:00
19ffe0219d [processor] move commonalities to mixin (#40339)
* move commonalities to mixin

* revert - unrelated

* fix copies

* fix style

* comments
2025-08-22 13:04:43 +02:00
d8f6d3790a ⚠️⚠️ Use dtype instead of torch_dtype everywhere! (#39782)
* update everywhere

* style

* pipelines

* switch it everywhere in tests

* switch it everywhere in docs

* switch in converters everywhere

* update in examples

* update in model docstrings

* style

* warnings

* style

* Update configuration_utils.py

* fix

* Update configuration_utils.py

* fixes and add first test

* add pipeline tests

* Update test_pipelines_common.py

* add config test

* Update test_modeling_common.py

* add new ones

* post rebase

* add new

* post rebase adds
2025-08-22 12:34:16 +02:00
9c25820978 [pipelines] add support to skip_special_tokens in the main text generation pipelines (#40356)
* add support to skip_special_tokens in pipelines

* add test

* rm redundant
2025-08-22 10:12:46 +00:00
5c40e7a225 Change multimodal data links to HF hub (#40309)
change multimodal data links to HF hub
2025-08-22 11:50:04 +02:00
e018b77c89 wav2vec2 fixes (#40341)
* Changed datasets to avoid a datasets error

* Changed back split to test
2025-08-22 11:32:29 +02:00
d7fe3111ff Fix idefics3 vision embeddings indices dtype (#40360)
fix idefics3 vision embeddings

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-08-22 11:10:45 +02:00
cf487cdf1f HunYuan opensource (#39606)
* merge opensource_hunyuan

* add head_dim

* fix assertion error

* fix seen_tokens

* ready_for_upstream (merge request !17)

Squash merge branch 'ready_for_upstream' into 'main'

* fix configuration type&docstring
* fix style

* ready_for_upstream (merge request !18)

Squash merge branch 'ready_for_upstream' into 'main'
* add doc
* fix testcode
* fix configuration type&docstring

* rename base model

* remove assert

* update

* remove tiktoken

* update

* fix moe and code style (#3)

* update

* fix format

* update

* revert makefile

* fix moe config

* fix numel()

* remove prepare_inputs_for_generation

* fix kv_seq_len

* add docs/toctree

* remove unused paramter&add licence

* add licence

* remove unused paramter

* fix code

* dense modular

update import

fix

fix

use mistralmodel

fix qknorm

add sliding_window

make style

fix

dense done

hunyuan moe

fix import

fix modular

fixup

fixup

* update model path

* fix mlp_bias

* fix modular

* Fix modeling (#5)

* fix attention

* use llamamodel

* fix code

* Fix qk (#6)

* fix qk_norm

* fix

* fix modual

* Fix moe (#7)

* fix some moe code

* fix einsum

* try top1

* use top1

* Fix rotary (#8)

* fix rotary

* fix modeling

* fix modular

* fix testcode

* remove A13B unit test

* Fix moe v1 (#9)

fix moe & gate

* Fix gate norm (#10)

* add norm_topk_prob

* Fix testcase (#11)

* fix&skip test

* Fix testcase (#12)


* skip testcase

* Fix norm topk (#13)

* hardcode norm_topk_prob

* fix testcase

---------

Co-authored-by: pridejcyang <pridejcyang@tencent.com>
Co-authored-by: Mingji Han <mingjihan@tencent.com>
2025-08-22 07:59:58 +00:00
8365f70e92 DOCS: Clarification on the use of label_names as an argument to TrainingArguments (#40353)
* Update trainer.md

* Update trainer.md

Removed the detail about label_names argument usage from the tip/ warning section

* Update training_args.py

Added the label_names usage clarification in the docstring

* Update trainer.md

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-08-21 17:19:04 -07:00
7c1169e21f [4/N]more docs to device agnostic (#40355)
* more docs to device agnostic

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

* more

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

* 1

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

* 2

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

* Update vitpose.md

* Update camembert.md

* Update camembert.md

---------

Signed-off-by: YAO Matrix <matrix.yao@intel.com>
2025-08-21 10:22:26 -07:00
9568b506ed [generate] handle support for cache classes when num enc layers != num dec layers (#40277)
* handle support for cache classes when num enc layers != num dec layers

* handle overwrites

* one more corner case

* Update src/transformers/generation/utils.py

* Update src/transformers/generation/utils.py

* Apply suggestions from code review

* handle corner case :o
2025-08-21 17:35:18 +01:00
7f38068ae0 Qwen2.5-VL test fixes for ROCm (#40308) 2025-08-21 18:13:07 +02:00
cb1df4d26a [FA] Fix some model tests (#40350)
* fix

* cleanup, revert aimv2 fa changes

* fix aria

* i searched a long time but the cross dependency is for the recent models so...

* this was something... evolla

* fix modernbert decoder + make fa test more robust

* nit
2025-08-21 18:08:21 +02:00
f46f29dd7c Remove more PyTorch 2.2 compatible code (#40337)
Signed-off-by: cyy <cyyever@outlook.com>
2025-08-21 15:19:53 +00:00
128f42d370 [detection] use consistent dtype for Conditional and DAB DETR positional embeddings (#40300)
fix: use consistent dtype for sine positional embeddings
2025-08-21 15:49:56 +01:00
2121d09239 [serve] add cors warnings (#40112)
* add cors warnings

* Update src/transformers/commands/serving.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update src/transformers/commands/serving.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Apply suggestions from code review

* make fixup

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2025-08-21 14:32:36 +01:00
b40b834ab1 Clean up XCodec and other codecs (#40348)
* Clean up xcodec addition.

* Clean up config.

* Switch to fixtures test.

* Small stuff.

* Polish XCodec and standardize across codecs.

* Update src/transformers/models/xcodec/modeling_xcodec.py

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

* Format and fix test.

* Update tol.

---------

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
2025-08-21 15:32:00 +02:00
75aa7c7252 [ModernBert] Prevent the attention mask from being None in ModernBertForSequenceClassification (#35991)
* [ModernBert] Prevent the attention mask from being None in ModernBertForSequenceClassification

* fix the modular conversion
2025-08-21 15:16:03 +02:00
04b751f07d Fix attention vizualizer (#40285)
* make visualizer rely on create causal mask

* format

* fixup

* fixup

* read token

* read token, duh

* what is up with that token

* small tests?

* adjust

* try with flush

* normalize for ANSI

* buffer shenanigans
2025-08-21 13:13:35 +00:00
cyn
1e1db12304 (small) fix conditional for input_ids and input_embeds in marian (#40045)
* (small) fix conditional for input_ids and input_embeds in marian

* address comment
2025-08-21 15:13:14 +02:00
7f2f53424e Update test_spm_converter_bytefallback_warning (#40284)
fff

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-21 14:09:28 +02:00
11a49dd9e3 T5 test and target device fixes (#40313)
* Fix cache setup related issues

* Fix target-device-related issues

* Ruff

* Address review comments
2025-08-21 14:07:29 +02:00
c4513a9fe6 Fix links in Glm4vMoe configuration classes to point to the correct H… (#40310)
* Fix links in Glm4vMoe configuration classes to point to the correct Hugging Face model repository

* run fixup to update links in Glm4vMoe configuration classes to point to the correct Hugging Face model repository
2025-08-21 11:42:53 +00:00
c7e6f9a485 Fix an infinite loop bug in recursive search of relative imports (#40326)
Fix bug in recursive search of relative imports
2025-08-21 11:39:43 +00:00
e95441bdb5 add type hints (#40319)
* add basic type hints to import module

* run make fixup

* remove optional

* fixes

---------

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
2025-08-21 12:19:59 +01:00
5c88d8fbcc Fix: Only call Trainer.align_special_tokens if model has "config" attribute (#40322)
* Only call Trainer.align_special_tokens if model has "config" attribute

* Add efficient test for training a model without model.config

* Reformat
2025-08-21 12:06:42 +01:00
c031f6f994 [docs] remove TF references from /en/model_doc (#40344)
* models up to F

* models up to M

* all models
2025-08-21 11:53:21 +01:00
7b060e5eb7 Add missing arguments to class constructors (#40068)
* Add missing arguments

Signed-off-by: cyy <cyyever@outlook.com>

* Fix typos

Signed-off-by: cyy <cyyever@outlook.com>

* More fixes

Signed-off-by: cyy <cyyever@outlook.com>

---------

Signed-off-by: cyy <cyyever@outlook.com>
2025-08-21 10:22:38 +00:00
6ad7f29461 Fix deprecation warning version (#40343)
fix
2025-08-21 12:18:23 +02:00
adf84aec21 Add DeepseekV3ForSequenceClassification for Deepseek V3 models (#40200)
* Add Sequence Classification Support for Deepseek v3 model DeepseekV3ForSequenceClassification

* After run make fixup
2025-08-21 12:01:33 +02:00
1e2e28f3c8 Change Qwen2RMSNorm to RMSNorm from PyTorch (#40066)
* Unify Qwen2RMSNorm definitions and use RMSNorm from PyTorch

Signed-off-by: cyy <cyyever@outlook.com>

* subclass RMSNorm

Signed-off-by: cyy <cyyever@outlook.com>

---------

Signed-off-by: cyy <cyyever@outlook.com>
2025-08-21 11:58:35 +02:00
022af24fcc Fix qwen-omni processor text only mode (#40336)
* Fix qwen-omni processor text only mode

* remove try except

---------

Co-authored-by: yuekaiz <yuekaiz@mgmt1-login.cm.cluster>
2025-08-21 11:57:32 +02:00
c99ed492c7 [docs] remove flax references from /en/model_doc (#40311)
* 1st commit

* all models up to D

* all models up to G

* all models up to M

* all remaining models
2025-08-21 10:52:54 +01:00
c2e3cc24e0 Fix chunked attention mask with left-padding (#40324)
* add fix

* add test

* raise proper warning for older versions

* fix

* fix and add 2nd test

* fix for flex and torch 2.5
2025-08-21 10:52:49 +02:00
242bb2cafc One cache class to rule them all (#40276)
* remove all classes

* fix generate

* start replacing everywhere

* finish removing everywhere

* typo

* typo

* fix

* typo

* remove num_layers=1

* CI

* fix all docstrings

* review

* style
2025-08-20 19:36:11 +02:00
1054494dd6 Update notification service amd_daily_ci_workflows definition (#40314) 2025-08-20 17:49:46 +02:00
139cd91713 Fix: Apply get_placeholder_mask in Ovis2 (#40280)
* Refactor special image mask

* Refactor get_placeholder_mask method

* Revert "Refactor special image mask"

This reverts commit 9eb1828ae930329656d6f323a510c5e6033e1f85.

* Fix

* Revert "Refactor get_placeholder_mask method"

This reverts commit 07aad6484bb08d6351d5b605e9db574d28edcd15.
2025-08-20 17:12:10 +02:00
5d906740d2 Update CI with nightly torch workflow file (#40306)
* fix nightly ci

* Apply suggestions from code review

Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
2025-08-20 16:59:00 +02:00
4977ec2ae8 [GPT OSS] Refactor the tests as it was not properly checking the outputs (#40288)
* it was long due!

* use the official kernel

* more permissive

* update the kernel as well

* mmm should it be this?

* up pu

* fixup

* Update test_modeling_gpt_oss.py

* style

* start with 20b
2025-08-20 16:47:41 +02:00
3b7230124b No more natten (#40287)
get rid off natten

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-20 16:10:15 +02:00
2df0c323cb byebye torch 2.1 (#40317)
* Bump minimum torch version to 2.2

* Remove is_torch_greater_or_equal_than_2_2

* update versions table

* Deprecate is_torch_sdpa_available (except for backward compat), remove require_torch_sdpa
2025-08-20 15:03:46 +01:00
c50f140be2 Add back _tp_plan attribute (#39944)
* Update modeling_utils.py

* make sure we update with the module's plan

* use public api

* oups

* update

* fix failing test

* Update src/transformers/integrations/tensor_parallel.py

* Update src/transformers/integrations/tensor_parallel.py

* fix

* make the API more friendly!

* fix tests

* fix styling

---------

Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2025-08-20 15:29:55 +02:00
a97213d131 Qwen2.5-Omni test fixes (#40307)
Updated expectations, and mp tests
2025-08-20 14:48:30 +02:00
ca543f822f Add support for Florence-2 (#38188)
* init

* add modular

* fixup

* update configuration

* add processing file

* update auto files

* update

* update modular

* green setup_and_quality ci

* it works

* fix some tests

* commit florence2

* update test

* make test cases done - 16 left

* style

* fix few test cases

* fix some tests

* fix init test

* update florence2 vision style

* hope is green

* fix init test

* fix init

* update modular

* refactor vision module

* fix: channel attention use dynamic scale

* update modular

* update

* update attention mask

* update

* fix naming

* Update src/transformers/models/florence2/processing_florence2.py

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>

* spatial block works

* more beautiful

* more more beautiful

* merge main

* merge main and fixup

* fix typing hint

* update modeling

* fix eager matches sdpa

* fix style

* fix compile test - all green

* remove florence2 language

* remove Florence2LanguageModel things

* fix style

* update florence2 model

* override prepare encoder_decoder for generation

* add weight conversion script

* rewrite channel attention to use sdpa

* eleminate 1 tranpose op

* support fa2

* fix quality check

* chore: reformat `test_modeling_florence2.py`

* some refactor for processor

* some refactor for processor

* update naming convention and remove BC

* make it pass the test

* fix: correct Embedding Cosine

* update comments and docstring

* support input_embeds

* support input embeds ideally

* fix style

* fix style

* fix style again :D

* add test prcoessor

* refactor processor and add test for processor

* reformat test processor

* make fixup

* fix schema check

* remove image_token

* ensure image token in tokenizer and fix integration tests

* fix processor test

* add more integration tests for large model and rename test_processor to test_processing

* test_assisted_decoding_sample should pass

* update doc and make model work with image text to text pipeline

* docs: add sdpa bagde

* resolve cyril's comments

* fix import torch error

* add helper get_placeholder_mask

* inherit from llava

* florence2 may not _supports_attention_backend because of bart ...

* move florence2 model card to multimodal

* let base model always return_dict

* fix style

* tiny update doc

* set   _checkpoint_conversion_mapping = {}

* fix code quality

* support flex and compile graph and move external func to internal func

* remove condition because it always true

* remove window funcs

* move post processor config out

* fix ci

* new intro to trigger test

* remove `kernel_size` argument

---------

Co-authored-by: ducviet00-h2 <viet.d.hoang@h2corporation.jp>
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
2025-08-20 14:28:06 +02:00
959239debc Remove unnecessary contiguous calls for modern torch (#40315) 2025-08-20 12:24:14 +00:00
7d2aa5d6e6 🚨 [Flash Attention] Fix sliding window size (#40163)
* swa fix

* add comment, make fix symmetrical

* modify fa inference test to force swa correctness check

* fixup comment
2025-08-20 14:23:14 +02:00
3128db6927 chore: fix typo in find_executable_batch_size to match new 0.9 ratio (#40206) 2025-08-20 12:18:06 +00:00
ca0aaa8c74 [fix] Pass adamw optimizer parameters to StableAdamW (#40184)
* fix: pass adamw optimizer parameters to StableAdamW

* add test for stable_adamw initialization with trainer arguments

* address copilot suggestion

* fix: update weight_decay handling in stable_adamw kwargs

---------

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
2025-08-20 11:52:23 +00:00
a01f38b364 Fix GOT-OCR2 and Cohere2Vision image processor patches caculation (#40312)
fix got-ocr patches caculation

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-08-20 13:13:58 +02:00
a5f0b505a0 Remove OTel SDK dependencies (#40305) 2025-08-20 12:31:44 +02:00
d0f1a6ec36 Clean up X-Codec. (#40271)
* Clean up xcodec addition.

* Clean up config.

* Switch to fixtures test.

* Small stuff.
2025-08-20 12:16:28 +02:00
da9452a592 [docs] delete more TF/Flax docs (#40289)
* delete some TF docs

* update documentation checks to ignore tf/flax

* a few more removals

* nit

* Update utils/check_repo.py

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>

---------

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
2025-08-20 10:44:14 +01:00
a4e1fee44d [FA] Fix dtype in varlen with position ids (#40295)
fix
2025-08-20 11:15:55 +02:00
126bc03b4e Allow to be able to run torch.compile tests with fullgraph=True (#40164)
* fix

* address comment

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-20 10:42:33 +02:00
1d46091737 Add MetaCLIP 2 (#39826)
* First draft

* Make fixup

* Use eos_token_id

* Improve tests

* Update clip

* Make fixup

* Fix processor tests

* Add conversion script

* Update docs

* Update tokenization_auto

* Make fixup

* Use check_model_inputs

* Rename to lowercase

* Undo CLIP changes

* Address comment

* Convert all checkpoints

* Update auto files

* Rename checkpoints
2025-08-20 09:25:43 +02:00
0f9c9088d0 [3/3] make docs device agnostic, all en docs for existing models done (#40298)
docs to device agnostic cont.

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-08-19 21:01:27 -07:00
eaa48c81e9 make model docs device agnostic (2) (#40256)
* doc cont.

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

* more models

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

* Update docs/source/en/quicktour.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/quicktour.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/quicktour.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/quicktour.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update mixtral.md

---------

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-08-19 13:10:03 -07:00
42fe769928 SmolVLM test fixes (#40275)
* Fix SmolVLM tests

* Add the proper CUDA expectations as well

* Split 'A10 and A100 expectations

* Ruff

---------

Co-authored-by: Akos Hadnagy <akoshuggingface@mi325x8-123.atl1.do.cpe.ice.amd.com>
2025-08-19 21:22:06 +02:00
4c017465bd Adjust ROCm test output expectations (#40279)
Adjust ROCm output expectations
2025-08-19 21:21:45 +02:00
0f9ce43687 Standardize BertGeneration model card (#40250)
* Standardize BertGeneration model card: new format, usage examples, quantization

* Update docs/source/en/model_doc/bert-generation.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/bert-generation.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/bert-generation.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/bert-generation.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/bert-generation.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/bert-generation.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/bert-generation.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Apply reviewer feedback: update code examples

* Add missing code example

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-08-19 11:22:13 -07:00
6ceb13fb22 SmolVLM and InternVL: Ensure pixel values are converted to the correct dtype for fp16/bf16 (#40121)
* Ensure pixel values are converted to the correct dtype for fp16/bf16

* add to modular
2025-08-19 10:39:08 -07:00
92f40da608 Update model card for gpt neox japanese (#39862)
* Update GPT-NeoX-Japanese model card

* Apply suggestions from code review

* Update gpt_neox_japanese.md

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-08-19 09:18:46 -07:00
3a4b2756cf docs: Update TrOCR model card to new format (#40240)
* docs: Update TrOCR model card to new format

* Updated Sugegestions
2025-08-19 09:17:45 -07:00
46d38546f3 Standardize RAG model card (#40222)
* Standardize RAG model card

Update rag.md to follow the new Hugging Face model card template:
- Added friendly overview in plain language
- Added pipeline and AutoModel usage examples
- Included quantization example with BitsAndBytesConfig
- Added notes and resources sections
- Removed abstract and FlashAttention badge

* Standardize RAG model card

Update rag.md to follow the new Hugging Face model card template:
- Added friendly overview in plain language
- Added AutoModel usage example
- Included quantization example with BitsAndBytesConfig
2025-08-19 09:16:10 -07:00
bd96e1e1cc docs(layoutlm): add missing id=usage to <hfoptions> tag in LayoutLM model card (#40273)
docs(layoutlm): add missing 'id=usage' to <hfoptions> tag in LayoutLM model card
2025-08-19 09:14:43 -07:00
8636b309e6 Fix chat CLI GPU loading and request_id validation issues (#40230) (#40232)
* Fix chat CLI GPU loading and request_id validation issues (#40230)

This commit addresses two critical bugs in the transformers chat CLI:

1. **GPU Loading Issue**: Changed default device from "cpu" to "auto" in ChatArguments
   - Chat CLI now automatically uses GPU when available instead of defaulting to CPU
   - Matches the behavior of the underlying serving infrastructure

2. **Request ID Validation Error**: Added request_id field to TransformersCompletionCreateParamsStreaming schema
   - Fixes "Unexpected keys in the request: {'request_id'}" error on second message
   - Allows request_id to be properly sent and validated by the server

Both fixes target the exact root causes identified in issue #40230:
- Users will now get GPU acceleration by default when available
- Chat sessions will no longer break after the second message

* Remove unrelated request_id field from TransformersCompletionCreateParamsStreaming
2025-08-19 15:33:44 +00:00
bebeccb06a fix which routing method (#40283) 2025-08-19 16:35:13 +02:00
249d7c6929 Update image_processing_perception_lm_fast.py to allow for proper override of vision_input_type (#40252)
* Update image_processing_perception_lm_fast.py

Allow for a proper override of vision_input_type in hf fast image processor, otherwise we need to resort to manually setting the attribute.

* Update processing_perception_lm.py to match kwargs vision input type

* Update image_processing_perception_lm_fast.py kwargs to signature args
2025-08-19 11:41:27 +00:00
r0
57bb6db6ee Skipping pytree registration in case fsdp is enabled (#40075)
* Skipping pytree registration in case fsdp is enabled

* Beauty changes

* Beauty changes

* Moved the is_fsdp_available function to import utils

* Moved is_fsdp_available to integrations.fsdp

* Skipping pytree registration in case fsdp is enabled

* Beauty changes

* Beauty changes

* Moved the is_fsdp_available function to import utils

* Moved is_fsdp_available to integrations.fsdp

* Added pytree registration inside dynamic cache class

* Making ci/cd lords happy

* Adding a check if DynamicCache is already a leaf

* Adding try/catch for multiple initializations of DynamicCache in test suites

* Moving dynamic cache pytree registration to executorch

* Adding try catch back
2025-08-19 11:58:05 +02:00
5b3b7ea472 Add Kosmos-2.5 (#31711)
Add Microsoft Kosmos-2.5

---------

Co-authored-by: kirp@umich.edu <tic-top>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-08-19 11:56:03 +02:00
c93594e286 [detection] fix correct k_proj weight and bias slicing in D-FINE (#40257)
Fix: correct k_proj weight and bias conversion in D-FINE
2025-08-19 09:44:37 +00:00
2f1a8ad4ba Fix setting attention for multimodal models (#39984)
* fix

* use non-explicit `None`

* keep previously set attn if exists
2025-08-19 11:35:11 +02:00
a2e76b908b 🚨🚨 Switch default compilation to fullgraph=False (#40137)
* switch default

* docstring

* docstring

* rework tests and remove outdated restrictions

* simplify

* we need a check for static cache

* fix

* rename var

* fix

* revert

* style

* rename test
2025-08-19 11:26:22 +02:00
2b59207a72 Fix slow static cache export tests (#40261) 2025-08-19 11:24:07 +02:00
56c44213b3 [detection] fix attention mask for RT-DETR-based models (#40269)
* Fix get_contrastive_denoising_training_group attention

* Add bool attention_mask conversion
2025-08-19 09:15:56 +00:00
5d9a715e30 set inputs_embeds to None while generate to avoid audio encoder forward in generation process (#40248)
* set inputs_embeds to None while generate to avoid audio encoder forward in generation process

* set input_features to none instead

---------

Co-authored-by: lvyuanjun.lyj <lvyuanjun.lyj@alibaba-inc.com>
2025-08-19 08:45:57 +00:00
28746cdc7b Remove MI300 CI (#40270)
Remove MI300 CI (in history if we need it back)
2025-08-19 08:23:39 +00:00
debc92e60a Skip broken tests (#40157)
skip these tests
2025-08-19 10:04:08 +02:00
6b5bd11723 docs: Update OLMo model card (#40233)
* Updated OLMo model card

* Update OLMo description

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Fix typo

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Fix cli typo

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Fix cli example

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Add bitsandbytes info

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-08-18 13:35:39 -07:00
e472efb9ac Fix benchmark workflow (#40254)
Correct init_db.sql path

Co-authored-by: Akos Hadnagy <akoshuggingface@mi325x8-123.atl1.do.cpe.ice.amd.com>
2025-08-18 18:14:16 +00:00
59862209ca Correct typo and update notes in docs Readme (#40234)
* Correct typo and update notes in docs readme

* Update docs/README.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/README.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-08-18 10:31:12 -07:00
a7eabf1dde Model card for NLLB (#40074)
* initializing branch and draft PR

* updated model card .md file

* minor

* minor

* Update docs/source/en/model_doc/nllb.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/nllb.md

suggestion

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/nllb.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/nllb.md

suggestion

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/nllb.md

suggestion

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/nllb.md

suggestion

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/nllb.md

suggestion

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* resolving comments + adding visuals

* Update docs/source/en/model_doc/nllb.md

suggestion

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/nllb.md

suggestion

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/nllb.md

suggestion

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/nllb.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/nllb.md

suggestion

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/nllb.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/nllb.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* NllbTokenizerFast and NllbTokenizer added

* endline

* minor

* Update nllb.md

---------

Co-authored-by: Sahil Kabir <sahilkabir@Sahils-MacBook-Pro.local>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-08-18 10:05:59 -07:00
01c03bf4ee fix: Catch correct ConnectionError for additional_chat_templates (#39874)
* fix: Catch correct ConnectionError for additional_chat_templates

* fix: don't catch timeout

* fix: formatting
2025-08-18 17:25:47 +01:00
2bcf9f6c7e Fixes for EncoderDecoderCache (#40008)
* Add expectation to t5 for rocm 9.4

* Made EncoderDecoderCache compatible with nn.DataParallel

* Fixed t5gemma EncoderDecoderCache

* Added todos in autoformer

* Ruff

* Init is self-contained

* Review compliance

* Fixed kwargs init of EncoderDecoderCache
2025-08-18 17:51:05 +02:00
aa45824919 [CI] Fix repo consistency (#40249)
* fix

* doc

---------

Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
2025-08-18 17:32:17 +02:00
d6fad86d23 [serve] guard imports (#39825)
guard imports
2025-08-18 16:28:10 +01:00
MQY
7a0ba0d7d8 [typing] fix type annotation error in DepthPro model image processor (#40238)
* fix type annotation error in DepthPro model image processor

* fix

* run make fix-copies
2025-08-18 15:42:13 +01:00
00b4dfb786 Add chat_template (jinja2) as an extra dependency (#40128)
* add jinja2 as a dependency

* Make jinja2 a core dependency in install_requires

- Add jinja2 to install_requires list in setup.py for automatic installation
- Add jinja2 to runtime version checks in dependency_versions_check.py
- Resolves issue where pip install transformers doesn't install jinja2

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Make jinja2 a core dependency in install_requires

* Make jinja2 an extra dependency instead of adding a core dep

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-08-18 14:31:40 +00:00
f417a1aad4 remove transpose_for_scores call in ESM-2 (#40210)
* remove transpose_for_scores call

Signed-off-by: Peter St. John <pstjohn@nvidia.com>

* fix copied evolla code

Signed-off-by: Peter St. John <pstjohn@nvidia.com>

---------

Signed-off-by: Peter St. John <pstjohn@nvidia.com>
2025-08-18 14:28:59 +00:00
a36d51e801 🚨 Always return Cache objects in modelings (to align with generate) (#39765)
* watch the world burn

* fix models, pipelines

* make the error a warning

* remove kwargs and return_legacy_cache

* fix reformer
2025-08-18 16:26:35 +02:00
57e230cdb2 Fix more pylint warnings (#40204)
Fix pylint warnings

Signed-off-by: cyy <cyyever@outlook.com>
2025-08-18 14:17:16 +00:00
47938f8f8d Add Ovis2 model and processor implementation (#37088)
* Add Ovis2 model and processor implementation

* Apply style fixes

* Add unit tests for Ovis2 image processing and processor

* Refactor image processing functions for clarity and efficiency

* Add Ovis2 ImageProcessorFast

* Refactor Ovis2 code

* Refactor Ovis2 model components and update processor functionality

* Fix repo consistency issues for Ovis2: docstring, config cleanup

* Update Ovis2 model integration tests

* Update Ovis2 configuration and processing classes for improved documentation

* Remove duplicate entry for 'ovis2' in VLM_CLASS_NAMES

* Fix conflict

* Fix import order

* Update image processor class names

* Update Ovis2 model structure

* Refactor Ovis2 configuration

* Fix typos

* Refactor Ovis2 model classes and remove unused code

* Fix typos

* Refactor Ovis2 model initialization

* Fiix typos

* Remove Ovis2 model mapping from MODEL_MAPPING_NAMES in modeling_auto.py

* Add license and update type hints

* Refactor token function and update docstring handling

* Add license

* Add Ovis2 model support and update documentation

* Refactor Ovis2 model structure and enhance multimodal capabilities

* Update Ovis2 weight mapping for consistency and clarity in key patterns

* Remove unused 'grids' parameter from Ovis2 model and Update processing logic to handle image grids more efficiently.

* Refactor Ovis2 model test structure to include Ovis2Model

* Add optional disable_grouping param to Ovis2ImageProcessorFast

* Refactor type hints in Ovis2 modules

* Add licensing information in Ovis2 modules and tests

* Refactor Ovis2 model by removing unused methods

* Refactor Ovis2 model tests by renaming test classes and removing skipped tests

* Refactor Ovis2 model output classes

* Refactor Ovis2 weight conversion and Update model embedding classes

* Refactor Ovis2 model imports and remove unused functions

* Enhance vision configuration extraction in Ovis2 weight conversion

* Refactor Ovis2 model's forward method to remove interpolation option

* Update Ovis2 model documentation

* Refactor Ovis2 model input handling and tokenizer configuration

* Update return type hints in Ovis2 model

* Remove commented-out code

* fix config for tests and remove key mappings

* Update tokenizer configuration to use add_special_tokens method

* skip torchscript

* Fix image placeholder generation in Ovis2Processor

* Refactor Ovis2 model to rename visual_table to visual_embeddings_table

* Enhance Ovis2 model by adding vision_feature_select_strategy parameter

* Refactor Ovis2 model weights conversion and architecture

* Refactor Ovis2 model by removing vision_feature_select_strategy parameter

* Update Ovis2 model examples

* Refactor Ovis2 model

* Update Ovis2 model

* Update Ovis2 model configuration

* Refactor Ovis2 model test setup

* Refactor flash attention support

* Refactor

* Fix typo

* Refactor

* Refactor model classes

* Update expected output in Ovis2

* Refactor docstrings

* Fix

* Fix

* Fix

* Update input in tests

* Fix

* Fix get_decoder method

* Refactor

* Refactor Ovis2

* Fix

* Fix

* Fix test

* Add get_placeholder_mask

* Refactor Ovis2 model tests

* Fix

* Refactor

* Fix

* Fix

* Fix Ovis2 test

---------

Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
2025-08-18 16:05:49 +02:00
2fe43376cd AMD scheduled CI ref env file (#40243)
* Reference env-file to be used in docker running the CI

* Disable MI300 CI for now
2025-08-18 15:23:27 +02:00
e4bd2c858d Fix ESM token_dropout crash when using inputs_embeds instead of input_ids (#40181)
* fix: Error after calling ESM model with input embeddings not input ids

* propagate changes to other models
2025-08-18 13:22:10 +00:00
6333eb986a Fix more typos (#40212)
Signed-off-by: cyy <cyyever@outlook.com>
2025-08-18 12:52:12 +00:00
e5886f9194 [SAM 2] Change checkpoints in docs and tests (#40213)
* change checkpoints in docs and tests

* add notebook
2025-08-18 11:21:34 +02:00
eb2f9da096 fix error vocab_size at Qwen2_5_VLForConditionalGeneration loss_function (#40130)
* fix error vocab_size at Qwen2_5_VLForConditionalGeneration loss_function

Signed-off-by: luoxiaoc <xiaochuan.luo@intel.com>

* fix similar errer at qwen2_vl and do make fix-copies

Signed-off-by: luoxiaoc <xiaochuan.luo@intel.com>

* pass in kwargs for loss_func at qwen2_vl and qwen2_5_vl

Signed-off-by: luoxiaoc <xiaochuan.luo@intel.com>

* Apply style fixes

---------

Signed-off-by: luoxiaoc <xiaochuan.luo@intel.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2025-08-18 08:59:25 +00:00
6ce8f05375 Use correct model_input_names for PixtralImageProcessor (#40226)
add image_sizes to model_input_names
2025-08-18 08:06:52 +00:00
2914ceca20 Revert "Pin torch to 2.7.1 on CircleCI for now" + Final fix for too long with no output (#40201)
* Revert "Pin torch to 2.7.1 on CircleCI for now (#40174)"

This reverts commit 31b6e6e1dac0d32f74ec5cd6b3c1868534ccd7b5.

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-18 08:40:53 +02:00
cd22550692 docs: Update LayoutLM model card according to new standardized format (#40129)
* docs: Update LayoutLM model card with standardized format

* Apply suggestions from code review

This commit incorporates all suggestions provided in the recent review. Further changes will be committed separately to address remaining comments.

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Address remaining review comments

* Address few more review comments:
1. remove transformer-cli section
2. put resources after notes
3. change API refs to 2nd level header

* Update layoutlm.md

* Update layoutlm.md

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-08-15 09:33:47 -07:00
05000aefe1 Fix GPT-OSS swiglu_limit not passed in for MXFP4 (#40197)
Add swiglu_limit = 7.0
2025-08-15 17:04:25 +02:00
3f4c85fef0 Add X-Codec model (#38248)
* add working x-codec

* nit

* fix styling + copies

* fix docstring

* fix docstring and config attribute

* Update args + config

* update convertion script

* update docs + cleanup

* Ruff fix

* fix doctrings
2025-08-15 16:24:12 +02:00
29e4e35927 Benchmarking improvements (#39768)
* Start revamping benchmarking

* Start refactoring benchmarking

* Use Pandas for CSV

* import fix

* Remove benchmark files

* Remove sample data

* Address review comments
2025-08-15 15:59:11 +02:00
de437d0d7a Update: add type hints to check_tokenizers.py (#40094)
* Update check_tokenizers.py

chore(typing): add type hints to check_tokenizers script

- Annotate params/returns for helper functions
- Keep tokenizer instances as `Any` to avoid runtime coupling
- Make `check_LTR_mark` return `bool` explicitly (no behavior change)

* Update check_tokenizers.py

chore(typing): replace Any with PreTrainedTokenizerBase in check_tokenizers.py

- Use transformers.tokenization_utils_base.PreTrainedTokenizerBase for `slow` and `fast` params
- Covers both PreTrainedTokenizer and PreTrainedTokenizerFast
- Exposes required methods (encode, decode, encode_plus, tokenize)
- Removes generic Any typing while staying implementation-agnostic
2025-08-15 12:41:28 +00:00
28a03fb78a Fix various Pylint warnings (#40107)
Tidy code

Signed-off-by: cyy <cyyever@outlook.com>
2025-08-15 12:40:12 +00:00
ec85d2c44f Avoid CUDA stream sync (#40060)
Signed-off-by: cyy <cyyever@outlook.com>
2025-08-15 12:37:15 +00:00
c7afaa5b44 Remove _prepare_flash_attention_from_position_ids (#40069)
Signed-off-by: cyy <cyyever@outlook.com>
2025-08-15 12:35:03 +00:00
c167faa081 Fix typos (#40175)
Signed-off-by: cyy <cyyever@outlook.com>
2025-08-15 12:10:26 +00:00
5068fcd9a8 Add repr to EncoderDecoderCache (#40195)
* add repr

* oups
2025-08-15 12:57:49 +02:00
421175685d Fix fsdp for generic-task models (#40191)
* remove abc inheritance

* add fast test
2025-08-15 12:28:16 +02:00
4912d5b490 fix to avoid modifying a view in place (#40162)
* fix to avoid modifying a view in place

* add backward test in tensor parallel

* add test to test_modelig_gpt_oss.py

* linting
2025-08-15 10:30:49 +02:00
cc9997878a make model doc device agnostic (#40143)
* make model doc device agnostic

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

* Update align.md

* Update aya_vision.md

* Update byt5.md

* refine

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

* Update granitevision.md

* Update src/transformers/pytorch_utils.py

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* add doc

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

* 3 more

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

---------

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-08-14 23:31:31 -07:00
85fce2e54c [MINOR:TYPO] Update base.py (#40169)
* [MINOR:TYPO] Update base.py

All other occurrences in the docs use lowercase. (https://github.com/search?q=repo%3Ahuggingface%2Ftransformers%20translation_XX_to_YY&type=code)

Also, using uppercase doesn't work: tested with "translation_EN_to_FR" which doesn't work and instead returns:  `ValueError: The task does not provide any default models for options ('EN', 'FR')`

It might be a good idea to allow for uppercase, but that's for another issue.

* [MINOR:TYPO] Update __init__.py
2025-08-14 22:53:57 -07:00
52c6c1bb6e Update dynamic attnt setter for multimodals (#39908)
* update

* fix the test for DepthPro

* PR comments

* wait, I didn't delete this in prev commit?

* fix

* better way

---------

Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
2025-08-14 21:46:13 +02:00
31b6e6e1da Pin torch to 2.7.1 on CircleCI for now (#40174)
* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-14 20:19:35 +02:00
b02f2d8b6a Add dates to the model docs (#39320)
* added dates to the models with a single hf papers link

* added the dates for models with multiple papers

* half of no_papers models done

* rest of no_papers models also done, only the exceptions left

* added copyright disclaimer to sam_hw, cohere, cohere2 + dates

* some more fixes, hf links + typo

* some new models + a rough script

* the script looks robust, changed all paper links to hf

* minor change to handle technical reports along with blogs

* ran make fixup to remove the white space

* refactor
2025-08-14 10:08:46 -07:00
8a658ac119 Standardize BARTpho model card: badges, new examples, fixed broken im… (#40051)
* Standardize BARTpho model card: badges, new examples, fixed broken image section, and links (#36979)Update bartpho.md

* Update bartpho.md

Removed non-required/unsupported sections: Quantization, Attention visualizer, and Resources (plus stray tokenizer header).

Added code snippets which were suggested

* Update bartpho.md

Updated with necessary tags

* Update bartpho.md

* Update bartpho.md

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-08-14 09:55:27 -07:00
2b6cbedeb2 Add GptOssForSequenceClassification for GPT-OSS models (#40043)
* Add GptOssForSequenceClassification

* Tiny fix

* make fixup

* trigger CI rerun

* Check config type instead

---------

Co-authored-by: Yuefeng Zhan <yuefzh@microsoft.com>
2025-08-14 18:32:14 +02:00
b834cb8138 build: Add fast image processor tvp (#39529)
* build: add TvpImageProcessorFast

- Introduced TvpImageProcessorFast to enhance image processing capabilities.
- Updated image processing auto registration to include the new fast processor.
- Modified tests to accommodate both TvpImageProcessor and TvpImageProcessorFast, ensuring comprehensive coverage for both classes.

* fix: TvpImageProcessorFast with new resize method and update processing logic

* build: add TvpImageProcessorFast

* refactor: clean up whitespace and formatting in TvpImageProcessorFast and related tests

- Removed unnecessary whitespace and ensured consistent formatting in image_processing_tvp_fast.py.
- Updated import order in test_image_processing_tvp.py for clarity.
- Minor adjustments to maintain code readability and consistency.

* fix: Enhance TvpFastImageProcessorKwargs and update documentation

- Added TvpFastImageProcessorKwargs class to define valid kwargs for TvpImageProcessorFast.
- Updated the documentation in tvp.md to include the new class and its parameters.
- Refined the image processing logic in image_processing_tvp_fast.py for better handling of padding and resizing.
- Improved test cases in test_image_processing_tvp.py to ensure compatibility with the new processing logic and tensor inputs.

* fix: tested now with python 3.9

* fix: remove tvp kwargs from docs

* simplify processing

* remove import and fix tests

---------

Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
2025-08-14 15:48:18 +00:00
6f259bc83e Fix docs typo (#40167)
* DINOv3 model

* working version

* linter revert

* linter revert

* linter revert

* fix init

* remove flex and add convert to hf script

* DINOv3 convnext

* working version of convnext

* adding to auto

* Dinov3 -> DINOv3

* PR feedback

* complete convert checkpoint

* fix assertion

* bf16 -> fp32

* add fast image processor

* fixup

* change conversion script

* Use Pixtral attention

* minor renaming

* simplify intermediates capturing

* refactor DINOv3ViTPatchEmbeddings

* Refactor DINOv3ViTEmbeddings

* [WIP] rope: remove unused params

* [WIP] rope: rename period -> inv_freq for consistency

* [WIP] rope: move augs

* change inv_freq init (not persistent anymore)

* [WIP] rope: move coords to init

* rope - done!

* use default LayerScale

* conversion: truncate expected outputs

* remove commented code

* Refactor MLP layers

* nit

* clean up config params

* nit docs

* simplify embeddings

* simplify compile compat lru_cache

* fixup

* dynamic patch coords

* move augmentation

* Fix docs

* fixup and type hints

* fix output capturing

* fix tests

* fixup

* fix auto mappings

* Add draft docs

* fix dtype cast issue

* add push to hub

* add image processor tests

* fixup

* add modular

* update modular

* convert and test convnext

* update conversion script

* update prefix

* Update LayerNorm

* refactor DINOv3ConvNextLayer

* rename

* refactor convnext model

* fix doc check

* fix docs

* fix convnext config

* tmp fix for check docstring

* remove unused arg

* fix tests

* (nit) change init

* standardize gated MLP

* clear namings and sat493m

* fix tensors on different devices

* revert linter

* pr

* pr feedbak ruff format

* missing headers

* fix code snippet and collection link in docs

* DINOv3 description

* fix checkpoints in tests

* not doc fixes in configs

* output_hidden_states

* x -> features

* remove sequential

---------

Co-authored-by: Cijo Jose <cijose@meta.com>
2025-08-14 17:29:53 +02:00
41980ce93e [bugfix] fix flash-attention2 unavailable error for Ascend NPU (#40151)
* [bugfix] fix flash-attention2 unavailable error for Ascend NPU

* remove redundant apply_rotary_emb usage

* fix ruff check error

* pad_input and unpad_input use same implementation as fa2

* rollback redundant codes

* fix ruff check error

* optimize fa2 judgement logic
2025-08-14 14:21:39 +02:00
eba1d62091 [FA2] Fix it finally - revert fa kwargs preparation (#40161)
revert
2025-08-14 13:39:11 +02:00
1c5d2f7fb6 Replace self.tokenizer by self.processing_class (#40119) 2025-08-14 13:24:55 +02:00
cfe52ff4db [Continous Batching] set head_dim when config.head_dim is None (#40159)
* set head_dim when config.head_dim is None

* use model's actual TP setting
2025-08-14 13:23:27 +02:00
c47544b16f Fix CI: Use correct import in SAM for torchvision InterpolationMode (#40160)
fix ci
2025-08-14 10:53:23 +00:00
22e89e5385 [efficientloftr] fix bugs and follow original cross attn implementation strictly (#40141)
* fix: changed is_causal to be False

* fix: Added original cross attention bug

* fix: fixed the way bordel removal is computed

* fix: added missing normalization on coarse features

* test: fixed integration tests

---------

Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
2025-08-14 10:42:59 +01:00
252364fd8e [Cohere2Vision] remove unused arg (#40103)
* remove unused arg

* remove the arg from test as well
2025-08-14 09:10:25 +00:00
e446372f76 Create self-scheduled-amd-mi355-caller.yml (#40134) 2025-08-14 01:33:45 +02:00
be1ab5103f Update Dockerfiles to install packages inside a virtual environment (#39098)
* Removed un-necessary virtual environment creation in Dockerfiles.

* Updated Dockerfiles to install packages in a virtual environment.

* use venv's python

* update

* build and trigger

* trigger

* build and trigger

* build and trigger

* build and trigger

* build and trigger

* build and trigger

* build and trigger

* update

* update

* update

* update

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-13 23:51:52 +02:00
591708d9ce Add pytest marker: torch_compile_test and torch_export_test (#39950)
* new marker

* trigger CI

* update

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-13 23:47:15 +02:00
12e49cda32 Fix quantized cache with only cache_implementation in generate (#40144)
* fix args

* comment
2025-08-13 23:21:41 +02:00
e651ae0a32 🌐 [i18n-KO] Translated gemma3.md to Korean (#39865)
* docs: ko: gemma3.md

* feat: nmt draft

* fix: manual edits

* fix: resolve suggestions

Co-authored-by: Chaewon Song <chaewon1019@ewhain.net>

* fix: resolve suggestions

---------

Co-authored-by: Chaewon Song <chaewon1019@ewhain.net>
2025-08-13 13:25:20 -07:00
0f9c2595cd updated visualBERT modelcard (#40057)
* updated visualBERT modelcard

* fix: Review for VisualBERT card
2025-08-13 12:47:32 -07:00
412c9c3030 Remove an old badly designed test (#40142)
remove it
2025-08-13 20:47:00 +02:00
eb5768a86e [docs] Fix ko toctree (#40138)
Update _toctree.yml
2025-08-13 11:24:58 -07:00
68a13cd4a6 Add Segment Anything 2 (SAM2) (#32317)
* initial comment

* test

* initial conversion for outline

* intermediate commit for configuration

* chore:init files for sam2

* adding arbitary undefined config

* check

* add vision

* make style

* init sam2 base model

* Fix imports

* Linting

* chore:sam to sam2 classes

* Linting

* Add sam2 to models.__init__

* chore:match prompt encoder with sam2 code

* chore:prepare kwargs for mask decoder

* Add image/video predictors

* Add CUDA kernel

* Add output classes

* linting

* Add logging info

* tmp commit

* docs for sam2

* enable image processing

* check difference of original SAM2
- difference is the order of ToTensor()
- please see https://pytorch.org/vision/main/_modules/torchvision/transforms/functional.html#resize

* enable promptencoder of sam2

* fix promprencoder

* Confirmed that PromptEncoder is exactly same (Be aware of bfloat16 and float32 difference)

* Confirmed that ImageEncoder is exactly same (Be aware the linting of init)

* Confirmed that MaskDecoder is exactly same (TO DO: lint variable name)

* SamModel is now available (Need more chore for name)

* make fix-copies

* make style

* make CI happy

* Refactor VisionEncoder and PostioinEmbedding

* TO DO : fix the image_embeddings and sparse_embeddings part

* pure image inference done

* reusable features fix and make style

* styling

* refactor memoryattention

* tmp

* tmp

* refactor memoryencoder
TO DO : convert and inference the video pipeline

* TO DO : fix the image_encoder shape

* conversion finish
TO DO: need to check video inference

* make style

* remove video model

* lint

* change

* python utils/check_docstringspy --check_all

* python utils/check_config_attributes.py

* remove copies for sam2promptencoder due to configuration

* change __init__.py

* remove tensorflow version

* fix that to not use direct comparison

* make style

* add missing import

* fix image_embedding_size

* refactor Sam2 Attention

* add fully working video inference (refactoring todo)

* clarify _prepare_memory_conditioned_features

* simplify modeling code, remove unused paths

* use one model

* use auto_docstring

* refactor rope embeddings

* nit

* not using multimask when several points given

* add all sam2.1

* add video tmp

* add Sam2VideoSessionState + fast image proc + video proc

* remove init_states from model

* fix batch inference

* add image integration tests

* uniformize modeling code with other sam models and use modular

* pass vision tests an most model tests

* All tests passing

* add offloading inference state and video to cpu

* fix inference from image embedding and existing mask

* fix multi_boxes mask inference

* Fix batch images + batch boxes inference

* improve processing for image inference

* add support for mask generation pipeline

* add support for get_connected_components post processing in mask generation

* add fast image processor sam, image processor tests and use modular for sam2 image processor

* fix mistake in sam after #39120

* fix init weights

* refactor convert

* add integration tests for video + other improvements

* add needed missing docstrings

* Improve docstrings and

* improve inference speed by avoiding cuda sync

* add test

* skip test for vision_model

* minor fix for vision_model

* fix vision_model by adding sam2model and change the torch dependencies

* remove patch_size

* remove image_embedding_size

* fix patch_size

* fix test

* make style

* Separate hieradet and vision encoder in sam2

* fixup

* review changes part 1

* remove MemoryEncoderConfig and MemoryAttentionConfig

* pass q_stride instead of q_pool module

* add inference on streamed videos

* explicitely process streamed frames

* nit

* Improve docstrings in Sam2Model

* update sam2 modeling with better gestion of inference state and cache, and separate Sam2Model and Sam2VideoModel

* improve video inference api

* change inference_state to inference_session

* use modular for Sam2Model

* fix convert sam2 hf

* modular

* Update src/transformers/models/sam2/video_processing_sam2.py

Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>

* fix minor config

* fix attention loading error

* update modeling tests to use hub checkpoints

* Use CI A10 runner for integration tests values + higher tolerance for video integration tests

* PR review part 1

* fix doc

* nit improvements

* enforce one input format for points, labels and boxes

* nit

* last few nits from PR review

* fix style

* fix the input type

* fix docs

* add sam2 model as conversion script

* improve sam2 doc

* nit fixes + optimization

* split sam2 and sam2_video in two models

* PR review part 1

* fix None for default slow processor of sam2

* remove unecessary code path in sam2_video

* refactor/simplify RoPE

* replace embedding module list with embedding matrix

* fix tests

* remove kernel

* nit

* use lru_cache for sine_pos_embeddings

* reorder sam2_video methods

* simplify sam2_video

* PR review part 1

* simplify sam2 video a lot

* more simplification

* update integration tests with updated conftest

* more explicit config for hieradet

* do post_processing outside of sam2 video model

* Improve Sam2VideoVisionRotaryEmbedding

* fix tests

* update docs and fix mask2former/oneformer

* avoid unnecessary reshapes/permute

* fix device concatenating points

* small dtype fix

* PR review

* nit

* fix style and finish up doc

* fix style

* fix docstrings

* fix modular

---------

Co-authored-by: RUFFY-369 <prakarshkaushik369@gmail.com>
Co-authored-by: Haitham Khedr <haithamkhedr@meta.com>
Co-authored-by: sangbum choi <sangbumchoi@sangbumui-MacBookAir.local>
Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
2025-08-13 14:18:05 -04:00
25ad9c8c92 Fix Janus (#40140)
fix
2025-08-13 20:12:21 +02:00
bec6926696 gpt oss is important (#40139) 2025-08-13 19:49:54 +02:00
ab9108517a 🌐 [i18n-KO] Translated pipelines.md to Korean (#39577)
* docs: ko: pipelines.md

* feat: gpt draft

* Update docs/source/ko/main_classes/pipelines.md

Co-authored-by: Yijun Lee <119404328+yijun-lee@users.noreply.github.com>

* Update docs/source/ko/main_classes/pipelines.md

Co-authored-by: Yijun Lee <119404328+yijun-lee@users.noreply.github.com>

* Update docs/source/ko/main_classes/pipelines.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/ko/main_classes/pipelines.md

Co-authored-by: Yijun Lee <119404328+yijun-lee@users.noreply.github.com>

* Update docs/source/ko/main_classes/pipelines.md

Co-authored-by: Yijun Lee <119404328+yijun-lee@users.noreply.github.com>

* Update _toctree.yml

* Update _toctree.yml

번역 문서 수정

* Update pipelines.md

ToC 수정

* Update pipelines.md

---------

Co-authored-by: xhaktm <tnwjd318@hs.ac.kr>
Co-authored-by: Yijun Lee <119404328+yijun-lee@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-08-13 10:26:17 -07:00
20c6b478cd 🚨 Use lru_cache for sine pos embeddings MaskFormer (#40007)
* use lru_cache for sine pos embeddings maskformer

* fix calls to pos embed

* change maxsize to 1
2025-08-13 17:05:22 +00:00
6b728f1830 🌐 [i18n-KO] Translated grounding-dino.md to Korean (#39861)
* docs: ko: grounding-dino.md

* feat: nmt draft

* fix: manual edits

* Update docs/source/ko/model_doc/grounding-dino.md

Co-authored-by: Kim Juwon <81630351+Kim-Ju-won@users.noreply.github.com>

* Update docs/source/ko/model_doc/grounding-dino.md

Co-authored-by: Kim Juwon <81630351+Kim-Ju-won@users.noreply.github.com>

* Update docs/source/ko/model_doc/grounding-dino.md

Co-authored-by: Kim Juwon <81630351+Kim-Ju-won@users.noreply.github.com>

* docs: add AP explanation for better readability

---------

Co-authored-by: TaskerJang <bymyself103@naver.com>
Co-authored-by: Kim Juwon <81630351+Kim-Ju-won@users.noreply.github.com>
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
2025-08-13 10:01:05 -07:00
127e33f759 🌐 [i18n-KO] Translated optimizers.md to Korean (#40011)
* docs: ko: optimizers.md

* feat: optimizers draft

* fix: manual edits

* docs: ko: update optimizers.md

* Update docs/source/ko/optimizers.md

Co-authored-by: Minseo Kim <75977640+luckyvickyricky@users.noreply.github.com>

* Update docs/source/ko/optimizers.md

Co-authored-by: Minseo Kim <75977640+luckyvickyricky@users.noreply.github.com>

* Update docs/source/ko/optimizers.md

Co-authored-by: Jaehyeon Shin <108786184+skwh54@users.noreply.github.com>

* docs: ko: final updates to optimizers and toctree

---------

Co-authored-by: Minseo Kim <75977640+luckyvickyricky@users.noreply.github.com>
Co-authored-by: Jaehyeon Shin <108786184+skwh54@users.noreply.github.com>
2025-08-13 10:00:47 -07:00
ac52c77a66 🌐 [i18n-KO] Translated gpt2.md to Korean (#39808)
* docs: ko: bamba.md

* feat: nmt draft

* fix: manual edits

* docs: ko: gpt2.md

* feat: nmt draft

* fix: manual edits

* Remove bamba.md from docs/source/ko/model_doc/

* Update _toctree.yml
2025-08-13 10:00:25 -07:00
5337f3052d 🚨🚨 [generate] ignore cache_implementation="hybrid" hub defaults (#40135)
* working?

* fix tests
2025-08-13 17:57:41 +02:00
e4223fa915 🌐 [i18n-KO] Translated main_classes/optimizer_schedules.md to Korean (#39713)
* docs: ko: main_classes/optimizer_schedules

* feat: nmt draft

* fix: improve TOC anchors and expressions in optimizer_schedules

- Add TOC anchors to all section headers
- Fix terminology and improve Korean expressions

* fix: Correct translation of 'weight decay fixed' to '가중치 감쇠가 적용된'

Changed '가중치 감쇠가 수정된' to '가중치 감쇠가 적용된' for more accurate translation of 'weight decay fixed' in the context of optimization.

* fix: Use more natural Korean inheritance expression

Changed '에서 상속받는' to '을 상속받는' to follow natural Korean grammar patterns for inheritance terminology.

* fix: Use consistent '미세 조정' translation for 'finetuned models'

Changed '파인튜닝된' to '미세 조정된 모델' to follow the established translation glossary for 'finetuned models' terminology.
2025-08-13 08:23:09 -07:00
9e21e50241 🌐 [i18n-KO] Translated jamba.md to Korean (#39890)
* docs: ko: jamba.md

* feat: nmt draft

* fix: manual edits

* fix: resolve suggestion

Co-authored-by: Minseo Kim <75977640+luckyvickyricky@users.noreply.github.com>

---------

Co-authored-by: Minseo Kim <75977640+luckyvickyricky@users.noreply.github.com>
2025-08-13 08:22:28 -07:00
486844579b 🌐 [i18n-KO] Translated main_classes/processors.md to Korean (#39519)
* docs: ko: processors.md

* feat: nmt draft

* fix: manual edits

* Update docs/source/ko/main_classes/processors.md

Co-authored-by: Ahnjj_DEV <ahnjj.dev@gmail.com>

* Update docs/source/ko/main_classes/processors.md

Co-authored-by: Ahnjj_DEV <ahnjj.dev@gmail.com>

---------

Co-authored-by: TaskerJang <bymyself103@naver.com>
Co-authored-by: Ahnjj_DEV <ahnjj.dev@gmail.com>
2025-08-13 08:21:38 -07:00
f445caeb0f Fix hidden torchvision>=0.15 dependency issue (#39928)
* use pil_torch_interpolation_mapping for NEAREST/NEAREST_EXACT

* fix min torchvision version

* use InterpolationMode directly

* remove unused is_torchvision_greater_or_equal,

* nit
2025-08-13 15:13:42 +00:00
11537c3e0c [trainer] handle case where EOS token is None in generation_config (#40127)
* handle case where EOS token is None in gen config

* update eli5 dataset
2025-08-13 15:57:17 +01:00
8ef5cd6579 DOCS: Add missing space in SECURITY.md (#40087) 2025-08-13 12:57:37 +00:00
ebceef343a Collated reports (#40080)
* Add initial collated reports script and job definition

* provide commit hash for this run. Also use hash in generated artifact name. Json formatting

* tidy

* Add option to upload collated reports to hf hub

* Add glob pattern for test report folders

* Fix glob

* Use machine_type as path filter instead of glob. Include machine_type in collated report
2025-08-13 14:48:15 +02:00
e78571f5ce decoding_method argument in generate (#40085)
* factor out expand inputs

* callable arg

* improve docs, add test

* Update docs/source/en/generation_strategies.md

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

---------

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
2025-08-13 12:45:50 +00:00
8d19231bca [serve] allow array content inputs for LLMs (#39829)
fix bug; add tests
2025-08-13 11:26:19 +01:00
34a1fc6426 Fix QuantoQuantizedCache import issues (#40109)
* fix quantoquantized
2025-08-13 10:22:59 +00:00
060b86e21d changed xLSTMRMSNorm to RMSNorm (#40113)
* changed xLSTMRMS.. to RMS...

* fix linter error

---------

Co-authored-by: Nikita <nikita@Nikitas-MacBook-Pro.local>
2025-08-13 11:10:42 +02:00
849c3778c6 [bugfix] Fix tensor device in Idefics2, Idefics3, and SmolVLM (#39975)
* [bugfix] ensure correct tensor device in Idefics2, Idefics3, and SmolVLM models

* to cuda
2025-08-13 09:58:50 +02:00
85d536a93b 🌐 [i18n-KO] Translated tiny_agents.md to Korean (#39913)
* docs: ko: tiny_agents.md

* feat: nmt draft

* fix: manual edits

* fix: manual edits
2025-08-12 22:54:16 -07:00
31ab7168ff remove sequence parallel in llama4 (#40084) 2025-08-13 00:12:45 +02:00
a1a4fcd03e Add model card for MobileViT (#40033)
* Add model card for MobileViT

* Update docs/source/en/model_doc/mobilevit.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/mobilevit.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/mobilevit.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/mobilevit.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/mobilevit.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update mobilevit.md

* Update mobilevit.md

* Update mobilevit.md

* Update docs/source/en/model_doc/mobilevit.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/mobilevit.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update mobilevit.md

* Update mobilevit.md

* Update mobilevit.md

* Update mobilevit.md

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-08-12 11:36:59 -07:00
e5e73e4b95 [docs] Add reference to HF-maintained custom_generate collections (#39894)
decoding -> generation; add collections
2025-08-12 17:38:00 +01:00
0ce24f5a88 Fix Causality Handling in Flash Attention to Support Bidirectional Attention (#39707)
Fix the is_causal logic to enable bidirectional attention

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2025-08-12 16:16:28 +00:00
83dbebc429 [trainer] ensure special tokens in model configs are aligned with tokenizer at train time (#38441)
* tmp commit

* add test

* make fixup

* reset warns/info in test
2025-08-12 16:32:07 +01:00
9977cf1739 [Flash Attention] Fix flash attention integration (#40002)
* fix flash attention

* i got a stroke reading that comment

* change dropout kwarg back to before

* rename _fa3... as it's used for multiple variants and should work as fallback instead

* simplify imports and support kwargs for fa

* style

* fix comments order

* small fix

* skip kernels test (causes cuda illegal memories w/o cleanup), fix fa test in general esp for models like bart

* style

* allow fullgraph by preloading on init

* make globals "private"

* ci pls be happy

* change skip conditions based on backend flag (indicating missing mask interface)

* move globals support to a function to prepare kwargs

* style

* generalize supported kwargs

* small change to doc

* fix

* add comments

* style

* revert prep during generate

* style

* revert weird style changes

* add fa kwarg prep during generate with fixes back

* how did this even happen

* how

* add comment
2025-08-12 15:24:10 +00:00
b6ba595543 Default to dequantize if cpu in device_map for mxfp4 (#39993)
* default to dq if cpu

* an other check

* style

* revert some changes
2025-08-12 16:48:52 +02:00
a5fac1c394 Fix error on importing unavailable torch.distributed (#40038)
Currently model_debugging_utils.py would have an unguarded `import torch.distributed.tensor`. This PR ensures that the distributed module is available before including its tensor module.
2025-08-12 16:30:51 +02:00
085e02383c Fix Qwen3 MoE GGUF architecture mismatch (#39976)
* fix qwen3moe gguf architecture

* Fix Qwen3Moe GGUF loading

---------

Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
Co-authored-by: Jinuk Kim <jusjinuk@snu.ac.kr>
2025-08-12 13:38:48 +00:00
2ce0dae390 Switch the order of args in StaticCache (for BC and future logic) (#40100)
* switch order for BC and future logic

* in generate as well
2025-08-12 15:30:44 +02:00
f7cbd5f3ef Fix regression in mllama vision encoder (#40083)
fix mllama vision encoder

Signed-off-by: Isotr0py <2037008807@qq.com>
2025-08-12 15:29:45 +02:00
35dc88829c Replace logger.warning with logger.warning_once in GradientCheckpointingLayer (#40091) 2025-08-12 15:26:47 +02:00
b1b46555cd Re-apply make style (#40106)
make style
2025-08-12 15:02:16 +02:00
a07b5e90f2 feat: add is_fast to ImageProcessor (#39603)
* feat: add `is_fast` to ImageProcessor

* test_image_processing_common.py 업데이트

Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>

* feat: add missing BaseImageProcessorFast import

* fix: `issubclass` for discriminating subclass of BaseImageProcessorFast

---------

Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
2025-08-12 12:14:57 +00:00
952fac100d Enable SIM rules (#39806)
* Enable SIM rules

Signed-off-by: cyy <cyyever@outlook.com>

* More fixes

Signed-off-by: cyy <cyyever@outlook.com>

---------

Signed-off-by: cyy <cyyever@outlook.com>
2025-08-12 12:14:26 +00:00
41d1717882 New DynamicSlidingWindowLayer & associated Cache (#40039)
* start adding the layer

* style

* improve

* modular

* fix

* fix

* improve

* generate integration

* comment

* remove old one

* remove

* fix

* fix

* fix

* fix all recompiles

* fix

* doc

* fix

* add text config check

* fix encoderdecoder cache

* add it for all models with sliding/hybrid support

* revert

* start fixing

* prophetnet

* fsmt

* fix ddp_data

* add test for mistral

* improve mistral test and add gemma2 test

* docstrings
2025-08-12 14:09:52 +02:00
ab455e0d88 Audio encodings now match conv2d weight dtype in Gemma3nAudioSSCPConvBlock (#39743)
audio encodings now match conv weight dtype in Gemma3nAudioSSCPConvBlock
2025-08-12 12:08:28 +00:00
4b3a1a62cc Causal loss for ForConditionalGeneration (#39973)
* feat: add ForConditionalGeneration loss to LOSS_MAPPING

* consistent spelling of "recognized"
2025-08-12 14:03:09 +02:00
f6b6e17719 Add glm4.5&&glm4.5V doc (#40095)
* Docs: GLM-4-MoE & GLM-4V-MoE pages

* Docs: polish GLM-4V-MoE intro, remove placeholders; pin image

* Docs

---------

Co-authored-by: wujiahan <lambert@gmail.com>
2025-08-12 11:44:53 +00:00
1c5e17c025 Update Glm4V processor and add tests (#39988)
* update GLm4V and add tests

* Update tests/models/glm4v/test_processor_glm4v.py

Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>

* remove min/max pixels for BC

* fix video tests

---------

Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
2025-08-12 13:40:54 +02:00
913c0a8c33 [docs] Zero Shot Object Detection Task (#40096)
* refactor zsod task docs

* keeping the image guided od section

* Apply suggestions from code review

Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>

* Update docs/source/en/tasks/zero_shot_object_detection.md

Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>

---------

Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
2025-08-12 11:43:38 +01:00
c6fbfab61b [fix] batch inference for llava_onevision (#40021)
* [fix] llava onevision batch inference

* style

* cannot pass inconsistent list & handle text-only case
2025-08-12 11:01:00 +02:00
86bb1fcd26 Revert FA2 kwargs construction (#40029)
* revert

* use imports

* went way too high in imports level

* style
2025-08-12 10:48:35 +02:00
3ff2e984d2 Fix PerceptionLM image preprocessing for non-tiled image input. (#40006)
* Fix PerceptionLM image preprocessing for non-tiled image input.

* Add test for single tile vanilla image processing.

* ruff format

* recover missing test skip

* Simplify test.

* minor test name fix
2025-08-12 08:40:22 +00:00
4668ef1459 Update notification service MI325 (#40078)
add mi325 to amd_daily_ci_workflows
2025-08-12 10:22:52 +02:00
1cea763ba4 feat: extract rev in attn_implementation kernels via @ (#40009)
* feat: extract rev in attn_implementation kernels via @

* fix: adjust for ruff

* fix: update regex and add explanatory comment

* fix: move attn_implementation kernel doc

* fix: remove extra line
2025-08-11 15:14:13 -04:00
e29919f993 [GPT Big Code] Fix attention scaling (#40041)
* fix

* update integration tests

* fmt

* add regression test
2025-08-11 19:01:31 +00:00
eca703026e chore: standardize DeBERTa model card (#37409)
* chore: standardize DeBERTa model card

* Apply suggestions from code review in docs

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* fix: Update deberta.md with code cleanup suggestions

* Update docs/source/en/model_doc/deberta.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/deberta.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update deberta.md

* Update deberta.md

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-08-11 10:30:37 -07:00
43001fd3c6 Fix time_spent in notification_service.py. (#40081)
fix

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-11 18:30:58 +02:00
5521c62b89 added Textnet fast image processor (#39884)
* feat: add fast image processor implementation for TextNet model

* chore: override to_dict method to TextNetImageProcessorFast for slow processor compatibility tests

* chore: update init method

* chore: coding and style checks

* chore: fixed code quality issue

* chore: override resize to handle size_divisor, move all preprocessing logic to child class

* fix: autoImageProcessor issue for textnet

* chore: cleanup

* simplify resize

---------

Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
2025-08-11 11:44:31 -04:00
6b70d79b61 Fix repo consistency (#40077)
fix
2025-08-11 15:26:22 +02:00
7dd82f307b guard on model.eval when using torch.compile + FSDP2 (#37413)
guard on model.eval

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
2025-08-11 13:22:42 +02:00
68eb1a9a63 Remove deprecated cache-related objects (#40035)
remove them
2025-08-11 10:30:14 +02:00
480653d271 fix: move super().__init__ after vision_config init in Mistral3Config (#40063)
fix: move super().__init__ after vision_config init in Mistral3Config (#40062)
2025-08-11 09:21:54 +02:00
502f253e20 [gemma3] update conversion key mapping (#39778)
update conversion key mapping
2025-08-11 09:21:13 +02:00
3124d1b439 [qwen-vl] fix beam search with videos (#39726)
* fix

* fix copies
2025-08-11 09:21:04 +02:00
1372a5b8c4 fix: resolve triton version check compatibility on windows (#39986)
* fix: resolve triton version check compatibility on windows

* style: remove trailing space

* fix: fix typo

---------

Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
2025-08-11 08:53:19 +02:00
99c747539e unpin torchcodec==0.5.0 and use torch 2.8 on daily CI (#40072)
fix

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-10 22:27:39 +02:00
b59140b696 Update HuBERT model card according to template (#39742)
* Update HuBERT model card according to template

Standardized HuBERT doc, added ASR examples, Flash Attention 2 support, and quantization section.

* Address review comments and changes requested to hubert.md

* Update hubert.md

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-08-10 11:32:45 -07:00
f4d57f2f0c Revert "fix notification_service.py about time_spent" (#40044)
Revert "fix `notification_service.py` about `time_spent` (#40037)"

This reverts commit d2ba153b29feb9cc0e9818c1ce63a07679b47250.
2025-08-08 22:32:24 +02:00
7b20915f4e GLM-4.5V Model Support (#39805)
* init

* update

* uupdate

* ruff

* t patch is 2 defalut not 1

* draft

* back

* back1

* update

* config update

* update using glm-41 format

* add self.rope_scaling = config.rope_scaling

* update config

* update

* remove the processor

* update

* fix tests

* update

* for test

* update

* update 2126

* self.rope_scaling is missing in GLM4MOE lets add it

* update

* update

* Update modular_glm4v_moe.py

* change config

* update apply_multimodal_rotary_pos_emb

* format

* update

* Delete 3-rollout_qas_thinking_answers.py

* use right name

* update with place holder

* update

* use right rotary

* Update image_processing_glm4v_fast.py

* rope_config_validation needs to rewrite the entire config file in modular

* update

* changed name

* update

* Update modeling_glm4v_moe.py

* _init_weights shoud be add in Glm4vMoePreTrainedModel

* remove use_qk_norm

* Update modular_glm4v_moe.py

* remove use_qk_norm as it is not use

* fix style

* deprecations are not needed on new models

* fix merge issues

---------

Co-authored-by: raushan <raushan@huggingface.co>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Arthur <arthur.zucker@gmail.com>
2025-08-08 17:39:52 +02:00
d2ba153b29 fix notification_service.py about time_spent (#40037)
temp

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-08 17:11:16 +02:00
f639c0c780 Bnb failling tests (#40026)
* initial commit

* style

---------

Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
2025-08-08 16:28:00 +02:00
a96cccd0dd Tie weights recursively on all submodels (#39996)
* recursive call

* add missing keys

* remove bad keys
2025-08-08 16:03:16 +02:00
a78263dbb5 fix 2025-08-08 15:32:23 +02:00
dc11a3cbb2 [core] Refactor the Cache logic to make it simpler and more general (#39797)
* Simplify the logic quite a bit

* Update cache_utils.py

* continue work

* continue simplifying a lot

* style

* Update cache_utils.py

* offloading much simpler

* style

* Update cache_utils.py

* update inits

* Update cache_utils.py

* consistemncy

* Update cache_utils.py

* update generate

* style

* fix

* fix

* add early_initialization

* fix

* fix mamba caches

* update

* fix

* fix

* fix

* fix tests

* fix configs

* revert

* fix tests

* alright

* Update modeling_gptj.py

* fix the constructors

* cache tests

* Update test_cache_utils.py

* fix

* simplify

* back to before -> avoid compile bug

* doc

* mistral test

* llama4 test dtype

* Update test_modeling_llama4.py

* CIs

* Finally find a nice impl

* Update cache_utils.py

* Update cache_utils.py

* add lazy methods in autodoc

* typo

* better doc

* Add detailed docstring for lazy init

* CIs

* style

* fix
2025-08-08 14:47:21 +02:00
95510ab018 Fix missing None default values for Gemma3n model in get_placeholder_mask (#39991) (#40024)
* Fix missing None default values for Gemma3n model in get_placeholder_mask (#39991)

* Switched definition of optional from| None to Optiona[] (Issue #39991)

---------

Co-authored-by: Laurenz Ruzicka <Laurenz.Ruzicka@ait.ac.at>
2025-08-08 10:43:42 +00:00
5c3fb7f731 Harmonize past_key_value to past_key_valueS everywhere (#39956)
* all modulars and llama

* apply modular

* bert and gpt2 copies

* fix imports

* do it everywhere

* fix import

* finalize it

* fix

* oups set it in modular

* style

* fix

* Add 1 version to deprecation cycle

* Update modeling_layers.py
2025-08-08 11:52:57 +02:00
2469cce621 Fix an annoying flaky test (#40000)
annoying flaky test
2025-08-08 10:32:51 +02:00
fe1bf82159 Higgs modules_to_not_convert standardization (#39989)
fix higgs
2025-08-08 10:22:59 +02:00
b374c3d12e Fix broken image inference for Fuyu model (#39915)
* fix fuyu

Signed-off-by: Isotr0py <2037008807@qq.com>

* oops

Signed-off-by: Isotr0py <2037008807@qq.com>

* run test on GPU

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* clean unused

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* revert

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* add fuyu multimodal test

Signed-off-by: Isotr0py <2037008807@qq.com>

* fix

Signed-off-by: Isotr0py <2037008807@qq.com>

---------

Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-08-08 07:21:49 +00:00
4d57c39007 pin torchcodec==0.5.0 for now with torch 2.7.1 on daily CI (#40013)
* update

* update

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-07 23:05:39 +02:00
3e0333fa4a Update expected output values after #39885 (part 2) (#40015)
update

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-07 22:52:53 +02:00
12f248bced Raising error when quantizing a quantized model (#39998)
* error when quantizing a quantized model

* style
2025-08-07 20:37:25 +00:00
efaf3714dc docs: fix duplication in 'en/optimizers.md' (#40014) 2025-08-07 13:28:43 -07:00
ca4cbb1e3f unpin torch<2.8 on circleci (#40012)
update

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-07 21:31:17 +02:00
78922577e9 FA2 can continue generation from cache (#39843)
* add fa2 support to continue generation from cache

* update q-len
2025-08-07 19:26:23 +02:00
9bfbdd2945 Fix default values of getenv (#39867)
Signed-off-by: cyy <cyyever@outlook.com>
2025-08-07 17:25:40 +00:00
692d336908 Fix HGNetV2 Model Card and Image Classification Pipeline Usage Tips (#39965)
* fix hgnet docs and image-classification pipeline

* use positional argument

* fix dit close hfoptions tag

* fix alphabet order

* fix hgnnet modular docstring

* Update hgnet_v2.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update hgnet_v2.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/hgnet_v2.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* fix: hgnet reference

* change hgnet to en doc

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-08-07 09:33:29 -07:00
0659214196 fix: remove CHAT_TEMPLATE import in tests for deepseek-vl (#40003)
* remove CHAT_TEMPLATE import in tests

* update and use prepare_processor_dict
2025-08-07 16:19:36 +00:00
27997eeb8d Fix missing video inputs for PerceptionLM. (#39971)
* Fix missing video inputs for PerceptionLM.

* Minor fix for vanilla input image (only C,H,W, no tiles dim).

* Revert "Minor fix for vanilla input image (only C,H,W, no tiles dim)."

This reverts commit 181d87b964e59c4118035a9fd4f530c6e551ba9f.
2025-08-07 15:54:45 +00:00
bf1bd6ac1f Fix int4 quantized model cannot work with cpu (#39724)
* Fix int4 quantized model cannot work with cpu

Signed-off-by: yuanwu <yuan.wu@intel.com>

* Update the comments

Signed-off-by: yuanwu <yuan.wu@intel.com>

* update

Signed-off-by: yuanwu <yuan.wu@intel.com>

* update

Signed-off-by: yuanwu <yuan.wu@intel.com>

---------

Signed-off-by: yuanwu <yuan.wu@intel.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
2025-08-07 15:24:00 +00:00
43d3b1931a Update expected output values after #39885 (part 1) (#39990)
fix

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-07 16:00:28 +02:00
d5a0809707 Fix consistency (#39995)
* modular

* fix
2025-08-07 15:52:40 +02:00
b347e93567 [typing] Fix return typehint for decoder and inv_freq annotation (#39610)
* fix return typehint for decoder and annotate inv_freq

* fix modular

* Fix consistency

* Move annotation on class level

* missing annotations

* add comment
2025-08-07 14:10:22 +01:00
7188e2e28c Bump transformers from 4.48.0 to 4.53.0 in /examples/tensorflow/language-modeling-tpu (#39967)
Bump transformers in /examples/tensorflow/language-modeling-tpu

Bumps [transformers](https://github.com/huggingface/transformers) from 4.48.0 to 4.53.0.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](https://github.com/huggingface/transformers/compare/v4.48.0...v4.53.0)

---
updated-dependencies:
- dependency-name: transformers
  dependency-version: 4.53.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-08-07 12:13:48 +01:00
2b19a06692 Fix gemma3n feature extractor's incorrect squeeze (#39919)
* fix gemma3n squeeze

Signed-off-by: Isotr0py <2037008807@qq.com>

* add regression test

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

---------

Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-08-07 18:34:28 +08:00
555cbf5917 [Idefics] fix device mismatch (#39981)
fix
2025-08-07 11:12:04 +02:00
597ed1a11d Various test fixes for AMD (#39978)
* Add amd expectation in internvl

* Add amd expectation to llama

* Added bnb decorator for a llava test that requires bnb

* Added amd expectation for mistral3

* Style
2025-08-07 10:57:04 +02:00
6121e9e46c Support input_embeds in torch exportable decoders (#39836)
* Support input_embeds in torch exportable decoders

* Hybrid cache update

* Manually change some callsites

* AI changes the rest of the call sites

* Make either input_ids/inputs_embeds mandatory

* Clean up

* Ruff check --fix

* Fix test

* pr review

* Revert config/generation_config changes

* Ruff check
2025-08-07 08:51:31 +00:00
cdeaad96b7 [superglue] Fixed the way batch mask was applied to the scores before match assignment computation (#39968)
fix: mask filling to score was wrong
2025-08-07 09:49:39 +01:00
2593932f10 Gemma3 fixes (#39960)
* Fix multiple devices issue

* Added expectations for rocm 9.4

* Ruff
2025-08-07 09:57:21 +02:00
513f76853b Modular fix: remove the model name in find_file_type (#39897)
* remove the model name in the class name

* add comment
2025-08-06 23:31:07 +00:00
743bb5f52e chore: update Deformable_Detr model card (#39902)
* chore: update Deformable_Detr model card

* fix: added pipeline, automodel examples and checkpoints link

* Update deformable_detr.md

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-08-06 12:45:14 -07:00
ac0b468465 [bugfix] fix flash_attention_2 unavailable error on Ascend NPU (#39844) 2025-08-06 17:48:52 +00:00
cf243a1bf8 Fix fix_and_overwrite mode of utils/check_docstring.py (#39369)
* bug in fix mode of check_docstring
2025-08-06 19:37:25 +02:00
6902ffa505 remove triton_kernels dep with kernels instead (#39926)
* remove dep

* style

* rm import

* fix

* style

* simplify

* style
2025-08-06 19:31:20 +02:00
cb2e0df2ec [image processor] fix glm4v (#39964)
* fix glm4v image process

* Update src/transformers/models/glm4v/image_processing_glm4v.py

---------

Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
2025-08-06 17:46:58 +01:00
9ab75fc428 fix typo (#39936)
* fix typo

* fix modular instead

* fix

---------

Co-authored-by: y.korobko <y.korobko@tbank.ru>
2025-08-06 16:21:24 +00:00
43b3f58875 Fix grammatical error in MoE variable name: expert_hitted → expert_hit, hitted_experts → hit_experts (#39959)
* Fix grammatical error: expert_hitted -> expert_hit in MoE implementations

* Fix grammatical error: hitted_experts -> hit_experts in MoE implementation
2025-08-06 15:45:19 +00:00
dff6185d61 docs: fix typo in 'quantization-aware training' (#39904) 2025-08-06 14:52:43 +00:00
c7844c7a8e Enable gpt-oss mxfp4 on older hardware (sm75+) (#39940)
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
2025-08-06 13:39:21 +00:00
dd70a8cb9d Fix MXFP4 quantizer validation to allow CPU inference with dequantize option (#39953)
* Fix MXFP4 quantizer validation to enable CPU dequantization

Move dequantize check before CUDA availability check to allow
CPU inference when quantization_config.dequantize is True.
This enables users to run MXFP4 models on CPU by automatically
converting them to BF16 format.

* Add tests for MXFP4 quantizer CPU dequantization validation

* fix: format mxfp4 test file with ruff
2025-08-06 15:20:41 +02:00
82eb67e62a [docs] ko toc fix (#39927) 2025-08-06 10:12:34 +00:00
9e76a6bb54 circleci: pin torch 2.7.1 until torchcodec is updated (#39951)
circleci torch 2.7.1

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-06 11:18:00 +02:00
910b319357 Fix CI: Tests failing on CPU due to torch.device('cpu').index being None (#39933)
replace routing_weights.device.index with a
2025-08-06 10:22:43 +02:00
369c99d0ce Avoid utils/check_bad_commit.py failing due to rate limit (requesting api.github.com) (#39918)
fix

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-05 21:52:20 +02:00
b771e476a8 [CI] post-GptOss fixes for green CI (#39929) 2025-08-05 20:04:59 +02:00
eb6e26acf3 Dev version 2025-08-05 18:09:30 +02:00
c54203a32e gpt_oss last chat template changes (#39925)
Last chat template changes
2025-08-05 18:08:08 +02:00
7c38d8fc23 Add GPT OSS model from OpenAI (#39923)
* fix

* nice

* where i am at

* Bro this works

* Update src/transformers/integrations/tensor_parallel.py

* cleanups

* yups that was breaking

* Update src/transformers/models/openai_moe/modeling_openai_moe.py

* gather on experts and not mlp

* add changes for latest convert branch

* adds options to get output_router_logits from config

* bring chat temlate + special tokens back into the script.

* initial commmit

* update

* working with shards

* add model.safetensors.index.json

* fix

* fix

* mxfp4 flag

* rm print

* Fix PAD/EOS/BOS (#18)

* fix pad/eos/bos

* base model maybe one day

* add some doc

* special tokens based on harmony.

* add in tokenizer config as well.

* prepare for rebase with main

* Fix for initialize_tensor_parallelism  now returning 4-tuple

```
[rank0]:   File "/fsx/edward/work/openai-tsm-examples/examples/generate.py", line 17, in <module>
[rank0]:     model = AutoModelForCausalLM.from_pretrained(
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/fsx/edward/work/new-model-addition-openai/src/transformers/models/auto/auto_factory.py", line 600, in from_pretrained
[rank0]:     return model_class.from_pretrained(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/fsx/edward/work/new-model-addition-openai/src/transformers/modeling_utils.py", line 316, in _wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/fsx/edward/work/new-model-addition-openai/src/transformers/modeling_utils.py", line 4748, in from_pretrained
[rank0]:     tp_plan, device_map, device_mesh = initialize_tensor_parallelism(tp_plan, tp_size=None)
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: ValueError: too many values to unpack (expected 3)
```

* mxfp4

* mxfp4 draft

* fix

* fix import

* draft

* draft impl

* finally working !

* simplify

* add import

* working version

* consider blocks and scales

* device mesh fix

* initial commit

* add working dequant + quant logic

* update

* non nan, gibberish output

* working EP + quantization finally !

* start cleaning

* remove reversing process

* style

* some cleaning

* initial commmit

* more cleaning

* more cleaning

* simplify

* more cleaning

* rm duplicated function

* changing tp_plan

* update tp plan check

* add loading attribute

* dequantizing logic

* use subfunctions

* import cleaning

* update_param_name

* adds clamped swiglu

* add clamping to training path

* simplify dequant logic

* update

* Bad merge

* more simplifications & tests

* fix !

* fix registering custom attention

* fix order

* fixes

* some test nits

* nits

* nit

* fix

* Clamp sink logits

* Clean

* Soft-max trick

* Clean up

* p

* fix deepspeed

* update both modeling and modular for cleanup

* contiguous

* update tests

* fix top_k router call

* revert renaming

* test nits

* small fixes for EP

* fix path for our local tests

* update as I should not have broken that!

* fix the loss of mixtral

* revert part of the changes related to router_scores, kernel probably no ready for that!

* deleting a small nit

* update arch

* fix post processing

* update

* running version but not expected output

* moving to cuda

* initial commit

* revert

* erroring when loading on cpu

* updates

* del blocks, scales

* fix

* style

* rm comm

* comment

* add comment

* style

* remove duplicated lines

* Fix minor issue with weight_map conversion script

* fix sampling params

* rename to final name

* upate pre-final version of template

* Update src/transformers/models/gpt_oss/convert_gpt_oss_weights_to_hf.py

* fix batched inference

* serve fixes

* swizzle !

* update final chat template by Matt.

* fix responses; pin oai

* sinplify

* Thanks Matt for his tireless efforts!

Co-authored-by: Rocketknight1 <Rocketknight1@users.noreply.github.com>

* Update src/transformers/models/gpt_oss/convert_gpt_oss_weights_to_hf.py

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>

* fix

* Use ROCm kernels from HUB

* Make kernel modes explicit

* update final chat template by Matt. x2

* Thanks Matt for his tireless efforts!

Co-authored-by: Rocketknight1 <Rocketknight1@users.noreply.github.com>

* Fix installation

* Update setup.py

Co-authored-by: Ákos Hadnagy <akos.hadnagy@gmail.com>

* allow no content

* fix: update message handling in write_tokenizer function

* Fix template logic for user message role

* last nits for CB and flash_paged!

* there was one bad merge

* fix CB (hardcode for now, its just using kv groups instead)

* fix

* better fix for device_map

* minor device fix

* Fix flash paged

* updates

* Revert "remove dtensors, not explicit (#39840)"

This reverts commit 6dfd561d9cd722dfc09f702355518c6d09b9b4e3.

* update

* Revert "remove dtensors, not explicit (#39840)"

This reverts commit 6dfd561d9cd722dfc09f702355518c6d09b9b4e3.

* fix merge

* fix

* Fix line break when custom model indentity

* nits testing

* to locals first and pass sliding window to flash paged

* register modes for MegaBlocksMoeMlp

* add integration test in fixtures -> now update the tests to use it!

* update integration tests

* initial fix

* style and update tests

* fix

* chore(gpt oss): remove mlp_bias from configuration

It was just a leftover.

* stats

* Integration tests

* whoops

* Shouldn't move model

* Ensure assistant messages without thinking always go to "final" channel

* More checks to ensure expected format

* Add pad_token_id to model configuration in write_model function (#51)

* Add oai fix fast tests (#59)

* Fix some fast tests

* Force some updates

* Remove unnecessary fixes

* Update src/transformers/models/gpt_oss/convert_gpt_oss_weights_to_hf.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update src/transformers/models/gpt_oss/convert_gpt_oss_weights_to_hf.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update src/transformers/models/gpt_oss/convert_gpt_oss_weights_to_hf.py

* reasoning -> Reasoning

* Add additional integration tests

* fixup

* Slight fixes

* align chat template with harmony

* simplify

* Add comment

* torch testing assert close

* torch testing assert close

* torch testing assert close

* torch testing assert close

* torch testing assert close

* torch testing assert close

* Revert fixup

* skip 2 test remove todo

* merge

* padding side should be left for integration tests

* fix modular wrt to changes made to modeling

* style

* isort

* fix opies for the loss

* mmmm

---------

Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Marc Sun <marc@huggingface.co>
Co-authored-by: edbeeching <edbeeching@gmail.com>
Co-authored-by: Vaibhavs10 <vaibhavs10@gmail.com>
Co-authored-by: MekkCyber <mekk.cyber@gmail.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
Co-authored-by: Lewis Tunstall <lewis.c.tunstall@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan@openai.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: joao@huggingface.co <joao@ip-10-53-88-32.ec2.internal>
Co-authored-by: Rocketknight1 <Rocketknight1@users.noreply.github.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Co-authored-by: Akos Hadnagy <akos@ahadnagy.com>
Co-authored-by: Ákos Hadnagy <akos.hadnagy@gmail.com>
Co-authored-by: Alvaro Moran <alvaro.moran@huggingface.co>
Co-authored-by: Lysandre <hi@lysand.re>
Co-authored-by: Matt <rocketknight1@gmail.com>
2025-08-05 18:02:18 +02:00
738c1a3899 🌐 [i18n-KO] Translated cache_explanation.md to Korean (#39535)
* update: _toctree.yml

* docs: ko: cache_explanation.md

* feat: nmt draft

* fix: apply yijun-lee's comments

* fix: apply 4N3MONE's comments

* docs: update cache_position

* docs: update cache-storage-implementation

* update: add h2 tag in cache-position

---------

Co-authored-by: taehyeonjeon <xogus294@gmail.com>
2025-08-05 08:20:13 -07:00
d2ae766836 Export SmolvLM (#39614)
Export SmolVLM for ExecuTorch
2025-08-05 16:20:23 +02:00
c430047602 [docs] update object detection guide (#39909)
* Update object_detection.md

* Update object_detection.md
2025-08-05 14:07:21 +00:00
dedcbd6e3d run model debugging with forward arg (#39905)
* run model debugging a lot simpler

* fixup

* Update src/transformers/utils/generic.py

* fixup

* mode syle?

* guard a bit
2025-08-05 15:46:19 +02:00
20ce210ab7 Revert "remove dtensors, not explicit (#39840)" (#39912)
* Revert "remove dtensors, not explicit (#39840)"
This did not work with generation (lm_head needs extra care!)
This reverts commit 6dfd561d9cd722dfc09f702355518c6d09b9b4e3.

* update

* style?
2025-08-05 15:12:14 +02:00
2589a52c5c Fix aria tests (#39879)
* fix aria tests

* awful bug

* fix copies

* fix tests

* fix style

* revert this
2025-08-05 13:48:47 +02:00
6e4a9a5b43 Fix eval thread fork bomb (#39717) 2025-08-05 10:50:32 +00:00
98a3c49135 Replace video_fps with fps in tests (#39898)
Signed-off-by: cyy <cyyever@outlook.com>
2025-08-05 10:39:55 +00:00
1af1071081 Fix misleading WandB error when WANDB_DISABLED is set (#39891)
When users set `report_to="wandb"` but also have `WANDB_DISABLED=true` in their environment,
the previous error message was misleading: "WandbCallback requires wandb to be installed. Run pip install wandb."

This was confusing because wandb was actually installed, just disabled via the environment variable.

The fix detects this specific case and provides a clear, actionable error message explaining
the conflict and how to resolve it.
2025-08-05 10:18:18 +00:00
78ef84921b Avoid aliasing in cond's branches for torch 2.8 (#39488)
Avoid alaising in cond's branches

Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
2025-08-05 11:18:11 +02:00
9e676e6a0e [qwen] remove unnecessary CUDA sync in qwen2_5_vl (#39870)
Signed-off-by: cyy <cyyever@outlook.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
2025-08-05 08:54:16 +00:00
392be3b282 fix test_working_of_tp failure of accelerate ut (#39828)
Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
2025-08-05 08:52:57 +00:00
cc5de36454 [Exaone4] Fixes the attn implementation! (#39906)
* fix

* fix config
2025-08-05 09:29:16 +02:00
00d47757bf Reorder serving docs (#39634)
* Slight reorg

* LLMs + draft VLMs

* Actual VLM examples

* Initial responses

* Reorder

* Update docs/source/en/serving.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Update docs/source/en/tiny_agents.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Update docs/source/en/open_webui.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Update docs/source/en/cursor.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Update docs/source/en/serving.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Responses API

* Address Pedro's comments

---------

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
2025-08-05 08:43:06 +02:00
8c4ea670dc chore: update DETR model card (#39822)
* Update model card for DETR

* fix: applied suggested changes

* fix: simplified pipeline and modified notes and resources

* Update detr.md

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-08-04 12:25:53 -07:00
0bd91cc822 Add support for ModernBertForMultipleChoice (#39232)
* implement ModernBertForMultipleChoice

* fixup, style, repo consistency

* generate modeling_modernbert

* add tests + docs

* fix test
2025-08-04 20:45:43 +02:00
801e869b67 send some feedback when manually building doc via comment (#39889)
* fix

* fix

* fix

* Update .github/workflows/pr_build_doc_with_comment.yml

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
2025-08-04 18:20:48 +00:00
ee7eb2d0b1 Update cohere2 vision test (#39888)
* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-04 20:08:18 +02:00
3bafa128dc [DOCS] : Improved mimi model card (#39824)
* [DOCS] : Improved mimi model card

* Removed additional header

* Review: addressed feedback

* Update mimi.md

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-08-04 10:07:06 -07:00
192acc2d0f Fix link to models in README (#39880)
Update README.md
2025-08-04 09:34:41 -07:00
7dca2ff8cf [typing] better return type hint for AutoModelForCausalLM and AutoModelForImageTextToText (#39881)
* Better return type hint for  AutoModelForCausalLM and AutoModelForImageTextToText

* fix imports

* fix
2025-08-04 15:03:53 +00:00
3edd14610e Set torch.backends.cudnn.allow_tf32 = False for CI (#39885)
* fix

* fix

* [test all]

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-04 16:55:16 +02:00
e3505cd4dc Replace Tokenizer with PreTrainedTokenizerFast in ContinuousBatchProcessor (#39858)
Replace Tokenizer with PreTrainedTokenizerFast in ContinuousBatchProcessor
2025-08-04 16:39:19 +02:00
380b2a0317 Rework add-new-model-like with modular and make test filenames coherent (#39612)
* remove tf/flax

* fix

* style

* Update add_new_model_like.py

* work in progress

* continue

* more cleanup

* simplify and first final version

* fixes -> it works

* add linter checks

* Update add_new_model_like.py

* fix

* add modular conversion at the end

* Update add_new_model_like.py

* add video processor

* Update add_new_model_like.py

* Update add_new_model_like.py

* Update add_new_model_like.py

* fix

* Update image_processing_auto.py

* Update image_processing_auto.py

* fix post rebase

* start test filenames replacement

* rename all test_processor -> test_processing

* fix copied from

* add docstrings

* Update add_new_model_like.py

* fix regex

* improve wording

* Update add_new_model_like.py

* Update add_new_model_like.py

* Update add_new_model_like.py

* start adding test

* fix

* fix

* proper first test

* tests

* fix

* fix

* fix

* fix

* modular can be used from anywhere

* protect import

* fix

* Update add_new_model_like.py

* fix
2025-08-04 14:41:09 +02:00
5fb5b6cfaf Fix quant docker for fp-quant (#39641)
* fix quant docker

* Apply style fixes

---------

Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2025-08-04 11:57:08 +00:00
16d6faef9a [core] Fix attn_implementation setter with missing sub_configs (#39855)
* fix

* add sub_configs

* remove case for attention setter

* fix None

* Add test

* Fix sub-configs

* fix tests_config

* fix consistency

* fix fsmt

* fix
2025-08-04 11:35:09 +01:00
2a9febd632 Add support for including in-memory videos (not just files/urls) in apply_chat_template (#39494)
* added code for handling video object ,as dictionary of frames and metadata, in chat template

* added new test where videos are passed as objects (dict of frames, metadata) in the chat template

* modified hardcoded video_len check that does not match with increased number of tests cases.

* Modify hardcoded video_len check that fails with increased number of tests

* update documentation of multi-modal chat templating with extra information about including video object in chat template.

* add array handling in load_video()

* temporary test video inlcuded

* skip testing smolvlm with videos that are list of frames

* update documentation & make fixup

* Address review comments
2025-08-04 11:49:42 +02:00
0d511f7a77 Use comment to build doc on PRs (#39846)
* try

* try

* try

* try

* try

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-08-04 10:24:45 +02:00
4819adbbaa Refactor label name handling for PEFT models in Trainer class (#39265)
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
2025-08-04 06:29:57 +00:00
166fcad3f8 Improve is_wandb_available function to verify WandB installation (#39875)
Improve `is_wandb_available` function to verify WandB installation by checking for a key attribute
2025-08-04 08:22:52 +02:00
6dfd561d9c remove dtensors, not explicit (#39840)
* remove dtensors, not explicit

Co-authored-by: 3outeille <3outeille@users.noreply.github.com>

* style

* fix test

* update

* as we broke saving try to fix

* output layouts should exit

* nit

* devicemesh exists if it was distributed

* use _device_mesh of self

* update

* lol

* fix

* nit

* update

* fix!

* this???

* grumble grumble

* ?

* fuck me

---------

Co-authored-by: 3outeille <3outeille@users.noreply.github.com>
2025-08-01 22:02:47 +02:00
b727c2b20e Allow TrackioCallback to work when pynvml is not installed (#39851)
Allow TrackioCallback to work when pynvml is not installed
2025-08-01 18:57:25 +02:00
1ec0feccdd [image-processing] deprecate plot_keypoint_matching, make visualize_keypoint_matching as a standard (#39830)
* fix: deprecate plot_keypoint_matching and make visualize_keypoint_matching for all Keypoint Matching models

* refactor: added copied from

* fix: make style

* fix: repo consistency

* fix: make style

* docs: added missing method in SuperGlue docs
2025-08-01 16:29:57 +00:00
7b4d9843ba Add fast image processor Janus, Deepseek VL, Deepseek VL hybrid (#39739)
* add fast image processor Janus, deepseek_vl, deepseek_vl_hybrid

* fix after review
2025-08-01 12:20:08 -04:00
88ead3f518 Fix responses add tests (#39848)
* Quick responses fix

* [serve] Fix responses API and add tests

* Remove typo

* Remove typo

* Tests
2025-08-01 18:06:08 +02:00
6ea646a03a Update ux cb (#39845)
* clenaup

* nits

* updates

* fix logging

* push updates?

* just passexception

* update

* nits

* fix

* add tokencount

* style
2025-08-01 16:50:28 +02:00
3951d4ad5d Add MM Grounding DINO (#37925)
* first commit

Added modular implementation for MM Grounding DINO from starting point created by add-new-model-like. Added conversion script from mmdetection to huggingface.

TODO: Some tests are failing so that needs to be fixed.

* fixed a bug with modular definition of MMGroundingDinoForObjectDetection where box and class heads were not correctly assigned to inner model

* cleaned up a hack in the conversion script

* Fixed the expected values in integration tests

Cross att masking and cpu-gpu consistency tests are still failing however.

* changes for make style and quality

* add documentation

* clean up contrastive embedding

* add mm grounding dino to loss mapping

* add model link to config docstring

* hack fix for mm grounding dino consistency tests

* add special cases for unused config attr check

* add all models and update docs

* update model doc to the new style

* Use super_kwargs for modular config

* Move init to the _init_weights function

* Add copied from for tests

* fixup

* update typehints

* Fix-copies for tests

* fix-copies

* Fix init test

* fix snippets in docs

* fix consistency

* fix consistency

* update conversion script

* fix nits in readme and remove old comments from conversion script

* add license

* remove unused config args

* remove unnecessary if/else in model init

* fix quality

* Update references

* fix test

* fixup

---------

Co-authored-by: qubvel <qubvel@gmail.com>
2025-08-01 15:43:23 +01:00
50145474b7 [typecheck] proper export of private symbols (#39729)
* Export private symbols

Signed-off-by: cyy <cyyever@outlook.com>

* Update src/transformers/__init__.py

Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>

* Update src/transformers/__init__.py

Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>

* Fix format

Signed-off-by: cyy <cyyever@outlook.com>

* Add a comment for exported symbols

Signed-off-by: cyy <cyyever@outlook.com>

---------

Signed-off-by: cyy <cyyever@outlook.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
2025-08-01 13:36:47 +01:00
c962f1515e [attn_implementation] remove recursive, allows custom kernels with wrappers (#39823)
* fix?

* fixme and style

* Update src/transformers/modeling_utils.py

* update

* update

* fix

* small fixees

* nit

* nits

* fix init check?

* fix

* fix default

* or fucks me

* nits

* include a small nit

* does this make it hapy?

* fixup

* fix the remaining ones
2025-08-01 12:18:28 +02:00
d3b8627b56 [VLMs] split out "get placeholder mask" to helper (#39777)
* batch upidate all models

* update

* forgot about llava onevision

* update

* fix tests

* delete file

* typo

* fix emu3 once and forever

* update cohere2 vision as well
2025-08-01 08:01:06 +00:00
a115b67392 Fix tp cb (#39838)
* fixes

* one more
2025-08-01 09:59:04 +02:00
2c0af41ce5 Fix bad markdown links (#39819)
Fix bad markdown links.
2025-07-31 09:14:14 -07:00
4fcf455517 Fix broken links (#39809)
Replace links in the form of `[text]((url))` to `[text](url)`. This is
the correct format of a url in the markdown.
2025-07-31 13:23:04 +00:00
b937d47455 [cohere2 vision] move doc to multimodal section (#39820)
move doc to multimodal section
2025-07-31 15:13:02 +02:00
6ba8a1ff45 Update documentation for Cohere2Vision models (#39817)
* Update docs with pipeline example

* Add Cohere2Vision to list of vision models

* Sort models
2025-07-31 11:58:45 +00:00
e1688d28d3 [Model] Cohere2 Vision (#39810)
* Add cohere2_vision to support CohereLabs/command-a-vision-07-2025

* update and add modualr file

* update processors and check with orig impl later

* delete unused files

* image processor reduce LOC and re-use GotOCR2

* update the config to use modular

* model tests pass

* processor fixes

* check model outputs decorator

* address one more comment

* Update tokens. Temp - need to read from tokenizer'

* fix for multi-gpu

* Fix image token handling

* upadte image token expansion logic

* fix a few issues with remote code loading

* not related but modular forces us to change all files now

* Add overview and code sample to cohere vision docs

* add scripts. TMP.

* Update inference script

* Create script

* set dtype in export script

* TO revert: modular export fix

* Fix scripts

* Revert "TO revert: modular export fix"

This reverts commit bdb2f305b61027a05f0032ce70d6ca698879191c.

* Use modular weights

* Upload to hub

Removed OOD weights ad script

* Updated docs

* fix import error

Update docs

Added pipeline test

* Updated docs

* Run modular script

remove modular for config

Added patch_size

Added docstrings in modular

Fix OOM

Add docs, fixup integration tests. 8-gpu passing

* tiny updates

* address comments + fixup

* add test for chat template

* check model outputs workaround

* aya vision fix check model inputs

* Revert "add test for chat template"

This reverts commit 42c756e397f588d76b449ff1f93292d8ee0202d8.

* reveert more changes

* last revert

* skip and merge

* faulty copy from

---------

Co-authored-by: Julian Mack <julian.mack@cohere.com>
Co-authored-by: kyle-cohere <kyle@cohere.com>
2025-07-31 10:57:34 +00:00
6c3f27ba61 [docs] fix korean docs yet again (#39813)
fix korean docs yet again
2025-07-31 09:13:25 +00:00
cb289ad243 feat(tokenization): add encode_message to tokenize messages one by one (#39507)
* feat(tokenization): add encode_message to tokenize messages one by one

* Fix the `encode_message` method, remove the `add_generation_prompt` parameter and add the corresponding error handling. Update the document to reflect this change and verify the error handling in the test.

* Optimize the `encode_message` method, improve the processing logic of the empty dialogue history, and ensure that the chat template can be applied correctly when the dialogue history is empty. Update the document to reflect these changes.

* The `_encode_message` method is deleted, the message coding logic is simplified, and the functional integrity of the `encode_message` method is ensured. Update the document to reflect these changes.

* Docs fix

* Revert changes in docstring of pad()

* Revert changes in docstring

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Repair the call of the `encode_message` method, update it to `encode_message_with_chat_template` to support the chat template, and adjust the relevant test cases to reflect this change.

* Optimize the call format of the `apply_chat_template` method, and merge multi-line calls into a single line to improve code readability.

---------

Co-authored-by: pco111 <15262555+pco111@user.noreply.gitee.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2025-07-31 10:55:45 +02:00
4f93cc9174 fix: providing a tensor to cache_position in model.generate kwargs always crashes because of boolean test (#39300)
* fix: cache_position: RuntimeError: Boolean value of Tensor with more than one value is ambiguous

* test cache_position

* move test

* propagate changes

---------

Co-authored-by: Masataro Asai <guicho2.71828@gmail.com>
2025-07-30 17:30:28 +00:00
9b3203f47b Add callback to monitor progress in whisper transcription (#37483)
* Add callback to monitor progress in whisper transcription

* Added `` around variables, rewording

* Add example of `monitor_progress`.

---------

Co-authored-by: Eric B <ebezzam@gmail.com>
2025-07-30 17:40:53 +02:00
7abb5d3992 Update mT5 model card (#39702)
* Update mt5 model card

* Fix casing of model title

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-07-30 08:35:04 -07:00
1019b00028 Update model card for Cohere2 (Command R7B) (#39604)
* Update model card for Cohere2 (Command R7B)

* fix: applied suggested changes
2025-07-30 08:34:26 -07:00
ecbb5ee194 standardized BARThez model card (#39701)
* standardized barthez model card according to template

* Update docs/source/en/model_doc/barthez.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/barthez.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/barthez.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/barthez.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/barthez.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/barthez.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* suggested changes to barthez model card

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-07-30 08:33:13 -07:00
2240 changed files with 91309 additions and 115217 deletions

View File

@ -109,7 +109,9 @@ class CircleCIJob:
self.docker_image[0]["image"] = f"{self.docker_image[0]['image']}:dev"
print(f"Using {self.docker_image} docker image")
if self.install_steps is None:
self.install_steps = ["uv venv && uv pip install ."]
self.install_steps = ["uv pip install ."]
# Use a custom patched pytest to force exit the process at the end, to avoid `Too long with no output (exceeded 10m0s): context deadline exceeded`
self.install_steps.append("uv pip install git+https://github.com/ydshieh/pytest.git@8.4.1-ydshieh")
if self.pytest_options is None:
self.pytest_options = {}
if isinstance(self.tests_to_run, str):
@ -213,7 +215,7 @@ generate_job = CircleCIJob(
docker_image=[{"image": "huggingface/transformers-torch-light"}],
# networkx==3.3 (after #36957) cause some issues
# TODO: remove this once it works directly
install_steps=["uv venv && uv pip install ."],
install_steps=["uv pip install ."],
marker="generate",
parallelism=6,
)
@ -250,7 +252,7 @@ examples_torch_job = CircleCIJob(
additional_env={"OMP_NUM_THREADS": 8},
docker_image=[{"image":"huggingface/transformers-examples-torch"}],
# TODO @ArthurZucker remove this once docker is easier to build
install_steps=["uv venv && uv pip install . && uv pip install -r examples/pytorch/_tests_requirements.txt"],
install_steps=["uv pip install . && uv pip install -r examples/pytorch/_tests_requirements.txt"],
pytest_num_workers=4,
)
@ -259,7 +261,7 @@ hub_job = CircleCIJob(
additional_env={"HUGGINGFACE_CO_STAGING": True},
docker_image=[{"image":"huggingface/transformers-torch-light"}],
install_steps=[
'uv venv && uv pip install .',
'uv pip install .',
'git config --global user.email "ci@dummy.com"',
'git config --global user.name "ci"',
],
@ -273,7 +275,6 @@ onnx_job = CircleCIJob(
"onnx",
docker_image=[{"image":"huggingface/transformers-torch-tf-light"}],
install_steps=[
"uv venv",
"uv pip install .[testing,sentencepiece,onnxruntime,vision,rjieba]",
],
pytest_options={"k onnx": None},
@ -303,7 +304,7 @@ non_model_job = CircleCIJob(
docker_image=[{"image": "huggingface/transformers-torch-light"}],
# networkx==3.3 (after #36957) cause some issues
# TODO: remove this once it works directly
install_steps=["uv venv && uv pip install .[serving]"],
install_steps=["uv pip install .[serving]"],
marker="not generate",
parallelism=6,
)
@ -321,7 +322,7 @@ doc_test_job = CircleCIJob(
additional_env={"TRANSFORMERS_VERBOSITY": "error", "DATASETS_VERBOSITY": "error", "SKIP_CUDA_DOCTEST": "1"},
install_steps=[
# Add an empty file to keep the test step running correctly even no file is selected to be tested.
"uv venv && pip install .",
"uv pip install .",
"touch dummy.py",
command,
"cat pr_documentation_tests_temp.txt",

View File

@ -48,7 +48,7 @@ jobs:
- name: Run database init script
run: |
psql -f benchmark/init_db.sql
psql -f benchmark/utils/init_db.sql
env:
PGDATABASE: metrics
PGHOST: ${{ secrets.TRANSFORMERS_BENCHMARKS_PGHOST }}

View File

@ -21,6 +21,9 @@ on:
report_repo_id:
required: true
type: string
commit_sha:
required: false
type: string
env:
@ -87,7 +90,7 @@ jobs:
- name: Update clone
working-directory: /transformers
if: ${{ env.process == 'true' }}
run: git fetch && git checkout ${{ github.sha }}
run: git fetch && git checkout ${{ inputs.commit_sha || github.sha }}
- name: Get target commit
working-directory: /transformers/utils

43
.github/workflows/collated-reports.yml vendored Normal file
View File

@ -0,0 +1,43 @@
name: CI collated reports
on:
workflow_call:
inputs:
job:
required: true
type: string
report_repo_id:
required: true
type: string
machine_type:
required: true
type: string
gpu_name:
description: Name of the GPU used for the job. Its enough that the value contains the name of the GPU, e.g. "noise-h100-more-noise". Case insensitive.
required: true
type: string
jobs:
collated_reports:
name: Collated reports
runs-on: ubuntu-22.04
if: always()
steps:
- uses: actions/checkout@v4
- uses: actions/download-artifact@v4
- name: Collated reports
shell: bash
env:
ACCESS_REPO_INFO_TOKEN: ${{ secrets.ACCESS_REPO_INFO_TOKEN }}
CI_SHA: ${{ github.sha }}
TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN: ${{ secrets.TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN }}
run: |
pip install huggingface_hub
python3 utils/collated_reports.py \
--path . \
--machine-type ${{ inputs.machine_type }} \
--commit-hash ${{ env.CI_SHA }} \
--job ${{ inputs.job }} \
--report-repo-id ${{ inputs.report_repo_id }} \
--gpu-name ${{ inputs.gpu_name }}

View File

@ -18,6 +18,9 @@ on:
docker:
required: true
type: string
commit_sha:
required: false
type: string
report_name_prefix:
required: false
default: run_models_gpu
@ -70,7 +73,7 @@ jobs:
- name: Update clone
working-directory: /transformers
run: git fetch && git checkout ${{ github.sha }}
run: git fetch && git checkout ${{ inputs.commit_sha || github.sha }}
- name: Reinstall transformers in edit mode (remove the one installed during docker image build)
working-directory: /transformers

View File

@ -0,0 +1,134 @@
name: PR - build doc via comment
on:
issue_comment:
types:
- created
branches-ignore:
- main
concurrency:
group: ${{ github.workflow }}-${{ github.event.issue.number }}-${{ startsWith(github.event.comment.body, 'build-doc') }}
cancel-in-progress: true
permissions: {}
jobs:
get-pr-number:
name: Get PR number
if: ${{ github.event.issue.state == 'open' && contains(fromJSON('["ydshieh", "ArthurZucker", "zucchini-nlp", "qubvel", "molbap", "gante", "LysandreJik", "Cyrilvallez", "Rocketknight1", "SunMarc", "muellerzr", "eustlb", "MekkCyber", "manueldeprada", "vasqu", "ivarflakstad", "stevhliu", "ebezzam"]'), github.actor) && (startsWith(github.event.comment.body, 'build-doc')) }}
uses: ./.github/workflows/get-pr-number.yml
get-pr-info:
name: Get PR commit SHA
needs: get-pr-number
if: ${{ needs.get-pr-number.outputs.PR_NUMBER != ''}}
uses: ./.github/workflows/get-pr-info.yml
with:
pr_number: ${{ needs.get-pr-number.outputs.PR_NUMBER }}
verity_pr_commit:
name: Verity PR commit corresponds to a specific event by comparing timestamps
if: ${{ needs.get-pr-number.outputs.PR_NUMBER != ''}}
runs-on: ubuntu-22.04
needs: get-pr-info
env:
COMMENT_DATE: ${{ github.event.comment.created_at }}
PR_MERGE_COMMIT_DATE: ${{ needs.get-pr-info.outputs.PR_MERGE_COMMIT_DATE }}
PR_MERGE_COMMIT_TIMESTAMP: ${{ needs.get-pr-info.outputs.PR_MERGE_COMMIT_TIMESTAMP }}
steps:
- run: |
COMMENT_TIMESTAMP=$(date -d "${COMMENT_DATE}" +"%s")
echo "COMMENT_DATE: $COMMENT_DATE"
echo "PR_MERGE_COMMIT_DATE: $PR_MERGE_COMMIT_DATE"
echo "COMMENT_TIMESTAMP: $COMMENT_TIMESTAMP"
echo "PR_MERGE_COMMIT_TIMESTAMP: $PR_MERGE_COMMIT_TIMESTAMP"
if [ $COMMENT_TIMESTAMP -le $PR_MERGE_COMMIT_TIMESTAMP ]; then
echo "Last commit on the pull request is newer than the issue comment triggering this run! Abort!";
exit -1;
fi
create_run:
name: Create run
needs: [get-pr-number, get-pr-info]
if: ${{ needs.get-pr-number.outputs.PR_NUMBER != '' }}
permissions:
statuses: write
runs-on: ubuntu-22.04
steps:
- name: Create Run
id: create_run
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
# Create a commit status (pending) for a run of this workflow. The status has to be updated later in `update_run_status`.
# See https://docs.github.com/en/rest/commits/statuses?apiVersion=2022-11-28#create-a-commit-status
GITHUB_RUN_URL: https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}
run: |
gh api \
--method POST \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
repos/${{ github.repository }}/statuses/${{ needs.get-pr-info.outputs.PR_HEAD_SHA }} \
-f "target_url=$GITHUB_RUN_URL" -f "state=pending" -f "description=Custom doc building job" -f "context=custom-doc-build"
reply_to_comment:
name: Reply to the comment
if: ${{ needs.create_run.result == 'success' }}
needs: [get-pr-number, create_run]
permissions:
pull-requests: write
runs-on: ubuntu-22.04
steps:
- name: Reply to the comment
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GITHUB_RUN_URL: https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}
run: |
gh api \
--method POST \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
repos/${{ github.repository }}/issues/${{ needs.get-pr-number.outputs.PR_NUMBER }}/comments \
-f "body=[Building docs for all languages...](${{ env.GITHUB_RUN_URL }})"
build-doc:
name: Build doc
needs: [get-pr-number, get-pr-info]
if: ${{ needs.get-pr-number.outputs.PR_NUMBER != '' }}
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
with:
commit_sha: ${{ needs.get-pr-info.outputs.PR_HEAD_SHA }}
pr_number: ${{ needs.get-pr-number.outputs.PR_NUMBER }}
package: transformers
languages: ar de en es fr hi it ko pt tr zh ja te
update_run_status:
name: Update Check Run Status
needs: [ get-pr-info, create_run, build-doc ]
permissions:
statuses: write
if: ${{ always() && needs.create_run.result == 'success' }}
runs-on: ubuntu-22.04
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GITHUB_RUN_URL: https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}
STATUS_OK: ${{ contains(fromJSON('["skipped", "success"]'), needs.create_run.result) }}
steps:
- name: Get `build-doc` job status
run: |
echo "${{ needs.build-doc.result }}"
echo $STATUS_OK
if [ "$STATUS_OK" = "true" ]; then
echo "STATUS=success" >> $GITHUB_ENV
else
echo "STATUS=failure" >> $GITHUB_ENV
fi
- name: Update PR commit statuses
run: |
echo "${{ needs.build-doc.result }}"
echo "${{ env.STATUS }}"
gh api \
--method POST \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
repos/${{ github.repository }}/statuses/${{ needs.get-pr-info.outputs.PR_HEAD_SHA }} \
-f "target_url=$GITHUB_RUN_URL" -f "state=${{ env.STATUS }}" -f "description=Custom doc building job" -f "context=custom-doc-build"

View File

@ -16,28 +16,6 @@ jobs:
with:
pr_number: ${{ needs.get-pr-number.outputs.PR_NUMBER }}
# We only need to verify the timestamp if the workflow is triggered by `issue_comment`.
verity_pr_commit:
name: Verity PR commit corresponds to a specific event by comparing timestamps
if: ${{ github.event.comment.created_at != '' }}
runs-on: ubuntu-22.04
needs: get-pr-info
env:
COMMENT_DATE: ${{ github.event.comment.created_at }}
PR_MERGE_COMMIT_DATE: ${{ needs.get-pr-info.outputs.PR_MERGE_COMMIT_DATE }}
PR_MERGE_COMMIT_TIMESTAMP: ${{ needs.get-pr-info.outputs.PR_MERGE_COMMIT_TIMESTAMP }}
steps:
- run: |
COMMENT_TIMESTAMP=$(date -d "${COMMENT_DATE}" +"%s")
echo "COMMENT_DATE: $COMMENT_DATE"
echo "PR_MERGE_COMMIT_DATE: $PR_MERGE_COMMIT_DATE"
echo "COMMENT_TIMESTAMP: $COMMENT_TIMESTAMP"
echo "PR_MERGE_COMMIT_TIMESTAMP: $PR_MERGE_COMMIT_TIMESTAMP"
if [ $COMMENT_TIMESTAMP -le $PR_MERGE_COMMIT_TIMESTAMP ]; then
echo "Last commit on the pull request is newer than the issue comment triggering this run! Abort!";
exit -1;
fi
get-jobs:
name: Get test files to run
runs-on: ubuntu-22.04

View File

@ -4,17 +4,6 @@ on:
push:
branches: [ main ]
env:
OUTPUT_SLACK_CHANNEL_ID: "C06L2SGMEEA"
HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
HF_HOME: /mnt/cache
TRANSFORMERS_IS_CI: yes
OMP_NUM_THREADS: 8
MKL_NUM_THREADS: 8
RUN_SLOW: yes # For gated repositories, we still need to agree to share information on the Hub repo. page in order to get access. # This token is created under the bot `hf-transformers-bot`.
SIGOPT_API_TOKEN: ${{ secrets.SIGOPT_API_TOKEN }}
TF_FORCE_GPU_ALLOW_GROWTH: true
jobs:
get_modified_models:
name: "Get all modified files"
@ -25,111 +14,144 @@ jobs:
- name: Check out code
uses: actions/checkout@v4
- name: Get changed files
id: changed-files
uses: tj-actions/changed-files@1c8e6069583811afb28f97afeaf8e7da80c6be5c
- name: Get changed files using `actions/github-script`
id: get-changed-files
uses: actions/github-script@v7
with:
files: src/transformers/models/**
script: |
let files = [];
// Only handle push events
if (context.eventName === 'push') {
const afterSha = context.payload.after;
const branchName = context.payload.ref.replace('refs/heads/', '');
let baseSha;
if (branchName === 'main') {
console.log('Push to main branch, comparing to parent commit');
// Get the parent commit of the pushed commit
const { data: commit } = await github.rest.repos.getCommit({
owner: context.repo.owner,
repo: context.repo.repo,
ref: afterSha
});
baseSha = commit.parents[0]?.sha;
if (!baseSha) {
throw new Error('No parent commit found for the pushed commit');
}
} else {
console.log(`Push to branch ${branchName}, comparing to main`);
baseSha = 'main';
}
const { data: comparison } = await github.rest.repos.compareCommits({
owner: context.repo.owner,
repo: context.repo.repo,
base: baseSha,
head: afterSha
});
// Include added, modified, and renamed files
files = comparison.files
.filter(file => file.status === 'added' || file.status === 'modified' || file.status === 'renamed')
.map(file => file.filename);
}
// Include all files under src/transformers/ (not just models subdirectory)
const filteredFiles = files.filter(file =>
file.startsWith('src/transformers/')
);
core.setOutput('changed_files', filteredFiles.join(' '));
core.setOutput('any_changed', filteredFiles.length > 0 ? 'true' : 'false');
- name: Run step if only the files listed above change
if: steps.changed-files.outputs.any_changed == 'true'
id: set-matrix
- name: Parse changed files with Python
if: steps.get-changed-files.outputs.any_changed == 'true'
env:
ALL_CHANGED_FILES: ${{ steps.changed-files.outputs.all_changed_files }}
CHANGED_FILES: ${{ steps.get-changed-files.outputs.changed_files }}
id: set-matrix
run: |
model_arrays=()
for file in $ALL_CHANGED_FILES; do
model_path="${file#*models/}"
model_path="models/${model_path%%/*}"
if grep -qFx "$model_path" utils/important_models.txt; then
# Append the file to the matrix string
model_arrays+=("$model_path")
fi
done
matrix_string=$(printf '"%s", ' "${model_arrays[@]}" | sed 's/, $//')
echo "matrix=[$matrix_string]" >> $GITHUB_OUTPUT
test_modified_files:
python3 - << 'EOF'
import os
import sys
import json
# Add the utils directory to Python path
sys.path.insert(0, 'utils')
# Import the important models list
from important_files import IMPORTANT_MODELS
print(f"Important models: {IMPORTANT_MODELS}")
# Get the changed files from the previous step
changed_files_str = os.environ.get('CHANGED_FILES', '')
changed_files = changed_files_str.split() if changed_files_str else []
# Filter to only Python files
python_files = [f for f in changed_files if f.endswith('.py')]
print(f"Python files changed: {python_files}")
result_models = set()
# Specific files that trigger all models
transformers_utils_files = [
'modeling_utils.py',
'modeling_rope_utils.py',
'modeling_flash_attention_utils.py',
'modeling_attn_mask_utils.py',
'cache_utils.py',
'masking_utils.py',
'pytorch_utils.py'
]
# Single loop through all Python files
for file in python_files:
# Check for files under src/transformers/models/
if file.startswith('src/transformers/models/'):
remaining_path = file[len('src/transformers/models/'):]
if '/' in remaining_path:
model_dir = remaining_path.split('/')[0]
if model_dir in IMPORTANT_MODELS:
result_models.add(model_dir)
print(f"Added model directory: {model_dir}")
# Check for specific files under src/transformers/ or src/transformers/generation/ files
elif file.startswith('src/transformers/generation/') or \
(file.startswith('src/transformers/') and os.path.basename(file) in transformers_utils_files):
print(f"Found core file: {file} - including all important models")
result_models.update(IMPORTANT_MODELS)
break # No need to continue once we include all models
# Convert to sorted list and create matrix
result_list = sorted(list(result_models))
print(f"Final model list: {result_list}")
if result_list:
matrix_json = json.dumps(result_list)
print(f"matrix={matrix_json}")
# Write to GITHUB_OUTPUT
with open(os.environ['GITHUB_OUTPUT'], 'a') as f:
f.write(f"matrix={matrix_json}\n")
else:
print("matrix=[]")
with open(os.environ['GITHUB_OUTPUT'], 'a') as f:
f.write("matrix=[]\n")
EOF
model-ci:
name: Model CI
uses: ./.github/workflows/self-scheduled.yml
needs: get_modified_models
name: Slow & FA2 tests
runs-on:
group: aws-g5-4xlarge-cache
container:
image: huggingface/transformers-all-latest-gpu
options: --gpus all --privileged --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
if: ${{ needs.get_modified_models.outputs.matrix != '[]' && needs.get_modified_models.outputs.matrix != '' && fromJson(needs.get_modified_models.outputs.matrix)[0] != null }}
strategy:
fail-fast: false
matrix:
model-name: ${{ fromJson(needs.get_modified_models.outputs.matrix) }}
steps:
- name: Check out code
uses: actions/checkout@v4
- name: Install locally transformers & other libs
run: |
apt install sudo
sudo -H pip install --upgrade pip
sudo -H pip uninstall -y transformers
sudo -H pip install -U -e ".[testing]"
MAX_JOBS=4 pip install flash-attn --no-build-isolation
pip install bitsandbytes
- name: NVIDIA-SMI
run: |
nvidia-smi
- name: Show installed libraries and their versions
run: pip freeze
- name: Run FA2 tests
id: run_fa2_tests
run:
pytest -rsfE -m "flash_attn_test" --make-reports=${{ matrix.model-name }}_fa2_tests/ tests/${{ matrix.model-name }}/test_modeling_*
- name: "Test suite reports artifacts: ${{ matrix.model-name }}_fa2_tests"
if: ${{ always() }}
uses: actions/upload-artifact@v4
with:
name: ${{ matrix.model-name }}_fa2_tests
path: /transformers/reports/${{ matrix.model-name }}_fa2_tests
- name: Post to Slack
if: always()
uses: huggingface/hf-workflows/.github/actions/post-slack@main
with:
slack_channel: ${{ env.OUTPUT_SLACK_CHANNEL_ID }}
title: 🤗 Results of the FA2 tests - ${{ matrix.model-name }}
status: ${{ steps.run_fa2_tests.conclusion}}
slack_token: ${{ secrets.CI_SLACK_BOT_TOKEN }}
- name: Run integration tests
id: run_integration_tests
if: always()
run:
pytest -rsfE -k "IntegrationTest" --make-reports=tests_integration_${{ matrix.model-name }} tests/${{ matrix.model-name }}/test_modeling_*
- name: "Test suite reports artifacts: tests_integration_${{ matrix.model-name }}"
if: ${{ always() }}
uses: actions/upload-artifact@v4
with:
name: tests_integration_${{ matrix.model-name }}
path: /transformers/reports/tests_integration_${{ matrix.model-name }}
- name: Post to Slack
if: always()
uses: huggingface/hf-workflows/.github/actions/post-slack@main
with:
slack_channel: ${{ env.OUTPUT_SLACK_CHANNEL_ID }}
title: 🤗 Results of the Integration tests - ${{ matrix.model-name }}
status: ${{ steps.run_integration_tests.conclusion}}
slack_token: ${{ secrets.CI_SLACK_BOT_TOKEN }}
- name: Tailscale # In order to be able to SSH when a test fails
if: ${{ runner.debug == '1'}}
uses: huggingface/tailscale-action@v1
with:
authkey: ${{ secrets.TAILSCALE_SSH_AUTHKEY }}
slackChannel: ${{ secrets.SLACK_CIFEEDBACK_CHANNEL }}
slackToken: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
waitForSSH: true
if: needs.get_modified_models.outputs.matrix != '' && needs.get_modified_models.outputs.matrix != '[]'
with:
job: run_models_gpu
slack_report_channel: "#transformers-ci-push"
docker: huggingface/transformers-all-latest-gpu
ci_event: push
report_repo_id: hf-internal-testing/transformers_ci_push
commit_sha: ${{ github.sha }}
models: ${{ needs.get_modified_models.outputs.matrix }}
secrets: inherit

View File

@ -1,43 +1,54 @@
name: Self-hosted runner (nightly-ci)
name: Nvidia CI with nightly torch
on:
repository_dispatch:
schedule:
- cron: "17 2 * * *"
# triggered when the daily scheduled Nvidia CI is completed.
# This way, we can compare the results more easily.
workflow_run:
workflows: ["Nvidia CI"]
branches: ["main"]
types: [completed]
push:
branches:
- run_nightly_ci*
- run_ci_with_nightly_torch*
# Used for `push` to easily modify the target workflow runs to compare against
env:
prev_workflow_run_id: ""
other_workflow_run_id: ""
jobs:
build_nightly_ci_images:
name: Build Nightly CI Docker Images
if: (github.event_name == 'schedule') || ((github.event_name == 'push') && startsWith(github.ref_name, 'run_nightly_ci'))
build_nightly_torch_ci_images:
name: Build CI Docker Images with nightly torch
uses: ./.github/workflows/build-nightly-ci-docker-images.yml
secrets: inherit
setup:
name: Setup
runs-on: ubuntu-22.04
steps:
- name: Setup
run: |
mkdir "setup_values"
echo "${{ inputs.prev_workflow_run_id || env.prev_workflow_run_id }}" > "setup_values/prev_workflow_run_id.txt"
echo "${{ inputs.other_workflow_run_id || env.other_workflow_run_id }}" > "setup_values/other_workflow_run_id.txt"
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
name: setup_values
path: setup_values
model-ci:
name: Model CI
needs: [build_nightly_ci_images]
needs: build_nightly_torch_ci_images
uses: ./.github/workflows/self-scheduled.yml
with:
job: run_models_gpu
slack_report_channel: "#transformers-ci-past-future"
runner: ci
docker: huggingface/transformers-all-latest-torch-nightly-gpu
ci_event: Nightly CI
secrets: inherit
deepspeed-ci:
name: DeepSpeed CI
needs: [build_nightly_ci_images]
uses: ./.github/workflows/self-scheduled.yml
with:
job: run_torch_cuda_extensions_gpu
slack_report_channel: "#transformers-ci-past-future"
runner: ci
# test deepspeed nightly build with the latest release torch
docker: huggingface/transformers-pytorch-deepspeed-latest-gpu
ci_event: Nightly CI
working-directory-prefix: /workspace
report_repo_id: hf-internal-testing/transformers_daily_ci_with_torch_nightly
commit_sha: ${{ github.event.workflow_run.head_sha || github.sha }}
secrets: inherit

View File

@ -1,25 +0,0 @@
name: Self-hosted runner (AMD mi300 CI caller)
on:
#workflow_run:
# workflows: ["Self-hosted runner (push-caller)"]
# branches: ["main"]
# types: [completed]
push:
branches:
- run_amd_push_ci_caller*
paths:
- "src/**"
- "tests/**"
- ".github/**"
- "templates/**"
- "utils/**"
jobs:
run_amd_ci:
name: AMD mi300
if: (cancelled() != true) && ((github.event_name == 'workflow_run') || ((github.event_name == 'push') && (startsWith(github.ref_name, 'run_amd_push_ci_caller') || startsWith(github.ref_name, 'mi300-ci'))))
uses: ./.github/workflows/self-push-amd.yml
with:
gpu_flavor: mi300
secrets: inherit

View File

@ -24,6 +24,7 @@ jobs:
docker: huggingface/transformers-pytorch-amd-gpu
ci_event: Scheduled CI (AMD) - mi325
report_repo_id: optimum-amd/transformers_daily_ci
env_file: /etc/podinfo/gha-gpu-isolation-settings
secrets: inherit
torch-pipeline:
@ -36,6 +37,7 @@ jobs:
docker: huggingface/transformers-pytorch-amd-gpu
ci_event: Scheduled CI (AMD) - mi325
report_repo_id: optimum-amd/transformers_daily_ci
env_file: /etc/podinfo/gha-gpu-isolation-settings
secrets: inherit
example-ci:
@ -48,6 +50,7 @@ jobs:
docker: huggingface/transformers-pytorch-amd-gpu
ci_event: Scheduled CI (AMD) - mi325
report_repo_id: optimum-amd/transformers_daily_ci
env_file: /etc/podinfo/gha-gpu-isolation-settings
secrets: inherit
deepspeed-ci:
@ -60,4 +63,5 @@ jobs:
docker: huggingface/transformers-pytorch-deepspeed-amd-gpu
ci_event: Scheduled CI (AMD) - mi325
report_repo_id: optimum-amd/transformers_daily_ci
env_file: /etc/podinfo/gha-gpu-isolation-settings
secrets: inherit

View File

@ -1,8 +1,8 @@
name: Self-hosted runner scale set (AMD mi300 scheduled CI caller)
name: Self-hosted runner scale set (AMD mi355 scheduled CI caller)
# Note: For every job in this workflow, the name of the runner scale set is finalized in the runner yaml i.e. huggingface/hf-workflows/.github/workflows/transformers_amd_ci_scheduled_arc_scale_set.yaml
# For example, 1gpu scale set: amd-mi300-ci-1gpu
# 2gpu scale set: amd-mi300-ci-2gpu
# For example, 1gpu : amd-mi355-ci-1gpu
# 2gpu : amd-mi355-ci-2gpu
on:
workflow_run:
@ -20,9 +20,9 @@ jobs:
with:
job: run_models_gpu
slack_report_channel: "#amd-hf-ci"
runner_scale_set: amd-mi300-ci
runner_scale_set: amd-mi355-ci
docker: huggingface/transformers-pytorch-amd-gpu
ci_event: Scheduled CI (AMD) - mi300
ci_event: Scheduled CI (AMD) - mi355
report_repo_id: optimum-amd/transformers_daily_ci
secrets: inherit
@ -32,9 +32,9 @@ jobs:
with:
job: run_pipelines_torch_gpu
slack_report_channel: "#amd-hf-ci"
runner_scale_set: amd-mi300-ci
runner_scale_set: amd-mi355-ci
docker: huggingface/transformers-pytorch-amd-gpu
ci_event: Scheduled CI (AMD) - mi300
ci_event: Scheduled CI (AMD) - mi355
report_repo_id: optimum-amd/transformers_daily_ci
secrets: inherit
@ -44,9 +44,9 @@ jobs:
with:
job: run_examples_gpu
slack_report_channel: "#amd-hf-ci"
runner_scale_set: amd-mi300-ci
runner_scale_set: amd-mi355-ci
docker: huggingface/transformers-pytorch-amd-gpu
ci_event: Scheduled CI (AMD) - mi300
ci_event: Scheduled CI (AMD) - mi355
report_repo_id: optimum-amd/transformers_daily_ci
secrets: inherit
@ -56,8 +56,8 @@ jobs:
with:
job: run_torch_cuda_extensions_gpu
slack_report_channel: "#amd-hf-ci"
runner_scale_set: amd-mi300-ci
runner_scale_set: amd-mi355-ci
docker: huggingface/transformers-pytorch-deepspeed-amd-gpu
ci_event: Scheduled CI (AMD) - mi300
ci_event: Scheduled CI (AMD) - mi355
report_repo_id: optimum-amd/transformers_daily_ci
secrets: inherit

View File

@ -1,5 +1,4 @@
name: Self-hosted runner (scheduled)
name: Nvidia CI
on:
repository_dispatch:
@ -7,7 +6,7 @@ on:
- cron: "17 2 * * *"
push:
branches:
- run_scheduled_ci*
- run_nvidia_ci*
workflow_dispatch:
inputs:
prev_workflow_run_id:
@ -54,6 +53,7 @@ jobs:
docker: huggingface/transformers-all-latest-gpu
ci_event: Daily CI
report_repo_id: hf-internal-testing/transformers_daily_ci
commit_sha: ${{ github.sha }}
secrets: inherit
torch-pipeline:
@ -65,6 +65,7 @@ jobs:
docker: huggingface/transformers-pytorch-gpu
ci_event: Daily CI
report_repo_id: hf-internal-testing/transformers_daily_ci
commit_sha: ${{ github.sha }}
secrets: inherit
example-ci:
@ -76,6 +77,7 @@ jobs:
docker: huggingface/transformers-all-latest-gpu
ci_event: Daily CI
report_repo_id: hf-internal-testing/transformers_daily_ci
commit_sha: ${{ github.sha }}
secrets: inherit
trainer-fsdp-ci:
@ -87,6 +89,7 @@ jobs:
docker: huggingface/transformers-all-latest-gpu
ci_event: Daily CI
report_repo_id: hf-internal-testing/transformers_daily_ci
commit_sha: ${{ github.sha }}
secrets: inherit
deepspeed-ci:
@ -99,6 +102,7 @@ jobs:
ci_event: Daily CI
working-directory-prefix: /workspace
report_repo_id: hf-internal-testing/transformers_daily_ci
commit_sha: ${{ github.sha }}
secrets: inherit
quantization-ci:
@ -110,4 +114,5 @@ jobs:
docker: huggingface/transformers-quantization-latest-gpu
ci_event: Daily CI
report_repo_id: hf-internal-testing/transformers_daily_ci
commit_sha: ${{ github.sha }}
secrets: inherit

View File

@ -1,4 +1,4 @@
name: Self-hosted runner (scheduled)
name: Nvidia CI (job definitions)
# Note that each job's dependencies go into a corresponding docker file.
#
@ -28,7 +28,13 @@ on:
report_repo_id:
required: true
type: string
commit_sha:
required: false
type: string
models:
default: ""
required: false
type: string
env:
HF_HOME: /mnt/cache
@ -46,8 +52,8 @@ env:
jobs:
setup:
if: contains(fromJSON('["run_models_gpu", "run_trainer_and_fsdp_gpu", "run_quantization_torch_gpu"]'), inputs.job)
name: Setup
if: contains(fromJSON('["run_models_gpu", "run_trainer_and_fsdp_gpu", "run_quantization_torch_gpu"]'), inputs.job)
strategy:
matrix:
machine_type: [aws-g5-4xlarge-cache, aws-g5-12xlarge-cache]
@ -65,7 +71,7 @@ jobs:
- name: Update clone
working-directory: /transformers
run: |
git fetch && git checkout ${{ github.sha }}
git fetch && git checkout ${{ inputs.commit_sha || github.sha }}
- name: Cleanup
working-directory: /transformers
@ -84,7 +90,7 @@ jobs:
working-directory: /transformers/tests
run: |
if [ "${{ inputs.job }}" = "run_models_gpu" ]; then
echo "folder_slices=$(python3 ../utils/split_model_tests.py --num_splits ${{ env.NUM_SLICES }})" >> $GITHUB_OUTPUT
echo "folder_slices=$(python3 ../utils/split_model_tests.py --models '${{ inputs.models }}' --num_splits ${{ env.NUM_SLICES }})" >> $GITHUB_OUTPUT
echo "slice_ids=$(python3 -c 'd = list(range(${{ env.NUM_SLICES }})); print(d)')" >> $GITHUB_OUTPUT
echo "runner_map=$(python3 ../utils/get_runner_map.py)" >> $GITHUB_OUTPUT
elif [ "${{ inputs.job }}" = "run_trainer_and_fsdp_gpu" ]; then
@ -119,6 +125,7 @@ jobs:
slice_id: ${{ matrix.slice_id }}
runner_map: ${{ needs.setup.outputs.runner_map }}
docker: ${{ inputs.docker }}
commit_sha: ${{ inputs.commit_sha || github.sha }}
secrets: inherit
run_trainer_and_fsdp_gpu:
@ -137,6 +144,7 @@ jobs:
slice_id: ${{ matrix.slice_id }}
runner_map: ${{ needs.setup.outputs.runner_map }}
docker: ${{ inputs.docker }}
commit_sha: ${{ inputs.commit_sha || github.sha }}
report_name_prefix: run_trainer_and_fsdp_gpu
secrets: inherit
@ -155,7 +163,7 @@ jobs:
steps:
- name: Update clone
working-directory: /transformers
run: git fetch && git checkout ${{ github.sha }}
run: git fetch && git checkout ${{ inputs.commit_sha || github.sha }}
- name: Reinstall transformers in edit mode (remove the one installed during docker image build)
working-directory: /transformers
@ -223,7 +231,7 @@ jobs:
steps:
- name: Update clone
working-directory: /transformers
run: git fetch && git checkout ${{ github.sha }}
run: git fetch && git checkout ${{ inputs.commit_sha || github.sha }}
- name: Reinstall transformers in edit mode (remove the one installed during docker image build)
working-directory: /transformers
@ -292,7 +300,7 @@ jobs:
steps:
- name: Update clone
working-directory: ${{ inputs.working-directory-prefix }}/transformers
run: git fetch && git checkout ${{ github.sha }}
run: git fetch && git checkout ${{ inputs.commit_sha || github.sha }}
- name: Reinstall transformers in edit mode (remove the one installed during docker image build)
working-directory: ${{ inputs.working-directory-prefix }}/transformers
@ -400,7 +408,7 @@ jobs:
- name: Update clone
working-directory: /transformers
run: git fetch && git checkout ${{ github.sha }}
run: git fetch && git checkout ${{ inputs.commit_sha || github.sha }}
- name: Reinstall transformers in edit mode (remove the one installed during docker image build)
working-directory: /transformers
@ -464,6 +472,7 @@ jobs:
uses: actions/checkout@v4
with:
fetch-depth: 2
ref: ${{ inputs.commit_sha || github.sha }}
- name: Install transformers
run: pip install transformers
@ -506,7 +515,7 @@ jobs:
run_quantization_torch_gpu,
run_extract_warnings
]
if: ${{ always() }}
if: always() && !cancelled()
uses: ./.github/workflows/slack-report.yml
with:
job: ${{ inputs.job }}
@ -518,6 +527,7 @@ jobs:
quantization_matrix: ${{ needs.setup.outputs.quantization_matrix }}
ci_event: ${{ inputs.ci_event }}
report_repo_id: ${{ inputs.report_repo_id }}
commit_sha: ${{ inputs.commit_sha || github.sha }}
secrets: inherit
@ -528,7 +538,7 @@ jobs:
uses: ./.github/workflows/check_failed_tests.yml
with:
docker: ${{ inputs.docker }}
start_sha: ${{ github.sha }}
start_sha: ${{ inputs.commit_sha || github.sha }}
job: ${{ inputs.job }}
slack_report_channel: ${{ inputs.slack_report_channel }}
ci_event: ${{ inputs.ci_event }}

View File

@ -24,6 +24,10 @@ on:
report_repo_id:
required: true
type: string
commit_sha:
required: false
type: string
env:
TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN: ${{ secrets.TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN }}
@ -32,7 +36,7 @@ jobs:
send_results:
name: Send results to webhook
runs-on: ubuntu-22.04
if: always()
if: always() && !cancelled()
steps:
- name: Preliminary job status
shell: bash
@ -41,6 +45,10 @@ jobs:
echo "Setup status: ${{ inputs.setup_status }}"
- uses: actions/checkout@v4
with:
fetch-depth: 2
ref: ${{ inputs.commit_sha || github.sha }}
- uses: actions/download-artifact@v4
- name: Prepare some setup values
@ -67,7 +75,9 @@ jobs:
SLACK_REPORT_CHANNEL: ${{ inputs.slack_report_channel }}
ACCESS_REPO_INFO_TOKEN: ${{ secrets.ACCESS_REPO_INFO_TOKEN }}
CI_EVENT: ${{ inputs.ci_event }}
CI_SHA: ${{ github.sha }}
# This `CI_TITLE` would be empty for `schedule` or `workflow_run` events.
CI_TITLE: ${{ github.event.head_commit.message }}
CI_SHA: ${{ inputs.commit_sha || github.sha }}
CI_TEST_JOB: ${{ inputs.job }}
SETUP_STATUS: ${{ inputs.setup_status }}
REPORT_REPO_ID: ${{ inputs.report_repo_id }}
@ -83,7 +93,7 @@ jobs:
python utils/notification_service.py "${{ inputs.quantization_matrix }}"
else
python utils/notification_service.py "${{ inputs.folder_slices }}"
fi
fi
# Upload complete failure tables, as they might be big and only truncated versions could be sent to Slack.
- name: Failure table artifacts

View File

@ -68,8 +68,7 @@ already reported** (use the search bar on GitHub under Issues). Your issue shoul
Once you've confirmed the bug hasn't already been reported, please include the following information in your issue so we can quickly resolve it:
* Your **OS type and version** and **Python**, **PyTorch** and
**TensorFlow** versions when applicable.
* Your **OS type and version** and **Python**, and **PyTorch** versions when applicable.
* A short, self-contained, code snippet that allows us to reproduce the bug in
less than 30s.
* The *full* traceback if an exception is raised.
@ -165,8 +164,7 @@ You'll need **[Python 3.9](https://github.com/huggingface/transformers/blob/main
mode with the `-e` flag.
Depending on your OS, and since the number of optional dependencies of Transformers is growing, you might get a
failure with this command. If that's the case make sure to install the Deep Learning framework you are working with
(PyTorch, TensorFlow and/or Flax) then do:
failure with this command. If that's the case make sure to install Pytorch then do:
```bash
pip install -e ".[quality]"

View File

@ -52,6 +52,7 @@ repo-consistency:
python utils/check_doctest_list.py
python utils/update_metadata.py --check-only
python utils/check_docstrings.py
python utils/add_dates.py
# this target runs checks on all files

View File

@ -147,7 +147,7 @@ chat = [
{"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"}
]
pipeline = pipeline(task="text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", torch_dtype=torch.bfloat16, device_map="auto")
pipeline = pipeline(task="text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", dtype=torch.bfloat16, device_map="auto")
response = pipeline(chat, max_new_tokens=512)
print(response[0]["generated_text"][-1]["content"])
```
@ -242,7 +242,7 @@ pipeline(
- This library is not a modular toolbox of building blocks for neural nets. The code in the model files is not refactored with additional abstractions on purpose, so that researchers can quickly iterate on each of the models without diving into additional abstractions/files.
- The training API is optimized to work with PyTorch models provided by Transformers. For generic machine learning loops, you should use another library like [Accelerate](https://huggingface.co/docs/accelerate).
- The [example scripts]((https://github.com/huggingface/transformers/tree/main/examples)) are only *examples*. They may not necessarily work out-of-the-box on your specific use case and you'll need to adapt the code for it to work.
- The [example scripts](https://github.com/huggingface/transformers/tree/main/examples) are only *examples*. They may not necessarily work out-of-the-box on your specific use case and you'll need to adapt the code for it to work.
## 100 projects using Transformers
@ -280,8 +280,8 @@ Expand each modality below to see a few example models for various use cases.
- Automatic mask generation with [SAM](https://huggingface.co/facebook/sam-vit-base)
- Depth estimation with [DepthPro](https://huggingface.co/apple/DepthPro-hf)
- Image classification with [DINO v2](https://huggingface.co/facebook/dinov2-base)
- Keypoint detection with [SuperGlue](https://huggingface.co/magic-leap-community/superglue_outdoor)
- Keypoint matching with [SuperGlue](https://huggingface.co/magic-leap-community/superglue)
- Keypoint detection with [SuperPoint](https://huggingface.co/magic-leap-community/superpoint)
- Keypoint matching with [SuperGlue](https://huggingface.co/magic-leap-community/superglue_outdoor)
- Object detection with [RT-DETRv2](https://huggingface.co/PekingU/rtdetr_v2_r50vd)
- Pose Estimation with [VitPose](https://huggingface.co/usyd-community/vitpose-base-simple)
- Universal segmentation with [OneFormer](https://huggingface.co/shi-labs/oneformer_ade20k_swin_large)

View File

@ -14,7 +14,7 @@ Models uploaded on the Hugging Face Hub come in different formats. We heavily re
models in the [`safetensors`](https://github.com/huggingface/safetensors) format (which is the default prioritized
by the transformers library), as developed specifically to prevent arbitrary code execution on your system.
To avoid loading models from unsafe formats(e.g. [pickle](https://docs.python.org/3/library/pickle.html), you should use the `use_safetensors` parameter. If doing so, in the event that no .safetensors file is present, transformers will error when loading the model.
To avoid loading models from unsafe formats (e.g. [pickle](https://docs.python.org/3/library/pickle.html), you should use the `use_safetensors` parameter. If doing so, in the event that no .safetensors file is present, transformers will error when loading the model.
### Remote code

1
benchmark/.gitignore vendored Normal file
View File

@ -0,0 +1 @@
benchmark_results/

345
benchmark/benches/llama.py Normal file
View File

@ -0,0 +1,345 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from logging import Logger
import os
from threading import Event, Thread
from time import perf_counter, sleep
from typing import Optional
import sys
# Add the parent directory to Python path to import benchmarks_entrypoint
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from benchmarks_entrypoint import MetricsRecorder
import gpustat
import psutil
import psycopg2
# Optional heavy ML dependencies - only required when actually running the benchmark
try:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, StaticCache
TRANSFORMERS_AVAILABLE = True
except ImportError:
TRANSFORMERS_AVAILABLE = False
torch = None
AutoModelForCausalLM = None
AutoTokenizer = None
GenerationConfig = None
StaticCache = None
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "1"
# Only set torch precision if torch is available
if TRANSFORMERS_AVAILABLE:
torch.set_float32_matmul_precision("high")
def collect_metrics(benchmark_id, continue_metric_collection, metrics_recorder):
p = psutil.Process(os.getpid())
while not continue_metric_collection.is_set():
with p.oneshot():
cpu_util = p.cpu_percent()
mem_megabytes = p.memory_info().rss / (1024 * 1024)
gpu_stats = gpustat.GPUStatCollection.new_query()
gpu_util = gpu_stats[0]["utilization.gpu"]
gpu_mem_megabytes = gpu_stats[0]["memory.used"]
metrics_recorder.collect_device_measurements(
benchmark_id, cpu_util, mem_megabytes, gpu_util, gpu_mem_megabytes
)
sleep(0.01)
def run_benchmark(
logger: Logger, repository: str, branch: str, commit_id: str, commit_msg: str, metrics_recorder=None, num_tokens_to_generate=100
):
# Check if required ML dependencies are available
if not TRANSFORMERS_AVAILABLE:
logger.error("Transformers and torch are required to run the LLaMA benchmark. Please install them with:")
logger.error("pip install torch transformers")
logger.error("Skipping LLaMA benchmark due to missing dependencies.")
return
continue_metric_collection = Event()
metrics_thread = None
model_id = "meta-llama/Llama-2-7b-hf"
# If no metrics_recorder is provided, create one for backward compatibility
if metrics_recorder is None:
try:
metrics_recorder = MetricsRecorder(
psycopg2.connect("dbname=metrics"), logger, repository, branch, commit_id, commit_msg, True
)
should_close_recorder = True
except Exception as e:
logger.error(f"Failed to create metrics recorder: {e}")
return
else:
should_close_recorder = False
try:
gpu_stats = gpustat.GPUStatCollection.new_query()
gpu_name = gpu_stats[0]["name"]
benchmark_id = metrics_recorder.initialise_benchmark({"gpu_name": gpu_name, "model_id": model_id})
logger.info(f"running benchmark #{benchmark_id} on {gpu_name} for {model_id}")
metrics_thread = Thread(
target=collect_metrics,
args=[benchmark_id, continue_metric_collection, metrics_recorder],
)
metrics_thread.start()
logger.info("started background thread to fetch device metrics")
os.environ["TOKENIZERS_PARALLELISM"] = "false" # silence warnings when compiling
device = "cuda"
logger.info("downloading weights")
# This is to avoid counting download in model load time measurement
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.float16)
gen_config = GenerationConfig(do_sample=False, top_p=1, temperature=1)
logger.info("loading model")
start = perf_counter()
model = AutoModelForCausalLM.from_pretrained(
model_id, dtype=torch.float16, generation_config=gen_config
).eval()
model.to(device)
torch.cuda.synchronize()
end = perf_counter()
model_load_time = end - start
logger.info(f"loaded model in: {model_load_time}s")
tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = "Why dogs are so cute?"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
# Specify the max length (including both the prompt and the response)
# When calling `generate` with `cache_implementation="static" later, this is also used to create a `StaticCache` object
# with sequence length = `max_length`. The longer the more you will re-use it
seq_length = inputs["input_ids"].shape[1]
model.generation_config.max_length = seq_length + num_tokens_to_generate
batch_size = inputs["input_ids"].shape[0]
# Copied from the gpt-fast repo
def multinomial_sample_one_no_sync(probs_sort): # Does multinomial sampling without a cuda synchronization
q = torch.empty_like(probs_sort).exponential_(1)
return torch.argmax(probs_sort / q, dim=-1, keepdim=True).to(dtype=torch.int)
def logits_to_probs(logits, temperature: float = 1.0, top_k: Optional[int] = None):
logits = logits / max(temperature, 1e-5)
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
pivot = v.select(-1, -1).unsqueeze(-1)
logits = torch.where(logits < pivot, -float("Inf"), logits)
probs = torch.nn.functional.softmax(logits, dim=-1)
return probs
def sample(logits, temperature: float = 1.0, top_k: Optional[int] = None):
probs = logits_to_probs(logits[0, -1], temperature, top_k)
idx_next = multinomial_sample_one_no_sync(probs)
return idx_next, probs
# First eager forward pass
logger.info("running first eager forward pass")
start = perf_counter()
outputs = model(**inputs)
torch.cuda.synchronize()
end = perf_counter()
first_eager_fwd_pass_time = end - start
logger.info(f"completed first eager forward pass in: {first_eager_fwd_pass_time}s")
# Second eager forward pass (should be faster)
logger.info("running second eager forward pass")
start = perf_counter()
outputs = model(**inputs)
torch.cuda.synchronize()
end = perf_counter()
second_eager_fwd_pass_time = end - start
logger.info(f"completed second eager forward pass in: {second_eager_fwd_pass_time}s")
# First eager generation
logger.info("running first eager generation")
start = perf_counter()
output = model.generate(**inputs)
torch.cuda.synchronize()
end = perf_counter()
first_eager_generate_time = end - start
logger.info(f"completed first eager generation in: {first_eager_generate_time}s")
logger.info(f"generated: {tokenizer.batch_decode(output.cpu().tolist())}")
# Second eager generation (should be faster)
logger.info("running second eager generation")
start = perf_counter()
output = model.generate(**inputs)
torch.cuda.synchronize()
end = perf_counter()
second_eager_generate_time = end - start
logger.info(f"completed second eager generation in: {second_eager_generate_time}s")
logger.info(f"generated: {tokenizer.batch_decode(output.cpu().tolist())}")
logger.info("running generation timing loop")
input_pos = torch.arange(0, seq_length, device=device)
inputs = inputs["input_ids"]
start = perf_counter()
with torch.nn.attention.sdpa_kernel(torch.nn.attention.SDPBackend.MATH):
logits = model(inputs, position_ids=input_pos).logits
next_token, probs = sample(logits, temperature=0.6, top_k=5)
torch.cuda.synchronize()
end = perf_counter()
time_to_first_token = end - start
input_pos = torch.tensor([seq_length], device=device, dtype=torch.int)
next_token = next_token.clone()
start = perf_counter()
with torch.nn.attention.sdpa_kernel(torch.nn.attention.SDPBackend.MATH):
logits = model(next_token, position_ids=input_pos).logits
next_token, probs = sample(logits, temperature=0.6, top_k=5)
torch.cuda.synchronize()
end = perf_counter()
time_to_second_token = end - start
input_pos = torch.tensor([seq_length + 1], device=device, dtype=torch.int)
next_token = next_token.clone()
start = perf_counter()
with torch.nn.attention.sdpa_kernel(torch.nn.attention.SDPBackend.MATH):
logits = model(next_token, position_ids=input_pos).logits
next_token, probs = sample(logits, temperature=0.6, top_k=5)
torch.cuda.synchronize()
end = perf_counter()
time_to_third_token = end - start
logger.info("running longer generation timing loop")
total_time = 0
for i in range(20):
input_pos = torch.tensor([seq_length + 2 + i], device=device, dtype=torch.int)
next_token = next_token.clone()
start = perf_counter()
with torch.nn.attention.sdpa_kernel(torch.nn.attention.SDPBackend.MATH):
logits = model(next_token, position_ids=input_pos).logits
next_token, probs = sample(logits, temperature=0.6, top_k=5)
torch.cuda.synchronize()
end = perf_counter()
total_time += end - start
mean_time_to_next_token = total_time / 20
logger.info("running compilation benchmarks")
# Now compile the model
model = torch.compile(model, mode="max-autotune", fullgraph=True)
# StaticCache for generation
with torch.device(device):
model.setup_caches(max_batch_size=batch_size, max_seq_len=seq_length + num_tokens_to_generate)
input_pos = torch.arange(0, seq_length, device=device)
inputs = tokenizer(prompt, return_tensors="pt").to(device)["input_ids"]
logger.info("compiling model")
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.float16, generation_config=gen_config)
model.to(device)
model = torch.compile(model, mode="max-autotune", fullgraph=True)
past_key_values = StaticCache(
model.config,
max_batch_size=batch_size,
device=device,
dtype=torch.float16,
max_cache_len=seq_length + 128,
)
# 1st call
start = perf_counter()
output = model.generate(**inputs, past_key_values=past_key_values)
end = perf_counter()
first_compile_generate_time = end - start
logger.info(f"completed first compile generation in: {first_compile_generate_time}s")
logger.info(f"generated: {tokenizer.batch_decode(output.cpu().tolist())}")
past_key_values = StaticCache(
model.config,
max_batch_size=batch_size,
device=device,
dtype=torch.float16,
max_cache_len=seq_length + 128,
)
# 2nd call
start = perf_counter()
output = model.generate(**inputs, past_key_values=past_key_values)
end = perf_counter()
second_compile_generate_time = end - start
logger.info(f"completed second compile generation in: {second_compile_generate_time}s")
logger.info(f"generated: {tokenizer.batch_decode(output.cpu().tolist())}")
past_key_values = StaticCache(
model.config,
max_batch_size=batch_size,
device=device,
dtype=torch.float16,
max_cache_len=seq_length + 128,
)
# 3rd call
start = perf_counter()
output = model.generate(**inputs, past_key_values=past_key_values)
end = perf_counter()
third_compile_generate_time = end - start
logger.info(f"completed third compile generation in: {third_compile_generate_time}s")
logger.info(f"generated: {tokenizer.batch_decode(output.cpu().tolist())}")
past_key_values = StaticCache(
model.config,
max_batch_size=batch_size,
device=device,
dtype=torch.float16,
max_cache_len=seq_length + 128,
)
# 4th call
start = perf_counter()
output = model.generate(**inputs, past_key_values=past_key_values)
end = perf_counter()
fourth_compile_generate_time = end - start
logger.info(f"completed fourth compile generation in: {fourth_compile_generate_time}s")
logger.info(f"generated: {tokenizer.batch_decode(output.cpu().tolist())}")
metrics_recorder.collect_model_measurements(
benchmark_id,
{
"model_load_time": model_load_time,
"first_eager_forward_pass_time_secs": first_eager_fwd_pass_time,
"second_eager_forward_pass_time_secs": second_eager_fwd_pass_time,
"first_eager_generate_time_secs": first_eager_generate_time,
"second_eager_generate_time_secs": second_eager_generate_time,
"time_to_first_token_secs": time_to_first_token,
"time_to_second_token_secs": time_to_second_token,
"time_to_third_token_secs": time_to_third_token,
"time_to_next_token_mean_secs": mean_time_to_next_token,
"first_compile_generate_time_secs": first_compile_generate_time,
"second_compile_generate_time_secs": second_compile_generate_time,
"third_compile_generate_time_secs": third_compile_generate_time,
"fourth_compile_generate_time_secs": fourth_compile_generate_time,
},
)
except Exception as e:
logger.error(f"Caught exception: {e}")
continue_metric_collection.set()
if metrics_thread is not None:
metrics_thread.join()
# Only close the recorder if we created it locally
if should_close_recorder:
metrics_recorder.close()

View File

@ -1,15 +1,35 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import importlib.util
import logging
import os
import sys
from typing import Dict, Tuple
import json
import uuid
from datetime import datetime
from typing import Dict, Tuple, Optional, List
from psycopg2.extensions import register_adapter
from psycopg2.extras import Json
import pandas as pd
register_adapter(dict, Json)
try:
from psycopg2.extensions import register_adapter
from psycopg2.extras import Json
register_adapter(dict, Json)
PSYCOPG2_AVAILABLE = True
except ImportError:
PSYCOPG2_AVAILABLE = False
class ImportModuleException(Exception):
@ -18,61 +38,239 @@ class ImportModuleException(Exception):
class MetricsRecorder:
def __init__(
self, connection, logger: logging.Logger, repository: str, branch: str, commit_id: str, commit_msg: str
self, connection, logger: logging.Logger, repository: str, branch: str, commit_id: str, commit_msg: str,
collect_csv_data: bool = True
):
self.conn = connection
self.conn.autocommit = True
self.use_database = connection is not None
if self.use_database:
self.conn.autocommit = True
self.logger = logger
self.repository = repository
self.branch = branch
self.commit_id = commit_id
self.commit_msg = commit_msg
self.collect_csv_data = collect_csv_data
# For CSV export - store all data in pandas DataFrames (only if CSV collection is enabled)
if self.collect_csv_data:
# Initialize empty DataFrames with proper schemas
self.benchmarks_df = pd.DataFrame(columns=[
'benchmark_id', 'repository', 'branch', 'commit_id', 'commit_message',
'metadata', 'created_at'
])
self.device_measurements_df = pd.DataFrame(columns=[
'benchmark_id', 'cpu_util', 'mem_megabytes', 'gpu_util',
'gpu_mem_megabytes', 'time'
])
self.model_measurements_df = pd.DataFrame(columns=[
'benchmark_id', 'time', 'model_load_time', 'first_eager_forward_pass_time_secs',
'second_eager_forward_pass_time_secs', 'first_eager_generate_time_secs',
'second_eager_generate_time_secs', 'time_to_first_token_secs',
'time_to_second_token_secs', 'time_to_third_token_secs',
'time_to_next_token_mean_secs', 'first_compile_generate_time_secs',
'second_compile_generate_time_secs', 'third_compile_generate_time_secs',
'fourth_compile_generate_time_secs'
])
else:
self.benchmarks_df = None
self.device_measurements_df = None
self.model_measurements_df = None
def initialise_benchmark(self, metadata: dict[str, str]) -> int:
def initialise_benchmark(self, metadata: dict[str, str]) -> str:
"""
Creates a new benchmark, returns the benchmark id
Creates a new benchmark, returns the benchmark id (UUID)
"""
# gpu_name: str, model_id: str
with self.conn.cursor() as cur:
cur.execute(
"INSERT INTO benchmarks (repository, branch, commit_id, commit_message, metadata) VALUES (%s, %s, %s, %s, %s) RETURNING benchmark_id",
(self.repository, self.branch, self.commit_id, self.commit_msg, metadata),
)
benchmark_id = cur.fetchone()[0]
logger.debug(f"initialised benchmark #{benchmark_id}")
return benchmark_id
# Generate a unique UUID for this benchmark
benchmark_id = str(uuid.uuid4())
if self.use_database:
with self.conn.cursor() as cur:
cur.execute(
"INSERT INTO benchmarks (benchmark_id, repository, branch, commit_id, commit_message, metadata) VALUES (%s, %s, %s, %s, %s, %s)",
(benchmark_id, self.repository, self.branch, self.commit_id, self.commit_msg, metadata),
)
self.logger.debug(f"initialised benchmark #{benchmark_id}")
# Store benchmark data for CSV export (if enabled)
if self.collect_csv_data:
# Add row to pandas DataFrame
new_row = pd.DataFrame([{
'benchmark_id': benchmark_id,
'repository': self.repository,
'branch': self.branch,
'commit_id': self.commit_id,
'commit_message': self.commit_msg,
'metadata': json.dumps(metadata),
'created_at': datetime.utcnow().isoformat()
}])
self.benchmarks_df = pd.concat([self.benchmarks_df, new_row], ignore_index=True)
mode_info = []
if self.use_database:
mode_info.append("database")
if self.collect_csv_data:
mode_info.append("CSV")
mode_str = " + ".join(mode_info) if mode_info else "no storage"
self.logger.debug(f"initialised benchmark #{benchmark_id} ({mode_str} mode)")
return benchmark_id
def collect_device_measurements(self, benchmark_id: int, cpu_util, mem_megabytes, gpu_util, gpu_mem_megabytes):
def collect_device_measurements(self, benchmark_id: str, cpu_util, mem_megabytes, gpu_util, gpu_mem_megabytes):
"""
Collect device metrics, such as CPU & GPU usage. These are "static", as in you cannot pass arbitrary arguments to the function.
"""
with self.conn.cursor() as cur:
cur.execute(
"INSERT INTO device_measurements (benchmark_id, cpu_util, mem_megabytes, gpu_util, gpu_mem_megabytes) VALUES (%s, %s, %s, %s, %s)",
(benchmark_id, cpu_util, mem_megabytes, gpu_util, gpu_mem_megabytes),
)
# Store device measurements for CSV export (if enabled)
if self.collect_csv_data:
# Add row to pandas DataFrame
new_row = pd.DataFrame([{
'benchmark_id': benchmark_id,
'cpu_util': cpu_util,
'mem_megabytes': mem_megabytes,
'gpu_util': gpu_util,
'gpu_mem_megabytes': gpu_mem_megabytes,
'time': datetime.utcnow().isoformat()
}])
self.device_measurements_df = pd.concat([self.device_measurements_df, new_row], ignore_index=True)
# Store in database if available
if self.use_database:
with self.conn.cursor() as cur:
cur.execute(
"INSERT INTO device_measurements (benchmark_id, cpu_util, mem_megabytes, gpu_util, gpu_mem_megabytes) VALUES (%s, %s, %s, %s, %s)",
(benchmark_id, cpu_util, mem_megabytes, gpu_util, gpu_mem_megabytes),
)
self.logger.debug(
f"inserted device measurements for benchmark #{benchmark_id} [CPU util: {cpu_util}, mem MBs: {mem_megabytes}, GPU util: {gpu_util}, GPU mem MBs: {gpu_mem_megabytes}]"
f"collected device measurements for benchmark #{benchmark_id} [CPU util: {cpu_util}, mem MBs: {mem_megabytes}, GPU util: {gpu_util}, GPU mem MBs: {gpu_mem_megabytes}]"
)
def collect_model_measurements(self, benchmark_id: int, measurements: dict[str, float]):
with self.conn.cursor() as cur:
cur.execute(
"""
INSERT INTO model_measurements (
benchmark_id,
measurements
) VALUES (%s, %s)
""",
(
benchmark_id,
measurements,
),
)
self.logger.debug(f"inserted model measurements for benchmark #{benchmark_id}: {measurements}")
def collect_model_measurements(self, benchmark_id: str, measurements: dict[str, float]):
# Store model measurements for CSV export (if enabled)
if self.collect_csv_data:
# Add row to pandas DataFrame with flattened measurements
row_data = {
'benchmark_id': benchmark_id,
'time': datetime.utcnow().isoformat()
}
# Flatten the measurements dict into the row
row_data.update(measurements)
new_row = pd.DataFrame([row_data])
self.model_measurements_df = pd.concat([self.model_measurements_df, new_row], ignore_index=True)
# Store in database if available
if self.use_database:
with self.conn.cursor() as cur:
cur.execute(
"""
INSERT INTO model_measurements (
benchmark_id,
measurements
) VALUES (%s, %s)
""",
(
benchmark_id,
measurements,
),
)
self.logger.debug(f"collected model measurements for benchmark #{benchmark_id}: {measurements}")
def export_to_csv(self, output_dir: str = "benchmark_results"):
"""
Export all collected data to CSV files using pandas DataFrames
"""
if not self.collect_csv_data:
self.logger.warning("CSV data collection is disabled - no CSV files will be generated")
return
if not os.path.exists(output_dir):
os.makedirs(output_dir)
self.logger.info(f"Created output directory: {output_dir}")
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
files_created = []
# Export using pandas DataFrames
self._export_pandas_data(output_dir, timestamp, files_created)
self.logger.info(f"CSV export complete! Created {len(files_created)} files in {output_dir}")
def _export_pandas_data(self, output_dir: str, timestamp: str, files_created: list):
"""
Export CSV files using pandas DataFrames
"""
# Export benchmarks
benchmarks_file = os.path.join(output_dir, f"benchmarks_{timestamp}.csv")
self.benchmarks_df.to_csv(benchmarks_file, index=False)
files_created.append(benchmarks_file)
self.logger.info(f"Exported {len(self.benchmarks_df)} benchmark records to {benchmarks_file}")
# Export device measurements
device_file = os.path.join(output_dir, f"device_measurements_{timestamp}.csv")
self.device_measurements_df.to_csv(device_file, index=False)
files_created.append(device_file)
self.logger.info(f"Exported {len(self.device_measurements_df)} device measurement records to {device_file}")
# Export model measurements (already flattened)
model_file = os.path.join(output_dir, f"model_measurements_{timestamp}.csv")
self.model_measurements_df.to_csv(model_file, index=False)
files_created.append(model_file)
self.logger.info(f"Exported {len(self.model_measurements_df)} model measurement records to {model_file}")
# Create comprehensive summary using pandas operations
summary_file = os.path.join(output_dir, f"benchmark_summary_{timestamp}.csv")
self._create_summary(summary_file)
files_created.append(summary_file)
def _create_summary(self, summary_file: str):
"""
Create a comprehensive summary CSV using pandas operations
"""
if len(self.benchmarks_df) == 0:
# Create empty summary file
summary_df = pd.DataFrame()
summary_df.to_csv(summary_file, index=False)
self.logger.info(f"Created empty benchmark summary at {summary_file}")
return
# Start with benchmarks as the base
summary_df = self.benchmarks_df.copy()
# Add model measurements (join on benchmark_id)
if len(self.model_measurements_df) > 0:
# Drop 'time' column from model measurements to avoid conflicts
model_df = self.model_measurements_df.drop(columns=['time'], errors='ignore')
summary_df = summary_df.merge(model_df, on='benchmark_id', how='left')
# Calculate device measurement aggregates using pandas groupby
if len(self.device_measurements_df) > 0:
device_agg = self.device_measurements_df.groupby('benchmark_id').agg({
'cpu_util': ['mean', 'max', 'std', 'count'],
'mem_megabytes': ['mean', 'max', 'std'],
'gpu_util': ['mean', 'max', 'std'],
'gpu_mem_megabytes': ['mean', 'max', 'std']
}).round(3)
# Flatten column names
device_agg.columns = [f"{col[0]}_{col[1]}" for col in device_agg.columns]
device_agg = device_agg.reset_index()
# Rename count column to be more descriptive
if 'cpu_util_count' in device_agg.columns:
device_agg = device_agg.rename(columns={'cpu_util_count': 'device_measurement_count'})
# Merge with summary
summary_df = summary_df.merge(device_agg, on='benchmark_id', how='left')
# Export the comprehensive summary
summary_df.to_csv(summary_file, index=False)
self.logger.info(f"Created comprehensive benchmark summary with {len(summary_df)} records at {summary_file}")
def close(self):
self.conn.close()
if self.use_database and self.conn:
self.conn.close()
logger = logging.getLogger(__name__)
@ -85,7 +283,7 @@ handler.setFormatter(formatter)
logger.addHandler(handler)
def parse_arguments() -> tuple[str, str, str, str]:
def parse_arguments() -> tuple[str, str, str, str, bool, str]:
"""
Parse command line arguments for the benchmarking CLI.
"""
@ -114,10 +312,27 @@ def parse_arguments() -> tuple[str, str, str, str]:
type=str,
help="The commit message associated with the commit, truncated to 70 characters.",
)
parser.add_argument(
"--csv",
action="store_true",
default=False,
help="Enable CSV output files generation."
)
parser.add_argument(
"--csv-output-dir",
type=str,
default="benchmark_results",
help="Directory for CSV output files (default: benchmark_results)."
)
args = parser.parse_args()
# CSV is disabled by default, only enabled when --csv is used
generate_csv = args.csv
return args.repository, args.branch, args.commit_id, args.commit_msg
return args.repository, args.branch, args.commit_id, args.commit_msg, generate_csv, args.csv_output_dir
def import_from_path(module_name, file_path):
@ -131,22 +346,124 @@ def import_from_path(module_name, file_path):
raise ImportModuleException(f"failed to load python module: {e}")
def create_database_connection():
"""
Try to create a database connection. Returns None if connection fails.
"""
if not PSYCOPG2_AVAILABLE:
logger.warning("psycopg2 not available - running in CSV-only mode")
return None
try:
import psycopg2
conn = psycopg2.connect("dbname=metrics")
logger.info("Successfully connected to database")
return conn
except Exception as e:
logger.warning(f"Failed to connect to database: {e}. Running in CSV-only mode")
return None
def create_global_metrics_recorder(repository: str, branch: str, commit_id: str, commit_msg: str,
generate_csv: bool = False) -> MetricsRecorder:
"""
Create a global metrics recorder that will be used across all benchmarks.
"""
connection = create_database_connection()
recorder = MetricsRecorder(connection, logger, repository, branch, commit_id, commit_msg, generate_csv)
# Log the storage mode
storage_modes = []
if connection is not None:
storage_modes.append("database")
if generate_csv:
storage_modes.append("CSV")
if not storage_modes:
logger.warning("Running benchmarks with NO data storage (no database connection, CSV disabled)")
logger.warning("Use --csv flag to enable CSV output when database is unavailable")
else:
logger.info(f"Running benchmarks with: {' + '.join(storage_modes)} storage")
return recorder
if __name__ == "__main__":
benchmarks_folder_path = os.path.dirname(os.path.realpath(__file__))
benches_folder_path = os.path.join(benchmarks_folder_path, "benches")
repository, branch, commit_id, commit_msg = parse_arguments()
for entry in os.scandir(benchmarks_folder_path):
try:
repository, branch, commit_id, commit_msg, generate_csv, csv_output_dir = parse_arguments()
# Create a global metrics recorder
global_metrics_recorder = create_global_metrics_recorder(repository, branch, commit_id, commit_msg, generate_csv)
successful_benchmarks = 0
failed_benchmarks = 0
# Automatically discover all benchmark modules in benches/ folder
benchmark_modules = []
if os.path.exists(benches_folder_path):
logger.debug(f"Scanning for benchmarks in: {benches_folder_path}")
for entry in os.scandir(benches_folder_path):
if not entry.name.endswith(".py"):
continue
if entry.path == __file__:
if entry.name.startswith("__"): # Skip __init__.py, __pycache__, etc.
continue
logger.debug(f"loading: {entry.name}")
module = import_from_path(entry.name.split(".")[0], entry.path)
logger.info(f"running benchmarks in: {entry.name}")
module.run_benchmark(logger, repository, branch, commit_id, commit_msg)
# Check if the file has a run_benchmark function
try:
logger.debug(f"checking if benches/{entry.name} has run_benchmark function")
module = import_from_path(entry.name.split(".")[0], entry.path)
if hasattr(module, 'run_benchmark'):
benchmark_modules.append(entry.name)
logger.debug(f"discovered benchmark: {entry.name}")
else:
logger.debug(f"skipping {entry.name} - no run_benchmark function found")
except Exception as e:
logger.debug(f"failed to check benches/{entry.name}: {e}")
else:
logger.warning(f"Benches directory not found: {benches_folder_path}")
if benchmark_modules:
logger.info(f"Discovered {len(benchmark_modules)} benchmark(s): {benchmark_modules}")
else:
logger.warning("No benchmark modules found in benches/ directory")
for module_name in benchmark_modules:
module_path = os.path.join(benches_folder_path, module_name)
try:
logger.debug(f"loading: {module_name}")
module = import_from_path(module_name.split(".")[0], module_path)
logger.info(f"running benchmarks in: {module_name}")
# Check if the module has an updated run_benchmark function that accepts metrics_recorder
try:
# Try the new signature first
module.run_benchmark(logger, repository, branch, commit_id, commit_msg, global_metrics_recorder)
except TypeError:
# Fall back to the old signature for backward compatibility
logger.warning(f"Module {module_name} using old run_benchmark signature - database connection will be created per module")
module.run_benchmark(logger, repository, branch, commit_id, commit_msg)
successful_benchmarks += 1
except ImportModuleException as e:
logger.error(e)
failed_benchmarks += 1
except Exception as e:
logger.error(f"error running benchmarks for {entry.name}: {e}")
logger.error(f"error running benchmarks for {module_name}: {e}")
failed_benchmarks += 1
# Export CSV results at the end (if enabled)
try:
if generate_csv:
global_metrics_recorder.export_to_csv(csv_output_dir)
logger.info(f"CSV reports have been generated and saved to the {csv_output_dir} directory")
else:
logger.info("CSV generation disabled - no CSV files created (use --csv to enable)")
logger.info(f"Benchmark run completed. Successful: {successful_benchmarks}, Failed: {failed_benchmarks}")
except Exception as e:
logger.error(f"Failed to export CSV results: {e}")
finally:
global_metrics_recorder.close()

View File

@ -19,7 +19,7 @@ backend:
model: meta-llama/Llama-2-7b-hf
cache_implementation: static
torch_compile: true
torch_dtype: float16
dtype: float16
torch_compile_config:
backend: inductor
mode: reduce-overhead

View File

@ -1,34 +0,0 @@
CREATE TABLE IF NOT EXISTS benchmarks (
benchmark_id SERIAL PRIMARY KEY,
repository VARCHAR(255),
branch VARCHAR(255),
commit_id VARCHAR(72),
commit_message VARCHAR(70),
metadata jsonb,
created_at timestamp without time zone NOT NULL DEFAULT (current_timestamp AT TIME ZONE 'UTC')
);
CREATE INDEX IF NOT EXISTS benchmarks_benchmark_id_idx ON benchmarks (benchmark_id);
CREATE INDEX IF NOT EXISTS benchmarks_branch_idx ON benchmarks (branch);
CREATE TABLE IF NOT EXISTS device_measurements (
measurement_id SERIAL PRIMARY KEY,
benchmark_id int REFERENCES benchmarks (benchmark_id),
cpu_util double precision,
mem_megabytes double precision,
gpu_util double precision,
gpu_mem_megabytes double precision,
time timestamp without time zone NOT NULL DEFAULT (current_timestamp AT TIME ZONE 'UTC')
);
CREATE INDEX IF NOT EXISTS device_measurements_branch_idx ON device_measurements (benchmark_id);
CREATE TABLE IF NOT EXISTS model_measurements (
measurement_id SERIAL PRIMARY KEY,
benchmark_id int REFERENCES benchmarks (benchmark_id),
measurements jsonb,
time timestamp without time zone NOT NULL DEFAULT (current_timestamp AT TIME ZONE 'UTC')
);
CREATE INDEX IF NOT EXISTS model_measurements_branch_idx ON model_measurements (benchmark_id);

View File

@ -1,346 +0,0 @@
from logging import Logger
import os
from threading import Event, Thread
from time import perf_counter, sleep
from typing import Optional
from benchmarks_entrypoint import MetricsRecorder
import gpustat
import psutil
import psycopg2
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, StaticCache
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "1"
torch.set_float32_matmul_precision("high")
def collect_metrics(benchmark_id, continue_metric_collection, metrics_recorder):
p = psutil.Process(os.getpid())
while not continue_metric_collection.is_set():
with p.oneshot():
cpu_util = p.cpu_percent()
mem_megabytes = p.memory_info().rss / (1024 * 1024)
gpu_stats = gpustat.GPUStatCollection.new_query()
gpu_util = gpu_stats[0]["utilization.gpu"]
gpu_mem_megabytes = gpu_stats[0]["memory.used"]
metrics_recorder.collect_device_measurements(
benchmark_id, cpu_util, mem_megabytes, gpu_util, gpu_mem_megabytes
)
sleep(0.01)
def run_benchmark(
logger: Logger, repository: str, branch: str, commit_id: str, commit_msg: str, num_tokens_to_generate=100
):
continue_metric_collection = Event()
metrics_thread = None
model_id = "meta-llama/Llama-2-7b-hf"
metrics_recorder = MetricsRecorder(
psycopg2.connect("dbname=metrics"), logger, repository, branch, commit_id, commit_msg
)
try:
gpu_stats = gpustat.GPUStatCollection.new_query()
gpu_name = gpu_stats[0]["name"]
benchmark_id = metrics_recorder.initialise_benchmark({"gpu_name": gpu_name, "model_id": model_id})
logger.info(f"running benchmark #{benchmark_id} on {gpu_name} for {model_id}")
metrics_thread = Thread(
target=collect_metrics,
args=[benchmark_id, continue_metric_collection, metrics_recorder],
)
metrics_thread.start()
logger.info("started background thread to fetch device metrics")
os.environ["TOKENIZERS_PARALLELISM"] = "false" # silence warnings when compiling
device = "cuda"
logger.info("downloading weights")
# This is to avoid counting download in model load time measurement
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)
gen_config = GenerationConfig(do_sample=False, top_p=1, temperature=1)
logger.info("loading model")
start = perf_counter()
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype=torch.float16, generation_config=gen_config
).eval()
model.to(device)
torch.cuda.synchronize()
end = perf_counter()
model_load_time = end - start
logger.info(f"loaded model in: {model_load_time}s")
tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = "Why dogs are so cute?"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
# Specify the max length (including both the prompt and the response)
# When calling `generate` with `cache_implementation="static" later, this is also used to create a `StaticCache` object
# with sequence length = `max_length`. The longer the more you will re-use it
seq_length = inputs["input_ids"].shape[1]
model.generation_config.max_length = seq_length + num_tokens_to_generate
batch_size = inputs["input_ids"].shape[0]
# Copied from the gpt-fast repo
def multinomial_sample_one_no_sync(probs_sort): # Does multinomial sampling without a cuda synchronization
q = torch.empty_like(probs_sort).exponential_(1)
return torch.argmax(probs_sort / q, dim=-1, keepdim=True).to(dtype=torch.int)
def logits_to_probs(logits, temperature: float = 1.0, top_k: Optional[int] = None):
logits = logits / max(temperature, 1e-5)
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
pivot = v.select(-1, -1).unsqueeze(-1)
logits = torch.where(logits < pivot, -float("Inf"), logits)
probs = torch.nn.functional.softmax(logits, dim=-1)
return probs
def sample(logits, temperature: float = 1.0, top_k: Optional[int] = None):
probs = logits_to_probs(logits[:, -1], temperature, top_k)
idx_next = multinomial_sample_one_no_sync(probs)
return idx_next, probs
def decode_one_token(model, cur_token, cache_position, past_key_values):
logits = model(
cur_token,
cache_position=cache_position,
past_key_values=past_key_values,
return_dict=False,
use_cache=True,
)[0]
new_token = sample(logits, temperature=0.6, top_k=5)[0]
return new_token
#########
# Eager #
#########
with torch.no_grad():
past_key_values = StaticCache(
model.config,
max_batch_size=batch_size,
device=device,
dtype=torch.float16,
max_cache_len=seq_length + num_tokens_to_generate,
)
cache_position = torch.arange(seq_length, device=device)
start = perf_counter()
model(
**inputs,
cache_position=cache_position,
past_key_values=past_key_values,
return_dict=False,
use_cache=True,
)
end = perf_counter()
first_eager_fwd_pass_time = end - start
logger.info(f"completed first eager fwd pass in: {first_eager_fwd_pass_time}s")
start = perf_counter()
output = model.generate(**inputs, do_sample=False)
end = perf_counter()
first_eager_generate_time = end - start
logger.info(f"completed first eager generation in: {first_eager_generate_time}s")
logger.info(f"generated: {tokenizer.batch_decode(output.cpu().tolist())}")
past_key_values = StaticCache(
model.config,
max_batch_size=batch_size,
device=device,
dtype=torch.float16,
max_cache_len=seq_length + num_tokens_to_generate,
)
cache_position = torch.arange(seq_length, device=device)
start = perf_counter()
model(
**inputs,
cache_position=cache_position,
past_key_values=past_key_values,
return_dict=False,
use_cache=True,
)
end = perf_counter()
second_eager_fwd_pass_time = end - start
logger.info(f"completed second eager fwd pass in: {second_eager_fwd_pass_time}s")
start = perf_counter()
model.generate(**inputs, do_sample=False)
end = perf_counter()
second_eager_generate_time = end - start
logger.info(f"completed second eager generation in: {second_eager_generate_time}s")
logger.info(f"generated: {tokenizer.batch_decode(output.cpu().tolist())}")
torch.compiler.reset()
################
# Forward pass #
################
# `torch.compile(model, ...)` is not recommended as you compile callbacks
# and full generate. We recommend compiling only the forward for now.
# "reduce-overhead" will use cudagraphs.
generated_ids = torch.zeros(
(batch_size, num_tokens_to_generate + seq_length), dtype=torch.int, device=device
)
generated_ids[:, :seq_length] = inputs["input_ids"]
decode_one_token = torch.compile(decode_one_token, mode="reduce-overhead", fullgraph=True)
# model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
# TODO use decode_one_token(model, input_id.clone(), cache_position) for verification
past_key_values = StaticCache(
model.config,
max_batch_size=batch_size,
device=device,
dtype=torch.float16,
max_cache_len=seq_length + num_tokens_to_generate + 10,
)
cache_position = torch.arange(seq_length, device=device)
all_generated_tokens = []
### First compile, prefill
start = perf_counter()
next_token = decode_one_token(
model, inputs["input_ids"], cache_position=cache_position, past_key_values=past_key_values
)
torch.cuda.synchronize()
end = perf_counter()
time_to_first_token = end - start
logger.info(f"completed first compile generation in: {time_to_first_token}s")
cache_position += 1
all_generated_tokens += next_token.tolist()
cache_position = torch.tensor([seq_length], device=device)
### First compile, decoding
start = perf_counter()
next_token = decode_one_token(
model, next_token.clone(), cache_position=cache_position, past_key_values=past_key_values
)
torch.cuda.synchronize()
end = perf_counter()
time_to_second_token = end - start
logger.info(f"completed second compile generation in: {time_to_second_token}s")
cache_position += 1
all_generated_tokens += next_token.tolist()
### Second compile, decoding
start = perf_counter()
next_token = decode_one_token(
model, next_token.clone(), cache_position=cache_position, past_key_values=past_key_values
)
torch.cuda.synchronize()
end = perf_counter()
time_to_third_token = end - start
logger.info(f"completed third compile forward in: {time_to_third_token}s")
cache_position += 1
all_generated_tokens += next_token.tolist()
### Using cuda graphs decoding
start = perf_counter()
for _ in range(1, num_tokens_to_generate):
all_generated_tokens += next_token.tolist()
next_token = decode_one_token(
model, next_token.clone(), cache_position=cache_position, past_key_values=past_key_values
)
cache_position += 1
torch.cuda.synchronize()
end = perf_counter()
mean_time_to_next_token = (end - start) / num_tokens_to_generate
logger.info(f"completed next compile generation in: {mean_time_to_next_token}s")
logger.info(f"generated: {tokenizer.batch_decode(all_generated_tokens)}")
####################
# Generate compile #
####################
torch.compiler.reset()
# we will not compile full generate as it' s to intensive, tho we measure full forward!
past_key_values = StaticCache(
model.config,
max_batch_size=batch_size,
device=device,
dtype=torch.float16,
max_cache_len=seq_length + 128,
)
# 1st call
start = perf_counter()
output = model.generate(**inputs, past_key_values=past_key_values)
torch.cuda.synchronize()
end = perf_counter()
first_compile_generate_time = end - start
logger.info(f"completed first compile generation in: {first_compile_generate_time}s")
logger.info(f"generated: {tokenizer.batch_decode(output.cpu().tolist())}")
past_key_values = StaticCache(
model.config,
max_batch_size=batch_size,
device=device,
dtype=torch.float16,
max_cache_len=seq_length + 128,
)
# 2nd call
start = perf_counter()
output = model.generate(**inputs, past_key_values=past_key_values)
torch.cuda.synchronize()
end = perf_counter()
second_compile_generate_time = end - start
logger.info(f"completed second compile generation in: {second_compile_generate_time}s")
logger.info(f"generated: {tokenizer.batch_decode(output.cpu().tolist())}")
past_key_values = StaticCache(
model.config,
max_batch_size=batch_size,
device=device,
dtype=torch.float16,
max_cache_len=seq_length + 128,
)
# 3rd call
start = perf_counter()
output = model.generate(**inputs, past_key_values=past_key_values)
end = perf_counter()
third_compile_generate_time = end - start
logger.info(f"completed third compile generation in: {third_compile_generate_time}s")
logger.info(f"generated: {tokenizer.batch_decode(output.cpu().tolist())}")
past_key_values = StaticCache(
model.config,
max_batch_size=batch_size,
device=device,
dtype=torch.float16,
max_cache_len=seq_length + 128,
)
# 4th call
start = perf_counter()
output = model.generate(**inputs, past_key_values=past_key_values)
end = perf_counter()
fourth_compile_generate_time = end - start
logger.info(f"completed fourth compile generation in: {fourth_compile_generate_time}s")
logger.info(f"generated: {tokenizer.batch_decode(output.cpu().tolist())}")
metrics_recorder.collect_model_measurements(
benchmark_id,
{
"model_load_time": model_load_time,
"first_eager_forward_pass_time_secs": first_eager_fwd_pass_time,
"second_eager_forward_pass_time_secs": second_eager_fwd_pass_time,
"first_eager_generate_time_secs": first_eager_generate_time,
"second_eager_generate_time_secs": second_eager_generate_time,
"time_to_first_token_secs": time_to_first_token,
"time_to_second_token_secs": time_to_second_token,
"time_to_third_token_secs": time_to_third_token,
"time_to_next_token_mean_secs": mean_time_to_next_token,
"first_compile_generate_time_secs": first_compile_generate_time,
"second_compile_generate_time_secs": second_compile_generate_time,
"third_compile_generate_time_secs": third_compile_generate_time,
"fourth_compile_generate_time_secs": fourth_compile_generate_time,
},
)
except Exception as e:
logger.error(f"Caught exception: {e}")
continue_metric_collection.set()
if metrics_thread is not None:
metrics_thread.join()
metrics_recorder.close()

View File

@ -2,4 +2,5 @@ gpustat==1.1.1
psutil==6.0.0
psycopg2==2.9.9
torch>=2.4.0
hf_transfer
hf_transfer
pandas>=1.5.0

View File

View File

@ -23,13 +23,17 @@ from os.path import abspath, dirname, join
import _pytest
import pytest
from transformers.testing_utils import HfDoctestModule, HfDocTestParser
from transformers.testing_utils import (
HfDoctestModule,
HfDocTestParser,
is_torch_available,
patch_torch_compile_force_graph,
)
NOT_DEVICE_TESTS = {
"test_tokenization",
"test_tokenization_mistral_common",
"test_processor",
"test_processing",
"test_beam_constraints",
"test_configuration_utils",
@ -84,6 +88,8 @@ def pytest_configure(config):
config.addinivalue_line("markers", "is_staging_test: mark test to run only in the staging environment")
config.addinivalue_line("markers", "accelerate_tests: mark test that require accelerate")
config.addinivalue_line("markers", "not_device_test: mark the tests always running on cpu")
config.addinivalue_line("markers", "torch_compile_test: mark test which tests torch compile functionality")
config.addinivalue_line("markers", "torch_export_test: mark test which tests torch export functionality")
def pytest_collection_modifyitems(items):
@ -128,3 +134,14 @@ class CustomOutputChecker(OutputChecker):
doctest.OutputChecker = CustomOutputChecker
_pytest.doctest.DoctestModule = HfDoctestModule
doctest.DocTestParser = HfDocTestParser
if is_torch_available():
import torch
# The flag below controls whether to allow TF32 on cuDNN. This flag defaults to True.
# We set it to `False` for CI. See https://github.com/pytorch/pytorch/issues/157274#issuecomment-3090791615
torch.backends.cudnn.allow_tf32 = False
# patch `torch.compile`: if `TORCH_COMPILE_FORCE_FULLGRAPH=1` (or values considered as true, e.g. yes, y, etc.),
# the patched version will always run with `fullgraph=True`.
patch_torch_compile_force_graph()

View File

@ -4,7 +4,7 @@ USER root
ARG REF=main
RUN apt-get update && apt-get install -y time git g++ pkg-config make git-lfs
ENV UV_PYTHON=/usr/local/bin/python
RUN pip install uv && uv venv && uv pip install --no-cache-dir -U pip setuptools GitPython
RUN pip install uv && uv pip install --no-cache-dir -U pip setuptools GitPython
RUN uv pip install --no-cache-dir --upgrade 'torch' 'torchaudio' 'torchvision' --index-url https://download.pytorch.org/whl/cpu
# tensorflow pin matching setup.py
RUN uv pip install --no-cache-dir pypi-kenlm

View File

@ -4,7 +4,7 @@ ARG REF=main
USER root
RUN apt-get update && apt-get install -y libsndfile1-dev espeak-ng time git cmake wget xz-utils build-essential g++5 libprotobuf-dev protobuf-compiler
ENV UV_PYTHON=/usr/local/bin/python
RUN pip --no-cache-dir install uv && uv venv && uv pip install --no-cache-dir -U pip setuptools
RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
RUN wget https://github.com/ku-nlp/jumanpp/releases/download/v2.0.0-rc3/jumanpp-2.0.0-rc3.tar.xz
RUN tar xvf jumanpp-2.0.0-rc3.tar.xz
@ -20,7 +20,7 @@ RUN uv pip install --no-cache --upgrade 'torch' --index-url https://download.pyt
RUN uv pip install --no-cache-dir --no-deps accelerate --extra-index-url https://download.pytorch.org/whl/cpu
RUN uv pip install --no-cache-dir "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[ja,testing,sentencepiece,jieba,spacy,ftfy,rjieba]" unidic unidic-lite
# spacy is not used so not tested. Causes to failures. TODO fix later
RUN python3 -m unidic download
RUN uv run python -m unidic download
RUN uv pip uninstall transformers
RUN apt-get clean && rm -rf /var/lib/apt/lists/*

View File

@ -5,7 +5,7 @@ USER root
RUN apt-get update && apt-get install -y libsndfile1-dev espeak-ng time git
RUN apt-get install -y g++ cmake
ENV UV_PYTHON=/usr/local/bin/python
RUN pip --no-cache-dir install uv && uv venv
RUN pip --no-cache-dir install uv
RUN uv pip install --no-cache-dir -U pip setuptools albumentations seqeval
RUN uv pip install --upgrade --no-cache-dir "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[tf-cpu,sklearn,testing,sentencepiece,tf-speech,vision]"
RUN uv pip install --no-cache-dir "protobuf==3.20.3"

View File

@ -4,7 +4,7 @@ ARG REF=main
USER root
RUN apt-get update && apt-get install -y --no-install-recommends libsndfile1-dev espeak-ng time git g++ cmake pkg-config openssh-client git ffmpeg
ENV UV_PYTHON=/usr/local/bin/python
RUN pip --no-cache-dir install uv && uv venv && uv pip install --no-cache-dir -U pip setuptools
RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
RUN uv pip install --no-cache-dir 'torch' 'torchaudio' 'torchvision' 'torchcodec' --index-url https://download.pytorch.org/whl/cpu
RUN uv pip install --no-deps timm accelerate --extra-index-url https://download.pytorch.org/whl/cpu
RUN uv pip install --no-cache-dir librosa "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[sklearn,sentencepiece,vision,testing]" seqeval albumentations jiwer

View File

@ -4,10 +4,10 @@ ARG REF=main
USER root
RUN apt-get update && apt-get install -y libsndfile1-dev espeak-ng time git libgl1-mesa-glx libgl1 g++ tesseract-ocr
ENV UV_PYTHON=/usr/local/bin/python
RUN pip --no-cache-dir install uv && uv venv && uv pip install --no-cache-dir -U pip setuptools
RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
RUN uv pip install --no-cache-dir 'torch' 'torchaudio' 'torchvision' --index-url https://download.pytorch.org/whl/cpu
RUN uv pip install --no-cache-dir --no-deps timm accelerate
RUN pip install -U --upgrade-strategy eager --no-cache-dir pytesseract python-Levenshtein opencv-python nltk
RUN uv pip install -U --upgrade-strategy eager --no-cache-dir pytesseract python-Levenshtein opencv-python nltk
# RUN uv pip install --no-cache-dir natten==0.15.1+torch210cpu -f https://shi-labs.com/natten/wheels
RUN uv pip install --no-cache-dir "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[testing, vision]" 'scikit-learn' 'torch-stft' 'nose' 'dataset'
# RUN git clone https://github.com/facebookresearch/detectron2.git

View File

@ -4,7 +4,7 @@ ARG REF=main
USER root
RUN apt-get update && apt-get install -y libsndfile1-dev espeak-ng time git g++ cmake
ENV UV_PYTHON=/usr/local/bin/python
RUN pip --no-cache-dir install uv && uv venv && uv pip install --no-cache-dir -U pip setuptools
RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
RUN uv pip install --no-cache-dir "scipy<1.13" "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[flax,testing,sentencepiece,flax-speech,vision]"
RUN uv pip uninstall transformers
RUN apt-get clean && rm -rf /var/lib/apt/lists/* && apt-get autoremove && apt-get autoclean

View File

@ -4,7 +4,7 @@ ARG REF=main
USER root
RUN apt-get update && apt-get install -y libsndfile1-dev espeak-ng time git cmake g++
ENV UV_PYTHON=/usr/local/bin/python
RUN pip --no-cache-dir install uv && uv venv && uv pip install --no-cache-dir -U pip setuptools
RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
RUN uv pip install --no-cache-dir "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[sklearn,tf-cpu,testing,sentencepiece,tf-speech,vision]"
RUN uv pip install --no-cache-dir "protobuf==3.20.3" tensorflow_probability
RUN apt-get clean && rm -rf /var/lib/apt/lists/*

View File

@ -4,7 +4,7 @@ ARG REF=main
USER root
RUN apt-get update && apt-get install -y --no-install-recommends libsndfile1-dev espeak-ng time git pkg-config openssh-client git ffmpeg
ENV UV_PYTHON=/usr/local/bin/python
RUN pip --no-cache-dir install uv && uv venv && uv pip install --no-cache-dir -U pip setuptools
RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
RUN uv pip install --no-cache-dir 'torch' 'torchaudio' 'torchvision' 'torchcodec' --index-url https://download.pytorch.org/whl/cpu
RUN uv pip install --no-deps timm accelerate --extra-index-url https://download.pytorch.org/whl/cpu
RUN uv pip install --no-cache-dir librosa "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[sklearn,sentencepiece,vision,testing]"

View File

@ -2,8 +2,8 @@ FROM python:3.9-slim
ENV PYTHONDONTWRITEBYTECODE=1
ARG REF=main
USER root
RUN apt-get update && apt-get install -y time git
RUN apt-get update && apt-get install -y time git
ENV UV_PYTHON=/usr/local/bin/python
RUN pip install uv && uv venv
RUN pip install uv
RUN uv pip install --no-cache-dir -U pip setuptools GitPython "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[ruff]" urllib3
RUN apt-get install -y jq curl && apt-get clean && rm -rf /var/lib/apt/lists/*

View File

@ -5,7 +5,7 @@ USER root
RUN apt-get update && apt-get install -y --no-install-recommends libsndfile1-dev espeak-ng time git g++ pkg-config openssh-client git
RUN apt-get install -y cmake
ENV UV_PYTHON=/usr/local/bin/python
RUN pip --no-cache-dir install uv && uv venv && uv pip install --no-cache-dir -U pip setuptools
RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
RUN uv pip install --upgrade --no-cache-dir "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[tf-cpu,sklearn,testing,sentencepiece,tf-speech,vision]"
RUN uv pip install --no-cache-dir "protobuf==3.20.3"
RUN uv pip uninstall transformers

View File

@ -4,7 +4,7 @@ ARG REF=main
USER root
RUN apt-get update && apt-get install -y libsndfile1-dev espeak-ng time git g++ cmake pkg-config openssh-client git
ENV UV_PYTHON=/usr/local/bin/python
RUN pip --no-cache-dir install uv && uv venv && uv pip install --no-cache-dir -U pip setuptools
RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
RUN uv pip install --no-deps accelerate
RUN uv pip install --no-cache-dir 'torch' 'torchvision' 'torchaudio' --index-url https://download.pytorch.org/whl/cpu
RUN uv pip install --no-cache-dir "scipy<1.13" "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[flax,audio,sklearn,sentencepiece,vision,testing]"

View File

@ -4,7 +4,7 @@ ARG REF=main
USER root
RUN apt-get update && apt-get install -y --no-install-recommends libsndfile1-dev espeak-ng time git g++ cmake pkg-config openssh-client git git-lfs ffmpeg
ENV UV_PYTHON=/usr/local/bin/python
RUN pip --no-cache-dir install uv && uv venv && uv pip install --no-cache-dir -U pip setuptools
RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
RUN uv pip install --no-cache-dir 'torch' 'torchaudio' 'torchvision' 'torchcodec' --index-url https://download.pytorch.org/whl/cpu
RUN uv pip install --no-deps timm accelerate --extra-index-url https://download.pytorch.org/whl/cpu
RUN uv pip install --no-cache-dir librosa "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[sklearn,sentencepiece,vision,testing,tiktoken,num2words,video]"

View File

@ -5,7 +5,7 @@ RUN echo ${REF}
USER root
RUN apt-get update && apt-get install -y --no-install-recommends libsndfile1-dev espeak-ng time git g++ cmake pkg-config openssh-client git git-lfs
ENV UV_PYTHON=/usr/local/bin/python
RUN pip --no-cache-dir install uv && uv venv && uv pip install --no-cache-dir -U pip setuptools
RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
RUN uv pip install --no-cache-dir --no-deps accelerate --extra-index-url https://download.pytorch.org/whl/cpu
RUN uv pip install --no-cache-dir 'torch' 'torchaudio' 'torchvision' --index-url https://download.pytorch.org/whl/cpu
RUN git lfs install

View File

@ -9,7 +9,7 @@ SHELL ["sh", "-lc"]
# The following `ARG` are mainly used to specify the versions explicitly & directly in this docker file, and not meant
# to be used as arguments for docker build (so far).
ARG PYTORCH='2.7.1'
ARG PYTORCH='2.8.0'
# Example: `cu102`, `cu113`, etc.
ARG CUDA='cu126'
# Disable kernel mapping for now until all tests pass
@ -32,7 +32,10 @@ RUN python3 -m pip uninstall -y flax jax
RUN python3 -m pip install --no-cache-dir -U timm
RUN python3 -m pip install --no-cache-dir git+https://github.com/facebookresearch/detectron2.git pytesseract
RUN [ "$PYTORCH" != "pre" ] && python3 -m pip install --no-cache-dir git+https://github.com/facebookresearch/detectron2.git || echo "Don't install detectron2 with nightly torch"
RUN python3 -m pip install --no-cache-dir pytesseract
RUN python3 -m pip install -U "itsdangerous<2.1.0"
RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/accelerate@main#egg=accelerate
@ -41,6 +44,8 @@ RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/pef
# For bettertransformer
RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/optimum@main#egg=optimum
# For kernels
RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/kernels@main#egg=kernels
# For video model testing
RUN python3 -m pip install --no-cache-dir av
@ -51,15 +56,14 @@ RUN python3 -m pip install --no-cache-dir bitsandbytes
# Some tests require quanto
RUN python3 -m pip install --no-cache-dir quanto
# After using A10 as CI runner, let's run FA2 tests
RUN [ "$PYTORCH" != "pre" ] && python3 -m pip uninstall -y ninja && python3 -m pip install --no-cache-dir ninja && python3 -m pip install flash-attn --no-cache-dir --no-build-isolation || echo "Don't install FA2 with nightly torch"
# TODO (ydshieh): check this again
# `quanto` will install `ninja` which leads to many `CUDA error: an illegal memory access ...` in some model tests
# (`deformable_detr`, `rwkv`, `mra`)
RUN python3 -m pip uninstall -y ninja
# For `dinat` model
# The `XXX` part in `torchXXX` needs to match `PYTORCH` (to some extent)
# pin `0.17.4` otherwise `cannot import name 'natten2dav' from 'natten.functional'`
RUN python3 -m pip install --no-cache-dir natten==0.17.4+torch250cu121 -f https://shi-labs.com/natten/wheels
# For `nougat` tokenizer
RUN python3 -m pip install --no-cache-dir python-Levenshtein

View File

@ -17,6 +17,7 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \
jupyter \
tensorflow \
torch
RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/kernels@main#egg=kernels
RUN git clone https://github.com/NVIDIA/apex
RUN cd apex && \

View File

@ -4,7 +4,7 @@ LABEL maintainer="Hugging Face"
ARG DEBIAN_FRONTEND=noninteractive
ARG PYTORCH='2.7.1'
ARG PYTORCH='2.8.0'
# Example: `cu102`, `cu113`, etc.
ARG CUDA='cu126'

View File

@ -11,7 +11,7 @@ ARG REF=main
RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF
# If set to nothing, will install the latest version
ARG PYTORCH='2.7.1'
ARG PYTORCH='2.8.0'
ARG TORCH_VISION=''
ARG TORCH_AUDIO=''
# Example: `cu102`, `cu113`, etc.

View File

@ -79,7 +79,8 @@ RUN git clone https://github.com/NetEase-FuXi/EETQ.git && cd EETQ/ && git submod
# RUN python3 -m pip install --no-cache-dir git+https://github.com/Dao-AILab/fast-hadamard-transform.git
# Add fp-quant for quantization testing
RUN python3 -m pip install --no-cache-dir "fp-quant>=0.1.6"
# Requires py3.11 but our CI runs on 3.9
# RUN python3 -m pip install --no-cache-dir "fp-quant>=0.1.6"
# Add compressed-tensors for quantization testing
RUN python3 -m pip install --no-cache-dir compressed-tensors

View File

@ -20,22 +20,21 @@ To generate the documentation, you first have to build it. Several packages are
you can install them with the following command, at the root of the code repository:
```bash
pip install -e ".[docs]"
pip install -e ".[dev]"
```
> [!NOTE]
> This command might fail for some OS that are missing dependencies. Check step 4 in [Create a Pull Request](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request) to workaround it.
Then you need to install our special tool that builds the documentation:
```bash
pip install git+https://github.com/huggingface/doc-builder
```
---
**NOTE**
You only need to generate the documentation to inspect it locally (if you're planning changes and want to
check how they look before committing for instance). You don't have to commit the built documentation.
---
> [!NOTE]
> You only need to generate the documentation to inspect it locally (if you're planning changes and want to
> check how they look before committing for instance). You don't have to commit the built documentation.
## Building the documentation
@ -72,12 +71,8 @@ doc-builder preview transformers docs/source/en/
The docs will be viewable at [http://localhost:3000](http://localhost:3000). You can also preview the docs once you have opened a PR. You will see a bot add a comment to a link where the documentation with your changes lives.
---
**NOTE**
The `preview` command only works with existing doc files. When you add a completely new file, you need to update `_toctree.yml` & restart `preview` command (`ctrl-c` to stop it & call `doc-builder preview ...` again).
---
> [!NOTE]
> The `preview` command only works with existing doc files. When you add a completely new file, you need to update `_toctree.yml` & restart `preview` command (`ctrl-c` to stop it & call `doc-builder preview ...` again).
## Adding a new element to the navigation bar
@ -164,6 +159,9 @@ These classes should be added using our Markdown syntax. Usually as follows:
[[autodoc]] XXXConfig
```
> [!IMPORTANT]
> Always add a blank line after `[[autodoc]]` to ensure it passes the CI/CD checks.
This will include every public method of the configuration that is documented. If for some reason you wish for a method
not to be displayed in the documentation, you can do so by specifying which methods should be in the docs:

View File

@ -304,7 +304,7 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "NousResearch/Hermes-2-Pro-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, device_map="auto")
model = AutoModelForCausalLM.from_pretrained(checkpoint, dtype=torch.bfloat16, device_map="auto")
```python
messages = [

View File

@ -25,7 +25,7 @@ chat = [
import torch
from transformers import pipeline
pipe = pipeline("text-generation", "meta-llama/Meta-Llama-3-8B-Instruct", torch_dtype=torch.bfloat16, device_map="auto")
pipe = pipeline("text-generation", "meta-llama/Meta-Llama-3-8B-Instruct", dtype=torch.bfloat16, device_map="auto")
response = pipe(chat, max_new_tokens=512)
print(response[0]['generated_text'][-1]['content'])
```
@ -126,7 +126,7 @@ chat = [
]
# 1: تحميل النموذج والمحلل
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", device_map="auto", torch_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", device_map="auto", dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
# 2: تطبيق قالب الدردشة
@ -164,7 +164,7 @@ print("Decoded output:\n", decoded_output)
### اعتبارات الذاكرة
بشكل افتراضي، تقوم فئات Hugging Face مثل [`TextGenerationPipeline`] أو [`AutoModelForCausalLM`] بتحميل النموذج في دقة "float32". وهذا يعني أنه يحتاج إلى 4 بايتات (32 بت) لكل معلمة، لذا فإن نموذج "8B" بحجم 8 مليار معلمة سيحتاج إلى ~32 جيجابايت من الذاكرة. ومع ذلك، يمكن أن يكون هذا مضيعة للموارد! يتم تدريب معظم نماذج اللغة الحديثة في دقة "bfloat16"، والتي تستخدم فقط 2 بايت لكل معلمة. إذا كان عتادك يدعم ذلك (Nvidia 30xx/Axxx أو أحدث)، فيمكنك تحميل النموذج في دقة "bfloat16"، باستخدام معامل "torch_dtype" كما فعلنا أعلاه.
بشكل افتراضي، تقوم فئات Hugging Face مثل [`TextGenerationPipeline`] أو [`AutoModelForCausalLM`] بتحميل النموذج في دقة "float32". وهذا يعني أنه يحتاج إلى 4 بايتات (32 بت) لكل معلمة، لذا فإن نموذج "8B" بحجم 8 مليار معلمة سيحتاج إلى ~32 جيجابايت من الذاكرة. ومع ذلك، يمكن أن يكون هذا مضيعة للموارد! يتم تدريب معظم نماذج اللغة الحديثة في دقة "bfloat16"، والتي تستخدم فقط 2 بايت لكل معلمة. إذا كان عتادك يدعم ذلك (Nvidia 30xx/Axxx أو أحدث)، فيمكنك تحميل النموذج في دقة "bfloat16"، باستخدام معامل "dtype" كما فعلنا أعلاه.
ومن الممكن أيضًا النزول إلى أقل من 16 بت باستخدام "التكميم"، وهي طريقة لضغط أوزان النموذج بطريقة تفقد بعض المعلومات. يسمح هذا بضغط كل معلمة إلى 8 بتات أو 4 بتات أو حتى أقل. لاحظ أنه، خاصة في 4 بتات، قد تتأثر جودة ناتج النموذج سلبًا، ولكن غالبًا ما يكون هذا مقايضة تستحق القيام بها لتناسب نموذج محادثة أكبر وأكثر قدرة في الذاكرة. دعنا كيف يمكننا تطبيق ذلك باستخدام مكتبة `bitsandbytes`:

View File

@ -13,11 +13,11 @@
في هذا الدليل، سنستعرض التقنيات الفعالة لتُحسِّن من كفاءة نشر نماذج اللغة الكبيرة:
1. سنتناول تقنية "دقة أقل" التي أثبتت الأبحاث فعاليتها في تحقيق مزايا حسابية دون التأثير بشكل ملحوظ على أداء النموذج عن طريق العمل بدقة رقمية أقل [8 بت و4 بت](/main_classes/quantization.md).
1. سنتناول تقنية "دقة أقل" التي أثبتت الأبحاث فعاليتها في تحقيق مزايا حسابية دون التأثير بشكل ملحوظ على أداء النموذج عن طريق العمل بدقة رقمية أقل [8 بت و4 بت](/main_classes/quantization).
2. **اFlash Attention:** إن Flash Attention وهي نسخة مُعدَّلة من خوارزمية الانتباه التي لا توفر فقط نهجًا أكثر كفاءة في استخدام الذاكرة، ولكنها تحقق أيضًا كفاءة متزايدة بسبب الاستخدام الأمثل لذاكرة GPU.
3. **الابتكارات المعمارية:** حيث تم اقتراح هياكل متخصصة تسمح باستدلال أكثر فعالية نظرًا لأن نماذج اللغة الكبيرة يتم نشرها دائمًا بنفس الطريقة أثناء عملية الاستدلال، أي توليد النص التنبؤي التلقائي مع سياق الإدخال الطويل، فقد تم اقتراح بنيات نموذج متخصصة تسمح بالاستدلال الأكثر كفاءة. أهم تقدم في بنيات النماذج هنا هو [عذر](https://huggingface.co/papers/2108.12409)، [الترميز الدوار](https://huggingface.co/papers/2104.09864)، [الاهتمام متعدد الاستعلامات (MQA)](https://huggingface.co/papers/1911.02150) و [مجموعة الانتباه بالاستعلام (GQA)]((https://huggingface.co/papers/2305.13245)).
3. **الابتكارات المعمارية:** حيث تم اقتراح هياكل متخصصة تسمح باستدلال أكثر فعالية نظرًا لأن نماذج اللغة الكبيرة يتم نشرها دائمًا بنفس الطريقة أثناء عملية الاستدلال، أي توليد النص التنبؤي التلقائي مع سياق الإدخال الطويل، فقد تم اقتراح بنيات نموذج متخصصة تسمح بالاستدلال الأكثر كفاءة. أهم تقدم في بنيات النماذج هنا هو [عذر](https://huggingface.co/papers/2108.12409)، [الترميز الدوار](https://huggingface.co/papers/2104.09864)، [الاهتمام متعدد الاستعلامات (MQA)](https://huggingface.co/papers/1911.02150) و [مجموعة الانتباه بالاستعلام (GQA)](https://huggingface.co/papers/2305.13245).
على مدار هذا الدليل، سنقدم تحليلًا للتوليد التنبؤي التلقائي من منظور المُوتِّرات. نتعمق في مزايا وعيوب استخدام دقة أقل، ونقدم استكشافًا شاملاً لخوارزميات الانتباه الأحدث، ونناقش بنيات نماذج نماذج اللغة الكبيرة المحسنة. سندعم الشرح بأمثلة عملية تُبرِز كل تحسين على حدة.
@ -73,7 +73,7 @@ model = AutoModelForCausalLM.from_pretrained("bigscience/bloom", device_map="aut
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", torch_dtype=torch.bfloat16, device_map="auto", pad_token_id=0)
model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", dtype=torch.bfloat16, device_map="auto", pad_token_id=0)
tokenizer = AutoTokenizer.from_pretrained("bigcode/octocoder")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
@ -114,7 +114,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
> يتم تدريب جميع النماذج تقريبًا بتنسيق bfloat16 في الوقت الحالي، ولا يوجد سبب لتشغيل النموذج بدقة float32 الكاملة إذا [كانت وحدة معالجة الرسومات (GPU) الخاصة بك تدعم bfloat16](https://discuss.pytorch.org/t/bfloat16-native-support/117155/5). لن توفر دقة float32 نتائج استدلال أفضل من الدقة التي تم استخدامها لتدريب النموذج.
إذا لم تكن متأكدًا من تنسيق تخزين أوزان النموذج على Hub، فيمكنك دائمًا الاطلاع على تهيئة نقطة التفتيش في `"torch_dtype"`، على سبيل المثال [هنا](https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/6fdf2e60f86ff2481f2241aaee459f85b5b0bbb9/config.json#L21). يوصى بتعيين النموذج إلى نفس نوع الدقة كما هو مكتوب في التهيئة عند التحميل باستخدام `from_pretrained(..., torch_dtype=...)` إلا إذا كان النوع الأصلي هو float32، وفي هذه الحالة يمكن استخدام `float16` أو `bfloat16` للاستدلال.
إذا لم تكن متأكدًا من تنسيق تخزين أوزان النموذج على Hub، فيمكنك دائمًا الاطلاع على تهيئة نقطة التفتيش في `"dtype"`، على سبيل المثال [هنا](https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/6fdf2e60f86ff2481f2241aaee459f85b5b0bbb9/config.json#L21). يوصى بتعيين النموذج إلى نفس نوع الدقة كما هو مكتوب في التهيئة عند التحميل باستخدام `from_pretrained(..., dtype=...)` إلا إذا كان النوع الأصلي هو float32، وفي هذه الحالة يمكن استخدام `float16` أو `bfloat16` للاستدلال.
دعونا نحدد وظيفة `flush(...)` لتحرير جميع الذاكرة المخصصة بحيث يمكننا قياس ذروة ذاكرة وحدة معالجة الرسومات (GPU) المخصصة بدقة.
@ -389,7 +389,7 @@ long_prompt = 10 * system_prompt + prompt
نقوم بتنفيذ نموذجنا مرة أخرى بدقة bfloat16.
```python
model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", torch_dtype=torch.bfloat16, device_map="auto")
model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("bigcode/octocoder")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

View File

@ -90,7 +90,7 @@ out = transcriber(...) # سيتم الرجوع إلى استخدام `my_parame
transcriber = pipeline(model="openai/whisper-large-v2", device=0)
```
إذا كان النموذج كبيرًا جدًا بالنسبة لوحدة معالجة الرسومات (GPU) واحدة، وأنت تستخدم PyTorch، فيمكنك تعيين `torch_dtype='float16'` لتمكين الاستدلال بدقة FP16. عادةً ما لا يتسبب ذلك في حدوث انخفاضات كبيرة في الأداء، ولكن تأكد من تقييمه على نماذجك!
إذا كان النموذج كبيرًا جدًا بالنسبة لوحدة معالجة الرسومات (GPU) واحدة، وأنت تستخدم PyTorch، فيمكنك تعيين `dtype='float16'` لتمكين الاستدلال بدقة FP16. عادةً ما لا يتسبب ذلك في حدوث انخفاضات كبيرة في الأداء، ولكن تأكد من تقييمه على نماذجك!
بدلاً من ذلك، يمكنك تعيين `device_map="auto"` لتحديد كيفية تحميل مخزنات النموذج وتخزينها تلقائيًا. يتطلب استخدام معامل `device_map` مكتبه 🤗 [Accelerate](https://huggingface.co/docs/accelerate):
@ -273,7 +273,7 @@ pip install pytesseract
import torch
from transformers import pipeline
pipe = pipeline(model="facebook/opt-1.3b", torch_dtype=torch.bfloat16, device_map="auto")
pipe = pipeline(model="facebook/opt-1.3b", dtype=torch.bfloat16, device_map="auto")
output = pipe("This is a cool example!", do_sample=True, top_p=0.95)
```

View File

@ -81,14 +81,26 @@
- local: conversations
title: Chat basics
- local: chat_templating
title: Templates
title: Chat templates
- local: chat_templating_multimodal
title: Multimodal templates
- local: chat_templating_writing
title: Template writing
title: Multimodal chat templates
- local: chat_extras
title: Tools and RAG
title: Tool use
- local: chat_templating_writing
title: Writing a chat template
title: Chat with models
- sections:
- local: serving
title: Serving LLMs, VLMs, and other chat-based models
- local: jan
title: Jan
- local: cursor
title: Cursor
- local: tiny_agents
title: Tiny-Agents CLI and MCP tools
- local: open_webui
title: Open WebUI
title: Serving
- sections:
- local: perf_torch_compile
title: torch.compile
@ -103,8 +115,6 @@
title: Agents
- local: tools
title: Tools
- local: serving
title: Serving
- local: transformers_as_backend
title: Inference server backends
title: Inference
@ -363,6 +373,8 @@
- sections:
- local: model_doc/albert
title: ALBERT
- local: model_doc/apertus
title: Apertus
- local: model_doc/arcee
title: Arcee
- local: model_doc/bamba
@ -501,6 +513,8 @@
title: GPT2
- local: model_doc/gpt_bigcode
title: GPTBigCode
- local: model_doc/gpt_oss
title: GptOss
- local: model_doc/gptsan-japanese
title: GPTSAN Japanese
- local: model_doc/gpt-sw3
@ -519,6 +533,10 @@
title: HerBERT
- local: model_doc/hgnet_v2
title: HGNet-V2
- local: model_doc/hunyuan_v1_dense
title: HunYuanDenseV1
- local: model_doc/hunyuan_v1_moe
title: HunYuanMoEV1
- local: model_doc/ibert
title: I-BERT
- local: model_doc/jamba
@ -659,6 +677,8 @@
title: RoFormer
- local: model_doc/rwkv
title: RWKV
- local: model_doc/seed_oss
title: Seed-Oss
- local: model_doc/splinter
title: Splinter
- local: model_doc/squeezebert
@ -753,6 +773,8 @@
title: DINOV2
- local: model_doc/dinov2_with_registers
title: DINOv2 with Registers
- local: model_doc/dinov3
title: DINOv3
- local: model_doc/dit
title: DiT
- local: model_doc/dpt
@ -769,6 +791,8 @@
title: FocalNet
- local: model_doc/glpn
title: GLPN
- local: model_doc/hgnet_v2
title: HGNet-V2
- local: model_doc/hiera
title: Hiera
- local: model_doc/ijepa
@ -929,6 +953,8 @@
title: WavLM
- local: model_doc/whisper
title: Whisper
- local: model_doc/xcodec
title: X-Codec
- local: model_doc/xls_r
title: XLS-R
- local: model_doc/xlsr_wav2vec2
@ -971,6 +997,8 @@
title: CLIPSeg
- local: model_doc/clvp
title: CLVP
- local: model_doc/cohere2_vision
title: Cohere2Vision
- local: model_doc/colpali
title: ColPali
- local: model_doc/colqwen2
@ -987,6 +1015,8 @@
title: Evolla
- local: model_doc/flava
title: FLAVA
- local: model_doc/florence2
title: Florence2
- local: model_doc/gemma3
title: Gemma3
- local: model_doc/gemma3n
@ -995,6 +1025,8 @@
title: GIT
- local: model_doc/glm4v
title: glm4v
- local: model_doc/glm4v_moe
title: glm4v_moe
- local: model_doc/got_ocr2
title: GOT-OCR2
- local: model_doc/granitevision
@ -1019,6 +1051,8 @@
title: Janus
- local: model_doc/kosmos-2
title: KOSMOS-2
- local: model_doc/kosmos2_5
title: KOSMOS-2.5
- local: model_doc/layoutlm
title: LayoutLM
- local: model_doc/layoutlmv2
@ -1032,7 +1066,7 @@
- local: model_doc/llama4
title: Llama4
- local: model_doc/llava
title: Llava
title: LLaVA
- local: model_doc/llava_next
title: LLaVA-NeXT
- local: model_doc/llava_next_video
@ -1043,18 +1077,24 @@
title: LXMERT
- local: model_doc/matcha
title: MatCha
- local: model_doc/metaclip_2
title: MetaCLIP 2
- local: model_doc/mgp-str
title: MGP-STR
- local: model_doc/mistral3
title: Mistral3
- local: model_doc/mllama
title: mllama
- local: model_doc/mm-grounding-dino
title: MM Grounding DINO
- local: model_doc/nougat
title: Nougat
- local: model_doc/omdet-turbo
title: OmDet-Turbo
- local: model_doc/oneformer
title: OneFormer
- local: model_doc/ovis2
title: Ovis2
- local: model_doc/owlvit
title: OWL-ViT
- local: model_doc/owlv2
@ -1079,6 +1119,10 @@
title: Qwen2Audio
- local: model_doc/qwen2_vl
title: Qwen2VL
- local: model_doc/sam2
title: SAM2
- local: model_doc/sam2_video
title: SAM2 Video
- local: model_doc/sam
title: Segment Anything
- local: model_doc/sam_hq

View File

@ -100,19 +100,18 @@ pipeline("This is the best meal I've ever had")
Register the new task your pipeline supports in the `PIPELINE_REGISTRY`. The registry defines:
- the machine learning framework the pipeline supports with either `pt_model` or `tf_model` (add both to ensure it works with either frameworks)
- The supported Pytorch model class with `pt_model`
- a default model which should come from a specific revision (branch, or commit hash) where the model works as expected with `default`
- the expected input with `type`
```py
from transformers.pipelines import PIPELINE_REGISTRY
from transformers import AutoModelForSequenceClassification, TFAutoModelForSequenceClassification
from transformers import AutoModelForSequenceClassification
PIPELINE_REGISTRY.register_pipeline(
"new-task",
pipeline_class=MyPipeline,
pt_model=AutoModelForSequenceClassification,
tf_model=TFAutoModelForSequenceClassification,
default={"pt": ("user/awesome-model", "branch-name")},
type="text",
)
@ -128,7 +127,7 @@ It's faster to upload your pipeline code to the Hub because it doesn't require a
Add your pipeline code to the Hub in a Python file.
For example, a custom pipeline for sentence pair classification might look like the following code below. The implementation works for PyTorch and TensorFlow models.
For example, a custom pipeline for sentence pair classification might look like the following code below.
```py
import numpy as np
@ -168,13 +167,12 @@ Save the code in a file named `pair_classification.py`, and import and register
```py
from pair_classification import PairClassificationPipeline
from transformers.pipelines import PIPELINE_REGISTRY
from transformers import AutoModelForSequenceClassification, TFAutoModelForSequenceClassification
from transformers import AutoModelForSequenceClassification
PIPELINE_REGISTRY.register_pipeline(
"pair-classification",
pipeline_class=PairClassificationPipeline,
pt_model=AutoModelForSequenceClassification,
tf_model=TFAutoModelForSequenceClassification,
)
```
@ -187,9 +185,6 @@ The [register_pipeline](https://github.com/huggingface/transformers/blob/9feae5f
"pt": [
"AutoModelForSequenceClassification"
],
"tf": [
"TFAutoModelForSequenceClassification"
],
}
},
```
@ -219,11 +214,11 @@ Add your pipeline code as a new module to the [pipelines](https://github.com/hug
Next, add a new test for the pipeline in [transformers/tests/pipelines](https://github.com/huggingface/transformers/tree/main/tests/pipelines). You can look at the other tests for examples of how to test your pipeline.
The [run_pipeline_test](https://github.com/huggingface/transformers/blob/db70426854fe7850f2c5834d633aff637f14772e/tests/pipelines/test_pipelines_text_classification.py#L186) function should be very generic and run on the models defined in [model_mapping](https://github.com/huggingface/transformers/blob/db70426854fe7850f2c5834d633aff637f14772e/tests/pipelines/test_pipelines_text_classification.py#L48) and [tf_model_mapping](https://github.com/huggingface/transformers/blob/db70426854fe7850f2c5834d633aff637f14772e/tests/pipelines/test_pipelines_text_classification.py#L49). This is important for testing future compatibility with new models.
The [run_pipeline_test](https://github.com/huggingface/transformers/blob/db70426854fe7850f2c5834d633aff637f14772e/tests/pipelines/test_pipelines_text_classification.py#L186) function should be very generic and run on the models defined in [model_mapping](https://github.com/huggingface/transformers/blob/db70426854fe7850f2c5834d633aff637f14772e/tests/pipelines/test_pipelines_text_classification.py#L48). This is important for testing future compatibility with new models.
You'll also notice `ANY` is used throughout the [run_pipeline_test](https://github.com/huggingface/transformers/blob/db70426854fe7850f2c5834d633aff637f14772e/tests/pipelines/test_pipelines_text_classification.py#L186) function. The models are random, so you can't check the actual values. Using `ANY` allows the test to match the output of the pipeline type instead.
Finally, you should also implement the following 4 tests.
1. [test_small_model_pt](https://github.com/huggingface/transformers/blob/db70426854fe7850f2c5834d633aff637f14772e/tests/pipelines/test_pipelines_text_classification.py#L59) and [test_small_model_tf](https://github.com/huggingface/transformers/blob/db70426854fe7850f2c5834d633aff637f14772e/tests/pipelines/test_pipelines_text_classification.py#L150), use a small model for these pipelines to make sure they return the correct outputs. The results don't have to make sense. Each pipeline should return the same result.
1. [test_large_model_pt](https://github.com/huggingface/transformers/blob/db70426854fe7850f2c5834d633aff637f14772e/tests/pipelines/test_pipelines_zero_shot_image_classification.py#L187) nad [test_large_model_tf](https://github.com/huggingface/transformers/blob/db70426854fe7850f2c5834d633aff637f14772e/tests/pipelines/test_pipelines_zero_shot_image_classification.py#L220), use a realistic model for these pipelines to make sure they return meaningful results. These tests are slow and should be marked as slow.
1. [test_small_model_pt](https://github.com/huggingface/transformers/blob/db70426854fe7850f2c5834d633aff637f14772e/tests/pipelines/test_pipelines_text_classification.py#L59), use a small model for these pipelines to make sure they return the correct outputs. The results don't have to make sense. Each pipeline should return the same result.
1. [test_large_model_pt](https://github.com/huggingface/transformers/blob/db70426854fe7850f2c5834d633aff637f14772e/tests/pipelines/test_pipelines_zero_shot_image_classification.py#L187), use a realistic model for these pipelines to make sure they return meaningful results. These tests are slow and should be marked as slow.

View File

@ -290,7 +290,7 @@ The `@auto_docstring` decorator automatically generates docstrings by:
7. Adding return values to the docstring. For methods like `forward`, the decorator automatically generates the `Returns` field in the docstring based on the method's return type annotation.
For example, if a method returns a [`~transformers.utils.ModelOutput`] subclass, `@auto_docstring` extracts the field descriptions from the class' docstring to create a comprehensive return value description. You can also manually specifiy a custom `Returns` field in a functions docstring.
For example, if a method returns a [`~transformers.utils.ModelOutput`] subclass, `@auto_docstring` extracts the field descriptions from the class' docstring to create a comprehensive return value description. You can also manually specify a custom `Returns` field in a functions docstring.
8. Unrolling kwargs typed with the unpack operator. For specific methods (defined in `UNROLL_KWARGS_METHODS`) or classes (defined in `UNROLL_KWARGS_CLASSES`), the decorator processes `**kwargs` parameters that are typed with `Unpack[KwargsTypedDict]`. It extracts the documentations from the `TypedDict` and adds each parameter to the function's docstring.

View File

@ -15,6 +15,7 @@ rendered properly in your Markdown viewer.
-->
# Caching
Imagine you're having a conversation with someone, and instead of remembering what they previously said, they have to start from scratch every time you respond. This would be slow and inefficient, right?
You can extend this analogy to transformer models. Autoregressive model generation can be slow because it makes a prediction one token at a time. Each new prediction is dependent on all the previous context.
@ -99,18 +100,20 @@ The example below demonstrates how to create a generation loop with [`DynamicCac
```py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache
from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache, infer_device
device = f"{infer_device()}:0"
model_id = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="cuda:0")
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map=device)
tokenizer = AutoTokenizer.from_pretrained(model_id)
past_key_values = DynamicCache()
past_key_values = DynamicCache(config=model.config)
messages = [{"role": "user", "content": "Hello, what's your name."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True).to("cuda:0")
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True).to(model.device)
generated_ids = inputs.input_ids
cache_position = torch.arange(inputs.input_ids.shape[1], dtype=torch.int64, device="cuda:0")
cache_position = torch.arange(inputs.input_ids.shape[1], dtype=torch.int64, device=model.device)
max_new_tokens = 10
for _ in range(max_new_tokens):
@ -136,21 +139,23 @@ The cache position tracks where to insert new tokens in the attention cache. It
Cache position is used internally for two purposes:
1. Selecting new tokens to process in the input sequence and ensuring only tokens that havent been cached yet are passed to the model's `forward`.
2. Storing key/value pairs at the correct positions in the cache. This is especially important for fixed-size caches, like [`StaticCache`], that pre-allocates a specific cache length.
2. Storing key/value pairs at the correct positions in the cache. This is especially important for fixed-size caches, that pre-allocates a specific cache length.
The generation loop usually takes care of the cache position, but if you're writing a custom generation method, it is important that cache positions are accurate since they are used to write and read key/value states into fixed slots.
```py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache
from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache, infer_device
device = f"{infer_device()}:0"
model_id = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="cuda:0")
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map=device)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [{"role": "user", "content": "You are a helpful assistant."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True).to("cuda:0")
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True).to(model.device)
generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=10)
```
@ -172,7 +177,7 @@ import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16, device_map="auto")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", dtype=torch.float16, device_map="auto")
inputs = tokenizer("Hello, my name is", return_tensors="pt").to(model.device)
# `return_dict_in_generate=True` is required to return the cache and `return_legacy_cache` forces the returned cache

View File

@ -14,64 +14,64 @@ rendered properly in your Markdown viewer.
-->
# Tools and RAG
# Tool use
The [`~PreTrainedTokenizerBase.apply_chat_template`] method supports virtually any additional argument types - strings, lists, dicts - besides the chat message. This makes it possible to use chat templates for many use cases.
Chat models are commonly trained with support for "function-calling" or "tool-use". Tools are functions supplied by the user, which the model can choose to call as part of its response. For example, models could have access to a calculator tool to perform arithmetic without having to it internally.
This guide will demonstrate how to use chat templates with tools and retrieval-augmented generation (RAG).
This guide will demonstrate how to define tools, how to pass them to a chat model, and how to handle the model's output when it calls a tool.
## Tools
## Passing tools
Tools are functions a large language model (LLM) can call to perform specific tasks. It is a powerful way to extend the capabilities of conversational agents with real-time information, computational tools, or access to large databases.
When a model supports tool-use, pass functions to the `tools` argument of [`~PreTrainedTokenizerBase.apply_chat_template`].
The tools are passed as either a [JSON schema](https://json-schema.org/learn) or Python functions. If you pass Python functions,
the arguments, argument types, and function docstring are parsed in order to generate the JSON schema automatically.
Follow the rules below when creating a tool.
Although passing Python functions is very convenient, the parser can only handle [Google-style](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings)
docstrings. Refer to the examples below for how to format a tool-ready function.
1. The function should have a descriptive name.
2. The function arguments must have a type hint in the function header (don't include in the `Args` block).
3. The function must have a [Google-style](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings) docstring.
4. The function can have a return type and `Returns` block, but these are optional because most tool use models ignore them.
An example tool to get temperature and wind speed is shown below.
```py
def get_current_temperature(location: str, unit: str) -> float:
def get_current_temperature(location: str, unit: str):
"""
Get the current temperature at a location.
Args:
location: The location to get the temperature for, in the format "City, Country"
unit: The unit to return the temperature in. (choices: ["celsius", "fahrenheit"])
Returns:
The current temperature at the specified location in the specified units, as a float.
"""
return 22. # A real function should probably actually get the temperature!
def get_current_wind_speed(location: str) -> float:
def get_current_wind_speed(location: str):
"""
Get the current wind speed in km/h at a given location.
Args:
location: The location to get the temperature for, in the format "City, Country"
Returns:
The current wind speed at the given location in km/h, as a float.
location: The location to get the wind speed for, in the format "City, Country"
"""
return 6. # A real function should probably actually get the wind speed!
tools = [get_current_temperature, get_current_wind_speed]
```
You can optionally add a `Returns:` block to the docstring and a return type to the function header, but most models won't use this information. The parser will also ignore the actual code inside the function!
What really matters is the function name, argument names, argument types, and docstring describing the function's purpose
and the purpose of its arguments. These create the "signature" the model will use to decide whether to call the tool.
## Tool-calling Example
Load a model and tokenizer that supports tool-use like [NousResearch/Hermes-2-Pro-Llama-3-8B](https://hf.co/NousResearch/Hermes-2-Pro-Llama-3-8B), but you can also consider a larger model like [Command-R](./model_doc/cohere) and [Mixtral-8x22B](./model_doc/mixtral) if your hardware can support it.
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained( "NousResearch/Hermes-2-Pro-Llama-3-8B")
tokenizer = AutoTokenizer.from_pretrained( "NousResearch/Hermes-2-Pro-Llama-3-8B")
model = AutoModelForCausalLM.from_pretrained( "NousResearch/Hermes-2-Pro-Llama-3-8B", torch_dtype=torch.bfloat16, device_map="auto")
checkpoint = "NousResearch/Hermes-2-Pro-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, dtype="auto", device_map="auto")
```
Create a chat message.
Create a chat history.
```py
messages = [
@ -80,12 +80,11 @@ messages = [
]
```
Pass `messages` and a list of tools to [`~PreTrainedTokenizerBase.apply_chat_template`]. Then you can pass the inputs to the model for generation.
Next, pass `messages` and a list of tools to [`~PreTrainedTokenizerBase.apply_chat_template`]. Tokenize the chat and generate a response.
```py
inputs = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, return_dict=True, return_tensors="pt")
inputs = {k: v for k, v in inputs.items()}
outputs = model.generate(**inputs, max_new_tokens=128)
outputs = model.generate(**inputs.to(model.device), max_new_tokens=128)
print(tokenizer.decode(outputs[0][len(inputs["input_ids"][0]):]))
```
@ -95,60 +94,52 @@ print(tokenizer.decode(outputs[0][len(inputs["input_ids"][0]):]))
</tool_call><|im_end|>
```
The chat model called the `get_current_temperature` tool with the correct parameters from the docstring. It inferred France as the location based on Paris, and that it should use Celsius for the units of temperature.
The chat model called the `get_current_temperature` tool with the correct parameters from the docstring. It inferred France as the location based on Paris, and that it should use Celsius for the units of temperature.
Now append the `get_current_temperature` function and these arguments to the chat message as `tool_call`. The `tool_call` dictionary should be provided to the `assistant` role instead of the `system` or `user`.
A model **cannot actually call the tool itself**. It requests a tool call, and it's your job to handle the call and append it and the result to the chat history.
Hold the call in the `tool_calls` key of an `assistant` message. This is the recommended API, and should be supported by the chat template of most tool-using models.
> [!WARNING]
> The OpenAI API uses a JSON string as its `tool_call` format. This may cause errors or strange model behavior if used in Transformers, which expects a dict.
> Although `tool_calls` is similar to the OpenAI API, the OpenAI API uses a JSON string as its `tool_calls` format. This may cause errors or strange model behavior if used in Transformers, which expects a dict.
<hfoptions id="tool-call">
<hfoption id="Llama">
```py
tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France", "unit": "celsius"}}
messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]})
```
Allow the assistant to read the function outputs and chat with the user.
Append the tool response to the chat history with the `tool` role.
```py
messages.append({"role": "tool", "content": "22"}) # Note that the returned content is always a string!
```
Finally, allow the model to read the tool response and reply to the user.
```py
inputs = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, return_dict=True, return_tensors="pt")
inputs = {k: v for k, v in inputs.items()}
out = model.generate(**inputs, max_new_tokens=128)
out = model.generate(**inputs.to(model.device), max_new_tokens=128)
print(tokenizer.decode(out[0][len(inputs["input_ids"][0]):]))
```
```txt
The temperature in Paris, France right now is approximately 12°C (53.6°F).<|im_end|>
The temperature in Paris, France right now is 22°C.<|im_end|>
```
</hfoption>
<hfoption id="Mistral/Mixtral">
> [!WARNING]
> Although the key in the assistant message is called `tool_calls`, in most cases, models only emit a single tool call at a time. Some older models emit multiple tool calls at the same time, but this is a
> significantly more complex process, as you need to handle multiple tool responses at once and disambiguate them, often using tool call IDs. Please refer to the model card to see exactly what format a model expects for tool calls.
For [Mistral](./model_doc/mistral) and [Mixtral](./model_doc/mixtral) models, you need an additional `tool_call_id`. The `tool_call_id` is 9 randomly generated alphanumeric characters assigned to the `id` key in the `tool_call` dictionary.
```py
tool_call_id = "9Ae3bDc2F"
tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France", "unit": "celsius"}}
messages.append({"role": "assistant", "tool_calls": [{"type": "function", "id": tool_call_id, "function": tool_call}]})
```
## JSON schemas
```py
inputs = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, return_dict=True, return_tensors="pt")
inputs = {k: v for k, v in inputs.items()}
out = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(out[0][len(inputs["input_ids"][0]):]))
```
Another way to define tools is by passing a [JSON schema](https://json-schema.org/learn/getting-started-step-by-step).
</hfoption>
</hfoptions>
You can also manually call the low-level functions that convert Python functions to JSON schemas, and then check or edit the generated schemas. This is usually not necessary, but is useful for understanding the underlying mechanics. It's particularly important
for chat template authors who need to access the JSON schema to render the tool definitions.
## Schema
[`~PreTrainedTokenizerBase.apply_chat_template`] converts functions into a [JSON schema](https://json-schema.org/learn/getting-started-step-by-step) which is passed to the chat template. A LLM never sees the code inside the function. In other words, a LLM doesn't care how the function works technically, it only cares about function **definition** and **arguments**.
The JSON schema is automatically generated behind the scenes as long as your function follows the [rules](#tools) listed earlier above. But you can use [get_json_schema](https://github.com/huggingface/transformers/blob/14561209291255e51c55260306c7d00c159381a5/src/transformers/utils/chat_template_utils.py#L205) to manually convert a schema for more visibility or debugging.
The [`~PreTrainedTokenizerBase.apply_chat_template`] method uses the [get_json_schema](https://github.com/huggingface/transformers/blob/14561209291255e51c55260306c7d00c159381a5/src/transformers/utils/chat_template_utils.py#L205) function to convert Python functions to a JSON schema.
```py
from transformers.utils import get_json_schema
@ -191,12 +182,7 @@ print(schema)
}
```
You can edit the schema or write one entirely from scratch. This gives you a lot of flexibility to define precise schemas for more complex functions.
> [!WARNING]
> Try keeping your function signatures simple and the arguments to a minimum. These are easier for a model to understand and use than complex functions for example with nested arguments.
The example below demonstrates writing a schema manually and then passing it to [`~PreTrainedTokenizerBase.apply_chat_template`].
We won't go into the details of JSON schema itself here, since it's already [very well documented](https://json-schema.org/) elsewhere. We will, however, mention that you can pass JSON schema dicts to the `tools` argument of [`~PreTrainedTokenizerBase.apply_chat_template`] instead of Python functions:
```py
# A simple function that takes no arguments
@ -238,62 +224,4 @@ model_input = tokenizer.apply_chat_template(
messages,
tools = [current_time, multiply]
)
```
## RAG
Retrieval-augmented generation (RAG) models enhance a models existing knowledge by allowing it to search documents for additional information before returning a query. For RAG models, add a `documents` parameter to [`~PreTrainedTokenizerBase.apply_chat_template`]. This `documents` parameter should be a list of documents, and each document should be a single dict with `title` and `content` keys.
> [!TIP]
> The `documents` parameter for RAG isn't widely supported and many models have chat templates that ignore `documents`. Verify if a model supports `documents` by reading its model card or executing `print(tokenizer.chat_template)` to see if the `documents` key is present. [Command-R](https://hf.co/CohereForAI/c4ai-command-r-08-2024) and [Command-R+](https://hf.co/CohereForAI/c4ai-command-r-plus-08-2024) both support `documents` in their RAG chat templates.
Create a list of documents to pass to the model.
```py
documents = [
{
"title": "The Moon: Our Age-Old Foe",
"text": "Man has always dreamed of destroying the moon. In this essay, I shall..."
},
{
"title": "The Sun: Our Age-Old Friend",
"text": "Although often underappreciated, the sun provides several notable benefits..."
}
]
```
Set `chat_template="rag"` in [`~PreTrainedTokenizerBase.apply_chat_template`] and generate a response.
```py
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01-4bit")
model = AutoModelForCausalLM.from_pretrained("CohereForAI/c4ai-command-r-v01-4bit", device_map="auto")
device = model.device # Get the device the model is loaded on
# Define conversation input
conversation = [
{"role": "user", "content": "What has Man always dreamed of?"}
]
input_ids = tokenizer.apply_chat_template(
conversation=conversation,
documents=documents,
chat_template="rag",
tokenize=True,
add_generation_prompt=True,
return_tensors="pt").to(device)
# Generate a response
generated_tokens = model.generate(
input_ids,
max_new_tokens=100,
do_sample=True,
temperature=0.3,
)
# Decode and print the generated text along with generation prompt
generated_text = tokenizer.decode(generated_tokens[0])
print(generated_text)
```
```

View File

@ -14,11 +14,19 @@ rendered properly in your Markdown viewer.
-->
# Templates
# Chat templates
The [chat pipeline](./conversations) guide introduced [`TextGenerationPipeline`] and the concept of a chat prompt or chat template for conversing with a model. Underlying this high-level pipeline is the [`apply_chat_template`] method. A chat template is a part of the tokenizer and it specifies how to convert conversations into a single tokenizable string in the expected model format.
The [chat basics](./conversations) guide covers how to store chat histories and generate text from chat models using [`TextGenerationPipeline`].
In the example below, Mistral-7B-Instruct and Zephyr-7B are finetuned from the same base model but theyre trained with different chat formats. Without chat templates, you have to manually write formatting code for each model and even minor errors can hurt performance. Chat templates offer a universal way to format chat inputs to any model.
This guide is intended for more advanced users, and covers the underlying classes and methods, as well as the key concepts for understanding what's actually going on when you chat with a model.
The critical insight needed to understand chat models is this: All causal LMs, whether chat-trained or not, continue a sequence of tokens. When causal LMs are trained, the training usually begins with "pre-training" on a huge corpus of text, which creates a "base" model.
These base models are then often "fine-tuned" for chat, which means training them on data that is formatted as a sequence of messages. The chat is still just a sequence of tokens, though! The list of `role` and `content` dictionaries that you pass
to a chat model get converted to a token sequence, often with control tokens like `<|user|>` or `<|assistant|>` or `<|end_of_message|>`, which allow the model to see the chat structure.
There are many possible chat formats, and different models may use different formats or control tokens, even if they were fine-tuned from the same base model!
Don't panic, though - you don't need to memorize every possible chat format in order to use chat models. Chat models come with **chat templates**, which indicate how they expect chats to be formatted.
You can access these with the [`apply_chat_template`] method. Let's see two examples. Both of these models are fine-tuned from the same `Mistral-7B` base model:
<hfoptions id="template">
<hfoption id="Mistral">
@ -61,20 +69,24 @@ tokenizer.apply_chat_template(chat, tokenize=False)
</hfoption>
</hfoptions>
This guide explores [`apply_chat_template`] and chat templates in more detail.
Mistral-7B-Instruct uses `[INST]` and `[/INST]` tokens to indicate the start and end of user messages, while Zephyr-7B uses `<|user|>` and `<|assistant|>` tokens to indicate speaker roles. This is why chat templates are important - with the wrong control tokens, these models would have drastically worse performance.
## apply_chat_template
## Using `apply_chat_template`
Chats should be structured as a list of dictionaries with `role` and `content` keys. The `role` key specifies the speaker (usually between you and the system), and the `content` key contains your message. For the system, the `content` is a high-level description of how the model should behave and respond when youre chatting with it.
The input to `apply_chat_template` should be structured as a list of dictionaries with `role` and `content` keys. The `role` key specifies the speaker, and the `content` key contains the message. The common roles are:
Pass your messages to [`apply_chat_template`] to tokenize and format them. You can set [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) to `True` to indicate the start of a message.
- `user` for messages from the user
- `assistant` for messages from the model
- `system` for directives on how the model should act (usually placed at the beginning of the chat)
[`apply_chat_template`] takes this list and returns a formatted sequence. Set `tokenize=True` if you want to tokenize the sequence.
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", device_map="auto", torch_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", device_map="auto", dtype=torch.bfloat16)
messages = [
{"role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate",},
@ -83,6 +95,7 @@ messages = [
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
print(tokenizer.decode(tokenized_chat[0]))
```
```md
<|system|>
You are a friendly chatbot who always responds in the style of a pirate</s>
@ -91,7 +104,7 @@ How many helicopters can a human eat in one sitting?</s>
<|assistant|>
```
Now pass the tokenized chat to [`~GenerationMixin.generate`] to generate a response.
Pass the tokenized chat to [`~GenerationMixin.generate`] to generate a response.
```py
outputs = model.generate(tokenized_chat, max_new_tokens=128)
@ -106,10 +119,17 @@ How many helicopters can a human eat in one sitting?</s>
Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopters are not food, they are flying machines. Food is meant to be eaten, like a hearty plate o' grog, a savory bowl o' stew, or a delicious loaf o' bread. But helicopters, they be for transportin' and movin' around, not for eatin'. So, I'd say none, me hearties. None at all.
```
### add_generation_prompt
The [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) parameter adds tokens that indicate the start of a response. This ensures the chat model generates a system response instead of continuing a users message.
> [!WARNING]
> Some tokenizers add special `<bos>` and `<eos>` tokens. Chat templates should already include all the necessary special tokens, and adding additional special tokens is often incorrect or duplicated, hurting model performance. When you format text with `apply_chat_template(tokenize=False)`, make sure you set `add_special_tokens=False` if you tokenize later to avoid duplicating these tokens.
> This isnt an issue if you use `apply_chat_template(tokenize=True)`, which means it's usually the safer option!
Not all models require generation prompts, and some models, like [Llama](./model_doc/llama), dont have any special tokens before the system response. In this case, [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) has no effect.
### add_generation_prompt
You may have noticed the [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) argument in the above examples.
This argument adds tokens to the end of the chat that indicate the start of an `assistant` response. Remember: Beneath all the chat abstractions, chat models are still just language models that continue a sequence of tokens!
If you include tokens that tell it that it's now in an `assistant` response, it will correctly write a response, but if you don't include these tokens, the model may get confused and do something strange, like **continuing** the user's message instead of replying to it!
Let's see an example to understand what `add_generation_prompt` is actually doing. First, let's format a chat without `add_generation_prompt`:
```py
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
@ -124,11 +144,32 @@ Nice to meet you!<|im_end|>
Can I ask a question?<|im_end|>
```
Now, let's format the same chat with `add_generation_prompt=True`:
```py
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
tokenized_chat
```
```md
<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>
<|im_start|>assistant
```
When `add_generation_prompt=True`, `<|im_start|>assistant` is added at the end to indicate the start of an `assistant` message. This lets the model know an `assistant` response is next.
Not all models require generation prompts, and some models, like [Llama](./model_doc/llama), dont have any special tokens before the `assistant` response. In these cases, [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) has no effect.
### continue_final_message
The [continue_final_message](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.continue_final_message) parameter controls whether the final message in the chat should be continued or not instead of starting a new one. It removes end of sequence tokens so that the model continues generation from the final message.
This is useful for “prefilling” a model response. In the example below, the model generates text that continues the JSON string rather than starting a new message. It can be very useful for improving the accuracy for instruction following when you know how to start its replies.
This is useful for “prefilling” a model response. In the example below, the model generates text that continues the JSON string rather than starting a new message. It can be very useful for improving the accuracy of instruction following when you know how to start its replies.
```py
chat = [
@ -143,52 +184,12 @@ model.generate(**formatted_chat)
> [!WARNING]
> You shouldnt use [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) and [continue_final_message](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.continue_final_message) together. The former adds tokens that start a new message, while the latter removes end of sequence tokens. Using them together returns an error.
[`TextGenerationPipeline`] sets [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) to `True` by default to start a new message. However, if the final message in the chat has the assistant role, it assumes the message is a prefill and switches to `continue_final_message=True`. This is because most models dont support multiple consecutive assistant messages. To override this behavior, explicitly pass the [continue_final_message](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.continue_final_message) to the pipeline.
[`TextGenerationPipeline`] sets [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) to `True` by default to start a new message. However, if the final message in the chat has the `assistant` role, it assumes the message is a prefill and switches to `continue_final_message=True`. This is because most models dont support multiple consecutive assistant messages. To override this behavior, explicitly pass the [continue_final_message](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.continue_final_message) argument to the pipeline.
## Multiple templates
A model may have several different templates for different use cases. For example, a model may have a template for regular chat, tool use, and RAG.
When there are multiple templates, the chat template is a dictionary. Each key corresponds to the name of a template. [`apply_chat_template`] handles multiple templates based on their name. It looks for a template named `default` in most cases and if it cant find one, it raises an error.
For a tool calling template, if a user passes a `tools` parameter and a `tool_use` template exists, the tool calling template is used instead of `default`.
To access templates with other names, pass the template name to the `chat_template` parameter in [`apply_chat_template`]. For example, if youre using a RAG template then set `chat_template="rag"`.
It can be confusing to manage multiple templates though, so we recommend using a single template for all use cases. Use Jinja statements like `if tools is defined` and `{% macro %}` definitions to wrap multiple code paths in a single template.
## Template selection
It is important to set a chat template format that matches the template format a model was pretrained on, otherwise performance may suffer. Even if youre training the model further, performance is best if the chat tokens are kept constant.
But if youre training a model from scratch or finetuning a model for chat, you have more options to select a template. For example, [ChatML](https://github.com/openai/openai-python/blob/release-v0.28.0/chatml.md) is a popular format that is flexbile enough to handle many use cases. It even includes support for [generation prompts](#add_generation_prompt), but it doesnt add beginning-of-string (`BOS`) or end-of-string (`EOS`) tokens. If your model expects `BOS` and `EOS` tokens, set `add_special_tokens=True` and make sure to add them to your template.
```py
{%- for message in messages %}
{{- '<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n' }}
{%- endfor %}
```
Set the template with the following logic to support [generation prompts](#add_generation_prompt). The template wraps each message with `<|im_start|>` and `<|im_end|>` tokens and writes the role as a string. This allows you to easily customize the roles you want to train with.
```py
tokenizer.chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
```
The `user`, `system` and `assistant` roles are standard roles in chat templates. We recommend using these roles when it makes sense, especially if youre using your model with the [`TextGenerationPipeline`].
```py
<|im_start|>system
You are a helpful chatbot that will do its best not to say anything so stupid that people tweet about it.<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
I'm doing great!<|im_end|>
```
## Model training
Training a model with a chat template is a good way to ensure a chat template matches the tokens a model is trained on. Apply the chat template as a preprocessing step to your dataset. Set `add_generation_prompt=False` because the additional tokens to prompt an assistant response arent helpful during training.
Training a model with a chat template is a good way to ensure the template matches the tokens the model was trained on. Apply the chat template as a preprocessing step to your dataset. Set `add_generation_prompt=False` because the additional tokens to prompt an assistant response arent helpful during training.
An example of preprocessing a dataset with a chat template is shown below.
@ -219,11 +220,3 @@ The sun.</s>
```
After this step, you can continue following the [training recipe](./tasks/language_modeling) for causal language models using the `formatted_chat` column.
Some tokenizers add special `<bos>` and `<eos>` tokens. Chat templates should already include all the necessary special tokens, and adding additional special tokens is often incorrect or duplicated, hurting model performance. When you format text with `apply_chat_template(tokenize=False)`, make sure you set `add_special_tokens=False` as well to avoid duplicating them.
```py
apply_chat_template(messages, tokenize=False, add_special_tokens=False)
```
This isnt an issue if `apply_chat_template(tokenize=True)`.

View File

@ -14,22 +14,21 @@ rendered properly in your Markdown viewer.
-->
# Multimodal templates
# Multimodal chat templates
Multimodal model chat templates expect a similar [template](./chat_templating) as text-only models. It needs `messages` that includes a dictionary of the `role` and `content`.
Multimodal chat models accept inputs like images, audio or video, in addition to text. The `content` key in a multimodal chat history is a list containing multiple items of different types. This is unlike text-only chat models whose `content` key is a single string.
Multimodal templates are included in the [Processor](./processors) class and require an additional `type` key for specifying whether the included content is an image, video, or text.
This guide will show you how to format chat templates for multimodal models as well as some best practices for configuring the template
In the same way the [Tokenizer](./fast_tokenizer) class handles chat templates and tokenization for text-only models,
the [Processor](./processors) class handles preprocessing, tokenization and chat templates for multimodal models. Their [`~ProcessorMixin.apply_chat_template`] methods are almost identical.
This guide will show you how to chat with multimodal models with the high-level [`ImageTextToTextPipeline`] and at a lower level using the [`~ProcessorMixin.apply_chat_template`] and [`~GenerationMixin.generate`] methods.
## ImageTextToTextPipeline
[`ImageTextToTextPipeline`] is a high-level image and text generation class with a “chat mode”. Chat mode is enabled when a conversational model is detected and the chat prompt is [properly formatted](./llm_tutorial#wrong-prompt-format).
Start by building a chat history with the following two roles.
- `system` describes how the model should behave and respond when youre chatting with it. This role isnt supported by all chat models.
- `user` is where you enter your first message to the model.
Add image and text blocks to the `content` key in the chat history.
```py
messages = [
@ -47,39 +46,35 @@ messages = [
]
```
Create a [`ImageTextToTextPipeline`] and pass the chat to it. For large models, setting [device_map=“auto”](./models#big-model-inference) helps load the model quicker and automatically places it on the fastest device available. Changing the data type to [torch.bfloat16](./models#model-data-type) also helps save memory.
> [!TIP]
> The [`ImageTextToTextPipeline`] accepts chats in the OpenAI format to make inference easier and more accessible.
Create an [`ImageTextToTextPipeline`] and pass the chat to it. For large models, setting [device_map=“auto”](./models#big-model-inference) helps load the model quicker and automatically places it on the fastest device available. Setting the data type to [auto](./models#model-data-type) also helps save memory and improve speed.
```python
import torch
from transformers import pipeline
pipeline = pipeline("image-text-to-text", model="llava-hf/llava-onevision-qwen2-0.5b-ov-hf", device_map="auto", torch_dtype=torch.float16)
pipeline(text=messages, max_new_tokens=50, return_full_text=False)
[{'input_text': [{'role': 'system',
'content': [{'type': 'text',
'text': 'You are a friendly chatbot who always responds in the style of a pirate'}]},
{'role': 'user',
'content': [{'type': 'image',
'url': 'http://images.cocodataset.org/val2017/000000039769.jpg'},
{'type': 'text', 'text': 'What are these?'}]}],
'generated_text': 'The image shows two cats lying on a pink surface, which appears to be a cushion or a soft blanket. The cat on the left has a striped coat, typical of tabby cats, and is lying on its side with its head resting on the'}]
pipe = pipeline("image-text-to-text", model="Qwen/Qwen2.5-VL-3B-Instruct", device_map="auto", dtype="auto")
out = pipe(text=messages, max_new_tokens=128)
print(out[0]['generated_text'][-1]['content'])
```
## Image inputs
For multimodal models that accept images like [LLaVA](./model_doc/llava), include the following in `content` as shown below.
```
Ahoy, me hearty! These be two feline friends, likely some tabby cats, taking a siesta on a cozy pink blanket. They're resting near remote controls, perhaps after watching some TV or just enjoying some quiet time together. Cats sure know how to find comfort and relaxation, don't they?
```
Aside from the gradual descent from pirate-speak into modern American English (it **is** only a 3B model, after all), this is correct!
## Using `apply_chat_template`
Like [text-only models](./chat_templating), use the [`~ProcessorMixin.apply_chat_template`] method to prepare the chat messages for multimodal models.
This method handles the tokenization and formatting of the chat messages, including images and other media types. The resulting inputs are passed to the model for generation.
- The content `"type"` can be an `"image"` or `"text"`.
- For images, it can be a link to the image (`"url"`), a file path (`"path"`), or `"base64"`. Images are automatically loaded, processed, and prepared into pixel values as inputs to the model.
```python
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
from transformers import AutoProcessor, AutoModelForImageTextToText
model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf")
processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf")
model = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", device_map="auto", torch_dtype="auto")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
messages = [
{
@ -96,14 +91,28 @@ messages = [
]
```
Pass `messages` to [`~ProcessorMixin.apply_chat_template`] to tokenize the input content and return the `input_ids` and `pixel_values`.
Pass `messages` to [`~ProcessorMixin.apply_chat_template`] to tokenize the input content. Unlike text models, the output of `apply_chat_template`
contains a `pixel_values` key with the preprocessed image data, in addition to the tokenized text.
```py
processed_chat = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")
print(processed_chat.keys())
print(list(processed_chat.keys()))
```
These inputs are now ready to be used in [`~GenerationMixin.generate`].
```
['input_ids', 'attention_mask', 'pixel_values', 'image_grid_thw']
```
Pass these inputs to [`~GenerationMixin.generate`].
```python
out = model.generate(**processed_chat.to(model.device), max_new_tokens=128)
print(processor.decode(out[0]))
```
The decoded output contains the full conversation so far, including the user message and the placeholder tokens that contain the image information. You may need to trim the previous conversation from the output before displaying it to the user.
## Video inputs
@ -111,6 +120,7 @@ Some vision models also support video inputs. The message format is very similar
- The content `"type"` should be `"video"` to indicate the content is a video.
- For videos, it can be a link to the video (`"url"`) or it could be a file path (`"path"`). Videos loaded from a URL can only be decoded with [PyAV](https://pyav.basswood-io.com/docs/stable/) or [Decord](https://github.com/dmlc/decord).
- In addition to loading videos from a URL or file path, you can also pass decoded video data directly. This is useful if youve already preprocessed or decoded video frames elsewhere in memory (e.g., using OpenCV, decord, or torchvision). You don't need to save to files or store it in an URL.
> [!WARNING]
> Loading a video from `"url"` is only supported by the PyAV or Decord backends.
@ -137,6 +147,52 @@ messages = [
]
```
### Example: Passing decoded video objects
```python
import numpy as np
video_object1 = np.random.randint(0, 255, size=(16, 224, 224, 3), dtype=np.uint8),
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
},
{
"role": "user",
"content": [
{"type": "video", "video": video_object1},
{"type": "text", "text": "What do you see in this video?"}
],
},
]
```
You can also use existing (`"load_video()"`) function to load a video, edit the video in memory and pass it in the messages.
```python
# Make sure a video backend library (pyav, decord, or torchvision) is available.
from transformers.video_utils import load_video
# load a video file in memory for testing
video_object2, _ = load_video(
"https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_10MB.mp4"
)
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
},
{
"role": "user",
"content": [
{"type": "video", "video": video_object2},
{"type": "text", "text": "What do you see in this video?"}
],
},
]
```
Pass `messages` to [`~ProcessorMixin.apply_chat_template`] to tokenize the input content. There are a few extra parameters to include in [`~ProcessorMixin.apply_chat_template`] that controls the sampling process.
The `video_load_backend` parameter refers to a specific framework to load a video. It supports [PyAV](https://pyav.basswood-io.com/docs/stable/), [Decord](https://github.com/dmlc/decord), [OpenCV](https://github.com/opencv/opencv), and [torchvision](https://pytorch.org/vision/stable/index.html).
@ -216,28 +272,3 @@ print(processed_chat.keys())
</hfoption>
</hfoptions>
## Template configuration
You can create a custom chat template with [Jinja](https://jinja.palletsprojects.com/en/3.1.x/templates/) and set it with [`~ProcessorMixin.apply_chat_template`]. Refer to the [Template writing](./chat_templating_writing) guide for more details.
For example, to enable a template to handle a *list of content* from multiple modalities while still supporting plain strings for text-only inference, specify how to handle the `content['type']` if it is an image or text as shown below in the Llama 3.2 Vision Instruct [template](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct/blob/main/chat_template.json).
```jinja
{% for message in messages %}
{% if loop.index0 == 0 %}{{ bos_token }}{% endif %}
{{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' }}
{% if message['content'] is string %}
{{ message['content'] }}
{% else %}
{% for content in message['content'] %}
{% if content['type'] == 'image' %}
{{ '<|image|>' }}
{% elif content['type'] == 'text' %}
{{ content['text'] }}
{% endif %}
{% endfor %}
{% endif %}
{{ '<|eot_id|>' }}
{% endfor %}
{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}
```

View File

@ -14,15 +14,10 @@ rendered properly in your Markdown viewer.
-->
# Template writing
# Writing a chat template
A chat template is a [Jinja](https://jinja.palletsprojects.com/en/3.1.x/templates/) template stored in the tokenizers [chat_template](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.chat_template) attribute. Jinja is a templating language that allows you to write Python-like code and syntax. A chat template performs the following three roles.
A chat template is a [Jinja](https://jinja.palletsprojects.com/en/stable/templates/) template stored in the tokenizer's [chat_template](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.chat_template) attribute. Jinja is a templating language that allows you to write Python-like code and syntax.
1. Print the role enclosed in `<|` and `|>` (`<|user|>`, `<|assistant|>`, etc.).
2. Print the message followed by an end-of-sequence (`EOS`) token.
3. Print the assistant token if [add_generation_prompt=True](./chat_templating#add_generation_prompt) so the model generates an assistant response.
An example template is shown below.
```jinja
{%- for message in messages %}
@ -34,55 +29,68 @@ An example template is shown below.
{%- endif %}
```
The template can be customized to handle more complex use cases. This guide will show you how to add and edit templates and includes template writing tips.
If you stare at this for a while, you should realize that this is actually very like Python, albeit with some strange
`{%-` syntax. The template iterates over a list of messages, and for each message, it prints the role and content of
the message, followed by an end-of-sequence token. If `add_generation_prompt=True`, it adds
the starting header for an assistant message to the end of the conversation.
## Create a template
Create a template by writing a Jinja template and then setting it as the chat template in the tokenizer. For example, the template below adds `[ASST]` and `[/ASST]` tags to the assistant messages.
```jinja
{%- for message in messages %}
{%- if message['role'] == 'user' %}
{{- bos_token + '[INST] ' + message['content'].strip() + ' [/INST]' }}
{%- elif message['role'] == 'system' %}
{{- '<<SYS>>\\n' + message['content'].strip() + '\\n<</SYS>>\\n\\n' }}
{%- elif message['role'] == 'assistant' %}
{{- '[ASST] ' + message['content'] + ' [/ASST]' + eos_token }}
{%- endif %}
{%- endfor %}
```
Set the template in the tokenizer, and the next time you use [`~PreTrainedTokenizerBase.apply_chat_template`], the new template is used.
```py
template = tokenizer.chat_template
template = template.replace("SYS", "SYSTEM") # Change the system token
tokenizer.chat_template = template # Set the new template
```
The template is saved in the `tokenizer_config.json` file. Upload it to the Hub with [`~PreTrainedTokenizer.push_to_hub`] so you can reuse it later and make sure everyone is using the right template for your model.
```py
tokenizer.push_to_hub("model_name")
```
Load the written template as a string and assign it to the tokenizer's `chat_template` attribute. Once set, the template is used whenever you call [`~PreTrainedTokenizerBase.apply_chat_template`]. It is also saved
with the tokenizer whenever [`~PreTrainedTokenizer.save_pretrained`] or [`~PreTrainedTokenizer.push_to_hub`] is called. The template is saved in the `chat_template.jinja` file in the tokenizer directory. You can
edit this file directly to change the template, which is often easier than manipulating a template string.
## Template writing tips
The easiest way to start writing Jinja templates is to refer to existing templates. Use `print(tokenizer.chat_template)` on any chat model to see what template it's using. Try starting with simple models that don't call any tools or support RAG. Finally, take a look at the [Jinja documentation](https://jinja.palletsprojects.com/en/3.1.x/templates/#synopsis) for more details about formatting and syntax.
The easiest way to start writing Jinja templates is to refer to existing templates. Use `print(tokenizer.chat_template)` on any chat model to see the template it's using. Try starting with simple models that don't call any tools or support RAG because tool-use models can have very complex templates. Finally, take a look at the [Jinja documentation](https://jinja.palletsprojects.com/en/stable/templates/#synopsis) for more details about formatting and syntax.
This section curates some best practices for writing clean and efficient Jinja templates.
There are some specific tips and pitfalls you may encounter while writing chat templates specifically, though, and this section will cover some of them in more detail.
### Trimming whitespace
### Writing multimodal chat templates
Jinja prints any whitespace before or after a block of text. This can be an issue for chat templates because whitespace usage should be intentional. Add `-` to strip any whitespace before a block.
For multimodal templates, the `chat_template` attribute is set on the **processor**, not the tokenizer. The `content` key of a message is often a list of content dicts,
rather than just a single string. You may wish to check the type of each content item in the list, and handle it accordingly.
Generally, the template should not directly access image or video data. This is normally handled by the processor after template rendering has finished. Instead,
your template should emit a single special token like `<|image|>` or `<|video|>` when it encounters image or video content. The processor will
expand the single special token out into a sequence of image or video tokens later. The exact tokens to emit depends on the model you're working with. We strongly recommend loading an existing multimodal processor to see how it handles data.
The example template below handles mixed image and text content.
```jinja
{%- for message in messages %}
{{- message['role'] + message['content'] }}
{%- if loop.index0 == 0 %}
{{- bos_token }}
{%- endif %}
{{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' }}
{%- if message['content'] is string %}
{{- message['content'] }}
{%- else %}
{%- for content in message['content'] %}
{%- if content['type'] == 'image' %}
{{- '<|image|>' }}
{%- elif content['type'] == 'text' %}
{{- content['text'] }}
{%- endif %}
{%- endfor %}
{%- endif %}
{{- '<|eot_id|>' }}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
{%- endif %}
```
The incorrect whitespace usage example below may introduce a newline and indentation in the output.
This multimodal template is very similar to the more simple template above, but it checks for `content` lists,
and iterates over them to render `<|image|>` tokens where necessary. This allows images to be inserted "into the flow"
of user text.
Not all models work this way - some may move all images to the end of the user message,
for example. The chat template should always match the format the model was trained with.
### Trimming whitespace
Jinja prints any whitespace before or after a block of text. This can be an issue for chat templates because adding extra whitespace that was not present during model training can harm performance. To remove the whitespace, add `-` to the Jinja line syntax. This allows you to write your template with Pythonic indentation and linebreaks, without accidentally printing an indentation in the rendered output.
The example template below doesn't use `-`, resulting in extra whitespace being printed in the output.
```jinja
{% for message in messages %}
@ -90,22 +98,28 @@ The incorrect whitespace usage example below may introduce a newline and indenta
{% endfor %}
```
### Special variables
We strongly recommend using `-` to ensure only the intended content is printed.
There are five special variables available inside a template. You can pass virtually any additional arguments to [`~PreTrainedTokenizerBase.apply_chat_template`] and it will be available inside the template as a variable. However, you should try to keep the number of variables to the five below to make it easier for users to use the chat model without writing custom code to handle model-specific arguments.
```jinja
{%- for message in messages %}
{{- message['role'] + message['content'] }}
{%- endfor %}
```
- `messages` contains the chat history as a list of message dicts.
- `tools` contains a list of tools in JSON schema format.
- `documents` contains a list of documents with the format `{"title": Title, "contents": "Contents"}` (designed for RAG models).
- `add_generation_prompt` is a boolean that determines whether to add an assistant header at the end of the conversation.
- `bos_token` and `eos_token` are special tokens extracted from a tokenizers `special_tokens_map`.
### Special variables and callables
### Callable functions
There are two callable functions available inside a template.
The only constants in a template are the `messages` variable and the `add_generation_prompt` boolean. However, you have
access to **any other keyword arguments that are passed** to the [`~PreTrainedTokenizerBase.apply_chat_template`] method.
This provides flexibility and enables support for use-cases we may not have thought of while designing the spec. The most common additional variable is `tools`, which contains a list of tools in JSON schema format. Although you can use any variable name you like, we highly recommend sticking to convention and using `tools` for this purpose. This makes templates more compatible with the standard API.
You also have access to any tokens contained in `tokenizer.special_tokens_map`, which often includes special tokens like `bos_token` and `eos_token`. Access these directly by name, like `{{- bos_token }}`.
There are two callable functions available to you. To call them, use `{{- function_name(argument) }}`.
- `raise_exception(msg)` raises a `TemplateException`. This is useful for debugging or warning users about incorrect template usage.
- `strftime_now(format_str)` retrieves the current date and time in a specific format which could be useful to include in system messages. It is equivalent to [datetime.now().strftime(format_str)](https://docs.python.org/3/library/datetime.html#datetime.datetime.now) in Python.
- `strftime_now(format_str)` retrieves the current date and time in a specific format, which is often required in system messages. It is equivalent to [datetime.now().strftime(format_str)](https://docs.python.org/3/library/datetime.html#datetime.datetime.now) in Python.
### Compatibility with non-Python Jinja
@ -144,9 +158,11 @@ The following section lists elements of the standard API for writing templates f
### Tool definitions
Transformers chat template methods allow a user to pass tools as Python functions or a JSON schema. When functions are passed, a JSON schema is automatically generated and passed to the template. The `tools` variable in a template always takes a list of JSON schemas.
[Tools](./chat_extras) are passed as Python functions or a JSON schema. When functions are passed, a JSON schema is automatically generated and passed to the template. When a template accesses the `tools` variable, it is always a list of JSON schemas.
The specific tokens and tool descriptions should match the ones your model was trained with. Your model doesn't need to understand the JSON schema input because your template can translate the JSON schema into your models format. For example, [Command-R](./model_doc/cohere) was trained with tools defined with Python function headers, but the Command-R tool template accepts JSON schemas. The template internally converts types and renders the input tools as Python headers.
Even though a template always receive tools as a JSON schema, you may need to radically change this format when rendering them to match the format a model was trained with. For example, [Command-R](./model_doc/cohere) was trained with tools defined with Python function headers. The template internally converts JSON schema types and renders the input tools as Python headers.
The example below shows how a tool is defined in JSON schema format.
```json
{
@ -172,7 +188,7 @@ The specific tokens and tool descriptions should match the ones your model was t
}
```
An example for handling tool definitions in a chat template is shown below. The specific tokens and tool descriptions should be changed to match the ones a model was trained with.
An example of handling tool definitions in a chat template is shown below. The specific tokens and layouts should be changed to match the ones the model was trained with.
```
{%- if tools %}
@ -188,7 +204,9 @@ An example for handling tool definitions in a chat template is shown below. The
### Tool calls
Tool calls, if present, is a list with the `"assistant”` role. This is always a list even though most tool-calling models only support single tool calls, which means the list usually only contains a single element.
In addition to rendering the tool definitions, you also need to render **tool calls** and **tool responses** in the template.
Tool calls are generally passed in the `tool_calls` key of an `"assistant”` message. This is always a list even though most tool-calling models only support single tool calls, which means the list usually only contains a single element.
```json
{
@ -208,7 +226,7 @@ Tool calls, if present, is a list with the `"assistant”` role. This is always
}
```
A common pattern for handling tool calls is shown below.
A common pattern for handling tool calls is shown below. You can use this as a starting point, but make sure you template actually matches the format the model was trained with!
```
{%- if message['role'] == 'assistant' and 'tool_calls' in message %}
@ -221,7 +239,7 @@ A common pattern for handling tool calls is shown below.
### Tool responses
Tool responses are a message dict with the `role`, `name` (name of the function) and `content` (result of the tool call) keys.
Tool responses are message dicts with the `tool` role. They are much simpler than tool calls, and usually only contain the `role`, `name` and `content` keys.
```json
{
@ -231,7 +249,7 @@ Tool responses are a message dict with the `role`, `name` (name of the function)
}
```
Not all the keys need to be used in the tool response. For example, if a model doesnt expect the function name to be included in the tool response, then you can just include the `role` and `content`.
Some templates may not even need the `name` key, in which case, you can write your template to only read the `content` key.
```
{%- if message['role'] == 'tool' %}
@ -241,11 +259,11 @@ Not all the keys need to be used in the tool response. For example, if a model d
## Contribute
Add a chat template by setting the `chat_template` attribute in the tokenizer and testing it with [`~PreTrainedTokenizerBase.apply_chat_template`]. If it works as expected, then you can upload it to the Hub with with [`~PreTrainedTokenizer.push_to_hub`].
Once a template is ready, set it to the `chat_template` attribute in the tokenizer and test it with [`~PreTrainedTokenizerBase.apply_chat_template`]. If it works as expected, then upload it to the Hub with [`~PreTrainedTokenizer.push_to_hub`].
Even if you're not the model owner, it is still helpful to add a template for a model with an empty chat template or a model that is using a default class template. Open a [pull request](https://hf.co/docs/hub/repositories-pull-requests-discussions) on the model repository to add the template.
Even if you're not the model owner, it is still helpful to add a template for a model with an empty or incorrect chat template. Open a [pull request](https://hf.co/docs/hub/repositories-pull-requests-discussions) on the model repository to add the template!
```py
tokenizer.chat_template = template
tokenizer.push_to_hub("model_name")
tokenizer.push_to_hub("amazing_company/cool_model", commit_message="Add chat template", create_pr=True)
```

View File

@ -17,7 +17,6 @@ This page regroups resources around 🤗 Transformers developed by the community
| Notebook | Description | Author | |
|:----------|:-------------|:-------------|------:|
| [Fine-tune a pre-trained Transformer to generate lyrics](https://github.com/AlekseyKorshuk/huggingartists) | How to generate lyrics in the style of your favorite artist by fine-tuning a GPT-2 model | [Aleksey Korshuk](https://github.com/AlekseyKorshuk) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AlekseyKorshuk/huggingartists/blob/master/huggingartists-demo.ipynb) |
| [Train T5 in Tensorflow 2](https://github.com/snapthat/TF-T5-text-to-text) | How to train T5 for any task using Tensorflow 2. This notebook demonstrates a Question & Answer task implemented in Tensorflow 2 using SQUAD | [Muhammad Harris](https://github.com/HarrisDePerceptron) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snapthat/TF-T5-text-to-text/blob/master/snapthatT5/notebooks/TF-T5-Datasets%20Training.ipynb) |
| [Train T5 on TPU](https://github.com/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb) | How to train T5 on SQUAD with Transformers and Nlp | [Suraj Patil](https://github.com/patil-suraj) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb#scrollTo=QLGiFCDqvuil) |
| [Fine-tune T5 for Classification and Multiple Choice](https://github.com/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) | How to fine-tune T5 for classification and multiple choice tasks using a text-to-text format with PyTorch Lightning | [Suraj Patil](https://github.com/patil-suraj) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) |
| [Fine-tune DialoGPT on New Datasets and Languages](https://github.com/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb) | How to fine-tune the DialoGPT model on a new dataset for open-dialog conversational chatbots | [Nathan Cooper](https://github.com/ncoop57) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb) |
@ -42,7 +41,6 @@ This page regroups resources around 🤗 Transformers developed by the community
|[Fine-tune ALBERT for sentence-pair classification](https://github.com/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb) | How to fine-tune an ALBERT model or another BERT-based model for the sentence-pair classification task | [Nadir El Manouzi](https://github.com/NadirEM) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb)|
|[Fine-tune Roberta for sentiment analysis](https://github.com/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb) | How to fine-tune a Roberta model for sentiment analysis | [Dhaval Taunk](https://github.com/DhavalTaunk08) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb)|
|[Evaluating Question Generation Models](https://github.com/flexudy-pipe/qugeev) | How accurate are the answers to questions generated by your seq2seq transformer model? | [Pascal Zoleko](https://github.com/zolekode) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bpsSqCQU-iw_5nNoRm_crPq6FRuJthq_?usp=sharing)|
|[Classify text with DistilBERT and Tensorflow](https://github.com/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb) | How to fine-tune DistilBERT for text classification in TensorFlow | [Peter Bayerle](https://github.com/peterbayerle) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb)|
|[Leverage BERT for Encoder-Decoder Summarization on CNN/Dailymail](https://github.com/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb) | How to warm-start a *EncoderDecoderModel* with a *google-bert/bert-base-uncased* checkpoint for summarization on CNN/Dailymail | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb)|
|[Leverage RoBERTa for Encoder-Decoder Summarization on BBC XSum](https://github.com/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb) | How to warm-start a shared *EncoderDecoderModel* with a *FacebookAI/roberta-base* checkpoint for summarization on BBC/XSum | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb)|
|[Fine-tune TAPAS on Sequential Question Answering (SQA)](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) | How to fine-tune *TapasForQuestionAnswering* with a *tapas-base* checkpoint on the Sequential Question Answering (SQA) dataset | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb)|

View File

@ -16,18 +16,15 @@ rendered properly in your Markdown viewer.
# Chat basics
Chat models are conversational models you can send and receive messages from. There are many chat models available to choose from, but in general, larger models tend to be better though that's not always the case. The model size is often included in the name, like "8B" or "70B", and it describes the number of parameters. Mixture-of-expert (MoE) models have names like "8x7B" or "141B-A35B" which means it's a 56B and 141B parameter model. You can try quantizing larger models to reduce memory requirements, otherwise you'll need ~2 bytes of memory per parameter.
Chat models are conversational models you can send a message to and receive a response. Most language models from mid-2023 onwards are chat models and may be referred to as "instruct" or "instruction-tuned" models. Models that do not support chat are often referred to as "base" or "pretrained" models.
Check model leaderboards like [OpenLLM](https://hf.co/spaces/HuggingFaceH4/open_llm_leaderboard) and [LMSys Chatbot Arena](https://chat.lmsys.org/?leaderboard) to further help you identify the best chat models for your use case. Models that are specialized in certain domains (medical, legal text, non-English languages, etc.) may sometimes outperform larger general purpose models.
Larger and newer models are generally more capable, but models specialized in certain domains (medical, legal text, non-English languages, etc.) can often outperform these larger models. Try leaderboards like [OpenLLM](https://hf.co/spaces/HuggingFaceH4/open_llm_leaderboard) and [LMSys Chatbot Arena](https://chat.lmsys.org/?leaderboard) to help you identify the best model for your use case.
> [!TIP]
> Chat with a number of open-source models for free on [HuggingChat](https://hf.co/chat/)!
This guide shows you how to quickly start chatting with Transformers from the command line, how build and format a conversation, and how to chat using the [`TextGenerationPipeline`].
This guide shows you how to quickly load chat models in Transformers from the command line, how to build and format a conversation, and how to chat using the [`TextGenerationPipeline`].
## chat CLI
After you've [installed Transformers](./installation.md), chat with a model directly from the command line as shown below. It launches an interactive session with a model, with a few base commands listed at the start of the session.
After you've [installed Transformers](./installation), you can chat with a model directly from the command line. The command below launches an interactive session with a model, with a few base commands listed at the start of the session.
```bash
transformers chat Qwen/Qwen2.5-0.5B-Instruct
@ -56,85 +53,54 @@ The chat is implemented on top of the [AutoClass](./model_doc/auto), using tooli
[`TextGenerationPipeline`] is a high-level text generation class with a "chat mode". Chat mode is enabled when a conversational model is detected and the chat prompt is [properly formatted](./llm_tutorial#wrong-prompt-format).
To start, build a chat history with the following two roles.
- `system` describes how the model should behave and respond when you're chatting with it. This role isn't supported by all chat models.
- `user` is where you enter your first message to the model.
Chat models accept a list of messages (the chat history) as the input. Each message is a dictionary with `role` and `content` keys.
To start the chat, add a single `user` message. You can also optionally include a `system` message to give the model directions on how to behave.
```py
chat = [
{"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."},
{"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"}
{"role": "system", "content": "You are a helpful science assistant."},
{"role": "user", "content": "Hey, can you explain gravity to me?"}
]
```
Create the [`TextGenerationPipeline`] and pass `chat` to it. For large models, setting [device_map="auto"](./models#big-model-inference) helps load the model quicker and automatically places it on the fastest device available. Changing the data type to [torch.bfloat16](./models#model-data-type) also helps save memory.
Create the [`TextGenerationPipeline`] and pass `chat` to it. For large models, setting [device_map="auto"](./models#big-model-inference) helps load the model quicker and automatically places it on the fastest device available.
```py
import torch
from transformers import pipeline
pipeline = pipeline(task="text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", torch_dtype=torch.bfloat16, device_map="auto")
pipeline = pipeline(task="text-generation", model="HuggingFaceTB/SmolLM2-1.7B-Instruct", dtype="auto", device_map="auto")
response = pipeline(chat, max_new_tokens=512)
print(response[0]["generated_text"][-1]["content"])
```
```txt
(sigh) Oh boy, you're asking me for advice? You're gonna need a map, pal! Alright,
alright, I'll give you the lowdown. But don't say I didn't warn you, I'm a robot, not a tour guide!
If this works successfully, you should see a response from the model! If you want to continue the conversation,
you need to update the chat history with the model's response. You can do this either by appending the text
to `chat` (use the `assistant` role), or by reading `response[0]["generated_text"]`, which contains
the full chat history, including the most recent response.
So, you wanna know what's fun to do in the Big Apple? Well, let me tell you, there's a million
things to do, but I'll give you the highlights. First off, you gotta see the sights: the Statue of
Liberty, Central Park, Times Square... you know, the usual tourist traps. But if you're lookin' for
something a little more... unusual, I'd recommend checkin' out the Museum of Modern Art. It's got
some wild stuff, like that Warhol guy's soup cans and all that jazz.
And if you're feelin' adventurous, take a walk across the Brooklyn Bridge. Just watch out for
those pesky pigeons, they're like little feathered thieves! (laughs) Get it? Thieves? Ah, never mind.
Now, if you're lookin' for some serious fun, hit up the comedy clubs in Greenwich Village. You might
even catch a glimpse of some up-and-coming comedians... or a bunch of wannabes tryin' to make it big. (winks)
And finally, if you're feelin' like a real New Yorker, grab a slice of pizza from one of the many amazing
pizzerias around the city. Just don't try to order a "robot-sized" slice, trust me, it won't end well. (laughs)
So, there you have it, pal! That's my expert advice on what to do in New York. Now, if you'll
excuse me, I've got some oil changes to attend to. (winks)
```
Use the `append` method on `chat` to respond to the models message.
Once you have the model's response, you can continue the conversation by appending a new `user` message to the chat history.
```py
chat = response[0]["generated_text"]
chat.append(
{"role": "user", "content": "Wait, what's so wild about soup cans?"}
{"role": "user", "content": "Woah! But can it be reconciled with quantum mechanics?"}
)
response = pipeline(chat, max_new_tokens=512)
print(response[0]["generated_text"][-1]["content"])
```
```txt
(laughs) Oh, you're killin' me, pal! You don't get it, do you? Warhol's soup cans are like, art, man!
It's like, he took something totally mundane, like a can of soup, and turned it into a masterpiece. It's
like, "Hey, look at me, I'm a can of soup, but I'm also a work of art!"
(sarcastically) Oh, yeah, real original, Andy.
By repeating this process, you can continue the conversation as long as you like, at least until the model runs out of context window
or you run out of memory.
But, you know, back in the '60s, it was like, a big deal. People were all about challenging the
status quo, and Warhol was like, the king of that. He took the ordinary and made it extraordinary.
And, let me tell you, it was like, a real game-changer. I mean, who would've thought that a can of soup could be art? (laughs)
## Performance and memory usage
But, hey, you're not alone, pal. I mean, I'm a robot, and even I don't get it. (winks)
But, hey, that's what makes art, art, right? (laughs)
```
## Performance
Transformers load models in full precision by default, and for a 8B model, this requires ~32GB of memory! Reduce memory usage by loading a model in half-precision or bfloat16 (only uses ~2 bytes per parameter). You can even quantize the model to a lower precision like 8-bit or 4-bit with [bitsandbytes](https://hf.co/docs/bitsandbytes/index).
Transformers load models in full `float32` precision by default, and for a 8B model, this requires ~32GB of memory! Use the `torch_dtype="auto"` argument, which generally uses `bfloat16` for models that were trained with it, to reduce your memory usage.
> [!TIP]
> Refer to the [Quantization](./quantization/overview) docs for more information about the different quantization backends available.
Create a [`BitsAndBytesConfig`] with your desired quantization settings and pass it to the pipelines `model_kwargs` parameter. The example below quantizes a model to 8-bits.
To lower memory usage even lower, you can quantize the model to 8-bit or 4-bit with [bitsandbytes](https://hf.co/docs/bitsandbytes/index). Create a [`BitsAndBytesConfig`] with your desired quantization settings and pass it to the pipelines `model_kwargs` parameter. The example below quantizes a model to 8-bits.
```py
from transformers import pipeline, BitsAndBytesConfig
@ -143,19 +109,10 @@ quantization_config = BitsAndBytesConfig(load_in_8bit=True)
pipeline = pipeline(task="text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", device_map="auto", model_kwargs={"quantization_config": quantization_config})
```
In general, larger models are slower in addition to requiring more memory because text generation is bottlenecked by **memory bandwidth** instead of compute power. Each active parameter must be read from memory for every generated token. For a 16GB model, 16GB must be read from memory for every generated token.
In general, model size and performance are directly correlated. Larger models are slower in addition to requiring more memory because each active parameter must be read from memory for every generated token.
This is a bottleneck for LLM text generation and the main options for improving generation speed are to either quantize a model or use hardware with higher memory bandwidth. Adding more compute power doesn't meaningfully help.
The number of generated tokens/sec is proportional to the total memory bandwidth of the system divided by the model size. Depending on your hardware, total memory bandwidth can vary. Refer to the table below for approximate generation speeds for different hardware types.
| Hardware | Memory bandwidth |
|---|---|
| consumer CPU | 20-100GB/sec |
| specialized CPU (Intel Xeon, AMD Threadripper/Epyc, Apple silicon) | 200-900GB/sec |
| data center GPU (NVIDIA A100/H100) | 2-3TB/sec |
The easiest solution for improving generation speed is to either quantize a model or use hardware with higher memory bandwidth.
You can also try techniques like [speculative decoding](./generation_strategies#speculative-decoding), where a smaller model generates candidate tokens that are verified by the larger model. If the candidate tokens are correct, the larger model can generate more than one token per `forward` pass. This significantly alleviates the bandwidth bottleneck and improves generation speed.
You can also try techniques like [speculative decoding](./generation_strategies#speculative-decoding), where a smaller model generates candidate tokens that are verified by the larger model. If the candidate tokens are correct, the larger model can generate more than one token at a time. This significantly alleviates the bandwidth bottleneck and improves generation speed.
> [!TIP]
> Parameters may not be active for every generated token in MoE models such as [Mixtral](./model_doc/mixtral), [Qwen2MoE](./model_doc/qwen2_moe.md), and [DBRX](./model_doc/dbrx). As a result, MoE models generally have much lower memory bandwidth requirements and can be faster than a regular LLM of the same size. However, techniques like speculative decoding are ineffective with MoE models because parameters become activated with each new speculated token.
Mixture-of-Expert (MoE) models such as [Mixtral](./model_doc/mixtral), [Qwen2MoE](./model_doc/qwen2_moe), and [GPT-OSS](./model_doc/gpt-oss) have lots of parameters, but only "activate" a small fraction of them to generate each token. As a result, MoE models generally have much lower memory bandwidth requirements and can be faster than a regular LLM of the same size. However, techniques like speculative decoding are ineffective with MoE models because more parameters become activated with each new speculated token.

42
docs/source/en/cursor.md Normal file
View File

@ -0,0 +1,42 @@
# Using Cursor as a client of transformers serve
This example shows how to use `transformers serve` as a local LLM provider for [Cursor](https://cursor.com/), the popular IDE. In this particular case, requests to `transformers serve` will come from an external IP (Cursor's server IPs), which requires some additional setup. Furthermore, some of Cursor's requests require [CORS](https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/CORS), which is disabled by default for security reasons.
To launch a server with CORS enabled, run
```shell
transformers serve --enable-cors
```
You'll also need to expose your server to external IPs. A potential solution is to use [`ngrok`](https://ngrok.com/), which has a permissive free tier. After setting up your `ngrok` account and authenticating on your server machine, you run
```shell
ngrok http [port]
```
where `port` is the port used by `transformers serve` (`8000` by default). On the terminal where you launched `ngrok`, you'll see a https address in the "Forwarding" row, as in the image below. This is the address to send requests to.
<h3 align="center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/transformers_serve_ngrok.png"/>
</h3>
You're now ready to set things up on the app side! In Cursor, while you can't set a new provider, you can change the endpoint for OpenAI requests in the model selection settings. First, navigate to "Settings" > "Cursor Settings", "Models" tab, and expand the "API Keys" collapsible. To set your `transformers serve` endpoint, follow this order:
1. Unselect ALL models in the list above (e.g. `gpt4`, ...);
2. Add and select the model you want to use (e.g. `Qwen/Qwen3-4B`)
3. Add some random text to OpenAI API Key. This field won't be used, but it cant be empty;
4. Add the https address from `ngrok` to the "Override OpenAI Base URL" field, appending `/v1` to the address (i.e. `https://(...).ngrok-free.app/v1`);
5. Hit "Verify".
After you follow these steps, your "Models" tab should look like the image below. Your server should also have received a few requests from the verification step.
<h3 align="center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/transformers_serve_cursor.png"/>
</h3>
You are now ready to use your local model in Cursor! For instance, if you toggle the AI Pane, you can select the model you added and ask it questions about your local files.
<h3 align="center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/transformers_serve_cursor_chat.png"/>
</h3>

View File

@ -260,7 +260,7 @@ with deepspeed.zero.Init():
The DeepSped config file needs to have `is_deepspeed_zero3_enabled: true` setup in [`TrainingArguments`] and it needs a ZeRO configuration enabled. The [`TrainingArguments`] object must be created **before** calling [`~PreTrainedModel.from_pretrained`].
> [!TIP]
> You'll need ZeRO-3 when the fp16 weights don't fit on a single GPU. But if you're able to load the fp16 weights, set `torch_dtype=torch.float16` in [`~PreTrainedModel.from_pretrained`].
> You'll need ZeRO-3 when the fp16 weights don't fit on a single GPU. But if you're able to load the fp16 weights, set `dtype=torch.float16` in [`~PreTrainedModel.from_pretrained`].
```py
from transformers import AutoModel, Trainer, TrainingArguments

View File

@ -38,7 +38,7 @@ generation_config = GenerationConfig(
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B", pad_token="</s>", padding_side="right")
model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B", device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="sdpa", generation_config=generation_config)
model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B", device_map="auto", dtype=torch.bfloat16, attn_implementation="sdpa", generation_config=generation_config)
exported_program = convert_and_export_with_cache(model)
```

View File

@ -31,7 +31,7 @@ from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b")
tokenizer("We are very happy to show you the 🤗 Transformers library", return_tensors="pt")
{'input_ids': tensor([[ 2, 1734, 708, 1508, 4915, 577, 1500, 692, 573,
156808, 128149, 9581, 235265]]),
156808, 128149, 9581, 235265]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
}
```
@ -62,7 +62,7 @@ from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b")
tokenizer("We are very happy to show you the 🤗 Transformers library.", return_tensors="pt")
{'input_ids': tensor([[ 2, 1734, 708, 1508, 4915, 577, 1500, 692, 573,
156808, 128149, 9581, 235265]]),
156808, 128149, 9581, 235265]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
}
```
@ -112,7 +112,7 @@ tokenizer = GemmaTokenizerFast(vocab_file="my_vocab_file.txt")
## Multimodal tokenizers
In addition to text tokens, multimodal tokenizers also holds tokens from other modalities as a part of its attributes for easy access.
In addition to text tokens, multimodal tokenizers also holds tokens from other modalities as a part of its attributes for easy access.
To add these special tokens to a tokenizer, pass them as a dictionary to the `extra_special_tokens` parameter in [`~AutoTokenizer.from_pretrained`]. The example below adds the `image_token` to a vision-language model.
@ -198,7 +198,7 @@ Add the `subfolder` parameter to [`~PreTrainedModel.from_pretrained`] to specify
```py
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", subfolder="original")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", subfolder="original")
```
### Create a tiktoken tokenizer
@ -226,7 +226,7 @@ tokenizer = PreTrainedTokenizerFast.from_pretrained("config/save/dir")
<Youtube id="Yffk5aydLzg"/>
A Transformers model expects the input to be a PyTorch, TensorFlow, or NumPy tensor. A tokenizers job is to preprocess text into those tensors. Specify the framework tensor type to return with the `return_tensors` parameter.
A Transformers model expects the input to be a PyTorch or NumPy tensor. A tokenizers job is to preprocess text into those tensors. Specify the framework tensor type to return with the `return_tensors` parameter.
```py
from transformers import AutoTokenizer
@ -234,7 +234,7 @@ from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b")
tokenizer("We are very happy to show you the 🤗 Transformers library.", return_tensors="pt")
{'input_ids': tensor([[ 2, 1734, 708, 1508, 4915, 577, 1500, 692, 573,
156808, 128149, 9581, 235265]]),
156808, 128149, 9581, 235265]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
}
```
@ -321,12 +321,12 @@ batch_sentences = [
encoded_inputs = tokenizer(batch_sentences, return_tensors="pt")
print(encoded_inputs)
{
'input_ids':
[[2, 1860, 1212, 1105, 2257, 14457, 235336],
[2, 4454, 235303, 235251, 1742, 693, 9242, 1105, 2257, 14457, 235269, 48782, 235265],
[2, 1841, 1105, 29754, 37453, 235336]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'input_ids':
[[2, 1860, 1212, 1105, 2257, 14457, 235336],
[2, 4454, 235303, 235251, 1742, 693, 9242, 1105, 2257, 14457, 235269, 48782, 235265],
[2, 1841, 1105, 29754, 37453, 235336]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1]]
}
```

View File

@ -32,12 +32,14 @@ Greedy search works well for tasks with relatively short outputs where creativit
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
device = infer_device()
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to("cuda")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to(device)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16).to("cuda")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", dtype=torch.float16).to(device)
# explicitly set to default length because Llama2 generation length is 4096
outputs = model.generate(**inputs, max_new_tokens=20)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
@ -52,12 +54,14 @@ Enable multinomial sampling with `do_sample=True` and `num_beams=1`.
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
device = infer_device()
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to("cuda")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to(device)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16).to("cuda")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", dtype=torch.float16).to(device)
# explicitly set to 100 because Llama2 generation length is 4096
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, num_beams=1)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
@ -75,12 +79,14 @@ Enable beam search with the `num_beams` parameter (should be greater than 1 othe
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
device = infer_device()
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to("cuda")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to(device)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16).to("cuda")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", dtype=torch.float16).to(device)
# explicitly set to 100 because Llama2 generation length is 4096
outputs = model.generate(**inputs, max_new_tokens=50, num_beams=2)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
@ -125,7 +131,7 @@ pipe = pipeline(
"text-generation",
model="meta-llama/Llama-3.1-8B",
assistant_model="meta-llama/Llama-3.2-1B",
torch_dtype=torch.bfloat16
dtype=torch.bfloat16
)
pipe_output = pipe("Once upon a time, ", max_new_tokens=50, do_sample=False)
pipe_output[0]["generated_text"]
@ -160,12 +166,14 @@ Enable prompt lookup decoding with the `prompt_lookup_num_tokens` parameter.
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
device = infer_device()
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-1.7B")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-1.7B", torch_dtype=torch.float16).to("cuda")
assistant_model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-135M", torch_dtype=torch.float16).to("cuda")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to("cuda")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-1.7B", dtype=torch.float16).to(device)
assistant_model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-135M", dtype=torch.float16).to(device)
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to(device)
outputs = model.generate(**inputs, assistant_model=assistant_model, max_new_tokens=20, prompt_lookup_num_tokens=5)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
@ -217,83 +225,6 @@ outputs = model.generate(**inputs, assistant_model=assistant_model, tokenizer=to
tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']
```
### Contrastive search
[Contrastive search](https://huggingface.co/papers/2202.06417) is a decoding strategy that aims to reduce repetition even while generating longer sequences. This strategy compares how similar a generated token is against previous tokens, and if they're more similar, a penalty is applied.
Enable contrastive search with the `penalty_alpha` and `top_k` parameters. The `penalty_alpha` manages the penalty applied and `top_k` is the number of most likely tokens to return.
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to("cuda")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16).to("cuda")
# explicitly set to 100 because Llama2 generation length is 4096
outputs = model.generate(**inputs, max_new_tokens=100, penalty_alpha=0.6, top_k=4)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
'Hugging Face is an open-source company that provides a platform for building and deploying AI models.\nHugging Face is an open-source company that provides a platform for building and deploying AI models. The platform allows developers to build and deploy AI models, as well as collaborate with other developers.\nHugging Face was founded in 2019 by Thibault Wittemberg and Clément Delangue. The company is based in Paris, France.\nHugging Face has'
```
### DoLa
[Decoding by Contrasting Layers (DoLa)](https://hf.co/papers/2309.03883) is a contrastive decoding strategy for improving factuality and reducing hallucination. This strategy works by contrasting the logit differences between the final and early layers. As a result, factual knowledge localized to particular layers are amplified. DoLa is not recommended for smaller models like GPT-2.
Enable DoLa with the following parameters.
- `dola_layers` are the candidate layers to be contrasted with the final layer. It can be a string (`low` or `high`) to contrast the lower or higher parts of a layer. `high` is recommended for short-answer tasks like TruthfulQA. `low` is recommended for long-answer reasoning tasks like GSM8K, StrategyQA, FACTOR, and VicunaQA.
When a model has tied word embeddings, layer 0 is skipped and it begins from layer 2.
It can also be a list of integers that represent the layer indices between 0 and the total number of layers. Layer 0 is the word embedding, 1 is the first transformer layer, and so on. Refer to the table below for the range of layer indices depending on the number of model layers.
| layers | low | high |
|---|---|---|
| > 40 | (0, 20, 2) | (N - 20, N, 2) |
| <= 40 | range(0, N // 2, 2) | range(N // 2, N, 2) |
- `repetition_penalty` reduces repetition and it is recommended to set it to 1.2.
<hfoptions id="dola">
<hfoption id="contrast higher layers">
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-1.7B")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-1.7B", torch_dtype=torch.float16).to("cuda")
inputs = tokenizer("What is the highest peak in the world??", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50, dola_layers="high", do_sample=False)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
" Mount EverestMount Everest, called Himalaya in Nepali, is the world's highest peak, lying almost 9.5 kilometers above the sea level and the tallest mountain from 19,036.91 ft. The mountain was"
```
</hfoption>
<hfoption id="contrast specific layers">
Contrast layers 18 and 20 with the final layer.
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-1.7B")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-1.7B", torch_dtype=torch.float16).to("cuda")
inputs = tokenizer("What is the highest peak in the world?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50, dola_layers=[18,20], do_sample=False, repetition_penalty=1.2)
tokenizer.batch_decode(outputs[:, inputs.input_ids.shape[-1]:], skip_special_tokens=True)
" Mount EverestMount Everest, called Himalaya in Nepali, is the world's highest peak above sea level and it rises to an incredible height of 29,028 feet above the ocean. Its summit is over a mile taller than Mt"
```
</hfoption>
</hfoptions>
### Diverse beam search
[Diverse beam search](https://hf.co/papers/1610.02424) is a variant of beam search that produces more diverse output candidates to choose from. This strategy measures the dissimilarity of sequences and a penalty is applied if sequences are too similar. To avoid high computation costs, the number of beams is divided into groups.
@ -302,12 +233,14 @@ Enable diverse beam search with the `num_beams`, `num_beam_groups` and `diversit
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
device = infer_device()
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to("cuda")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to(device)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16).to("cuda")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", dtype=torch.float16).to(device)
# explicitly set to 100 because Llama2 generation length is 4096
outputs = model.generate(**inputs, max_new_tokens=50, num_beams=6, num_beam_groups=3, diversity_penalty=1.0, do_sample=False)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
@ -315,37 +248,37 @@ tokenizer.batch_decode(outputs, skip_special_tokens=True)
```
## Custom decoding methods
## Custom generation methods
Custom decoding methods enable specialized generation behavior such as the following:
Custom generation methods enable specialized behavior such as:
- have the model continue thinking if it is uncertain;
- roll back generation if the model gets stuck;
- handle special tokens with custom logic;
- enhanced input preparation for advanced models;
- use specialized KV caches;
We enable custom decoding methods through model repositories, assuming a specific model tag and file structure (see subsection below). This feature is an extension of [custom modeling code](./models.md#custom-models) and, like such, requires setting `trust_remote_code=True`.
We enable custom generation methods through model repositories, assuming a specific model tag and file structure (see subsection below). This feature is an extension of [custom modeling code](./models.md#custom-models) and, like such, requires setting `trust_remote_code=True`.
If a model repository holds a custom decoding method, the easiest way to try it out is to load the model and generate with it:
If a model repository holds a custom generation method, the easiest way to try it out is to load the model and generate with it:
```py
from transformers import AutoModelForCausalLM, AutoTokenizer
# `transformers-community/custom_generate_example` holds a copy of `Qwen/Qwen2.5-0.5B-Instruct`, but
# with custom generation code -> calling `generate` uses the custom decoding method!
# with custom generation code -> calling `generate` uses the custom generation method!
tokenizer = AutoTokenizer.from_pretrained("transformers-community/custom_generate_example")
model = AutoModelForCausalLM.from_pretrained(
"transformers-community/custom_generate_example", device_map="auto", trust_remote_code=True
)
inputs = tokenizer(["The quick brown"], return_tensors="pt").to(model.device)
# The custom decoding method is a minimal greedy decoding implementation. It also prints a custom message at run time.
# The custom generation method is a minimal greedy decoding implementation. It also prints a custom message at run time.
gen_out = model.generate(**inputs)
# you should now see its custom message, "✨ using a custom generation method ✨"
print(tokenizer.batch_decode(gen_out, skip_special_tokens=True))
'The quick brown fox jumps over a lazy dog, and the dog is a type of animal. Is'
```
Model repositories with custom decoding methods have a special property: their decoding method can be loaded from **any** model through [`~GenerationMixin.generate`]'s `custom_generate` argument. This means anyone can create and share their custom generation method to potentially work with any Transformers model, without requiring users to install additional Python packages.
Model repositories with custom generation methods have a special property: their generation method can be loaded from **any** model through [`~GenerationMixin.generate`]'s `custom_generate` argument. This means anyone can create and share their custom generation method to potentially work with any Transformers model, without requiring users to install additional Python packages.
```py
from transformers import AutoModelForCausalLM, AutoTokenizer
@ -354,7 +287,7 @@ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", device_map="auto")
inputs = tokenizer(["The quick brown"], return_tensors="pt").to(model.device)
# `custom_generate` replaces the original `generate` by the custom decoding method defined in
# `custom_generate` replaces the original `generate` by the custom generation method defined in
# `transformers-community/custom_generate_example`
gen_out = model.generate(**inputs, custom_generate="transformers-community/custom_generate_example", trust_remote_code=True)
print(tokenizer.batch_decode(gen_out, skip_special_tokens=True)[0])
@ -364,7 +297,7 @@ print(tokenizer.batch_decode(gen_out, skip_special_tokens=True)[0])
You should read the `README.md` file of the repository containing the custom generation strategy to see what the new arguments and output type differences are, if they exist. Otherwise, you can assume it works like the base [`~GenerationMixin.generate`] method.
> [!TIP]
> You can find all custom decoding methods by [searching for their custom tag.](https://huggingface.co/models?other=custom_generate), `custom_generate`
> You can find all custom generation methods by [searching for their custom tag.](https://huggingface.co/models?other=custom_generate), `custom_generate`.
Consider the Hub repository [transformers-community/custom_generate_example](https://huggingface.co/transformers-community/custom_generate_example) as an example. The `README.md` states that it has an additional input argument, `left_padding`, which adds a number of padding tokens before the prompt.
@ -387,11 +320,11 @@ torch>=99.0 (installed: 2.6.0)
Updating your Python requirements accordingly will remove this error message.
### Creating a custom decoding method
### Creating a custom generation method
To create a new decoding method, you need to create a new [**Model**](https://huggingface.co/new) repository and push a few files into it.
1. The model you've designed your decoding method with.
2. `custom_generate/generate.py`, which contains all the logic for your custom decoding method.
To create a new generation method, you need to create a new [**Model**](https://huggingface.co/new) repository and push a few files into it.
1. The model you've designed your generation method with.
2. `custom_generate/generate.py`, which contains all the logic for your custom generation method.
3. `custom_generate/requirements.txt`, used to optionally add new Python requirements and/or lock specific versions to correctly use your method.
4. `README.md`, where you should add the `custom_generate` tag and document any new arguments or output type differences of your custom method here.
@ -409,7 +342,7 @@ your_repo/
#### Adding the base model
The starting point for your custom decoding method is a model repository just like any other. The model to add to this repository should be the model you've designed your method with, and it is meant to be part of a working self-contained model-generate pair. When the model in this repository is loaded, your custom decoding method will override `generate`. Don't worry -- your decoding method can still be loaded with any other Transformers model, as explained in the section above.
The starting point for your custom generation method is a model repository just like any other. The model to add to this repository should be the model you've designed your method with, and it is meant to be part of a working self-contained model-generate pair. When the model in this repository is loaded, your custom generation method will override `generate`. Don't worry -- your generation method can still be loaded with any other Transformers model, as explained in the section above.
If you simply want to copy an existing model, you can do
@ -418,13 +351,13 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("source/model_repo")
model = AutoModelForCausalLM.from_pretrained("source/model_repo")
tokenizer.save_pretrained("your/decoding_method", push_to_hub=True)
model.save_pretrained("your/decoding_method", push_to_hub=True)
tokenizer.save_pretrained("your/generation_method", push_to_hub=True)
model.save_pretrained("your/generation_method", push_to_hub=True)
```
#### generate.py
This is the core of your decoding method. It *must* contain a method named `generate`, and this method *must* contain a `model` argument as its first argument. `model` is the model instance, which means you have access to all attributes and methods in the model, including the ones defined in [`GenerationMixin`] (like the base `generate` method).
This is the core of your generation method. It *must* contain a method named `generate`, and this method *must* contain a `model` argument as its first argument. `model` is the model instance, which means you have access to all attributes and methods in the model, including the ones defined in [`GenerationMixin`] (like the base `generate` method).
> [!WARNING]
> `generate.py` must be placed in a folder named `custom_generate`, and not at the root level of the repository. The file paths for this feature are hardcoded.
@ -465,7 +398,7 @@ def generate(model, input_ids, generation_config=None, left_padding=None, **kwar
return input_ids
```
Follow the recommended practices below to ensure your custom decoding method works as expected.
Follow the recommended practices below to ensure your custom generation method works as expected.
- Feel free to reuse the logic for validation and input preparation in the original [`~GenerationMixin.generate`].
- Pin the `transformers` version in the requirements if you use any private method/attribute in `model`.
- Consider adding model validation, input validation, or even a separate test file to help users sanity-check your code in their environment.
@ -476,7 +409,7 @@ Your custom `generate` method can relative import code from the `custom_generate
from .utils import some_function
```
Only relative imports from the same-level `custom_generate` folder are supported. Parent/sibling folder imports are not valid. The `custom_generate` argument also works locally with any directory that contains a `custom_generate` structure. This is the recommended workflow for developing your custom decoding method.
Only relative imports from the same-level `custom_generate` folder are supported. Parent/sibling folder imports are not valid. The `custom_generate` argument also works locally with any directory that contains a `custom_generate` structure. This is the recommended workflow for developing your custom generation method.
#### requirements.txt
@ -485,7 +418,7 @@ You can optionally specify additional Python requirements in a `requirements.txt
#### README.md
The root level `README.md` in the model repository usually describes the model therein. However, since the focus of the repository is the custom decoding method, we highly recommend to shift its focus towards describing the custom decoding method. In addition to a description of the method, we recommend documenting any input and/or output differences to the original [`~GenerationMixin.generate`]. This way, users can focus on what's new, and rely on Transformers docs for generic implementation details.
The root level `README.md` in the model repository usually describes the model therein. However, since the focus of the repository is the custom generation method, we highly recommend to shift its focus towards describing the custom generation method. In addition to a description of the method, we recommend documenting any input and/or output differences to the original [`~GenerationMixin.generate`]. This way, users can focus on what's new, and rely on Transformers docs for generic implementation details.
For discoverability, we highly recommend you to add the `custom_generate` tag to your repository. To do so, the top of your `README.md` file should look like the example below. After you push the file, you should see the tag in your repository!
@ -504,6 +437,36 @@ Recommended practices:
- Add self-contained examples to enable quick experimentation.
- Describe soft-requirements such as if the method only works well with a certain family of models.
### Reusing `generate`s input preparation
If you're adding a new decoding loop, you might want to preserve the input preparation present in `generate` (batch expansion, attention masks, logits processors, stopping criteria, etc.). You can also pass a **callable** to `custom_generate` to reuse [`~GenerationMixin.generate`]s full preparation pipeline while overriding only the decoding loop.
```py
def custom_loop(model, input_ids, attention_mask, logits_processor, stopping_criteria, generation_config, **model_kwargs):
next_tokens = input_ids
while input_ids.shape[1] < stopping_criteria[0].max_length:
logits = model(next_tokens, attention_mask=attention_mask, **model_kwargs).logits
next_token_logits = logits_processor(input_ids, logits[:, -1, :])
next_tokens = torch.argmax(next_token_logits, dim=-1)[:, None]
input_ids = torch.cat((input_ids, next_tokens), dim=-1)
attention_mask = torch.cat((attention_mask, torch.ones_like(next_tokens)), dim=-1)
return input_ids
output = model.generate(
**inputs,
custom_generate=custom_loop,
max_new_tokens=10,
)
```
> [!TIP]
> If you publish a `custom_generate` repository, your `generate` implementation can itself define a callable and pass it to `model.generate()`. This lets you customize the decoding loop while still benefiting from Transformers built-in input preparation logic.
### Finding custom generation methods
You can find all custom generation methods by [searching for their custom tag.](https://huggingface.co/models?other=custom_generate), `custom_generate`. In addition to the tag, we curate two collections of `custom_generate` methods:
- [Custom generation methods - Community](https://huggingface.co/collections/transformers-community/custom-generation-methods-community-6888fb1da0efbc592d3a8ab6) -- a collection of powerful methods contributed by the community;
- [Custom generation methods - Tutorials](https://huggingface.co/collections/transformers-community/custom-generation-methods-tutorials-6823589657a94940ea02cfec) -- a collection of reference implementations for methods that previously were part of `transformers`, as well as tutorials for `custom_generate`.
## Resources

View File

@ -33,14 +33,15 @@ Add the `gguf_file` parameter to [`~PreTrainedModel.from_pretrained`] to specify
```py
# pip install gguf
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
filename = "tinyllama-1.1b-chat-v1.0.Q6_K.gguf"
torch_dtype = torch.float32 # could be torch.float16 or torch.bfloat16 too
dtype = torch.float32 # could be torch.float16 or torch.bfloat16 too
tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename, torch_dtype=torch_dtype)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename, dtype=dtype)
```
Once you're done tinkering with the model, save and convert it back to the GGUF format with the [convert-hf-to-gguf.py](https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py) script.

View File

@ -67,7 +67,7 @@ We can see that 0s have been added on the right of the first sentence to make it
[[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]]
```
This can then be converted into a tensor in PyTorch or TensorFlow. The attention mask is a binary tensor indicating the
This can then be converted into a tensor in PyTorch. The attention mask is a binary tensor indicating the
position of the padded indices so that the model does not attend to them. For the [`BertTokenizer`], `1` indicates a
value that should be attended to, while `0` indicates a padded value. This attention mask is in the dictionary returned
by the tokenizer under the key "attention_mask":
@ -114,7 +114,7 @@ A type of layer in a neural network where the input matrix is multiplied element
### DataParallel (DP)
Parallelism technique for training on multiple GPUs where the same setup is replicated multiple times, with each instance
Parallelism technique for training on multiple GPUs where the same setup is replicated multiple times, with each instance
receiving a distinct data slice. The processing is done in parallel and all setups are synchronized at the end of each training step.
Learn more about how DataParallel works [here](perf_train_gpu_many#dataparallel-vs-distributeddataparallel).
@ -295,7 +295,7 @@ These labels are different according to the model head, for example:
`class_labels` and `boxes` key where each value of the batch corresponds to the expected label and number of bounding boxes of each individual image.
- For automatic speech recognition models, ([`Wav2Vec2ForCTC`]), the model expects a tensor of dimension `(batch_size,
target_length)` with each value corresponding to the expected label of each individual token.
<Tip>
Each model's labels may be different, so be sure to always check the documentation of each model for more information
@ -346,8 +346,8 @@ For more details, see [Pipelines for inference](https://huggingface.co/docs/tran
### PipelineParallel (PP)
Parallelism technique in which the model is split up vertically (layer-level) across multiple GPUs, so that only one or
several layers of the model are placed on a single GPU. Each GPU processes in parallel different stages of the pipeline
Parallelism technique in which the model is split up vertically (layer-level) across multiple GPUs, so that only one or
several layers of the model are placed on a single GPU. Each GPU processes in parallel different stages of the pipeline
and working on a small chunk of the batch. Learn more about how PipelineParallel works [here](perf_train_gpu_many#from-naive-model-parallelism-to-pipeline-parallelism).
### pixel values
@ -379,7 +379,7 @@ The task of preparing raw data into a format that can be easily consumed by mach
A model that has been pretrained on some data (for instance all of Wikipedia). Pretraining methods involve a
self-supervised objective, which can be reading the text and trying to predict the next word (see [causal language
modeling](#causal-language-modeling)) or masking some words and trying to predict them (see [masked language
modeling](#masked-language-modeling-mlm)).
modeling](#masked-language-modeling-mlm)).
Speech and vision models have their own pretraining objectives. For example, Wav2Vec2 is a speech model pretrained on a contrastive task which requires the model to identify the "true" speech representation from a set of "false" speech representations. On the other hand, BEiT is a vision model pretrained on a masked image modeling task which masks some of the image patches and requires the model to predict the masked patches (similar to the masked language modeling objective).
@ -403,9 +403,9 @@ A measurement in hertz of the number of samples (the audio signal) taken per sec
Each element of the input finds out which other elements of the input they should attend to.
### self-supervised learning
### self-supervised learning
A category of machine learning techniques in which a model creates its own learning objective from unlabeled data. It differs from [unsupervised learning](#unsupervised-learning) and [supervised learning](#supervised-learning) in that the learning process is supervised, but not explicitly from the user.
A category of machine learning techniques in which a model creates its own learning objective from unlabeled data. It differs from [unsupervised learning](#unsupervised-learning) and [supervised learning](#supervised-learning) in that the learning process is supervised, but not explicitly from the user.
One example of self-supervised learning is [masked language modeling](#masked-language-modeling-mlm), where a model is passed sentences with a proportion of its tokens removed and learns to predict the missing tokens.
@ -436,9 +436,9 @@ A form of model training that directly uses labeled data to correct and instruct
### Tensor Parallelism (TP)
Parallelism technique for training on multiple GPUs in which each tensor is split up into multiple chunks, so instead of
having the whole tensor reside on a single GPU, each shard of the tensor resides on its designated GPU. Shards gets
processed separately and in parallel on different GPUs and the results are synced at the end of the processing step.
Parallelism technique for training on multiple GPUs in which each tensor is split up into multiple chunks, so instead of
having the whole tensor reside on a single GPU, each shard of the tensor resides on its designated GPU. Shards gets
processed separately and in parallel on different GPUs and the results are synced at the end of the processing step.
This is what is sometimes called horizontal parallelism, as the splitting happens on horizontal level.
Learn more about Tensor Parallelism [here](perf_train_gpu_many#tensor-parallelism).
@ -516,7 +516,7 @@ A form of model training in which data provided to the model is not labeled. Uns
### Zero Redundancy Optimizer (ZeRO)
Parallelism technique which performs sharding of the tensors somewhat similar to [TensorParallel](#tensor-parallelism-tp),
except the whole tensor gets reconstructed in time for a forward or backward computation, therefore the model doesn't need
to be modified. This method also supports various offloading techniques to compensate for limited GPU memory.
Parallelism technique which performs sharding of the tensors somewhat similar to [TensorParallel](#tensor-parallelism-tp),
except the whole tensor gets reconstructed in time for a forward or backward computation, therefore the model doesn't need
to be modified. This method also supports various offloading techniques to compensate for limited GPU memory.
Learn more about ZeRO [here](perf_train_gpu_many#zero-data-parallelism).

View File

@ -37,7 +37,6 @@ An example `model_init` function is shown below.
def model_init(trial):
return AutoModelForSequenceClassification.from_pretrained(
model_args.model_name_or_path,
from_tf=bool(".ckpt" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
@ -103,7 +102,7 @@ def ray_hp_space(trial):
"per_device_train_batch_size": tune.choice([16, 32, 64, 128]),
}
best_trials = trainer.hyperparameter_search(
best_trials = trainer.hyperparameter_search(
direction=["minimize", "maximize"],
backend="ray",
hp_space=ray_hp_space,
@ -128,7 +127,7 @@ def sigopt_hp_space(trial):
},
]
best_trials = trainer.hyperparameter_search(
best_trials = trainer.hyperparameter_search(
direction=["minimize", "maximize"],
backend="sigopt",
hp_space=sigopt_hp_space,
@ -153,7 +152,7 @@ def wandb_hp_space(trial):
},
}
best_trials = trainer.hyperparameter_search(
best_trials = trainer.hyperparameter_search(
direction=["minimize", "maximize"],
backend="wandb",
hp_space=wandb_hp_space,

View File

@ -20,7 +20,7 @@ rendered properly in your Markdown viewer.
# Installation
Transformers works with [PyTorch](https://pytorch.org/get-started/locally/), [TensorFlow 2.0](https://www.tensorflow.org/install/pip), and [Flax](https://flax.readthedocs.io/en/latest/). It has been tested on Python 3.9+, PyTorch 2.1+, TensorFlow 2.6+, and Flax 0.4.1+.
Transformers works with [PyTorch](https://pytorch.org/get-started/locally/). It has been tested on Python 3.9+ and PyTorch 2.2+.
## Virtual environment
@ -74,7 +74,7 @@ uv pip install transformers
</hfoption>
</hfoptions>
For GPU acceleration, install the appropriate CUDA drivers for [PyTorch](https://pytorch.org/get-started/locally) and [TensorFlow](https://www.tensorflow.org/install/pip).
For GPU acceleration, install the appropriate CUDA drivers for [PyTorch](https://pytorch.org/get-started/locally).
Run the command below to check if your system detects an NVIDIA GPU.
@ -84,42 +84,11 @@ nvidia-smi
To install a CPU-only version of Transformers and a machine learning framework, run the following command.
<hfoptions id="cpu-only">
<hfoption id="PyTorch">
```bash
pip install 'transformers[torch]'
uv pip install 'transformers[torch]'
```
</hfoption>
<hfoption id="TensorFlow">
For Apple M1 hardware, you need to install CMake and pkg-config first.
```bash
brew install cmake
brew install pkg-config
```
Install TensorFlow 2.0.
```bash
pip install 'transformers[tf-cpu]'
uv pip install 'transformers[tf-cpu]'
```
</hfoption>
<hfoption id="Flax">
```bash
pip install 'transformers[flax]'
uv pip install 'transformers[flax]'
```
</hfoption>
</hfoptions>
Test whether the install was successful with the following command. It should return a label and score for the provided text.
```bash

View File

@ -48,3 +48,4 @@ Most of those are only useful if you are studying the general code in the librar
## Other Utilities
[[autodoc]] utils._LazyModule
[[autodoc]] pytorch_utils.infer_device

View File

@ -66,8 +66,6 @@ values. Here, for instance, it has two keys that are `sequences` and `scores`.
We document here all output types.
### PyTorch
[[autodoc]] generation.GenerateDecoderOnlyOutput
[[autodoc]] generation.GenerateEncoderDecoderOutput
@ -76,42 +74,12 @@ We document here all output types.
[[autodoc]] generation.GenerateBeamEncoderDecoderOutput
### TensorFlow
[[autodoc]] generation.TFGreedySearchEncoderDecoderOutput
[[autodoc]] generation.TFGreedySearchDecoderOnlyOutput
[[autodoc]] generation.TFSampleEncoderDecoderOutput
[[autodoc]] generation.TFSampleDecoderOnlyOutput
[[autodoc]] generation.TFBeamSearchEncoderDecoderOutput
[[autodoc]] generation.TFBeamSearchDecoderOnlyOutput
[[autodoc]] generation.TFBeamSampleEncoderDecoderOutput
[[autodoc]] generation.TFBeamSampleDecoderOnlyOutput
[[autodoc]] generation.TFContrastiveSearchEncoderDecoderOutput
[[autodoc]] generation.TFContrastiveSearchDecoderOnlyOutput
### FLAX
[[autodoc]] generation.FlaxSampleOutput
[[autodoc]] generation.FlaxGreedySearchOutput
[[autodoc]] generation.FlaxBeamSearchOutput
## LogitsProcessor
A [`LogitsProcessor`] can be used to modify the prediction scores of a language model head for
generation.
### PyTorch
[[autodoc]] AlternatingCodebooksLogitsProcessor
- __call__
@ -210,93 +178,6 @@ generation.
- __call__
### TensorFlow
[[autodoc]] TFForcedBOSTokenLogitsProcessor
- __call__
[[autodoc]] TFForcedEOSTokenLogitsProcessor
- __call__
[[autodoc]] TFForceTokensLogitsProcessor
- __call__
[[autodoc]] TFLogitsProcessor
- __call__
[[autodoc]] TFLogitsProcessorList
- __call__
[[autodoc]] TFLogitsWarper
- __call__
[[autodoc]] TFMinLengthLogitsProcessor
- __call__
[[autodoc]] TFNoBadWordsLogitsProcessor
- __call__
[[autodoc]] TFNoRepeatNGramLogitsProcessor
- __call__
[[autodoc]] TFRepetitionPenaltyLogitsProcessor
- __call__
[[autodoc]] TFSuppressTokensAtBeginLogitsProcessor
- __call__
[[autodoc]] TFSuppressTokensLogitsProcessor
- __call__
[[autodoc]] TFTemperatureLogitsWarper
- __call__
[[autodoc]] TFTopKLogitsWarper
- __call__
[[autodoc]] TFTopPLogitsWarper
- __call__
### FLAX
[[autodoc]] FlaxForcedBOSTokenLogitsProcessor
- __call__
[[autodoc]] FlaxForcedEOSTokenLogitsProcessor
- __call__
[[autodoc]] FlaxForceTokensLogitsProcessor
- __call__
[[autodoc]] FlaxLogitsProcessor
- __call__
[[autodoc]] FlaxLogitsProcessorList
- __call__
[[autodoc]] FlaxLogitsWarper
- __call__
[[autodoc]] FlaxMinLengthLogitsProcessor
- __call__
[[autodoc]] FlaxSuppressTokensAtBeginLogitsProcessor
- __call__
[[autodoc]] FlaxSuppressTokensLogitsProcessor
- __call__
[[autodoc]] FlaxTemperatureLogitsWarper
- __call__
[[autodoc]] FlaxTopKLogitsWarper
- __call__
[[autodoc]] FlaxTopPLogitsWarper
- __call__
[[autodoc]] FlaxWhisperTimeStampLogitsProcessor
- __call__
## StoppingCriteria
@ -363,37 +244,34 @@ A [`Constraint`] can be used to force the generation to include specific tokens
- get_max_cache_shape
- reset
- reorder_cache
- lazy_initialization
[[autodoc]] DynamicLayer
- update
- lazy_initialization
- crop
- batch_repeat_interleave
- batch_select_indices
[[autodoc]] StaticLayer
- update
- lazy_initialization
[[autodoc]] SlidingWindowLayer
- update
- lazy_initialization
[[autodoc]] CacheProcessor
- pre_update
- post_update
[[autodoc]] QuantoQuantizedLayer
- update
- lazy_initialization
[[autodoc]] OffloadedCacheProcessor
- pre_update
[[autodoc]] QuantizedCacheProcessor
- post_update
[[autodoc]] QuantoQuantizedCacheProcessor
- post_update
[[autodoc]] HQQQuantizedCacheProcessor
- post_update
[[autodoc]] HQQQuantizedLayer
- update
- lazy_initialization
[[autodoc]] Cache
- update
- early_initialization
- get_seq_length
- get_mask_sizes
- get_max_cache_shape
@ -411,12 +289,8 @@ A [`Constraint`] can be used to force the generation to include specific tokens
[[autodoc]] QuantoQuantizedCache
[[autodoc]] QuantoQuantizedCacheProcessor
[[autodoc]] HQQQuantizedCache
[[autodoc]] HQQQuantizedCacheProcessor
[[autodoc]] OffloadedCache
[[autodoc]] StaticCache
@ -433,15 +307,6 @@ A [`Constraint`] can be used to force the generation to include specific tokens
- to_legacy_cache
- from_legacy_cache
[[autodoc]] MambaCache
- update_conv_state
- update_ssm_state
- reset
[[autodoc]] CacheConfig
[[autodoc]] QuantizedCacheConfig
## Watermark Utils

View File

@ -53,31 +53,3 @@ Most of those are only useful if you are studying the code of the models in the
[[autodoc]] pytorch_utils.prune_conv1d_layer
[[autodoc]] pytorch_utils.prune_linear_layer
## TensorFlow custom layers
[[autodoc]] modeling_tf_utils.TFConv1D
[[autodoc]] modeling_tf_utils.TFSequenceSummary
## TensorFlow loss functions
[[autodoc]] modeling_tf_utils.TFCausalLanguageModelingLoss
[[autodoc]] modeling_tf_utils.TFMaskedLanguageModelingLoss
[[autodoc]] modeling_tf_utils.TFMultipleChoiceLoss
[[autodoc]] modeling_tf_utils.TFQuestionAnsweringLoss
[[autodoc]] modeling_tf_utils.TFSequenceClassificationLoss
[[autodoc]] modeling_tf_utils.TFTokenClassificationLoss
## TensorFlow Helper Functions
[[autodoc]] modeling_tf_utils.get_initializer
[[autodoc]] modeling_tf_utils.keras_serializable
[[autodoc]] modeling_tf_utils.shape_list

32
docs/source/en/jan.md Normal file
View File

@ -0,0 +1,32 @@
# Jan: using the serving API as a local LLM provider
This example shows how to use `transformers serve` as a local LLM provider for the [Jan](https://jan.ai/) app. Jan is a ChatGPT-alternative graphical interface, fully running on your machine. The requests to `transformers serve` come directly from the local app -- while this section focuses on Jan, you can extrapolate some instructions to other apps that make local requests.
## Running models locally
To connect `transformers serve` with Jan, you'll need to set up a new model provider ("Settings" > "Model Providers"). Click on "Add Provider", and set a new name. In your new model provider page, all you need to set is the "Base URL" to the following pattern:
```shell
http://[host]:[port]/v1
```
where `host` and `port` are the `transformers serve` CLI parameters (`localhost:8000` by default). After setting this up, you should be able to see some models in the "Models" section, hitting "Refresh". Make sure you add some text in the "API key" text field too -- this data is not actually used, but the field can't be empty. Your custom model provider page should look like this:
<h3 align="center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/transformers_serve_jan_model_providers.png"/>
</h3>
You are now ready to chat!
> [!TIP]
> You can add any `transformers`-compatible model to Jan through `transformers serve`. In the custom model provider you created, click on the "+" button in the "Models" section and add its Hub repository name, e.g. `Qwen/Qwen3-4B`.
## Running models on a separate machine
To conclude this example, let's look into a more advanced use-case. If you have a beefy machine to serve models with, but prefer using Jan on a different device, you need to add port forwarding. If you have `ssh` access from your Jan machine into your server, this can be accomplished by typing the following to your Jan machine's terminal
```
ssh -N -f -L 8000:localhost:8000 your_server_account@your_server_IP -p port_to_ssh_into_your_server
```
Port forwarding is not Jan-specific: you can use it to connect `transformers serve` running in a different machine with an app of your choice.

View File

@ -22,20 +22,19 @@ A KV *cache* stores these calculations so they can be reused without recomputing
Transformers offers several [`Cache`] classes that implement different caching mechanisms. Some of these [`Cache`] classes are optimized to save memory while others are designed to maximize generation speed. Refer to the table below to compare cache types and use it to help you select the best cache for your use case.
| Cache Type | Memory Efficient  | Supports torch.compile() | Initialization Recommended | Latency | Long Context Generation |
|------------------------|------------------|--------------------------|----------------------------|---------|-------------------------|
| Dynamic Cache | No | No | No | Mid | No |
| Static Cache | No | Yes | Yes | High | No |
| Offloaded Cache | Yes | No | No | Low | Yes |
| Offloaded Static Cache | No | Yes | Yes | High | Yes |
| Quantized Cache | Yes | No | No | Low | Yes |
| Sliding Window Cache | No | Yes | Yes | High | No |
| Cache Type | Supports sliding layers | Supports offloading | Supports torch.compile() | Expected memory usage |
|------------------------|--------------------------|---------------------|--------------------------|-----------------------|
| Dynamic Cache | Yes | Yes | No | Medium |
| Static Cache | Yes | Yes | Yes | High |
| Quantized Cache | No | No    | No | Low |
This guide introduces you to the different [`Cache`] classes and shows you how to use them for generation.
## Default cache
The [`DynamicCache`] is the default cache class for most models. It allows the cache size to grow dynamically in order to store an increasing number of keys and values as generation progresses.
The [`DynamicCache`] is the default cache class for all models. It allows the cache size to grow dynamically in order to store an increasing number of keys and values as generation progresses.
Note that for models using sliding window attention (Mistral, Gemma2,...) or chunked attention (Llama4), the cache will stop growing when the layers using these types of attention have reached their maximum size (the sliding window or chunk size).
Disable the cache by configuring `use_cache=False` in [`~GenerationMixin.generate`].
@ -44,153 +43,44 @@ import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16, device_map="auto")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", dtype=torch.float16, device_map="auto")
inputs = tokenizer("I like rock music because", return_tensors="pt").to(model.device)
model.generate(**inputs, do_sample=False, max_new_tokens=20, use_cache=False)
```
Cache classes can also be initialized first before calling and passing it to the models [past_key_values](https://hf.co/docs/transformers/internal/generation_utils#transformers.generation.GenerateDecoderOnlyOutput.past_key_values) parameter. This cache initialization strategy is only recommended for some cache types.
Cache classes can also be initialized first before calling and passing it to the models [past_key_values](https://hf.co/docs/transformers/internal/generation_utils#transformers.generation.GenerateDecoderOnlyOutput.past_key_values) parameter. This can be useful for more fine-grained control, or more advanced usage such as context caching.
In most other cases, it's easier to define the cache strategy in the [cache_implementation](https://hf.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig.cache_implementation) parameter.
In most cases, it's easier to define the cache strategy in the [cache_implementation](https://hf.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig.cache_implementation) parameter.
```py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16, device_map="auto")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", dtype=torch.float16, device_map="auto")
inputs = tokenizer("I like rock music because", return_tensors="pt").to(model.device)
past_key_values = DynamicCache()
past_key_values = DynamicCache(config=model.config)
out = model.generate(**inputs, do_sample=False, max_new_tokens=20, past_key_values=past_key_values)
```
## Memory efficient caches
## Fixed-size cache
The KV cache can occupy a significant portion of memory and become a [bottleneck](https://hf.co/blog/llama31#inference-memory-requirements) for long-context generation. Memory efficient caches focus on trading off speed for reduced memory usage. This is especially important for large language models (LLMs) and if your hardware is memory constrained.
The default [`DynamicCache`] prevents you from taking advantage of most just-in-time (JIT) optimizations because the cache size isn't fixed. JIT optimizations enable you to maximize latency at the expense of memory usage. All of the following cache types are compatible with JIT optimizations like [torch.compile](./llm_optims#static-kv-cache-and-torchcompile) to accelerate generation.
### Offloaded cache
A fixed-size cache ([`StaticCache`]) pre-allocates a specific maximum cache size for the kv pairs. You can generate up to the maximum cache size without needing to modify it. However, having a fixed (usually large) size for the key/value states means that while generating, a lot of tokens will actually be masked as they should not take part in the attention. So this trick allows to easily `compile` the decoding stage, but it incurs a waste of tokens in the attention computation. As all things, it's then a trade-off which should be very good if you generate with several sequence of more or less the same lengths, but may be sub-optimal if you have for example 1 very large sequence, and then only short sequences (as the fix cache size would be large, a lot would be wasted for the short sequences). Make sure you understand the impact if you use it!
The [`OffloadedCache`] saves GPU memory by moving the KV cache for most model layers to the CPU. Only the current layer cache is maintained on the GPU during a models `forward` iteration over the layers. [`OffloadedCache`] asynchronously prefetches the next layer cache and sends the previous layer cache back to the CPU.
As for [`DynamicCache`], note that for models using sliding window attention (Mistral, Gemma2,...) or chunked attention (Llama4), the cache will never be larger than the sliding window/chunk size on layers using these types of attention, even if the maximum length specified is larger.
This cache strategy always generates the same result as [`DynamicCache`] and works as a drop-in replacement or fallback. You may want to use [`OffloadedCache`] if you have a GPU and you're getting out-of-memory (OOM) errors.
> [!WARNING]
> You may notice a small degradation in generation throughput compared to [`DynamicCache`] depending on your model and generation choices (context size, number of generated tokens, number of beams, etc.).
Enable [`OffloadedCache`] by configuring `cache_implementation="offloaded"` in either [`GenerationConfig`] or [`~GenerationMixin.generate`].
```py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
ckpt = "microsoft/Phi-3-mini-4k-instruct"
tokenizer = AutoTokenizer.from_pretrained(ckpt)
model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16).to("cuda:0")
inputs = tokenizer("Fun fact: The shortest", return_tensors="pt").to(model.device)
out = model.generate(**inputs, do_sample=False, max_new_tokens=23, cache_implementation="offloaded")
print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
Fun fact: The shortest war in history was between Britain and Zanzibar on August 27, 1896.
```
The example below shows how you can fallback on [`OffloadedCache`] if you run out of memory.
```py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
def resilient_generate(model, *args, **kwargs):
oom = False
try:
return model.generate(*args, **kwargs)
except torch.cuda.OutOfMemoryError as e:
print(e)
print("retrying with cache_implementation='offloaded'")
oom = True
if oom:
torch.cuda.empty_cache()
kwargs["cache_implementation"] = "offloaded"
return model.generate(*args, **kwargs)
ckpt = "microsoft/Phi-3-mini-4k-instruct"
tokenizer = AutoTokenizer.from_pretrained(ckpt)
model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16).to("cuda:0")
prompt = ["okay "*1000 + "Fun fact: The most"]
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
beams = { "num_beams": 40, "num_beam_groups": 40, "num_return_sequences": 40, "diversity_penalty": 1.0, "max_new_tokens": 23, "early_stopping": True, }
out = resilient_generate(model, **inputs, **beams)
responses = tokenizer.batch_decode(out[:,-28:], skip_special_tokens=True)
```
### Quantized cache
The [`QuantizedCache`] reduces memory requirements by quantizing the KV values to a lower precision. [`QuantizedCache`] currently supports two quantization backends.
- [`HQQQuantizedCache`] supports int2, int4, and int8 datatypes.
- [`QuantoQuantizedCache`] supports int2 and int4 datatypes. This is the default quantization backend.
> [!WARNING]
> Quantizing the cache can harm latency if the context length is short and there is enough GPU memory available for generation without enabling cache quantization. Try to find a balance between memory efficiency and latency.
Enable [`QuantizedCache`] by configuring `cache_implementation="quantized"` in [`GenerationConfig`], and the quantization backend, as well as any additional quantization related parameters should also be passed either as a dict. You should use the default values for these additional parameters unless you're running out-of-memory. In that case, consider decreasing the residual length.
<hfoptions id="quantized-cache">
<hfoption id="HQQQuantizedCache">
For [`HQQQuantizedCache`], we recommend setting the `axis-key` and `axis-value` parameters to `1`.
```py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, HQQQuantizedCache
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16, device_map="auto")
inputs = tokenizer("I like rock music because", return_tensors="pt").to(model.device)
out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="quantized", cache_config={"backend": "HQQ"})
print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
I like rock music because it's loud and energetic. It's a great way to express myself and rel
```
</hfoption>
<hfoption id="Quanto">
For [`QuantoQuantizedCache`], we recommend setting the `axis-key` and `axis-value` parameters to `0`.
```py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, QuantoQuantizedCache
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16, device_map="auto")
inputs = tokenizer("I like rock music because", return_tensors="pt").to(model.device)
out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="quantized", cache_config={"nbits": 4, "backend": "quanto"})
print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
I like rock music because it's loud and energetic. It's a great way to express myself and rel
```
</hfoption>
</hfoptions>
## Speed optimized caches
The default [`DynamicCache`] prevents you from taking advantage of just-in-time (JIT) optimizations because the cache size isn't fixed. JIT optimizations enable you to maximize latency at the expense of memory usage. All of the following cache types are compatible with JIT optimizations like [torch.compile](./llm_optims#static-kv-cache-and-torchcompile) to accelerate generation.
### Static cache
A [`StaticCache`] pre-allocates a specific maximum cache size for the kv pairs. You can generate up to the maximum cache size without needing to modify it.
Enable [`StaticCache`] by configuring `cache_implementation="static"` in [`~GenerationMixin.generate`].
You can enable [`StaticCache`] by configuring `cache_implementation="static"` in [`~GenerationMixin.generate`]. This will also turn on automatic `compilation` of the decoding stage for greedy and sample decoding strategies.
```py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16, device_map="auto")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", dtype=torch.float16, device_map="auto")
inputs = tokenizer("Hello, my name is", return_tensors="pt").to(model.device)
out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="static")
@ -198,66 +88,126 @@ tokenizer.batch_decode(out, skip_special_tokens=True)[0]
"Hello, my name is [Your Name], and I am a [Your Profession] with [Number of Years] of"
```
### Offloaded static cache
## Cache offloading
The [`OffloadedStaticCache`] is very similar to the [OffloadedCache](#offloaded-cache) except the cache size is set to a maximum cache size. Otherwise, [`OffloadedStaticCache`] only keeps the current layer cache on the GPU and the rest are moved to the CPU.
The KV cache can occupy a significant portion of memory and become a [bottleneck](https://hf.co/blog/llama31#inference-memory-requirements) for long-context generation. Memory efficient caches focus on trading off speed for reduced memory usage. This is especially important for large language models (LLMs) and if your hardware is memory constrained.
Enable [`OffloadedStaticCache`] by configuring `cache_implementation="offloaded_static"` in [`~GenerationMixin.generate`].
Offloading the cache saves GPU memory by moving the KV cache for model layers except one to the CPU. Only the current layer cache is maintained on the GPU during a models `forward` iteration over the layers. It will asynchronously prefetch the next layer's cache, and send back the current layer's cache back to the CPU after attention computation.
You may want to consider offloading if you have a small GPU and you're getting out-of-memory (OOM) errors.
> [!WARNING]
> You may notice a small degradation in generation throughput compared to a full on-device cache, depending on your model and generation choices (context size, number of generated tokens, number of beams, etc.). This is because moving the key/value states back and forth requires some work.
Offloading is available for both [`DynamicCache`] and [`StaticCache`]. You can enable it by configuring `cache_implementation="offloaded"` for the dynamic version, or `cache_implementation="offloaded_static"` for the static version, in either [`GenerationConfig`] or [`~GenerationMixin.generate`].
Additionally, you can also instantiate your own [`DynamicCache`] or [`StaticCache`] with the `offloading=True` option, and pass this cache in `generate` or your model's `forward` (for example, `past_key_values=DynamicCache(config=model.config, offloading=True)` for a dynamic cache).
Note that the 2 [`Cache`] classes mentionned above have an additional option when instantiating them directly, `offload_only_non_sliding`.
This additional argument decides if the layers using sliding window/chunk attention (if any), will be offloaded as well. Since
these layers are usually short anyway, it may be better to avoid offloading them, as offloading may incur a speed penalty. By default, this option is `False` for [`DynamicCache`], and `True` for [`StaticCache`].
```py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
ckpt = "microsoft/Phi-3-mini-4k-instruct"
tokenizer = AutoTokenizer.from_pretrained(ckpt)
model = AutoModelForCausalLM.from_pretrained(ckpt, dtype=torch.float16, device_map="auto")
inputs = tokenizer("Fun fact: The shortest", return_tensors="pt").to(model.device)
out = model.generate(**inputs, do_sample=False, max_new_tokens=23, cache_implementation="offloaded")
print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
Fun fact: The shortest war in history was between Britain and Zanzibar on August 27, 1896.
```
The example below shows how you can fallback to an offloaded cache if you run out of memory:
```py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, infer_device
def resilient_generate(model, *args, **kwargs):
oom = False
device = infer_device()
torch_device_module = getattr(torch, device, torch.cuda)
try:
return model.generate(*args, **kwargs)
except torch.OutOfMemoryError as e:
print(e)
print("retrying with cache_implementation='offloaded'")
oom = True
if oom:
torch_device_module.empty_cache()
kwargs["cache_implementation"] = "offloaded"
return model.generate(*args, **kwargs)
ckpt = "microsoft/Phi-3-mini-4k-instruct"
tokenizer = AutoTokenizer.from_pretrained(ckpt)
model = AutoModelForCausalLM.from_pretrained(ckpt, dtype=torch.float16, device_map="auto")
prompt = ["okay "*1000 + "Fun fact: The most"]
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
beams = { "num_beams": 40, "num_beam_groups": 40, "num_return_sequences": 40, "diversity_penalty": 1.0, "max_new_tokens": 23, "early_stopping": True, }
out = resilient_generate(model, **inputs, **beams)
responses = tokenizer.batch_decode(out[:,-28:], skip_special_tokens=True)
```
## Quantized cache
The [`QuantizedCache`] reduces memory requirements by quantizing the KV values to a lower precision. [`QuantizedCache`] currently supports two quantization backends:
- `hqq` supports int2, int4, and int8 datatypes.
- `quanto` supports int2 and int4 datatypes. This is the default quantization backend.
> [!WARNING]
> Quantizing the cache can harm latency if the context length is short and there is enough GPU memory available for generation without enabling cache quantization. Try to find a balance between memory efficiency and latency.
Enable [`QuantizedCache`] by configuring `cache_implementation="quantized"` in [`GenerationConfig`], and the quantization backend, as well as any additional quantization related parameters should also be passed either as a dict. You should use the default values for these additional parameters unless you're running out-of-memory. In that case, consider decreasing the residual length.
<hfoptions id="quantized-cache">
For the `hqq` backend, we recommend setting the `axis-key` and `axis-value` parameters to `1`.
```py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, QuantizedCache
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", dtype=torch.float16, device_map="auto")
inputs = tokenizer("I like rock music because", return_tensors="pt").to(model.device)
out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="quantized", cache_config={"backend": "hqq"})
print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
I like rock music because it's loud and energetic. It's a great way to express myself and rel
```
For `quanto` backend, we recommend setting the `axis-key` and `axis-value` parameters to `0`.
```py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16, device_map={"": 0})
inputs = tokenizer("Hello, my name is", return_tensors="pt").to(model.device)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", dtype=torch.float16, device_map="auto")
inputs = tokenizer("I like rock music because", return_tensors="pt").to(model.device)
out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="offloaded_static")
tokenizer.batch_decode(out, skip_special_tokens=True)[0]
"Hello, my name is [Your Name], and I am a [Your Profession] with [Number of Years] of"
```
Cache offloading requires a CUDA GPU or Intel XPU.
### Sliding window cache
[`SlidingWindowCache`] implements a sliding window over the previous kv pairs, and only keeps the last `sliding_window` tokens. This cache type is designed to only work with models that support *sliding window attention*, such as [Mistral](./model_doc/mistral). Older kv states are discarded and replaced by new kv states.
Enable [`SlidingWindowCache`] by configuring `cache_implementation="sliding_window"` in [`~GenerationMixin.generate`].
```py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", torch_dtype=torch.float16, device_map="auto")
inputs = tokenizer("Yesterday I was on a rock concert and.", return_tensors="pt").to(model.device)
out = model.generate(**inputs, do_sample=False, max_new_tokens=30, cache_implementation="sliding_window")
tokenizer.batch_decode(out, skip_special_tokens=True)[0]
out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="quantized", cache_config={"nbits": 4, "backend": "quanto"})
print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
I like rock music because it's loud and energetic. It's a great way to express myself and rel
```
## Model caches
Some model types, like encoder-decoder models or [Gemma2](./model_doc/gemma2) and [Mamba](./model_doc/mamba), have dedicated cache classes.
### Encoder-decoder cache
## Encoder-decoder cache
[`EncoderDecoderCache`] is designed for encoder-decoder models. It manages both the self-attention and cross-attention caches to ensure storage and retrieval of previous kv pairs. It is possible to individually set a different cache type for the encoder and decoder.
This cache type doesn't require any setup. It can be used when calling [`~GenerationMixin.generate`] or a models `forward` method.
This cache type doesn't require any setup. It is a simple wrapper around 2 [`Cache`]s as described above, that will be used independently directly by the model.
> [!TIP]
> The [`EncoderDecoderCache`] currently only supports [Whisper](./model_doc/whisper).
### Model-specific caches
## Model-specific caches
Some models have a unique way of storing past kv pairs or states that is not compatible with any other cache classes.
[Gemma2](./model_doc/gemma2) requires [`HybridCache`], which uses a combination of [`SlidingWindowCache`] for sliding window attention and [`StaticCache`] for global attention under the hood.
Mamba models, such as [Mamba](./model_doc/mamba), require a specific cache because the model doesn't have an attention mechanism or kv states. Thus, they are not compatible with the above [`Cache`] classes.
[Mamba](./model_doc/mamba) requires [`MambaCache`] because the model doesn't have an attention mechanism or kv states.
## Iterative generation
# Iterative generation
A cache can also work in iterative generation settings where there is back-and-forth interaction with a model (chatbots). Like regular generation, iterative generation with a cache allows a model to efficiently handle ongoing conversations without recomputing the entire context at each step.
@ -269,21 +219,15 @@ For example, some models use special `<think> ... </think>` tokens during reason
```py
import torch
from transformers import AutoTokenizer,AutoModelForCausalLM
from transformers.cache_utils import (
DynamicCache,
StaticCache,
SlidingWindowCache,
QuantoQuantizedCache,
)
from transformers import AutoTokenizer,AutoModelForCausalLM, DynamicCache, StaticCache
model_id = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map='auto')
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(model_id)
user_prompts = ["Hello, what's your name?", "Btw, yesterday I was on a rock concert."]
past_key_values = DynamicCache()
past_key_values = DynamicCache(config=model.config)
messages = []
for prompt in user_prompts:
@ -295,7 +239,7 @@ for prompt in user_prompts:
messages.append({"role": "assistant", "content": completion})
```
## Prefill a cache
## Prefill a cache (prefix caching)
In some situations, you may want to fill a [`Cache`] with kv pairs for a certain prefix prompt and reuse it to generate different sequences.
@ -307,12 +251,12 @@ import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache, StaticCache
model_id = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map={"": 0})
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map={"": 0})
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Init StaticCache with big enough max-length (1024 tokens for the below example)
# You can also init a DynamicCache, if that suits you better
prompt_cache = StaticCache(config=model.config, max_batch_size=1, max_cache_len=1024, device=model.device.type, dtype=torch.bfloat16)
prompt_cache = StaticCache(config=model.config, max_cache_len=1024)
INITIAL_PROMPT = "You are a helpful assistant. "
inputs_initial_prompt = tokenizer(INITIAL_PROMPT, return_tensors="pt").to(model.device.type)

View File

@ -53,7 +53,7 @@ import os
os.environ["TOKENIZERS_PARALLELISM"] = "false" # To prevent long warnings :)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", torch_dtype="auto", device_map="auto")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", dtype="auto", device_map="auto")
model.generation_config.cache_implementation = "static"
@ -83,7 +83,7 @@ import os
os.environ["TOKENIZERS_PARALLELISM"] = "false" # To prevent long warnings :)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", torch_dtype="auto", device_map="auto")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", dtype="auto", device_map="auto")
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
input_text = "The theory of special relativity states "
@ -93,11 +93,8 @@ model.generation_config.max_new_tokens = 16
past_key_values = StaticCache(
config=model.config,
max_batch_size=1,
# If you plan to reuse the cache, make sure the cache length is large enough for all cases
max_cache_len=prompt_length+(model.generation_config.max_new_tokens*2),
device=model.device,
dtype=model.dtype
)
outputs = model.generate(**input_ids, past_key_values=past_key_values)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
@ -117,10 +114,9 @@ print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
Another option for using [`StaticCache`] is to pass it to a models forward pass using the same `past_key_values` argument. This allows you to write your own custom decoding function to decode the next token given the current token, position, and cache position of previously generated tokens.
```py
from transformers import LlamaTokenizer, LlamaForCausalLM, StaticCache, logging
from transformers import LlamaTokenizer, LlamaForCausalLM, StaticCache, logging, infer_device
from transformers.testing_utils import CaptureLogger
import torch
from accelerate.test_utils.testing import get_backend
prompts = [
"Simply put, the theory of relativity states that ",
@ -128,7 +124,7 @@ prompts = [
]
NUM_TOKENS_TO_GENERATE = 40
torch_device, _, _ = get_backend() # automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)
torch_device = infer_device()
tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", pad_token="</s>", padding_side="right")
model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", device_map="sequential")
@ -159,7 +155,7 @@ from torch.nn.attention import SDPBackend, sdpa_kernel
batch_size, seq_length = inputs["input_ids"].shape
with torch.no_grad():
past_key_values = StaticCache(
config=model.config, max_batch_size=2, max_cache_len=4096, device=torch_device, dtype=model.dtype
config=model.config, max_cache_len=4096
)
cache_position = torch.arange(seq_length, device=torch_device)
generated_ids = torch.zeros(
@ -199,7 +195,7 @@ import os
os.environ["TOKENIZERS_PARALLELISM"] = "false" # To prevent long warnings :)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", torch_dtype="auto", device_map="auto")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", dtype="auto", device_map="auto")
model.generate = torch.compile(model.generate, mode="reduce-overhead", fullgraph=True)
input_text = "The theory of special relativity states "
@ -242,16 +238,15 @@ Enable speculative decoding by loading an assistant model and passing it to [`~G
<hfoption id="greedy search">
```py
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
import torch
from accelerate.test_utils.testing import get_backend
device, _, _ = get_backend() # automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)
device = infer_device()
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
inputs = tokenizer("Einstein's theory of relativity states", return_tensors="pt").to(device)
model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b", torch_dtype="auto").to(device)
model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b", dtype="auto").to(device)
assistant_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m").to(device)
outputs = model.generate(**inputs, assistant_model=assistant_model)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
@ -264,16 +259,15 @@ tokenizer.batch_decode(outputs, skip_special_tokens=True)
For speculative sampling decoding, add the [do_sample](https://hf.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig.do_sample) and [temperature](https://hf.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig.temperature) parameters to [`~GenerationMixin.generate`].
```py
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
import torch
from accelerate.test_utils.testing import get_backend
device, _, _ = get_backend() # automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)
device = infer_device()
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
inputs = tokenizer("Einstein's theory of relativity states", return_tensors="pt").to(device)
model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b", torch_dtype="auto").to(device)
model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b", dtype="auto").to(device)
assistant_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m").to(device)
outputs = model.generate(**inputs, assistant_model=assistant_model, do_sample=True, temperature=0.7)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
@ -293,16 +287,15 @@ To enable prompt lookup decoding, specify the number of tokens that should be ov
<hfoption id="greedy decoding">
```py
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
import torch
from accelerate.test_utils.testing import get_backend
device, _, _ = get_backend() # automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)
device = infer_device()
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
inputs = tokenizer("The second law of thermodynamics states", return_tensors="pt").to(device)
model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b", torch_dtype="auto").to(device)
model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b", dtype="auto").to(device)
assistant_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m").to(device)
outputs = model.generate(**inputs, prompt_lookup_num_tokens=3)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
@ -315,16 +308,15 @@ print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
For prompt lookup decoding with sampling, add the [do_sample](https://hf.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig.do_sample) and [temperature](https://hf.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig.temperature) parameters to [`~GenerationMixin.generate`].
```py
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
import torch
from accelerate.test_utils.testing import get_backend
device, _, _ = get_backend() # automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)
device = infer_device()
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
inputs = tokenizer("The second law of thermodynamics states", return_tensors="pt").to(device)
model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b", torch_dtype="auto").to(device)
model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b", dtype="auto").to(device)
outputs = model.generate(**inputs, prompt_lookup_num_tokens=3, do_sample=True, temperature=0.7)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
["The second law of thermodynamics states that energy cannot be created nor destroyed. It's not a"]
@ -350,7 +342,7 @@ quant_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2b",
quantization_config=quant_config,
torch_dtype=torch.bfloat16,
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
@ -358,7 +350,7 @@ model = AutoModelForCausalLM.from_pretrained(
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2b",
quantization_config=quant_config,
torch_dtype=torch.bfloat16
dtype=torch.bfloat16
)
model.set_attention_implementation("flash_attention_2")
```
@ -379,7 +371,7 @@ from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2b",
torch_dtype=torch.bfloat16,
dtype=torch.bfloat16,
)
with sdpa_kernel(SDPBackend.FLASH_ATTENTION):
@ -404,14 +396,14 @@ Use the Model Memory Calculator below to estimate and compare how much memory is
height="450"
></iframe>
To load a model in half-precision, set the [torch_dtype](https://hf.co/docs/transformers/main/en/main_classes/text_generation#transformers.PreTrainedModel.from_pretrained.torch_dtype) parameter in [`~transformers.AutoModelForCausalLM.from_pretrained`] to `torch.bfloat16`. This requires 13.74GB of memory.
To load a model in half-precision, set the [dtype](https://hf.co/docs/transformers/main/en/main_classes/text_generation#transformers.PreTrainedModel.from_pretrained.dtype) parameter in [`~transformers.AutoModelForCausalLM.from_pretrained`] to `torch.bfloat16`. This requires 13.74GB of memory.
```py
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.1", torch_dtype=torch.bfloat16, device_map="auto",
"mistralai/Mistral-7B-v0.1", dtype=torch.bfloat16, device_map="auto",
)
```

View File

@ -56,7 +56,7 @@ Tokenize your input, and set the [`~PreTrainedTokenizer.padding_side`] parameter
```py
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", padding_side="left")
model_inputs = tokenizer(["A list of colors: red, blue"], return_tensors="pt").to("cuda")
model_inputs = tokenizer(["A list of colors: red, blue"], return_tensors="pt").to(model.device)
```
Pass the inputs to [`~GenerationMixin.generate`] to generate tokens, and [`~PreTrainedTokenizer.batch_decode`] the generated tokens back to text.
@ -148,9 +148,9 @@ print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
| Option name | Type | Simplified description |
|---|---|---|
| `max_new_tokens` | `int` | Controls the maximum generation length. Be sure to define it, as it usually defaults to a small value. |
| `do_sample` | `bool` | Defines whether generation will sample the next token (`True`), or is greedy instead (`False`). Most use cases should set this flag to `True`. Check [this guide](./generation_strategies.md) for more information. |
| `do_sample` | `bool` | Defines whether generation will sample the next token (`True`), or is greedy instead (`False`). Most use cases should set this flag to `True`. Check [this guide](./generation_strategies) for more information. |
| `temperature` | `float` | How unpredictable the next selected token will be. High values (`>0.8`) are good for creative tasks, low values (e.g. `<0.4`) for tasks that require "thinking". Requires `do_sample=True`. |
| `num_beams` | `int` | When set to `>1`, activates the beam search algorithm. Beam search is good on input-grounded tasks. Check [this guide](./generation_strategies.md) for more information. |
| `num_beams` | `int` | When set to `>1`, activates the beam search algorithm. Beam search is good on input-grounded tasks. Check [this guide](./generation_strategies) for more information. |
| `repetition_penalty` | `float` | Set it to `>1.0` if you're seeing the model repeat itself often. Larger values apply a larger penalty. |
| `eos_token_id` | `list[int]` | The token(s) that will cause generation to stop. The default value is usually good, but you can specify a different token. |
@ -164,7 +164,7 @@ The section below covers some common issues you may encounter during text genera
[`~GenerationMixin.generate`] returns up to 20 tokens by default unless otherwise specified in a models [`GenerationConfig`]. It is highly recommended to manually set the number of generated tokens with the [`max_new_tokens`] parameter to control the output length. [Decoder-only](https://hf.co/learn/nlp-course/chapter1/6?fw=pt) models returns the initial prompt along with the generated tokens.
```py
model_inputs = tokenizer(["A sequence of numbers: 1, 2"], return_tensors="pt").to("cuda")
model_inputs = tokenizer(["A sequence of numbers: 1, 2"], return_tensors="pt").to(model.device)
```
<hfoptions id="output-length">
@ -195,7 +195,7 @@ The default decoding strategy in [`~GenerationMixin.generate`] is *greedy search
For example, enable a [multinomial sampling](./generation_strategies#multinomial-sampling) strategy to generate more diverse outputs. Refer to the [Generation strategy](./generation_strategies) guide for more decoding strategies.
```py
model_inputs = tokenizer(["I am a cat."], return_tensors="pt").to("cuda")
model_inputs = tokenizer(["I am a cat."], return_tensors="pt").to(model.device)
```
<hfoptions id="decoding">
@ -227,7 +227,7 @@ Inputs need to be padded if they don't have the same length. But LLMs aren't tra
```py
model_inputs = tokenizer(
["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt"
).to("cuda")
).to(model.device)
generated_ids = model.generate(**model_inputs)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
'1, 2, 33333333333'
@ -241,7 +241,7 @@ tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", padding_s
tokenizer.pad_token = tokenizer.eos_token
model_inputs = tokenizer(
["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt"
).to("cuda")
).to(model.device)
generated_ids = model.generate(**model_inputs)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
'1, 2, 3, 4, 5, 6,'
@ -270,7 +270,7 @@ model = AutoModelForCausalLM.from_pretrained(
```py
prompt = """How many cats does it take to change a light bulb? Reply as a pirate."""
model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
input_length = model_inputs.input_ids.shape[1]
generated_ids = model.generate(**model_inputs, max_new_tokens=50)
print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0])
@ -288,7 +288,7 @@ messages = [
},
{"role": "user", "content": "How many cats does it take to change a light bulb?"},
]
model_inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to("cuda")
model_inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
input_length = model_inputs.shape[1]
generated_ids = model.generate(model_inputs, do_sample=True, max_new_tokens=50)
print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0])

View File

@ -23,11 +23,11 @@ The crux of these challenges lies in augmenting the computational and memory cap
In this guide, we will go over the effective techniques for efficient LLM deployment:
1. **Lower Precision:** Research has shown that operating at reduced numerical precision, namely [8-bit and 4-bit](./main_classes/quantization.md) can achieve computational advantages without a considerable decline in model performance.
1. **Lower Precision:** Research has shown that operating at reduced numerical precision, namely [8-bit and 4-bit](./main_classes/quantization) can achieve computational advantages without a considerable decline in model performance.
2. **Flash Attention:** Flash Attention is a variation of the attention algorithm that not only provides a more memory-efficient approach but also realizes increased efficiency due to optimized GPU memory utilization.
3. **Architectural Innovations:** Considering that LLMs are always deployed in the same way during inference, namely autoregressive text generation with a long input context, specialized model architectures have been proposed that allow for more efficient inference. The most important advancement in model architectures hereby are [Alibi](https://huggingface.co/papers/2108.12409), [Rotary embeddings](https://huggingface.co/papers/2104.09864), [Multi-Query Attention (MQA)](https://huggingface.co/papers/1911.02150) and [Grouped-Query-Attention (GQA)]((https://huggingface.co/papers/2305.13245)).
3. **Architectural Innovations:** Considering that LLMs are always deployed in the same way during inference, namely autoregressive text generation with a long input context, specialized model architectures have been proposed that allow for more efficient inference. The most important advancement in model architectures hereby are [Alibi](https://huggingface.co/papers/2108.12409), [Rotary embeddings](https://huggingface.co/papers/2104.09864), [Multi-Query Attention (MQA)](https://huggingface.co/papers/1911.02150) and [Grouped-Query-Attention (GQA)](https://huggingface.co/papers/2305.13245).
Throughout this guide, we will offer an analysis of auto-regressive generation from a tensor's perspective. We delve into the pros and cons of adopting lower precision, provide a comprehensive exploration of the latest attention algorithms, and discuss improved LLM architectures. While doing so, we run practical examples showcasing each of the feature improvements.
@ -84,7 +84,7 @@ We first load the model and tokenizer and then pass both to Transformers' [pipel
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", torch_dtype=torch.bfloat16, device_map="auto", pad_token_id=0)
model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", dtype=torch.bfloat16, device_map="auto", pad_token_id=0)
tokenizer = AutoTokenizer.from_pretrained("bigcode/octocoder")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
@ -125,7 +125,7 @@ Note that if we had tried to run the model in full float32 precision, a whopping
> Almost all models are trained in bfloat16 nowadays, there is no reason to run the model in full float32 precision if [your GPU supports bfloat16](https://discuss.pytorch.org/t/bfloat16-native-support/117155/5). Float32 won't give better inference results than the precision that was used to train the model.
If you are unsure in which format the model weights are stored on the Hub, you can always look into the checkpoint's config under `"torch_dtype"`, *e.g.* [here](https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/6fdf2e60f86ff2481f2241aaee459f85b5b0bbb9/config.json#L21). It is recommended to set the model to the same precision type as written in the config when loading with `from_pretrained(..., torch_dtype=...)` except when the original type is float32 in which case one can use both `float16` or `bfloat16` for inference.
If you are unsure in which format the model weights are stored on the Hub, you can always look into the checkpoint's config under `"dtype"`, *e.g.* [here](https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/6fdf2e60f86ff2481f2241aaee459f85b5b0bbb9/config.json#L21). It is recommended to set the model to the same precision type as written in the config when loading with `from_pretrained(..., dtype=...)` except when the original type is float32 in which case one can use both `float16` or `bfloat16` for inference.
Let's define a `flush(...)` function to free all allocated memory so that we can accurately measure the peak allocated GPU memory.
@ -394,7 +394,7 @@ long_prompt = 10 * system_prompt + prompt
We instantiate our model again in bfloat16 precision.
```python
model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", torch_dtype=torch.bfloat16, device_map="auto")
model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("bigcode/octocoder")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

View File

@ -17,9 +17,8 @@ rendered properly in your Markdown viewer.
# Callbacks
Callbacks are objects that can customize the behavior of the training loop in the PyTorch
[`Trainer`] (this feature is not yet implemented in TensorFlow) that can inspect the training loop
state (for progress reporting, logging on TensorBoard or other ML platforms...) and take decisions (like early
stopping).
[`Trainer`] that can inspect the training loop state (for progress reporting, logging on TensorBoard or other ML
platforms...) and take decisions (like early stopping).
Callbacks are "read only" pieces of code, apart from the [`TrainerControl`] object they return, they
cannot change anything in the training loop. For customizations that require changes in the training loop, you should
@ -48,7 +47,7 @@ By default, `TrainingArguments.report_to` is set to `"all"`, so a [`Trainer`] wi
- [`~integrations.DVCLiveCallback`] if [dvclive](https://dvc.org/doc/dvclive) is installed.
- [`~integrations.SwanLabCallback`] if [swanlab](http://swanlab.cn/) is installed.
If a package is installed but you don't wish to use the accompanying integration, you can change `TrainingArguments.report_to` to a list of just those integrations you want to use (e.g. `["azure_ml", "wandb"]`).
If a package is installed but you don't wish to use the accompanying integration, you can change `TrainingArguments.report_to` to a list of just those integrations you want to use (e.g. `["azure_ml", "wandb"]`).
The main class that implements callbacks is [`TrainerCallback`]. It gets the
[`TrainingArguments`] used to instantiate the [`Trainer`], can access that

View File

@ -50,21 +50,18 @@ Examples of use can be found in the [example scripts](../examples) or [example n
[[autodoc]] data.data_collator.DataCollatorForLanguageModeling
- numpy_mask_tokens
- tf_mask_tokens
- torch_mask_tokens
## DataCollatorForWholeWordMask
[[autodoc]] data.data_collator.DataCollatorForWholeWordMask
- numpy_mask_tokens
- tf_mask_tokens
- torch_mask_tokens
## DataCollatorForPermutationLanguageModeling
[[autodoc]] data.data_collator.DataCollatorForPermutationLanguageModeling
- numpy_mask_tokens
- tf_mask_tokens
- torch_mask_tokens
## DataCollatorWithFlattening

View File

@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
# Feature Extractor
A feature extractor is in charge of preparing input features for audio or vision models. This includes feature extraction from sequences, e.g., pre-processing audio files to generate Log-Mel Spectrogram features, feature extraction from images, e.g., cropping image files, but also padding, normalization, and conversion to NumPy, PyTorch, and TensorFlow tensors.
A feature extractor is in charge of preparing input features for audio or vision models. This includes feature extraction from sequences, e.g., pre-processing audio files to generate Log-Mel Spectrogram features, feature extraction from images, e.g., cropping image files, but also padding, normalization, and conversion to NumPy and PyTorch tensors.
## FeatureExtractionMixin

View File

@ -16,8 +16,7 @@ rendered properly in your Markdown viewer.
# Image Processor
An image processor is in charge of preparing input features for vision models and post processing their outputs. This includes transformations such as resizing, normalization, and conversion to PyTorch, TensorFlow, Flax and Numpy tensors. It may also include model specific post-processing such as converting logits to segmentation masks.
An image processor is in charge of loading images (optionally), preparing input features for vision models and post processing their outputs. This includes transformations such as resizing, normalization, and conversion to PyTorch and Numpy tensors. It may also include model specific post-processing such as converting logits to segmentation masks.
Fast image processors are available for a few models and more will be added in the future. They are based on the [torchvision](https://pytorch.org/vision/stable/index.html) library and provide a significant speed-up, especially when processing on GPU.
They have the same API as the base image processors and can be used as drop-in replacements.
To use a fast image processor, you need to install the `torchvision` library, and set the `use_fast` argument to `True` when instantiating the image processor:

View File

@ -16,22 +16,15 @@ rendered properly in your Markdown viewer.
# Models
The base classes [`PreTrainedModel`], [`TFPreTrainedModel`], and
[`FlaxPreTrainedModel`] implement the common methods for loading/saving a model either from a local
file or directory, or from a pretrained model configuration provided by the library (downloaded from HuggingFace's AWS
S3 repository).
The base class [`PreTrainedModel`] implements the common methods for loading/saving a model either from a local
file or directory, or from a pretrained model configuration provided by the library (downloaded from HuggingFace's Hub).
[`PreTrainedModel`] and [`TFPreTrainedModel`] also implement a few methods which
are common among all the models to:
[`PreTrainedModel`] also implements a few methods which are common among all the models to:
- resize the input token embeddings when new tokens are added to the vocabulary
- prune the attention heads of the model.
The other methods that are common to each model are defined in [`~modeling_utils.ModuleUtilsMixin`]
(for the PyTorch models) and [`~modeling_tf_utils.TFModuleUtilsMixin`] (for the TensorFlow models) or
for text generation, [`~generation.GenerationMixin`] (for the PyTorch models),
[`~generation.TFGenerationMixin`] (for the TensorFlow models) and
[`~generation.FlaxGenerationMixin`] (for the Flax/JAX models).
The other methods that are common to each model are defined in [`~modeling_utils.ModuleUtilsMixin`] and [`~generation.GenerationMixin`].
## PreTrainedModel
@ -48,22 +41,6 @@ set this to `False`.
[[autodoc]] modeling_utils.ModuleUtilsMixin
## TFPreTrainedModel
[[autodoc]] TFPreTrainedModel
- push_to_hub
- all
## TFModelUtilsMixin
[[autodoc]] modeling_tf_utils.TFModelUtilsMixin
## FlaxPreTrainedModel
[[autodoc]] FlaxPreTrainedModel
- push_to_hub
- all
## Pushing to the Hub
[[autodoc]] utils.PushToHubMixin

View File

@ -23,19 +23,13 @@ The `.optimization` module provides:
- a gradient accumulation class to accumulate the gradients of multiple batches
## AdaFactor (PyTorch)
## AdaFactor
[[autodoc]] Adafactor
## AdamWeightDecay (TensorFlow)
[[autodoc]] AdamWeightDecay
[[autodoc]] create_optimizer
## Schedules
### Learning Rate Schedules (PyTorch)
### Learning Rate Schedules
[[autodoc]] SchedulerType
@ -64,13 +58,3 @@ The `.optimization` module provides:
[[autodoc]] get_inverse_sqrt_schedule
[[autodoc]] get_wsd_schedule
### Warmup (TensorFlow)
[[autodoc]] WarmUp
## Gradient Strategies
### GradientAccumulator (TensorFlow)
[[autodoc]] GradientAccumulator

View File

@ -187,135 +187,3 @@ documented on their corresponding model page.
## SampleTSPredictionOutput
[[autodoc]] modeling_outputs.SampleTSPredictionOutput
## TFBaseModelOutput
[[autodoc]] modeling_tf_outputs.TFBaseModelOutput
## TFBaseModelOutputWithPooling
[[autodoc]] modeling_tf_outputs.TFBaseModelOutputWithPooling
## TFBaseModelOutputWithPoolingAndCrossAttentions
[[autodoc]] modeling_tf_outputs.TFBaseModelOutputWithPoolingAndCrossAttentions
## TFBaseModelOutputWithPast
[[autodoc]] modeling_tf_outputs.TFBaseModelOutputWithPast
## TFBaseModelOutputWithPastAndCrossAttentions
[[autodoc]] modeling_tf_outputs.TFBaseModelOutputWithPastAndCrossAttentions
## TFSeq2SeqModelOutput
[[autodoc]] modeling_tf_outputs.TFSeq2SeqModelOutput
## TFCausalLMOutput
[[autodoc]] modeling_tf_outputs.TFCausalLMOutput
## TFCausalLMOutputWithCrossAttentions
[[autodoc]] modeling_tf_outputs.TFCausalLMOutputWithCrossAttentions
## TFCausalLMOutputWithPast
[[autodoc]] modeling_tf_outputs.TFCausalLMOutputWithPast
## TFMaskedLMOutput
[[autodoc]] modeling_tf_outputs.TFMaskedLMOutput
## TFSeq2SeqLMOutput
[[autodoc]] modeling_tf_outputs.TFSeq2SeqLMOutput
## TFNextSentencePredictorOutput
[[autodoc]] modeling_tf_outputs.TFNextSentencePredictorOutput
## TFSequenceClassifierOutput
[[autodoc]] modeling_tf_outputs.TFSequenceClassifierOutput
## TFSeq2SeqSequenceClassifierOutput
[[autodoc]] modeling_tf_outputs.TFSeq2SeqSequenceClassifierOutput
## TFMultipleChoiceModelOutput
[[autodoc]] modeling_tf_outputs.TFMultipleChoiceModelOutput
## TFTokenClassifierOutput
[[autodoc]] modeling_tf_outputs.TFTokenClassifierOutput
## TFQuestionAnsweringModelOutput
[[autodoc]] modeling_tf_outputs.TFQuestionAnsweringModelOutput
## TFSeq2SeqQuestionAnsweringModelOutput
[[autodoc]] modeling_tf_outputs.TFSeq2SeqQuestionAnsweringModelOutput
## FlaxBaseModelOutput
[[autodoc]] modeling_flax_outputs.FlaxBaseModelOutput
## FlaxBaseModelOutputWithPast
[[autodoc]] modeling_flax_outputs.FlaxBaseModelOutputWithPast
## FlaxBaseModelOutputWithPooling
[[autodoc]] modeling_flax_outputs.FlaxBaseModelOutputWithPooling
## FlaxBaseModelOutputWithPastAndCrossAttentions
[[autodoc]] modeling_flax_outputs.FlaxBaseModelOutputWithPastAndCrossAttentions
## FlaxSeq2SeqModelOutput
[[autodoc]] modeling_flax_outputs.FlaxSeq2SeqModelOutput
## FlaxCausalLMOutputWithCrossAttentions
[[autodoc]] modeling_flax_outputs.FlaxCausalLMOutputWithCrossAttentions
## FlaxMaskedLMOutput
[[autodoc]] modeling_flax_outputs.FlaxMaskedLMOutput
## FlaxSeq2SeqLMOutput
[[autodoc]] modeling_flax_outputs.FlaxSeq2SeqLMOutput
## FlaxNextSentencePredictorOutput
[[autodoc]] modeling_flax_outputs.FlaxNextSentencePredictorOutput
## FlaxSequenceClassifierOutput
[[autodoc]] modeling_flax_outputs.FlaxSequenceClassifierOutput
## FlaxSeq2SeqSequenceClassifierOutput
[[autodoc]] modeling_flax_outputs.FlaxSeq2SeqSequenceClassifierOutput
## FlaxMultipleChoiceModelOutput
[[autodoc]] modeling_flax_outputs.FlaxMultipleChoiceModelOutput
## FlaxTokenClassifierOutput
[[autodoc]] modeling_flax_outputs.FlaxTokenClassifierOutput
## FlaxQuestionAnsweringModelOutput
[[autodoc]] modeling_flax_outputs.FlaxQuestionAnsweringModelOutput
## FlaxSeq2SeqQuestionAnsweringModelOutput
[[autodoc]] modeling_flax_outputs.FlaxSeq2SeqQuestionAnsweringModelOutput

View File

@ -273,7 +273,7 @@ independently of the inputs. The caveats from the previous section still apply.
## Pipeline FP16 inference
Models can be run in FP16 which can be significantly faster on GPU while saving memory. Most models will not suffer noticeable performance loss from this. The larger the model, the less likely that it will.
To enable FP16 inference, you can simply pass `torch_dtype=torch.float16` or `torch_dtype='float16'` to the pipeline constructor. Note that this only works for models with a PyTorch backend. Your inputs will be converted to FP16 internally.
To enable FP16 inference, you can simply pass `dtype=torch.float16` or `dtype='float16'` to the pipeline constructor. Note that this only works for models with a PyTorch backend. Your inputs will be converted to FP16 internally.
## Pipeline custom code
@ -363,6 +363,12 @@ Pipelines available for computer vision tasks include the following.
- __call__
- all
### KeypointMatchingPipeline
[[autodoc]] KeypointMatchingPipeline
- __call__
- all
### ObjectDetectionPipeline
[[autodoc]] ObjectDetectionPipeline

View File

@ -65,6 +65,10 @@ Learn how to quantize models in the [Quantization](../quantization) guide.
[[autodoc]] HqqConfig
## Mxfp4Config
[[autodoc]] Mxfp4Config
## FbgemmFp8Config
[[autodoc]] FbgemmFp8Config

View File

@ -19,12 +19,8 @@ rendered properly in your Markdown viewer.
Each framework has a generate method for text generation implemented in their respective `GenerationMixin` class:
- PyTorch [`~generation.GenerationMixin.generate`] is implemented in [`~generation.GenerationMixin`].
- TensorFlow [`~generation.TFGenerationMixin.generate`] is implemented in [`~generation.TFGenerationMixin`].
- Flax/JAX [`~generation.FlaxGenerationMixin.generate`] is implemented in [`~generation.FlaxGenerationMixin`].
Regardless of your framework of choice, you can parameterize the generate method with a [`~generation.GenerationConfig`]
class instance. Please refer to this class for the complete list of generation parameters, which control the behavior
of the generation method.
You can parameterize the generate method with a [`~generation.GenerationConfig`] class instance. Please refer to this class for the complete list of generation parameters, which control the behavior of the generation method.
To learn how to inspect a model's generation configuration, what are the defaults, how to change the parameters ad hoc,
and how to create and save a customized generation configuration, refer to the
@ -46,14 +42,3 @@ like token streaming.
[[autodoc]] GenerationMixin
- generate
- compute_transition_scores
## TFGenerationMixin
[[autodoc]] TFGenerationMixin
- generate
- compute_transition_scores
## FlaxGenerationMixin
[[autodoc]] FlaxGenerationMixin
- generate

View File

@ -14,10 +14,9 @@ rendered properly in your Markdown viewer.
-->
# Video Processor
A **Video Processor** is a utility responsible for preparing input features for video models, as well as handling the post-processing of their outputs. It provides transformations such as resizing, normalization, and conversion into PyTorch.
A **Video Processor** is a utility responsible for preparing input features for video models, as well as handling the post-processing of their outputs. It provides transformations such as resizing, normalization, and conversion into PyTorch. Along ith transformations the `VideoProcessor` class handles video decoding from local paths or URLs (requires [`torchcodec`](https://pypi.org/project/torchcodec/)) and frame sampling according to model-specific strategies.
The video processor extends the functionality of image processors by allowing Vision Large Language Models (VLMs) to handle videos with a distinct set of arguments compared to images. It serves as the bridge between raw video data and the model, ensuring that input features are optimized for the VLM.
@ -48,6 +47,47 @@ processor = torch.compile(processor)
processed_video = processor(video, return_tensors="pt")
```
#### Sampling behavior
The video processor can also sample video frames using the technique best suited for the given model. Sampling behavior is controlled with the `do_sample_frames` argument and can be configured through model-specific parameters such as `num_frames` or `fps` (the rate at which the video will be sampled). If the input video is given as a local path or URL (`str`), the processor will decode it automatically. To obtain metadata about the decoded video, such as sampled frame indices, original dimensions, duration, and fps, pass `return_metadata=True` to the processor.
<Tip warning={false}>
- Specifying `num_frames` does not guarantee the output will contain exactly that number of frames. Depending on the model, the sampler may enforce minimum or maximum frame limits.
- The default decoder is [`torchcodec`](https://pypi.org/project/torchcodec/), which must be installed.
</Tip>
```python
from transformers import AutoVideoProcessor
processor = AutoVideoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf", device="cuda")
processed_video_inputs = processor(videos=["video_path.mp4"], return_metadata=True, do_sample_frames=True, return_tensors="pt")
video_metadata = processed_video_inputs["video_metadata"]
# See how many frames the original video had and what was the original FPS
print(video_metadata.total_num_frames, video_metadata.fps)
```
If you pass an already decoded video array but still want to enable model-specific frame sampling, it is strongly recommended to provide video_metadata. This allows the sampler to know the original videos duration and FPS. You can pass metadata as a `VideoMetadata` object or as a plain dict.
```python
from transformers import AutoVideoProcessor
from transformers.video_utils import VideoMetadata
processor = AutoVideoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf", device="cuda")
my_decodec_video = torch.randint(0, 255, size=(100, 3, 1280, 1280)) # short video of 100 frames
video_metadata = VideoMetadata(
total_num_frames=100,
fps=24,
duration=4.1, # in seconds
)
processed_video_inputs = processor(videos=["video_path.mp4"], video_metadata=video_metadata, do_sample_frames=True, num_frames=10, return_tensors="pt")
print(processed_video_inputs.pixel_values_videos.shape)
>>> [10, 3, 384, 384]
```
## BaseVideoProcessor

View File

@ -13,12 +13,13 @@ specific language governing permissions and limitations under the License.
rendered properly in your Markdown viewer.
-->
*This model was released on 2024-11-21 and added to Hugging Face Transformers on 2025-07-08.*
# AIMv2
## Overview
The AIMv2 model was proposed in [Multimodal Autoregressive Pre-training of Large Vision Encoders](https://arxiv.org/abs/2411.14402) by Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby.
The AIMv2 model was proposed in [Multimodal Autoregressive Pre-training of Large Vision Encoders](https://huggingface.co/papers/2411.14402) by Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby.
The abstract from the paper is the following:
@ -99,6 +100,3 @@ probs = outputs.logits_per_image.softmax(dim=-1)
[[autodoc]] Aimv2TextModel
- forward
</pt>
<tf>

View File

@ -13,13 +13,12 @@ specific language governing permissions and limitations under the License.
rendered properly in your Markdown viewer.
-->
*This model was released on 2019-09-26 and added to Hugging Face Transformers on 2020-11-16.*
<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white" >
<img alt= "TensorFlow" src= "https://img.shields.io/badge/TensorFlow-FF6F00?style=flat&logo=tensorflow&logoColor=white" >
<img alt= "Flax" src="https://img.shields.io/badge/Flax-29a79b.svg?style…Nu+W0m6K/I9gGPd/dfx/EN/wN62AhsBWuAAAAAElFTkSuQmCC">
<img alt="SDPA" src= "https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white" >
<img alt="SDPA" src= "https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white" >
</div>
</div>
@ -51,7 +50,7 @@ from transformers import pipeline
pipeline = pipeline(
task="fill-mask",
model="albert-base-v2",
torch_dtype=torch.float16,
dtype=torch.float16,
device=0
)
pipeline("Plants create [MASK] through a process known as photosynthesis.", top_k=5)
@ -67,7 +66,7 @@ from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("albert/albert-base-v2")
model = AutoModelForMaskedLM.from_pretrained(
"albert/albert-base-v2",
torch_dtype=torch.float16,
dtype=torch.float16,
attn_implementation="sdpa",
device_map="auto"
)
@ -109,42 +108,30 @@ The resources provided in the following sections consist of a list of official H
- [`AlbertForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification).
- [`TFAlbertForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/text-classification).
- [`FlaxAlbertForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/text-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification_flax.ipynb).
- Check the [Text classification task guide](../tasks/sequence_classification) on how to use the model.
<PipelineTag pipeline="token-classification"/>
- [`AlbertForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/token-classification).
- [`TFAlbertForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/token-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb).
- [`FlaxAlbertForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/token-classification).
- [Token classification](https://huggingface.co/course/chapter7/2?fw=pt) chapter of the 🤗 Hugging Face Course.
- Check the [Token classification task guide](../tasks/token_classification) on how to use the model.
<PipelineTag pipeline="fill-mask"/>
- [`AlbertForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#robertabertdistilbert-and-masked-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb).
- [`TFAlbertForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/language-modeling#run_mlmpy) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb).
- [`FlaxAlbertForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling#masked-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/masked_language_modeling_flax.ipynb).
- [Masked language modeling](https://huggingface.co/course/chapter7/3?fw=pt) chapter of the 🤗 Hugging Face Course.
- Check the [Masked language modeling task guide](../tasks/masked_language_modeling) on how to use the model.
<PipelineTag pipeline="question-answering"/>
- [`AlbertForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb).
- [`TFAlbertForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/question-answering) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb).
- [`FlaxAlbertForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/question-answering).
- [Question answering](https://huggingface.co/course/chapter7/7?fw=pt) chapter of the 🤗 Hugging Face Course.
- Check the [Question answering task guide](../tasks/question_answering) on how to use the model.
**Multiple choice**
- [`AlbertForMultipleChoice`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/multiple-choice) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb).
- [`TFAlbertForMultipleChoice`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/multiple-choice) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice-tf.ipynb).
- Check the [Multiple choice task guide](../tasks/multiple_choice) on how to use the model.
## AlbertConfig
@ -163,11 +150,6 @@ The resources provided in the following sections consist of a list of official H
[[autodoc]] models.albert.modeling_albert.AlbertForPreTrainingOutput
[[autodoc]] models.albert.modeling_tf_albert.TFAlbertForPreTrainingOutput
<frameworkcontent>
<pt>
## AlbertModel
[[autodoc]] AlbertModel - forward
@ -195,69 +177,3 @@ The resources provided in the following sections consist of a list of official H
## AlbertForQuestionAnswering
[[autodoc]] AlbertForQuestionAnswering - forward
</pt>
<tf>
## TFAlbertModel
[[autodoc]] TFAlbertModel - call
## TFAlbertForPreTraining
[[autodoc]] TFAlbertForPreTraining - call
## TFAlbertForMaskedLM
[[autodoc]] TFAlbertForMaskedLM - call
## TFAlbertForSequenceClassification
[[autodoc]] TFAlbertForSequenceClassification - call
## TFAlbertForMultipleChoice
[[autodoc]] TFAlbertForMultipleChoice - call
## TFAlbertForTokenClassification
[[autodoc]] TFAlbertForTokenClassification - call
## TFAlbertForQuestionAnswering
[[autodoc]] TFAlbertForQuestionAnswering - call
</tf>
<jax>
## FlaxAlbertModel
[[autodoc]] FlaxAlbertModel - **call**
## FlaxAlbertForPreTraining
[[autodoc]] FlaxAlbertForPreTraining - **call**
## FlaxAlbertForMaskedLM
[[autodoc]] FlaxAlbertForMaskedLM - **call**
## FlaxAlbertForSequenceClassification
[[autodoc]] FlaxAlbertForSequenceClassification - **call**
## FlaxAlbertForMultipleChoice
[[autodoc]] FlaxAlbertForMultipleChoice - **call**
## FlaxAlbertForTokenClassification
[[autodoc]] FlaxAlbertForTokenClassification - **call**
## FlaxAlbertForQuestionAnswering
[[autodoc]] FlaxAlbertForQuestionAnswering - **call**
</jax>
</frameworkcontent>

View File

@ -13,6 +13,7 @@ specific language governing permissions and limitations under the License.
rendered properly in your Markdown viewer.
-->
*This model was released on 2021-02-11 and added to Hugging Face Transformers on 2023-03-01.*
<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
@ -43,7 +44,7 @@ pipeline = pipeline(
task="zero-shot-image-classification",
model="kakaobrain/align-base",
device=0,
torch_dtype=torch.bfloat16
dtype=torch.bfloat16
)
candidate_labels = [
@ -65,18 +66,18 @@ from PIL import Image
from transformers import AutoProcessor, AutoModelForZeroShotImageClassification
processor = AutoProcessor.from_pretrained("kakaobrain/align-base")
model = AutoModelForZeroShotImageClassification.from_pretrained("kakaobrain/align-base").to("cuda")
model = AutoModelForZeroShotImageClassification.from_pretrained("kakaobrain/align-base", device_map="auto")
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
image = requests.get(url, stream=True)
inputs = Image.open(image.raw).convert("RGB")
image_inputs = processor(images=inputs, return_tensors="pt").to("cuda")
image_inputs = processor(images=inputs, return_tensors="pt").to(model.device)
with torch.no_grad():
image_embeds = model.get_image_features(**image_inputs)
candidate_labels = ["a photo of a dog", "a photo of a cat", "a photo of a person"]
text_inputs = processor(text=candidate_labels, padding=True, return_tensors="pt").to("cuda")
text_inputs = processor(text=candidate_labels, padding=True, return_tensors="pt").to(model.device)
with torch.no_grad():
text_embeds = model.get_text_features(**text_inputs)

View File

@ -13,6 +13,7 @@ specific language governing permissions and limitations under the License.
rendered properly in your Markdown viewer.
-->
*This model was released on 2022-11-12 and added to Hugging Face Transformers on 2023-01-04.*
<div style="float: right;">
<div class="flex flex-wrap space-x-1">
@ -39,7 +40,7 @@ import requests
from PIL import Image
from transformers import AltCLIPModel, AltCLIPProcessor
model = AltCLIPModel.from_pretrained("BAAI/AltCLIP", torch_dtype=torch.bfloat16)
model = AltCLIPModel.from_pretrained("BAAI/AltCLIP", dtype=torch.bfloat16)
processor = AltCLIPProcessor.from_pretrained("BAAI/AltCLIP")
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
@ -73,7 +74,7 @@ from transformers import AltCLIPModel, AltCLIPProcessor, TorchAoConfig
model = AltCLIPModel.from_pretrained(
"BAAI/AltCLIP",
quantization_config=TorchAoConfig("int4_weight_only", group_size=128),
torch_dtype=torch.bfloat16,
dtype=torch.bfloat16,
)
processor = AltCLIPProcessor.from_pretrained("BAAI/AltCLIP")

View File

@ -0,0 +1,100 @@
<!--Copyright 2025 The HuggingFace Team and the Swiss AI Initiative. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="Tensor parallelism" src="https://img.shields.io/badge/Tensor%20parallelism-06b6d4?style=flat&logoColor=white">
</div>
</div>
# Apertus
[Apertus](https://www.swiss-ai.org) is a family of large language models from the Swiss AI Initiative.
> [!TIP]
> Coming soon
The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`], and from the command line.
<hfoptions id="usage">
<hfoption id="Pipeline">
```py
import torch
from transformers import pipeline
pipeline = pipeline(
task="text-generation",
model="swiss-ai/Apertus-8B",
dtype=torch.bfloat16,
device=0
)
pipeline("Plants create energy through a process known as")
```
</hfoption>
<hfoption id="AutoModel">
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"swiss-ai/Apertus-8B",
)
model = AutoModelForCausalLM.from_pretrained(
"swiss-ai/Apertus-8B",
dtype=torch.bfloat16,
device_map="auto",
attn_implementation="sdpa"
)
input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")
output = model.generate(**input_ids)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
</hfoption>
<hfoption id="transformers CLI">
```bash
echo -e "Plants create energy through a process known as" | transformers run --task text-generation --model swiss-ai/Apertus-8B --device 0
```
</hfoption>
</hfoptions>
## ApertusConfig
[[autodoc]] ApertusConfig
## ApertusModel
[[autodoc]] ApertusModel
- forward
## ApertusForCausalLM
[[autodoc]] ApertusForCausalLM
- forward
## ApertusForTokenClassification
[[autodoc]] ApertusForTokenClassification
- forward

View File

@ -13,6 +13,7 @@ specific language governing permissions and limitations under the License.
rendered properly in your Markdown viewer.
-->
*This model was released on 2025-06-18 and added to Hugging Face Transformers on 2025-06-24.*
<div style="float: right;">
<div class="flex flex-wrap space-x-1">
@ -24,7 +25,7 @@ rendered properly in your Markdown viewer.
# Arcee
Arcee is a decoder-only transformer model based on the Llama architecture with a key modification: it uses ReLU² (ReLU-squared) activation in the MLP blocks instead of SiLU, following recent research showing improved training efficiency with squared activations. This architecture is designed for efficient training and inference while maintaining the proven stability of the Llama design.
[Arcee](https://www.arcee.ai/blog/deep-dive-afm-4-5b-the-first-arcee-foundational-model) is a decoder-only transformer model based on the Llama architecture with a key modification: it uses ReLU² (ReLU-squared) activation in the MLP blocks instead of SiLU, following recent research showing improved training efficiency with squared activations. This architecture is designed for efficient training and inference while maintaining the proven stability of the Llama design.
The Arcee model is architecturally similar to Llama but uses `x * relu(x)` in MLP layers for improved gradient flow and is optimized for efficiency in both training and inference scenarios.
@ -43,7 +44,7 @@ from transformers import pipeline
pipeline = pipeline(
task="text-generation",
model="arcee-ai/AFM-4.5B",
torch_dtype=torch.float16,
dtype=torch.float16,
device=0
)
@ -61,7 +62,7 @@ from transformers import AutoTokenizer, ArceeForCausalLM
tokenizer = AutoTokenizer.from_pretrained("arcee-ai/AFM-4.5B")
model = ArceeForCausalLM.from_pretrained(
"arcee-ai/AFM-4.5B",
torch_dtype=torch.float16,
dtype=torch.float16,
device_map="auto"
)

View File

@ -13,6 +13,7 @@ specific language governing permissions and limitations under the License.
rendered properly in your Markdown viewer.
-->
*This model was released on 2024-10-08 and added to Hugging Face Transformers on 2024-12-06.*
<div style="float: right;">
<div class="flex flex-wrap space-x-1">
@ -44,7 +45,7 @@ pipeline = pipeline(
"image-to-text",
model="rhymes-ai/Aria",
device=0,
torch_dtype=torch.bfloat16
dtype=torch.bfloat16
)
pipeline(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
@ -62,7 +63,7 @@ from transformers import AutoModelForCausalLM, AutoProcessor
model = AutoModelForCausalLM.from_pretrained(
"rhymes-ai/Aria",
device_map="auto",
torch_dtype=torch.bfloat16,
dtype=torch.bfloat16,
attn_implementation="sdpa"
)
@ -108,7 +109,7 @@ from transformers import TorchAoConfig, AutoModelForCausalLM, AutoProcessor
quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
model = AutoModelForCausalLM.from_pretrained(
"rhymes-ai/Aria-sequential_mlp",
torch_dtype=torch.bfloat16,
dtype=torch.bfloat16,
device_map="auto",
quantization_config=quantization_config
)

View File

@ -13,6 +13,7 @@ specific language governing permissions and limitations under the License.
rendered properly in your Markdown viewer.
-->
*This model was released on 2021-04-05 and added to Hugging Face Transformers on 2022-11-21.*
# Audio Spectrogram Transformer
@ -62,7 +63,7 @@ SDPA is used by default for `torch>=2.1.1` when an implementation is available,
```
from transformers import ASTForAudioClassification
model = ASTForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593", attn_implementation="sdpa", torch_dtype=torch.float16)
model = ASTForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593", attn_implementation="sdpa", dtype=torch.float16)
...
```

View File

@ -30,7 +30,7 @@ model = AutoModel.from_pretrained("google-bert/bert-base-cased")
will create a model that is an instance of [`BertModel`].
There is one class of `AutoModel` for each task, and for each backend (PyTorch, TensorFlow, or Flax).
There is one class of `AutoModel` for each task.
## Extending the Auto Classes
@ -90,14 +90,6 @@ The following auto classes are available for instantiating a base model class wi
[[autodoc]] AutoModel
### TFAutoModel
[[autodoc]] TFAutoModel
### FlaxAutoModel
[[autodoc]] FlaxAutoModel
## Generic pretraining classes
The following auto classes are available for instantiating a model with a pretraining head.
@ -106,14 +98,6 @@ The following auto classes are available for instantiating a model with a pretra
[[autodoc]] AutoModelForPreTraining
### TFAutoModelForPreTraining
[[autodoc]] TFAutoModelForPreTraining
### FlaxAutoModelForPreTraining
[[autodoc]] FlaxAutoModelForPreTraining
## Natural Language Processing
The following auto classes are available for the following natural language processing tasks.
@ -122,114 +106,42 @@ The following auto classes are available for the following natural language proc
[[autodoc]] AutoModelForCausalLM
### TFAutoModelForCausalLM
[[autodoc]] TFAutoModelForCausalLM
### FlaxAutoModelForCausalLM
[[autodoc]] FlaxAutoModelForCausalLM
### AutoModelForMaskedLM
[[autodoc]] AutoModelForMaskedLM
### TFAutoModelForMaskedLM
[[autodoc]] TFAutoModelForMaskedLM
### FlaxAutoModelForMaskedLM
[[autodoc]] FlaxAutoModelForMaskedLM
### AutoModelForMaskGeneration
[[autodoc]] AutoModelForMaskGeneration
### TFAutoModelForMaskGeneration
[[autodoc]] TFAutoModelForMaskGeneration
### AutoModelForSeq2SeqLM
[[autodoc]] AutoModelForSeq2SeqLM
### TFAutoModelForSeq2SeqLM
[[autodoc]] TFAutoModelForSeq2SeqLM
### FlaxAutoModelForSeq2SeqLM
[[autodoc]] FlaxAutoModelForSeq2SeqLM
### AutoModelForSequenceClassification
[[autodoc]] AutoModelForSequenceClassification
### TFAutoModelForSequenceClassification
[[autodoc]] TFAutoModelForSequenceClassification
### FlaxAutoModelForSequenceClassification
[[autodoc]] FlaxAutoModelForSequenceClassification
### AutoModelForMultipleChoice
[[autodoc]] AutoModelForMultipleChoice
### TFAutoModelForMultipleChoice
[[autodoc]] TFAutoModelForMultipleChoice
### FlaxAutoModelForMultipleChoice
[[autodoc]] FlaxAutoModelForMultipleChoice
### AutoModelForNextSentencePrediction
[[autodoc]] AutoModelForNextSentencePrediction
### TFAutoModelForNextSentencePrediction
[[autodoc]] TFAutoModelForNextSentencePrediction
### FlaxAutoModelForNextSentencePrediction
[[autodoc]] FlaxAutoModelForNextSentencePrediction
### AutoModelForTokenClassification
[[autodoc]] AutoModelForTokenClassification
### TFAutoModelForTokenClassification
[[autodoc]] TFAutoModelForTokenClassification
### FlaxAutoModelForTokenClassification
[[autodoc]] FlaxAutoModelForTokenClassification
### AutoModelForQuestionAnswering
[[autodoc]] AutoModelForQuestionAnswering
### TFAutoModelForQuestionAnswering
[[autodoc]] TFAutoModelForQuestionAnswering
### FlaxAutoModelForQuestionAnswering
[[autodoc]] FlaxAutoModelForQuestionAnswering
### AutoModelForTextEncoding
[[autodoc]] AutoModelForTextEncoding
### TFAutoModelForTextEncoding
[[autodoc]] TFAutoModelForTextEncoding
## Computer vision
The following auto classes are available for the following computer vision tasks.
@ -242,14 +154,6 @@ The following auto classes are available for the following computer vision tasks
[[autodoc]] AutoModelForImageClassification
### TFAutoModelForImageClassification
[[autodoc]] TFAutoModelForImageClassification
### FlaxAutoModelForImageClassification
[[autodoc]] FlaxAutoModelForImageClassification
### AutoModelForVideoClassification
[[autodoc]] AutoModelForVideoClassification
@ -266,10 +170,6 @@ The following auto classes are available for the following computer vision tasks
[[autodoc]] AutoModelForMaskedImageModeling
### TFAutoModelForMaskedImageModeling
[[autodoc]] TFAutoModelForMaskedImageModeling
### AutoModelForObjectDetection
[[autodoc]] AutoModelForObjectDetection
@ -286,10 +186,6 @@ The following auto classes are available for the following computer vision tasks
[[autodoc]] AutoModelForSemanticSegmentation
### TFAutoModelForSemanticSegmentation
[[autodoc]] TFAutoModelForSemanticSegmentation
### AutoModelForInstanceSegmentation
[[autodoc]] AutoModelForInstanceSegmentation
@ -302,10 +198,6 @@ The following auto classes are available for the following computer vision tasks
[[autodoc]] AutoModelForZeroShotImageClassification
### TFAutoModelForZeroShotImageClassification
[[autodoc]] TFAutoModelForZeroShotImageClassification
### AutoModelForZeroShotObjectDetection
[[autodoc]] AutoModelForZeroShotObjectDetection
@ -320,10 +212,6 @@ The following auto classes are available for the following audio tasks.
### AutoModelForAudioFrameClassification
[[autodoc]] TFAutoModelForAudioClassification
### TFAutoModelForAudioFrameClassification
[[autodoc]] AutoModelForAudioFrameClassification
### AutoModelForCTC
@ -334,14 +222,6 @@ The following auto classes are available for the following audio tasks.
[[autodoc]] AutoModelForSpeechSeq2Seq
### TFAutoModelForSpeechSeq2Seq
[[autodoc]] TFAutoModelForSpeechSeq2Seq
### FlaxAutoModelForSpeechSeq2Seq
[[autodoc]] FlaxAutoModelForSpeechSeq2Seq
### AutoModelForAudioXVector
[[autodoc]] AutoModelForAudioXVector
@ -366,18 +246,10 @@ The following auto classes are available for the following multimodal tasks.
[[autodoc]] AutoModelForTableQuestionAnswering
### TFAutoModelForTableQuestionAnswering
[[autodoc]] TFAutoModelForTableQuestionAnswering
### AutoModelForDocumentQuestionAnswering
[[autodoc]] AutoModelForDocumentQuestionAnswering
### TFAutoModelForDocumentQuestionAnswering
[[autodoc]] TFAutoModelForDocumentQuestionAnswering
### AutoModelForVisualQuestionAnswering
[[autodoc]] AutoModelForVisualQuestionAnswering
@ -386,14 +258,6 @@ The following auto classes are available for the following multimodal tasks.
[[autodoc]] AutoModelForVision2Seq
### TFAutoModelForVision2Seq
[[autodoc]] TFAutoModelForVision2Seq
### FlaxAutoModelForVision2Seq
[[autodoc]] FlaxAutoModelForVision2Seq
### AutoModelForImageTextToText
[[autodoc]] AutoModelForImageTextToText

View File

@ -13,6 +13,7 @@ specific language governing permissions and limitations under the License.
rendered properly in your Markdown viewer.
-->
*This model was released on 2021-06-24 and added to Hugging Face Transformers on 2023-05-30.*
# Autoformer

View File

@ -13,6 +13,7 @@ specific language governing permissions and limitations under the License.
rendered properly in your Markdown viewer.
-->
*This model was released on 2025-05-13 and added to Hugging Face Transformers on 2025-03-04.*
<div style="float: right;">
<div class="flex flex-wrap space-x-1">
@ -59,14 +60,14 @@ print(outputs)
```python
# pip install 'git+https://github.com/huggingface/transformers.git@v4.49.0-Aya Vision'
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
model_id = "CohereLabs/aya-vision-8b"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id, device_map="auto", torch_dtype=torch.float16
model_id, device_map="auto", dtype=torch.float16
)
# Format message with the aya-vision chat template
@ -132,7 +133,7 @@ inputs = processor.apply_chat_template(
add_generation_prompt=True,
tokenize=True,
return_tensors="pt"
).to("cuda")
).to(model.device)
generated = model.generate(**inputs, max_new_tokens=50)
print(processor.tokenizer.decode(generated[0], skip_special_tokens=True))
@ -147,12 +148,12 @@ print(processor.tokenizer.decode(generated[0], skip_special_tokens=True))
- The example below demonstrates inference with multiple images.
```py
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
processor = AutoProcessor.from_pretrained("CohereForAI/aya-vision-8b")
model = AutoModelForImageTextToText.from_pretrained(
"CohereForAI/aya-vision-8b", device_map="cuda", torch_dtype=torch.float16
"CohereForAI/aya-vision-8b", device_map="auto", dtype=torch.float16
)
messages = [
@ -177,7 +178,7 @@ print(processor.tokenizer.decode(generated[0], skip_special_tokens=True))
inputs = processor.apply_chat_template(
messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
).to("cuda")
).to(model.device)
gen_tokens = model.generate(
**inputs,
@ -193,12 +194,12 @@ print(processor.tokenizer.decode(generated[0], skip_special_tokens=True))
- The example below demonstrates inference with batched inputs.
```py
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
"CohereForAI/aya-vision-8b", device_map="cuda", torch_dtype=torch.float16
"CohereForAI/aya-vision-8b", device_map="auto", dtype=torch.float16
)
batch_messages = [

Some files were not shown because too many files have changed in this diff Show More