Compare commits

..

174 Commits

Author SHA1 Message Date
2be1dd7af2 no changes 2025-10-24 10:47:55 +02:00
a20a326408 Merge branch 'main' into test-bart-dummy 2025-10-24 10:31:20 +02:00
9c20660138 unpin torch/torchcodec for CircleCI (#41839)
CirCleCI with torch 2.9

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-10-24 08:19:38 +02:00
e4b920b3cf [Parakeet] add output_attention_mask (#41694)
* add output_attention_mask

* style
2025-10-23 23:09:20 +00:00
12b253e921 test 2025-10-23 20:23:42 +02:00
cecef75790 init 2025-10-23 20:17:07 +02:00
81b4f9882c transformers serve quantization docs + some api fixes for bitsandbytes (#41253)
* doc

* fix api

* fix

* fix

* fix

* fix args

* minor doc fix

* fix

* style

* rm check for now

* fix

* style

* Update docs/source/en/serving.md

Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>

* add log and update value

---------

Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
2025-10-23 16:00:54 +00:00
2a3f66d9d2 Deprecate warmup_ratio (#41326)
* dep

* style

* deprecate warmup_ratio

* better

* fix

* Revert "style"

This reverts commit cf4f9e7c4f7837a88eea6eeabf8b4dfe9455f6dc.

* Revert "dep"

This reverts commit 1800beb13f407ddb881d0af936860643e84ba085.

* update version
2025-10-23 17:17:21 +02:00
ca01fe4d13 transformers cli default flag fix (#41761) 2025-10-23 13:33:55 +00:00
f780932e05 Fixed some grammar mistakes (#41802)
Added spaces between words, fixed a typo and other errors
2025-10-23 12:39:58 +00:00
e7c5a60368 Fixed grammar mistakes (#41799)
* Fixed grammar mistakes

fixed a couple grammar mistakes

* Update README.md

* Change phrasing a bit more

---------

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
2025-10-23 12:34:02 +00:00
91b5a680c0 [Trainer] remove env vars (#41697)
* remove env var

* style

* fix value

* update

* fix

* style

* fix

* maybe this time

* rm tests

* fix
2025-10-23 14:17:20 +02:00
d4562bb8ae Fix Qwen3Next dtype API usage (#41735)
Replace torch.get_current_dtype() with torch.get_default_dtype() to fix FLA compatibility
2025-10-23 12:02:02 +00:00
e46c2ff32e Add a safeguard around a flaky test in gemma2 (#41811)
* Fix _compile flag in flex attn integration

* Revert fix and add precaution around test
2025-10-23 12:36:50 +02:00
3b6ddbcb88 make apollo test case pass (#41805)
make apollo test cases pass

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-10-23 12:07:31 +02:00
ff04520266 Bump AMD docker (#41792) 2025-10-23 10:44:20 +02:00
01f5ac70a3 flash attn pytest marker (#41781)
* flash attn marker

* 111

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-10-23 08:39:19 +00:00
2c5b888c95 [Onnx docs] Remove some traces (#41791)
fix
2025-10-23 10:34:25 +02:00
0eb372ba19 [quantization] fix torchao tests after 0.14.0 release (#41777)
* initial commit

* clean int4_weight_only

* make style

---------

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
2025-10-23 08:26:44 +00:00
87be559508 Fix attention mask in mamba layers (#41790)
* not all mamba models are like LFM

* compile friendly

* adjust slow tests expectation

* naming
2025-10-22 18:15:38 +02:00
2ca506ca1d Fix chat schema tests (#41793)
* Fix chat schema tests

* make fixup
2025-10-22 15:00:49 +00:00
5426947e3a fix type annotation typo in docstring (#41788) 2025-10-22 13:58:18 +00:00
93671b4444 Swap columns and rows of the grid layout in LFM2-VL (#41755)
* swap columns and rows of the grid layout

* update integration tests

* fix the test case

* revert batched test change
2025-10-22 12:52:06 +00:00
18a3349a9f [quantization] Skip Fp8 tests when hardware capability < 8.9 (#41785)
* skipping tests

* style

* nit
2025-10-22 13:33:28 +02:00
e9f241bf89 [quantization] fix compressed_tensors tests (#41780)
fixing tests
2025-10-22 12:37:07 +02:00
7cd1d2b66c [v5] Delete legacy chat template saving (#41648)
* delete lagcy chat template saving

* fix tests

* fix qwen audio
2025-10-22 09:40:55 +00:00
48a36c96da fix: Gemma 3 weights conversion vision and multimodal projector paths (#41767)
fix: Gemma 3 vision and multimodal projector paths
2025-10-22 09:38:56 +00:00
9a27302803 Fix CUDA index out of bounds for q_idx in VLM token type masking for Gemma3, PaliGemma, and example modular (#41757)
* Fix CUDA index out of bounds for q_idx in Gemma3 token type masking

* Fix CUDA index out of bounds for q_idx in modular modeling_new_task_model

* Revert "Fix CUDA index out of bounds for q_idx in Gemma3 token type masking"

This reverts commit f8e5c2a42c305aebd00c46161bf22f520009c8fc.

* Fix CUDA index out of bounds for q_idx in PaliGemma token type masking

* Fix CUDA index out of bounds for q_idx in Gemma3 token type masking
2025-10-22 11:29:47 +02:00
4f8781f84f Remove invalid @staticmethod from module-level get_device_and_memory_breakdown (#41747)
Remove staticmethod decorator from function
2025-10-22 10:52:29 +02:00
a8cece13e2 Fix bark after #41445 (#41645)
* fix

* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-10-22 09:21:45 +02:00
2e67a9b602 Add LightGlue fast image processor (#41670)
* add fast image processor skel

* add working lightglue fast image processor + tests

* remove plot_keypoint_matching
2025-10-22 00:33:16 +02:00
264cce9e0a Chat response parsing (#40894)
* Initial commit

* Adding more tests, bugfixes, starting tool tests

* Add support for JSON parsers and some tool tests

* stash commit

* stash commit

* stash commit

* stash commit

* stash commit

* Fix cohere schema, fix a lot of the recursive parser code

* GPT-OSS passing too!

* Update tests

* make fixup

* Offset tracking partially done

* stash commit

* stash commit

* Assistant masking Just Works

* make fixup

* stash commit

* stash commit

* JMESPath approach

* stash commit before i rip this PR apart

* Remove broken offset code

* Remove broken offset code

* Update chat parsing code and add tests for Ernie + fix Cohere tests for new format

* Implement tokenizer method

* jmespath dependency handling

* Completed TODOs

* Add support to TextGenerationPipeline

* Update GPT-OSS schema and test cases

* make fixup

* Fix typing (??)

* missing future import

* Use old typing in tokenization_utils_base.py

* put jmespath in various extras

* Remove accidental newline

* Guard tests correctly

* Remove require_jinja on the schema tests since we don't actually apply chat templates there

* make fixup

* fix some bad linter changes

* Fix docstring

* Push draft documentation

* Extend tests, more documentation

* make fixup

* docs docs docs

* Add Processor support

* Add to toctree

* Flag markdown correctly

* Remove double backslashes in docs for simplicity

* Simplify node-regex-to-dict

* Add support to ImageTextToTextPipeline

* Add support to ImageTextToTextPipeline and save/loading support in Processors

* Begin reworking docs to start fitting in response parsing

* Fix rebase

* Expand documentation further

* Expand documentation further

* Refactor x-regex-to-dict to x-regex-key-value, update the parser logic docs section

* Refactor x-regex-to-dict to x-regex-key-value, update the parser logic docs section

* More docs update

* Update TextGenerationPipeline to support tools properly

* Some rebase fixes

* Re-add is_jmespath_available

* Re-add is_jmespath_available

* Add Qwen3 parser and test, add maybe-json support

* Rollback processor changes - we'll wait for legacy saving to be deprecated

* Make fixup

* Revert ImageTextToText changes for now

* Add pipeline test

* make fixup

* Resolve a todo

* Resolve more TODOs and clean up the spec a little

* Add ref in the tools doc

* Update docs/source/en/chat_response_parsing.md

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update src/transformers/utils/chat_parsing_utils.py

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* Add a docstring for parse_response

* Add function docstring and reference it in the docs

* Fix generate link

* Revert Processor changes for now

* Use updated GPT-OSS format

* Print the dict keys instead of the whole dict so the example doesn't become too big

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
2025-10-21 17:26:18 +01:00
3f2db2c205 Simplify pipeline padding logic (#41667)
* Remove a lot of unnecessary pad logic

* Remove unnecessary clone() calls since we're just doing a slice assignment

* Just make the full tensor instead of adding to a zeros tensor
2025-10-21 17:01:48 +01:00
1d651c749e Modernize CLIP modeling code (#41546)
* stranded

* update modular

* modularities

* update

* fx broken

* fx stillb roken

* update

* missed this

* fix metaclip
2025-10-21 16:04:43 +02:00
f39355ec23 [v5] Remove deprecated tranformers.onnx (#41700)
* Remove deprecated tranformers.onnx

* Remove transformers.onnx related doc

* style

* shouldn't have been removed

* fix mismatch between metaclip2 modular en config file

* remove onnx config from not_doctested.txt

---------

Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
2025-10-21 15:22:41 +02:00
5995435d96 Update type hints in modeling_rope_utils.py to use | syntax (#41714)
* Update type hints to use | syntax for Union types

- Replace Union[str, os.PathLike] with str | os.PathLike
- Replace Optional[Union[str, dict]] with str | dict | None
- Keep Union for forward references like 'torch.dtype'
- Update imports to remove unused Union import where possible

This modernizes the type hints to use Python 3.10+ syntax while maintaining
compatibility with forward references.

* Update type hints in modeling_rope_utils.py to use | syntax

- Replace Union[float, dict[str, float]] with float | dict[str, float]
- Remove unused Union import
- Maintain backward compatibility

This modernizes the type hints to use Python 3.10+ syntax.
2025-10-21 12:05:11 +00:00
2383f3fcbb Fix graphormer model compilation with Cython 3.1.4 (#41671)
Hitting this kind of error when running:

```
cython src/transformers/models/deprecated/graphormer/algos_graphormer.pyx
```

```
Error compiling Cython file:
------------------------------------------------------------
...
    (nrows, ncols) = path.shape
    assert nrows == ncols
    cdef unsigned int n = nrows
    cdef unsigned int max_dist_copy = max_dist

    path_copy = path.astype(long, order='C', casting='safe', copy=True)
                            ^
------------------------------------------------------------
src/transformers/models/deprecated/graphormer/algos_graphormer.pyx:88:28: undeclared name not builtin: long
```

This appears to have changed between cython==3.0 and cython==3.1.  AFAICT the
correct type to use here would be `int`.  Switching to it makes the command
succeed and generate an algos_graphormer.c file.
2025-10-21 12:02:23 +00:00
c4e88f78ca Reduce warning noise caused by Tensor.new_tensor (#41748) 2025-10-21 11:54:13 +00:00
2fe4a30340 [kernels] Add version to function mapping (#41685)
add version
2025-10-21 13:27:18 +02:00
ede7976cd2 Fixed incorrect model_type for qwen2vl and qwen2.5vl when config is saved and loaded again (#41758)
* fixed incorrect model_type for qwen2vl and qwen2.5vl

* added tests
2025-10-21 10:54:58 +00:00
ee3a1002e2 [v5] Delete videos from image processing classes (#41607)
* delete

* why there were video tests in image file

* fix tests and copies

* docs and autto class
2025-10-21 12:03:31 +02:00
4e50b8459d upgrade xpu docker file to torch 2.8 (#41551)
* upgrade xpu docker file to torch 2.8

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

* Update Dockerfile

---------

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-10-21 10:03:04 +02:00
9aab965b1e Add vision contribution guide (#41456)
* vision contrib guide

* Update CONTRIBUTING.md

Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>

* Update CONTRIBUTING.md

Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>

* Update CONTRIBUTING.md

Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>

* Update CONTRIBUTING.md

Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>

* Update CONTRIBUTING.md

Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>

* Update CONTRIBUTING.md

Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>

* update tiny things

---------

Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>
2025-10-20 18:56:47 +02:00
1a034ce1d2 Docs(zh-hans): Refine wording for professionalism in README (#40943)
* [docs] Polish Chinese README translation by replacing informal terms  with professional vocabulary

* [docs] Polish Simplified Chinese README for better professionalism and consistency

- Replace "抱抱脸" with "Hugging Face" to align with standard usage in Chinese developer community
- Replace "流水线" with "pipeline" to maintain consistency with code and technical terminology
- Add proper code formatting (`pipeline`) for API references to match Traditional Chinese version
- Update translation dictionary to reflect these standardized terms
- Improve overall readability and technical accuracy for Chinese developers

These changes enhance the professionalism of the documentation while maintaining consistency with established technical terminology used by the Chinese developer community.
2025-10-20 08:39:49 -07:00
6850ba853f Small Fix for imports (#41411)
small fix
2025-10-20 17:21:04 +02:00
bf0bce8d5f Apply RUFF PIE rules (#41727)
Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
2025-10-20 13:32:23 +00:00
2cf8f833b0 Fix documentation issues (#41726)
Fix more documentation issues

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
2025-10-20 13:31:02 +00:00
517197f795 Update type hints in tokenization_utils.py to use | syntax (#41713)
* Update type hints to use | syntax for Union types

- Replace Union[str, os.PathLike] with str | os.PathLike
- Replace Optional[Union[str, dict]] with str | dict | None
- Keep Union for forward references like 'torch.dtype'
- Update imports to remove unused Union import where possible

This modernizes the type hints to use Python 3.10+ syntax while maintaining
compatibility with forward references.

* Update type hints in tokenization_utils.py to use | syntax

- Replace Union[AddedToken, str] with AddedToken | str
- Replace Union[list[str], list[AddedToken]] with list[str] | list[AddedToken]
- Replace Union[str, list[str]] with str | list[str]
- Replace Union[int, list[int]] with int | list[int]
- Update error messages to use | syntax
- Maintain backward compatibility

This modernizes the type hints to use Python 3.10+ syntax.

* Fix error message formatting in tokenization_utils.py

- Fix error message to use Union syntax instead of | syntax in string
- This prevents potential issues with error message formatting
- Maintains type hint modernization while fixing error messages
2025-10-20 13:24:16 +00:00
9d4ee18e25 [doc] remove broken notebooks on AMD Dev Cloud (#41743)
revert
2025-10-20 14:36:53 +02:00
818f7f10e4 Revert "Remove upper version bound of pandas" (#41744)
Revert "Remove upper version bound of pandas (#41677)"

This reverts commit a15d77cd0c3e4c1d9f0a196da5996b735eead37e.
2025-10-20 14:25:32 +02:00
ce4ffeeb6c Fix typo in LFM-VL (#41742)
oops, remove untrelated commits
2025-10-20 13:55:41 +02:00
cb6f03fce4 Fix Qwen3-Omni inference when mixing video and image inputs in one batch (#41741)
* Fix qwen3omni inference when mixing video and image inputs in one batch

* Fix `router_aux_loss_coef`

---------

Co-authored-by: lvyuanjun.lyj <lvyuanjun.lyj@alibaba-inc.com>
2025-10-20 11:35:02 +00:00
8fc5420913 Gemma3 conversion script maintenance (#41704)
* conversion: add include_vision_encoder flag (default true)

* conversion: update for inverted model.language_model weight path

* conversion: revert include_vision_encoder to True by default

* conversion: add chat template path flag
2025-10-20 12:52:22 +02:00
71db0d49e9 feat: add benchmark v2 ci with results pushed to dataset (#41672) 2025-10-20 08:56:58 +01:00
307c523854 further improve utils/check_bad_commit.py (#41658) (#41690)
* fix

* Update utils/check_bad_commit.py

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

* fix

* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
2025-10-17 23:07:00 +02:00
448c553ccb Update run_name docs in TrainingArguments (#41705)
Update run_name docs in TrainingArguments
2025-10-17 20:40:03 +00:00
cb4d4f5b75 pin torchcodec on CI docker image (#41703)
pin

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-10-17 20:50:18 +02:00
ac81541778 🌐 [i18n-KO] Translated gemma3n.md to Korean (#40873)
* fix: manual edits

* Apply suggestions from code review

Apply suggestions from code review and make additional revisions

Co-authored-by: HyunSang Jang <tasker.dev103@gmail.com>

* Apply suggestions from code review

Apply suggestions from code review — updated inline links for related text

* Apply suggestions from code review

Apply suggestions from code review - final

* Update docs/source/ko/_toctree.yml

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Co-authored-by: HyunSang Jang <tasker.dev103@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-10-17 09:57:05 -07:00
e7592f2508 [docs] Manual tp-plan (#41674)
* manual tp-plan

* feedback
2025-10-17 09:38:26 -07:00
347a0f9e83 Simplify GQA conditions in sdpa_attention.py (#41699)
Removed unnecessary checks for key being a torch.fx.Proxy in GQA conditions because fx tracing is no longer supported, and torch.export supports enable_gqa.
2025-10-17 16:36:38 +00:00
7e204ad121 [Attn] Allow dynamic causality in SDPA via Kwargs (#41692)
* is causal as kwarg

* Update src/transformers/integrations/sdpa_attention.py

Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>

* fix comment

---------

Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
2025-10-17 15:51:51 +00:00
a15d77cd0c Remove upper version bound of pandas (#41677)
Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
2025-10-17 17:31:41 +02:00
12a50f294d Enable FURB rules in ruff (#41395)
* Apply ruff FURB rules

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

* Enable ruff FURB rules

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

* More fixes

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

* More fixes

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

* Revert changes

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

* More fixes

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

---------

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
2025-10-17 15:00:40 +00:00
39b6d3bf7e Remove skipped tests without parents (#41691)
remove
2025-10-17 16:25:40 +02:00
75da795d8f 🚨 Remove torch.fx support (#41683)
* remove all

* fix comments

* better checks

* doc
2025-10-17 16:12:46 +02:00
080d704af1 Fix Pylint warnings (#41644)
* Fix pylint warnings

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

* More fixes

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

* Raise with an exception

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

---------

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
2025-10-17 13:09:42 +00:00
c01ceffeb4 Enable faiss-cpu on Windows (#41678)
faiss-cpu is supported on Windows

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
2025-10-17 13:00:57 +00:00
10de06dace 🚨 [v5] Refactor RoPE for layer types (#39847)
* update

* batch update model code

* typos

* too many diffs, dump

* dump again

* another dump

* fix copies

* make `rope_scaling_dict` self attr

* fix a few more tests

* another update

* fix a few more tests, hopefully last ones

* fox copies

* fix copies again

* fix newly added models, I hate rebasing on main

* update config files

* modular files

* fix rope utils test

* docstring has to be indented more, why?

* oops forgot to update some modualr files

* copy from doesn't copy decorators?

* fix overriden test as well

* add a new test

* fix failing tests again

* update docstrings

* fix phi3

* fix two models

* fix copies

* forgot to add

* stupid bug from modular conversion

* fix slow tests

* update to call rotary emb once per model forward

* 3K tests failing?!

* update

* update more models

* fix copies

* fix the rest of tests hopefully

* fix after rebase

* fix the rope tests

* fix docs omni

* change a bit

* models with layer types

* why it was deleted?

* fix a few tests

* fix last test!

* delete extra empty lines

* add a test case

* more changes

* fix models

* typing hint for nested rope params

* missed when resolving conflicts

* delete layer types and fix typo

* fix copies

* fix copies

* update docs text

* docs

* huuge update all models

* fix copies

* rename attr to align with new format

* delete redundant rope tests

* trigger ci

* update the case

* this is why i hate rebasing

* maybe fixed?

* oops

* now fix?

* fix last tests and copies

* fix copies?

* fix minimax and gemma3n

* update typo

* deprecation end version

* final fix copies :fingers-crossed:

* oh my, add the docs in toctree

* oke, this is really the last fix

* fix copies and hope that tests won't start failing again

* use rope scaling if saved

* fix slow tests

* fix cwm and unrelated deepseek

* fix last

* update

* hope it works now, it took so long

* lets keep None for now, I will try to remove after checking tests

* some more fixes, i find and replace does not always find all cases

* last fix of tests

* arthur's comment for extra foreward kwargs

* delete unused code

* fix slow qwen tests

* delete layer types from models

* faulty modular conversion

* fix qwen omni

* fix copies and style

* address my comment

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-10-17 14:57:27 +02:00
def9a7ef05 Use | for Optional and Union typing (#41675)
Use | for Optional and Union typing

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
2025-10-17 12:52:47 +00:00
0beda2aa3a Fix MarkDown syntax (#41676)
Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
2025-10-17 12:44:27 +00:00
0b3aef1da9 🚨 Remove torchscript support (#41688)
* remove a lot

* remove the rest

* doc
2025-10-17 13:38:27 +02:00
7370a1babd path validation for security reason (#41256)
fix

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-10-17 12:36:04 +02:00
151d6adc86 Remove require_torch_bf16_gpu (#40979)
* More cleanup

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

* Remove more functions

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

* More fixes

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

---------

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
2025-10-17 12:35:19 +02:00
252d7cd952 Remove deprecated use_auth_token parameter (#41666)
* Remove deprecated use_auth_token

* code styl

* fix test

* Update examples/pytorch/speech-recognition/README.md
2025-10-17 09:57:46 +00:00
415cb37708 torch 2.9 still don't ❤️ torchcodec 0.8 💔 (#41686)
fix

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-10-17 11:57:28 +02:00
1eb45cd61d Fix ckpt in docs (#41659)
* fix ckpt in docs

* fix config ckpt
2025-10-17 11:00:34 +02:00
354567d955 Adding superglue fast image processing (#41394)
* Default implementation - no time improvement

* Improved implementation - apparently 2 times faster with only simple function refactor

* elementary torch first approach, still need further implementation of torch first method

* torch-first approach finished

* refactor processor

* refactor test

* partial doc update

* EfficientLoFTRImageProcessorFast based implementation

* EfficientLoFTRImageProcessorFast based implementation

* Logic checked - Test Passed - Validated execution speed

* use modular for efficientloftr

* fix import

---------

Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
2025-10-16 19:34:09 +00:00
4dd4133d32 🌐 [i18n-KO] Translated ko-LFM2.md to Korean (#41502)
* feat: nmt draft

* fix: manual edits

* Update docs/source/ko/model_doc/lfm2.md

Co-authored-by: Yijun Lee <119404328+yijun-lee@users.noreply.github.com>

* Update docs/source/ko/model_doc/lfm2.md

Co-authored-by: Ahnjj_DEV <ahnjj.dev@gmail.com>

* Update docs/source/ko/model_doc/lfm2.md

Co-authored-by: Ahnjj_DEV <ahnjj.dev@gmail.com>

* Update docs/source/ko/model_doc/lfm2.md

Co-authored-by: Ahnjj_DEV <ahnjj.dev@gmail.com>

---------

Co-authored-by: Yijun Lee <119404328+yijun-lee@users.noreply.github.com>
Co-authored-by: Ahnjj_DEV <ahnjj.dev@gmail.com>
2025-10-16 11:29:04 -07:00
eefbf4ac8b 🌐 [i18n-KO] Translated llama4.md to Korean (#40396)
* docs: ko: llama4.md

* feat: nmt draft

* fix: manual edits

* Update docs/source/ko/model_doc/llama4.md

Co-authored-by: YONGSANG <71686691+4N3MONE@users.noreply.github.com>

* Update docs/source/ko/model_doc/llama4.md

Co-authored-by: YONGSANG <71686691+4N3MONE@users.noreply.github.com>

* Update docs/source/ko/model_doc/llama4.md

Co-authored-by: YONGSANG <71686691+4N3MONE@users.noreply.github.com>

* Update docs/source/ko/model_doc/llama4.md

Co-authored-by: YONGSANG <71686691+4N3MONE@users.noreply.github.com>

---------

Co-authored-by: TaskerJang <bymyself103@naver.com>
Co-authored-by: YONGSANG <71686691+4N3MONE@users.noreply.github.com>
2025-10-16 11:28:27 -07:00
50ca781d78 🌐 [i18n-KO] Translated code_llama.md to Korean (#40558)
* docs: ko: code_llama.md

* feat: nmt draft

* fix: manual edits

* Apply suggestions from code review

Co-authored-by: Harheem Kim <49297157+harheem@users.noreply.github.com>
Co-authored-by: HyunZ118 <156191095+HyunZ118@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Harheem Kim <49297157+harheem@users.noreply.github.com>

---------

Co-authored-by: Harheem Kim <49297157+harheem@users.noreply.github.com>
Co-authored-by: HyunZ118 <156191095+HyunZ118@users.noreply.github.com>
2025-10-16 11:27:46 -07:00
8739fc05c4 [i18n-KO] Translated big_bird.md to Korean (#40445)
* docs: ko: BigBird.md

* feat: nmt draft

* fix: manual edits
2025-10-16 11:23:56 -07:00
77b5ad65ee 🌐 [i18n-KO] Translated sam_hq.md to Korean (#41340)
* fix: manual edits

* Apply suggestions from code review

Apply suggestions from code review

Co-authored-by: HyunSang Jang <tasker.dev103@gmail.com>

* Apply suggestions from code review

Apply suggestions from code review

Co-authored-by: Woojun Jung <46880056+jungnerd@users.noreply.github.com>

---------

Co-authored-by: HyunSang Jang <tasker.dev103@gmail.com>
Co-authored-by: Woojun Jung <46880056+jungnerd@users.noreply.github.com>
2025-10-16 11:10:16 -07:00
a9731a725e 🌐 [i18n-KO] Translated chat_extras.md to Korean (#39863)
* docs: ko: chat_extras.md

* feat: nmt draft

* fix: manual edits

* Apply suggestions from code review

* Apply suggestions from code review

* Update docs/source/ko/chat_extras.md
2025-10-16 10:41:03 -07:00
bdbc2d037b [Trainer] [Breaking change] use_cache default to False (#41585)
* use_cache default to `False` when training

* style

* Fix comment

* add checks

* style

* set

* switch
2025-10-16 18:51:36 +02:00
fe11cbb808 Erroring when KernelConfig is passed without use_kernels = True (#41657)
* update

* update
2025-10-16 18:08:46 +02:00
6344371a91 improve utils/check_bad_commit.py (#41658)
* robust

* robust

* robust

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-10-16 15:51:19 +00:00
a408384a88 Improve package version check (#41661)
fix
2025-10-16 17:31:58 +02:00
f7c33abab3 Small changes to benchmarking script (#41662) 2025-10-16 17:25:49 +02:00
9839d57a02 Fix serving continuous batching (#41624)
* udpate-serving-cb

* style

* style

* check none

* Apply suggestions from code review

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2025-10-16 17:24:21 +02:00
e85d5ab2bb Fix dtype casting with quantization (#41665)
fix dtype casting
2025-10-16 17:19:32 +02:00
1c36d407d5 Add in-out modalities as class attribute per model (#41366)
* update all models

* fix copies

* explanation comment

* better notation in omni model

* style

* fix copies

* output_modalities under generation mixin

* fix copies

* oh, glm4v also needs conversion
2025-10-16 17:11:06 +02:00
0215846d98 Switch to CB if cache_implementation == paged (#41655)
* Add a switch to CB in case of paged cache

* Added paged as a valid cache implem

* Added a fallback on inputs_ids as a name

* Rookie mistake

* Removed paged from cache implems

* Added warning about some  beam search args

* Moved up CB warning
2025-10-16 17:00:18 +02:00
9e99198e5e Use | for Optional and Union typing (#41646)
Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
2025-10-16 14:29:54 +00:00
bf815e9b5e [Masks] Fix mask handling in eager for vision models (#41625)
add mask handling in case of models that do use it
2025-10-16 16:27:26 +02:00
vb
4a43e3d57c purge HF_HUB_ENABLE_HF_TRANSFER; promote Xet (#41656) 2025-10-16 16:17:09 +02:00
8725ce10ed [Fix] Deepseek V3 expert bias routing (#41647)
* [Fix] Deepseek V3 expert bias routing

* [Fix] fix-copies

* [Fix] Run make style
2025-10-16 14:04:48 +00:00
1fb3fc4db0 [kernels] refactor function kernel calling (#41577)
* refactor function kernel callling

* nit

* don't pass the mapping

* use _kernels_available

* rm import
2025-10-16 15:43:02 +02:00
9176af574a Double router compute? (#41653)
* weird double router compute?

* flip it
2025-10-16 15:17:21 +02:00
503c933f36 Fix confusing cls assignment (#41642)
Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
2025-10-16 13:01:07 +00:00
2aff20aff6 Fix typos in documentation (#41641)
Fix typos

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
2025-10-16 12:58:46 +00:00
981370c038 Format MarkDown documentation and tiny fixes (#41638)
* Fix MarkDown syntax

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

* More fixes

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

---------

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
2025-10-16 12:58:06 +00:00
eef9fb2af3 Fix EncoderDecoder cache (#41612)
* Fix EncoderDecoder cache

* Add the option for the ddp data tuples to have 2 elems

* Modifiy the order of the KV and sliding

* Adapted RAG and Whisper to new EncoderDecoderCache

* A single comma

* Remove kwargs in map

* Fixed order in manual injection cache test

* Slight changes to support legacy format

* Removed Nonnes
2025-10-16 14:55:41 +02:00
35dc8f0a2e Adjust device logging level and add minor fixes (#41636)
This commit addresses a noisy warning and improves the robustness of the base pipeline implementation.

- The device placement message in the pipeline base class has been changed from a `warning` to a `debug` log. This reduces log noise for users who are aware of their device setup, while still providing the information for debugging purposes.

- Additionally, potential `UnboundLocalError` exceptions in the `_pad` and `check_model_type` functions have been prevented by initializing variables before their conditional assignment.
2025-10-16 12:47:39 +00:00
2935a1be19 Fix fp32_ln for various models (#41605)
* Add is_causal to KosmosTextAttention

* Move get target_dtype to be imported elsewhere

* Fix fp32 flash attention bug in bark

* Fix is_causal in mllama

* Fix fp32 issue on StableLM

* Fix repo-consistency
2025-10-16 14:18:49 +02:00
b9bd8c45a1 [CI] Build translated docs (#41632)
fix
2025-10-16 14:01:33 +02:00
baecdb8a97 [Ernie 4.5 Moe] Fix Moe and offloading (#41385)
fix
2025-10-16 13:59:01 +02:00
44539827d5 [Executorch] Simplify for encoder models (#41627)
* Trigger Build

* revert extra treatment for executorch as we default to no vmapping now
2025-10-16 13:57:52 +02:00
143acfe2ce fix check inputs for text2text pipeline (#41556)
fix check inputs

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
2025-10-16 11:42:41 +00:00
67fae90519 Fix FP-Quant quantization fallback CPU dispatch. (#41619)
* fp_quant fix

* Update quantizer_fp_quant.py
2025-10-16 11:41:01 +00:00
af2a66ced9 Migrate transformers cli to Typer (#41487)
* Add typer-slim as explicit dependency

* Migrate CLI to Typer

* code quality

* bump release candidate

* adapt test_cli.py

* Remove ./commands + adapt tests

* fix quality

* consistency

* doctested

* do not serve model in chat

* style

* will it fix them?

* fix test

* capitalize classes

* Rebase

* Rebase

* tests + fixup

tests + fixup

* csutom error message

* fix ?

* should be good

* fix caplog globally

* inner caplog

* last attempt

* Retry

* Let's try with capsys disabled

---------

Co-authored-by: Lysandre <hi@lysand.re>
2025-10-16 13:29:42 +02:00
a59124e27e Add missing dates to docs (#41576)
add dates
2025-10-16 09:32:28 +00:00
81f97b17d2 Remove randomly added script (#41650)
remove
2025-10-16 11:23:53 +02:00
c0a5cf19ad Fix tokenization test (#41649)
fix
2025-10-16 11:14:20 +02:00
3ef6f2c415 Allow passing tp_plan in from_pretrained directly (#41435)
* start

* allow passing it

* fix plans

* fix

* fix

* style

* style

* fix

* add_test

* oupsi indent

* fix

* fix

* fix for CI without accelerator

* fix import
2025-10-16 11:12:07 +02:00
59efd86da2 Add aux loss for GLM-4.5V (#41564)
* add aux

* update

* update config to text_config

* use qwen data class to avoid repeat again

* format

* update

* use 1e-4

* update

* update for remove init

* Apply style fixes

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Raushan Turganbay <raushan@huggingface.co>
2025-10-16 09:04:21 +00:00
7b7d17f9bf 🚨 [v5] Toggle the serialization format in processors (#41474)
* toggle the serialization

* prob this fixes it

* fix tests

* typo

* delete legacy save entirely

* remove extra nesting in if

* revert test and serialzie a public attr instead of private
2025-10-16 10:19:22 +02:00
e20df45bf6 Add Backbone API fine-tuning tutorial (#41590)
---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-10-15 18:42:32 +02:00
19df66dcba Update executorch.md (#41582)
* Update executorch.md

* Update executorch.md

* Update executorch.md

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-10-15 09:01:46 -07:00
9f71e3a604 [docs] Duplicate entry (#41591)
fix
2025-10-15 17:02:36 +02:00
bc9900562d Fix quantization base class (#41613)
* fix

* fix

---------

Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
2025-10-15 16:58:17 +02:00
72fd67929b Remove deprecated code (#41616)
remove

Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
2025-10-15 16:57:52 +02:00
da382917aa Remove the head masking block in some vision models (#41620)
* old

* new

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-10-15 15:51:01 +02:00
313afcc468 [chat template] update when "push_to_hub" (#39815)
* update templates push to hub

* rvert jinja suffix and move it to processor file
2025-10-15 13:49:59 +00:00
7bba4d1202 Fix video processing channel format (#41603)
fix
2025-10-15 15:48:01 +02:00
ab92534377 enable sdpa enable gqa logic for Ascend NPU (#41601)
* enable gqa logic for Ascend NPU

* remove redundant comments

* fix comments about Ascend NPU

---------

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
2025-10-15 13:45:28 +00:00
56a727dde5 Add fast path for bidirectional mask creation to fix regression (#41586)
* fixed performance regression

* also fixed the older_torch function

* Update src/transformers/masking_utils.py

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

* fix

* more general

* fix slicing

* fix data dependent

---------

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
2025-10-15 15:30:39 +02:00
dc6fdeb705 Update a dataset reop link (#41618)
fix

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-10-15 14:41:38 +02:00
3953b65440 Reinstate early CUDA init fix (#41617)
* Reinstate early CUDA init fix

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* Delay import further

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

---------

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-15 14:41:10 +02:00
96d245a83d torch 2.9 don't ❤️ torchcodec 💔 (#41610)
pin

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-10-15 14:34:00 +02:00
bb0c3af995 More markdown file fixes (#41599)
* Format markdown files

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

* Format markdown files

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

* Format markdown files

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

---------

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
2025-10-15 12:29:27 +00:00
70e871959c Fix trainer simple tests (#41449)
* fix

* fix ray

* train to tune

* breaking changes wrt generation config

* Fix !

* fix

* fix

* fix deepspeed !

* fix

* fix

* fix

* improve logic

* revert and fix

* revert comment

* oups

* revert change

* fix

* style

* typo in comment

---------

Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
2025-10-15 14:09:00 +02:00
c4210796e0 Import expand_device_map instead of redefining it (#41608)
remove it
2025-10-15 14:00:09 +02:00
fcd1ccdb78 [Docs] Fix changed references (#41614)
* fix

* fix

* other ln
2025-10-15 13:59:13 +02:00
2b2c20f315 Update issue template (#41573)
* update

* fix
2025-10-15 13:54:37 +02:00
e2122c4bcb remove ray_scope and check_quantized_param (#41587)
remove
2025-10-15 13:10:35 +02:00
e89cef6625 fix some case failures lead by "torch.compile recompiled part of th… (#41558)
* fix some case failures lead by "`torch.compile` recompiled part of the forward pass" in xpu

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* update comment

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-10-15 10:45:29 +00:00
26b7f66850 Add logits_to_keep to many older CausalLM models (#41335)
* Add logits_to_keep to CausalLM models

* Skip failing test for git model

* Remove unused return_dict from kosmos2 signature

* Revert BlipForQuestionAnswering
2025-10-15 11:56:01 +02:00
5db730786d [device_map] Accelerate loading by computing device_map much faster (#41548)
* start

* add the important fix

* continue

* big cleanup

* type hints

* add method

* fix typehints

* typehints

* fix

* oupsi

* remove space

* improve function

* CI
2025-10-15 11:18:57 +02:00
13a35a5057 Enable non-streaming mode in transformers serve (#41446)
* Enable non-streaming in transformers serve

Remove typos

Remove typos

Remove typos

* Fix tests

* Arthur review
2025-10-15 09:37:26 +02:00
94df0e6560 Benchmark overhaul (#41408)
* Big refactor, still classes to move around and script to re-complexify

* Move to streamer, isolate benches, propagate num tokens

* Some refacto

* Added compile mode to name

* Re-order

* Move to dt_tokens

* Better format

* Fix and disable use_cache by default

* Fixed compile and SDPA backend default

* Refactor results format

* Added default compile mode

* Always use cache

* Fixed cache and added flex

* Plan for missing modules

* Experiments: no cg and shuffle

* Disable compile for FA

* Remove wall time, add sweep mode, get git commit

* Review compliance, start

* Apply suggestions from code review

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

* Update benchmark_v2/framework/benchmark_runner.py

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

* Disable workflow

* Pretty print

* Added some pretty names to have pretty logs

* Review n2 compliance (end?)

* Style and end of PR

---------

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
2025-10-14 21:41:43 +02:00
9e4199ede3 Gemma3 fixes (#41572)
* Multiple device error fix

* FA2 equivalence fix

* Move the train fwd in cfg test

* Style

* Added comment

* Made the comment more clear
2025-10-14 18:33:27 +02:00
4c8d293599 Fix typsetting and content of llm_tutorial_optimization.md (#41172)
* Fix typsetting of llm_tutorial_optimization

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

* Fix errors

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

---------

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
2025-10-14 08:40:26 -07:00
a99b1be3c7 Revert some breaking changes bnb (#41581)
fix
2025-10-14 16:28:16 +02:00
82cae9eb52 Add __iter__ to DynamicCache (#41569)
* Add __iter__ to DynamicCache

* Fix tests that use ddp init
2025-10-14 16:16:32 +02:00
4fad35ee4a [VisionEncoderDecoderModel] Update loss function (#40863)
Update loss function
2025-10-14 16:03:00 +02:00
ae6f6cc3e0 Revert "add rmsnorm kernels support for Intel XPU" (#41579)
Revert "add rmsnorm kernels support for Intel XPU (#41563)"

This reverts commit fd787c5f6d667d3e00def70f588972af4437f631.
2025-10-14 15:49:33 +02:00
fd787c5f6d add rmsnorm kernels support for Intel XPU (#41563)
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
2025-10-14 13:26:09 +00:00
4e4f2af586 Add conditional checks to _check_and_adjust_attn_implementation() (#41542) 2025-10-14 13:00:07 +00:00
3648fde486 Add DINOv3Backbone for ConvNext variant (#40651)
---------

Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
2025-10-14 14:57:04 +02:00
abf5b57a68 delete some tokenizer tests using pickle (#41514)
* hate pickle

* hate pickle

* hate pickle

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-10-14 14:50:51 +02:00
8fe4db5399 [kernels] rm mra kernels (#41507)
* fix modeling

* remove kernel

* fix style
2025-10-14 13:34:04 +02:00
c620c38bb0 [Qwen3VLMoe] Fixed: Expected self.dtype to be equal to src.dtype - routing_weights casting (#41420)
* Fixed Expected self.dtype to be equal to src.dtype on eval

* Fixed Expected self.dtype to be equal to src.dtype on eval

* Fixed Expected self.dtype to be equal to src.dtype on eval

* generated modeling_qwen3_vl_moe.py file

* Fixed Ernie_4_5_MoE router casting

* Fixed routing_weights dtype casting (ernie4_5_moe, hunyuan_v1_moe, qwen2_moe, qwen3_moe, qwen3_next,qwen3_omni_moe)

* rollback hunyuan_v1_moe changes

---------

Co-authored-by: Daniel Oliveira <daniel-oliveira-11@hotmail.com>
Co-authored-by: Daniel Oliveira <36623265+daniel3303@users.noreply.github.com>
2025-10-14 13:14:49 +02:00
0798797ec9 Fix an import error with PreTrainModel (#41571) 2025-10-14 13:13:37 +02:00
0566b6f5bd Patch MistralCommonTokenizer (#41439)
* Fix token_to_id and add add_generation_prompt

* Fix spm download

* Refactor spm

* Try another possibly non-gated spm

* Improve get_vocab

* lint

* Improve get_vocab

* Add warn to piece_to_id

* Improve from_pretrained raise and revert model spm

* Revert fast
2025-10-14 11:13:19 +00:00
b3e3c3dc93 [Qwen3VL] fix device mismatch error for FSDP2 training (#41536)
For FSDP2, parameters might be on a meta device, and the weight.device attribute may
not accurately reflect where the actual computation will happen during forward passes.

```log
  File "transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 776, in forward
    pos_embeds = self.fast_pos_embed_interpolate(grid_thw)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 745, in fast_pos_embed_interpolate
    pos_embeds = self.pos_embed(idx_tensor) * weight_tensor[:, :, None]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/nn/modules/module.py", line 1879, in _call_impl
    return inner()
           ^^^^^^^
  File "torch/nn/modules/module.py", line 1827, in inner
    result = forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/nn/modules/sparse.py", line 192, in forward
    return F.embedding(
           ^^^^^^^^^^^^
  File "torch/nn/functional.py", line 2546, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but got index is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA__index_select)
```
https://github.com/volcengine/verl/pull/3686#issuecomment-3380981817

Signed-off-by: Hollow Man <hollowman@opensuse.org>
2025-10-14 10:28:25 +00:00
b84c0b31c6 Remove references to AutoModelForVision2Seq (#41513)
* Since Vision2Seq is deprecated, remove it from pipelines and docstrings

* Catch some more references
2025-10-13 17:00:07 +01:00
1ee3b288a6 [from_pretrained] Small refactor from_pretrained: move around unrelated stuff (#41445)
* drafts

* up

* simplify modeling utils

* more simplifications

* type kwargs

* up

* move more accelerate related stuff

* safeguarding?

* nits

* remove func when func is NOPE

* more

* nits

* styling

* yups

* up

* ups

* revert

* protect trainer utils iport

* fix doc

* Update src/transformers/integrations/peft.py

Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>

* review

* update

* ?

* fixx

* update

* super small update

* ups

* style

* this is stupid

* 🤦 well this was the issue

* small nit

* fix

* nit

* damn the missing return

* one last stupid fix

---------

Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
2025-10-13 16:33:32 +02:00
cad74496ca [model] Add VideoLLaMA3 implementation (#40499)
* Add VideoLLaMA3 implementation

* Run style fix

* Switch to modular

* Fix config and smart_resize

* Fix

* Fix

* Fix style

* Fix

* Ruff fix

* Rename

* Rename

* Fix

* Clean

* Fix consistency

* Add doc

* Fix

* Fix

* Fix doc

* Update generated code

* remove test_initialization

* fix tests

* simplify

* tests

* Add VideoLlama3IntegrationTest

* replace asserts

* fix tests

---------

Co-authored-by: steven-ccq <55176896+steven-ccq@users.noreply.github.com>
Co-authored-by: steven-ccq <1456320989@qq.com>
Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
2025-10-13 15:54:34 +02:00
3813a8e3a1 Add VideoMAE video processor (#41534)
* Add video processor for VideoMAE

* Document VideoMAE video processor

* Add regression tests for VideoMAE video processor

* refactor: Use direct batch key access for pixel_values_videos

* test: add parity test for VideoMAEVideoProcessor vs VideoMAEImageProcessor

* docs(videomae): update model docstring example to demonstrate VideoMAEVideoProcessor (TorchCodec-based decoding and sampling)
2025-10-13 15:42:27 +02:00
66d8d7a077 Fixed typos and formatting (#34215)
#hacktoberfest
2025-10-13 13:38:06 +00:00
d621be8286 🚨 [v5] generate delegates default cache initialization to the model (#41505) 2025-10-13 13:20:48 +01:00
d7c9fbdb64 Enable modular files from other libraries (#41372)
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
2025-10-13 13:48:32 +02:00
41e763decd Add AMD developer cloud support (#41126)
* Add AMD developer cloud support

* Add AMD remote svg link.

* Update notebooks/README.md

Co-authored-by: pagezyhf <165770107+pagezyhf@users.noreply.github.com>

---------

Co-authored-by: Rémi Ouazan <83456801+remi-or@users.noreply.github.com>
Co-authored-by: pagezyhf <165770107+pagezyhf@users.noreply.github.com>
2025-10-13 12:17:24 +02:00
cf1e9834ec Restore cuda graphs to continuous batching (#41421)
* Type hints and small fixes

* Remove unusued params

* Made slice inputs the default

* ruffed

* Updated some var name and moved index slicing

* Logging arg in example

* Added some padding debug var and reformat out cg

* First working CG, fixe size

* Working flexible CG

* CG are compatible with all implementations

* Fixed CG API

* Update example

* Documentation

* Fix padding tokens in FA

* Review compliance

* Better doc around weird bug

* Style

* Fix for sliding with CG
2025-10-13 11:57:56 +02:00
6c901bdc0e [SAM] Fix typing hints (#41506)
fix
2025-10-13 11:52:00 +02:00
58f9e13313 Fixed Type-hints in function defintions (#41525)
* Explicitly annotate default None parameters as Optional

* make style.

* make style.

* Fixed check_copies.

* fix consistency.
2025-10-13 11:48:37 +02:00
eb28242251 Add MLlama fast image processor (#41391)
* Merge conflict

* add fast processor

* add fast processor

* make style

* add new convert rgb

* use nested group by shape in mllama fast, add support for multiple inputs in group by shape

* refactor after review

---------

Co-authored-by: Vincent <phamvinh257@gmail.com>
2025-10-13 09:16:05 +00:00
65cb8fac6d [Qwen3VL] fix: hidden_states in place modification error (#41535)
```
  File "transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 941, in forward
    hidden_states = self._deepstack_process(
                    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 960, in _deepstack_process
    hidden_states[visual_pos_masks, :] = local_this
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Output 0 of SliceBackward0 is a view and is being modified inplace. This view was created inside a custom Function (or because an input was returned as-is) and the autograd logic to handle view+inplace would override the custom backward associated with the custom Function, leading to incorrect gradients. This behavior is forbidden. You can fix this by cloning the output of the custom Function.
```

Signed-off-by: Hollow Man <hollowman@opensuse.org>
2025-10-13 10:50:14 +02:00
3927ffed31 [testing] reduce runtime of HunYuanMoEV1IntegrationTest:test_model_generation (#41373)
* fix

* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-10-10 22:27:01 +02:00
7164924a7e Fix Latex typesetting in documentation (#41177)
Fix Latex typsetting in documentation

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
2025-10-10 08:54:27 -07:00
26a5368c44 Allow optuna's catch kwargs passthrough (#41496)
* allow optuna's catch kwargs passthrough

* apply ruff formatting

---------

Co-authored-by: nicha <nicha.api@nectec.or.th>
2025-10-10 13:58:07 +00:00
feca4f3de7 remove tpu_num_cores (#41383)
* remove-tpu-num-cores

* fix

* let's remove it

* style

* Update examples/legacy/seq2seq/finetune_tpu.sh

Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>

---------

Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
2025-10-10 15:53:28 +02:00
c6042a4169 Remove outdated flags (#41512)
remove flags
2025-10-10 14:34:47 +02:00
dfd4121cd4 add Trainer import to .md in appropriate cell block for training.ipynb transformers_doc (#41484)
add Trainer import to .md in appropriate cell block for docs
2025-10-10 12:04:07 +00:00
1468 changed files with 33555 additions and 35003 deletions

View File

@ -48,18 +48,17 @@ body:
- continuous batching: @remi-or @ArthurZucker @McPatate
- pipelines: @Rocketknight1
- tokenizers: @ArthurZucker and @itazap
- trainer: @zach-huggingface @SunMarc
- trainer: @SunMarc
- attention: @vasqu @ArthurZucker @CyrilVallez
- model loading (from pretrained, etc): @CyrilVallez
- distributed: @3outeille @ArthurZucker @S1ro1
- distributed: @3outeille @ArthurZucker
- CIs: @ydshieh
Integrations:
- deepspeed: HF Trainer/Accelerate: @SunMarc @zach-huggingface
- ray/raytune: @richardliaw, @amogkam
- Big Model Inference: @SunMarc
- quantization (bitsandbytes, autogpt): @SunMarc @MekkCyber
- quantization: @SunMarc @MekkCyber
- kernels: @MekkCyber @drbh
- peft: @BenjaminBossan @githubnemo

View File

@ -51,18 +51,17 @@ Library:
- continuous batching: @remi-or @ArthurZucker @McPatate
- pipelines: @Rocketknight1
- tokenizers: @ArthurZucker and @itazap
- trainer: @zach-huggingface @SunMarc
- trainer: @SunMarc
- attention: @vasqu @ArthurZucker @CyrilVallez
- model loading (from pretrained, etc): @CyrilVallez
- distributed: @3outeille @ArthurZucker @S1ro1
- distributed: @3outeille @ArthurZucker
- CIs: @ydshieh
Integrations:
- deepspeed: HF Trainer/Accelerate: @SunMarc @zach-huggingface
- ray/raytune: @richardliaw, @amogkam
- Big Model Inference: @SunMarc
- quantization (bitsandbytes, autogpt): @SunMarc @MekkCyber
- quantization: @SunMarc @MekkCyber
- kernels: @MekkCyber @drbh
- peft: @BenjaminBossan @githubnemo

View File

@ -22,7 +22,6 @@ tests/generation/ @gante
/src/transformers/models/auto/ @ArthurZucker
/src/transformers/utils/ @ArthurZucker @Rocketknight1
/src/transformers/loss/ @ArthurZucker
/src/transformers/onnx/ @michaelbenayoun
# Specific files come after the sections/globs, so they take priority
/.circleci/config.yml @ArthurZucker @ydshieh

View File

@ -12,6 +12,8 @@ concurrency:
env:
HF_HOME: /mnt/cache
DATASET_ID: hf-benchmarks/transformers
MODEL_ID: meta-llama/Llama-3.1-8B-Instruct
jobs:
benchmark:
@ -34,26 +36,12 @@ jobs:
with:
ref: ${{ github.event.pull_request.head.sha || github.sha }}
- name: Install libpq-dev & psql
run: |
apt update
apt install -y libpq-dev postgresql-client
- name: Install benchmark script dependencies
run: python3 -m pip install -r benchmark/requirements.txt
run: python3 -m pip install -r benchmark_v2/requirements.txt kernels
- name: Reinstall transformers in edit mode (remove the one installed during docker image build)
working-directory: /transformers
run: python3 -m pip uninstall -y transformers && python3 -m pip install -e ".[torch]"
- name: Run database init script
run: |
psql -f benchmark/utils/init_db.sql
env:
PGDATABASE: metrics
PGHOST: ${{ secrets.TRANSFORMERS_BENCHMARKS_PGHOST }}
PGUSER: transformers_benchmarks
PGPASSWORD: ${{ secrets.TRANSFORMERS_BENCHMARKS_PGPASSWORD }}
run: python3 -m pip uninstall -y transformers && python3 -m pip install -e ".[torch]" && python3 -m pip uninstall -y torchvision # temp fix
- name: Run benchmark
run: |
@ -64,13 +52,11 @@ jobs:
commit_id=$GITHUB_SHA
fi
commit_msg=$(git show -s --format=%s | cut -c1-70)
python3 benchmark/benchmarks_entrypoint.py "huggingface/transformers" "$BRANCH_NAME" "$commit_id" "$commit_msg"
python3 benchmark_v2/run_benchmarks.py -b 32 -s 128 -n 256 --branch-name "$BRANCH_NAME" --commit-id "$commit_id" --commit-message "$commit_msg" --model-id "$MODEL_ID" --log-level INFO --push-result-to-dataset "$DATASET_ID"
env:
HF_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
PUSH_TO_HUB_TOKEN: ${{ secrets.PUSH_TO_HUB_TOKEN }}
# Enable this to see debug logs
# HF_HUB_VERBOSITY: debug
# TRANSFORMERS_VERBOSITY: debug
PGHOST: ${{ secrets.TRANSFORMERS_BENCHMARKS_PGHOST }}
PGUSER: transformers_benchmarks
PGPASSWORD: ${{ secrets.TRANSFORMERS_BENCHMARKS_PGPASSWORD }}
BRANCH_NAME: ${{ github.head_ref || github.ref_name }}

View File

@ -1,35 +1,7 @@
name: Benchmark v2 Framework
on:
workflow_call:
inputs:
runner:
description: 'GH Actions runner group to use'
required: true
type: string
container_image:
description: 'Docker image to use'
required: true
type: string
container_options:
description: 'Container options to use'
required: true
type: string
commit_sha:
description: 'Commit SHA to benchmark'
required: false
type: string
default: ''
run_id:
description: 'Custom run ID for organizing results (auto-generated if not provided)'
required: false
type: string
default: ''
benchmark_repo_id:
description: 'HuggingFace Dataset to upload results to (e.g., "org/benchmark-results")'
required: false
type: string
default: ''
workflow_dispatch:
env:
HF_HOME: /mnt/cache
@ -82,4 +54,4 @@ jobs:
--token '${{ secrets.TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN }}' \
--log-level INFO
env:
HF_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
HF_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}

View File

@ -1,11 +1,7 @@
name: Benchmark v2 Scheduled Runner - A10 Single-GPU
on:
schedule:
# Run daily at 16:30 UTC
- cron: "30 16 * * *"
pull_request:
types: [ opened, labeled, reopened, synchronize ]
workflow_dispatch:
jobs:
benchmark-v2-default:
@ -18,4 +14,4 @@ jobs:
commit_sha: ${{ github.sha }}
run_id: ${{ github.run_id }}
benchmark_repo_id: hf-internal-testing/transformers-daily-benchmarks
secrets: inherit
secrets: inherit

View File

@ -1,11 +1,7 @@
name: Benchmark v2 Scheduled Runner - MI325 Single-GPU
on:
schedule:
# Run daily at 16:30 UTC
- cron: "30 16 * * *"
pull_request:
types: [ opened, labeled, reopened, synchronize ]
workflow_dispatch:
jobs:
benchmark-v2-default:
@ -18,4 +14,4 @@ jobs:
commit_sha: ${{ github.sha }}
run_id: ${{ github.run_id }}
benchmark_repo_id: hf-internal-testing/transformers-daily-benchmarks
secrets: inherit
secrets: inherit

View File

@ -41,9 +41,14 @@ env:
jobs:
check_new_failures:
name: " "
name: "Find commits for new failing tests"
strategy:
matrix:
run_idx: [1]
runs-on:
group: aws-g5-4xlarge-cache
outputs:
process: ${{ steps.check_file.outputs.process }}
container:
image: ${{ inputs.docker }}
options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
@ -54,14 +59,17 @@ jobs:
path: /transformers/ci_results_${{ inputs.job }}
- name: Check file
id: check_file
working-directory: /transformers
run: |
if [ -f ci_results_${{ inputs.job }}/new_failures.json ]; then
echo "`ci_results_${{ inputs.job }}/new_failures.json` exists, continue ..."
echo "process=true" >> $GITHUB_ENV
echo "process=true" >> $GITHUB_OUTPUT
else
echo "`ci_results_${{ inputs.job }}/new_failures.json` doesn't exist, abort."
echo "process=false" >> $GITHUB_ENV
echo "process=false" >> $GITHUB_OUTPUT
fi
- uses: actions/download-artifact@v4
@ -118,6 +126,10 @@ jobs:
run: |
python3 utils/print_env.py
- name: Install pytest-flakefinder
if: ${{ env.process == 'true' }}
run: python3 -m pip install pytest-flakefinder
- name: Show installed libraries and their versions
working-directory: /transformers
if: ${{ env.process == 'true' }}
@ -126,25 +138,63 @@ jobs:
- name: Check failed tests
working-directory: /transformers
if: ${{ env.process == 'true' }}
run: python3 utils/check_bad_commit.py --start_commit ${{ inputs.start_sha }} --end_commit ${{ env.END_SHA }} --file ci_results_${{ inputs.job }}/new_failures.json --output_file new_failures_with_bad_commit.json
run: python3 utils/check_bad_commit.py --start_commit ${{ inputs.start_sha }} --end_commit ${{ env.END_SHA }} --file ci_results_${{ inputs.job }}/new_failures.json --output_file new_failures_with_bad_commit_${{ inputs.job }}_${{ matrix.run_idx }}.json
- name: Show results
working-directory: /transformers
if: ${{ env.process == 'true' }}
run: |
ls -l new_failures_with_bad_commit.json
cat new_failures_with_bad_commit.json
ls -l new_failures_with_bad_commit_${{ inputs.job }}_${{ matrix.run_idx }}.json
cat new_failures_with_bad_commit_${{ inputs.job }}_${{ matrix.run_idx }}.json
- name: Checkout back
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
name: new_failures_with_bad_commit_${{ inputs.job }}_${{ matrix.run_idx }}
path: /transformers/new_failures_with_bad_commit_${{ inputs.job }}_${{ matrix.run_idx }}.json
process_new_failures_with_commit_info:
name: "process bad commit reports"
needs: check_new_failures
if: needs.check_new_failures.outputs.process == 'true'
runs-on:
group: aws-g5-4xlarge-cache
container:
image: ${{ inputs.docker }}
options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
steps:
- uses: actions/download-artifact@v4
with:
name: ci_results_${{ inputs.job }}
path: /transformers/ci_results_${{ inputs.job }}
- uses: actions/download-artifact@v4
with:
pattern: new_failures_with_bad_commit_${{ inputs.job }}*
path: /transformers/new_failures_with_bad_commit_${{ inputs.job }}
merge-multiple: true
- name: Check files
working-directory: /transformers
if: ${{ env.process == 'true' }}
run: |
git checkout ${{ inputs.start_sha }}
ls -la /transformers
ls -la /transformers/new_failures_with_bad_commit_${{ inputs.job }}
# Currently, we only run with a single runner by using `run_idx: [1]`. We might try to run with multiple runners
# to further reduce the false positive caused by flaky tests, which requires further processing to merge reports.
- name: Merge files
shell: bash
working-directory: /transformers
run: |
cp /transformers/new_failures_with_bad_commit_${{ inputs.job }}/new_failures_with_bad_commit_${{ inputs.job }}_1.json new_failures_with_bad_commit.json
- name: Update clone
working-directory: /transformers
run: git fetch && git checkout ${{ inputs.commit_sha || github.sha }}
- name: Process report
shell: bash
working-directory: /transformers
if: ${{ env.process == 'true' }}
env:
ACCESS_REPO_INFO_TOKEN: ${{ secrets.ACCESS_REPO_INFO_TOKEN }}
TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN: ${{ secrets.TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN }}
@ -156,7 +206,6 @@ jobs:
- name: Process report
shell: bash
working-directory: /transformers
if: ${{ env.process == 'true' }}
env:
ACCESS_REPO_INFO_TOKEN: ${{ secrets.ACCESS_REPO_INFO_TOKEN }}
TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN: ${{ secrets.TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN }}
@ -171,13 +220,12 @@ jobs:
- name: Prepare Slack report title
working-directory: /transformers
if: ${{ env.process == 'true' }}
run: |
pip install slack_sdk
echo "title=$(python3 -c 'import sys; sys.path.append("utils"); from utils.notification_service import job_to_test_map; ci_event = "${{ inputs.ci_event }}"; job = "${{ inputs.job }}"; test_name = job_to_test_map[job]; title = f"New failed tests of {ci_event}" + ":" + f" {test_name}"; print(title)')" >> $GITHUB_ENV
- name: Send processed report
if: ${{ env.process == 'true' && !endsWith(env.REPORT_TEXT, '{}') }}
if: ${{ !endsWith(env.REPORT_TEXT, '{}') }}
uses: slackapi/slack-github-action@6c661ce58804a1a20f6dc5fbee7f0381b469e001
with:
# Slack channel id, channel name, or user id to post message.

View File

@ -98,7 +98,7 @@ jobs:
commit_sha: ${{ needs.get-pr-info.outputs.PR_HEAD_SHA }}
pr_number: ${{ needs.get-pr-number.outputs.PR_NUMBER }}
package: transformers
languages: ar de en es fr hi it ko pt tr zh ja te
languages: ar de en es fr hi it ja ko pt zh
update_run_status:
name: Update Check Run Status

4
.gitignore vendored
View File

@ -98,6 +98,7 @@ celerybeat-schedule
# Environments
.env
.venv
.venv*
env/
venv/
ENV/
@ -171,3 +172,6 @@ tags
# modular conversion
*.modular_backup
# Cursor IDE files
.cursor/

View File

@ -112,7 +112,125 @@ New models are constantly released and if you want to implement a new model, ple
If you are willing to contribute the model yourself, let us know so we can help you add it to 🤗 Transformers!
We have a technical guide for [how to add a model to 🤗 Transformers](https://huggingface.co/docs/transformers/add_new_model).
We have a technical guide for [how to add a model to 🤗 Transformers](https://huggingface.co/docs/transformers/modular_transformers).
### Vision-Language Model Contribution Checklist
If you're contributing a **vision-language model** (or any multimodal model that processes images/videos), please follow this checklist. Maintainers will use this to review your PR, and completing these steps will significantly increase the likelihood of your PR being merged quickly.
**Required checklist for all vision-language model contributions:**
**1. Implement a modular file**
All new models should use the modular architecture pattern. Create a `modular_<model_name>.py` file using the modular model converter:
- Use the CLI, [`transformers add-new-model-like`](https://github.com/huggingface/transformers/blob/main/src/transformers/cli/add_new_model_like.py) to generate a modular skeleton and get started
- All code should be in the modular file if possible. Modeling must be in it, it's better if configuration is in it as well.
- Reuse existing patterns from similar models as much as possible
To verify your modular file is correct, run:
```bash
python utils/modular_model_converter.py <model_name>
```
This will generate the separate files (`modeling_*.py`, `configuration_*.py`, etc.) from your modular file. The CI will enforce that these generated files match your modular file.
**2. Add a fast image processor (for image models)**
If your model processes images, implement a fast image processor that uses `torch` and `torchvision` instead of PIL/numpy for better inference performance:
- See the detailed guide in [#36978](https://github.com/huggingface/transformers/issues/36978)
- Fast processors inherit from `BaseImageProcessorFast`
- Examples: `LlavaOnevisionImageProcessorFast`, `Idefics2ImageProcessorFast`
**3. Create a weight conversion script**
Add a `convert_<model_name>_to_hf.py` script that converts the original model weights to the HuggingFace format:
- Script should handle checkpoint loading, key mapping, and saving in HF format
- Include usage examples and documentation in the script
- Examples: [`convert_llava_onevision_weights_to_hf.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_onevision/convert_llava_onevision_weights_to_hf.py), [`convert_idefics2_weights_to_hf.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/idefics2/convert_idefics2_weights_to_hf.py)
**4. Add integration tests with exact output matching**
At minimum, add an `IntegrationTest` class that tests end-to-end generation (processing and modelling) with **exact** output matching:
- For generative models: test that generated text matches expected output exactly
- For non-generative models: test that output logits match expected values
- Tests should use real checkpoints (load in 4-bit or half precision if the checkpoint is too big to fit in our CI runners) and real inputs
- Example pattern:
```python
class MyModelIntegrationTest(unittest.TestCase):
@slow
def test_model_integration(self):
model = MyModelForConditionalGeneration.from_pretrained("org/model-name")
processor = AutoProcessor.from_pretrained("org/model-name")
inputs = processor(images=image, text=prompt, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=20)
EXPECTED_TEXT = "exact expected output"
self.assertEqual(processor.decode(output[0]), EXPECTED_TEXT)
```
See `tests/models/llava_onevision/test_modeling_llava_onevision.py` for complete examples.
**5. Update documentation**
Add or update model documentation:
- Create if the cli hasn't `docs/source/en/model_doc/<model_name>.md` with usage examples
- Include model description, paper link, and basic usage with `Pipeline` and `AutoModel`
- Add the model to the appropriate TOC files
**6. Look for reusable patterns**
The library has 400+ models with many established patterns:
- Search for similar models (e.g., other vision-language models)
- Reuse attention mechanisms, layer implementations, and processing patterns
- Check models like LLaVA, Idefics2, Fuyu for vision-language patterns
- Use provided decorators like (`auto_docstring`, `can_return_tuple`, `check_model_inputs` and `_can_record_outputs`) where relevant.
- Don't reinvent the wheel
**7. Run quality checks and read the output**
Before submitting your PR, install quality dependencies and run the full check suite:
```bash
pip install -e ".[quality]"
make fixup
```
**Important**: Take time to read the output of `make fixup`. It will:
- Lint and format your code automatically
- Run consistency checks (imports, docstrings, etc.)
- Show any remaining issues that need manual fixes
All checks must pass before your PR can be merged.
**If this checklist is complete, your PR has a very high likelihood of being merged!** Following these steps makes the maintainers' work much easier and will reduce the number of review iterations, getting your important work out there faster.
#### Copy-pastable checklist for maintainers
Here's a condensed version maintainers can copy into PRs:
```markdown
## Multimodal Model Addition Checklist
Please ensure your PR completes all following items. See the [full checklist](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#vision-language-model-contribution-checklist) for details.
- [ ] **Modular file**: `modular_<model_name>.py` implemented and verified with `python utils/modular_model_converter.py <model_name>`
- [ ] **Fast image processor**: Implemented using `BaseImageProcessorFast` (see [#36978](https://github.com/huggingface/transformers/issues/36978))
- [ ] **Conversion script**: `convert_<model_name>_to_hf.py` added with usage examples
- [ ] **Integration tests**: End-to-end tests with exact output matching (text or logits)
- [ ] **Documentation**: Model docs added/updated in `docs/source/en/model_doc/`
- [ ] **Pattern reuse**: Verified against similar models (LLaVA, Idefics2, etc.)
- [ ] **Quality checks**: `make fixup` passes with no errors
```
## Do you want to add documentation?

View File

@ -64,8 +64,8 @@ limitations under the License.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/transformers_as_a_model_definition.png"/>
</h3>
Transformers acts as the model-definition framework for state-of-the-art machine learning models in text, computer
vision, audio, video, and multimodal model, for both inference and training.
Transformers acts as the model-definition framework for state-of-the-art machine learning with text, computer
vision, audio, video, and multimodal models, for both inference and training.
It centralizes the model definition so that this definition is agreed upon across the ecosystem. `transformers` is the
pivot across frameworks: if a model definition is supported, it will be compatible with the majority of training

View File

@ -16,7 +16,6 @@ import sys
from logging import Logger
from threading import Event, Thread
from time import perf_counter, sleep
from typing import Optional
# Add the parent directory to Python path to import benchmarks_entrypoint
@ -42,7 +41,7 @@ except ImportError:
GenerationConfig = None
StaticCache = None
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
os.environ["HF_XET_HIGH_PERFORMANCE"] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "1"
# Only set torch precision if torch is available
@ -145,7 +144,7 @@ def run_benchmark(
q = torch.empty_like(probs_sort).exponential_(1)
return torch.argmax(probs_sort / q, dim=-1, keepdim=True).to(dtype=torch.int)
def logits_to_probs(logits, temperature: float = 1.0, top_k: Optional[int] = None):
def logits_to_probs(logits, temperature: float = 1.0, top_k: int | None = None):
logits = logits / max(temperature, 1e-5)
if top_k is not None:
@ -155,7 +154,7 @@ def run_benchmark(
probs = torch.nn.functional.softmax(logits, dim=-1)
return probs
def sample(logits, temperature: float = 1.0, top_k: Optional[int] = None):
def sample(logits, temperature: float = 1.0, top_k: int | None = None):
probs = logits_to_probs(logits[0, -1], temperature, top_k)
idx_next = multinomial_sample_one_no_sync(probs)
return idx_next, probs

View File

@ -2,5 +2,5 @@ gpustat==1.1.1
psutil==6.0.0
psycopg2==2.9.9
torch>=2.4.0
hf_transfer
hf_xet
pandas>=1.5.0

View File

@ -1 +1,2 @@
benchmark_results/
benchmark_results/
benchmark_results_profiles/

View File

@ -1 +0,0 @@
# Benchmark implementations directory

View File

@ -1,165 +0,0 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
import os
from typing import Any
import torch
from benchmark_framework import ModelBenchmark
os.environ["TOKENIZERS_PARALLELISM"] = "1"
torch.set_float32_matmul_precision("high")
class LLaMABenchmark(ModelBenchmark):
"""Simplified LLaMA model benchmark implementation using the ModelBenchmark base class."""
def __init__(self, logger: logging.Logger):
super().__init__(logger)
self._default_prompt = "Why dogs are so cute?" # Custom prompt for LLaMA
def get_scenario_configs(self) -> list[dict[str, Any]]:
"""
Get LLaMA-specific scenario configurations.
Returns:
List of scenario configuration dictionaries
"""
return [
# Eager variants
{"variant": "eager", "compile_mode": None, "use_cache": True, "description": "Eager execution with cache"},
# Compiled variants
{
"variant": "compiled",
"compile_mode": "max-autotune",
"use_cache": True,
"description": "Compiled with max autotune",
},
# Kernelized variant (if available)
{
"variant": "kernelized",
"compile_mode": "max-autotune",
"use_cache": True,
"description": "Kernelized execution",
},
]
def _is_kernelization_available(self) -> bool:
"""Check if kernelization is available for LLaMA."""
try:
from kernels import Mode, kernelize # noqa: F401
return True
except ImportError:
self.logger.debug("Kernelization not available: kernels module not found")
return False
def get_default_generation_config(self) -> dict[str, Any]:
"""Get LLaMA-specific generation configuration."""
return {
"do_sample": False,
"top_p": 1.0,
"temperature": 1.0,
"repetition_penalty": 1.0,
"max_new_tokens": None, # Will be set per scenario
}
def get_model_init_kwargs(self, config) -> dict[str, Any]:
"""Get LLaMA-specific model initialization kwargs."""
return {
"torch_dtype": getattr(torch, config.torch_dtype),
"attn_implementation": config.attn_implementation,
"use_cache": True,
}
def get_default_torch_dtype(self) -> str:
"""Get default torch dtype for LLaMA."""
return "float16" # LLaMA works well with float16
def get_default_device(self) -> str:
"""Get default device for LLaMA."""
return "cuda" # LLaMA prefers CUDA
def run_llama(logger, output_dir, **kwargs):
"""
Run LLaMA benchmark with the given configuration.
Args:
logger: Logger instance
output_dir: Output directory for results
**kwargs: Additional configuration options
Returns:
Path to output file if successful
"""
from benchmark_framework import BenchmarkRunner
# Extract parameters with defaults
model_id = kwargs.get("model_id", "meta-llama/Llama-2-7b-hf")
warmup_iterations = kwargs.get("warmup_iterations", 3)
measurement_iterations = kwargs.get("measurement_iterations", 5)
num_tokens_to_generate = kwargs.get("num_tokens_to_generate", 100)
include_sdpa_variants = kwargs.get("include_sdpa_variants", True)
device = kwargs.get("device", "cuda")
torch_dtype = kwargs.get("torch_dtype", "float16")
batch_size = kwargs.get("batch_size", 1)
commit_id = kwargs.get("commit_id")
logger.info(f"Starting LLaMA benchmark for model: {model_id}")
logger.info(
f"Configuration: warmup={warmup_iterations}, measurement={measurement_iterations}, tokens={num_tokens_to_generate}"
)
try:
# Create benchmark instance
benchmark = LLaMABenchmark(logger)
# Create scenarios
scenarios = benchmark.create_scenarios(
model_id=model_id,
warmup_iterations=warmup_iterations,
measurement_iterations=measurement_iterations,
num_tokens_to_generate=num_tokens_to_generate,
include_sdpa_variants=include_sdpa_variants,
device=device,
torch_dtype=torch_dtype,
batch_size=batch_size,
)
logger.info(f"Created {len(scenarios)} benchmark scenarios")
# Create runner and execute benchmarks
runner = BenchmarkRunner(logger, output_dir)
results = runner.run_benchmark(benchmark, scenarios, commit_id=commit_id)
if not results:
logger.warning("No successful benchmark results")
return None
# Save results
model_name = model_id.split("/")[-1] # Extract model name from ID
output_file = runner.save_results(model_name, results)
logger.info(f"LLaMA benchmark completed successfully. Results saved to: {output_file}")
return output_file
except Exception as e:
logger.error(f"LLaMA benchmark failed: {e}")
import traceback
logger.debug(traceback.format_exc())
raise

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,214 @@
import hashlib
import json
import logging
from typing import Any
KERNELIZATION_AVAILABLE = False
try:
from kernels import Mode, kernelize # noqa: F401
KERNELIZATION_AVAILABLE = True
except ImportError:
pass
logger = logging.getLogger(__name__)
class BenchmarkConfig:
"""Configuration for a single benchmark scenario."""
def __init__(
self,
warmup_iterations: int = 5,
measurement_iterations: int = 20,
gpu_monitoring: bool = True, # NOTE: you may want to disable this at times as we have obsvered it could heavily slow down benchmarks on AMD
batch_size: int = 1,
sequence_length: int = 128,
num_tokens_to_generate: int = 128,
attn_implementation: str = "eager",
sdpa_backend: str | None = None,
compile_mode: str | None = None,
compile_options: dict[str, Any] | None = None,
kernelize: bool = False,
name: str | None = None,
skip_validity_check: bool = False,
) -> None:
# Benchmark parameters
self.warmup_iterations = warmup_iterations
self.measurement_iterations = measurement_iterations
self.gpu_monitoring = gpu_monitoring
# Input parameters
self.batch_size = batch_size
self.sequence_length = sequence_length
self.num_tokens_to_generate = num_tokens_to_generate
# Generation parameters
self.attn_implementation = attn_implementation
self.sdpa_backend = sdpa_backend
# Optimization parameters
self.compile_mode = compile_mode
self.compile_options = compile_options if compile_options is not None else {}
self.kernelize = kernelize
# Constant parameters
self.dtype = "torch.bfloat16"
self.device = "cuda"
self.check_validity(skip_validity_check)
self.name = name if name is not None else self.infer_name()
def check_validity(self, skip_validity_check: bool = False) -> None:
if skip_validity_check:
return
# Flash attention does not support compile mode, so we turn it off # FIXME: it would be better to support it
is_fa = self.attn_implementation == "flash_attention_2"
is_fa |= self.attn_implementation == "sdpa" and self.sdpa_backend == "flash_attention"
if is_fa:
logger.warning("Flash attention does not support compile mode. Turning off compile mode.")
self.compile_mode = None
@property
def hash(self) -> str:
return hashlib.sha256(json.dumps(self.to_dict()).encode()).hexdigest()
def infer_name(self, compact: bool = True) -> str:
"""Infer a human-readable name for the benchmark config, either compact or verbose."""
if compact:
iter_str = f"w{self.warmup_iterations}_i{self.measurement_iterations}"
gpu_monitor_str = "monitored" if self.gpu_monitoring else "unmonitored"
dimensions_str = f"b{self.batch_size}_s{self.sequence_length}_n{self.num_tokens_to_generate}"
attn_code = self.attn_implementation
attn_code += f"_{self.sdpa_backend}" if self.attn_implementation == "sdpa" else ""
compile_str = f"compiled_{self.compile_mode}" if self.compile_mode is not None else "uncompiled"
kernelize_str = "kernelized" if self.kernelize else "unkernelized"
sep = "-"
else:
iter_str = f"{self.warmup_iterations} warmup, {self.measurement_iterations} iterations"
gpu_monitor_str = ("with" if self.gpu_monitoring else "no") + " GPU monitoring"
dimensions_str = f"batch size {self.batch_size}, sequence length {self.sequence_length}, {self.num_tokens_to_generate} generated tokens"
attn_code = f"{self.attn_implementation} attention"
attn_code += f" with {self.sdpa_backend} backend" if self.attn_implementation == "sdpa" else ""
compile_str = "compiled" if self.compile_mode is not None else "not compiled"
kernelize_str = "kernelized" if self.kernelize else "not kernelized"
sep = ", "
return sep.join([iter_str, gpu_monitor_str, dimensions_str, attn_code, compile_str, kernelize_str])
def to_dict(self) -> dict[str, Any]:
return {
"name": self.name,
"warmup_iterations": self.warmup_iterations,
"measurement_iterations": self.measurement_iterations,
"gpu_monitoring": self.gpu_monitoring,
"batch_size": self.batch_size,
"sequence_length": self.sequence_length,
"num_tokens_to_generate": self.num_tokens_to_generate,
"attn_implementation": self.attn_implementation,
"sdpa_backend": self.sdpa_backend,
"compile_mode": self.compile_mode,
"compile_options": self.compile_options | {}, # to avoid inplace modification of the original dict
"kernelize": self.kernelize,
}
@classmethod
def from_dict(cls, data: dict[str, Any], skip_validity_check: bool = False) -> "BenchmarkConfig":
return cls(
warmup_iterations=data.get("warmup_iterations", 5),
measurement_iterations=data.get("measurement_iterations", 20),
gpu_monitoring=data.get("gpu_monitoring", False),
batch_size=data.get("batch_size", 1),
sequence_length=data.get("sequence_length", 128),
num_tokens_to_generate=data.get("num_tokens_to_generate", 128),
attn_implementation=data.get("attn_implementation", "eager"),
sdpa_backend=data.get("sdpa_backend"),
compile_mode=data.get("compile_mode"),
compile_options=data.get("compile_options"),
kernelize=data.get("kernelize", False),
name=data.get("name"),
skip_validity_check=skip_validity_check,
)
def cross_generate_configs(
attn_impl_and_sdpa_backend: list[tuple[str, str | None]],
compiled_mode: list[str | None],
kernelized: list[bool],
warmup_iterations: int = 5,
measurement_iterations: int = 20,
batch_size: int = 1,
sequence_length: int = 128,
num_tokens_to_generate: int = 128,
gpu_monitoring: bool = True,
) -> list[BenchmarkConfig]:
# Create kwargs common to all configs
kwargs = {
"warmup_iterations": warmup_iterations,
"measurement_iterations": measurement_iterations,
"batch_size": batch_size,
"sequence_length": sequence_length,
"num_tokens_to_generate": num_tokens_to_generate,
"gpu_monitoring": gpu_monitoring,
}
# Cross-generate all combinations of attn_implementation, compiled_mode, and kernelized
configs = []
for attn_implementation, sdpa_backend in list(dict.fromkeys(attn_impl_and_sdpa_backend)):
for cm in list(dict.fromkeys(compiled_mode)):
for kernelize_on in list(dict.fromkeys(kernelized)):
config = BenchmarkConfig(
attn_implementation=attn_implementation,
sdpa_backend=sdpa_backend,
compile_mode=cm,
kernelize=kernelize_on,
**kwargs,
)
configs.append(config)
return configs
def generate_all_configs(
warmup_iterations: int = 5,
measurement_iterations: int = 20,
batch_size: int = 1,
sequence_length: int = 128,
num_tokens_to_generate: int = 128,
gpu_monitoring: bool = True,
) -> list[BenchmarkConfig]:
all_attn_implementations = [
("flash_attention_2", None),
("eager", None),
("sdpa", "math"),
("sdpa", "flash_attention"),
("flex_attention", None),
]
return cross_generate_configs(
attn_impl_and_sdpa_backend=all_attn_implementations,
compiled_mode=[None, "default", "reduce-overhead", "max-autotune", "max-autotune-no-cudagraphs"],
kernelized=[False, KERNELIZATION_AVAILABLE],
warmup_iterations=warmup_iterations,
measurement_iterations=measurement_iterations,
batch_size=batch_size,
sequence_length=sequence_length,
num_tokens_to_generate=num_tokens_to_generate,
gpu_monitoring=gpu_monitoring,
)
def generate_main_configs(
warmup_iterations: int = 5,
measurement_iterations: int = 20,
batch_size: int = 1,
sequence_length: int = 128,
num_tokens_to_generate: int = 128,
) -> list[BenchmarkConfig]:
# Create kwargs common to all configs
kwargs = {
"warmup_iterations": warmup_iterations,
"measurement_iterations": measurement_iterations,
"batch_size": batch_size,
"sequence_length": sequence_length,
"num_tokens_to_generate": num_tokens_to_generate,
}
return [ # TODO: test max-autotune instead of default
BenchmarkConfig(attn_implementation="flex_attention", compile_mode="default", gpu_monitoring=False, **kwargs),
BenchmarkConfig(attn_implementation="flex_attention", compile_mode="default", gpu_monitoring=True, **kwargs),
BenchmarkConfig(attn_implementation="eager", compile_mode="default", gpu_monitoring=True, **kwargs),
BenchmarkConfig(attn_implementation="flash_attention_2", gpu_monitoring=True, **kwargs),
]

View File

@ -0,0 +1,456 @@
import gc
import json
import logging
import os
import pathlib
import re
import tempfile
import time
from contextlib import nullcontext
from datetime import datetime
from queue import Queue
from typing import Any
import torch
from datasets import Dataset
from huggingface_hub import HfApi
from tqdm import trange
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
CompileConfig,
GenerationConfig,
GenerationMixin,
)
from transformers.generation.streamers import BaseStreamer
from .benchmark_config import BenchmarkConfig
from .data_classes import BenchmarkMetadata, BenchmarkResult, GPURawMetrics, pretty_print_dict
from .hardware_metrics import GPUMonitor
try:
from kernels import Mode, kernelize # noqa: F401
except ImportError:
kernelize = None
Mode = None
DEFAULT_PROMPT = "\n".join([
"The French Revolution was a period of political and societal change in France that began with the Estates General of 1789 and ended with the Coup of 18 Brumaire on 9 November 1799.",
"Many of the revolution's ideas are considered fundamental principles of liberal democracy, and its values remain central to modern French political discourse.",
"It was caused by a combination of social, political, and economic factors which the existing regime proved unable to manage.",
"Financial crisis and widespread social distress led to the convocation of the Estates General in May 1789, its first meeting since 1614.",
"The representatives of the Third Estate broke away and re-constituted themselves as a National Assembly in June.",
"The Storming of the Bastille in Paris on 14 July led to a series of radical measures by the Assembly, including the abolition of feudalism, state control over the Catholic Church in France, and issuing the Declaration of the Rights of Man and of the Citizen.",
"The next three years were dominated by a struggle for political control.",
"King Louis XVI's attempted flight to Varennes in June 1791 further discredited the monarchy, and military defeats after the outbreak of the French Revolutionary Wars in April 1792 led to the insurrection of 10 August 1792.",
"As a result, the monarchy was replaced by the French First Republic in September, followed by the execution of Louis XVI himself in January 1793.",
"After another revolt in June 1793, the constitution was suspended, and political power passed from the National Convention to the Committee of Public Safety, dominated by radical Jacobins led by Maximilien Robespierre.",
"About 16,000 people were sentenced by the Revolutionary Tribunal and executed in the Reign of Terror, which ended in July 1794 with the Thermidorian Reaction.",
"Weakened by external threats and internal opposition, the Committee of Public Safety was replaced in November 1795 by the Directory.",
"Its instability ended in the coup of 18 Brumaire and the establishment of the Consulate, with Napoleon Bonaparte as First Consul.",
]) # fmt: skip
PUSH_TO_HUB_TOKEN = os.getenv("PUSH_TO_HUB_TOKEN", None)
def compact_json_numeric_arrays(data: dict):
# Match arrays that contain only numbers (ints/floats), whitespace, commas, and newlines
pattern = r"\[\s*\n\s*((?:\d+(?:\.\d+)?\s*,\s*)*\d+(?:\.\d+)?)\s*\n\s*\]"
def replace_numeric_array(match):
# Get the array content
content = match.group(1)
# Remove extra whitespace but keep commas
compact_content = re.sub(r"\s+", " ", content).strip()
return f"[{compact_content}]"
return re.sub(pattern, replace_numeric_array, json.dumps(data, indent=4, default=str), flags=re.DOTALL)
def get_git_revision() -> str:
base_path = pathlib.Path(__file__).parent.parent.parent
git_dir = base_path / ".git"
with (git_dir / "HEAD").open("r") as head:
ref = head.readline().split(" ")[-1].strip()
with (git_dir / ref).open("r") as git_hash:
return git_hash.readline().strip()
def get_sdpa_backend(backend_name: str | None) -> torch.nn.attention.SDPBackend | None:
"""Get the SDPA backend enum from string name."""
if backend_name is None:
return None
try:
backend_map = {
"math": torch.nn.attention.SDPBackend.MATH,
"flash_attention": torch.nn.attention.SDPBackend.FLASH_ATTENTION,
"efficient_attention": torch.nn.attention.SDPBackend.EFFICIENT_ATTENTION,
"cudnn_attention": torch.nn.attention.SDPBackend.CUDNN_ATTENTION,
}
return backend_map.get(backend_name.lower())
except AttributeError:
# torch.nn.attention.SDPBackend not available in older torch versions
return None
def flush_memory():
"""Flush GPU memory and run garbage collection."""
gc.collect()
# Dynamo resets
torch._dynamo.reset()
torch._dynamo.reset_code_caches()
if hasattr(torch._inductor, "codecache"):
# Clear FX graph cache
if hasattr(torch._inductor.codecache, "FxGraphCache"):
torch._inductor.codecache.FxGraphCache.clear()
# Clear PyCodeCache
if hasattr(torch._inductor.codecache, "PyCodeCache"):
torch._inductor.codecache.PyCodeCache.cache_clear()
# Clear TritonFuture cache (for async compilation)
if hasattr(torch._inductor.codecache, "TritonFuture"):
if hasattr(torch._inductor.codecache.TritonFuture, "_compile_cache"):
torch._inductor.codecache.TritonFuture._compile_cache.clear()
# Clear CUDA cache
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.reset_max_memory_allocated()
torch.cuda.reset_peak_memory_stats()
torch.cuda.synchronize()
gc.collect()
class BenchmarkStreamer(BaseStreamer):
def __init__(self, **kwargs) -> None:
self.timeout = kwargs.pop("timeout", 10)
self.timestamps = []
self.text_queue = Queue()
self.stop_signal = None
def put(self, value):
"""Receives tokens and logs the timestamp of the generation."""
self.timestamps.append(time.perf_counter())
self.text_queue.put(value)
def end(self):
self.timestamps.append(time.perf_counter())
self.text_queue.put(self.stop_signal)
def __iter__(self):
return self
def __next__(self):
value = self.text_queue.get(timeout=self.timeout)
if value == self.stop_signal:
raise StopIteration()
else:
return value
class BenchmarkRunner:
"""Main benchmark runner that coordinates benchmark execution."""
def __init__(
self,
logger: logging.Logger,
output_dir: str | None = None,
branch_name: str | None = None,
commit_id: str | None = None,
commit_message: str | None = None,
) -> None:
# Those stay constant for the whole run
self.logger = logger
if output_dir is None:
output_dir = os.path.join(os.path.dirname(os.path.dirname(__file__)), "benchmark_results")
self.output_dir = output_dir
self.branch_name = branch_name
self.commit_id = get_git_revision() if commit_id is None else commit_id
self.commit_message = commit_message
os.makedirs(self.output_dir, exist_ok=True)
self.profile_dir = None
# Attributes that are reset for each model
self._setup_for = ""
# Attributes that are reset for each run
self.model: GenerationMixin | None = None
def cleanup(self) -> None:
del self.model
self.model = None
flush_memory()
def setup_benchmark(self, model_id: str, config: BenchmarkConfig) -> None:
# Some attributes only need to be set once per model
if self._setup_for != model_id:
self.tokenizer = AutoTokenizer.from_pretrained(model_id)
# We set the EOS token to the padding token for open-ended generation
self.tokenizer.eos_token = self.tokenizer.pad_token
self._setup_for = model_id
# Prepare inputs
self.inputs = self.tokenizer(
[DEFAULT_PROMPT for _ in range(config.batch_size)],
return_tensors="pt",
max_length=config.sequence_length,
truncation=True,
return_attention_mask=True,
).to(config.device)
self.inputs["use_cache"] = True
# Prepare generation config
gen_config = GenerationConfig(
do_sample=False, top_p=1.0, temperature=1.0, max_new_tokens=config.num_tokens_to_generate
)
# Prepare compile config
if config.compile_mode is not None:
gen_config.compile_config = CompileConfig(mode=config.compile_mode, options=config.compile_options)
gen_config.cache_implementation = "static"
# Load model
self.logger.debug(f"Loading model {model_id} on device {config.device}...")
dtype = getattr(torch, config.dtype.removeprefix("torch."))
self.model = AutoModelForCausalLM.from_pretrained(
model_id, dtype=dtype, attn_implementation=config.attn_implementation, generation_config=gen_config
)
self.model = self.model.eval().to(config.device)
# Kernelize the model if needed
if config.kernelize and kernelize is not None and Mode is not None:
self.model = kernelize(self.model, mode=Mode.INFERENCE)
def run_benchmark(
self, model_id: str, config: BenchmarkConfig, num_tokens_to_profile: int = 0
) -> dict[str, Any] | None:
"""Run a single benchmark with the given model ID and config."""
sdpa_ctx = nullcontext()
if config.attn_implementation == "sdpa":
sdpa_backend = get_sdpa_backend(config.sdpa_backend)
sdpa_ctx = torch.nn.attention.sdpa_kernel(sdpa_backend)
with sdpa_ctx, torch.no_grad():
self.logger.info(f"Running benchmark scenario: {config.name}")
# Quick validation: try one measurement first to see if this scenario works
flush_memory()
e2e_latency, token_generation_times, shape_and_decoded_output, gpu_metrics = self.time_generate(
max_new_tokens=1, gpu_monitor=None
)
if e2e_latency < 0:
self.logger.warning(f"Skipping config {config.name}: {e2e_latency = } (no GPU monitoring)")
return None
# Warmup runs
self.logger.info(f"Warming up with {config.warmup_iterations} iterations...")
for _ in trange(config.warmup_iterations):
_ = self.time_generate(max_new_tokens=config.num_tokens_to_generate)
self.logger.info("Warmup over.")
# Measurement runs
result = BenchmarkResult()
self.logger.info(f"Benchmarking with {config.measurement_iterations} iterations.")
for _ in trange(config.measurement_iterations):
e2e_latency, token_generation_times, shape_and_decoded_output, gpu_metrics = self.time_generate(
max_new_tokens=config.num_tokens_to_generate,
gpu_monitor=(GPUMonitor(logger=self.logger) if config.gpu_monitoring else None),
)
result.accumulate(e2e_latency, token_generation_times, shape_and_decoded_output, gpu_metrics)
self.logger.info("Benchmarking done. Cleaning up.")
# Profile if needed
if num_tokens_to_profile > 0:
self.profile_generate(num_tokens_to_profile, config.name)
return {
"metadata": BenchmarkMetadata(
model_id=model_id,
branch_name=self.branch_name,
commit_id=self.commit_id,
commit_message=self.commit_message,
),
"measurements": result,
"config": config,
}
def time_generate(
self,
max_new_tokens: int,
gpu_monitor: GPUMonitor | None = None,
) -> tuple[float, list[float], str, GPURawMetrics | None]:
"""Time the latency of a call to model.generate() with the given (inputs) and (max_new_tokens)."""
# Prepare gpu monitoring if needed
if gpu_monitor is not None:
gpu_monitor.start()
# Prepare streamer
streamer = BenchmarkStreamer()
# Generate and time
wall_time_0 = time.perf_counter()
outputs = self.model.generate(
**self.inputs,
max_new_tokens=max_new_tokens,
streamer=streamer,
)
wall_time_1 = time.perf_counter()
# Stop gpu monitoring if needed
gpu_metrics = gpu_monitor.stop_and_collect() if gpu_monitor is not None else None
# Check if generation had the right number of tokens
input_tokens = self.inputs["input_ids"].size(-1)
batch_size, output_tokens = outputs.shape
new_tokens = output_tokens - input_tokens
if new_tokens != max_new_tokens:
raise RuntimeError(f"Generated {new_tokens} tokens, expected {max_new_tokens}")
# Decode outputs
decoded_output = self.tokenizer.decode(outputs[0, input_tokens:], skip_special_tokens=True)
shape_and_decoded_output = f"{tuple(outputs.shape)} | {decoded_output}"
# Compute intermediate quantities
e2e_latency = wall_time_1 - wall_time_0
token_generation_times = [t - wall_time_0 for t in streamer.timestamps[1:]]
return e2e_latency, token_generation_times, shape_and_decoded_output, gpu_metrics
def profile_generate(self, num_tokens_to_profile: int, config_name: str) -> None:
"""Profile the latency of a call to model.generate() with the given (inputs) and (max_new_tokens)."""
profiler = torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
record_shapes=True,
)
with profiler as prof:
_ = self.model.generate(
**self.inputs,
max_new_tokens=num_tokens_to_profile,
)
if self.profile_dir is None:
self.profile_dir = self.output_dir + "_profiles"
os.makedirs(self.profile_dir, exist_ok=True)
prof.export_chrome_trace(f"{self.profile_dir}/{config_name}.json")
def run_benchmarks(
self,
model_id: str,
benchmark_configs: list[BenchmarkConfig],
num_tokens_to_profile: int = 0,
pretty_print_summary: bool = True,
) -> tuple[str, dict[str, Any]]:
"""Run multiple benchmarks for the given model ID and list of benchmark configs."""
all_results = {}
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
start_time = time.perf_counter()
n_configs = len(benchmark_configs)
for i, config in enumerate(benchmark_configs):
# Handle SDPA backend if not determined by the config (needs to be done before skipping duplicates)
if config.attn_implementation == "sdpa" and config.sdpa_backend is None:
default_backend = "flash_attention" # FIXME: torch has a _cur_sdpa_kernel_backends but it fails
self.logger.warning(f"No SDPA backend provided, using {default_backend} instead.")
config.sdpa_backend = default_backend
# Skip if already run
if config.hash in all_results:
self.logger.info(f"Skipping duplicate config {config.name} for model {model_id} ({i + 1}/{n_configs})")
continue
# Otherwise, run the benchmark
self.setup_benchmark(model_id, config)
self.logger.info(
f"Running benchmark of model {model_id} with scenario: {config.name} ({i + 1}/{n_configs})"
)
# Launch benchmark in a try/except block to avoid stopping the whole run if one benchmark fails
try:
results = self.run_benchmark(model_id, config, num_tokens_to_profile)
if results is not None:
all_results[config.hash] = results
except Exception as e:
self.logger.error(f"Error running with scenario: {config.name}:\n{repr(e)}")
# Cleanup model and save results
self.cleanup()
self.save_results(model_id, all_results, timestamp=timestamp)
if pretty_print_summary:
print()
print("=" * 100)
print(f"Finished benchmarks in {time.perf_counter() - start_time:.2f} seconds")
print(f"Total number of benchmarks: {len(all_results)}")
if len(all_results) > 0:
print("First run metadata:")
first_key = list(all_results.keys())[0]
first_metadata = all_results[first_key]["metadata"].to_dict()
hardware_info = first_metadata.pop("hardware_info")
pretty_print_dict(first_metadata | hardware_info, tabs=1)
for result in all_results.values():
print("=" * 100)
print(f"Config: {result['config'].infer_name(compact=False)}\n")
result["measurements"].pprint(batch_size=result["config"].batch_size, tabs=1)
print("=" * 100)
return (timestamp, all_results)
def save_results(self, model_name: str, results: dict, timestamp: str = "") -> str:
"""Save benchmark results to JSON file."""
# Create model-specific subdirectory
model_name = model_name.replace("/", "_")
model_dir = os.path.join(self.output_dir, model_name)
os.makedirs(model_dir, exist_ok=True)
# Create filename with timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") if not timestamp else timestamp
filename = f"{model_name}_benchmark_{timestamp}.json"
filepath = os.path.join(model_dir, filename)
# Convert results to dict
converted_results = {}
for cfg_hash in results.keys():
converted_results[cfg_hash] = {
"metadata": results[cfg_hash]["metadata"].to_dict(),
"measurements": results[cfg_hash]["measurements"].to_dict(),
"config": results[cfg_hash]["config"].to_dict(),
}
# Save to JSON file
with open(filepath, "w") as f:
f.write(compact_json_numeric_arrays(converted_results))
self.logger.info(f"Results saved to {filepath}")
return filepath
def push_results_to_hub(self, dataset_id: str, results: dict[Any, Any], timestamp: str) -> None:
if PUSH_TO_HUB_TOKEN is None:
raise ValueError(
"PUSH_TO_HUB_TOKEN is not set, cannot push results to the Hub. When setting dataset_id, please also set the PUSH_TO_HUB_TOKEN environment variable."
)
n_results = len(results)
self.logger.info(f"Pushing {n_results} results to: {dataset_id}")
rows = []
for cfg_hash, entry in results.items():
row = {
"benchmark_config_hash": cfg_hash,
"config": entry["config"].to_dict(),
"measurements": entry["measurements"].to_dict(),
"metadata": entry["metadata"].to_dict(),
}
rows.append(row)
ds = Dataset.from_list(rows)
with tempfile.TemporaryDirectory() as tmp:
jsonl_path = os.path.join(tmp, "data.jsonl")
with open(jsonl_path, "w") as f:
json_lines = []
for ex in ds:
json_lines.append(json.dumps(ex, ensure_ascii=False))
f.write("\n".join(json_lines))
api = HfApi()
# NOTE: we expect the repository to already exist
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") if not timestamp else timestamp
file_name = f"benchmark_run_{timestamp}.jsonl"
api.upload_file(
path_or_fileobj=jsonl_path,
path_in_repo=file_name,
repo_id=dataset_id,
repo_type="dataset",
token=PUSH_TO_HUB_TOKEN,
)
self.logger.info(f"Succesfully uploaded results to: {dataset_id}")

View File

@ -0,0 +1,167 @@
from dataclasses import dataclass
from datetime import datetime, timezone
from typing import Any
import numpy as np
from .hardware_metrics import GPURawMetrics, HardwareInfo
def compute_basic_statistics(measurements: list[float]) -> dict[str, float]:
return {
"avg": np.mean(measurements),
"std": np.std(measurements),
"min": np.min(measurements),
"med": np.median(measurements),
"max": np.max(measurements),
"p95": np.percentile(measurements, 95),
}
def add_unit_to_duration(stats: dict[str, float]) -> dict[str, str]:
for key in list(stats.keys()):
value = stats[key]
if value > 3600:
stats[key] = f"{(value / 3600):.2f}hr"
elif value > 60:
stats[key] = f"{(value / 60):.2f}min"
elif value > 1:
stats[key] = f"{value:.2f}s"
elif value > 1e-3:
stats[key] = f"{(value * 1e3):.2f}ms"
elif value > 1e-6:
stats[key] = f"{(value * 1e6):.2f}us"
else:
stats[key] = f"{(value * 1e9):.2f}ns"
return stats
def equalize_lengths_and_collate(stats: list[dict[str, str]]) -> list[str]:
keys = ["avg", "std", "min", "med", "max", "p95"]
for key in keys:
max_length = max(len(stat[key]) for stat in stats)
for stat in stats:
stat[key] = stat[key].ljust(max_length, " ")
return [" ".join([f"{key}={stat[key]}" for key in keys]) for stat in stats]
def pretty_print_dict(data: dict[str, Any], tabs: int = 0) -> None:
max_key_length = max([len(key) for key in data.keys()])
for key, value in data.items():
tabs_str = " " * tabs
padded_key = key.ljust(max_key_length + 1, ".")
print(f"{tabs_str}{padded_key}: {value}")
@dataclass
class BenchmarkMetadata:
"""Metadata collected for each benchmark run."""
model_id: str
timestamp: str
branch_name: str
commit_id: str
commit_message: str
hardware_info: HardwareInfo
def __init__(self, model_id: str, commit_id: str, branch_name: str = "main", commit_message: str = "") -> None:
self.model_id = model_id
self.timestamp = datetime.now(timezone.utc).isoformat()
self.branch_name = branch_name
self.commit_id = commit_id
self.commit_message = commit_message
self.hardware_info = HardwareInfo()
def to_dict(self) -> dict[str, Any]:
return {
"model_id": self.model_id,
"timestamp": self.timestamp,
"branch_name": self.branch_name,
"commit_id": self.commit_id,
"commit_message": self.commit_message,
"hardware_info": self.hardware_info.to_dict(),
}
class BenchmarkResult:
"""Result from a series of benchmark runs."""
def __init__(self) -> None:
self.e2e_latency = []
self.token_generation_times = [] # time at which each token was generated (relative to start of the generation)
self.shape_and_decoded_outputs = []
self.gpu_metrics = []
def accumulate(
self,
e2e_latency: float,
token_generation_times: list[float],
shape_and_decoded_output: str,
gpu_metrics: GPURawMetrics | None,
) -> None:
self.e2e_latency.append(e2e_latency)
self.token_generation_times.append(token_generation_times)
self.shape_and_decoded_outputs.append(shape_and_decoded_output)
self.gpu_metrics.append(gpu_metrics)
def to_dict(self) -> dict[str, None | int | float]:
# Save GPU metrics as None if it contains only None values
if all(gm is None for gm in self.gpu_metrics):
gpu_metrics = None
else:
gpu_metrics = [gm.to_dict() for gm in self.gpu_metrics]
return {
"e2e_latency": self.e2e_latency,
"token_generation_times": self.token_generation_times,
"shape_and_decoded_outputs": self.shape_and_decoded_outputs,
"gpu_metrics": gpu_metrics,
}
@classmethod
def from_dict(cls, data: dict[str, None | int | float]) -> "BenchmarkResult":
# Handle GPU metrics, which is saved as None if it contains only None values
if data["gpu_metrics"] is None:
gpu_metrics = [None for _ in range(len(data["e2e_latency"]))]
else:
gpu_metrics = [GPURawMetrics.from_dict(gm) for gm in data["gpu_metrics"]]
# Create a new instance and accumulate the data
new_instance = cls()
for i in range(len(data["e2e_latency"])):
new_instance.accumulate(
e2e_latency=data["e2e_latency"][i],
token_generation_times=data["token_generation_times"][i],
shape_and_decoded_output=data["shape_and_decoded_outputs"][i],
gpu_metrics=gpu_metrics[i],
)
return new_instance
def get_measured_ttft(self) -> list[float]:
return [dt[0] for dt in self.token_generation_times if len(dt) > 0]
def get_measured_itl(self) -> list[float]:
return [(dt[-1] - dt[0]) / (len(dt) - 1) for dt in self.token_generation_times if len(dt) > 1]
def get_throughput(self, batch_size: int) -> float:
return [
batch_size * len(dt) / e2e_latency
for e2e_latency, dt in zip(self.e2e_latency, self.token_generation_times)
]
def pprint(self, batch_size: int = 0, tabs: int = 0) -> None:
stats_to_collate = [
add_unit_to_duration(compute_basic_statistics(self.e2e_latency)),
add_unit_to_duration(compute_basic_statistics(self.get_measured_ttft())),
add_unit_to_duration(compute_basic_statistics(self.get_measured_itl())),
]
if batch_size > 0:
throughput_stats = compute_basic_statistics(self.get_throughput(batch_size))
stats_to_collate.append({key: f"{value:.2f}tok/s" for key, value in throughput_stats.items()})
collated_stats = equalize_lengths_and_collate(stats_to_collate)
dict_to_pprint = {
"E2E Latency": collated_stats[0],
"Time to First Token": collated_stats[1],
"Inter-Token Latency": collated_stats[2],
}
if batch_size > 0:
dict_to_pprint["Throughput"] = collated_stats[3]
pretty_print_dict(dict_to_pprint, tabs=tabs)

View File

@ -0,0 +1,171 @@
import json
import logging
import subprocess
import sys
import threading
import time
from dataclasses import dataclass
from enum import Enum
from logging import Logger
import gpustat
import psutil
import torch
# Data class to hold the hardware information
def get_device_name_and_memory_total() -> tuple[str, float]:
"""Returns the name and memory total of GPU 0."""
device_name = torch.cuda.get_device_properties(0).name
device_memory_total = torch.cuda.get_device_properties(0).total_memory / 1024**3
return device_name, device_memory_total
class HardwareInfo:
"""A class to hold information about the hardware."""
def __init__(self) -> None:
# Retrieve GPU stats
try:
self.gpu_name, self.gpu_memory_total_gb = get_device_name_and_memory_total()
except Exception:
self.gpu_name, self.gpu_memory_total_gb = None, None
# Retrieve python, torch and CUDA version
self.python_version = f"{sys.version.split()[0]}"
self.torch_version = torch.__version__
if hasattr(torch, "cuda") and torch.cuda.is_available():
self.cuda_version = torch.version.cuda
else:
self.cuda_version = None
# Retrieve general hardware information
self.cpu_count = psutil.cpu_count()
self.memory_total_mb = int(psutil.virtual_memory().total / (1024 * 1024))
def to_dict(self) -> dict[str, None | int | float | str]:
return {
"gpu_name": self.gpu_name,
"gpu_memory_total_gb": self.gpu_memory_total_gb,
"python_version": self.python_version,
"torch_version": self.torch_version,
}
# Functions to get information about the GPU
def get_amd_gpu_stats() -> tuple[int, float]:
"""Returns the utilization and memory used of an AMD GPU, both in percent"""
rocm_smi_output = subprocess.check_output(["rocm-smi", "--json", "--showuse", "--showmeminfo", "VRAM"])
gpu_stats = json.loads(rocm_smi_output.decode("utf-8"))
gpu_stats = [
(card_id, stats["GPU use (%)"], stats["VRAM Total Used Memory (B)"]) for card_id, stats in gpu_stats.items()
]
gpu_stats.sort(key=lambda x: x[1], reverse=True)
return int(gpu_stats[0][1]), float(gpu_stats[0][2]) / 1024**3
def get_nvidia_gpu_stats() -> tuple[int, float]:
"""Returns the utilization and memory used of an NVIDIA GPU, both in percent"""
gpu_stats = gpustat.GPUStatCollection.new_query()
gpu_stats = gpu_stats[0]
return int(gpu_stats["utilization.gpu"]), float(gpu_stats["memory.used"]) / 1024**3
class GPUStatsCollector:
"""A class to get statistics about the GPU. It serves as a wrapper that holds the GPU total memory and its name,
which is used to call the right function to get the utilization and memory used."""
def __init__(self) -> None:
self.device_name, self.device_memory_total = get_device_name_and_memory_total()
# Monkey patch the get_utilization_and_memory_used method based on the GPU type
if "amd" in self.device_name.lower():
self.get_utilization_and_memory_used = get_amd_gpu_stats
elif "nvidia" in self.device_name.lower():
self.get_utilization_and_memory_used = get_nvidia_gpu_stats
else:
raise RuntimeError(f"Unsupported GPU: {self.device_name}")
def get_measurements(self) -> tuple[int, float]:
"""Get the utilization and memory used of the GPU, both in percent"""
raise NotImplementedError("This method is meant to be monkey patched during __init__")
# Simple data classes to hold the raw GPU metrics
class GPUMonitoringStatus(Enum):
"""Status of GPU monitoring."""
SUCCESS = "success"
FAILED = "failed"
NO_GPUS_AVAILABLE = "no_gpus_available"
NO_SAMPLES_COLLECTED = "no_samples_collected"
@dataclass
class GPURawMetrics:
"""Raw values for GPU utilization and memory used."""
utilization: list[float] # in percent
memory_used: list[float] # in GB
timestamps: list[float] # in seconds
timestamp_0: float # in seconds
monitoring_status: GPUMonitoringStatus
def to_dict(self) -> dict[str, None | int | float | str]:
return {
"utilization": self.utilization,
"memory_used": self.memory_used,
"timestamps": self.timestamps,
"timestamp_0": self.timestamp_0,
"monitoring_status": self.monitoring_status.value,
}
# Main class, used to monitor the GPU utilization during benchmark execution
class GPUMonitor:
"""Monitor GPU utilization during benchmark execution."""
def __init__(self, sample_interval_sec: float = 0.1, logger: Logger | None = None):
self.sample_interval_sec = sample_interval_sec
self.logger = logger if logger is not None else logging.getLogger(__name__)
self.num_available_gpus = torch.cuda.device_count()
if self.num_available_gpus == 0:
raise RuntimeError("No GPUs detected by torch.cuda.device_count().")
self.gpu_stats_getter = GPUStatsCollector()
def start(self):
"""Start monitoring GPU metrics."""
# Clear the stop event to enable monitoring
self.stop_event = threading.Event()
self.gpu_utilization = []
self.gpu_memory_used = []
self.timestamps = []
self.thread = threading.Thread(target=self._monitor_loop)
self.thread.start()
self.logger.debug("GPU monitoring started")
def stop_and_collect(self) -> GPURawMetrics:
"""Stop monitoring and return collected metrics."""
self.stop_event.set()
self.thread.join()
if self.gpu_utilization:
timestamp_0 = self.timestamps[0]
metrics = GPURawMetrics(
utilization=self.gpu_utilization,
memory_used=self.gpu_memory_used,
timestamps=[t - timestamp_0 for t in self.timestamps],
timestamp_0=timestamp_0,
monitoring_status=GPUMonitoringStatus.SUCCESS,
)
self.logger.debug(f"GPU monitoring completed: {len(self.gpu_utilization)} samples collected")
else:
metrics = GPURawMetrics(monitoring_status=GPUMonitoringStatus.NO_SAMPLES_COLLECTED)
return metrics
def _monitor_loop(self):
"""Background monitoring loop using threading.Event for communication."""
while not self.stop_event.is_set():
utilization, memory_used = self.gpu_stats_getter.get_utilization_and_memory_used()
self.gpu_utilization.append(utilization)
self.gpu_memory_used.append(memory_used)
self.timestamps.append(time.time())
if self.stop_event.wait(timeout=self.sample_interval_sec):
break

View File

@ -4,4 +4,4 @@ gpustat>=1.0.0
torch>=2.0.0
transformers>=4.30.0
datasets>=2.10.0
huggingface_hub>=0.16.0
huggingface_hub>=0.16.0

View File

@ -19,477 +19,124 @@ in the ./benches directory, organizing outputs into model-specific subfolders.
"""
import argparse
import importlib.util
import json
import logging
import os
import sys
import uuid
from datetime import datetime
from pathlib import Path
from typing import Any, Optional
from framework.benchmark_config import BenchmarkConfig, generate_all_configs, generate_main_configs
from framework.benchmark_runner import BenchmarkRunner
def setup_logging(log_level: str = "INFO", enable_file_logging: bool = False) -> logging.Logger:
"""Setup logging configuration."""
numeric_level = getattr(logging, log_level.upper(), None)
if not isinstance(numeric_level, int):
raise ValueError(f"Invalid log level: {log_level}")
if __name__ == "__main__":
# Parse arguments
parser = argparse.ArgumentParser()
parser.add_argument("--output-dir", type=str, default=None, help="Output dir for benchmark results")
parser.add_argument("--log-level", type=str, choices=["DEBUG", "INFO", "WARNING", "ERROR"], default="INFO")
parser.add_argument("--model-id", type=str, help="Specific model ID to benchmark (if supported by benchmarks)")
parser.add_argument("--warmup", "-w", type=int, default=3, help="Number of warmup iterations")
parser.add_argument("--iterations", "-i", type=int, default=10, help="Number of measurement iterations")
parser.add_argument("--batch-size", "-b", type=int, nargs="+", help="Batch size")
parser.add_argument("--sequence-length", "-s", type=int, nargs="+", help="Sequence length")
parser.add_argument("--num-tokens-to-generate", "-n", type=int, nargs="+", help="Number of tokens to generate")
parser.add_argument("--cross-generate", action="store_true", help="Cross-generate all combinations of configs")
parser.add_argument("--num-tokens-to-profile", "-p", type=int, default=0, help="Number of tokens to profile")
parser.add_argument("--branch-name", type=str, help="Git branch name")
parser.add_argument("--commit-id", type=str, help="Git commit ID (if not provided, will auto-detect from git)")
parser.add_argument("--commit-message", type=str, help="Git commit message")
parser.add_argument(
"--no-gpu-monitoring", action="store_true", help="Disables GPU monitoring during benchmark runs"
)
parser.add_argument(
"--push-result-to-dataset",
type=str,
default=None,
help="Name of the dataset to push results to. If not provided, results are not pushed to the Hub.",
)
args = parser.parse_args()
# Setup logging
benchmark_run_uuid = str(uuid.uuid4())[:8]
numeric_level = getattr(logging, args.log_level.upper())
handlers = [logging.StreamHandler(sys.stdout)]
if enable_file_logging:
handlers.append(logging.FileHandler(f"benchmark_run_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"))
logging.basicConfig(
level=numeric_level, format="[%(levelname)s - %(asctime)s] %(name)s: %(message)s", handlers=handlers
)
return logging.getLogger(__name__)
def discover_benchmarks(benches_dir: str) -> list[dict[str, Any]]:
"""
Discover all benchmark modules in the benches directory.
Returns:
List of dictionaries containing benchmark module info
"""
benchmarks = []
benches_path = Path(benches_dir)
if not benches_path.exists():
raise FileNotFoundError(f"Benches directory not found: {benches_dir}")
for py_file in benches_path.glob("*.py"):
if py_file.name.startswith("__"):
continue
module_name = py_file.stem
try:
# Import the module
spec = importlib.util.spec_from_file_location(module_name, py_file)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
# Check if it has a benchmark runner function
if hasattr(module, f"run_{module_name}"):
benchmarks.append(
{
"name": module_name,
"path": str(py_file),
"module": module,
"runner_function": getattr(module, f"run_{module_name}"),
}
)
elif hasattr(module, "run_benchmark"):
benchmarks.append(
{
"name": module_name,
"path": str(py_file),
"module": module,
"runner_function": getattr(module, "run_benchmark"),
}
)
else:
logging.warning(f"No runner function found in {py_file}")
except Exception as e:
logging.error(f"Failed to import {py_file}: {e}")
return benchmarks
def run_single_benchmark(
benchmark_info: dict[str, Any], output_dir: str, logger: logging.Logger, **kwargs
) -> Optional[str]:
"""
Run a single benchmark and return the output file path.
Args:
benchmark_info: Dictionary containing benchmark module info
output_dir: Base output directory
logger: Logger instance
**kwargs: Additional arguments to pass to the benchmark
Returns:
Path to the output file if successful, None otherwise
"""
benchmark_name = benchmark_info["name"]
runner_func = benchmark_info["runner_function"]
logger.info(f"Running benchmark: {benchmark_name}")
try:
# Check function signature to determine what arguments to pass
import inspect
sig = inspect.signature(runner_func)
# Prepare arguments based on function signature
func_kwargs = {"logger": logger, "output_dir": output_dir}
# Add other kwargs if the function accepts them
for param_name in sig.parameters:
if param_name in kwargs:
func_kwargs[param_name] = kwargs[param_name]
# Filter kwargs to only include parameters the function accepts
# If function has **kwargs, include all provided kwargs
has_var_kwargs = any(param.kind == param.VAR_KEYWORD for param in sig.parameters.values())
if has_var_kwargs:
valid_kwargs = {**func_kwargs, **kwargs}
else:
valid_kwargs = {k: v for k, v in func_kwargs.items() if k in sig.parameters}
# Run the benchmark
result = runner_func(**valid_kwargs)
if isinstance(result, str):
# Function returned a file path
return result
else:
logger.info(f"Benchmark {benchmark_name} completed successfully")
return "completed"
except Exception as e:
logger.error(f"Benchmark {benchmark_name} failed: {e}")
import traceback
logger.debug(traceback.format_exc())
return None
def generate_summary_report(
output_dir: str,
benchmark_results: dict[str, Any],
logger: logging.Logger,
benchmark_run_uuid: Optional[str] = None,
) -> str:
"""Generate a summary report of all benchmark runs."""
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
summary_file = os.path.join(output_dir, f"benchmark_summary_{timestamp}.json")
summary_data = {
"run_metadata": {
"timestamp": datetime.utcnow().isoformat(),
"benchmark_run_uuid": benchmark_run_uuid,
"total_benchmarks": len(benchmark_results),
"successful_benchmarks": len([r for r in benchmark_results.values() if r is not None]),
"failed_benchmarks": len([r for r in benchmark_results.values() if r is None]),
},
"benchmark_results": benchmark_results,
"output_directory": output_dir,
}
with open(summary_file, "w") as f:
json.dump(summary_data, f, indent=2, default=str)
logger.info(f"Summary report saved to: {summary_file}")
return summary_file
def upload_results_to_hf_dataset(
output_dir: str,
summary_file: str,
dataset_name: str,
run_id: Optional[str] = None,
token: Optional[str] = None,
logger: Optional[logging.Logger] = None,
) -> Optional[str]:
"""
Upload benchmark results to a HuggingFace Dataset.
Based on upload_collated_report() from utils/collated_reports.py
Args:
output_dir: Local output directory containing results
summary_file: Path to the summary file
dataset_name: Name of the HuggingFace dataset to upload to
run_id: Unique run identifier (if None, will generate one)
token: HuggingFace token for authentication (if None, will use environment variables)
logger: Logger instance
Returns:
The run_id used for the upload, None if upload failed
"""
if logger is None:
logger = logging.getLogger(__name__)
import os
from huggingface_hub import HfApi
api = HfApi()
if run_id is None:
github_run_number = os.getenv("GITHUB_RUN_NUMBER")
github_run_id = os.getenv("GITHUB_RUN_ID")
if github_run_number and github_run_id:
run_id = f"{github_run_number}-{github_run_id}"
date_folder = datetime.now().strftime("%Y-%m-%d")
github_event_name = os.getenv("GITHUB_EVENT_NAME")
if github_event_name != "schedule":
# Non-scheduled runs go under a runs subfolder
repo_path = f"{date_folder}/runs/{run_id}/benchmark_results"
else:
# Scheduled runs go directly under the date
repo_path = f"{date_folder}/{run_id}/benchmark_results"
logger.info(f"Uploading benchmark results to dataset '{dataset_name}' at path '{repo_path}'")
try:
# Upload all files in the output directory
from pathlib import Path
output_path = Path(output_dir)
for file_path in output_path.rglob("*"):
if file_path.is_file():
# Calculate relative path from output_dir
relative_path = file_path.relative_to(output_path)
path_in_repo = f"{repo_path}/{relative_path}"
logger.debug(f"Uploading {file_path} to {path_in_repo}")
api.upload_file(
path_or_fileobj=str(file_path),
path_in_repo=path_in_repo,
repo_id=dataset_name,
repo_type="dataset",
token=token,
commit_message=f"Upload benchmark results for run {run_id}",
)
logger.info(
f"Successfully uploaded results to: https://huggingface.co/datasets/{dataset_name}/tree/main/{repo_path}"
)
return run_id
except Exception as upload_error:
logger.error(f"Failed to upload results: {upload_error}")
import traceback
logger.debug(traceback.format_exc())
return None
def main():
"""Main entry point for the benchmarking script."""
# Generate a unique UUID for this benchmark run
benchmark_run_uuid = str(uuid.uuid4())[:8]
parser = argparse.ArgumentParser(
description="Run all benchmarks in the ./benches directory",
epilog="""
Examples:
# Run all available benchmarks
python3 run_benchmarks.py
# Run with specific model and upload to HuggingFace Dataset
python3 run_benchmarks.py --model-id meta-llama/Llama-2-7b-hf --upload-to-hf username/benchmark-results
# Run with custom run ID and upload to HuggingFace Dataset
python3 run_benchmarks.py --run-id experiment_v1 --upload-to-hf org/benchmarks
# Run only specific benchmarks with file logging
python3 run_benchmarks.py --include llama --enable-file-logging
""", # noqa: W293
formatter_class=argparse.RawDescriptionHelpFormatter,
)
parser.add_argument(
"--output-dir",
type=str,
default="benchmark_results",
help="Base output directory for benchmark results (default: benchmark_results)",
)
parser.add_argument(
"--benches-dir",
type=str,
default="./benches",
help="Directory containing benchmark implementations (default: ./benches)",
)
parser.add_argument(
"--log-level",
type=str,
choices=["DEBUG", "INFO", "WARNING", "ERROR"],
default="INFO",
help="Logging level (default: INFO)",
)
parser.add_argument("--model-id", type=str, help="Specific model ID to benchmark (if supported by benchmarks)")
parser.add_argument("--warmup-iterations", type=int, default=3, help="Number of warmup iterations (default: 3)")
parser.add_argument(
"--measurement-iterations", type=int, default=5, help="Number of measurement iterations (default: 5)"
)
parser.add_argument(
"--num-tokens-to-generate",
type=int,
default=100,
help="Number of tokens to generate in benchmarks (default: 100)",
)
parser.add_argument("--include", type=str, nargs="*", help="Only run benchmarks matching these names")
parser.add_argument("--exclude", type=str, nargs="*", help="Exclude benchmarks matching these names")
parser.add_argument("--enable-file-logging", action="store_true", help="Enable file logging (disabled by default)")
parser.add_argument(
"--commit-id", type=str, help="Git commit ID for metadata (if not provided, will auto-detect from git)"
)
parser.add_argument(
"--push-to-hub",
type=str,
help="Upload results to HuggingFace Dataset (provide dataset name, e.g., 'username/benchmark-results')",
)
parser.add_argument(
"--run-id", type=str, help="Custom run ID for organizing results (if not provided, will generate a unique ID)"
)
parser.add_argument(
"--token",
type=str,
help="HuggingFace token for dataset uploads (if not provided, will use HF_TOKEN environment variable)",
)
args = parser.parse_args()
# Setup logging
logger = setup_logging(args.log_level, args.enable_file_logging)
logger = logging.getLogger("benchmark_v2")
logger.info("Starting benchmark discovery and execution")
logger.info(f"Benchmark run UUID: {benchmark_run_uuid}")
logger.info(f"Output directory: {args.output_dir}")
logger.info(f"Benches directory: {args.benches_dir}")
# Create output directory
os.makedirs(args.output_dir, exist_ok=True)
# Error out if one of the arguments is not provided
if len(args.batch_size) * len(args.sequence_length) * len(args.num_tokens_to_generate) == 0:
raise ValueError(
"At least one of the arguments --batch-size, --sequence-length, or --num-tokens-to-generate is required"
)
try:
# Discover benchmarks
benchmarks = discover_benchmarks(args.benches_dir)
logger.info(f"Discovered {len(benchmarks)} benchmark(s): {[b['name'] for b in benchmarks]}")
if not benchmarks:
logger.warning("No benchmarks found!")
return 1
# Filter benchmarks based on include/exclude
filtered_benchmarks = benchmarks
if args.include:
filtered_benchmarks = [
b for b in filtered_benchmarks if any(pattern in b["name"] for pattern in args.include)
]
logger.info(f"Filtered to include: {[b['name'] for b in filtered_benchmarks]}")
if args.exclude:
filtered_benchmarks = [
b for b in filtered_benchmarks if not any(pattern in b["name"] for pattern in args.exclude)
]
logger.info(f"After exclusion: {[b['name'] for b in filtered_benchmarks]}")
if not filtered_benchmarks:
logger.warning("No benchmarks remaining after filtering!")
return 1
# Prepare common kwargs for benchmarks
benchmark_kwargs = {
"warmup_iterations": args.warmup_iterations,
"measurement_iterations": args.measurement_iterations,
"num_tokens_to_generate": args.num_tokens_to_generate,
}
if args.model_id:
benchmark_kwargs["model_id"] = args.model_id
# Add commit_id if provided
if args.commit_id:
benchmark_kwargs["commit_id"] = args.commit_id
# Run benchmarks
benchmark_results = {}
successful_count = 0
for benchmark_info in filtered_benchmarks:
result = run_single_benchmark(benchmark_info, args.output_dir, logger, **benchmark_kwargs)
benchmark_results[benchmark_info["name"]] = result
if result is not None:
successful_count += 1
# Generate summary report
summary_file = generate_summary_report(args.output_dir, benchmark_results, logger, benchmark_run_uuid)
# Upload results to HuggingFace Dataset if requested
upload_run_id = None
if args.push_to_hub:
logger.info("=" * 60)
logger.info("UPLOADING TO HUGGINGFACE DATASET")
logger.info("=" * 60)
# Use provided run_id or fallback to benchmark run UUID
effective_run_id = args.run_id or benchmark_run_uuid
upload_run_id = upload_results_to_hf_dataset(
output_dir=args.output_dir,
summary_file=summary_file,
dataset_name=args.push_to_hub,
run_id=effective_run_id,
token=args.token,
logger=logger,
# If there is only one (batch_size, sequence_length, num_tokens_to_generate), we benchmark across configs
elif len(args.batch_size) * len(args.sequence_length) * len(args.num_tokens_to_generate) == 1:
if args.cross_generate:
benchmark_configs = generate_all_configs(
warmup_iterations=args.warmup,
measurement_iterations=args.iterations,
batch_size=args.batch_size[0],
sequence_length=args.sequence_length[0],
num_tokens_to_generate=args.num_tokens_to_generate[0],
gpu_monitoring=not args.no_gpu_monitoring,
)
if upload_run_id:
logger.info(f"Upload completed with run ID: {upload_run_id}")
else:
logger.warning("Upload failed - continuing with local results")
# Final summary
total_benchmarks = len(filtered_benchmarks)
failed_count = total_benchmarks - successful_count
logger.info("=" * 60)
logger.info("BENCHMARK RUN SUMMARY")
logger.info("=" * 60)
logger.info(f"Total benchmarks: {total_benchmarks}")
logger.info(f"Successful: {successful_count}")
logger.info(f"Failed: {failed_count}")
logger.info(f"Output directory: {args.output_dir}")
logger.info(f"Summary report: {summary_file}")
if args.push_to_hub:
if upload_run_id:
logger.info(f"HuggingFace Dataset: {args.push_to_hub}")
logger.info(f"Run ID: {upload_run_id}")
logger.info(
f"View results: https://huggingface.co/datasets/{args.push_to_hub}/tree/main/{datetime.now().strftime('%Y-%m-%d')}/runs/{upload_run_id}"
)
else:
logger.warning("Upload to HuggingFace Dataset failed")
if failed_count > 0:
logger.warning(f"{failed_count} benchmark(s) failed. Check logs for details.")
return 1
else:
logger.info("All benchmarks completed successfully!")
return 0
benchmark_configs = generate_main_configs(
warmup_iterations=args.warmup,
measurement_iterations=args.iterations,
batch_size=args.batch_size[0],
sequence_length=args.sequence_length[0],
num_tokens_to_generate=args.num_tokens_to_generate[0],
)
except Exception as e:
logger.error(f"Benchmark run failed: {e}")
import traceback
# Otherwise, we benchmark across all combinations of dimensions
else:
main_config = generate_main_configs(
warmup_iterations=args.warmup,
measurement_iterations=args.iterations,
batch_size=args.batch_size[0],
sequence_length=args.sequence_length[0],
num_tokens_to_generate=args.num_tokens_to_generate[0],
)[0]
benchmark_configs = []
for num_tokens_to_generate in args.num_tokens_to_generate:
for sequence_length in args.sequence_length:
for batch_size in args.batch_size:
cfg_dict = main_config.to_dict()
cfg_dict["batch_size"] = batch_size
cfg_dict["sequence_length"] = sequence_length
cfg_dict["num_tokens_to_generate"] = num_tokens_to_generate
cfg_dict.pop("name")
benchmark_configs.append(BenchmarkConfig.from_dict(cfg_dict))
logger.debug(traceback.format_exc())
return 1
runner = BenchmarkRunner(
logger,
args.output_dir,
args.branch_name,
args.commit_id,
args.commit_message,
)
timestamp, results = runner.run_benchmarks(
args.model_id,
benchmark_configs,
args.num_tokens_to_profile,
pretty_print_summary=True,
)
if __name__ == "__main__":
sys.exit(main())
dataset_id = args.push_result_to_dataset
if dataset_id is not None and len(results) > 0:
runner.push_results_to_hub(
dataset_id,
results,
timestamp,
)

View File

@ -58,7 +58,6 @@ NOT_DEVICE_TESTS = {
"test_model_get_set_embeddings",
"test_model_main_input_name",
"test_correct_missing_keys",
"test_tie_model_weights",
"test_can_use_safetensors",
"test_load_save_without_tied_weights",
"test_tied_weights_keys",
@ -88,6 +87,8 @@ def pytest_configure(config):
config.addinivalue_line("markers", "not_device_test: mark the tests always running on cpu")
config.addinivalue_line("markers", "torch_compile_test: mark test which tests torch compile functionality")
config.addinivalue_line("markers", "torch_export_test: mark test which tests torch export functionality")
config.addinivalue_line("markers", "flash_attn_test: mark test which tests flash attention functionality")
config.addinivalue_line("markers", "flash_attn_3_test: mark test which tests flash attention 3 functionality")
os.environ["DISABLE_SAFETENSORS_CONVERSION"] = "true"

View File

@ -24,7 +24,8 @@ RUN git clone https://github.com/huggingface/transformers && cd transformers &&
# 1. Put several commands in a single `RUN` to avoid image/layer exporting issue. Could be revised in the future.
# 2. Regarding `torch` part, We might need to specify proper versions for `torchvision` and `torchaudio`.
# Currently, let's not bother to specify their versions explicitly (so installed with their latest release versions).
RUN python3 -m pip install --no-cache-dir -e ./transformers[dev,onnxruntime] && [ ${#PYTORCH} -gt 0 -a "$PYTORCH" != "pre" ] && VERSION='torch=='$PYTORCH'.*' || VERSION='torch'; echo "export VERSION='$VERSION'" >> ~/.profile && echo torch=$VERSION && [ "$PYTORCH" != "pre" ] && python3 -m pip install --no-cache-dir -U $VERSION torchvision torchaudio torchcodec --extra-index-url https://download.pytorch.org/whl/$CUDA || python3 -m pip install --no-cache-dir -U --pre torch torchvision torchaudio torchcodec --extra-index-url https://download.pytorch.org/whl/nightly/$CUDA
# 3. For `torchcodec<0.8`: this is quickly added as torch 2.9.0 + torchcodec 0.8.0 fails on our CI env. Need to remove later once they work.
RUN python3 -m pip install --no-cache-dir -e ./transformers[dev,onnxruntime] && [ ${#PYTORCH} -gt 0 -a "$PYTORCH" != "pre" ] && VERSION='torch=='$PYTORCH'.*' || VERSION='torch'; echo "export VERSION='$VERSION'" >> ~/.profile && echo torch=$VERSION && [ "$PYTORCH" != "pre" ] && python3 -m pip install --no-cache-dir -U $VERSION torchvision torchaudio "torchcodec<0.8" --extra-index-url https://download.pytorch.org/whl/$CUDA || python3 -m pip install --no-cache-dir -U --pre torch torchvision torchaudio torchcodec --extra-index-url https://download.pytorch.org/whl/nightly/$CUDA
RUN python3 -m pip install --no-cache-dir -U timm

View File

@ -1,4 +1,4 @@
FROM rocm/pytorch:rocm6.4.1_ubuntu24.04_py3.12_pytorch_release_2.7.1
FROM rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.7.1
LABEL maintainer="Hugging Face"
ARG DEBIAN_FRONTEND=noninteractive
@ -10,8 +10,8 @@ RUN apt update && \
RUN git lfs install
RUN python3 -m pip install --no-cache-dir --upgrade pip numpy
RUN python3 -m pip install --no-cache-dir --upgrade importlib-metadata setuptools ninja git+https://github.com/facebookresearch/detectron2.git pytesseract "itsdangerous<2.1.0"
RUN python3 -m pip install --no-cache-dir --upgrade pip numpy importlib-metadata setuptools wheel ninja pytesseract "itsdangerous<2.1.0"
RUN python3 -m pip install --no-cache-dir --no-build-isolation git+https://github.com/facebookresearch/detectron2.git
ARG REF=main
WORKDIR /
@ -39,6 +39,7 @@ RUN python3 -m pip install --no-cache-dir "torchcodec==0.5"
# Install flash attention from source. Tested with commit 6387433156558135a998d5568a9d74c1778666d8
RUN git clone https://github.com/ROCm/flash-attention/ -b tridao && \
cd flash-attention && \
GPU_ARCHS="gfx942" python setup.py install
GPU_ARCHS="gfx942;gfx950" python setup.py install
# GPU_ARCHS builds for MI300, MI325 and MI355
RUN python3 -m pip install --no-cache-dir einops

View File

@ -3,11 +3,10 @@ LABEL maintainer="Hugging Face"
SHELL ["/bin/bash", "-c"]
ARG PYTHON_VER=3.11
ARG PYTHON_VER=3.12
ENV TORCH_DEVICE_BACKEND_AUTOLOAD=0
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get remove -y python3.10 && apt-get autoremove -y
RUN apt-get update && \
apt-get install -y software-properties-common && \
add-apt-repository -y ppa:deadsnakes/ppa && \
@ -23,7 +22,6 @@ RUN apt-get update && \
apt-utils \
build-essential \
ca-certificates \
clinfo \
curl \
git \
git-lfs \
@ -35,7 +33,6 @@ RUN apt-get update && \
rsync \
sudo \
libnl-genl-3-200 \
xpu-smi \
unzip \
ffmpeg \
tesseract-ocr \
@ -45,34 +42,47 @@ RUN apt-get update && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN apt-get update && \
apt-get install -y \
linux-headers-$(uname -r) \
linux-modules-extra-$(uname -r) \
linux-headers-$(uname -r) linux-modules-extra-$(uname -r) \
flex bison \
intel-fw-gpu intel-i915-dkms xpu-smi \
intel-fw-gpu intel-i915-dkms xpu-smi intel-ocloc clinfo\
intel-opencl-icd libze-intel-gpu1 libze1 \
intel-media-va-driver-non-free libmfx-gen1 libvpl2 \
libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
libegl-mesa0 libegl1 libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
libglapi-mesa libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo intel-ocloc \
mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo \
libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev libze-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN pip install --upgrade pip
RUN pip install triton==3.3.0
# Use virtual env because Ubuntu-24 does not allowed pip on original python
RUN curl -LsSf https://astral.sh/uv/install.sh | sh
ENV PATH="/root/.local/bin:$PATH"
ENV VIRTUAL_ENV="/opt/venv"
ENV UV_PYTHON_INSTALL_DIR=/opt/uv/python
RUN uv venv --python ${PYTHON_VER} --seed ${VIRTUAL_ENV}
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
RUN pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/xpu --no-cache-dir
RUN pip install --upgrade pip wheel
RUN pip install triton==3.4.0
RUN pip install evaluate torchdata pyctcdecode pytesseract decord galore-torch fire scipy scikit-learn sentencepiece sacremoses nltk rouge_score librosa soundfile g2p_en mpi4py requests_mock
RUN pip install pretty_midi essentia resampy Levenshtein av sacrebleu phonemizer invisible_watermark schedulefree
RUN pip install gguf hqq compressed_tensors gptqmodel mergekit autoawq deepspeed torchao onnx
RUN pip install hf_transfer huggingface-hub hf-doc-builder datasets optimum-quanto timm transformers accelerate optimum peft
RUN pip install torch==2.8.0+xpu torchvision==0.23.0+xpu torchaudio==2.8.0+xpu --index-url https://download.pytorch.org/whl/xpu --no-cache-dir
RUN pip install torchcodec torchdata --no-cache-dir
RUN pip install evaluate pyctcdecode pytesseract decord galore-torch fire scipy scikit-learn sentencepiece sacremoses nltk rouge_score librosa soundfile g2p_en mpi4py requests_mock
RUN pip install pretty_midi essentia resampy Levenshtein av sacrebleu phonemizer invisible_watermark schedulefree setuptools
RUN pip install gptqmodel --no-build-isolation
RUN pip install gguf hqq compressed_tensors autoawq deepspeed torchao onnx auto_round
RUN pip install hf_transfer huggingface-hub hf-doc-builder datasets optimum-quanto timm transformers accelerate optimum peft diffusers trl kernels
# install liger-kernel
RUN pip install git+https://github.com/linkedin/Liger-Kernel.git --extra-index-url https://download.pytorch.org/whl/test/xpu
# install mergekit
RUN pip install --break-system-packages git+https://github.com/arcee-ai/mergekit.git@v0.1.3
# install bitsandbytes
RUN pip install git+https://github.com/bitsandbytes-foundation/bitsandbytes.git

View File

@ -24,7 +24,7 @@ pip install -e ".[dev]"
```
> [!NOTE]
> This command might fail for some OS that are missing dependencies. Check step 4 in [Create a Pull Request](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request) to workaround it.
> This command might fail for some OS that are missing dependencies. Check step 4 in [Create a Pull Request](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request) to work around it.
Then you need to install our special tool that builds the documentation:
@ -38,7 +38,7 @@ pip install git+https://github.com/huggingface/doc-builder
## Building the documentation
Once you have setup the `doc-builder` and additional packages, you can generate the documentation by
Once you have set up the `doc-builder` and additional packages, you can generate the documentation by
typing the following command:
```bash
@ -295,12 +295,11 @@ Here's an example of a tuple return, comprising several objects:
Due to the rapidly growing repository, it is important to make sure that no files that would significantly weigh down the repository are added. This includes images, videos, and other non-text files. We prefer to leverage a hf.co hosted `dataset` like
the ones hosted on [`hf-internal-testing`](https://huggingface.co/hf-internal-testing) in which to place these files and reference
them by URL. We recommend putting them in the following dataset: [huggingface/documentation-images](https://huggingface.co/datasets/huggingface/documentation-images).
If an external contribution, feel free to add the images to your PR and ask a Hugging Face member to migrate your images
to this dataset.
If an external contribution, feel free to add the images to your PR and ask a Hugging Face member to migrate them to this dataset.
## Styling the docstring
We have an automatic script running with the `make style` comment that will make sure that:
We have an automatic script running with the `make style` command that will make sure that:
- the docstrings fully take advantage of the line width
- all code examples are formatted using black, like the code of the Transformers library

View File

@ -123,8 +123,6 @@
title: تشغيل التدريب على Amazon SageMaker
- local: serialization
title: التصدير إلى ONNX
- local: torchscript
title: التصدير إلى TorchScript
- local: notebooks
title: دفاتر الملاحظات مع الأمثلة
- local: community
@ -260,8 +258,6 @@
# title: النماذج
# - local: main_classes/text_generation
# title: توليد النصوص
# - local: main_classes/onnx
# title: ONNX
# - local: main_classes/optimizer_schedules
# title: التحسين
# - local: main_classes/output

View File

@ -32,7 +32,7 @@
لتصدير نموذج 🤗 Transformers إلى ONNX، قم أولاً بتثبيت اعتماد إضافي:
```bash
pip install optimum[exporters]
pip install optimum-onnx
```
للاطلاع على جميع المعامﻻت المتاحة، يرجى الرجوع إلى [وثائق 🤗 Optimum](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli)، أو عرض المساعدة في سطر الأوامر:
@ -111,60 +111,3 @@ optimum-cli export onnx --model keras-io/transformers-qa distilbert_base_cased_s
### تصدير نموذج لهندسة غير مدعومة
إذا كنت ترغب في المساهمة من خلال إضافة دعم لنموذج لا يُمكن تصديره حاليًا، فيجب عليك أولاً التحقق مما إذا كان مدعومًا في [`optimum.exporters.onnx`](https://huggingface.co/docs/optimum/exporters/onnx/overview)، وإذا لم يكن مدعومًا، [فيمكنك المساهمة في 🤗 Optimum](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute) مُباشرةً.
### تصدير نموذج باستخدام `transformers.onnx`
<Tip warning={true}>
لم يعد يتم دعم `transformers.onnx` يُرجى تصدير النماذج باستخدام 🤗 Optimum كما هو موضح أعلاه. سيتم إزالة هذا القسم في الإصدارات القادمة.
</Tip>
لتصدير نموذج 🤗 Transformers إلى ONNX باستخدام `transformers.onnx`، ثبّت التبعيات الإضافية:
```bash
pip install transformers[onnx]
```
استخدم حزمة `transformers.onnx` كنموذج Python لتصدير نقطة حفظ باستخدام تكوين جاهز:
```bash
python -m transformers.onnx --model=distilbert/distilbert-base-uncased onnx/
```
يُصدّر هذا رسمًا بيانيًا ONNX لنقطة الحفظ المُحددة بواسطة وسيطة `--model`. مرر أي نقطة حفظ على 🤗 Hub أو نقطة حفظ مُخزنة محليًا.
يُمكن بعد ذلك تشغيل ملف `model.onnx` الناتج على أحد المُسرعات العديدة التي تدعم معيار ONNX. على سبيل المثال، قم بتحميل وتشغيل النموذج باستخدام ONNX Runtime كما يلي:
```python
>>> from transformers import AutoTokenizer
>>> from onnxruntime import InferenceSession
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
>>> session = InferenceSession("onnx/model.onnx")
>>> # يتوقع ONNX Runtime مصفوفات NumPy كمدخلات
>>> inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np")
>>> outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs))
```
يُمكن الحصول على أسماء المخرجات المطلوبة (مثل `["last_hidden_state"]`) من خلال إلقاء نظرة على تكوين ONNX لكل نموذج. على سبيل المثال، بالنسبة لـ DistilBERT، لدينا:
```python
>>> from transformers.models.distilbert import DistilBertConfig, DistilBertOnnxConfig
>>> config = DistilBertConfig()
>>> onnx_config = DistilBertOnnxConfig(config)
>>> print(list(onnx_config.outputs.keys()))
["last_hidden_state"]
```
العمليات مُتطابقة لنقاط الحفظ TensorFlow على Hub. على سبيل المثال، صدّر نقطة حفظ TensorFlow خالصة كما يلي:
```bash
python -m transformers.onnx --model=keras-io/transformers-qa onnx/
```
لتصدير نموذج مُخزن محليًا، احفظ أوزان النموذج ومجزىء اللغوى في نفس الدليل (على سبيل المثال `local-pt-checkpoint`)، ثم قم بتصديره إلى ONNX عن طريق توجيه وسيط `--model` لحزمة `transformers.onnx` إلى الدليل المطلوب:
```bash
python -m transformers.onnx --model=local-pt-checkpoint onnx/
```

View File

@ -1,154 +0,0 @@
# التصدير إلى TorchScript
<Tip>
هذه هي بداية تجاربنا مع TorchScript ولا زلنا نستكشف قدراته مع نماذج المدخلات المتغيرة الحجم. إنه مجال اهتمامنا وسنعمق تحليلنا في الإصدارات القادمة، مع المزيد من الأمثلة البرمجية، وتنفيذ أكثر مرونة، ومقاييس مقارنة بين الأكواد القائمة على Python مع أكواد TorchScript المُجمّعة.
</Tip>
وفقًا لـ [وثائق TorchScript](https://pytorch.org/docs/stable/jit.html):
> TorchScript هي طريقة لإنشاء نماذج قابلة للتسلسل والتحسين من تعليمات PyTorch البرمجية.
هناك وحدتان من PyTorch، [JIT and TRACE](https://pytorch.org/docs/stable/jit.html)، تتيحان للمطورين تصدير نماذجهم لإعادة استخدامها في برامج أخرى مثل برامج C++ المُحسّنة للأداء.
نقدم واجهة تتيح لك تصدير نماذج 🤗 Transformers إلى TorchScript بحيث يمكن إعادة استخدامها في بيئة مختلفة عن برامج Python القائمة إلى PyTorch. هنا نشرح كيفية تصدير نماذجنا واستخدامها باستخدام TorchScript.
يتطلب تصدير نموذج أمرين:
- تهيئة مثيل للنموذج باستخدام علامة `torchscript`
- تمرير مُدخلات وهمية (dummy inputs) خلال النموذج
تنطوي هذه الضرورات على عدة أمور يجب على المطورين توخي الحذر بشأنها كما هو مفصل أدناه.
## علامة TorchScript والأوزان المرتبطة
علامة `torchscript` ضرورية لأن معظم نماذج اللغة 🤗 Transformers لها أوزان مرتبطة بين طبقة `Embedding` وطبقة `Decoding`. لا يسمح لك TorchScript بتصدير النماذج ذات الأوزان المرتبطة، لذلك من الضروري فصل الأوزان ونسخها مسبقًا.
النماذج المُهيأة باستخدام علامة `torchscript` لها طبقة `Embedding` وطبقة`Decoding` منفصلتين، مما يعني أنه لا ينبغي تدريبها لاحقًا. سيؤدي التدريب إلى عدم تزامن الطبقتين، مما يؤدي إلى نتائج غير متوقعة.
هذا لا ينطبق على النماذج التي لا تحتوي على رأس نموذج اللغة، حيث لا تملك أوزانًا مرتبطة. يمكن تصدير هذه النماذج بأمان دون علامة `torchscript`.
## المدخلات الوهمية والأطوال القياسية
تُستخدم المُدخلات الوهمية لتمرير أمامي خلال النموذج. أثناء انتشار قيم المُدخلات عبر الطبقات، يتتبع PyTorch العمليات المختلفة التي يتم تنفيذها على كل مصفوفة(tensor). ثم يتم استخدام هذه العمليات المُسجلة بعد ذلك لإنشاء *أثر* النموذج.
يتم إنشاء التتبع بالنسبة لأبعاد المُدخلات. وبالتالي، فهو مُقيّد بأبعاد المُدخلات الوهمية، ولن يعمل لأي طول تسلسل أو حجم دفعة مختلف. عند المحاولة بحجم مختلف، يتم رفع الخطأ التالي:
```
`The expanded size of the tensor (3) must match the existing size (7) at non-singleton dimension 2`
```
نوصي بتتبع النموذج باستخدام حجم مُدخلات وهمية لا يقل عن أكبر مُدخل سيتم تقديمه للنموذج أثناء الاستدلال. يمكن أن تساعد الحشوة(padding) في ملء القيم المفقودة. ومع ذلك، نظرًا لتتبع النموذج بحجم مُدخل أكبر، ستكون أبعاد المصفوفة ستكون كبيرة أيضًا، مما يؤدي عنه المزيد من الحسابات.
انتبه إلى إجمالي عدد العمليات المُنفذة على كل مُدخل وتابع الأداء عن كثب عند تصدير نماذج متغيرة طول التسلسل.
## استخدام TorchScript في Python
يوضح هذا القسم كيفية حفظ النماذج وتحميلها، بالإضافة إلى كيفية استخدام التتبع للاستدلال.
### حفظ نموذج
لتصدير `BertModel` باستخدام TorchScript، قم بتهيئة ـ `BertModel` من فئة `BertConfig` ثم احفظه على القرص تحت اسم الملف `traced_bert.pt`:
```python
from transformers import BertModel, BertTokenizer, BertConfig
import torch
enc = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
# Tokenizing input text
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = enc.tokenize(text)
# Masking one of the input tokens
masked_index = 8
tokenized_text[masked_index] = "[MASK]"
indexed_tokens = enc.convert_tokens_to_ids(tokenized_text)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
# Creating a dummy input
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
dummy_input = [tokens_tensor, segments_tensors]
# Initializing the model with the torchscript flag
# Flag set to True even though it is not necessary as this model does not have an LM Head.
config = BertConfig(
vocab_size_or_config_json_file=32000,
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072,
torchscript=True,
)
# Instantiating the model
model = BertModel(config)
# The model needs to be in evaluation mode
model.eval()
# If you are instantiating the model with *from_pretrained* you can also easily set the TorchScript flag
model = BertModel.from_pretrained("google-bert/bert-base-uncased", torchscript=True)
# Creating the trace
traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors])
torch.jit.save(traced_model, "traced_bert.pt")
```
### تحميل نموذج
يمكنك الآن تحميل `BertModel` المُحفظ سابقًا، `traced_bert.pt`، من القرص واستخدامه على `dummy_input` المُهيأ سابقًا:
```python
loaded_model = torch.jit.load("traced_bert.pt")
loaded_model.eval()
all_encoder_layers, pooled_output = loaded_model(*dummy_input)
```
### استخدام نموذج مُتتبع للاستدلال
استخدم النموذج المُتتبع للاستدلال باستخدام أسلوب `__call__` الخاص به:
```python
traced_model(tokens_tensor, segments_tensors)
```
## نشر نماذج Hugging Face TorchScript على AWS باستخدام Neuron SDK
قدمت AWS عائلة [Amazon EC2 Inf1](https://aws.amazon.com/ec2/instance-types/inf1/) من اﻷجهزة لخفض التكلفة وأداء التعلم الآلي عالي الأداء في البيئة السحابية. تعمل أجهزة Inf1 بواسطة شريحة Inferentia من AWS، وهي مُسرّع أجهزة مُخصص، متخصص في أعباء عمل الاستدلال للتعلم العميق. [AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/#) هي SDK لـ Inferentia التي تدعم تتبع نماذج المحولات وتحسينها للنشر على Inf1. توفر Neuron SDK ما يلي:
1. واجهة برمجة تطبيقات سهلة الاستخدام مع تغيير سطر واحد من التعليمات البرمجية لتتبع نموذج TorchScript وتحسينه للاستدلال في البيئة السحابية.
2. تحسينات الأداء الجاهزة للاستخدام [تحسين التكلفة والأداء](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/benchmark/>).
3. دعم نماذج Hugging Face المحولات المبنية باستخدام إما [PyTorch](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert.html) أو [TensorFlow](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/tensorflow/huggingface_bert/huggingface_bert.html).
### الآثار المترتبة
تعمل نماذج المحولات المستندة إلى بنية [BERT (تمثيلات الترميز ثنائية الاتجاه من المحولات)](https://huggingface.co/docs/transformers/main/model_doc/bert) أو متغيراتها مثل [distilBERT](https://huggingface.co/docs/transformers/main/model_doc/distilbert) و [roBERTa](https://huggingface.co/docs/transformers/main/model_doc/roberta) بشكل أفضل على Inf1 للمهام غير التوليدية مثل الإجابة على الأسئلة الاستخراجية، وتصنيف التسلسلات، وتصنيف الرموز (tokens). ومع ذلك، يمكن تكييف مهام توليد النصوص للعمل على Inf1 وفقًا لهذا [برنامج تعليمي AWS Neuron MarianMT](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/transformers-marianmt.html). يمكن العثور على مزيد من المعلومات حول النماذج التي يمكن تحويلها جاهزة على Inferentia في قسم [ملاءمة بنية النموذج](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/models/models-inferentia.html#models-inferentia) من وثائق Neuron.
### التبعيات (Dependencies)
يتطلب استخدام AWS Neuron لتحويل النماذج [بيئة SDK Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/pytorch-neuron/index.html#installation-guide) والتي تأتي مسبقًا على [AMI للتعلم العميق من AWS](https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-inferentia-launching.html).
### تحويل نموذج لـ AWS Neuron
قم بتحويل نموذج لـ AWS NEURON باستخدام نفس التعليمات البرمجية من [استخدام TorchScript في Python](torchscript#using-torchscript-in-python) لتتبع `BertModel`. قم باستيراد امتداد إطار عمل `torch.neuron` للوصول إلى مكونات Neuron SDK من خلال واجهة برمجة تطبيقات Python:
```python
from transformers import BertModel, BertTokenizer, BertConfig
import torch
import torch.neuron
```
كل ما عليك فعله هو تعديل السطر التالي:
```diff
- torch.jit.trace(model, [tokens_tensor, segments_tensors])
+ torch.neuron.trace(model, [token_tensor, segments_tensors])
```
يتيح ذلك لـ Neuron SDK تتبع النموذج وتحسينه لمثيلات Inf1.
لمعرفة المزيد حول ميزات AWS Neuron SDK والأدوات ودروس البرامج التعليمية والتحديثات الأخيرة، يرجى الاطلاع على [وثائق AWS NeuronSDK](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html).

View File

@ -88,6 +88,8 @@
title: Tool use
- local: chat_templating_writing
title: Writing a chat template
- local: chat_response_parsing
title: Response parsing
title: Chat with models
- sections:
- local: serving
@ -227,8 +229,6 @@
title: ONNX
- local: executorch
title: ExecuTorch
- local: torchscript
title: TorchScript
title: Export to production
- isExpanded: false
sections:
@ -284,6 +284,8 @@
title: Knowledge Distillation for Computer Vision
- local: tasks/keypoint_matching
title: Keypoint matching
- local: tasks/training_vision_backbone
title: Training vision models using Backbone API
title: Computer vision
- sections:
- local: tasks/image_captioning
@ -544,8 +546,6 @@
title: Helium
- local: model_doc/herbert
title: HerBERT
- local: model_doc/hgnet_v2
title: HGNet-V2
- local: model_doc/hunyuan_v1_dense
title: HunYuanDenseV1
- local: model_doc/hunyuan_v1_moe
@ -1188,6 +1188,8 @@
title: TVP
- local: model_doc/udop
title: UDOP
- local: model_doc/video_llama_3
title: VideoLlama3
- local: model_doc/video_llava
title: VideoLlava
- local: model_doc/vilt
@ -1253,6 +1255,8 @@
title: Importing Utilities
- local: internal/time_series_utils
title: Utilities for Time Series
- local: internal/rope_utils
title: Rotary Embeddings Utilities
title: Internal helpers
- sections:
- local: reference/environment_variables

View File

@ -55,6 +55,7 @@ deepspeed --num_gpus 2 trainer-program.py ...
</hfoptions>
## Order of accelerators
To select specific accelerators to use and their order, use the environment variable appropriate for your hardware. This is often set on the command line for each run, but can also be added to your `~/.bashrc` or other startup config file.
For example, if there are 4 accelerators (0, 1, 2, 3) and you only want to run accelerators 0 and 2:

View File

@ -41,13 +41,13 @@ $$
The query (`Q`), key (`K`), and value (`V`) matrices are projections from the input embeddings of shape `(b, h, T, d_head)`.
For causal attention, the mask prevents the model from attending to future tokens. Once a token is processed, its representation never changes with respect to future tokens, which means \\( K_{\text{past}} \\) and \\( V_{\text{past}} \\) can be cached and reused to compute the last token's representation.
For causal attention, the mask prevents the model from attending to future tokens. Once a token is processed, its representation never changes with respect to future tokens, which means $ K_{\text{past}} $ and $ V_{\text{past}} $ can be cached and reused to compute the last token's representation.
$$
\text{Attention}(q_t, [\underbrace{k_1, k_2, \dots, k_{t-1}}_{\text{cached}}, k_{t}], [\underbrace{v_1, v_2, \dots, v_{t-1}}_{\text{cached}}, v_{t}])
$$
At inference time, you only need the last token's query to compute the representation \\( x_t \\) that predicts the next token \\( t+1 \\). At each step, the new key and value vectors are **stored** in the cache and **appended** to the past keys and values.
At inference time, you only need the last token's query to compute the representation $ x_t $ that predicts the next token $ t+1 $. At each step, the new key and value vectors are **stored** in the cache and **appended** to the past keys and values.
$$
K_{\text{cache}} \leftarrow \text{concat}(K_{\text{past}}, k_t), \quad V_{\text{cache}} \leftarrow \text{concat}(V_{\text{past}}, v_t)
@ -59,7 +59,7 @@ Refer to the table below to compare how caching improves efficiency.
| without caching | with caching |
|---|---|
| for each step, recompute all previous `K` and `V` | for each step, only compute current `K` and `V`
| for each step, recompute all previous `K` and `V` | for each step, only compute current `K` and `V` |
| attention cost per step is **quadratic** with sequence length | attention cost per step is **linear** with sequence length (memory grows linearly, but compute/token remains low) |
## Cache class

View File

@ -95,9 +95,12 @@ print(tokenizer.decode(outputs[0][len(inputs["input_ids"][0]):]))
The chat model called the `get_current_temperature` tool with the correct parameters from the docstring. It inferred France as the location based on Paris, and that it should use Celsius for the units of temperature.
A model **cannot actually call the tool itself**. It requests a tool call, and it's your job to handle the call and append it and the result to the chat history.
A model **cannot actually call the tool itself**. It requests a tool call, and it's your job to handle the call and append it and the result to the chat history. For
models that support [response parsing](./chat_response_parsing), the response parsing will be handled automatically, and you can just use
[`~PreTrainedTokenizer.parse_response] to extract the tool call. For other models, you'll need to manually translate the output
string into a tool call dict.
Hold the call in the `tool_calls` key of an `assistant` message. This is the recommended API, and should be supported by the chat template of most tool-using models.
Regardless of the approach you use, the tool call should go in the `tool_calls` key of an `assistant` message. This is the recommended API, and should be supported by the chat template of most tool-using models.
> [!WARNING]
> Although `tool_calls` is similar to the OpenAI API, the OpenAI API uses a JSON string as its `tool_calls` format. This may cause errors or strange model behavior if used in Transformers, which expects a dict.

View File

@ -0,0 +1,233 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Response Parsing
It is increasingly common for chat models to generate structured outputs, rather than just a single reply string.
The most common uses for structured outputs are [tool calling](./chat_extras) and [reasoning models](https://huggingface.co/reasoning-course).
Tool calling models can output tool calls, containing the name of the tool to call and any arguments to be passed to it,
while reasoning models often output reasoning steps as a "chain of thought". Some recent models even use both of these,
and may output reasoning and/or one or more tool calls before their final answer.
Models with structured outputs pose a challenge for chat templating, because the output needs to be parsed before it
can be appended to the chat. For a concrete example, let's say we ask [GPT-OSS](https://huggingface.co/openai/gpt-oss-120b)
what the weather is like, and it thinks and decides to call a tool. Here's what the raw model output might look like:
```txt
<|start|>analysis<|message|>The user asks: "What is the weather like in SF?" We need to get the location of the user? The user explicitly asks about SF (San Francisco).
So we need to get the current weather in San Francisco, CA. We need to call get_current_weather function. But we need to call function to get weather data.
So we should call get_current_weather with location "San Francisco, CA". Let's do that.
We will call function get_current_weather.<|end|><|start|>commentary to=functions.get_current_weather<|channel|>commentary <|constrain|>json<|message|>{"location":"San Francisco, CA"}<|call|>
}
```
But if you want to append this to a chat, you'll need to format it as a chat message dict, like this:
```json
{
"role": "assistant",
"thinking": "The user asks: \"What is the weather like in SF?\" We need to get the location of the user? The user explicitly asks about SF (San Francisco). So we need to get the current weather in San Francisco, CA. We need to call get_current_weather function. But we need to call function to get weather data. So we should call get_current_weather with location \"San Francisco, CA\". Let's do that.",
"tool_calls": [
{
"name": "get_current_weather",
"arguments": {
"location": "San Francisco, CA"
}
}
]
}
```
Chat **templates** give us a way to turn messages into formatted input for a model, but we need something else to
parse model output back into a standard message dict. This is what chat **parsing** is for.
## The [parse_response](~PreTrainedTokenizerBase.parse_response) method
Parsing a chat response on a model that supports it is straightforward. Simply take the raw, decoded output from
[generate](`~generation.GenerationMixin.generate`), and pass it to the tokenizer's [parse_response](~PreTrainedTokenizerBase.parse_response) method:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "HuggingFaceTB/SmolLM3-3B"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, dtype="auto", device_map="auto")
messages = [
{
"role": "user",
"content": "Hey! Can you summarize the end of the Cold War as briefly as possible? Like, comically briefly. It should really leave out almost most of the relevant information."
}
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_tensors="pt"
).to(model.device)
outputs = model.generate(input_ids, max_new_tokens=1024)[0, input_ids.shape[1]:]
out_text = tokenizer.decode(outputs)
parsed = tokenizer.parse_response(out_text)
print(parsed.keys())
```
And you should get:
```text
dict_keys(['thinking', 'content'])
```
And that's all you need to start using response parsing! `parse_response` should return a complete message dict that is ready to be appended to the chat history.
When the tokenizer does not support response parsing, `parse_response` will throw an error. We hope to add support
to more tokenizers over time.
## Developers: Understanding a simple response schema
Under the hood, `parse_response` uses a **JSON schema** to parse the model output. A JSON schema represents
the structure of the output message dict. The schema is augmented with additional fields that indicate how the
output message string should be parsed into the expected format. Let's take a look at the schema for a SmolLM response,
excluding tool calls for now:
```python
{
"x-regex": "(?:<think>\n?(?P<thinking>.+?)\n?</think>)?\s*(?P<content>.+?)?\s*(?:<\|im_end\|>|$)",
"type": "object",
"properties": {
"role": {"const": "assistant"},
"content": {"type": "string"},
"thinking": {"type": "string"}
}
}
```
We can see that the schema describes a JSON "object" (a `dict`, in other words) with three keys: `role`, `content`, and `thinking`.
Because all assistant responses have the role "assistant", the `role` key is a `const`(ant). The other two keys are strings, extracted
from the named groups in the regex in the `x-regex` field.
Like chat templates, response schemas are set as a property of the tokenizer. To enable response parsing, all you need
to do is set `tokenizer.response_schema` to a valid schema dict, and `tokenizer.parse_response()` will work! Again, like
chat templates, this schema will be saved with the processor, so once you set it, you can use `save_pretrained()` or `push_to_hub()` to
save and share the schema.
## Developers: Complex schemas
Now, let's look at a more complex schema, which includes tool calls, to gain more of an understanding of the parser
internals. For this, we'll use the `GPT-OSS` schema. GPT-OSS emits both tool calls and thinking blocks, and it uses
an unusual format where model responses are tagged with one of three "channels": `commentary` for things like
tool calls, `analysis` for chain of thought blocks, and `final` for messages intended to be sent to the user.
A full message where the model calls a tool named `get_current_weather` might look like this, with some extra linebreaks added for clarity:
```text
<|channel|>analysis<|message|>
The user asks: "What is the weather like in SF?" So we need to get the current weather in San Francisco, CA.
We need to call get_current_weather function. So we should call get_current_weather with location "San Francisco, CA".
<|end|>
<|start|>assistant<|channel|>commentary
to=functions.get_current_weather <|constrain|>json<|message|>
{
"location": "San Francisco, CA"
}
<|call|>
```
Parsing proceeds recursively; the output of a regex (or other parser) at one level becomes the input to the nodes below it.
In other words, don't feel like you have to parse the entire output in one enormous regex! Instead, start with the schema,
and then add regexes to extract the relevant chunks as you go. Here's a schema that will parse it, with some
explanatory comments:
```python
{
"type": "object",
"properties": {
"role": {"const": "assistant"},
# "content" and "thinking" are both similar to the previous example, and just extract a single string
# However, rather than using a single regex with named groups to extract both, we use a regex in each subkey.
# When an object node has no parser/regex, the entire input string is passed to all of its children, so
# parsing can either be done with named groups at the object level, or with separate regexes at the property level.
"content": {"type": "string", "x-regex": r"<\|channel\|>final<\|message\|>(.*?)(?:<\|end\|>|$)"},
"thinking": {"type": "string", "x-regex": r"<\|channel\|>analysis<\|message\|>(.*?)<\|end\|>"},
"tool_calls": {
# "x-regex-iterator" uses re.findall to find multiple possible manages, and returns them as an
# array/list. You don't need to worry about array handling, though - each item in the array will be
# parsed by the `items` schema, so just write the schema for a single item.
"x-regex-iterator": r"<\|channel\|>commentary (to=functions\..*?<\|message\|>.*?)(?:<\|call\|>|$)",
"type": "array",
"items": {
"type": "object",
"properties": {
# A const property is a fixed value, and the input has no effect on it.
"type": {"const": "function"},
# Here, we wrap the entire tool call dict in a `{"function": ...}` block. The input string is passed through to it unchanged.
"function": {
"type": "object",
"properties": {
"name": {"type": "string", "x-regex": r"^to=functions\.(\w+)"},
"arguments": {
"type": "object",
"x-regex": "<\|message\|>(.*)",
# The "x-parser" field indicates that the extracted string should be parsed as JSON.
# The output is then passed to the schema nodes below and recursive parsing continues.
"x-parser": "json",
"additionalProperties": {"type": "any"},
},
},
},
},
},
},
},
}
```
## Developers: Understanding the parser logic
The parser follows a few simple rules:
1. Each level of the schema receives input from the level above, applies any regex or parser it has, and then passes the output to its children.
2. The root level receives the entire decoded model output string as input.
3. If a node has structured content after parsing (for example, if the regex has named groups and returns a dict, or if the parser returns a dict or list),
then that structured content is mapped to the node's children, and each child node receives its corresponding value as input.
4. If an `object` (dict) node has unstructured (string) output, then the entire string is passed to all of its children. This allows child nodes
to handle parsing individually rather than requiring a single parent regex to extract all keys at once.
5. If an `array` (list) node has unstructured (string) output, then this throws an error.
There is a small set of allowable `x-` keys that indicate how parsing should be done at each node:
- `x-regex`: A regex string to apply to the input. If the regex has named groups, the output is a dict of group names to values. Named groups should only be used in `object` nodes.
Otherwise, the regex must have exactly one unnamed capturing group, and the output is the value of that group as a string.
- `x-regex-iterator`: A regex string to apply to the input using `re.findall()`. The output is a list of all matches.
This should only be used in `array` nodes, and the regex must have exactly one unnamed capturing group. The output is distributed to
the node's `items` schema.
- `x-parser`: Calls a built-in parser to apply to the input. Currently, the only supported parser is `json`, which parses the input string as JSON.
The output is passed to the child nodes for further parsing. Note that the `json` parser can return deeply nested output - in this case, the output
will be progressively unwrapped as it is passed through child nodes. The child nodes do not need additional `x-parser` or `x-regex` fields in this case,
but their structure must match the structure of the parsed JSON.
- `x-parser-args`: Only allowed in conjunction with `x-parser`. This is a dict of additional arguments that control parsing. Right now, the only supported
argument is `transform`, which specifies a `jmespath` transformation to apply to the output. This is useful when the JSON parser returns a structure
that needs to be modified to match the schema.
- `x-regex-key-value`: This is rarely necessary, but it can be useful when parsing key-value pairs in non-JSON format where the names of the keys are not known
in advance, such as when a model emits XML tool calls with arbitrary argument names. The regex must have exactly two named capturing groups,
`key` and `value`, and the output is a dict mapping keys to values. This should only be used in `object` nodes.
In general, multiple regexes/parsers cannot be combined at the same level. The exception is that `x-regex`, returning a single string, can be combined with the other parsers. In this case,
`x-regex` is applied first, and then the output is passed to the other parser, either `x-regex-iterator`, `x-parser`, or `x-regex-key-value`.
Putting these ideas together, you can see that the input flows through the schema, being parsed at each level and then distributed to child nodes. Each level
only needs to extract the input content that is relevant for that part of the schema, and can then let its child nodes handle the rest. Internally, this is handled
with a parser function that receives input, applies any regexes/parsers at the current level, then maps the result to its child nodes before recursively calling itself on each of them.
Recursion terminates when it reaches leaf nodes, usually primitive types like `string` or `number`, which simply return the input they receive.

View File

@ -6,13 +6,13 @@ rendered properly in your Markdown viewer.
This page regroups resources around 🤗 Transformers developed by the community.
## Community resources:
## Community resources
| Resource | Description | Author |
|:----------|:-------------|------:|
| [Hugging Face Transformers Glossary Flashcards](https://www.darigovresearch.com/huggingface-transformers-glossary-flashcards) | A set of flashcards based on the [Transformers Docs Glossary](glossary) that has been put into a form which can be easily learned/revised using [Anki](https://apps.ankiweb.net/) an open source, cross platform app specifically designed for long term knowledge retention. See this [Introductory video on how to use the flashcards](https://www.youtube.com/watch?v=Dji_h7PILrw). | [Darigov Research](https://www.darigovresearch.com/) |
## Community notebooks:
## Community notebooks
| Notebook | Description | Author | |
|:----------|:-------------|:-------------|------:|

View File

@ -16,44 +16,18 @@ rendered properly in your Markdown viewer.
# ExecuTorch
[ExecuTorch](https://pytorch.org/executorch/stable/index.html) is a platform that enables PyTorch training and inference programs to be run on mobile and edge devices. It is powered by [torch.compile](https://pytorch.org/docs/stable/torch.compiler.html) and [torch.export](https://pytorch.org/docs/main/export.html) for performance and deployment.
[ExecuTorch](https://pytorch.org/executorch/stable/index.html) runs PyTorch models on mobile and edge devices. Export your Transformers models to the ExecuTorch format with [Optimum ExecuTorch](https://github.com/huggingface/optimum-executorch) with the command below.
You can use ExecuTorch with Transformers with [torch.export](https://pytorch.org/docs/main/export.html). The [`~transformers.convert_and_export_with_cache`] method converts a [`PreTrainedModel`] into an exportable module. Under the hood, it uses [torch.export](https://pytorch.org/docs/main/export.html) to export the model, ensuring compatibility with ExecuTorch.
```py
import torch
from transformers import LlamaForCausalLM, AutoTokenizer, GenerationConfig
from transformers.integrations.executorch import(
TorchExportableModuleWithStaticCache,
convert_and_export_with_cache
)
generation_config = GenerationConfig(
use_cache=True,
cache_implementation="static",
cache_config={
"batch_size": 1,
"max_cache_len": 20,
}
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B", pad_token="</s>", padding_side="right")
model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B", device_map="auto", dtype=torch.bfloat16, attn_implementation="sdpa", generation_config=generation_config)
exported_program = convert_and_export_with_cache(model)
```bash
optimum-cli export executorch \
--model "HuggingFaceTB/SmolLM2-135M-Instruct" \
--task "text-generation" \
--recipe "xnnpack" \
--use_custom_sdpa \
--use_custom_kv_cache \
--qlinear 8da4w \
--qembedding 8w \
--output_dir="hf_smollm2"
```
The exported PyTorch model is now ready to be used with ExecuTorch. Wrap the model with [`~transformers.TorchExportableModuleWithStaticCache`] to generate text.
```py
prompts = ["Simply put, the theory of relativity states that "]
prompt_tokens = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)
prompt_token_ids = prompt_tokens["input_ids"]
generated_ids = TorchExportableModuleWithStaticCache.generate(
exported_program=exported_program, prompt_token_ids=prompt_token_ids, max_new_tokens=20,
)
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_text)
['Simply put, the theory of relativity states that 1) the speed of light is the']
```
Run `optimum-cli export executorch --help` to see all export options. For detailed export instructions, check the [README](optimum/exporters/executorch/README.md).

View File

@ -37,7 +37,6 @@ def model_init(trial):
config=config,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
token=True if model_args.use_auth_token else None,
)
```

View File

@ -36,8 +36,6 @@ Explore the [Hub](https://huggingface.com/) today to find a model and use Transf
Explore the [Models Timeline](./models_timeline) to discover the latest text, vision, audio and multimodal model architectures in Transformers.
## Features
Transformers provides everything you need for inference or training with state-of-the-art pretrained models. Some of the main features include:

View File

@ -320,7 +320,7 @@ df.sort_values(by=['skipped_proportion'], ascending=False)
You can focus on a specific test method using `--test_method_name`:
```bash
$ python utils/scan_skipped_tests.py --test_method_name test_inputs_embeds --output_dir path/to/output
python utils/scan_skipped_tests.py --test_method_name test_inputs_embeds --output_dir path/to/output
```
- `--test_method_name`: Name of the test method to scan (e.g., `test_inputs_embeds`).
@ -364,6 +364,7 @@ This utility analyzes code similarities between model implementations to identif
When adding a new model to transformers, many components (attention layers, MLPs, outputs, etc.) may already exist in similar form in other models. Instead of implementing everything from scratch, model adders can identify which existing classes are similar and potentially reusable through modularization.
The tool computes two similarity scores:
- **Embedding score**: Uses semantic code embeddings (via `Qwen/Qwen3-Embedding-4B`) to detect functionally similar code even with different naming
- **Jaccard score**: Measures token set overlap to identify structurally similar code patterns

View File

@ -0,0 +1,83 @@
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Utilities for Rotary Embedding
This page explains how the Rotary Embedding is computed and applied in Transformers and what types of RoPE are supported.
## Overview
Rotary Position Embeddings are a technique used to inject positional information into attention mechanisms without relying on explicit position encodings.
Instead of adding position vectors to token embeddings, RoPE rotates query and key vectors in the complex plane according to their positions enabling relative positional awareness and better extrapolation to unseen sequence lengths.
The Transformers library provides a flexible and extensible implementation of various RoPE types defined in `[`~modeling_rope_utils.ROPE_VALIDATION_FUNCTIONS`]`, including both the default and scaled variants:
| Rope Type | Description |
|------------|-------------|
| `"default"` | Standard rotary embedding as in LLaMA. |
| `"linear"` | Linear-scaled RoPE which allows longer context windows. |
| `"dynamic"` | NTK-aware scaling computed by rescaling frequency base (`θ`) for longer context. |
| `"yarn"` | YaRN scaling variant providing smoother extrapolation and stability. |
| `"longrope"` | [LongRoPE](https://github.com/microsoft/LongRoPE) scaling as in Phi-2 model series. |
| `"llama3"` | RoPE scaling as in Llama3.1. |
## Configuration in Model Configs
To enable and customize rotary embeddings, add a `rope_parameters` field to your models configuration file (`config.json`). This field controls the RoPE behavior across model layers. Note that each RoPE variant defines its own set of expected keys and missing keys will raise an error. See the example below which creates a llama config with default RoPE parameters:
```python
from transformers import LlamaConfig
config = LlamaConfig()
config.rope_parameters = {
"rope_type": "default", # type of RoPE to use
"rope_theta": 10000.0 # base frequency parameter
}
# If we want to apply a scaled RoPE type, we need to pass extra parameters
config.rope_parameters = {
"rope_type": "linear",
"rope_theta": 10000.0,
"factor": 8.0 # scale factor for context extension
}
```
## Per-Layer-Type RoPE Configuration
Some models such as Gemma-3 use different layer types with different attention mechanisms, i.e. "full attention" in some blocks and "sliding-window attention" in others. Transformers supports specifying distinct RoPE parameters per layer type for these models. In this case, `rope_parameters` should be a nested dictionary, where top-level keys correspond to `config.layer_types` and values are per-type RoPE parameters. During model initialization, each decoder layer will automatically look up the matching RoPE configuration based on its declared layer type.
```python
from transformers import Gemma3Config
config = Gemma3Config()
config.rope_parameters = {
"full_attention": {
"rope_type": "dynamic",
"rope_theta": 1000000.0,
"factor": 8.0,
"original_max_position_embeddings": 8096,
},
"sliding_attention": {
"rope_type": "default",
"rope_theta": 10000.0,
}
}
```
## Utilities
[[autodoc]] RopeParameters
- __call__

View File

@ -1,3 +1,3 @@
# Overview
Kernels in transformers are used to optimize the performance of models with custom layers from the hub and very low effort.
Kernels in transformers are used to optimize the performance of models with custom layers from the hub and very low effort.

View File

@ -208,7 +208,7 @@ Some models have a unique way of storing past kv pairs or states that is not com
Mamba models, such as [Mamba](./model_doc/mamba), require a specific cache because the model doesn't have an attention mechanism or kv states. Thus, they are not compatible with the above [`Cache`] classes.
# Iterative generation
## Iterative generation
A cache can also work in iterative generation settings where there is back-and-forth interaction with a model (chatbots). Like regular generation, iterative generation with a cache allows a model to efficiently handle ongoing conversations without recomputing the entire context at each step.

View File

@ -16,18 +16,18 @@ rendered properly in your Markdown viewer.
Large Language Models (LLMs) such as GPT3/4, [Falcon](https://huggingface.co/tiiuae/falcon-40b), and [Llama](https://huggingface.co/meta-llama/Llama-2-70b-hf) are rapidly advancing in their ability to tackle human-centric tasks, establishing themselves as essential tools in modern knowledge-based industries.
Deploying these models in real-world tasks remains challenging, however:
- To exhibit near-human text understanding and generation capabilities, LLMs currently require to be composed of billions of parameters (see [Kaplan et al](https://huggingface.co/papers/2001.08361), [Wei et. al](https://huggingface.co/papers/2206.07682)). This consequently amplifies the memory demands for inference.
- In many real-world tasks, LLMs need to be given extensive contextual information. This necessitates the model's capability to manage very long input sequences during inference.
- To exhibit near-human text understanding and generation capabilities, LLMs currently require to be composed of billions of parameters (see [Kaplan et al](https://huggingface.co/papers/2001.08361), [Wei et. al](https://huggingface.co/papers/2206.07682)). This consequently amplifies the memory demands for inference.
- In many real-world tasks, LLMs need to be given extensive contextual information. This necessitates the model's capability to manage very long input sequences during inference.
The crux of these challenges lies in augmenting the computational and memory capabilities of LLMs, especially when handling expansive input sequences.
In this guide, we will go over the effective techniques for efficient LLM deployment:
1. **Lower Precision:** Research has shown that operating at reduced numerical precision, namely [8-bit and 4-bit](./main_classes/quantization) can achieve computational advantages without a considerable decline in model performance.
1. **Lower Precision:** Research has shown that operating at reduced numerical precision, namely [8-bit and 4-bit](./main_classes/quantization) can achieve computational advantages without a considerable decline in model performance.
2. **Flash Attention:** Flash Attention is a variation of the attention algorithm that not only provides a more memory-efficient approach but also realizes increased efficiency due to optimized GPU memory utilization.
2. **Flash Attention:** Flash Attention is a variation of the attention algorithm that not only provides a more memory-efficient approach but also realizes increased efficiency due to optimized GPU memory utilization.
3. **Architectural Innovations:** Considering that LLMs are always deployed in the same way during inference, namely autoregressive text generation with a long input context, specialized model architectures have been proposed that allow for more efficient inference. The most important advancement in model architectures hereby are [Alibi](https://huggingface.co/papers/2108.12409), [Rotary embeddings](https://huggingface.co/papers/2104.09864), [Multi-Query Attention (MQA)](https://huggingface.co/papers/1911.02150) and [Grouped-Query-Attention (GQA)](https://huggingface.co/papers/2305.13245).
3. **Architectural Innovations:** Considering that LLMs are always deployed in the same way during inference, namely autoregressive text generation with a long input context, specialized model architectures have been proposed that allow for more efficient inference. The most important advancement in model architectures hereby are [Alibi](https://huggingface.co/papers/2108.12409), [Rotary embeddings](https://huggingface.co/papers/2104.09864), [Multi-Query Attention (MQA)](https://huggingface.co/papers/1911.02150) and [Grouped-Query-Attention (GQA)](https://huggingface.co/papers/2305.13245).
Throughout this guide, we will offer an analysis of auto-regressive generation from a tensor's perspective. We delve into the pros and cons of adopting lower precision, provide a comprehensive exploration of the latest attention algorithms, and discuss improved LLM architectures. While doing so, we run practical examples showcasing each of the feature improvements.
@ -37,22 +37,22 @@ Memory requirements of LLMs can be best understood by seeing the LLM as a set of
At the time of writing this guide, LLMs consist of at least a couple billion parameters. Each parameter thereby is made of a decimal number, e.g. `4.5689` which is usually stored in either [float32](https://en.wikipedia.org/wiki/Single-precision_floating-point_format), [bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format), or [float16](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) format. This allows us to easily compute the memory requirement to load the LLM into memory:
> *Loading the weights of a model having X billion parameters requires roughly 4 * X GB of VRAM in float32 precision*
> *Loading the weights of a model having X billion parameters requires roughly 4 \* X GB of VRAM in float32 precision*
Nowadays, models are however rarely trained in full float32 precision, but usually in bfloat16 precision or less frequently in float16 precision. Therefore the rule of thumb becomes:
> *Loading the weights of a model having X billion parameters requires roughly 2 * X GB of VRAM in bfloat16/float16 precision*
> *Loading the weights of a model having X billion parameters requires roughly 2 \* X GB of VRAM in bfloat16/float16 precision*
For shorter text inputs (less than 1024 tokens), the memory requirement for inference is very much dominated by the memory requirement to load the weights. Therefore, for now, let's assume that the memory requirement for inference is equal to the memory requirement to load the model into the GPU VRAM.
To give some examples of how much VRAM it roughly takes to load a model in bfloat16:
- **GPT3** requires 2 \* 175 GB = **350 GB** VRAM
- [**Bloom**](https://huggingface.co/bigscience/bloom) requires 2 \* 176 GB = **352 GB** VRAM
- [**Llama-2-70b**](https://huggingface.co/meta-llama/Llama-2-70b-hf) requires 2 \* 70 GB = **140 GB** VRAM
- [**Falcon-40b**](https://huggingface.co/tiiuae/falcon-40b) requires 2 \* 40 GB = **80 GB** VRAM
- [**MPT-30b**](https://huggingface.co/mosaicml/mpt-30b) requires 2 \* 30 GB = **60 GB** VRAM
- [**bigcode/starcoder**](https://huggingface.co/bigcode/starcoder) requires 2 \* 15.5 = **31 GB** VRAM
- **GPT3** requires 2 \* 175 GB = **350 GB** VRAM
- [**Bloom**](https://huggingface.co/bigscience/bloom) requires 2 \* 176 GB = **352 GB** VRAM
- [**Llama-2-70b**](https://huggingface.co/meta-llama/Llama-2-70b-hf) requires 2 \* 70 GB = **140 GB** VRAM
- [**Falcon-40b**](https://huggingface.co/tiiuae/falcon-40b) requires 2 \* 40 GB = **80 GB** VRAM
- [**MPT-30b**](https://huggingface.co/mosaicml/mpt-30b) requires 2 \* 30 GB = **60 GB** VRAM
- [**bigcode/starcoder**](https://huggingface.co/bigcode/starcoder) requires 2 \* 15.5 = **31 GB** VRAM
As of writing this document, the largest GPU chip on the market is the A100 & H100 offering 80GB of VRAM. Most of the models listed before require more than 80GB just to be loaded and therefore necessarily require [tensor parallelism](https://huggingface.co/docs/transformers/perf_train_gpu_many#tensor-parallelism) and/or [pipeline parallelism](https://huggingface.co/docs/transformers/perf_train_gpu_many#naive-model-parallelism-vertical-and-pipeline-parallelism).
@ -169,11 +169,11 @@ All that matters is that the next token *logit* distribution stays roughly the s
There are various quantization techniques, which we won't discuss in detail here, but in general, all quantization techniques work as follows:
- 1. Quantize all weights to the target precision
- 2. Load the quantized weights, and pass the input sequence of vectors in bfloat16 precision
- 3. Dynamically dequantize weights to bfloat16 to perform the computation with their input vectors in bfloat16 precision
- 1. Quantize all weights to the target precision
- 2. Load the quantized weights, and pass the input sequence of vectors in bfloat16 precision
- 3. Dynamically dequantize weights to bfloat16 to perform the computation with their input vectors in bfloat16 precision
In a nutshell, this means that *inputs-weight matrix* multiplications, with \\( X \\) being the *inputs*, \\( W \\) being a weight matrix and \\( Y \\) being the output:
In a nutshell, this means that *inputs-weight matrix* multiplications, with $X$ being the *inputs*, $W$ being a weight matrix and $Y$ being the output:
$$ Y = X * W $$
@ -271,7 +271,7 @@ Just 9.5GB! That's really not a lot for a >15 billion parameter model.
While we see very little degradation in accuracy for our model here, 4-bit quantization can in practice often lead to different results compared to 8-bit quantization or full `bfloat16` inference. It is up to the user to try it out.
Also note that inference here was again a bit slower compared to 8-bit quantization which is due to the more aggressive quantization method used for 4-bit quantization leading to \\( \text{quantize} \\) and \\( \text{dequantize} \\) taking longer during inference.
Also note that inference here was again a bit slower compared to 8-bit quantization which is due to the more aggressive quantization method used for 4-bit quantization leading to $\text{quantize}$ and $\text{dequantize}$ taking longer during inference.
```python
del model
@ -300,41 +300,41 @@ Next, let's look into how we can improve computational and memory efficiency by
Today's top-performing LLMs share more or less the same fundamental architecture that consists of feed-forward layers, activation layers, layer normalization layers, and most crucially, self-attention layers.
Self-attention layers are central to Large Language Models (LLMs) in that they enable the model to understand the contextual relationships between input tokens.
However, the peak GPU memory consumption for self-attention layers grows *quadratically* both in compute and memory complexity with number of input tokens (also called *sequence length*) that we denote in the following by \\( N \\) .
However, the peak GPU memory consumption for self-attention layers grows *quadratically* both in compute and memory complexity with number of input tokens (also called *sequence length*) that we denote in the following by $N$ .
While this is not really noticeable for shorter input sequences (of up to 1000 input tokens), it becomes a serious problem for longer input sequences (at around 16000 input tokens).
Let's take a closer look. The formula to compute the output \\( \mathbf{O} \\) of a self-attention layer for an input \\( \mathbf{X} \\) of length \\( N \\) is:
Let's take a closer look. The formula to compute the output $\mathbf{O}$ of a self-attention layer for an input $\mathbf{X}$ of length $N$ is:
$$ \textbf{O} = \text{Attn}(\mathbf{X}) = \mathbf{V} \times \text{Softmax}(\mathbf{QK}^T) \text{ with } \mathbf{Q} = \mathbf{W}_q \mathbf{X}, \mathbf{V} = \mathbf{W}_v \mathbf{X}, \mathbf{K} = \mathbf{W}_k \mathbf{X} $$
\\( \mathbf{X} = (\mathbf{x}_1, ... \mathbf{x}_{N}) \\) is thereby the input sequence to the attention layer. The projections \\( \mathbf{Q} \\) and \\( \mathbf{K} \\) will each consist of \\( N \\) vectors resulting in the \\( \mathbf{QK}^T \\) being of size \\( N^2 \\) .
$\mathbf{X} = (\mathbf{x}_1, ... \mathbf{x}_{N})$ is thereby the input sequence to the attention layer. The projections $\mathbf{Q}$ and $\mathbf{K}$ will each consist of $N$ vectors resulting in the $\mathbf{QK}^T$ being of size $N^2$ .
LLMs usually have multiple attention heads, thus doing multiple self-attention computations in parallel.
Assuming, the LLM has 40 attention heads and runs in bfloat16 precision, we can calculate the memory requirement to store the \\( \mathbf{QK^T} \\) matrices to be \\( 40 * 2 * N^2 \\) bytes. For \\( N=1000 \\) only around 50 MB of VRAM are needed, however, for \\( N=16000 \\) we would need 19 GB of VRAM, and for \\( N=100,000 \\) we would need almost 1TB just to store the \\( \mathbf{QK}^T \\) matrices.
Assuming, the LLM has 40 attention heads and runs in bfloat16 precision, we can calculate the memory requirement to store the $\mathbf{QK^T}$ matrices to be $40 * 2 * N^2$ bytes. For $N=1000$ only around 50 MB of VRAM are needed, however, for $N=16000$ we would need 19 GB of VRAM, and for $N=100,000$ we would need almost 1TB just to store the $\mathbf{QK}^T$ matrices.
Long story short, the default self-attention algorithm quickly becomes prohibitively memory-expensive for large input contexts.
As LLMs improve in text comprehension and generation, they are applied to increasingly complex tasks. While models once handled the translation or summarization of a few sentences, they now manage entire pages, demanding the capability to process extensive input lengths.
How can we get rid of the exorbitant memory requirements for large input lengths? We need a new way to compute the self-attention mechanism that gets rid of the \\( QK^T \\) matrix. [Tri Dao et al.](https://huggingface.co/papers/2205.14135) developed exactly such a new algorithm and called it **Flash Attention**.
How can we get rid of the exorbitant memory requirements for large input lengths? We need a new way to compute the self-attention mechanism that gets rid of the $\mathbf{QK}^T$ matrix. [Tri Dao et al.](https://huggingface.co/papers/2205.14135) developed exactly such a new algorithm and called it **Flash Attention**.
In a nutshell, Flash Attention breaks the \\(\mathbf{V} \times \text{Softmax}(\mathbf{QK}^T\\)) computation apart and instead computes smaller chunks of the output by iterating over multiple softmax computation steps:
In a nutshell, Flash Attention breaks the $\mathbf{V} \times \text{Softmax}(\mathbf{QK}^T)$ computation apart and instead computes smaller chunks of the output by iterating over multiple softmax computation steps:
$$ \textbf{O}_i \leftarrow s^a_{ij} * \textbf{O}_i + s^b_{ij} * \mathbf{V}_{j} \times \text{Softmax}(\mathbf{QK}^T_{i,j}) \text{ for multiple } i, j \text{ iterations} $$
with \\( s^a_{ij} \\) and \\( s^b_{ij} \\) being some softmax normalization statistics that need to be recomputed for every \\( i \\) and \\( j \\) .
with $s^a_{ij}$ and $s^b_{ij}$ being some softmax normalization statistics that need to be recomputed for every $i$ and $j$ .
Please note that the whole Flash Attention is a bit more complex and is greatly simplified here as going in too much depth is out of scope for this guide. The reader is invited to take a look at the well-written [Flash Attention paper](https://huggingface.co/papers/2205.14135) for more details.
The main takeaway here is:
> By keeping track of softmax normalization statistics and by using some smart mathematics, Flash Attention gives **numerical identical** outputs compared to the default self-attention layer at a memory cost that only increases linearly with \\( N \\) .
> By keeping track of softmax normalization statistics and by using some smart mathematics, Flash Attention gives **numerical identical** outputs compared to the default self-attention layer at a memory cost that only increases linearly with $N$ .
Looking at the formula, one would intuitively say that Flash Attention must be much slower compared to the default self-attention formula as more computation needs to be done. Indeed Flash Attention requires more FLOPs compared to normal attention as the softmax normalization statistics have to constantly be recomputed (see [paper](https://huggingface.co/papers/2205.14135) for more details if interested)
> However, Flash Attention is much faster in inference compared to default attention which comes from its ability to significantly reduce the demands on the slower, high-bandwidth memory of the GPU (VRAM), focusing instead on the faster on-chip memory (SRAM).
Essentially, Flash Attention makes sure that all intermediate write and read operations can be done using the fast *on-chip* SRAM memory instead of having to access the slower VRAM memory to compute the output vector \\( \mathbf{O} \\) .
Essentially, Flash Attention makes sure that all intermediate write and read operations can be done using the fast *on-chip* SRAM memory instead of having to access the slower VRAM memory to compute the output vector $\mathbf{O}$ .
In practice, there is currently absolutely no reason to **not** use Flash Attention if available. The algorithm gives mathematically the same outputs, and is both faster and more memory-efficient.
@ -342,74 +342,75 @@ In practice, there is currently absolutely no reason to **not** use Flash Attent
So far we have looked into improving computational and memory efficiency by:
- Casting the weights to a lower precision format
- Replacing the self-attention algorithm with a more memory- and compute efficient version
- Casting the weights to a lower precision format
- Replacing the self-attention algorithm with a more memory- and compute efficient version
Let's now look into how we can change the architecture of an LLM so that it is most effective and efficient for task that require long text inputs, *e.g.*:
- Retrieval augmented Questions Answering,
- Summarization,
- Chat
Let's now look into how we can change the architecture of an LLM so that it is most effective and efficient for tasks that require long text inputs, *e.g.*:
- Retrieval augmented Questions Answering,
- Summarization,
- Chat
Note that *chat* not only requires the LLM to handle long text inputs, but it also necessitates that the LLM is able to efficiently handle the back-and-forth dialogue between user and assistant (such as ChatGPT).
Once trained, the fundamental LLM architecture is difficult to change, so it is important to make considerations about the LLM's tasks beforehand and accordingly optimize the model's architecture.
There are two important components of the model architecture that quickly become memory and/or performance bottlenecks for large input sequences.
- The positional embeddings
- The key-value cache
- The positional embeddings
- The key-value cache
Let's go over each component in more detail
### 3.1 Improving positional embeddings of LLMs
Self-attention puts each token in relation to each other's tokens.
As an example, the \\( \text{Softmax}(\mathbf{QK}^T) \\) matrix of the text input sequence *"Hello", "I", "love", "you"* could look as follows:
As an example, the $\text{Softmax}(\mathbf{QK}^T)$ matrix of the text input sequence *"Hello", "I", "love", "you"* could look as follows:
![](/blog/assets/163_optimize_llm/self_attn_tokens.png)
Each word token is given a probability mass at which it attends all other word tokens and, therefore is put into relation with all other word tokens. E.g. the word *"love"* attends to the word *"Hello"* with 5%, to *"I"* with 30%, and to itself with 65%.
A LLM based on self-attention, but without position embeddings would have great difficulties in understanding the positions of the text inputs to each other.
This is because the probability score computed by \\( \mathbf{QK}^T \\) relates each word token to each other word token in \\( O(1) \\) computations regardless of their relative positional distance to each other.
This is because the probability score computed by $\mathbf{QK}^T$ relates each word token to each other word token in $O(1)$ computations regardless of their relative positional distance to each other.
Therefore, for the LLM without position embeddings each token appears to have the same distance to all other tokens, *e.g.* differentiating between *"Hello I love you"* and *"You love I hello"* would be very challenging.
For the LLM to understand sentence order, an additional *cue* is needed and is usually applied in the form of *positional encodings* (or also called *positional embeddings*).
Positional encodings, encode the position of each token into a numerical presentation that the LLM can leverage to better understand sentence order.
The authors of the [*Attention Is All You Need*](https://huggingface.co/papers/1706.03762) paper introduced sinusoidal positional embeddings \\( \mathbf{P} = \mathbf{p}_1, \ldots, \mathbf{p}_N \\) .
where each vector \\( \mathbf{p}_i \\) is computed as a sinusoidal function of its position \\( i \\) .
The positional encodings are then simply added to the input sequence vectors \\( \mathbf{\hat{X}} = \mathbf{\hat{x}}_1, \ldots, \mathbf{\hat{x}}_N \\) = \\( \mathbf{x}_1 + \mathbf{p}_1, \ldots, \mathbf{x}_N + \mathbf{p}_N \\) thereby cueing the model to better learn sentence order.
The authors of the [*Attention Is All You Need*](https://huggingface.co/papers/1706.03762) paper introduced sinusoidal positional embeddings $\mathbf{P} = \mathbf{p}_1, \ldots, \mathbf{p}_N$ .
where each vector $\mathbf{p}_i$ is computed as a sinusoidal function of its position $i$ .
The positional encodings are then simply added to the input sequence vectors $\mathbf{\hat{X}} = \mathbf{\hat{x}}_1, \ldots, \mathbf{\hat{x}}_N$ = $\mathbf{x}_1 + \mathbf{p}_1, \ldots, \mathbf{x}_N + \mathbf{p}_N$ thereby cueing the model to better learn sentence order.
Instead of using fixed position embeddings, others (such as [Devlin et al.](https://huggingface.co/papers/1810.04805)) used learned positional encodings for which the positional embeddings
\\( \mathbf{P} \\) are learned during training.
$\mathbf{P}$ are learned during training.
Sinusoidal and learned position embeddings used to be the predominant methods to encode sentence order into LLMs, but a couple of problems related to these positional encodings were found:
1. Sinusoidal and learned position embeddings are both absolute positional embeddings, *i.e.* encoding a unique embedding for each position id: \\( 0, \ldots, N \\) . As shown by [Huang et al.](https://huggingface.co/papers/2009.13658) and [Su et al.](https://huggingface.co/papers/2104.09864), absolute positional embeddings lead to poor LLM performance for long text inputs. For long text inputs, it is advantageous if the model learns the relative positional distance input tokens have to each other instead of their absolute position.
2. When using learned position embeddings, the LLM has to be trained on a fixed input length \\( N \\), which makes it difficult to extrapolate to an input length longer than what it was trained on.
1. Sinusoidal and learned position embeddings are both absolute positional embeddings, *i.e.* encoding a unique embedding for each position id: $0, \ldots, N$ . As shown by [Huang et al.](https://huggingface.co/papers/2009.13658) and [Su et al.](https://huggingface.co/papers/2104.09864), absolute positional embeddings lead to poor LLM performance for long text inputs. For long text inputs, it is advantageous if the model learns the relative positional distance input tokens have to each other instead of their absolute position.
2. When using learned position embeddings, the LLM has to be trained on a fixed input length $N$, which makes it difficult to extrapolate to an input length longer than what it was trained on.
Recently, relative positional embeddings that can tackle the above mentioned problems have become more popular, most notably:
- [Rotary Position Embedding (RoPE)](https://huggingface.co/papers/2104.09864)
- [ALiBi](https://huggingface.co/papers/2108.12409)
- [Rotary Position Embedding (RoPE)](https://huggingface.co/papers/2104.09864)
- [ALiBi](https://huggingface.co/papers/2108.12409)
Both *RoPE* and *ALiBi* argue that it's best to cue the LLM about sentence order directly in the self-attention algorithm as it's there that word tokens are put into relation with each other. More specifically, sentence order should be cued by modifying the \\( \mathbf{QK}^T \\) computation.
Both *RoPE* and *ALiBi* argue that it's best to cue the LLM about sentence order directly in the self-attention algorithm as it's there that word tokens are put into relation with each other. More specifically, sentence order should be cued by modifying the $\mathbf{QK}^T$ computation.
Without going into too many details, *RoPE* notes that positional information can be encoded into query-key pairs, *e.g.* \\( \mathbf{q}_i \\) and \\( \mathbf{x}_j \\) by rotating each vector by an angle \\( \theta * i \\) and \\( \theta * j \\) respectively with \\( i, j \\) describing each vectors sentence position:
Without going into too many details, *RoPE* notes that positional information can be encoded into query-key pairs, *e.g.* $\mathbf{q}_i$ and $\mathbf{x}_j$ by rotating each vector by an angle $\theta * i$ and $\theta * j$ respectively with $i, j$ describing each vectors sentence position:
$$ \mathbf{\hat{q}}_i^T \mathbf{\hat{x}}_j = \mathbf{{q}}_i^T \mathbf{R}_{\theta, i -j} \mathbf{{x}}_j. $$
\\( \mathbf{R}_{\theta, i - j} \\) thereby represents a rotational matrix. \\( \theta \\) is *not* learned during training, but instead set to a pre-defined value that depends on the maximum input sequence length during training.
$\mathbf{R}_{\theta, i - j}$ thereby represents a rotational matrix. $\theta$ is *not* learned during training, but instead set to a pre-defined value that depends on the maximum input sequence length during training.
> By doing so, the probability score between \\( \mathbf{q}_i \\) and \\( \mathbf{q}_j \\) is only affected if \\( i \ne j \\) and solely depends on the relative distance \\( i - j \\) regardless of each vector's specific positions \\( i \\) and \\( j \\) .
> By doing so, the probability score between $\mathbf{q}_i$ and $\mathbf{q}_j$ is only affected if $i \ne j$ and solely depends on the relative distance $i - j$ regardless of each vector's specific positions $i$ and $j$ .
*RoPE* is used in multiple of today's most important LLMs, such as:
- [**Falcon**](https://huggingface.co/tiiuae/falcon-40b)
- [**Llama**](https://huggingface.co/papers/2302.13971)
- [**PaLM**](https://huggingface.co/papers/2204.02311)
- [**Falcon**](https://huggingface.co/tiiuae/falcon-40b)
- [**Llama**](https://huggingface.co/papers/2302.13971)
- [**PaLM**](https://huggingface.co/papers/2204.02311)
As an alternative, *ALiBi* proposes a much simpler relative position encoding scheme. The relative distance that input tokens have to each other is added as a negative integer scaled by a pre-defined value `m` to each query-key entry of the \\( \mathbf{QK}^T \\) matrix right before the softmax computation.
As an alternative, *ALiBi* proposes a much simpler relative position encoding scheme. The relative distance that input tokens have to each other is added as a negative integer scaled by a pre-defined value `m` to each query-key entry of the $\mathbf{QK}^T$ matrix right before the softmax computation.
![](/blog/assets/163_optimize_llm/alibi.png)
@ -417,19 +418,20 @@ As shown in the [ALiBi](https://huggingface.co/papers/2108.12409) paper, this si
*ALiBi* is used in multiple of today's most important LLMs, such as:
- [**MPT**](https://huggingface.co/mosaicml/mpt-30b)
- [**BLOOM**](https://huggingface.co/bigscience/bloom)
- [**MPT**](https://huggingface.co/mosaicml/mpt-30b)
- [**BLOOM**](https://huggingface.co/bigscience/bloom)
Both *RoPE* and *ALiBi* position encodings can extrapolate to input lengths not seen during training whereas it has been shown that extrapolation works much better out-of-the-box for *ALiBi* as compared to *RoPE*.
For ALiBi, one simply increases the values of the lower triangular position matrix to match the length of the input sequence.
For *RoPE*, keeping the same \\( \theta \\) that was used during training leads to poor results when passing text inputs much longer than those seen during training, *c.f* [Press et al.](https://huggingface.co/papers/2108.12409). However, the community has found a couple of effective tricks that adapt \\( \theta \\), thereby allowing *RoPE* position embeddings to work well for extrapolated text input sequences (see [here](https://github.com/huggingface/transformers/pull/24653)).
For *RoPE*, keeping the same $\theta$ that was used during training leads to poor results when passing text inputs much longer than those seen during training, *c.f* [Press et al.](https://huggingface.co/papers/2108.12409). However, the community has found a couple of effective tricks that adapt $\theta$, thereby allowing *RoPE* position embeddings to work well for extrapolated text input sequences (see [here](https://github.com/huggingface/transformers/pull/24653)).
> Both RoPE and ALiBi are relative positional embeddings that are *not* learned during training, but instead are based on the following intuitions:
- Positional cues about the text inputs should be given directly to the \\( QK^T \\) matrix of the self-attention layer
- The LLM should be incentivized to learn a constant *relative* distance positional encodings have to each other
- The further text input tokens are from each other, the lower the probability of their query-value probability. Both RoPE and ALiBi lower the query-key probability of tokens far away from each other. RoPE by decreasing their vector product by increasing the angle between the query-key vectors. ALiBi by adding large negative numbers to the vector product
In conclusion, LLMs that are intended to be deployed in tasks that require handling large text inputs are better trained with relative positional embeddings, such as RoPE and ALiBi. Also note that even if an LLM with RoPE and ALiBi has been trained only on a fixed length of say \\( N_1 = 2048 \\) it can still be used in practice with text inputs much larger than \\( N_1 \\), like \\( N_2 = 8192 > N_1 \\) by extrapolating the positional embeddings.
- Positional cues about the text inputs should be given directly to the $\mathbf{QK}^T$ matrix of the self-attention layer.
- The LLM should be incentivized to learn a constant *relative* distance positional encoding.
- The further text input tokens are from each other, the lower the probability of their query-value probability. Both RoPE and ALiBi lower the query-key probability of tokens far away from each other. RoPE lowers by decreasing their vector product by increasing the angle between the query-key vectors. ALiBi lowers by adding large negative numbers to the vector product.
In conclusion, LLMs that are intended to be deployed in tasks that require handling large text inputs are better trained with relative positional embeddings, such as RoPE and ALiBi. Also note that even if an LLM with RoPE and ALiBi has been trained only on a fixed length of say $N_1 = 2048$ it can still be used in practice with text inputs much larger than $N_1$, like $N_2 = 8192 > N_1$ by extrapolating the positional embeddings.
### 3.2 The key-value cache
@ -468,7 +470,7 @@ As we can see every time we increase the text input tokens by the just sampled t
With very few exceptions, LLMs are trained using the [causal language modeling objective](https://huggingface.co/docs/transformers/tasks/language_modeling#causal-language-modeling) and therefore mask the upper triangle matrix of the attention score - this is why in the two diagrams above the attention scores are left blank (*a.k.a* have 0 probability). For a quick recap on causal language modeling you can refer to the [*Illustrated Self Attention blog*](https://jalammar.github.io/illustrated-gpt2/#part-2-illustrated-self-attention).
As a consequence, tokens *never* depend on previous tokens, more specifically the \\( \mathbf{q}_i \\) vector is never put in relation with any key, values vectors \\( \mathbf{k}_j, \mathbf{v}_j \\) if \\( j > i \\) . Instead \\( \mathbf{q}_i \\) only attends to previous key-value vectors \\( \mathbf{k}_{m < i}, \mathbf{v}_{m < i} \text{ , for } m \in \{0, \ldots i - 1\} \\). In order to reduce unnecessary computation, one can therefore cache each layer's key-value vectors for all previous timesteps.
As a consequence, tokens *never* depend on later tokens, more specifically the $\mathbf{q}_i$ vector is never put in relation with any key, values vectors $\mathbf{k}_j, \mathbf{v}_j$ if $j > i$ . Instead $\mathbf{q}_i$ only attends to previous key-value vectors $\mathbf{k}_{m < i}, \mathbf{v}_{m < i} \text{ , for } m \in \{0, \ldots i - 1\}$. In order to reduce unnecessary computation, one can therefore cache each layer's key-value vectors for all previous timesteps.
In the following, we will tell the LLM to make use of the key-value cache by retrieving and forwarding it for each forward pass.
In Transformers, we can retrieve the key-value cache by passing the `use_cache` flag to the `forward` call and can then pass it with the current token.
@ -509,11 +511,12 @@ length of key-value cache 24
As one can see, when using the key-value cache the text input tokens are *not* increased in length, but remain a single input vector. The length of the key-value cache on the other hand is increased by one at every decoding step.
> Making use of the key-value cache means that the \\( \mathbf{QK}^T \\) is essentially reduced to \\( \mathbf{q}_c\mathbf{K}^T \\) with \\( \mathbf{q}_c \\) being the query projection of the currently passed input token which is *always* just a single vector.
> Making use of the key-value cache means that the $\mathbf{QK}^T$ is essentially reduced to $\mathbf{q}_c\mathbf{K}^T$ with $\mathbf{q}_c$ being the query projection of the currently passed input token which is *always* just a single vector.
Using the key-value cache has two advantages:
- Significant increase in computational efficiency as less computations are performed compared to computing the full \\( \mathbf{QK}^T \\) matrix. This leads to an increase in inference speed
- The maximum required memory is not increased quadratically with the number of generated tokens, but only increases linearly.
- Significant increase in computational efficiency as less computations are performed compared to computing the full $\mathbf{QK}^T$ matrix. This leads to an increase in inference speed
- The maximum required memory is not increased quadratically with the number of generated tokens, but only increases linearly.
> One should *always* make use of the key-value cache as it leads to identical results and a significant speed-up for longer input sequences. Transformers has the key-value cache enabled by default when making use of the text pipeline or the [`generate` method](https://huggingface.co/docs/transformers/main_classes/text_generation). We have an entire guide dedicated to caches [here](./kv_cache).
@ -535,10 +538,12 @@ Assistant: Germany has ca. 81 million inhabitants
```
In this chat, the LLM runs auto-regressive decoding twice:
1. The first time, the key-value cache is empty and the input prompt is `"User: How many people live in France?"` and the model auto-regressively generates the text `"Roughly 75 million people live in France"` while increasing the key-value cache at every decoding step.
2. The second time the input prompt is `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many in Germany?"`. Thanks to the cache, all key-value vectors for the first two sentences are already computed. Therefore the input prompt only consists of `"User: And how many in Germany?"`. While processing the shortened input prompt, its computed key-value vectors are concatenated to the key-value cache of the first decoding. The second Assistant's answer `"Germany has ca. 81 million inhabitants"` is then auto-regressively generated with the key-value cache consisting of encoded key-value vectors of `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many are in Germany?"`.
Two things should be noted here:
1. Keeping all the context is crucial for LLMs deployed in chat so that the LLM understands all the previous context of the conversation. E.g. for the example above the LLM needs to understand that the user refers to the population when asking `"And how many are in Germany"`.
2. The key-value cache is extremely useful for chat as it allows us to continuously grow the encoded chat history instead of having to re-encode the chat history again from scratch (as e.g. would be the case when using an encoder-decoder architecture).
@ -574,7 +579,7 @@ def bytes_to_megabytes(bytes):
Answer: The function takes a number of bytes as input and returns the number of
```
Great, no additional time is spent recomputing the same key and values for the attention layer! There is however one catch. While the required peak memory for the \\( \mathbf{QK}^T \\) matrix is significantly reduced, holding the key-value cache in memory can become very memory expensive for long input sequences or multi-turn chat. Remember that the key-value cache needs to store the key-value vectors for all previous input vectors \\( \mathbf{x}_i \text{, for } i \in \{1, \ldots, c - 1\} \\) for all self-attention layers and for all attention heads.
Great, no additional time is spent recomputing the same key and values for the attention layer! There is however one catch. While the required peak memory for the $\mathbf{QK}^T$ matrix is significantly reduced, holding the key-value cache in memory can become very memory expensive for long input sequences or multi-turn chat. Remember that the key-value cache needs to store the key-value vectors for all previous input vectors $\mathbf{x}_i \text{, for } i \in \{1, \ldots, c - 1\}$ for all self-attention layers and for all attention heads.
Let's compute the number of float values that need to be stored in the key-value cache for the LLM `bigcode/octocoder` that we used before.
The number of float values amounts to two times the sequence length times the number of attention heads times the attention head dimension and times the number of layers.
@ -598,21 +603,21 @@ Researchers have proposed two methods that allow to significantly reduce the mem
[Multi-Query-Attention](https://huggingface.co/papers/1911.02150) was proposed in Noam Shazeer's *Fast Transformer Decoding: One Write-Head is All You Need* paper. As the title says, Noam found out that instead of using `n_head` key-value projections weights, one can use a single head-value projection weight pair that is shared across all attention heads without that the model's performance significantly degrades.
> By using a single head-value projection weight pair, the key value vectors \\( \mathbf{k}_i, \mathbf{v}_i \\) have to be identical across all attention heads which in turn means that we only need to store 1 key-value projection pair in the cache instead of `n_head` ones.
> By using a single head-value projection weight pair, the key value vectors $\mathbf{k}_i, \mathbf{v}_i$ have to be identical across all attention heads which in turn means that we only need to store 1 key-value projection pair in the cache instead of `n_head` ones.
As most LLMs use between 20 and 100 attention heads, MQA significantly reduces the memory consumption of the key-value cache. For the LLM used in this notebook we could therefore reduce the required memory consumption from 15 GB to less than 400 MB at an input sequence length of 16000.
In addition to memory savings, MQA also leads to improved computational efficiency as explained in the following.
In auto-regressive decoding, large key-value vectors need to be reloaded, concatenated with the current key-value vector pair to be then fed into the \\( \mathbf{q}_c\mathbf{K}^T \\) computation at every step. For auto-regressive decoding, the required memory bandwidth for the constant reloading can become a serious time bottleneck. By reducing the size of the key-value vectors less memory needs to be accessed, thus reducing the memory bandwidth bottleneck. For more detail, please have a look at [Noam's paper](https://huggingface.co/papers/1911.02150).
In auto-regressive decoding, large key-value vectors need to be reloaded, concatenated with the current key-value vector pair to be then fed into the $\mathbf{q}_c\mathbf{K}^T$ computation at every step. For auto-regressive decoding, the required memory bandwidth for the constant reloading can become a serious time bottleneck. By reducing the size of the key-value vectors less memory needs to be accessed, thus reducing the memory bandwidth bottleneck. For more detail, please have a look at [Noam's paper](https://huggingface.co/papers/1911.02150).
The important part to understand here is that reducing the number of key-value attention heads to 1 only makes sense if a key-value cache is used. The peak memory consumption of the model for a single forward pass without key-value cache stays unchanged as every attention head still has a unique query vector so that each attention head still has a different \\( \mathbf{QK}^T \\) matrix.
The important part to understand here is that reducing the number of key-value attention heads to 1 only makes sense if a key-value cache is used. The peak memory consumption of the model for a single forward pass without key-value cache stays unchanged as every attention head still has a unique query vector so that each attention head still has a different $\mathbf{QK}^T$ matrix.
MQA has seen wide adoption by the community and is now used by many of the most popular LLMs:
- [**Falcon**](https://huggingface.co/tiiuae/falcon-40b)
- [**PaLM**](https://huggingface.co/papers/2204.02311)
- [**MPT**](https://huggingface.co/mosaicml/mpt-30b)
- [**BLOOM**](https://huggingface.co/bigscience/bloom)
- [**Falcon**](https://huggingface.co/tiiuae/falcon-40b)
- [**PaLM**](https://huggingface.co/papers/2204.02311)
- [**MPT**](https://huggingface.co/mosaicml/mpt-30b)
- [**BLOOM**](https://huggingface.co/bigscience/bloom)
Also, the checkpoint used in this notebook - `bigcode/octocoder` - makes use of MQA.

View File

@ -67,6 +67,6 @@ Examples of use can be found in the [example scripts](../examples) or [example n
[[autodoc]] data.data_collator.DataCollatorWithFlattening
# DataCollatorForMultipleChoice
## DataCollatorForMultipleChoice
[[autodoc]] data.data_collator.DataCollatorForMultipleChoice

View File

@ -42,7 +42,3 @@ set this to `False`.
## Pushing to the Hub
[[autodoc]] utils.PushToHubMixin
## Sharded checkpoints
[[autodoc]] modeling_utils.load_sharded_checkpoint

View File

@ -267,6 +267,7 @@ about how many forward passes you inputs are actually going to trigger, you can
independently of the inputs. The caveats from the previous section still apply.
## Pipeline FP16 inference
Models can be run in FP16 which can be significantly faster on GPU while saving memory. Most models will not suffer noticeable performance loss from this. The larger the model, the less likely that it will.
To enable FP16 inference, you can simply pass `dtype=torch.float16` or `dtype='float16'` to the pipeline constructor. Note that this only works for models with a PyTorch backend. Your inputs will be converted to FP16 internally.
@ -334,6 +335,7 @@ Pipelines available for audio tasks include the following.
Pipelines available for computer vision tasks include the following.
### DepthEstimationPipeline
[[autodoc]] DepthEstimationPipeline
- __call__
- all

View File

@ -43,6 +43,7 @@ Learn how to quantize models in the [Quantization](../quantization) guide.
[[autodoc]] AwqConfig
## EetqConfig
[[autodoc]] EetqConfig
## GPTQConfig

View File

@ -50,14 +50,14 @@ several advanced alignment methods which can be used to map between the original
token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding
to a given token).
# Multimodal Tokenizer
## Multimodal Tokenizer
Apart from that each tokenizer can be a "multimodal" tokenizer which means that the tokenizer will hold all relevant special tokens
as part of tokenizer attributes for easier access. For example, if the tokenizer is loaded from a vision-language model like LLaVA, you will
be able to access `tokenizer.image_token_id` to obtain the special image token used as a placeholder.
To enable extra special tokens for any type of tokenizer, you have to add the following lines and save the tokenizer. Extra special tokens do not
have to be modality related and can ne anything that the model often needs access to. In the below code, tokenizer at `output_dir` will have direct access
have to be modality related and can be anything that the model often needs access to. In the below code, tokenizer at `output_dir` will have direct access
to three more special tokens.
```python

View File

@ -23,6 +23,7 @@ The video processor extends the functionality of image processors by allowing Vi
When adding a new VLM or updating an existing one to enable distinct video preprocessing, saving and reloading the processor configuration will store the video related arguments in a dedicated file named `video_preprocessing_config.json`. Don't worry if you haven't updated your VLM, the processor will try to load video related configurations from a file named `preprocessing_config.json`.
### Usage Example
Here's an example of how to load a video processor with [`llava-hf/llava-onevision-qwen2-0.5b-ov-hf`](https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf) model:
```python

View File

@ -100,22 +100,29 @@ for label, prob in zip(labels, probs[0]):
- [`AltCLIPProcessor`] combines [`CLIPImageProcessor`] and [`XLMRobertaTokenizer`] into a single instance to encode text and prepare images.
## AltCLIPConfig
[[autodoc]] AltCLIPConfig
## AltCLIPTextConfig
[[autodoc]] AltCLIPTextConfig
## AltCLIPVisionConfig
[[autodoc]] AltCLIPVisionConfig
## AltCLIPModel
[[autodoc]] AltCLIPModel
## AltCLIPTextModel
[[autodoc]] AltCLIPTextModel
## AltCLIPVisionModel
[[autodoc]] AltCLIPVisionModel
## AltCLIPProcessor
[[autodoc]] AltCLIPProcessor

View File

@ -23,6 +23,7 @@ rendered properly in your Markdown viewer.
</div>
# BART
[BART](https://huggingface.co/papers/1910.13461) is a sequence-to-sequence model that combines the pretraining objectives from BERT and GPT. It's pretrained by corrupting text in different ways like deleting words, shuffling sentences, or masking tokens and learning how to fix it. The encoder encodes the corrupted document and the corrupted text is fixed by the decoder. As it learns to recover the original text, BART gets really good at both understanding and generating language.
You can find all the original BART checkpoints under the [AI at Meta](https://huggingface.co/facebook?search_models=bart) organization.

View File

@ -38,7 +38,7 @@ The abstract from the paper is the following:
efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating
more compute and model capacity where increased data complexity demands it. We present the first flop controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.*
## Usage Tips:
## Usage Tips
- **Dual Model Architecture**: BLT consists of two separate trained models:
- **Patcher (Entropy Model)**: A smaller transformer model that predicts byte-level entropy to determine patch boundaries and segment input.

View File

@ -25,8 +25,7 @@ rendered properly in your Markdown viewer.
## Overview
The Chameleon model was proposed in [Chameleon: Mixed-Modal Early-Fusion Foundation Models
](https://huggingface.co/papers/2405.09818) by META AI Chameleon Team. Chameleon is a Vision-Language Model that use vector quantization to tokenize images which enables the model to generate multimodal output. The model takes images and texts as input, including an interleaved format, and generates textual response. Image generation module is not released yet.
The Chameleon model was proposed in [Chameleon: Mixed-Modal Early-Fusion Foundation Models](https://huggingface.co/papers/2405.09818) by META AI Chameleon Team. Chameleon is a Vision-Language Model that use vector quantization to tokenize images which enables the model to generate multimodal output. The model takes images and texts as input, including an interleaved format, and generates textual response. Image generation module is not released yet.
The abstract from the paper is the following:

View File

@ -39,7 +39,7 @@ The original code can be found [here](https://github.com/neonbjb/tortoise-tts).
3. The use of the [`ClvpModelForConditionalGeneration.generate()`] method is strongly recommended for tortoise usage.
4. Note that the CLVP model expects the audio to be sampled at 22.05 kHz contrary to other audio models which expects 16 kHz.
## Brief Explanation:
## Brief Explanation
- The [`ClvpTokenizer`] tokenizes the text input, and the [`ClvpFeatureExtractor`] extracts the log mel-spectrogram from the desired audio.
- [`ClvpConditioningEncoder`] takes those text tokens and audio representations and converts them into embeddings conditioned on the text and audio.

View File

@ -12,11 +12,10 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer.
-->
*This model was released on {release_date} and added to Hugging Face Transformers on 2025-10-09.*
# Code World Model (CWM)
@ -53,7 +52,8 @@ CWM requires a dedicated system prompt to function optimally during inference. W
configuration, CWM's output quality may be significantly degraded. The following serves as the default
system prompt for reasoning tasks. For agentic workflows, append the relevant tool specifications
after this base prompt. Checkout the original code repository for more details.
```
```text
You are a helpful AI assistant. You always reason before responding, using the following format:
<think>
@ -110,6 +110,7 @@ generated_ids = model.generate(
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
print(tokenizer.decode(output_ids))
```
<details>
<summary>Produces the following output:</summary>

View File

@ -28,6 +28,7 @@ This model was contributed by [VladOS95-cyber](https://github.com/VladOS95-cyber
The original code can be found [here](https://huggingface.co/deepseek-ai/DeepSeek-V2).
### Usage tips
The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient inference and cost-effective training. It employs an auxiliary-loss-free strategy for load balancing and multi-token prediction training objective. The model can be used for various language tasks after being pre-trained on 14.8 trillion tokens and going through Supervised Fine-Tuning and Reinforcement Learning stages.
## DeepseekV2Config

View File

@ -34,6 +34,7 @@ We are super happy to make this code community-powered, and would love to see ho
- static cache is not supported (this should be just a generation config issue / config shape issues)
### Usage tips
The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient inference and cost-effective training. It employs an auxiliary-loss-free strategy for load balancing and multi-token prediction training objective. The model can be used for various language tasks after being pre-trained on 14.8 trillion tokens and going through Supervised Fine-Tuning and Reinforcement Learning stages.
You can run the model in `FP8` automatically, using 2 nodes of 8 H100 should be more than enough!

View File

@ -105,7 +105,7 @@ DETR can be naturally extended to perform panoptic segmentation (which unifies s
- The decoder of DETR updates the query embeddings in parallel. This is different from language models like GPT-2, which use autoregressive decoding instead of parallel. Hence, no causal attention mask is used.
- DETR adds position embeddings to the hidden states at each self-attention and cross-attention layer before projecting to queries and keys. For the position embeddings of the image, one can choose between fixed sinusoidal or learned absolute position embeddings. By default, the parameter `position_embedding_type` of [`~transformers.DetrConfig`] is set to `"sine"`.
- During training, the authors of DETR did find it helpful to use auxiliary losses in the decoder, especially to help the model output the correct number of objects of each class. If you set the parameter `auxiliary_loss` of [`~transformers.DetrConfig`] to `True`, then prediction feedforward neural networks and Hungarian losses are added after each decoder layer (with the FFNs sharing parameters).
- If you want to train the model in a distributed environment across multiple nodes, then one should update the _num_boxes_ variable in the _DetrLoss_ class of _modeling_detr.py_. When training on multiple nodes, this should be set to the average number of target boxes across all nodes, as can be seen in the original implementation [here](https://github.com/facebookresearch/detr/blob/a54b77800eb8e64e3ad0d8237789fcbf2f8350c5/models/detr.py#L227-L232).
- If you want to train the model in a distributed environment across multiple nodes, then one should update the *num_boxes* variable in the *DetrLoss* class of *modeling_detr.py*. When training on multiple nodes, this should be set to the average number of target boxes across all nodes, as can be seen in the original implementation [here](https://github.com/facebookresearch/detr/blob/a54b77800eb8e64e3ad0d8237789fcbf2f8350c5/models/detr.py#L227-L232).
- [`~transformers.DetrForObjectDetection`] and [`~transformers.DetrForSegmentation`] can be initialized with any convolutional backbone available in the [timm library](https://github.com/rwightman/pytorch-image-models). Initializing with a MobileNet backbone for example can be done by setting the `backbone` attribute of [`~transformers.DetrConfig`] to `"tf_mobilenetv3_small_075"`, and then initializing the model with that config.
- DETR resizes the input images such that the shortest side is at least a certain amount of pixels while the longest is at most 1333 pixels. At training time, scale augmentation is used such that the shortest side is randomly set to at least 480 and at most 800 pixels. At inference time, the shortest side is set to 800. One can use [`~transformers.DetrImageProcessor`] to prepare images (and optional annotations in COCO format) for the model. Due to this resizing, images in a batch can have different sizes. DETR solves this by padding images up to the largest size in a batch, and by creating a pixel mask that indicates which pixels are real/which are padding. Alternatively, one can also define a custom `collate_fn` in order to batch images together, using [`~transformers.DetrImageProcessor.pad_and_create_pixel_mask`].
- The size of the images will determine the amount of memory being used, and will thus determine the `batch_size`. It is advised to use a batch size of 2 per GPU. See [this Github thread](https://github.com/facebookresearch/detr/issues/150) for more info.
@ -142,7 +142,7 @@ As a summary, consider the following table:
|------|------------------|-----------------------|-----------------------|
| **Description** | Predicting bounding boxes and class labels around objects in an image | Predicting masks around objects (i.e. instances) in an image | Predicting masks around both objects (i.e. instances) as well as "stuff" (i.e. background things like trees and roads) in an image |
| **Model** | [`~transformers.DetrForObjectDetection`] | [`~transformers.DetrForSegmentation`] | [`~transformers.DetrForSegmentation`] |
| **Example dataset** | COCO detection | COCO detection, COCO panoptic | COCO panoptic | |
| **Example dataset** | COCO detection | COCO detection, COCO panoptic | COCO panoptic |
| **Format of annotations to provide to** [`~transformers.DetrImageProcessor`] | {'image_id': `int`, 'annotations': `list[Dict]`} each Dict being a COCO object annotation | {'image_id': `int`, 'annotations': `list[Dict]`} (in case of COCO detection) or {'file_name': `str`, 'image_id': `int`, 'segments_info': `list[Dict]`} (in case of COCO panoptic) | {'file_name': `str`, 'image_id': `int`, 'segments_info': `list[Dict]`} and masks_path (path to directory containing PNG files of the masks) |
| **Postprocessing** (i.e. converting the output of the model to Pascal VOC format) | [`~transformers.DetrImageProcessor.post_process`] | [`~transformers.DetrImageProcessor.post_process_segmentation`] | [`~transformers.DetrImageProcessor.post_process_segmentation`], [`~transformers.DetrImageProcessor.post_process_panoptic`] |
| **evaluators** | `CocoEvaluator` with `iou_types="bbox"` | `CocoEvaluator` with `iou_types="bbox"` or `"segm"` | `CocoEvaluator` with `iou_tupes="bbox"` or `"segm"`, `PanopticEvaluator` |

View File

@ -33,6 +33,7 @@ The abstract from the paper is the following:
*Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture to advance large language models.*
### Usage tips
The hyperparameters of this model is the same as Llama model.
## DiffLlamaConfig

View File

@ -47,7 +47,7 @@ Our large model is faster and ahead of its Swin counterpart by 1.5% box AP in CO
Paired with new frameworks, our large variant is the new state of the art panoptic segmentation model on COCO (58.2 PQ)
and ADE20K (48.5 PQ), and instance segmentation model on Cityscapes (44.5 AP) and ADE20K (35.4 AP) (no extra data).
It also matches the state of the art specialized semantic segmentation models on ADE20K (58.2 mIoU),
and ranks second on Cityscapes (84.5 mIoU) (no extra data). *
and ranks second on Cityscapes (84.5 mIoU) (no extra data).*
<img
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/dilated-neighborhood-attention-pattern.jpg"

View File

@ -178,3 +178,8 @@ print("Pooled output shape:", pooled_output.shape)
[[autodoc]] DINOv3ViTImageProcessorFast
- preprocess
## DINOv3ConvNextBackbone
[[autodoc]] DINOv3ConvNextBackbone
- forward

View File

@ -61,10 +61,10 @@ pipeline(image=image, question="What time is the coffee break?")
# pip install datasets
import torch
from datasets import load_dataset
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers import AutoProcessor, AutoModelForImageTextToText
processor = AutoProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
model = AutoModelForVision2Seq.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
model = AutoModelForImageTextToText.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
dataset = load_dataset("hf-internal-testing/example-documents", split="test")
image = dataset[0]["image"]
@ -92,11 +92,11 @@ The example below uses [torchao](../quantization/torchao) to only quantize the w
# pip install datasets torchao
import torch
from datasets import load_dataset
from transformers import TorchAoConfig, AutoProcessor, AutoModelForVision2Seq
from transformers import TorchAoConfig, AutoProcessor, AutoModelForImageTextToText
quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
processor = AutoProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
model = AutoModelForVision2Seq.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa", quantization_config=quantization_config)
model = AutoModelForImageTextToText.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa", quantization_config=quantization_config)
dataset = load_dataset("hf-internal-testing/example-documents", split="test")
image = dataset[0]["image"]
@ -120,7 +120,7 @@ print(answer)
```py
>>> import re
>>> from transformers import DonutProcessor, VisionEncoderDecoderModel
from accelerate import Accelerator
>>> from accelerate import Accelerator
>>> from datasets import load_dataset
>>> import torch
@ -162,9 +162,9 @@ from accelerate import Accelerator
```py
>>> import re
>>> from transformers import DonutProcessor, VisionEncoderDecoderModel
from accelerate import Accelerator
>>> from accelerate import Accelerator
>>> from datasets import load_dataset
>>> from transformers import DonutProcessor, VisionEncoderDecoderModel
>>> import torch
>>> processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")

View File

@ -305,7 +305,6 @@ EdgeTAM can use masks from previous predictions as input to refine segmentation:
... )
```
## EdgeTamConfig
[[autodoc]] EdgeTamConfig

View File

@ -12,13 +12,11 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer.
-->
*This model was released on 2025-01-13 and added to Hugging Face Transformers on 2025-09-29.*
<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">

View File

@ -61,7 +61,7 @@ message_list = [
]
]
input_dict = processor(
protein_informations, messages_list, return_tensors="pt", text_max_length=512, protein_max_length=1024
protein_inputs, messages_list, return_tensors="pt", text_max_length=512, protein_max_length=1024
)
with torch.no_grad():
generated_ids = hf_model.generate(**input_dict)

View File

@ -28,15 +28,19 @@ The abstract from the original FastSpeech2 paper is the following:
This model was contributed by [Connor Henderson](https://huggingface.co/connor-henderson). The original code can be found [here](https://github.com/espnet/espnet/blob/master/espnet2/tts/fastspeech2/fastspeech2.py).
## 🤗 Model Architecture
FastSpeech2's general structure with a Mel-spectrogram decoder was implemented, and the traditional transformer blocks were replaced with conformer blocks as done in the ESPnet library.
#### FastSpeech2 Model Architecture
### FastSpeech2 Model Architecture
![FastSpeech2 Model Architecture](https://www.microsoft.com/en-us/research/uploads/prod/2021/04/fastspeech2-1.png)
#### Conformer Blocks
![Conformer Blocks](https://www.researchgate.net/profile/Hirofumi-Inaguma-2/publication/344911155/figure/fig2/AS:951455406108673@1603856054097/An-overview-of-Conformer-block.png)
#### Convolution Module
![Convolution Module](https://d3i71xaburhd42.cloudfront.net/8809d0732f6147d4ad9218c8f9b20227c837a746/2-Figure1-1.png)
## 🤗 Transformers Usage

View File

@ -70,8 +70,8 @@ from transformers import AutoProcessor, Florence2ForConditionalGeneration
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
model = Florence2ForConditionalGeneration.from_pretrained("microsoft/Florence-2-base", dtype=torch.bfloat16, device_map="auto")
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base")
model = Florence2ForConditionalGeneration.from_pretrained("florence-community/Florence-2-base", dtype=torch.bfloat16, device_map="auto")
processor = AutoProcessor.from_pretrained("florence-community/Florence-2-base")
task_prompt = "<OD>"
inputs = processor(text=task_prompt, images=image, return_tensors="pt").to(model.device)
@ -105,12 +105,12 @@ from transformers import AutoProcessor, Florence2ForConditionalGeneration, BitsA
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model = Florence2ForConditionalGeneration.from_pretrained(
"microsoft/Florence-2-large",
"florence-community/Florence-2-base",
dtype=torch.bfloat16,
device_map="auto",
quantization_config=quantization_config
)
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large")
processor = AutoProcessor.from_pretrained("florence-community/Florence-2-base")
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

View File

@ -33,7 +33,7 @@ this model, including [Alternating Updates][altup] (AltUp), [Learned Augmented R
[MatFormer][matformer], Per-Layer Embeddings (PLE), [Activation Sparsity with Statistical Top-k][spark-transformer], and KV cache sharing. The language model uses
a similar attention pattern to [Gemma 3](./gemma3) with alternating 4 local sliding window self-attention layers for
every global self-attention layer with a maximum context length of 32k tokens. Gemma 3n introduces
[MobileNet v5][mobilenetv5] as the vision encoder, using a default resolution of 768x768 pixels, and adds a newly
MobileNet v5 as the vision encoder, using a default resolution of 768x768 pixels, and adds a newly
trained audio encoder based on the [Universal Speech Model][usm] (USM) architecture.
The instruction-tuned variant was post-trained with knowledge distillation and reinforcement learning.

View File

@ -37,7 +37,6 @@ We evaluated GLM-4.6 across eight public benchmarks covering agents, reasoning,
For more eval results, show cases, and technical details, please visit our [technical blog](https://z.ai/blog/glm-4.6).
### GLM-4.5
The [**GLM-4.5**](https://huggingface.co/papers/2508.06471) series models are foundation models designed for intelligent agents, MoE variants are documented here as Glm4Moe.

View File

@ -101,6 +101,7 @@ Below is an expected speedup diagram that compares pure inference time between t
</div>
## Using Scaled Dot Product Attention (SDPA)
PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
@ -123,6 +124,7 @@ On a local benchmark (rtx3080ti-16GB, PyTorch 2.2.1, OS Ubuntu 22.04) using `flo
following speedups during training and inference.
### Training
| Batch size | Seq len | Time per batch (Eager - s) | Time per batch (SDPA - s) | Speedup (%) | Eager peak mem (MB) | SDPA peak mem (MB) | Mem saving (%) |
|-----------:|-----------:|---------------------------:|-----------------------------:|------------:|--------------------:|-------------------:|------------------:|
| 1 | 128 | 0.024 | 0.019 | 28.945 | 1789.95 | 1789.95 | 0 |
@ -142,6 +144,7 @@ following speedups during training and inference.
| 4 | 2048 | OOM | 0.731 | / | OOM | 12705.1 | SDPA does not OOM |
### Inference
| Batch size | Seq len | Per token latency Eager (ms) | Per token latency SDPA (ms) | Speedup (%) | Mem Eager (MB) | Mem SDPA (MB) | Mem saved (%) |
|--------------:|-------------:|--------------------------------:|-------------------------------:|---------------:|------------------:|----------------:|-----------------:|
| 1 | 128 | 6.569 | 5.858 | 12.14 | 974.831 | 974.826 | 0 |

View File

@ -41,7 +41,7 @@ The example below demonstrates how to generate text with [`Pipeline`] or the [`A
<hfoptions id="usage">
<hfoption id="Pipeline">
```py
```python
import torch
from transformers import pipeline
pipeline = pipeline(task="text-generation",
@ -52,7 +52,7 @@ pipeline("人とAIが協調するためには、")
</hfoption>
<hfoption id="AutoModel">
```py
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
@ -112,6 +112,7 @@ visualizer("<img>What is shown in this image?")
</div>
## Resources
Refer to the [Training a better GPT model: Learnings from PaLM](https://medium.com/ml-abeja/training-a-better-gpt-2-93b157662ae4) blog post for more details about how ABEJA trained GPT-NeoX-Japanese.
## GPTNeoXJapaneseConfig

View File

@ -35,8 +35,8 @@ The abstract from the paper is the following:
*<INSERT PAPER ABSTRACT HERE>*
Tips:
- **Attention Sinks with Flex Attention**: When using flex attention, attention sinks require special handling. Unlike with standard attention implementations where sinks can be added directly to attention scores, flex attention `score_mod` function operates on individual score elements rather than the full attention matrix. Therefore, attention sinks renormalization have to be applied after the flex attention computations by renormalizing the outputs using the log-sum-exp (LSE) values returned by flex attention.
- **Attention Sinks with Flex Attention**: When using flex attention, attention sinks require special handling. Unlike with standard attention implementations where sinks can be added directly to attention scores, flex attention `score_mod` function operates on individual score elements rather than the full attention matrix. Therefore, attention sinks renormalization have to be applied after the flex attention computations by renormalizing the outputs using the log-sum-exp (LSE) values returned by flex attention.
<INSERT TIPS ABOUT MODEL HERE>

View File

@ -79,6 +79,8 @@ When token_type_ids=None or all zero, it is equivalent to regular causal mask
for example:
>>> x_token = tokenizer("アイウエ")
```text
input_ids: | SOT | SEG | ア | イ | ウ | エ |
token_type_ids: | 1 | 0 | 0 | 0 | 0 | 0 |
prefix_lm_mask:
@ -88,8 +90,11 @@ SEG | 1 1 0 0 0 0 |
イ | 1 1 1 1 0 0 |
ウ | 1 1 1 1 1 0 |
エ | 1 1 1 1 1 1 |
```
>>> x_token = tokenizer("", prefix_text="アイウエ")
```text
input_ids: | SOT | ア | イ | ウ | エ | SEG |
token_type_ids: | 1 | 1 | 1 | 1 | 1 | 0 |
prefix_lm_mask:
@ -99,8 +104,11 @@ SOT | 1 1 1 1 1 0 |
ウ | 1 1 1 1 1 0 |
エ | 1 1 1 1 1 0 |
SEG | 1 1 1 1 1 1 |
```
>>> x_token = tokenizer("ウエ", prefix_text="アイ")
```text
input_ids: | SOT | ア | イ | SEG | ウ | エ |
token_type_ids: | 1 | 1 | 1 | 0 | 0 | 0 |
prefix_lm_mask:
@ -110,6 +118,7 @@ SOT | 1 1 1 0 0 0 |
SEG | 1 1 1 1 0 0 |
ウ | 1 1 1 1 1 0 |
エ | 1 1 1 1 1 1 |
```
### Spout Vector

View File

@ -22,6 +22,7 @@ rendered properly in your Markdown viewer.
</div>
## Overview
The [Granite Speech](https://huggingface.co/papers/2505.08699) model ([blog post](https://www.ibm.com/new/announcements/ibm-granite-3-3-speech-recognition-refined-reasoning-rag-loras)) is a multimodal language model, consisting of a speech encoder, speech projector, large language model, and LoRA adapter(s). More details regarding each component for the current (Granite 3.2 Speech) model architecture may be found below.
1. Speech Encoder: A [Conformer](https://huggingface.co/papers/2005.08100) encoder trained with Connectionist Temporal Classification (CTC) on character-level targets on ASR corpora. The encoder uses block-attention and self-conditioned CTC from the middle layer.

View File

@ -39,14 +39,14 @@ It supports the following languages: English, French, German, Italian, Portugues
<!-- This section describes the evaluation protocols and provides the results. -->
#### Testing Data
### Testing Data
<!-- This should link to a Dataset Card if possible. -->
The model was evaluated on MMLU, TriviaQA, NaturalQuestions, ARC Easy & Challenge, Open Book QA, Common Sense QA,
Physical Interaction QA, Social Interaction QA, HellaSwag, WinoGrande, Multilingual Knowledge QA, FLORES 200.
#### Metrics
### Metrics
<!-- These are the evaluation metrics being used, ideally with a description of why. -->

View File

@ -24,9 +24,7 @@ rendered properly in your Markdown viewer.
## Overview
The IDEFICS model was proposed in [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
](https://huggingface.co/papers/2306.16527
) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh
The IDEFICS model was proposed in [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh
The abstract from the paper is the following:

View File

@ -215,13 +215,16 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
- forward
## Idefics2ImageProcessor
[[autodoc]] Idefics2ImageProcessor
- preprocess
## Idefics2ImageProcessorFast
[[autodoc]] Idefics2ImageProcessorFast
- preprocess
## Idefics2Processor
[[autodoc]] Idefics2Processor
- __call__

View File

@ -77,13 +77,16 @@ This model was contributed by [amyeroberts](https://huggingface.co/amyeroberts)
- forward
## Idefics3ImageProcessor
[[autodoc]] Idefics3ImageProcessor
- preprocess
## Idefics3ImageProcessorFast
[[autodoc]] Idefics3ImageProcessorFast
- preprocess
## Idefics3Processor
[[autodoc]] Idefics3Processor
- __call__

View File

@ -63,11 +63,6 @@ The attributes can be obtained from model config, as `model.config.num_query_tok
[[autodoc]] InstructBlipVideoVideoProcessor
- preprocess
## InstructBlipVideoImageProcessor
[[autodoc]] InstructBlipVideoImageProcessor
- preprocess
## InstructBlipVideoVisionModel
[[autodoc]] InstructBlipVideoVisionModel
@ -79,6 +74,7 @@ The attributes can be obtained from model config, as `model.config.num_query_tok
- forward
## InstructBlipVideoModel
[[autodoc]] InstructBlipVideoModel
- forward

View File

@ -105,6 +105,7 @@ This example demonstrates how to perform inference on a single image with the In
```
### Text-only generation
This example shows how to generate text using the InternVL model without providing any image input.
```python
@ -134,6 +135,7 @@ This example shows how to generate text using the InternVL model without providi
```
### Batched image and text inputs
InternVL models also support batched image and text inputs.
```python
@ -177,6 +179,7 @@ InternVL models also support batched image and text inputs.
```
### Batched multi-image input
This implementation of the InternVL models supports batched text-images inputs with different number of images for each text.
```python
@ -220,6 +223,7 @@ This implementation of the InternVL models supports batched text-images inputs w
```
### Video input
InternVL models can also handle video inputs. Here is an example of how to perform inference on a video input using chat templates.
```python
@ -259,6 +263,7 @@ InternVL models can also handle video inputs. Here is an example of how to perfo
```
### Interleaved image and video inputs
This example showcases how to handle a batch of chat conversations with interleaved image and video inputs using chat template.
```python

View File

@ -14,6 +14,7 @@ rendered properly in your Markdown viewer.
-->
*This model was released on 2020-04-30 and added to Hugging Face Transformers on 2023-06-20.*
# Jukebox
<div class="flex flex-wrap space-x-1">

View File

@ -16,6 +16,7 @@ rendered properly in your Markdown viewer.
*This model was released on 2025-06-17 and added to Hugging Face Transformers on 2025-06-25.*
# Kyutai Speech-To-Text
## Overview
[Kyutai STT](https://kyutai.org/next/stt) is a speech-to-text model architecture based on the [Mimi codec](https://huggingface.co/docs/transformers/en/model_doc/mimi), which encodes audio into discrete tokens in a streaming fashion, and a [Moshi-like](https://huggingface.co/docs/transformers/en/model_doc/moshi) autoregressive decoder. Kyutai's lab has released two model checkpoints:

View File

@ -36,7 +36,7 @@ in vision transformers. As a result, we propose LeVIT: a hybrid neural network f
We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of
application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable
to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect
to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 5 times faster than EfficientNet on CPU. *
to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 5 times faster than EfficientNet on CPU.*
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/levit_architecture.png"
alt="drawing" width="600"/>

View File

@ -12,11 +12,10 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer.
-->
*This model was released on {release_date} and added to Hugging Face Transformers on 2025-10-07.*
# Lfm2Moe
@ -24,7 +23,7 @@ limitations under the License.
LFM2-MoE is a Mixture-of-Experts (MoE) variant of [LFM2](https://huggingface.co/collections/LiquidAI/lfm2-686d721927015b2ad73eaa38). The LFM2 family is optimized for on-device inference by combining shortrange, inputaware gated convolutions with groupedquery attention (GQA) in a layout tuned to maximize quality under strict speed and memory constraints.
LFM2MoE keeps this fast backbone and introduces sparse MoE feedforward networks to add representational capacity without significantly increasing the active compute path. The first LFM2-MoE release is LFM2-8B-A1B, with 8.3B total parameters and 1.5B active parameters. The model excels in quality (comparable to 3-4B dense models) and speed (faster than other 1.5B class models).
LFM2MoE keeps this fast backbone and introduces sparse MoE feedforward networks to add representational capacity without significantly increasing the active compute path. The first LFM2-MoE release is LFM2-8B-A1B, with 8.3B total parameters and 1.5B active parameters. The model excels in quality (comparable to 3-4B dense models) and speed (faster than other 1.5B class models).
## Example

View File

@ -88,16 +88,16 @@ processed_outputs = processor.post_process_keypoint_matching(outputs, image_size
import torch
from PIL import Image
import requests
processor = AutoImageProcessor.from_pretrained("ETH-CVG/lightglue_superpoint")
model = AutoModel.from_pretrained("ETH-CVG/lightglue_superpoint")
# LightGlue requires pairs of images
images = [image1, image2]
inputs = processor(images, return_tensors="pt")
with torch.inference_mode():
outputs = model(**inputs)
# Extract matching information
keypoints0 = outputs.keypoints0 # Keypoints in first image
keypoints1 = outputs.keypoints1 # Keypoints in second image
@ -112,7 +112,7 @@ processed_outputs = processor.post_process_keypoint_matching(outputs, image_size
# Process outputs for visualization
image_sizes = [[(image.height, image.width) for image in images]]
processed_outputs = processor.post_process_keypoint_matching(outputs, image_sizes, threshold=0.2)
for i, output in enumerate(processed_outputs):
print(f"For the image pair {i}")
for keypoint0, keypoint1, matching_score in zip(
@ -147,6 +147,13 @@ processed_outputs = processor.post_process_keypoint_matching(outputs, image_size
- post_process_keypoint_matching
- visualize_keypoint_matching
## LightGlueImageProcessorFast
[[autodoc]] LightGlueImageProcessorFast
- preprocess
- post_process_keypoint_matching
- visualize_keypoint_matching
## LightGlueForKeypointMatching
[[autodoc]] LightGlueForKeypointMatching

View File

@ -436,11 +436,6 @@ model = Llama4ForConditionalGeneration.from_pretrained(
[[autodoc]] Llama4TextModel
- forward
## Llama4ForCausalLM
[[autodoc]] Llama4ForCausalLM
- forward
## Llama4VisionModel
[[autodoc]] Llama4VisionModel

View File

@ -25,8 +25,7 @@ rendered properly in your Markdown viewer.
## Overview
The LLaVa-NeXT-Video model was proposed in [LLaVA-NeXT: A Strong Zero-shot Video Understanding Model
](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/) by Yuanhan Zhang, Bo Li, Haotian Liu, Yong Jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, Chunyuan Li. LLaVa-NeXT-Video improves upon [LLaVa-NeXT](llava_next) by fine-tuning on a mix if video and image dataset thus increasing the model's performance on videos.
The LLaVa-NeXT-Video model was proposed in [LLaVA-NeXT: A Strong Zero-shot Video Understanding Model](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/) by Yuanhan Zhang, Bo Li, Haotian Liu, Yong Jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, Chunyuan Li. LLaVa-NeXT-Video improves upon [LLaVa-NeXT](llava_next) by fine-tuning on a mix if video and image dataset thus increasing the model's performance on videos.
[LLaVA-NeXT](llava_next) surprisingly has strong performance in understanding video content in zero-shot fashion with the AnyRes technique that it uses. The AnyRes technique naturally represents a high-resolution image into multiple images. This technique is naturally generalizable to represent videos because videos can be considered as a set of frames (similar to a set of images in LLaVa-NeXT). The current version of LLaVA-NeXT makes use of AnyRes and trains with supervised fine-tuning (SFT) on top of LLaVA-Next on video data to achieves better video understanding capabilities.The model is a current SOTA among open-source models on [VideoMME bench](https://huggingface.co/papers/2405.21075).
@ -248,10 +247,6 @@ model = LlavaNextVideoForConditionalGeneration.from_pretrained(
[[autodoc]] LlavaNextVideoProcessor
## LlavaNextVideoImageProcessor
[[autodoc]] LlavaNextVideoImageProcessor
## LlavaNextVideoVideoProcessor
[[autodoc]] LlavaNextVideoVideoProcessor

View File

@ -171,6 +171,7 @@ Below is an expected speedup diagram that compares pure inference time between t
</div>
## Using Scaled Dot Product Attention (SDPA)
PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)

View File

@ -39,7 +39,7 @@ attractive option for long-document NLP tasks.
The abstract from the paper is the following:
*The design choices in the Transformer attention mechanism, including weak inductive bias and quadratic computational complexity, have limited its application for modeling long sequences. In this paper, we introduce Mega, a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average to incorporate inductive bias of position-aware local dependencies into the position-agnostic attention mechanism. We further propose a variant of Mega that offers linear time and space complexity yet yields only minimal quality loss, by efficiently splitting the whole sequence into multiple chunks with fixed length. Extensive experiments on a wide range of sequence modeling benchmarks, including the Long Range Arena, neural machine translation, auto-regressive language modeling, and image and speech classification, show that Mega achieves significant improvements over other sequence models, including variants of Transformers and recent state space models. *
*The design choices in the Transformer attention mechanism, including weak inductive bias and quadratic computational complexity, have limited its application for modeling long sequences. In this paper, we introduce Mega, a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average to incorporate inductive bias of position-aware local dependencies into the position-agnostic attention mechanism. We further propose a variant of Mega that offers linear time and space complexity yet yields only minimal quality loss, by efficiently splitting the whole sequence into multiple chunks with fixed length. Extensive experiments on a wide range of sequence modeling benchmarks, including the Long Range Arena, neural machine translation, auto-regressive language modeling, and image and speech classification, show that Mega achieves significant improvements over other sequence models, including variants of Transformers and recent state space models.*
This model was contributed by [mnaylor](https://huggingface.co/mnaylor).
The original code can be found [here](https://github.com/facebookresearch/mega).

View File

@ -186,5 +186,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
- forward
## MiniMaxForQuestionAnswering
[[autodoc]] MiniMaxForQuestionAnswering
- forward

View File

@ -223,5 +223,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
- forward
## MixtralForQuestionAnswering
[[autodoc]] MixtralForQuestionAnswering
- forward

View File

@ -54,7 +54,7 @@ model.set_output_embeddings(resized_embeddings)
## Usage Example
#### Instruct model
### Instruct model
```python
import torch
@ -67,7 +67,7 @@ processor = AutoProcessor.from_pretrained(model_id)
messages = [
[
{
"role": "user",
"role": "user",
"content": [
{"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
{"type": "text", "text": "What does the image show?"}
@ -80,7 +80,7 @@ output = model.generate(**inputs, max_new_tokens=25)
print(processor.decode(output[0]))
```
#### Base model
### Base model
```python
import requests
@ -113,6 +113,10 @@ print(processor.decode(output[0], skip_special_tokens=True))
[[autodoc]] MllamaImageProcessor
## MllamaImageProcessorFast
[[autodoc]] MllamaImageProcessorFast
## MllamaForConditionalGeneration
[[autodoc]] MllamaForConditionalGeneration
@ -132,11 +136,6 @@ print(processor.decode(output[0], skip_special_tokens=True))
[[autodoc]] MllamaModel
## MllamaForCausalLM
[[autodoc]] MllamaForCausalLM
- forward
## MllamaVisionModel
[[autodoc]] MllamaVisionModel

View File

@ -316,6 +316,7 @@ with torch.no_grad():
Different LID models are available based on the number of languages they can recognize - [126](https://huggingface.co/facebook/mms-lid-126), [256](https://huggingface.co/facebook/mms-lid-256), [512](https://huggingface.co/facebook/mms-lid-512), [1024](https://huggingface.co/facebook/mms-lid-1024), [2048](https://huggingface.co/facebook/mms-lid-2048), [4017](https://huggingface.co/facebook/mms-lid-4017).
#### Inference
First, we install transformers and some other libraries
```bash

View File

@ -99,7 +99,6 @@ print(f"The predicted class label is:{predicted_class_label}")
[[autodoc]] MobileViTConfig
## MobileViTImageProcessor
[[autodoc]] MobileViTImageProcessor

View File

@ -64,11 +64,11 @@ Note that each timestamp - i.e each codebook - gets its own set of Linear Layers
It's the audio encoder from Kyutai, that has recently been integrated to transformers, which is used to "tokenize" audio. It has the same use that [`~EncodecModel`] has in [`~MusicgenModel`].
## Tips:
## Tips
The original checkpoints can be converted using the conversion script `src/transformers/models/moshi/convert_moshi_transformers.py`
### How to use the model:
### How to use the model
This implementation has two main aims:
@ -152,7 +152,7 @@ Once it's done, you can simply forward `text_labels` and `audio_labels` to [`Mos
A training guide will come soon, but user contributions are welcomed!
### How does the model forward the inputs / generate:
### How does the model forward the inputs / generate
1. The input streams are embedded and combined into `inputs_embeds`.

View File

@ -50,7 +50,7 @@ MusicGen Melody is compatible with two generation modes: greedy and sampling. In
Transformers supports both mono (1-channel) and stereo (2-channel) variants of MusicGen Melody. The mono channel versions generate a single set of codebooks. The stereo versions generate 2 sets of codebooks, 1 for each channel (left/right), and each set of codebooks is decoded independently through the audio compression model. The audio streams for each channel are combined to give the final stereo output.
#### Audio Conditional Generation
### Audio Conditional Generation
The model can generate an audio sample conditioned on a text and an audio prompt through use of the [`MusicgenMelodyProcessor`] to pre-process the inputs.

View File

@ -40,7 +40,3 @@ The original code can be found [here](https://github.com/tomlimi/MYTE).
- get_special_tokens_mask
- create_token_type_ids_from_sequences
- save_vocabulary
## MyT5Tokenizer
[[autodoc]] MyT5Tokenizer

Some files were not shown because too many files have changed in this diff Show More