Compare commits

..

78 Commits

Author SHA1 Message Date
7e236b275d Apply style fixes 2025-11-13 16:19:12 +00:00
afa5b2b9d5 fix 2025-11-13 17:15:26 +01:00
6f6095e0cf Refactor weight loading (#41580)
* ah actually we don't discard lm head if missing -> needs to be moved to correct device and etc

* fix some tests

* small fixes

* up

* up

* dik why we tie weights twice but,..,,.

* ups

* removeunused

* fix hunyuan

* small fix

* nits

* ish

* up

* rev

* fix more tie weights keys

* small fixes

* nit

* update

* fix and fix

* fix a test

* glubs

* current shitty changes

* ship validated ones

* more

* more update

* more

* more

* more

* mllama

* more up

* fix ernie

* fix xopies

* up more

* more fixes

* up

* up

* fix-copies

* fix more

* more updates

* AI UPDATE

* up

* hoey

* make it fast

* fix

* lol

* fix asjusting

* more fixes

* _dtype nit

* up

* nit

* update

* update

* remove semaphores

* fix import to avoid jit execution

* try to remove custom tiing logic when its stupid

* fix more individual models

* fix whisper as well

* fix?

* fox umt5

* improve tqdm bar

* cleanup a bit

* oupsi

* some updates

* improve

* remove all buffering -> much faster without it

* remove some tie_weights custome funcs when not needed

* more fixes related to strict matching regex

* remove ALL custom tie weights

* small update

* revert change to init scheme (no need for params)

* mixtral init

* try less strict source check

* tied weight first shot to the fiiiixxxxxx

* does this help?

* :)

* fix some ppolry defined tied_weights_keys for now

* subclass nn.Parameters

* up

* lol

* Ouiiii

* fix led

* fix long cat flash

* fix qwen and long cat flash

* properly fix qwen init

* just push this for now

* propnet is dumb

* update

* push

* remove explict sharing of some tied keys.

* update decoder.bias

* moe case

* more changes to untangle old hardcoded ting

* fixup

* fix big faileurs

* fix prophnet

* fix resize token embeddings

* nits

* fix xcodex

* asyncio?

* fix smart apply

* fix data-2-vec

* [build-ci-image]

* checkout

* uupdate

* fix hunyuan

* update error message

* fix deformable detr

* fixes

* fix init weights for non param gate up projs

* shared todo?

* update some models

* big revert, don't break this behaviour

* ty @SunMarc this fixes the buffers

Co-authored-by: SunMarc <SunMarc@users.noreply.github.com>

* mt5 fuck

* fix lxmbert

* nuke slow test fetcher

* fix zamba and deepcopy for now

* fix zamba tied weight keys! ~

* fix-copies

* update fetch terst

* fix gradient for test modeling common!

* break "shared" for now I will fix tomorrow changes are properly isoalted now :)

* does this fix marian? probably not

* fix some vlms

* D fine seems to handle this well

* glob is fine actually

* fix dab detr

* small steps

* opusy

* fix some more models?

* yups

* better erro

* fix?

* fix double escape

* escape wehere it makes sense

* ??

* fix ibert

* fix tvp as well

* more fxes

* try always download ref PR

* ONONONO

* big fixup

* more fixup

* small step

* small nits

* nits

* brut force some stuff

* fix vilt

* make sure special models that always need tie always tie

* cleaning up

* small nits

* fix zamba and bridge tower!

* just fixup

* potential culprits

* revert bark and fix bridgetower

* remove now non existant tie_weights

* ?

* lol reformer actually had nothing tied!

* wow these two fucking models were really not well made

* fix sam family!

* fix bark revision

* fix speech2test ?

* push this for now....

* upsy

* the fuck

* fix rtdetr

* update

* proper

* wow that one 's annoying

* update

* try to find the culprit

* get some help on common

* nit about general init and cls.padding_idx

* revert num workers update

* remove old loading func

* fix glob

* add annotations

* fix re

* small improvements

* clean some stuff

* improvements

* someone did not understannnnnnd what I tried to dooo or does BNB not support that either?

* gluos

* fix case when `.` is just not there

* remove unused arg

* recover orignal parameter/buffer using _original

* fix glob issu

* this?

* deepspeed best-effort

* remove unused stuff

* Update tie weight keys as they were just wroong

Co-authored-by: Benjamin Bossan <benjaminbossan@users.noreply.github.com>"

* up

* augustuc clauss, a gloubs gloups gloubs

* fixup

* fixup

* there was fucking typo

* mrain

* nits

* fix marian 3 remaining tests

* one more

* fix some of the copies, not all :)

* small cleanup

* one propertest

* fix core model loadig tes

* attempt a new test

* fix some of the annoying tests by supporting reading .bin sometimes

* push

* push more small fixes

* remove 1 useless test

* up

* fix audio flamingo post rebase

* fixup

* some small updatess

* fix sam models

* nits

* up

* updates

* onem ore

* skip this stupid test

* some other fixes

* fixup

* update

* skip more offloaded stuff

* oups

* ups

* update mixtral

* skip this one

* LET"SGO

* fixup

* rope delta order

* fix csm

* small nit

---------

Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
Co-authored-by: SunMarc <SunMarc@users.noreply.github.com>
Co-authored-by: Marc Sun <marc@huggingface.co>
2025-11-13 17:12:52 +01:00
c4cfc2e023 [TP] Fix parameter detection issue and some invalid TP-plans (#42129)
* fix

* add test

* fix test

* fix the obvious

* more fix

* fix

* continue to improve

* more fix

* more

* fix

* fix

* finally

* CI
2025-11-13 15:44:56 +01:00
5c6d6bed4d [PEFT] Fix the general test for prefix tuning (#42185)
fix
2025-11-13 14:40:01 +00:00
80134e6e66 Update transformers to support FqnToConfig (#41894)
* Update transformers to support `FqnToConfig`

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* add case for modulefqn

* remove comment

* update tests

* cleanup

* update

* wip

* wip

* update quantizer_torchao for module default

* fix underscore

* update tests

* update

* fix import error

* fix import

* import change not included in previous commit

* Apply suggestion from @MekkCyber

Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>

* Update src/transformers/quantizers/quantizer_torchao.py

Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>

* update tests and add comment

* fix test

---------

Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
2025-11-13 13:34:45 +00:00
ce40ca0d4c [typo] fix mrope-interleave annotation to avoid ambiguity (#42177)
* fix typo

* fix qwen3omni
2025-11-13 13:07:14 +00:00
6408d3b01a [PEFT] Fix prefix tuning (#41696)
* fix

* simplify

* add my 2 cents
2025-11-13 11:58:26 +01:00
f40ef03214 Remove unnecessary slicing in sdpa_attention_forward (#41900)
Remove redundant slicing in sdpa_attention_forward

The slicing in sdpa_attention_forward was there only because some masks were not constructed correctly (I was told). When the dimension is dynamic, the slice op also prevents torch.export from correctly reasoning about its size.

Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-11-13 10:29:38 +01:00
5150dac727 Fix helper fn for new processor config format (#42085)
* fix the helper fn for new processor config format

* change the priority order

* maybe we need to explicitly load and then decide

* Apply suggestions from code review

Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com>

* use helper fn for json decoding

---------

Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com>
2025-11-13 10:06:24 +01:00
27c3807991 [Pop2Piano] Fix cache usage (#42170)
fix
2025-11-13 09:19:56 +01:00
ffb35fe142 Fix in-place modification of user-input in SAM2 embed boxes (#42173)
* Do not modify boxes tensor in-place
2025-11-12 15:52:33 -05:00
1fd63dd532 Docs/i18n updates (#42006)
* docs(i18n): Update translations and terminology in Traditional and Simplified Chinese documentation\n\n- Correct the translation explanation of "Trainer" in Traditional Chinese\n- Update the terminology for "Named Entity Recognition" in Simplified Chinese\n- Adjust the display format of confidence values\n- Unify the expression for installation pages

* docs(i18n): Update translations and terminology in Traditional and Simplified Chinese documentation

- Correct the translation explanation of "Trainer" in Traditional Chinese
- Update the terminology for "Named Entity Recognition" in Simplified Chinese
- Adjust the display format of confidence values
- Unify the expression for installation pages

* docs(i18n): update Simplified Chinese and Traditional Chinese README files

update content to reflect latest library features and usage examples
add new model badges and installation instructions
improve overall structure and clarity
2025-11-12 09:36:10 -08:00
240d19f4a3 pin pytest<9 for now (#42162)
fix

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-11-12 18:10:49 +01:00
ba938fa590 fix(ci): unexpected keyword argument streaming (#42102)
* debug(ci): run `pwd` to check what we're working with

* fix(ci): `ls -lR`

* fix(ci): remove working directory which should not be there?

* fix(cb): make sure memory is freed when calling `stop`

* fix(ci): effectively clear cache

* fix(ci): reduce memory safety margin

* refactor(cb): add fixme note on default safety margin value
2025-11-12 17:29:09 +01:00
6744ebe745 Add slow to some examples tests (#42164)
* slow

* remove
2025-11-12 15:08:57 +00:00
1709ed96e4 [models] Add AudioFlamingo3 integration (#40290)
* Audio Flamingo 3 initial integration

* Added local Qwen

* Moving to AF3

* Loading directly from HF

* Formatting

* add snapshot_download

* Loading from hub

* Import gating

* Pass audio arrays directly

* Remove requires_backend

* Move constants to config.json

* Remove redundancies

* Separate tokenizer, cleaner from_pretrained

* Remove LlavaMetaModel

* Remove sound tower wrapper

* Merged BasicSoundEncoder

* Some improvements

* Towards AudioFlamingo3

* Migrate LlavaConfig

* Merge LlavaMetaForCausalLM into AudioFlamingo3ForConditionalGeneration

* Remove redundant lines

* Add AudioFlamingo3PreTrainedModel

* Unified model.safetensors

* Inline MM projector

* Tokenizer in root dir

* Default processor from_pretrained

* Remove tokenizer from modeling

* Added types

* Cleanup

* Docs & license

* device handling

* Change year

* Remove redundant methods

* Use BatchFeature

* Streamline audio feature handling

* Batch inference

* Reorder alphabetically

* Make style check

* Make fixup

* Avoid calls to separate functions

* Remove forward_tower()

* Rename encode_sound to get_audio_features for clarity

* Add batch decoding method to AudioFlamingo3Processor

* Use tensors instead of lists

* Move end embed token eval

* Prepare audio_features_mask in the processor

* No hardcoded 750 and 3000

* Remove _load_sound_mask completely and use WhisperFeatureExtractor

* Compute embeddings separately

* MM Projector is audio adaptor

* Simplify AudioFlamingo3Config initialization with default encoder_config

* Add modular

* Clean up

* make fixup

* Cleanup processing, add params to encoder config

* Remove redundant methods

* update config references, improve method names, and enhance logging in processor

* processor: move FE args to audio_kwargs, use common_kwargs for return_tensors

* Qwen-like processor

* Simplified AudioFlamingo3Processor

* Extract common code from generate() and forward()

* Add conversion script for AudioFlamingo3 to Hugging Face format

* Use save_pretrained()

* Don't overwrite gen config

* Use AutoTokenizer and FE to convert the processor

* minor formatting

* Finalize processor, do token expansion inside

* AudioFlamingo3: refactor docs, types, and audio–text feature merge

* AudioFlamingo3 Docs

* Add AudioFlamingo3Processor to AutoProcessor

* Processor tests

* Use audio_config instead of encoder_config

* Add audio_token_id to config

* Cleanup & new keys

* Add links

* Improved processor

* Handle conversational input

* Make processing consistent.

* Add fallback for no sound token, default left padding.

* Cleanup

* Replace manual 4D mask with masking_utils; dtype/device from inputs

* Text only mode

* Finalize processor

* Export processor directly

* Add push_to_hub to converter

* Add model_input_names property to AudioFlamingo3Processor to pass tests

* Processor chat template support

* Added Jinja processor chat template with audio support

* Processor tests

* Model tests

* Added docs

* Don't use common_kwargs in __call__

* Pass 'test_left_padding_compatibility' by never treating padding as content

* Updated docs

* Cleanup docs

* Standardization

* Update conversion script weight mapping.

* Flatten _build_square_attn_mask

* Make style

* Small dim and attn mask fix

* Fix processor padding side bug

* Error handling in converter

* Use position_ids

* Cleanup generation config

* Use precomputed position embeddings in AudioFlamingo3 encoder

* Added usage examples

* Fix generation config

* Integration tests

* Simplify modeling and shift part of mask preparation to processor. And update tests.

* Updated docs

* ASR convenience method

* Fixed tests

* make fixup

* Shift encoder mask preparation to the encoder's forward.

* Change to HF profiles.

* Integration test standardization.

* Clean up before integration test setup.

* Remove strict float32, more similar to Qwen2Audio.

* Use HF dataset links

* Keep weights in BF16

* New audio in tests

* Processor conventions.

* Standardize audio token expansion in processor.

* Add 'strip_prefix' to batch_decode

* Batch decode nits.

* Remove dtype casting.

* Read token ids from tokenizer

* diverse changes according to review

* add training example

* Add missing docstring.

* Fix typos.

* Add audio token docstring.

* Fix fill type.

* Fix docs

* Save converted weights in bf16

* Fix tests

* Keep model in bf16 for tests.

* Update expected results for single.

* Fix integration tests from runner.

* Update reproducer, and dtype nits.

---------

Co-authored-by: Eric B <ebezzam@gmail.com>
Co-authored-by: Eustache Le Bihan <eulebihan@gmail.com>
2025-11-12 15:24:09 +01:00
fd36275be2 handle inputs from Siglip/Siglip2 non-automapped encoder layers (#41930)
* handle inputs from non-automapped encoder layers

* correct inheritance + protect executorch

* fixup

* fix tests

* missing docstring

* attn support

* fix initialization

* reorder/simplify

* flag test as broken

* minor changes

* modulaaar
2025-11-12 13:58:44 +00:00
922e85487b feat(kernels): add opt-out flag to disable kernels hub usage through the lib (#41990)
* feat(kernels): add opt-out flag to disable kernels hub usage through the lib

* misc(quality): style

* test(kernels): add some opt-out test logic

* chore(quality): style here we go again

* chore(quality): style here we go again ... again

* chore(quality): STYLE
2025-11-12 14:46:55 +01:00
f9e668abf3 update torchao doc (#42139)
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
2025-11-12 13:17:38 +00:00
7951105d69 add rmsnorm kernels support for npu (#42106) 2025-11-12 13:58:35 +01:00
58a3f8caac fix test failure of speculative_generation on xpu (#42052)
* fix test failure of speculative_generation on xpu

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* code refine

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* address review comment

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-11-12 12:33:41 +00:00
fcea1e1fe0 Fixes Flash Attention implementation for models (#42149)
* flash-att3 fix for smolvlm2

* flash-att3 fix for idefics2

* idefics2 changes

* reset idefics2
2025-11-12 13:33:15 +01:00
563f2ffb21 fix failure of tests/models/shieldgemma2/test_modeling_shieldgemma2.p… (#42022)
* fix failure of tests/models/shieldgemma2/test_modeling_shieldgemma2.py::ShieldGemma2IntegrationTest::test_model

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

* add common kwargs in processing_shieldgemma2.py

---------

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
2025-11-12 10:59:50 +01:00
6f479d5d75 extend test_beam_search_early_stop_heuristic case to other device (#42078)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-11-12 10:14:51 +01:00
d012f34e0d add xpu to valid hardware for torch.compile (#42079)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-11-12 10:12:05 +01:00
e76364d5c1 fix: improve visibility of ValueError root causes in model config loading (#41972)
* Make config util exception handling more informative when ValueError is raised

* transform ValueError into OSError and propagate upstream instead

---------

Co-authored-by: Scott Zhang <scottzh@fb.com>
2025-11-12 09:51:43 +01:00
2b8068c306 T5 migration to new masking interface (#41804)
* Refactor: migrate T5 attention masking to masking_utils interface

* Refactor: migrate T5 attention masking to masking_utils interface

* create_bidirectional_mask function with appropriate paramaters

* create_bidirectional_mask function with appropriate paramaters

* fixup executorch + import

* revert causal masks

* rm executorch stuff

* add causal mask with non vmap

* copies

* remove unnecessary import

---------

Co-authored-by: Vasqu <antonprogamer@gmail.com>
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
2025-11-11 18:10:12 +00:00
33c60a5254 [T5Gemma] Fix cross attention cache (#41890)
* fix

* add test

* style

* added comment
2025-11-11 18:08:34 +00:00
fa22b56903 🚨 Fix gradient checkpointing for several models and improve test robustness (#41818)
* Implement gradient checkpointing in GPTBigCode

Support for gradient checkpointing was lost in the major refactoring in PR #38635
and this is the attempt to re-add it.

I extended the tests to
- test `use_reentrant=True` and `False`
- make sure `model.train` is called so that gradient checkpointing works;
  this is a limiation of the tests currently used by GPTBigCode
- make sure that one (the first) gradient checkpointing layer is called
- make sure that the same non-zero grads are there for normal and checkpointing
  runs - this is something we tripped over before in PEFT due to the possibly
  incompletely stored runtime environment in the checkpointed forward step,
  see also peft#2826

Note that the invocation of `GPTBigCodeBlock.forward` has changed:

- `layer_past` is now passed as a keyword argument so that
  `GradientCheckpointingLayer.__call__` can see and filter this parameter
  (`use_reentrant=False` fails otherwise)
- `{encoder_}hidden_states` are still passed as positional arguments
  so that `torch.utils.checkpoint.checkpoint` receives them as pos. args
  and computes gradients for these (kwargs would be filtered by
  `GradientCheckpointingLayer`).

* Improve gradient checkpointing tests

- Compare that the non-zero gradients in a reference run are present in the checkpointing run
- Make sure that the forward of at least one gradient checkpointing layer is actually called
  more than once (as expected during gradient checkpointing backward)

Currently there are some problems with Bert-derived MultipleChoice models, when dropout is
enabled there are scenarios during gradient checkpointing where `classifier.bias.grad` is None.
I don't yet have a good explanation for this, disabling dropout resolves this. I would have
understood, if it is dropout on the classification layer but enabling attention dropout is
also leading to this behavior.

MoE models have selective sparsity depending on the selected experts, for this reason we
only compare gradients on parameters collected on the reference backward run.

* Remove duplicated gradient checkpointing code

* Address review comments

* Make test output consistent

* GradientCheckpointingLayer for xlstm, zamba, zamba2

* GradientCheckpointingLayer for swiftformer

also drop janus from ignore list - only the VQVAE case is without
gradient checkpointing and it is doubtful that it is usefule in that
case. Training with gradient checkpointing is not tested anyway.

* Make an exception for CLVP

The implementation of GradientCheckpointingLayers is not trivial and may break behavior
that was previously expected. Therefore we keep it as-is for now.

* Remove unneeded exceptions

---------

Co-authored-by: nemo <git@ningu.net>
Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
2025-11-11 18:13:38 +01:00
f30c22500b Fix logic error in prepare_inputs_for_generation cache slicing condition (#41764)
Fix logic error in cache slicing condition

Co-authored-by: Raushan Turganbay <raushan@huggingface.co>
2025-11-11 16:52:38 +00:00
496c283615 Add dinov3 autobackbone (#41276)
* feat: Add DINOv3 support to AutoBackbone [DRAFT]

- Implement DINOv3ViTConfig, DINOv3ViTModel, and DINOv3ViTBackbone
- Add DINOv3 to MODEL_FOR_BACKBONE_MAPPING_NAMES
- Support get_intermediate_layers for Facebook compatibility
- Enable multi-scale feature extraction for detection/segmentation

Note: Tests and documentation coming in follow-up commits
Addresses #40323

* Updated import structure of get_aligned_output_features_output_indices

* Added test for DINOv3ViTBackbone

* Add DINOv3ViTBackbone to model documentation

* Refactored the code to adhere to the Transformers principles

* Generated modeling_dinov3_vit.py

* DINOv3ViT backbone: keep hidden_states with return_dict=False, add @check_model_inputs and polish docs

- Add @check_model_inputs to DINOv3ViTBackbone.forward to normalize flags and enable output recording.
- Preserve hidden_states when return_dict=False by appending them to the tuple output when requested.
- Clean up config docstring formatting (consistent indentation and use list[...] types).

* Restructure DINOv3 backbone and update its tests

* Resolved merge conflicts

* Resolved failing testcase

* Fix DINOv3 backbone to use self.norm for feature maps

---------

Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
2025-11-11 11:22:06 -05:00
df45a92cea Enforce check_auto_docstring (#41635)
fix issues and enforce check_auto_docstring
2025-11-11 16:05:55 +00:00
3ff0e69f84 Avoid mutating user-provided arguments in preprocessing utils (#42126)
* fix: deepcopy kwargs to avoid mutating user-provided arguments

* mod: shallow copy instead of deepcopy

---------

Co-authored-by: Leonardo Emili <lemili@apple.com>
2025-11-11 14:52:07 +00:00
31839d741a Bugfix/remove emojis from print (#42091)
* Removed initial batch of emojis from strings and print statements.

* Removed many of the emojis that are printed to the client.

* Removed the Hugging Face emoji from the arg-parse help descriptions.

* Removed even more emojis.

* Used cursor to do a pass on top of my manual pass to remove run-time emojis. I did a manual verification of the results.

* Reverting some changes that were not necessary.

* Used ruff to format files.
2025-11-11 14:38:38 +00:00
2072f3059e DataCollatorForLanguageModeling warning error fixed (#42144)
* warning error fixed

* Update src/transformers/data/data_collator.py

---------

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
2025-11-11 14:22:20 +00:00
3760afb21c Fix T5Gemma module structure (#42145)
* fix modular

* oupsi typo
2025-11-11 12:26:03 +01:00
3c0b2b101e fix: improve video processing fps assignment logic (#42009)
* fix: improve video processing fps and do_sample_frames assignment logic

* fix: set return_metadata=True to get metadata

* reformat the modular file

* fix typo

* revert flag change and fix fps assignment

* Taking 'num_frames' into considered.

Avoid error when 'num_frames' is passed rather than 'fps'.

* fix

* fix: avoid potential reference before assignment error

* fix

* add 'sample_fps' to 'VideoMetadata'

* fix missing comma

* fix trailing whitespace

* Handle different 'sample_indices_fn'

* Cleaning white space

* import callable from collections.abc

* calculate sampled_fps using indices

* correct the order

* fix

* properly check  value in kwargs

* handle sampled_fps as property

* remove duplicated definition

* fix

* fix

* add safety check

---------

Co-authored-by: Raushan Turganbay <raushan@huggingface.co>
2025-11-11 10:54:33 +01:00
e869e9df54 update deps table (#42120)
* update deps table

* [build-ci-image]

* [build-ci-image]

* [push-ci-image]
2025-11-11 09:23:58 +01:00
37d48bbb48 Remove unused functions in image_transforms.py (#42044)
* up

* make style

* Update trimaps logic

* fix typo

* Revert changes

---------

Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
2025-11-10 16:55:57 +00:00
21913b2e10 Fix MaskFormer/Mask2Former fast image processors (#41393)
* Merge conflict

* add fast processor

* add fast processor

* make style

* add new convert rgb

* use nested group by shape in mllama fast, add support for multiple inputs in group by shape

* fix maskformer mask2 former fast im proc and add tests

* refactor after review

* add _iterate_items utility

* Fix failing tests

* fix copies and improve docs

---------

Co-authored-by: Vincent <phamvinh257@gmail.com>
2025-11-10 16:48:10 +00:00
f028e9340c Fix model name test for compressed tensors (#42128)
* fix models

* fix output
2025-11-10 16:22:46 +01:00
4dd4a8fafe Fix GPT-2 Flash Attention 2 generation with left-padding (#41966)
* Fix GPT-2 Flash Attention 2 generation with left-padding

* repo consistency

* define is_causal in init

* fix
2025-11-10 16:14:30 +01:00
03538a80be [Attn Masks] Non-vmap default for attention masks (#41852)
* atmpt 1

* fixup masking to work correctly with old torch

* few changes to make things a bit more cleaner

* oopsie

* fix integer overflow on bidirectional masks via indexing fn

* rm executorch workarounds --> still need to handle on sliding etc fns properly

* typo

* docs, fix older torch inplace issue, proper kwarg handling

* chunked works with non vmap and older torch, add warning on non guaranteed masks

* lift unnecessary restriction on older torch

* simplify a few things, restrict torch < 2.6 to non-vmap (for now)

* try fix

* remove unnecessary slicing logic

* remove legacy func

* harmonize slightly

---------

Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
2025-11-10 16:04:21 +01:00
700c48a29f fix qwen2vl/qwen3vl video processor temporal padding when num_frames%temporal_patch_size!=1 (#42083)
* qwen3vl video process padding video frames

* add two video processor test cases

* fix typo

* down test image size

* fix qwen2vl video processor t padding

* delete padding when num_frames < temporal_patch_size

* to default format

* fix smart_resize in qwen3vl
2025-11-10 14:36:21 +00:00
18a19dea61 📝 docs(smolvlm): fix variable name in batch inference example (#42123)
Fixed incorrect variable name on line 162. The code creates a list called
'conversations' (plural) but incorrectly references 'conversation' (singular)
in the apply_chat_template call, which would cause a NameError.

This fixes the batch mixed media inference example to use the correct
variable name, allowing the code to run without errors.
2025-11-10 13:48:25 +00:00
dba6aeb1e3 fix continuous batching issues, extend ut cases to xpu (#41830)
* extend conrinuous batching cases to xpu

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

* fix style

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

* fix style

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

---------

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
2025-11-10 13:03:52 +00:00
1c9077f66d Fix base model prefix in VLMs (#42059)
* fix base model prefix

* it is now defined
2025-11-10 11:59:41 +00:00
756742354b [Bugfix] fix qwen3vl expand generation with video (#42089)
fix qwen3vl expand generation with video and add
2025-11-10 10:06:06 +01:00
926c37aaf4 Correctly handle unbatched audio inputs in Gemma3nAudioFeatureExtractor (#42076)
* Correctly handle unbatched audio inputs in Gemma3nAudioFeatureExtractor

* Simplify the logic for batching the unbatched speech input in Gemma3nAudioFeatureExtractor
2025-11-10 08:44:56 +00:00
f5630f9b1a Fix return metadata checking logic (#42108)
fix return_metadata_checking_logic
2025-11-10 08:23:43 +00:00
e8a6eb3304 Revert "permissions worflows fix" (#42110)
Revert "permissions worflows fix (#42080)"

This reverts commit 08f52e2178a0bada437da02ed7c1395ae54b3309.
2025-11-08 16:21:02 +01:00
370fc65ee5 add xpu support in test_modeling_janus.py::JanusIntegrationTest::test… (#41986)
* add xpu support in test_modeling_janus.py::JanusIntegrationTest::test_model_generate_images

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix ci issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-11-08 14:17:21 +01:00
f065e402fc [v5] 🚨Refactor subprocessors handling in processors (#41633)
* remove attributes and add all missing sub processors to their auto classes

* remove all mentions of .attributes

* cleanup

* fix processor tests

* fix modular

* remove last attributes

* fixup

* fixes after merge

* fix wrong tokenizer in auto florence2

* fix missing audio_processor + nits

* Override __init__ in NewProcessor and change hf-internal-testing-repo (temporarily)

* fix auto tokenizer test

* add init to markup_lm

* update CustomProcessor in custom_processing

* remove print

* nit

* fix test modeling owlv2

* fix test_processing_layoutxlm

* Fix owlv2, wav2vec2, markuplm, voxtral issues

* add support for loading and saving multiple tokenizer natively

* remove exclude_attributes from save_pretrained

* modifs after review
2025-11-07 12:57:33 -05:00
91d250efb1 Reinstate self.scaling in Gemma3nTextAttention (#41751)
maintenance: make Gemma3nTextAttention more amenable to modular inheritance
2025-11-07 18:35:54 +01:00
7cb4280112 Fix Auto classes to support dynamically registered processors (#41865) 2025-11-07 18:31:36 +01:00
144c8ce280 Fix modular docstring for Mixtral (#42041)
* Fix modular docstring for Mixtral

* fixes all docstrings
2025-11-07 16:49:50 +00:00
069684ef87 feat(ci): add continuous batching to benchmarks (#41916)
* feat(ci): add continuous batching to benchmarks

* refactor(ci): PR comments

* refactor(cb): when stopping, block by default

* fix(benchmarks): `stream` -> `streaming`

* fix(benchmarks): invalid configuration when cb has attn_impl == sdpa

* tests(cb): fix attn impl

* fix(benchmarks): update `get_throughput` formula

* fix(benchmarks): prevent version conflicts and ensure proper cleanup in continuous batching (#42063)

* Initial plan

* fix(benchmarks): ensure proper cleanup and remove transformers from requirements

- Remove transformers from benchmark_v2/requirements.txt to prevent version conflicts
- Add try-finally block to ensure ContinuousBatchingManager.stop() is always called
- This fixes TypeError about unexpected 'streaming' argument and prevents OOM from improper cleanup

Co-authored-by: McPatate <9112841+McPatate@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: McPatate <9112841+McPatate@users.noreply.github.com>

* fix(benchmarks): raise the exception on failure instead of ignoring

we catch the exception later on and raising it here helps debugging
because it will be logged

* test(cb): comment out failing tests for now

added a `FIXME` mark

* fix(benchmarks): revert `finally` removal but keep raising exception

* test(cb): fix missing `require_read_token` import

* refactor(benchmarks): error if no benchmarks were run

* refactor(benchmarks): change default lvls of cb bench config

---------

Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: McPatate <9112841+McPatate@users.noreply.github.com>
2025-11-07 16:23:27 +00:00
a127710b3a 4.1V Model and GLM-4.5V Model Conversion Code Updates (#41784)
* update for new model convert

* Update convert_glm4v_moe_mgt_weights_to_hf.py

* restore

* Update convert_glm4v_mgt_weights_to_hf.py

* update

* 1

* Update convert_glm4v_moe_mgt_weights_to_hf.py

* Update convert_glm4v_mgt_weights_to_hf.py

* finish

* update

* 2

* 2

* 1

* Update convert_glm4v_moe_mgt_weights_to_hf.py

* update

* update with tie_word_embeddings place
2025-11-07 10:34:57 +00:00
08f52e2178 permissions worflows fix (#42080)
- add new workflow to scan permissions github_token really need and advise pernmissions
- add actions-permissions/monitor on almost all worklows
=> the goal is to define properly all permissions blocks by jobs

# Conflicts:
#	.github/workflows/check-workflow-permissions.yml

Co-authored-by: Pauline <pauline@Paulines-MacBook-Pro-2.local>
2025-11-07 10:37:26 +01:00
c790403039 QwenVL: add skipped keys in setattr as well (#41808)
add the keys in setattr as well
2025-11-07 09:47:32 +01:00
8012f80f72 Fix inconsistency of commit sha during the workflow run (#42074)
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-11-06 22:32:27 +01:00
7b325cd573 Fix security issue 5 (#42072)
fix

Co-authored-by: Pauline <pauline@Paulines-MacBook-Pro-2.local>
2025-11-06 19:50:59 +01:00
a9e2b80c71 add workflow to check permissions and advise a set of permissions req… (#42071)
add workflow to check permissions and advise a set of permissions required

Co-authored-by: Pauline <pauline@Paulines-MacBook-Pro-2.local>
2025-11-06 18:55:01 +01:00
bc8b0b0541 fix tensor device placement issue of 2 UT cases (#41921)
fix tensor device placement issue

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
2025-11-06 15:38:58 +00:00
cbd83bf161 Fix logic in setting self.fsdp when it is False (#41974)
Co-authored-by: Roy Chan <roy@Roys-MacBook-Air.local>
2025-11-06 21:05:22 +05:30
8f242a45b1 [deepspeed tests fixes] (#41925)
* [deepspeed tests fixes]

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

* simplify

---------

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
2025-11-06 16:24:22 +01:00
1c2e50a72e Fix missing arg in check_docstring (#42054)
fix missing arg in check_docstring
2025-11-06 09:38:04 -05:00
5aa7dd07da Revert back to use GitHub context (#42066)
* check

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-11-06 14:41:58 +01:00
5c1df12fa5 Fix KeyError in _is_package_available for packages with dotted names (#42050)
Fixes #41981

When checking for packages with dotted names like 'optimum.quanto', the
function was crashing with KeyError because these packages aren't always
in PACKAGE_DISTRIBUTION_MAPPING (it might only have 'optimum' as a key,
not 'optimum.quanto').

The function already had fallback logic to handle cases where package
metadata can't be found - it tries to import the package directly. But
the exception handler only caught PackageNotFoundError, not KeyError.

This adds KeyError to the exception handler so the fallback works for
both cases.

Co-authored-by: Yashwant Bezawada <yashwantbezawada@gmail.com>
2025-11-06 13:32:43 +00:00
fe5ca9ddaa Fix KeyError in GPT-OSS weight conversion script (#42007)
* Fix KeyError in GPT-OSS weight conversion script

* Fix code quality: remove trailing whitespace

* Address review: use pop with defaults, remove commented code
2025-11-06 12:43:39 +00:00
76fea9b482 Fix another Argument list too long in pr_slow_ci_suggestion.yml (#42061)
* fix

* trigger

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-11-06 13:33:23 +01:00
8a96f5fbe8 Be careful at explicit checkout actions (#42060)
final

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-11-06 11:01:06 +01:00
21ecb5b0b8 Fix transformers serve following chat template output update 2025-11-06 09:57:25 +01:00
40929ec4fd Annoying typo in attention error message (#42037)
annoying typo
2025-11-06 08:57:17 +00:00
17fdaf9b7a Avoid explicit checkout in workflow (#42057)
* remove explicit checkout

* check 1

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-11-06 09:31:20 +01:00
85c50557b9 Fix Qwen3-Omni RoPE (#41778)
* fix qwen rope

* not sure it fits in runners, let's see

* fix some tests

* use input ids device everywhere

* fix rope in wav2code and the test

* I remember pushing these changes yesterday
2025-11-06 09:30:39 +01:00
32e49f2884 Fix AutoImageProcessor.register and documentation in auto processing modules (#41864) 2025-11-06 07:43:07 +00:00
927 changed files with 15675 additions and 10674 deletions

View File

@ -46,8 +46,8 @@ jobs:
- run: uv pip install -U -e .
- run: echo 'export "GIT_COMMIT_MESSAGE=$(git show -s --format=%s)"' >> "$BASH_ENV" && source "$BASH_ENV"
- run: mkdir -p test_preparation
- run: python utils/tests_fetcher.py | tee tests_fetched_summary.txt
- run: python utils/tests_fetcher.py --filter_tests
- run: python utils/tests_fetcher.py | tee tests_fetched_summary.txt || true
- run: python utils/tests_fetcher.py --filter_tests || true
- run: export "GIT_COMMIT_MESSAGE=$(git show -s --format=%s)" && echo $GIT_COMMIT_MESSAGE && python .circleci/create_circleci_config.py --fetcher_folder test_preparation
- run: |
if [ ! -s test_preparation/generated_config.yml ]; then
@ -98,8 +98,8 @@ jobs:
- run: uv pip install -U -e .
- run: echo 'export "GIT_COMMIT_MESSAGE=$(git show -s --format=%s)"' >> "$BASH_ENV" && source "$BASH_ENV"
- run: mkdir -p test_preparation
- run: python utils/tests_fetcher.py --fetch_all | tee tests_fetched_summary.txt
- run: python utils/tests_fetcher.py --filter_tests
- run: python utils/tests_fetcher.py --fetch_all | tee tests_fetched_summary.txt || true
- run: python utils/tests_fetcher.py --filter_tests || true
- run: export "GIT_COMMIT_MESSAGE=$(git show -s --format=%s)" && echo $GIT_COMMIT_MESSAGE && python .circleci/create_circleci_config.py --fetcher_folder test_preparation
- run: |
if [ ! -s test_preparation/generated_config.yml ]; then

View File

@ -32,16 +32,15 @@ jobs:
options: --gpus all --privileged --ipc host
steps:
- name: Get repo
uses: actions/checkout@v4
uses: actions/checkout@v5
with:
ref: ${{ github.event.pull_request.head.sha || github.sha }}
fetch-depth: 1
- name: Install benchmark script dependencies
run: python3 -m pip install -r benchmark_v2/requirements.txt kernels
- name: Reinstall transformers in edit mode (remove the one installed during docker image build)
working-directory: /transformers
run: python3 -m pip uninstall -y transformers && python3 -m pip install -e ".[torch]" && python3 -m pip uninstall -y torchvision # temp fix
run: python3 -m pip uninstall -y transformers && python3 -m pip install -e ".[torch]"
- name: Run benchmark
run: |

View File

@ -0,0 +1,23 @@
---
name: Check Permissions Advisor
on:
workflow_dispatch:
inputs:
workflow_name:
description: 'Workflow file name'
type: string
run_count:
description: 'Number of runs to analyze'
type: string
default: "10"
jobs:
advisor:
uses: huggingface/security-workflows/.github/workflows/permissions-advisor-reusable.yml@main
permissions:
actions: read
contents: read
with:
workflow_name: ${{ inputs.workflow_name }}
run_count: ${{ fromJSON(inputs.run_count) }}

View File

@ -125,7 +125,7 @@ jobs:
const { data: merge_commit } = await github.rest.repos.getCommit({
owner: pr.base.repo.owner.login,
repo: pr.base.repo.name,
ref: pr.merge_commit_sha,
ref: '${{ inputs.commit_sha }}',
});
core.setOutput('merge_commit_base_sha', merge_commit.parents[0].sha);

View File

@ -3,7 +3,7 @@ name: CodeQL Security Analysis
on:
push:
branches: ["main"]
branches: ["main", "fix_security_issue_*"]
# pull_request:
# branches: ["main"]
workflow_dispatch:

View File

@ -28,12 +28,14 @@ jobs:
with:
fetch-depth: "0"
# We need to use `${{ ... }}` here to avoid `Argument list too long` error when a PR changes a lot of files.
# (We could also try to use artifact approach, but it's more involved).
# `CodeQL` doesn't identify any security issue here. Also `PR_FILES` is from `get-pr-info.yml` by using an api
# `github.rest.pulls.listFiles`, which is fine.
- name: Write pr_files file
env:
PR_FILES: ${{ needs.get-pr-info.outputs.PR_FILES }}
run: |
cat > pr_files.txt << EOF
$PR_FILES
cat > pr_files.txt << 'EOF'
${{ needs.get-pr-info.outputs.PR_FILES }}
EOF
- name: Get repository content

View File

@ -6,7 +6,7 @@ on:
- cron: "17 2 * * *"
push:
branches:
- check_cleanup_workflow
- run_nvidia_ci*
workflow_dispatch:
inputs:
prev_workflow_run_id:
@ -23,7 +23,7 @@ on:
# Used for `push` to easily modify the target workflow runs to compare against
env:
prev_workflow_run_id: "19056134459"
prev_workflow_run_id: ""
other_workflow_run_id: ""
@ -52,10 +52,84 @@ jobs:
uses: ./.github/workflows/self-scheduled.yml
with:
job: run_models_gpu
slack_report_channel: "#transformers-ci-dummy"
slack_report_channel: "#transformers-ci-daily-models"
docker: huggingface/transformers-all-latest-gpu
ci_event: Daily CI
runner_type: "a10"
report_repo_id: hf-internal-testing/transformers_daily_ci
commit_sha: ${{ github.sha }}
secrets: inherit
torch-pipeline:
name: Torch pipeline CI
uses: ./.github/workflows/self-scheduled.yml
with:
job: run_pipelines_torch_gpu
slack_report_channel: "#transformers-ci-daily-pipeline-torch"
docker: huggingface/transformers-all-latest-gpu
ci_event: Daily CI
report_repo_id: hf-internal-testing/transformers_daily_ci
commit_sha: ${{ github.sha }}
secrets: inherit
example-ci:
name: Example CI
uses: ./.github/workflows/self-scheduled.yml
with:
job: run_examples_gpu
slack_report_channel: "#transformers-ci-daily-examples"
docker: huggingface/transformers-all-latest-gpu
ci_event: Daily CI
report_repo_id: hf-internal-testing/transformers_daily_ci
commit_sha: ${{ github.sha }}
secrets: inherit
trainer-fsdp-ci:
name: Trainer/FSDP CI
uses: ./.github/workflows/self-scheduled.yml
with:
job: run_trainer_and_fsdp_gpu
slack_report_channel: "#transformers-ci-daily-training"
docker: huggingface/transformers-all-latest-gpu
runner_type: "a10"
ci_event: Daily CI
report_repo_id: hf-internal-testing/transformers_daily_ci
commit_sha: ${{ github.sha }}
secrets: inherit
deepspeed-ci:
name: DeepSpeed CI
uses: ./.github/workflows/self-scheduled.yml
with:
job: run_torch_cuda_extensions_gpu
slack_report_channel: "#transformers-ci-daily-training"
docker: huggingface/transformers-pytorch-deepspeed-latest-gpu
ci_event: Daily CI
working-directory-prefix: /workspace
report_repo_id: hf-internal-testing/transformers_daily_ci
commit_sha: ${{ github.sha }}
secrets: inherit
quantization-ci:
name: Quantization CI
uses: ./.github/workflows/self-scheduled.yml
with:
job: run_quantization_torch_gpu
slack_report_channel: "#transformers-ci-daily-quantization"
docker: huggingface/transformers-quantization-latest-gpu
ci_event: Daily CI
report_repo_id: hf-internal-testing/transformers_daily_ci
commit_sha: ${{ github.sha }}
secrets: inherit
kernels-ci:
name: Kernels CI
uses: ./.github/workflows/self-scheduled.yml
with:
job: run_kernels_gpu
slack_report_channel: "#transformers-ci-daily-kernels"
docker: huggingface/transformers-all-latest-gpu
ci_event: Daily CI
report_repo_id: hf-internal-testing/transformers_daily_ci
commit_sha: ${{ github.sha }}
secrets: inherit

View File

@ -67,7 +67,7 @@ jobs:
if: contains(fromJSON('["run_models_gpu", "run_trainer_and_fsdp_gpu", "run_quantization_torch_gpu"]'), inputs.job)
strategy:
matrix:
machine_type: [aws-g5-4xlarge-cache]
machine_type: [aws-g5-4xlarge-cache, aws-g5-12xlarge-cache]
runs-on:
group: '${{ matrix.machine_type }}'
container:
@ -136,7 +136,7 @@ jobs:
strategy:
fail-fast: false
matrix:
machine_type: [aws-g5-4xlarge-cache]
machine_type: [aws-g5-4xlarge-cache, aws-g5-12xlarge-cache]
slice_id: ${{ fromJSON(needs.setup.outputs.slice_ids) }}
uses: ./.github/workflows/model_jobs.yml
with:
@ -157,7 +157,7 @@ jobs:
strategy:
fail-fast: false
matrix:
machine_type: [aws-g5-4xlarge-cache]
machine_type: [aws-g5-4xlarge-cache, aws-g5-12xlarge-cache]
slice_id: [0, 1]
uses: ./.github/workflows/model_jobs.yml
with:
@ -177,7 +177,7 @@ jobs:
strategy:
fail-fast: false
matrix:
machine_type: [aws-g5-4xlarge-cache]
machine_type: [aws-g5-4xlarge-cache, aws-g5-12xlarge-cache]
runs-on:
group: '${{ matrix.machine_type }}'
container:
@ -322,7 +322,7 @@ jobs:
strategy:
fail-fast: false
matrix:
machine_type: [aws-g5-4xlarge-cache]
machine_type: [aws-g5-4xlarge-cache, aws-g5-12xlarge-cache]
runs-on:
group: '${{ matrix.machine_type }}'
container:
@ -427,7 +427,7 @@ jobs:
fail-fast: false
matrix:
folders: ${{ fromJson(needs.setup.outputs.quantization_matrix) }}
machine_type: [aws-g5-4xlarge-cache]
machine_type: [aws-g5-4xlarge-cache, aws-g5-12xlarge-cache]
runs-on:
group: '${{ matrix.machine_type }}'
container:
@ -586,11 +586,10 @@ jobs:
runs-on: ubuntu-22.04
needs: [setup, run_models_gpu]
steps:
# Checkout in order to run `utils/extract_warnings.py`. Avoid **explicit** checkout (i.e. don't specify `ref`) for
# security reason.
- name: Checkout transformers
uses: actions/checkout@v4
with:
fetch-depth: 2
ref: ${{ inputs.commit_sha || github.sha }}
- name: Install transformers
run: pip install transformers

View File

@ -49,7 +49,8 @@ jobs:
- uses: actions/checkout@v4
with:
fetch-depth: 2
ref: ${{ inputs.commit_sha || github.sha }}
# Security: checkout to the `main` branch for untrusted triggers (issue_comment, pull_request_target), otherwise use the specified ref
ref: ${{ (github.event_name == 'issue_comment' || github.event_name == 'pull_request_target') && 'main' || (inputs.commit_sha || github.sha) }}
- uses: actions/download-artifact@v4

View File

@ -45,6 +45,7 @@ repo-consistency:
python utils/check_modular_conversion.py
python utils/check_dummies.py
python utils/check_repo.py
python utils/check_init_weights_data.py
python utils/check_inits.py
python utils/check_pipeline_typing.py
python utils/check_config_docstrings.py

View File

@ -1,6 +1,5 @@
gpustat==1.1.1
psutil==6.0.0
psycopg2==2.9.9
torch>=2.4.0
hf_xet
pandas>=1.5.0
pandas>=1.5.0

View File

@ -36,6 +36,7 @@ class BenchmarkConfig:
warmup_iterations: int = 5,
measurement_iterations: int = 20,
gpu_monitoring: bool = True, # NOTE: you may want to disable this at times as we have obsvered it could heavily slow down benchmarks on AMD
continuous_batching: bool = False,
batch_size: int = 1,
sequence_length: int = 128,
num_tokens_to_generate: int = 128,
@ -51,6 +52,7 @@ class BenchmarkConfig:
self.warmup_iterations = warmup_iterations
self.measurement_iterations = measurement_iterations
self.gpu_monitoring = gpu_monitoring
self.continuous_batching = continuous_batching
# Input parameters
self.batch_size = batch_size
self.sequence_length = sequence_length
@ -85,6 +87,22 @@ class BenchmarkConfig:
if is_fa:
logger.warning("Flash attention does not support compile mode. Turning off compile mode.")
self.compile_mode = None
# Handle SDPA backend if not determined by the config (needs to be done before skipping duplicates)
if self.attn_implementation == "sdpa" and self.sdpa_backend is None:
default_backend = "flash_attention" # FIXME: torch has a _cur_sdpa_kernel_backends but it fails
logger.warning(f"No SDPA backend provided, using {default_backend} instead.")
self.sdpa_backend = default_backend
if self.continuous_batching:
if self.attn_implementation == "flex_attention":
logger.error(
"disabling continuous batching because of invalid configuration: flex attention is not supported"
)
self.continuous_batching = False
elif self.attn_implementation == "sdpa" and self.sdpa_backend is not None:
logger.warning(
"when continuous batching is enabled, sdpa_backend must be None because of the attention mask, setting it to None"
)
self.sdpa_backend = "math"
@property
def hash(self) -> str:
@ -100,6 +118,7 @@ class BenchmarkConfig:
attn_code += f"_{self.sdpa_backend}" if self.attn_implementation == "sdpa" else ""
compile_str = f"compiled_{self.compile_mode}" if self.compile_mode is not None else "uncompiled"
kernelize_str = "kernelized" if self.kernelize else "unkernelized"
continuous_batching_str = "cb" if self.continuous_batching else "generate"
sep = "-"
else:
iter_str = f"{self.warmup_iterations} warmup, {self.measurement_iterations} iterations"
@ -109,8 +128,11 @@ class BenchmarkConfig:
attn_code += f" with {self.sdpa_backend} backend" if self.attn_implementation == "sdpa" else ""
compile_str = "compiled" if self.compile_mode is not None else "not compiled"
kernelize_str = "kernelized" if self.kernelize else "not kernelized"
continuous_batching_str = "continuous batching" if self.continuous_batching else "regular generate"
sep = ", "
return sep.join([iter_str, gpu_monitor_str, dimensions_str, attn_code, compile_str, kernelize_str])
return sep.join(
[iter_str, gpu_monitor_str, dimensions_str, attn_code, compile_str, kernelize_str, continuous_batching_str]
)
def to_dict(self) -> dict[str, Any]:
return {
@ -118,6 +140,7 @@ class BenchmarkConfig:
"warmup_iterations": self.warmup_iterations,
"measurement_iterations": self.measurement_iterations,
"gpu_monitoring": self.gpu_monitoring,
"continuous_batching": self.continuous_batching,
"batch_size": self.batch_size,
"sequence_length": self.sequence_length,
"num_tokens_to_generate": self.num_tokens_to_generate,
@ -134,6 +157,7 @@ class BenchmarkConfig:
warmup_iterations=data.get("warmup_iterations", 5),
measurement_iterations=data.get("measurement_iterations", 20),
gpu_monitoring=data.get("gpu_monitoring", False),
continuous_batching=data.get("continuous_batching", False),
batch_size=data.get("batch_size", 1),
sequence_length=data.get("sequence_length", 128),
num_tokens_to_generate=data.get("num_tokens_to_generate", 128),
@ -191,15 +215,17 @@ def get_config_by_level(level: int) -> list[BenchmarkConfig]:
# Usually there is not much to gain by compiling with other modes, but we allow it for level 4
compile_modes = BenchmarkConfig.all_compiled_modes if level >= 4 else [None, "default"]
for cm in compile_modes:
for kernelize_on in [False, KERNELIZATION_AVAILABLE]:
configs.append(
BenchmarkConfig(
attn_implementation=attn_implementation,
sdpa_backend=sdpa_backend,
compile_mode=cm,
kernelize=kernelize_on,
for kernelize_on in {False, KERNELIZATION_AVAILABLE}:
for cb_on in [False, True]:
configs.append(
BenchmarkConfig(
attn_implementation=attn_implementation,
sdpa_backend=sdpa_backend,
compile_mode=cm,
kernelize=kernelize_on,
continuous_batching=cb_on,
)
)
)
return configs
# Otherwise, we add the configs for the given level
if level >= 0:
@ -207,8 +233,10 @@ def get_config_by_level(level: int) -> list[BenchmarkConfig]:
if level >= 1:
configs.append(BenchmarkConfig(attn_implementation="flash_attention_2"))
configs.append(BenchmarkConfig(attn_implementation="eager", compile_mode="default"))
configs.append(BenchmarkConfig(attn_implementation="flash_attention_2", continuous_batching=True))
if level >= 2:
configs.append(BenchmarkConfig(attn_implementation="sdpa", compile_mode="default"))
configs.append(BenchmarkConfig(attn_implementation="flex_attention", compile_mode="default", kernelize=True))
configs.append(BenchmarkConfig(attn_implementation="flash_attention_2", kernelize=True))
configs.append(BenchmarkConfig(attn_implementation="paged|sdpa", continuous_batching=True))
return configs

View File

@ -117,8 +117,6 @@ def flush_memory():
# Clear CUDA cache
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.reset_max_memory_allocated()
torch.cuda.reset_peak_memory_stats()
torch.cuda.synchronize()
gc.collect()
@ -234,8 +232,9 @@ class BenchmarkRunner:
self.logger.info(f"Running benchmark scenario: {config.name}")
# Quick validation: try one measurement first to see if this scenario works
generate_fn = self.time_generate_batch if config.continuous_batching else self.time_generate
flush_memory()
e2e_latency, token_generation_times, shape_and_decoded_output, gpu_metrics = self.time_generate(
e2e_latency, token_generation_times, shape_and_decoded_output, gpu_metrics = generate_fn(
max_new_tokens=1, gpu_monitor=None
)
if e2e_latency < 0:
@ -245,14 +244,14 @@ class BenchmarkRunner:
# Warmup runs
self.logger.info(f"Warming up with {config.warmup_iterations} iterations...")
for _ in trange(config.warmup_iterations):
_ = self.time_generate(max_new_tokens=config.num_tokens_to_generate)
_ = generate_fn(max_new_tokens=config.num_tokens_to_generate)
self.logger.info("Warmup over.")
# Measurement runs
result = BenchmarkResult()
self.logger.info(f"Benchmarking with {config.measurement_iterations} iterations.")
for _ in trange(config.measurement_iterations):
e2e_latency, token_generation_times, shape_and_decoded_output, gpu_metrics = self.time_generate(
e2e_latency, token_generation_times, shape_and_decoded_output, gpu_metrics = generate_fn(
max_new_tokens=config.num_tokens_to_generate,
gpu_monitor=(GPUMonitor(logger=self.logger) if config.gpu_monitoring else None),
)
@ -274,6 +273,58 @@ class BenchmarkRunner:
"config": config,
}
# TODO: refactor `generate_batch` to handle streaming so we can use it here
def time_generate_batch(
self,
max_new_tokens: int,
gpu_monitor: GPUMonitor | None = None,
) -> tuple[float, list[float], str, GPURawMetrics | None]:
if gpu_monitor is not None:
gpu_monitor.start()
config = GenerationConfig(
max_new_tokens=max_new_tokens,
eos_token_id=self.tokenizer.eos_token_id,
pad_token_id=self.tokenizer.pad_token_id,
do_sample=True,
)
manager = self.model.init_continuous_batching(config)
manager.start()
try:
first_req_results = []
timestamps = []
wall_time_0 = time.perf_counter()
inputs = self.inputs["input_ids"].tolist()
manager.add_requests(inputs, max_new_tokens=max_new_tokens, streaming=True)
first_req_id = None
num_requests = len(inputs)
finished_requests = 0
while finished_requests < num_requests:
# NOTE: I don't like having the extra if stmt here, but hopefully won't degrade perf too much
result = manager.get_result()
if result:
timestamps.append(time.perf_counter() - wall_time_0)
if result.is_finished():
finished_requests += 1
if first_req_id is None:
first_req_id = result.request_id
if result.request_id == first_req_id:
first_req_results.append(result)
else:
if not manager.is_running():
raise RuntimeError("Generation thread exited unexpectedly")
wall_time_1 = time.perf_counter()
gpu_metrics = gpu_monitor.stop_and_collect() if gpu_monitor is not None else None
decoded_output = self.tokenizer.decode(
[res.generated_tokens[0] for res in first_req_results], skip_special_tokens=True
)
shape_and_decoded_output = f"{(1, len(first_req_results))} | {decoded_output}"
e2e_latency = wall_time_1 - wall_time_0
return e2e_latency, timestamps, shape_and_decoded_output, gpu_metrics
except Exception as e:
raise e
finally:
manager.stop()
def time_generate(
self,
max_new_tokens: int,
@ -339,12 +390,6 @@ class BenchmarkRunner:
n_configs = len(benchmark_configs)
for i, config in enumerate(benchmark_configs):
# Handle SDPA backend if not determined by the config (needs to be done before skipping duplicates)
if config.attn_implementation == "sdpa" and config.sdpa_backend is None:
default_backend = "flash_attention" # FIXME: torch has a _cur_sdpa_kernel_backends but it fails
self.logger.warning(f"No SDPA backend provided, using {default_backend} instead.")
config.sdpa_backend = default_backend
# Skip if already run
if config.hash in all_results:
self.logger.info(f"Skipping duplicate config {config.name} for model {model_id} ({i + 1}/{n_configs})")
@ -368,21 +413,27 @@ class BenchmarkRunner:
self.cleanup()
self.save_results(model_id, all_results, timestamp=timestamp)
if len(all_results) < 1:
raise RuntimeError("No benchmark was run succesfully")
if pretty_print_summary:
print()
print("=" * 100)
print(f"Finished benchmarks in {time.perf_counter() - start_time:.2f} seconds")
print(f"Total number of benchmarks: {len(all_results)}")
if len(all_results) > 0:
print("First run metadata:")
first_key = list(all_results.keys())[0]
first_metadata = all_results[first_key]["metadata"].to_dict()
hardware_info = first_metadata.pop("hardware_info")
pretty_print_dict(first_metadata | hardware_info, tabs=1)
print("First run metadata:")
first_key = list(all_results.keys())[0]
first_metadata = all_results[first_key]["metadata"].to_dict()
hardware_info = first_metadata.pop("hardware_info")
pretty_print_dict(first_metadata | hardware_info, tabs=1)
for result in all_results.values():
print("=" * 100)
print(f"Config: {result['config'].infer_name(compact=False)}\n")
result["measurements"].pprint(batch_size=result["config"].batch_size, tabs=1)
result["measurements"].pprint(
batch_size=result["config"].batch_size,
num_generated_tokens=result["config"].num_tokens_to_generate,
tabs=1,
)
print("=" * 100)
return (timestamp, all_results)

View File

@ -36,16 +36,17 @@ def add_unit_to_duration(stats: dict[str, float]) -> dict[str, str]:
return stats
def equalize_lengths_and_collate(stats: list[dict[str, str]]) -> list[str]:
def equalize_lengths_and_collate(stats: dict[str, dict[str, str]]) -> dict[str, str]:
"""Note: This operation is destructive as it will update values in place before returning a new correctly formatted dict"""
keys = ["avg", "std", "min", "med", "max", "p95"]
for key in keys:
max_length = max(len(stat[key]) for stat in stats)
for stat in stats:
max_length = max(len(stat[key]) for stat in stats.values())
for stat in stats.values():
stat[key] = stat[key].ljust(max_length, " ")
return [" ".join([f"{key}={stat[key]}" for key in keys]) for stat in stats]
return {name: " ".join([f"{key}={stat[key]}" for key in keys]) for name, stat in stats.items()}
def pretty_print_dict(data: dict[str, Any], tabs: int = 0) -> None:
def pretty_print_dict(data: dict[str, str], tabs: int = 0) -> None:
max_key_length = max([len(key) for key in data.keys()])
for key, value in data.items():
tabs_str = " " * tabs
@ -141,27 +142,19 @@ class BenchmarkResult:
def get_measured_itl(self) -> list[float]:
return [(dt[-1] - dt[0]) / (len(dt) - 1) for dt in self.token_generation_times if len(dt) > 1]
def get_throughput(self, batch_size: int) -> float:
return [
batch_size * len(dt) / e2e_latency
for e2e_latency, dt in zip(self.e2e_latency, self.token_generation_times)
]
def get_throughput(self, total_generated_tokens: int) -> list[float]:
return [total_generated_tokens / e2e_latency for e2e_latency in self.e2e_latency]
def pprint(self, batch_size: int = 0, tabs: int = 0) -> None:
stats_to_collate = [
add_unit_to_duration(compute_basic_statistics(self.e2e_latency)),
add_unit_to_duration(compute_basic_statistics(self.get_measured_ttft())),
add_unit_to_duration(compute_basic_statistics(self.get_measured_itl())),
]
if batch_size > 0:
throughput_stats = compute_basic_statistics(self.get_throughput(batch_size))
stats_to_collate.append({key: f"{value:.2f}tok/s" for key, value in throughput_stats.items()})
collated_stats = equalize_lengths_and_collate(stats_to_collate)
dict_to_pprint = {
"E2E Latency": collated_stats[0],
"Time to First Token": collated_stats[1],
"Inter-Token Latency": collated_stats[2],
def pprint(self, batch_size: int = 0, num_generated_tokens: int = 0, tabs: int = 0) -> None:
measurements = {
"E2E Latency": add_unit_to_duration(compute_basic_statistics(self.e2e_latency)),
"Time to First Token": add_unit_to_duration(compute_basic_statistics(self.get_measured_ttft())),
}
itl_values = self.get_measured_itl()
if len(itl_values) > 0:
measurements["Inter-Token Latency"] = add_unit_to_duration(compute_basic_statistics(itl_values))
if batch_size > 0:
dict_to_pprint["Throughput"] = collated_stats[3]
throughput_stats = compute_basic_statistics(self.get_throughput(batch_size * num_generated_tokens))
measurements["Throughput"] = {key: f"{value:.2f}tok/s" for key, value in throughput_stats.items()}
dict_to_pprint = equalize_lengths_and_collate(measurements)
pretty_print_dict(dict_to_pprint, tabs=tabs)

View File

@ -2,6 +2,5 @@ numpy>=1.21.0
psutil>=5.8.0
gpustat>=1.0.0
torch>=2.0.0
transformers>=4.30.0
datasets>=2.10.0
huggingface_hub>=0.16.0

View File

@ -80,6 +80,10 @@ if __name__ == "__main__":
logger.info(f"Benchmark run UUID: {benchmark_run_uuid}")
logger.info(f"Output directory: {args.output_dir}")
# We cannot compute ITL if we don't have at least two measurements
if any(n <= 1 for n in args.num_tokens_to_generate):
raise ValueError("--num_tokens_to_generate arguments should be larger than 1")
# Error out if one of the arguments is not provided
if len(args.batch_size) * len(args.sequence_length) * len(args.num_tokens_to_generate) == 0:
raise ValueError(

View File

@ -508,16 +508,16 @@ BERT `_init_weights` Methode:
def _init_weights(self, module):
"""Initialize the weights"""
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
module.weight.normal_(mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
module.bias.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
module.weight.normal_(mean=0.0, std=self.config.initializer_range)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
module.bias.zero_()
module.weight.fill_(1.0)
```
Sie können weitere benutzerdefinierte Schemata verwenden, wenn Sie eine spezielle Initialisierung für einige Module benötigen. Zum Beispiel in
@ -533,9 +533,9 @@ def _init_weights(self, module):
module.project_hid._is_hf_initialized = True
module.project_q._is_hf_initialized = True
elif isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
module.weight.normal_(mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
module.bias.zero_()
```
Das Flag `_is_hf_initialized` wird intern verwendet, um sicherzustellen, dass wir ein Submodul nur einmal initialisieren. Wenn Sie es auf

View File

@ -1008,6 +1008,8 @@
title: AltCLIP
- local: model_doc/aria
title: Aria
- local: model_doc/audioflamingo3
title: AudioFlamingo3
- local: model_doc/aya_vision
title: AyaVision
- local: model_doc/blip

View File

@ -314,16 +314,16 @@ Random initialization occurs in the `_init_weights` method of `BrandNewLlamaPreT
def _init_weights(self, module):
"""Initialize the weights"""
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
module.weight.normal_(mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
module.bias.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
module.weight.normal_(mean=0.0, std=self.config.initializer_range)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
module.bias.zero_()
module.weight.fill_(1.0)
```
The initialization scheme can look different if you need to adapt it to your model. For example, [`Wav2Vec2ForPreTraining`] initializes [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) in its last two linear layers.
@ -339,9 +339,9 @@ def _init_weights(self, module):
module.project_hid._is_hf_initialized = True
module.project_q._is_hf_initialized = True
elif isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
module.weight.normal_(mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
module.bias.zero_()
```
### Convert checkpoints to Transformers

View File

@ -0,0 +1,402 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
*This model was released on 2025-07-10 and added to Hugging Face Transformers on 2025-11-11.*
# Audio Flamingo 3
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>
## Overview
Audio Flamingo 3 (AF3) is a fully open large audiolanguage model designed for robust understanding and reasoning over speech, environmental sounds, and music. AF3 pairs a Whisper-style audio encoder with a causal language model and performs replace-in-place audiotext fusion: the processor aligns post-pool audio frames to a dedicated placeholder token and the model replaces those token slots with projected audio embeddings during the forward pass.
The model checkpoint is available at: [nvidia/audio-flamingo-3-hf](https://huggingface.co/nvidia/audio-flamingo-3-hf)
Highlights:
- Unified audio encoder across speech, sound, and music.
- **Long-audio support via windowing and post-pool alignment (up to 10 minutes maximum).** The model processes audio in 30-second windows with a hard limit of 20 windows (10 minutes total). Audio longer than 10 minutes will be truncated.
- Deterministic fusion that preserves sequence length by replacing audio placeholder tokens with audio embeddings.
This model was contributed by [Lasha Koroshinadze](https://huggingface.co/lashahub) and [Eric Bezzam](https://huggingface.co/bezzam).
### Paper
[Audio Flamingo 3](https://huggingface.co/papers/2507.08128): Advancing Audio Intelligence with Fully Open Large Audio Language Models
A. Goel, S. Ghosh, J. Kim, S. Kumar, Z. Kong, S. Lee, C.-H. H. Yang, R. Duraiswami, D. Manocha, R. Valle, B. Catanzaro
NVIDIA and University of Maryland
Project: https://research.nvidia.com/labs/adlr/AF3/
## Usage
### Audio Instruct Mode
The model supports audio-text instructions, including multi-turn interactions, all processed in batches.
➡️ audio + text instruction
```python
from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe the input speech."},
{"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/WhDJDIviAOg_120_10.mp3"},
],
}
]
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(decoded_outputs)
```
➡️ multi-turn:
```python
from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
conversation = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Instruction: How does the tone of female speech change throughout the audio? Choose the correct option among the options below: (A) Sad to happy (B) Happy to sad (C) Neutral to happy (D) Happy to neutral.",
},
{"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/000000786159.31.wav"},
],
},
{
"role": "assistant",
"content": [{"type": "text", "text": "(A) Sad to happy"}],
},
{
"role": "user",
"content": [
{"type": "text", "text": "Why do you think so?"},
],
},
]
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(decoded_outputs)
```
➡️ text only:
```python
from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "What is the capital of France?"},
],
}
]
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(decoded_outputs)
```
➡️ audio only:
```python
from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
conversation = [
{
"role": "user",
"content": [
{"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/WhDJDIviAOg_120_10.mp3"},
],
}
]
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(decoded_outputs)
```
➡️ batched inference!
```python
from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
conversations = [
[
{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe the input speech."},
{
"type": "audio",
"path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/t_837b89f2-26aa-4ee2-bdf6-f73f0dd59b26.wav",
},
],
}
],
[
{
"role": "user",
"content": [
{
"type": "text",
"text": "This track feels really peaceful and introspective. What elements make it feel so calming and meditative?",
},
{"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/FPSbCAANfbJLVSwD.mp3"},
],
}
],
]
inputs = processor.apply_chat_template(
conversations,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(decoded_outputs)
```
➡️ Training:
```python
from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
model.train()
conversation = [
[
{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe the input speech."},
{"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/WhDJDIviAOg_120_10.mp3"},
],
},
{
"role": "assistant",
"content": [{"type": "text", "text": "The transcription of the audio is 'summer follows spring the days grow longer and the nights are warm'."}],
}
],
[
{
"role": "user",
"content": [
{
"type": "text",
"text": "This track feels really peaceful and introspective. What elements make it feel so calming and meditative?",
},
{"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/FPSbCAANfbJLVSwD.mp3"},
],
},
{
"role": "assistant",
"content": [{"type": "text", "text": "The transcription of the audio is 'some transcription of the audio'."}],
}
]
]
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
output_labels=True,
).to(model.device)
loss = model(**inputs).loss
loss.backward()
```
➡️ transcription shortcut
```python
from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
model_id = "nvidia/audio-flamingo-3-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
inputs = processor.apply_transcription_request(audio="https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/t_837b89f2-26aa-4ee2-bdf6-f73f0dd59b26.wav").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True, strip_prefix=True)
print(decoded_outputs)
```
The model is trained to emit transcriptions prefixed with assistant framing such as `The spoken content of the audio is "<text>".`. Use `strip_prefix=True` (as shown above) to remove the fixed assistant sentence and surrounding quotes so that only the transcription remains.
## How the model works
### Architecture
* **AudioFlamingo3Encoder**
Whisper-style feature extractor + encoder → average-pool over time (stride 2) → LayerNorm.
Produces per-frame hidden states at the post-pool rate.
* **AudioFlamingo3MultiModalProjector**
A small MLP that maps encoder features to the language models hidden size.
* **AudioFlamingo3ForConditionalGeneration**
A causal language model that accepts text embeddings where each audio placeholder token slot is replaced, in place, by an audio frame embedding. No sequence-length change is introduced by fusion.
### Processor-level alignment
1. Each raw waveform is split into fixed-length windows based on the feature extractors `chunk_length` (seconds) and `sampling_rate` (Hz).
2. For each window, the processor computes the number of post-pool frames `post_pool_len` that the encoder will output (matching the conv/pool schedule).
3. The processor expands the audio placeholder token by the total number of post-pool frames across all windows.
4. The model later replaces those token positions with the corresponding projected audio embeddings.
## Usage patterns
### Transcription shortcut
For automatic speech recognition you can skip writing the default instruction each time and call
[`~transformers.AudioFlamingo3Processor.apply_transcription_request`]:
```python
inputs = processor.apply_transcription_request(audio=audio_array)
```
Pass `prompt="Transcribe the input speech."` (or a list of prompts for batch audio) to customize the instruction while
keeping the audio placeholder handling.
`audio` accepts in-memory arrays, local file paths, or URLs. Any processor kwargs (`text_kwargs`, `audio_kwargs`, etc.)
are forwarded, so you can tweak padding or tensor formats just like when calling `processor(...)`.
## Long audio and windowing
**Important: Maximum audio length is 10 minutes.** Audio longer than this will be truncated.
* The default setup processes 30-second windows at 16 kHz mono.
* **The processor enforces a hard limit of 20 windows per sample, resulting in a maximum of 10 minutes of audio (20 windows × 30 seconds).**
* For each window:
* `mel_len` is the padded mel length.
* A conv stack reduces time as `conv_output_len = (mel_len - 1) // 2 + 1`.
* Post-pool frames per window: `post_pool_len = (conv_output_len - 2) // 2 + 1`.
* An audio placeholder token is expanded to the sum of `post_pool_len` across all windows.
## Padding, attention, and caching
* **Left padding vs right padding**
For generation with mixed prompt lengths in a batch, left padding is usually preferable.
For training, right padding is common; AF3s fusion mechanism itself is padding-agnostic because it replaces in place.
* **Attention masks**
The processor returns `attention_mask` (text) and `input_features_mask` (audio). The model builds an internal 4-D mask on the encoders pre-pool axis with negative infinity at pad positions.
* **Caching**
During generation, `input_features` and `input_features_mask` are only passed on the first step. Subsequent steps use cached keys/values from the language model.
## Troubleshooting
* Empty or truncated outputs when batching
Use left padding for batched generation and decode only the new tokens after the prompt length, as shown in the quickstart.
## AudioFlamingo3Config
[[autodoc]] AudioFlamingo3Config
## AudioFlamingo3EncoderConfig
[[autodoc]] AudioFlamingo3EncoderConfig
## AudioFlamingo3Processor
[[autodoc]] AudioFlamingo3Processor
## AudioFlamingo3Encoder
[[autodoc]] AudioFlamingo3Encoder
- forward
## AudioFlamingo3ForConditionalGeneration
[[autodoc]] AudioFlamingo3ForConditionalGeneration
- forward

View File

@ -169,6 +169,9 @@ print("Pooled output shape:", pooled_output.shape)
[[autodoc]] DINOv3ViTModel
- forward
## DINOv3ViTBackbone
[[autodoc]] DINOv3ViTBackbone
## DINOv3ConvNextModel
[[autodoc]] DINOv3ConvNextModel

View File

@ -159,7 +159,7 @@ conversation3 = [
conversations = [conversation1, conversation2, conversation3]
inputs = processor.apply_chat_template(
conversation,
conversations,
add_generation_prompt=True,
tokenize=True,
return_dict=True,

View File

@ -149,7 +149,7 @@ The example below packs `up_proj` and `gate_proj` into a single `gate_up_proj` m
```python
class Llama4TextExperts(nn.Module):
...
self.gate_up_proj = nn.Parameter(torch.empty(self.num_experts, self.hidden_size, 2 * self.expert_dim))
self.gate_up_proj = nn.Parameter(torch.zeros(self.num_experts, self.hidden_size, 2 * self.expert_dim))
```
Batch matrix multiplication can be used in the `forward` pass to compute the output of the `gate_up_proj` module.

View File

@ -329,7 +329,7 @@ from torchao.dtypes import Int4XPULayout
from torchao.quantization.quant_primitives import ZeroPointDomain
quant_config = Int4WeightOnlyConfig(group_size=128, layout=Int4XPULayout(), zero_point_domain=ZeroPointDomain.INT)
quant_config = Int4WeightOnlyConfig(group_size=128, layout=Int4XPULayout(), zero_point_domain=ZeroPointDomain.INT, int4_packing_format="plain_int32")
quantization_config = TorchAoConfig(quant_type=quant_config)
# Load and quantize the model
@ -342,7 +342,7 @@ quantized_model = AutoModelForCausalLM.from_pretrained(
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
input_ids = tokenizer(input_text, return_tensors="pt").to(quantized_model.device)
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
@ -395,7 +395,7 @@ from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
from torchao.quantization import Int4WeightOnlyConfig
from torchao.dtypes import Int4CPULayout
quant_config = Int4WeightOnlyConfig(group_size=128, layout=Int4CPULayout())
quant_config = Int4WeightOnlyConfig(group_size=128, layout=Int4CPULayout(), int4_packing_format="opaque")
quantization_config = TorchAoConfig(quant_type=quant_config)
# Load and quantize the model
@ -422,7 +422,7 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
#### 1. Skip quantization for certain layers
With `ModuleFqnToConfig` we can specify a default configuration for all layers while skipping quantization for certain layers.
With `FqnToConfig` we can specify a default configuration for all layers while skipping quantization for certain layers.
```py
import torch
@ -430,11 +430,11 @@ from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
model_id = "meta-llama/Llama-3.1-8B-Instruct"
from torchao.quantization import Int4WeightOnlyConfig, ModuleFqnToConfig
from torchao.quantization import Int4WeightOnlyConfig, FqnToConfig
config = Int4WeightOnlyConfig(group_size=128)
# set default to int4 (for linears), and skip quantizing `model.layers.0.self_attn.q_proj`
quant_config = ModuleFqnToConfig({"_default": config, "model.layers.0.self_attn.q_proj": None})
quant_config = FqnToConfig({"_default": config, "model.layers.0.self_attn.q_proj": None})
quantization_config = TorchAoConfig(quant_type=quant_config)
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", dtype=torch.bfloat16, quantization_config=quantization_config)
# lm_head is not quantized and model.layers.0.self_attn.q_proj is not quantized
@ -459,7 +459,7 @@ from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
model_id = "facebook/opt-125m"
from torchao.quantization import Int4WeightOnlyConfig, ModuleFqnToConfig, Int8DynamicActivationInt4WeightConfig, IntxWeightOnlyConfig, PerAxis, MappingType
from torchao.quantization import Int4WeightOnlyConfig, FqnToConfig, Int8DynamicActivationInt4WeightConfig, IntxWeightOnlyConfig, PerAxis, MappingType
weight_dtype = torch.int8
granularity = PerAxis(0)
@ -470,7 +470,7 @@ embedding_config = IntxWeightOnlyConfig(
mapping_type=mapping_type,
)
linear_config = Int8DynamicActivationInt4WeightConfig(group_size=128)
quant_config = ModuleFqnToConfig({"_default": linear_config, "model.decoder.embed_tokens": embedding_config, "model.decoder.embed_positions": None})
quant_config = FqnToConfig({"_default": linear_config, "model.decoder.embed_tokens": embedding_config, "model.decoder.embed_positions": None})
# set `include_embedding` to True in order to include embedding in quantization
# when `include_embedding` is True, we'll remove input embedding from `modules_not_to_convert` as well
quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True)
@ -521,7 +521,7 @@ from torchao.quantization import (
IntxWeightOnlyConfig,
PerRow,
PerAxis,
ModuleFqnToConfig,
FqnToConfig,
Float8Tensor,
Int4TilePackedTo4dTensor,
IntxUnpackedToInt8Tensor,
@ -550,7 +550,7 @@ qconfig_dict = {
"_default": intxwo,
}
quant_config = ModuleFqnToConfig(qconfig_dict)
quant_config = FqnToConfig(qconfig_dict)
quantization_config = TorchAoConfig(quant_type=quant_config)
quantized_model = AutoModelForCausalLM.from_pretrained(
model_id,

View File

@ -170,7 +170,7 @@ Per quanto riguarda la classe `TrainingArguments`:
- L'argomento `evaluate_during_training` di `TrainingArguments` è deprecato a favore di `eval_strategy`.
Per quanto riguarda il modello Transfo-XL:
- L'attributo di configurazione `tie_weight` di Transfo-XL diventa `tie_words_embeddings`.
- L'attributo di configurazione `tie_weight` di Transfo-XL diventa `tie_word_embeddings`.
- Il metodo di modellazione `reset_length` di Transfo-XL diventa `reset_memory_length`.
Per quanto riguarda le pipeline:

View File

@ -406,16 +406,16 @@ model = BrandNewBertModel(BrandNewBertConfig())
def _init_weights(self, module):
"""Initialize the weights"""
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
module.weight.normal_(mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
module.bias.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
module.weight.normal_(mean=0.0, std=self.config.initializer_range)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
module.bias.zero_()
module.weight.fill_(1.0)
```
特定のモジュールに特別な初期化が必要な場合、カスタムスキームをさらに持つことができます。たとえば、
@ -431,9 +431,9 @@ def _init_weights(self, module):
module.project_hid._is_hf_initialized = True
module.project_q._is_hf_initialized = True
elif isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
module.weight.normal_(mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
module.bias.zero_()
```
`_is_hf_initialized`フラグは、サブモジュールを一度だけ初期化することを確実にするために内部で使用されます。

View File

@ -348,16 +348,16 @@ model = BrandNewBertModel(BrandNewBertConfig())
def _init_weights(self, module):
"""Initialize the weights"""
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
module.weight.normal_(mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
module.bias.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
module.weight.normal_(mean=0.0, std=self.config.initializer_range)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
module.bias.zero_()
module.weight.fill_(1.0)
```
몇 가지 모듈에 대해 특별한 초기화가 필요한 경우 사용자 정의 방식을 사용할 수도 있습니다. 예를 들어, `Wav2Vec2ForPreTraining`에서 마지막 두 개의 선형 레이어는 일반적인 PyTorch `nn.Linear`의 초기화를 가져야 하지만, 다른 모든 레이어는 위와 같은 초기화를 사용해야 합니다. 이는 다음과 같이 코드화됩니다:
@ -371,9 +371,9 @@ def _init_weights(self, module):
module.project_hid._is_hf_initialized = True
module.project_q._is_hf_initialized = True
elif isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
module.weight.normal_(mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
module.bias.zero_()
```
`_is_hf_initialized` 플래그는 서브모듈을 한 번만 초기화하도록 내부적으로 사용됩니다. `module.project_q``module.project_hid`에 대해 `True`로 설정함으로써, 우리가 수행한 사용자 정의 초기화가 이후에 덮어쓰이지 않도록 합니다. 즉, `_init_weights` 함수가 이들에게 적용되지 않습니다.

View File

@ -152,7 +152,7 @@ class ParallelInterface(MutableMapping):
```python
class Llama4TextExperts(nn.Module):
...
self.gate_up_proj = nn.Parameter(torch.empty(self.num_experts, self.hidden_size, 2 * self.expert_dim))
self.gate_up_proj = nn.Parameter(torch.zeros(self.num_experts, self.hidden_size, 2 * self.expert_dim))
```
배치 행렬 곱셈을 `forward` 패스에서 사용하여 `gate_up_proj` 모듈의 출력을 계산할 수 있습니다.

View File

@ -502,16 +502,10 @@ class DummyBertLMPredictionHead(nn.Module):
# The output weights are the same as the input embeddings, but there is
# an output-only bias for each token.
self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=True)
self.bias = nn.Parameter(torch.zeros(config.vocab_size))
# Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
self.decoder.bias = self.bias
def _tie_weights(self):
self.decoder.bias = self.bias
def forward(self, hidden_states):
hidden_states = self.transform(hidden_states)
hidden_states = self.decoder(hidden_states)
@ -536,18 +530,18 @@ class DummyBertPreTrainedModel(PreTrainedModel):
def _init_weights(self, module):
"""Initialize the weights"""
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
module.weight.normal_(mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
module.bias.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
module.weight.normal_(mean=0.0, std=self.config.initializer_range)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
module.bias.zero_()
module.weight.fill_(1.0)
elif isinstance(module, DummyBertLMPredictionHead):
module.bias.data.zero_()
module.bias.zero_()
@auto_docstring(

View File

@ -265,7 +265,7 @@ class MyNewModel2PreTrainedModel(PreTrainedModel):
# We initialize with 0s to be 1 centered as the RMSNorm here does (1 + weight)
if "RMSNorm" in module.__class__.__name__:
module.weight.data.zero_()
module.weight.zero_()
class MyNewModel2ForSequenceClassification(GenericForSequenceClassification, MyNewModel2PreTrainedModel):

View File

@ -104,9 +104,9 @@ class NewTaskModelPreTrainedModel(PreTrainedModel):
std = getattr(self.config, "initializer_range", self.config.get_text_config().initializer_range)
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=std)
module.weight.normal_(mean=0.0, std=std)
if module.bias is not None:
module.bias.data.zero_()
module.bias.zero_()
def token_type_ids_mask_function(
@ -428,7 +428,7 @@ class NewTaskModelForNewTask(NewTaskModelPreTrainedModel, GenerationMixin):
"^multi_modal_projector": "model.multi_modal_projector",
"^language_model.lm_head": "lm_head",
}
_tied_weights_keys = ["lm_head.weight"]
_tied_weights_keys = {"lm_head.weight": "model.language_model.embed_tokens.weight"}
main_input_name: ClassVar[str] = "doc_input_ids" # transformers-related
def __init__(self, config):
@ -440,7 +440,15 @@ class NewTaskModelForNewTask(NewTaskModelPreTrainedModel, GenerationMixin):
self.custom_text_proj = nn.Linear(self.config.text_config.hidden_size, self.embedding_dim)
if self.language_model._tied_weights_keys is not None:
self._tied_weights_keys = [f"model.language_model.{k}" for k in self.language_model._tied_weights_keys]
prefix = "model.language_model."
prefixed_mapping = {
f"{prefix}{target}": f"{prefix}{source}"
for target, source in self.language_model._tied_weights_keys.items()
}
if isinstance(self._tied_weights_keys, dict):
self._tied_weights_keys.update(prefixed_mapping)
else:
self._tied_weights_keys = prefixed_mapping
self.post_init()
def get_input_embeddings(self):

View File

@ -505,16 +505,10 @@ class RobertaLMPredictionHead(nn.Module):
# The output weights are the same as the input embeddings, but there is
# an output-only bias for each token.
self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=True)
self.bias = nn.Parameter(torch.zeros(config.vocab_size))
# Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
self.decoder.bias = self.bias
def _tie_weights(self):
self.decoder.bias = self.bias
def forward(self, hidden_states):
hidden_states = self.transform(hidden_states)
hidden_states = self.decoder(hidden_states)
@ -539,18 +533,18 @@ class RobertaPreTrainedModel(PreTrainedModel):
def _init_weights(self, module):
"""Initialize the weights"""
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
module.weight.normal_(mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
module.bias.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
module.weight.normal_(mean=0.0, std=self.config.initializer_range)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
module.bias.zero_()
module.weight.fill_(1.0)
elif isinstance(module, RobertaLMPredictionHead):
module.bias.data.zero_()
module.bias.zero_()
@auto_docstring(

View File

@ -846,11 +846,11 @@ class TestDetrPreTrainedModel(PreTrainedModel):
nn.init.xavier_uniform_(module.output_proj.weight.data)
nn.init.constant_(module.output_proj.bias.data, 0.0)
elif isinstance(module, (nn.Linear, nn.Conv2d, nn.BatchNorm2d)):
module.weight.data.normal_(mean=0.0, std=std)
module.weight.normal_(mean=0.0, std=std)
if module.bias is not None:
module.bias.data.zero_()
module.bias.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=std)
module.weight.normal_(mean=0.0, std=std)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
if hasattr(module, "reference_points") and not self.config.two_stage:

View File

@ -19,7 +19,15 @@ class NewTaskModelForNewTask(PaliGemmaForConditionalGeneration):
self.custom_text_proj = nn.Linear(self.config.text_config.hidden_size, self.embedding_dim)
if self.language_model._tied_weights_keys is not None:
self._tied_weights_keys = [f"model.language_model.{k}" for k in self.language_model._tied_weights_keys]
prefix = "model.language_model."
prefixed_mapping = {
f"{prefix}{target}": f"{prefix}{source}"
for target, source in self.language_model._tied_weights_keys.items()
}
if isinstance(self._tied_weights_keys, dict):
self._tied_weights_keys.update(prefixed_mapping)
else:
self._tied_weights_keys = prefixed_mapping
self.post_init()

View File

@ -127,7 +127,7 @@ def parse_args():
parser.add_argument(
"--use_slow_tokenizer",
action="store_true",
help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
help="If passed, will use a slow tokenizer (not backed by the Hugging Face Tokenizers library).",
)
parser.add_argument(
"--per_device_train_batch_size",

View File

@ -132,7 +132,7 @@ def parse_args():
parser.add_argument(
"--use_slow_tokenizer",
action="store_true",
help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
help="If passed, will use a slow tokenizer (not backed by the Tokenizers library).",
)
parser.add_argument(
"--per_device_train_batch_size",

View File

@ -130,7 +130,7 @@ def parse_args():
parser.add_argument(
"--use_slow_tokenizer",
action="store_true",
help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
help="If passed, will use a slow tokenizer (not backed by the Hugging Face Tokenizers library).",
)
parser.add_argument(
"--per_device_train_batch_size",

View File

@ -128,7 +128,7 @@ def parse_args():
parser.add_argument(
"--use_slow_tokenizer",
action="store_true",
help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
help="If passed, will use a slow tokenizer (not backed by the HuggingFace Tokenizers library).",
)
parser.add_argument(
"--per_device_train_batch_size",

View File

@ -151,7 +151,7 @@ def parse_args():
parser.add_argument(
"--use_slow_tokenizer",
action="store_true",
help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
help="If passed, will use a slow tokenizer (not backed by the Hugging Face Tokenizers library).",
)
parser.add_argument(
"--per_device_train_batch_size",

View File

@ -223,7 +223,7 @@ def parse_args():
parser.add_argument(
"--use_slow_tokenizer",
action="store_true",
help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
help="If passed, will use a slow tokenizer (not backed by the Hugging Face Tokenizers library).",
)
parser.add_argument(
"--per_device_train_batch_size",

View File

@ -74,6 +74,7 @@ class ExamplesTestsNoTrainer(TestCasePlus):
def tearDownClass(cls):
shutil.rmtree(cls.tmpdir)
@slow
@mock.patch.dict(os.environ, {"WANDB_MODE": "offline", "DVCLIVE_TEST": "true"})
def test_run_glue_no_trainer(self):
tmp_dir = self.get_auto_remove_tmp_dir()
@ -147,6 +148,7 @@ class ExamplesTestsNoTrainer(TestCasePlus):
self.assertTrue(os.path.exists(os.path.join(tmp_dir, "epoch_0")))
self.assertTrue(os.path.exists(os.path.join(tmp_dir, "mlm_no_trainer")))
@slow
@mock.patch.dict(os.environ, {"WANDB_MODE": "offline", "DVCLIVE_TEST": "true"})
def test_run_ner_no_trainer(self):
# with so little data distributed training needs more epochs to get the score on par with 0/1 gpu
@ -175,6 +177,7 @@ class ExamplesTestsNoTrainer(TestCasePlus):
self.assertTrue(os.path.exists(os.path.join(tmp_dir, "epoch_0")))
self.assertTrue(os.path.exists(os.path.join(tmp_dir, "ner_no_trainer")))
@slow
@mock.patch.dict(os.environ, {"WANDB_MODE": "offline", "DVCLIVE_TEST": "true"})
def test_run_squad_no_trainer(self):
tmp_dir = self.get_auto_remove_tmp_dir()
@ -203,6 +206,7 @@ class ExamplesTestsNoTrainer(TestCasePlus):
self.assertTrue(os.path.exists(os.path.join(tmp_dir, "epoch_0")))
self.assertTrue(os.path.exists(os.path.join(tmp_dir, "qa_no_trainer")))
@slow
@mock.patch.dict(os.environ, {"WANDB_MODE": "offline", "DVCLIVE_TEST": "true"})
def test_run_swag_no_trainer(self):
tmp_dir = self.get_auto_remove_tmp_dir()
@ -305,6 +309,7 @@ class ExamplesTestsNoTrainer(TestCasePlus):
result = get_results(tmp_dir)
self.assertGreaterEqual(result["eval_overall_accuracy"], 0.10)
@slow
@mock.patch.dict(os.environ, {"WANDB_MODE": "offline", "DVCLIVE_TEST": "true"})
def test_run_image_classification_no_trainer(self):
tmp_dir = self.get_auto_remove_tmp_dir()

View File

@ -374,6 +374,7 @@ class ExamplesTests(TestCasePlus):
result = get_results(tmp_dir)
self.assertGreaterEqual(result["eval_bleu"], 30)
@slow
def test_run_image_classification(self):
tmp_dir = self.get_auto_remove_tmp_dir()
testargs = f"""
@ -403,6 +404,7 @@ class ExamplesTests(TestCasePlus):
result = get_results(tmp_dir)
self.assertGreaterEqual(result["eval_accuracy"], 0.8)
@slow
def test_run_speech_recognition_ctc(self):
tmp_dir = self.get_auto_remove_tmp_dir()
testargs = f"""
@ -573,6 +575,7 @@ class ExamplesTests(TestCasePlus):
model = ViTMAEForPreTraining.from_pretrained(tmp_dir)
self.assertIsNotNone(model)
@slow
def test_run_semantic_segmentation(self):
tmp_dir = self.get_auto_remove_tmp_dir()
testargs = f"""
@ -597,6 +600,7 @@ class ExamplesTests(TestCasePlus):
result = get_results(tmp_dir)
self.assertGreaterEqual(result["eval_overall_accuracy"], 0.1)
@slow
@patch.dict(os.environ, {"WANDB_DISABLED": "true"})
def test_run_object_detection(self):
tmp_dir = self.get_auto_remove_tmp_dir()
@ -624,6 +628,7 @@ class ExamplesTests(TestCasePlus):
result = get_results(tmp_dir)
self.assertGreaterEqual(result["test_map"], 0.1)
@slow
@patch.dict(os.environ, {"WANDB_DISABLED": "true"})
def test_run_instance_segmentation(self):
tmp_dir = self.get_auto_remove_tmp_dir()

View File

@ -120,7 +120,7 @@ def parse_args():
parser.add_argument(
"--use_slow_tokenizer",
action="store_true",
help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
help="If passed, will use a slow tokenizer (not backed by the Hugging Face Tokenizers library).",
)
parser.add_argument(
"--per_device_train_batch_size",

View File

@ -212,7 +212,7 @@ def parse_args():
parser.add_argument(
"--use_slow_tokenizer",
action="store_true",
help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
help="If passed, will use a slow tokenizer (not backed by the Hugging Face Tokenizers library).",
)
parser.add_argument(
"--per_device_train_batch_size",

View File

@ -50,6 +50,7 @@ checkpoint: 检查点
</p>
<p align="center">
<a href="https://huggingface.co/models"><img alt="Checkpoints on Hub" src="https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/models&color=brightgreen"></a>
<a href="https://circleci.com/gh/huggingface/transformers"><img alt="Build" src="https://img.shields.io/circleci/build/github/huggingface/transformers/main"></a>
<a href="https://github.com/huggingface/transformers/blob/main/LICENSE"><img alt="GitHub" src="https://img.shields.io/github/license/huggingface/transformers.svg?color=blue"></a>
<a href="https://huggingface.co/docs/transformers/index"><img alt="Documentation" src="https://img.shields.io/website/http/huggingface.co/docs/transformers/index.svg?down_color=red&down_message=offline&up_message=online"></a>
@ -60,7 +61,7 @@ checkpoint: 检查点
<h4 align="center">
<p>
<a href="https://github.com/huggingface/transformers/">English</a> |
<a href="https://github.com/huggingface/transformers/blob/main/README.md">English</a> |
<b>简体中文</b> |
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_zh-hant.md">繁體中文</a> |
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_ko.md">한국어</a> |
@ -68,7 +69,7 @@ checkpoint: 检查点
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_ja.md">日本語</a> |
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_hd.md">हिन्दी</a> |
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_ru.md">Русский</a> |
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_pt-br.md">Рortuguês</a> |
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_pt-br.md">Português</a> |
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_te.md">తెలుగు</a> |
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_fr.md">Français</a> |
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_de.md">Deutsch</a> |
@ -81,182 +82,258 @@ checkpoint: 检查点
</h4>
<h3 align="center">
<p> Jax、PyTorch 和 TensorFlow 打造的先进的自然语言处理函数库</p>
<p>文本、视觉、音频、视频与多模态提供推理与训练的先进预训练模型</p>
</h3>
<h3 align="center">
<a href="https://hf.co/course"><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/course_banner.png"></a>
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/transformers_as_a_model_definition.png"/>
</h3>
🤗 Transformers 提供了数以千计的预训练模型,支持 100 多种语言的文本分类、信息抽取、问答、摘要、翻译、文本生成。它的宗旨是让最先进的 NLP 技术人人易用
Transformers 充当跨文本、计算机视觉、音频、视频与多模态的最先进机器学习模型的「模型定义框架」,同时覆盖推理与训练
🤗 Transformers 提供了便于快速下载和使用的API让你可以把预训练模型用在给定文本、在你的数据集上微调然后通过 [model hub](https://huggingface.co/models) 与社区共享。同时,每个定义的 Python 模块都是完全独立的,便于修改和快速进行研究实验
它将模型的定义集中化,使整个生态系统对该定义达成一致。`transformers` 是跨框架的枢纽:一旦某模型定义被支持,它通常就能兼容多数训练框架(如 Axolotl、Unsloth、DeepSpeed、FSDP、PyTorchLightning 等)、推理引擎(如 vLLM、SGLang、TGI 等),以及依赖 `transformers` 模型定义的相关库(如 llama.cpp、mlx 等)
🤗 Transformers 支持三个最热门的深度学习库: [Jax](https://jax.readthedocs.io/en/latest/), [PyTorch](https://pytorch.org/) 以及 [TensorFlow](https://www.tensorflow.org/) — 并与之无缝整合。你可以直接使用一个框架训练你的模型然后用另一个加载和推理
我们的目标是持续支持新的最先进模型,并通过让模型定义保持简单、可定制且高效来普及其使用
## 在线演示
你可以直接在模型页面上测试大多数 [model hub](https://huggingface.co/models) 上的模型。 我们也提供了 [私有模型托管、模型版本管理以及推理API](https://huggingface.co/pricing)
这里是一些例子:
- [用 BERT 做掩码填词](https://huggingface.co/google-bert/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France)
- [用 Electra 做命名实体识别](https://huggingface.co/dbmdz/electra-large-discriminator-finetuned-conll03-english?text=My+name+is+Sarah+and+I+live+in+London+city)
- [用 GPT-2 做文本生成](https://huggingface.co/openai-community/gpt2?text=A+long+time+ago%2C+)
- [用 RoBERTa 做自然语言推理](https://huggingface.co/FacebookAI/roberta-large-mnli?text=The+dog+was+lost.+Nobody+lost+any+animal)
- [用 BART 做文本摘要](https://huggingface.co/facebook/bart-large-cnn?text=The+tower+is+324+metres+%281%2C063+ft%29+tall%2C+about+the+same+height+as+an+81-storey+building%2C+and+the+tallest+structure+in+Paris.+Its+base+is+square%2C+measuring+125+metres+%28410+ft%29+on+each+side.+During+its+construction%2C+the+Eiffel+Tower+surpassed+the+Washington+Monument+to+become+the+tallest+man-made+structure+in+the+world%2C+a+title+it+held+for+41+years+until+the+Chrysler+Building+in+New+York+City+was+finished+in+1930.+It+was+the+first+structure+to+reach+a+height+of+300+metres.+Due+to+the+addition+of+a+broadcasting+aerial+at+the+top+of+the+tower+in+1957%2C+it+is+now+taller+than+the+Chrysler+Building+by+5.2+metres+%2817+ft%29.+Excluding+transmitters%2C+the+Eiffel+Tower+is+the+second+tallest+free-standing+structure+in+France+after+the+Millau+Viaduct)
- [用 DistilBERT 做问答](https://huggingface.co/distilbert/distilbert-base-uncased-distilled-squad?text=Which+name+is+also+used+to+describe+the+Amazon+rainforest+in+English%3F&context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species)
- [用 T5 做翻译](https://huggingface.co/google-t5/t5-base?text=My+name+is+Wolfgang+and+I+live+in+Berlin)
**[Write With Transformer](https://transformer.huggingface.co)**,由 Hugging Face 团队打造,是一个文本生成的官方 demo。
## 如果你在寻找由 Hugging Face 团队提供的定制化支持服务
<a target="_blank" href="https://huggingface.co/support">
<img alt="HuggingFace Expert Acceleration Program" src="https://huggingface.co/front/thumbnails/support.png" style="max-width: 600px; border: 1px solid #eee; border-radius: 4px; box-shadow: 0 1px 2px 0 rgba(0, 0, 0, 0.05);">
</a><br>
## 快速上手
我们为快速使用模型提供了 `pipeline` API。Pipeline 聚合了预训练模型和对应的文本预处理。下面是一个快速使用 pipeline 去判断正负面情绪的例子:
```python
>>> from transformers import pipeline
# 使用情绪分析 pipeline
>>> classifier = pipeline('sentiment-analysis')
>>> classifier('We are very happy to introduce pipeline to the transformers repository.')
[{'label': 'POSITIVE', 'score': 0.9996980428695679}]
```
第二行代码下载并缓存了 pipeline 使用的预训练模型,而第三行代码则在给定的文本上进行了评估。这里的答案"正面" (positive) 具有 99 的置信度。
许多的 NLP 任务都有开箱即用的预训练 `pipeline`。比如说,我们可以轻松的从给定文本中抽取问题答案:
``` python
>>> from transformers import pipeline
# 使用问答 pipeline
>>> question_answerer = pipeline('question-answering')
>>> question_answerer({
... 'question': 'What is the name of the repository ?',
... 'context': 'Pipeline has been included in the huggingface/transformers repository'
... })
{'score': 0.30970096588134766, 'start': 34, 'end': 58, 'answer': 'huggingface/transformers'}
```
除了给出答案,预训练模型还给出了对应的置信度分数、答案在词符化 (tokenized) 后的文本中开始和结束的位置。你可以从[这个教程](https://huggingface.co/docs/transformers/task_summary)了解更多 `pipeline` API 支持的任务。
要在你的任务上下载和使用任意预训练模型也很简单,只需三行代码。这里是 PyTorch 版的示例:
```python
>>> from transformers import AutoTokenizer, AutoModel
>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
>>> model = AutoModel.from_pretrained("google-bert/bert-base-uncased")
>>> inputs = tokenizer("Hello world!", return_tensors="pt")
>>> outputs = model(**inputs)
```
这里是等效的 TensorFlow 代码:
```python
>>> from transformers import AutoTokenizer, TFAutoModel
>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
>>> model = TFAutoModel.from_pretrained("google-bert/bert-base-uncased")
>>> inputs = tokenizer("Hello world!", return_tensors="tf")
>>> outputs = model(**inputs)
```
词符化器 (tokenizer) 为所有的预训练模型提供了预处理,并可以直接对单个字符串进行调用(比如上面的例子)或对列表 (list) 调用。它会输出一个你可以在下游代码里使用或直接通过 `**` 解包表达式传给模型的词典 (dict)。
模型本身是一个常规的 [Pytorch `nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) 或 [TensorFlow `tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model)(取决于你的后端),可以常规方式使用。 [这个教程](https://huggingface.co/transformers/training.html)解释了如何将这样的模型整合到经典的 PyTorch 或 TensorFlow 训练循环中,或是如何使用我们的 `Trainer` 训练器API 来在一个新的数据集上快速微调。
## 为什么要用 transformers
1. 便于使用的先进模型:
- NLU 和 NLG 上表现优越
- 对教学和实践友好且低门槛
- 高级抽象,只需了解三个类
- 对所有模型统一的API
1. 更低计算开销,更少的碳排放:
- 研究人员可以分享已训练的模型而非每次从头开始训练
- 工程师可以减少计算用时和生产环境开销
- 数十种模型架构、两千多个预训练模型、100多种语言支持
1. 对于模型生命周期的每一个部分都面面俱到:
- 训练先进的模型,只需 3 行代码
- 模型在不同深度学习框架间任意转移,随你心意
- 为训练、评估和生产选择最适合的框架,衔接无缝
1. 为你的需求轻松定制专属模型和用例:
- 我们为每种模型架构提供了多个用例来复现原论文结果
- 模型内部结构保持透明一致
- 模型文件可单独使用,方便修改和快速实验
## 什么情况下我不该用 transformers
- 本库并不是模块化的神经网络工具箱。模型文件中的代码特意呈若璞玉,未经额外抽象封装,以便研究人员快速迭代修改而不致溺于抽象和文件跳转之中。
- `Trainer` API 并非兼容任何模型,只为本库之模型优化。若是在寻找适用于通用机器学习的训练循环实现,请另觅他库。
- 尽管我们已尽力而为,[examples 目录](https://github.com/huggingface/transformers/tree/main/examples)中的脚本也仅为用例而已。对于你的特定问题,它们并不一定开箱即用,可能需要改几行代码以适之。
目前在 [Hugging Face Hub](https://huggingface.com/models) 上有超过 1M+ 使用 `transformers` 的[模型检查点](https://huggingface.co/models?library=transformers&sort=trending),可随取随用。
今天就去探索 Hub找到一个模型并用 Transformers 立刻开始吧
## 安装
### 使用 pip
Transformers 支持 Python 3.9+,以及 [PyTorch](https://pytorch.org/get-started/locally/) 2.1+。
这个仓库已在 Python 3.9+、Flax 0.4.1+、PyTorch 2.1+ 和 TensorFlow 2.6+ 下经过测试。
使用 [venv](https://docs.python.org/3/library/venv.html) 或 [uv](https://docs.astral.sh/uv/)(一个基于 Rust 的快速 Python 包与项目管理器)创建并激活虚拟环境:
你可以在[虚拟环境](https://docs.python.org/3/library/venv.html)中安装 🤗 Transformers。如果你还不熟悉 Python 的虚拟环境,请阅此[用户说明](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/)。
首先,用你打算使用的版本的 Python 创建一个虚拟环境并激活。
然后,你需要安装 Flax、PyTorch 或 TensorFlow 其中之一。关于在你使用的平台上安装这些框架,请参阅 [TensorFlow 安装页](https://www.tensorflow.org/install/), [PyTorch 安装页](https://pytorch.org/get-started/locally/#start-locally) 或 [Flax 安装页](https://github.com/google/flax#quick-install)。
当这些后端之一安装成功后, 🤗 Transformers 可依此安装:
```bash
pip install transformers
```py
# venv
python -m venv .my-env
source .my-env/bin/activate
# uv
uv venv .my-env
source .my-env/bin/activate
```
如果你想要试试用例或者想在正式发布前使用最新的开发中代码,你得[从源代码安装](https://huggingface.co/docs/transformers/installation#installing-from-source)。
在虚拟环境中安装 Transformers
### 使用 conda
```py
# pip
pip install "transformers[torch]"
🤗 Transformers 可以通过 conda 依此安装:
```shell script
conda install conda-forge::transformers
# uv
uv pip install "transformers[torch]"
```
> **_笔记:_** 从 `huggingface` 渠道安装 `transformers` 已被废弃。
如果你需要库中的最新改动或计划参与贡献,可从源码安装(注意:最新版可能不稳定;如遇错误,欢迎在 [issues](https://github.com/huggingface/transformers/issues) 中反馈):
要通过 conda 安装 Flax、PyTorch 或 TensorFlow 其中之一,请参阅它们各自安装页的说明。
```shell
git clone https://github.com/huggingface/transformers.git
cd transformers
## 模型架构
# pip
pip install '.[torch]'
🤗 Transformers 支持的[**所有的模型检查点**](https://huggingface.co/models)由[用户](https://huggingface.co/users)和[组织](https://huggingface.co/organizations)上传,均与 huggingface.co [model hub](https://huggingface.co) 无缝整合。
# uv
uv pip install '.[torch]'
```
目前的检查点数量: ![](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/models&color=brightgreen)
## 快速上手
🤗 Transformers 目前支持如下的架构: 模型概述请阅[这里](https://huggingface.co/docs/transformers/model_summary).
使用 [Pipeline](https://huggingface.co/docs/transformers/pipeline_tutorial) API 一步上手。`Pipeline` 是一个高级推理类,支持文本、音频、视觉与多模态任务,负责输入预处理并返回适配的输出。
要检查某个模型是否已有 Flax、PyTorch 或 TensorFlow 的实现,或其是否在 🤗 Tokenizers 库中有对应词符化器tokenizer敬请参阅[此表](https://huggingface.co/docs/transformers/index#supported-frameworks)。
实例化一个用于文本生成的 pipeline指定使用的模型。模型会被下载并缓存方便复用。最后传入文本作为提示
这些实现均已于多个数据集测试(请参看用例脚本)并应于原版实现表现相当。你可以在用例文档的[此节](https://huggingface.co/docs/transformers/examples)中了解表现的细节。
```py
from transformers import pipeline
pipeline = pipeline(task="text-generation", model="Qwen/Qwen2.5-1.5B")
pipeline("the secret to baking a really good cake is ")
[{'generated_text': 'the secret to baking a really good cake is 1) to use the right ingredients and 2) to follow the recipe exactly. the recipe for the cake is as follows: 1 cup of sugar, 1 cup of flour, 1 cup of milk, 1 cup of butter, 1 cup of eggs, 1 cup of chocolate chips. if you want to make 2 cakes, how much sugar do you need? To make 2 cakes, you will need 2 cups of sugar.'}]
```
要与模型进行「聊天」,用法也一致。唯一不同是需要构造一段「聊天历史」(即 `Pipeline` 的输入):
> [!TIP]
> 你也可以直接在命令行与模型聊天:
> ```shell
> transformers chat Qwen/Qwen2.5-0.5B-Instruct
> ```
```py
import torch
from transformers import pipeline
chat = [
{"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."},
{"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"}
]
pipeline = pipeline(task="text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", dtype=torch.bfloat16, device_map="auto")
response = pipeline(chat, max_new_tokens=512)
print(response[0]["generated_text"][-1]["content"])
```
展开下方示例,查看 `Pipeline` 在不同模态与任务中的用法。
<details>
<summary>自动语音识别</summary>
```py
from transformers import pipeline
pipeline = pipeline(task="automatic-speech-recognition", model="openai/whisper-large-v3")
pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'}
```
</details>
<details>
<summary>图像分类</summary>
<h3 align="center">
<a><img src="https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png"></a>
</h3>
```py
from transformers import pipeline
pipeline = pipeline(task="image-classification", model="facebook/dinov2-small-imagenet1k-1-layer")
pipeline("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png")
[{"label": "macaw", "score": 0.997848391532898},
{"label": "sulphur-crested cockatoo, Kakatoe galerita, Cacatua galerita",
"score": 0.0016551691805943847},
{"label": "lorikeet", "score": 0.00018523589824326336},
{"label": "African grey, African gray, Psittacus erithacus",
"score": 7.85409429227002e-05},
{"label": "quail", "score": 5.502637941390276e-05}]
```
</details>
<details>
<summary>视觉问答</summary>
<h3 align="center">
<a><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-few-shot.jpg"></a>
</h3>
```py
from transformers import pipeline
pipeline = pipeline(task="visual-question-answering", model="Salesforce/blip-vqa-base")
pipeline(
image="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-few-shot.jpg",
question="What is in the image?",
)
[{"answer": "statue of liberty"}]
```
</details>
## 为什么要用 Transformers
1. 易于使用的最先进模型:
- 在自然语言理解与生成、计算机视觉、音频、视频与多模态任务上表现优越。
- 对研究者、工程师与开发者友好且低门槛。
- 少量用户侧抽象,仅需学习三个类。
- 统一的 API使用所有预训练模型体验一致。
1. 更低计算开销与更小碳足迹:
- 共享已训练的模型,而非每次从零开始训练。
- 减少计算时间与生产环境成本。
- 覆盖数十种模型架构,跨所有模态提供 1M+ 预训练检查点。
1. 在模型生命周期的每个阶段都可以选用合适的框架:
- 3 行代码即可训练最先进模型。
- 在 PyTorch/JAX/TF2.0 间自由迁移同一个模型。
- 为训练、评估与生产挑选最合适的框架。
1. 轻松定制模型或用例:
- 为每个架构提供示例以复现原论文结果。
- 尽可能一致地暴露模型内部。
- 模型文件可独立于库使用,便于快速实验。
<a target="_blank" href="https://huggingface.co/enterprise">
<img alt="Hugging Face Enterprise Hub" src="https://github.com/user-attachments/assets/247fb16d-d251-4583-96c4-d3d76dda4925">
</a><br>
## 为什么我不该用 Transformers
- 该库不是一个可自由拼搭的神经网络模块化工具箱。模型文件中的代码刻意减少额外抽象,以便研究者能快速在各个模型上迭代,而无需深入更多抽象或文件跳转。
- 训练 API 优化用于 Transformers 提供的 PyTorch 模型。若需要通用的机器学习训练循环,请使用其它库,如 [Accelerate](https://huggingface.co/docs/accelerate)。
- [示例脚本](https://github.com/huggingface/transformers/tree/main/examples)只是「示例」。它们不一定能直接适配你的具体用例,需要你进行必要的改动。
## 了解更多
## 100 个使用 Transformers 的项目
| 章节 | 描述 |
|-|-|
| [文档](https://huggingface.co/docs/transformers/) | 完整的 API 文档和教程 |
| [任务总结](https://huggingface.co/docs/transformers/task_summary) | 🤗 Transformers 支持的任务 |
| [预处理教程](https://huggingface.co/docs/transformers/preprocessing) | 使用 `Tokenizer` 来为模型准备数据 |
| [训练和微调](https://huggingface.co/docs/transformers/training) | 在 PyTorch/TensorFlow 的训练循环或 `Trainer` API 中使用 🤗 Transformers 提供的模型 |
| [快速上手:微调和用例脚本](https://github.com/huggingface/transformers/tree/main/examples) | 为各种任务提供的用例脚本 |
| [模型分享和上传](https://huggingface.co/docs/transformers/model_sharing) | 和社区上传和分享你微调的模型 |
| [迁移](https://huggingface.co/docs/transformers/migration) | 从 `pytorch-transformers` 或 `pytorch-pretrained-bert` 迁移到 🤗 Transformers |
Transformers 不止是一个使用预训练模型的工具包,它还是围绕 Hugging Face Hub 构建的项目社区。我们希望 Transformers 能助力开发者、研究人员、学生、老师、工程师与任何人构建理想项目。
为庆祝 Transformers 获得 100,000 颗星,我们制作了 [awesome-transformers](./awesome-transformers.md) 页面,展示了 100 个由社区构建的优秀项目。
如果你拥有或使用某个项目,认为它应该在列表中出现,欢迎提交 PR 添加它!
## 示例模型
你可以直接在它们的 [Hub 模型页](https://huggingface.co/models) 上测试我们的多数模型。
展开每个模态以查看不同用例中的部分示例模型。
<details>
<summary>音频</summary>
- 使用 [Whisper](https://huggingface.co/openai/whisper-large-v3-turbo) 进行音频分类
- 使用 [Moonshine](https://huggingface.co/UsefulSensors/moonshine) 进行自动语音识别
- 使用 [Wav2Vec2](https://huggingface.co/superb/wav2vec2-base-superb-ks) 进行关键词检索
- 使用 [Moshi](https://huggingface.co/kyutai/moshiko-pytorch-bf16) 进行语音到语音生成
- 使用 [MusicGen](https://huggingface.co/facebook/musicgen-large) 文本到音频生成
- 使用 [Bark](https://huggingface.co/suno/bark) 文本到语音生成
</details>
<details>
<summary>计算机视觉</summary>
- 使用 [SAM](https://huggingface.co/facebook/sam-vit-base) 自动生成掩码
- 使用 [DepthPro](https://huggingface.co/apple/DepthPro-hf) 进行深度估计
- 使用 [DINO v2](https://huggingface.co/facebook/dinov2-base) 进行图像分类
- 使用 [SuperPoint](https://huggingface.co/magic-leap-community/superpoint) 进行关键点检测
- 使用 [SuperGlue](https://huggingface.co/magic-leap-community/superglue_outdoor) 进行关键点匹配
- 使用 [RT-DETRv2](https://huggingface.co/PekingU/rtdetr_v2_r50vd) 进行目标检测
- 使用 [VitPose](https://huggingface.co/usyd-community/vitpose-base-simple) 进行姿态估计
- 使用 [OneFormer](https://huggingface.co/shi-labs/oneformer_ade20k_swin_large) 进行通用分割
- 使用 [VideoMAE](https://huggingface.co/MCG-NJU/videomae-large) 进行视频分类
</details>
<details>
<summary>多模态</summary>
- 使用 [Qwen2-Audio](https://huggingface.co/Qwen/Qwen2-Audio-7B) 实现音频或文本到文本
- 使用 [LayoutLMv3](https://huggingface.co/microsoft/layoutlmv3-base) 进行文档问答
- 使用 [Qwen-VL](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) 实现图像或文本到文本
- 使用 [BLIP-2](https://huggingface.co/Salesforce/blip2-opt-2.7b) 进行图文描述
- 使用 [GOT-OCR2](https://huggingface.co/stepfun-ai/GOT-OCR-2.0-hf) 进行基于 OCR 的文档理解
- 使用 [TAPAS](https://huggingface.co/google/tapas-base) 进行表格问答
- 使用 [Emu3](https://huggingface.co/BAAI/Emu3-Gen) 进行统一的多模态理解与生成
- 使用 [Llava-OneVision](https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf) 视觉到文本
- 使用 [Llava](https://huggingface.co/llava-hf/llava-1.5-7b-hf) 进行视觉问答
- 使用 [Kosmos-2](https://huggingface.co/microsoft/kosmos-2-patch14-224) 进行视觉指代表达分割
</details>
<details>
<summary>NLP</summary>
- 使用 [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base) 进行掩码词填充
- 使用 [Gemma](https://huggingface.co/google/gemma-2-2b) 进行命名实体识别NER
- 使用 [Mixtral](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) 进行问答
- 使用 [BART](https://huggingface.co/facebook/bart-large-cnn) 进行摘要
- 使用 [T5](https://huggingface.co/google-t5/t5-base) 进行翻译
- 使用 [Llama](https://huggingface.co/meta-llama/Llama-3.2-1B) 进行文本生成
- 使用 [Qwen](https://huggingface.co/Qwen/Qwen2.5-0.5B) 进行文本分类
</details>
## 引用

View File

@ -14,43 +14,6 @@ See the License for the specific language governing permissions and
limitations under the License.
-->
<!---
A useful guide for English-Traditional Chinese translation of Hugging Face documentation
- Add space around English words and numbers when they appear between Chinese characters. E.g., 共 100 多種語言; 使用 transformers 函式庫。
- Use square quotes, e.g.,「引用」
- Some of terms in the file can be found at National Academy for Educational Research (https://terms.naer.edu.tw/), an official website providing bilingual translations between English and Traditional Chinese.
Dictionary
API: API (不翻譯)
add: 加入
checkpoint: 檢查點
code: 程式碼
community: 社群
confidence: 信賴度
dataset: 資料集
documentation: 文件
example: 基本翻譯為「範例」,或依語意翻為「例子」
finetune: 微調
Hugging Face: Hugging Face不翻譯
implementation: 實作
inference: 推論
library: 函式庫
module: 模組
NLP/Natural Language Processing: 以 NLP 出現時不翻譯,以 Natural Language Processing 出現時翻譯為自然語言處理
online demos: 線上Demo
pipeline: pipeline不翻譯
pretrained/pretrain: 預訓練
Python data structures (e.g., list, set, dict): 翻譯為串列,集合,字典,並用括號標註原英文
repository: repository不翻譯
summary: 概覽
token-: token-(不翻譯)
Trainer: Trainer不翻譯
transformer: transformer不翻譯
tutorial: 教學
user: 使用者
-->
<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://huggingface.co/datasets/huggingface/documentation-images/raw/main/transformers-logo-dark.svg">
@ -62,6 +25,7 @@ user: 使用者
</p>
<p align="center">
<a href="https://huggingface.com/models"><img alt="Checkpoints on Hub" src="https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/models&color=brightgreen"></a>
<a href="https://circleci.com/gh/huggingface/transformers"><img alt="Build" src="https://img.shields.io/circleci/build/github/huggingface/transformers/main"></a>
<a href="https://github.com/huggingface/transformers/blob/main/LICENSE"><img alt="GitHub" src="https://img.shields.io/github/license/huggingface/transformers.svg?color=blue"></a>
<a href="https://huggingface.co/docs/transformers/index"><img alt="Documentation" src="https://img.shields.io/website/http/huggingface.co/docs/transformers/index.svg?down_color=red&down_message=offline&up_message=online"></a>
@ -72,7 +36,7 @@ user: 使用者
<h4 align="center">
<p>
<a href="https://github.com/huggingface/transformers/">English</a> |
<a href="https://github.com/huggingface/transformers/blob/main/README.md">English</a> |
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_zh-hans.md">简体中文</a> |
<b>繁體中文</b> |
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_ko.md">한국어</a> |
@ -80,7 +44,7 @@ user: 使用者
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_ja.md">日本語</a> |
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_hd.md">हिन्दी</a> |
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_ru.md">Русский</a> |
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_pt-br.md">Рortuguês</a> |
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_pt-br.md">Português</a> |
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_te.md">తెలుగు</a> |
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_fr.md">Français</a> |
<a href="https://github.com/huggingface/transformers/blob/main/i18n/README_de.md">Deutsch</a> |
@ -93,186 +57,261 @@ user: 使用者
</h4>
<h3 align="center">
<p>為 Jax、PyTorch 以及 TensorFlow 打造的先進自然語言處理函式庫</p>
<p>最先進的預訓練模型,專為推理與訓練而生</p>
</h3>
<h3 align="center">
<a href="https://hf.co/course"><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/course_banner.png"></a>
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/transformers_as_a_model_definition.png"/>
</h3>
🤗 Transformers 提供了數以千計的預訓練模型,支援 100 多種語言的文本分類、資訊擷取、問答、摘要、翻譯、文本生成。它的宗旨是讓最先進的 NLP 技術人人易用
Transformers 是一個為最先進的機器學習模型(涵蓋文字、電腦視覺、音訊、影片及多模態)提供推理和訓練支援的模型定義框架
🤗 Transformers 提供了便於快速下載和使用的API讓你可以將預訓練模型用在給定文本、在你的資料集上微調然後經由 [model hub](https://huggingface.co/models) 與社群共享。同時,每個定義的 Python 模組架構均完全獨立,方便修改和快速研究實驗
它將模型定義集中化,使得該定義在整個生態系中能夠達成共識。`transformers` 是貫穿各個框架的樞紐:如果一個模型定義受到支援,它將與大多數訓練框架(如 Axolotl、Unsloth、DeepSpeed、FSDP、PyTorch-Lightning 等)、推理引擎(如 vLLM、SGLang、TGI 等)以及利用 `transformers` 模型定義的周邊建模函式庫(如 llama.cpp、mlx 等)相容
🤗 Transformers 支援三個最熱門的深度學習函式庫: [Jax](https://jax.readthedocs.io/en/latest/), [PyTorch](https://pytorch.org/) 以及 [TensorFlow](https://www.tensorflow.org/) — 並與之完美整合。你可以直接使用其中一個框架訓練你的模型,然後用另一個載入和推論
我們致力於支援最新的頂尖模型,並透過使其模型定義變得簡單、可客製化且高效,來普及它們的應用
## 線上Demo
在 [Hugging Face Hub](https://huggingface.com/models) 上,有超過 100 萬個 Transformers [模型檢查點](https://huggingface.co/models?library=transformers&sort=trending) 供您使用。
你可以直接在 [model hub](https://huggingface.co/models) 上測試大多數的模型。我們也提供了 [私有模型託管、模型版本管理以及推論API](https://huggingface.co/pricing)
這裡是一些範例:
- [用 BERT 做遮蓋填詞](https://huggingface.co/google-bert/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France)
- [用 Electra 做專有名詞辨識](https://huggingface.co/dbmdz/electra-large-discriminator-finetuned-conll03-english?text=My+name+is+Sarah+and+I+live+in+London+city)
- [用 GPT-2 做文本生成](https://huggingface.co/openai-community/gpt2?text=A+long+time+ago%2C+)
- [用 RoBERTa 做自然語言推論](https://huggingface.co/FacebookAI/roberta-large-mnli?text=The+dog+was+lost.+Nobody+lost+any+animal)
- [用 BART 做文本摘要](https://huggingface.co/facebook/bart-large-cnn?text=The+tower+is+324+metres+%281%2C063+ft%29+tall%2C+about+the+same+height+as+an+81-storey+building%2C+and+the+tallest+structure+in+Paris.+Its+base+is+square%2C+measuring+125+metres+%28410+ft%29+on+each+side.+During+its+construction%2C+the+Eiffel+Tower+surpassed+the+Washington+Monument+to+become+the+tallest+man-made+structure+in+the+world%2C+a+title+it+held+for+41+years+until+the+Chrysler+Building+in+New+York+City+was+finished+in+1930.+It+was+the+first+structure+to+reach+a+height+of+300+metres.+Due+to+the+addition+of+a+broadcasting+aerial+at+the+top+of+the+tower+in+1957%2C+it+is+now+taller+than+the+Chrysler+Building+by+5.2+metres+%2817+ft%29.+Excluding+transmitters%2C+the+Eiffel+Tower+is+the+second+tallest+free-standing+structure+in+France+after+the+Millau+Viaduct)
- [用 DistilBERT 做問答](https://huggingface.co/distilbert/distilbert-base-uncased-distilled-squad?text=Which+name+is+also+used+to+describe+the+Amazon+rainforest+in+English%3F&context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species)
- [用 T5 做翻譯](https://huggingface.co/google-t5/t5-base?text=My+name+is+Wolfgang+and+I+live+in+Berlin)
**[Write With Transformer](https://transformer.huggingface.co)**,由 Hugging Face 團隊所打造,是一個文本生成的官方 demo。
## 如果你在尋找由 Hugging Face 團隊所提供的客製化支援服務
<a target="_blank" href="https://huggingface.co/support">
<img alt="HuggingFace Expert Acceleration Program" src="https://huggingface.co/front/thumbnails/support.png" style="max-width: 600px; border: 1px solid #eee; border-radius: 4px; box-shadow: 0 1px 2px 0 rgba(0, 0, 0, 0.05);">
</a><br>
## 快速上手
我們為快速使用模型提供了 `pipeline` API。 Pipeline 包含了預訓練模型和對應的文本預處理。下面是一個快速使用 pipeline 去判斷正負面情緒的例子:
```python
>>> from transformers import pipeline
# 使用情緒分析 pipeline
>>> classifier = pipeline('sentiment-analysis')
>>> classifier('We are very happy to introduce pipeline to the transformers repository.')
[{'label': 'POSITIVE', 'score': 0.9996980428695679}]
```
第二行程式碼下載並快取 pipeline 使用的預訓練模型,而第三行程式碼則在給定的文本上進行了評估。這裡的答案“正面” (positive) 具有 99.97% 的信賴度。
許多的 NLP 任務都有隨選即用的預訓練 `pipeline`。例如,我們可以輕鬆地從給定文本中擷取問題答案:
``` python
>>> from transformers import pipeline
# 使用問答 pipeline
>>> question_answerer = pipeline('question-answering')
>>> question_answerer({
... 'question': 'What is the name of the repository ?',
... 'context': 'Pipeline has been included in the huggingface/transformers repository'
... })
{'score': 0.30970096588134766, 'start': 34, 'end': 58, 'answer': 'huggingface/transformers'}
```
除了提供問題解答,預訓練模型還提供了對應的信賴度分數以及解答在 tokenized 後的文本中開始和結束的位置。你可以從[這個教學](https://huggingface.co/docs/transformers/task_summary)了解更多 `pipeline` API支援的任務。
要在你的任務中下載和使用任何預訓練模型很簡單,只需三行程式碼。這裡是 PyTorch 版的範例:
```python
>>> from transformers import AutoTokenizer, AutoModel
>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
>>> model = AutoModel.from_pretrained("google-bert/bert-base-uncased")
>>> inputs = tokenizer("Hello world!", return_tensors="pt")
>>> outputs = model(**inputs)
```
這裡是對應的 TensorFlow 程式碼:
```python
>>> from transformers import AutoTokenizer, TFAutoModel
>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
>>> model = TFAutoModel.from_pretrained("google-bert/bert-base-uncased")
>>> inputs = tokenizer("Hello world!", return_tensors="tf")
>>> outputs = model(**inputs)
```
Tokenizer 為所有的預訓練模型提供了預處理,並可以直接轉換單一字串(比如上面的例子)或串列 (list)。它會輸出一個的字典 (dict) 讓你可以在下游程式碼裡使用或直接藉由 `**` 運算式傳給模型。
模型本身是一個常規的 [Pytorch `nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) 或 [TensorFlow `tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model)(取決於你的後端),可依常規方式使用。 [這個教學](https://huggingface.co/transformers/training.html)解釋了如何將這樣的模型整合到一般的 PyTorch 或 TensorFlow 訓練迴圈中,或是如何使用我們的 `Trainer` API 在一個新的資料集上快速進行微調。
## 為什麼要用 transformers
1. 便於使用的先進模型:
- NLU 和 NLG 上性能卓越
- 對教學和實作友好且低門檻
- 高度抽象,使用者只須學習 3 個類別
- 對所有模型使用的制式化API
1. 更低的運算成本,更少的碳排放:
- 研究人員可以分享已訓練的模型而非每次從頭開始訓練
- 工程師可以減少計算時間以及生產成本
- 數十種模型架構、兩千多個預訓練模型、100多種語言支援
1. 對於模型生命週期的每一個部分都面面俱到:
- 訓練先進的模型,只需 3 行程式碼
- 模型可以在不同深度學習框架之間任意轉換
- 為訓練、評估和生產選擇最適合的框架,並完美銜接
1. 為你的需求輕鬆客製化專屬模型和範例:
- 我們為每種模型架構提供了多個範例來重現原論文結果
- 一致的模型內部架構
- 模型檔案可單獨使用,便於修改和快速實驗
## 什麼情況下我不該用 transformers
- 本函式庫並不是模組化的神經網絡工具箱。模型文件中的程式碼並未做額外的抽象封裝,以便研究人員快速地翻閱及修改程式碼,而不會深陷複雜的類別包裝之中。
- `Trainer` API 並非相容任何模型,它只為本函式庫中的模型最佳化。對於一般的機器學習用途,請使用其他函式庫。
- 儘管我們已盡力而為,[examples 目錄](https://github.com/huggingface/transformers/tree/main/examples)中的腳本也僅為範例而已。對於特定問題,它們並不一定隨選即用,可能需要修改幾行程式碼以符合需求。
立即探索 [Hub](https://huggingface.com/),尋找合適的模型,並使用 Transformers 幫助您快速上手
## 安裝
### 使用 pip
Transformers 支援 Python 3.9+ 和 [PyTorch](https://pytorch.org/get-started/locally/) 2.1+。
這個 Repository 已在 Python 3.9+、Flax 0.4.1+、PyTorch 2.1+ 和 TensorFlow 2.6+ 下經過測試
使用 [venv](https://docs.python.org/3/library/venv.html) 或基於 Rust 的高速 Python 套件及專案管理器 [uv](https://docs.astral.sh/uv/) 來建立並啟用虛擬環境
你可以在[虛擬環境](https://docs.python.org/3/library/venv.html)中安裝 🤗 Transformers。如果你還不熟悉 Python 的虛擬環境,請閱此[使用者指引](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/)。
首先,用你打算使用的版本的 Python 創建一個虛擬環境並進入。
然後,你需要安裝 Flax、PyTorch 或 TensorFlow 其中之一。對於該如何在你使用的平台上安裝這些框架,請參閱 [TensorFlow 安裝頁面](https://www.tensorflow.org/install/), [PyTorch 安裝頁面](https://pytorch.org/get-started/locally/#start-locally) 或 [Flax 安裝頁面](https://github.com/google/flax#quick-install)。
當其中一個後端安裝成功後,🤗 Transformers 可依此安裝:
```bash
pip install transformers
```py
# venv
python -m venv .my-env
source .my-env/bin/activate
# uv
uv venv .my-env
source .my-env/bin/activate
```
如果你想要試試範例或者想在正式發布前使用最新開發中的程式碼,你必須[從原始碼安裝](https://huggingface.co/docs/transformers/installation#installing-from-source)
在您的虛擬環境中安裝 Transformers
### 使用 conda
```py
# pip
pip install "transformers[torch]"
🤗 Transformers 可以藉由 conda 依此安裝:
```shell script
conda install conda-forge::transformers
# uv
uv pip install "transformers[torch]"
```
> **_筆記:_** 從 `huggingface` 頻道安裝 `transformers` 已被淘汰
如果您想使用函式庫的最新變更或有興趣參與貢獻,可以從原始碼安裝 Transformers。然而*最新*版本可能不穩定。如果您遇到任何錯誤,歡迎隨時提交一個 [issue](https://github.com/huggingface/transformers/issues)
要藉由 conda 安裝 Flax、PyTorch 或 TensorFlow 其中之一,請參閱它們各自安裝頁面的說明。
```shell
git clone https://github.com/huggingface/transformers.git
cd transformers
## 模型架構
# pip
pip install '.[torch]'
**🤗 Transformers 支援的[所有的模型檢查點](https://huggingface.co/models)**,由[使用者](https://huggingface.co/users)和[組織](https://huggingface.co/organizations)上傳,均與 huggingface.co [model hub](https://huggingface.co) 完美結合。
# uv
uv pip install '.[torch]'
```
目前的檢查點數量: ![](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/models&color=brightgreen)
## 快速入門
🤗 Transformers 目前支援以下的架構: 模型概覽請參閱[這裡](https://huggingface.co/docs/transformers/model_summary).
透過 [Pipeline](https://huggingface.co/docs/transformers/pipeline_tutorial) API 快速開始使用 Transformers。`Pipeline` 是一個高階的推理類別,支援文字、音訊、視覺和多模態任務。它負責處理輸入資料的預處理,並回傳適當的輸出。
要檢查某個模型是否已有 Flax、PyTorch 或 TensorFlow 的實作,或其是否在🤗 Tokenizers 函式庫中有對應的 tokenizer敬請參閱[此表](https://huggingface.co/docs/transformers/index#supported-frameworks)
實例化一個 pipeline 並指定用於文字生成的模型。該模型會被下載並快取,方便您之後輕鬆複用。最後,傳入一些文字來提示模型
這些實作均已於多個資料集測試(請參閱範例腳本)並應與原版實作表現相當。你可以在範例文件的[此節](https://huggingface.co/docs/transformers/examples)中了解實作的細節。
```py
from transformers import pipeline
pipeline = pipeline(task="text-generation", model="Qwen/Qwen2.5-1.5B")
pipeline("the secret to baking a really good cake is ")
[{'generated_text': 'the secret to baking a really good cake is 1) to use the right ingredients and 2) to follow the recipe exactly. the recipe for the cake is as follows: 1 cup of sugar, 1 cup of flour, 1 cup of milk, 1 cup of butter, 1 cup of eggs, 1 cup of chocolate chips. if you want to make 2 cakes, how much sugar do you need? To make 2 cakes, you will need 2 cups of sugar.'}]
```
## 了解更多
與模型進行聊天,使用模式是相同的。唯一的區別是您需要建構一個您與系統之間的聊天歷史(作為 `Pipeline` 的輸入)。
| 章節 | 描述 |
|-|-|
| [文件](https://huggingface.co/transformers/) | 完整的 API 文件和教學 |
| [任務概覽](https://huggingface.co/docs/transformers/task_summary) | 🤗 Transformers 支援的任務 |
| [預處理教學](https://huggingface.co/docs/transformers/preprocessing) | 使用 `Tokenizer` 來為模型準備資料 |
| [訓練和微調](https://huggingface.co/docs/transformers/training) | 使用 PyTorch/TensorFlow 的內建的訓練方式或於 `Trainer` API 中使用 🤗 Transformers 提供的模型 |
| [快速上手:微調和範例腳本](https://github.com/huggingface/transformers/tree/main/examples) | 為各種任務提供的範例腳本 |
| [模型分享和上傳](https://huggingface.co/docs/transformers/model_sharing) | 上傳並與社群分享你微調的模型 |
| [遷移](https://huggingface.co/docs/transformers/migration) | 從 `pytorch-transformers` 或 `pytorch-pretrained-bert` 遷移到 🤗 Transformers |
> [!TIP]
> 你也可以直接在命令列中與模型聊天。
> ```shell
> transformers chat Qwen/Qwen2.5-0.5B-Instruct
> ```
```py
import torch
from transformers import pipeline
chat = [
{"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."},
{"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"}
]
pipeline = pipeline(task="text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", dtype=torch.bfloat16, device_map="auto")
response = pipeline(chat, max_new_tokens=512)
print(response[0]["generated_text"][-1]["content"])
```
展開下面的範例,查看 `Pipeline` 如何在不同模態和任務上運作。
<details>
<summary>自動語音辨識</summary>
```py
from transformers import pipeline
pipeline = pipeline(task="automatic-speech-recognition", model="openai/whisper-large-v3")
pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'}
```
</details>
<details>
<summary>影像分類</summary>
<h3 align="center">
<a><img src="https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png"></a>
</h3>
```py
from transformers import pipeline
pipeline = pipeline(task="image-classification", model="facebook/dinov2-small-imagenet1k-1-layer")
pipeline("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png")
[{'label': 'macaw', 'score': 0.997848391532898},
{'label': 'sulphur-crested cockatoo, Kakatoe galerita, Cacatua galerita',
'score': 0.0016551691805943847},
{'label': 'lorikeet', 'score': 0.00018523589824326336},
{'label': 'African grey, African gray, Psittacus erithacus',
'score': 7.85409429227002e-05},
{'label': 'quail', 'score': 5.502637941390276e-05}]
```
</details>
<details>
<summary>視覺問答</summary>
<h3 align="center">
<a><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-few-shot.jpg"></a>
</h3>
```py
from transformers import pipeline
pipeline = pipeline(task="visual-question-answering", model="Salesforce/blip-vqa-base")
pipeline(
image="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-few-shot.jpg",
question="What is in the image?",
)
[{'answer': 'statue of liberty'}]
```
</details>
## 為什麼我應該使用 Transformers
1. 易於使用的最先進模型:
* 在自然語言理解與生成、電腦視覺、音訊、影片和多模態任務上表現卓越。
* 為研究人員、工程師與開發者提供了低門檻的入門途徑。
* 面向使用者的抽象層級少,只需學習三個核心類別。
* 為所有預訓練模型提供了統一的 API 介面。
2. 更低的運算成本,更小的碳足跡:
* 分享訓練好的模型,而不是從零開始訓練。
* 減少運算時間和生產成本。
* 擁有數十種模型架構和超過100萬個橫跨所有模態的預訓練檢查點。
3. 為模型的每個生命週期階段選擇合適的框架:
* 僅用3行程式碼即可訓練最先進的模型。
* 在PyTorch/JAX/TF2.0框架之間輕鬆切換單一模型。
* 為訓練、評估和生產選擇最合適的框架。
4. 輕鬆根據您的需求客製化模型或範例:
* 我們為每個架構提供了範例,以重現其原作者發表的結果。
* 模型內部結構盡可能保持一致地暴露給使用者。
* 模型檔案可以獨立於函式庫使用,便於快速實驗。
<a target="_blank" href="https://huggingface.co/enterprise">
<img alt="Hugging Face Enterprise Hub" src="https://github.com/user-attachments/assets/247fb16d-d251-4583-96c4-d3d76dda4925">
</a><br>
## 為什麼我不應該使用 Transformers
- 本函式庫並非一個用於建構神經網路的模組化工具箱。模型檔案中的程式碼為了讓研究人員能快速在模型上迭代,而沒有進行過度的抽象重構,避免了深入額外的抽象層/檔案。
- 訓練 API 針對 Transformers 提供的 PyTorch 模型進行了最佳化。對於通用的機器學習迴圈,您應該使用像 [Accelerate](https://huggingface.co/docs/accelerate) 這樣的其他函式庫。
- [範例指令稿](https://github.com/huggingface/transformers/tree/main/examples)僅僅是*範例*。它們不一定能在您的特定用例上開箱即用,您可能需要修改程式碼才能使其正常運作。
## 100個使用 Transformers 的專案
Transformers 不僅僅是一個使用預訓練模型的工具包,它還是一個圍繞它和 Hugging Face Hub 建構的專案社群。我們希望 Transformers 能夠賦能開發者、研究人員、學生、教授、工程師以及其他任何人,去建構他們夢想中的專案。
為了慶祝 Transformers 獲得 10 萬顆星標,我們希望透過 [awesome-transformers](./awesome-transformers.md) 頁面來聚焦社群該頁面列出了100個基於 Transformers 建構的精彩專案。
如果您擁有或使用一個您認為應該被列入其中的專案,請隨時提交 PR 將其加入!
## 範例模型
您可以在我們大多數模型的 [Hub 模型頁面](https://huggingface.co/models) 上直接進行測試。
展開下面的每個模態,查看一些用於不同用例的範例模型。
<details>
<summary>音訊</summary>
- Audio classification with [Whisper](https://huggingface.co/openai/whisper-large-v3-turbo)
- Automatic speech recognition with [Moonshine](https://huggingface.co/UsefulSensors/moonshine)
- Keyword spotting with [Wav2Vec2](https://huggingface.co/superb/wav2vec2-base-superb-ks)
- Speech to speech generation with [Moshi](https://huggingface.co/kyutai/moshiko-pytorch-bf16)
- Text to audio with [MusicGen](https://huggingface.co/facebook/musicgen-large)
- Text to speech with [Bark](https://huggingface.co/suno/bark)
</details>
<details>
<summary>電腦視覺</summary>
- Automatic mask generation with [SAM](https://huggingface.co/facebook/sam-vit-base)
- Depth estimation with [DepthPro](https://huggingface.co/apple/DepthPro-hf)
- Image classification with [DINO v2](https://huggingface.co/facebook/dinov2-base)
- Keypoint detection with [SuperPoint](https://huggingface.co/magic-leap-community/superpoint)
- Keypoint matching with [SuperGlue](https://huggingface.co/magic-leap-community/superglue_outdoor)
- Object detection with [RT-DETRv2](https://huggingface.co/PekingU/rtdetr_v2_r50vd)
- Pose Estimation with [VitPose](https://huggingface.co/usyd-community/vitpose-base-simple)
- Universal segmentation with [OneFormer](https://huggingface.co/shi-labs/oneformer_ade20k_swin_large)
- Video classification with [VideoMAE](https://huggingface.co/MCG-NJU/videomae-large)
</details>
<details>
<summary>多模態</summary>
- Audio or text to text with [Qwen2-Audio](https://huggingface.co/Qwen/Qwen2-Audio-7B)
- Document question answering with [LayoutLMv3](https://huggingface.co/microsoft/layoutlmv3-base)
- Image or text to text with [Qwen-VL](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
- Image captioning [BLIP-2](https://huggingface.co/Salesforce/blip2-opt-2.7b)
- OCR-based document understanding with [GOT-OCR2](https://huggingface.co/stepfun-ai/GOT-OCR-2.0-hf)
- Table question answering with [TAPAS](https://huggingface.co/google/tapas-base)
- Unified multimodal understanding and generation with [Emu3](https://huggingface.co/BAAI/Emu3-Gen)
- Vision to text with [Llava-OneVision](https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf)
- Visual question answering with [Llava](https://huggingface.co/llava-hf/llava-1.5-7b-hf)
- Visual referring expression segmentation with [Kosmos-2](https://huggingface.co/microsoft/kosmos-2-patch14-224)
</details>
<details>
<summary>自然語言處理 (NLP)</summary>
- Masked word completion with [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base)
- Named entity recognition with [Gemma](https://huggingface.co/google/gemma-2-2b)
- Question answering with [Mixtral](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1)
- Summarization with [BART](https://huggingface.co/facebook/bart-large-cnn)
- Translation with [T5](https://huggingface.co/google-t5/t5-base)
- Text generation with [Llama](https://huggingface.co/meta-llama/Llama-3.2-1B)
- Text classification with [Qwen](https://huggingface.co/Qwen/Qwen2.5-0.5B)
</details>
## 引用
我們已將此函式庫的[論文](https://www.aclweb.org/anthology/2020.emnlp-demos.6/)正式發表。如果你使用了 🤗 Transformers 函式庫,可以引用
現在我們有一篇可供您引用的關於 🤗 Transformers 函式庫的 [論文](https://www.aclweb.org/anthology/2020.emnlp-demos.6/)
```bibtex
@inproceedings{wolf-etal-2020-transformers,
title = "Transformers: State-of-the-Art Natural Language Processing",
@ -285,4 +324,4 @@ conda install conda-forge::transformers
url = "https://www.aclweb.org/anthology/2020.emnlp-demos.6",
pages = "38--45"
}
```
```

View File

@ -137,8 +137,8 @@ _deps = [
"psutil",
"pyyaml>=5.1",
"pydantic>=2",
"pytest>=7.2.0",
"pytest-asyncio",
"pytest>=7.2.0,<9.0.0",
"pytest-asyncio>=1.2.0",
"pytest-rerunfailures<16.0",
"pytest-timeout",
"pytest-xdist",

View File

@ -813,7 +813,7 @@ class Serve:
# TODO (Joao, Lysandre): this should also work with tool support
inputs = processor.apply_chat_template(req["messages"], return_tensors="pt", add_generation_prompt=True).to(
model.device
)[0]
)["input_ids"][0]
def stream_chat_completion(request_id, decode_stream):
try:
@ -1237,7 +1237,7 @@ class Serve:
else:
raise TypeError("inputs should be a list, dict, or str")
inputs = processor.apply_chat_template(inputs, add_generation_prompt=True, return_tensors="pt")
inputs = processor.apply_chat_template(inputs, add_generation_prompt=True, return_tensors="pt")["input_ids"]
inputs = inputs.to(model.device)
request_id = req.get("previous_response_id", "req_0")
@ -1541,7 +1541,7 @@ class Serve:
else:
raise ValueError("inputs should be a list, dict, or str")
inputs = processor.apply_chat_template(inputs, add_generation_prompt=True, return_tensors="pt")
inputs = processor.apply_chat_template(inputs, add_generation_prompt=True, return_tensors="pt")["input_ids"]
inputs = inputs.to(model.device)
request_id = req.get("previous_response_id", "req_0")

View File

@ -876,7 +876,7 @@ class PreTrainedConfig(PushToHubMixin):
if hasattr(self, "quantization_config"):
serializable_config_dict["quantization_config"] = (
self.quantization_config.to_dict()
if not isinstance(self.quantization_config, dict)
if not isinstance(self.quantization_config, dict) and self.quantization_config is not None
else self.quantization_config
)
self.dict_dtype_to_str(serializable_config_dict)
@ -910,7 +910,7 @@ class PreTrainedConfig(PushToHubMixin):
if hasattr(self, "quantization_config"):
output["quantization_config"] = (
self.quantization_config.to_dict()
if not isinstance(self.quantization_config, dict)
if not isinstance(self.quantization_config, dict) and self.quantization_config is not None
else self.quantization_config
)
self.dict_dtype_to_str(output)

View File

@ -0,0 +1,136 @@
# coding=utf-8
# Copyright (C) 2025 the HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from copy import deepcopy
from .core_model_loading import Concatenate, MergeModulelist, WeightConverter
from .utils import is_torch_available
if is_torch_available():
import torch
def _build_checkpoint_conversion_mapping():
mapping = {
"mixtral": [
WeightConverter(
source_keys=[
"block_sparse_moe.experts.*.w1.weight",
"block_sparse_moe.experts.*.w3.weight",
], # you give me a list of 2 keys, I collect a list of a list of tensors
target_keys="mlp.experts.gate_up_proj", # target key gets the list of two tensors
operations=[
MergeModulelist(
dim=0
), # each process has two lists of tensors, we cat each list. -> we end up with 2 tensors
Concatenate(dim=1), # each process has 2 tensors, gate and up, we concat them into gate_up
], # we want the loading to add this shard operation here. Though we can't shard after concats and merge, needs to be first
),
WeightConverter(
source_keys=[
"block_sparse_moe.experts.*.w2.weight",
],
target_keys="mlp.experts.down_proj", # target key gets the list of two tensors
operations=[
MergeModulelist(
dim=0
), # each process has two lists of tensors, we cat each list. -> we end up with 2 tensors
], # we want the loading to add this shard operation here. Though we can't shard after concats and merge, needs to be first
),
# WeightConverter(
# ["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj"],
# "self_attn.qkv_proj",
# operations=[Concatenate(dim=0)], # more like stack?
# ),
WeightConverter("*.block_sparse_moe.", "*.mlp."),
],
"qwen2_moe": [
WeightConverter(
source_keys=[
"mlp.experts.*.gate_proj.weight",
"mlp.experts.*.up_proj.weight",
],
target_keys="mlp.experts.gate_up_proj",
operations=[MergeModulelist(dim=0), Concatenate(dim=1)],
),
WeightConverter(
source_keys=["mlp.experts.*.down_proj.weight"],
target_keys="mlp.experts.down_proj",
operations=[MergeModulelist(dim=0)],
),
],
"legacy": [
WeightConverter(
source_keys="LayerNorm.gamma",
target_keys="LayerNorm.weight",
),
WeightConverter(
source_keys="LayerNorm.beta",
target_keys="LayerNorm.bias",
),
],
}
if hasattr(torch.nn.utils.parametrizations, "weight_norm"):
mapping["legacy"] += [
WeightConverter(
source_keys="weight_g",
target_keys="parametrizations.weight.original0",
),
WeightConverter(
source_keys="weight_v",
target_keys="parametrizations.weight.original1",
),
]
else:
mapping["legacy"] += [
WeightConverter(
source_keys="parametrizations.weight.original0",
target_keys="weight_g",
),
WeightConverter(
source_keys="parametrizations.weight.original1",
target_keys="weight_v",
),
]
mapping["phimoe"] = mapping["mixtral"].copy()
mapping["deepseek_v2"] = mapping["qwen2_moe"].copy()
mapping["deepseek_v3"] = mapping["qwen2_moe"].copy()
mapping["dot1"] = mapping["qwen2_moe"].copy()
mapping["ernie_4_5_moe"] = mapping["qwen2_moe"].copy()
mapping["glm4_moe"] = mapping["qwen2_moe"].copy()
mapping["glm4v_moe"] = mapping["qwen2_moe"].copy()
mapping["jamba"] = mapping["qwen2_moe"].copy()
mapping["lfm2_moe"] = mapping["mixtral"].copy()
mapping["long_cat_flash"] = mapping["qwen2_moe"].copy()
mapping["qwen3_moe"] = mapping["qwen2_moe"].copy()
mapping["qwen3_omni_moe"] = mapping["qwen2_moe"].copy()
mapping["qwen3_next"] = mapping["qwen2_moe"].copy()
mapping["qwen3_vl_moe"] = mapping["qwen2_moe"].copy()
mapping["hunyuan_v1_moe"] = mapping["qwen2_moe"].copy()
mapping["minimax"] = mapping["mixtral"].copy()
return mapping
_checkpoint_conversion_mapping_cache = None
def get_checkpoint_conversion_mapping(model_type):
global _checkpoint_conversion_mapping_cache
_checkpoint_conversion_mapping_cache = _build_checkpoint_conversion_mapping()
globals()["_checkpoint_conversion_mapping"] = _checkpoint_conversion_mapping_cache
return deepcopy(_checkpoint_conversion_mapping_cache.get(model_type, None))

View File

@ -0,0 +1,732 @@
# coding=utf-8
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Core helpers for loading model checkpoints."""
from __future__ import annotations
import itertools
import os
import re
from abc import abstractmethod
from collections import defaultdict
from collections.abc import MutableMapping, MutableSet, Sequence
from concurrent.futures import Future, ThreadPoolExecutor
from contextlib import contextmanager
from dataclasses import dataclass, field
from functools import partial
from types import MethodType
from typing import TYPE_CHECKING, Any, Optional, Union
import torch
from .integrations.tensor_parallel import ALL_PARALLEL_STYLES, DTensor, Replicate, TensorParallelLayer
from .utils import is_torch_greater_or_equal, logging
_torch_distributed_available = torch.distributed.is_available()
_is_dtensor_available = _torch_distributed_available and is_torch_greater_or_equal("2.5")
if _is_dtensor_available:
from torch.distributed.tensor import DTensor
if TYPE_CHECKING:
from .modeling_utils import PreTrainedModel
from .quantizers import HfQuantizer
logger = logging.get_logger(__name__)
str_to_torch_dtype = {
"BOOL": torch.bool,
"U8": torch.uint8,
"I8": torch.int8,
"I16": torch.int16,
"F16": torch.float16,
"BF16": torch.bfloat16,
"I32": torch.int32,
"F32": torch.float32,
"F64": torch.float64,
"I64": torch.int64,
"F8_E4M3": torch.float8_e4m3fn,
"F8_E5M2": torch.float8_e5m2,
}
logger = logging.get_logger(__name__)
def _glob_to_regex_src(glob: str, *, digits_only: bool = True) -> str:
"""
Convert a glob with '*' into a regex *source* string. We don't use `glob.translate`
'*' matches (\\d+) if digits_only else (.+). Inner groups are non-capturing.
"""
star = r"(\d+)" if digits_only else r"(.+)"
return glob.replace(r"\*", star)
def build_glob_alt(
globs: list[str],
) -> tuple[re.Pattern, dict[str, str]]:
r"""
Build one compiled regex alternation with a named group per glob. This allows to run a single
re.match and get the correct group name to finally get which pattern matched.
Returns (compiled_regex, name->glob map).
Example:
```py
>>> reg, map_ = build_glob_alt(["mlp.*.w1", "mlp.*.w2"])
>>> print(reg)
(re.compile(r'(?P<g0>.*mlp\.(\d+)\.w1)|(?P<g1>.*mlp\.(\d+)\.w2)', re.UNICODE),
>>> print(map_)
{'g0': 'mlp.*.w1', 'g1': 'mlp.*.w2'})
>>> match_ = reg.match("model.layers.0.mlp.0.w1.weight")
>>> print(match_.lastgroup)
'g0'
>>> print(map_[match_.lastgroup])
mlp.*.w1
```
"""
name_map: dict[str, str] = {}
parts: list[str] = []
for i, g in enumerate(globs):
name = f"g{i}"
name_map[name] = g
pat_src = _glob_to_regex_src(g)
prefix_src = ""
if pat_src.startswith("*"):
prefix_src = "."
elif not pat_src.startswith(r"\^") and not pat_src.startswith(r".*"):
prefix_src = ".*"
parts.append(f"(?P<{name}>{prefix_src}{pat_src}.*)")
alt_src = "|".join(parts).replace("\\^", "^").replace("\\.", r"\.")
try:
reg = re.compile(alt_src)
except re.error as e:
logger.error(f"Error compiling regex for alternation: {alt_src}")
raise e
return reg, name_map
def match_glob(key: str, alt: re.Pattern, name_map: dict[str, str]) -> Optional[str]:
"""
Match the key against the alternation; return the original glob string that matched.
"""
m = alt.match(key)
if not m:
return None
return name_map.get(m.lastgroup)
class ConversionOps:
"""Base class for weight conversion operations."""
# The inverse operation class, will be used when saving the checkpoint
reverse_op: type[ConversionOps]
@abstractmethod
def convert(
self, value: Union[dict[str, torch.Tensor], Sequence[torch.Tensor], torch.Tensor], *args, **kwargs
) -> torch.Tensor:
raise NotImplementedError
class Chunk(ConversionOps):
"""Split a tensor along ``dim`` into equally sized chunks or using explicit ``sizes``."""
reverse_op: type[ConversionOps]
def __init__(self, dim: int = 0, chunks: Optional[int] = None, sizes: Optional[Sequence[int]] = None):
if chunks is None and sizes is None:
raise ValueError("`chunks` or `sizes` must be provided for Chunk operations.")
if chunks is not None and chunks <= 0:
raise ValueError("`chunks` must be a strictly positive integer.")
self.dim = dim
self.chunks = chunks
self.sizes = list(sizes) if sizes is not None else None
self.reverse_op = Concatenate
def convert(self, value: torch.Tensor, *args, **kwargs) -> list[torch.Tensor]:
# chunk requires a single tensor input
if len(value) != 1 or len(value[0]) != 1:
raise ValueError("Chunk operation requires a single tensor input.")
return list(torch.chunk(value[0][0], self.chunks, dim=self.dim))
class Concatenate(ConversionOps):
"""Concatenate tensors along `dim` using a reusable buffer."""
reverse_op: type[ConversionOps]
def __init__(self, dim: int = 0):
self.dim = dim
self.reverse_op = Chunk
@torch.no_grad
def convert(self, value: Sequence[torch.Tensor], *args, **kwargs) -> torch.Tensor:
if isinstance(value[0], list):
value = [v[0] for v in value]
tensors = value
if not tensors:
raise ValueError("Fuse requires at least one tensor to concatenate.")
return torch.cat(tuple(tensors), dim=self.dim)
class MergeModulelist(Concatenate):
"""
Merge a list of tensors into a single tensor along the first dimension.
We explicitly define this because for EP or TP you want to make sure you know what you are doing!
"""
def __init__(self, dim: int = 0):
super().__init__(dim=dim)
self.reverse_op = SplitModulelist
@torch.no_grad
def convert(self, value: Sequence[torch.Tensor], *args, **kwargs) -> list[torch.Tensor]:
merged = []
for group in value:
if not isinstance(group, Sequence) or len(group) == 0:
raise ValueError("MergeModulelist requires non-empty sub-sequences.")
group = [k for k in group if k.ndim]
merged.append(torch.stack(group, dim=self.dim))
return merged
class SplitModulelist(ConversionOps):
"""Inverse of :class:`MergeModulelist` using explicit split sizes per group."""
def __init__(self, sizes: Sequence[Sequence[int]], dim: int = 0):
if not isinstance(sizes, Sequence) or not all(isinstance(sub, Sequence) and sub for sub in sizes):
raise ValueError("`sizes` must be a sequence of non-empty sequences of integers.")
self.sizes = [list(sub) for sub in sizes]
self.dim = dim
self.reverse_op = MergeModulelist
@torch.no_grad
def convert(self, value: Sequence[torch.Tensor], *, context: dict[str, Any]) -> list[list[torch.Tensor]]:
if not isinstance(value, Sequence):
raise TypeError("SplitModulelist expects a sequence of tensors.")
if len(value) != len(self.sizes):
raise ValueError("Number of tensors does not match the provided split specifications.")
result: list[list[torch.Tensor]] = []
for tensor, split_sizes in zip(value, self.sizes):
if not isinstance(tensor, torch.Tensor):
raise TypeError("SplitModulelist can only split torch.Tensor instances.")
splits = torch.split(tensor, split_sizes, dim=self.dim)
result.append(list(splits))
return result
class PermuteForRope(ConversionOps):
"""
Applies the permutation required to convert complex RoPE weights to the split sin/cos format.
"""
def __init__(self):
pass
def _apply(self, tensor: torch.Tensor) -> torch.Tensor:
dim1, dim2 = tensor.shape
n_heads = self.config.getattr("num_attention_heads", 1)
tensor = tensor.view(n_heads, dim1 // n_heads // 2, 2, dim2)
tensor = tensor.transpose(1, 2).reshape(dim1, dim2)
return tensor
@torch.no_grad
def convert(
self, value: Union[dict[str, torch.Tensor], Sequence[torch.Tensor], torch.Tensor], config
) -> Union[dict[str, torch.Tensor], list[torch.Tensor], torch.Tensor]:
self.config = config
out = [[self._apply(x) for x in inner] if isinstance(inner, list) else self._apply(inner) for inner in value]
return out
@dataclass(slots=True)
class WeightConverter:
r"""
A weight convert that acts on a pattern of source keys.
The keys need to be collected based on the target keys.
With wild card, glob patterns are matched, so you have to be detailed with what to match. If you match:
`model.layers.*.experts.*` -> it will act on all of them
{"model.layers.*.experts.*": []}
but
`experts.*.mlp` will be layer specific.
{"model.layers.1.experts.*": [], }
- source_keys: str | list[str] (wildcards '*' match digits)
- target_keys: str | list[str] | None
- distributed_operation / operations / quantization_operations are ALWAYS lists.
TODO: for BNB we need to collect model.weight.quant_state_keys
"""
source_keys: Union[str, list[str]]
target_keys: Optional[Union[str, list[str]]] = None
operations: list[ConversionOps] = field(default_factory=list, repr=False)
distributed_operation: Optional[TensorParallelLayer] = None
quantization_operation: Optional[ConversionOps] = None
def __post_init__(self):
if not isinstance(self.source_keys, list):
self.source_keys = [self.source_keys]
targets_were_none = False
if not isinstance(self.target_keys, list):
if self.target_keys is None:
self.target_keys = list(self.source_keys)
targets_were_none = True
else:
self.target_keys = [self.target_keys]
if not targets_were_none and bool(len(self.source_keys) - 1) + bool(len(self.target_keys) - 1) >= 2:
raise ValueError(
f"source keys={self.source_keys}, target_keys={self.target_keys} but you can only have one to many, one to one or many to one."
)
@dataclass(slots=True)
class ConversionEntry:
weight_converter: WeightConverter
collected_tensors: dict = field(default_factory=lambda: defaultdict(dict))
GLOBAL_WORKERS = min(16, (os.cpu_count() or 8) * 2) # NVMe: 8-16; HDD/NFS: 2-4
# Factory function to create LoadedParameter subclasses dynamically
def get_loaded_parameter_class(base_cls):
"""
base_cls: an nn.Parameter subclass (or nn.Parameter) or a Tensor
Returns a new class that combines the base_cls with LoadedParameterMixin
"""
class LoadedParam(base_cls):
_inplace_methods = [
"add_",
"mul_",
"clamp_",
"zero_",
"fill_",
"normal_",
"uniform_",
"copy_",
"erfinv_",
"log_",
"__getitem__",
"neg_",
"exp_",
"sub_",
]
def __new__(cls, from_existing, **kwargs):
if isinstance(from_existing, torch.nn.Parameter):
inst = super().__new__(cls, from_existing.data, from_existing.requires_grad, **from_existing.__dict__)
else:
inst = super().__new__(cls, from_existing)
# we store the original object to get it back later on
inst._original = from_existing
# Explicitly override all in-place methods per instance
for method_name in inst._inplace_methods:
setattr(inst, method_name, MethodType(inst._skip, inst))
return inst
def _skip(self, *args, **kwargs):
"""Helper to skip in-place operations."""
return self
def __repr__(self):
return f"LoadedParameter(data={self.data})"
@property
def data(self):
return super().data
@data.setter
def data(self, new):
pass
def __lt__(self, other):
return torch.Tensor.__lt__(self, other)
def __le__(self, other):
return torch.Tensor.__le__(self, other)
def __gt__(self, other):
return torch.Tensor.__gt__(self, other)
def __ge__(self, other):
return torch.Tensor.__ge__(self, other)
def __eq__(self, other):
return torch.Tensor.__eq__(self, other)
def __ne__(self, other):
return torch.Tensor.__ne__(self, other)
def __iadd__(self, *args, **kwargs):
return self
def __isub__(self, *args, **kwargs):
return self
def __imul__(self, *args, **kwargs):
return self
def __imatmul__(self, *args, **kwargs):
return self
def __itruediv__(self, *args, **kwargs):
return self
def __ifloordiv__(self, *args, **kwargs):
return self
def __imod__(self, *args, **kwargs):
return self
def __ipow__(self, *args, **kwargs):
return self
def __iand__(self, *args, **kwargs):
return self
def __ior__(self, *args, **kwargs):
return self
def __ixor__(self, *args, **kwargs):
return self
def __ilshift__(self, *args, **kwargs):
return self
def __irshift__(self, *args, **kwargs):
return self
return LoadedParam
def _materialize_copy(tensor, dtype=None):
tensor = tensor[...]
if dtype is not None:
tensor = tensor.to(dtype)
return tensor
def spawn_materialize(thread_pool, tensor, dtype=None) -> Future:
def _job():
return _materialize_copy(tensor, dtype)
return thread_pool.submit(_job)
def spawn_tp_materialize(thread_pool, tensor, sharding_method, tensor_idx, dtype=None) -> Future:
def _job():
return sharding_method.shard_tensor(tensor, param_casting_dtype=dtype, tensor_idx=tensor_idx)[0]
return thread_pool.submit(_job)
def dot_natural_key(s: str):
parts = s.split(".")
for i, p in enumerate(parts):
# whole-segment digits -> int; otherwise leave as str
if p.isdigit():
parts[i] = int(p)
return parts
@contextmanager
def log_to_misc(
layer_name: str,
misc: MutableMapping[str, str],
extras: Any = None,
op: Union[list[ConversionOps], ConversionOps, None] = None,
):
# A simple helper to handle errors with contextual messages.
try:
yield
except Exception as e:
def _format_op_name(curr_op: Union[list[ConversionOps], ConversionOps, None]) -> Optional[str]:
if curr_op is None:
return None
if isinstance(curr_op, (list, tuple, set)):
names = [o.__class__.__name__ for o in curr_op if o is not None]
if not names:
return None
return ", ".join(names)
return curr_op.__class__.__name__
op_name = _format_op_name(op)
if isinstance(extras, tuple) and len(extras) == 2:
values, target_keys = extras
descriptor = f"{op_name} " if op_name else ""
misc[layer_name] = (
f"{e}\nError: {descriptor}on tensors destined for {target_keys}. Ckpt contains: {len(values[0])}"
)
elif isinstance(extras, str):
suffix = f" via {op_name}" if op_name else ""
misc[layer_name] = f"{e}\nError{suffix} when processing parameter {extras}"
elif extras is None and op_name:
misc[layer_name] = f"{op_name}: {e}"
else:
misc[layer_name] = f"{extras} |Error: {e}"
raise SkipLayer()
def set_param_for_module(
model: PreTrainedModel,
layer_name: str,
param_value: torch.Tensor,
mismatch_keys: MutableSet[tuple[str, torch.Size, torch.Size]],
missing_keys: MutableSet[str],
misc: MutableMapping[str, Any],
distributed_operation: Optional[TensorParallelLayer],
):
with log_to_misc(layer_name, misc, layer_name):
module_path, _, param_name = layer_name.rpartition(".")
module_obj = model.get_submodule(module_path) if module_path else model
param_value = param_value[0] if isinstance(param_value, list) else param_value[...]
ref = getattr(module_obj, param_name)
use_dtensor = hasattr(distributed_operation, "use_dtensor") and distributed_operation.use_dtensor
if not isinstance(param_value, torch.nn.Parameter):
if distributed_operation is not None:
param_value = DTensor.from_local(
param_value,
distributed_operation.device_mesh,
getattr(distributed_operation, "shard", Replicate()),
run_check=False,
shape=ref.size(),
stride=ref.stride(),
)
if not use_dtensor:
# we convert to local
param_value = param_value.to_local()
if param_name not in module_obj._buffers:
param_value = torch.nn.Parameter(param_value, requires_grad=param_value.is_floating_point())
param_value = get_loaded_parameter_class(param_value.__class__)(from_existing=param_value)
# Remove from missing keys (it's either mismatched, or all good)
missing_keys.discard(layer_name)
if ref is not None and ref.shape != param_value.shape:
mismatch_keys.add((layer_name, param_value.shape, ref.shape))
module_obj.param_name._is_hf_initialized = False # Needs to be initialized
else:
param_value._is_hf_initialized = True # super important otherwise _init_weight re-initi if bias is missing
setattr(module_obj, param_name, param_value)
class SkipLayer(Exception):
"""Control-flow sentinel: abort processing of the current layer only."""
pass
def convert_and_load_state_dict_in_model(
model: PreTrainedModel,
state_dict: dict[str, Any],
weight_mapping: dict[str, WeightConverter] | None,
tp_plan: dict[str, str] | None,
quantizer: HfQuantizer | None,
dtype: torch.dtype | None = None,
device_map: dict | None = None,
dtype_plan: dict | None = None,
device_mesh: torch.distributed.device_mesh.DeviceMesh | None = None,
):
"""
Convert a state dict according to a weight mapping (one WeightConverter per glob pattern),
collecting tensors per *layer instance* (the concrete indices captured from '*').
"""
prefix = model.base_model_prefix
tp_plan = tp_plan or {} # {glob_pattern: plan_obj_or_key}
device_map = device_map or {} # {exact_target_key: device}
dtype_plan = dtype_plan or {} # {glob_pattern: dtype}
weight_mapping = weight_mapping or {} # {glob_pattern: WeightConverter}
meta_model_state_dict = model.state_dict()
missing_keys = set(meta_model_state_dict.keys())
misc = {}
mismatch_keys = set()
unexpected_keys = set()
# Global thread_pool
thread_pool = ThreadPoolExecutor(max_workers=GLOBAL_WORKERS)
_patterns = list(itertools.chain.from_iterable([k.source_keys for k in weight_mapping]))
source_to_target = {sk: k for k in weight_mapping for sk in k.source_keys}
weight_pattern_alt, weight_pattern_by_group_name = build_glob_alt(_patterns)
tp_plan_alt, tp_plan_by_group_name = build_glob_alt(list(tp_plan.keys()))
dtype_policy_alt, dtype_policy_by_group_name = build_glob_alt(list(dtype_plan.keys()))
state_dict = sorted(state_dict.items(), key=lambda kv: dot_natural_key(kv[0]))
# 1. Create the conversion entries
by_conversion_pattern: dict[str, ConversionEntry] = {}
for original_key, tensor in state_dict:
matched_pattern = match_glob(original_key, weight_pattern_alt, weight_pattern_by_group_name)
if matched_pattern is not None:
converter = source_to_target[matched_pattern] # TODO make sure its the ref
sub_with_extractor = partial(re.sub, matched_pattern.replace("*", r"(\d+)"), string=original_key)
entry_key = "|".join(converter.target_keys)
target_key = "|".join(map(sub_with_extractor, [k.replace("*", "\\1") for k in converter.target_keys]))
entry: ConversionEntry = by_conversion_pattern.setdefault(entry_key, ConversionEntry(converter))
converter_key = sub_with_extractor(matched_pattern)
else:
converter = WeightConverter(original_key)
converter_key = entry_key = target_key = original_key
entry = by_conversion_pattern.setdefault(converter_key, ConversionEntry(converter))
_dtype = dtype
new_target_key = [] # test_load_with_mismatched_shapes for AutoModel.from_pretrained(AutoForCausal, vocab=10)
for t in target_key.split("|"):
if t.startswith(prefix) and meta_model_state_dict.get(re.sub(f"^{prefix}.", "", t, count=1)) is not None:
t = re.sub(f"^{prefix}.", "", t, count=1)
elif meta_model_state_dict.get(f"{prefix}.{t}") is not None:
t = f"{prefix}.{t}"
new_target_key.append(t)
empty_param = meta_model_state_dict.get(t)
# If it does not exist, it's unexpected
if empty_param is None:
unexpected_keys.add(t)
continue
if quantizer is not None and quantizer.param_needs_quantization(model, t):
if quantizer.__class__.__name__ == "FineGrainedFP8HfQuantizer":
from .integrations.finegrained_fp8 import Fp8Quantize
converter.quantization_operation = Fp8Quantize() # TODO support other methods
else:
raise ValueError("This quantization method is gonna be supported SOOOON")
else:
_dtype = dtype
matched_dtype_pattern = match_glob(t, dtype_policy_alt, dtype_policy_by_group_name)
if matched_dtype_pattern is not None:
_dtype = dtype_plan[matched_dtype_pattern]
elif empty_param.dtype != _dtype:
_dtype = empty_param.dtype
first_target_key = new_target_key[0]
target_key = "|".join(new_target_key)
future = None
if device_mesh:
if matched_tp_pattern := match_glob(first_target_key, tp_plan_alt, tp_plan_by_group_name):
empty_param = meta_model_state_dict.get(first_target_key)
if getattr(converter, "distributed_operation", {}) is None:
tp_layer = ALL_PARALLEL_STYLES[model.tp_plan[matched_tp_pattern]].__class__
converter.distributed_operation = tp_layer(
device_mesh=device_mesh, rank=device_map[""].index, empty_param=empty_param.clone()
)
# VERY IMPORTANT: this tells us wether we collected stuffs or not.
shard_index = len(entry.collected_tensors[target_key].get(converter_key, []))
future = spawn_tp_materialize(
thread_pool,
tensor,
_dtype,
converter.distributed_operation,
shard_index,
)
if future is None: # If not TP, async materialize the tensors. TODO handle disk offload?
future = spawn_materialize(thread_pool, tensor, _dtype)
entry.collected_tensors[target_key].setdefault(converter_key, []).append(future)
# 2. Actually convert the ckpt
inverse_converters = {}
keys = list(by_conversion_pattern.keys())
with logging.tqdm(total=len(keys), desc="Loading weights") as pbar:
for key in keys[::-1]: # revert to process simple keys first
group = by_conversion_pattern.pop(key)
converter = group.weight_converter
operations = converter.operations if isinstance(converter.operations, list) else [converter.operations]
for layer_name, tensors_for_this_layer in group.collected_tensors.items():
pbar.update(1)
pbar.set_postfix({"Materializing param": layer_name})
pbar.refresh()
concrete_target_keys = layer_name.split("|")
try:
if bool(set(concrete_target_keys) - unexpected_keys):
with log_to_misc(layer_name, misc):
values = [[k.result() for k in inner] for inner in tensors_for_this_layer.values()]
for op in operations:
with log_to_misc(layer_name, misc, (values, concrete_target_keys), operations):
values = op.convert(values, model.config)
values = [values] if not isinstance(values, list) else values
with log_to_misc(layer_name, misc, (values, concrete_target_keys), operations):
realized_value = {
k: t for k, t in zip(concrete_target_keys, values) if k not in unexpected_keys
}
for k in list(realized_value.keys()).copy():
if op := converter.quantization_operation:
with log_to_misc(layer_name, misc, op=op):
realized_value.update(
op.convert(
{k: realized_value.pop(k)}, quant_config=quantizer.quantization_config
)
)
for k, output_value in realized_value.items():
for src in converter.source_keys: # what should happen to k when we meet k at saving
inverse_converters[k] = {src: converter}
set_param_for_module(
model,
k,
output_value,
mismatch_keys,
missing_keys,
misc,
converter.distributed_operation,
)
except SkipLayer:
continue
del group
model.inverse_converters = inverse_converters
thread_pool.shutdown(wait=False)
return missing_keys, unexpected_keys, mismatch_keys, misc
# TODO this is not done yet!
def revert_weight_conversion(model, state_dict):
mapping = getattr(model, "_checkpoint_conversion_mapping", {}) # IDK why but setting this will fail all llava.
reverse_key_mapping = [(v, k) for k, v in mapping.items()]
original_state_dict = {}
for key, value in state_dict.items():
for pattern, inverse_converter in reverse_key_mapping:
# TODO FIXME you name it
replacement = inverse_converter.lstrip("^") # strip off un-needed chars and patterns
replacement = re.sub(r"\(.*\)", "", replacement)
key, n_replace = re.subn(pattern, replacement, key)
# Early exit of the loop
if n_replace > 0:
break
original_state_dict[key] = value
state_dict = original_state_dict
return state_dict

View File

@ -723,7 +723,7 @@ class DataCollatorForLanguageModeling(DataCollatorMixin):
if self.mask_replace_prob < 1:
warnings.warn(
"Random token replacement is not supported with whole word masking.",
"Random token replacement is not supported with whole word masking. "
"Setting mask_replace_prob to 1.",
)
self.mask_replace_prob = 1

View File

@ -82,7 +82,7 @@ class GlueDataset(Dataset):
cache_dir: Optional[str] = None,
):
warnings.warn(
"This dataset will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets "
"This dataset will be removed from the library soon, preprocessing should be handled with the Hugging Face Datasets "
"library. You can have a look at this example script for pointers: "
"https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_glue.py",
FutureWarning,

View File

@ -21,7 +21,7 @@ if is_sklearn_available():
DEPRECATION_WARNING = (
"This metric will be removed from the library soon, metrics should be handled with the 🤗 Evaluate "
"This metric will be removed from the library soon, metrics should be handled with the Hugging Face Evaluate "
"library. You can have a look at this example script for pointers: "
"https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_glue.py"
)

View File

@ -28,7 +28,7 @@ from .utils import DataProcessor, InputExample, InputFeatures
logger = logging.get_logger(__name__)
DEPRECATION_WARNING = (
"This {0} will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets "
"This {0} will be removed from the library soon, preprocessing should be handled with the Hugging Face Datasets "
"library. You can have a look at this example script for pointers: "
"https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_glue.py"
)

View File

@ -47,8 +47,8 @@ deps = {
"psutil": "psutil",
"pyyaml": "pyyaml>=5.1",
"pydantic": "pydantic>=2",
"pytest": "pytest>=7.2.0",
"pytest-asyncio": "pytest-asyncio",
"pytest": "pytest>=7.2.0,<9.0.0",
"pytest-asyncio": "pytest-asyncio>=1.2.0",
"pytest-rerunfailures": "pytest-rerunfailures<16.0",
"pytest-timeout": "pytest-timeout",
"pytest-xdist": "pytest-xdist",

View File

@ -39,6 +39,7 @@ from .utils import (
is_torch_dtype,
logging,
requires_backends,
safe_load_json_file,
)
from .utils.hub import cached_file
@ -427,35 +428,42 @@ class FeatureExtractionMixin(PushToHubMixin):
feature_extractor_file = os.path.join(pretrained_model_name_or_path, FEATURE_EXTRACTOR_NAME)
if os.path.isfile(pretrained_model_name_or_path):
resolved_feature_extractor_file = pretrained_model_name_or_path
resolved_processor_file = None
is_local = True
elif is_remote_url(pretrained_model_name_or_path):
feature_extractor_file = pretrained_model_name_or_path
resolved_processor_file = None
resolved_feature_extractor_file = download_url(pretrained_model_name_or_path)
else:
feature_extractor_file = FEATURE_EXTRACTOR_NAME
try:
# Load from local folder or from cache or download from model Hub and cache
resolved_feature_extractor_files = [
resolved_file
for filename in [feature_extractor_file, PROCESSOR_NAME]
if (
resolved_file := cached_file(
pretrained_model_name_or_path,
filename=filename,
cache_dir=cache_dir,
force_download=force_download,
proxies=proxies,
local_files_only=local_files_only,
subfolder=subfolder,
token=token,
user_agent=user_agent,
revision=revision,
_raise_exceptions_for_missing_entries=False,
)
)
is not None
]
resolved_feature_extractor_file = resolved_feature_extractor_files[0]
resolved_processor_file = cached_file(
pretrained_model_name_or_path,
filename=PROCESSOR_NAME,
cache_dir=cache_dir,
force_download=force_download,
proxies=proxies,
local_files_only=local_files_only,
token=token,
user_agent=user_agent,
revision=revision,
subfolder=subfolder,
_raise_exceptions_for_missing_entries=False,
)
resolved_feature_extractor_file = cached_file(
pretrained_model_name_or_path,
filename=feature_extractor_file,
cache_dir=cache_dir,
force_download=force_download,
proxies=proxies,
local_files_only=local_files_only,
token=token,
user_agent=user_agent,
revision=revision,
subfolder=subfolder,
_raise_exceptions_for_missing_entries=False,
)
except OSError:
# Raise any environment error raise by `cached_file`. It will have a helpful error message adapted to
# the original exception.
@ -469,19 +477,24 @@ class FeatureExtractionMixin(PushToHubMixin):
f" directory containing a {FEATURE_EXTRACTOR_NAME} file"
)
try:
# Load feature_extractor dict
with open(resolved_feature_extractor_file, encoding="utf-8") as reader:
text = reader.read()
feature_extractor_dict = json.loads(text)
if "audio_processor" in feature_extractor_dict:
feature_extractor_dict = feature_extractor_dict["audio_processor"]
else:
feature_extractor_dict = feature_extractor_dict.get("feature_extractor", feature_extractor_dict)
# Load feature_extractor dict. Priority goes as (nested config if found -> image processor config)
# We are downloading both configs because almost all models have a `processor_config.json` but
# not all of these are nested. We need to check if it was saved recebtly as nested or if it is legacy style
feature_extractor_dict = None
if resolved_processor_file is not None:
processor_dict = safe_load_json_file(resolved_processor_file)
if "feature_extractor" in processor_dict or "audio_processor" in processor_dict:
feature_extractor_dict = processor_dict.get("feature_extractor", processor_dict.get("audio_processor"))
except json.JSONDecodeError:
if resolved_feature_extractor_file is not None and feature_extractor_dict is None:
feature_extractor_dict = safe_load_json_file(resolved_feature_extractor_file)
if feature_extractor_dict is None:
raise OSError(
f"It looks like the config file at '{resolved_feature_extractor_file}' is not a valid JSON file."
f"Can't load feature extractor for '{pretrained_model_name_or_path}'. If you were trying to load"
" it from 'https://huggingface.co/models', make sure you don't have a local directory with the"
f" same name. Otherwise, make sure '{pretrained_model_name_or_path}' is the correct path to a"
f" directory containing a {feature_extractor_file} file"
)
if is_local:

View File

@ -189,7 +189,9 @@ class PagedAttentionCache:
num_blocks, max_batch_tokens = memory_handler.infer_num_blocks_and_max_batch_tokens(
num_blocks=getattr(generation_config, "num_blocks", None),
max_batch_tokens=getattr(generation_config, "max_batch_tokens", None),
max_memory_percent=getattr(generation_config, "max_memory", 0.9),
max_memory_percent=getattr(
generation_config, "max_memory", 0.8
), # FIXME: it seems we overcommit memory, was changed from 0.9 which caused OOMs in our benchmarking CI
cache_dtype=self.dtype,
)
@ -414,7 +416,7 @@ class PagedAttentionMemoryHandler:
self,
num_blocks: Optional[int] = None,
max_batch_tokens: Optional[int] = None,
max_memory_percent: float = 0.9,
max_memory_percent: float = 0.8, # FIXME: it seems we overcommit memory, was changed from 0.9 which caused OOMs in our benchmarking CI
cache_dtype: torch.dtype = torch.float16,
) -> tuple[int, int]:
"""Determine optimal number of blocks and maximum number of tokens per batch based on available memory and
@ -454,7 +456,7 @@ class PagedAttentionMemoryHandler:
def compute_num_blocks_and_max_batch_tokens(
self,
max_memory_percent: float = 0.9,
max_memory_percent: float,
cache_dtype: torch.dtype = torch.float16,
m: float = 0.01,
) -> tuple[int, int]:
@ -503,7 +505,7 @@ class PagedAttentionMemoryHandler:
def compute_max_batch_tokens(
self,
num_blocks: int,
max_memory_percent: float = 0.9,
max_memory_percent: float,
cache_dtype: torch.dtype = torch.float16,
) -> int:
"""Calculate maximum batch tokens M given a fixed number of cache blocks. The formula for M is given by:
@ -531,7 +533,7 @@ class PagedAttentionMemoryHandler:
def compute_num_blocks(
self,
max_batch_tokens: int,
max_memory_percent: float = 0.9,
max_memory_percent: float,
cache_dtype: torch.dtype = torch.float16,
) -> int:
"""Calculate number of cache blocks N given a fixed maximum token per token M. The formula for N is given by:

View File

@ -807,7 +807,7 @@ class ContinuousBatchingManager:
"""Check if the background generation thread is running."""
return self._generation_thread is not None and self._generation_thread.is_alive()
def stop(self, block: bool = False, timeout: Optional[float] = None) -> None:
def stop(self, block: bool = True, timeout: Optional[float] = None) -> None:
"""Signal the background thread to stop.
Args:
@ -818,14 +818,17 @@ class ContinuousBatchingManager:
logger.warning("Manager not started.")
return
stop_trigger_time = perf_counter()
if not self.stop_event.is_set():
self.stop_event.set()
logger.info("Stopping continuous batching manager...")
if block:
self.join(timeout)
self.join(stop_trigger_time, timeout)
def join(self, timeout: Optional[float] = None) -> None:
self.batch_processor = None
def join(self, stop_trigger_time: float, timeout: Optional[float] = None) -> None:
"""Wait for the background thread to finish.
Args:
@ -834,9 +837,10 @@ class ContinuousBatchingManager:
if self._generation_thread is not None:
self._generation_thread.join(timeout=timeout)
if self._generation_thread.is_alive():
logger.warning("Generation thread did not exit after join timeout.")
logger.warning(f"Generation thread did not exit after join timeout ({timeout}).")
else:
logger.info("Continuous Batching Manager stopped.")
end = perf_counter()
logger.info(f"Continuous Batching Manager stopped after {end - stop_trigger_time:.2f}s.")
self._generation_thread = None
def add_request(
@ -877,9 +881,11 @@ class ContinuousBatchingManager:
self.input_queue.put(state, block=True, timeout=10) # XXX: pass timeout as fn arg?
return request_id
def add_requests(self, inputs: list[list[int]], max_new_tokens: Optional[int] = None) -> None:
def add_requests(
self, inputs: list[list[int]], max_new_tokens: Optional[int] = None, streaming: bool = False
) -> None:
for input_ids in inputs:
self.add_request(input_ids, max_new_tokens=max_new_tokens)
self.add_request(input_ids, max_new_tokens=max_new_tokens, streaming=streaming)
def cancel_request(self, request_id: str) -> None:
"""Cancel a request by its ID.
@ -890,6 +896,7 @@ class ContinuousBatchingManager:
if self.batch_processor is not None:
self.batch_processor.scheduler.set_request_cancellation(request_id)
# TODO:handle benchmarking properly when updating / fixing the requeue logic
def get_result(
self, request_id: Optional[str] = None, timeout: Optional[float] = None
) -> Optional[GenerationOutput]:
@ -905,6 +912,7 @@ class ContinuousBatchingManager:
return None
try:
result = self.output_queue.get(block=True, timeout=timeout)
# NOTE: requeue logic here
if request_id is not None and result.request_id != request_id:
self.output_queue.put(result)
return None
@ -1092,6 +1100,7 @@ class ContinuousMixin:
num_kv_cuda_graphs=num_kv_cuda_graphs,
)
# TODO: support streaming
@traced
@torch.inference_mode()
def generate_batch(
@ -1148,7 +1157,7 @@ class ContinuousMixin:
result = manager.get_result(timeout=1)
if result:
req_id = result.request_id
if result.status == RequestStatus.FINISHED:
if result.is_finished():
results[req_id] = result
finished_count += 1
pbar.update(1)

View File

@ -19,6 +19,7 @@ from typing import Optional
import torch
from ...utils import is_torch_xpu_available
from ...utils.logging import logging
from ...utils.metrics import traced
@ -35,6 +36,13 @@ def get_device_and_memory_breakdown() -> tuple[torch.device, int, int, int]:
total_memory = torch.cuda.get_device_properties(device).total_memory
reserved_memory = torch.cuda.memory_reserved(device)
allocated_memory = torch.cuda.memory_allocated(device)
elif is_torch_xpu_available():
device = torch.device("xpu")
torch.xpu.empty_cache()
torch.xpu.synchronize()
total_memory = torch.xpu.get_device_properties(device).total_memory
reserved_memory = torch.xpu.memory_reserved(device)
allocated_memory = torch.xpu.memory_allocated(device)
elif torch.backends.mps.is_available() and torch.backends.mps.is_built():
device = torch.device("mps")
# MPS memory reporting (PyTorch 2.0+)
@ -83,6 +91,9 @@ class GenerationOutput:
status: RequestStatus = RequestStatus.PENDING
created_time: float = field(default_factory=time.time)
def is_finished(self) -> bool:
return self.status == RequestStatus.FINISHED
@dataclass
class RequestState:

View File

@ -411,7 +411,7 @@ class GenerationMixin(ContinuousMixin):
"Generation config file not found, using a generation config created from the model config."
)
# Load custom generate function if `pretrained_model_name_or_path` defines it (and override `generate`)
if hasattr(self, "load_custom_generate"):
if hasattr(self, "load_custom_generate") and trust_remote_code:
try:
custom_generate = self.load_custom_generate(
pretrained_model_name_or_path, trust_remote_code=trust_remote_code, **repo_loading_kwargs
@ -608,7 +608,7 @@ class GenerationMixin(ContinuousMixin):
use_cache = kwargs.get("use_cache")
if use_cache is None:
use_cache = getattr(self.config, "use_cache", False)
if past_key_values is None or use_cache:
if past_key_values is not None or use_cache:
# TODO (joao): handle the case where cache length == input_ids length. The function below results in an
# exception because we get empty input_ids after slicing. In essence, we need to roll back the cache 1
# token to recompute the logits for the first token to be generated (but not all caches support roll backs)
@ -1635,7 +1635,12 @@ class GenerationMixin(ContinuousMixin):
# TransformersKwargs are model-agnostic attention and generation arguments such as 'output_attentions'
for key, value in model_kwargs.items():
if value is not None and key not in model_args and key not in TransformersKwargs.__optional_keys__:
if (
value is not None
and key not in model_args
and key not in TransformersKwargs.__optional_keys__
and key != "debug_io"
):
unused_model_args.append(key)
if unused_model_args:
@ -2170,7 +2175,7 @@ class GenerationMixin(ContinuousMixin):
return False
# Base logic
valid_hardware = self.device.type == "cuda" or bool(
valid_hardware = self.device.type in ["cuda", "xpu"] or bool(
generation_config.compile_config is not None and generation_config.compile_config._compile_all_devices
)
using_compilable_cache = (

View File

@ -383,10 +383,11 @@ class BayesianDetectorModel(PreTrainedModel):
)
self.prior = torch.nn.Parameter(torch.tensor([self.base_rate]))
@torch.no_grad()
def _init_weights(self, module):
"""Initialize the weights."""
if isinstance(module, nn.Parameter):
module.weight.data.normal_(mean=0.0, std=0.02)
module.weight.normal_(mean=0.0, std=0.02)
def _compute_posterior(
self,

View File

@ -32,6 +32,7 @@ from .utils import (
is_offline_mode,
is_remote_url,
logging,
safe_load_json_file,
)
from .utils.hub import cached_file
@ -280,35 +281,41 @@ class ImageProcessingMixin(PushToHubMixin):
image_processor_file = os.path.join(pretrained_model_name_or_path, image_processor_filename)
if os.path.isfile(pretrained_model_name_or_path):
resolved_image_processor_file = pretrained_model_name_or_path
resolved_processor_file = None
is_local = True
elif is_remote_url(pretrained_model_name_or_path):
image_processor_file = pretrained_model_name_or_path
resolved_processor_file = None
resolved_image_processor_file = download_url(pretrained_model_name_or_path)
else:
image_processor_file = image_processor_filename
try:
# Load from local folder or from cache or download from model Hub and cache
resolved_image_processor_files = [
resolved_file
for filename in [image_processor_file, PROCESSOR_NAME]
if (
resolved_file := cached_file(
pretrained_model_name_or_path,
filename=filename,
cache_dir=cache_dir,
force_download=force_download,
proxies=proxies,
local_files_only=local_files_only,
token=token,
user_agent=user_agent,
revision=revision,
subfolder=subfolder,
_raise_exceptions_for_missing_entries=False,
)
)
is not None
]
resolved_image_processor_file = resolved_image_processor_files[0]
resolved_processor_file = cached_file(
pretrained_model_name_or_path,
filename=PROCESSOR_NAME,
cache_dir=cache_dir,
force_download=force_download,
proxies=proxies,
local_files_only=local_files_only,
token=token,
user_agent=user_agent,
revision=revision,
subfolder=subfolder,
_raise_exceptions_for_missing_entries=False,
)
resolved_image_processor_file = cached_file(
pretrained_model_name_or_path,
filename=image_processor_file,
cache_dir=cache_dir,
force_download=force_download,
proxies=proxies,
local_files_only=local_files_only,
token=token,
user_agent=user_agent,
revision=revision,
subfolder=subfolder,
_raise_exceptions_for_missing_entries=False,
)
except OSError:
# Raise any environment error raise by `cached_file`. It will have a helpful error message adapted to
# the original exception.
@ -322,16 +329,24 @@ class ImageProcessingMixin(PushToHubMixin):
f" directory containing a {image_processor_filename} file"
)
try:
# Load image_processor dict
with open(resolved_image_processor_file, encoding="utf-8") as reader:
text = reader.read()
image_processor_dict = json.loads(text)
image_processor_dict = image_processor_dict.get("image_processor", image_processor_dict)
# Load image_processor dict. Priority goes as (nested config if found -> image processor config)
# We are downloading both configs because almost all models have a `processor_config.json` but
# not all of these are nested. We need to check if it was saved recebtly as nested or if it is legacy style
image_processor_dict = None
if resolved_processor_file is not None:
processor_dict = safe_load_json_file(resolved_processor_file)
if "image_processor" in processor_dict:
image_processor_dict = processor_dict["image_processor"]
except json.JSONDecodeError:
if resolved_image_processor_file is not None and image_processor_dict is None:
image_processor_dict = safe_load_json_file(resolved_image_processor_file)
if image_processor_dict is None:
raise OSError(
f"It looks like the config file at '{resolved_image_processor_file}' is not a valid JSON file."
f"Can't load image processor for '{pretrained_model_name_or_path}'. If you were trying to load"
" it from 'https://huggingface.co/models', make sure you don't have a local directory with the"
f" same name. Otherwise, make sure '{pretrained_model_name_or_path}' is the correct path to a"
f" directory containing a {image_processor_filename} file"
)
if is_local:

View File

@ -821,14 +821,26 @@ def split_to_tiles(images: "torch.Tensor", num_tiles_height: int, num_tiles_widt
return image
def _cast_tensor_to_float(x):
if x.is_floating_point():
return x
return x.float()
def _group_images_by_shape(nested_images, *paired_inputs, is_nested: bool = False):
"""Helper function to flatten a single level of nested image and batch structures and group by shape."""
"""
Helper function to flatten a single level of nested image and batch structures and group by shape.
Args:
nested_images (list):
A list of images or a single tensor
paired_inputs (Any, *optional*):
Zero or more lists that mirror the structure of `nested_images` (flat list, or list of lists when
`is_nested=True`). Each element is paired 1:1 with the corresponding image so it can be grouped by the
same shape key. These paired values are grouped alongside `nested_images` but are not stacked in the output, so
they do not need to be tensors.
is_nested (bool, *optional*, defaults to False):
Whether the images are nested.
Returns:
tuple[dict, ...]:
- A dictionary with shape as key and list of images with that shape as value
- A dictionary with shape as key and list of paired values with that shape as value
- A dictionary mapping original indices to (shape, index) tuples
- A dictionary mapping original indices to (shape, index) tuples for each paired input
"""
grouped_images = defaultdict(list)
grouped_images_index = {}
paired_grouped_values = [defaultdict(list) for _ in paired_inputs]
@ -880,27 +892,20 @@ def _reconstruct_nested_structure(indices, processed_images):
return result
def _disable_grouping_output_nested(images, *paired_inputs):
"""Build the disable_grouping output tuple for a single-level nested structure."""
outer_range = range(len(images))
inner_ranges = [range(len(images[i])) for i in outer_range]
def _iterate_items(items, is_nested: bool):
"""
Helper function to iterate over items yielding (key, item) pairs.
# Precompute all (i, j) pairs
ij_pairs = [(i, j) for i in outer_range for j in inner_ranges[i]]
images_dict = {(i, j): images[i][j].unsqueeze(0) for (i, j) in ij_pairs}
paired_dicts = [{(i, j): paired_list[i][j].unsqueeze(0) for (i, j) in ij_pairs} for paired_list in paired_inputs]
index_map = {(i, j): ((i, j), 0) for (i, j) in ij_pairs}
return images_dict, *paired_dicts, index_map
def _disable_grouping_output_flat(images, *paired_inputs):
"""Build the disable_grouping output tuple for a flat list structure."""
idx_range = range(len(images))
images_dict = {i: images[i].unsqueeze(0) for i in idx_range}
paired_dicts = [{i: paired_list[i].unsqueeze(0) for i in idx_range} for paired_list in paired_inputs]
index_map = {i: (i, 0) for i in idx_range}
return images_dict, *paired_dicts, index_map
For nested structures, yields ((row_index, col_index), item).
For flat structures, yields (index, item).
"""
if is_nested:
for i, row in enumerate(items):
for j, item in enumerate(row):
yield (i, j), item
else:
for i, item in enumerate(items):
yield i, item
def group_images_by_shape(
@ -920,7 +925,7 @@ def group_images_by_shape(
Args:
images (Union[list["torch.Tensor"], "torch.Tensor"]):
A list of images or a single tensor
*paired_inputs (Any):
paired_inputs (Any, *optional*):
Zero or more lists that mirror the structure of `images` (flat list, or list of lists when
`is_nested=True`). Each element is paired 1:1 with the corresponding image so it can be grouped by the
same shape key. These paired values are grouped alongside `images` but are not stacked in the output, so
@ -944,10 +949,14 @@ def group_images_by_shape(
disable_grouping = device == "cpu"
if disable_grouping:
if is_nested:
return _disable_grouping_output_nested(images, *paired_inputs)
else:
return _disable_grouping_output_flat(images, *paired_inputs)
return (
{key: img.unsqueeze(0) for key, img in _iterate_items(images, is_nested)},
*[
{key: item.unsqueeze(0) for key, item in _iterate_items(paired_list, is_nested)}
for paired_list in paired_inputs
],
{key: (key, 0) for key, _ in _iterate_items(images, is_nested)},
)
# Handle single level nested structure
grouped_images, *paired_grouped_values, grouped_images_index = _group_images_by_shape(
@ -990,14 +999,3 @@ def reorder_images(
]
return _reconstruct_nested_structure(grouped_images_index, processed_images)
class NumpyToTensor:
"""
Convert a numpy array to a PyTorch tensor.
"""
def __call__(self, image: np.ndarray):
# Same as in PyTorch, we assume incoming numpy images are in HWC format
# c.f. https://github.com/pytorch/vision/blob/61d97f41bc209e1407dcfbd685d2ee2da9c1cdad/torchvision/transforms/functional.py#L154
return torch.from_numpy(image.transpose(2, 0, 1)).contiguous()

View File

@ -512,10 +512,8 @@ def accelerate_disk_offload(
checkpoint_files,
device_map,
checkpoint_keys,
key_renaming_mapping,
sharded_metadata,
dtype,
reverse_key_renaming_mapping,
):
disk_only_shard_files = []
if disk_offload_folder is not None:
@ -534,19 +532,13 @@ def accelerate_disk_offload(
weight_map = dict.fromkeys(checkpoint_keys, checkpoint_files[0])
else:
folder = os.path.sep.join(checkpoint_files[0].split(os.path.sep)[:-1])
# Fix the weight map keys according to the key mapping
weight_map = {
key_renaming_mapping[k]: v
for k, v in sharded_metadata["weight_map"].items()
if k in key_renaming_mapping
}
weight_map = {k: os.path.join(folder, v) for k, v in weight_map.items()}
# Find potential checkpoints containing only offloaded weights
disk_only_shard_files = get_disk_only_shard_files(device_map, weight_map)
disk_offload_index = {
name: {
"safetensors_file": file,
"weight_name": reverse_key_renaming_mapping[name],
"weight_name": name,
"dtype": str_dtype,
}
for name, file in weight_map.items()

View File

@ -1,5 +1,4 @@
import inspect
from copy import deepcopy
from inspect import signature
from ..utils import (
@ -24,7 +23,6 @@ if is_accelerate_available():
import accelerate
from accelerate import init_empty_weights
from accelerate.hooks import add_hook_to_module, remove_hook_from_module
from accelerate.utils import find_tied_parameters
logger = logging.get_logger(__name__)
@ -151,52 +149,6 @@ def replace_with_bnb_linear(model, modules_to_not_convert=None, current_key_name
return model
def get_keys_to_not_convert(model):
r"""
An utility function to get the key of the module to keep in full precision if any For example for CausalLM modules
we may want to keep the lm_head in full precision for numerical stability reasons. For other architectures, we want
to keep the tied weights of the model. The function will return a list of the keys of the modules to not convert in
int8.
Parameters:
model (`torch.nn.Module`):
Input model
"""
# Create a copy of the model and tie the weights, then
# check if it contains tied weights
tied_model = deepcopy(model) # this has 0 cost since it is done inside `init_empty_weights` context manager`
tied_model.tie_weights()
tied_params = find_tied_parameters(tied_model)
tied_keys = sum(tied_params, [])
has_tied_params = len(tied_keys) > 0
# If there is not tied weights, we want to keep the lm_headoutput_embedding) in full precision
if not has_tied_params:
output_emb = model.get_output_embeddings()
if output_emb is not None:
list_last_module = [name for name, module in model.named_modules() if id(module) == id(output_emb)]
return list_last_module
# otherwise, no tied weights, no output embedding defined, simply keep the last module in full precision
list_modules = list(model.named_parameters())
list_last_module = [list_modules[-1][0]]
# add last module together with tied weights
intersection = set(list_last_module) - set(tied_keys)
list_untouched = list(set(tied_keys)) + list(intersection)
# remove ".weight" from the keys
names_to_remove = [".weight", ".bias"]
filtered_module_names = []
for name in list_untouched:
for name_to_remove in names_to_remove:
if name_to_remove in name:
name = name.replace(name_to_remove, "")
filtered_module_names.append(name)
return filtered_module_names
# Copied from PEFT: https://github.com/huggingface/peft/blob/47b3712898539569c02ec5b3ed4a6c36811331a1/src/peft/utils/integrations.py#L41
def dequantize_bnb_weight(weight: "torch.nn.Parameter", dtype: "torch.dtype", state=None):
"""

View File

@ -11,7 +11,6 @@
# specific language governing permissions and limitations under the License.
import logging
from collections.abc import Callable
from typing import Optional
import torch
@ -24,13 +23,7 @@ from ..cache_utils import (
StaticCache,
)
from ..generation.configuration_utils import GenerationConfig
from ..masking_utils import (
ALL_MASK_ATTENTION_FUNCTIONS,
_ignore_causal_mask_sdpa,
_is_torch_greater_or_equal_than_2_5,
prepare_padding_mask,
)
from ..modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
from ..modeling_utils import PreTrainedModel
from ..pytorch_utils import (
is_torch_greater_or_equal,
is_torch_greater_or_equal_than_2_3,
@ -229,10 +222,6 @@ class TorchExportableModuleForDecoderOnlyLM(torch.nn.Module):
"Using `StaticCache` for export as `layer_types` is not specified or `sliding_window` is `null` in the config."
)
self.model = TorchExportableModuleWithStaticCache(model, batch_size, max_cache_len, device)
# This is the same as sdpa, but mask creation does not use `vmap` which is not exportable
ALL_MASK_ATTENTION_FUNCTIONS.register("sdpa_without_vmap", sdpa_mask_without_vmap)
ALL_ATTENTION_FUNCTIONS.register("sdpa_without_vmap", ALL_ATTENTION_FUNCTIONS["sdpa"])
self.model.model.config._attn_implementation = "sdpa_without_vmap"
def forward(
self,
@ -768,11 +757,6 @@ def convert_and_export_with_cache(
import torch.export._trace
# This is the same as sdpa, but mask creation does not use `vmap` which is not exportable
ALL_MASK_ATTENTION_FUNCTIONS.register("sdpa_without_vmap", sdpa_mask_without_vmap)
ALL_ATTENTION_FUNCTIONS.register("sdpa_without_vmap", ALL_ATTENTION_FUNCTIONS["sdpa"])
model.config._attn_implementation = "sdpa_without_vmap"
with torch.no_grad():
# TODO: The default inputs only work for text models. We need to add support for vision/audio models.
example_input_ids = (
@ -1036,11 +1020,6 @@ def export_with_dynamic_cache(
if not is_torch_greater_or_equal_than_2_3:
raise ImportError("torch >= 2.3 is required.")
# This is the same as sdpa, but mask creation does not use `vmap` which is not exportable
ALL_MASK_ATTENTION_FUNCTIONS.register("sdpa_without_vmap", sdpa_mask_without_vmap)
ALL_ATTENTION_FUNCTIONS.register("sdpa_without_vmap", ALL_ATTENTION_FUNCTIONS["sdpa"])
model.config._attn_implementation = "sdpa_without_vmap"
register_dynamic_cache_export_support()
with torch.no_grad():
@ -1109,92 +1088,3 @@ def _unflatten_dynamic_cache(values, context: torch.utils._pytree.Context):
value = value_list[idx] if idx < len(value_list) else None
cache.update(key, value, idx)
return cache
def sdpa_mask_without_vmap(
batch_size: int,
cache_position: torch.Tensor,
kv_length: int,
kv_offset: int = 0,
mask_function: Optional[Callable] = None,
attention_mask: Optional[torch.Tensor] = None,
local_size: Optional[int] = None,
allow_is_causal_skip: bool = True,
allow_torch_fix: bool = True,
**kwargs,
) -> Optional[torch.Tensor]:
"""
Create a 4D boolean mask of shape `(batch_size, 1, query_length, kv_length)` where a value of True indicates that
the element should take part in the attention computation, and False that it should not.
This is similar to `masking_utils.sdpa_mask` but does not use `vmap` which is incompatible with export.
Args:
batch_size (`int`):
The batch size of the input sequence.
cache_position (`torch.Tensor`):
A tensor of shape (query_length,) indicating the current indices of the input sequence elements.
kv_length (`int`):
The size that the key and value states will have during the attention computation.
kv_offset (`int`, optional):
An optional offset to indicate at which first position the key and values states will refer to.
mask_function (`Callable`):
The mask factory function describing the mask pattern.
attention_mask (`torch.Tensor`, optional):
The 2D attention mask corresponding to padded tokens of shape (batch_size, number_of_seen_tokens+q_length)
local_size (`int`, optional):
The size of the local attention, if we do not use full attention. This is used only if `allow_is_causal_skip=True`
to try to skip mask creation if possible.
allow_is_causal_skip (`bool`, optional):
Whether to allow to return `None` for the mask under conditions where we can use the `is_causal` argument in
`torch.sdpa` instead. Default to `True`.
allow_torch_fix (`bool`, optional):
Whether to update the mask in case a query is not attending to any tokens, to solve a bug in torch's older
versions. We need an arg to skip it when using eager. By default `True`.
"""
q_length = cache_position.shape[0]
# Potentially pad the 2D mask, and slice it correctly
padding_mask = prepare_padding_mask(attention_mask, kv_length, kv_offset)
# Under specific conditions, we can avoid materializing the mask, instead relying on the `is_causal` argument
if allow_is_causal_skip and _ignore_causal_mask_sdpa(padding_mask, q_length, kv_length, local_size):
return None
# Similar to `kv_arange = torch.arange(start=kv_offset, end=kv_offset + kv_length, device=cache_position.device)`
# but without data-dependent slicing (i.e. torch.compile friendly)
kv_arange = torch.arange(kv_length, device=cache_position.device)
kv_arange += kv_offset
reshaped_cache_position = cache_position.view(-1, 1)
# This is a bit hacky to know what pattern we are using, but all mask creation function actually forward
# the config through kwargs anyway, so it allows to rely on it
# Usually, the `mask_function` is the only entry-point to define the pattern - we could do for loops over it,
# but this is more efficient
sliding_window = getattr(kwargs["config"], "sliding_window", None)
chunk_size = getattr(kwargs["config"], "attention_chunk_size", None)
if sliding_window is not None and chunk_size is not None:
raise ValueError("Cannot use both `sliding_window` and `attention_chunk_size`")
# Simplest and most efficient way to obtain a causal mask
causal_mask = kv_arange <= reshaped_cache_position
# If using sliding window, add the sliding mask
if sliding_window is not None:
sliding_mask_overlay = kv_arange > reshaped_cache_position - sliding_window
causal_mask *= sliding_mask_overlay
# If using chunk attention, add the chunked mask
elif chunk_size is not None:
chunked_mask_overlay = kv_arange // chunk_size == reshaped_cache_position // chunk_size
causal_mask *= chunked_mask_overlay
causal_mask = causal_mask[None, None, :, :].expand(batch_size, -1, -1, -1)
if padding_mask is not None:
causal_mask = causal_mask * padding_mask[:, None, None, :]
# Due to a bug in some older torch version, we need to update the mask in case a query is not attending to any
# tokens (due to padding). See details in https://github.com/pytorch/pytorch/issues/110213
if not _is_torch_greater_or_equal_than_2_5 and allow_torch_fix:
causal_mask |= torch.all(~causal_mask, dim=-1, keepdim=True)
return causal_mask

View File

@ -13,8 +13,11 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import Optional
import re
from collections.abc import Sequence
from typing import Any, Optional, Union
from ..core_model_loading import ConversionOps
from ..utils import is_accelerate_available, is_torch_accelerator_available, is_torch_available, logging
@ -30,6 +33,18 @@ if is_accelerate_available():
logger = logging.get_logger(__name__)
try:
_FP8_DTYPE = torch.float8_e4m3fn
_FP8_MIN = torch.finfo(_FP8_DTYPE).min
_FP8_MAX = torch.finfo(_FP8_DTYPE).max
_FP8_IS_INT = False
except AttributeError:
_FP8_DTYPE = torch.int8
_FP8_MIN, _FP8_MAX = -127, 127
_FP8_IS_INT = True
logger.warning_once(
"torch.float8_e4m3fn not available; falling back to int8 emulation for Fp8Quantize operations."
)
# Copied from https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/inference/kernel.py
@ -332,6 +347,12 @@ class FP8Linear(nn.Linear):
if self.weight.element_size() > 1:
return F.linear(input, self.weight, self.bias)
else:
if isinstance(self.weight, torch.distributed.tensor.DTensor):
weight = self.weight._local_tensor.contiguous()
scale_inv = self.weight_scale_inv._local_tensor.contiguous()
else:
weight = self.weight.contiguous()
scale_inv = self.weight_scale_inv.contiguous()
# Context manager used to switch among the available accelerators
device_type = torch.accelerator.current_accelerator().type if is_torch_accelerator_available() else "cuda"
torch_accelerator_module = getattr(torch, device_type, torch.cuda)
@ -339,9 +360,9 @@ class FP8Linear(nn.Linear):
qinput, scale = act_quant(input, self.block_size[1])
output = w8a8_block_fp8_matmul_triton(
qinput,
self.weight,
weight,
scale,
self.weight_scale_inv,
scale_inv,
self.block_size,
output_dtype=input.dtype,
)
@ -350,9 +371,124 @@ class FP8Linear(nn.Linear):
torch_accelerator_module.synchronize()
if self.bias is not None:
output = output + self.bias
output = torch.nan_to_num(output, nan=0.0)
return output.to(dtype=input.dtype)
def _ceil_div(a, b):
return (a + b - 1) // b
class FP8Expert(nn.Module):
dtype = torch.float8_e4m3fn
def __init__(self, config, block_size, device):
super().__init__()
from ..activations import ACT2FN
self.block_size = block_size
self.num_experts = config.num_local_experts
self.hidden_dim = config.hidden_size
self.intermediate_dim = config.intermediate_size
Wg_out, Wg_in = 2 * self.intermediate_dim, self.hidden_dim
Wd_out, Wd_in = self.hidden_dim, self.intermediate_dim
self.gate_up_proj = nn.Parameter(
torch.zeros(self.num_experts, Wg_out, Wg_in, dtype=FP8Expert.dtype, device=device)
)
self.down_proj = nn.Parameter(
torch.zeros(self.num_experts, Wd_out, Wd_in, dtype=FP8Expert.dtype, device=device)
)
# Create inverse scale tiles only when using 1-byte types (fp8)
if self.gate_up_proj.element_size() == 1:
bo, bi = self.block_size
# gate_up tiles: ceil(Wg_out/bo) x ceil(Wg_in/bi)
gu_scale_o = _ceil_div(Wg_out, bo)
gu_scale_i = _ceil_div(Wg_in, bi)
self.gate_up_proj_scales_inv = nn.Parameter(
torch.zeros(self.num_experts, gu_scale_o, gu_scale_i, dtype=torch.float32, device=device)
)
# down tiles: ceil(Wd_out/bo) x ceil(Wd_in/bi)
dp_scale_o = _ceil_div(Wd_out, bo)
dp_scale_i = _ceil_div(Wd_in, bi)
self.down_proj_scales_inv = nn.Parameter(
torch.zeros(self.num_experts, dp_scale_o, dp_scale_i, dtype=torch.float32, device=device)
)
else:
# Match FP8Linear behavior when not using 1-byte weights
self.register_parameter("gate_up_proj_scale_inv", None)
self.register_parameter("down_proj_scale_inv", None)
# (Optional) bias per projection — many MoEs omit bias; keep None to match your FP8Linear default
self.register_parameter("gate_up_bias", None)
self.register_parameter("down_bias", None)
# Activation used in the MLP (same as your config / ACT2FN)
# Keep a handle here; actual usage happens in forward of your MoE block
self.act_fn = ACT2FN[config.hidden_act]
def forward(
self,
hidden_states: torch.Tensor,
top_k_index: torch.Tensor,
top_k_weights: torch.Tensor,
) -> torch.Tensor:
final_hidden_states = torch.zeros_like(hidden_states)
num_experts = top_k_weights.shape[1]
with torch.no_grad():
expert_mask = torch.nn.functional.one_hot(top_k_index, num_classes=num_experts + 1)
expert_mask = expert_mask.permute(2, 1, 0)
expert_hit = torch.greater(expert_mask.sum(dim=(-1, -2)), 0).nonzero()
for expert_idx in expert_hit:
expert_idx = expert_idx[0]
if expert_idx == num_experts:
continue
_, token_idx = torch.where(expert_mask[expert_idx])
current_state = hidden_states.index_select(0, token_idx)
gate, up = self.linear(
current_state, self.gate_up_proj[expert_idx], self.gate_up_proj_scales_inv[expert_idx]
).chunk(2, dim=-1)
current_hidden_states = self.act_fn(gate) * up
current_hidden_states = self.linear(
current_hidden_states, self.down_proj[expert_idx], self.down_proj_scales_inv[expert_idx]
)
routing_weights = top_k_weights[token_idx, expert_idx].unsqueeze(-1)
current_hidden_states = current_hidden_states * routing_weights.to(current_hidden_states.dtype)
final_hidden_states.index_add_(0, token_idx, current_hidden_states.to(final_hidden_states.dtype))
return final_hidden_states
def linear(self, input: torch.Tensor, weight: torch.Tensor, weight_scale_inv: torch.Tensor) -> torch.Tensor:
if weight.element_size() > 1:
return F.linear(input, weight, None)
else:
# Context manager used to switch among the available accelerators
device_type = torch.accelerator.current_accelerator().type if is_torch_accelerator_available() else "cuda"
torch_accelerator_module = getattr(torch, device_type, torch.cuda)
with torch_accelerator_module.device(input.device):
qinput, scale = act_quant(input, self.block_size[1])
output = w8a8_block_fp8_matmul_triton(
qinput,
weight,
scale,
weight_scale_inv,
self.block_size,
output_dtype=input.dtype,
)
# Blocks the CPU until all accelerator operations on the specified device are complete. It is used to ensure that the results of the
# preceding operations are ready before proceeding
torch_accelerator_module.synchronize()
return output.to(dtype=input.dtype)
# TODO: we do need this.... but not recursive...
def _replace_with_fp8_linear(
model,
tp_plan=None,
@ -361,40 +497,48 @@ def _replace_with_fp8_linear(
quantization_config=None,
has_been_replaced=False,
):
"""Replace Linear layers with FP8Linear."""
if current_key_name is None:
current_key_name = []
iterator = list(model.named_parameters()).copy()
for name, empty_tensor in iterator:
current_key_name = name
name = name.rsplit(".", 1)[0] if "." in name else name
module = model.get_submodule(name)
for name, module in model.named_children():
current_key_name.append(name)
if isinstance(module, nn.Linear) and name not in (modules_to_not_convert or []):
current_key_name_str = ".".join(current_key_name)
if not any(key in current_key_name_str for key in (modules_to_not_convert or [])):
with init_empty_weights():
model._modules[name] = FP8Linear(
in_features=module.in_features,
out_features=module.out_features,
bias=module.bias is not None,
device=module.weight.device,
dtype=module.weight.dtype,
activation_scheme=quantization_config.activation_scheme,
block_size=quantization_config.weight_block_size,
current_key_name_str = re.sub(r"\d+", "*", current_key_name)
if not any(key in current_key_name_str for key in (modules_to_not_convert or [])):
with init_empty_weights():
if (
"gate_up_proj" in current_key_name
or "down_proj" in current_key_name
and "experts" in current_key_name
): # Experts!
in_features = empty_tensor.size(-2)
out_features = empty_tensor.size(-1)
model.set_submodule(
name,
FP8Expert(
config=model.config,
block_size=quantization_config.weight_block_size,
device=empty_tensor.device,
),
)
has_been_replaced = True
# when changing a layer the TP PLAN for that layer should be updated. TODO
if len(list(module.children())) > 0:
_, has_been_replaced = _replace_with_fp8_linear(
module,
tp_plan,
modules_to_not_convert,
current_key_name,
quantization_config,
has_been_replaced=has_been_replaced,
)
current_key_name.pop(-1)
elif isinstance(module, nn.Linear):
in_features = module.in_features
out_features = module.out_features
model.set_submodule(
name,
FP8Linear(
in_features=in_features,
out_features=out_features,
bias=module.bias is not None,
device=module.weight.device,
dtype=module.weight.dtype,
activation_scheme=quantization_config.activation_scheme,
block_size=quantization_config.weight_block_size,
),
)
has_been_replaced = True
# when changing a layer the TP PLAN for that layer should be updated. TODO
return model, has_been_replaced
@ -405,7 +549,7 @@ def replace_with_fp8_linear(
quantization_config=None,
):
"""Helper function to replace model layers with FP8 versions."""
modules_to_not_convert = ["lm_head"] if modules_to_not_convert is None else modules_to_not_convert
modules_to_not_convert += ["lm_head"]
if quantization_config.modules_to_not_convert is not None:
modules_to_not_convert.extend(quantization_config.modules_to_not_convert)
@ -424,3 +568,133 @@ def replace_with_fp8_linear(
)
return model
class QuantizationOp(ConversionOps):
"""Base class for quantization operations."""
pass
class Fp8Quantize(QuantizationOp):
"""
A quantization operation that creates two tensors, weight and scale out of a weight.
"""
reverse_op: type[ConversionOps]
def __init__(self, block_size: Optional[tuple[int, int]] = None):
self.block_size = block_size
self.reverse_op = Fp8Dequantize
def convert(self, input_dict: torch.Tensor, *, quant_config: dict[str, Any]) -> dict[str, torch.Tensor]:
# Unpack single key/value (value may be wrapped in a list)
target_keys, value = tuple(input_dict.items())[0]
value = value[0] if isinstance(value, list) else value
# Resolve block size (support dict-like or attr-like quant_config)
block_size = None
if quant_config is not None:
if isinstance(quant_config, dict):
block_size = quant_config.get("weight_block_size")
else:
block_size = getattr(quant_config, "weight_block_size", None)
if block_size is None:
block_size = (value.shape[-2], value.shape[-1])
block_m, block_n = block_size
rows, cols = value.shape[-2], value.shape[-1]
# Enforce exact tiling like your original
if rows % block_m != 0 or cols % block_n != 0:
raise ValueError(
f"Matrix dimensions ({rows}, {cols}) must be divisible by block sizes ({block_m}, {block_n}). for {target_keys}"
)
# Leading dims can be empty (2D) or include num_experts/... (3D+)
leading_shape = value.shape[:-2]
rows_tiles = rows // block_m
cols_tiles = cols // block_n
original_shape = value.shape
value_fp32 = value.to(torch.float32)
# Reshape to (..., rows_tiles, block_m, cols_tiles, block_n)
reshaped = value_fp32.reshape(*leading_shape, rows_tiles, block_m, cols_tiles, block_n)
# Per-tile max-abs over the block dims
# dims: block_m is at -3, block_n is at -1 after the reshape
max_abs = reshaped.abs().amax(dim=(-3, -1))
safe_max_abs = torch.where(max_abs > 0, max_abs, torch.ones_like(max_abs))
# Tile scale (we store inverse scale like your Linear: weight_scale_inv)
scales = _FP8_MAX / safe_max_abs
scales = torch.where(max_abs > 0, scales, torch.ones_like(scales)) # keep zeros stable
# Broadcast scales back over the block dims and quantize
# max_abs/scales shape: (..., rows_tiles, cols_tiles)
scales_broadcast = scales.unsqueeze(-1).unsqueeze(-3) # -> (..., rows_tiles, 1, cols_tiles, 1)
scaled = reshaped * scales_broadcast
if _FP8_IS_INT:
quantized = torch.clamp(scaled.round(), min=_FP8_MIN, max=_FP8_MAX).to(_FP8_DTYPE)
else:
quantized = torch.clamp(scaled, min=_FP8_MIN, max=_FP8_MAX).to(_FP8_DTYPE)
quantized = quantized.reshape(original_shape)
inv_scales = (1.0 / scales).to(torch.float32) # shape: (*leading, rows_tiles, cols_tiles)
if target_keys.endswith("weight"):
scale_key = target_keys.rsplit(".", 1)[0] + ".weight_scale_inv"
else:
scale_key = target_keys + "_scales_inv"
# Return both quantized weights and per-tile inverse scales (keeps leading dims, e.g., num_experts)
return {
target_keys: quantized,
scale_key: inv_scales,
}
class Fp8Dequantize(QuantizationOp):
"""Inverse operation of :class:`Fp8Quantize`. Takes a pair (weight, scale) and reconstructs the fp32 tensor."""
def __init__(self, block_size: Optional[tuple[int, int]] = None):
self.block_size = block_size
self.reverse_op = Fp8Quantize
def convert(
self,
value: Union[Sequence[torch.Tensor], dict[str, torch.Tensor]],
*,
context: dict[str, Any],
) -> torch.Tensor:
if isinstance(value, dict):
tensors = list(value.values())
else:
tensors = list(value) if isinstance(value, Sequence) else [value]
if len(tensors) != 2:
raise ValueError("Fp8Dequantize expects exactly two tensors: quantized weights and scales.")
quantized, scales = tensors
if not isinstance(quantized, torch.Tensor) or not isinstance(scales, torch.Tensor):
raise TypeError("Fp8Dequantize expects tensors as inputs.")
quantized_fp32 = quantized.to(torch.float32)
rows, cols = quantized_fp32.shape[-2:]
block_size = self.block_size
if block_size is None:
quant_config = context.get("quantization_config")
block_size = getattr(quant_config, "weight_block_size", None)
if block_size is None:
block_size = (rows, cols)
block_m, block_n = block_size
if rows % block_m != 0 or cols % block_n != 0:
raise ValueError(
f"Matrix dimensions ({rows}, {cols}) must be divisible by block sizes ({block_m}, {block_n})."
)
reshaped = quantized_fp32.reshape(-1, rows // block_m, block_m, cols // block_n, block_n)
expanded_scales = scales.to(torch.float32).reshape(-1, rows // block_m, cols // block_n)
expanded_scales = expanded_scales.unsqueeze(-1).unsqueeze(2)
dequantized = reshaped * expanded_scales
return dequantized.reshape(quantized_fp32.shape)

View File

@ -11,6 +11,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import re
from collections.abc import Callable
from functools import partial
@ -18,7 +19,7 @@ from types import ModuleType
from typing import Optional, Union
from ..modeling_flash_attention_utils import lazy_import_flash_attention
from ..utils import logging
from ..utils import ENV_VARS_TRUE_VALUES, logging
from ..utils.import_utils import is_kernels_available
from .flash_attention import flash_attention_forward
@ -33,10 +34,22 @@ try:
get_kernel,
register_kernel_mapping,
replace_kernel_forward_from_hub,
use_kernel_forward_from_hub,
)
_TRANSFORMERS_USE_HUB_KERNELS = os.environ.get("USE_HUB_KERNELS", "YES").upper()
_kernels_available = True
_kernels_enabled = _TRANSFORMERS_USE_HUB_KERNELS in ENV_VARS_TRUE_VALUES
def use_kernel_forward_from_hub(layer_name: str):
if _kernels_enabled:
from kernels import use_kernel_forward_from_hub as _kernels_use_kernel_forward_from_hub
return _kernels_use_kernel_forward_from_hub(layer_name)
else:
logger.warning_once(
f"kernels hub usage is disabled through the environment USE_HUB_KERNELS={_TRANSFORMERS_USE_HUB_KERNELS}"
)
return lambda cls: cls
_KERNEL_MAPPING: dict[str, dict[Union[Device, str], LayerRepository]] = {
"MultiScaleDeformableAttention": {
@ -71,6 +84,12 @@ try:
layer_name="RMSNorm",
)
},
"npu": {
Mode.INFERENCE: LayerRepository(
repo_id="kernels-community/liger_kernels",
layer_name="LigerRMSNorm",
)
},
},
"MLP": {
"cuda": LayerRepository(
@ -161,6 +180,7 @@ try:
except ImportError:
_kernels_available = False
_kernels_enabled = False
# Stub to make decorators int transformers work when `kernels`
# is not installed.

View File

@ -38,7 +38,7 @@ from transformers.utils.import_utils import _is_package_available
if os.getenv("WANDB_MODE") == "offline":
print("⚙️ Running in WANDB offline mode")
print("[INFO] Running in WANDB offline mode")
from .. import PreTrainedModel, TrainingArguments
from .. import __version__ as version

View File

@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from ..utils import is_accelerate_available, is_torch_available, logging
from ..utils import is_accelerate_available, is_torch_available, is_torch_xpu_available, logging
if is_torch_available():
@ -114,6 +114,9 @@ def convert_moe_packed_tensors(
if not blocks.is_cuda and torch.cuda.is_available():
blocks = blocks.cuda()
scales = scales.cuda()
elif (blocks.device.type != "xpu") and is_torch_xpu_available():
blocks = blocks.to("xpu")
scales = scales.to("xpu")
scales = scales.to(torch.int32) - 127 # TODO that's because 128=2**7
@ -351,6 +354,8 @@ def dequantize(module, param_name, param_value, target_device, dq_param_name, **
dequantized = convert_moe_packed_tensors(getattr(module, blocks_attr), getattr(module, scales_attr))
if target_device == "cpu" and torch.cuda.is_available():
torch.cuda.empty_cache()
elif target_device == "cpu" and is_torch_xpu_available():
torch.xpu.empty_cache()
setattr(module, proj, torch.nn.Parameter(dequantized.to(target_device)))
delattr(module, blocks_attr)
delattr(module, scales_attr)
@ -395,7 +400,7 @@ def load_and_swizzle_mxfp4(module, param_name, param_value, target_device, trito
else:
blocks = blocks.reshape(local_experts, -1, module.intermediate_size // 2)
if getattr(target_device, "type", target_device) == "cpu":
target_device = "cuda"
target_device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
blocks = blocks.to(target_device).contiguous()
scales = scales.to(target_device).contiguous()
with on_device(target_device):

View File

@ -236,7 +236,7 @@ class PeftAdapterMixin:
**adapter_kwargs,
)
peft_config.inference_mode = not is_trainable
# TODO: WE NEED TOO APPLY OUR DYNAMIC WEIGHT CONVERSION AT SOME POINT HERE!
# Create and add fresh new adapters into the model.
inject_adapter_in_model(peft_config, self, adapter_name, **peft_load_kwargs)

View File

@ -63,9 +63,6 @@ def sdpa_attention_forward(
else:
sdpa_kwargs = {"enable_gqa": True}
if attention_mask is not None and attention_mask.ndim == 4:
attention_mask = attention_mask[:, :, :, : key.shape[-2]]
# Instead of relying on the value set in the module directly, we use the is_causal passed in kwargs if it is presented
is_causal = is_causal if is_causal is not None else getattr(module, "is_causal", True)

View File

@ -18,6 +18,7 @@ import operator
import os
import re
from functools import partial, reduce
from typing import Optional
import torch
import torch.distributed as dist
@ -140,6 +141,16 @@ def _blocks_to_block_sizes(total_size: int, blocks: int | list[int]) -> list[int
return [single_size] * blocks
def replace_layer_number_by_wildcard(name: str) -> str:
"""
Replace the numbers in the `name` by wildcards, only if they are in-between dots (`.`) or if they are between
a dot (`.`) and the end of the string.
This matches how modules are named/numbered when using a nn.ModuleList or nn.Sequential, but will NOT match
numbers in a parameter name itself, e.g. if the param is named `"w1"` or `"w2"`.
"""
return re.sub(r"\.\d+(\.|$)", lambda m: ".*" + m.group(1), name)
def _get_parameter_tp_plan(parameter_name: str, tp_plan: dict[str, str], is_weight=True) -> str | None:
"""
Get the TP style for a parameter from the TP plan.
@ -150,11 +161,11 @@ def _get_parameter_tp_plan(parameter_name: str, tp_plan: dict[str, str], is_weig
The `is_weight` is important because for weights, we want to support `.weights` and `.bias` cases seamlessly! but
not parent classes for `post_init` calls
"""
generic_param_name = re.sub(r"\d+", "*", parameter_name)
generic_param_name = replace_layer_number_by_wildcard(parameter_name)
if generic_param_name in tp_plan:
return tp_plan[generic_param_name]
elif "." in generic_param_name and generic_param_name.rsplit(".", 1)[0] in tp_plan and is_weight:
return tp_plan[generic_param_name.rsplit(".", 1)[0]]
elif is_weight and "." in generic_param_name and (module_name := generic_param_name.rsplit(".", 1)[0]) in tp_plan:
return tp_plan[module_name]
return None
@ -306,7 +317,7 @@ def repack_weights(
return final_ordered_tensor
def get_tensor_shard(param, empty_param, device_mesh, rank, dim):
def get_tensor_shard(param, empty_param, device_mesh, rank, dim, tensor_idx: Optional[int] = None):
"""
Generalized tensor sharding across a multi-dimensional device mesh.
Extract only the fraction of the parameter owned by the given `rank` when the parameter would have gone sharding at provided `dim`.
@ -358,32 +369,57 @@ def get_tensor_shard(param, empty_param, device_mesh, rank, dim):
rank (int): Global rank of the current process/device.
dim (int): Dimension along which to shard the tensor.
"""
param_dim = empty_param.dim()
if dim < 0:
dim = param_dim + dim
if dim >= param_dim:
raise ValueError(f"dim {dim} is out of bounds for tensor of dimension {param_dim}")
param_dim = empty_param.ndim
# Flatten the mesh to get the total number of devices
mesh_shape = device_mesh.shape
world_size = reduce(operator.mul, mesh_shape)
if dim < 0:
dim = param_dim + dim
if empty_param.dim() == 3 and dim == 1 and len(param.get_shape()) == 2:
dim = 0
elif empty_param.dim() == 3 and dim == 2 and len(param.get_shape()) == 2:
dim = 0
shard_size = math.ceil(empty_param.size(dim) / world_size)
start = rank * shard_size
end = min(start + shard_size, empty_param.size(dim))
if dim >= param_dim:
raise ValueError(f"dim {dim} is out of bounds for tensor of dimension {param_dim}")
if rank >= world_size:
raise ValueError(f"Rank {rank} is out of bounds for mesh size {world_size}")
shard_size = math.ceil(empty_param.shape[dim] / world_size)
start = rank * shard_size
# we have the full tensor not 1 part of it.
# in that case, we just assume that the weight was properly saved
# and thus because we TP if the layer is colwise it should not use this. Layer should be packed_colwise
# to inform that it needs to read form a packed tensor. It will also take care of the module list thingy.
# here we take care of potential chunking / layer split / layer chunking.
# The only "hard" case is? if we collect q,k,v -> merge it into qkv. In that case
# actually we still shard dim=0 does not change
# so only case is if the dim of the empty param is 3 and the shard dim is 0 -> we put the
# tensor on a certain device (with the input tensor_index)
dimensions = param.get_shape()
# Construct slicing index dynamically
end = min(start + shard_size, empty_param.shape[dim])
slice_indices = [slice(None)] * param_dim
if start < empty_param.shape[dim]:
if empty_param.dim() == 3 and dim == 0 and len(param.get_shape()) == 2:
# special case we don't "shard" just send this entire tensor to the correct rank.
if start <= tensor_idx < end:
# this tensor does need to be materialized on this device:
return param[:]
else:
return torch.empty([], dtype=torch.int64, device=rank)
slice_indices = [slice(None)] * len(param.get_shape())
if start < param.get_shape()[dim]:
slice_indices[dim] = slice(start, end)
return param[tuple(slice_indices)]
dimensions = list(param.shape)
param = param[tuple(slice_indices)]
if isinstance(param, list): # TODO handle the modulelist case!
param = [p[:] for p in param]
return param
dimensions[dim] = 0
return torch.empty(tuple(dimensions), dtype=torch.int64)
return torch.empty(tuple(dimensions), dtype=torch.int64) # empty allocates memory....
def distribute_module(
@ -410,6 +446,19 @@ class TensorParallelLayer:
"""
use_dtensor = True
device_mesh = None
rank = None
# Used to compare the shape of the original tensor
empty_param = None
# Used to init the corresponding DTensor
shard = None
def __init__(self, device_mesh=None, rank=None, empty_param=None):
self.rank = rank
self.device_mesh = device_mesh
self.empty_param = empty_param
@staticmethod
def _prepare_input_fn(input_layouts, desired_input_layouts, mod, inputs, device_mesh): ...
@ -439,12 +488,12 @@ class GatherParallel(TensorParallelLayer):
def __init__(
self,
*,
input_layouts: Placement | None = None,
output_layouts: Placement | None = None,
use_local_output: bool = True,
**kwargs,
):
super().__init__()
super().__init__(**kwargs)
self.input_layouts = (input_layouts or Replicate(),)
self.output_layouts = output_layouts
self.desired_input_layouts = (Replicate(),)
@ -465,6 +514,21 @@ class GatherParallel(TensorParallelLayer):
dist.all_reduce(outputs[0], op=dist.ReduceOp.SUM, async_op=False)
return outputs
def shard_tensor(
self,
param,
param_type=None,
param_casting_dtype=None,
to_contiguous=None,
rank=None,
device_mesh=None,
tensor_idx=None,
):
shard = [Replicate()]
parameter = param[...].to(param_casting_dtype)
self.shard = shard
return parameter, shard
def prepare_module_tp(self, module: nn.Module, device_mesh) -> nn.Module:
distribute_module(
module,
@ -493,6 +557,23 @@ class IsolatedParallel(TensorParallelLayer):
# TODO: figure out dynamo support for instance method and switch this to instance method
return outputs
def shard_tensor(
self,
param,
param_type=None,
param_casting_dtype=None,
to_contiguous=None,
rank=None,
device_mesh=None,
tensor_idx=None,
):
mesh = device_mesh or self.device_mesh
parameter = param[...].to(param_casting_dtype)
if mesh is not None:
parameter = parameter / mesh.size()
self.shard = None
return parameter, None
def partition_tensor(self, param, empty_param, param_type, param_casting_dtype, to_contiguous, rank, device_mesh):
param = param[...].to(param_casting_dtype)
if to_contiguous:
@ -515,8 +596,8 @@ class ReplicateParallel(TensorParallelLayer):
This class is used to replicate computation in a TP layer (used in SP regions when we don't use sequence parallelism for example)
"""
def __init__(self, *, use_dtensor=True, use_local_output=True):
super().__init__()
def __init__(self, use_dtensor=True, use_local_output=True, **kwargs):
super().__init__(**kwargs)
self.input_layouts = (Replicate(),)
self.output_layouts = (Replicate(),)
self.desired_input_layouts = (Replicate(),)
@ -537,12 +618,33 @@ class ReplicateParallel(TensorParallelLayer):
def _prepare_output_fn(output_layouts, use_local_output, mod, outputs, device_mesh):
return outputs.to_local() if use_local_output and isinstance(outputs, DTensor) else outputs
def shard_tensor(
self,
param,
param_type=None,
param_casting_dtype=None,
to_contiguous=None,
rank=None,
device_mesh=None,
tensor_idx=None,
):
parameter = param[...].to(param_casting_dtype)
shard = [Replicate()]
self.shard = shard
return parameter, shard
def partition_tensor(self, param, empty_param, param_type, param_casting_dtype, to_contiguous, rank, device_mesh):
param = param[...].to(param_casting_dtype)
if to_contiguous:
param = param.contiguous()
param = DTensor.from_local(param, device_mesh, [Replicate()], run_check=False)
return param
parameter, shard = self.shard_tensor(
param,
param_type=param_type,
param_casting_dtype=param_casting_dtype,
to_contiguous=to_contiguous,
rank=rank,
device_mesh=device_mesh,
)
if self.use_dtensor:
parameter = DTensor.from_local(parameter, device_mesh, shard, run_check=False)
return parameter
class ColwiseParallel(TensorParallelLayer):
@ -552,13 +654,13 @@ class ColwiseParallel(TensorParallelLayer):
def __init__(
self,
*,
input_layouts: Placement | None = None,
output_layouts: Placement | None = None,
use_local_output: bool = True,
use_dtensor=True,
**kwargs,
):
super().__init__()
super().__init__(**kwargs)
self.input_layouts = (input_layouts or Replicate(),)
self.output_layouts = (output_layouts or Shard(-1),)
self.desired_input_layouts = (Replicate(),)
@ -578,18 +680,34 @@ class ColwiseParallel(TensorParallelLayer):
input_tensor = input_tensor.redistribute(placements=desired_input_layouts, async_op=False)
return input_tensor
def shard_tensor(
self,
param,
param_type=None,
param_casting_dtype=None,
to_contiguous=None,
rank=None,
device_mesh=None,
tensor_idx=None,
):
device_mesh = self.device_mesh
empty_param = self.empty_param
rank = self.rank
if param_type == "bias":
parameter = get_tensor_shard(param, empty_param, device_mesh, rank, -1, tensor_idx)
shard = [Shard(-1)]
else:
shard = [Shard(-2)]
parameter = get_tensor_shard(param, empty_param, device_mesh, rank, -2, tensor_idx)
parameter = parameter.to(param_casting_dtype)
self.shard = shard
return parameter, shard
def partition_tensor(self, param, empty_param, param_type, param_casting_dtype, to_contiguous, rank, device_mesh):
# colwise shard weight/bias to Shard(0), weight be Shard(-2) (0 if you have 1 dim only)
# means Colwise as Linear is input * weight^T + bias, where
# weight would become Shard(1)
if param_type == "bias":
parameter = get_tensor_shard(param, empty_param, device_mesh, rank, -1)
shard = [Shard(-1)]
else:
shard = [Shard(-2)]
parameter = get_tensor_shard(param, empty_param, device_mesh, rank, -2)
parameter = parameter.to(param_casting_dtype)
parameter, shard = self.shard_tensor(param, param_type, param_casting_dtype, to_contiguous, rank, device_mesh)
if to_contiguous:
parameter = parameter.contiguous()
if self.use_dtensor:
@ -608,6 +726,21 @@ class ColwiseParallel(TensorParallelLayer):
class PackedColwiseParallel(ColwiseParallel):
def shard_tensor(
self,
param,
param_type=None,
param_casting_dtype=None,
to_contiguous=None,
rank=None,
device_mesh=None,
tensor_idx=None,
):
device_mesh = device_mesh or self.device_mesh
empty_param = self.empty_param
rank = rank if rank is not None else self.rank
return get_packed_weights(param, empty_param, device_mesh, rank, -2).to(param_casting_dtype), [Shard(-2)]
def partition_tensor(self, param, empty_param, param_type, param_casting_dtype, to_contiguous, rank, device_mesh):
# colwise shard weight/bias to Shard(0), weight be Shard(-2) (0 if you have 1 dim only)
# means Colwise as Linear is input * weight^T + bias, where
@ -642,18 +775,41 @@ class RowwiseParallel(TensorParallelLayer):
def __init__(
self,
*,
input_layouts: Placement | None = None,
output_layouts: Placement | None = None,
use_local_output: bool = True,
use_dtensor=True,
**kwargs,
):
super().__init__()
super().__init__(**kwargs)
self.input_layouts = (input_layouts or Shard(-1),)
self.output_layouts = (output_layouts or Replicate(),)
self.use_local_output = use_local_output
self.use_dtensor = use_dtensor
def shard_tensor(
self,
param,
param_type=None,
param_casting_dtype=None,
to_contiguous=None,
rank=None,
device_mesh=None,
tensor_idx=None,
):
device_mesh = device_mesh or self.device_mesh
empty_param = self.empty_param
rank = rank if rank is not None else self.rank
if param_type == "bias":
shard = [Replicate()]
parameter = param[...]
else:
parameter = get_tensor_shard(param, empty_param, device_mesh, rank, -1, tensor_idx=tensor_idx)
shard = [Shard(-1)]
parameter = parameter.to(param_casting_dtype)
self.shard = shard
return parameter, shard
def partition_tensor(self, param, empty_param, param_type, param_casting_dtype, to_contiguous, rank, device_mesh):
# Rowwise shard weight to Shard(1), bias to Replicate(), weight be Shard(1)
# means Rowwise as nn.Linear is input * weight^T + bias, where
@ -725,6 +881,21 @@ class RowwiseParallel(TensorParallelLayer):
class PackedRowwiseParallel(RowwiseParallel):
def shard_tensor(
self,
param,
param_type=None,
param_casting_dtype=None,
to_contiguous=None,
rank=None,
device_mesh=None,
tensor_idx=None,
):
device_mesh = device_mesh or self.device_mesh
empty_param = self.empty_param
rank = rank if rank is not None else self.rank
return get_packed_weights(param, empty_param, device_mesh, rank, -1), [Shard(-1)]
def partition_tensor(self, param, empty_param, param_type, param_casting_dtype, to_contiguous, rank, device_mesh):
# colwise shard weight/bias to Shard(0), weight be Shard(-2) (0 if you have 1 dim only)
# means Colwise as Linear is input * weight^T + bias, where
@ -783,8 +954,8 @@ class SequenceParallel(TensorParallelLayer):
to ensure that they are replicated.
"""
def __init__(self, *, sequence_dim: int = 1, use_local_output: bool = False, use_dtensor=False):
super().__init__()
def __init__(self, sequence_dim: int = 1, use_local_output: bool = False, use_dtensor=False, **kwargs):
super().__init__(**kwargs)
self.input_layouts = (Replicate(),)
self.desired_input_layouts = (Shard(1),)
self.output_layouts = (Replicate(),)
@ -793,6 +964,21 @@ class SequenceParallel(TensorParallelLayer):
self.sequence_sharding = (Shard(sequence_dim),)
self.use_local_output = use_local_output
def shard_tensor(
self,
param,
param_type=None,
param_casting_dtype=None,
to_contiguous=None,
rank=None,
device_mesh=None,
tensor_idx=None,
):
parameter = param[...].to(param_casting_dtype)
shard = [Replicate()]
self.shard = shard
return parameter, shard
@staticmethod
def _prepare_input_fn(input_layouts, desired_input_layouts, mod, inputs, device_mesh):
input_tensor = inputs[0]
@ -827,10 +1013,34 @@ class GroupedGemmParallel(TensorParallelLayer):
Applies Expert Parallelism to MoE experts by loading the correct experts on each device.
"""
def __init__(self):
super().__init__()
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.use_dtensor = False
def shard_tensor(
self,
param,
param_type=None,
param_casting_dtype=None,
to_contiguous=None,
rank=None,
device_mesh=None,
tensor_idx=None,
):
empty_param = self.empty_param
ep_rank = self.rank
device_mesh = self.device_mesh
global_num_experts = empty_param.shape[0]
if global_num_experts % device_mesh.size() != 0:
raise ValueError(
f"Global number of experts must be divisible by number of devices: {global_num_experts} % {device_mesh.size()} != 0"
)
local_num_experts = global_num_experts // device_mesh.size()
parameter = param[ep_rank * local_num_experts : (ep_rank + 1) * local_num_experts].to(param_casting_dtype)
self.shard = None
return parameter, None
def partition_tensor(self, param, empty_param, param_type, param_casting_dtype, to_contiguous, rank, device_mesh):
ep_rank = rank
global_num_experts = empty_param.shape[0]
@ -851,8 +1061,8 @@ class RouterParallel(TensorParallelLayer):
"""
def __init__(self, *args, **kwargs):
super().__init__(**kwargs)
self.args = args
self.kwargs = kwargs
self.use_dtensor = False
@staticmethod
@ -917,6 +1127,20 @@ class RouterParallel(TensorParallelLayer):
) # masking class for one hot
return router_scores, router_indices
def shard_tensor(
self,
param,
param_type=None,
param_casting_dtype=None,
to_contiguous=None,
rank=None,
device_mesh=None,
tensor_idx=None,
):
parameter = param[...].to(param_casting_dtype)
self.shard = None
return parameter, None
def partition_tensor(self, param, empty_param, param_type, param_casting_dtype, to_contiguous, rank, device_mesh):
# TODO: i'd like for this to be the default
param = param[...].to(param_casting_dtype)
@ -1059,6 +1283,9 @@ def shard_and_distribute_module(
if current_shard_plan is not None:
try:
tp_layer = ALL_PARALLEL_STYLES[current_shard_plan]
tp_layer.empty_param = empty_param
tp_layer.device_mesh = device_mesh
tp_layer.rank = rank
param = tp_layer.partition_tensor(
param, empty_param, param_type, param_casting_dtype, is_contiguous, rank, device_mesh
)
@ -1086,7 +1313,7 @@ def verify_tp_plan(expected_keys: list[str], tp_plan: dict[str, str] | None):
if tp_plan is None:
return
generic_keys = {re.sub(r"\d+", "*", key) for key in expected_keys}
generic_keys = {replace_layer_number_by_wildcard(key) for key in expected_keys}
unsharded_layers = set(generic_keys)
unused_rules = tp_plan

View File

@ -82,8 +82,10 @@ def causal_mask_function(batch_idx: int, head_idx: int, q_idx: int, kv_idx: int)
def bidirectional_mask_function(batch_idx: int, head_idx: int, q_idx: int, kv_idx: int) -> bool:
"""
This creates a full bidirectional mask.
NOTE: It is important to keep an index-based version for non-vmap expansion.
"""
return q_idx.new_ones((), dtype=torch.bool)
return q_idx >= 0
def sliding_window_overlay(sliding_window: int) -> Callable:
@ -110,18 +112,6 @@ def chunked_overlay(chunk_size: int, left_padding: torch.Tensor) -> Callable:
return inner_mask
def _legacy_chunked_overlay(chunk_size: int) -> Callable:
"""
Same as the above function, but do not correctly account for left padding tokens.
Only kept for compatibility with older torch versions (< 2.6).
"""
def inner_mask(batch_idx: int, head_idx: int, q_idx: int, kv_idx: int) -> bool:
return kv_idx // chunk_size == q_idx // chunk_size
return inner_mask
def sliding_window_causal_mask_function(sliding_window: int) -> Callable:
"""
This return the mask_function function to create a sliding window mask.
@ -133,8 +123,6 @@ def chunked_causal_mask_function(chunk_size: int, left_padding: torch.Tensor) ->
"""
This return the mask_function function to create a chunked attention mask.
"""
if not _is_torch_greater_or_equal_than_2_6:
return and_masks(_legacy_chunked_overlay(chunk_size), causal_mask_function)
return and_masks(chunked_overlay(chunk_size, left_padding), causal_mask_function)
@ -175,55 +163,56 @@ def add_offsets_to_mask_function(mask_function: Callable, q_offset: int, kv_offs
return inner_mask
def _vmap_for_bhqkv(mask_function: Callable, bh_indices: bool = True) -> Callable:
"""
Used to vmap our mask_functions over the q_idx and kv_idx dimensions of the inputs. Optionally, vmap over
the batch and head indices as well if `bh_indices=True`.
Using vmap here allows us to keep the performance of vectorized ops, while having a single set of primitive
functions between attention interfaces (i.e. between flex and sdpa/eager, FA2 being a bit different).
Args:
mask_function (`Callable`):
The mask_function to vmap.
bh_indices (`bool`, optional):
Whether to vmap over the batch and head indices as well, or only q and kv indices.
Returns:
Callable: The vmapped function.
"""
# We vmap the function 2 times, broadcasting the [q_idx, kv_idx] dimensions
dimensions = [(None, None, None, 0), (None, None, 0, None)]
if bh_indices:
# We extend broadcasting over the [batch_idx, head_idx] dimensions
dimensions.extend([(None, 0, None, None), (0, None, None, None)])
for dims in dimensions:
mask_function = torch.vmap(mask_function, in_dims=dims, out_dims=0)
return mask_function
def prepare_padding_mask(
attention_mask: Optional[torch.Tensor], kv_length: int, kv_offset: int, _slice: bool = True
attention_mask: Optional[torch.Tensor], kv_length: int, kv_offset: int
) -> Optional[torch.Tensor]:
"""
From the 2D attention mask, prepare the correct padding mask to use by potentially padding it, and slicing
according to the `kv_offset` if `_slice` is `True`.
From the 2D attention mask, prepare the correct padding mask to use by potentially padding it.
"""
local_padding_mask = attention_mask
if attention_mask is not None:
# Pad it if necessary
if (padding_length := kv_length + kv_offset - attention_mask.shape[-1]) > 0:
local_padding_mask = torch.nn.functional.pad(attention_mask, (0, padding_length))
# For flex, we should not slice them, only use an offset
if _slice:
# Equivalent to: `local_padding_mask = attention_mask[:, kv_offset : kv_offset + kv_length]`,
# but without data-dependent slicing (i.e. torch.compile friendly)
mask_indices = torch.arange(kv_length, device=local_padding_mask.device)
mask_indices += kv_offset
local_padding_mask = local_padding_mask[:, mask_indices]
return local_padding_mask
def _can_skip_causal_mask_xpu(
padding_mask: Optional[torch.Tensor],
query_length: int,
kv_length: int,
local_attention_size: Optional[int],
) -> bool:
"""
XPU-specific logic for determining if we can skip causal mask creation.
For XPU devices, we have special handling:
- Single query tokens (query_length == 1) use the same logic as CUDA
- Multi-query tokens can skip if padding_mask is provided and correctly structured
The mask must have all True values in the query window and all False after
"""
if is_tracing(padding_mask):
return False
# Check local attention constraint (same as CUDA)
if local_attention_size is not None and kv_length >= local_attention_size:
return False
if padding_mask is None:
# Without padding mask, can skip if single query token or full causal attention
return query_length == 1 or kv_length == query_length
# XPU allows skipping under additional conditions when padding_mask is provided
if query_length == 1:
# Single query token: skip only if no padding tokens present
return padding_mask.all()
# XPU-specific: check if query window is all True and rest is all False
# This allows XPU to optimize the 1st token in static cache
return padding_mask[:, :query_length].all() and not padding_mask[:, query_length:].any()
def _ignore_causal_mask_sdpa(
padding_mask: Optional[torch.Tensor],
query_length: int,
@ -244,6 +233,12 @@ def _ignore_causal_mask_sdpa(
mask_indices += kv_offset
padding_mask = padding_mask[:, mask_indices]
if _is_torch_xpu_available:
# XPU devices have special handling for mask skipping:
# - Single query tokens use the same logic as CUDA
# - Multi-query tokens can skip if padding_mask is provided and correctly structured
# (all True in query window, all False after)
return _can_skip_causal_mask_xpu(padding_mask, query_length, kv_length, local_attention_size)
# When using `torch.export` or `torch.onnx.dynamo_export`, we must pass an example input, and `is_causal` behavior is
# hard-coded to the forward. If a user exports a model with query_length > 1, the exported model will hard-code `is_causal=True`
# which is in general wrong (see https://github.com/pytorch/pytorch/issues/108108). Thus, we only set
@ -251,18 +246,11 @@ def _ignore_causal_mask_sdpa(
if (
not is_tracing(padding_mask)
# only cases when lower and upper diags are the same, see https://github.com/pytorch/pytorch/issues/108108
and (query_length == 1 or (kv_length == query_length or _is_torch_xpu_available))
and (query_length == 1 or kv_length == query_length)
# in this case we need to add special patterns to the mask so cannot be skipped otherwise
and (local_attention_size is None or kv_length < local_attention_size)
# In this case, we need to add padding to the mask, so cannot be skipped otherwise
and (
padding_mask is None
or (
padding_mask.all()
if not _is_torch_xpu_available or query_length == 1
else padding_mask[:, :query_length].all()
)
)
and (padding_mask is None or padding_mask.all())
):
return True
@ -282,7 +270,39 @@ def _ignore_bidirectional_mask_sdpa(padding_mask: Optional[torch.Tensor]) -> boo
return False
def sdpa_mask_recent_torch(
def _vmap_expansion_sdpa(mask_function: Callable) -> Callable:
"""
Used to vmap our mask_functions over the all 4 dimensions (b_idx, h_idx, q_idx, kv_idx) of the inputs.
Using vmap here allows us to keep the performance of vectorized ops, while having a single set of primitive
functions between attention interfaces (i.e. between flex and sdpa/eager, FA2 being a bit different).
"""
# We vmap the function over all 4 dimensions, broadcasting [b_idx, h_idx, q_idx, kv_idx]
dimensions = [(None, None, None, 0), (None, None, 0, None), (None, 0, None, None), (0, None, None, None)]
for dims in dimensions:
mask_function = torch.vmap(mask_function, in_dims=dims, out_dims=0)
return mask_function
def _non_vmap_expansion_sdpa(
batch_indices: torch.Tensor, head_indices: torch.Tensor, q_indices: torch.Tensor, kv_indices: torch.Tensor
):
"""
Used to broadcast our mask_functions over the all 4 dimensions (b_idx, h_idx, q_idx, kv_idx) of the inputs.
Allows the usage of any index-based mask function without relying on vmap.
NOTE: This is limited to index based functions only and is not guaranteed to work otherwise.
Reference:
- https://github.com/huggingface/optimum-onnx/blob/c123e8f4fab61b54a8e0e31ce74462bcacca576e/optimum/exporters/onnx/model_patcher.py#L362-L365
"""
batch_indices = batch_indices[:, None, None, None]
head_indices = head_indices[None, :, None, None]
q_indices = q_indices[None, None, :, None]
kv_indices = kv_indices[None, None, None, :]
return batch_indices, head_indices, q_indices, kv_indices
def sdpa_mask(
batch_size: int,
cache_position: torch.Tensor,
kv_length: int,
@ -292,6 +312,8 @@ def sdpa_mask_recent_torch(
local_size: Optional[int] = None,
allow_is_causal_skip: bool = True,
allow_is_bidirectional_skip: bool = False,
allow_torch_fix: bool = True,
use_vmap: bool = False,
**kwargs,
) -> Optional[torch.Tensor]:
"""
@ -324,6 +346,12 @@ def sdpa_mask_recent_torch(
allow_is_bidirectional_skip (`bool`, optional):
Whether to allow to return `None` for the mask under conditions where we do not have to add any bias,
i.e. full attention without any padding. Default to `False`.
allow_torch_fix (`bool`, optional):
Whether to update the mask in case a query is not attending to any tokens, to solve a bug in torch's older
versions. We need an arg to skip it when using eager. By default `True`.
use_vmap (`bool`, optional):
Whether to use `vmap` during the mask construction or not. Allows powerful custom patterns that may not be
index-based (for the cost of speed performance). By default `False`.
## Creating a simple causal mask:
@ -391,97 +419,8 @@ def sdpa_mask_recent_torch(
"""
q_length = cache_position.shape[0]
# Potentially pad the 2D mask, and slice it correctly
padding_mask = prepare_padding_mask(attention_mask, kv_length, kv_offset, _slice=False)
# Under specific conditions, we can avoid materializing the mask
# 1. Causal masks can rely on the `is_causal` argument
# 2. Bidirectional do not need any further processing (no bias)
if allow_is_causal_skip and _ignore_causal_mask_sdpa(padding_mask, q_length, kv_length, kv_offset, local_size):
return None
if allow_is_bidirectional_skip and _ignore_bidirectional_mask_sdpa(padding_mask):
return None
# vmap can incur performance issues as reported in #41566 for bidirectional mask as we only need to expand the
# padding mask. Thus, we allow early exit here if we do not detect any modification to the base mask function
if mask_function is bidirectional_mask_function:
if padding_mask is not None:
# used for slicing without data-dependent slicing
mask_indices = torch.arange(kv_length, device=cache_position.device) + kv_offset
return padding_mask[:, None, None, mask_indices].expand(-1, -1, q_length, -1)
else:
return torch.ones(batch_size, 1, q_length, kv_length, dtype=torch.bool, device=cache_position.device)
# Similar to `kv_arange = torch.arange(start=kv_offset, end=kv_offset + kv_length, device=cache_position.device)`
# but without data-dependent slicing (i.e. torch.compile friendly)
kv_arange = torch.arange(kv_length, device=cache_position.device)
kv_arange += kv_offset
# Potentially add the padding 2D mask
if padding_mask is not None:
mask_function = and_masks(mask_function, padding_mask_function(padding_mask))
batch_arange = torch.arange(batch_size, device=cache_position.device)
head_arange = torch.arange(1, device=cache_position.device)
# This creates the 4D mask easily. Note that we need this context manager as vmap cannot handle slicing a tensor from
# scalar tensor (it internally calls `.item()` which vmap does not allow, but this context works around it
# We don't need to add an offset to the mask_function either, as we vmap directly the correct indices for k and kv indices
with TransformGetItemToIndex():
causal_mask = _vmap_for_bhqkv(mask_function)(batch_arange, head_arange, cache_position, kv_arange)
return causal_mask
def sdpa_mask_older_torch(
batch_size: int,
cache_position: torch.Tensor,
kv_length: int,
kv_offset: int = 0,
mask_function: Callable = causal_mask_function,
attention_mask: Optional[torch.Tensor] = None,
local_size: Optional[int] = None,
allow_is_causal_skip: bool = True,
allow_torch_fix: bool = True,
allow_is_bidirectional_skip: bool = False,
**kwargs,
) -> Optional[torch.Tensor]:
"""
NOTE: This function is only used when torch version is torch<2.5 - see `sdpa_mask_recent_torch` otherwise.
Create a 4D boolean mask of shape `(batch_size, 1, query_length, kv_length)` where a value of True indicates that
the element should take part in the attention computation, and False that it should not.
If `allow_torch_fix=True` (the default), rows corresponding to query tokens that do not attend
to any other tokens (due to padding) will be fully attended to instead, in order to avoid `nan` propagation (this does
not change the final result).
Args:
batch_size (`int`):
The batch size of the input sequence.
cache_position (`torch.Tensor`):
A tensor of shape (query_length,) indicating the current indices of the input sequence elements.
kv_length (`int`):
The size that the key and value states will have during the attention computation.
kv_offset (`int`, optional):
An optional offset to indicate at which first position the key and values states will refer to.
mask_function (`Callable`):
The mask factory function describing the mask pattern.
attention_mask (`torch.Tensor`, optional):
The 2D attention mask corresponding to padded tokens of shape (batch_size, number_of_seen_tokens+q_length)
local_size (`int`, optional):
The size of the local attention, if we do not use full attention. This is used only if `allow_is_causal_skip=True`
to try to skip mask creation if possible.
allow_is_causal_skip (`bool`, optional):
Whether to allow to return `None` for the mask under conditions where we can use the `is_causal` argument in
`torch.sdpa` instead. Default to `True`.
allow_torch_fix (`bool`, optional):
Whether to update the mask in case a query is not attending to any tokens, to solve a bug in torch's older
versions. We need an arg to skip it when using eager. By default `True`.
allow_is_bidirectional_skip (`bool`, optional):
Whether to allow to return `None` for the mask under conditions where we do not have to add any bias,
i.e. full attention without any padding. Default to `False`.
"""
q_length = cache_position.shape[0]
# Potentially pad the 2D mask, and slice it correctly
# Potentially pad the 2D mask
padding_mask = prepare_padding_mask(attention_mask, kv_length, kv_offset)
# Under specific conditions, we can avoid materializing the mask
@ -492,38 +431,45 @@ def sdpa_mask_older_torch(
if allow_is_bidirectional_skip and _ignore_bidirectional_mask_sdpa(padding_mask):
return None
# vmap can incur performance issues as reported in #41566 for bidirectional mask as we only need to expand the
# padding mask. Thus, we allow early exit here if we do not detect any modification to the base mask function
if mask_function is bidirectional_mask_function:
if padding_mask is not None:
return padding_mask[:, None, None, :].expand(-1, -1, q_length, -1)
else:
return torch.ones(batch_size, 1, q_length, kv_length, dtype=torch.bool, device=cache_position.device)
# Potentially add the padding 2D mask
if padding_mask is not None:
mask_function = and_masks(mask_function, padding_mask_function(padding_mask))
batch_arange = torch.arange(batch_size, device=cache_position.device)
head_arange = torch.arange(1, device=cache_position.device)
# Similar to `kv_arange = torch.arange(start=kv_offset, end=kv_offset + kv_length, device=cache_position.device)`
# but without data-dependent slicing (i.e. torch.compile friendly)
kv_arange = torch.arange(kv_length, device=cache_position.device)
kv_arange += kv_offset
kv_arange = torch.arange(kv_length, device=cache_position.device) + kv_offset
# This creates the 4D mask easily. Note that we do not include vmap over the batch_idx dimension as well,
# as vmap cannot handle slicing a tensor from scalar tensor (it internally calls `.item()` which vmap does not allow
# However, in more recent version of Pytorch, a trick was introduced to handle it - which is the reason we have
# `sdpa_mask_recent_torch`, as it allows more general `mask_function`
causal_mask = _vmap_for_bhqkv(mask_function, bh_indices=False)(None, None, cache_position, kv_arange)
causal_mask = causal_mask[None, None, :, :].expand(batch_size, -1, -1, -1)
if padding_mask is not None:
causal_mask = causal_mask * padding_mask[:, None, None, :]
# Actual mask creation
# Option 1: Fast non-vmap mask creation (default)
if not use_vmap:
# Apply mask function element-wise through broadcasting
attention_mask = mask_function(*_non_vmap_expansion_sdpa(batch_arange, head_arange, cache_position, kv_arange))
# Expand the mask to match batch size and query length if they weren't used in the mask function
attention_mask = attention_mask.expand(batch_size, -1, q_length, kv_length)
# Option 2: Vmap mask creation (torch>=2.6 and custom patterns)
elif _is_torch_greater_or_equal_than_2_6:
# This creates the 4D mask easily. Note that we need this context manager as vmap cannot handle slicing a tensor from
# scalar tensor (it internally calls `.item()` which vmap does not allow, but this context works around it
# We don't need to add an offset to the mask_function either, as we vmap directly the correct indices for k and kv indices
with TransformGetItemToIndex():
attention_mask = _vmap_expansion_sdpa(mask_function)(batch_arange, head_arange, cache_position, kv_arange)
# Option 3: Error out since it indicates that the user did something custom, which they shouldn't have (torch<2.6)
else:
raise ValueError(
"The vmap functionality for mask creation is only supported from torch>=2.6. "
"Please update your torch version or use `use_vmap=False` with index-based masks."
)
# Due to a bug in versions of torch<2.5, we need to update the mask in case a query is not attending to any
# tokens (due to padding). See details in https://github.com/pytorch/pytorch/issues/110213
if not _is_torch_greater_or_equal_than_2_5 and allow_torch_fix:
causal_mask |= torch.all(~causal_mask, dim=-1, keepdim=True)
return causal_mask
attention_mask = attention_mask | torch.all(~attention_mask, dim=-1, keepdim=True)
# We use the version with newer torch whenever possible, as it is more general and can handle arbitrary mask functions
# (especially mask_function indexing a tensor, such as the padding mask function)
sdpa_mask = sdpa_mask_recent_torch if _is_torch_greater_or_equal_than_2_6 else sdpa_mask_older_torch
return attention_mask
def eager_mask(
@ -534,6 +480,7 @@ def eager_mask(
mask_function: Callable = causal_mask_function,
attention_mask: Optional[torch.Tensor] = None,
dtype: torch.dtype = torch.float32,
use_vmap: bool = False,
**kwargs,
) -> torch.Tensor:
"""
@ -556,10 +503,14 @@ def eager_mask(
The 2D attention mask corresponding to padded tokens of shape (batch_size, number_of_seen_tokens+q_length)
dtype (`torch.dtype`, optional):
The dtype to use for the mask. By default, `torch.float32`.
use_vmap (`bool`, optional):
Whether to use `vmap` during the mask construction or not. Allows powerful custom patterns that may not be
index-based (for the cost of speed performance). By default `False`.
"""
# The masks for eager attention are simply boolean mask from sdpa, casted to 0 and -inf
_ = kwargs.pop("allow_is_causal_skip", None)
_ = kwargs.pop("allow_is_bidirectional_skip", None)
_ = kwargs.pop("allow_torch_fix", None)
mask = sdpa_mask(
batch_size=batch_size,
cache_position=cache_position,
@ -570,6 +521,7 @@ def eager_mask(
allow_is_causal_skip=False,
allow_is_bidirectional_skip=False,
allow_torch_fix=False,
use_vmap=use_vmap,
**kwargs,
)
min_dtype = torch.finfo(dtype).min
@ -655,7 +607,7 @@ def flex_attention_mask(
if not _is_torch_greater_or_equal_than_2_6 and pad_len > 0:
attention_mask = torch.nn.functional.pad(attention_mask, value=0, pad=(0, pad_len))
padding_mask = prepare_padding_mask(attention_mask, kv_length, kv_offset, _slice=False)
padding_mask = prepare_padding_mask(attention_mask, kv_length, kv_offset)
mask_function = and_masks(mask_function, padding_mask_function(padding_mask))
# Add the offsets on top (because flex interface only allows length, not start and end indices)
@ -781,9 +733,19 @@ def _preprocess_mask_arguments(
# If using a cache, it can give all information about mask sizes based on seen tokens
if past_key_values is not None:
kv_length, kv_offset = past_key_values.get_mask_sizes(cache_position, layer_idx)
# Otherwise, the sizes are simply the input sizes
# Otherwise, we infer based on our input
else:
kv_length, kv_offset = input_embeds.shape[1], 0
# 1. Rely on input directly
if attention_mask is None:
kv_length, kv_offset = input_embeds.shape[1], 0
# 2. Rely on the mask instead - needed for special cases like prefix tuning in PEFT
#
# This is a very unique and special case where an encoder utilizes a cache and expects its length
# to be accounted for (usually, they should never use a cache). In general, the mask should always
# match with the input sizes nonetheless (i.e. it does not affect others).
# Conclusion: "prefix tuning is evil"
else:
kv_length, kv_offset = attention_mask.shape[-1], 0
# We check the position_ids for potential packed sequence format (only if the 2D attention mask is explicitly None,
# and we don't have past_key_values, i.e. generally a training setup)
@ -851,6 +813,11 @@ def create_causal_mask(
mask_factory_function = causal_mask_function
mask_interface = ALL_MASK_ATTENTION_FUNCTIONS[config._attn_implementation]
# Defaulting to using non-vmap based mask creations except when detecting
# users passing custom mask functions (as we cannot guarantee that they
# are properly index-based as required by our implementation).
use_vmap = False
# Do not allow skip if we are compiling (this is to match BC)
# TODO: cyril -> probably revisit and remove this, but a lot of tests rely on it
if _is_torch_xpu_available:
@ -867,14 +834,16 @@ def create_causal_mask(
raise ValueError("Using `or_mask_function` or `and_mask_function` arguments require torch>=2.6")
mask_factory_function = or_masks(mask_factory_function, or_mask_function)
allow_is_causal_skip = False
use_vmap = True
if and_mask_function is not None:
if not _is_torch_greater_or_equal_than_2_6:
raise ValueError("Using `or_mask_function` or `and_mask_function` arguments require torch>=2.6")
mask_factory_function = and_masks(mask_factory_function, and_mask_function)
allow_is_causal_skip = False
use_vmap = True
# If we detected packing format
if packed_sequence_mask is not None and _is_torch_greater_or_equal_than_2_6:
if packed_sequence_mask is not None:
mask_factory_function = and_masks(mask_factory_function, packed_sequence_mask_function(packed_sequence_mask))
allow_is_causal_skip = False
@ -889,6 +858,7 @@ def create_causal_mask(
allow_is_causal_skip=allow_is_causal_skip, # additional kwarg for sdpa
dtype=dtype, # Additional kwarg for eager
config=config, # Pass the config as well, in case someone wants to easily have their own mask_interface
use_vmap=use_vmap, # Short-circuit to non-vmap expansions for the mask
)
return causal_mask
@ -942,6 +912,10 @@ def create_bidirectional_mask(
# Allow skipping the mask creation except we have additional masking operators (and/or masks)
allow_is_bidirectional_skip = True
# Defaulting to using non-vmap based mask creations except when detecting
# users passing custom mask functions (as we cannot guarantee that they
# are properly index-based as required by our implementation).
use_vmap = False
# Allow slight deviations from the base mask
# Note that it is very important to apply this before any other deviations of the mask (such as packed sequence mask,
@ -951,11 +925,13 @@ def create_bidirectional_mask(
raise ValueError("Using `or_mask_function` or `and_mask_function` arguments require torch>=2.6")
mask_factory_function = or_masks(mask_factory_function, or_mask_function)
allow_is_bidirectional_skip = False
use_vmap = True
if and_mask_function is not None:
if not _is_torch_greater_or_equal_than_2_6:
raise ValueError("Using `or_mask_function` or `and_mask_function` arguments require torch>=2.6")
mask_factory_function = and_masks(mask_factory_function, and_mask_function)
allow_is_bidirectional_skip = False
use_vmap = True
# We now create the mask
attention_mask = mask_interface(
@ -970,6 +946,7 @@ def create_bidirectional_mask(
allow_is_bidirectional_skip=allow_is_bidirectional_skip,
dtype=dtype, # Additional kwarg for eager
config=config, # Pass the config as well, in case someone wants to easily have their own mask_interface
use_vmap=use_vmap, # Short-circuit to non-vmap expansions for the mask
)
return attention_mask
@ -1032,6 +1009,10 @@ def create_sliding_window_causal_mask(
mask_factory_function = sliding_window_causal_mask_function(sliding_window)
mask_interface = ALL_MASK_ATTENTION_FUNCTIONS[config._attn_implementation]
# Defaulting to using non-vmap based mask creations except when detecting
# users passing custom mask functions (as we cannot guarantee that they
# are properly index-based as required by our implementation).
use_vmap = False
# Do not allow skip if we are compiling (this is to match BC)
# TODO: cyril -> probably revisit and remove this, but a lot of tests rely on it
allow_is_causal_skip = not getattr(past_key_values, "is_compileable", False)
@ -1044,14 +1025,16 @@ def create_sliding_window_causal_mask(
raise ValueError("Using `or_mask_function` or `and_mask_function` arguments require torch>=2.6")
mask_factory_function = or_masks(mask_factory_function, or_mask_function)
allow_is_causal_skip = False
use_vmap = True
if and_mask_function is not None:
if not _is_torch_greater_or_equal_than_2_6:
raise ValueError("Using `or_mask_function` or `and_mask_function` arguments require torch>=2.6")
mask_factory_function = and_masks(mask_factory_function, and_mask_function)
allow_is_causal_skip = False
use_vmap = True
# If we detected packing format
if packed_sequence_mask is not None and _is_torch_greater_or_equal_than_2_6:
if packed_sequence_mask is not None:
mask_factory_function = and_masks(mask_factory_function, packed_sequence_mask_function(packed_sequence_mask))
allow_is_causal_skip = False
@ -1067,6 +1050,7 @@ def create_sliding_window_causal_mask(
local_size=sliding_window, # Additional kwarg for sdpa
dtype=dtype, # Additional kwarg for eager
config=config, # Pass the config as well, in case someone wants to easily have their own mask_interface
use_vmap=use_vmap, # Short-circuit to non-vmap expansions for the mask
)
return causal_mask
@ -1140,20 +1124,13 @@ def create_chunked_causal_mask(
left_padding_tokens = (attention_mask.cumsum(dim=-1) == torch.zeros_like(attention_mask)).sum(dim=-1)
else:
left_padding_tokens = torch.zeros(batch_size, device=cache_position.device, dtype=int)
# Raise a warning for older versions if the problematic left-padding situation arises
if (
not _is_torch_greater_or_equal_than_2_6
and kv_length + kv_offset > chunk_size
and (left_padding_tokens > 0).any()
):
logger.warning_once(
"Due to limitations of your current torch version, we cannot correctly account for the left-padding "
"when computing the chunked attention pattern. This will lead to a wrong attention mask for the padded "
"sequences. Behavior will be undefined. Please upgrade to `torch>=2.6` to solve this issue."
)
mask_factory_function = chunked_causal_mask_function(chunk_size, left_padding_tokens)
mask_interface = ALL_MASK_ATTENTION_FUNCTIONS[config._attn_implementation]
# Defaulting to using non-vmap based mask creations except when detecting
# users passing custom mask functions (as we cannot guarantee that they
# are properly index-based as required by our implementation).
use_vmap = False
# Do not allow skip if we are compiling (this is to match BC)
# TODO: cyril -> probably revisit and remove this, but a lot of tests rely on it
allow_is_causal_skip = not getattr(past_key_values, "is_compileable", False)
@ -1166,14 +1143,16 @@ def create_chunked_causal_mask(
raise ValueError("Using `or_mask_function` or `and_mask_function` arguments require torch>=2.6")
mask_factory_function = or_masks(mask_factory_function, or_mask_function)
allow_is_causal_skip = False
use_vmap = True
if and_mask_function is not None:
if not _is_torch_greater_or_equal_than_2_6:
raise ValueError("Using `or_mask_function` or `and_mask_function` arguments require torch>=2.6")
mask_factory_function = and_masks(mask_factory_function, and_mask_function)
allow_is_causal_skip = False
use_vmap = True
# If we detected packing format
if packed_sequence_mask is not None and _is_torch_greater_or_equal_than_2_6:
if packed_sequence_mask is not None:
mask_factory_function = and_masks(mask_factory_function, packed_sequence_mask_function(packed_sequence_mask))
allow_is_causal_skip = False
@ -1189,6 +1168,7 @@ def create_chunked_causal_mask(
local_size=chunk_size, # Additional kwarg for sdpa
dtype=dtype, # Additional kwarg for eager
config=config, # Pass the config as well, in case someone wants to easily have their own mask_interface
use_vmap=use_vmap, # Short-circuit to non-vmap expansions for the mask
)
return causal_mask

View File

@ -697,6 +697,7 @@ def _validate_yarn_parameters(
"original_max_position_embeddings",
"mscale",
"mscale_all_dim",
"truncate",
}
received_keys = set(rope_parameters.keys())
rope_type = rope_parameters["rope_type"]

File diff suppressed because it is too large Load Diff

View File

@ -25,6 +25,7 @@ if TYPE_CHECKING:
from .arcee import *
from .aria import *
from .audio_spectrogram_transformer import *
from .audioflamingo3 import *
from .auto import *
from .autoformer import *
from .aya_vision import *

View File

@ -406,13 +406,14 @@ class Aimv2PreTrainedModel(PreTrainedModel):
_supports_flash_attn = True
_supports_flex_attn = True
@torch.no_grad()
def _init_weights(self, module):
super()._init_weights(module)
if hasattr(module, "logit_scale"):
if isinstance(module.logit_scale, nn.Parameter):
module.logit_scale.data.fill_(math.log(1 / 0.07))
module.logit_scale.fill_(math.log(1 / 0.07))
elif isinstance(module, Aimv2AttentionPoolingHead):
module.cls_token.data.normal_(mean=0.0, std=self.config.initializer_range)
module.cls_token.normal_(mean=0.0, std=self.config.initializer_range)
@auto_docstring(

View File

@ -449,13 +449,14 @@ class Aimv2PreTrainedModel(PreTrainedModel):
_supports_flash_attn = True
_supports_flex_attn = True
@torch.no_grad()
def _init_weights(self, module):
super()._init_weights(module)
if hasattr(module, "logit_scale"):
if isinstance(module.logit_scale, nn.Parameter):
module.logit_scale.data.fill_(math.log(1 / 0.07))
module.logit_scale.fill_(math.log(1 / 0.07))
elif isinstance(module, Aimv2AttentionPoolingHead):
module.cls_token.data.normal_(mean=0.0, std=self.config.initializer_range)
module.cls_token.normal_(mean=0.0, std=self.config.initializer_range)
@auto_docstring(

View File

@ -302,21 +302,22 @@ class AlbertPreTrainedModel(PreTrainedModel):
"attentions": AlbertAttention,
}
@torch.no_grad()
def _init_weights(self, module):
"""Initialize the weights."""
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
module.weight.normal_(mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
module.bias.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
module.weight.normal_(mean=0.0, std=self.config.initializer_range)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
module.weight[module.padding_idx].zero_()
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
module.bias.zero_()
module.weight.fill_(1.0)
elif isinstance(module, AlbertMLMHead):
module.bias.data.zero_()
module.bias.zero_()
@dataclass
@ -425,7 +426,10 @@ class AlbertModel(AlbertPreTrainedModel):
"""
)
class AlbertForPreTraining(AlbertPreTrainedModel):
_tied_weights_keys = ["predictions.decoder.bias", "predictions.decoder.weight"]
_tied_weights_keys = {
"predictions.decoder.weight": "albert.embeddings.word_embeddings.weight",
"predictions.decoder.bias": "predictions.bias",
}
def __init__(self, config: AlbertConfig):
super().__init__(config)
@ -525,7 +529,6 @@ class AlbertMLMHead(nn.Module):
self.dense = nn.Linear(config.hidden_size, config.embedding_size)
self.decoder = nn.Linear(config.embedding_size, config.vocab_size)
self.activation = ACT2FN[config.hidden_act]
self.decoder.bias = self.bias
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
hidden_states = self.dense(hidden_states)
@ -537,14 +540,6 @@ class AlbertMLMHead(nn.Module):
return prediction_scores
def _tie_weights(self) -> None:
# For accelerate compatibility and to not break backward compatibility
if self.decoder.bias.device.type == "meta":
self.decoder.bias = self.bias
else:
# To tie those two weights if they get disconnected (on TPU or when the bias is resized)
self.bias = self.decoder.bias
class AlbertSOPHead(nn.Module):
def __init__(self, config: AlbertConfig):
@ -561,7 +556,10 @@ class AlbertSOPHead(nn.Module):
@auto_docstring
class AlbertForMaskedLM(AlbertPreTrainedModel):
_tied_weights_keys = ["predictions.decoder.bias", "predictions.decoder.weight"]
_tied_weights_keys = {
"predictions.decoder.weight": "albert.embeddings.word_embeddings.weight",
"predictions.decoder.bias": "predictions.bias",
}
def __init__(self, config):
super().__init__(config)

View File

@ -823,24 +823,25 @@ class AlignPreTrainedModel(PreTrainedModel):
input_modalities = ["image", "text"]
supports_gradient_checkpointing = True
@torch.no_grad()
def _init_weights(self, module: nn.Module):
"""Initialize the weights"""
std = self.config.initializer_range
if isinstance(module, (nn.Linear, nn.Conv2d)):
module.weight.data.normal_(mean=0.0, std=std)
module.weight.normal_(mean=0.0, std=std)
if module.bias is not None:
module.bias.data.zero_()
module.bias.zero_()
elif isinstance(module, AlignModel):
nn.init.xavier_uniform_(module.text_projection.weight)
module.text_projection.bias.data.zero_()
module.temperature.data.fill_(self.config.temperature_init_value)
module.text_projection.bias.zero_()
module.temperature.fill_(self.config.temperature_init_value)
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=std)
module.weight.normal_(mean=0.0, std=std)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
module.weight[module.padding_idx].zero_()
if isinstance(module, (nn.LayerNorm, nn.BatchNorm2d)):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
module.bias.zero_()
module.weight.fill_(1.0)
@auto_docstring(

View File

@ -59,9 +59,6 @@ class AlignProcessor(ProcessorMixin):
"""
attributes = ["image_processor", "tokenizer"]
image_processor_class = "EfficientNetImageProcessor"
tokenizer_class = ("BertTokenizer", "BertTokenizerFast")
valid_processor_kwargs = AlignProcessorKwargs
def __init__(self, image_processor, tokenizer):

View File

@ -770,6 +770,7 @@ class AltCLIPPreTrainedModel(PreTrainedModel):
supports_gradient_checkpointing = True
_no_split_module = []
@torch.no_grad()
def _init_weights(self, module):
"""Initialize the weights"""
factor = self.config.initializer_factor
@ -797,23 +798,21 @@ class AltCLIPPreTrainedModel(PreTrainedModel):
module.text_projection.weight,
std=module.text_embed_dim**-0.5 * self.config.initializer_factor,
)
module.text_projection._is_hf_initialized = True
nn.init.normal_(
module.visual_projection.weight,
std=module.vision_embed_dim**-0.5 * self.config.initializer_factor,
)
module.visual_projection._is_hf_initialized = True
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
module.bias.zero_()
module.weight.fill_(1.0)
elif isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_factor)
module.weight.normal_(mean=0.0, std=self.config.initializer_factor)
if module.bias is not None:
module.bias.data.zero_()
module.bias.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_factor)
module.weight.normal_(mean=0.0, std=self.config.initializer_factor)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
module.weight[module.padding_idx].zero_()
class AltCLIPVisionTransformer(nn.Module):

View File

@ -35,10 +35,6 @@ class AltCLIPProcessor(ProcessorMixin):
The tokenizer is a required input.
"""
attributes = ["image_processor", "tokenizer"]
image_processor_class = ("CLIPImageProcessor", "CLIPImageProcessorFast")
tokenizer_class = ("XLMRobertaTokenizer", "XLMRobertaTokenizerFast")
@deprecate_kwarg(old_name="feature_extractor", version="5.0.0", new_name="image_processor")
def __init__(self, image_processor=None, tokenizer=None):
super().__init__(image_processor, tokenizer)

View File

@ -106,7 +106,6 @@ class ApertusConfig(PreTrainedConfig):
"layers.*.self_attn.o_proj": "rowwise_rep", # we need to replicate here due to the added norm on q and k
"layers.*.mlp.up_proj": "colwise",
"layers.*.mlp.down_proj": "rowwise",
"layers.*.mlp.gate_proj": "colwise",
}
base_model_pp_plan = {
"embed_tokens": (["input_ids"], ["inputs_embeds"]),

View File

@ -429,7 +429,7 @@ class ApertusModel(ApertusPreTrainedModel):
@auto_docstring
class ApertusForCausalLM(ApertusPreTrainedModel, GenerationMixin):
_tied_weights_keys = ["lm_head.weight"]
_tied_weights_keys = {"lm_head.weight": "model.embed_tokens.weight"}
_tp_plan = {"lm_head": "colwise_rep"}
_pp_plan = {"lm_head": (["hidden_states"], ["logits"])}

View File

@ -123,7 +123,6 @@ class ApertusConfig(LlamaConfig):
"layers.*.self_attn.o_proj": "rowwise_rep", # we need to replicate here due to the added norm on q and k
"layers.*.mlp.up_proj": "colwise",
"layers.*.mlp.down_proj": "rowwise",
"layers.*.mlp.gate_proj": "colwise",
}
def __init__(

View File

@ -434,7 +434,7 @@ class ArceeModel(ArceePreTrainedModel):
@auto_docstring(checkpoint="arcee-ai/AFM-4.5B")
class ArceeForCausalLM(ArceePreTrainedModel, GenerationMixin):
_tied_weights_keys = ["lm_head.weight"]
_tied_weights_keys = {"lm_head.weight": "model.embed_tokens.weight"}
_tp_plan = {"lm_head": "colwise_rep"}
_pp_plan = {"lm_head": (["hidden_states"], ["logits"])}

View File

@ -99,15 +99,14 @@ class AriaTextConfig(PreTrainedConfig):
model_type = "aria_text"
keys_to_ignore_at_inference = ["past_key_values"]
# Default tensor parallel plan for base model `AriaTextModel`
base_model_tp_plan = {
"layers.*.self_attn.q_proj": "colwise",
"layers.*.self_attn.k_proj": "colwise",
"layers.*.self_attn.v_proj": "colwise",
"layers.*.self_attn.o_proj": "rowwise",
"layers.*.mlp.gate_proj": "colwise",
"layers.*.mlp.up_proj": "colwise",
"layers.*.mlp.down_proj": "rowwise",
"layers.*.mlp.shared_experts.gate_proj": "colwise",
"layers.*.mlp.shared_experts.up_proj": "colwise",
"layers.*.mlp.shared_experts.down_proj": "rowwise",
}
base_model_pp_plan = {
"embed_tokens": (["input_ids"], ["inputs_embeds"]),

View File

@ -585,10 +585,11 @@ class AriaTextPreTrainedModel(PreTrainedModel):
"attentions": AriaTextAttention,
}
@torch.no_grad()
def _init_weights(self, module):
super()._init_weights(module)
if isinstance(module, AriaGroupedExpertsGemm):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
module.weight.normal_(mean=0.0, std=self.config.initializer_range)
@auto_docstring
@ -608,6 +609,7 @@ class AriaPreTrainedModel(PreTrainedModel):
"attentions": AriaTextAttention,
}
@torch.no_grad()
def _init_weights(self, module):
super()._init_weights(module)
if isinstance(module, AriaProjector):
@ -760,7 +762,7 @@ class AriaTextModel(AriaTextPreTrainedModel):
@auto_docstring
class AriaTextForCausalLM(AriaTextPreTrainedModel, GenerationMixin):
_tied_weights_keys = ["lm_head.weight"]
_tied_weights_keys = {"lm_head.weight": "model.embed_tokens.weight"}
_tp_plan = {"lm_head": "colwise_rep"}
_pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
@ -890,8 +892,6 @@ class AriaModelOutputWithPast(BaseModelOutputWithPast):
"""
)
class AriaModel(AriaPreTrainedModel):
_checkpoint_conversion_mapping = {"language_model.model": "language_model"}
def __init__(self, config: AriaConfig):
super().__init__(config)
self.vision_tower = AutoModel.from_config(config.vision_config)
@ -1048,12 +1048,12 @@ class AriaModel(AriaPreTrainedModel):
)
class AriaForConditionalGeneration(AriaPreTrainedModel, GenerationMixin):
_checkpoint_conversion_mapping = {
"^language_model.model": "model.language_model",
"^vision_tower": "model.vision_tower",
"^multi_modal_projector": "model.multi_modal_projector",
"^language_model.lm_head": "lm_head",
r"^language_model.model": "model.language_model",
r"^vision_tower": "model.vision_tower",
r"^multi_modal_projector": "model.multi_modal_projector",
r"^language_model.lm_head": "lm_head",
}
_tied_weights_keys = ["lm_head.weight"]
_tied_weights_keys = {"lm_head.weight": "model.language_model.embed_tokens.weight"}
def __init__(self, config: AriaConfig):
super().__init__(config)

View File

@ -169,6 +169,15 @@ class AriaTextConfig(LlamaConfig):
model_type = "aria_text"
base_config_key = "text_config"
base_model_tp_plan = {
"layers.*.self_attn.q_proj": "colwise",
"layers.*.self_attn.k_proj": "colwise",
"layers.*.self_attn.v_proj": "colwise",
"layers.*.self_attn.o_proj": "rowwise",
"layers.*.mlp.shared_experts.gate_proj": "colwise",
"layers.*.mlp.shared_experts.up_proj": "colwise",
"layers.*.mlp.shared_experts.down_proj": "rowwise",
}
def __init__(
self,
@ -906,10 +915,6 @@ class AriaProcessor(ProcessorMixin):
A dictionary indicating size conversions for images.
"""
attributes = ["image_processor", "tokenizer"]
image_processor_class = "AriaImageProcessor"
tokenizer_class = "AutoTokenizer"
def __init__(
self,
image_processor=None,
@ -1191,10 +1196,11 @@ class AriaTextPreTrainedModel(PreTrainedModel):
"attentions": AriaTextAttention,
}
@torch.no_grad()
def _init_weights(self, module):
super()._init_weights(module)
if isinstance(module, AriaGroupedExpertsGemm):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
module.weight.normal_(mean=0.0, std=self.config.initializer_range)
class AriaPreTrainedModel(LlamaPreTrainedModel):
@ -1203,6 +1209,7 @@ class AriaPreTrainedModel(LlamaPreTrainedModel):
_can_compile_fullgraph = False # MoE models don't work with torch.compile (dynamic slicing)
_supports_attention_backend = True
@torch.no_grad()
def _init_weights(self, module):
PreTrainedModel._init_weights(self, module)
if isinstance(module, AriaProjector):
@ -1220,7 +1227,7 @@ class AriaTextModel(LlamaModel):
class AriaTextForCausalLM(AriaTextPreTrainedModel, LlamaForCausalLM):
_tied_weights_keys = ["lm_head.weight"]
_tied_weights_keys = {"lm_head.weight": "model.embed_tokens.weight"}
def __init__(self, config: AriaTextConfig):
super().__init__(config)
@ -1359,6 +1366,8 @@ class AriaModel(LlavaModel):
"""
)
class AriaForConditionalGeneration(LlavaForConditionalGeneration):
_tied_weights_keys = {"lm_head.weight": "model.language_model.embed_tokens.weight"}
def get_image_features(
self,
pixel_values: torch.FloatTensor,

View File

@ -67,10 +67,6 @@ class AriaProcessor(ProcessorMixin):
A dictionary indicating size conversions for images.
"""
attributes = ["image_processor", "tokenizer"]
image_processor_class = "AriaImageProcessor"
tokenizer_class = "AutoTokenizer"
def __init__(
self,
image_processor=None,

View File

@ -272,7 +272,9 @@ if __name__ == "__main__":
"--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
)
parser.add_argument(
"--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
"--push_to_hub",
action="store_true",
help="Whether or not to push the converted model to the Hugging Face hub.",
)
args = parser.parse_args()

View File

@ -300,23 +300,26 @@ class ASTPreTrainedModel(PreTrainedModel):
"attentions": ASTSelfAttention,
}
@torch.no_grad()
def _init_weights(self, module: Union[nn.Linear, nn.Conv2d, nn.LayerNorm]) -> None:
"""Initialize the weights"""
if isinstance(module, (nn.Linear, nn.Conv2d)):
# Upcast the input in `fp32` and cast it back to desired `dtype` to avoid
# `trunc_normal_cpu` not implemented in `half` issues
module.weight.data = nn.init.trunc_normal_(
module.weight.data.to(torch.float32), mean=0.0, std=self.config.initializer_range
).to(module.weight.dtype)
module.weight.copy_(
nn.init.trunc_normal_(module.weight.to(torch.float32), mean=0.0, std=self.config.initializer_range).to(
module.weight.dtype
)
)
if module.bias is not None:
module.bias.data.zero_()
module.bias.zero_()
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
module.bias.zero_()
module.weight.fill_(1.0)
elif isinstance(module, ASTEmbeddings):
module.cls_token.data.zero_()
module.position_embeddings.data.zero_()
module.distillation_token.data.zero_()
module.cls_token.zero_()
module.position_embeddings.zero_()
module.distillation_token.zero_()
@auto_docstring

View File

@ -0,0 +1,31 @@
# coding=utf-8
# Copyright 2025 NVIDIA CORPORATION and the HuggingFace Inc. team. All rights
# reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure
if TYPE_CHECKING:
from .configuration_audioflamingo3 import *
from .modeling_audioflamingo3 import *
from .processing_audioflamingo3 import *
else:
import sys
_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)

View File

@ -0,0 +1,210 @@
# coding=utf-8
# Copyright 2025 NVIDIA CORPORATION and the HuggingFace Inc. team. All rights
# reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from ...configuration_utils import PretrainedConfig
from ...utils import logging
from ..auto import CONFIG_MAPPING, AutoConfig
logger = logging.get_logger(__name__)
class AudioFlamingo3EncoderConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of an [`AudioFlamingo3Encoder`]. It is used to instantiate an
AudioFlamingo3 audio encoder according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the audio encoder of the AudioFlamingo3
architecture.
e.g. [nvidia/audio-flamingo-3-hf](https://huggingface.co/nvidia/audio-flamingo-3-hf)
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
num_mel_bins (`int`, *optional*, defaults to 128):
Number of mel features used per input features. Should correspond to the value used in the
`AudioFlamingo3Processor` class.
num_hidden_layers (`int`, *optional*, defaults to 32):
Number of encoder layers.
num_attention_heads (`int`, *optional*, defaults to 20):
Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (`int`, *optional*, defaults to 5120):
Dimensionality of the "intermediate" (often named feed-forward) layer in encoder.
layerdrop (`float`, *optional*, defaults to 0.0):
The LayerDrop probability for the encoder. See the [LayerDrop paper](https://huggingface.co/papers/1909.11556)
for more details.
activation_function (`str`, *optional*, defaults to `"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
`"relu"`, `"silu"` and `"gelu_new"` are supported.
hidden_size (`int`, *optional*, defaults to 1280):
Dimensionality of the layers.
dropout (`float`, *optional*, defaults to 0.0):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
activation_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for activations inside the fully connected layer.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
scale_embedding (`bool`, *optional*, defaults to `False`):
Scale embeddings by dividing by sqrt(hidden_size).
max_source_positions (`int`, *optional*, defaults to 1500):
The maximum sequence length of log-mel filter-bank features that this model might ever be used with.
Example:
```python
>>> from transformers import AudioFlamingo3EncoderConfig, AudioFlamingo3Encoder
>>> # Initializing an AudioFlamingo3EncoderConfig
>>> configuration = AudioFlamingo3EncoderConfig()
>>> # Initializing an AudioFlamingo3Encoder (with random weights)
>>> model = AudioFlamingo3Encoder(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "audioflamingo3_encoder"
attribute_map = {
"d_model": "hidden_size",
"encoder_layers": "num_hidden_layers",
"encoder_attention_heads": "num_attention_heads",
"encoder_ffn_dim": "intermediate_size",
"encoder_layerdrop": "layerdrop",
}
def __init__(
self,
num_mel_bins=128,
num_hidden_layers=32,
num_attention_heads=20,
intermediate_size=5120,
layerdrop=0.0,
activation_function="gelu",
hidden_size=1280,
dropout=0.0,
attention_dropout=0.0,
activation_dropout=0.0,
initializer_range=0.02,
scale_embedding=False,
max_source_positions=1500,
**kwargs,
):
super().__init__(**kwargs)
self.num_mel_bins = num_mel_bins
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.dropout = dropout
self.attention_dropout = attention_dropout
self.activation_dropout = activation_dropout
self.activation_function = activation_function
self.initializer_range = initializer_range
self.layerdrop = layerdrop
self.num_hidden_layers = num_hidden_layers
self.scale_embedding = scale_embedding
self.max_source_positions = max_source_positions
class AudioFlamingo3Config(PretrainedConfig):
r"""
This is the configuration class to store the configuration of an [`AudioFlamingo3ForConditionalGeneration`]. It is used to instantiate an
AudioFlamingo3 model according to the specified arguments, defining the model architecture. Instantiating a configuration
with the defaults will yield a similar configuration to that of the AudioFlamingo3.
e.g. [nvidia/audio-flamingo-3-hf](https://huggingface.co/nvidia/audio-flamingo-3-hf)
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
audio_config (`Union[AudioFlamingo3EncoderConfig, dict]`, *optional*, defaults to `AudioFlamingo3EncoderConfig`):
The config object or dictionary of the audio backbone.
text_config (`Union[AutoConfig, dict]`, *optional*, defaults to `Qwen2Config`):
The config object or dictionary of the text backbone.
audio_token_id (`int`, *optional*, defaults to 151669):
The audio token index to encode the audio prompt.
projector_hidden_act (`str`, *optional*, defaults to `"gelu"`):
Activation function used in the projector.
projector_bias (`bool`, *optional*, defaults to `True`):
Whether to include bias terms in the projector.
Example:
```python
>>> from transformers import AudioFlamingo3ForConditionalGeneration, AudioFlamingo3Config, AudioFlamingo3EncoderConfig, Qwen2Config
>>> # Initializing an AudioFlamingo3Encoder config
>>> audio_config = AudioFlamingo3EncoderConfig()
>>> # Initializing a Qwen2 config
>>> text_config = Qwen2Config()
>>> # Initializing an AudioFlamingo3 configuration
>>> configuration = AudioFlamingo3Config(audio_config, text_config)
>>> # Initializing a model from the audioflamingo3 style configuration
>>> model = AudioFlamingo3ForConditionalGeneration(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "audioflamingo3"
sub_configs = {
"audio_config": AudioFlamingo3EncoderConfig,
"text_config": AutoConfig,
}
def __init__(
self,
audio_config=None,
text_config=None,
audio_token_id=151669,
projector_hidden_act="gelu",
projector_bias=True,
**kwargs,
):
self.audio_token_id = audio_token_id
if isinstance(audio_config, dict):
audio_config["model_type"] = audio_config.get("model_type", "audioflamingo3_encoder")
audio_config = CONFIG_MAPPING[audio_config["model_type"]](**audio_config)
elif audio_config is None:
audio_config = CONFIG_MAPPING["audioflamingo3_encoder"]()
self.audio_config = audio_config
if isinstance(text_config, dict):
text_config["model_type"] = text_config.get("model_type", "qwen2")
text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
elif text_config is None:
text_config = CONFIG_MAPPING["qwen2"]()
self.text_config = text_config
self.projector_hidden_act = projector_hidden_act
self.projector_bias = projector_bias
super().__init__(**kwargs)
__all__ = ["AudioFlamingo3Config", "AudioFlamingo3EncoderConfig"]

View File

@ -0,0 +1,286 @@
# coding=utf-8
# Copyright 2025 NVIDIA CORPORATION and the HuggingFace Inc. team. All rights
# reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Convert AudioFlamingo3 checkpoints into a Hugging Face repository layout."""
from __future__ import annotations
import argparse
import json
import logging
from collections import defaultdict
from pathlib import Path
from typing import Any
import torch
from safetensors.torch import safe_open
from transformers import (
AudioFlamingo3Config,
AudioFlamingo3ForConditionalGeneration,
AudioFlamingo3Processor,
AutoTokenizer,
GenerationConfig,
Qwen2Config,
WhisperFeatureExtractor,
)
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
def _load_json(p: Path):
if not p.is_file():
raise FileNotFoundError(f"Missing JSON: {p}")
with p.open("r", encoding="utf-8") as f:
return json.load(f)
def write_processor(src_root: Path, dst_root: Path):
llm_dir = src_root / "llm"
# fmt: off
tokenizer_chat_template = (
"{% if messages[0]['role'] != 'system' %}"
"{{ '<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n' }}"
"{% endif %}"
"{% for message in messages if message['content'] is not none %}"
"{{ '<|im_start|>' + message['role'] + '\\n' + message['content'] + '<|im_end|>' + '\\n' }}"
"{% endfor %}"
"{% if add_generation_prompt %}"
"{{ '<|im_start|>assistant\\n' }}"
"{% endif %}"
)
# fmt: on
# fmt: off
processor_chat_template = (
"{% if messages[0]['role'] != 'system' %}"
"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
"{% endif %}"
"{% for m in messages if m['content'] is not none %}"
"<|im_start|>{{ m['role'] }}\n"
"{% if m['content'] is string %}"
"{{ m['content'] }}"
"{% else %}"
"{% set audio = namespace(found=False) %}"
"{% set text_buf = namespace(v='') %}"
"{% for c in m['content'] %}"
"{% if c.get('type') == 'audio' or 'audio' in c %}"
"{% set audio.found = True %}"
"{% elif c.get('type') == 'text' or 'text' in c %}"
"{% set text_buf.v = text_buf.v + c['text'] %}"
"{% endif %}"
"{% endfor %}"
"{% if audio.found %}{{ '<sound>' }}{% endif %}{{ text_buf.v }}"
"{% endif %}"
"<|im_end|>\n"
"{% endfor %}"
"{% if add_generation_prompt %}"
"<|im_start|>assistant\n"
"{% endif %}"
)
# fmt: on
processor = AudioFlamingo3Processor(
feature_extractor=WhisperFeatureExtractor(feature_size=128, return_attention_mask=True),
tokenizer=AutoTokenizer.from_pretrained(str(llm_dir), chat_template=tokenizer_chat_template, use_fast=True),
chat_template=processor_chat_template,
)
processor.save_pretrained(str(dst_root))
logger.info("processor (tokenizer + preprocessor)")
return processor
PREFIX_MAP = {
"llm": "language_model",
"sound_tower": "audio_tower",
"sound_mm_projector": "multi_modal_projector",
}
def _resolve_component_dir(dirpath: Path):
if not dirpath.is_dir():
return None
idx = dirpath / "model.safetensors.index.json"
mono = dirpath / "model.safetensors"
if idx.exists():
wm = _load_json(idx).get("weight_map") or {}
by_shard: dict[str, list[str]] = defaultdict(list)
for k, shard in wm.items():
by_shard[shard].append(k)
return ("sharded", dirpath, {k: sorted(v) for k, v in sorted(by_shard.items())})
if mono.exists():
return ("file", mono)
cands = sorted([x for x in dirpath.iterdir() if x.suffix == ".safetensors"])
return ("file", cands[0]) if len(cands) == 1 else None
def merge_and_shard_weights(src_root: Path, dst_root: Path, processor: AudioFlamingo3Processor):
state: dict[str, Any] = {}
for tag in PREFIX_MAP.keys():
comp = _resolve_component_dir(src_root / tag)
if not comp:
continue
out_prefix = PREFIX_MAP.get(tag, tag)
if comp[0] == "file":
fp: Path = comp[1]
with safe_open(str(fp), framework="pt", device="cpu") as f:
for k in f.keys():
if k == "__metadata__":
continue
state[f"{out_prefix}.{k}"] = f.get_tensor(k)
else:
base: Path = comp[1]
shard_map: dict[str, list[str]] = comp[2]
for shard, keys in shard_map.items():
sp = base / shard
with safe_open(str(sp), framework="pt", device="cpu") as f:
for k in keys:
state[f"{out_prefix}.{k}"] = f.get_tensor(k)
if not state:
raise FileNotFoundError("No tensors found in llm/, sound_tower/, or sound_mm_projector/.")
tok = processor.tokenizer
text_config = Qwen2Config(
bos_token_id=tok.bos_token_id,
eos_token_id=tok.eos_token_id,
pad_token_id=tok.pad_token_id,
vocab_size=len(tok),
hidden_size=3584,
intermediate_size=18944,
model_max_length=8192,
num_attention_heads=28,
num_hidden_layers=28,
num_key_value_heads=4,
rope_theta=1000000.0,
use_cache=False,
)
config = AudioFlamingo3Config(text_config=text_config, audio_token_id=tok.get_vocab()["<sound>"])
model = AudioFlamingo3ForConditionalGeneration(config).to(dtype=torch.bfloat16)
# Update state dict to new key names if necessary
projector_key_mapping = {
"multi_modal_projector.layers.0.weight": "multi_modal_projector.linear_1.weight",
"multi_modal_projector.layers.0.bias": "multi_modal_projector.linear_1.bias",
"multi_modal_projector.layers.2.weight": "multi_modal_projector.linear_2.weight",
"multi_modal_projector.layers.2.bias": "multi_modal_projector.linear_2.bias",
}
for old_key, new_key in projector_key_mapping.items():
if old_key in state:
state[new_key] = state.pop(old_key)
# Load weights into the instantiated model so we can push via `push_to_hub` later.
load_res = model.load_state_dict(state, strict=True)
# Enforce a clean load
if getattr(load_res, "missing_keys", None) and load_res.missing_keys:
mk = load_res.missing_keys
raise ValueError(f"Missing keys when loading: {mk[:10]}{' ...' if len(mk) > 10 else ''}")
if getattr(load_res, "unexpected_keys", None) and load_res.unexpected_keys:
uk = load_res.unexpected_keys
raise ValueError(f"Unexpected keys when loading: {uk[:10]}{' ...' if len(uk) > 10 else ''}")
generation_config = GenerationConfig(
bos_token_id=tok.bos_token_id,
eos_token_id=tok.eos_token_id,
pad_token_id=tok.pad_token_id,
max_new_tokens=2048,
)
model.generation_config = generation_config
model.save_pretrained(save_directory=str(dst_root))
logger.info("model.safetensors index and shards")
return model
"""
Reproducible Usage
==================
1) Download the original AudioFlamingo-3 weights from NVIDIA (requires Git LFS):
```
git lfs install
git clone https://huggingface.co/nvidia/audio-flamingo-3
```
This will create a folder `audio-flamingo-3/` containing the original components:
`llm/`, `sound_tower/`, and `sound_mm_projector/`.
2) Convert to the Hugging Face Transformers format (locally):
```
python src/transformers/models/audioflamingo3/convert_audioflamingo3_to_hf.py \
--src_dir audio-flamingo-3 \
--dst_dir audio-flamingo-3-hf
```
3) Convert and push directly to the Hub (requires `huggingface-cli login` or `HF_TOKEN`):
```
python src/transformers/models/audioflamingo3/convert_audioflamingo3_to_hf.py \
--src_dir audio-flamingo-3 \
--dst_dir audio-flamingo-3-hf \
--push_to_hub <username-or-org>/audio-flamingo-3
```
This command uploads both the processor (tokenizer + feature extractor) and the converted
model (sharded safetensors + configs) to the specified Hub repository.
"""
def main() -> None:
ap = argparse.ArgumentParser(description="Convert AudioFlamingo3 to Hugging Face format.")
ap.add_argument("--src_dir", required=True, help="Source model root directory")
ap.add_argument("--dst_dir", required=True, help="Destination directory for converted model")
ap.add_argument(
"--push_to_hub",
default=None,
type=str,
help=(
"Optional repository ID to push the converted assets to the Hugging Face Hub, "
"e.g. 'username/audio-flamingo-3'."
),
)
args = ap.parse_args()
src_root = Path(args.src_dir).resolve()
if not src_root.is_dir():
raise FileNotFoundError(f"Source directory not found: {src_root}")
dst_root = Path(args.dst_dir).resolve()
if dst_root.exists():
raise FileExistsError(f"Destination already exists: {dst_root}")
processor = write_processor(src_root, dst_root)
model = merge_and_shard_weights(src_root, dst_root, processor)
# Optionally push converted assets using native push_to_hub only
if args.push_to_hub:
logger.info("Pushing processor to the Hub ...")
processor.push_to_hub(args.push_to_hub)
logger.info("Pushing model to the Hub ...")
model.push_to_hub(args.push_to_hub)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,625 @@
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# This file was automatically generated from src/transformers/models/audioflamingo3/modular_audioflamingo3.py.
# Do NOT edit this file manually as any edits will be overwritten by the generation of
# the file from the modular. If any change should be done, please apply the change to the
# modular_audioflamingo3.py file directly. One of our CI enforces this.
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# coding=utf-8
# Copyright 2025 NVIDIA CORPORATION and the HuggingFace Inc. team. All rights
# reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
from collections.abc import Callable
from typing import Optional, Union
import torch
from torch import nn
from ...activations import ACT2FN
from ...cache_utils import Cache, EncoderDecoderCache
from ...generation import GenerationMixin
from ...masking_utils import eager_mask, padding_mask_function
from ...modeling_flash_attention_utils import FlashAttentionKwargs
from ...modeling_layers import GradientCheckpointingLayer
from ...modeling_outputs import BaseModelOutput, CausalLMOutputWithPast
from ...modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
from ...processing_utils import Unpack
from ...utils import TransformersKwargs, auto_docstring, can_return_tuple, logging
from ..auto import AutoModel, AutoModelForCausalLM
from .configuration_audioflamingo3 import AudioFlamingo3Config, AudioFlamingo3EncoderConfig
logger = logging.get_logger(__name__)
def eager_attention_forward(
module: nn.Module,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
attention_mask: Optional[torch.Tensor],
scaling: Optional[float] = None,
dropout: float = 0.0,
**kwargs,
):
if scaling is None:
scaling = query.size(-1) ** -0.5
attn_weights = torch.matmul(query, key.transpose(2, 3)) * scaling
if attention_mask is not None and attention_mask.ndim == 4:
attn_weights = attn_weights + attention_mask[:, :, :, : key.shape[-2]]
attn_weights = nn.functional.softmax(attn_weights, dim=-1)
attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
attn_output = torch.matmul(attn_weights, value)
attn_output = attn_output.transpose(1, 2).contiguous()
return attn_output, attn_weights
class AudioFlamingo3Attention(nn.Module):
"""Multi-headed attention from 'Attention Is All You Need' paper"""
def __init__(
self,
embed_dim: int,
num_heads: int,
dropout: float = 0.0,
is_decoder: bool = False,
bias: bool = True,
is_causal: bool = False,
layer_idx: Optional[int] = None,
config: Optional[AudioFlamingo3Config] = None,
):
super().__init__()
self.embed_dim = embed_dim
self.num_heads = num_heads
self.dropout = dropout
self.head_dim = embed_dim // num_heads
self.config = config
if (self.head_dim * num_heads) != self.embed_dim:
raise ValueError(
f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim}"
f" and `num_heads`: {num_heads})."
)
self.scaling = self.head_dim**-0.5
self.is_decoder = is_decoder
self.is_causal = is_causal
if layer_idx is None and is_decoder:
logger.warning_once(
f"Instantiating a decoder {self.__class__.__name__} without passing `layer_idx` is not recommended and "
"will to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` "
"when creating this class."
)
self.layer_idx = layer_idx
self.k_proj = nn.Linear(embed_dim, embed_dim, bias=False)
self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
self.q_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
def forward(
self,
hidden_states: torch.Tensor,
key_value_states: Optional[torch.Tensor] = None,
past_key_values: Optional[Cache] = None,
attention_mask: Optional[torch.Tensor] = None,
output_attentions: bool = False,
cache_position: Optional[torch.Tensor] = None,
# TODO: we need a refactor so that the different attention modules can get their specific kwargs
# ATM, we have mixed things encoder, decoder, and encoder-decoder attn
**kwargs: Unpack[FlashAttentionKwargs],
) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]:
"""Input shape: Batch x Time x Channel"""
# if key_value_states are provided this layer is used as a cross-attention layer
# for the decoder
is_cross_attention = key_value_states is not None
# determine input shapes
bsz, tgt_len = hidden_states.shape[:-1]
q_input_shape = (bsz, tgt_len, -1, self.head_dim)
# Scaling is susceptible to floating point arithmetics' inprecisions
# which can lead to different results (this is dependent from model
# to model, e.g. audioflamingo3 is one such case). We therefore keep the
# original order of scaling to follow the original implementation
# and enforce no scaling (1.0) in the attention call below.
query_states = self.q_proj(hidden_states) * self.scaling
query_states = query_states.view(*q_input_shape)
query_states = query_states.transpose(1, 2).contiguous()
# Check is encoder-decoder model is being used. Otherwise we'll get `DynamicCache`
if past_key_values is not None and isinstance(past_key_values, EncoderDecoderCache):
is_updated = past_key_values.is_updated.get(self.layer_idx)
if is_cross_attention:
# after the first generated id, we can subsequently re-use all key/value_states from cache
past_key_values.is_updated[self.layer_idx] = True
past_key_values = past_key_values.cross_attention_cache
else:
past_key_values = past_key_values.self_attention_cache
# use key_value_states if cross attention
current_states = key_value_states if key_value_states is not None else hidden_states
if is_cross_attention and past_key_values and is_updated:
# reuse k,v, cross_attentions
key_states = past_key_values.layers[self.layer_idx].keys
value_states = past_key_values.layers[self.layer_idx].values
else:
key_states = self.k_proj(current_states).view(bsz, -1, self.num_heads, self.head_dim)
value_states = self.v_proj(current_states).view(bsz, -1, self.num_heads, self.head_dim)
key_states = key_states.transpose(1, 2).contiguous()
value_states = value_states.transpose(1, 2).contiguous()
if past_key_values is not None:
# save all key/value_states to cache to be re-used for fast auto-regressive generation
cache_position = cache_position if not is_cross_attention else None
key_states, value_states = past_key_values.update(
key_states, value_states, self.layer_idx, {"cache_position": cache_position}
)
attention_interface: Callable = eager_attention_forward
if self.config._attn_implementation != "eager":
attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
attn_output, attn_weights = attention_interface(
self,
query_states,
key_states,
value_states,
attention_mask,
dropout=0.0 if not self.training else self.dropout,
scaling=1.0,
output_attentions=output_attentions,
**kwargs,
)
attn_output = attn_output.reshape(bsz, tgt_len, -1).contiguous()
attn_output = self.out_proj(attn_output)
return attn_output, attn_weights
class AudioFlamingo3EncoderLayer(GradientCheckpointingLayer):
def __init__(self, config: AudioFlamingo3Config):
super().__init__()
self.embed_dim = config.d_model
self.self_attn = AudioFlamingo3Attention(
embed_dim=self.embed_dim,
num_heads=config.encoder_attention_heads,
dropout=config.attention_dropout,
config=config,
)
self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim)
self.dropout = config.dropout
self.activation_fn = ACT2FN[config.activation_function]
self.activation_dropout = config.activation_dropout
self.fc1 = nn.Linear(self.embed_dim, config.encoder_ffn_dim)
self.fc2 = nn.Linear(config.encoder_ffn_dim, self.embed_dim)
self.final_layer_norm = nn.LayerNorm(self.embed_dim)
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: torch.Tensor,
output_attentions: bool = False,
) -> torch.Tensor:
"""
Args:
hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
attention_mask (`torch.FloatTensor`): attention mask of size
`(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under
returned tensors for more detail.
"""
residual = hidden_states
hidden_states = self.self_attn_layer_norm(hidden_states)
hidden_states, attn_weights = self.self_attn(
hidden_states=hidden_states,
attention_mask=attention_mask,
output_attentions=output_attentions,
)
hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
hidden_states = residual + hidden_states
residual = hidden_states
hidden_states = self.final_layer_norm(hidden_states)
hidden_states = self.activation_fn(self.fc1(hidden_states))
hidden_states = nn.functional.dropout(hidden_states, p=self.activation_dropout, training=self.training)
hidden_states = self.fc2(hidden_states)
hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
hidden_states = residual + hidden_states
if hidden_states.dtype == torch.float16:
clamp_value = torch.finfo(hidden_states.dtype).max - 1000
hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value)
return hidden_states, attn_weights
@auto_docstring
class AudioFlamingo3PreTrainedModel(PreTrainedModel):
config: AudioFlamingo3Config
base_model_prefix = "model"
input_modalities = ["audio", "text"]
supports_gradient_checkpointing = True
_no_split_modules = ["AudioFlamingo3Attention"]
_skip_keys_device_placement = "past_key_values"
_supports_flash_attn = True
_supports_sdpa = True
@torch.no_grad()
def _init_weights(self, module):
# important: this ported version of AudioFlamingo3 isn't meant for training from scratch - only
# inference and fine-tuning - so the proper init weights code has been removed
std = (
self.config.initializer_range
if hasattr(self.config, "initializer_range")
else self.config.audio_config.initializer_range
)
if isinstance(module, (nn.Linear, nn.Conv1d)):
module.weight.normal_(mean=0.0, std=std)
if module.bias is not None:
module.bias.zero_()
elif isinstance(module, nn.LayerNorm):
module.weight.fill_(1.0)
module.bias.zero_()
elif isinstance(module, nn.Embedding):
module.weight.normal_(mean=0.0, std=std)
if module.padding_idx is not None:
module.weight[module.padding_idx].zero_()
@auto_docstring(
custom_intro="""
The audio model from AudioFlamingo3 without any head or projection on top.
"""
)
class AudioFlamingo3Encoder(AudioFlamingo3PreTrainedModel):
"""
AudioFlamingo3 encoder: Whisper encoder, average pool (time/2), then LayerNorm.
"""
# Ignore copy
config: AudioFlamingo3EncoderConfig
main_input_name = "input_features"
input_modalities = "audio"
_no_split_modules = ["AudioFlamingo3EncoderLayer"]
def __init__(self, config: AudioFlamingo3EncoderConfig):
super().__init__(config)
self.dropout = config.dropout
self.layerdrop = config.encoder_layerdrop
embed_dim = config.d_model
self.num_mel_bins = config.num_mel_bins
self.padding_idx = config.pad_token_id
self.max_source_positions = config.max_source_positions
self.embed_scale = math.sqrt(embed_dim) if config.scale_embedding else 1.0
self.conv1 = nn.Conv1d(self.num_mel_bins, embed_dim, kernel_size=3, padding=1)
self.conv2 = nn.Conv1d(embed_dim, embed_dim, kernel_size=3, stride=2, padding=1)
self.embed_positions = nn.Embedding(self.max_source_positions, embed_dim)
self.embed_positions.requires_grad_(False)
self.layers = nn.ModuleList([AudioFlamingo3EncoderLayer(config) for _ in range(config.encoder_layers)])
self.layer_norm = nn.LayerNorm(config.d_model)
# Ignore copy
self.avg_pooler = nn.AvgPool1d(2, stride=2)
self.gradient_checkpointing = False
# Initialize weights and apply final processing
self.post_init()
def _freeze_parameters(self):
for param in self.parameters():
param.requires_grad = False
self._requires_grad = False
def get_input_embeddings(self) -> nn.Module:
return self.conv1
def set_input_embeddings(self, value: nn.Module):
self.conv1 = value
@can_return_tuple
def forward(
self,
input_features: torch.Tensor,
input_features_mask: Optional[torch.Tensor] = None,
):
r"""
Args:
input_features (`torch.FloatTensor` of shape `(batch_size, feature_size, sequence_length)`):
Log-Mel features extracted from raw audio. Use the processor/feature extractor to compute and pad
these features from waveform input.
input_features_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to avoid performing attention on padding feature indices. Mask values selected in `[0, 1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
"""
# Prepare attention mask for transformer layers
batch_size = input_features.shape[0]
seq_len = (input_features.shape[-1] - 1) // 2 + 1 # After conv2 downsampling
input_features_lengths = input_features_mask.sum(-1)
input_features_lengths = (input_features_lengths - 1) // 2 + 1 # conv2 downsampling
input_features_mask = torch.arange(seq_len, device=input_features.device) < input_features_lengths[:, None]
attention_mask = eager_mask(
batch_size=batch_size,
cache_position=torch.arange(seq_len, device=input_features.device),
kv_length=seq_len,
mask_function=padding_mask_function(input_features_mask),
dtype=self.conv1.weight.dtype,
)
# Conv front-end
inputs_embeds = nn.functional.gelu(self.conv1(input_features))
inputs_embeds = nn.functional.gelu(self.conv2(inputs_embeds))
inputs_embeds = inputs_embeds.permute(0, 2, 1)
# Add positions, dropout
hidden_states = inputs_embeds + self.embed_positions.weight
hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
# Transformer stack
for layer in self.layers:
drop = self.training and torch.rand([]) < self.layerdrop
if not drop:
hidden_states = layer(hidden_states, attention_mask)[0]
# AvgPool (time/2) + LayerNorm
hidden_states = hidden_states.permute(0, 2, 1)
hidden_states = self.avg_pooler(hidden_states).permute(0, 2, 1)
hidden_states = self.layer_norm(hidden_states)
return BaseModelOutput(
last_hidden_state=hidden_states,
)
# Ignore copy
def _get_feat_extract_output_lengths(self, input_lengths: torch.LongTensor):
"""
Computes the output length of the convolutional layers and the output length of the audio encoder
"""
input_lengths = (input_lengths - 1) // 2 + 1
output_lengths = (input_lengths - 2) // 2 + 1
return input_lengths, output_lengths
class AudioFlamingo3MultiModalProjector(nn.Module):
"""
Audio adaptor (small MLP) that projects AudioFlamingo3Encoder features
to the LLM embedding space so they can replace `<sound>` tokens.
"""
def __init__(self, config: AudioFlamingo3Config):
super().__init__()
self.linear_1 = nn.Linear(
config.audio_config.hidden_size, config.text_config.hidden_size, bias=config.projector_bias
)
self.act = ACT2FN[config.projector_hidden_act]
self.linear_2 = nn.Linear(
config.text_config.hidden_size, config.text_config.hidden_size, bias=config.projector_bias
)
def forward(self, audio_features):
hidden_states = self.linear_1(audio_features)
hidden_states = self.act(hidden_states)
hidden_states = self.linear_2(hidden_states)
return hidden_states
@auto_docstring(
custom_intro="""
The AudioFlamingo3 model which consists of a fine-tuned Whisper encoder, a multi-modal projector and a Qwen2 language model.
"""
)
class AudioFlamingo3ForConditionalGeneration(AudioFlamingo3PreTrainedModel, GenerationMixin):
_keep_in_fp32_modules_strict = None
_tp_plan = None
_pp_plan = None
def __init__(self, config):
super().__init__(config)
self.vocab_size = config.text_config.vocab_size
self.audio_tower = AutoModel.from_config(config.audio_config)
self.language_model = AutoModelForCausalLM.from_config(config.text_config)
self.multi_modal_projector = AudioFlamingo3MultiModalProjector(config)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.language_model.get_input_embeddings()
def set_input_embeddings(self, value):
self.language_model.set_input_embeddings(value)
def get_output_embeddings(self):
return self.language_model.get_output_embeddings()
def set_output_embeddings(self, new_embeddings):
self.language_model.set_output_embeddings(new_embeddings)
def set_decoder(self, decoder):
self.language_model.set_decoder(decoder)
def get_decoder(self):
return self.language_model.get_decoder()
def get_audio_features(
self, input_features: torch.FloatTensor, input_features_mask: torch.Tensor
) -> torch.FloatTensor:
"""
This method is used to get the audio embeddings from input features (a log mel spectrogram), meaning inferring the audio encoder and the multi-modal projector.
Args:
input_features (`torch.FloatTensor`):
Float values of mel features extracted from the raw speech waveform. Raw speech waveform can be
obtained by loading a `.flac` or `.wav` audio file into an array of type `list[float]` or a
`numpy.ndarray`, *e.g.* via the soundfile library (`pip install soundfile`). To prepare the array into
`input_features`, the [`AutoFeatureExtractor`] should be used for extracting the mel features, padding
and conversion into a tensor of type `torch.FloatTensor`. See [`~WhisperFeatureExtractor.__call__`]
input_features_mask (`torch.Tensor` of shape `(batch_size, feature_sequence_length)`):
Mask to avoid performing attention on padded feature indices.
Returns:
`torch.FloatTensor`:
The audio embeddings.
"""
# Encode audio
encoder_output = self.audio_tower(input_features, input_features_mask=input_features_mask)
audio_embeds = self.multi_modal_projector(encoder_output.last_hidden_state)
# Mask according to avg pooling (which is after attention blocks)
post_lengths = (input_features_mask.sum(-1) - 2) // 2 + 1
valid_mask = torch.arange(audio_embeds.shape[1], device=post_lengths.device)[None, :] < post_lengths[:, None]
audio_embeds = audio_embeds[valid_mask.to(audio_embeds.device)]
return audio_embeds
@can_return_tuple
@auto_docstring
def forward(
self,
input_ids: Optional[torch.LongTensor] = None,
input_features: Optional[torch.FloatTensor] = None,
input_features_mask: Optional[torch.Tensor] = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[Cache] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
cache_position: Optional[torch.LongTensor] = None,
logits_to_keep: Union[int, torch.Tensor] = 0,
**kwargs: Unpack[TransformersKwargs],
) -> CausalLMOutputWithPast:
r"""
input_features_mask (`torch.Tensor` of shape `(batch_size, feature_sequence_length)`):
Mask to avoid performing attention on padding feature indices. Mask values selected in `[0, 1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
Example:
```python
>>> from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
>>> model_id = "nvidia/audio-flamingo-3-hf"
>>> processor = AutoProcessor.from_pretrained(model_id)
>>> model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
>>> conversations = [
>>> [
>>> {
>>> "role": "user",
>>> "content": [
>>> {"type": "text", "text": "Transcribe the input speech."},
>>> {
>>> "type": "audio",
>>> "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/t_837b89f2-26aa-4ee2-bdf6-f73f0dd59b26.wav",
>>> },
>>> ],
>>> }
>>> ],
>>> [
>>> {
>>> "role": "user",
>>> "content": [
>>> {
>>> "type": "text",
>>> "text": "This track feels really peaceful and introspective. What elements make it feel so calming and meditative?",
>>> },
>>> {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/FPSbCAANfbJLVSwD.mp3"},
>>> ],
>>> }
>>> ],
>>> ]
>>> inputs = processor.apply_chat_template(
>>> conversations,
>>> tokenize=True,
>>> add_generation_prompt=True,
>>> return_dict=True,
>>> ).to(model.device)
>>> outputs = model.generate(**inputs, max_new_tokens=500)
>>> decoded_outputs = processor.batch_decode(
>>> outputs[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True
>>> )
>>> print(decoded_outputs)
["The spoken content of the audio is...", "The track's calming and meditative feel can be attributed to..."]
```"""
if inputs_embeds is None:
inputs_embeds = self.get_input_embeddings()(input_ids)
if input_features is not None and input_ids is not None:
audio_embeds = self.get_audio_features(input_features, input_features_mask)
# replace text-audio token placeholders with audio embeddings
audio_token_mask = (input_ids == self.config.audio_token_id).unsqueeze(-1)
inputs_embeds = inputs_embeds.masked_scatter(
audio_token_mask.to(inputs_embeds.device), audio_embeds.to(inputs_embeds.device)
)
outputs: CausalLMOutputWithPast = self.language_model(
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
labels=labels,
use_cache=use_cache,
cache_position=cache_position,
logits_to_keep=logits_to_keep,
**kwargs,
)
return outputs
def prepare_inputs_for_generation(self, *args, **kwargs):
# Overwritten -- we should not pass input_features when we are in cached decoding stage
input_features = kwargs.pop("input_features", None)
input_features_mask = kwargs.pop("input_features_mask", None)
cache_position = kwargs.get("cache_position")
model_inputs = super().prepare_inputs_for_generation(*args, **kwargs)
if cache_position is not None and cache_position[0] == 0:
# input_features should only be passed when we are not in cached decoding stage
if input_features is not None:
model_inputs["input_features"] = input_features
if input_features_mask is not None:
model_inputs["input_features_mask"] = input_features_mask
return model_inputs
__all__ = ["AudioFlamingo3ForConditionalGeneration", "AudioFlamingo3PreTrainedModel", "AudioFlamingo3Encoder"]

View File

@ -0,0 +1,303 @@
# coding=utf-8
# Copyright 2025 NVIDIA CORPORATION and the HuggingFace Inc. team. All rights
# reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import Optional, Union
import torch
from torch import nn
from ...activations import ACT2FN
from ...cache_utils import Cache
from ...masking_utils import eager_mask, padding_mask_function
from ...modeling_outputs import BaseModelOutput, CausalLMOutputWithPast
from ...processing_utils import Unpack
from ...utils import TransformersKwargs, auto_docstring, can_return_tuple, logging
from ..qwen2_audio.modeling_qwen2_audio import (
Qwen2AudioEncoder,
Qwen2AudioPreTrainedModel,
)
from ..voxtral.modeling_voxtral import VoxtralForConditionalGeneration, VoxtralMultiModalProjector
from ..whisper.modeling_whisper import WhisperEncoderLayer
from .configuration_audioflamingo3 import AudioFlamingo3Config
logger = logging.get_logger(__name__)
class AudioFlamingo3EncoderLayer(WhisperEncoderLayer):
pass
class AudioFlamingo3PreTrainedModel(Qwen2AudioPreTrainedModel):
pass
@auto_docstring(
custom_intro="""
The audio model from AudioFlamingo3 without any head or projection on top.
"""
)
class AudioFlamingo3Encoder(Qwen2AudioEncoder):
"""
AudioFlamingo3 encoder: Whisper encoder, average pool (time/2), then LayerNorm.
"""
@can_return_tuple
def forward(
self,
input_features: torch.Tensor,
input_features_mask: Optional[torch.Tensor] = None,
):
r"""
Args:
input_features (`torch.FloatTensor` of shape `(batch_size, feature_size, sequence_length)`):
Log-Mel features extracted from raw audio. Use the processor/feature extractor to compute and pad
these features from waveform input.
input_features_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to avoid performing attention on padding feature indices. Mask values selected in `[0, 1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
"""
# Prepare attention mask for transformer layers
batch_size = input_features.shape[0]
seq_len = (input_features.shape[-1] - 1) // 2 + 1 # After conv2 downsampling
input_features_lengths = input_features_mask.sum(-1)
input_features_lengths = (input_features_lengths - 1) // 2 + 1 # conv2 downsampling
input_features_mask = torch.arange(seq_len, device=input_features.device) < input_features_lengths[:, None]
attention_mask = eager_mask(
batch_size=batch_size,
cache_position=torch.arange(seq_len, device=input_features.device),
kv_length=seq_len,
mask_function=padding_mask_function(input_features_mask),
dtype=self.conv1.weight.dtype,
)
# Conv front-end
inputs_embeds = nn.functional.gelu(self.conv1(input_features))
inputs_embeds = nn.functional.gelu(self.conv2(inputs_embeds))
inputs_embeds = inputs_embeds.permute(0, 2, 1)
# Add positions, dropout
hidden_states = inputs_embeds + self.embed_positions.weight
hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
# Transformer stack
for layer in self.layers:
drop = self.training and torch.rand([]) < self.layerdrop
if not drop:
hidden_states = layer(hidden_states, attention_mask)[0]
# AvgPool (time/2) + LayerNorm
hidden_states = hidden_states.permute(0, 2, 1)
hidden_states = self.avg_pooler(hidden_states).permute(0, 2, 1)
hidden_states = self.layer_norm(hidden_states)
return BaseModelOutput(
last_hidden_state=hidden_states,
)
class AudioFlamingo3MultiModalProjector(VoxtralMultiModalProjector):
"""
Audio adaptor (small MLP) that projects AudioFlamingo3Encoder features
to the LLM embedding space so they can replace `<sound>` tokens.
"""
def __init__(self, config: AudioFlamingo3Config):
super().__init__()
self.linear_1 = nn.Linear(
config.audio_config.hidden_size, config.text_config.hidden_size, bias=config.projector_bias
)
self.act = ACT2FN[config.projector_hidden_act]
self.linear_2 = nn.Linear(
config.text_config.hidden_size, config.text_config.hidden_size, bias=config.projector_bias
)
@auto_docstring(
custom_intro="""
The AudioFlamingo3 model which consists of a fine-tuned Whisper encoder, a multi-modal projector and a Qwen2 language model.
"""
)
class AudioFlamingo3ForConditionalGeneration(VoxtralForConditionalGeneration):
_tp_plan = None
_pp_plan = None
_keep_in_fp32_modules_strict = None
def __init__(self, config):
super().__init__(config)
def get_audio_features(
self, input_features: torch.FloatTensor, input_features_mask: torch.Tensor
) -> torch.FloatTensor:
"""
This method is used to get the audio embeddings from input features (a log mel spectrogram), meaning inferring the audio encoder and the multi-modal projector.
Args:
input_features (`torch.FloatTensor`):
Float values of mel features extracted from the raw speech waveform. Raw speech waveform can be
obtained by loading a `.flac` or `.wav` audio file into an array of type `list[float]` or a
`numpy.ndarray`, *e.g.* via the soundfile library (`pip install soundfile`). To prepare the array into
`input_features`, the [`AutoFeatureExtractor`] should be used for extracting the mel features, padding
and conversion into a tensor of type `torch.FloatTensor`. See [`~WhisperFeatureExtractor.__call__`]
input_features_mask (`torch.Tensor` of shape `(batch_size, feature_sequence_length)`):
Mask to avoid performing attention on padded feature indices.
Returns:
`torch.FloatTensor`:
The audio embeddings.
"""
# Encode audio
encoder_output = self.audio_tower(input_features, input_features_mask=input_features_mask)
audio_embeds = self.multi_modal_projector(encoder_output.last_hidden_state)
# Mask according to avg pooling (which is after attention blocks)
post_lengths = (input_features_mask.sum(-1) - 2) // 2 + 1
valid_mask = torch.arange(audio_embeds.shape[1], device=post_lengths.device)[None, :] < post_lengths[:, None]
audio_embeds = audio_embeds[valid_mask.to(audio_embeds.device)]
return audio_embeds
def get_audio_embeds(self):
raise NotImplementedError("This method is not supported for AudioFlamingo3ForConditionalGeneration.")
@can_return_tuple
@auto_docstring
def forward(
self,
input_ids: Optional[torch.LongTensor] = None,
input_features: Optional[torch.FloatTensor] = None,
input_features_mask: Optional[torch.Tensor] = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[Cache] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
cache_position: Optional[torch.LongTensor] = None,
logits_to_keep: Union[int, torch.Tensor] = 0,
**kwargs: Unpack[TransformersKwargs],
) -> CausalLMOutputWithPast:
r"""
input_features_mask (`torch.Tensor` of shape `(batch_size, feature_sequence_length)`):
Mask to avoid performing attention on padding feature indices. Mask values selected in `[0, 1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
Example:
```python
>>> from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
>>> model_id = "nvidia/audio-flamingo-3-hf"
>>> processor = AutoProcessor.from_pretrained(model_id)
>>> model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
>>> conversations = [
>>> [
>>> {
>>> "role": "user",
>>> "content": [
>>> {"type": "text", "text": "Transcribe the input speech."},
>>> {
>>> "type": "audio",
>>> "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/t_837b89f2-26aa-4ee2-bdf6-f73f0dd59b26.wav",
>>> },
>>> ],
>>> }
>>> ],
>>> [
>>> {
>>> "role": "user",
>>> "content": [
>>> {
>>> "type": "text",
>>> "text": "This track feels really peaceful and introspective. What elements make it feel so calming and meditative?",
>>> },
>>> {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/FPSbCAANfbJLVSwD.mp3"},
>>> ],
>>> }
>>> ],
>>> ]
>>> inputs = processor.apply_chat_template(
>>> conversations,
>>> tokenize=True,
>>> add_generation_prompt=True,
>>> return_dict=True,
>>> ).to(model.device)
>>> outputs = model.generate(**inputs, max_new_tokens=500)
>>> decoded_outputs = processor.batch_decode(
>>> outputs[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True
>>> )
>>> print(decoded_outputs)
["The spoken content of the audio is...", "The track's calming and meditative feel can be attributed to..."]
```"""
if inputs_embeds is None:
inputs_embeds = self.get_input_embeddings()(input_ids)
if input_features is not None and input_ids is not None:
audio_embeds = self.get_audio_features(input_features, input_features_mask)
# replace text-audio token placeholders with audio embeddings
audio_token_mask = (input_ids == self.config.audio_token_id).unsqueeze(-1)
inputs_embeds = inputs_embeds.masked_scatter(
audio_token_mask.to(inputs_embeds.device), audio_embeds.to(inputs_embeds.device)
)
outputs: CausalLMOutputWithPast = self.language_model(
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
labels=labels,
use_cache=use_cache,
cache_position=cache_position,
logits_to_keep=logits_to_keep,
**kwargs,
)
return outputs
def prepare_inputs_for_generation(self, *args, **kwargs):
# Overwritten -- we should not pass input_features when we are in cached decoding stage
input_features = kwargs.pop("input_features", None)
input_features_mask = kwargs.pop("input_features_mask", None)
cache_position = kwargs.get("cache_position")
model_inputs = super().prepare_inputs_for_generation(*args, **kwargs)
if cache_position is not None and cache_position[0] == 0:
# input_features should only be passed when we are not in cached decoding stage
if input_features is not None:
model_inputs["input_features"] = input_features
if input_features_mask is not None:
model_inputs["input_features_mask"] = input_features_mask
return model_inputs
__all__ = ["AudioFlamingo3ForConditionalGeneration", "AudioFlamingo3PreTrainedModel", "AudioFlamingo3Encoder"]

Some files were not shown because too many files have changed in this diff Show More