Release: v4.57.0

v4.57.0 Branch (#41310 )
* Update expected values for one more `test_speculative_generation` after #40949 (#40967) fix Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * FIX(trainer): ensure final checkpoint is saved when resuming training (#40347) * fix(trainer): ensure final checkpoint is saved when resuming training * add test * make style && slight fix of test * make style again * move test code to test_trainer * remove outdated test file * Apply style fixes --------- Co-authored-by: rangehow <rangehow@foxmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Add new model LFM2-VL (#40624) * Add LFM2-VL support * add tests * linting, formatting, misc review changes * add siglip2 to auto config and instantiate it in lfm2-vl configuration * decouple image processor from processor * remove torch import from configuration * replace | with Optional * remove layer truncation from modeling file * fix copies * update everything * fix test case to use tiny model * update the test cases * fix finally the image processor and add slow tests * fixup * typo in docs * fix tests * the doc name uses underscore * address comments from Yoni * delete tests and unsuffling * relative import * do we really handle imports better now? * fix test * slow tests * found a bug in ordering + slow tests * fix copies * dont run compile test --------- Co-authored-by: Anna <anna@liquid.ai> Co-authored-by: Anna Banaszak <48625325+ankke@users.noreply.github.com> * Fix outdated version checks of accelerator (#40969) * Fix outdated version checks of accelerator Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix outdated version checks of accelerator Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> --------- Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Use `skip_predictor=True` in vjepa2 `get_vision_features` (#40966) use skip_predictor in vjepa2 `get_vision_features` * [Trainer] Fix DP loss (#40799) * fix * style * Fix fp16 * style --------- Co-authored-by: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com> * [timm_wrapper] better handling of "Unknown model" exception in timm (#40951) * fix(timm): Add exception handling for unknown Gemma3n model * nit: Let’s cater to this specific issue * nit: Simplify error handling * Fix Issue #39030: AutoTokenizer.from_pretrained does not propagate token (#40956) * fix merge conflicts * change token typing --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-27-253.ec2.internal> * [tests] Really use small models in all fast tests (#40945) * start * xcodec * chameleon * start * layoutlm2 * layoutlm * remove skip * oups * timm_wrapper * add default * doc * consistency * Add captured actual outputs to CI artifacts (#40965) * fix * fix * Remove `# TODO: ???` as it make me `???` * fix * fix * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Revert change in `compile_friendly_resize` (#40645) fix * Track the CI (model) jobs that don't produce test output files (process being killed etc.) (#40981) * fix * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Remove `set_model_tester_for_less_flaky_tests` (#40982) remove * Benchmarking v2 GH workflows (#40716) * WIP benchmark v2 workflow * Container was missing * Change to sandbox branch name * Wrong place for image name * Variable declarations * Remove references to file logging * Remove unnecessary step * Fix deps install * Syntax * Add workdir * Add upload feature * typo * No need for hf_transfer * Pass in runner * Runner config * Runner config * Runner config * Runner config * Runner config * mi325 caller * Name workflow runs properly * Copy-paste error * Add final repo IDs and schedule * Review comments * Remove wf params * Remove parametrization from worfkflow files * Fix callers * Change push trigger to pull_request + label * Add back schedule event * Push to the same dataset * Simplify parameter description * ENH: Enable readline support for transformers chat (#40911) ENH Enable readline support for chat This small change enables GNU readline support for the transformers chat command. This includes, among others: - advanced navigation and editing: ctrl + a ctrl + e alt + b alt + f ctrl + k alt + d etc. - navigate and search history: arrow up/down ctrl + p ctrl + n ctrl + r - undo: ctrl + _ - clear screen: ctrl + l Implementation Although it may look strange, just importing readline is enough to enable it in Python, see: https://docs.python.org/3/library/functions.html#input As readline is not available on some platforms (https://docs.python.org/3/library/readline.html), the import is guarded. Readline should work on Linux, MacOS, and with WSL, I'm not sure about Windows though. Ideally, someone can give it a try. It's possible that Windows users would have to install pyreadline (https://pypi.org/project/pyreadline3/). * [testing] test `num_hidden_layers` being small in model tester (#40992) fix Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * blt wip (#38579) * blt wip * cpu version * cpu friendly with full entropy model (real time patching) * adding config file instead of args file * enable MPS * refactoring unused code * single config class in config file * inherit from PreTrainedModel * refactor LMTransformer --> BLTPatcher * add conversion script * load from new checkpoing with form_pretrained * fixed demo from_pretrained * clean up * clean a few comments * cleanup folder * clean up dir * cleaned up modeling further * rename classes * adding transformers Attention class and RotaryEmbedding class * exchanged blt modules for transformers modules: attention, rotary_emb, create_causal_mask, etc * seperate out patcher config, update modeling and conversion script * rename vars to be more transformers-like * rm unused functions * adding cross attention from transformers * pass arg * rename weights * updated conversion script * overwritten commit! fixing PR * apply feedback * adding BLTRMSNorm like Llama * add repeat_kv and eager_attention_forward copied from * BLTMLP identical to MllamTextMLP * clean up some args' * more like mllama, but busier inits * BLTTransformerLayer config * decoder, encoder, global configs * wip working on modular file * cleaning up patch and configs * clean up patcher helpers * clean up patcher helpers further * clean up * some config renaming * clean up unused configs * clean up configs * clean up configs * update modular * clean * update demo * config more like mllama, seperated subconfigs from subdicts * read from config instead of self args * update demo file * model weights to causal lm weights * missed file * added tied weights keys * BLTForCausalLM * adding files after add-new-model-like * update demo * working on tests * first running integration tests * added integration tests * adding tokenization tests, integration tests, and cleaned up tokenization file, + ruff * tokenizer clean up * modular file * fixing rebase * ruff * adding correct basemodel output and updating config with checkpoint vals (for testing) * BLTModelTests git status * enabling inputs_embeds, although won't be equal to input_ids since need ids for patching logic * fix sdpa == causal tests * fix small model test and some gradient checkpointing * skip training GC tests * fix test * updated modular * update modular * ruff * adding modular + modeling * modular * more modern is_casual check * cleaning up modular * more modular reduction * ruff * modular fix * fix styling * return 2 * return 2 * fix some tests * fix bltcrossattention after modular break * some fixes / feedback * try cache generate fix * try cache generate fix * fix generate tests * attn_impl workaround * refactoring to use recent TransformersKwargs changes * fix hidden_states shape test * refactor to new outputs * simplify outputs a bit * rm unneeded decoderlayer overwriting * rename blt * forgot tokenizer test renamed * Reorder * Reorder * working on modular * updates from modular * new modular * ruff and such * update pretrainedmodel modular * using cohere2 apply_rotary_pos_emb * small changes * apply feedback r2 * fix cross_attention * apply more feedback * update modeling fix * load submodules from pretrainedmodel * set initializer_range to subconfigs * rm cross_attnetion_states pass when not needed * add 7b projection layer support * check repo * make copies * lost cohere2 rotate_half * ruff * copies? * don't tie weights for submodules * tie weights setting * check docstrings * apply feedback * rebase * rebased modeling * update docs * applying feedback * few more fixes * fix can_record_outputs * fast tokenizer * no more modulelist * tok auto * rm tokenizersss * fix docs * ruff * fix after rebase * fix test, configs are not subscriptable --------- Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-168-30.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-161-103.ec2.internal> Co-authored-by: Lysandre <hi@lysand.re> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-174-36.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-164-45.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-173-121.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-160-103.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-161-178.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-162-79.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-169-239.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-167-111.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-160-100.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-161-153.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-166-15.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-165-131.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-161-138.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-174-215.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-172-142.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-172-147.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-164-0.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-163-58.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-165-202.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-166-244.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-174-186.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-160-192.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-162-14.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-171-249.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-164-75.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-161-78.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-163-134.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-162-180.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-175-241.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-160-225.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-167-9.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-168-34.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-166-68.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-167-175.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-170-160.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-168-95.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-172-73.ec2.internal> * [`RMSNorm`] Fix rms norm init for models that center around 1 (#40796) * fix * fixup inits * oops * fixup gemma * fixup modular order * how does this keep happen lol * vaultgemma is new i forgot * remove init check * Make `EfficientLoFTRModelTest` faster (#41000) * fix * fix * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Fix typoes in src and tests (#40845) Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix more dates in model cards and wrong modalities in _toctree.yml (#40955) * Fix model cards and modalities in toctree * fix new models * RUFF fix on CI scripts (#40805) Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * fix dict like init for ModelOutput (#41002) * fix dict like init * style * [tests] update `test_left_padding_compatibility` (and minimize overwrites) (#40980) * update test (and overwrites) * better test comment * 0 as a default for * Patch more `unittest.case.TestCase.assertXXX` methods (#41008) fix Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * 🚨 [lightglue] fix: matches order changed because of early stopped indices (#40859) * fix: bug that made early stop change order of matches * fix: applied code suggestion Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com> * fix: applied code suggestion to modular * fix: integration tests --------- Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com> * Fix `PhimoeIntegrationTest` (#41007) * fix * fix * fix * fix * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Fix Glm4v test (#41011) fix * Update after #41007 (#41014) * fix * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Fix benchmark runner argument name (#41012) * Adding support for Qwen3Omni (#41025) * Add Qwen3Omni * make fix-copies, import properly * nit * fix wrong setup. Why was audio_token_id renamed ? * upds * more processing fixes * yup * fix more generation tests * down to 1? * fix import issue * style, update check repo * up * fix quality at my best * final quality? * fix doc building * FINAL COMMIT: SKIP IMPORTANT BUT FAILING TESTS FOR MERGE * SKIP THE TEMPLATE ONE --------- Co-authored-by: lvyuanjun.lyj <lvyuanjun.lyj@alibaba-inc.com> Co-authored-by: Arthur <arthur.zucker@gmail.com> * Making compute_loss_func always take priority in Trainer (#40632) * logger warn, if-else logic improved * redundant if condition fix * Modify Qwen3Omni parameter name since VL changed it (#41045) Modify parameter name since VL changed it Co-authored-by: lvyuanjun.lyj <lvyuanjun.lyj@alibaba-inc.com> * Fix Qwen video tests (#41049) fix test * [testing] Fix `qwen2_audio` (#41018) * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Fix typing of tuples (#41028) * Fix tuple typing Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * More fixes Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * More fixes Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> --------- Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Remove optax (#41030) Remove optax dep Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix typos in English/Chinese documentation (#41031) * Fix typos and formatting in English docs Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix typos and formatting in Chinese docs Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> --------- Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Use torch.autocast (#40975) * Use torch.autocast Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Format code Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> --------- Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * docs: improved RoPE function Docstrings (#41004) * docs: improved RoPE functuon docstrings * Update src/transformers/modeling_rope_utils.py Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> --------- Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> * Fix condition for emitting warning when generation exceeds max model length (#40775) correct warning when generation exceeds max model length Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com> * Fix outdated torch version check (#40925) Update torch minimum version check to 2.2 Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Add Whole Word Masking and Padding Strategy to DataCollatorForLanguageModeling (#39485) * Add whole word masking * Vectorize whole word masking functions * Unit test whole word masking * Remove support for TF in whole word masking * [testing] Fix `seed_oss` (#41052) * fix * fix * fix * fix * fix * fix * Update tests/models/seed_oss/test_modeling_seed_oss.py Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * Remove repeated import (#40937) * Remove repeated import Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix conflict Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> --------- Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Simplify unnecessary Optional typing (#40839) Remove Optional Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Add write token for uploading benchmark results to the Hub (#41047) * Separate write token for Hub upload * Address review comments * Address review comments * Ci utils (#40978) * Add CI reports dir to gitignore * Add utils to run local CI * Review compliance * Style * License * Fix CI jobs being all red 🔴 (false positive) (#41059) fix Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Update quantization CI (#41068) * fix * new everything * fix * [i18n-bn] Add Bengali language README file (#40935) * [i18n-bn] Add Bengali language README file and update links in existing language files * Update Bengali README for clarity and consistency in model descriptions * Improve documentation and errors in Mamba2-based models (#41063) * fix bug in Mamba2 docs * correct 'because on of' issue * link to other Mamba2 model types * github URL is not changed * update error message in generated files * Update team member list for some CI workflows (#41094) * update list * update list --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * fix crash when using chat to send 2+ request to gptoss (#40536) Signed-off-by: Wang, Yi <yi.a.wang@intel.com> * Minor addition, no split modules for VideoMAEE (#41051) * added no split modules * fixed typo --------- Co-authored-by: Raushan Turganbay <raushan@huggingface.co> * Switch to `python:3.10-slim` for CircleCI docker images (#41067) fix Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Fix argument name in benchmarking script (#41086) * Fix argument name in benchmarking script * Adjust vars * Fix typos in documentation (#41087) Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix typing (#40788) * Fix optional typing Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix optional typing Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix schema typing Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix typing * Fix typing * Fix typing * Fix typing * Use np.ndarray Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix typing Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Format code Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Use np.ndarray Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Improve typing Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix quote string of np.ndarray Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * More fixes Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix code * Format Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> --------- Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Remove unused arguments (#40916) * Fix unused arguments Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * More fixes Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> --------- Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * fix wrong height and width when read video use torchvision (#41091) * docs: Fix Tool Use links and remove dead RAG links (#41104) docs: Fix tool use links. Remove dead RAG links. Fix style * [tests] gpt2 + `CausalLMModelTester` (#41003) * tmp commit * tmp commit * tmp commit * rm old GPT2ModelTester * nit bug * add facilities for encoder-decoder tests; add comments on ALL overwrites/extra fns * vision_encoder_decoder * Fix `_get_test_info` for inherited tests (#41106) * fix _get_test_info * fix patched * add comment * ruff --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Remove bad test skips (#41109) * remove bad skips * remove more * fix inits * Format empty lines and white space in markdown files. (#41100) * Remove additional white space and empty lines from markdown files Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Add empty lines around code Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> --------- Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Update ruff to 0.13.1 + target Python 3.10 + apply fixes (#37809) Update ruff to 0.13.1 target it to Python 3.10 and apply its fixes Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> * Support loading LFM2 GGUF (#41111) * add gguf config mapping for lfm2 * add lfm2 tensor process to unsqueeze conv weights * adjust values from gguf config to HF config * add test for lfm2 gguf * ruff --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * [torchao safetensors] integrate torchao safetensors support with transformers (#40735) * enable torchao safetensors * enable torchao safetensors support * add more version checking * [Qwen3-next] Fix dimension mismatch in torch_chunk_gated_delta_rule and torch_recurrent_gated_delta_rule (#40963) (#41036) * fix mismatched dims for qwen3 next * propagate changes * chore: renamed tot_heads to total_sequence_length * Apply suggestion from @vasqu Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * minor fix to modular qwen3 next file --------- Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * Fix the error where a keyword argument appearing before *args (#41099) Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix broken `` expressions in markdown files (#41113) Fix broken expressions in markdown files Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Remove self-assignment (#41062) * Remove self-assignment Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Update src/transformers/integrations/flash_paged.py Co-authored-by: Matt <Rocketknight1@users.noreply.github.com> * Clear pass Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Clear pass Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Clear pass Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> --------- Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> Co-authored-by: Matt <Rocketknight1@users.noreply.github.com> * Fixed MXFP4 model storage issue (#41118) * Fixed loading LongT5 from legacy checkpoints (#40724) * Fixed loading LongT5 from legacy checkpoints * Adapted the fix to work with missing lm_head * dummy commit (#41133) * dummy commit, nothing interesting * dummy commit, nothing interesting * dummy commit, nothing interesting * dummy commit, nothing interesting --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Fix loading logic flaw with regards to unexpected and missing keys (#40850) * Unexpected keys should be ignored at load with device map * remove them all * fix logic flaw * fix * simplify * style * fix * revert caching allocator change * add other test * add nice doc --------- Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com> * Fix: align Qwen2.5-VL inference rope index with training by passing s… (#41153) Fix: align Qwen2.5-VL inference rope index with training by passing second_per_grid_ts * Fix single quotes in markdown (#41154) Fix typos Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * extend gemma3n integration ut cases on XPU (#41071) Signed-off-by: Yao, Matrix <matrix.yao@intel.com> * Add Parakeet (#39062) * first commit Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * update to handle masking for bs>1 Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * Add tests and docs Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * update model ids Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * update docs and improve style Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * update librosa location Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * import guard torch too Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * ruff code checks fix Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * ruff format check Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * updated to parakeet names Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * update script Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * Add tokenizer decoding Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * Remove other model dependency Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * clean tests Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * fix tests Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * linting Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * fix ruff lint warnings Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * move to seperate folders Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * add parakeet ctc model code Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * simplify encoder structure Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * update documentation Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * add parakeet to toctree Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * fix tests Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * add parakeet doc Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * Address comments Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * Update featurizer to compute lens directly Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * fix ruff tests Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * fix encoding format Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * fix minor ctc decoding Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * revert modular_model_converter.py changes * revert check_config_attributes.py changes * refactor: fastconformer & parakeet_ctc -> parakeet * modeling update * test update * propagate feature extractor updates * propagate doc changes * propagate doc changes * propagate tokenization changes * propagate conversion changes * remove fastconformer tests * remove modular * update processor * update processor * tset update * diverse fixes * 100% macthing greedy batched * Update conversion script. * Refactor docs. * Reafactor auto loading. * Refactor and fix tokenization and processing. * Update integration test. * Modeling fixes: - ensure correct attention mask shape - ensure layer drop returns valid output - correct blank token ID when computing CTC loss * Format and repo consistency. * Update model doc. * Fix feature extraction tests. * Fix (most) tokenizer tests. * Add pipeline example. * Fixes * Use eager_attention_forward from Llama. * Small tweaks. * Replace Sequential with ModuleList * Add check if not all layers copied * Clean tokenizer. * Standardize FastSpeech2ConformerConvolutionModule for Parakeet. * Switch to modular for modeling and processing. * Add processor tests. * Fix modeling tests. * Formating and docstrings. * Add `return_attention_mask` like other feature extractors. * clean up after merging main. * nits on modeling * configuration update * nit * simplification: use PretrainedTokenizerFast, simplify processor * add dtype arg to mel_filter_bank * feature extraction: simplify! * modeling update * change to ParakeetTokenizerFast * correct attention mask handling * auto update * proc update * test update * feature extraction fixes * modeling update * conversion script update * udpate tests feature integration * update tokenization and tests * processor tests * revert audio_utils * config docstring update * blank_token -> pad_token * modeling udpate * doc update * fix tests * fix test * fix tests * address review comments * add comment * add comment * explicitly not support flash * atttention straightforward masking * fix * tokenizer update: skipping blank tokens by default * doc update * fix max_positions_embeddings handling * nits * change atol faeture extraction integration tests * doc update + fix loss * doc update * nit * update integration test for A10 * repo id name * nit --------- Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> Co-authored-by: Eustache Le Bihan <eulebihan@gmail.com> Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com> Co-authored-by: Eric B <ebezzam@gmail.com> * Fix format of compressed_tensors.md (#41155) * Fix table format Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix format Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> --------- Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Simplify and improve model loading logic (#41103) * remove unexpected keys from inputs (they have nothing to do there) * remove input * simplify a lot init * fix * fix check for non-persistent buffer * revert because too many old and bad models... * remove comment * type hint * make it a real test * remove model_to_load -> always use the same model * typo * remove legacy offload_folder (we never waste that memory anymore) * do not change prefix anymore * change very bad function name * create adjust method * remove useless method * restrict * BC * remove unused method * CI * remove unused args * small fix * fix * CI * CI * avoid too many loops * fix regex * cleaner * typo * fix * fix * Force new vision models addition to include a fast image processor (#40802) * add test * fix test and change cutoff date * Add documentation to test * Add language specifiers to code blocks of markdown files (#41114) * Add language specifiers to code blocks of markdown files Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Update docs/source/en/model_doc/qwen3_omni_moe.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/chat_templating_writing.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/chat_templating_writing.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/chat_templating_writing.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * More fixes Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Update nemotron.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update phimoe.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update README.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Fix syntax error Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> --------- Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Improve `add_dates` script (#41167) * utils/add_dates.py * put lfm2-vl in correct category * Fix flash-attn for paged_attention when no kernels (#41078) * Fix non-kernels flash attention paged implementation * Cover all cases * Style * Update src/transformers/integrations/flash_paged.py Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com> * Apply style fixes --------- Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * Remove data from examples (#41168) Remove telemetry * Enable fa in amd docker (#41069) * Add FA to docker * Use caching mechanism for qwen2_5 * Fix a typo in important models list * Partial fixes for gemma3 * Added a commit ID for FA repo * Detailled the expectation storage format * Rebase fix * Apply style fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * handle flash slow tests (#41072) * handle flash slow tests * update patch mask to 1/0 for flash * don't skip flash * flash * raise tols * rm flash support :( * nits --------- Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-173-7.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-171-230.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-168-95.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-166-214.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-163-147.ec2.internal> * Modernbert fix (#41056) * Add FA to docker * Fixed padding for mdernbert * Fixed logits and hidden states extraction in ModernBertForMultipleChoice * Added a test for ModernBertForMultipleChoice * fixes * More fixes and GREEN CI * consistency * moar consistency * CI Runners - move amd runners mi355 and 325 to runner group (#41193) * Update CI workflows to use devmi355 branch * Add workflow trigger for AMD scheduled CI caller * Remove unnecessary blank line in workflow YAML * Add trigger for workflow_run on main branch * Update workflow references from devmi355 to main * Change runner_scale_set to runner_group in CI config * [XPU] Add MXFP4 support for XPU (#41117) * XPU supports gpt-oss MXFP4 * Complete MXFP4 UT file and comment information * Complete MXFP4 UT file and comment information * Fix code style * Fix code style --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * [tests] `CausalLMTester` automatically infers other test classes from `base_model_class` 🐛 🔫 (#41066) * halfway through the models * update test checks * refactor all * another one * use tuples * more deletions * solve bad inheritance patterns * type * PR ready? * automatic model class inference from the base class * vaultgemma * make fixup * make fixup * rebase with gpt2 * make fixup :'( * gpt2 is special * More typing fixes (#41102) * Fix noqa Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * fix typing Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Use np.ndarray Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * More fixes Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * remove noqa Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix chars Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * More fixes Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> --------- Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * enable flex attention ut cases on XPU (#40989) * enable flex attention ut cases on XPU Signed-off-by: Yao, Matrix <matrix.yao@intel.com> * fix style Signed-off-by: Yao, Matrix <matrix.yao@intel.com> --------- Signed-off-by: Yao, Matrix <matrix.yao@intel.com> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * fix(trainer): Avoid moving model with device_map (#41032) * fix(trainer): Avoid moving model with device_map When a model is loaded with `device_map="auto"` and is too large to fit on a single GPU, `accelerate` will offload some layers to the CPU or disk. The `Trainer` would previously attempt to move the entire model to the specified device, causing a `RuntimeError` because a model dispatched with `accelerate` hooks cannot be moved. This commit fixes the issue by adding a check in `_move_model_to_device` to see if the model has an `hf_device_map` attribute. If it does, the device placement is assumed to be handled by `accelerate`, and the `model.to(device)` call is skipped. A regression test is added to ensure the `Trainer` can be initialized with a model that has a `hf_device_map` that simulates offloading without raising an error. * Added the logger warning for the move model --------- Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com> * Fix attention sink implementation in flex attention (#41083) * Fix attention sink implementation in flex attention * fix dim * fix * Remove print * raisae error when return_lse is False yet s_aux is providewd * Clean test files for merge * Update src/transformers/integrations/flex_attention.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * force return lse * Add to doc --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Separate docker images for Nvidia and AMD in benchmarking (#41119) Separate docker images for Nvidia and AMD * Make quantizers good citizens loading-wise (#41138) * fix param_needs_quantization * rewrite most hqq * clean * fix * comment * remove it from exception of safetensors * start on bnb 4bits * post-rebase fix * make bnb4 bit a good citizen * remove forgotten print * make bnb 8bits a good citizen * better hqq * fix * clean * remove state dict from signature * switch method * make torchao a good citizen * fixes * fix torchao * add check * typo * [`Kernels Attention`] Change fallback logic to error out on explicit kernels request and include FA3 (#41010) * fix * be more strict * change logic to include fa3 * fix the case where nothing is requested * modify old tests + add kernels related tests * style * Add EdgeTAM (#39800) * initial comment * test * initial conversion for outline * intermediate commit for configuration * chore:init files for sam2 * adding arbitary undefined config * check * add vision * make style * init sam2 base model * Fix imports * Linting * chore:sam to sam2 classes * Linting * Add sam2 to models.__init__ * chore:match prompt encoder with sam2 code * chore:prepare kwargs for mask decoder * Add image/video predictors * Add CUDA kernel * Add output classes * linting * Add logging info * tmp commit * docs for sam2 * enable image processing * check difference of original SAM2 - difference is the order of ToTensor() - please see https://pytorch.org/vision/main/_modules/torchvision/transforms/functional.html#resize * enable promptencoder of sam2 * fix promprencoder * Confirmed that PromptEncoder is exactly same (Be aware of bfloat16 and float32 difference) * Confirmed that ImageEncoder is exactly same (Be aware the linting of init) * Confirmed that MaskDecoder is exactly same (TO DO: lint variable name) * SamModel is now available (Need more chore for name) * make fix-copies * make style * make CI happy * Refactor VisionEncoder and PostioinEmbedding * TO DO : fix the image_embeddings and sparse_embeddings part * pure image inference done * reusable features fix and make style * styling * refactor memoryattention * tmp * tmp * refactor memoryencoder TO DO : convert and inference the video pipeline * TO DO : fix the image_encoder shape * conversion finish TO DO: need to check video inference * make style * remove video model * lint * change * python utils/check_docstringspy --check_all * python utils/check_config_attributes.py * remove copies for sam2promptencoder due to configuration * change __init__.py * remove tensorflow version * fix that to not use direct comparison * make style * add missing import * fix image_embedding_size * refactor Sam2 Attention * add fully working video inference (refactoring todo) * clarify _prepare_memory_conditioned_features * simplify modeling code, remove unused paths * use one model * use auto_docstring * refactor rope embeddings * nit * not using multimask when several points given * add all sam2.1 * add video tmp * add Sam2VideoSessionState + fast image proc + video proc * remove init_states from model * fix batch inference * add image integration tests * uniformize modeling code with other sam models and use modular * pass vision tests an most model tests * All tests passing * add offloading inference state and video to cpu * fix inference from image embedding and existing mask * fix multi_boxes mask inference * Fix batch images + batch boxes inference * improve processing for image inference * add support for mask generation pipeline * add support for get_connected_components post processing in mask generation * add fast image processor sam, image processor tests and use modular for sam2 image processor * fix mistake in sam after #39120 * fix init weights * refactor convert * add integration tests for video + other improvements * add needed missing docstrings * Improve docstrings and * improve inference speed by avoiding cuda sync * add test * skip test for vision_model * minor fix for vision_model * fix vision_model by adding sam2model and change the torch dependencies * remove patch_size * remove image_embedding_size * fix patch_size * fix test * make style * Separate hieradet and vision encoder in sam2 * fixup * review changes part 1 * remove MemoryEncoderConfig and MemoryAttentionConfig * pass q_stride instead of q_pool module * add inference on streamed videos * explicitely process streamed frames * nit * Improve docstrings in Sam2Model * update sam2 modeling with better gestion of inference state and cache, and separate Sam2Model and Sam2VideoModel * improve video inference api * change inference_state to inference_session * use modular for Sam2Model * fix convert sam2 hf * modular * Update src/transformers/models/sam2/video_processing_sam2.py Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com> * fix minor config * fix attention loading error * update modeling tests to use hub checkpoints * Use CI A10 runner for integration tests values + higher tolerance for video integration tests * PR review part 1 * fix doc * nit improvements * enforce one input format for points, labels and boxes * nit * last few nits from PR review * fix style * fix the input type * fix docs * add sam2 model as conversion script * improve sam2 doc * add rough necessarry changes * first working edgetam * fix issue with object pointers * Use modular as much as possible * nit fixes + optimization * refactor spatial perceiver * cleanup after merge * add working edgetam * improve perceiver resampler code * simplify/unify rope attention logic * Improve comments in apply_rotary_pos_emb_2d * add working tests * fix test timmwrapper * add docs * make fixup * nits * fix modular * fix modular * PR review part 1 * split apply_rotary_pos_emb_2d * add granularity to _prepare_memory_conditioned_features * add dates to doc * add separate mlp for memory attention * Fix memory on wrong device * store processed frames in dict * update checkpoints in tests * update dates --------- Co-authored-by: sangbumchoi <danielsejong55@gmail.com> Co-authored-by: RUFFY-369 <prakarshkaushik369@gmail.com> Co-authored-by: Sangbum Daniel Choi <34004152+SangbumChoi@users.noreply.github.com> Co-authored-by: Haitham Khedr <haithamkhedr@meta.com> Co-authored-by: sangbum choi <sangbumchoi@sangbumui-MacBookAir.local> Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com> * Fix EXAONE-4.0 dummy id (#41089) * Fix EXAONE-4.0 dummy id * Fix exaone4 dummy (#1) * fix * fix * fix * fix * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> --------- Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Fix 8bit bnb loading (#41200) * Fix 8bit * oups forgot the case where it is not prequantized * Fix docker quantization (#41201) * launch docker * remove gptq for now * run tests * Revert "run tests" This reverts commit f85718ce3a21d5937bf7405b8925c125c67d1a3e. * revert * Embed interactive timeline in docs (#41015) * embed timeline in docs (test web componentand Iframe) * test scaling * test multiple scales * compensate scale in width * set correct syle and scale * remove bottom space created by scale * add timeline as a separate page * reformulate docs after review * [docs] Fix links (#41110) fix * Remove unnecessary Optional typing (#41198) Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * docs/examples(speech): pin CTC commands to Hub datasets; add Windows notes (#41027) * examples(speech): load Common Voice from Hub; remove deprecated dataset-script references (Windows-friendly notes) * docs/examples(speech): pin CTC streaming & other CTC commands to Hub datasets; add Windows notes * make style * examples(speech): align DataTrainingArguments help with datasets docs; minor wording fixes * docs/examples(speech): address review remove Hub subsection & Whisper tip; align dataset help text * style: apply ruff/black/usort/codespell on examples/speech-recognition * Apply style fixes * Update examples/pytorch/speech-recognition/README.md * update doc to match load_dataset --------- Co-authored-by: Eustache Le Bihan <eulebihan@gmail.com> Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * Fix Qwen3-Omni audio_token_id serialization issue (#41192) Fix Qwen3-Omni audio_token_id serialization by overriding parent's attribute_map - Override attribute_map in Qwen3OmniMoeThinkerConfig to prevent inheritance of incorrect mapping - Parent class maps audio_token_id -> audio_token_index, but implementation uses audio_token_id directly - Fixes issue where custom audio_token_id values were not preserved during save_pretrained/from_pretrained cycles Fixes #41191 * Wait for main process in _save_checkpoint to ensure best checkpoint exists (#40923) * Update trainer.py * fix * fix format * move barrier, delete redundant * Avoid assumption that model has config attribute in deepspeed (#41207) Avoid assumption that model has config in deepspeed * Trainer: Pass `num_items_in_batch` to `compute_loss` in `prediction_step` (#41183) * Add num_items_in_batch computation to predict_step. * address comments. * Fix test cases. * fixup --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * [ESM] add accepts_loss_kwargs=False to EsmPreTrainedModel (#41006) add accepts_loss_kwargs=False to EsmPreTrainedModel Signed-off-by: Peter St. John <pstjohn@nvidia.com> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Align pull request template to bug report template (#41220) The only difference is that I don't users to https://discuss.huggingface.co/ for hub issues. * [generate] cache missing custom generate file (#41216) * cache missing custom generate file * make fixup * Remove old Python code (#41226) Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Adapt to the SDPA interface to enable the NPU to call FlashAttentionScore (#41143) Adapt to the SDPA interface to enable the NPU to call FlashAttentionScore. Co-authored-by: frozenleaves <frozen@Mac.local> * update code owners (#41221) Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Unify is_torchvision_v2_available with is_torchvision_available (#41227) Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix typing of train_args (#41142) * Fix typing Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix fsdp typing Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> --------- Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix sliding window attn mask (#41228) * Fix sliding window attn mask * Clearer test * Apply style fixes * If Picasso made ascii drawings he would have made this --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * Revert "Fix DeepSpeed mixed precision precedence over Accelerate defaults" (#41124) * Revert "Fix DeepSpeed mixed precision precedence over Accelerate defaults (#3…" This reverts commit df67cd35f0ca1a1cbf7147b2576db31b16200cf4. * fix * [docs] Fix tp_plan (#41205) remove manual * Fix white space in documentation (#41157) * Fix white space Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Revert changes Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix autodoc Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> --------- Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * fix qwen text config (#41158) * fix qwen text config * fix tests * fix one more test * address comments * Video processor accepts single frames on cuda (#41218) * fix * why was is np if input is in torch * Use math.log2 (#41241) Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * fix TrainerIntegrationDeepSpeed UT failures (#41236) Signed-off-by: Yao, Matrix <matrix.yao@intel.com> * [repo utils] Update `models_to_deprecate.py` (#41231) * update models_to_deprecate * exclude this file * handle typos and aliases * don't commit files * PR suggestions; make fixup * Use removeprefix and removesuffix (#41240) * Use removeprefix and removesuffix Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * More fixes Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> --------- Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix pylint warnings (#41222) * Remove unused variables Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Remove reimported packages Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix code Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix pylint warnings Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Simplify Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> --------- Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Remove all instances of `is_safetensors_available` (#41233) * safetensors is a core dep * fix * ok * simplify branching * keep it for now --------- Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com> * FP-Quant NVFP4 and Python 3.9 support (#39876) * quartet * quartet qat -> quartet * format * bf16 backward * interfaces * forward_method * quartet -> fp_quant * style * List -> list * list typing * fixed format and annotations * test_fp_quant * docstrings and default dtypes * better docstring and removed noop checks * docs * pseudoquantization support to test on non-blackwell * pseudoquant * Pseudoquant docs * Update docs/source/en/quantization/fp_quant.md Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update docs/source/en/quantization/fp_quant.md * Update docs/source/en/quantization/fp_quant.md * Update src/transformers/utils/quantization_config.py Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com> * Update tests/quantization/fp_quant_integration/test_fp_quant.py Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com> * Update tests/quantization/fp_quant_integration/test_fp_quant.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * small test fixes * dockerfile update * spec link * removed `_process_model_after_weight_loading` * toctree * nvfp4 * nvfp4 tests * FP-Quant version bumped * nvfp4 default and docs update * trainable * cpu if pseudoquant * proper group size selection * gsr * qutlass requirement version bumo * Upstream docker copy * docs update --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com> * [`FA3`] Fix masking and loading logic in same process (#41217) fix loading and fa3 masking * [t5gemma] fix `get_text_config` and related fixes (#40939) * tmp commit * t5gemma fixes * Don't convert to `safetensors` on the fly if the call is from testing (#41194) * don't convert * disable * Update src/transformers/modeling_utils.py Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co> * fix * disable * disable * disable --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co> * Resolve remote custom module path warnings (#41243) * add peft team members to issue/pr template (#41262) * add * Update .github/PULL_REQUEST_TEMPLATE.md Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com> --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com> * docs: update bitsandbytes platform support (#41266) * add more activation kernels, follow up (#40944) * add more activation kernels * fixing style * fix version * fix asr pipeline ut failures (#41275) * fix asr pipeline ut failures Signed-off-by: Yao, Matrix <matrix.yao@intel.com> * make style Signed-off-by: Yao, Matrix <matrix.yao@intel.com> --------- Signed-off-by: Yao, Matrix <matrix.yao@intel.com> * Use regex defailed flags (#41264) Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix multi-video timestamp bug in Qwen-3-VL and GLM4V (#41229) * fix multi-video timestamp bug in qwen3vl,glm4v * run make fix-copies to sync modular files * run make fix-copies to sync modular files --------- Co-authored-by: UBT <daqin.luo@ubtrobot.com> * Fix binding of video frames to video placeholder in `InternVL` model (#41237) * Fix binding video frames to video placeholder in prompt Signed-off-by: Daniel Bershatsky <daniel.bershatsky@gmail.com> * Add test on binding video frames to prompt Signed-off-by: Daniel Bershatsky <daniel.bershatsky@gmail.com> * Fix code style issues Signed-off-by: Daniel Bershatsky <daniel.bershatsky@gmail.com> * Fix broken tests on `InternVLProcessor` Signed-off-by: Daniel Bershatsky <daniel.bershatsky@gmail.com> * Add `return_tensors` to video processor defaults Signed-off-by: Daniel Bershatsky <daniel.bershatsky@gmail.com> --------- Signed-off-by: Daniel Bershatsky <daniel.bershatsky@gmail.com> * Deprecate Trackio environment variables and deploy to Spaces by default (#40950) * allow prive space id for trackio * complete docstring * Deprecate environment variables for Trackio integration; use TrainingArguments instead and deploy by default * style * Enhance documentation for Trackio Space ID in TrainingArguments * Allow private Space id for Trackio (#40948) * allow prive space id for trackio * complete docstring * fix async client for transformers chat (#41255) * fix-client * fix * Unify is_torchvision_v2_available with is_torchvision_available (#41259) Fix is_torchvision_v2_available Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Use max/min (#41280) Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Biogptlogits (#41270) added logits slicing to BioGpt for seq classifier Signed-off-by: Aviral <aviralkamaljain@gmail.com> * Fix unnecessary single-item container checks (#41279) Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix pylint generator warnings (#41258) Fix pylint generator warnings Signed-off-by: cyy <cyyever@outlook.com> * feat: use `aws-highcpu-32-priv` for amd docker img build (#41285) * feat: use `aws-highcpu-32-priv` for amd docker img build * feat: add `workflow_dispatch` event to docker build CI * Add processor and intergration test for qwen3vl (#41277) * support aux loss in qwen3vlmoe * update qwen3vl processor test! * add integration tests for qwen3vl-30a3 * remove duplicated decorator * code clean * fix consistency * do not inherit from nn.Linear for better quantization * pass check * Remove `test_initialization` (#41261) remove it * Remove some previous team members from allow list of triggering Github Actions (#41263) * delete * delete --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Build doc in 2 jobs: `en` and `other languages` (#41290) * separate * separate --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Fix mxfp4 dequantization (#41292) fix * [`Flex Attn`] Fix lse x attention sinks logic (#41249) fix * FIX: Bug in PEFT integration delete_adapter method (#41252) The main content of this PR is to fix a bug in the delete_adapter method of the PeftAdapterMixin. Previously, it did not take into account auxiliary modules from PEFT, e.g. those added by modules_to_save. This PR fixes this oversight. Note that the PR uses a new functionality from PEFT that exposes integration functions like delete_adapter. Those will be contained in the next PEFT release, 0.18.0 (yet unreleased). Therefore, the bug is only fixed when users have a PEFT version fullfilling this requirement. I ensured that with old PEFT versions, the integration still works the same as previously. The newly added test for this is skipped if the PEFT version is too low. (Note: I tested locally with that the test will pass with PEFT 0.18.0) While working on this, I also cleaned up the following: - The active_adapter property has been deprecated for more than 2 years (#26407). It is safe to remove it now. - There were numerous small errors or outdated pieces of information in the docstrings, which have been addressed. When PEFT < 0.18.0 is used, although we cannot delete modules_to_save, we can still detect them and warn about it. * Italian translation for README.md (#41269) chore: add Italian translation for README.md * Fix README.md error when installing from source (#41303) * download and use HF Hub Cache (#41181) use hub cache Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * fix some merge issues * [test_all] * [test-all] --------- Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com> Signed-off-by: Wang, Yi <yi.a.wang@intel.com> Signed-off-by: Yao, Matrix <matrix.yao@intel.com> Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> Signed-off-by: Peter St. John <pstjohn@nvidia.com> Signed-off-by: Daniel Bershatsky <daniel.bershatsky@gmail.com> Signed-off-by: Aviral <aviralkamaljain@gmail.com> Signed-off-by: cyy <cyyever@outlook.com> Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> Co-authored-by: Rangehow <88258534+rangehow@users.noreply.github.com> Co-authored-by: rangehow <rangehow@foxmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> Co-authored-by: Raushan Turganbay <raushan@huggingface.co> Co-authored-by: Anna <anna@liquid.ai> Co-authored-by: Anna Banaszak <48625325+ankke@users.noreply.github.com> Co-authored-by: Yuanyuan Chen <cyyever@outlook.com> Co-authored-by: Hamish Scott <41787553+hamishs@users.noreply.github.com> Co-authored-by: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com> Co-authored-by: Harshal Janjani <75426551+harshaljanjani@users.noreply.github.com> Co-authored-by: Branden <brandenkmurray@gmail.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-27-253.ec2.internal> Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co> Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com> Co-authored-by: Ákos Hadnagy <akos@ahadnagy.com> Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com> Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-168-30.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-161-103.ec2.internal> Co-authored-by: Lysandre <hi@lysand.re> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-174-36.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-164-45.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-173-121.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-160-103.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-161-178.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-162-79.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-169-239.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-167-111.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-160-100.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-161-153.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-166-15.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-165-131.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-161-138.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-174-215.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-172-142.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-172-147.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-164-0.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-163-58.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-165-202.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-166-244.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-174-186.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-160-192.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-162-14.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-171-249.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-164-75.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-161-78.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-163-134.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-162-180.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-175-241.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-160-225.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-167-9.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-168-34.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-166-68.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-167-175.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-170-160.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-168-95.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-172-73.ec2.internal> Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com> Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> Co-authored-by: StevenBucaille <steven.bucaille@gmail.com> Co-authored-by: BakerBunker <17872844+BakerBunker@users.noreply.github.com> Co-authored-by: lvyuanjun.lyj <lvyuanjun.lyj@alibaba-inc.com> Co-authored-by: Ayush <ayushtanwar1729@gmail.com> Co-authored-by: Ryan Mullins <ryan@ryanmullins.org> Co-authored-by: Yannick Schnider <Yannick.Schnider1@ibm.com> Co-authored-by: Ralph Gleaton <70818603+rjgleaton@users.noreply.github.com> Co-authored-by: Rémi Ouazan <83456801+remi-or@users.noreply.github.com> Co-authored-by: Saidur Rahman Pulok <59414463+saidurpulok@users.noreply.github.com> Co-authored-by: Nick Doiron <ndoiron@mapmeld.com> Co-authored-by: Wang, Yi <yi.a.wang@intel.com> Co-authored-by: Duygu Altinok <duygu.altinok12@gmail.com> Co-authored-by: Jinde.Song <juude.song@gmail.com> Co-authored-by: Ryan Mullins <ryanmullins@google.com> Co-authored-by: hbenoit <60629420+HaroldBenoit@users.noreply.github.com> Co-authored-by: liangel-02 <liangel@meta.com> Co-authored-by: nnul <107971634+notkisk@users.noreply.github.com> Co-authored-by: Matt <Rocketknight1@users.noreply.github.com> Co-authored-by: YangKai0616 <kai.yang@intel.com> Co-authored-by: Karol Szustakowski <61427290+Szustarol@users.noreply.github.com> Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com> Co-authored-by: Qile Xu <87457840+Xqle@users.noreply.github.com> Co-authored-by: Yao Matrix <matrix.yao@intel.com> Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com> Co-authored-by: Eustache Le Bihan <eulebihan@gmail.com> Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com> Co-authored-by: Eric B <ebezzam@gmail.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-173-7.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-171-230.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-166-214.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-163-147.ec2.internal> Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com> Co-authored-by: Pk Patel <46714886+The5cheduler@users.noreply.github.com> Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com> Co-authored-by: Samuel Barry <127697809+SamuelBarryCS@users.noreply.github.com> Co-authored-by: sangbumchoi <danielsejong55@gmail.com> Co-authored-by: RUFFY-369 <prakarshkaushik369@gmail.com> Co-authored-by: Sangbum Daniel Choi <34004152+SangbumChoi@users.noreply.github.com> Co-authored-by: Haitham Khedr <haithamkhedr@meta.com> Co-authored-by: sangbum choi <sangbumchoi@sangbumui-MacBookAir.local> Co-authored-by: Kyungmin Lee <30465912+lkm2835@users.noreply.github.com> Co-authored-by: OMOTAYO OMOYEMI <58476114+tayo4christ@users.noreply.github.com> Co-authored-by: eun2ce <joeun2ce@gmail.com> Co-authored-by: Sam Sharpe <ssharpe42y@gmail.com> Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com> Co-authored-by: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Co-authored-by: Peter St. John <pstjohn@nvidia.com> Co-authored-by: 魅影 <46097299+frozenleaves@users.noreply.github.com> Co-authored-by: frozenleaves <frozen@Mac.local> Co-authored-by: Andrei Panferov <andrei@panferov.org> Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn> Co-authored-by: Matthew Douglas <38992547+matthewdouglas@users.noreply.github.com> Co-authored-by: tim120526 <43242086+tim120526@users.noreply.github.com> Co-authored-by: UBT <daqin.luo@ubtrobot.com> Co-authored-by: Daniel Bershatsky <daskol@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: 0xAvi <aviralkamaljain@gmail.com> Co-authored-by: Luc Georges <McPatate@users.noreply.github.com> Co-authored-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com> Co-authored-by: Federico Moretti <hello@federicomoretti.it> Co-authored-by: Yangshen⚡Deng <yangshen.d@outlook.com>
2025-10-22 02:08:58 +08:00 · 2025-10-03 18:32:49 +02:00 · 2025-10-03 18:29:51 +02:00
1489 changed files with 37915 additions and 93002 deletions
--- a/.circleci/create_circleci_config.py
+++ b/.circleci/create_circleci_config.py
@ -16,10 +16,9 @@
 import argparse
 import copy
 import os
-import random
 from dataclasses import dataclass
-from typing import Any, Dict, List, Optional
-import glob
+from typing import Any, Optional
+
 import yaml


@ -30,6 +29,7 @@ COMMON_ENV_VARIABLES = {
    "RUN_PIPELINE_TESTS": False,
    # will be adjust in `CircleCIJob.to_dict`.
    "RUN_FLAKY": True,
+    "DISABLE_SAFETENSORS_CONVERSION": True,
 }
 # Disable the use of {"s": None} as the output is way too long, causing the navigation on CircleCI impractical
 COMMON_PYTEST_OPTIONS = {"max-worker-restart": 0, "vvv": None, "rsfE":None}
@ -82,15 +82,15 @@ class EmptyJob:
@dataclass
 class CircleCIJob:
    name: str
-    additional_env: Dict[str, Any] = None
-    docker_image: List[Dict[str, str]] = None
-    install_steps: List[str] = None
+    additional_env: dict[str, Any] = None
+    docker_image: list[dict[str, str]] = None
+    install_steps: list[str] = None
    marker: Optional[str] = None
    parallelism: Optional[int] = 0
    pytest_num_workers: int = 8
-    pytest_options: Dict[str, Any] = None
+    pytest_options: dict[str, Any] = None
    resource_class: Optional[str] = "xlarge"
-    tests_to_run: Optional[List[str]] = None
+    tests_to_run: Optional[list[str]] = None
    num_test_files_per_worker: Optional[int] = 10
    # This should be only used for doctest job!
    command_timeout: Optional[int] = None
@ -130,6 +130,12 @@ class CircleCIJob:

    def to_dict(self):
        env = COMMON_ENV_VARIABLES.copy()
+        if self.job_name != "tests_hub":
+            # fmt: off
+            # not critical
+            env.update({"HF_TOKEN": "".join(["h", "f", "_", "H", "o", "d", "V", "u", "M", "q", "b", "R", "m", "t", "b", "z", "F", "Q", "O", "Q", "A", "J", "G", "D", "l", "V", "Q", "r", "R", "N", "w", "D", "M", "V", "C", "s", "d"])})
+            # fmt: on
+
        # Do not run tests decorated by @is_flaky on pull requests
        env['RUN_FLAKY'] = os.environ.get("CIRCLE_PULL_REQUEST", "") == ""
        env.update(self.additional_env)
@ -149,7 +155,7 @@ class CircleCIJob:
                # Examples special case: we need to download NLTK files in advance to avoid cuncurrency issues
        timeout_cmd = f"timeout {self.command_timeout} " if self.command_timeout else ""
        marker_cmd = f"-m '{self.marker}'" if self.marker is not None else ""
-        junit_flags = f" -p no:warning -o junit_family=xunit1 --junitxml=test-results/junit.xml"
+        junit_flags = " -p no:warning -o junit_family=xunit1 --junitxml=test-results/junit.xml"
        joined_flaky_patterns = "|".join(FLAKY_TEST_FAILURE_PATTERNS)
        repeat_on_failure_flags = f"--reruns 5 --reruns-delay 2 --only-rerun '({joined_flaky_patterns})'"
        parallel = f' << pipeline.parameters.{self.job_name}_parallelism >> '
@ -180,6 +186,7 @@ class CircleCIJob:
            # During the CircleCI docker images build time, we might already (or not) download the data.
            # If it's done already, the files are inside the directory `/test_data/`.
            {"run": {"name": "fetch hub objects before pytest", "command": "cp -r /test_data/* . 2>/dev/null || true; python3 utils/fetch_hub_objects_for_ci.py"}},
+            {"run": {"name": "download and unzip hub cache", "command": 'curl -L -o huggingface-cache.tar.gz https://huggingface.co/datasets/hf-internal-testing/hf_hub_cache/resolve/main/huggingface-cache.tar.gz && apt-get install pigz && tar --use-compress-program="pigz -d -p 8" -xf huggingface-cache.tar.gz && mv -n hub/* /root/.cache/huggingface/hub/ && ls -la /root/.cache/huggingface/hub/'}},
            {"run": {
                "name": "Run tests",
                "command": f"({timeout_cmd} python3 -m pytest {marker_cmd} -n {self.pytest_num_workers} {junit_flags} {repeat_on_failure_flags} {' '.join(pytest_flags)} $(cat splitted_tests.txt) | tee tests_output.txt)"}
@ -200,9 +207,9 @@ class CircleCIJob:
                        fi"""
                },
            },
-            {"run": {"name": "Expand to show skipped tests", "when": "always", "command": f"python3 .circleci/parse_test_outputs.py --file tests_output.txt --skip"}},
-            {"run": {"name": "Failed tests: show reasons",   "when": "always", "command": f"python3 .circleci/parse_test_outputs.py --file tests_output.txt --fail"}},
-            {"run": {"name": "Errors",                       "when": "always", "command": f"python3 .circleci/parse_test_outputs.py --file tests_output.txt --errors"}},
+            {"run": {"name": "Expand to show skipped tests", "when": "always", "command": "python3 .circleci/parse_test_outputs.py --file tests_output.txt --skip"}},
+            {"run": {"name": "Failed tests: show reasons",   "when": "always", "command": "python3 .circleci/parse_test_outputs.py --file tests_output.txt --fail"}},
+            {"run": {"name": "Errors",                       "when": "always", "command": "python3 .circleci/parse_test_outputs.py --file tests_output.txt --errors"}},
            {"store_test_results": {"path": "test-results"}},
            {"store_artifacts": {"path": "test-results/junit.xml"}},
            {"store_artifacts": {"path": "reports"}},
--- a/.circleci/parse_test_outputs.py
+++ b/.circleci/parse_test_outputs.py
@ -1,5 +1,6 @@
-import re
 import argparse
+import re
+

 def parse_pytest_output(file_path):
    skipped_tests = {}
--- a/.github/ISSUE_TEMPLATE/bug-report.yml
+++ b/.github/ISSUE_TEMPLATE/bug-report.yml
@ -61,6 +61,7 @@ body:
          - Big Model Inference: @SunMarc
          - quantization (bitsandbytes, autogpt): @SunMarc @MekkCyber
          - kernels: @MekkCyber @drbh
+          - peft: @BenjaminBossan @githubnemo
        
        Devices/Backends:
        
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@ -39,20 +39,23 @@ members/contributors who may be interested in your PR.

 Models:

- text models: @ArthurZucker
- vision models: @amyeroberts, @qubvel
- speech models: @eustlb
+- text models: @ArthurZucker @Cyrilvallez
+- vision models: @yonigozlan @molbap
+- audio models: @eustlb @ebezzam @vasqu
+- multimodal models: @zucchini-nlp
 - graph models: @clefourrier

 Library:

- flax: @gante and @Rocketknight1
 - generate: @zucchini-nlp (visual-language models) or @gante (all others)
+- continuous batching: @remi-or @ArthurZucker @McPatate
 - pipelines: @Rocketknight1
- tensorflow: @gante and @Rocketknight1
- tokenizers: @ArthurZucker
- trainer: @zach-huggingface, @SunMarc and @qgallouedec
- chat templates: @Rocketknight1
+- tokenizers: @ArthurZucker and @itazap
+- trainer: @zach-huggingface @SunMarc
+- attention: @vasqu @ArthurZucker @CyrilVallez
+- model loading (from pretrained, etc): @CyrilVallez
+- distributed: @3outeille @ArthurZucker @S1ro1
+- CIs: @ydshieh

 Integrations:

@ -60,20 +63,17 @@ Integrations:
 - ray/raytune: @richardliaw, @amogkam
 - Big Model Inference: @SunMarc
 - quantization (bitsandbytes, autogpt): @SunMarc @MekkCyber
+- kernels: @MekkCyber @drbh
+- peft: @BenjaminBossan @githubnemo
+
+Devices/Backends:
+
+- AMD ROCm: @ivarflakstad
+- Intel XPU: @IlyasMoutawwakil
+- Ascend NPU: @ivarflakstad 

 Documentation: @stevhliu

-HF projects:
-
- accelerate: [different repo](https://github.com/huggingface/accelerate)
- datasets: [different repo](https://github.com/huggingface/datasets)
- diffusers: [different repo](https://github.com/huggingface/diffusers)
- rust tokenizers: [different repo](https://github.com/huggingface/tokenizers)
-
-Maintained examples (not research project or legacy):
-
- Flax: @Rocketknight1
- PyTorch: See Models above and tag the person corresponding to the modality of the example.
- TensorFlow: @Rocketknight1
+Research projects are not maintained and should be taken as is.

 -->
--- a/.github/scripts/assign_reviewers.py
+++ b/.github/scripts/assign_reviewers.py
@ -13,14 +13,16 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-import os
-import github
 import json
-from github import Github
+import os
 import re
 from collections import Counter
 from pathlib import Path

+import github
+from github import Github
+
+
 def pattern_to_regex(pattern):
    if pattern.startswith("/"):
        start_anchor = True
--- a/.github/scripts/codeowners_for_review_action
+++ b/.github/scripts/codeowners_for_review_action
@ -7,8 +7,8 @@ docs/ @stevhliu
 /docker/ @ydshieh @ArthurZucker

 # More high-level globs catch cases when specific rules later don't apply
-/src/transformers/models/*/processing* @molbap @yonigozlan @qubvel
-/src/transformers/models/*/image_processing* @qubvel
+/src/transformers/models/*/processing* @molbap @yonigozlan
+/src/transformers/models/*/image_processing* @yonigozlan
 /src/transformers/models/*/image_processing_*_fast* @yonigozlan

 # Owners of subsections of the library
@ -186,65 +186,65 @@ trainer_utils.py @zach-huggingface @SunMarc
 /src/transformers/models/zamba/mod*_zamba* @ArthurZucker

 # Vision models
-/src/transformers/models/beit/mod*_beit* @amyeroberts @qubvel
-/src/transformers/models/bit/mod*_bit* @amyeroberts @qubvel
-/src/transformers/models/conditional_detr/mod*_conditional_detr* @amyeroberts @qubvel
-/src/transformers/models/convnext/mod*_convnext* @amyeroberts @qubvel
-/src/transformers/models/convnextv2/mod*_convnextv2* @amyeroberts @qubvel
-/src/transformers/models/cvt/mod*_cvt* @amyeroberts @qubvel
-/src/transformers/models/deformable_detr/mod*_deformable_detr* @amyeroberts @qubvel
-/src/transformers/models/deit/mod*_deit* @amyeroberts @qubvel
-/src/transformers/models/depth_anything/mod*_depth_anything* @amyeroberts @qubvel
-/src/transformers/models/depth_anything_v2/mod*_depth_anything_v2* @amyeroberts @qubvel
-/src/transformers/models/deta/mod*_deta* @amyeroberts @qubvel
-/src/transformers/models/detr/mod*_detr* @amyeroberts @qubvel
-/src/transformers/models/dinat/mod*_dinat* @amyeroberts @qubvel
-/src/transformers/models/dinov2/mod*_dinov2* @amyeroberts @qubvel
-/src/transformers/models/dinov2_with_registers/mod*_dinov2_with_registers* @amyeroberts @qubvel
-/src/transformers/models/dit/mod*_dit* @amyeroberts @qubvel
-/src/transformers/models/dpt/mod*_dpt* @amyeroberts @qubvel
-/src/transformers/models/efficientformer/mod*_efficientformer* @amyeroberts @qubvel
-/src/transformers/models/efficientnet/mod*_efficientnet* @amyeroberts @qubvel
-/src/transformers/models/focalnet/mod*_focalnet* @amyeroberts @qubvel
-/src/transformers/models/glpn/mod*_glpn* @amyeroberts @qubvel
-/src/transformers/models/hiera/mod*_hiera* @amyeroberts @qubvel
-/src/transformers/models/ijepa/mod*_ijepa* @amyeroberts @qubvel
-/src/transformers/models/imagegpt/mod*_imagegpt* @amyeroberts @qubvel
-/src/transformers/models/levit/mod*_levit* @amyeroberts @qubvel
-/src/transformers/models/mask2former/mod*_mask2former* @amyeroberts @qubvel
-/src/transformers/models/maskformer/mod*_maskformer* @amyeroberts @qubvel
-/src/transformers/models/mobilenet_v1/mod*_mobilenet_v1* @amyeroberts @qubvel
-/src/transformers/models/mobilenet_v2/mod*_mobilenet_v2* @amyeroberts @qubvel
-/src/transformers/models/mobilevit/mod*_mobilevit* @amyeroberts @qubvel
-/src/transformers/models/mobilevitv2/mod*_mobilevitv2* @amyeroberts @qubvel
-/src/transformers/models/nat/mod*_nat* @amyeroberts @qubvel
-/src/transformers/models/poolformer/mod*_poolformer* @amyeroberts @qubvel
-/src/transformers/models/pvt/mod*_pvt* @amyeroberts @qubvel
-/src/transformers/models/pvt_v2/mod*_pvt_v2* @amyeroberts @qubvel
-/src/transformers/models/regnet/mod*_regnet* @amyeroberts @qubvel
-/src/transformers/models/resnet/mod*_resnet* @amyeroberts @qubvel
-/src/transformers/models/rt_detr/mod*_rt_detr* @amyeroberts @qubvel
-/src/transformers/models/segformer/mod*_segformer* @amyeroberts @qubvel
-/src/transformers/models/seggpt/mod*_seggpt* @amyeroberts @qubvel
-/src/transformers/models/superpoint/mod*_superpoint* @amyeroberts @qubvel
-/src/transformers/models/swiftformer/mod*_swiftformer* @amyeroberts @qubvel
-/src/transformers/models/swin/mod*_swin* @amyeroberts @qubvel
-/src/transformers/models/swinv2/mod*_swinv2* @amyeroberts @qubvel
-/src/transformers/models/swin2sr/mod*_swin2sr* @amyeroberts @qubvel
-/src/transformers/models/table_transformer/mod*_table_transformer* @amyeroberts @qubvel
-/src/transformers/models/textnet/mod*_textnet* @amyeroberts @qubvel
-/src/transformers/models/timm_wrapper/mod*_timm_wrapper* @amyeroberts @qubvel
-/src/transformers/models/upernet/mod*_upernet* @amyeroberts @qubvel
-/src/transformers/models/van/mod*_van* @amyeroberts @qubvel
-/src/transformers/models/vit/mod*_vit* @amyeroberts @qubvel
-/src/transformers/models/vit_hybrid/mod*_vit_hybrid* @amyeroberts @qubvel
-/src/transformers/models/vitdet/mod*_vitdet* @amyeroberts @qubvel
-/src/transformers/models/vit_mae/mod*_vit_mae* @amyeroberts @qubvel
-/src/transformers/models/vitmatte/mod*_vitmatte* @amyeroberts @qubvel
-/src/transformers/models/vit_msn/mod*_vit_msn* @amyeroberts @qubvel
-/src/transformers/models/vitpose/mod*_vitpose* @amyeroberts @qubvel
-/src/transformers/models/yolos/mod*_yolos* @amyeroberts @qubvel
-/src/transformers/models/zoedepth/mod*_zoedepth* @amyeroberts @qubvel
+/src/transformers/models/beit/mod*_beit* @yonigozlan @molbap
+/src/transformers/models/bit/mod*_bit* @yonigozlan @molbap
+/src/transformers/models/conditional_detr/mod*_conditional_detr* @yonigozlan @molbap
+/src/transformers/models/convnext/mod*_convnext* @yonigozlan @molbap
+/src/transformers/models/convnextv2/mod*_convnextv2* @yonigozlan @molbap
+/src/transformers/models/cvt/mod*_cvt* @yonigozlan @molbap
+/src/transformers/models/deformable_detr/mod*_deformable_detr* @yonigozlan @molbap
+/src/transformers/models/deit/mod*_deit* @yonigozlan @molbap
+/src/transformers/models/depth_anything/mod*_depth_anything* @yonigozlan @molbap
+/src/transformers/models/depth_anything_v2/mod*_depth_anything_v2* @yonigozlan @molbap
+/src/transformers/models/deta/mod*_deta* @yonigozlan @molbap
+/src/transformers/models/detr/mod*_detr* @yonigozlan @molbap
+/src/transformers/models/dinat/mod*_dinat* @yonigozlan @molbap
+/src/transformers/models/dinov2/mod*_dinov2* @yonigozlan @molbap
+/src/transformers/models/dinov2_with_registers/mod*_dinov2_with_registers* @yonigozlan @molbap
+/src/transformers/models/dit/mod*_dit* @yonigozlan @molbap
+/src/transformers/models/dpt/mod*_dpt* @yonigozlan @molbap
+/src/transformers/models/efficientformer/mod*_efficientformer* @yonigozlan @molbap
+/src/transformers/models/efficientnet/mod*_efficientnet* @yonigozlan @molbap
+/src/transformers/models/focalnet/mod*_focalnet* @yonigozlan @molbap
+/src/transformers/models/glpn/mod*_glpn* @yonigozlan @molbap
+/src/transformers/models/hiera/mod*_hiera* @yonigozlan @molbap
+/src/transformers/models/ijepa/mod*_ijepa* @yonigozlan @molbap
+/src/transformers/models/imagegpt/mod*_imagegpt* @yonigozlan @molbap
+/src/transformers/models/levit/mod*_levit* @yonigozlan @molbap
+/src/transformers/models/mask2former/mod*_mask2former* @yonigozlan @molbap
+/src/transformers/models/maskformer/mod*_maskformer* @yonigozlan @molbap
+/src/transformers/models/mobilenet_v1/mod*_mobilenet_v1* @yonigozlan @molbap
+/src/transformers/models/mobilenet_v2/mod*_mobilenet_v2* @yonigozlan @molbap
+/src/transformers/models/mobilevit/mod*_mobilevit* @yonigozlan @molbap
+/src/transformers/models/mobilevitv2/mod*_mobilevitv2* @yonigozlan @molbap
+/src/transformers/models/nat/mod*_nat* @yonigozlan @molbap
+/src/transformers/models/poolformer/mod*_poolformer* @yonigozlan @molbap
+/src/transformers/models/pvt/mod*_pvt* @yonigozlan @molbap
+/src/transformers/models/pvt_v2/mod*_pvt_v2* @yonigozlan @molbap
+/src/transformers/models/regnet/mod*_regnet* @yonigozlan @molbap
+/src/transformers/models/resnet/mod*_resnet* @yonigozlan @molbap
+/src/transformers/models/rt_detr/mod*_rt_detr* @yonigozlan @molbap
+/src/transformers/models/segformer/mod*_segformer* @yonigozlan @molbap
+/src/transformers/models/seggpt/mod*_seggpt* @yonigozlan @molbap
+/src/transformers/models/superpoint/mod*_superpoint* @yonigozlan @molbap
+/src/transformers/models/swiftformer/mod*_swiftformer* @yonigozlan @molbap
+/src/transformers/models/swin/mod*_swin* @yonigozlan @molbap
+/src/transformers/models/swinv2/mod*_swinv2* @yonigozlan @molbap
+/src/transformers/models/swin2sr/mod*_swin2sr* @yonigozlan @molbap
+/src/transformers/models/table_transformer/mod*_table_transformer* @yonigozlan @molbap
+/src/transformers/models/textnet/mod*_textnet* @yonigozlan @molbap
+/src/transformers/models/timm_wrapper/mod*_timm_wrapper* @yonigozlan @molbap
+/src/transformers/models/upernet/mod*_upernet* @yonigozlan @molbap
+/src/transformers/models/van/mod*_van* @yonigozlan @molbap
+/src/transformers/models/vit/mod*_vit* @yonigozlan @molbap
+/src/transformers/models/vit_hybrid/mod*_vit_hybrid* @yonigozlan @molbap
+/src/transformers/models/vitdet/mod*_vitdet* @yonigozlan @molbap
+/src/transformers/models/vit_mae/mod*_vit_mae* @yonigozlan @molbap
+/src/transformers/models/vitmatte/mod*_vitmatte* @yonigozlan @molbap
+/src/transformers/models/vit_msn/mod*_vit_msn* @yonigozlan @molbap
+/src/transformers/models/vitpose/mod*_vitpose* @yonigozlan @molbap
+/src/transformers/models/yolos/mod*_yolos* @yonigozlan @molbap
+/src/transformers/models/zoedepth/mod*_zoedepth* @yonigozlan @molbap

 # Audio models
 /src/transformers/models/audio_spectrogram_transformer/mod*_audio_spectrogram_transformer* @eustlb
@ -304,7 +304,7 @@ trainer_utils.py @zach-huggingface @SunMarc
 /src/transformers/models/donut/mod*_donut* @zucchini-nlp
 /src/transformers/models/flava/mod*_flava* @zucchini-nlp
 /src/transformers/models/git/mod*_git* @zucchini-nlp
-/src/transformers/models/grounding_dino/mod*_grounding_dino* @qubvel
+/src/transformers/models/grounding_dino/mod*_grounding_dino* @yonigozlan
 /src/transformers/models/groupvit/mod*_groupvit* @zucchini-nlp
 /src/transformers/models/idefics/mod*_idefics* @zucchini-nlp
 /src/transformers/models/idefics2/mod*_idefics2* @zucchini-nlp
@ -326,10 +326,10 @@ trainer_utils.py @zach-huggingface @SunMarc
 /src/transformers/models/mgp_str/mod*_mgp_str* @zucchini-nlp
 /src/transformers/models/mllama/mod*_mllama* @zucchini-nlp
 /src/transformers/models/nougat/mod*_nougat* @NielsRogge
-/src/transformers/models/omdet_turbo/mod*_omdet_turbo* @qubvel @yonigozlan
+/src/transformers/models/omdet_turbo/mod*_omdet_turbo* @yonigozlan
 /src/transformers/models/oneformer/mod*_oneformer* @zucchini-nlp
-/src/transformers/models/owlvit/mod*_owlvit* @qubvel
-/src/transformers/models/owlv2/mod*_owlv2* @qubvel
+/src/transformers/models/owlvit/mod*_owlvit* @yonigozlan
+/src/transformers/models/owlv2/mod*_owlv2* @yonigozlan
 /src/transformers/models/paligemma/mod*_paligemma* @zucchini-nlp @molbap
 /src/transformers/models/perceiver/mod*_perceiver* @zucchini-nlp
 /src/transformers/models/pix2struct/mod*_pix2struct* @zucchini-nlp
--- a/.github/workflows/benchmark_v2.yml
+++ b/.github/workflows/benchmark_v2.yml
@ -0,0 +1,85 @@
+name: Benchmark v2 Framework
+
+on:
+  workflow_call:
+    inputs:
+      runner:
+        description: 'GH Actions runner group to use'
+        required: true
+        type: string
+      container_image:
+        description: 'Docker image to use'
+        required: true
+        type: string
+      container_options:
+        description: 'Container options to use'
+        required: true
+        type: string
+      commit_sha:
+        description: 'Commit SHA to benchmark'
+        required: false
+        type: string
+        default: ''
+      run_id:
+        description: 'Custom run ID for organizing results (auto-generated if not provided)'
+        required: false
+        type: string
+        default: ''
+      benchmark_repo_id:
+        description: 'HuggingFace Dataset to upload results to (e.g., "org/benchmark-results")'
+        required: false
+        type: string
+        default: ''
+
+env:
+  HF_HOME: /mnt/cache
+  TRANSFORMERS_IS_CI: yes
+  # For gated repositories, we still need to agree to share information on the Hub repo. page in order to get access.
+  # This token is created under the bot `hf-transformers-bot`.
+  HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
+
+jobs:
+  benchmark-v2:
+    name: Benchmark v2
+    runs-on: ${{ inputs.runner }}
+    if: |
+      (github.event_name == 'pull_request' && contains( github.event.pull_request.labels.*.name, 'run-benchmark')) ||
+      (github.event_name == 'schedule')
+    container:
+      image: ${{ inputs.container_image }}
+      options: ${{ inputs.container_options }}
+    steps:
+      - name: Get repo
+        uses: actions/checkout@v4
+        with:
+          ref: ${{ inputs.commit_sha || github.sha }}
+
+      - name: Install benchmark dependencies
+        run: |
+          python3 -m pip install -r benchmark_v2/requirements.txt
+
+      - name: Reinstall transformers in edit mode
+        run: |
+          python3 -m pip uninstall -y transformers
+          python3 -m pip install -e ".[torch]"
+
+      - name: Show installed libraries and their versions
+        run: |
+          python3 -m pip list
+          python3 -c "import torch; print(f'PyTorch version: {torch.__version__}')"
+          python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
+          python3 -c "import torch; print(f'CUDA device count: {torch.cuda.device_count()}')" || true
+          nvidia-smi || true
+
+      - name: Run benchmark v2
+        working-directory: benchmark_v2
+        run: |
+          echo "Running benchmarks"
+          python3 run_benchmarks.py \
+          --commit-id '${{ inputs.commit_sha || github.sha }}' \
+          --run-id '${{ inputs.run_id }}' \
+          --push-to-hub '${{ inputs.benchmark_repo_id}}' \
+          --token '${{ secrets.TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN }}' \
+          --log-level INFO
+        env:
+          HF_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
--- a/.github/workflows/benchmark_v2_a10_caller.yml
+++ b/.github/workflows/benchmark_v2_a10_caller.yml
@ -0,0 +1,21 @@
+name: Benchmark v2 Scheduled Runner - A10 Single-GPU
+
+on:
+  schedule:
+    # Run daily at 16:30 UTC
+    - cron: "30 16 * * *"
+  pull_request:
+    types: [ opened, labeled, reopened, synchronize ]
+
+jobs:
+  benchmark-v2-default:
+    name: Benchmark v2 - Default Models
+    uses: ./.github/workflows/benchmark_v2.yml
+    with:
+      runner: aws-g5-4xlarge-cache-use1-public-80
+      container_image: huggingface/transformers-pytorch-gpu
+      container_options: --gpus all --privileged --ipc host --shm-size "16gb"
+      commit_sha: ${{ github.sha }}
+      run_id: ${{ github.run_id }}
+      benchmark_repo_id: hf-internal-testing/transformers-daily-benchmarks
+    secrets: inherit
--- a/.github/workflows/benchmark_v2_mi325_caller.yml
+++ b/.github/workflows/benchmark_v2_mi325_caller.yml
@ -0,0 +1,21 @@
+name: Benchmark v2 Scheduled Runner - MI325 Single-GPU
+
+on:
+  schedule:
+    # Run daily at 16:30 UTC
+    - cron: "30 16 * * *"
+  pull_request:
+    types: [ opened, labeled, reopened, synchronize ]
+
+jobs:
+  benchmark-v2-default:
+    name: Benchmark v2 - Default Models
+    uses: ./.github/workflows/benchmark_v2.yml
+    with:
+      runner: amd-mi325-ci-1gpu
+      container_image: huggingface/transformers-pytorch-amd-gpu
+      container_options: --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache
+      commit_sha: ${{ github.sha }}
+      run_id: ${{ github.run_id }}
+      benchmark_repo_id: hf-internal-testing/transformers-daily-benchmarks
+    secrets: inherit
--- a/.github/workflows/build-docker-images.yml
+++ b/.github/workflows/build-docker-images.yml
@ -5,6 +5,7 @@ on:
    branches:
      - build_ci_docker_image*
  repository_dispatch:
+  workflow_dispatch:
  workflow_call:
    inputs:
      image_postfix:
@ -221,7 +222,7 @@ jobs:
  latest-pytorch-amd:
    name: "Latest PyTorch (AMD) [dev]"
    runs-on:
-      group: aws-general-8-plus
+      group: aws-highcpu-32-priv
    steps:
      -
        name: Set up Docker Buildx
--- a/.github/workflows/build_documentation.yml
+++ b/.github/workflows/build_documentation.yml
@ -16,7 +16,19 @@ jobs:
      commit_sha: ${{ github.sha }}
      package: transformers
      notebook_folder: transformers_doc
-      languages: ar de en es fr hi it ko pt tr zh ja te
+      languages: en
+      custom_container: huggingface/transformers-doc-builder
+    secrets:
+      token: ${{ secrets.HUGGINGFACE_PUSH }}
+      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
+
+   build_other_lang:
+    uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
+    with:
+      commit_sha: ${{ github.sha }}
+      package: transformers
+      notebook_folder: transformers_doc
+      languages: ar de es fr hi it ja ko pt zh
      custom_container: huggingface/transformers-doc-builder
    secrets:
      token: ${{ secrets.HUGGINGFACE_PUSH }}
--- a/.github/workflows/model_jobs.yml
+++ b/.github/workflows/model_jobs.yml
@ -128,28 +128,47 @@ jobs:
          echo "machine_type=$machine_type" >> $GITHUB_ENV
          echo "machine_type=$machine_type" >> $GITHUB_OUTPUT

+      - name: Create report directory if it doesn't exist
+        shell: bash
+        run: |
+          mkdir -p /transformers/reports/${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ env.matrix_folders }}_test_reports
+          echo "dummy" > /transformers/reports/${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ env.matrix_folders }}_test_reports/dummy.txt
+          ls -la /transformers/reports/${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ env.matrix_folders }}_test_reports
+
      - name: Run all tests on GPU
        working-directory: /transformers
-        run: python3 -m pytest -rsfE -v --make-reports=${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ matrix.folders }}_test_reports tests/${{ matrix.folders }}
+        run: |
+          script -q -c "PATCH_TESTING_METHODS_TO_COLLECT_OUTPUTS=yes _PATCHED_TESTING_METHODS_OUTPUT_DIR=/transformers/reports/${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ env.matrix_folders }}_test_reports python3 -m pytest -rsfE -v --make-reports=${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ env.matrix_folders }}_test_reports tests/${{ matrix.folders }}" test_outputs.txt
+          ls -la
+          # Extract the exit code from the output file
+          EXIT_CODE=$(tail -1 test_outputs.txt | grep -o 'COMMAND_EXIT_CODE="[0-9]*"' | cut -d'"' -f2)
+          exit ${EXIT_CODE:-1}

      - name: Failure short reports
        if: ${{ failure() }}
+        # This step is only to show information on Github Actions log.
+        # Always mark this step as successful, even if the report directory or the file `failures_short.txt` in it doesn't exist
        continue-on-error: true
-        run: cat /transformers/reports/${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ matrix.folders }}_test_reports/failures_short.txt
+        run: cat /transformers/reports/${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ env.matrix_folders }}_test_reports/failures_short.txt

-      - name: Run test
-        shell: bash
+      - name: Captured information
+        if: ${{ failure() }}
+        continue-on-error: true
        run: |
-          mkdir -p /transformers/reports/${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ matrix.folders }}_test_reports
-          echo "hello" > /transformers/reports/${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ matrix.folders }}_test_reports/hello.txt
-          echo "${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ matrix.folders }}_test_reports"
+          cat /transformers/reports/${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ env.matrix_folders }}_test_reports/captured_info.txt
+
+      - name: Copy test_outputs.txt
+        if: ${{ always() }}
+        continue-on-error: true
+        run: |
+          cp /transformers/test_outputs.txt /transformers/reports/${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ env.matrix_folders }}_test_reports

      - name: "Test suite reports artifacts: ${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ env.matrix_folders }}_test_reports"
        if: ${{ always() }}
        uses: actions/upload-artifact@v4
        with:
          name: ${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ env.matrix_folders }}_test_reports
-          path: /transformers/reports/${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ matrix.folders }}_test_reports
+          path: /transformers/reports/${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ env.matrix_folders }}_test_reports

  collated_reports:
    name: Collated Reports
--- a/.github/workflows/pr_build_doc_with_comment.yml
+++ b/.github/workflows/pr_build_doc_with_comment.yml
@ -14,7 +14,7 @@ permissions: {}
 jobs:
  get-pr-number:
    name: Get PR number
-    if: ${{ github.event.issue.state == 'open' && contains(fromJSON('["ydshieh", "ArthurZucker", "zucchini-nlp", "qubvel", "molbap", "gante", "LysandreJik", "Cyrilvallez", "Rocketknight1", "SunMarc", "muellerzr", "eustlb", "MekkCyber", "manueldeprada", "vasqu", "ivarflakstad", "stevhliu", "ebezzam"]'), github.actor) && (startsWith(github.event.comment.body, 'build-doc')) }}
+    if: ${{ github.event.issue.state == 'open' && contains(fromJSON('["ydshieh", "ArthurZucker", "zucchini-nlp", "molbap", "gante", "LysandreJik", "Cyrilvallez", "Rocketknight1", "SunMarc", "eustlb", "MekkCyber", "vasqu", "ivarflakstad", "stevhliu", "ebezzam", "itazap"]'), github.actor) && (startsWith(github.event.comment.body, 'build-doc')) }}
    uses: ./.github/workflows/get-pr-number.yml

  get-pr-info:
--- a/.github/workflows/self-comment-ci.yml
+++ b/.github/workflows/self-comment-ci.yml
@ -29,7 +29,7 @@ jobs:
    runs-on: ubuntu-22.04
    name: Get PR number
    # For security: only allow team members to run
-    if: ${{ github.event.issue.state == 'open' && contains(fromJSON('["ydshieh", "ArthurZucker", "zucchini-nlp", "qubvel", "molbap", "gante", "LysandreJik", "Cyrilvallez", "Rocketknight1", "SunMarc", "muellerzr", "eustlb", "MekkCyber", "manueldeprada", "vasqu", "ivarflakstad", "stevhliu", "ebezzam", "remi-or"]'), github.actor) && (startsWith(github.event.comment.body, 'run-slow') || startsWith(github.event.comment.body, 'run slow') || startsWith(github.event.comment.body, 'run_slow')) }}
+    if: ${{ github.event.issue.state == 'open' && contains(fromJSON('["ydshieh", "ArthurZucker", "zucchini-nlp", "molbap", "gante", "LysandreJik", "Cyrilvallez", "Rocketknight1", "SunMarc", "eustlb", "MekkCyber", "vasqu", "ivarflakstad", "stevhliu", "ebezzam", "remi-or", "itazap"]'), github.actor) && (startsWith(github.event.comment.body, 'run-slow') || startsWith(github.event.comment.body, 'run slow') || startsWith(github.event.comment.body, 'run_slow')) }}
    outputs:
      PR_NUMBER: ${{ steps.set_pr_number.outputs.PR_NUMBER }}
    steps:
--- a/.github/workflows/self-scheduled-amd-mi325-caller.yml
+++ b/.github/workflows/self-scheduled-amd-mi325-caller.yml
@ -20,7 +20,7 @@ jobs:
    with:
      job: run_models_gpu
      slack_report_channel: "#amd-hf-ci"
-      runner_scale_set: amd-mi325-ci
+      runner_group: amd-mi325
      docker: huggingface/transformers-pytorch-amd-gpu
      ci_event: Scheduled CI (AMD) - mi325
      report_repo_id: optimum-amd/transformers_daily_ci
@ -33,7 +33,7 @@ jobs:
    with:
      job: run_pipelines_torch_gpu
      slack_report_channel: "#amd-hf-ci"
-      runner_scale_set: amd-mi325-ci
+      runner_group: amd-mi325
      docker: huggingface/transformers-pytorch-amd-gpu
      ci_event: Scheduled CI (AMD) - mi325
      report_repo_id: optimum-amd/transformers_daily_ci
@ -46,7 +46,7 @@ jobs:
    with:
      job: run_examples_gpu
      slack_report_channel: "#amd-hf-ci"
-      runner_scale_set: amd-mi325-ci
+      runner_group: amd-mi325
      docker: huggingface/transformers-pytorch-amd-gpu
      ci_event: Scheduled CI (AMD) - mi325
      report_repo_id: optimum-amd/transformers_daily_ci
@ -59,7 +59,7 @@ jobs:
    with:
      job: run_torch_cuda_extensions_gpu
      slack_report_channel: "#amd-hf-ci"
-      runner_scale_set: amd-mi325-ci
+      runner_group: amd-mi325
      docker: huggingface/transformers-pytorch-deepspeed-amd-gpu
      ci_event: Scheduled CI (AMD) - mi325
      report_repo_id: optimum-amd/transformers_daily_ci
--- a/.github/workflows/self-scheduled-amd-mi355-caller.yml
+++ b/.github/workflows/self-scheduled-amd-mi355-caller.yml
@ -20,7 +20,7 @@ jobs:
    with:
      job: run_models_gpu
      slack_report_channel: "#amd-hf-ci"
-      runner_scale_set: amd-mi355-ci
+      runner_group: hfc-amd-mi355
      docker: huggingface/testing-rocm7.0-preview
      ci_event: Scheduled CI (AMD) - mi355
      report_repo_id: hf-transformers-bot/transformers-ci-dummy
@ -32,7 +32,7 @@ jobs:
    with:
      job: run_pipelines_torch_gpu
      slack_report_channel: "#amd-hf-ci"
-      runner_scale_set: amd-mi355-ci
+      runner_group: hfc-amd-mi355
      docker: huggingface/testing-rocm7.0-preview
      ci_event: Scheduled CI (AMD) - mi355
      report_repo_id: hf-transformers-bot/transformers-ci-dummy
@ -44,7 +44,7 @@ jobs:
    with:
      job: run_examples_gpu
      slack_report_channel: "#amd-hf-ci"
-      runner_scale_set: amd-mi355-ci
+      runner_group: hfc-amd-mi355
      docker: huggingface/testing-rocm7.0-preview
      ci_event: Scheduled CI (AMD) - mi355
      report_repo_id: hf-transformers-bot/transformers-ci-dummy
@ -56,7 +56,7 @@ jobs:
    with:  
      job: run_torch_cuda_extensions_gpu
      slack_report_channel: "#amd-hf-ci"
-      runner_scale_set: amd-mi355-ci
+      runner_group: hfc-amd-mi355
      docker: huggingface/testing-rocm7.0-preview
      ci_event: Scheduled CI (AMD) - mi355
      report_repo_id: hf-transformers-bot/transformers-ci-dummy
--- a/.gitignore
+++ b/.gitignore
@ -13,6 +13,7 @@ tests/fixtures/cached_*_text.txt
 logs/
 lightning_logs/
 lang_code_data/
+reports/

 # Distribution / packaging
 .Python
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -278,13 +278,14 @@ are working on it).<br>
 useful to avoid duplicated work, and to differentiate it from PRs ready to be merged.<br>
 ☐ Make sure existing tests pass.<br>
 ☐ If adding a new feature, also add tests for it.<br>
-   - If you are adding a new model, make sure you use
+
+- If you are adding a new model, make sure you use
     `ModelTester.all_model_classes = (MyModel, MyModelWithLMHead,...)` to trigger the common tests.
-   - If you are adding new `@slow` tests, make sure they pass using
+- If you are adding new `@slow` tests, make sure they pass using
     `RUN_SLOW=1 python -m pytest tests/models/my_new_model/test_my_new_model.py`.
-   - If you are adding a new tokenizer, write tests and make sure
+- If you are adding a new tokenizer, write tests and make sure
     `RUN_SLOW=1 python -m pytest tests/models/{your_model_name}/test_tokenization_{your_model_name}.py` passes.
-   - CircleCI does not run the slow tests, but GitHub Actions does every night!<br>
+- CircleCI does not run the slow tests, but GitHub Actions does every night!<br>

 ☐ All public methods must have informative docstrings (see
 [`modeling_bert.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py)
@ -340,6 +341,7 @@ RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/t
 ```

 Like the slow tests, there are other environment variables available which are not enabled by default during testing:
+
 - `RUN_CUSTOM_TOKENIZERS`: Enables tests for custom tokenizers.

 More environment variables and additional information can be found in the [testing_utils.py](https://github.com/huggingface/transformers/blob/main/src/transformers/testing_utils.py).
--- a/ISSUES.md
+++ b/ISSUES.md
@ -38,7 +38,6 @@ In particular all "Please explain" questions or objectively very user-specific f

 * "How to train T5 on De->En translation?"

-
 ## The GitHub Issues

 Everything which hints at a bug should be opened as an [issue](https://github.com/huggingface/transformers/issues).
@ -247,7 +246,6 @@ You are not required to read the following guidelines before opening an issue. H

    Try not use italics and bold text too much as these often make the text more difficult to read.

-
 12. If you are cross-referencing a specific comment in a given thread or another issue, always link to that specific comment, rather than using the issue link. If you do the latter it could be quite impossible to find which specific comment you're referring to.

    To get the link to the specific comment do not copy the url from the location bar of your browser, but instead, click the `...` icon in the upper right corner of the comment and then select "Copy Link".
@ -257,7 +255,6 @@ You are not required to read the following guidelines before opening an issue. H
    1. https://github.com/huggingface/transformers/issues/9257
    2. https://github.com/huggingface/transformers/issues/9257#issuecomment-749945162

-
 13. If you are replying to a last comment, it's totally fine to make your reply with just your comment in it. The readers can follow the information flow here.

    But if you're replying to a comment that happened some comments back it's always a good practice to quote just the relevant lines you're replying it. The `>` is used for quoting, or you can always use the menu to do so. For example your editor box will look like:
--- a/README.md
+++ b/README.md
@ -48,9 +48,11 @@ limitations under the License.
        <a href="https://github.com/huggingface/transformers/blob/main/i18n/README_te.md">తెలుగు</a> |
        <a href="https://github.com/huggingface/transformers/blob/main/i18n/README_fr.md">Français</a> |
        <a href="https://github.com/huggingface/transformers/blob/main/i18n/README_de.md">Deutsch</a> |
+        <a href="https://github.com/huggingface/transformers/blob/main/i18n/README_it.md">Italiano</a> |
        <a href="https://github.com/huggingface/transformers/blob/main/i18n/README_vi.md">Tiếng Việt</a> |
        <a href="https://github.com/huggingface/transformers/blob/main/i18n/README_ar.md">العربية</a> |
        <a href="https://github.com/huggingface/transformers/blob/main/i18n/README_ur.md">اردو</a> |
+        <a href="https://github.com/huggingface/transformers/blob/main/i18n/README_bn.md">বাংলা</a> |
    </p>
 </h4>

@ -62,7 +64,6 @@ limitations under the License.
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/transformers_as_a_model_definition.png"/>
 </h3>

-
 Transformers acts as the model-definition framework for state-of-the-art machine learning models in text, computer
 vision, audio, video, and multimodal model, for both inference and training.

@ -110,10 +111,10 @@ git clone https://github.com/huggingface/transformers.git
 cd transformers

 # pip
-pip install .[torch]
+pip install '.[torch]'

 # uv
-uv pip install .[torch]
+uv pip install '.[torch]'
 ```

 ## Quickstart
@ -193,7 +194,6 @@ pipeline("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.pn
 <details>
 <summary>Visual question answering</summary>

-
 <h3 align="center">
    <a><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-few-shot.jpg"></a>
 </h3>
--- a/awesome-transformers.md
+++ b/awesome-transformers.md
@ -606,4 +606,3 @@ Keywords: BentoML, Framework, Deployment, AI Applications
 [LLaMA Factory](https://github.com/hiyouga/LLaMA-Factory) offers a user-friendly fine-tuning framework that incorporates PEFT. The repository includes training(fine-tuning) and inference examples for LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, and other LLMs. A ChatGLM version is also available in [ChatGLM-Efficient-Tuning](https://github.com/hiyouga/ChatGLM-Efficient-Tuning).

 Keywords: PEFT, fine-tuning, LLaMA-2, ChatGLM, Qwen
-
--- a/benchmark_v2/README.md
+++ b/benchmark_v2/README.md
@ -21,6 +21,46 @@ python run_benchmarks.py \
    --num-tokens-to-generate 200
 ```

+### Uploading Results to HuggingFace Dataset
+
+You can automatically upload benchmark results to a HuggingFace Dataset for tracking and analysis:
+
+```bash
+# Upload to a public dataset with auto-generated run ID
+python run_benchmarks.py --upload-to-hub username/benchmark-results
+
+# Upload with a custom run ID for easy identification
+python run_benchmarks.py --upload-to-hub username/benchmark-results --run-id experiment_v1
+
+# Upload with custom HuggingFace token (if not set in environment)
+python run_benchmarks.py --upload-to-hub username/benchmark-results --token hf_your_token_here
+```
+
+**Dataset Directory Structure:**
+```
+dataset_name/
+├── 2025-01-15/
+│   ├── runs/                       # Non-scheduled runs (manual, PR, etc.)
+│   │   └── 123-1245151651/         # GitHub run number and ID
+│   │       └── benchmark_results/
+│   │           ├── benchmark_summary_20250115_143022.json
+│   │           └── model-name/
+│   │               └── model-name_benchmark_20250115_143022.json
+│   └── benchmark_results_abc123de/ # Scheduled runs (daily CI)
+│       ├── benchmark_summary_20250115_143022.json
+│       └── model-name/
+│           └── model-name_benchmark_20250115_143022.json
+└── 2025-01-16/
+    └── ...
+```
+
+**Authentication for Uploads:**
+
+For uploading results, you need a HuggingFace token with write permissions to the target dataset. You can provide the token in several ways (in order of precedence):
+
+1. Command line: `--token hf_your_token_here`
+3. Environment variable: `HF_TOKEN`
+
 ### Running Specific Benchmarks

 ```bash
--- a/benchmark_v2/benches/llama.py
+++ b/benchmark_v2/benches/llama.py
@ -20,7 +20,6 @@ import torch
 from benchmark_framework import ModelBenchmark


-os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
 os.environ["TOKENIZERS_PARALLELISM"] = "1"
 torch.set_float32_matmul_precision("high")

--- a/benchmark_v2/requirements.txt
+++ b/benchmark_v2/requirements.txt
@ -4,3 +4,4 @@ gpustat>=1.0.0
 torch>=2.0.0
 transformers>=4.30.0
 datasets>=2.10.0
+huggingface_hub>=0.16.0 
--- a/benchmark_v2/run_benchmarks.py
+++ b/benchmark_v2/run_benchmarks.py
@ -24,6 +24,7 @@ import json
 import logging
 import os
 import sys
+import uuid
 from datetime import datetime
 from pathlib import Path
 from typing import Any, Optional
@ -160,7 +161,12 @@ def run_single_benchmark(
        return None


-def generate_summary_report(output_dir: str, benchmark_results: dict[str, Any], logger: logging.Logger) -> str:
+def generate_summary_report(
+    output_dir: str,
+    benchmark_results: dict[str, Any],
+    logger: logging.Logger,
+    benchmark_run_uuid: Optional[str] = None,
+) -> str:
    """Generate a summary report of all benchmark runs."""
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    summary_file = os.path.join(output_dir, f"benchmark_summary_{timestamp}.json")
@ -168,6 +174,7 @@ def generate_summary_report(output_dir: str, benchmark_results: dict[str, Any],
    summary_data = {
        "run_metadata": {
            "timestamp": datetime.utcnow().isoformat(),
+            "benchmark_run_uuid": benchmark_run_uuid,
            "total_benchmarks": len(benchmark_results),
            "successful_benchmarks": len([r for r in benchmark_results.values() if r is not None]),
            "failed_benchmarks": len([r for r in benchmark_results.values() if r is None]),
@ -183,9 +190,114 @@ def generate_summary_report(output_dir: str, benchmark_results: dict[str, Any],
    return summary_file


+def upload_results_to_hf_dataset(
+    output_dir: str,
+    summary_file: str,
+    dataset_name: str,
+    run_id: Optional[str] = None,
+    token: Optional[str] = None,
+    logger: Optional[logging.Logger] = None,
+) -> Optional[str]:
+    """
+    Upload benchmark results to a HuggingFace Dataset.
+    Based on upload_collated_report() from utils/collated_reports.py
+    Args:
+        output_dir: Local output directory containing results
+        summary_file: Path to the summary file
+        dataset_name: Name of the HuggingFace dataset to upload to
+        run_id: Unique run identifier (if None, will generate one)
+        token: HuggingFace token for authentication (if None, will use environment variables)
+        logger: Logger instance
+    Returns:
+        The run_id used for the upload, None if upload failed
+    """
+    if logger is None:
+        logger = logging.getLogger(__name__)
+
+    import os
+
+    from huggingface_hub import HfApi
+
+    api = HfApi()
+
+    if run_id is None:
+        github_run_number = os.getenv("GITHUB_RUN_NUMBER")
+        github_run_id = os.getenv("GITHUB_RUN_ID")
+        if github_run_number and github_run_id:
+            run_id = f"{github_run_number}-{github_run_id}"
+
+    date_folder = datetime.now().strftime("%Y-%m-%d")
+
+    github_event_name = os.getenv("GITHUB_EVENT_NAME")
+    if github_event_name != "schedule":
+        # Non-scheduled runs go under a runs subfolder
+        repo_path = f"{date_folder}/runs/{run_id}/benchmark_results"
+    else:
+        # Scheduled runs go directly under the date
+        repo_path = f"{date_folder}/{run_id}/benchmark_results"
+
+    logger.info(f"Uploading benchmark results to dataset '{dataset_name}' at path '{repo_path}'")
+
+    try:
+        # Upload all files in the output directory
+        from pathlib import Path
+
+        output_path = Path(output_dir)
+
+        for file_path in output_path.rglob("*"):
+            if file_path.is_file():
+                # Calculate relative path from output_dir
+                relative_path = file_path.relative_to(output_path)
+                path_in_repo = f"{repo_path}/{relative_path}"
+
+                logger.debug(f"Uploading {file_path} to {path_in_repo}")
+
+                api.upload_file(
+                    path_or_fileobj=str(file_path),
+                    path_in_repo=path_in_repo,
+                    repo_id=dataset_name,
+                    repo_type="dataset",
+                    token=token,
+                    commit_message=f"Upload benchmark results for run {run_id}",
+                )
+
+        logger.info(
+            f"Successfully uploaded results to: https://huggingface.co/datasets/{dataset_name}/tree/main/{repo_path}"
+        )
+
+        return run_id
+
+    except Exception as upload_error:
+        logger.error(f"Failed to upload results: {upload_error}")
+        import traceback
+
+        logger.debug(traceback.format_exc())
+        return None
+
+
 def main():
    """Main entry point for the benchmarking script."""
-    parser = argparse.ArgumentParser(description="Run all benchmarks in the ./benches directory")
+    # Generate a unique UUID for this benchmark run
+    benchmark_run_uuid = str(uuid.uuid4())[:8]
+
+    parser = argparse.ArgumentParser(
+        description="Run all benchmarks in the ./benches directory",
+        epilog="""
+Examples:
+  # Run all available benchmarks
+  python3 run_benchmarks.py
+  
+  # Run with specific model and upload to HuggingFace Dataset
+  python3 run_benchmarks.py --model-id meta-llama/Llama-2-7b-hf --upload-to-hf username/benchmark-results
+  
+  # Run with custom run ID and upload to HuggingFace Dataset
+  python3 run_benchmarks.py --run-id experiment_v1 --upload-to-hf org/benchmarks
+  
+  # Run only specific benchmarks with file logging
+  python3 run_benchmarks.py --include llama --enable-file-logging
+        """,  # noqa: W293
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+    )

    parser.add_argument(
        "--output-dir",
@ -228,20 +340,35 @@ def main():

    parser.add_argument("--exclude", type=str, nargs="*", help="Exclude benchmarks matching these names")

-    parser.add_argument("--enable-mock", action="store_true", help="Enable mock benchmark (skipped by default)")
-
    parser.add_argument("--enable-file-logging", action="store_true", help="Enable file logging (disabled by default)")

    parser.add_argument(
        "--commit-id", type=str, help="Git commit ID for metadata (if not provided, will auto-detect from git)"
    )

+    parser.add_argument(
+        "--push-to-hub",
+        type=str,
+        help="Upload results to HuggingFace Dataset (provide dataset name, e.g., 'username/benchmark-results')",
+    )
+
+    parser.add_argument(
+        "--run-id", type=str, help="Custom run ID for organizing results (if not provided, will generate a unique ID)"
+    )
+
+    parser.add_argument(
+        "--token",
+        type=str,
+        help="HuggingFace token for dataset uploads (if not provided, will use HF_TOKEN environment variable)",
+    )
+
    args = parser.parse_args()

    # Setup logging
    logger = setup_logging(args.log_level, args.enable_file_logging)

    logger.info("Starting benchmark discovery and execution")
+    logger.info(f"Benchmark run UUID: {benchmark_run_uuid}")
    logger.info(f"Output directory: {args.output_dir}")
    logger.info(f"Benches directory: {args.benches_dir}")

@ -286,9 +413,6 @@ def main():
        if args.model_id:
            benchmark_kwargs["model_id"] = args.model_id

-        # Add enable_mock flag for mock benchmark
-        benchmark_kwargs["enable_mock"] = args.enable_mock
-
        # Add commit_id if provided
        if args.commit_id:
            benchmark_kwargs["commit_id"] = args.commit_id
@ -306,7 +430,28 @@ def main():
                successful_count += 1

        # Generate summary report
-        summary_file = generate_summary_report(args.output_dir, benchmark_results, logger)
+        summary_file = generate_summary_report(args.output_dir, benchmark_results, logger, benchmark_run_uuid)
+
+        # Upload results to HuggingFace Dataset if requested
+        upload_run_id = None
+        if args.push_to_hub:
+            logger.info("=" * 60)
+            logger.info("UPLOADING TO HUGGINGFACE DATASET")
+            logger.info("=" * 60)
+            # Use provided run_id or fallback to benchmark run UUID
+            effective_run_id = args.run_id or benchmark_run_uuid
+            upload_run_id = upload_results_to_hf_dataset(
+                output_dir=args.output_dir,
+                summary_file=summary_file,
+                dataset_name=args.push_to_hub,
+                run_id=effective_run_id,
+                token=args.token,
+                logger=logger,
+            )
+            if upload_run_id:
+                logger.info(f"Upload completed with run ID: {upload_run_id}")
+            else:
+                logger.warning("Upload failed - continuing with local results")

        # Final summary
        total_benchmarks = len(filtered_benchmarks)
@ -321,6 +466,16 @@ def main():
        logger.info(f"Output directory: {args.output_dir}")
        logger.info(f"Summary report: {summary_file}")

+        if args.push_to_hub:
+            if upload_run_id:
+                logger.info(f"HuggingFace Dataset: {args.push_to_hub}")
+                logger.info(f"Run ID: {upload_run_id}")
+                logger.info(
+                    f"View results: https://huggingface.co/datasets/{args.push_to_hub}/tree/main/{datetime.now().strftime('%Y-%m-%d')}/runs/{upload_run_id}"
+                )
+            else:
+                logger.warning("Upload to HuggingFace Dataset failed")
+
        if failed_count > 0:
            logger.warning(f"{failed_count} benchmark(s) failed. Check logs for details.")
            return 1
--- a/conftest.py
+++ b/conftest.py
@ -54,7 +54,6 @@ NOT_DEVICE_TESTS = {
    "test_gradient_checkpointing_backward_compatibility",
    "test_gradient_checkpointing_enable_disable",
    "test_torch_save_load",
-    "test_initialization",
    "test_forward_signature",
    "test_model_get_set_embeddings",
    "test_model_main_input_name",
@ -64,8 +63,7 @@ NOT_DEVICE_TESTS = {
    "test_load_save_without_tied_weights",
    "test_tied_weights_keys",
    "test_model_weights_reload_no_missing_tied_weights",
-    "test_mismatched_shapes_have_properly_initialized_weights",
-    "test_matched_shapes_have_loaded_weights_when_some_mismatched_shapes_exist",
+    "test_can_load_ignoring_mismatched_shapes",
    "test_model_is_small",
    "test_tf_from_pt_safetensors",
    "test_flax_from_pt_safetensors",
@ -93,6 +91,8 @@ def pytest_configure(config):
    config.addinivalue_line("markers", "torch_compile_test: mark test which tests torch compile functionality")
    config.addinivalue_line("markers", "torch_export_test: mark test which tests torch export functionality")

+    os.environ['DISABLE_SAFETENSORS_CONVERSION'] = 'true'
+

 def pytest_collection_modifyitems(items):
    for item in items:
--- a/docker/consistency.dockerfile
+++ b/docker/consistency.dockerfile
@ -1,4 +1,4 @@
-FROM python:3.9-slim
+FROM python:3.10-slim
 ENV PYTHONDONTWRITEBYTECODE=1
 USER root
 ARG REF=main
--- a/docker/custom-tokenizers.dockerfile
+++ b/docker/custom-tokenizers.dockerfile
@ -1,4 +1,4 @@
-FROM python:3.9-slim
+FROM python:3.10-slim
 ENV PYTHONDONTWRITEBYTECODE=1
 ARG REF=main
 USER root
--- a/docker/examples-torch.dockerfile
+++ b/docker/examples-torch.dockerfile
@ -1,4 +1,4 @@
-FROM python:3.9-slim
+FROM python:3.10-slim
 ENV PYTHONDONTWRITEBYTECODE=1
 ARG REF=main
 USER root
--- a/docker/exotic-models.dockerfile
+++ b/docker/exotic-models.dockerfile
@ -1,4 +1,4 @@
-FROM python:3.9-slim
+FROM python:3.10-slim
 ENV PYTHONDONTWRITEBYTECODE=1
 ARG REF=main
 USER root
--- a/docker/pipeline-torch.dockerfile
+++ b/docker/pipeline-torch.dockerfile
@ -1,4 +1,4 @@
-FROM python:3.9-slim
+FROM python:3.10-slim
 ENV PYTHONDONTWRITEBYTECODE=1
 ARG REF=main
 USER root
--- a/docker/quality.dockerfile
+++ b/docker/quality.dockerfile
@ -1,4 +1,4 @@
-FROM python:3.9-slim
+FROM python:3.10-slim
 ENV PYTHONDONTWRITEBYTECODE=1
 ARG REF=main
 USER root
--- a/docker/torch-light.dockerfile
+++ b/docker/torch-light.dockerfile
@ -1,4 +1,4 @@
-FROM python:3.9-slim
+FROM python:3.10-slim
 ENV PYTHONDONTWRITEBYTECODE=1
 ARG REF=main
 USER root
--- a/docker/transformers-pytorch-amd-gpu/Dockerfile
+++ b/docker/transformers-pytorch-amd-gpu/Dockerfile
@ -38,3 +38,10 @@ RUN python3 -m pip uninstall -y kernels

 # On ROCm, torchcodec is required to decode audio files and 0.4 or 0.6 fails
 RUN python3 -m pip install --no-cache-dir "torchcodec==0.5"
+
+# Install flash attention from source. Tested with commit 6387433156558135a998d5568a9d74c1778666d8
+RUN git clone https://github.com/ROCm/flash-attention/ -b tridao && \
+    cd flash-attention && \
+    GPU_ARCHS="gfx942" python setup.py install
+
+RUN python3 -m pip install --no-cache-dir einops
--- a/docker/transformers-quantization-latest-gpu/Dockerfile
+++ b/docker/transformers-quantization-latest-gpu/Dockerfile
@ -1,4 +1,4 @@
-FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04
+FROM nvidia/cuda:12.6.0-cudnn-devel-ubuntu22.04
 LABEL maintainer="Hugging Face"

 ARG DEBIAN_FRONTEND=noninteractive
@ -9,9 +9,9 @@ SHELL ["sh", "-lc"]
 # The following `ARG` are mainly used to specify the versions explicitly & directly in this docker file, and not meant
 # to be used as arguments for docker build (so far).

-ARG PYTORCH='2.6.0'
+ARG PYTORCH='2.8.0'
 # Example: `cu102`, `cu113`, etc.
-ARG CUDA='cu121'
+ARG CUDA='cu126'
 # Disable kernel mapping for quantization tests
 ENV DISABLE_KERNEL_MAPPING=1

@ -30,31 +30,20 @@ RUN python3 -m pip install --no-cache-dir -U $VERSION torchvision torchaudio tor

 RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/accelerate@main#egg=accelerate

-# needed in bnb and awq
-RUN python3 -m pip install --no-cache-dir einops
-
-# Add bitsandbytes for mixed int8 testing
-RUN python3 -m pip install --no-cache-dir bitsandbytes
-
-# Add gptqmodel for gtpq quantization testing, installed from source for pytorch==2.6.0 compatibility
-RUN python3 -m pip install lm_eval
-RUN git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel && pip install -v . --no-build-isolation
-
 # Add optimum for gptq quantization testing
 RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/optimum@main#egg=optimum

 # Add PEFT
 RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/peft@main#egg=peft

-# Add aqlm for quantization testing
-RUN python3 -m pip install --no-cache-dir aqlm[gpu]==1.0.2
+# needed in bnb and awq
+RUN python3 -m pip install --no-cache-dir einops

-# Add vptq for quantization testing
-RUN pip install vptq
+# Add bitsandbytes
+RUN python3 -m pip install --no-cache-dir bitsandbytes

-# Add spqr for quantization testing
-# Commented for now as No matching distribution found we need to reach out to the authors
-# RUN python3 -m pip install --no-cache-dir spqr_quant[gpu]
+# # Add gptqmodel
+# RUN python3 -m pip install --no-cache-dir gptqmodel

 # Add hqq for quantization testing
 RUN python3 -m pip install --no-cache-dir hqq
@ -63,25 +52,11 @@ RUN python3 -m pip install --no-cache-dir hqq
 RUN python3 -m pip install --no-cache-dir gguf

 # Add autoawq for quantization testing
-# New release v0.2.8
 RUN python3 -m pip install --no-cache-dir autoawq[kernels]

 # Add quanto for quantization testing
 RUN python3 -m pip install --no-cache-dir optimum-quanto

-# Add eetq for quantization testing
-RUN git clone https://github.com/NetEase-FuXi/EETQ.git && cd EETQ/ && git submodule update --init --recursive && pip install .
-
-# # Add flute-kernel and fast_hadamard_transform for quantization testing
-# # Commented for now as they cause issues with the build
-# # TODO: create a new workflow to test them
-# RUN python3 -m pip install --no-cache-dir flute-kernel==0.4.1
-# RUN python3 -m pip install --no-cache-dir git+https://github.com/Dao-AILab/fast-hadamard-transform.git
-
-# Add fp-quant for quantization testing
-# Requires py3.11 but our CI runs on 3.9
-# RUN python3 -m pip install --no-cache-dir "fp-quant>=0.1.6"
-
 # Add compressed-tensors for quantization testing
 RUN python3 -m pip install --no-cache-dir compressed-tensors

@ -89,7 +64,10 @@ RUN python3 -m pip install --no-cache-dir compressed-tensors
 RUN python3 -m pip install --no-cache-dir amd-quark

 # Add AutoRound for quantization testing
-RUN python3 -m pip install --no-cache-dir "auto-round>=0.5.0"
+RUN python3 -m pip install --no-cache-dir auto-round
+
+# Add torchao for quantization testing
+RUN python3 -m pip install --no-cache-dir torchao

 # Add transformers in editable mode
 RUN python3 -m pip install --no-cache-dir -e ./transformers[dev-torch]
@ -103,3 +81,27 @@ RUN python3 -m pip uninstall -y flash-attn
 # When installing in editable mode, `transformers` is not recognized as a package.
 # this line must be added in order for python to be aware of transformers.
 RUN cd transformers && python3 setup.py develop
+
+# Add fp-quant for quantization testing
+RUN python3 -m pip install --no-cache-dir "fp-quant>=0.2.0"
+
+# Low usage or incompatible lib, will enable later on
+
+# # Add aqlm for quantization testing
+# RUN python3 -m pip install --no-cache-dir aqlm[gpu]==1.0.2
+
+# # Add vptq for quantization testing
+# RUN pip install vptq
+
+# Add spqr for quantization testing
+# Commented for now as No matching distribution found we need to reach out to the authors
+# RUN python3 -m pip install --no-cache-dir spqr_quant[gpu]
+
+# # Add eetq for quantization testing
+# RUN git clone https://github.com/NetEase-FuXi/EETQ.git && cd EETQ/ && git submodule update --init --recursive && pip install .
+
+# # Add flute-kernel and fast_hadamard_transform for quantization testing
+# # Commented for now as they cause issues with the build
+# # TODO: create a new workflow to test them
+# RUN python3 -m pip install --no-cache-dir flute-kernel==0.4.1
+# RUN python3 -m pip install --no-cache-dir git+https://github.com/Dao-AILab/fast-hadamard-transform.git
--- a/docs/TRANSLATING.md
+++ b/docs/TRANSLATING.md
@ -50,7 +50,7 @@ Begin translating the text!

 1. Start with the `_toctree.yml` file that corresponds to your documentation chapter. This file is essential for rendering the table of contents on the website.

-    - If the `_toctree.yml` file doesn’t exist for your language, create one by copying the English version and removing unrelated sections.
+    - If the `_toctree.yml` file doesn't exist for your language, create one by copying the English version and removing unrelated sections.
    - Ensure it is placed in the `docs/source/LANG-ID/` directory.

    Here’s an example structure for the `_toctree.yml` file:
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@ -307,6 +307,8 @@
    title: Glossary
  - local: philosophy
    title: Philosophy
+  - local: models_timeline
+    title: Models Timeline
  - local: notebooks
    title: Notebooks with examples
  - local: community
@ -411,6 +413,8 @@
        title: Blenderbot Small
      - local: model_doc/bloom
        title: BLOOM
+      - local: model_doc/blt
+        title: BLT
      - local: model_doc/bort
        title: BORT
      - local: model_doc/byt5
@ -441,6 +445,8 @@
        title: DeBERTa
      - local: model_doc/deberta-v2
        title: DeBERTa-v2
+      - local: model_doc/deepseek_v2
+        title: DeepSeek-V2
      - local: model_doc/deepseek_v3
        title: DeepSeek-V3
      - local: model_doc/dialogpt
@ -763,12 +769,6 @@
        title: D-FINE
      - local: model_doc/dab-detr
        title: DAB-DETR
-      - local: model_doc/deepseek_v2
-        title: DeepSeek-V2
-      - local: model_doc/deepseek_vl
-        title: DeepseekVL
-      - local: model_doc/deepseek_vl_hybrid
-        title: DeepseekVLHybrid
      - local: model_doc/deformable_detr
        title: Deformable DETR
      - local: model_doc/deit
@ -851,10 +851,16 @@
        title: RT-DETR
      - local: model_doc/rt_detr_v2
        title: RT-DETRv2
+      - local: model_doc/sam2
+        title: SAM2
      - local: model_doc/segformer
        title: SegFormer
      - local: model_doc/seggpt
        title: SegGpt
+      - local: model_doc/sam
+        title: Segment Anything
+      - local: model_doc/sam_hq
+        title: Segment Anything High Quality
      - local: model_doc/superglue
        title: SuperGlue
      - local: model_doc/superpoint
@ -933,6 +939,8 @@
        title: MusicGen
      - local: model_doc/musicgen_melody
        title: MusicGen Melody
+      - local: model_doc/parakeet
+        title: Parakeet
      - local: model_doc/pop2piano
        title: Pop2Piano
      - local: model_doc/seamless_m4t
@ -977,6 +985,8 @@
        title: XLSR-Wav2Vec2
      title: Audio models
    - sections:
+      - local: model_doc/sam2_video
+        title: SAM2 Video
      - local: model_doc/timesformer
        title: TimeSformer
      - local: model_doc/vjepa2
@ -1021,10 +1031,18 @@
        title: ColQwen2
      - local: model_doc/data2vec
        title: Data2Vec
+      - local: model_doc/deepseek_vl
+        title: DeepseekVL
+      - local: model_doc/deepseek_vl_hybrid
+        title: DeepseekVLHybrid
      - local: model_doc/deplot
        title: DePlot
      - local: model_doc/donut
        title: Donut
+      - local: model_doc/edgetam
+        title: EdgeTAM
+      - local: model_doc/edgetam_video
+        title: EdgeTamVideo
      - local: model_doc/emu3
        title: Emu3
      - local: model_doc/evolla
@ -1077,6 +1095,8 @@
        title: LayoutLMV3
      - local: model_doc/layoutxlm
        title: LayoutXLM
+      - local: model_doc/lfm2_vl
+        title: LFM2-VL
      - local: model_doc/lilt
        title: LiLT
      - local: model_doc/llama4
@ -1135,18 +1155,12 @@
        title: Qwen2Audio
      - local: model_doc/qwen2_vl
        title: Qwen2VL
+      - local: model_doc/qwen3_omni_moe
+        title: Qwen3-Omni-MoE
      - local: model_doc/qwen3_vl
        title: Qwen3VL
      - local: model_doc/qwen3_vl_moe
        title: Qwen3VLMoe
-      - local: model_doc/sam2
-        title: SAM2
-      - local: model_doc/sam2_video
-        title: SAM2 Video
-      - local: model_doc/sam
-        title: Segment Anything
-      - local: model_doc/sam_hq
-        title: Segment Anything High Quality
      - local: model_doc/shieldgemma2
        title: ShieldGemma2
      - local: model_doc/siglip
--- a/docs/source/en/accelerator_selection.md
+++ b/docs/source/en/accelerator_selection.md
@ -69,7 +69,6 @@ CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py ...
 Only GPUs 0 and 2 are "visible" to PyTorch and are mapped to `cuda:0` and `cuda:1` respectively.  
 To reverse the order (use GPU 2 as `cuda:0` and GPU 0 as `cuda:1`):

-
 ```bash
 CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ...
 ```
@ -108,7 +107,6 @@ To reverse the order (use XPU 2 as `xpu:0` and XPU 0 as `xpu:1`):
 ZE_AFFINITY_MASK=2,0 torchrun trainer-program.py ...
 ```

-
 You can also control the order of Intel XPUs with:

 ```bash
@ -120,7 +118,5 @@ For more information about device enumeration and sorting on Intel XPU, please r
 </hfoption>
 </hfoptions>

-
-
 > [!WARNING]
 > Environment variables can be exported instead of being added to the command line. This is not recommended because it can be confusing if you forget how the environment variable was set up and you end up using the wrong accelerators. Instead, it is common practice to set the environment variable for a specific training run on the same command line.
--- a/docs/source/en/auto_docstring.md
+++ b/docs/source/en/auto_docstring.md
@ -145,7 +145,6 @@ Arguments can also be passed directly to `@auto_docstring` for more control. Use

 The `Returns` and `Examples` parts of the docstring can also be manually specified.

-
 ```python
 MODEL_COMMON_CUSTOM_ARGS = r"""
    common_arg_1 (`torch.Tensor`, *optional*, defaults to `default_value`):
@ -202,7 +201,6 @@ There are some rules for documenting different types of arguments and they're li

    If a standard argument behaves differently in your model, then you can override it locally in a `r""" """` block. This local definition has a higher priority. For example, the `labels` argument is often customized per model and typically requires overriding.

-
 - New or custom arguments should be documented within an `r""" """` block after the signature if it is a function or in the `__init__` method's docstring if it is a class.

    ```py
@ -212,9 +210,9 @@ There are some rules for documenting different types of arguments and they're li
        This can span multiple lines.
    ```

-    * Include `type` in backticks.
-    * Add *optional* if the argument is not required or has a default value.
-    * Add "defaults to X" if it has a default value. You don't need to add "defaults to `None`" if the default value is `None`.
+  * Include `type` in backticks.
+  * Add *optional* if the argument is not required or has a default value.
+  * Add "defaults to X" if it has a default value. You don't need to add "defaults to `None`" if the default value is `None`.

    These arguments can also be passed to `@auto_docstring` as a `custom_args` argument. It is used to define the docstring block for new arguments once if they are repeated in multiple places in the modeling file.

--- a/docs/source/en/cache_explanation.md
+++ b/docs/source/en/cache_explanation.md
@ -62,8 +62,6 @@ Refer to the table below to compare how caching improves efficiency.
 | for each step, recompute all previous `K` and `V`  | for each step, only compute current `K` and `V`
 | attention cost per step is **quadratic** with sequence length | attention cost per step is **linear** with sequence length (memory grows linearly, but compute/token remains low) |

-
-
 ## Cache class

 A basic KV cache interface takes a key and value tensor for the current token and returns the updated `K` and `V` tensors. This is internally managed by a model's `forward` method.
@ -138,12 +136,11 @@ The cache position tracks where to insert new tokens in the attention cache. It

 Cache position is used internally for two purposes:

-1. Selecting new tokens to process in the input sequence and ensuring only tokens that haven’t been cached yet are passed to the model's `forward`.
+1. Selecting new tokens to process in the input sequence and ensuring only tokens that haven't been cached yet are passed to the model's `forward`.
 2. Storing key/value pairs at the correct positions in the cache. This is especially important for fixed-size caches, that pre-allocates a specific cache length.

 The generation loop usually takes care of the cache position, but if you're writing a custom generation method, it is important that cache positions are accurate since they are used to write and read key/value states into fixed slots.

-
 ```py
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache, infer_device
@ -160,12 +157,12 @@ generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=10)

 ```

-
 ## Legacy cache format

 Before the [`Cache`] class, the cache used to be stored as a tuple of tuples of tensors. This format is dynamic because it grows as text is generated, similar to [`DynamicCache`].

 The legacy format is essentially the same data structure but organized differently.
+
 - It's a tuple of tuples, where each inner tuple contains the key and value tensors for a layer.
 - The tensors have the same shape `[batch_size, num_heads, seq_len, head_dim]`.
 - The format is less flexible and doesn't support features like quantization or offloading.
--- a/docs/source/en/chat_extras.md
+++ b/docs/source/en/chat_extras.md
@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.

 # Tool use

-Chat models are commonly trained with support for "function-calling" or "tool-use". Tools are functions supplied by the user, which the model can choose to call as part of its response. For example, models could have access to a calculator tool to perform arithmetic without having to it internally.
+Chat models are commonly trained with support for "function-calling" or "tool-use". Tools are functions supplied by the user, which the model can choose to call as part of its response. For example, models could have access to a calculator tool to perform arithmetic without having to perform the computation internally.

 This guide will demonstrate how to define tools, how to pass them to a chat model, and how to handle the model's output when it calls a tool.

@ -29,7 +29,6 @@ the arguments, argument types, and function docstring are parsed in order to gen
 Although passing Python functions is very convenient, the parser can only handle [Google-style](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings)
 docstrings. Refer to the examples below for how to format a tool-ready function.

-
 ```py
 def get_current_temperature(location: str, unit: str):
    """
@ -103,7 +102,6 @@ Hold the call in the `tool_calls` key of an `assistant` message. This is the rec
 > [!WARNING]
 > Although `tool_calls` is similar to the OpenAI API, the OpenAI API uses a JSON string as its `tool_calls` format. This may cause errors or strange model behavior if used in Transformers, which expects a dict.

-
 ```py
 tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France", "unit": "celsius"}}
 messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]})
@ -131,7 +129,6 @@ The temperature in Paris, France right now is 22°C.<|im_end|>
 > Although the key in the assistant message is called `tool_calls`, in most cases, models only emit a single tool call at a time. Some older models emit multiple tool calls at the same time, but this is a
 > significantly more complex process, as you need to handle multiple tool responses at once and disambiguate them, often using tool call IDs. Please refer to the model card to see exactly what format a model expects for tool calls.

-
 ## JSON schemas

 Another way to define tools is by passing a [JSON schema](https://json-schema.org/learn/getting-started-step-by-step).
--- a/docs/source/en/chat_templating.md
+++ b/docs/source/en/chat_templating.md
@ -43,6 +43,7 @@ chat = [

 tokenizer.apply_chat_template(chat, tokenize=False)
 ```
+
 ```md
 <s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]
 ```
@ -62,6 +63,7 @@ chat = [

 tokenizer.apply_chat_template(chat, tokenize=False)
 ```
+
 ```md
 <|user|>\nHello, how are you?</s>\n<|assistant|>\nI'm doing great. How can I help you today?</s>\n<|user|>\nI'd like to show off how chat templating works!</s>\n
 ```
@ -75,9 +77,9 @@ Mistral-7B-Instruct uses `[INST]` and `[/INST]` tokens to indicate the start and

 The input to `apply_chat_template` should be structured as a list of dictionaries with `role` and `content` keys. The `role` key specifies the speaker, and the `content` key contains the message. The common roles are:

- - `user` for messages from the user
- - `assistant` for messages from the model
- - `system` for directives on how the model should act (usually placed at the beginning of the chat)
+- `user` for messages from the user
+- `assistant` for messages from the model
+- `system` for directives on how the model should act (usually placed at the beginning of the chat)

 [`apply_chat_template`] takes this list and returns a formatted sequence. Set `tokenize=True` if you want to tokenize the sequence.

@ -110,6 +112,7 @@ Pass the tokenized chat to [`~GenerationMixin.generate`] to generate a response.
 outputs = model.generate(tokenized_chat, max_new_tokens=128) 
 print(tokenizer.decode(outputs[0]))
 ```
+
 ```md
 <|system|>
 You are a friendly chatbot who always responds in the style of a pirate</s>
@ -121,7 +124,7 @@ Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopte

 > [!WARNING]
 > Some tokenizers add special `<bos>` and `<eos>` tokens. Chat templates should already include all the necessary special tokens, and adding additional special tokens is often incorrect or duplicated, hurting model performance. When you format text with `apply_chat_template(tokenize=False)`, make sure you set `add_special_tokens=False` if you tokenize later to avoid duplicating these tokens.
-> This isn’t an issue if you use `apply_chat_template(tokenize=True)`, which means it's usually the safer option!
+> This isn't an issue if you use `apply_chat_template(tokenize=True)`, which means it's usually the safer option!

 ### add_generation_prompt

@ -135,6 +138,7 @@ Let's see an example to understand what `add_generation_prompt` is actually doin
 tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
 tokenized_chat
 ```
+
 ```md
 <|im_start|>user
 Hi there!<|im_end|>
@ -150,6 +154,7 @@ Now, let's format the same chat with `add_generation_prompt=True`:
 tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 tokenized_chat
 ```
+
 ```md
 <|im_start|>user
 Hi there!<|im_end|>
@ -163,7 +168,7 @@ Can I ask a question?<|im_end|>

 When `add_generation_prompt=True`, `<|im_start|>assistant` is added at the end to indicate the start of an `assistant` message. This lets the model know an `assistant` response is next.

-Not all models require generation prompts, and some models, like [Llama](./model_doc/llama), don’t have any special tokens before the `assistant` response. In these cases, [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) has no effect.
+Not all models require generation prompts, and some models, like [Llama](./model_doc/llama), don't have any special tokens before the `assistant` response. In these cases, [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) has no effect.

 ### continue_final_message

@ -182,14 +187,13 @@ model.generate(**formatted_chat)
 ```

 > [!WARNING]
-> You shouldn’t use [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) and [continue_final_message](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.continue_final_message) together. The former adds tokens that start a new message, while the latter removes end of sequence tokens. Using them together returns an error.
-
-[`TextGenerationPipeline`] sets [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) to `True` by default to start a new message. However, if the final message in the chat has the `assistant` role, it assumes the message is a prefill and switches to `continue_final_message=True`. This is because most models don’t support multiple consecutive assistant messages. To override this behavior, explicitly pass the [continue_final_message](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.continue_final_message) argument to the pipeline.
+> You shouldn't use [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) and [continue_final_message](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.continue_final_message) together. The former adds tokens that start a new message, while the latter removes end of sequence tokens. Using them together returns an error.

+[`TextGenerationPipeline`] sets [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) to `True` by default to start a new message. However, if the final message in the chat has the `assistant` role, it assumes the message is a prefill and switches to `continue_final_message=True`. This is because most models don't support multiple consecutive assistant messages. To override this behavior, explicitly pass the [continue_final_message](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.continue_final_message) argument to the pipeline.

 ## Model training

-Training a model with a chat template is a good way to ensure the template matches the tokens the model was trained on. Apply the chat template as a preprocessing step to your dataset. Set `add_generation_prompt=False` because the additional tokens to prompt an assistant response aren’t helpful during training.
+Training a model with a chat template is a good way to ensure the template matches the tokens the model was trained on. Apply the chat template as a preprocessing step to your dataset. Set `add_generation_prompt=False` because the additional tokens to prompt an assistant response aren't helpful during training.

 An example of preprocessing a dataset with a chat template is shown below.

@ -212,6 +216,7 @@ dataset = Dataset.from_dict({"chat": [chat1, chat2]})
 dataset = dataset.map(lambda x: {"formatted_chat": tokenizer.apply_chat_template(x["chat"], tokenize=False, add_generation_prompt=False)})
 print(dataset['formatted_chat'][0])
 ```
+
 ```md
 <|user|>
 Which is bigger, the moon or the sun?</s>
--- a/docs/source/en/chat_templating_multimodal.md
+++ b/docs/source/en/chat_templating_multimodal.md
@ -18,7 +18,6 @@ rendered properly in your Markdown viewer.

 Multimodal chat models accept inputs like images, audio or video, in addition to text. The `content` key in a multimodal chat history is a list containing multiple items of different types. This is unlike text-only chat models whose `content` key is a single string.

-
 In the same way the [Tokenizer](./fast_tokenizer) class handles chat templates and tokenization for text-only models,
 the [Processor](./processors) class handles preprocessing, tokenization and chat templates for multimodal models. Their [`~ProcessorMixin.apply_chat_template`] methods are almost identical.

@ -46,7 +45,7 @@ messages = [
 ]
 ```

-Create an [`ImageTextToTextPipeline`] and pass the chat to it. For large models, setting [device_map=“auto”](./models#big-model-inference) helps load the model quicker and automatically places it on the fastest device available. Setting the data type to [auto](./models#model-data-type) also helps save memory and improve speed.
+Create an [`ImageTextToTextPipeline`] and pass the chat to it. For large models, setting [device_map="auto"](./models#big-model-inference) helps load the model quicker and automatically places it on the fastest device available. Setting the data type to [auto](./models#model-data-type) also helps save memory and improve speed.

 ```python
 import torch
@ -57,8 +56,7 @@ out = pipe(text=messages, max_new_tokens=128)
 print(out[0]['generated_text'][-1]['content'])
 ```

-
-```
+```text
 Ahoy, me hearty! These be two feline friends, likely some tabby cats, taking a siesta on a cozy pink blanket. They're resting near remote controls, perhaps after watching some TV or just enjoying some quiet time together. Cats sure know how to find comfort and relaxation, don't they?
 ```

@ -69,7 +67,6 @@ Aside from the gradual descent from pirate-speak into modern American English (i
 Like [text-only models](./chat_templating), use the [`~ProcessorMixin.apply_chat_template`] method to prepare the chat messages for multimodal models.
 This method handles the tokenization and formatting of the chat messages, including images and other media types. The resulting inputs are passed to the model for generation.

-
 ```python
 from transformers import AutoProcessor, AutoModelForImageTextToText

@ -99,8 +96,7 @@ processed_chat = processor.apply_chat_template(messages, add_generation_prompt=T
 print(list(processed_chat.keys()))
 ```

-
-```
+```text
 ['input_ids', 'attention_mask', 'pixel_values', 'image_grid_thw']
 ```

@ -113,14 +109,13 @@ print(processor.decode(out[0]))

 The decoded output contains the full conversation so far, including the user message and the placeholder tokens that contain the image information. You may need to trim the previous conversation from the output before displaying it to the user.

-
 ## Video inputs

 Some vision models also support video inputs. The message format is very similar to the format for [image inputs](#image-inputs).

 - The content `"type"` should be `"video"` to indicate the content is a video.
 - For videos, it can be a link to the video (`"url"`) or it could be a file path (`"path"`). Videos loaded from a URL can only be decoded with [PyAV](https://pyav.basswood-io.com/docs/stable/) or [Decord](https://github.com/dmlc/decord).
- In addition to loading videos from a URL or file path, you can also pass decoded video data directly. This is useful if you’ve already preprocessed or decoded video frames elsewhere in memory (e.g., using OpenCV, decord, or torchvision). You don't need to save to files or store it in an URL.
+- In addition to loading videos from a URL or file path, you can also pass decoded video data directly. This is useful if you've already preprocessed or decoded video frames elsewhere in memory (e.g., using OpenCV, decord, or torchvision). You don't need to save to files or store it in an URL.

 > [!WARNING]
 > Loading a video from `"url"` is only supported by the PyAV or Decord backends.
@ -148,6 +143,7 @@ messages = [
 ```

 ### Example: Passing decoded video objects
+
 ```python
 import numpy as np

@ -167,7 +163,9 @@ messages = [
    },
 ]
 ```
+
 You can also use existing (`"load_video()"`) function to load a video, edit the video in memory and pass it in the messages.
+
 ```python

 # Make sure a video backend library (pyav, decord, or torchvision) is available.
@ -200,7 +198,6 @@ Pass `messages` to [`~ProcessorMixin.apply_chat_template`] to tokenize the input

 The `num_frames` parameter controls how many frames to uniformly sample from the video. Each checkpoint has a maximum frame count it was pretrained with and exceeding this count can significantly lower generation quality. It's important to choose a frame count that fits both the model capacity and your hardware resources. If `num_frames` isn't specified, the entire video is loaded without any frame sampling.

-
 ```python
 processed_chat = processor.apply_chat_template(
    messages,
@ -265,4 +262,3 @@ print(processed_chat.keys())

 </hfoption>
 </hfoptions>
-
--- a/docs/source/en/chat_templating_writing.md
+++ b/docs/source/en/chat_templating_writing.md
@ -18,7 +18,6 @@ rendered properly in your Markdown viewer.

 A chat template is a [Jinja](https://jinja.palletsprojects.com/en/stable/templates/) template stored in the tokenizer's [chat_template](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.chat_template) attribute. Jinja is a templating language that allows you to write Python-like code and syntax.

-
 ```jinja
 {%- for message in messages %}
    {{- '<|' + message['role'] + |>\n' }}
@ -108,7 +107,6 @@ We strongly recommend using `-` to ensure only the intended content is printed.

 ### Special variables and callables

-
 The only constants in a template are the `messages` variable and the `add_generation_prompt` boolean. However, you have
 access to **any other keyword arguments that are passed** to the [`~PreTrainedTokenizerBase.apply_chat_template`] method.

@ -133,7 +131,7 @@ Make the changes below to ensure compatibility across all Jinja implementations.

 ### Big templates

-Newer models or models with features like [tool-calling](./chat_extras#tools) and [RAG](./chat_extras#retrieval-augmented-generation-rag) require larger templates that can be longer than 100 lines. It may be easier to write larger templates in a separate file. The line numbers in the separate file corresponds exactly to the line numbers in template parsing or execution errors, making it easier to debug any potential issues.
+Newer models or models with features like [tool-calling](./chat_extras) and RAG require larger templates that can be longer than 100 lines. It may be easier to write larger templates in a separate file. The line numbers in the separate file corresponds exactly to the line numbers in template parsing or execution errors, making it easier to debug any potential issues.

 Write the template in a separate file and extract it to the chat template.

@ -190,7 +188,7 @@ The example below shows how a tool is defined in JSON schema format.

 An example of handling tool definitions in a chat template is shown below. The specific tokens and layouts should be changed to match the ones the model was trained with.

-```
+```jinja
 {%- if tools %}
    {%- for tool in tools %}
        {{- '<tool>' + tool['function']['name'] + '\n' }}
@ -228,7 +226,7 @@ Tool calls are generally passed in the `tool_calls` key of an `"assistant”` me

 A common pattern for handling tool calls is shown below. You can use this as a starting point, but make sure you template actually matches the format the model was trained with!

-```
+```jinja
 {%- if message['role'] == 'assistant' and 'tool_calls' in message %}
    {%- for tool_call in message['tool_calls'] %}
            {{- '<tool_call>' + tool_call['function']['name'] + '\n' + tool_call['function']['arguments']|tojson + '\n</tool_call>' }}
@ -251,7 +249,7 @@ Tool responses are message dicts with the `tool` role. They are much simpler tha

 Some templates may not even need the `name` key, in which case, you can write your template to only read the `content` key.

-```
+```jinja
 {%- if message['role'] == 'tool' %}
    {{- "<tool_result>" + message['content'] + "</tool_result>" }}
 {%- endif %}
--- a/docs/source/en/conversations.md
+++ b/docs/source/en/conversations.md
@ -48,7 +48,6 @@ transformers chat -h

 The chat is implemented on top of the [AutoClass](./model_doc/auto), using tooling from [text generation](./llm_tutorial) and [chat](./chat_templating). It uses the `transformers serve` CLI under the hood ([docs](./serving.md#serve-cli)).

-
 ## TextGenerationPipeline

 [`TextGenerationPipeline`] is a high-level text generation class with a "chat mode". Chat mode is enabled when a conversational model is detected and the chat prompt is [properly formatted](./llm_tutorial#wrong-prompt-format).
--- a/docs/source/en/cursor.md
+++ b/docs/source/en/cursor.md
@ -21,9 +21,10 @@ where `port` is the port used by `transformers serve` (`8000` by default). On th
 </h3>

 You're now ready to set things up on the app side! In Cursor, while you can't set a new provider, you can change the endpoint for OpenAI requests in the model selection settings. First, navigate to "Settings" > "Cursor Settings", "Models" tab, and expand the "API Keys" collapsible. To set your `transformers serve` endpoint, follow this order:
+
 1. Unselect ALL models in the list above (e.g. `gpt4`, ...);
 2. Add and select the model you want to use (e.g. `Qwen/Qwen3-4B`)
-3. Add some random text to OpenAI API Key. This field won't be used, but it can’t be empty;
+3. Add some random text to OpenAI API Key. This field won't be used, but it can't be empty;
 4. Add the https address from `ngrok` to the "Override OpenAI Base URL" field, appending `/v1` to the address (i.e. `https://(...).ngrok-free.app/v1`);
 5. Hit "Verify".

@ -38,5 +39,3 @@ You are now ready to use your local model in Cursor! For instance, if you toggle
 <h3 align="center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/transformers_serve_cursor_chat.png"/>
 </h3>
-
-
--- a/docs/source/en/debugging.md
+++ b/docs/source/en/debugging.md
@ -35,7 +35,7 @@ pip install deepspeed

 PyTorch comes with its own CUDA toolkit, but to use DeepSpeed with PyTorch, you need to have an identical version of CUDA installed system-wide. For example, if you installed PyTorch with `cudatoolkit==10.2` in your Python environment, then you'll also need to have CUDA 10.2 installed everywhere.

-The exact location can vary from system to system, but `usr/local/cuda-10.2` is the most common location on many Unix systems. When CUDA is correctly set up and added to your `PATH` environment variable, you can find the installation location with the following command.
+The exact location can vary from system to system, but `/usr/local/cuda-10.2` is the most common location on many Unix systems. When CUDA is correctly set up and added to your `PATH` environment variable, you can find the installation location with the following command.

 ```bash
 which nvcc
@ -45,7 +45,7 @@ which nvcc

 You may also have more than one CUDA toolkit installed on your system.

-```bash
+```text
 /usr/local/cuda-10.2
 /usr/local/cuda-11.0
 ```
--- a/docs/source/en/deepspeed.md
+++ b/docs/source/en/deepspeed.md
@ -294,7 +294,7 @@ Consider running a [benchmark](https://github.com/microsoft/DeepSpeed/issues/998

 The example ZeRO-3 and ZeRO-Infinity config below sets most of the parameter values to `auto`, but you can also manually set configure these values.

-```yaml
+```json
 {
    "fp16": {
        "enabled": "auto",
@ -383,7 +383,7 @@ Gradient checkpointing saves memory by only storing *some* of the intermediate a

 The batch size can be automatically configured or manually set. When you choose the `"auto"` option, [`Trainer`] sets `train_micro_batch_size_per_gpu` and `train_batch_size` to the value of `world_size * per_device_train_batch_size * gradient_accumulation_steps`.

-```yaml
+```json
 {
    "train_micro_batch_size_per_gpu": "auto",
    "train_batch_size": "auto"
@ -400,7 +400,7 @@ Reduce operations are lossy, for example, when gradients are averaged across mul

 Choose the communication data type by setting the `communication_data_type` parameter in the config file. For example, choosing fp32 adds a small amount of overhead but ensures the reduction operation is accumulated in fp32 and when it is ready, it's downcasted to whichever half-precision data type you're training in.

-```yaml
+```json
 {
    "communication_data_type": "fp32"
 }
@ -412,7 +412,7 @@ Gradient accumulation accumulates gradients over several mini-batches of data be

 Gradient accumulation can be automatically configured or manually set. When you choose the `"auto"` option, [`Trainer`] sets it to the value of `gradient_accumulation_steps`.

-```yaml
+```json
 {
    "gradient_accumulation_steps": "auto"
 }
@ -424,7 +424,7 @@ Gradient clipping is useful for preventing exploding gradients which can lead to

 Gradient clipping can be automatically configured or manually set. When you choose the `"auto"` option, [`Trainer`] sets it to the value of `max_grad_norm`.

-```yaml
+```json
 {
    "gradient_clipping": "auto"
 }
@ -439,7 +439,7 @@ Mixed precision accelerates training speed by performing some calculations in ha

 Train in fp32 if a model wasn't pretrained in mixed precision because it may cause underflow or overflow errors. Disable fp16, the default, in this case.

-```yaml
+```json
 {
    "fp16": {
        "enabled": false
@ -454,7 +454,7 @@ For Ampere GPUs and PyTorch 1.7+, the more efficient [tf32](https://pytorch.org/

 To configure AMP-like fp16 mixed precision, set up the config as shown below with `"auto"` or your own values. [`Trainer`] automatically enables or disables fp16 based on the value of `fp16_backend`, and the rest of the config can be set by you. fp16 is enabled from the command line when the following arguments are passed: `--fp16`, `--fp16_backend amp` or `--fp16_full_eval`.

-```yaml
+```json
 {
    "fp16": {
        "enabled": "auto",
@ -471,7 +471,7 @@ For additional DeepSpeed fp16 training options, take a look at the [FP16 Trainin

 To configure Apex-like fp16 mixed precision, set up the config as shown below with `"auto"` or your own values. [`Trainer`] automatically configures `amp` based on the values of `fp16_backend` and `fp16_opt_level`. It can also be enabled from the command line when the following arguments are passed: `--fp16`, `--fp16_backend apex` or `--fp16_opt_level 01`.

-```yaml
+```json
 {
    "amp": {
        "enabled": "auto",
@ -486,11 +486,11 @@ To configure Apex-like fp16 mixed precision, set up the config as shown below wi
 > [!TIP]
 > bf16 requires DeepSpeed 0.6.0.

-bf16 has the same dynamic range as fp32, and doesn’t require loss scaling unlike fp16. However, if you use [gradient accumulation](#gradient-accumulation) with bf16, gradients are accumulated in bf16 which may not be desirable because the lower precision can lead to lossy accumulation.
+bf16 has the same dynamic range as fp32, and doesn't require loss scaling unlike fp16. However, if you use [gradient accumulation](#gradient-accumulation) with bf16, gradients are accumulated in bf16 which may not be desirable because the lower precision can lead to lossy accumulation.

 bf16 can be set up in the config file or enabled from the command line when the following arguments are passed: `--bf16` or `--bf16_full_eval`.

-```yaml
+```json
 {
    "bf16": {
        "enabled": "auto"
@ -514,7 +514,7 @@ DeepSpeed offers several [optimizers](https://www.deepspeed.ai/docs/config-json/

 You can set the parameters to `"auto"` or manually input your own values.

-```yaml
+```json
 {
   "optimizer": {
       "type": "AdamW",
@ -530,7 +530,7 @@ You can set the parameters to `"auto"` or manually input your own values.

 Use an unsupported optimizer by adding the following to the top level configuration.

-```yaml
+```json
 {
   "zero_allow_untested_optimizer": true
 }
@ -538,7 +538,7 @@ Use an unsupported optimizer by adding the following to the top level configurat

 From DeepSpeed 0.8.3+, if you want to use offload, you'll also need to add the following to the top level configuration because offload works best with DeepSpeed's CPU Adam optimizer.

-```yaml
+```json
 {
   "zero_force_ds_cpu_optimizer": false
 }
@ -558,7 +558,7 @@ If you don't configure the scheduler in the config file, [`Trainer`] automatical

 You can set the parameters to `"auto"` or manually input your own values.

-```yaml
+```json
 {
   "scheduler": {
         "type": "WarmupDecayLR",
@ -581,7 +581,7 @@ You can set the parameters to `"auto"` or manually input your own values.

 Resume training with a Universal checkpoint by setting `load_universal` to `true` in the config file.

-```yaml
+```json
 {
    "checkpoint": {
        "load_universal": true
@ -640,7 +640,7 @@ deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \

 A multi-node setup consists of multiple nodes, where each node has one of more GPUs running a workload. DeepSpeed expects a shared storage system, but if this is not the case, you need to adjust the config file to include a [checkpoint](https://www.deepspeed.ai/docs/config-json/#checkpoint-options) to allow loading without access to a shared filesystem.

-```yaml
+```json
 {
  "checkpoint": {
    "use_node_local_storage": true
@ -824,7 +824,7 @@ ZeRO-2 saves the model weights in fp16. To save the weights in fp16 for ZeRO-3,

 If you don't, [`Trainer`] won't save the weights in fp16 and won't create a `pytorch_model.bin` file. This is because DeepSpeed's state_dict contains a placeholder instead of the real weights, so you won't be able to load it.

-```yaml
+```json
 {
    "zero_optimization": {
        "stage": 3,
@ -986,7 +986,7 @@ NaN loss often occurs when a model is pretrained in bf16 and you try to use it w

 It is also possible that fp16 is causing overflow. For example, if your config file looks like the one below, you may see the following overflow errors in the logs.

-```yaml
+```json
 {
    "fp16": {
        "enabled": "auto",
--- a/docs/source/en/fast_tokenizers.md
+++ b/docs/source/en/fast_tokenizers.md
@ -226,7 +226,7 @@ tokenizer = PreTrainedTokenizerFast.from_pretrained("config/save/dir")

 <Youtube id="Yffk5aydLzg"/>

-A Transformers model expects the input to be a PyTorch or NumPy tensor. A tokenizers job is to preprocess text into those tensors. Specify the framework tensor type to return with the `return_tensors` parameter.
+A Transformers model expects the input to be a PyTorch or NumPy tensor. A tokenizer's job is to preprocess text into those tensors. Specify the framework tensor type to return with the `return_tensors` parameter.

 ```py
 from transformers import AutoTokenizer
--- a/docs/source/en/generation_strategies.md
+++ b/docs/source/en/generation_strategies.md
@ -229,6 +229,7 @@ tokenizer.batch_decode(outputs, skip_special_tokens=True)
 ## Custom generation methods

 Custom generation methods enable specialized behavior such as:
+
 - have the model continue thinking if it is uncertain;
 - roll back generation if the model gets stuck;
 - handle special tokens with custom logic;
@ -289,7 +290,7 @@ print(tokenizer.batch_decode(gen_out)[0])

 If the custom method has pinned Python requirements that your environment doesn't meet, you'll get an exception about missing requirements. For instance, [transformers-community/custom_generate_bad_requirements](https://huggingface.co/transformers-community/custom_generate_bad_requirements) has an impossible set of requirements defined in its `custom_generate/requirements.txt` file, and you'll see the error message below if you try to run it.

-```
+```text
 ImportError: Missing requirements in your local environment for `transformers-community/custom_generate_bad_requirements`:
 foo (installed: None)
 bar==0.0.0 (installed: None)
@ -301,6 +302,7 @@ Updating your Python requirements accordingly will remove this error message.
 ### Creating a custom generation method

 To create a new generation method, you need to create a new [**Model**](https://huggingface.co/new) repository and push a few files into it.
+
 1. The model you've designed your generation method with.
 2. `custom_generate/generate.py`, which contains all the logic for your custom generation method.
 3. `custom_generate/requirements.txt`, used to optionally add new Python requirements and/or lock specific versions to correctly use your method.
@ -308,7 +310,7 @@ To create a new generation method, you need to create a new [**Model**](https://

 After you've added all required files, your repository should look like this

-```
+```text
 your_repo/
 ├── README.md          # include the 'custom_generate' tag
 ├── config.json
@ -377,6 +379,7 @@ def generate(model, input_ids, generation_config=None, left_padding=None, **kwar
 ```

 Follow the recommended practices below to ensure your custom generation method works as expected.
+
 - Feel free to reuse the logic for validation and input preparation in the original [`~GenerationMixin.generate`].
 - Pin the `transformers` version in the requirements if you use any private method/attribute in `model`.
 - Consider adding model validation, input validation, or even a separate test file to help users sanity-check your code in their environment.
@ -389,7 +392,6 @@ from .utils import some_function

 Only relative imports from the same-level `custom_generate` folder are supported. Parent/sibling folder imports are not valid. The `custom_generate` argument also works locally with any directory that contains a `custom_generate` structure. This is the recommended workflow for developing your custom generation method.

-
 #### requirements.txt

 You can optionally specify additional Python requirements in a `requirements.txt` file inside the `custom_generate` folder. These are checked at runtime and an exception will be thrown if they're missing, nudging users to update their environment accordingly.
@ -400,7 +402,7 @@ The root level `README.md` in the model repository usually describes the model t

 For discoverability, we highly recommend you to add the `custom_generate` tag to your repository. To do so, the top of your `README.md` file should look like the example below. After you push the file, you should see the tag in your repository!

-```
+```text
 ---
 library_name: transformers
 tags:
@ -411,13 +413,14 @@ tags:
 ```

 Recommended practices:
+
 - Document input and output differences in [`~GenerationMixin.generate`].
 - Add self-contained examples to enable quick experimentation.
 - Describe soft-requirements such as if the method only works well with a certain family of models.

-### Reusing `generate`’s input preparation
+### Reusing `generate`'s input preparation

-If you're adding a new decoding loop, you might want to preserve the input preparation present in `generate` (batch expansion, attention masks, logits processors, stopping criteria, etc.). You can also pass a **callable** to `custom_generate` to reuse [`~GenerationMixin.generate`]’s full preparation pipeline while overriding only the decoding loop.
+If you're adding a new decoding loop, you might want to preserve the input preparation present in `generate` (batch expansion, attention masks, logits processors, stopping criteria, etc.). You can also pass a **callable** to `custom_generate` to reuse [`~GenerationMixin.generate`]'s full preparation pipeline while overriding only the decoding loop.

 ```py
 def custom_loop(model, input_ids, attention_mask, logits_processor, stopping_criteria, generation_config, **model_kwargs):
@ -438,11 +441,12 @@ output = model.generate(
 ```

 > [!TIP]
-> If you publish a `custom_generate` repository, your `generate` implementation can itself define a callable and pass it to `model.generate()`. This lets you customize the decoding loop while still benefiting from Transformers’ built-in input preparation logic.
+> If you publish a `custom_generate` repository, your `generate` implementation can itself define a callable and pass it to `model.generate()`. This lets you customize the decoding loop while still benefiting from Transformers' built-in input preparation logic.

 ### Finding custom generation methods

 You can find all custom generation methods by [searching for their custom tag.](https://huggingface.co/models?other=custom_generate), `custom_generate`. In addition to the tag, we curate two collections of `custom_generate` methods:
+
 - [Custom generation methods - Community](https://huggingface.co/collections/transformers-community/custom-generation-methods-community-6888fb1da0efbc592d3a8ab6) -- a collection of powerful methods contributed by the community;
 - [Custom generation methods - Tutorials](https://huggingface.co/collections/transformers-community/custom-generation-methods-tutorials-6823589657a94940ea02cfec) -- a collection of reference implementations for methods that previously were part of `transformers`, as well as tutorials for `custom_generate`.

--- a/docs/source/en/glossary.md
+++ b/docs/source/en/glossary.md
@ -185,9 +185,9 @@ See the [Fine-tune a pretrained model](https://huggingface.co/docs/transformers/

 The model head refers to the last layer of a neural network that accepts the raw hidden states and projects them onto a different dimension. There is a different model head for each task. For example:

-  * [`GPT2ForSequenceClassification`] is a sequence classification head - a linear layer - on top of the base [`GPT2Model`].
-  * [`ViTForImageClassification`] is an image classification head - a linear layer on top of the final hidden state of the `CLS` token - on top of the base [`ViTModel`].
-  * [`Wav2Vec2ForCTC`] is a language modeling head with [CTC](#connectionist-temporal-classification-ctc) on top of the base [`Wav2Vec2Model`].
+* [`GPT2ForSequenceClassification`] is a sequence classification head - a linear layer - on top of the base [`GPT2Model`].
+* [`ViTForImageClassification`] is an image classification head - a linear layer on top of the final hidden state of the `CLS` token - on top of the base [`ViTModel`].
+* [`Wav2Vec2ForCTC`] is a language modeling head with [CTC](#connectionist-temporal-classification-ctc) on top of the base [`Wav2Vec2Model`].

 ## I

--- a/docs/source/en/index.md
+++ b/docs/source/en/index.md
@ -19,7 +19,6 @@ rendered properly in your Markdown viewer.
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/transformers_as_a_model_definition.png"/>
 </h3>

-
 Transformers acts as the model-definition framework for state-of-the-art machine learning models in text, computer
 vision, audio, video, and multimodal model, for both inference and training.

@ -35,6 +34,10 @@ There are over 1M+ Transformers [model checkpoints](https://huggingface.co/model

 Explore the [Hub](https://huggingface.com/) today to find a model and use Transformers to help you get started right away.

+Explore the [Models Timeline](./models_timeline) to discover the latest text, vision, audio and multimodal model architectures in Transformers.
+
+
+
 ## Features

 Transformers provides everything you need for inference or training with state-of-the-art pretrained models. Some of the main features include:
--- a/docs/source/en/internal/file_utils.md
+++ b/docs/source/en/internal/file_utils.md
@ -20,7 +20,6 @@ This page lists all of Transformers general utility functions that are found in

 Most of those are only useful if you are studying the general code in the library.

-
 ## Enums and namedtuples

 [[autodoc]] utils.ExplicitEnum
--- a/docs/source/en/internal/generation_utils.md
+++ b/docs/source/en/internal/generation_utils.md
@ -65,7 +65,6 @@ values. Here, for instance, it has two keys that are `sequences` and `scores`.

 We document here all output types.

-
 [[autodoc]] generation.GenerateDecoderOnlyOutput

 [[autodoc]] generation.GenerateEncoderDecoderOutput
@ -74,13 +73,11 @@ We document here all output types.

 [[autodoc]] generation.GenerateBeamEncoderDecoderOutput

-
 ## LogitsProcessor

 A [`LogitsProcessor`] can be used to modify the prediction scores of a language model head for
 generation.

-
 [[autodoc]] AlternatingCodebooksLogitsProcessor
    - __call__

@ -174,8 +171,6 @@ generation.
 [[autodoc]] WatermarkLogitsProcessor
    - __call__

-
-
 ## StoppingCriteria

 A [`StoppingCriteria`] can be used to change when to stop generation (other than EOS token). Please note that this is exclusively available to our PyTorch implementations.
@ -300,7 +295,6 @@ A [`Constraint`] can be used to force the generation to include specific tokens
    - to_legacy_cache
    - from_legacy_cache

-
 ## Watermark Utils

 [[autodoc]] WatermarkingConfig
--- a/docs/source/en/internal/model_debugging_utils.md
+++ b/docs/source/en/internal/model_debugging_utils.md
@ -21,10 +21,8 @@ provides for it.

 Most of those are only useful if you are adding new models in the library.

-
 ## Model addition debuggers

-
 ### Model addition debugger - context manager for model adders

 This context manager is a power user tool intended for model adders. It tracks all forward calls within a model forward
@ -72,7 +70,6 @@ with model_addition_debugger_context(

 ```

-
 ### Reading results

 The debugger generates two files from the forward call, both with the same base name, but ending either with
@ -221,9 +218,9 @@ path reference to the associated `.safetensors` file. Each tensor is written to
 the state dictionary. File names are constructed using the `module_path` as a prefix with a few possible postfixes that
 are built recursively.

-*   Module inputs are denoted with the `_inputs` and outputs by `_outputs`.
-*   `list` and `tuple` instances, such as `args` or function return values, will be postfixed with `_{index}`.
-*   `dict` instances will be postfixed with `_{key}`.
+* Module inputs are denoted with the `_inputs` and outputs by `_outputs`.
+* `list` and `tuple` instances, such as `args` or function return values, will be postfixed with `_{index}`.
+* `dict` instances will be postfixed with `_{key}`.

 ### Comparing between implementations

@ -231,10 +228,8 @@ Once the forward passes of two models have been traced by the debugger, one can
 below: we can see slight differences between these two implementations' key projection layer. Inputs are mostly
 identical, but not quite. Looking through the file differences makes it easier to pinpoint which layer is wrong.

-
 ![download-icon](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/files_difference_debugging.png)

-
 ### Limitations and scope

 This feature will only work for torch-based models, and would require more work and case-by-case approach for say
@ -261,6 +256,7 @@ how many tests are being skipped and for which models.
 When porting models to transformers, tests fail as they should, and sometimes `test_modeling_common` feels irreconcilable with the peculiarities of our brand new model. But how can we be sure we're not breaking everything by adding a seemingly innocent skip?

 This utility:
+
 - scans all test_modeling_common methods
 - looks for times where a method is skipped
 - returns a summary json you can load as a DataFrame/inspect
@ -269,7 +265,6 @@ This utility:

 ![download-icon](https://huggingface.co/datasets/huggingface/documentation-images/resolve/f7f671f69b88ce4967e19179172c248958d35742/transformers/tests_skipped_visualisation.png)

-
 ### Usage

 You can run the skipped test analyzer in two ways:
@ -286,7 +281,7 @@ python utils/scan_skipped_tests.py --output_dir path/to/output

 **Example output:**

-```
+```text
 🔬 Parsing 331 model test files once each...
 📝 Aggregating 224 tests...
  (224/224) test_update_candidate_strategy_with_matches_1es_3d_is_nonecodet_schedule_fa_kwargs
--- a/docs/source/en/internal/pipelines_utils.md
+++ b/docs/source/en/internal/pipelines_utils.md
@ -20,7 +20,6 @@ This page lists all the utility functions the library provides for pipelines.

 Most of those are only useful if you are studying the code of the models in the library.

-
 ## Argument handling

 [[autodoc]] pipelines.ArgumentHandler
--- a/docs/source/en/jan.md
+++ b/docs/source/en/jan.md
@ -25,7 +25,7 @@ You are now ready to chat!

 To conclude this example, let's look into a more advanced use-case. If you have a beefy machine to serve models with, but prefer using Jan on a different device, you need to add port forwarding. If you have `ssh` access from your Jan machine into your server, this can be accomplished by typing the following to your Jan machine's terminal

-```
+```bash
 ssh -N -f -L 8000:localhost:8000 your_server_account@your_server_IP -p port_to_ssh_into_your_server
 ```

--- a/docs/source/en/kv_cache.md
+++ b/docs/source/en/kv_cache.md
@ -213,7 +213,7 @@ A cache can also work in iterative generation settings where there is back-and-f

 For iterative generation with a cache, start by initializing an empty cache class and then you can feed in your new prompts. Keep track of dialogue history with a [chat template](./chat_templating).

-The following example demonstrates [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). If you’re using a different chat-style model, [`~PreTrainedTokenizer.apply_chat_template`] may process messages differently. It might cut out important tokens depending on how the Jinja template is written.
+The following example demonstrates [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). If you're using a different chat-style model, [`~PreTrainedTokenizer.apply_chat_template`] may process messages differently. It might cut out important tokens depending on how the Jinja template is written.

 For example, some models use special `<think> ... </think>` tokens during reasoning. These could get lost during re-encoding, causing indexing issues. You might need to manually remove or adjust extra tokens from the completions to keep things stable.

--- a/docs/source/en/llm_tutorial.md
+++ b/docs/source/en/llm_tutorial.md
@ -35,6 +35,7 @@ Before you begin, it's helpful to install [bitsandbytes](https://hf.co/docs/bits
 ```bash
 !pip install -U transformers bitsandbytes
 ```
+
 Bitsandbytes supports multiple backends in addition to CUDA-based GPUs. Refer to the multi-backend installation [guide](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend) to learn more.

 Load a LLM with [`~PreTrainedModel.from_pretrained`] and add the following two parameters to reduce the memory requirements.
@ -92,6 +93,7 @@ model.generate(**inputs, num_beams=4, do_sample=True)
 ```

 [`~GenerationMixin.generate`] can also be extended with external libraries or custom code:
+
 1. the `logits_processor` parameter accepts custom [`LogitsProcessor`] instances for manipulating the next token probability distribution;
 2. the `stopping_criteria` parameters supports custom [`StoppingCriteria`] to stop text generation;
 3. other custom generation methods can be loaded through the `custom_generate` flag ([docs](generation_strategies.md/#custom-decoding-methods)).
@ -154,7 +156,6 @@ print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
 | `repetition_penalty` | `float` | Set it to `>1.0` if you're seeing the model repeat itself often. Larger values apply a larger penalty. |
 | `eos_token_id` | `list[int]` | The token(s) that will cause generation to stop. The default value is usually good, but you can specify a different token. |

-
 ## Pitfalls

 The section below covers some common issues you may encounter during text generation and how to solve them.
--- a/docs/source/en/llm_tutorial_optimization.md
+++ b/docs/source/en/llm_tutorial_optimization.md
@ -66,6 +66,7 @@ If you have access to an 8 x 80GB A100 node, you could load BLOOM as follows
 ```bash
 !pip install transformers accelerate bitsandbytes optimum
 ```
+
 ```python
 from transformers import AutoModelForCausalLM

@ -98,7 +99,8 @@ result
 ```

 **Output**:
-```
+
+```text
 Here is a Python function that transforms bytes to Giga bytes:\n\n```python\ndef bytes_to_giga_bytes(bytes):\n    return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single
 ```

@ -116,7 +118,8 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
 ```

 **Output**:
-```bash
+
+```text
 29.0260648727417
 ```

@ -127,7 +130,6 @@ Note that if we had tried to run the model in full float32 precision, a whopping

 If you are unsure in which format the model weights are stored on the Hub, you can always look into the checkpoint's config under `"dtype"`, *e.g.* [here](https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/6fdf2e60f86ff2481f2241aaee459f85b5b0bbb9/config.json#L21). It is recommended to set the model to the same precision type as written in the config when loading with `from_pretrained(..., dtype=...)` except when the original type is float32 in which case one can use both `float16` or `bfloat16` for inference.

-
 Let's define a `flush(...)` function to free all allocated memory so that we can accurately measure the peak allocated GPU memory.

 ```python
@ -148,6 +150,7 @@ Let's call it now for the next experiment.
 ```python
 flush()
 ```
+
 From the Accelerate library, you can also use a device-agnostic utility method called [release_memory](https://github.com/huggingface/accelerate/blob/29be4788629b772a3b722076e433b5b3b5c85da3/src/accelerate/utils/memory.py#L63), which takes various hardware backends like XPU, MLU, NPU, MPS, and more into account.

 ```python
@ -204,7 +207,8 @@ result
 ```

 **Output**:
-```
+
+```text
 Here is a Python function that transforms bytes to Giga bytes:\n\n```python\ndef bytes_to_giga_bytes(bytes):\n    return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single
 ```

@ -215,15 +219,16 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
 ```

 **Output**:
-```
+
+```text
 15.219234466552734
 ```

 Significantly less! We're down to just a bit over 15 GBs and could therefore run this model on consumer GPUs like the 4090.
 We're seeing a very nice gain in memory efficiency and more or less no degradation to the model's output. However, we can also notice a slight slow-down during inference.

-
 We delete the models and flush the memory again.
+
 ```python
 del model
 del pipe
@ -245,7 +250,8 @@ result
 ```

 **Output**:
-```
+
+```text
 Here is a Python function that transforms bytes to Giga bytes:\n\n```\ndef bytes_to_gigabytes(bytes):\n    return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single argument
 ```

@ -256,7 +262,8 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
 ```

 **Output**:
-```
+
+```text
 9.543574333190918
 ```

@ -270,6 +277,7 @@ Also note that inference here was again a bit slower compared to 8-bit quantizat
 del model
 del pipe
 ```
+
 ```python
 flush()
 ```
@ -384,6 +392,7 @@ def alternating(list1, list2):
 -----
 """
 ```
+
 For demonstration purposes, we duplicate the system prompt by ten so that the input length is long enough to observe Flash Attention's memory savings.
 We append the original text prompt `"Question: Please write a function in Python that transforms bytes to Giga bytes.\n\nAnswer: Here"`

@ -413,7 +422,8 @@ result
 ```

 **Output**:
-```
+
+```text
 Generated in 10.96854019165039 seconds.
 Sure. Here is a function that does that.\n\ndef bytes_to_giga(bytes):\n   return bytes / 1024 / 1024 / 1024\n\nAnswer: Sure. Here is a function that does that.\n\ndef
 ````
@ -429,7 +439,8 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
 ```

 **Output**:
-```bash
+
+```text
 37.668193340301514
 ```

@ -460,7 +471,8 @@ result
 ```

 **Output**:
-```
+
+```text
 Generated in 3.0211617946624756 seconds.
 Sure. Here is a function that does that.\n\ndef bytes_to_giga(bytes):\n   return bytes / 1024 / 1024 / 1024\n\nAnswer: Sure. Here is a function that does that.\n\ndef
 ```
@ -474,7 +486,8 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
 ```

 **Output**:
-```
+
+```text
 32.617331981658936
 ```

@ -604,7 +617,8 @@ generated_text
 ```

 **Output**:
-```
+
+```text
 shape of input_ids torch.Size([1, 21])
 shape of input_ids torch.Size([1, 22])
 shape of input_ids torch.Size([1, 23])
@ -641,7 +655,8 @@ generated_text
 ```

 **Output**:
-```
+
+```text
 shape of input_ids torch.Size([1, 1])
 length of key-value cache 20
 shape of input_ids torch.Size([1, 1])
@ -675,7 +690,7 @@ Note that, despite our advice to use key-value caches, your LLM output may be sl

 The key-value cache is especially useful for applications such as chat where multiple passes of auto-regressive decoding are required. Let's look at an example.

-```
+```text
 User: How many people live in France?
 Assistant: Roughly 75 million people live in France
 User: And how many are in Germany?
@ -712,7 +727,8 @@ tokenizer.batch_decode(generation_output.sequences)[0][len(prompt):]
 ```

 **Output**:
-```
+
+```text
 is a modified version of the function that returns Mega bytes instead.

 def bytes_to_megabytes(bytes):
@ -733,7 +749,8 @@ config = model.config
 ```

 **Output**:
-```
+
+```text
 7864320000
 ```

@ -773,7 +790,6 @@ The most notable application of GQA is [Llama-v2](https://huggingface.co/meta-ll

 > As a conclusion, it is strongly recommended to make use of either GQA or MQA if the LLM is deployed with auto-regressive decoding and is required to handle large input sequences as is the case for example for chat.

-
 ## Conclusion

 The research community is constantly coming up with new, nifty ways to speed up inference time for ever-larger LLMs. As an example, one such promising research direction is [speculative decoding](https://huggingface.co/papers/2211.17192) where "easy tokens" are generated by smaller, faster language models and only "hard tokens" are generated by the LLM itself. Going into more detail is out of the scope of this notebook, but can be read upon in this [nice blog post](https://huggingface.co/blog/assisted-generation).
--- a/docs/source/en/main_classes/callback.md
+++ b/docs/source/en/main_classes/callback.md
@ -54,7 +54,6 @@ The main class that implements callbacks is [`TrainerCallback`]. It gets the
 Trainer's internal state via [`TrainerState`], and can take some actions on the training loop via
 [`TrainerControl`].

-
 ## Available Callbacks

 Here is the list of the available [`TrainerCallback`] in the library:
--- a/docs/source/en/main_classes/configuration.md
+++ b/docs/source/en/main_classes/configuration.md
@ -24,7 +24,6 @@ Each derived config class implements model specific attributes. Common attribute
 `hidden_size`, `num_attention_heads`, and `num_hidden_layers`. Text models further implement:
 `vocab_size`.

-
 ## PretrainedConfig

 [[autodoc]] PretrainedConfig
--- a/docs/source/en/main_classes/data_collator.md
+++ b/docs/source/en/main_classes/data_collator.md
@ -25,7 +25,6 @@ on the formed batch.

 Examples of use can be found in the [example scripts](../examples) or [example notebooks](../notebooks).

-
 ## Default data collator

 [[autodoc]] data.data_collator.default_data_collator
--- a/docs/source/en/main_classes/executorch.md
+++ b/docs/source/en/main_classes/executorch.md
@ -15,14 +15,12 @@ rendered properly in your Markdown viewer.

 -->

-
 # ExecuTorch

 [`ExecuTorch`](https://github.com/pytorch/executorch) is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers. It is part of the PyTorch ecosystem and supports the deployment of PyTorch models with a focus on portability, productivity, and performance.

 ExecuTorch introduces well defined entry points to perform model, device, and/or use-case specific optimizations such as backend delegation, user-defined compiler transformations, memory planning, and more. The first step in preparing a PyTorch model for execution on an edge device using ExecuTorch is to export the model. This is achieved through the use of a PyTorch API called [`torch.export`](https://pytorch.org/docs/stable/export.html).

-
 ## ExecuTorch Integration

 An integration point is being developed to ensure that 🤗 Transformers can be exported using `torch.export`. The goal of this integration is not only to enable export but also to ensure that the exported artifact can be further lowered and optimized to run efficiently in `ExecuTorch`, particularly for mobile and edge use cases.
--- a/docs/source/en/main_classes/feature_extractor.md
+++ b/docs/source/en/main_classes/feature_extractor.md
@ -18,7 +18,6 @@ rendered properly in your Markdown viewer.

 A feature extractor is in charge of preparing input features for audio or vision models. This includes feature extraction from sequences, e.g., pre-processing audio files to generate Log-Mel Spectrogram features, feature extraction from images, e.g., cropping image files, but also padding, normalization, and conversion to NumPy and PyTorch tensors.

-
 ## FeatureExtractionMixin

 [[autodoc]] feature_extraction_utils.FeatureExtractionMixin
--- a/docs/source/en/main_classes/image_processor.md
+++ b/docs/source/en/main_classes/image_processor.md
@ -26,6 +26,7 @@ from transformers import AutoImageProcessor

 processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50", use_fast=True)
 ```
+
 Note that `use_fast` will be set to `True` by default in a future release.

 When using a fast image processor, you can also set the `device` argument to specify the device on which the processing should be done. By default, the processing is done on the same device as the inputs if the inputs are tensors, or on the CPU otherwise.
@ -57,7 +58,6 @@ Here are some speed comparisons between the base and fast image processors for t

 These benchmarks were run on an [AWS EC2 g5.2xlarge instance](https://aws.amazon.com/ec2/instance-types/g5/), utilizing an NVIDIA A10G Tensor Core GPU.

-
 ## ImageProcessingMixin

 [[autodoc]] image_processing_utils.ImageProcessingMixin
@ -72,7 +72,6 @@ These benchmarks were run on an [AWS EC2 g5.2xlarge instance](https://aws.amazon

 [[autodoc]] image_processing_utils.BaseImageProcessor

-
 ## BaseImageProcessorFast

 [[autodoc]] image_processing_utils_fast.BaseImageProcessorFast
--- a/docs/source/en/main_classes/logging.md
+++ b/docs/source/en/main_classes/logging.md
@ -55,7 +55,6 @@ logger.info("INFO")
 logger.warning("WARN")
 ```

-
 All the methods of this logging module are documented below, the main ones are
 [`logging.get_verbosity`] to get the current level of verbosity in the logger and
 [`logging.set_verbosity`] to set the verbosity to the level of your choice. In order (from the least
@ -81,6 +80,7 @@ We use both in the `transformers` library. We leverage and adapt `logging`'s `ca
 management of these warning messages by the verbosity setters above.

 What does that mean for developers of the library? We should respect the following heuristics:
+
 - `warnings` should be favored for developers of the library and libraries dependent on `transformers`
 - `logging` should be used for end-users of the library using it in every-day projects

--- a/docs/source/en/main_classes/model.md
+++ b/docs/source/en/main_classes/model.md
@ -26,7 +26,6 @@ file or directory, or from a pretrained model configuration provided by the libr

 The other methods that are common to each model are defined in [`~modeling_utils.ModuleUtilsMixin`] and [`~generation.GenerationMixin`].

-
 ## PreTrainedModel

 [[autodoc]] PreTrainedModel
--- a/docs/source/en/main_classes/onnx.md
+++ b/docs/source/en/main_classes/onnx.md
@ -51,4 +51,3 @@ to export models for different types of topologies or tasks.
 ### FeaturesManager

 [[autodoc]] onnx.features.FeaturesManager
-
--- a/docs/source/en/main_classes/optimizer_schedules.md
+++ b/docs/source/en/main_classes/optimizer_schedules.md
@ -22,7 +22,6 @@ The `.optimization` module provides:
 - several schedules in the form of schedule objects that inherit from `_LRSchedule`:
 - a gradient accumulation class to accumulate the gradients of multiple batches

-
 ## AdaFactor

 [[autodoc]] Adafactor
--- a/docs/source/en/main_classes/output.md
+++ b/docs/source/en/main_classes/output.md
@ -47,7 +47,6 @@ However, this is not always the case. Some models apply normalization or subsequ

 </Tip>

-
 You can access each attribute as you would usually do, and if that attribute has not been returned by the model, you
 will get `None`. Here for instance `outputs.loss` is the loss computed by the model, and `outputs.attentions` is
 `None`.
--- a/docs/source/en/main_classes/pipelines.md
+++ b/docs/source/en/main_classes/pipelines.md
@ -81,7 +81,6 @@ for out in tqdm(pipe(KeyDataset(dataset, "file"))):

 For ease of use, a generator is also possible:

-
 ```python
 from transformers import pipeline

@ -160,7 +159,7 @@ for batch_size in [1, 8, 64, 256]:
        pass
 ```

-```
+```text
 # On GTX 970
 ------------------------------
 Streaming no batching
@ -196,8 +195,7 @@ This is a occasional very long sentence compared to the other. In that case, the
 tokens long, so the whole batch will be [64, 400] instead of [64, 4], leading to the high slowdown. Even worse, on
 bigger batches, the program simply crashes.

-
-```
+```text
 ------------------------------
 Streaming no batching
 100%|█████████████████████████████████████████████████████████████████████| 1000/1000 [00:05<00:00, 183.69it/s]
@ -245,7 +243,6 @@ multiple forward pass of a model. Under normal circumstances, this would yield i
 In order to circumvent this issue, both of these pipelines are a bit specific, they are `ChunkPipeline` instead of
 regular `Pipeline`. In short:

-
 ```python
 preprocessed = pipe.preprocess(inputs)
 model_outputs = pipe.forward(preprocessed)
@ -254,7 +251,6 @@ outputs = pipe.postprocess(model_outputs)

 Now becomes:

-
 ```python
 all_model_outputs = []
 for preprocessed in pipe.preprocess(inputs):
@ -282,7 +278,6 @@ If you want to override a specific pipeline.
 Don't hesitate to create an issue for your task at hand, the goal of the pipeline is to be easy to use and support most
 cases, so `transformers` could maybe support your use case.

-
 If you want to try simply you can:

 - Subclass your pipeline of choice
@ -302,7 +297,6 @@ my_pipeline = pipeline(model="xxxx", pipeline_class=MyPipeline)

 That should enable you to do all the custom code you want.

-
 ## Implementing a pipeline

 [Implementing a new pipeline](../add_new_pipeline)
@ -329,7 +323,6 @@ Pipelines available for audio tasks include the following.
    - __call__
    - all

-
 ### ZeroShotAudioClassificationPipeline

 [[autodoc]] ZeroShotAudioClassificationPipeline
--- a/docs/source/en/main_classes/processors.md
+++ b/docs/source/en/main_classes/processors.md
@ -17,6 +17,7 @@ rendered properly in your Markdown viewer.
 # Processors

 Processors can mean two different things in the Transformers library:
+
 - the objects that pre-process inputs for multi-modal models such as [Wav2Vec2](../model_doc/wav2vec2) (speech and text)
  or [CLIP](../model_doc/clip) (text and vision)
 - deprecated objects that were used in older versions of the library to preprocess data for GLUE or SQUAD.
@ -71,7 +72,6 @@ Additionally, the following method can be used to load values from a data file a

 [[autodoc]] data.processors.glue.glue_convert_examples_to_features

-
 ## XNLI

 [The Cross-Lingual NLI Corpus (XNLI)](https://www.nyu.edu/projects/bowman/xnli/) is a benchmark that evaluates the
@ -88,7 +88,6 @@ Please note that since the gold labels are available on the test set, evaluation

 An example using these processors is given in the [run_xnli.py](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification/run_xnli.py) script.

-
 ## SQuAD

 [The Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer//) is a benchmark that
@ -115,11 +114,9 @@ Additionally, the following method can be used to convert SQuAD examples into

 [[autodoc]] data.processors.squad.squad_convert_examples_to_features

-
 These processors as well as the aforementioned method can be used with files containing the data as well as with the
 *tensorflow_datasets* package. Examples are given below.

-
 ### Example usage

 Here is an example using the processors as well as the conversion method using data files:
--- a/docs/source/en/main_classes/text_generation.md
+++ b/docs/source/en/main_classes/text_generation.md
@ -30,15 +30,15 @@ like token streaming.
 ## GenerationConfig

 [[autodoc]] generation.GenerationConfig
-	- from_pretrained
-	- from_model_config
-	- save_pretrained
-	- update
-	- validate
-	- get_generation_mode
+    - from_pretrained
+    - from_model_config
+    - save_pretrained
+    - update
+    - validate
+    - get_generation_mode

 ## GenerationMixin

 [[autodoc]] GenerationMixin
-	- generate
-	- compute_transition_scores
+    - generate
+    - compute_transition_scores
--- a/docs/source/en/main_classes/tokenizer.md
+++ b/docs/source/en/main_classes/tokenizer.md
@ -50,7 +50,6 @@ several advanced alignment methods which can be used to map between the original
 token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding
 to a given token).

-
 # Multimodal Tokenizer

 Apart from that each tokenizer can be a "multimodal" tokenizer which means that the tokenizer will hold all relevant special tokens
--- a/docs/source/en/main_classes/video_processor.md
+++ b/docs/source/en/main_classes/video_processor.md
@ -22,7 +22,6 @@ The video processor extends the functionality of image processors by allowing Vi

 When adding a new VLM or updating an existing one to enable distinct video preprocessing, saving and reloading the processor configuration will store the video related arguments in a dedicated file named `video_preprocessing_config.json`. Don't worry if you haven't updated your VLM, the processor will try to load video related configurations from a file named `preprocessing_config.json`.

-
 ### Usage Example
 Here's an example of how to load a video processor with [`llava-hf/llava-onevision-qwen2-0.5b-ov-hf`](https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf) model:

@ -59,7 +58,6 @@ The video processor can also sample video frames using the technique best suited

 </Tip>

-
 ```python
 from transformers import AutoVideoProcessor

@ -92,4 +90,3 @@ print(processed_video_inputs.pixel_values_videos.shape)
 ## BaseVideoProcessor

 [[autodoc]] video_processing_utils.BaseVideoProcessor
-
--- a/docs/source/en/model_doc/aimv2.md
+++ b/docs/source/en/model_doc/aimv2.md
@ -25,7 +25,6 @@ The abstract from the paper is the following:

 *We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.*

-
 This model was contributed by [Yaswanth Gali](https://huggingface.co/yaswanthgali).
 The original code can be found [here](https://github.com/apple/ml-aim).

--- a/docs/source/en/model_doc/align.md
+++ b/docs/source/en/model_doc/align.md
@ -148,6 +148,7 @@ for label, score in zip(candidate_labels, probs):
  ```

 ## Resources
+
 - Refer to the [Kakao Brain’s Open Source ViT, ALIGN, and the New COYO Text-Image Dataset](https://huggingface.co/blog/vit-align) blog post for more details.

 ## AlignConfig
--- a/docs/source/en/model_doc/aria.md
+++ b/docs/source/en/model_doc/aria.md
@ -142,7 +142,6 @@ response = processor.decode(output_ids, skip_special_tokens=True)
 print(response)
 ```

-
 ## AriaImageProcessor

 [[autodoc]] AriaImageProcessor
--- a/docs/source/en/model_doc/audio-spectrogram-transformer.md
+++ b/docs/source/en/model_doc/audio-spectrogram-transformer.md
@ -61,7 +61,7 @@ page for more information.
 SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
 `attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.

-```
+```py
 from transformers import ASTForAudioClassification
 model = ASTForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593", attn_implementation="sdpa", dtype=torch.float16)
 ...
--- a/docs/source/en/model_doc/auto.md
+++ b/docs/source/en/model_doc/auto.md
@ -23,7 +23,6 @@ automatically retrieve the relevant model given the name/path to the pretrained
 Instantiating one of [`AutoConfig`], [`AutoModel`], and
 [`AutoTokenizer`] will directly create a class of the relevant architecture. For instance

-
 ```python
 model = AutoModel.from_pretrained("google-bert/bert-base-cased")
 ```
--- a/docs/source/en/model_doc/bark.md
+++ b/docs/source/en/model_doc/bark.md
@ -86,7 +86,6 @@ Next, [install](https://github.com/Dao-AILab/flash-attention#installation-and-fe
 pip install -U flash-attn --no-build-isolation
 ```

-
 ##### Usage

 To load a model using Flash Attention 2, we can pass the `attn_implementation="flash_attention_2"` flag to [`.from_pretrained`](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained). We'll also load the model in half-precision (e.g. `torch.float16`), since it results in almost no degradation to audio quality but significantly lower memory usage and faster inference:
@ -97,7 +96,6 @@ model = BarkModel.from_pretrained("suno/bark-small", dtype=torch.float16, attn_i

 ##### Performance comparison

-
 The following diagram shows the latency for the native attention implementation (no optimisation) against Better Transformer and Flash Attention 2. In all cases, we generate 400 semantic tokens on a 40GB A100 GPU with PyTorch 2.1. Flash Attention 2 is also consistently faster than Better Transformer, and its performance improves even more as batch sizes increase:

 <div style="text-align: center">
@ -108,7 +106,6 @@ To put this into perspective, on an NVIDIA A100 and when generating 400 semantic

 At batch size 8, on an NVIDIA A100, Flash Attention 2 is also 10% faster than Better Transformer, and at batch size 16, 25%.

-
 #### Combining optimization techniques

 You can combine optimization techniques, and use CPU offload, half-precision and Flash Attention 2 (or 🤗 Better Transformer) all at once.
@ -165,7 +162,6 @@ Bark can generate highly realistic, **multilingual** speech as well as other aud

 The model can also produce **nonverbal communications** like laughing, sighing and crying.

-
 ```python
 >>> # Adding non-speech cues to the input text
 >>> inputs = processor("Hello uh ... [clears throat], my dog is cute [laughter]")
@ -235,4 +231,3 @@ To save the audio, simply take the sample rate from the model config and some sc

 [[autodoc]] BarkSemanticConfig
    - all
-
--- a/docs/source/en/model_doc/bart.md
+++ b/docs/source/en/model_doc/bart.md
@ -15,7 +15,6 @@ rendered properly in your Markdown viewer.
 -->
 *This model was released on 2019-10-29 and added to Hugging Face Transformers on 2020-11-16.*

-
 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
    <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
@ -24,7 +23,7 @@ rendered properly in your Markdown viewer.
 </div>

 # BART
-[BART](https://huggingface.co/papers/1910.13461) is a sequence-to-sequence model that combines the pretraining objectives from BERT and GPT. It’s pretrained by corrupting text in different ways like deleting words, shuffling sentences, or masking tokens and learning how to fix it. The encoder encodes the corrupted document and the corrupted text is fixed by the decoder. As it learns to recover the original text, BART gets really good at both understanding and generating language.
+[BART](https://huggingface.co/papers/1910.13461) is a sequence-to-sequence model that combines the pretraining objectives from BERT and GPT. It's pretrained by corrupting text in different ways like deleting words, shuffling sentences, or masking tokens and learning how to fix it. The encoder encodes the corrupted document and the corrupted text is fixed by the decoder. As it learns to recover the original text, BART gets really good at both understanding and generating language.

 You can find all the original BART checkpoints under the [AI at Meta](https://huggingface.co/facebook?search_models=bart) organization.

@ -46,6 +45,7 @@ pipeline = pipeline(
 pipeline("Plants create <mask> through a process known as photosynthesis.")

 ```
+
 </hfoption>
 <hfoption id="AutoModel">

@ -89,7 +89,7 @@ echo -e "Plants create <mask> through a process known as photosynthesis." | tran

 - Inputs should be padded on the right because BERT uses absolute position embeddings.
 - The [facebook/bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn) checkpoint doesn't include `mask_token_id` which means it can't perform mask-filling tasks.
- BART doesn’t use `token_type_ids` for sequence classification. Use [`BartTokenizer`] or [`~PreTrainedTokenizerBase.encode`] to get the proper splitting.
+- BART doesn't use `token_type_ids` for sequence classification. Use [`BartTokenizer`] or [`~PreTrainedTokenizerBase.encode`] to get the proper splitting.
 - The forward pass of [`BartModel`] creates the `decoder_input_ids` if they're not passed. This can be different from other model APIs, but it is a useful feature for mask-filling tasks.
 - Model predictions are intended to be identical to the original implementation when `forced_bos_token_id=0`. This only works if the text passed to `fairseq.encode` begins with a space.
 - [`~GenerationMixin.generate`] should be used for conditional generation tasks like summarization.
--- a/docs/source/en/model_doc/barthez.md
+++ b/docs/source/en/model_doc/barthez.md
@ -31,7 +31,6 @@ You can find all of the original BARThez checkpoints under the [BARThez](https:/
 > This model was contributed by [moussakam](https://huggingface.co/moussakam).
 > Refer to the [BART](./bart) docs for more usage examples.

-
 The example below demonstrates how to predict the `<mask>` token with [`Pipeline`], [`AutoModel`], and from the command line.

 <hfoptions id="usage">
--- a/docs/source/en/model_doc/bartpho.md
+++ b/docs/source/en/model_doc/bartpho.md
@ -33,12 +33,9 @@ You can find all the original checkpoints under the [VinAI](https://huggingface.

 The example below demonstrates how to summarize text with [`Pipeline`] or the [`AutoModel`] class.

-
 <hfoptions id="usage">
 <hfoption id="Pipeline">

-
-
 ```python
 import torch
 from transformers import pipeline
@ -98,8 +95,6 @@ transformers run --task summarization --model vinai/bartpho-word --device 0
 </hfoption>
 </hfoptions>

-
-
 ## Notes

 - BARTpho uses the large architecture of BART with an additional layer-normalization layer on top of the encoder and decoder. The BART-specific classes should be replaced with the mBART-specific classes.
--- a/docs/source/en/model_doc/beit.md
+++ b/docs/source/en/model_doc/beit.md
@ -87,7 +87,7 @@ page for more information.
 SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
 `attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.

-```
+```py
 from transformers import BeitForImageClassification
 model = BeitForImageClassification.from_pretrained("microsoft/beit-base-patch16-224", attn_implementation="sdpa", dtype=torch.float16)
 ...
@ -123,6 +123,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 - See also: [Image classification task guide](../tasks/image_classification)

 **Semantic segmentation**
+
 - [Semantic segmentation task guide](../tasks/semantic_segmentation)

 If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
--- a/docs/source/en/model_doc/bert-generation.md
+++ b/docs/source/en/model_doc/bert-generation.md
@ -13,6 +13,7 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
+*This model was released on 2019-07-29 and added to Hugging Face Transformers on 2020-11-16.*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
--- a/docs/source/en/model_doc/bert-japanese.md
+++ b/docs/source/en/model_doc/bert-japanese.md
@ -81,7 +81,6 @@ API reference information.

 </Tip>

-
 ## BertJapaneseTokenizer

 [[autodoc]] BertJapaneseTokenizer
--- a/docs/source/en/model_doc/bertweet.md
+++ b/docs/source/en/model_doc/bertweet.md
@ -24,8 +24,7 @@ rendered properly in your Markdown viewer.

 ## BERTweet

-[BERTweet](https://huggingface.co/papers/2005.10200) shares the same architecture as [BERT-base](./bert), but it’s pretrained like [RoBERTa](./roberta) on English Tweets. It performs really well on Tweet-related tasks like part-of-speech tagging, named entity recognition, and text classification.
-
+[BERTweet](https://huggingface.co/papers/2005.10200) shares the same architecture as [BERT-base](./bert), but it's pretrained like [RoBERTa](./roberta) on English Tweets. It performs really well on Tweet-related tasks like part-of-speech tagging, named entity recognition, and text classification.

 You can find all the original BERTweet checkpoints under the [VinAI Research](https://huggingface.co/vinai?search_models=BERTweet) organization.

@ -49,6 +48,7 @@ pipeline = pipeline(
 )
 pipeline("Plants create <mask> through a process known as photosynthesis.")
 ```
+
 </hfoption>
 <hfoption id="AutoModel">

@ -88,7 +88,8 @@ echo -e "Plants create <mask> through a process known as photosynthesis." | tran
 </hfoptions>

 ## Notes
- Use the [`AutoTokenizer`] or [`BertweetTokenizer`] because it’s preloaded with a custom vocabulary adapted to tweet-specific tokens like hashtags (#), mentions (@), emojis, and common abbreviations. Make sure to also install the [emoji](https://pypi.org/project/emoji/) library.
+
+- Use the [`AutoTokenizer`] or [`BertweetTokenizer`] because it's preloaded with a custom vocabulary adapted to tweet-specific tokens like hashtags (#), mentions (@), emojis, and common abbreviations. Make sure to also install the [emoji](https://pypi.org/project/emoji/) library.
 - Inputs should be padded on the right (`padding="max_length"`) because BERT uses absolute position embeddings.

 ## BertweetTokenizer
--- a/docs/source/en/model_doc/big_bird.md
+++ b/docs/source/en/model_doc/big_bird.md
@ -47,6 +47,7 @@ pipeline = pipeline(
 )
 pipeline("Plants create [MASK] through a process known as photosynthesis.")
 ```
+
 </hfoption>
 <hfoption id="AutoModel">

@ -81,10 +82,12 @@ print(f"The predicted token is: {predicted_token}")
 ```bash
 !echo -e "Plants create [MASK] through a process known as photosynthesis." | transformers-cli run --task fill-mask --model google/bigbird-roberta-base --device 0
 ```
+
 </hfoption>
 </hfoptions>

 ## Notes
+
 - Inputs should be padded on the right because BigBird uses absolute position embeddings.
 - BigBird supports `original_full` and `block_sparse` attention. If the input sequence length is less than 1024, it is recommended to use `original_full` since sparse patterns don't offer much benefit for smaller inputs.
 - The current implementation uses window size of 3 blocks and 2 global blocks, only supports the ITC-implementation, and doesn't support `num_random_blocks=0`.
--- a/docs/source/en/model_doc/bigbird_pegasus.md
+++ b/docs/source/en/model_doc/bigbird_pegasus.md
@ -52,6 +52,7 @@ Through photosynthesis, plants capture energy from sunlight using a green pigmen
 These ingredients are then transformed into glucose, a type of sugar that serves as a source of chemical energy, and oxygen, which is released as a byproduct into the atmosphere. The glucose produced during photosynthesis is not just used immediately; plants also store it as starch or convert it into other organic compounds like cellulose, which is essential for building their cellular structure.
 This energy reserve allows them to grow, develop leaves, produce flowers, bear fruit, and carry out various physiological processes throughout their lifecycle.""")
 ```
+
 </hfoption>
 <hfoption id="AutoModel">

@ -77,6 +78,7 @@ input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
 output = model.generate(**input_ids, cache_implementation="static")
 print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
+
 </hfoption>
 <hfoption id="transformers-cli">

--- a/docs/source/en/model_doc/biogpt.md
+++ b/docs/source/en/model_doc/biogpt.md
@ -135,31 +135,26 @@ print(output)

 [[autodoc]] BioGptConfig

-
 ## BioGptTokenizer

 [[autodoc]] BioGptTokenizer
    - save_vocabulary

-
 ## BioGptModel

 [[autodoc]] BioGptModel
    - forward

-
 ## BioGptForCausalLM

 [[autodoc]] BioGptForCausalLM
    - forward

-
 ## BioGptForTokenClassification

 [[autodoc]] BioGptForTokenClassification
    - forward

-
 ## BioGptForSequenceClassification

 [[autodoc]] BioGptForSequenceClassification
--- a/docs/source/en/model_doc/bit.md
+++ b/docs/source/en/model_doc/bit.md
@ -36,6 +36,7 @@ The original code can be found [here](https://github.com/google-research/big_tra
 ## Usage tips

 - BiT models are equivalent to ResNetv2 in terms of architecture, except that: 1) all batch normalization layers are replaced by [group normalization](https://huggingface.co/papers/1803.08494),
+
 2) [weight standardization](https://huggingface.co/papers/1903.10520) is used for convolutional layers. The authors show that the combination of both is useful for training with large batch sizes, and has a significant
 impact on transfer learning.

--- a/docs/source/en/model_doc/bitnet.md
+++ b/docs/source/en/model_doc/bitnet.md
@ -35,33 +35,29 @@ Several versions of the model weights are available on Hugging Face:

 * [**`microsoft/bitnet-b1.58-2B-4T-gguf`**](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf): Contains the model weights in GGUF format, compatible with the `bitnet.cpp` library for CPU inference.

-
 ### Model Details

-
 * **Architecture:** Transformer-based, modified with `BitLinear` layers (BitNet framework).
-    * Uses Rotary Position Embeddings (RoPE).
-    * Uses squared ReLU (ReLU²) activation in FFN layers.
-    * Employs [`subln`](https://proceedings.mlr.press/v202/wang23u.html) normalization.
-    * No bias terms in linear or normalization layers.
+  * Uses Rotary Position Embeddings (RoPE).
+  * Uses squared ReLU (ReLU²) activation in FFN layers.
+  * Employs [`subln`](https://proceedings.mlr.press/v202/wang23u.html) normalization.
+  * No bias terms in linear or normalization layers.
 * **Quantization:** Native 1.58-bit weights and 8-bit activations (W1.58A8).
-    * Weights are quantized to ternary values {-1, 0, +1} using absmean quantization during the forward pass.
-    * Activations are quantized to 8-bit integers using absmax quantization (per-token).
-    * **Crucially, the model was *trained from scratch* with this quantization scheme, not post-training quantized.**
+  * Weights are quantized to ternary values {-1, 0, +1} using absmean quantization during the forward pass.
+  * Activations are quantized to 8-bit integers using absmax quantization (per-token).
+  * **Crucially, the model was *trained from scratch* with this quantization scheme, not post-training quantized.**
 * **Parameters:** ~2 Billion
 * **Training Tokens:** 4 Trillion
-*   **Context Length:** Maximum sequence length of **4096 tokens**.
-    *   *Recommendation:* For optimal performance on tasks requiring very long contexts (beyond the pre-training length or for specialized long-reasoning tasks), we recommend performing intermediate long-sequence adaptation/training before the final fine-tuning stage.
+* **Context Length:** Maximum sequence length of **4096 tokens**.
+  * *Recommendation:* For optimal performance on tasks requiring very long contexts (beyond the pre-training length or for specialized long-reasoning tasks), we recommend performing intermediate long-sequence adaptation/training before the final fine-tuning stage.
 * **Training Stages:**
-    1.  **Pre-training:** Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule.
-    2.  **Supervised Fine-tuning (SFT):** Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning.
-    3.  **Direct Preference Optimization (DPO):** Aligned with human preferences using preference pairs.
+    1. **Pre-training:** Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule.
+    2. **Supervised Fine-tuning (SFT):** Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning.
+    3. **Direct Preference Optimization (DPO):** Aligned with human preferences using preference pairs.
 * **Tokenizer:** LLaMA 3 Tokenizer (vocab size: 128,256).

-
 ## Usage tips

-
 **VERY IMPORTANT NOTE ON EFFICIENCY**

 > Please do NOT expect performance efficiency gains (in terms of speed, latency, or energy consumption) when using this model with the standard transformers library.
@ -106,7 +102,6 @@ response = tokenizer.decode(chat_outputs[0][chat_input.shape[-1]:], skip_special
 print("\nAssistant Response:", response)
 ```

-
 ## BitNetConfig

 [[autodoc]] BitNetConfig
--- a/docs/source/en/model_doc/blenderbot-small.md
+++ b/docs/source/en/model_doc/blenderbot-small.md
@ -55,7 +55,6 @@ found [here](https://github.com/facebookresearch/ParlAI).
 Blenderbot Small is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
 the left.

-
 ## Resources

 - [Causal language modeling task guide](../tasks/language_modeling)
--- a/docs/source/en/model_doc/blenderbot.md
+++ b/docs/source/en/model_doc/blenderbot.md
@ -71,7 +71,6 @@ An example:
  `facebook/blenderbot_small_90M`, have a different architecture and consequently should be used with
  [BlenderbotSmall](blenderbot-small).

-
 ## Resources

 - [Causal language modeling task guide](../tasks/language_modeling)
--- a/docs/source/en/model_doc/blip.md
+++ b/docs/source/en/model_doc/blip.md
@ -25,7 +25,6 @@ rendered properly in your Markdown viewer.

 [BLIP](https://huggingface.co/papers/2201.12086) (Bootstrapped Language-Image Pretraining) is a vision-language pretraining (VLP) framework designed for *both* understanding and generation tasks. Most existing pretrained models are only good at one or the other. It uses a captioner to generate captions and a filter to remove the noisy captions. This increases training data quality and more effectively uses the messy web data.

-
 You can find all the original BLIP checkpoints under the [BLIP](https://huggingface.co/collections/Salesforce/blip-models-65242f40f1491fbf6a9e9472) collection.

 > [!TIP]
@ -129,7 +128,7 @@ Refer to this [notebook](https://github.com/huggingface/notebooks/blob/main/exam
 ## BlipTextLMHeadModel

 [[autodoc]] BlipTextLMHeadModel
- forward
+    - forward

 ## BlipVisionModel

--- a/docs/source/en/model_doc/bloom.md
+++ b/docs/source/en/model_doc/bloom.md
@ -43,17 +43,19 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 - [`BloomForCausalLM`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb).

 See also:
+
 - [Causal language modeling task guide](../tasks/language_modeling)
 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
 - [Question answering task guide](../tasks/question_answering)

-
 ⚡️ Inference
+
 - A blog on [Optimization story: Bloom inference](https://huggingface.co/blog/bloom-inference-optimization).
 - A blog on [Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate](https://huggingface.co/blog/bloom-inference-pytorch-scripts).

 ⚙️ Training
+
 - A blog on [The Technology Behind BLOOM Training](https://huggingface.co/blog/bloom-megatron-deepspeed).

 ## BloomConfig
--- a/docs/source/en/model_doc/blt.md
+++ b/docs/source/en/model_doc/blt.md
@ -0,0 +1,97 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+*This model was released on 2024-12-13 and added to Hugging Face Transformers on 2025-09-19.*
+
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
+        <img alt="Flax" src="https://img.shields.io/badge/Flax-29a79b.svg?style=flat&logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAC0AAAAtCAMAAAANxBKoAAAC7lBMVEUAAADg5vYHPVgAoJH+/v76+v39/f9JbLP///9+AIgAnY3///+mcqzt8fXy9fgkXa3Ax9709fr+///9/f8qXq49qp5AaLGMwrv8/P0eW60VWawxYq8yqJzG2dytt9Wyu9elzci519Lf3O3S2efY3OrY0+Xp7PT///////+dqNCexMc6Z7AGpJeGvbenstPZ5ejQ1OfJzOLa7ejh4+/r8fT29vpccbklWK8PVa0AS6ghW63O498vYa+lsdKz1NDRt9Kw1c672tbD3tnAxt7R6OHp5vDe7OrDyuDn6vLl6/EAQKak0MgATakkppo3ZK/Bz9y8w9yzu9jey97axdvHzeG21NHH4trTwthKZrVGZLSUSpuPQJiGAI+GAI8SWKydycLL4d7f2OTi1+S9xNzL0ePT6OLGzeEAo5U0qJw/aLEAo5JFa7JBabEAp5Y4qZ2QxLyKmsm3kL2xoMOehrRNb7RIbbOZgrGre68AUqwAqZqNN5aKJ5N/lMq+qsd8kMa4pcWzh7muhLMEV69juq2kbKqgUaOTR5uMMZWLLZSGAI5VAIdEAH+ovNDHuNCnxcy3qcaYx8K8msGplrx+wLahjbYdXrV6vbMvYK9DrZ8QrZ8tqJuFms+Sos6sw8ecy8RffsNVeMCvmb43aLltv7Q4Y7EZWK4QWa1gt6meZKUdr6GOAZVeA4xPAISyveLUwtivxtKTpNJ2jcqfvcltiMiwwcfAoMVxhL+Kx7xjdrqTe60tsaNQs6KaRKACrJ6UTZwkqpqTL5pkHY4AloSgsd2ptNXPvNOOncuxxsqFl8lmg8apt8FJcr9EbryGxLqlkrkrY7dRa7ZGZLQ5t6iXUZ6PPpgVpZeJCJFKAIGareTa0+KJod3H0deY2M+esM25usmYu8d2zsJOdcBVvrCLbqcAOaaHaKQAMaScWqKBXqCXMJ2RHpiLF5NmJZAdAHN2kta11dKu1M+DkcZLdb+Mcql3TppyRJdzQ5ZtNZNlIY+DF4+voCOQAAAAZ3RSTlMABAT+MEEJ/RH+/TP+Zlv+pUo6Ifz8+fco/fz6+evr39S9nJmOilQaF/7+/f38+smmoYp6b1T+/v7++vj189zU0tDJxsGzsrKSfv34+Pf27dDOysG9t6+n/vv6+vr59uzr1tG+tZ6Qg9Ym3QAABR5JREFUSMeNlVVUG1EQhpcuxEspXqS0SKEtxQp1d3d332STTRpIQhIISQgJhODu7lAoDoUCpe7u7u7+1puGpqnCPOyZvffbOXPm/PsP9JfQgyCC+tmTABTOcbxDz/heENS7/1F+9nhvkHePG0wNDLbGWwdXL+rbLWvpmZHXD8+gMfBjTh+aSe6Gnn7lwQIOTR0c8wfX3PWgv7avbdKwf/ZoBp1Gp/PvuvXW3vw5ib7emnTW4OR+3D4jB9vjNJ/7gNvfWWeH/TO/JyYrsiKCRjVEZA3UB+96kON+DxOQ/NLE8PE5iUYgIXjFnCOlxEQMaSGVxjg4gxOnEycGz8bptuNjVx08LscIgrzH3umcn+KKtiBIyvzOO2O99aAdR8cF19oZalnCtvREUw79tCd5sow1g1UKM6kXqUx4T8wsi3sTjJ3yzDmmhenLXLpo8u45eG5y4Vvbk6kkC4LLtJMowkSQxmk4ggVJEG+7c6QpHT8vvW9X7/o7+3ELmiJi2mEzZJiz8cT6TBlanBk70cB5GGIGC1gRDdZ00yADLW1FL6gqhtvNXNG5S9gdSrk4M1qu7JAsmYshzDS4peoMrU/gT7qQdqYGZaYhxZmVbGJAm/CS/HloWyhRUlknQ9KYcExTwS80d3VNOxUZJpITYyspl0LbhArhpZCD9cRWEQuhYkNGMHToQ/2Cs6swJlb39CsllxdXX6IUKh/H5jbnSsPKjgmoaFQ1f8wRLR0UnGE/RcDEjj2jXG1WVTwUs8+zxfcrVO+vSsuOpVKxCfYZiQ0/aPKuxQbQ8lIz+DClxC8u+snlcJ7Yr1z1JPqUH0V+GDXbOwAib931Y4Imaq0NTIXPXY+N5L18GJ37SVWu+hwXff8l72Ds9XuwYIBaXPq6Shm4l+Vl/5QiOlV+uTk6YR9PxKsI9xNJny31ygK1e+nIRC1N97EGkFPI+jCpiHe5PCEy7oWqWSwRrpOvhFzcbTWMbm3ZJAOn1rUKpYIt/lDhW/5RHHteeWFN60qo98YJuoq1nK3uW5AabyspC1BcIEpOhft+SZAShYoLSvnmSfnYADUERP5jJn2h5XtsgCRuhYQqAvwTwn33+YWEKUI72HX5AtfSAZDe8F2DtPPm77afhl0EkthzuCQU0BWApgQIH9+KB0JhopMM7bJrdTRoleM2JAVNMyPF+wdoaz+XJpGoVAQ7WXUkcV7gT3oUZyi/ISIJAVKhgNp+4b4veCFhYVJw4locdSjZCp9cPUhLF9EZ3KKzURepMEtCDPP3VcWFx4UIiZIklIpFNfHpdEafIF2aRmOcrUmjohbT2WUllbmRvgfbythbQO3222fpDJoufaQPncYYuqoGtUEsCJZL6/3PR5b4syeSjZMQG/T2maGANlXT2v8S4AULWaUkCxfLyW8iW4kdka+nEMjxpL2NCwsYNBp+Q61PF43zyDg9Bm9+3NNySn78jMZUUkumqE4Gp7JmFOdP1vc8PpRrzj9+wPinCy8K1PiJ4aYbnTYpCCbDkBSbzhu2QJ1Gd82t8jI8TH51+OzvXoWbnXUOBkNW+0mWFwGcGOUVpU81/n3TOHb5oMt2FgYGjzau0Nif0Ss7Q3XB33hjjQHjHA5E5aOyIQc8CBrLdQSs3j92VG+3nNEjbkbdbBr9zm04ruvw37vh0QKOdeGIkckc80fX3KH/h7PT4BOjgCty8VZ5ux1MoO5Cf5naca2LAsEgehI+drX8o/0Nu+W0m6K/I9gGPd/dfx/EN/wN62AhsBWuAAAAAElFTkSuQmCC
+        ">
+        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
+</div>
+
+# Byte Lantet Transformer (BLT)
+
+## Overview
+
+The BLT model was proposed in [Byte Latent Transformer: Patches Scale Better Than Tokens](https://huggingface.co/papers/2412.09871) by Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li1, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman†, Srinivasan Iyer.
+BLT is a byte-level LLM that achieves tokenization-level performance through entropy-based dynamic patching.
+
+The abstract from the paper is the following:
+
+*We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference
+efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating
+more compute and model capacity where increased data complexity demands it. We present the first flop controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.*
+
+## Usage Tips:
+
+- **Dual Model Architecture**: BLT consists of two separate trained models:
+  - **Patcher (Entropy Model)**: A smaller transformer model that predicts byte-level entropy to determine patch boundaries and segment input.
+  - **Main Transformer Model**: The primary model that processes the patches through a Local Encoder, Global Transformer, and Local Decoder.
+
+- **Dynamic Patching**: The model uses entropy-based dynamic patching where:
+  - High-entropy regions (complex data) get shorter patches with more computational attention
+  - Low-entropy regions (predictable data) get longer patches for efficiency
+  - This allows the model to allocate compute resources where they're most needed
+
+- **Local Encoder**: Processes byte sequences with cross-attention to patch embeddings
+- **Global Transformer**: Processes patch-level representations with full attention across patches
+- **Local Decoder**: Generates output with cross-attention back to the original byte sequence
+
+- **Byte-Level Tokenizer**: Unlike traditional tokenizers that use learned vocabularies, BLT's tokenizer simply converts text to UTF-8 bytes and maps each byte to a token ID. There is no need for a vocabulary.
+
+The model can be loaded via:
+
+<hfoption id="AutoModel">
+
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+tokenizer = AutoTokenizer.from_pretrained("itazap/blt-1b-hf")
+model = AutoModelForCausalLM.from_pretrained(
+    "itazap/blt-1b-hf",
+    device_map="auto",
+)
+
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+
+prompt = "my name is"
+generated_ids = model.generate(
+    **inputs, max_new_tokens=NUM_TOKENS_TO_GENERATE, do_sample=False, use_cache=False
+)
+
+print(tokenizer.decode(generated_ids[0]))
+```
+
+</hfoption>
+
+This model was contributed by [itazap](https://huggingface.co/<itazap>).
+The original code can be found [here](<https://github.com/facebookresearch/blt>).
+
+## BltConfig
+
+[[autodoc]] BltConfig
+
+[[autodoc]] BltModel
+    - forward
+
+## BltForCausalLM
+
+[[autodoc]] BltForCausalLM
+    - forward
--- a/docs/source/en/model_doc/bridgetower.md
+++ b/docs/source/en/model_doc/bridgetower.md
@ -54,6 +54,7 @@ The [`BridgeTowerProcessor`] wraps [`RobertaTokenizer`] and [`BridgeTowerImagePr
 encode the text and prepare the images respectively.

 The following example shows how to run contrastive learning using [`BridgeTowerProcessor`] and [`BridgeTowerForContrastiveLearning`].
+
 ```python
 >>> from transformers import BridgeTowerProcessor, BridgeTowerForContrastiveLearning
 >>> import requests
@ -76,6 +77,7 @@ The following example shows how to run contrastive learning using [`BridgeTowerP
 ```

 The following example shows how to run image-text retrieval using [`BridgeTowerProcessor`] and [`BridgeTowerForImageAndTextRetrieval`].
+
 ```python
 >>> from transformers import BridgeTowerProcessor, BridgeTowerForImageAndTextRetrieval
 >>> import requests
@ -130,7 +132,6 @@ Tips:
 - Please refer to [Table 5](https://huggingface.co/papers/2206.08657) for BridgeTower's performance on Image Retrieval and other down stream tasks.
 - The PyTorch version of this model is only available in torch 1.10 and higher.

-
 ## BridgeTowerConfig

 [[autodoc]] BridgeTowerConfig
@ -177,4 +178,3 @@ Tips:

 [[autodoc]] BridgeTowerForImageAndTextRetrieval
    - forward
-
--- a/Show More
+++ b/Show More