Compare commits

..

2 Commits

Author SHA1 Message Date
faefe8ea1b Update examples/README.md
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
2025-09-25 18:00:39 +02:00
8680971980 Add a line to make telemetry in examples more explicit 2025-09-24 10:36:37 +02:00
1069 changed files with 15396 additions and 19255 deletions

View File

@ -129,12 +129,6 @@ class CircleCIJob:
def to_dict(self):
env = COMMON_ENV_VARIABLES.copy()
if self.job_name != "tests_hub":
# fmt: off
# not critical
env.update({"HF_TOKEN": "".join(["h", "f", "_", "H", "o", "d", "V", "u", "M", "q", "b", "R", "m", "t", "b", "z", "F", "Q", "O", "Q", "A", "J", "G", "D", "l", "V", "Q", "r", "R", "N", "w", "D", "M", "V", "C", "s", "d"])})
# fmt: on
# Do not run tests decorated by @is_flaky on pull requests
env['RUN_FLAKY'] = os.environ.get("CIRCLE_PULL_REQUEST", "") == ""
env.update(self.additional_env)

View File

@ -39,23 +39,20 @@ members/contributors who may be interested in your PR.
Models:
- text models: @ArthurZucker @Cyrilvallez
- vision models: @yonigozlan @molbap
- audio models: @eustlb @ebezzam @vasqu
- multimodal models: @zucchini-nlp
- text models: @ArthurZucker
- vision models: @amyeroberts, @qubvel
- speech models: @eustlb
- graph models: @clefourrier
Library:
- flax: @gante and @Rocketknight1
- generate: @zucchini-nlp (visual-language models) or @gante (all others)
- continuous batching: @remi-or @ArthurZucker @McPatate
- pipelines: @Rocketknight1
- tokenizers: @ArthurZucker and @itazap
- trainer: @zach-huggingface @SunMarc
- attention: @vasqu @ArthurZucker @CyrilVallez
- model loading (from pretrained, etc): @CyrilVallez
- distributed: @3outeille @ArthurZucker @S1ro1
- CIs: @ydshieh
- tensorflow: @gante and @Rocketknight1
- tokenizers: @ArthurZucker
- trainer: @zach-huggingface, @SunMarc and @qgallouedec
- chat templates: @Rocketknight1
Integrations:
@ -63,16 +60,20 @@ Integrations:
- ray/raytune: @richardliaw, @amogkam
- Big Model Inference: @SunMarc
- quantization (bitsandbytes, autogpt): @SunMarc @MekkCyber
- kernels: @MekkCyber @drbh
Devices/Backends:
- AMD ROCm: @ivarflakstad
- Intel XPU: @IlyasMoutawwakil
- Ascend NPU: @ivarflakstad
Documentation: @stevhliu
Research projects are not maintained and should be taken as is.
HF projects:
- accelerate: [different repo](https://github.com/huggingface/accelerate)
- datasets: [different repo](https://github.com/huggingface/datasets)
- diffusers: [different repo](https://github.com/huggingface/diffusers)
- rust tokenizers: [different repo](https://github.com/huggingface/tokenizers)
Maintained examples (not research project or legacy):
- Flax: @Rocketknight1
- PyTorch: See Models above and tag the person corresponding to the modality of the example.
- TensorFlow: @Rocketknight1
-->

View File

@ -7,8 +7,8 @@ docs/ @stevhliu
/docker/ @ydshieh @ArthurZucker
# More high-level globs catch cases when specific rules later don't apply
/src/transformers/models/*/processing* @molbap @yonigozlan
/src/transformers/models/*/image_processing* @yonigozlan
/src/transformers/models/*/processing* @molbap @yonigozlan @qubvel
/src/transformers/models/*/image_processing* @qubvel
/src/transformers/models/*/image_processing_*_fast* @yonigozlan
# Owners of subsections of the library
@ -186,65 +186,65 @@ trainer_utils.py @zach-huggingface @SunMarc
/src/transformers/models/zamba/mod*_zamba* @ArthurZucker
# Vision models
/src/transformers/models/beit/mod*_beit* @yonigozlan @molbap
/src/transformers/models/bit/mod*_bit* @yonigozlan @molbap
/src/transformers/models/conditional_detr/mod*_conditional_detr* @yonigozlan @molbap
/src/transformers/models/convnext/mod*_convnext* @yonigozlan @molbap
/src/transformers/models/convnextv2/mod*_convnextv2* @yonigozlan @molbap
/src/transformers/models/cvt/mod*_cvt* @yonigozlan @molbap
/src/transformers/models/deformable_detr/mod*_deformable_detr* @yonigozlan @molbap
/src/transformers/models/deit/mod*_deit* @yonigozlan @molbap
/src/transformers/models/depth_anything/mod*_depth_anything* @yonigozlan @molbap
/src/transformers/models/depth_anything_v2/mod*_depth_anything_v2* @yonigozlan @molbap
/src/transformers/models/deta/mod*_deta* @yonigozlan @molbap
/src/transformers/models/detr/mod*_detr* @yonigozlan @molbap
/src/transformers/models/dinat/mod*_dinat* @yonigozlan @molbap
/src/transformers/models/dinov2/mod*_dinov2* @yonigozlan @molbap
/src/transformers/models/dinov2_with_registers/mod*_dinov2_with_registers* @yonigozlan @molbap
/src/transformers/models/dit/mod*_dit* @yonigozlan @molbap
/src/transformers/models/dpt/mod*_dpt* @yonigozlan @molbap
/src/transformers/models/efficientformer/mod*_efficientformer* @yonigozlan @molbap
/src/transformers/models/efficientnet/mod*_efficientnet* @yonigozlan @molbap
/src/transformers/models/focalnet/mod*_focalnet* @yonigozlan @molbap
/src/transformers/models/glpn/mod*_glpn* @yonigozlan @molbap
/src/transformers/models/hiera/mod*_hiera* @yonigozlan @molbap
/src/transformers/models/ijepa/mod*_ijepa* @yonigozlan @molbap
/src/transformers/models/imagegpt/mod*_imagegpt* @yonigozlan @molbap
/src/transformers/models/levit/mod*_levit* @yonigozlan @molbap
/src/transformers/models/mask2former/mod*_mask2former* @yonigozlan @molbap
/src/transformers/models/maskformer/mod*_maskformer* @yonigozlan @molbap
/src/transformers/models/mobilenet_v1/mod*_mobilenet_v1* @yonigozlan @molbap
/src/transformers/models/mobilenet_v2/mod*_mobilenet_v2* @yonigozlan @molbap
/src/transformers/models/mobilevit/mod*_mobilevit* @yonigozlan @molbap
/src/transformers/models/mobilevitv2/mod*_mobilevitv2* @yonigozlan @molbap
/src/transformers/models/nat/mod*_nat* @yonigozlan @molbap
/src/transformers/models/poolformer/mod*_poolformer* @yonigozlan @molbap
/src/transformers/models/pvt/mod*_pvt* @yonigozlan @molbap
/src/transformers/models/pvt_v2/mod*_pvt_v2* @yonigozlan @molbap
/src/transformers/models/regnet/mod*_regnet* @yonigozlan @molbap
/src/transformers/models/resnet/mod*_resnet* @yonigozlan @molbap
/src/transformers/models/rt_detr/mod*_rt_detr* @yonigozlan @molbap
/src/transformers/models/segformer/mod*_segformer* @yonigozlan @molbap
/src/transformers/models/seggpt/mod*_seggpt* @yonigozlan @molbap
/src/transformers/models/superpoint/mod*_superpoint* @yonigozlan @molbap
/src/transformers/models/swiftformer/mod*_swiftformer* @yonigozlan @molbap
/src/transformers/models/swin/mod*_swin* @yonigozlan @molbap
/src/transformers/models/swinv2/mod*_swinv2* @yonigozlan @molbap
/src/transformers/models/swin2sr/mod*_swin2sr* @yonigozlan @molbap
/src/transformers/models/table_transformer/mod*_table_transformer* @yonigozlan @molbap
/src/transformers/models/textnet/mod*_textnet* @yonigozlan @molbap
/src/transformers/models/timm_wrapper/mod*_timm_wrapper* @yonigozlan @molbap
/src/transformers/models/upernet/mod*_upernet* @yonigozlan @molbap
/src/transformers/models/van/mod*_van* @yonigozlan @molbap
/src/transformers/models/vit/mod*_vit* @yonigozlan @molbap
/src/transformers/models/vit_hybrid/mod*_vit_hybrid* @yonigozlan @molbap
/src/transformers/models/vitdet/mod*_vitdet* @yonigozlan @molbap
/src/transformers/models/vit_mae/mod*_vit_mae* @yonigozlan @molbap
/src/transformers/models/vitmatte/mod*_vitmatte* @yonigozlan @molbap
/src/transformers/models/vit_msn/mod*_vit_msn* @yonigozlan @molbap
/src/transformers/models/vitpose/mod*_vitpose* @yonigozlan @molbap
/src/transformers/models/yolos/mod*_yolos* @yonigozlan @molbap
/src/transformers/models/zoedepth/mod*_zoedepth* @yonigozlan @molbap
/src/transformers/models/beit/mod*_beit* @amyeroberts @qubvel
/src/transformers/models/bit/mod*_bit* @amyeroberts @qubvel
/src/transformers/models/conditional_detr/mod*_conditional_detr* @amyeroberts @qubvel
/src/transformers/models/convnext/mod*_convnext* @amyeroberts @qubvel
/src/transformers/models/convnextv2/mod*_convnextv2* @amyeroberts @qubvel
/src/transformers/models/cvt/mod*_cvt* @amyeroberts @qubvel
/src/transformers/models/deformable_detr/mod*_deformable_detr* @amyeroberts @qubvel
/src/transformers/models/deit/mod*_deit* @amyeroberts @qubvel
/src/transformers/models/depth_anything/mod*_depth_anything* @amyeroberts @qubvel
/src/transformers/models/depth_anything_v2/mod*_depth_anything_v2* @amyeroberts @qubvel
/src/transformers/models/deta/mod*_deta* @amyeroberts @qubvel
/src/transformers/models/detr/mod*_detr* @amyeroberts @qubvel
/src/transformers/models/dinat/mod*_dinat* @amyeroberts @qubvel
/src/transformers/models/dinov2/mod*_dinov2* @amyeroberts @qubvel
/src/transformers/models/dinov2_with_registers/mod*_dinov2_with_registers* @amyeroberts @qubvel
/src/transformers/models/dit/mod*_dit* @amyeroberts @qubvel
/src/transformers/models/dpt/mod*_dpt* @amyeroberts @qubvel
/src/transformers/models/efficientformer/mod*_efficientformer* @amyeroberts @qubvel
/src/transformers/models/efficientnet/mod*_efficientnet* @amyeroberts @qubvel
/src/transformers/models/focalnet/mod*_focalnet* @amyeroberts @qubvel
/src/transformers/models/glpn/mod*_glpn* @amyeroberts @qubvel
/src/transformers/models/hiera/mod*_hiera* @amyeroberts @qubvel
/src/transformers/models/ijepa/mod*_ijepa* @amyeroberts @qubvel
/src/transformers/models/imagegpt/mod*_imagegpt* @amyeroberts @qubvel
/src/transformers/models/levit/mod*_levit* @amyeroberts @qubvel
/src/transformers/models/mask2former/mod*_mask2former* @amyeroberts @qubvel
/src/transformers/models/maskformer/mod*_maskformer* @amyeroberts @qubvel
/src/transformers/models/mobilenet_v1/mod*_mobilenet_v1* @amyeroberts @qubvel
/src/transformers/models/mobilenet_v2/mod*_mobilenet_v2* @amyeroberts @qubvel
/src/transformers/models/mobilevit/mod*_mobilevit* @amyeroberts @qubvel
/src/transformers/models/mobilevitv2/mod*_mobilevitv2* @amyeroberts @qubvel
/src/transformers/models/nat/mod*_nat* @amyeroberts @qubvel
/src/transformers/models/poolformer/mod*_poolformer* @amyeroberts @qubvel
/src/transformers/models/pvt/mod*_pvt* @amyeroberts @qubvel
/src/transformers/models/pvt_v2/mod*_pvt_v2* @amyeroberts @qubvel
/src/transformers/models/regnet/mod*_regnet* @amyeroberts @qubvel
/src/transformers/models/resnet/mod*_resnet* @amyeroberts @qubvel
/src/transformers/models/rt_detr/mod*_rt_detr* @amyeroberts @qubvel
/src/transformers/models/segformer/mod*_segformer* @amyeroberts @qubvel
/src/transformers/models/seggpt/mod*_seggpt* @amyeroberts @qubvel
/src/transformers/models/superpoint/mod*_superpoint* @amyeroberts @qubvel
/src/transformers/models/swiftformer/mod*_swiftformer* @amyeroberts @qubvel
/src/transformers/models/swin/mod*_swin* @amyeroberts @qubvel
/src/transformers/models/swinv2/mod*_swinv2* @amyeroberts @qubvel
/src/transformers/models/swin2sr/mod*_swin2sr* @amyeroberts @qubvel
/src/transformers/models/table_transformer/mod*_table_transformer* @amyeroberts @qubvel
/src/transformers/models/textnet/mod*_textnet* @amyeroberts @qubvel
/src/transformers/models/timm_wrapper/mod*_timm_wrapper* @amyeroberts @qubvel
/src/transformers/models/upernet/mod*_upernet* @amyeroberts @qubvel
/src/transformers/models/van/mod*_van* @amyeroberts @qubvel
/src/transformers/models/vit/mod*_vit* @amyeroberts @qubvel
/src/transformers/models/vit_hybrid/mod*_vit_hybrid* @amyeroberts @qubvel
/src/transformers/models/vitdet/mod*_vitdet* @amyeroberts @qubvel
/src/transformers/models/vit_mae/mod*_vit_mae* @amyeroberts @qubvel
/src/transformers/models/vitmatte/mod*_vitmatte* @amyeroberts @qubvel
/src/transformers/models/vit_msn/mod*_vit_msn* @amyeroberts @qubvel
/src/transformers/models/vitpose/mod*_vitpose* @amyeroberts @qubvel
/src/transformers/models/yolos/mod*_yolos* @amyeroberts @qubvel
/src/transformers/models/zoedepth/mod*_zoedepth* @amyeroberts @qubvel
# Audio models
/src/transformers/models/audio_spectrogram_transformer/mod*_audio_spectrogram_transformer* @eustlb
@ -304,7 +304,7 @@ trainer_utils.py @zach-huggingface @SunMarc
/src/transformers/models/donut/mod*_donut* @zucchini-nlp
/src/transformers/models/flava/mod*_flava* @zucchini-nlp
/src/transformers/models/git/mod*_git* @zucchini-nlp
/src/transformers/models/grounding_dino/mod*_grounding_dino* @yonigozlan
/src/transformers/models/grounding_dino/mod*_grounding_dino* @qubvel
/src/transformers/models/groupvit/mod*_groupvit* @zucchini-nlp
/src/transformers/models/idefics/mod*_idefics* @zucchini-nlp
/src/transformers/models/idefics2/mod*_idefics2* @zucchini-nlp
@ -326,10 +326,10 @@ trainer_utils.py @zach-huggingface @SunMarc
/src/transformers/models/mgp_str/mod*_mgp_str* @zucchini-nlp
/src/transformers/models/mllama/mod*_mllama* @zucchini-nlp
/src/transformers/models/nougat/mod*_nougat* @NielsRogge
/src/transformers/models/omdet_turbo/mod*_omdet_turbo* @yonigozlan
/src/transformers/models/omdet_turbo/mod*_omdet_turbo* @qubvel @yonigozlan
/src/transformers/models/oneformer/mod*_oneformer* @zucchini-nlp
/src/transformers/models/owlvit/mod*_owlvit* @yonigozlan
/src/transformers/models/owlv2/mod*_owlv2* @yonigozlan
/src/transformers/models/owlvit/mod*_owlvit* @qubvel
/src/transformers/models/owlv2/mod*_owlv2* @qubvel
/src/transformers/models/paligemma/mod*_paligemma* @zucchini-nlp @molbap
/src/transformers/models/perceiver/mod*_perceiver* @zucchini-nlp
/src/transformers/models/pix2struct/mod*_pix2struct* @zucchini-nlp

View File

@ -7,14 +7,6 @@ on:
description: 'GH Actions runner group to use'
required: true
type: string
container_image:
description: 'Docker image to use'
required: true
type: string
container_options:
description: 'Container options to use'
required: true
type: string
commit_sha:
description: 'Commit SHA to benchmark'
required: false
@ -46,8 +38,8 @@ jobs:
(github.event_name == 'pull_request' && contains( github.event.pull_request.labels.*.name, 'run-benchmark')) ||
(github.event_name == 'schedule')
container:
image: ${{ inputs.container_image }}
options: ${{ inputs.container_options }}
image: huggingface/transformers-pytorch-gpu
options: --gpus all --privileged --ipc host --shm-size "16gb"
steps:
- name: Get repo
uses: actions/checkout@v4

View File

@ -13,8 +13,6 @@ jobs:
uses: ./.github/workflows/benchmark_v2.yml
with:
runner: aws-g5-4xlarge-cache-use1-public-80
container_image: huggingface/transformers-pytorch-gpu
container_options: --gpus all --privileged --ipc host --shm-size "16gb"
commit_sha: ${{ github.sha }}
run_id: ${{ github.run_id }}
benchmark_repo_id: hf-internal-testing/transformers-daily-benchmarks

View File

@ -13,8 +13,6 @@ jobs:
uses: ./.github/workflows/benchmark_v2.yml
with:
runner: amd-mi325-ci-1gpu
container_image: huggingface/transformers-pytorch-amd-gpu
container_options: --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache
commit_sha: ${{ github.sha }}
run_id: ${{ github.run_id }}
benchmark_repo_id: hf-internal-testing/transformers-daily-benchmarks

View File

@ -20,7 +20,7 @@ jobs:
with:
job: run_models_gpu
slack_report_channel: "#amd-hf-ci"
runner_group: amd-mi325
runner_scale_set: amd-mi325-ci
docker: huggingface/transformers-pytorch-amd-gpu
ci_event: Scheduled CI (AMD) - mi325
report_repo_id: optimum-amd/transformers_daily_ci
@ -33,7 +33,7 @@ jobs:
with:
job: run_pipelines_torch_gpu
slack_report_channel: "#amd-hf-ci"
runner_group: amd-mi325
runner_scale_set: amd-mi325-ci
docker: huggingface/transformers-pytorch-amd-gpu
ci_event: Scheduled CI (AMD) - mi325
report_repo_id: optimum-amd/transformers_daily_ci
@ -46,7 +46,7 @@ jobs:
with:
job: run_examples_gpu
slack_report_channel: "#amd-hf-ci"
runner_group: amd-mi325
runner_scale_set: amd-mi325-ci
docker: huggingface/transformers-pytorch-amd-gpu
ci_event: Scheduled CI (AMD) - mi325
report_repo_id: optimum-amd/transformers_daily_ci
@ -59,7 +59,7 @@ jobs:
with:
job: run_torch_cuda_extensions_gpu
slack_report_channel: "#amd-hf-ci"
runner_group: amd-mi325
runner_scale_set: amd-mi325-ci
docker: huggingface/transformers-pytorch-deepspeed-amd-gpu
ci_event: Scheduled CI (AMD) - mi325
report_repo_id: optimum-amd/transformers_daily_ci

View File

@ -20,7 +20,7 @@ jobs:
with:
job: run_models_gpu
slack_report_channel: "#amd-hf-ci"
runner_group: hfc-amd-mi355
runner_scale_set: amd-mi355-ci
docker: huggingface/testing-rocm7.0-preview
ci_event: Scheduled CI (AMD) - mi355
report_repo_id: hf-transformers-bot/transformers-ci-dummy
@ -32,7 +32,7 @@ jobs:
with:
job: run_pipelines_torch_gpu
slack_report_channel: "#amd-hf-ci"
runner_group: hfc-amd-mi355
runner_scale_set: amd-mi355-ci
docker: huggingface/testing-rocm7.0-preview
ci_event: Scheduled CI (AMD) - mi355
report_repo_id: hf-transformers-bot/transformers-ci-dummy
@ -44,7 +44,7 @@ jobs:
with:
job: run_examples_gpu
slack_report_channel: "#amd-hf-ci"
runner_group: hfc-amd-mi355
runner_scale_set: amd-mi355-ci
docker: huggingface/testing-rocm7.0-preview
ci_event: Scheduled CI (AMD) - mi355
report_repo_id: hf-transformers-bot/transformers-ci-dummy
@ -56,7 +56,7 @@ jobs:
with:
job: run_torch_cuda_extensions_gpu
slack_report_channel: "#amd-hf-ci"
runner_group: hfc-amd-mi355
runner_scale_set: amd-mi355-ci
docker: huggingface/testing-rocm7.0-preview
ci_event: Scheduled CI (AMD) - mi355
report_repo_id: hf-transformers-bot/transformers-ci-dummy

View File

@ -278,7 +278,6 @@ are working on it).<br>
useful to avoid duplicated work, and to differentiate it from PRs ready to be merged.<br>
☐ Make sure existing tests pass.<br>
☐ If adding a new feature, also add tests for it.<br>
- If you are adding a new model, make sure you use
`ModelTester.all_model_classes = (MyModel, MyModelWithLMHead,...)` to trigger the common tests.
- If you are adding new `@slow` tests, make sure they pass using
@ -341,7 +340,6 @@ RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/t
```
Like the slow tests, there are other environment variables available which are not enabled by default during testing:
- `RUN_CUSTOM_TOKENIZERS`: Enables tests for custom tokenizers.
More environment variables and additional information can be found in the [testing_utils.py](https://github.com/huggingface/transformers/blob/main/src/transformers/testing_utils.py).

View File

@ -64,7 +64,8 @@ NOT_DEVICE_TESTS = {
"test_load_save_without_tied_weights",
"test_tied_weights_keys",
"test_model_weights_reload_no_missing_tied_weights",
"test_can_load_ignoring_mismatched_shapes",
"test_mismatched_shapes_have_properly_initialized_weights",
"test_matched_shapes_have_loaded_weights_when_some_mismatched_shapes_exist",
"test_model_is_small",
"ModelTest::test_pipeline_", # None of the pipeline tests from PipelineTesterMixin (of which XxxModelTest inherits from) are running on device
"ModelTester::test_pipeline_",

View File

@ -35,10 +35,3 @@ RUN python3 -m pip uninstall -y kernels
# On ROCm, torchcodec is required to decode audio files and 0.4 or 0.6 fails
RUN python3 -m pip install --no-cache-dir "torchcodec==0.5"
# Install flash attention from source. Tested with commit 6387433156558135a998d5568a9d74c1778666d8
RUN git clone https://github.com/ROCm/flash-attention/ -b tridao && \
cd flash-attention && \
GPU_ARCHS="gfx942" python setup.py install
RUN python3 -m pip install --no-cache-dir einops

View File

@ -30,21 +30,22 @@ RUN python3 -m pip install --no-cache-dir -U $VERSION torchvision torchaudio tor
RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/accelerate@main#egg=accelerate
# needed in bnb and awq
RUN python3 -m pip install --no-cache-dir einops
# Add bitsandbytes for mixed int8 testing
RUN python3 -m pip install --no-cache-dir bitsandbytes
# Add gptqmodel for gtpq quantization testing, installed from source for pytorch==2.6.0 compatibility
RUN python3 -m pip install lm_eval
RUN git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel && pip install -v . --no-build-isolation
# Add optimum for gptq quantization testing
RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/optimum@main#egg=optimum
# Add PEFT
RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/peft@main#egg=peft
# needed in bnb and awq
RUN python3 -m pip install --no-cache-dir einops
# Add bitsandbytes
RUN python3 -m pip install --no-cache-dir bitsandbytes
# # Add gptqmodel
# RUN python3 -m pip install --no-cache-dir gptqmodel
# Add hqq for quantization testing
RUN python3 -m pip install --no-cache-dir hqq

View File

@ -50,7 +50,7 @@ Begin translating the text!
1. Start with the `_toctree.yml` file that corresponds to your documentation chapter. This file is essential for rendering the table of contents on the website.
- If the `_toctree.yml` file doesn't exist for your language, create one by copying the English version and removing unrelated sections.
- If the `_toctree.yml` file doesnt exist for your language, create one by copying the English version and removing unrelated sections.
- Ensure it is placed in the `docs/source/LANG-ID/` directory.
Heres an example structure for the `_toctree.yml` file:

View File

@ -305,8 +305,6 @@
title: Glossary
- local: philosophy
title: Philosophy
- local: models_timeline
title: Models Timeline
- local: notebooks
title: Notebooks with examples
- local: community
@ -445,8 +443,6 @@
title: DeepSeek-V2
- local: model_doc/deepseek_v3
title: DeepSeek-V3
- local: model_doc/deepseek_v32
title: DeepseekV32
- local: model_doc/dialogpt
title: DialoGPT
- local: model_doc/diffllama
@ -559,6 +555,8 @@
title: LED
- local: model_doc/lfm2
title: LFM2
- local: model_doc/lfm2_vl
title: LFM2-VL
- local: model_doc/llama
title: LLaMA
- local: model_doc/llama2
@ -937,8 +935,6 @@
title: MusicGen
- local: model_doc/musicgen_melody
title: MusicGen Melody
- local: model_doc/parakeet
title: Parakeet
- local: model_doc/pop2piano
title: Pop2Piano
- local: model_doc/seamless_m4t
@ -1037,10 +1033,6 @@
title: DePlot
- local: model_doc/donut
title: Donut
- local: model_doc/edgetam
title: EdgeTAM
- local: model_doc/edgetam_video
title: EdgeTamVideo
- local: model_doc/emu3
title: Emu3
- local: model_doc/evolla
@ -1093,8 +1085,6 @@
title: LayoutLMV3
- local: model_doc/layoutxlm
title: LayoutXLM
- local: model_doc/lfm2_vl
title: LFM2-VL
- local: model_doc/lilt
title: LiLT
- local: model_doc/llama4

View File

@ -136,7 +136,7 @@ The cache position tracks where to insert new tokens in the attention cache. It
Cache position is used internally for two purposes:
1. Selecting new tokens to process in the input sequence and ensuring only tokens that haven't been cached yet are passed to the model's `forward`.
1. Selecting new tokens to process in the input sequence and ensuring only tokens that havent been cached yet are passed to the model's `forward`.
2. Storing key/value pairs at the correct positions in the cache. This is especially important for fixed-size caches, that pre-allocates a specific cache length.
The generation loop usually takes care of the cache position, but if you're writing a custom generation method, it is important that cache positions are accurate since they are used to write and read key/value states into fixed slots.
@ -162,7 +162,6 @@ generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=10)
Before the [`Cache`] class, the cache used to be stored as a tuple of tuples of tensors. This format is dynamic because it grows as text is generated, similar to [`DynamicCache`].
The legacy format is essentially the same data structure but organized differently.
- It's a tuple of tuples, where each inner tuple contains the key and value tensors for a layer.
- The tensors have the same shape `[batch_size, num_heads, seq_len, head_dim]`.
- The format is less flexible and doesn't support features like quantization or offloading.

View File

@ -124,7 +124,7 @@ Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopte
> [!WARNING]
> Some tokenizers add special `<bos>` and `<eos>` tokens. Chat templates should already include all the necessary special tokens, and adding additional special tokens is often incorrect or duplicated, hurting model performance. When you format text with `apply_chat_template(tokenize=False)`, make sure you set `add_special_tokens=False` if you tokenize later to avoid duplicating these tokens.
> This isn't an issue if you use `apply_chat_template(tokenize=True)`, which means it's usually the safer option!
> This isnt an issue if you use `apply_chat_template(tokenize=True)`, which means it's usually the safer option!
### add_generation_prompt
@ -168,7 +168,7 @@ Can I ask a question?<|im_end|>
When `add_generation_prompt=True`, `<|im_start|>assistant` is added at the end to indicate the start of an `assistant` message. This lets the model know an `assistant` response is next.
Not all models require generation prompts, and some models, like [Llama](./model_doc/llama), don't have any special tokens before the `assistant` response. In these cases, [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) has no effect.
Not all models require generation prompts, and some models, like [Llama](./model_doc/llama), dont have any special tokens before the `assistant` response. In these cases, [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) has no effect.
### continue_final_message
@ -187,9 +187,9 @@ model.generate(**formatted_chat)
```
> [!WARNING]
> You shouldn't use [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) and [continue_final_message](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.continue_final_message) together. The former adds tokens that start a new message, while the latter removes end of sequence tokens. Using them together returns an error.
> You shouldnt use [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) and [continue_final_message](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.continue_final_message) together. The former adds tokens that start a new message, while the latter removes end of sequence tokens. Using them together returns an error.
[`TextGenerationPipeline`] sets [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) to `True` by default to start a new message. However, if the final message in the chat has the `assistant` role, it assumes the message is a prefill and switches to `continue_final_message=True`. This is because most models don't support multiple consecutive assistant messages. To override this behavior, explicitly pass the [continue_final_message](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.continue_final_message) argument to the pipeline.
[`TextGenerationPipeline`] sets [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) to `True` by default to start a new message. However, if the final message in the chat has the `assistant` role, it assumes the message is a prefill and switches to `continue_final_message=True`. This is because most models dont support multiple consecutive assistant messages. To override this behavior, explicitly pass the [continue_final_message](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.continue_final_message) argument to the pipeline.
## Model training

View File

@ -56,7 +56,7 @@ out = pipe(text=messages, max_new_tokens=128)
print(out[0]['generated_text'][-1]['content'])
```
```text
```
Ahoy, me hearty! These be two feline friends, likely some tabby cats, taking a siesta on a cozy pink blanket. They're resting near remote controls, perhaps after watching some TV or just enjoying some quiet time together. Cats sure know how to find comfort and relaxation, don't they?
```
@ -96,7 +96,7 @@ processed_chat = processor.apply_chat_template(messages, add_generation_prompt=T
print(list(processed_chat.keys()))
```
```text
```
['input_ids', 'attention_mask', 'pixel_values', 'image_grid_thw']
```
@ -115,7 +115,7 @@ Some vision models also support video inputs. The message format is very similar
- The content `"type"` should be `"video"` to indicate the content is a video.
- For videos, it can be a link to the video (`"url"`) or it could be a file path (`"path"`). Videos loaded from a URL can only be decoded with [PyAV](https://pyav.basswood-io.com/docs/stable/) or [Decord](https://github.com/dmlc/decord).
- In addition to loading videos from a URL or file path, you can also pass decoded video data directly. This is useful if you've already preprocessed or decoded video frames elsewhere in memory (e.g., using OpenCV, decord, or torchvision). You don't need to save to files or store it in an URL.
- In addition to loading videos from a URL or file path, you can also pass decoded video data directly. This is useful if youve already preprocessed or decoded video frames elsewhere in memory (e.g., using OpenCV, decord, or torchvision). You don't need to save to files or store it in an URL.
> [!WARNING]
> Loading a video from `"url"` is only supported by the PyAV or Decord backends.

View File

@ -188,7 +188,7 @@ The example below shows how a tool is defined in JSON schema format.
An example of handling tool definitions in a chat template is shown below. The specific tokens and layouts should be changed to match the ones the model was trained with.
```jinja
```
{%- if tools %}
{%- for tool in tools %}
{{- '<tool>' + tool['function']['name'] + '\n' }}
@ -226,7 +226,7 @@ Tool calls are generally passed in the `tool_calls` key of an `"assistant”` me
A common pattern for handling tool calls is shown below. You can use this as a starting point, but make sure you template actually matches the format the model was trained with!
```jinja
```
{%- if message['role'] == 'assistant' and 'tool_calls' in message %}
{%- for tool_call in message['tool_calls'] %}
{{- '<tool_call>' + tool_call['function']['name'] + '\n' + tool_call['function']['arguments']|tojson + '\n</tool_call>' }}
@ -249,7 +249,7 @@ Tool responses are message dicts with the `tool` role. They are much simpler tha
Some templates may not even need the `name` key, in which case, you can write your template to only read the `content` key.
```jinja
```
{%- if message['role'] == 'tool' %}
{{- "<tool_result>" + message['content'] + "</tool_result>" }}
{%- endif %}

View File

@ -21,10 +21,9 @@ where `port` is the port used by `transformers serve` (`8000` by default). On th
</h3>
You're now ready to set things up on the app side! In Cursor, while you can't set a new provider, you can change the endpoint for OpenAI requests in the model selection settings. First, navigate to "Settings" > "Cursor Settings", "Models" tab, and expand the "API Keys" collapsible. To set your `transformers serve` endpoint, follow this order:
1. Unselect ALL models in the list above (e.g. `gpt4`, ...);
2. Add and select the model you want to use (e.g. `Qwen/Qwen3-4B`)
3. Add some random text to OpenAI API Key. This field won't be used, but it can't be empty;
3. Add some random text to OpenAI API Key. This field won't be used, but it cant be empty;
4. Add the https address from `ngrok` to the "Override OpenAI Base URL" field, appending `/v1` to the address (i.e. `https://(...).ngrok-free.app/v1`);
5. Hit "Verify".

View File

@ -45,7 +45,7 @@ which nvcc
You may also have more than one CUDA toolkit installed on your system.
```text
```bash
/usr/local/cuda-10.2
/usr/local/cuda-11.0
```

View File

@ -294,7 +294,7 @@ Consider running a [benchmark](https://github.com/microsoft/DeepSpeed/issues/998
The example ZeRO-3 and ZeRO-Infinity config below sets most of the parameter values to `auto`, but you can also manually set configure these values.
```json
```yaml
{
"fp16": {
"enabled": "auto",
@ -383,7 +383,7 @@ Gradient checkpointing saves memory by only storing *some* of the intermediate a
The batch size can be automatically configured or manually set. When you choose the `"auto"` option, [`Trainer`] sets `train_micro_batch_size_per_gpu` and `train_batch_size` to the value of `world_size * per_device_train_batch_size * gradient_accumulation_steps`.
```json
```yaml
{
"train_micro_batch_size_per_gpu": "auto",
"train_batch_size": "auto"
@ -400,7 +400,7 @@ Reduce operations are lossy, for example, when gradients are averaged across mul
Choose the communication data type by setting the `communication_data_type` parameter in the config file. For example, choosing fp32 adds a small amount of overhead but ensures the reduction operation is accumulated in fp32 and when it is ready, it's downcasted to whichever half-precision data type you're training in.
```json
```yaml
{
"communication_data_type": "fp32"
}
@ -412,7 +412,7 @@ Gradient accumulation accumulates gradients over several mini-batches of data be
Gradient accumulation can be automatically configured or manually set. When you choose the `"auto"` option, [`Trainer`] sets it to the value of `gradient_accumulation_steps`.
```json
```yaml
{
"gradient_accumulation_steps": "auto"
}
@ -424,7 +424,7 @@ Gradient clipping is useful for preventing exploding gradients which can lead to
Gradient clipping can be automatically configured or manually set. When you choose the `"auto"` option, [`Trainer`] sets it to the value of `max_grad_norm`.
```json
```yaml
{
"gradient_clipping": "auto"
}
@ -439,7 +439,7 @@ Mixed precision accelerates training speed by performing some calculations in ha
Train in fp32 if a model wasn't pretrained in mixed precision because it may cause underflow or overflow errors. Disable fp16, the default, in this case.
```json
```yaml
{
"fp16": {
"enabled": false
@ -452,9 +452,9 @@ For Ampere GPUs and PyTorch 1.7+, the more efficient [tf32](https://pytorch.org/
</hfoption>
<hfoption id="fp16">
To configure fp16 mixed precision, set up the config as shown below with `"auto"` or your own values. [`Trainer`] automatically enables or disables fp16 based on the value of `fp16` or `fp16_full_eval`, and the rest of the config can be set by you. fp16 is enabled from the command line when the following arguments are passed: `--fp16` or `--fp16_full_eval` also.
To configure AMP-like fp16 mixed precision, set up the config as shown below with `"auto"` or your own values. [`Trainer`] automatically enables or disables fp16 based on the value of `fp16_backend`, and the rest of the config can be set by you. fp16 is enabled from the command line when the following arguments are passed: `--fp16`, `--fp16_backend amp` or `--fp16_full_eval`.
```json
```yaml
{
"fp16": {
"enabled": "auto",
@ -469,17 +469,28 @@ To configure fp16 mixed precision, set up the config as shown below with `"auto"
For additional DeepSpeed fp16 training options, take a look at the [FP16 Training Options](https://www.deepspeed.ai/docs/config-json/#fp16-training-options) reference.
To configure Apex-like fp16 mixed precision, set up the config as shown below with `"auto"` or your own values. [`Trainer`] automatically configures `amp` based on the values of `fp16_backend` and `fp16_opt_level`. It can also be enabled from the command line when the following arguments are passed: `--fp16`, `--fp16_backend apex` or `--fp16_opt_level 01`.
```yaml
{
"amp": {
"enabled": "auto",
"opt_level": "auto"
}
}
```
</hfoption>
<hfoption id="bf16">
> [!TIP]
> bf16 requires DeepSpeed 0.6.0.
bf16 has the same dynamic range as fp32, and doesn't require loss scaling unlike fp16. However, if you use [gradient accumulation](#gradient-accumulation) with bf16, gradients are accumulated in bf16 which may not be desirable because the lower precision can lead to lossy accumulation.
bf16 has the same dynamic range as fp32, and doesnt require loss scaling unlike fp16. However, if you use [gradient accumulation](#gradient-accumulation) with bf16, gradients are accumulated in bf16 which may not be desirable because the lower precision can lead to lossy accumulation.
bf16 can be set up in the config file or enabled from the command line when the following arguments are passed: `--bf16` or `--bf16_full_eval`.
```json
```yaml
{
"bf16": {
"enabled": "auto"
@ -503,7 +514,7 @@ DeepSpeed offers several [optimizers](https://www.deepspeed.ai/docs/config-json/
You can set the parameters to `"auto"` or manually input your own values.
```json
```yaml
{
"optimizer": {
"type": "AdamW",
@ -519,7 +530,7 @@ You can set the parameters to `"auto"` or manually input your own values.
Use an unsupported optimizer by adding the following to the top level configuration.
```json
```yaml
{
"zero_allow_untested_optimizer": true
}
@ -527,7 +538,7 @@ Use an unsupported optimizer by adding the following to the top level configurat
From DeepSpeed 0.8.3+, if you want to use offload, you'll also need to add the following to the top level configuration because offload works best with DeepSpeed's CPU Adam optimizer.
```json
```yaml
{
"zero_force_ds_cpu_optimizer": false
}
@ -547,7 +558,7 @@ If you don't configure the scheduler in the config file, [`Trainer`] automatical
You can set the parameters to `"auto"` or manually input your own values.
```json
```yaml
{
"scheduler": {
"type": "WarmupDecayLR",
@ -570,7 +581,7 @@ You can set the parameters to `"auto"` or manually input your own values.
Resume training with a Universal checkpoint by setting `load_universal` to `true` in the config file.
```json
```yaml
{
"checkpoint": {
"load_universal": true
@ -629,7 +640,7 @@ deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \
A multi-node setup consists of multiple nodes, where each node has one of more GPUs running a workload. DeepSpeed expects a shared storage system, but if this is not the case, you need to adjust the config file to include a [checkpoint](https://www.deepspeed.ai/docs/config-json/#checkpoint-options) to allow loading without access to a shared filesystem.
```json
```yaml
{
"checkpoint": {
"use_node_local_storage": true
@ -813,7 +824,7 @@ ZeRO-2 saves the model weights in fp16. To save the weights in fp16 for ZeRO-3,
If you don't, [`Trainer`] won't save the weights in fp16 and won't create a `pytorch_model.bin` file. This is because DeepSpeed's state_dict contains a placeholder instead of the real weights, so you won't be able to load it.
```json
```yaml
{
"zero_optimization": {
"stage": 3,
@ -975,7 +986,7 @@ NaN loss often occurs when a model is pretrained in bf16 and you try to use it w
It is also possible that fp16 is causing overflow. For example, if your config file looks like the one below, you may see the following overflow errors in the logs.
```json
```yaml
{
"fp16": {
"enabled": "auto",

View File

@ -229,7 +229,6 @@ tokenizer.batch_decode(outputs, skip_special_tokens=True)
## Custom generation methods
Custom generation methods enable specialized behavior such as:
- have the model continue thinking if it is uncertain;
- roll back generation if the model gets stuck;
- handle special tokens with custom logic;
@ -290,7 +289,7 @@ print(tokenizer.batch_decode(gen_out)[0])
If the custom method has pinned Python requirements that your environment doesn't meet, you'll get an exception about missing requirements. For instance, [transformers-community/custom_generate_bad_requirements](https://huggingface.co/transformers-community/custom_generate_bad_requirements) has an impossible set of requirements defined in its `custom_generate/requirements.txt` file, and you'll see the error message below if you try to run it.
```text
```
ImportError: Missing requirements in your local environment for `transformers-community/custom_generate_bad_requirements`:
foo (installed: None)
bar==0.0.0 (installed: None)
@ -302,7 +301,6 @@ Updating your Python requirements accordingly will remove this error message.
### Creating a custom generation method
To create a new generation method, you need to create a new [**Model**](https://huggingface.co/new) repository and push a few files into it.
1. The model you've designed your generation method with.
2. `custom_generate/generate.py`, which contains all the logic for your custom generation method.
3. `custom_generate/requirements.txt`, used to optionally add new Python requirements and/or lock specific versions to correctly use your method.
@ -310,7 +308,7 @@ To create a new generation method, you need to create a new [**Model**](https://
After you've added all required files, your repository should look like this
```text
```
your_repo/
├── README.md # include the 'custom_generate' tag
├── config.json
@ -379,7 +377,6 @@ def generate(model, input_ids, generation_config=None, left_padding=None, **kwar
```
Follow the recommended practices below to ensure your custom generation method works as expected.
- Feel free to reuse the logic for validation and input preparation in the original [`~GenerationMixin.generate`].
- Pin the `transformers` version in the requirements if you use any private method/attribute in `model`.
- Consider adding model validation, input validation, or even a separate test file to help users sanity-check your code in their environment.
@ -402,7 +399,7 @@ The root level `README.md` in the model repository usually describes the model t
For discoverability, we highly recommend you to add the `custom_generate` tag to your repository. To do so, the top of your `README.md` file should look like the example below. After you push the file, you should see the tag in your repository!
```text
```
---
library_name: transformers
tags:
@ -413,14 +410,13 @@ tags:
```
Recommended practices:
- Document input and output differences in [`~GenerationMixin.generate`].
- Add self-contained examples to enable quick experimentation.
- Describe soft-requirements such as if the method only works well with a certain family of models.
### Reusing `generate`'s input preparation
### Reusing `generate`s input preparation
If you're adding a new decoding loop, you might want to preserve the input preparation present in `generate` (batch expansion, attention masks, logits processors, stopping criteria, etc.). You can also pass a **callable** to `custom_generate` to reuse [`~GenerationMixin.generate`]'s full preparation pipeline while overriding only the decoding loop.
If you're adding a new decoding loop, you might want to preserve the input preparation present in `generate` (batch expansion, attention masks, logits processors, stopping criteria, etc.). You can also pass a **callable** to `custom_generate` to reuse [`~GenerationMixin.generate`]s full preparation pipeline while overriding only the decoding loop.
```py
def custom_loop(model, input_ids, attention_mask, logits_processor, stopping_criteria, generation_config, **model_kwargs):
@ -441,12 +437,11 @@ output = model.generate(
```
> [!TIP]
> If you publish a `custom_generate` repository, your `generate` implementation can itself define a callable and pass it to `model.generate()`. This lets you customize the decoding loop while still benefiting from Transformers' built-in input preparation logic.
> If you publish a `custom_generate` repository, your `generate` implementation can itself define a callable and pass it to `model.generate()`. This lets you customize the decoding loop while still benefiting from Transformers built-in input preparation logic.
### Finding custom generation methods
You can find all custom generation methods by [searching for their custom tag.](https://huggingface.co/models?other=custom_generate), `custom_generate`. In addition to the tag, we curate two collections of `custom_generate` methods:
- [Custom generation methods - Community](https://huggingface.co/collections/transformers-community/custom-generation-methods-community-6888fb1da0efbc592d3a8ab6) -- a collection of powerful methods contributed by the community;
- [Custom generation methods - Tutorials](https://huggingface.co/collections/transformers-community/custom-generation-methods-tutorials-6823589657a94940ea02cfec) -- a collection of reference implementations for methods that previously were part of `transformers`, as well as tutorials for `custom_generate`.

View File

@ -34,10 +34,6 @@ There are over 1M+ Transformers [model checkpoints](https://huggingface.co/model
Explore the [Hub](https://huggingface.com/) today to find a model and use Transformers to help you get started right away.
Explore the [Models Timeline](./models_timeline) to discover the latest text, vision, audio and multimodal model architectures in Transformers.
## Features
Transformers provides everything you need for inference or training with state-of-the-art pretrained models. Some of the main features include:

View File

@ -255,7 +255,6 @@ how many tests are being skipped and for which models.
When porting models to transformers, tests fail as they should, and sometimes `test_modeling_common` feels irreconcilable with the peculiarities of our brand new model. But how can we be sure we're not breaking everything by adding a seemingly innocent skip?
This utility:
- scans all test_modeling_common methods
- looks for times where a method is skipped
- returns a summary json you can load as a DataFrame/inspect
@ -280,7 +279,7 @@ python utils/scan_skipped_tests.py --output_dir path/to/output
**Example output:**
```text
```
🔬 Parsing 331 model test files once each...
📝 Aggregating 224 tests...
(224/224) test_update_candidate_strategy_with_matches_1es_3d_is_nonecodet_schedule_fa_kwargs

View File

@ -36,6 +36,10 @@ Most of those are only useful if you are studying the code of the Trainer in the
[[autodoc]] trainer_callback.CallbackHandler
## Distributed Evaluation
[[autodoc]] trainer_pt_utils.DistributedTensorGatherer
## Trainer Argument Parser
[[autodoc]] HfArgumentParser

View File

@ -25,7 +25,7 @@ You are now ready to chat!
To conclude this example, let's look into a more advanced use-case. If you have a beefy machine to serve models with, but prefer using Jan on a different device, you need to add port forwarding. If you have `ssh` access from your Jan machine into your server, this can be accomplished by typing the following to your Jan machine's terminal
```bash
```
ssh -N -f -L 8000:localhost:8000 your_server_account@your_server_IP -p port_to_ssh_into_your_server
```

View File

@ -213,7 +213,7 @@ A cache can also work in iterative generation settings where there is back-and-f
For iterative generation with a cache, start by initializing an empty cache class and then you can feed in your new prompts. Keep track of dialogue history with a [chat template](./chat_templating).
The following example demonstrates [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). If you're using a different chat-style model, [`~PreTrainedTokenizer.apply_chat_template`] may process messages differently. It might cut out important tokens depending on how the Jinja template is written.
The following example demonstrates [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). If youre using a different chat-style model, [`~PreTrainedTokenizer.apply_chat_template`] may process messages differently. It might cut out important tokens depending on how the Jinja template is written.
For example, some models use special `<think> ... </think>` tokens during reasoning. These could get lost during re-encoding, causing indexing issues. You might need to manually remove or adjust extra tokens from the completions to keep things stable.

View File

@ -94,7 +94,6 @@ model.generate(**inputs, num_beams=4, do_sample=True)
```
[`~GenerationMixin.generate`] can also be extended with external libraries or custom code:
1. the `logits_processor` parameter accepts custom [`LogitsProcessor`] instances for manipulating the next token probability distribution;
2. the `stopping_criteria` parameters supports custom [`StoppingCriteria`] to stop text generation;
3. other custom generation methods can be loaded through the `custom_generate` flag ([docs](generation_strategies.md/#custom-decoding-methods)).

View File

@ -100,7 +100,7 @@ result
**Output**:
```text
```
Here is a Python function that transforms bytes to Giga bytes:\n\n```python\ndef bytes_to_giga_bytes(bytes):\n return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single
```
@ -119,7 +119,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
**Output**:
```text
```bash
29.0260648727417
```
@ -208,7 +208,7 @@ result
**Output**:
```text
```
Here is a Python function that transforms bytes to Giga bytes:\n\n```python\ndef bytes_to_giga_bytes(bytes):\n return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single
```
@ -220,7 +220,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
**Output**:
```text
```
15.219234466552734
```
@ -251,7 +251,7 @@ result
**Output**:
```text
```
Here is a Python function that transforms bytes to Giga bytes:\n\n```\ndef bytes_to_gigabytes(bytes):\n return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single argument
```
@ -263,7 +263,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
**Output**:
```text
```
9.543574333190918
```
@ -423,7 +423,7 @@ result
**Output**:
```text
```
Generated in 10.96854019165039 seconds.
Sure. Here is a function that does that.\n\ndef bytes_to_giga(bytes):\n return bytes / 1024 / 1024 / 1024\n\nAnswer: Sure. Here is a function that does that.\n\ndef
````
@ -440,7 +440,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
**Output**:
```text
```bash
37.668193340301514
```
@ -472,7 +472,7 @@ result
**Output**:
```text
```
Generated in 3.0211617946624756 seconds.
Sure. Here is a function that does that.\n\ndef bytes_to_giga(bytes):\n return bytes / 1024 / 1024 / 1024\n\nAnswer: Sure. Here is a function that does that.\n\ndef
```
@ -487,7 +487,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
**Output**:
```text
```
32.617331981658936
```
@ -618,7 +618,7 @@ generated_text
**Output**:
```text
```
shape of input_ids torch.Size([1, 21])
shape of input_ids torch.Size([1, 22])
shape of input_ids torch.Size([1, 23])
@ -656,7 +656,7 @@ generated_text
**Output**:
```text
```
shape of input_ids torch.Size([1, 1])
length of key-value cache 20
shape of input_ids torch.Size([1, 1])
@ -690,7 +690,7 @@ Note that, despite our advice to use key-value caches, your LLM output may be sl
The key-value cache is especially useful for applications such as chat where multiple passes of auto-regressive decoding are required. Let's look at an example.
```text
```
User: How many people live in France?
Assistant: Roughly 75 million people live in France
User: And how many are in Germany?
@ -728,7 +728,7 @@ tokenizer.batch_decode(generation_output.sequences)[0][len(prompt):]
**Output**:
```text
```
is a modified version of the function that returns Mega bytes instead.
def bytes_to_megabytes(bytes):
@ -750,7 +750,7 @@ config = model.config
**Output**:
```text
```
7864320000
```

View File

@ -80,7 +80,6 @@ We use both in the `transformers` library. We leverage and adapt `logging`'s `ca
management of these warning messages by the verbosity setters above.
What does that mean for developers of the library? We should respect the following heuristics:
- `warnings` should be favored for developers of the library and libraries dependent on `transformers`
- `logging` should be used for end-users of the library using it in every-day projects

View File

@ -159,7 +159,7 @@ for batch_size in [1, 8, 64, 256]:
pass
```
```text
```
# On GTX 970
------------------------------
Streaming no batching
@ -195,7 +195,7 @@ This is a occasional very long sentence compared to the other. In that case, the
tokens long, so the whole batch will be [64, 400] instead of [64, 4], leading to the high slowdown. Even worse, on
bigger batches, the program simply crashes.
```text
```
------------------------------
Streaming no batching
100%|█████████████████████████████████████████████████████████████████████| 1000/1000 [00:05<00:00, 183.69it/s]

View File

@ -17,7 +17,6 @@ rendered properly in your Markdown viewer.
# Processors
Processors can mean two different things in the Transformers library:
- the objects that pre-process inputs for multi-modal models such as [Wav2Vec2](../model_doc/wav2vec2) (speech and text)
or [CLIP](../model_doc/clip) (text and vision)
- deprecated objects that were used in older versions of the library to preprocess data for GLUE or SQUAD.

View File

@ -148,7 +148,6 @@ for label, score in zip(candidate_labels, probs):
```
## Resources
- Refer to the [Kakao Brains Open Source ViT, ALIGN, and the New COYO Text-Image Dataset](https://huggingface.co/blog/vit-align) blog post for more details.
## AlignConfig

View File

@ -61,7 +61,7 @@ page for more information.
SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
```py
```
from transformers import ASTForAudioClassification
model = ASTForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593", attn_implementation="sdpa", dtype=torch.float16)
...

View File

@ -23,7 +23,7 @@ rendered properly in your Markdown viewer.
</div>
# BART
[BART](https://huggingface.co/papers/1910.13461) is a sequence-to-sequence model that combines the pretraining objectives from BERT and GPT. It's pretrained by corrupting text in different ways like deleting words, shuffling sentences, or masking tokens and learning how to fix it. The encoder encodes the corrupted document and the corrupted text is fixed by the decoder. As it learns to recover the original text, BART gets really good at both understanding and generating language.
[BART](https://huggingface.co/papers/1910.13461) is a sequence-to-sequence model that combines the pretraining objectives from BERT and GPT. Its pretrained by corrupting text in different ways like deleting words, shuffling sentences, or masking tokens and learning how to fix it. The encoder encodes the corrupted document and the corrupted text is fixed by the decoder. As it learns to recover the original text, BART gets really good at both understanding and generating language.
You can find all the original BART checkpoints under the [AI at Meta](https://huggingface.co/facebook?search_models=bart) organization.
@ -89,7 +89,7 @@ echo -e "Plants create <mask> through a process known as photosynthesis." | tran
- Inputs should be padded on the right because BERT uses absolute position embeddings.
- The [facebook/bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn) checkpoint doesn't include `mask_token_id` which means it can't perform mask-filling tasks.
- BART doesn't use `token_type_ids` for sequence classification. Use [`BartTokenizer`] or [`~PreTrainedTokenizerBase.encode`] to get the proper splitting.
- BART doesnt use `token_type_ids` for sequence classification. Use [`BartTokenizer`] or [`~PreTrainedTokenizerBase.encode`] to get the proper splitting.
- The forward pass of [`BartModel`] creates the `decoder_input_ids` if they're not passed. This can be different from other model APIs, but it is a useful feature for mask-filling tasks.
- Model predictions are intended to be identical to the original implementation when `forced_bos_token_id=0`. This only works if the text passed to `fairseq.encode` begins with a space.
- [`~GenerationMixin.generate`] should be used for conditional generation tasks like summarization.

View File

@ -87,7 +87,7 @@ page for more information.
SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
```py
```
from transformers import BeitForImageClassification
model = BeitForImageClassification.from_pretrained("microsoft/beit-base-patch16-224", attn_implementation="sdpa", dtype=torch.float16)
...
@ -123,7 +123,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
- See also: [Image classification task guide](../tasks/image_classification)
**Semantic segmentation**
- [Semantic segmentation task guide](../tasks/semantic_segmentation)
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

View File

@ -24,7 +24,7 @@ rendered properly in your Markdown viewer.
## BERTweet
[BERTweet](https://huggingface.co/papers/2005.10200) shares the same architecture as [BERT-base](./bert), but it's pretrained like [RoBERTa](./roberta) on English Tweets. It performs really well on Tweet-related tasks like part-of-speech tagging, named entity recognition, and text classification.
[BERTweet](https://huggingface.co/papers/2005.10200) shares the same architecture as [BERT-base](./bert), but its pretrained like [RoBERTa](./roberta) on English Tweets. It performs really well on Tweet-related tasks like part-of-speech tagging, named entity recognition, and text classification.
You can find all the original BERTweet checkpoints under the [VinAI Research](https://huggingface.co/vinai?search_models=BERTweet) organization.
@ -88,8 +88,7 @@ echo -e "Plants create <mask> through a process known as photosynthesis." | tran
</hfoptions>
## Notes
- Use the [`AutoTokenizer`] or [`BertweetTokenizer`] because it's preloaded with a custom vocabulary adapted to tweet-specific tokens like hashtags (#), mentions (@), emojis, and common abbreviations. Make sure to also install the [emoji](https://pypi.org/project/emoji/) library.
- Use the [`AutoTokenizer`] or [`BertweetTokenizer`] because its preloaded with a custom vocabulary adapted to tweet-specific tokens like hashtags (#), mentions (@), emojis, and common abbreviations. Make sure to also install the [emoji](https://pypi.org/project/emoji/) library.
- Inputs should be padded on the right (`padding="max_length"`) because BERT uses absolute position embeddings.
## BertweetTokenizer

View File

@ -87,7 +87,6 @@ print(f"The predicted token is: {predicted_token}")
</hfoptions>
## Notes
- Inputs should be padded on the right because BigBird uses absolute position embeddings.
- BigBird supports `original_full` and `block_sparse` attention. If the input sequence length is less than 1024, it is recommended to use `original_full` since sparse patterns don't offer much benefit for smaller inputs.
- The current implementation uses window size of 3 blocks and 2 global blocks, only supports the ITC-implementation, and doesn't support `num_random_blocks=0`.

View File

@ -121,6 +121,7 @@ print(output)
- Pad inputs on the right because BioGPT uses absolute position embeddings.
- BioGPT can reuse previously computed key-value attention pairs. Access this feature with the [past_key_values](https://huggingface.co/docs/transformers/main/en/model_doc/biogpt#transformers.BioGptModel.forward.past_key_values) parameter in [`BioGPTModel.forward`].
- The `head_mask` argument is ignored when using an attention implementation other than "eager". If you want to use `head_mask`, make sure `attn_implementation="eager"`).
```py
from transformers import AutoModelForCausalLM

View File

@ -36,7 +36,6 @@ The original code can be found [here](https://github.com/google-research/big_tra
## Usage tips
- BiT models are equivalent to ResNetv2 in terms of architecture, except that: 1) all batch normalization layers are replaced by [group normalization](https://huggingface.co/papers/1803.08494),
2) [weight standardization](https://huggingface.co/papers/1903.10520) is used for convolutional layers. The authors show that the combination of both is useful for training with large batch sizes, and has a significant
impact on transfer learning.

View File

@ -43,19 +43,16 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
- [`BloomForCausalLM`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb).
See also:
- [Causal language modeling task guide](../tasks/language_modeling)
- [Text classification task guide](../tasks/sequence_classification)
- [Token classification task guide](../tasks/token_classification)
- [Question answering task guide](../tasks/question_answering)
⚡️ Inference
- A blog on [Optimization story: Bloom inference](https://huggingface.co/blog/bloom-inference-optimization).
- A blog on [Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate](https://huggingface.co/blog/bloom-inference-pytorch-scripts).
⚙️ Training
- A blog on [The Technology Behind BLOOM Training](https://huggingface.co/blog/bloom-megatron-deepspeed).
## BloomConfig

View File

@ -13,7 +13,6 @@ specific language governing permissions and limitations under the License.
rendered properly in your Markdown viewer.
-->
*This model was released on 2024-12-13 and added to Hugging Face Transformers on 2025-09-19.*
<div style="float: right;">
<div class="flex flex-wrap space-x-1">
@ -29,7 +28,7 @@ rendered properly in your Markdown viewer.
## Overview
The BLT model was proposed in [Byte Latent Transformer: Patches Scale Better Than Tokens](https://huggingface.co/papers/2412.09871) by Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li1, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman†, Srinivasan Iyer.
The BLT model was proposed in [Byte Latent Transformer: Patches Scale Better Than Tokens](<https://arxiv.org/pdf/2412.09871>) by Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li1, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman†, Srinivasan Iyer.
BLT is a byte-level LLM that achieves tokenization-level performance through entropy-based dynamic patching.
The abstract from the paper is the following:

View File

@ -23,7 +23,7 @@ rendered properly in your Markdown viewer.
# CANINE
[CANINE](https://huggingface.co/papers/2103.06874) is a tokenization-free Transformer. It skips the usual step of splitting text into subwords or wordpieces and processes text character by character. That means it works directly with raw Unicode, making it especially useful for languages with complex or inconsistent tokenization rules and even noisy inputs like typos. Since working with characters means handling longer sequences, CANINE uses a smart trick. The model compresses the input early on (called downsampling) so the transformer doesn't have to process every character individually. This keeps things fast and efficient.
[CANINE](https://huggingface.co/papers/2103.06874) is a tokenization-free Transformer. It skips the usual step of splitting text into subwords or wordpieces and processes text character by character. That means it works directly with raw Unicode, making it especially useful for languages with complex or inconsistent tokenization rules and even noisy inputs like typos. Since working with characters means handling longer sequences, CANINE uses a smart trick. The model compresses the input early on (called downsampling) so the transformer doesnt have to process every character individually. This keeps things fast and efficient.
You can find all the original CANINE checkpoints under the [Google](https://huggingface.co/google?search_models=canine) organization.

View File

@ -122,8 +122,7 @@ visualizer("Plants create energy through a process known as")
</div>
## Notes
- Don't use the dtype parameter in [`~AutoModel.from_pretrained`] if you're using FlashAttention-2 because it only supports fp16 or bf16. You should use [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html), set fp16 or bf16 to True if using [`Trainer`], or use [torch.autocast](https://pytorch.org/docs/stable/amp.html#torch.autocast).
- Dont use the dtype parameter in [`~AutoModel.from_pretrained`] if youre using FlashAttention-2 because it only supports fp16 or bf16. You should use [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html), set fp16 or bf16 to True if using [`Trainer`], or use [torch.autocast](https://pytorch.org/docs/stable/amp.html#torch.autocast).
## CohereConfig

View File

@ -80,7 +80,7 @@ for result in results:
This should output
```text
```
cat: 0.87 [14.7, 49.39, 320.52, 469.28]
remote: 0.86 [41.08, 72.37, 173.39, 117.2]
cat: 0.86 [344.45, 19.43, 639.85, 367.86]

View File

@ -53,6 +53,7 @@ The original code for vision can be found [here](https://github.com/facebookrese
- For Data2VecAudio, preprocessing is identical to [`Wav2Vec2Model`], including feature extraction
- For Data2VecText, preprocessing is identical to [`RobertaModel`], including tokenization.
- For Data2VecVision, preprocessing is identical to [`BeitModel`], including feature extraction.
- The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
### Using Scaled Dot Product Attention (SDPA)
@ -67,7 +68,7 @@ SDPA is used by default for `torch>=2.1.1` when an implementation is available,
The SDPA implementation is currently available for the Data2VecAudio and Data2VecVision models.
```py
```
from transformers import Data2VecVisionForImageClassification
model = Data2VecVisionForImageClassification.from_pretrained("facebook/data2vec-vision-base", attn_implementation="sdpa", dtype=torch.float16)
...
@ -103,7 +104,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
- [`Data2VecVisionForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
**Data2VecText documentation resources**
- [Text classification task guide](../tasks/sequence_classification)
- [Token classification task guide](../tasks/token_classification)
- [Question answering task guide](../tasks/question_answering)
@ -112,12 +112,10 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
- [Multiple choice task guide](../tasks/multiple_choice)
**Data2VecAudio documentation resources**
- [Audio classification task guide](../tasks/audio_classification)
- [Automatic speech recognition task guide](../tasks/asr)
**Data2VecVision documentation resources**
- [Image classification](../tasks/image_classification)
- [Semantic segmentation](../tasks/semantic_segmentation)

View File

@ -92,7 +92,6 @@ echo -e '{"text": "A soccer game with multiple people playing.", "text_pair": "S
</hfoptions>
## Notes
- DeBERTa uses **relative position embeddings**, so it does not require **right-padding** like BERT.
- For best results, use DeBERTa on sentence-level or sentence-pair classification tasks like MNLI, RTE, or SST-2.
- If you're using DeBERTa for token-level tasks like masked language modeling, make sure to load a checkpoint specifically pretrained or fine-tuned for token-level tasks.

View File

@ -64,7 +64,7 @@ print(time.time()-start)
This generated:
``````text
``````
<Assistant><think>
Okay, the user wants to demonstrate how chat templating works. Let me break down what that means. Chat templating is about structuring the conversation data, especially for models that need specific input formats. Maybe they're referring to something like how messages are formatted with roles (user, assistant, system) in APIs like OpenAI.
@ -138,7 +138,7 @@ Applying the template to our `messages` list would produce:
This tells the model:
1. The conversation history (user/assistant turns).
2. The model's turn to generate a response (`<|assistant|>` at the end).
2. The models turn to generate a response (`<|assistant|>` at the end).
---

View File

@ -1,63 +0,0 @@
<!--Copyright 2025 the HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer.
-->
# DeepseekV32
## Overview
The DeepseekV32 model was proposed in [<INSERT PAPER NAME HERE>](<INSERT PAPER LINK HERE>) by <INSERT AUTHORS HERE>.
<INSERT SHORT SUMMARY HERE>
The abstract from the paper is the following:
<INSERT PAPER ABSTRACT HERE>
Tips:
<INSERT TIPS ABOUT MODEL HERE>
This model was contributed by [INSERT YOUR HF USERNAME HERE](https://huggingface.co/<INSERT YOUR HF USERNAME HERE>).
The original code can be found [here](<INSERT LINK TO GITHUB REPO HERE>).
## Usage examples
<INSERT SOME NICE EXAMPLES HERE>
## DeepseekV32Config
[[autodoc]] DeepseekV32Config
## DeepseekV32PreTrainedModel
[[autodoc]] DeepseekV32PreTrainedModel
- forward
## DeepseekV32Model
[[autodoc]] DeepseekV32Model
- forward
## DeepseekV32ForCausalLM
[[autodoc]] DeepseekV32ForCausalLM
## DeepseekV32ForSequenceClassification
[[autodoc]] DeepseekV32ForSequenceClassification

View File

@ -24,7 +24,7 @@ rendered properly in your Markdown viewer.
# DeepseekVLHybrid
[Deepseek-VL-Hybrid](https://huggingface.co/papers/2403.05525) was introduced by the DeepSeek AI team. It is a vision-language model (VLM) designed to process both text and images for generating contextually relevant responses. The model leverages [LLaMA](./llama) as its text encoder, while [SigLip](./siglip) is used for encoding low-resolution images and [SAM (Segment Anything Model)](./sam) is incorporated to handle high-resolution image encoding, enhancing the model's ability to process fine-grained visual details. Deepseek-VL-Hybrid is a variant of Deepseek-VL that uses [SAM (Segment Anything Model)](./sam) to handle high-resolution image encoding.
[Deepseek-VL-Hybrid](https://huggingface.co/papers/2403.05525) was introduced by the DeepSeek AI team. It is a vision-language model (VLM) designed to process both text and images for generating contextually relevant responses. The model leverages [LLaMA](./llama) as its text encoder, while [SigLip](./siglip) is used for encoding low-resolution images and [SAM (Segment Anything Model)](./sam) is incorporated to handle high-resolution image encoding, enhancing the models ability to process fine-grained visual details. Deepseek-VL-Hybrid is a variant of Deepseek-VL that uses [SAM (Segment Anything Model)](./sam) to handle high-resolution image encoding.
You can find all the original Deepseek-VL-Hybrid checkpoints under the [DeepSeek-community](https://huggingface.co/deepseek-community) organization.

View File

@ -86,7 +86,7 @@ page for more information.
SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
```py
```
from transformers import DeiTForImageClassification
model = DeiTForImageClassification.from_pretrained("facebook/deit-base-distilled-patch16-224", attn_implementation="sdpa", dtype=torch.float16)
...

View File

@ -84,7 +84,6 @@ alt="drawing" width="600"/>
The `DepthProForDepthEstimation` model uses a `DepthProEncoder`, for encoding the input image and a `FeatureFusionStage` for fusing the output features from encoder.
The `DepthProEncoder` further uses two encoders:
- `patch_encoder`
- Input image is scaled with multiple ratios, as specified in the `scaled_images_ratios` configuration.
- Each scaled image is split into smaller **patches** of size `patch_size` with overlapping areas determined by `scaled_images_overlap_ratios`.

View File

@ -65,7 +65,6 @@ DiNAT can be used as a *backbone*. When `output_hidden_states = True`,
it will output both `hidden_states` and `reshaped_hidden_states`. The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than `(batch_size, height, width, num_channels)`.
Notes:
- DiNAT depends on [NATTEN](https://github.com/SHI-Labs/NATTEN/)'s implementation of Neighborhood Attention and Dilated Neighborhood Attention.
You can install it with pre-built wheels for Linux by referring to [shi-labs.com/natten](https://shi-labs.com/natten), or build on your system by running `pip install natten`.
Note that the latter will likely take time to compile. NATTEN does not support Windows devices yet.

View File

@ -24,8 +24,7 @@ The [Vision Transformer](vit) (ViT) is a transformer encoder model (BERT-like) o
Next, people figured out ways to make ViT work really well on self-supervised image feature extraction (i.e. learning meaningful features, also called embeddings) on images without requiring any labels. Some example papers here include [DINOv2](dinov2) and [MAE](vit_mae).
The authors of DINOv2 noticed that ViTs have artifacts in attention maps. It's due to the model using some image patches as “registers”. The authors propose a fix: just add some new tokens (called "register" tokens), which you only use during pre-training (and throw away afterwards). This results in:
The authors of DINOv2 noticed that ViTs have artifacts in attention maps. Its due to the model using some image patches as “registers”. The authors propose a fix: just add some new tokens (called "register" tokens), which you only use during pre-training (and throw away afterwards). This results in:
- no artifacts
- interpretable attention maps
- and improved performances.

View File

@ -1,331 +0,0 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
*This model was released on 2025-01-13 and added to Hugging Face Transformers on 2025-09-29.*
<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
</div>
</div>
# EdgeTAM
## Overview
The EdgeTAM model was proposed in [EdgeTAM: On-Device Track Anything Model](https://huggingface.co/papers/2501.07256) Chong Zhou, Chenchen Zhu, Yunyang Xiong, Saksham Suri, Fanyi Xiao, Lemeng Wu, Raghuraman Krishnamoorthi, Bo Dai, Chen Change Loy, Vikas Chandra, Bilge Soran.
EdgeTAM is an efficient adaptation of SAM 2 that introduces a 2D Spatial Perceiver architecture to optimize memory attention mechanisms for real-time video segmentation on mobile devices.
The abstract from the paper is the following:
*On top of Segment Anything Model (SAM), SAM 2 further extends its capability from image to video inputs through a memory bank mechanism and obtains a remarkable performance compared with previous methods, making it a foundation model for video segmentation task. In this paper, we aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance. Despite several works optimizing SAM for better efficiency, we find they are not sufficient for SAM 2 because they all focus on compressing the image encoder, while our benchmark shows that the newly introduced memory attention blocks are also the latency bottleneck. Given this observation, we propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to reduce the computational cost. In particular, the proposed 2D Spatial Perceiver encodes the densely stored frame-level memories with a lightweight Transformer that contains a fixed set of learnable queries. Given that video segmentation is a dense prediction task, we find preserving the spatial structure of the memories is essential so that the queries are split into global-level and patch-level groups. We also propose a distillation pipeline that further improves the performance without inference overhead. As a result, EdgeTAM achieves 87.7, 70.0, 72.3, and 71.7 J&F on DAVIS 2017, MOSE, SA-V val, and SA-V test, while running at 16 FPS on iPhone 15 Pro Max.*
This model was contributed by [yonigozlan](https://huggingface.co/yonigozlan).
The original code can be found [here](https://github.com/facebookresearch/EdgeTAM).
## Usage example
### Automatic Mask Generation with Pipeline
EdgeTAM can be used for automatic mask generation to segment all objects in an image using the `mask-generation` pipeline:
```python
>>> from transformers import pipeline
>>> generator = pipeline("mask-generation", model="yonigozlan/edgetam-1", device=0)
>>> image_url = "https://huggingface.co/datasets/hf-internal-testing/sam2-fixtures/resolve/main/truck.jpg"
>>> outputs = generator(image_url, points_per_batch=64)
>>> len(outputs["masks"]) # Number of masks generated
39
```
### Basic Image Segmentation
#### Single Point Click
You can segment objects by providing a single point click on the object you want to segment:
```python
>>> from transformers import Sam2Processor, EdgeTamModel, infer_device
>>> import torch
>>> from PIL import Image
>>> import requests
>>> device = infer_device()
>>> model = EdgeTamModel.from_pretrained("yonigozlan/edgetam-1").to(device)
>>> processor = Sam2Processor.from_pretrained("yonigozlan/edgetam-1")
>>> image_url = "https://huggingface.co/datasets/hf-internal-testing/sam2-fixtures/resolve/main/truck.jpg"
>>> raw_image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
>>> input_points = [[[[500, 375]]]] # Single point click, 4 dimensions (image_dim, object_dim, point_per_object_dim, coordinates)
>>> input_labels = [[[1]]] # 1 for positive click, 0 for negative click, 3 dimensions (image_dim, object_dim, point_label)
>>> inputs = processor(images=raw_image, input_points=input_points, input_labels=input_labels, return_tensors="pt").to(model.device)
>>> with torch.no_grad():
... outputs = model(**inputs)
>>> masks = processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"])[0]
>>> # The model outputs multiple mask predictions ranked by quality score
>>> print(f"Generated {masks.shape[1]} masks with shape {masks.shape}")
Generated 3 masks with shape torch.Size([1, 3, 1200, 1800])
>>> print(f"IoU scores: {outputs.iou_scores.squeeze()}")
IoU scores: tensor([0.0463, 0.4859, 0.7616], device='cuda:0')
```
#### Multiple Points for Refinement
You can provide multiple points to refine the segmentation:
```python
>>> # Add both positive and negative points to refine the mask
>>> input_points = [[[[500, 375], [1125, 625]]]] # Multiple points for refinement
>>> input_labels = [[[1, 1]]] # Both positive clicks
>>> inputs = processor(images=raw_image, input_points=input_points, input_labels=input_labels, return_tensors="pt").to(device)
>>> with torch.no_grad():
... outputs = model(**inputs)
>>> masks = processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"])[0]
>>> print(f"IoU scores: {outputs.iou_scores.squeeze()}")
IoU scores: tensor([0.8362, 0.6900, 0.2120], device='cuda:0')
```
#### Bounding Box Input
EdgeTAM also supports bounding box inputs for segmentation:
```python
>>> # Define bounding box as [x_min, y_min, x_max, y_max]
>>> input_boxes = [[[75, 275, 1725, 850]]]
>>> inputs = processor(images=raw_image, input_boxes=input_boxes, return_tensors="pt").to(device)
>>> with torch.no_grad():
... outputs = model(**inputs)
>>> masks = processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"])[0]
>>> print(f"IoU scores: {outputs.iou_scores.squeeze()}")
IoU scores: tensor([0.9301, 0.9348, 0.6605], device='cuda:0')
```
#### Multiple Objects Segmentation
You can segment multiple objects simultaneously:
```python
>>> # Define points for two different objects
>>> input_points = [[[[500, 375]], [[650, 750]]]] # Points for two objects in same image
>>> input_labels = [[[1], [1]]] # Positive clicks for both objects
>>> inputs = processor(images=raw_image, input_points=input_points, input_labels=input_labels, return_tensors="pt").to(device)
>>> with torch.no_grad():
... outputs = model(**inputs, multimask_output=False)
>>> # Each object gets its own mask
>>> masks = processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"])[0]
>>> print(f"Generated masks for {masks.shape[0]} objects")
Generated masks for 2 objects
>>> print(f"IoU scores: {outputs.iou_scores.squeeze()}")
IoU scores: tensor([0.7616, 0.9465], device='cuda:0')
```
### Batch Inference
#### Batched Images
Process multiple images simultaneously for improved efficiency:
```python
>>> from transformers import Sam2Processor, EdgeTamModel, infer_device
>>> import torch
>>> from PIL import Image
>>> import requests
>>> device = infer_device()
>>> model = EdgeTamModel.from_pretrained("yonigozlan/edgetam-1").to(device)
>>> processor = Sam2Processor.from_pretrained("yonigozlan/edgetam-1")
>>> # Load multiple images
>>> image_urls = [
... "https://huggingface.co/datasets/hf-internal-testing/sam2-fixtures/resolve/main/truck.jpg",
... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/dog-sam.png"
... ]
>>> raw_images = [Image.open(requests.get(url, stream=True).raw).convert("RGB") for url in image_urls]
>>> # Single point per image
>>> input_points = [[[[500, 375]]], [[[770, 200]]]] # One point for each image
>>> input_labels = [[[1]], [[1]]] # Positive clicks for both images
>>> inputs = processor(images=raw_images, input_points=input_points, input_labels=input_labels, return_tensors="pt").to(model.device)
>>> with torch.no_grad():
... outputs = model(**inputs, multimask_output=False)
>>> # Post-process masks for each image
>>> all_masks = processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"])
>>> print(f"Processed {len(all_masks)} images, each with {all_masks[0].shape[0]} objects")
Processed 2 images, each with 1 objects
>>> print(f"IoU scores: {outputs.iou_scores.squeeze()}")
IoU scores: tensor([0.7618, 0.7999], device='cuda:0')
```
#### Batched Objects per Image
Segment multiple objects within each image using batch inference:
```python
>>> # Multiple objects per image - different numbers of objects per image
>>> input_points = [
... [[[500, 375]], [[650, 750]]], # Truck image: 2 objects
... [[[770, 200]]] # Dog image: 1 object
... ]
>>> input_labels = [
... [[1], [1]], # Truck image: positive clicks for both objects
... [[1]] # Dog image: positive click for the object
... ]
>>> inputs = processor(images=raw_images, input_points=input_points, input_labels=input_labels, return_tensors="pt").to(device)
>>> with torch.no_grad():
... outputs = model(**inputs, multimask_output=False)
>>> all_masks = processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"])
```
#### Batched Images with Batched Objects and Multiple Points
Handle complex batch scenarios with multiple points per object:
```python
>>> # Add groceries image for more complex example
>>> groceries_url = "https://huggingface.co/datasets/hf-internal-testing/sam2-fixtures/resolve/main/groceries.jpg"
>>> groceries_image = Image.open(requests.get(groceries_url, stream=True).raw).convert("RGB")
>>> raw_images = [raw_images[0], groceries_image] # Use truck and groceries images
>>> # Complex batching: multiple images, multiple objects, multiple points per object
>>> input_points = [
... [[[500, 375]], [[650, 750]]], # Truck image: 2 objects with 1 point each
... [[[400, 300]], [[630, 300], [550, 300]]] # Groceries image: obj1 has 1 point, obj2 has 2 points
... ]
>>> input_labels = [
... [[1], [1]], # Truck image: positive clicks
... [[1], [1, 1]] # Groceries image: positive clicks for refinement
... ]
>>> inputs = processor(images=raw_images, input_points=input_points, input_labels=input_labels, return_tensors="pt").to(device)
>>> with torch.no_grad():
... outputs = model(**inputs, multimask_output=False)
>>> all_masks = processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"])
```
#### Batched Bounding Boxes
Process multiple images with bounding box inputs:
```python
>>> # Multiple bounding boxes per image (using truck and groceries images)
>>> input_boxes = [
... [[75, 275, 1725, 850], [425, 600, 700, 875], [1375, 550, 1650, 800], [1240, 675, 1400, 750]], # Truck image: 4 boxes
... [[450, 170, 520, 350], [350, 190, 450, 350], [500, 170, 580, 350], [580, 170, 640, 350]] # Groceries image: 4 boxes
... ]
>>> # Update images for this example
>>> raw_images = [raw_images[0], groceries_image] # truck and groceries
>>> inputs = processor(images=raw_images, input_boxes=input_boxes, return_tensors="pt").to(device)
>>> with torch.no_grad():
... outputs = model(**inputs, multimask_output=False)
>>> all_masks = processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"])
>>> print(f"Processed {len(input_boxes)} images with {len(input_boxes[0])} and {len(input_boxes[1])} boxes respectively")
Processed 2 images with 4 and 4 boxes respectively
>>> print(f"IoU scores: {outputs.iou_scores.squeeze()}")
IoU scores: tensor([0.9301, 0.9348, 0.6605, 0.9465], device='cuda:0')
```
### Using Previous Masks as Input
EdgeTAM can use masks from previous predictions as input to refine segmentation:
```python
>>> # Get initial segmentation
>>> input_points = [[[[500, 375]]]]
>>> input_labels = [[[1]]]
>>> inputs = processor(images=raw_image, input_points=input_points, input_labels=input_labels, return_tensors="pt").to(device)
>>> with torch.no_grad():
... outputs = model(**inputs)
>>> # Use the best mask as input for refinement
>>> mask_input = outputs.pred_masks[:, :, torch.argmax(outputs.iou_scores.squeeze())]
>>> # Add additional points with the mask input
>>> new_input_points = [[[[500, 375], [450, 300]]]]
>>> new_input_labels = [[[1, 1]]]
>>> inputs = processor(
... input_points=new_input_points,
... input_labels=new_input_labels,
... original_sizes=inputs["original_sizes"],
... return_tensors="pt",
... ).to(device)
>>> with torch.no_grad():
... refined_outputs = model(
... **inputs,
... input_masks=mask_input,
... image_embeddings=outputs.image_embeddings,
... multimask_output=False,
... )
```
## EdgeTamConfig
[[autodoc]] EdgeTamConfig
## EdgeTamVisionConfig
[[autodoc]] EdgeTamVisionConfig
## EdgeTamMaskDecoderConfig
[[autodoc]] EdgeTamMaskDecoderConfig
## EdgeTamPromptEncoderConfig
[[autodoc]] EdgeTamPromptEncoderConfig
## EdgeTamVisionModel
[[autodoc]] EdgeTamVisionModel
- forward
## EdgeTamModel
[[autodoc]] EdgeTamModel
- forward

View File

@ -1,297 +0,0 @@
<!--Copyright 2025 the HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer.
-->
*This model was released on 2025-01-13 and added to Hugging Face Transformers on 2025-09-29.*
<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
</div>
</div>
# EdgeTAMVideo
## Overview
The EdgeTAM model was proposed in [EdgeTAM: On-Device Track Anything Model](https://huggingface.co/papers/2501.07256) Chong Zhou, Chenchen Zhu, Yunyang Xiong, Saksham Suri, Fanyi Xiao, Lemeng Wu, Raghuraman Krishnamoorthi, Bo Dai, Chen Change Loy, Vikas Chandra, Bilge Soran.
EdgeTAM is an efficient adaptation of SAM 2 that introduces a 2D Spatial Perceiver architecture to optimize memory attention mechanisms for real-time video segmentation on mobile devices.
The abstract from the paper is the following:
*On top of Segment Anything Model (SAM), SAM 2 further extends its capability from image to video inputs through a memory bank mechanism and obtains a remarkable performance compared with previous methods, making it a foundation model for video segmentation task. In this paper, we aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance. Despite several works optimizing SAM for better efficiency, we find they are not sufficient for SAM 2 because they all focus on compressing the image encoder, while our benchmark shows that the newly introduced memory attention blocks are also the latency bottleneck. Given this observation, we propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to reduce the computational cost. In particular, the proposed 2D Spatial Perceiver encodes the densely stored frame-level memories with a lightweight Transformer that contains a fixed set of learnable queries. Given that video segmentation is a dense prediction task, we find preserving the spatial structure of the memories is essential so that the queries are split into global-level and patch-level groups. We also propose a distillation pipeline that further improves the performance without inference overhead. As a result, EdgeTAM achieves 87.7, 70.0, 72.3, and 71.7 J&F on DAVIS 2017, MOSE, SA-V val, and SA-V test, while running at 16 FPS on iPhone 15 Pro Max.*
This model was contributed by [yonigozlan](https://huggingface.co/yonigozlan).
The original code can be found [here](https://github.com/facebookresearch/EdgeTAM).
## Usage example
### Video Segmentation and Tracking
EdgeTAM Video's key strength is its ability to track objects across video frames efficiently on mobile devices. Here's how to use it for video segmentation:
#### Basic Video Tracking
```python
>>> from transformers import EdgeTamVideoModel, Sam2VideoProcessor, infer_device
>>> import torch
>>> device = infer_device()
>>> model = EdgeTamVideoModel.from_pretrained("yonigozlan/edgetam-video-1").to(device, dtype=torch.bfloat16)
>>> processor = Sam2VideoProcessor.from_pretrained("yonigozlan/edgetam-video-1")
>>> # Load video frames (example assumes you have a list of PIL Images)
>>> # video_frames = [Image.open(f"frame_{i:05d}.jpg") for i in range(num_frames)]
>>> # For this example, we'll use the video loading utility
>>> from transformers.video_utils import load_video
>>> video_url = "https://huggingface.co/datasets/hf-internal-testing/sam2-fixtures/resolve/main/bedroom.mp4"
>>> video_frames, _ = load_video(video_url)
>>> # Initialize video inference session
>>> inference_session = processor.init_video_session(
... video=video_frames,
... inference_device=device,
... dtype=torch.bfloat16,
... )
>>> # Add click on first frame to select object
>>> ann_frame_idx = 0
>>> ann_obj_id = 1
>>> points = [[[[210, 350]]]]
>>> labels = [[[1]]]
>>> processor.add_inputs_to_inference_session(
... inference_session=inference_session,
... frame_idx=ann_frame_idx,
... obj_ids=ann_obj_id,
... input_points=points,
... input_labels=labels,
... )
>>> # Segment the object on the first frame
>>> outputs = model(
... inference_session=inference_session,
... frame_idx=ann_frame_idx,
... )
>>> video_res_masks = processor.post_process_masks(
... [outputs.pred_masks], original_sizes=[[inference_session.video_height, inference_session.video_width]], binarize=False
... )[0]
>>> print(f"Segmentation shape: {video_res_masks.shape}")
Segmentation shape: torch.Size([1, 1, 540, 960])
>>> # Propagate through the entire video
>>> video_segments = {}
>>> for sam2_video_output in model.propagate_in_video_iterator(inference_session):
... video_res_masks = processor.post_process_masks(
... [sam2_video_output.pred_masks], original_sizes=[[inference_session.video_height, inference_session.video_width]], binarize=False
... )[0]
... video_segments[sam2_video_output.frame_idx] = video_res_masks
>>> print(f"Tracked object through {len(video_segments)} frames")
Tracked object through 200 frames
```
#### Multi-Object Video Tracking
Track multiple objects simultaneously across video frames:
```python
>>> # Reset for new tracking session
>>> inference_session.reset_inference_session()
>>> # Add multiple objects on the first frame
>>> ann_frame_idx = 0
>>> obj_ids = [2, 3]
>>> input_points = [[[[200, 300]], [[400, 150]]]] # Points for two objects (batched)
>>> input_labels = [[[1], [1]]]
>>> processor.add_inputs_to_inference_session(
... inference_session=inference_session,
... frame_idx=ann_frame_idx,
... obj_ids=obj_ids,
... input_points=input_points,
... input_labels=input_labels,
... )
>>> # Get masks for both objects on first frame
>>> outputs = model(
... inference_session=inference_session,
... frame_idx=ann_frame_idx,
... )
>>> # Propagate both objects through video
>>> video_segments = {}
>>> for sam2_video_output in model.propagate_in_video_iterator(inference_session):
... video_res_masks = processor.post_process_masks(
... [sam2_video_output.pred_masks], original_sizes=[[inference_session.video_height, inference_session.video_width]], binarize=False
... )[0]
... video_segments[sam2_video_output.frame_idx] = {
... obj_id: video_res_masks[i]
... for i, obj_id in enumerate(inference_session.obj_ids)
... }
>>> print(f"Tracked {len(inference_session.obj_ids)} objects through {len(video_segments)} frames")
Tracked 2 objects through 200 frames
```
#### Refining Video Segmentation
You can add additional clicks on any frame to refine the tracking:
```python
>>> # Add refinement click on a later frame
>>> refine_frame_idx = 50
>>> ann_obj_id = 2 # Refining first object
>>> points = [[[[220, 280]]]] # Additional point
>>> labels = [[[1]]] # Positive click
>>> processor.add_inputs_to_inference_session(
... inference_session=inference_session,
... frame_idx=refine_frame_idx,
... obj_ids=ann_obj_id,
... input_points=points,
... input_labels=labels,
... )
>>> # Re-propagate with the additional information
>>> video_segments = {}
>>> for sam2_video_output in model.propagate_in_video_iterator(inference_session):
... video_res_masks = processor.post_process_masks(
... [sam2_video_output.pred_masks], original_sizes=[[inference_session.video_height, inference_session.video_width]], binarize=False
... )[0]
... video_segments[sam2_video_output.frame_idx] = video_res_masks
```
### Streaming Video Inference
For real-time applications, EdgeTAM Video supports processing video frames as they arrive:
```python
>>> # Initialize session for streaming
>>> inference_session = processor.init_video_session(
... inference_device=device,
... dtype=torch.bfloat16,
... )
>>> # Process frames one by one
>>> for frame_idx, frame in enumerate(video_frames[:10]): # Process first 10 frames
... inputs = processor(images=frame, device=device, return_tensors="pt")
...
... if frame_idx == 0:
... # Add point input on first frame
... processor.add_inputs_to_inference_session(
... inference_session=inference_session,
... frame_idx=0,
... obj_ids=1,
... input_points=[[[[210, 350], [250, 220]]]],
... input_labels=[[[1, 1]]],
... original_size=inputs.original_sizes[0], # need to be provided when using streaming video inference
... )
...
... # Process current frame
... sam2_video_output = model(inference_session=inference_session, frame=inputs.pixel_values[0])
...
... video_res_masks = processor.post_process_masks(
... [sam2_video_output.pred_masks], original_sizes=inputs.original_sizes, binarize=False
... )[0]
... print(f"Frame {frame_idx}: mask shape {video_res_masks.shape}")
Frame 0: mask shape torch.Size([1, 1, 540, 960])
...
```
#### Video Batch Processing for Multiple Objects
Track multiple objects simultaneously in video by adding them all at once:
```python
>>> # Initialize video session
>>> inference_session = processor.init_video_session(
... video=video_frames,
... inference_device=device,
... dtype=torch.bfloat16,
... )
>>> # Add multiple objects on the first frame using batch processing
>>> ann_frame_idx = 0
>>> obj_ids = [2, 3] # Track two different objects
>>> input_points = [
... [[[200, 300], [230, 250], [275, 175]], [[400, 150]]]
... ] # Object 2: 3 points (2 positive, 1 negative); Object 3: 1 point
>>> input_labels = [
... [[1, 1, 0], [1]]
... ] # Object 2: positive, positive, negative; Object 3: positive
>>> processor.add_inputs_to_inference_session(
... inference_session=inference_session,
... frame_idx=ann_frame_idx,
... obj_ids=obj_ids,
... input_points=input_points,
... input_labels=input_labels,
... )
>>> # Get masks for all objects on the first frame
>>> outputs = model(
... inference_session=inference_session,
... frame_idx=ann_frame_idx,
... )
>>> video_res_masks = processor.post_process_masks(
... [outputs.pred_masks], original_sizes=[[inference_session.video_height, inference_session.video_width]], binarize=False
... )[0]
>>> print(f"Generated masks for {video_res_masks.shape[0]} objects")
Generated masks for 2 objects
>>> # Propagate all objects through the video
>>> video_segments = {}
>>> for sam2_video_output in model.propagate_in_video_iterator(inference_session):
... video_res_masks = processor.post_process_masks(
... [sam2_video_output.pred_masks], original_sizes=[[inference_session.video_height, inference_session.video_width]], binarize=False
... )[0]
... video_segments[sam2_video_output.frame_idx] = {
... obj_id: video_res_masks[i]
... for i, obj_id in enumerate(inference_session.obj_ids)
... }
>>> print(f"Tracked {len(inference_session.obj_ids)} objects through {len(video_segments)} frames")
Tracked 2 objects through 200 frames
```
## EdgeTamVideoMaskDecoderConfig
[[autodoc]] EdgeTamVideoMaskDecoderConfig
## EdgeTamVideoPromptEncoderConfig
[[autodoc]] EdgeTamVideoPromptEncoderConfig
## EdgeTamVideoConfig
[[autodoc]] EdgeTamVideoConfig
## EdgeTamVideoInferenceSession
[[autodoc]] EdgeTamVideoInferenceSession
## EdgeTamVideoModel
[[autodoc]] EdgeTamVideoModel
- forward

View File

@ -144,6 +144,7 @@ processed_outputs = processor.post_process_keypoint_matching(outputs, image_size
## EfficientLoFTRImageProcessor
[[autodoc]] EfficientLoFTRImageProcessor
- preprocess
- post_process_keypoint_matching
- visualize_keypoint_matching
@ -151,6 +152,7 @@ processed_outputs = processor.post_process_keypoint_matching(outputs, image_size
## EfficientLoFTRImageProcessorFast
[[autodoc]] EfficientLoFTRImageProcessorFast
- preprocess
- post_process_keypoint_matching
- visualize_keypoint_matching
@ -158,9 +160,11 @@ processed_outputs = processor.post_process_keypoint_matching(outputs, image_size
## EfficientLoFTRModel
[[autodoc]] EfficientLoFTRModel
- forward
## EfficientLoFTRForKeypointMatching
[[autodoc]] EfficientLoFTRForKeypointMatching
- forward

View File

@ -25,7 +25,7 @@ Evolla is an advanced 80-billion-parameter protein-language generative model des
The abstract from the paper is the following:
*Proteins, nature's intricate molecular machines, are the products of billions of years of evolution and play fundamental roles in sustaining life. Yet, deciphering their molecular language - that is, understanding how protein sequences and structures encode and determine biological functions - remains a corner-stone challenge in modern biology. Here, we introduce Evolla, an 80 billion frontier protein-language generative model designed to decode the molecular language of proteins. By integrating information from protein sequences, structures, and user queries, Evolla generates precise and contextually nuanced insights into protein function. A key innovation of Evolla lies in its training on an unprecedented AI-generated dataset: 546 million protein question-answer pairs and 150 billion word tokens, designed to reflect the immense complexity and functional diversity of proteins. Post-pretraining, Evolla integrates Direct Preference Optimization (DPO) to refine the model based on preference signals and Retrieval-Augmented Generation (RAG) for external knowledge incorporation, improving response quality and relevance. To evaluate its performance, we propose a novel framework, Instructional Response Space (IRS), demonstrating that Evolla delivers expert-level insights, advancing research in proteomics and functional genomics while shedding light on the molecular logic encoded in proteins. The online demo is available at http://www.chat-protein.com/.*
*Proteins, natures intricate molecular machines, are the products of billions of years of evolution and play fundamental roles in sustaining life. Yet, deciphering their molecular language - that is, understanding how protein sequences and structures encode and determine biological functions - remains a corner-stone challenge in modern biology. Here, we introduce Evolla, an 80 billion frontier protein-language generative model designed to decode the molecular language of proteins. By integrating information from protein sequences, structures, and user queries, Evolla generates precise and contextually nuanced insights into protein function. A key innovation of Evolla lies in its training on an unprecedented AI-generated dataset: 546 million protein question-answer pairs and 150 billion word tokens, designed to reflect the immense complexity and functional diversity of proteins. Post-pretraining, Evolla integrates Direct Preference Optimization (DPO) to refine the model based on preference signals and Retrieval-Augmented Generation (RAG) for external knowledge incorporation, improving response quality and relevance. To evaluate its performance, we propose a novel framework, Instructional Response Space (IRS), demonstrating that Evolla delivers expert-level insights, advancing research in proteomics and functional genomics while shedding light on the molecular logic encoded in proteins. The online demo is available at http://www.chat-protein.com/.*
Examples:

View File

@ -30,6 +30,5 @@ Depth up-scaling for improved reasoning: Building on recent studies on the effec
Knowledge distillation for better tiny models: To provide compact and efficient alternatives, we developed Falcon3-1B-Base and Falcon3-3B-Base by leveraging pruning and knowledge distillation techniques, using less than 100GT of curated high-quality data, thereby redefining pre-training efficiency.
## Resources
- [Blog post](https://huggingface.co/blog/falcon3)
- [Models on Huggingface](https://huggingface.co/collections/tiiuae/falcon3-67605ae03578be86e4e87026)

View File

@ -44,7 +44,6 @@ community for further reproducible experiments in French NLP.*
This model was contributed by [formiel](https://huggingface.co/formiel). The original code can be found [here](https://github.com/getalp/Flaubert).
Tips:
- Like RoBERTa, without the sentence ordering prediction (so just trained on the MLM objective).
## Resources

View File

@ -80,7 +80,7 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
</hfoption>
<hfoption id="transformers CLI">
```bash
```
echo -e "Explain quantum computing simply." | transformers run --task text-generation --model google/gemma-2-2b --device 0
```

View File

@ -21,12 +21,12 @@ rendered properly in your Markdown viewer.
The GLM family welcomes new members [GLM-4-0414](https://huggingface.co/papers/2406.12793) series models.
The **GLM-4-32B-0414** series models, featuring 32 billion parameters. Its performance is comparable to OpenAI's GPT
series and DeepSeek's V3/R1 series. It also supports very user-friendly local deployment features. GLM-4-32B-Base-0414
The **GLM-4-32B-0414** series models, featuring 32 billion parameters. Its performance is comparable to OpenAIs GPT
series and DeepSeeks V3/R1 series. It also supports very user-friendly local deployment features. GLM-4-32B-Base-0414
was pre-trained on 15T of high-quality data, including substantial reasoning-type synthetic data. This lays the
foundation for subsequent reinforcement learning extensions. In the post-training stage, we employed human preference
alignment for dialogue scenarios. Additionally, using techniques like rejection sampling and reinforcement learning, we
enhanced the model's performance in instruction following, engineering code, and function calling, thus strengthening
enhanced the models performance in instruction following, engineering code, and function calling, thus strengthening
the atomic capabilities required for agent tasks. GLM-4-32B-0414 achieves good results in engineering code, Artifact
generation, function calling, search-based Q&A, and report generation. In particular, on several benchmarks, such as
code generation or specific Q&A tasks, GLM-4-32B-Base-0414 achieves comparable performance with those larger models like

View File

@ -35,7 +35,6 @@ Through our open-source work, we aim to explore the technological frontier toget
![bench_45](https://raw.githubusercontent.com/zai-org/GLM-V/refs/heads/main/resources/bench_45v.jpeg)
Beyond benchmark performance, GLM-4.5V focuses on real-world usability. Through efficient hybrid training, it can handle diverse types of visual content, enabling full-spectrum vision reasoning, including:
- **Image reasoning** (scene understanding, complex multi-image analysis, spatial recognition)
- **Video understanding** (long video segmentation and event recognition)
- **GUI tasks** (screen reading, icon recognition, desktop operation assistance)

View File

@ -75,7 +75,7 @@ echo -e "Hello, I'm a language model" | transformers run --task text-generation
One can also serve the model using vLLM with the `transformers backend`.
```bash
```
vllm serve openai-community/gpt2 --model-imp transformers
```

View File

@ -36,7 +36,6 @@ The model is an optimized [GPT2 model](https://huggingface.co/docs/transformers/
## Implementation details
The main differences compared to GPT2.
- Added support for Multi-Query Attention.
- Use `gelu_pytorch_tanh` instead of classic `gelu`.
- Avoid unnecessary synchronizations (this has since been added to GPT2 in #20061, but wasn't in the reference codebase).
@ -50,6 +49,9 @@ The main differences compared to GPT2.
You can read more about the optimizations in the [original pull request](https://github.com/huggingface/transformers/pull/22575)
> [!NOTE]
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
## Combining Starcoder and Flash Attention 2
First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature.

View File

@ -35,8 +35,6 @@ The abstract from the paper is the following:
*<INSERT PAPER ABSTRACT HERE>*
Tips:
- **Attention Sinks with Flex Attention**: When using flex attention, attention sinks require special handling. Unlike with standard attention implementations where sinks can be added directly to attention scores, flex attention `score_mod` function operates on individual score elements rather than the full attention matrix. Therefore, attention sinks renormalization have to be applied after the flex attention computations by renormalizing the outputs using the log-sum-exp (LSE) values returned by flex attention.
<INSERT TIPS ABOUT MODEL HERE>

View File

@ -133,7 +133,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
- [`GPTJForCausalLM`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling), [text generation example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-generation), and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb).
**Documentation resources**
- [Text classification task guide](../tasks/sequence_classification)
- [Question answering task guide](../tasks/question_answering)
- [Causal language modeling task guide](../tasks/language_modeling)

View File

@ -37,7 +37,6 @@ Note that most of the aforementioned components are implemented generically to e
This model was contributed by [Alexander Brooks](https://huggingface.co/abrooks9944), [Avihu Dekel](https://huggingface.co/Avihu), and [George Saon](https://huggingface.co/gsaon).
## Usage tips
- This model bundles its own LoRA adapter, which will be automatically loaded and enabled/disabled as needed during inference calls. Be sure to install [PEFT](https://github.com/huggingface/peft) to ensure the LoRA is correctly applied!
<!-- TODO (@alex-jw-brooks) Add an example here once the model compatible with the transformers implementation is released -->

View File

@ -22,7 +22,6 @@ rendered properly in your Markdown viewer.
The [Granite Vision](https://www.ibm.com/new/announcements/ibm-granite-3-1-powerful-performance-long-context-and-more) model is a variant of [LLaVA-NeXT](llava_next), leveraging a [Granite](granite) language model alongside a [SigLIP](SigLIP) visual encoder. It utilizes multiple concatenated vision hidden states as its image features, similar to [VipLlava](vipllava). It also uses a larger set of image grid pinpoints than the original LlaVa-NeXT models to support additional aspect ratios.
Tips:
- This model is loaded into Transformers as an instance of LlaVA-Next. The usage and tips from [LLaVA-NeXT](llava_next) apply to this model as well.
- You can apply the chat template on the tokenizer / processor in the same way as well. Example chat format:

View File

@ -114,6 +114,11 @@ print(transcription[0])
## Notes
- HuBERT models expect raw audio input as a 1D float array sampled at 16kHz.
- If you want to use a `head_mask`, use the model with `attn_implementation="eager"`.
```python
model = HubertModel.from_pretrained("facebook/hubert-base-ls960", attn_implementation="eager")
```
## HubertConfig

View File

@ -9,7 +9,7 @@ Unless required by applicable law or agreed to in writing, software distributed
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
*This model was released on 2023-09-20 and added to Hugging Face Transformers on 2025-08-19.*
*This model was released on {release_date} and added to Hugging Face Transformers on 2025-08-19.*
<div style="float: right;">
<div class="flex flex-wrap space-x-1">

View File

@ -18,8 +18,7 @@ rendered properly in your Markdown viewer.
# Kyutai Speech-To-Text
## Overview
[Kyutai STT](https://kyutai.org/next/stt) is a speech-to-text model architecture based on the [Mimi codec](https://huggingface.co/docs/transformers/en/model_doc/mimi), which encodes audio into discrete tokens in a streaming fashion, and a [Moshi-like](https://huggingface.co/docs/transformers/en/model_doc/moshi) autoregressive decoder. Kyutai's lab has released two model checkpoints:
[Kyutai STT](https://kyutai.org/next/stt) is a speech-to-text model architecture based on the [Mimi codec](https://huggingface.co/docs/transformers/en/model_doc/mimi), which encodes audio into discrete tokens in a streaming fashion, and a [Moshi-like](https://huggingface.co/docs/transformers/en/model_doc/moshi) autoregressive decoder. Kyutais lab has released two model checkpoints:
- [kyutai/stt-1b-en_fr](https://huggingface.co/kyutai/stt-1b-en_fr): a 1B-parameter model capable of transcribing both English and French
- [kyutai/stt-2.6b-en](https://huggingface.co/kyutai/stt-2.6b-en): a 2.6B-parameter model focused solely on English, optimized for maximum transcription accuracy

View File

@ -73,7 +73,6 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2
- [Question answering task guide](../tasks/question_answering)
**Document question answering**
- [Document question answering task guide](../tasks/document_question_answering)
## LayoutLMv3Config

View File

@ -28,7 +28,6 @@ rendered properly in your Markdown viewer.
## Architecture
LFM2-VL consists of three main components: a language model backbone, a vision encoder, and a multimodal projector. LFM2-VL builds upon the LFM2 backbone, inheriting from either LFM2-1.2B (for LFM2-VL-1.6B) or LFM2-350M (for LFM2-VL-450M). For the vision tower, LFM2-VL uses SigLIP2 NaFlex encoders to convert input images into token sequences. Two variants are implemented:
* Shape-optimized (400M) for more fine-grained vision capabilities for LFM2-VL-1.6B
* Base (86M) for fast image processing for LFM2-VL-450M

View File

@ -143,6 +143,7 @@ processed_outputs = processor.post_process_keypoint_matching(outputs, image_size
## LightGlueImageProcessor
[[autodoc]] LightGlueImageProcessor
- preprocess
- post_process_keypoint_matching
- visualize_keypoint_matching
@ -150,4 +151,5 @@ processed_outputs = processor.post_process_keypoint_matching(outputs, image_size
## LightGlueForKeypointMatching
[[autodoc]] LightGlueForKeypointMatching
- forward

View File

@ -62,7 +62,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
- Demo notebooks for LiLT can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LiLT).
**Documentation resources**
- [Text classification task guide](../tasks/sequence_classification)
- [Token classification task guide](../tasks/token_classification)
- [Question answering task guide](../tasks/question_answering)

View File

@ -60,7 +60,7 @@ Tips:
- Weights for the Llama3 models can be obtained by filling out [this form](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)
- The architecture is exactly the same as Llama2.
- The tokenizer is a BPE model based on [tiktoken](https://github.com/openai/tiktoken) (vs the one based on sentencepiece implementation for Llama2). The main difference that it ignores BPE merge rules when an input token is part of the vocab. This means that if no merge exist to produce `"hugging"`, instead of having the smallest units, like `["hug","ging"]` form 2 tokens, if `"hugging"` is part of the vocab, it will be automatically returned as a token.
- The tokenizer is a BPE model based on [tiktoken](https://github.com/openai/tiktoken) (vs the one based on sentencepiece implementation for Llama2). The main difference that it ignores BPE merge rules when an input token is part of the vocab. This means that if no merge exist to produce `"hugging"`, instead of having the smallest units, like `["hug","ging"] form 2 tokens, if `"hugging"` is part of the vocab, it will be automatically returned as a token.
- The original model uses `pad_id = -1` which means that there is no padding token. We can't have the same logic, make sure to add a padding token using `tokenizer.add_special_tokens({"pad_token":"<pad>"})` and resize the token embedding accordingly. You should also set the `model.config.pad_token_id`. The `embed_tokens` layer of the model is initialized with `self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.config.padding_idx)`, which makes sure that encoding the padding token will output zeros, so passing it when initializing is recommended.
- The original checkpoint can be converted using the [conversion script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py). The script can be called with the following (example) command:

View File

@ -27,11 +27,9 @@ rendered properly in your Markdown viewer.
[Llama 4](https://ai.meta.com/blog/llama-4-multimodal-intelligence/), developed by Meta, introduces a new auto-regressive Mixture-of-Experts (MoE) architecture.
This generation includes two models:
- The highly capable Llama 4 Maverick with 17B active parameters out of ~400B total, with 128 experts.
- The efficient Llama 4 Scout also has 17B active parameters out of ~109B total, using just 16 experts.
-
Both models leverage early fusion for native multimodality, enabling them to process text and image inputs.
Maverick and Scout are both trained on up to 40 trillion tokens on data encompassing 200 languages
(with specific fine-tuning support for 12 languages including Arabic, Spanish, German, and Hindi).

View File

@ -48,21 +48,20 @@ The original code can be found [here](https://github.com/haotian-liu/LLaVA/tree/
- Note the model has not been explicitly trained to process multiple images in the same prompt, although this is technically possible, you may experience inaccurate results.
> [!NOTE]
> LLaVA models after release v4.46 will raise warnings about adding `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}` and `processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. It is strongly recommended to add the attributes to the processor if you own the model checkpoint, or open a PR if it is not owned by you.
> LLaVA models after release v4.46 will raise warnings about adding `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. It is strongly recommended to add the attributes to the processor if you own the model checkpoint, or open a PR if it is not owned by you.
Adding these attributes means that LLaVA will try to infer the number of image tokens required per image and expand the text with as many `<image>` placeholders as there will be tokens. Usually it is around 500 tokens per image, so make sure that the text is not truncated as otherwise there will be failure when merging the embeddings.
The attributes can be obtained from model config, as `model.config.vision_config.patch_size` or `model.config.vision_feature_select_strategy`. The `num_additional_image_tokens` should be `1` if the vision backbone adds a CLS token or `0` if nothing extra is added to the vision patches.
### Formatting Prompts with Chat Templates
Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processor's `apply_chat_template` method.
Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processors `apply_chat_template` method.
**Important:**
- You must construct a conversation history — passing a plain string won't work.
- Each message should be a dictionary with `"role"` and `"content"` keys.
- The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`.
Here's an example of how to structure your input.
Heres an example of how to structure your input.
We will use [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) and a conversation history of text and image. Each content field has to be a list of dicts, as follows:
```python

View File

@ -34,7 +34,7 @@ The introduction from the blog is the following:
On January 30, 2024, we released LLaVA-NeXT, an open-source Large Multimodal Model (LMM) that has been trained exclusively on text-image data. With the proposed AnyRes technique, it boosts capabilities in reasoning, OCR, and world knowledge, demonstrating remarkable performance across a spectrum of image-based multimodal understanding tasks, and even exceeding Gemini-Pro on several image benchmarks, e.g. MMMU and MathVista.
**In today's exploration, we delve into the performance of LLaVA-NeXT within the realm of video understanding tasks. We reveal that LLaVA-NeXT surprisingly has strong performance in understanding video content. The current version of LLaVA-NeXT for videos has several improvements:
**In todays exploration, we delve into the performance of LLaVA-NeXT within the realm of video understanding tasks. We reveal that LLaVA-NeXT surprisingly has strong performance in understanding video content. The current version of LLaVA-NeXT for videos has several improvements:
- Zero-shot video representation capabilities with AnyRes: The AnyRes technique naturally represents a high-resolution image into multiple images that a pre-trained VIT is able to digest, and forms them into a concatenated sequence. This technique is naturally generalizable to represent videos (consisting of multiple frames), allowing the image-only-trained LLaVA-Next model to perform surprisingly well on video tasks. Notably, this is the first time that LMMs show strong zero-shot modality transfer ability.
- Inference with length generalization improves on longer videos. The linear scaling technique enables length generalization, allowing LLaVA-NeXT to effectively handle long-video beyond the limitation of the "max_token_length" of the LLM.
@ -55,21 +55,20 @@ The original code can be found [here](https://github.com/LLaVA-VL/LLaVA-NeXT/tre
</Tip>
> [!NOTE]
> LLaVA models after release v4.46 will raise warnings about adding `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}` and `processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. It is strongly recommended to add the attributes to the processor if you own the model checkpoint, or open a PR if it is not owned by you.
> LLaVA models after release v4.46 will raise warnings about adding `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. It is strongly recommended to add the attributes to the processor if you own the model checkpoint, or open a PR if it is not owned by you.
Adding these attributes means that LLaVA will try to infer the number of image tokens required per image and expand the text with as many `<image>` placeholders as there will be tokens. Usually it is around 500 tokens per image, so make sure that the text is not truncated as otherwise there will be failure when merging the embeddings.
The attributes can be obtained from model config, as `model.config.vision_config.patch_size` or `model.config.vision_feature_select_strategy`. The `num_additional_image_tokens` should be `1` if the vision backbone adds a CLS token or `0` if nothing extra is added to the vision patches.
### Formatting Prompts with Chat Templates
Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processor's `apply_chat_template` method.
Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processors `apply_chat_template` method.
**Important:**
- You must construct a conversation history — passing a plain string won't work.
- Each message should be a dictionary with `"role"` and `"content"` keys.
- The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`.
Here's an example of how to structure your input. We will use [LLaVA-NeXT-Video-7B-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf) and a conversation history of videos and images.
Heres an example of how to structure your input. We will use [LLaVA-NeXT-Video-7B-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf) and a conversation history of videos and images.
```python
from transformers import LlavaNextVideoProcessor

View File

@ -59,7 +59,6 @@ Tips:
Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processors `apply_chat_template` method.
**Important:**
- You must construct a conversation history — passing a plain string won't work.
- Each message should be a dictionary with `"role"` and `"content"` keys.
- The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`.

View File

@ -51,6 +51,9 @@ multilingual it expects the sequences in a certain format: A special language id
source and target text. The source text format is `[lang_code] X [eos]`, where `lang_code` is source language
id for source text and target language id for target text, with `X` being the source or target text.
> [!NOTE]
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
The [`M2M100Tokenizer`] depends on `sentencepiece` so be sure to install it before running the
examples. To install `sentencepiece` run `pip install sentencepiece`.

View File

@ -30,7 +30,6 @@ performance, similar to [LayoutLM](layoutlm).
The model can be used for tasks like question answering on web pages or information extraction from web pages. It obtains
state-of-the-art results on 2 important benchmarks:
- [WebSRC](https://x-lance.github.io/WebSRC/), a dataset for Web-Based Structural Reading Comprehension (a bit like SQuAD but for web pages)
- [SWDE](https://www.researchgate.net/publication/221299838_From_one_tree_to_a_forest_a_unified_solution_for_structured_web_data_extraction), a dataset
for information extraction from web pages (basically named-entity recognition on web pages)

View File

@ -34,6 +34,9 @@ You can find all the original mBART checkpoints under the [AI at Meta](https://h
> [!TIP]
> Click on the mBART models in the right sidebar for more examples of applying mBART to different language tasks.
> [!NOTE]
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
The example below demonstrates how to translate text with [`Pipeline`] or the [`AutoModel`] class.
<hfoptions id="usage">

View File

@ -42,7 +42,6 @@ Mixtral-8x7B is a decoder-only Transformer with the following architectural choi
- Despite the model having 45 billion parameters, the compute required for a single forward pass is the same as that of a 14 billion parameter model. This is because even though each of the experts have to be loaded in RAM (70B like ram requirement) each token from the hidden states are dispatched twice (top 2 routing) and thus the compute (the operation required at each forward computation) is just 2 X sequence_length.
The following implementation details are shared with Mistral AI's first model [Mistral-7B](mistral):
- Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens
- GQA (Grouped Query Attention) - allowing faster inference and lower cache size.
- Byte-fallback BPE tokenizer - ensures that characters are never mapped to out of vocabulary tokens.
@ -56,7 +55,6 @@ For more details refer to the [release blog post](https://mistral.ai/news/mixtra
## Usage tips
The Mistral team has released 2 checkpoints:
- a base model, [Mixtral-8x7B-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1), which has been pre-trained to predict the next token on internet-scale data.
- an instruction tuned model, [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1), which is the base model optimized for chat purposes using supervised fine-tuning (SFT) and direct preference optimization (DPO).

View File

@ -28,7 +28,6 @@ Unless required by applicable law or agreed to in writing, software distributed
You can find all the original MobileViT checkpoints under the [Apple](https://huggingface.co/apple/models?search=mobilevit) organization.
> [!TIP]
>
> - This model was contributed by [matthijs](https://huggingface.co/Matthijs).
>
> Click on the MobileViT models in the right sidebar for more examples of how to apply MobileViT to different vision tasks.

View File

@ -119,7 +119,7 @@ print(f"Prediction probabilities: {predictions}")
<hfoption id="AutoModel (w/quantization)">
```py
```
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

View File

@ -38,7 +38,6 @@ The abstract from the paper is the following:
*We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning— such as emotion or non-speech sounds— is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this “Inner Monologue” method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at github.com/kyutai-labs/moshi.*
Moshi deals with 3 streams of information:
1. The user's audio
2. Moshi's audio
3. Moshi's textual output
@ -71,7 +70,6 @@ The original checkpoints can be converted using the conversion script `src/trans
### How to use the model:
This implementation has two main aims:
1. quickly test model generation by simplifying the original API
2. simplify training. A training guide will come soon, but user contributions are welcomed!
@ -86,7 +84,6 @@ It is designed for intermediate use. We strongly recommend using the original [i
Moshi is a streaming auto-regressive model with two streams of audio. To put it differently, one audio stream corresponds to what the model said/will say and the other audio stream corresponds to what the user said/will say.
[`MoshiForConditionalGeneration.generate`] thus needs 3 inputs:
1. `input_ids` - corresponding to the text token history
2. `moshi_input_values` or `moshi_audio_codes`- corresponding to the model audio history
3. `user_input_values` or `user_audio_codes` - corresponding to the user audio history
@ -94,7 +91,6 @@ Moshi is a streaming auto-regressive model with two streams of audio. To put it
These three inputs must be synchronized. Meaning that their lengths must correspond to the same number of tokens.
You can dynamically use the 3 inputs depending on what you want to test:
1. Simply check the model response to an user prompt - in that case, `input_ids` can be filled with pad tokens and `user_input_values` can be a zero tensor of the same shape than the user prompt.
2. Test more complex behaviour - in that case, you must be careful about how the input tokens are synchronized with the audios.

View File

@ -63,6 +63,9 @@ python src/transformers/models/musicgen/convert_musicgen_transformers.py \
--checkpoint small --pytorch_dump_folder /output/path --safe_serialization
```
> [!NOTE]
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
## Generation
MusicGen is compatible with two generation modes: greedy and sampling. In practice, sampling leads to significantly
@ -230,7 +233,6 @@ generation config.
## Model Structure
The MusicGen model can be de-composed into three distinct stages:
1. Text encoder: maps the text inputs to a sequence of hidden-state representations. The pre-trained MusicGen models use a frozen text encoder from either T5 or Flan-T5
2. MusicGen decoder: a language model (LM) that auto-regressively generates audio tokens (or codes) conditional on the encoder hidden-state representations
3. Audio encoder/decoder: used to encode an audio prompt to use as prompt tokens, and recover the audio waveform from the audio tokens predicted by the decoder
@ -257,7 +259,6 @@ be combined with the frozen text encoder and audio encoder/decoders to recover t
model.
Tips:
* MusicGen is trained on the 32kHz checkpoint of Encodec. You should ensure you use a compatible version of the Encodec model.
* Sampling mode tends to deliver better results than greedy - you can toggle sampling with the variable `do_sample` in the call to [`MusicgenForConditionalGeneration.generate`]

View File

@ -40,10 +40,12 @@ This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The o
## Difference with [MusicGen](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen)
There are two key differences with MusicGen:
1. The audio prompt is used here as a conditional signal for the generated audio sample, whereas it's used for audio continuation in [MusicGen](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen).
2. Conditional text and audio signals are concatenated to the decoder's hidden states instead of being used as a cross-attention signal, as in MusicGen.
> [!NOTE]
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
## Generation
MusicGen Melody is compatible with two generation modes: greedy and sampling. In practice, sampling leads to significantly better results than greedy, thus we encourage sampling mode to be used where possible. Sampling is enabled by default, and can be explicitly specified by setting `do_sample=True` in the call to [`MusicgenMelodyForConditionalGeneration.generate`], or by overriding the model's generation config (see below).
@ -56,7 +58,7 @@ The model can generate an audio sample conditioned on a text and an audio prompt
In the following examples, we load an audio file using the 🤗 Datasets library, which can be pip installed through the command below:
```bash
```
pip install --upgrade pip
pip install datasets[audio]
```
@ -225,7 +227,6 @@ Note that any arguments passed to the generate method will **supersede** those i
## Model Structure
The MusicGen model can be de-composed into three distinct stages:
1. Text encoder: maps the text inputs to a sequence of hidden-state representations. The pre-trained MusicGen models use a frozen text encoder from either T5 or Flan-T5.
2. MusicGen Melody decoder: a language model (LM) that auto-regressively generates audio tokens (or codes) conditional on the encoder hidden-state representations
3. Audio decoder: used to recover the audio waveform from the audio tokens predicted by the decoder.
@ -255,7 +256,6 @@ python src/transformers/models/musicgen_melody/convert_musicgen_melody_transform
```
Tips:
* MusicGen is trained on the 32kHz checkpoint of Encodec. You should ensure you use a compatible version of the Encodec model.
* Sampling mode tends to deliver better results than greedy - you can toggle sampling with the variable `do_sample` in the call to [`MusicgenMelodyForConditionalGeneration.generate`]

View File

@ -68,7 +68,6 @@ The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, widt
`(batch_size, height, width, num_channels)`.
Notes:
- NAT depends on [NATTEN](https://github.com/SHI-Labs/NATTEN/)'s implementation of Neighborhood Attention.
You can install it with pre-built wheels for Linux by referring to [shi-labs.com/natten](https://shi-labs.com/natten),
or build on your system by running `pip install natten`.

View File

@ -109,7 +109,7 @@ Please refer to our [paper](https://huggingface.co/papers/2407.14679) for the fu
If you find our work helpful, please consider citing our paper:
```bibtex
```
@article{minitron2024,
title={Compact Language Models via Pruning and Knowledge Distillation},
author={Saurav Muralidharan and Sharath Turuvekere Sreenivas and Raviraj Joshi and Marcin Chochowski and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro and Jan Kautz and Pavlo Molchanov},

View File

@ -101,6 +101,8 @@ tokenizer.batch_decode(generated_ids)[0]
- OPT adds an `EOS` token `</s>` to the beginning of every prompt.
- The `head_mask` argument is ignored if the attention implementation isn't `"eager"`. Set `attn_implementation="eager"` to enable the `head_mask`.
## Resources
- Refer to this [notebook](https://colab.research.google.com/drive/1jCkpikz0J2o20FBQmYmAGdiKmJGOMo-o?usp=sharing) for an example of fine-tuning OPT with PEFT, bitsandbytes, and Transformers.

View File

@ -1,221 +0,0 @@
<!--Copyright 2025 The NVIDIA NeMo Team and The HuggingFace Inc. team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
*This model was released on {release_date} and added to Hugging Face Transformers on 2025-09-25.*
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>
# Parakeet
## Overview
Parakeet models, [introduced by NVIDIA NeMo](https://developer.nvidia.com/blog/pushing-the-boundaries-of-speech-recognition-with-nemo-parakeet-asr-models/), are models that combine a [Fast Conformer](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html#fast-conformer) encoder with connectionist temporal classification (CTC), recurrent neural network transducer (RNNT) or token and duration transducer (TDT) decoder for automatic speech recognition.
**Model Architecture**
- **Fast Conformer Encoder**: A linearly scalable Conformer architecture that processes mel-spectrogram features and reduces sequence length through subsampling. This is more efficient version of the Conformer Encoder found in [FastSpeech2Conformer](./fastspeech2_conformer.md) (see [`ParakeetEncoder`] for the encoder implementation and details).
- [**ParakeetForCTC**](#parakeetforctc): a Fast Conformer Encoder + a CTC decoder
- **CTC Decoder**: Simple but effective decoder consisting of:
- 1D convolution projection from encoder hidden size to vocabulary size (for optimal NeMo compatibility).
- CTC loss computation for training.
- Greedy CTC decoding for inference.
The original implementation can be found in [NVIDIA NeMo](https://github.com/NVIDIA/NeMo).
Model checkpoints are to be found under [the NVIDIA organization](https://huggingface.co/nvidia/models?search=parakeet).
This model was contributed by [Nithin Rao Koluguri](https://huggingface.co/nithinraok), [Eustache Le Bihan](https://huggingface.co/eustlb) and [Eric Bezzam](https://huggingface.co/bezzam).
## Usage
### Basic usage
<hfoptions id="usage">
<hfoption id="Pipeline">
```py
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="nvidia/parakeet-ctc-1.1b")
out = pipe("https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3")
print(out)
```
</hfoption>
<hfoption id="AutoModel">
```py
from transformers import AutoModelForCTC, AutoProcessor
from datasets import load_dataset, Audio
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained("nvidia/parakeet-ctc-1.1b")
model = AutoModelForCTC.from_pretrained("nvidia/parakeet-ctc-1.1b", dtype="auto", device_map=device)
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))
speech_samples = [el['array'] for el in ds["audio"][:5]]
inputs = processor(speech_samples, sampling_rate=processor.feature_extractor.sampling_rate)
inputs.to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs)
print(processor.batch_decode(outputs))
```
</hfoption>
</hfoptions>
### Making The Model Go Brrr
Parakeet supports full-graph compilation with CUDA graphs! This optimization is most effective when you know the maximum audio length you want to transcribe. The key idea is using static input shapes to avoid recompilation. For example, if you know your audio will be under 30 seconds, you can use the processor to pad all inputs to 30 seconds, preparing consistent input features and attention masks. See the example below!
```python
from transformers import AutoModelForCTC, AutoProcessor
from datasets import load_dataset, Audio
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained("nvidia/parakeet-ctc-1.1b")
model = AutoModelForCTC.from_pretrained("nvidia/parakeet-ctc-1.1b", dtype="auto", device_map=device)
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))
speech_samples = [el['array'] for el in ds["audio"][:5]]
# Compile the generate method with fullgraph and CUDA graphs
model.generate = torch.compile(model.generate, fullgraph=True, mode="reduce-overhead")
# let's define processor kwargs to pad to 30 seconds
processor_kwargs = {
"padding": "max_length",
"max_length": 30 * processor.feature_extractor.sampling_rate,
}
# Define a timing context using CUDA events
class TimerContext:
def __init__(self, name="Execution"):
self.name = name
self.start_event = None
self.end_event = None
def __enter__(self):
# Use CUDA events for more accurate GPU timing
self.start_event = torch.cuda.Event(enable_timing=True)
self.end_event = torch.cuda.Event(enable_timing=True)
self.start_event.record()
return self
def __exit__(self, *args):
self.end_event.record()
torch.cuda.synchronize()
elapsed_time = self.start_event.elapsed_time(self.end_event) / 1000.0
print(f"{self.name} time: {elapsed_time:.4f} seconds")
inputs = processor(speech_samples[0], **processor_kwargs)
inputs.to(device, dtype=model.dtype)
print("\n" + "="*50)
print("First generation - compiling...")
# Generate with the compiled model
with TimerContext("First generation"):
outputs = model.generate(**inputs)
print(processor.batch_decode(outputs))
inputs = processor(speech_samples[1], **processor_kwargs)
inputs.to(device, dtype=model.dtype)
print("\n" + "="*50)
print("Second generation - recording CUDA graphs...")
with TimerContext("Second generation"):
outputs = model.generate(**inputs)
print(processor.batch_decode(outputs))
inputs = processor(speech_samples[2], **processor_kwargs)
inputs.to(device, dtype=model.dtype)
print("\n" + "="*50)
print("Third generation - fast !!!")
with TimerContext("Third generation"):
outputs = model.generate(**inputs)
print(processor.batch_decode(outputs))
inputs = processor(speech_samples[3], **processor_kwargs)
inputs.to(device, dtype=model.dtype)
print("\n" + "="*50)
print("Fourth generation - still fast !!!")
with TimerContext("Fourth generation"):
outputs = model.generate(**inputs)
print(processor.batch_decode(outputs))
```
### Training
```python
from transformers import AutoModelForCTC, AutoProcessor
from datasets import load_dataset, Audio
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained("nvidia/parakeet-ctc-1.1b")
model = AutoModelForCTC.from_pretrained("nvidia/parakeet-ctc-1.1b", dtype="auto", device_map=device)
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))
speech_samples = [el['array'] for el in ds["audio"][:5]]
text_samples = [el for el in ds["text"][:5]]
# passing `text` to the processor will prepare inputs' `labels` key
inputs = processor(audio=speech_samples, text=text_samples, sampling_rate=processor.feature_extractor.sampling_rate)
inputs.to(device, dtype=model.dtype)
outputs = model(**inputs)
outputs.loss.backward()
```
## ParakeetTokenizerFast
[[autodoc]] ParakeetTokenizerFast
## ParakeetFeatureExtractor
[[autodoc]] ParakeetFeatureExtractor
- __call__
## ParakeetProcessor
[[autodoc]] ParakeetProcessor
- __call__
- batch_decode
- decode
## ParakeetEncoderConfig
[[autodoc]] ParakeetEncoderConfig
## ParakeetCTCConfig
[[autodoc]] ParakeetCTCConfig
## ParakeetEncoder
[[autodoc]] ParakeetEncoder
## ParakeetForCTC
[[autodoc]] ParakeetForCTC

View File

@ -45,14 +45,13 @@ The original code for PhiMoE can be found [here](https://huggingface.co/microsof
<Tip warning={true}>
Phi-3.5-MoE-instruct has been integrated in the development version (4.44.2.dev) of `transformers`. Until the official version is released through `pip`, ensure that you are doing the following:
* When loading the model, ensure that `trust_remote_code=True` is passed as an argument of the `from_pretrained()` function.
The current `transformers` version can be verified with: `pip list | grep transformers`.
Examples of required packages:
```bash
```
flash_attn==2.5.8
torch==2.3.1
accelerate==0.31.0

View File

@ -59,7 +59,6 @@ pip install pretty-midi==0.2.9 essentia==2.1b6.dev1034 librosa scipy
```
Please note that you may need to restart your runtime after installation.
* Pop2Piano is an Encoder-Decoder based model like T5.
* Pop2Piano can be used to generate midi-audio files for a given audio sequence.
* Choosing different composers in `Pop2PianoForConditionalGeneration.generate()` can lead to variety of different results.

View File

@ -115,7 +115,7 @@ tensors. After setting up the tensor quantizers, one can use the following examp
The goal of exporting to ONNX is to deploy inference by [TensorRT](https://developer.nvidia.com/tensorrt). Fake
quantization will be broken into a pair of QuantizeLinear/DequantizeLinear ONNX ops. After setting static member of
TensorQuantizer to use Pytorch's own fake quantization functions, fake quantized model can be exported to ONNX, follow
TensorQuantizer to use Pytorchs own fake quantization functions, fake quantized model can be exported to ONNX, follow
the instructions in [torch.onnx](https://pytorch.org/docs/stable/onnx.html). Example:
```python

View File

@ -29,7 +29,7 @@ The [Qwen2.5-Omni](https://qwenlm.github.io/blog/qwen2.5-omni/) model is a unifi
The abstract from the technical report is the following:
*We present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. This strategy effectively decouples the handling of long sequences of multimodal data, assigning the perceptual responsibilities to the multimodal encoder and entrusting the modeling of extended sequences to a large language model. Such a division of labor enhances the fusion of different modalities via the shared attention mechanism. To synchronize the timestamps of video inputs with audio, we organized the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE (Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni outperforms the similarly sized Qwen2-VL and Qwen2-Audio in both image and audio capabilities. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni is the first open-source model to achieve a level of performance in end-to-end speech instruction following that is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni's streaming Talker outperform most existing streaming and non-streaming alternatives in robustness and naturalness.*
*We present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. This strategy effectively decouples the handling of long sequences of multimodal data, assigning the perceptual responsibilities to the multimodal encoder and entrusting the modeling of extended sequences to a large language model. Such a division of labor enhances the fusion of different modalities via the shared attention mechanism. To synchronize the timestamps of video inputs with audio, we organized the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE (Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni outperforms the similarly sized Qwen2-VL and Qwen2-Audio in both image and audio capabilities. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni is the first open-source model to achieve a level of performance in end-to-end speech instruction following that is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omnis streaming Talker outperform most existing streaming and non-streaming alternatives in robustness and naturalness.*
## Notes
@ -273,7 +273,7 @@ processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B", min_pixels=min
#### Prompt for audio output
If users need audio output, the system prompt must be set as "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", otherwise the audio output may not work as expected.
```python
```
{
"role": "system",
"content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",

View File

@ -40,6 +40,9 @@ The abstract from the paper is the following:
`Qwen2-Audio-7B` and `Qwen2-Audio-7B-Instruct` can be found on the [Huggingface Hub](https://huggingface.co/Qwen)
> [!NOTE]
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
### Inference
```python

View File

@ -19,7 +19,6 @@ rendered properly in your Markdown viewer.
The Qwen3-Next series represents our next-generation foundation models, optimized for extreme context length and large-scale parameter efficiency.
The series introduces a suite of architectural innovations designed to maximize performance while minimizing computational cost:
- **Hybrid Attention**: Replaces standard attention with the combination of **Gated DeltaNet** and **Gated Attention**, enabling efficient context modeling.
- **High-Sparsity MoE**: Achieves an extreme low activation ratio as 1:50 in MoE layers — drastically reducing FLOPs per token while preserving model capacity.
- **Multi-Token Prediction(MTP)**: Boosts pretraining model performance, and accelerates inference.

Some files were not shown because too many files have changed in this diff Show More