Merge branch 'main' into kernels-opt-out-flag

T5 migration to new masking interface (#41804 )
* Refactor: migrate T5 attention masking to masking_utils interface * Refactor: migrate T5 attention masking to masking_utils interface * create_bidirectional_mask function with appropriate paramaters * create_bidirectional_mask function with appropriate paramaters * fixup executorch + import * revert causal masks * rm executorch stuff * add causal mask with non vmap * copies * remove unnecessary import --------- Co-authored-by: Vasqu <antonprogamer@gmail.com> Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
2025-11-13 09:45:07 +08:00 · 2025-11-12 09:41:27 +01:00 · 2025-11-11 18:10:12 +00:00 · 2025-11-11 18:08:34 +00:00 · 2025-11-11 18:13:38 +01:00 · 2025-11-11 16:52:38 +00:00
130 changed files with 1177 additions and 917 deletions
--- a/docs/source/en/model_doc/dinov3.md
+++ b/docs/source/en/model_doc/dinov3.md
@ -169,6 +169,9 @@ print("Pooled output shape:", pooled_output.shape)
 [[autodoc]] DINOv3ViTModel
    - forward

+## DINOv3ViTBackbone    
+[[autodoc]] DINOv3ViTBackbone
+
 ## DINOv3ConvNextModel

 [[autodoc]] DINOv3ConvNextModel
--- a/examples/pytorch/language-modeling/run_clm_no_trainer.py
+++ b/examples/pytorch/language-modeling/run_clm_no_trainer.py
@ -127,7 +127,7 @@ def parse_args():
    parser.add_argument(
        "--use_slow_tokenizer",
        action="store_true",
-        help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
+        help="If passed, will use a slow tokenizer (not backed by the Hugging Face Tokenizers library).",
    )
    parser.add_argument(
        "--per_device_train_batch_size",
--- a/examples/pytorch/language-modeling/run_fim_no_trainer.py
+++ b/examples/pytorch/language-modeling/run_fim_no_trainer.py
@ -132,7 +132,7 @@ def parse_args():
    parser.add_argument(
        "--use_slow_tokenizer",
        action="store_true",
-        help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
+        help="If passed, will use a slow tokenizer (not backed by the Tokenizers library).",
    )
    parser.add_argument(
        "--per_device_train_batch_size",
--- a/examples/pytorch/language-modeling/run_mlm_no_trainer.py
+++ b/examples/pytorch/language-modeling/run_mlm_no_trainer.py
@ -130,7 +130,7 @@ def parse_args():
    parser.add_argument(
        "--use_slow_tokenizer",
        action="store_true",
-        help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
+        help="If passed, will use a slow tokenizer (not backed by the Hugging Face Tokenizers library).",
    )
    parser.add_argument(
        "--per_device_train_batch_size",
--- a/examples/pytorch/multiple-choice/run_swag_no_trainer.py
+++ b/examples/pytorch/multiple-choice/run_swag_no_trainer.py
@ -128,7 +128,7 @@ def parse_args():
    parser.add_argument(
        "--use_slow_tokenizer",
        action="store_true",
-        help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
+        help="If passed, will use a slow tokenizer (not backed by the HuggingFace Tokenizers library).",
    )
    parser.add_argument(
        "--per_device_train_batch_size",
--- a/examples/pytorch/question-answering/run_qa_no_trainer.py
+++ b/examples/pytorch/question-answering/run_qa_no_trainer.py
@ -151,7 +151,7 @@ def parse_args():
    parser.add_argument(
        "--use_slow_tokenizer",
        action="store_true",
-        help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
+        help="If passed, will use a slow tokenizer (not backed by the Hugging Face Tokenizers library).",
    )
    parser.add_argument(
        "--per_device_train_batch_size",
--- a/examples/pytorch/summarization/run_summarization_no_trainer.py
+++ b/examples/pytorch/summarization/run_summarization_no_trainer.py
@ -223,7 +223,7 @@ def parse_args():
    parser.add_argument(
        "--use_slow_tokenizer",
        action="store_true",
-        help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
+        help="If passed, will use a slow tokenizer (not backed by the Hugging Face Tokenizers library).",
    )
    parser.add_argument(
        "--per_device_train_batch_size",
--- a/examples/pytorch/text-classification/run_glue_no_trainer.py
+++ b/examples/pytorch/text-classification/run_glue_no_trainer.py
@ -120,7 +120,7 @@ def parse_args():
    parser.add_argument(
        "--use_slow_tokenizer",
        action="store_true",
-        help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
+        help="If passed, will use a slow tokenizer (not backed by the Hugging Face Tokenizers library).",
    )
    parser.add_argument(
        "--per_device_train_batch_size",
--- a/examples/pytorch/translation/run_translation_no_trainer.py
+++ b/examples/pytorch/translation/run_translation_no_trainer.py
@ -212,7 +212,7 @@ def parse_args():
    parser.add_argument(
        "--use_slow_tokenizer",
        action="store_true",
-        help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
+        help="If passed, will use a slow tokenizer (not backed by the Hugging Face Tokenizers library).",
    )
    parser.add_argument(
        "--per_device_train_batch_size",
--- a/setup.py
+++ b/setup.py
@ -138,7 +138,7 @@ _deps = [
    "pyyaml>=5.1",
    "pydantic>=2",
    "pytest>=7.2.0",
-    "pytest-asyncio",
+    "pytest-asyncio>=1.2.0",
    "pytest-rerunfailures<16.0",
    "pytest-timeout",
    "pytest-xdist",
--- a/src/transformers/data/data_collator.py
+++ b/src/transformers/data/data_collator.py
@ -723,7 +723,7 @@ class DataCollatorForLanguageModeling(DataCollatorMixin):

            if self.mask_replace_prob < 1:
                warnings.warn(
-                    "Random token replacement is not supported with whole word masking.",
+                    "Random token replacement is not supported with whole word masking. "
                    "Setting mask_replace_prob to 1.",
                )
                self.mask_replace_prob = 1
--- a/src/transformers/data/datasets/glue.py
+++ b/src/transformers/data/datasets/glue.py
@ -82,7 +82,7 @@ class GlueDataset(Dataset):
        cache_dir: Optional[str] = None,
    ):
        warnings.warn(
-            "This dataset will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets "
+            "This dataset will be removed from the library soon, preprocessing should be handled with the Hugging Face Datasets "
            "library. You can have a look at this example script for pointers: "
            "https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_glue.py",
            FutureWarning,
--- a/src/transformers/data/metrics/init.py
+++ b/src/transformers/data/metrics/init.py
@ -21,7 +21,7 @@ if is_sklearn_available():


 DEPRECATION_WARNING = (
-    "This metric will be removed from the library soon, metrics should be handled with the 🤗 Evaluate "
+    "This metric will be removed from the library soon, metrics should be handled with the Hugging Face Evaluate "
    "library. You can have a look at this example script for pointers: "
    "https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_glue.py"
 )
--- a/src/transformers/data/processors/glue.py
+++ b/src/transformers/data/processors/glue.py
@ -28,7 +28,7 @@ from .utils import DataProcessor, InputExample, InputFeatures
 logger = logging.get_logger(__name__)

 DEPRECATION_WARNING = (
-    "This {0} will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets "
+    "This {0} will be removed from the library soon, preprocessing should be handled with the Hugging Face Datasets "
    "library. You can have a look at this example script for pointers: "
    "https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_glue.py"
 )
--- a/src/transformers/dependency_versions_table.py
+++ b/src/transformers/dependency_versions_table.py
@ -48,7 +48,7 @@ deps = {
    "pyyaml": "pyyaml>=5.1",
    "pydantic": "pydantic>=2",
    "pytest": "pytest>=7.2.0",
-    "pytest-asyncio": "pytest-asyncio",
+    "pytest-asyncio": "pytest-asyncio>=1.2.0",
    "pytest-rerunfailures": "pytest-rerunfailures<16.0",
    "pytest-timeout": "pytest-timeout",
    "pytest-xdist": "pytest-xdist",
--- a/src/transformers/generation/utils.py
+++ b/src/transformers/generation/utils.py
@ -608,7 +608,7 @@ class GenerationMixin(ContinuousMixin):
        use_cache = kwargs.get("use_cache")
        if use_cache is None:
            use_cache = getattr(self.config, "use_cache", False)
-        if past_key_values is None or use_cache:
+        if past_key_values is not None or use_cache:
            # TODO (joao): handle the case where cache length == input_ids length. The function below results in an
            # exception because we get empty input_ids after slicing. In essence, we need to roll back the cache 1
            # token to recompute the logits for the first token to be generated (but not all caches support roll backs)
--- a/src/transformers/image_transforms.py
+++ b/src/transformers/image_transforms.py
@ -821,14 +821,26 @@ def split_to_tiles(images: "torch.Tensor", num_tiles_height: int, num_tiles_widt
    return image


-def _cast_tensor_to_float(x):
-    if x.is_floating_point():
-        return x
-    return x.float()
-
-
 def _group_images_by_shape(nested_images, *paired_inputs, is_nested: bool = False):
-    """Helper function to flatten a single level of nested image and batch structures and group by shape."""
+    """
+    Helper function to flatten a single level of nested image and batch structures and group by shape.
+    Args:
+        nested_images (list):
+            A list of images or a single tensor
+        paired_inputs (Any, *optional*):
+            Zero or more lists that mirror the structure of `nested_images` (flat list, or list of lists when
+            `is_nested=True`). Each element is paired 1:1 with the corresponding image so it can be grouped by the
+            same shape key. These paired values are grouped alongside `nested_images` but are not stacked in the output, so
+            they do not need to be tensors.
+        is_nested (bool, *optional*, defaults to False):
+            Whether the images are nested.
+    Returns:
+        tuple[dict, ...]:
+            - A dictionary with shape as key and list of images with that shape as value
+            - A dictionary with shape as key and list of paired values with that shape as value
+            - A dictionary mapping original indices to (shape, index) tuples
+            - A dictionary mapping original indices to (shape, index) tuples for each paired input
+    """
    grouped_images = defaultdict(list)
    grouped_images_index = {}
    paired_grouped_values = [defaultdict(list) for _ in paired_inputs]
@ -880,27 +892,20 @@ def _reconstruct_nested_structure(indices, processed_images):
    return result


-def _disable_grouping_output_nested(images, *paired_inputs):
-    """Build the disable_grouping output tuple for a single-level nested structure."""
-    outer_range = range(len(images))
-    inner_ranges = [range(len(images[i])) for i in outer_range]
+def _iterate_items(items, is_nested: bool):
+    """
+    Helper function to iterate over items yielding (key, item) pairs.

-    # Precompute all (i, j) pairs
-    ij_pairs = [(i, j) for i in outer_range for j in inner_ranges[i]]
-
-    images_dict = {(i, j): images[i][j].unsqueeze(0) for (i, j) in ij_pairs}
-    paired_dicts = [{(i, j): paired_list[i][j].unsqueeze(0) for (i, j) in ij_pairs} for paired_list in paired_inputs]
-    index_map = {(i, j): ((i, j), 0) for (i, j) in ij_pairs}
-    return images_dict, *paired_dicts, index_map
-
-
-def _disable_grouping_output_flat(images, *paired_inputs):
-    """Build the disable_grouping output tuple for a flat list structure."""
-    idx_range = range(len(images))
-    images_dict = {i: images[i].unsqueeze(0) for i in idx_range}
-    paired_dicts = [{i: paired_list[i].unsqueeze(0) for i in idx_range} for paired_list in paired_inputs]
-    index_map = {i: (i, 0) for i in idx_range}
-    return images_dict, *paired_dicts, index_map
+    For nested structures, yields ((row_index, col_index), item).
+    For flat structures, yields (index, item).
+    """
+    if is_nested:
+        for i, row in enumerate(items):
+            for j, item in enumerate(row):
+                yield (i, j), item
+    else:
+        for i, item in enumerate(items):
+            yield i, item


 def group_images_by_shape(
@ -920,7 +925,7 @@ def group_images_by_shape(
    Args:
        images (Union[list["torch.Tensor"], "torch.Tensor"]):
            A list of images or a single tensor
-        *paired_inputs (Any):
+        paired_inputs (Any, *optional*):
            Zero or more lists that mirror the structure of `images` (flat list, or list of lists when
            `is_nested=True`). Each element is paired 1:1 with the corresponding image so it can be grouped by the
            same shape key. These paired values are grouped alongside `images` but are not stacked in the output, so
@ -944,10 +949,14 @@ def group_images_by_shape(
        disable_grouping = device == "cpu"

    if disable_grouping:
-        if is_nested:
-            return _disable_grouping_output_nested(images, *paired_inputs)
-        else:
-            return _disable_grouping_output_flat(images, *paired_inputs)
+        return (
+            {key: img.unsqueeze(0) for key, img in _iterate_items(images, is_nested)},
+            *[
+                {key: item.unsqueeze(0) for key, item in _iterate_items(paired_list, is_nested)}
+                for paired_list in paired_inputs
+            ],
+            {key: (key, 0) for key, _ in _iterate_items(images, is_nested)},
+        )

    # Handle single level nested structure
    grouped_images, *paired_grouped_values, grouped_images_index = _group_images_by_shape(
@ -990,14 +999,3 @@ def reorder_images(
        ]

    return _reconstruct_nested_structure(grouped_images_index, processed_images)
-
-
-class NumpyToTensor:
-    """
-    Convert a numpy array to a PyTorch tensor.
-    """
-
-    def __call__(self, image: np.ndarray):
-        # Same as in PyTorch, we assume incoming numpy images are in HWC format
-        # c.f. https://github.com/pytorch/vision/blob/61d97f41bc209e1407dcfbd685d2ee2da9c1cdad/torchvision/transforms/functional.py#L154
-        return torch.from_numpy(image.transpose(2, 0, 1)).contiguous()
--- a/src/transformers/integrations/hub_kernels.py
+++ b/src/transformers/integrations/hub_kernels.py
@ -11,6 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+import os
 import re
 from collections.abc import Callable
 from functools import partial
@ -18,7 +19,7 @@ from types import ModuleType
 from typing import Optional, Union

 from ..modeling_flash_attention_utils import lazy_import_flash_attention
-from ..utils import logging
+from ..utils import ENV_VARS_TRUE_VALUES, logging
 from ..utils.import_utils import is_kernels_available
 from .flash_attention import flash_attention_forward

@ -33,10 +34,22 @@ try:
        get_kernel,
        register_kernel_mapping,
        replace_kernel_forward_from_hub,
-        use_kernel_forward_from_hub,
    )

+    _TRANSFORMERS_USE_HUB_KERNELS = os.environ.get("USE_HUB_KERNELS", "YES").upper()
    _kernels_available = True
+    _kernels_enabled = _TRANSFORMERS_USE_HUB_KERNELS in ENV_VARS_TRUE_VALUES
+
+    def use_kernel_forward_from_hub(layer_name: str):
+        if _kernels_enabled:
+            from kernels import use_kernel_forward_from_hub as _kernels_use_kernel_forward_from_hub
+
+            return _kernels_use_kernel_forward_from_hub(layer_name)
+        else:
+            logger.warning_once(
+                f"kernels hub usage is disabled through the environment USE_HUB_KERNELS={_TRANSFORMERS_USE_HUB_KERNELS}"
+            )
+            return lambda cls: cls

    _KERNEL_MAPPING: dict[str, dict[Union[Device, str], LayerRepository]] = {
        "MultiScaleDeformableAttention": {
@ -161,6 +174,7 @@ try:

 except ImportError:
    _kernels_available = False
+    _kernels_enabled = False

    # Stub to make decorators int transformers work when `kernels`
    # is not installed.
--- a/src/transformers/integrations/integration_utils.py
+++ b/src/transformers/integrations/integration_utils.py
@ -38,7 +38,7 @@ from transformers.utils.import_utils import _is_package_available


 if os.getenv("WANDB_MODE") == "offline":
-    print("⚙️  Running in WANDB offline mode")
+    print("[INFO] Running in WANDB offline mode")

 from .. import PreTrainedModel, TrainingArguments
 from .. import __version__ as version
--- a/src/transformers/models/audio_spectrogram_transformer/convert_audio_spectrogram_transformer_original_to_pytorch.py
+++ b/src/transformers/models/audio_spectrogram_transformer/convert_audio_spectrogram_transformer_original_to_pytorch.py
@ -272,7 +272,9 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )

    args = parser.parse_args()
--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@ -1700,6 +1700,7 @@ MODEL_FOR_BACKBONE_MAPPING_NAMES = OrderedDict(
        ("dinov2", "Dinov2Backbone"),
        ("dinov2_with_registers", "Dinov2WithRegistersBackbone"),
        ("dinov3_convnext", "DINOv3ConvNextBackbone"),
+        ("dinov3_vit", "DINOv3ViTBackbone"),
        ("focalnet", "FocalNetBackbone"),
        ("hgnet_v2", "HGNetV2Backbone"),
        ("hiera", "HieraBackbone"),
--- a/src/transformers/models/bros/convert_bros_to_pytorch.py
+++ b/src/transformers/models/bros/convert_bros_to_pytorch.py
@ -138,7 +138,7 @@ if __name__ == "__main__":
    parser.add_argument(
        "--push_to_hub",
        action="store_true",
-        help="Whether or not to push the converted model and processor to the 🤗 hub.",
+        help="Whether or not to push the converted model and processor to the Hugging Face hub.",
    )

    args = parser.parse_args()
--- a/src/transformers/models/clipseg/convert_clipseg_original_pytorch_to_hf.py
+++ b/src/transformers/models/clipseg/convert_clipseg_original_pytorch_to_hf.py
@ -257,7 +257,9 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )

    args = parser.parse_args()
--- a/src/transformers/models/dac/convert_dac_checkpoint.py
+++ b/src/transformers/models/dac/convert_dac_checkpoint.py
@ -274,7 +274,7 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", required=True, default=None, type=str, help="Path to the output PyTorch model."
    )
    parser.add_argument(
-        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the 🤗 hub."
+        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the Hugging Face hub."
    )
    parser.add_argument("--sample_rate", default=None, type=str, help="Sample rate used by DacFeatureExtractor")
    args = parser.parse_args()
--- a/src/transformers/models/data2vec/convert_data2vec_audio_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/models/data2vec/convert_data2vec_audio_original_pytorch_checkpoint_to_pytorch.py
@ -258,7 +258,7 @@ def convert_wav2vec2_checkpoint(
    max_absolute_diff = torch.max(torch.abs(our_output - their_output)).item()
    print(f"max_absolute_diff = {max_absolute_diff}")  # ~ 1e-7
    success = torch.allclose(our_output, their_output, atol=1e-3)
-    print("Do both models output the same tensors?", "🔥" if success else "💩")
+    print("Do both models output the same tensors?", "[PASS]" if success else "[FAIL]")
    if not success:
        raise Exception("Something went wRoNg")

--- a/src/transformers/models/data2vec/convert_data2vec_text_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/models/data2vec/convert_data2vec_text_original_pytorch_checkpoint_to_pytorch.py
@ -180,7 +180,7 @@ def convert_data2vec_checkpoint_to_pytorch(
    max_absolute_diff = torch.max(torch.abs(our_output - their_output)).item()
    print(f"max_absolute_diff = {max_absolute_diff}")  # ~ 1e-7
    success = torch.allclose(our_output, their_output, atol=1e-3)
-    print("Do both models output the same tensors?", "🔥" if success else "💩")
+    print("Do both models output the same tensors?", "[PASS]" if success else "[FAIL]")
    if not success:
        raise Exception("Something went wRoNg")

--- a/src/transformers/models/data2vec/convert_data2vec_vision_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/models/data2vec/convert_data2vec_vision_original_pytorch_checkpoint_to_pytorch.py
@ -341,7 +341,7 @@ def main():

    print(f"max_absolute_diff = {max_absolute_diff}")
    success = torch.allclose(hf_output, orig_model_output, atol=1e-3)
-    print("Do both models output the same tensors?", "🔥" if success else "💩")
+    print("Do both models output the same tensors?", "[PASS]" if success else "[FAIL]")
    if not success:
        raise Exception("Something went wRoNg")

--- a/src/transformers/models/deformable_detr/convert_deformable_detr_to_pytorch.py
+++ b/src/transformers/models/deformable_detr/convert_deformable_detr_to_pytorch.py
@ -222,7 +222,9 @@ if __name__ == "__main__":
        help="Path to the folder to output PyTorch model.",
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )
    args = parser.parse_args()
    convert_deformable_detr_checkpoint(
--- a/src/transformers/models/deprecated/bort/convert_bort_original_gluonnlp_checkpoint_to_pytorch.py
+++ b/src/transformers/models/deprecated/bort/convert_bort_original_gluonnlp_checkpoint_to_pytorch.py
@ -299,9 +299,9 @@ def convert_bort_checkpoint_to_pytorch(bort_checkpoint_path: str, pytorch_dump_f
    success = np.allclose(gluon_layer, hf_layer, atol=1e-3)

    if success:
-        print("✔️ Both model do output the same tensors")
+        print("[SUCCESS] Both models do output the same tensors")
    else:
-        print("❌ Both model do **NOT** output the same tensors")
+        print("[FAIL] Both models do **NOT** output the same tensors")
        print("Absolute difference is:", max_absolute_diff)


--- a/src/transformers/models/deprecated/deta/convert_deta_resnet_to_pytorch.py
+++ b/src/transformers/models/deprecated/deta/convert_deta_resnet_to_pytorch.py
@ -313,7 +313,9 @@ if __name__ == "__main__":
        help="Path to the folder to output PyTorch model.",
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )
    args = parser.parse_args()
    convert_deta_checkpoint(args.model_name, args.pytorch_dump_folder_path, args.push_to_hub)
--- a/src/transformers/models/deprecated/deta/convert_deta_swin_to_pytorch.py
+++ b/src/transformers/models/deprecated/deta/convert_deta_swin_to_pytorch.py
@ -320,7 +320,9 @@ if __name__ == "__main__":
        help="Path to the folder to output PyTorch model.",
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )
    args = parser.parse_args()
    convert_deta_checkpoint(args.model_name, args.pytorch_dump_folder_path, args.push_to_hub)
--- a/src/transformers/models/deprecated/gptsan_japanese/modeling_gptsan_japanese.py
+++ b/src/transformers/models/deprecated/gptsan_japanese/modeling_gptsan_japanese.py
@ -22,7 +22,7 @@ from ....activations import ACT2FN
 from ....cache_utils import Cache
 from ....modeling_outputs import MoECausalLMOutputWithPast, MoEModelOutputWithPastAndCrossAttentions
 from ....modeling_utils import PreTrainedModel
-from ....utils import DUMMY_INPUTS, DUMMY_MASK, auto_docstring
+from ....utils import DUMMY_INPUTS, DUMMY_MASK
 from .configuration_gptsan_japanese import GPTSanJapaneseConfig


@ -635,7 +635,6 @@ class GPTSanJapaneseModel(GPTSanJapanesePreTrainedModel):
    def set_input_embeddings(self, new_embeddings):
        self.embed_tokens = new_embeddings

-    @auto_docstring
    def forward(
        self,
        input_ids: Optional[torch.LongTensor] = None,
--- a/src/transformers/models/dinov2/convert_dinov2_to_hf.py
+++ b/src/transformers/models/dinov2/convert_dinov2_to_hf.py
@ -278,7 +278,9 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )

    args = parser.parse_args()
--- a/src/transformers/models/dinov2_with_registers/convert_dinov2_with_registers_to_hf.py
+++ b/src/transformers/models/dinov2_with_registers/convert_dinov2_with_registers_to_hf.py
@ -284,7 +284,9 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )

    args = parser.parse_args()
--- a/src/transformers/models/dinov3_vit/configuration_dinov3_vit.py
+++ b/src/transformers/models/dinov3_vit/configuration_dinov3_vit.py
@ -18,12 +18,13 @@ from typing import Optional

 from ...configuration_utils import PreTrainedConfig
 from ...utils import logging
+from ...utils.backbone_utils import BackboneConfigMixin, get_aligned_output_features_output_indices


 logger = logging.get_logger(__name__)


-class DINOv3ViTConfig(PreTrainedConfig):
+class DINOv3ViTConfig(BackboneConfigMixin, PreTrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`DINOv3Model`]. It is used to instantiate an
    DINOv3 model according to the specified arguments, defining the model architecture. Instantiating a configuration
@ -86,6 +87,16 @@ class DINOv3ViTConfig(PreTrainedConfig):
        pos_embed_rescale (`float`, *optional*, defaults to 2.0):
            Amount to randomly rescale position embedding coordinates in log-uniform value in [1/rescale, rescale],
            applied only in training mode if not `None`.
+        out_features (`list[str]`, *optional*):
+            If used as backbone, list of features to output. Can be any of `"stem"`, `"stage1"`, `"stage2"`, etc.
+            (depending on how many stages the model has). Will default to the last stage if unset.
+        out_indices (`list[int]`, *optional*):
+            If used as backbone, list of indices of features to output. Can be any of 0, 1, 2, etc.
+            (depending on how many stages the model has). Will default to the last stage if unset.
+        apply_layernorm (`bool`, *optional*, defaults to `True`):
+            Whether to apply layer normalization to the feature maps when used as backbone.
+        reshape_hidden_states (`bool`, *optional*, defaults to `True`):
+            Whether to reshape the hidden states to spatial dimensions when used as backbone.

    Example:

@ -131,6 +142,10 @@ class DINOv3ViTConfig(PreTrainedConfig):
        pos_embed_shift: Optional[float] = None,
        pos_embed_jitter: Optional[float] = None,
        pos_embed_rescale: Optional[float] = 2.0,
+        out_features: Optional[list[str]] = None,
+        out_indices: Optional[list[int]] = None,
+        apply_layernorm: bool = True,
+        reshape_hidden_states: bool = True,
        **kwargs,
    ):
        super().__init__(**kwargs)
@ -161,6 +176,18 @@ class DINOv3ViTConfig(PreTrainedConfig):
        self.pos_embed_shift = pos_embed_shift
        self.pos_embed_jitter = pos_embed_jitter
        self.pos_embed_rescale = pos_embed_rescale
+        # Initialize backbone-specific configuration
+        self.apply_layernorm = apply_layernorm
+        self.reshape_hidden_states = reshape_hidden_states
+
+        # Initialize backbone stage names
+        stage_names = ["stem"] + [f"stage{i}" for i in range(1, num_hidden_layers + 1)]
+        self.stage_names = stage_names
+
+        # Initialize backbone features/indices
+        self._out_features, self._out_indices = get_aligned_output_features_output_indices(
+            out_features=out_features, out_indices=out_indices, stage_names=stage_names
+        )


 __all__ = ["DINOv3ViTConfig"]
--- a/src/transformers/models/dinov3_vit/modeling_dinov3_vit.py
+++ b/src/transformers/models/dinov3_vit/modeling_dinov3_vit.py
@ -29,11 +29,12 @@ from torch import nn

 from ...activations import ACT2FN
 from ...modeling_layers import GradientCheckpointingLayer
-from ...modeling_outputs import BaseModelOutputWithPooling
+from ...modeling_outputs import BackboneOutput, BaseModelOutputWithPooling
 from ...modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
 from ...processing_utils import Unpack
 from ...pytorch_utils import compile_compatible_method_lru_cache
-from ...utils import TransformersKwargs, auto_docstring
+from ...utils import TransformersKwargs, auto_docstring, can_return_tuple
+from ...utils.backbone_utils import BackboneMixin
 from ...utils.generic import check_model_inputs
 from .configuration_dinov3_vit import DINOv3ViTConfig

@ -522,10 +523,79 @@ class DINOv3ViTModel(DINOv3ViTPreTrainedModel):
        sequence_output = self.norm(hidden_states)
        pooled_output = sequence_output[:, 0, :]

-        return BaseModelOutputWithPooling(
-            last_hidden_state=sequence_output,
-            pooler_output=pooled_output,
-        )
+        return BaseModelOutputWithPooling(last_hidden_state=sequence_output, pooler_output=pooled_output)


-__all__ = ["DINOv3ViTModel", "DINOv3ViTPreTrainedModel"]
+@auto_docstring
+class DINOv3ViTBackbone(DINOv3ViTPreTrainedModel, BackboneMixin):
+    def __init__(self, config):
+        super().__init__(config)
+        super()._init_backbone(config)
+
+        self.embeddings = DINOv3ViTEmbeddings(config)
+        self.rope_embeddings = DINOv3ViTRopePositionEmbedding(config)
+        self.layer = nn.ModuleList([DINOv3ViTLayer(config) for _ in range(config.num_hidden_layers)])
+        self.norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.gradient_checkpointing = False
+
+        self.num_features = [config.hidden_size for _ in range(config.num_hidden_layers + 1)]
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.embeddings.patch_embeddings
+
+    @check_model_inputs()
+    @can_return_tuple
+    def forward(
+        self,
+        pixel_values: torch.Tensor,
+        output_hidden_states: Optional[bool] = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> BackboneOutput:
+        pixel_values = pixel_values.to(self.embeddings.patch_embeddings.weight.dtype)
+        hidden_states = self.embeddings(pixel_values)
+        position_embeddings = self.rope_embeddings(pixel_values)
+
+        stage_hidden_states: list[torch.Tensor] = [hidden_states]
+
+        for layer_module in self.layer:
+            hidden_states = layer_module(hidden_states, position_embeddings=position_embeddings)
+            stage_hidden_states.append(hidden_states)
+
+        batch_size, _, image_height, image_width = pixel_values.shape
+        patch_size = self.config.patch_size
+        num_patches_height = image_height // patch_size
+        num_patches_width = image_width // patch_size
+
+        num_prefix = 1 + getattr(self.config, "num_register_tokens", 0)
+
+        feature_maps = []
+        sequence_output = None
+        last_stage_idx = len(self.stage_names) - 1
+        for idx, (stage_name, hidden_state) in enumerate(zip(self.stage_names, stage_hidden_states)):
+            if idx == last_stage_idx:
+                hidden_state = self.norm(hidden_state)
+                sequence_output = hidden_state
+            elif self.config.apply_layernorm:
+                hidden_state = self.norm(hidden_state)
+
+            if stage_name in self.out_features:
+                patch_tokens = hidden_state[:, num_prefix:, :]
+                if self.config.reshape_hidden_states:
+                    fmap = (
+                        patch_tokens.reshape(batch_size, num_patches_height, num_patches_width, patch_tokens.shape[-1])
+                        .permute(0, 3, 1, 2)
+                        .contiguous()
+                    )
+                else:
+                    fmap = patch_tokens
+
+                feature_maps.append(fmap)
+
+        output = BackboneOutput(feature_maps=tuple(feature_maps))
+        output.last_hidden_state = sequence_output
+
+        return output
+
+
+__all__ = ["DINOv3ViTModel", "DINOv3ViTPreTrainedModel", "DINOv3ViTBackbone"]
--- a/src/transformers/models/dinov3_vit/modular_dinov3_vit.py
+++ b/src/transformers/models/dinov3_vit/modular_dinov3_vit.py
@ -33,11 +33,12 @@ from transformers.models.llama.modeling_llama import LlamaMLP
 from transformers.models.pixtral.modeling_pixtral import PixtralAttention, rotate_half

 from ...modeling_layers import GradientCheckpointingLayer
-from ...modeling_outputs import BaseModelOutputWithPooling
+from ...modeling_outputs import BackboneOutput, BaseModelOutputWithPooling
 from ...modeling_utils import ALL_ATTENTION_FUNCTIONS
 from ...processing_utils import Unpack
 from ...pytorch_utils import compile_compatible_method_lru_cache
-from ...utils import TransformersKwargs, auto_docstring, logging
+from ...utils import TransformersKwargs, auto_docstring, can_return_tuple, logging
+from ...utils.backbone_utils import BackboneMixin
 from ...utils.generic import check_model_inputs
 from .configuration_dinov3_vit import DINOv3ViTConfig

@ -417,10 +418,79 @@ class DINOv3ViTModel(DINOv3ViTPreTrainedModel):
        sequence_output = self.norm(hidden_states)
        pooled_output = sequence_output[:, 0, :]

-        return BaseModelOutputWithPooling(
-            last_hidden_state=sequence_output,
-            pooler_output=pooled_output,
-        )
+        return BaseModelOutputWithPooling(last_hidden_state=sequence_output, pooler_output=pooled_output)


-__all__ = ["DINOv3ViTModel", "DINOv3ViTPreTrainedModel"]
+@auto_docstring
+class DINOv3ViTBackbone(DINOv3ViTPreTrainedModel, BackboneMixin):
+    def __init__(self, config):
+        super().__init__(config)
+        super()._init_backbone(config)
+
+        self.embeddings = DINOv3ViTEmbeddings(config)
+        self.rope_embeddings = DINOv3ViTRopePositionEmbedding(config)
+        self.layer = nn.ModuleList([DINOv3ViTLayer(config) for _ in range(config.num_hidden_layers)])
+        self.norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.gradient_checkpointing = False
+
+        self.num_features = [config.hidden_size for _ in range(config.num_hidden_layers + 1)]
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.embeddings.patch_embeddings
+
+    @check_model_inputs()
+    @can_return_tuple
+    def forward(
+        self,
+        pixel_values: torch.Tensor,
+        output_hidden_states: Optional[bool] = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> BackboneOutput:
+        pixel_values = pixel_values.to(self.embeddings.patch_embeddings.weight.dtype)
+        hidden_states = self.embeddings(pixel_values)
+        position_embeddings = self.rope_embeddings(pixel_values)
+
+        stage_hidden_states: list[torch.Tensor] = [hidden_states]
+
+        for layer_module in self.layer:
+            hidden_states = layer_module(hidden_states, position_embeddings=position_embeddings)
+            stage_hidden_states.append(hidden_states)
+
+        batch_size, _, image_height, image_width = pixel_values.shape
+        patch_size = self.config.patch_size
+        num_patches_height = image_height // patch_size
+        num_patches_width = image_width // patch_size
+
+        num_prefix = 1 + getattr(self.config, "num_register_tokens", 0)
+
+        feature_maps = []
+        sequence_output = None
+        last_stage_idx = len(self.stage_names) - 1
+        for idx, (stage_name, hidden_state) in enumerate(zip(self.stage_names, stage_hidden_states)):
+            if idx == last_stage_idx:
+                hidden_state = self.norm(hidden_state)
+                sequence_output = hidden_state
+            elif self.config.apply_layernorm:
+                hidden_state = self.norm(hidden_state)
+
+            if stage_name in self.out_features:
+                patch_tokens = hidden_state[:, num_prefix:, :]
+                if self.config.reshape_hidden_states:
+                    fmap = (
+                        patch_tokens.reshape(batch_size, num_patches_height, num_patches_width, patch_tokens.shape[-1])
+                        .permute(0, 3, 1, 2)
+                        .contiguous()
+                    )
+                else:
+                    fmap = patch_tokens
+
+                feature_maps.append(fmap)
+
+        output = BackboneOutput(feature_maps=tuple(feature_maps))
+        output.last_hidden_state = sequence_output
+
+        return output
+
+
+__all__ = ["DINOv3ViTModel", "DINOv3ViTPreTrainedModel", "DINOv3ViTBackbone"]
--- a/src/transformers/models/donut/convert_donut_to_pytorch.py
+++ b/src/transformers/models/donut/convert_donut_to_pytorch.py
@ -227,7 +227,7 @@ if __name__ == "__main__":
    parser.add_argument(
        "--push_to_hub",
        action="store_true",
-        help="Whether or not to push the converted model and processor to the 🤗 hub.",
+        help="Whether or not to push the converted model and processor to the Hugging Face hub.",
    )

    args = parser.parse_args()
--- a/src/transformers/models/encodec/convert_encodec_checkpoint_to_pytorch.py
+++ b/src/transformers/models/encodec/convert_encodec_checkpoint_to_pytorch.py
@ -352,7 +352,7 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", required=True, default=None, type=str, help="Path to the output PyTorch model."
    )
    parser.add_argument(
-        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the 🤗 hub."
+        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the Hugging Face hub."
    )

    args = parser.parse_args()
--- a/src/transformers/models/esm/convert_esm.py
+++ b/src/transformers/models/esm/convert_esm.py
@ -316,7 +316,7 @@ def convert_esm_checkpoint_to_pytorch(
        hf_tokens = hf_tokenizer([row[1] for row in sample_data], return_tensors="pt", padding=True)
        success = torch.all(hf_tokens["input_ids"] == batch_tokens)

-    print("Do both models tokenizers output the same tokens?", "🔥" if success else "💩")
+    print("Do both models tokenizers output the same tokens?", "[PASS]" if success else "[FAIL]")
    if not success:
        raise Exception("Tokenization does not match!")

@ -348,7 +348,7 @@ def convert_esm_checkpoint_to_pytorch(
            success = torch.allclose(our_output, their_output, atol=1e-5)

        print(f"max_absolute_diff = {max_absolute_diff}")  # ~ 1e-5
-        print("Do both models output the same tensors?", "🔥" if success else "💩")
+        print("Do both models output the same tensors?", "[PASS]" if success else "[FAIL]")

        if not success:
            raise Exception("Something went wRoNg")
@ -362,7 +362,7 @@ def convert_esm_checkpoint_to_pytorch(

            print("Contact prediction testing:")
            print(f"max_absolute_diff = {max_absolute_diff}")  # ~ 1e-5
-            print("Do both models output the same tensors?", "🔥" if success else "💩")
+            print("Do both models output the same tensors?", "[PASS]" if success else "[FAIL]")

            if not success:
                raise Exception("Something went wRoNg")
--- a/src/transformers/models/fastspeech2_conformer/convert_fastspeech2_conformer_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/models/fastspeech2_conformer/convert_fastspeech2_conformer_original_pytorch_checkpoint_to_pytorch.py
@ -198,7 +198,7 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", required=True, default=None, type=str, help="Path to the output PyTorch model."
    )
    parser.add_argument(
-        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the 🤗 hub."
+        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the Hugging Face hub."
    )

    args = parser.parse_args()
--- a/src/transformers/models/fastspeech2_conformer/convert_hifigan.py
+++ b/src/transformers/models/fastspeech2_conformer/convert_hifigan.py
@ -122,7 +122,7 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", required=True, default=None, type=str, help="Path to the output PyTorch model."
    )
    parser.add_argument(
-        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the 🤗 hub."
+        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the Hugging Face hub."
    )

    args = parser.parse_args()
--- a/src/transformers/models/fastspeech2_conformer/convert_model_with_hifigan.py
+++ b/src/transformers/models/fastspeech2_conformer/convert_model_with_hifigan.py
@ -89,7 +89,7 @@ if __name__ == "__main__":
        help="Path to the output `FastSpeech2ConformerModel` PyTorch model.",
    )
    parser.add_argument(
-        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the 🤗 hub."
+        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the Hugging Face hub."
    )

    args = parser.parse_args()
--- a/src/transformers/models/got_ocr2/convert_got_ocr2_weights_to_hf.py
+++ b/src/transformers/models/got_ocr2/convert_got_ocr2_weights_to_hf.py
@ -250,7 +250,9 @@ def main():
    )

    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )
    args = parser.parse_args()
    write_tokenizer(
--- a/src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py
+++ b/src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py
@ -26,6 +26,7 @@ from ...cache_utils import Cache, DynamicCache, EncoderDecoderCache
 from ...generation import GenerationMixin
 from ...masking_utils import create_causal_mask
 from ...modeling_flash_attention_utils import is_flash_attn_available
+from ...modeling_layers import GradientCheckpointingLayer
 from ...modeling_outputs import (
    BaseModelOutputWithPastAndCrossAttentions,
    CausalLMOutputWithCrossAttentions,
@ -266,7 +267,7 @@ class GPTBigCodeMLP(nn.Module):
        return hidden_states


-class GPTBigCodeBlock(nn.Module):
+class GPTBigCodeBlock(GradientCheckpointingLayer):
    def __init__(self, config, layer_idx=None):
        super().__init__()
        hidden_size = config.hidden_size
@ -291,9 +292,9 @@ class GPTBigCodeBlock(nn.Module):
    def forward(
        self,
        hidden_states: Optional[tuple[torch.Tensor]],
+        encoder_hidden_states: Optional[torch.Tensor] = None,
        layer_past: Optional[Cache] = None,
        attention_mask: Optional[torch.Tensor] = None,
-        encoder_hidden_states: Optional[torch.Tensor] = None,
        encoder_attention_mask: Optional[torch.Tensor] = None,
        use_cache: Optional[bool] = False,
        output_attentions: Optional[bool] = False,
@ -536,10 +537,10 @@ class GPTBigCodeModel(GPTBigCodePreTrainedModel):
                all_hidden_states = all_hidden_states + (hidden_states,)

            outputs = block(
-                hidden_states,
-                past_key_values,
-                causal_mask,
+                hidden_states,  # as a positional argument for gradient checkpointing
                encoder_hidden_states,  # as a positional argument for gradient checkpointing
+                layer_past=past_key_values,  # as keyword argument so it can be removed by GradientCheckpointingLayer
+                attention_mask=causal_mask,
                encoder_attention_mask=encoder_attention_mask,
                use_cache=use_cache,
                output_attentions=output_attentions,
--- a/src/transformers/models/grounding_dino/convert_grounding_dino_to_hf.py
+++ b/src/transformers/models/grounding_dino/convert_grounding_dino_to_hf.py
@ -481,7 +481,9 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )
    parser.add_argument(
        "--verify_logits", action="store_false", help="Whether or not to verify logits after conversion."
--- a/src/transformers/models/groupvit/convert_groupvit_nvlab_to_hf.py
+++ b/src/transformers/models/groupvit/convert_groupvit_nvlab_to_hf.py
@ -210,7 +210,7 @@ if __name__ == "__main__":
    parser.add_argument(
        "--push_to_hub",
        action="store_true",
-        help="Whether or not to push the converted model and processor to the 🤗 hub using the provided `model_name`.",
+        help="Whether or not to push the converted model and processor to the Hugging Face hub using the provided `model_name`.",
    )
    args = parser.parse_args()

--- a/src/transformers/models/hiera/convert_hiera_to_hf.py
+++ b/src/transformers/models/hiera/convert_hiera_to_hf.py
@ -353,7 +353,9 @@ if __name__ == "__main__":
        help="Whether or not to verify the logits against the original implementation.",
    )
    parser.add_argument(
-        "--push-to-hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push-to-hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )
    parser.add_argument(
        "--base-model",
--- a/src/transformers/models/ijepa/convert_ijepa_to_hf.py
+++ b/src/transformers/models/ijepa/convert_ijepa_to_hf.py
@ -253,7 +253,7 @@ def main():
    parser.add_argument(
        "--push_to_hub",
        action="store_true",
-        help="Whether or not to push the model to the 🤗 Hub.",
+        help="Whether or not to push the model to the Hugging Face Hub.",
    )
    parser.add_argument(
        "--verify_logits", action="store_false", help="Whether or not to verify logits after conversion."
--- a/src/transformers/models/internvl/convert_internvl_weights_to_hf.py
+++ b/src/transformers/models/internvl/convert_internvl_weights_to_hf.py
@ -433,7 +433,9 @@ def main():
    )

    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )
    args = parser.parse_args()
    write_tokenizer(
--- a/src/transformers/models/llava_next/convert_llava_next_weights_to_hf.py
+++ b/src/transformers/models/llava_next/convert_llava_next_weights_to_hf.py
@ -387,7 +387,9 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", type=str, required=True, help="Path to the output PyTorch model directory."
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )
    args = parser.parse_args()

--- a/src/transformers/models/llava_next_video/convert_llava_next_video_weights_to_hf.py
+++ b/src/transformers/models/llava_next_video/convert_llava_next_video_weights_to_hf.py
@ -267,7 +267,9 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )
    args = parser.parse_args()

--- a/src/transformers/models/llava_onevision/convert_llava_onevision_weights_to_hf.py
+++ b/src/transformers/models/llava_onevision/convert_llava_onevision_weights_to_hf.py
@ -379,7 +379,9 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", type=str, required=True, help="Path to the output PyTorch model directory."
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )
    args = parser.parse_args()

--- a/src/transformers/models/mask2former/convert_mask2former_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/models/mask2former/convert_mask2former_original_pytorch_checkpoint_to_pytorch.py
@ -826,7 +826,7 @@ class OriginalMask2FormerCheckpointToOursConverter:
        checkpoints: list[Path] = checkpoints_dir.glob("**/*.pkl")

        for checkpoint in checkpoints:
-            logger.info(f"💪 Converting {checkpoint.stem}")
+            logger.info(f"Converting {checkpoint.stem}")
            # find associated config file

            # dataset_name e.g 'coco'
@ -902,7 +902,7 @@ def test(
            "The predicted masks are not the same."
        )

-        logger.info("✅ Test passed!")
+        logger.info("Test passed!")


 def get_model_name(checkpoint_file: Path):
@ -1012,9 +1012,9 @@ if __name__ == "__main__":
        if model_name in high_tolerance_models:
            tolerance = 3e-1

-        logger.info(f"🪄 Testing {model_name}...")
+        logger.info(f"Testing {model_name}...")
        test(original_model, mask2former_for_segmentation, image_processor, tolerance)
-        logger.info(f"🪄 Pushing {model_name} to hub...")
+        logger.info(f"Pushing {model_name} to hub...")

        image_processor.push_to_hub(model_name)
        mask2former_for_segmentation.push_to_hub(model_name)
--- a/src/transformers/models/mask2former/image_processing_mask2former_fast.py
+++ b/src/transformers/models/mask2former/image_processing_mask2former_fast.py
@ -109,9 +109,6 @@ class Mask2FormerImageProcessorFast(BaseImageProcessorFast):
    valid_kwargs = Mask2FormerImageProcessorKwargs

    def __init__(self, **kwargs: Unpack[Mask2FormerImageProcessorKwargs]) -> None:
-        if "pad_and_return_pixel_mask" in kwargs:
-            kwargs["do_pad"] = kwargs.pop("pad_and_return_pixel_mask")
-
        size = kwargs.pop("size", None)
        max_size = kwargs.pop("max_size", None)

@ -224,7 +221,7 @@ class Mask2FormerImageProcessorFast(BaseImageProcessorFast):
            padding = [0, 0, padding_right, padding_bottom]
            images = F.pad(images, padding, fill=fill)
            if segmentation_maps is not None:
-                segmentation_maps = F.pad(segmentation_maps, padding, fill=ignore_index)
+                segmentation_maps = [F.pad(mask, padding, fill=ignore_index) for mask in segmentation_maps]

        # Make a pixel mask for the image, where 1 indicates a valid pixel and 0 indicates padding.
        pixel_mask = torch.zeros((images.shape[0], *padded_size), dtype=torch.int64, device=images.device)
@ -318,9 +315,11 @@ class Mask2FormerImageProcessorFast(BaseImageProcessorFast):
                stacked_images = self.resize(
                    image=stacked_images, size=size, size_divisor=size_divisor, interpolation=interpolation
                )
-                if segmentation_maps is not None:
+            if segmentation_maps is not None:
+                stacked_segmentation_maps = grouped_segmentation_maps[shape]
+                if do_resize:
                    stacked_segmentation_maps = self.resize(
-                        image=grouped_segmentation_maps[shape],
+                        image=stacked_segmentation_maps,
                        size=size,
                        size_divisor=size_divisor,
                        interpolation=F.InterpolationMode.NEAREST_EXACT,
@ -357,14 +356,18 @@ class Mask2FormerImageProcessorFast(BaseImageProcessorFast):
                mask_labels.append(masks)
                class_labels.append(classes)

-        grouped_images, grouped_images_index = group_images_by_shape(resized_images, disable_grouping=disable_grouping)
-        processed_images_grouped = {}
-        processed_pixel_masks_grouped = {}
        if segmentation_maps is not None:
-            grouped_segmentation_maps, grouped_segmentation_maps_index = group_images_by_shape(
-                mask_labels, disable_grouping=disable_grouping
+            # group mask_labels as paired inputs and not images so as not to stack them
+            grouped_images, grouped_segmentation_maps, grouped_images_index = group_images_by_shape(
+                resized_images, mask_labels, disable_grouping=disable_grouping
            )
            processed_segmentation_maps_grouped = {}
+        else:
+            grouped_images, grouped_images_index = group_images_by_shape(
+                resized_images, disable_grouping=disable_grouping
+            )
+        processed_images_grouped = {}
+        processed_pixel_masks_grouped = {}
        for shape, stacked_images in grouped_images.items():
            # Fused rescale and normalize
            stacked_images = self.rescale_and_normalize(
@ -379,7 +382,8 @@ class Mask2FormerImageProcessorFast(BaseImageProcessorFast):
            processed_images_grouped[shape] = padded_images
            processed_pixel_masks_grouped[shape] = pixel_masks
            if segmentation_maps is not None:
-                processed_segmentation_maps_grouped[shape] = padded_segmentation_maps.squeeze(1)
+                processed_segmentation_maps_grouped[shape] = padded_segmentation_maps
+
        processed_images = reorder_images(processed_images_grouped, grouped_images_index)
        processed_pixel_masks = reorder_images(processed_pixel_masks_grouped, grouped_images_index)
        encoded_inputs = BatchFeature(
@ -390,7 +394,7 @@ class Mask2FormerImageProcessorFast(BaseImageProcessorFast):
            tensor_type=return_tensors,
        )
        if segmentation_maps is not None:
-            mask_labels = reorder_images(processed_segmentation_maps_grouped, grouped_segmentation_maps_index)
+            mask_labels = reorder_images(processed_segmentation_maps_grouped, grouped_images_index)
            # we cannot batch them since they don't share a common class size
            encoded_inputs["mask_labels"] = mask_labels
            encoded_inputs["class_labels"] = class_labels
--- a/src/transformers/models/maskformer/convert_maskformer_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/models/maskformer/convert_maskformer_original_pytorch_checkpoint_to_pytorch.py
@ -548,7 +548,7 @@ class OriginalMaskFormerCheckpointToOursConverter:
        checkpoints: list[Path] = checkpoints_dir.glob("**/*.pkl")

        for checkpoint in checkpoints:
-            logger.info(f"💪 Converting {checkpoint.stem}")
+            logger.info(f"Converting {checkpoint.stem}")
            # find associated config file
            config: Path = config_dir / checkpoint.parents[0].stem / "swin" / f"{checkpoint.stem}.yaml"

@ -607,7 +607,7 @@ def test(original_model, our_model: MaskFormerForInstanceSegmentation, image_pro
            "The segmentation image is not the same."
        )

-        logger.info("✅ Test passed!")
+        logger.info("Test passed!")


 def get_name(checkpoint_file: Path):
@ -715,7 +715,7 @@ if __name__ == "__main__":
        test(original_model, mask_former_for_instance_segmentation, image_processor)

        model_name = get_name(checkpoint_file)
-        logger.info(f"🪄 Saving {model_name}")
+        logger.info(f"Saving {model_name}")

        image_processor.save_pretrained(save_directory / model_name)
        mask_former_for_instance_segmentation.save_pretrained(save_directory / model_name)
--- a/src/transformers/models/maskformer/convert_maskformer_resnet_to_pytorch.py
+++ b/src/transformers/models/maskformer/convert_maskformer_resnet_to_pytorch.py
@ -391,7 +391,9 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )

    args = parser.parse_args()
--- a/src/transformers/models/maskformer/convert_maskformer_swin_to_pytorch.py
+++ b/src/transformers/models/maskformer/convert_maskformer_swin_to_pytorch.py
@ -334,7 +334,9 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )

    args = parser.parse_args()
--- a/src/transformers/models/maskformer/image_processing_maskformer_fast.py
+++ b/src/transformers/models/maskformer/image_processing_maskformer_fast.py
@ -114,9 +114,6 @@ class MaskFormerImageProcessorFast(BaseImageProcessorFast):
    valid_kwargs = MaskFormerImageProcessorKwargs

    def __init__(self, **kwargs: Unpack[MaskFormerImageProcessorKwargs]) -> None:
-        if "pad_and_return_pixel_mask" in kwargs:
-            kwargs["do_pad"] = kwargs.pop("pad_and_return_pixel_mask")
-
        size = kwargs.pop("size", None)
        max_size = kwargs.pop("max_size", None)

@ -229,7 +226,7 @@ class MaskFormerImageProcessorFast(BaseImageProcessorFast):
            padding = [0, 0, padding_right, padding_bottom]
            images = F.pad(images, padding, fill=fill)
            if segmentation_maps is not None:
-                segmentation_maps = F.pad(segmentation_maps, padding, fill=ignore_index)
+                segmentation_maps = [F.pad(mask, padding, fill=ignore_index) for mask in segmentation_maps]

        # Make a pixel mask for the image, where 1 indicates a valid pixel and 0 indicates padding.
        pixel_mask = torch.zeros((images.shape[0], *padded_size), dtype=torch.int64, device=images.device)
@ -323,9 +320,11 @@ class MaskFormerImageProcessorFast(BaseImageProcessorFast):
                stacked_images = self.resize(
                    image=stacked_images, size=size, size_divisor=size_divisor, interpolation=interpolation
                )
-                if segmentation_maps is not None:
+            if segmentation_maps is not None:
+                stacked_segmentation_maps = grouped_segmentation_maps[shape]
+                if do_resize:
                    stacked_segmentation_maps = self.resize(
-                        image=grouped_segmentation_maps[shape],
+                        image=stacked_segmentation_maps,
                        size=size,
                        size_divisor=size_divisor,
                        interpolation=F.InterpolationMode.NEAREST_EXACT,
@ -362,14 +361,18 @@ class MaskFormerImageProcessorFast(BaseImageProcessorFast):
                mask_labels.append(masks)
                class_labels.append(classes)

-        grouped_images, grouped_images_index = group_images_by_shape(resized_images, disable_grouping=disable_grouping)
-        processed_images_grouped = {}
-        processed_pixel_masks_grouped = {}
        if segmentation_maps is not None:
-            grouped_segmentation_maps, grouped_segmentation_maps_index = group_images_by_shape(
-                mask_labels, disable_grouping=disable_grouping
+            # group mask_labels as paired inputs and not images so as not to stack them
+            grouped_images, grouped_segmentation_maps, grouped_images_index = group_images_by_shape(
+                resized_images, mask_labels, disable_grouping=disable_grouping
            )
            processed_segmentation_maps_grouped = {}
+        else:
+            grouped_images, grouped_images_index = group_images_by_shape(
+                resized_images, disable_grouping=disable_grouping
+            )
+        processed_images_grouped = {}
+        processed_pixel_masks_grouped = {}
        for shape, stacked_images in grouped_images.items():
            # Fused rescale and normalize
            stacked_images = self.rescale_and_normalize(
@ -384,7 +387,8 @@ class MaskFormerImageProcessorFast(BaseImageProcessorFast):
            processed_images_grouped[shape] = padded_images
            processed_pixel_masks_grouped[shape] = pixel_masks
            if segmentation_maps is not None:
-                processed_segmentation_maps_grouped[shape] = padded_segmentation_maps.squeeze(1)
+                processed_segmentation_maps_grouped[shape] = padded_segmentation_maps
+
        processed_images = reorder_images(processed_images_grouped, grouped_images_index)
        processed_pixel_masks = reorder_images(processed_pixel_masks_grouped, grouped_images_index)
        encoded_inputs = BatchFeature(
@ -395,7 +399,7 @@ class MaskFormerImageProcessorFast(BaseImageProcessorFast):
            tensor_type=return_tensors,
        )
        if segmentation_maps is not None:
-            mask_labels = reorder_images(processed_segmentation_maps_grouped, grouped_segmentation_maps_index)
+            mask_labels = reorder_images(processed_segmentation_maps_grouped, grouped_images_index)
            # we cannot batch them since they don't share a common class size
            encoded_inputs["mask_labels"] = mask_labels
            encoded_inputs["class_labels"] = class_labels
--- a/src/transformers/models/metaclip_2/convert_metaclip_2_to_hf.py
+++ b/src/transformers/models/metaclip_2/convert_metaclip_2_to_hf.py
@ -349,10 +349,10 @@ def verify_conversion(

    # Check if they're close
    if orig_logits.shape == hf_logits.shape and torch.allclose(orig_logits, hf_logits, atol=1e-4):
-        print("✅ Conversion verified! Outputs match.")
+        print("[SUCCESS] Conversion verified! Outputs match.")
        return True
    else:
-        print("❌ Conversion failed! Outputs don't match.")
+        print("[FAIL] Conversion failed! Outputs don't match.")
        if orig_logits.numel() > 0 and hf_logits.numel() > 0:
            print(f"Max difference: {(orig_logits - hf_logits).abs().max()}")
        return False
@ -365,9 +365,9 @@ def push_to_hub(hf_model: MetaClip2Model, processor: CLIPProcessor, repo_name: s
    try:
        hf_model.push_to_hub(repo_name)
        processor.push_to_hub(repo_name)
-        print(f"✅ Successfully pushed to {repo_name}")
+        print(f"[SUCCESS] Successfully pushed to {repo_name}")
    except Exception as e:
-        print(f"❌ Failed to push to hub: {e}")
+        print(f"[FAIL] Failed to push to hub: {e}")


 def main():
--- a/src/transformers/models/mimi/convert_mimi_checkpoint_to_pytorch.py
+++ b/src/transformers/models/mimi/convert_mimi_checkpoint_to_pytorch.py
@ -186,7 +186,7 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", required=True, default=None, type=str, help="Path to the output PyTorch model."
    )
    parser.add_argument(
-        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the 🤗 hub."
+        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the Hugging Face hub."
    )

    args = parser.parse_args()
--- a/src/transformers/models/mlcd/convert_mlcd_weights_to_hf.py
+++ b/src/transformers/models/mlcd/convert_mlcd_weights_to_hf.py
@ -327,7 +327,9 @@ if __name__ == "__main__":
        help="Whether to verify hidden_state against the original implementation.",
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )

    args = parser.parse_args()
--- a/src/transformers/models/mobilenet_v1/convert_original_tf_checkpoint_to_pytorch.py
+++ b/src/transformers/models/mobilenet_v1/convert_original_tf_checkpoint_to_pytorch.py
@ -234,7 +234,9 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", required=True, type=str, help="Path to the output PyTorch model directory."
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )

    args = parser.parse_args()
--- a/src/transformers/models/mobilenet_v2/convert_original_tf_checkpoint_to_pytorch.py
+++ b/src/transformers/models/mobilenet_v2/convert_original_tf_checkpoint_to_pytorch.py
@ -336,7 +336,9 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", required=True, type=str, help="Path to the output PyTorch model directory."
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )

    args = parser.parse_args()
--- a/src/transformers/models/mobilevit/convert_mlcvnets_to_pytorch.py
+++ b/src/transformers/models/mobilevit/convert_mlcvnets_to_pytorch.py
@ -302,7 +302,9 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", required=True, type=str, help="Path to the output PyTorch model directory."
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )

    args = parser.parse_args()
--- a/src/transformers/models/moshi/convert_moshi_transformers.py
+++ b/src/transformers/models/moshi/convert_moshi_transformers.py
@ -268,7 +268,7 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", required=True, default=None, type=str, help="Path to the output PyTorch model."
    )
    parser.add_argument(
-        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the 🤗 hub."
+        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the Hugging Face hub."
    )

    args = parser.parse_args()
--- a/src/transformers/models/mt5/modeling_mt5.py
+++ b/src/transformers/models/mt5/modeling_mt5.py
@ -25,7 +25,7 @@ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
 from ...activations import ACT2FN
 from ...cache_utils import Cache, DynamicCache, EncoderDecoderCache
 from ...generation import GenerationMixin
-from ...modeling_attn_mask_utils import AttentionMaskConverter
+from ...masking_utils import create_bidirectional_mask, create_causal_mask
 from ...modeling_layers import GradientCheckpointingLayer
 from ...modeling_outputs import (
    BaseModelOutput,
@ -41,18 +41,11 @@ from ...utils import (
    DUMMY_INPUTS,
    DUMMY_MASK,
    auto_docstring,
-    is_torch_flex_attn_available,
-    is_torchdynamo_compiling,
    logging,
 )
 from .configuration_mt5 import MT5Config


-if is_torch_flex_attn_available():
-    from torch.nn.attention.flex_attention import BlockMask
-
-    from ...integrations.flex_attention import make_flex_block_causal_mask
-
 logger = logging.get_logger(__name__)


@ -735,40 +728,31 @@ class MT5Stack(MT5PreTrainedModel):
                past_key_values_length, past_key_values_length + seq_length, device=inputs_embeds.device
            )

-        if attention_mask is None and not is_torchdynamo_compiling():
-            # required mask seq length can be calculated via length of past cache
-            mask_seq_length = past_key_values_length + seq_length
-            attention_mask = torch.ones(batch_size, mask_seq_length, device=inputs_embeds.device)
-
        if self.config.is_decoder:
-            causal_mask = self._update_causal_mask(
-                attention_mask,
-                inputs_embeds,
-                cache_position,
-                past_key_values.self_attention_cache
+            attention_mask = create_causal_mask(
+                config=self.config,
+                input_embeds=inputs_embeds,
+                attention_mask=attention_mask,
+                cache_position=cache_position,
+                past_key_values=past_key_values.self_attention_cache
                if isinstance(past_key_values, EncoderDecoderCache)
                else past_key_values,
-                output_attentions,
            )
-        elif attention_mask is not None:
-            causal_mask = attention_mask[:, None, None, :]
-            causal_mask = causal_mask.to(dtype=inputs_embeds.dtype)
-            causal_mask = (1.0 - causal_mask) * torch.finfo(inputs_embeds.dtype).min
        else:
-            causal_mask = None
+            attention_mask = create_bidirectional_mask(
+                config=self.config,
+                input_embeds=inputs_embeds,
+                attention_mask=attention_mask,
+            )

-        # If a 2D or 3D attention mask is provided for the cross-attention
-        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
+        encoder_extended_attention_mask = None
        if self.is_decoder and encoder_hidden_states is not None:
-            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
-            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
-            if encoder_attention_mask is None:
-                encoder_attention_mask = torch.ones(
-                    encoder_hidden_shape, device=inputs_embeds.device, dtype=torch.long
-                )
-            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
-        else:
-            encoder_extended_attention_mask = None
+            encoder_extended_attention_mask = create_bidirectional_mask(
+                config=self.config,
+                input_embeds=inputs_embeds,
+                attention_mask=encoder_attention_mask,
+                encoder_hidden_states=encoder_hidden_states,
+            )

        all_hidden_states = () if output_hidden_states else None
        all_attentions = () if output_attentions else None
@ -778,13 +762,13 @@ class MT5Stack(MT5PreTrainedModel):

        hidden_states = self.dropout(inputs_embeds)

-        for i, layer_module in enumerate(self.block):
+        for layer_module in self.block:
            if output_hidden_states:
                all_hidden_states = all_hidden_states + (hidden_states,)

            layer_outputs = layer_module(
                hidden_states,
-                causal_mask,
+                attention_mask,
                position_bias,
                encoder_hidden_states,
                encoder_extended_attention_mask,
@ -837,131 +821,6 @@ class MT5Stack(MT5PreTrainedModel):
            cross_attentions=all_cross_attentions,
        )

-    # Copied from transformers.models.gptj.modeling_gptj.GPTJModel._update_causal_mask
-    def _update_causal_mask(
-        self,
-        attention_mask: Union[torch.Tensor, "BlockMask"],
-        input_tensor: torch.Tensor,
-        cache_position: torch.Tensor,
-        past_key_values: Cache,
-        output_attentions: bool = False,
-    ):
-        if self.config._attn_implementation == "flash_attention_2":
-            if attention_mask is not None and (attention_mask == 0.0).any():
-                return attention_mask
-            return None
-        if self.config._attn_implementation == "flex_attention":
-            if isinstance(attention_mask, torch.Tensor):
-                attention_mask = make_flex_block_causal_mask(attention_mask)
-            return attention_mask
-
-        # For SDPA, when possible, we will rely on its `is_causal` argument instead of its `attn_mask` argument, in
-        # order to dispatch on Flash Attention 2. This feature is not compatible with static cache, as SDPA will fail
-        # to infer the attention mask.
-        past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
-        using_compilable_cache = past_key_values.is_compileable if past_key_values is not None else False
-
-        # When output attentions is True, sdpa implementation's forward method calls the eager implementation's forward
-        if self.config._attn_implementation == "sdpa" and not using_compilable_cache and not output_attentions:
-            if AttentionMaskConverter._ignore_causal_mask_sdpa(
-                attention_mask,
-                inputs_embeds=input_tensor,
-                past_key_values_length=past_seen_tokens,
-                is_training=self.training,
-            ):
-                return None
-
-        dtype = input_tensor.dtype
-        sequence_length = input_tensor.shape[1]
-        if using_compilable_cache:
-            target_length = past_key_values.get_max_cache_shape()
-        else:
-            target_length = (
-                attention_mask.shape[-1]
-                if isinstance(attention_mask, torch.Tensor)
-                else past_seen_tokens + sequence_length + 1
-            )
-
-        # In case the provided `attention` mask is 2D, we generate a causal mask here (4D).
-        causal_mask = self._prepare_4d_causal_attention_mask_with_cache_position(
-            attention_mask,
-            sequence_length=sequence_length,
-            target_length=target_length,
-            dtype=dtype,
-            cache_position=cache_position,
-            batch_size=input_tensor.shape[0],
-        )
-
-        if (
-            self.config._attn_implementation == "sdpa"
-            and attention_mask is not None
-            and attention_mask.device.type in ["cuda", "xpu", "npu"]
-            and not output_attentions
-        ):
-            # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
-            # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
-            # Details: https://github.com/pytorch/pytorch/issues/110213
-            min_dtype = torch.finfo(dtype).min
-            causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
-
-        return causal_mask
-
-    @staticmethod
-    # Copied from transformers.models.gptj.modeling_gptj.GPTJModel._prepare_4d_causal_attention_mask_with_cache_position
-    def _prepare_4d_causal_attention_mask_with_cache_position(
-        attention_mask: torch.Tensor,
-        sequence_length: int,
-        target_length: int,
-        dtype: torch.dtype,
-        cache_position: torch.Tensor,
-        batch_size: int,
-        **kwargs,
-    ):
-        """
-        Creates a causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
-        `(batch_size, key_value_length)`, or if the input `attention_mask` is already 4D, do nothing.
-
-        Args:
-            attention_mask (`torch.Tensor`):
-                A 2D attention mask of shape `(batch_size, key_value_length)` or a 4D attention mask of shape
-                `(batch_size, 1, query_length, key_value_length)`.
-            sequence_length (`int`):
-                The sequence length being processed.
-            target_length (`int`):
-                The target length: when generating with static cache, the mask should be as long as the static cache,
-                to account for the 0 padding, the part of the cache that is not filled yet.
-            dtype (`torch.dtype`):
-                The dtype to use for the 4D attention mask.
-            cache_position (`torch.Tensor`):
-                Indices depicting the position of the input sequence tokens in the sequence.
-            batch_size (`torch.Tensor`):
-                Batch size.
-        """
-        if attention_mask is not None and attention_mask.dim() == 4:
-            # In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
-            causal_mask = attention_mask
-        else:
-            min_dtype = torch.finfo(dtype).min
-            causal_mask = torch.full(
-                (sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=cache_position.device
-            )
-            if sequence_length != 1:
-                causal_mask = torch.triu(causal_mask, diagonal=1)
-            causal_mask *= torch.arange(target_length, device=cache_position.device) > cache_position.reshape(-1, 1)
-            causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
-            if attention_mask is not None:
-                causal_mask = causal_mask.clone()  # copy to contiguous memory for in-place edit
-                mask_length = attention_mask.shape[-1]
-                padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :].to(
-                    causal_mask.device
-                )
-                padding_mask = padding_mask == 0
-                causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
-                    padding_mask, min_dtype
-                )
-
-        return causal_mask
-

@auto_docstring
 class MT5Model(MT5PreTrainedModel):
--- a/src/transformers/models/musicgen/convert_musicgen_transformers.py
+++ b/src/transformers/models/musicgen/convert_musicgen_transformers.py
@ -221,7 +221,7 @@ if __name__ == "__main__":
        help="Path to the output PyTorch model directory.",
    )
    parser.add_argument(
-        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the 🤗 hub."
+        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the Hugging Face hub."
    )
    parser.add_argument(
        "--device", default="cpu", type=str, help="Torch device to run the conversion, either cpu or cuda."
--- a/src/transformers/models/musicgen_melody/convert_musicgen_melody_transformers.py
+++ b/src/transformers/models/musicgen_melody/convert_musicgen_melody_transformers.py
@ -254,7 +254,7 @@ if __name__ == "__main__":
        "--push_to_hub",
        default="musicgen-melody",
        type=str,
-        help="Where to upload the converted model on the 🤗 hub.",
+        help="Where to upload the converted model on the Hugging Face hub.",
    )
    parser.add_argument(
        "--device", default="cpu", type=str, help="Torch device to run the conversion, either cpu or cuda."
--- a/src/transformers/models/nougat/convert_nougat_to_hf.py
+++ b/src/transformers/models/nougat/convert_nougat_to_hf.py
@ -275,7 +275,7 @@ if __name__ == "__main__":
    parser.add_argument(
        "--push_to_hub",
        action="store_true",
-        help="Whether or not to push the converted model and processor to the 🤗 hub.",
+        help="Whether or not to push the converted model and processor to the Hugging Face hub.",
    )

    args = parser.parse_args()
--- a/src/transformers/models/nougat/tokenization_nougat_fast.py
+++ b/src/transformers/models/nougat/tokenization_nougat_fast.py
@ -23,9 +23,7 @@ from typing import Optional, Union

 import numpy as np

-from transformers.tokenization_utils_base import INIT_TOKENIZER_DOCSTRING
 from transformers.tokenization_utils_fast import PreTrainedTokenizerFast
-from transformers.utils import add_end_docstrings

 from ...utils import is_levenshtein_available, is_nltk_available, logging, requires_backends

@ -40,16 +38,6 @@ if is_nltk_available():
 logger = logging.get_logger(__name__)


-INIT_TOKENIZER_DOCSTRING += """
-        tokenizer_object ([`tokenizers.Tokenizer`]):
-            A [`tokenizers.Tokenizer`] object from 🤗 tokenizers to instantiate from. See [Using tokenizers from 🤗
-            tokenizers](../fast_tokenizers) for more information.
-        tokenizer_file ([`str`]):
-            A path to a local JSON file representing a previously serialized [`tokenizers.Tokenizer`] object from 🤗
-            tokenizers.
-"""
-
-
 VOCAB_FILES_NAMES = {"tokenizer_file": "tokenizer.json"}


@ -358,7 +346,6 @@ def remove_slice_from_lines(lines, clean_text, slice) -> str:
    return to_delete.strip()


-@add_end_docstrings(INIT_TOKENIZER_DOCSTRING)
 class NougatTokenizerFast(PreTrainedTokenizerFast):
    """
    Fast tokenizer for Nougat (backed by HuggingFace tokenizers library).
--- a/src/transformers/models/omdet_turbo/convert_omdet_turbo_to_hf.py
+++ b/src/transformers/models/omdet_turbo/convert_omdet_turbo_to_hf.py
@ -339,7 +339,9 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )
    parser.add_argument(
        "--use_timm_backbone", action="store_true", help="Whether or not to use timm backbone for vision backbone."
--- a/src/transformers/models/oneformer/convert_to_hf_oneformer.py
+++ b/src/transformers/models/oneformer/convert_to_hf_oneformer.py
@ -926,7 +926,7 @@ class OriginalOneFormerCheckpointToOursConverter:
        checkpoints: list[Path] = checkpoints_dir.glob("**/*.pth")

        for checkpoint in checkpoints:
-            logger.info(f"💪 Converting {checkpoint.stem}")
+            logger.info(f"Converting {checkpoint.stem}")
            # find associated config file
            config: Path = config_dir / f"{checkpoint.stem}.yaml"

@ -1054,7 +1054,7 @@ def test(
            "The segmentation image is not the same."
        )

-        logger.info("✅ Test passed!")
+        logger.info("Test passed!")


 def get_name(checkpoint_file: Path):
@ -1175,7 +1175,7 @@ if __name__ == "__main__":
        )

        model_name = get_name(checkpoint_file)
-        logger.info(f"🪄 Saving {model_name}")
+        logger.info(f"Saving {model_name}")

        processor.save_pretrained(save_directory / model_name)
        oneformer_for_universal_segmentation.save_pretrained(save_directory / model_name)
--- a/src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py
+++ b/src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py
@ -839,6 +839,7 @@ class Qwen2_5_VLProcessorKwargs(ProcessingKwargs, total=False):
            "padding": False,
            "return_mm_token_type_ids": False,
        },
+        "videos_kwargs": {"return_metadata": True},
    }


@ -922,10 +923,17 @@ class Qwen2_5_VLProcessor(Qwen2VLProcessor):
            image_grid_thw = image_inputs["image_grid_thw"]

        if videos is not None:
-            fps = output_kwargs["videos_kwargs"].get("fps", 2.0)
            videos_inputs = self.video_processor(videos=videos, **output_kwargs["videos_kwargs"])
            video_grid_thw = videos_inputs["video_grid_thw"]

+            # Get video metadata
+            if not kwargs.get("return_metadata"):
+                video_metadata = videos_inputs.pop("video_metadata")
+            else:
+                video_metadata = videos_inputs["video_metadata"]
+
+            fps = [metadata.sampled_fps for metadata in video_metadata]
+
            if isinstance(fps, (int, float)):
                second_per_grid_ts = [self.video_processor.temporal_patch_size / fps] * len(video_grid_thw)
            elif hasattr(fps, "__len__") and len(fps) == len(video_grid_thw):
--- a/src/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py
+++ b/src/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py
@ -41,6 +41,7 @@ class Qwen2_5_VLProcessorKwargs(ProcessingKwargs, total=False):
            "padding": False,
            "return_mm_token_type_ids": False,
        },
+        "videos_kwargs": {"return_metadata": True},
    }


@ -129,10 +130,17 @@ class Qwen2_5_VLProcessor(ProcessorMixin):
            image_grid_thw = image_inputs["image_grid_thw"]

        if videos is not None:
-            fps = output_kwargs["videos_kwargs"].get("fps", 2.0)
            videos_inputs = self.video_processor(videos=videos, **output_kwargs["videos_kwargs"])
            video_grid_thw = videos_inputs["video_grid_thw"]

+            # Get video metadata
+            if not kwargs.get("return_metadata"):
+                video_metadata = videos_inputs.pop("video_metadata")
+            else:
+                video_metadata = videos_inputs["video_metadata"]
+
+            fps = [metadata.sampled_fps for metadata in video_metadata]
+
            if isinstance(fps, (int, float)):
                second_per_grid_ts = [self.video_processor.temporal_patch_size / fps] * len(video_grid_thw)
            elif hasattr(fps, "__len__") and len(fps) == len(video_grid_thw):
--- a/src/transformers/models/rag/tokenization_rag.py
+++ b/src/transformers/models/rag/tokenization_rag.py
@ -86,7 +86,7 @@ class RagTokenizer:
        **kwargs,
    ) -> BatchEncoding:
        warnings.warn(
-            "`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of 🤗 Transformers. Use the "
+            "`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of Hugging Face Transformers. Use the "
            "regular `__call__` method to prepare your inputs and the tokenizer under the `with_target_tokenizer` "
            "context manager to prepare your targets. See the documentation of your specific tokenizer for more "
            "details",
--- a/src/transformers/models/roberta/convert_roberta_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/models/roberta/convert_roberta_original_pytorch_checkpoint_to_pytorch.py
@ -150,7 +150,7 @@ def convert_roberta_checkpoint_to_pytorch(
    max_absolute_diff = torch.max(torch.abs(our_output - their_output)).item()
    print(f"max_absolute_diff = {max_absolute_diff}")  # ~ 1e-7
    success = torch.allclose(our_output, their_output, atol=1e-3)
-    print("Do both models output the same tensors?", "🔥" if success else "💩")
+    print("Do both models output the same tensors?", "[PASS]" if success else "[FAIL]")
    if not success:
        raise Exception("Something went wRoNg")

--- a/src/transformers/models/seggpt/convert_seggpt_to_hf.py
+++ b/src/transformers/models/seggpt/convert_seggpt_to_hf.py
@ -214,7 +214,9 @@ if __name__ == "__main__":
        help="Whether or not to verify the logits against the original implementation.",
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )

    args = parser.parse_args()
--- a/src/transformers/models/siglip/convert_siglip_to_hf.py
+++ b/src/transformers/models/siglip/convert_siglip_to_hf.py
@ -525,7 +525,9 @@ if __name__ == "__main__":
        help="Whether to verify logits against the original implementation.",
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )

    args = parser.parse_args()
--- a/src/transformers/models/siglip2/convert_siglip2_to_hf.py
+++ b/src/transformers/models/siglip2/convert_siglip2_to_hf.py
@ -431,7 +431,9 @@ if __name__ == "__main__":
        help="Whether to verify logits against the original implementation.",
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )

    args = parser.parse_args()
--- a/src/transformers/models/speecht5/convert_hifigan.py
+++ b/src/transformers/models/speecht5/convert_hifigan.py
@ -95,7 +95,7 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", required=True, default=None, type=str, help="Path to the output PyTorch model."
    )
    parser.add_argument(
-        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the 🤗 hub."
+        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the Hugging Face hub."
    )

    args = parser.parse_args()
--- a/src/transformers/models/speecht5/convert_speecht5_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/models/speecht5/convert_speecht5_original_pytorch_checkpoint_to_pytorch.py
@ -387,7 +387,7 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", required=True, default=None, type=str, help="Path to the output PyTorch model."
    )
    parser.add_argument(
-        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the 🤗 hub."
+        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the Hugging Face hub."
    )

    args = parser.parse_args()
--- a/src/transformers/models/swiftformer/modeling_swiftformer.py
+++ b/src/transformers/models/swiftformer/modeling_swiftformer.py
@ -21,6 +21,7 @@ import torch
 from torch import nn

 from ...activations import ACT2CLS
+from ...modeling_layers import GradientCheckpointingLayer
 from ...modeling_outputs import BaseModelOutputWithNoAttention, ImageClassifierOutputWithNoAttention
 from ...modeling_utils import PreTrainedModel
 from ...utils import auto_docstring, logging
@ -295,7 +296,7 @@ class SwiftFormerEncoderBlock(nn.Module):
        return x


-class SwiftFormerStage(nn.Module):
+class SwiftFormerStage(GradientCheckpointingLayer):
    """
    A Swiftformer stage consisting of a series of `SwiftFormerConvEncoder` blocks and a final
    `SwiftFormerEncoderBlock`.
--- a/src/transformers/models/swin/convert_swin_simmim_to_pytorch.py
+++ b/src/transformers/models/swin/convert_swin_simmim_to_pytorch.py
@ -175,7 +175,9 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )

    args = parser.parse_args()
--- a/src/transformers/models/t5/modeling_t5.py
+++ b/src/transformers/models/t5/modeling_t5.py
@ -25,7 +25,7 @@ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
 from ...activations import ACT2FN
 from ...cache_utils import Cache, DynamicCache, EncoderDecoderCache
 from ...generation import GenerationMixin
-from ...modeling_attn_mask_utils import AttentionMaskConverter
+from ...masking_utils import create_bidirectional_mask, create_causal_mask
 from ...modeling_layers import GradientCheckpointingLayer
 from ...modeling_outputs import (
    BaseModelOutput,
@ -41,19 +41,11 @@ from ...utils import (
    DUMMY_INPUTS,
    DUMMY_MASK,
    auto_docstring,
-    is_torch_flex_attn_available,
-    is_torchdynamo_compiling,
    logging,
 )
 from .configuration_t5 import T5Config


-if is_torch_flex_attn_available():
-    from torch.nn.attention.flex_attention import BlockMask
-
-    from ...integrations.flex_attention import make_flex_block_causal_mask
-
-
 logger = logging.get_logger(__name__)


@ -738,40 +730,31 @@ class T5Stack(T5PreTrainedModel):
                past_key_values_length, past_key_values_length + seq_length, device=inputs_embeds.device
            )

-        if attention_mask is None and not is_torchdynamo_compiling():
-            # required mask seq length can be calculated via length of past cache
-            mask_seq_length = past_key_values_length + seq_length
-            attention_mask = torch.ones(batch_size, mask_seq_length, device=inputs_embeds.device)
-
        if self.config.is_decoder:
-            causal_mask = self._update_causal_mask(
-                attention_mask,
-                inputs_embeds,
-                cache_position,
-                past_key_values.self_attention_cache
+            attention_mask = create_causal_mask(
+                config=self.config,
+                input_embeds=inputs_embeds,
+                attention_mask=attention_mask,
+                cache_position=cache_position,
+                past_key_values=past_key_values.self_attention_cache
                if isinstance(past_key_values, EncoderDecoderCache)
                else past_key_values,
-                output_attentions,
            )
-        elif attention_mask is not None:
-            causal_mask = attention_mask[:, None, None, :]
-            causal_mask = causal_mask.to(dtype=inputs_embeds.dtype)
-            causal_mask = (1.0 - causal_mask) * torch.finfo(inputs_embeds.dtype).min
        else:
-            causal_mask = None
+            attention_mask = create_bidirectional_mask(
+                config=self.config,
+                input_embeds=inputs_embeds,
+                attention_mask=attention_mask,
+            )

-        # If a 2D or 3D attention mask is provided for the cross-attention
-        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
+        encoder_extended_attention_mask = None
        if self.is_decoder and encoder_hidden_states is not None:
-            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
-            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
-            if encoder_attention_mask is None:
-                encoder_attention_mask = torch.ones(
-                    encoder_hidden_shape, device=inputs_embeds.device, dtype=torch.long
-                )
-            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
-        else:
-            encoder_extended_attention_mask = None
+            encoder_extended_attention_mask = create_bidirectional_mask(
+                config=self.config,
+                input_embeds=inputs_embeds,
+                attention_mask=encoder_attention_mask,
+                encoder_hidden_states=encoder_hidden_states,
+            )

        all_hidden_states = () if output_hidden_states else None
        all_attentions = () if output_attentions else None
@ -781,13 +764,13 @@ class T5Stack(T5PreTrainedModel):

        hidden_states = self.dropout(inputs_embeds)

-        for i, layer_module in enumerate(self.block):
+        for layer_module in self.block:
            if output_hidden_states:
                all_hidden_states = all_hidden_states + (hidden_states,)

            layer_outputs = layer_module(
                hidden_states,
-                causal_mask,
+                attention_mask,
                position_bias,
                encoder_hidden_states,
                encoder_extended_attention_mask,
@ -840,131 +823,6 @@ class T5Stack(T5PreTrainedModel):
            cross_attentions=all_cross_attentions,
        )

-    # Copied from transformers.models.gptj.modeling_gptj.GPTJModel._update_causal_mask
-    def _update_causal_mask(
-        self,
-        attention_mask: Union[torch.Tensor, "BlockMask"],
-        input_tensor: torch.Tensor,
-        cache_position: torch.Tensor,
-        past_key_values: Cache,
-        output_attentions: bool = False,
-    ):
-        if self.config._attn_implementation == "flash_attention_2":
-            if attention_mask is not None and (attention_mask == 0.0).any():
-                return attention_mask
-            return None
-        if self.config._attn_implementation == "flex_attention":
-            if isinstance(attention_mask, torch.Tensor):
-                attention_mask = make_flex_block_causal_mask(attention_mask)
-            return attention_mask
-
-        # For SDPA, when possible, we will rely on its `is_causal` argument instead of its `attn_mask` argument, in
-        # order to dispatch on Flash Attention 2. This feature is not compatible with static cache, as SDPA will fail
-        # to infer the attention mask.
-        past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
-        using_compilable_cache = past_key_values.is_compileable if past_key_values is not None else False
-
-        # When output attentions is True, sdpa implementation's forward method calls the eager implementation's forward
-        if self.config._attn_implementation == "sdpa" and not using_compilable_cache and not output_attentions:
-            if AttentionMaskConverter._ignore_causal_mask_sdpa(
-                attention_mask,
-                inputs_embeds=input_tensor,
-                past_key_values_length=past_seen_tokens,
-                is_training=self.training,
-            ):
-                return None
-
-        dtype = input_tensor.dtype
-        sequence_length = input_tensor.shape[1]
-        if using_compilable_cache:
-            target_length = past_key_values.get_max_cache_shape()
-        else:
-            target_length = (
-                attention_mask.shape[-1]
-                if isinstance(attention_mask, torch.Tensor)
-                else past_seen_tokens + sequence_length + 1
-            )
-
-        # In case the provided `attention` mask is 2D, we generate a causal mask here (4D).
-        causal_mask = self._prepare_4d_causal_attention_mask_with_cache_position(
-            attention_mask,
-            sequence_length=sequence_length,
-            target_length=target_length,
-            dtype=dtype,
-            cache_position=cache_position,
-            batch_size=input_tensor.shape[0],
-        )
-
-        if (
-            self.config._attn_implementation == "sdpa"
-            and attention_mask is not None
-            and attention_mask.device.type in ["cuda", "xpu", "npu"]
-            and not output_attentions
-        ):
-            # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
-            # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
-            # Details: https://github.com/pytorch/pytorch/issues/110213
-            min_dtype = torch.finfo(dtype).min
-            causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
-
-        return causal_mask
-
-    @staticmethod
-    # Copied from transformers.models.gptj.modeling_gptj.GPTJModel._prepare_4d_causal_attention_mask_with_cache_position
-    def _prepare_4d_causal_attention_mask_with_cache_position(
-        attention_mask: torch.Tensor,
-        sequence_length: int,
-        target_length: int,
-        dtype: torch.dtype,
-        cache_position: torch.Tensor,
-        batch_size: int,
-        **kwargs,
-    ):
-        """
-        Creates a causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
-        `(batch_size, key_value_length)`, or if the input `attention_mask` is already 4D, do nothing.
-
-        Args:
-            attention_mask (`torch.Tensor`):
-                A 2D attention mask of shape `(batch_size, key_value_length)` or a 4D attention mask of shape
-                `(batch_size, 1, query_length, key_value_length)`.
-            sequence_length (`int`):
-                The sequence length being processed.
-            target_length (`int`):
-                The target length: when generating with static cache, the mask should be as long as the static cache,
-                to account for the 0 padding, the part of the cache that is not filled yet.
-            dtype (`torch.dtype`):
-                The dtype to use for the 4D attention mask.
-            cache_position (`torch.Tensor`):
-                Indices depicting the position of the input sequence tokens in the sequence.
-            batch_size (`torch.Tensor`):
-                Batch size.
-        """
-        if attention_mask is not None and attention_mask.dim() == 4:
-            # In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
-            causal_mask = attention_mask
-        else:
-            min_dtype = torch.finfo(dtype).min
-            causal_mask = torch.full(
-                (sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=cache_position.device
-            )
-            if sequence_length != 1:
-                causal_mask = torch.triu(causal_mask, diagonal=1)
-            causal_mask *= torch.arange(target_length, device=cache_position.device) > cache_position.reshape(-1, 1)
-            causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
-            if attention_mask is not None:
-                causal_mask = causal_mask.clone()  # copy to contiguous memory for in-place edit
-                mask_length = attention_mask.shape[-1]
-                padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :].to(
-                    causal_mask.device
-                )
-                padding_mask = padding_mask == 0
-                causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
-                    padding_mask, min_dtype
-                )
-
-        return causal_mask
-

@auto_docstring
 class T5Model(T5PreTrainedModel):
--- a/src/transformers/models/t5gemma/modeling_t5gemma.py
+++ b/src/transformers/models/t5gemma/modeling_t5gemma.py
@ -448,11 +448,28 @@ class T5GemmaEncoderLayer(GradientCheckpointingLayer):
        return hidden_states


-class T5GemmaDecoderLayer(T5GemmaEncoderLayer):
+class T5GemmaDecoderLayer(GradientCheckpointingLayer):
    """Decoder sub-layer: an extra cross-attention layer."""

    def __init__(self, config, layer_idx: int):
-        super().__init__(config, layer_idx)
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.config = config
+        self.layer_idx = layer_idx
+        self.attention_type = config.layer_types[layer_idx]
+
+        self.self_attn = T5GemmaSelfAttention(
+            config=config,
+            layer_idx=layer_idx,
+        )
+        self.pre_self_attn_layernorm = T5GemmaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_self_attn_layernorm = T5GemmaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+
+        self.mlp = T5GemmaMLP(config)
+        self.pre_feedforward_layernorm = T5GemmaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_feedforward_layernorm = T5GemmaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+
+        self.dropout = nn.Dropout(config.dropout_rate)
        self.cross_attn = T5GemmaCrossAttention(config=config, layer_idx=layer_idx)
        self.pre_cross_attn_layernorm = T5GemmaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
        self.post_cross_attn_layernorm = T5GemmaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
@ -732,7 +749,7 @@ class T5GemmaEncoder(T5GemmaPreTrainedModel):
        )


-class T5GemmaDecoder(T5GemmaEncoder):
+class T5GemmaDecoder(T5GemmaPreTrainedModel):
    _can_record_outputs = {
        "attentions": OutputRecorder(T5GemmaSelfAttention, index=1),
        "cross_attentions": OutputRecorder(T5GemmaCrossAttention, index=1),
@ -741,11 +758,20 @@ class T5GemmaDecoder(T5GemmaEncoder):

    def __init__(self, config):
        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.norm = T5GemmaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.gradient_checkpointing = False
+
        self.layers = nn.ModuleList(
            [T5GemmaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
        )
+        self.dropout = nn.Dropout(config.dropout_rate)
        self.rotary_emb = T5GemmaRotaryEmbedding(config=config)

+        # Initialize weights and apply final processing
        self.post_init()

    @check_model_inputs()
@ -771,7 +797,9 @@ class T5GemmaDecoder(T5GemmaEncoder):
            inputs_embeds = self.embed_tokens(input_ids)

        if not self.training and use_cache and past_key_values is None:
-            past_key_values = EncoderDecoderCache(DynamicCache(config=self.config), DynamicCache(config=self.config))
+            # We do not pass the config to the cross attn cache to avoid initializing SWA
+            # --> we use full attention between our cross attentions
+            past_key_values = EncoderDecoderCache(DynamicCache(config=self.config), DynamicCache())
        if cache_position is None:
            past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
            cache_position = torch.arange(
--- a/src/transformers/models/t5gemma/modular_t5gemma.py
+++ b/src/transformers/models/t5gemma/modular_t5gemma.py
@ -517,11 +517,28 @@ class T5GemmaEncoderLayer(GradientCheckpointingLayer):
        return hidden_states


-class T5GemmaDecoderLayer(T5GemmaEncoderLayer):
+class T5GemmaDecoderLayer(GradientCheckpointingLayer):
    """Decoder sub-layer: an extra cross-attention layer."""

    def __init__(self, config, layer_idx: int):
-        super().__init__(config, layer_idx)
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.config = config
+        self.layer_idx = layer_idx
+        self.attention_type = config.layer_types[layer_idx]
+
+        self.self_attn = T5GemmaSelfAttention(
+            config=config,
+            layer_idx=layer_idx,
+        )
+        self.pre_self_attn_layernorm = T5GemmaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_self_attn_layernorm = T5GemmaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+
+        self.mlp = T5GemmaMLP(config)
+        self.pre_feedforward_layernorm = T5GemmaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_feedforward_layernorm = T5GemmaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+
+        self.dropout = nn.Dropout(config.dropout_rate)
        self.cross_attn = T5GemmaCrossAttention(config=config, layer_idx=layer_idx)
        self.pre_cross_attn_layernorm = T5GemmaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
        self.post_cross_attn_layernorm = T5GemmaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
@ -770,7 +787,7 @@ class T5GemmaEncoder(T5GemmaPreTrainedModel):
        )


-class T5GemmaDecoder(T5GemmaEncoder):
+class T5GemmaDecoder(T5GemmaPreTrainedModel):
    _can_record_outputs = {
        "attentions": OutputRecorder(T5GemmaSelfAttention, index=1),
        "cross_attentions": OutputRecorder(T5GemmaCrossAttention, index=1),
@ -779,11 +796,20 @@ class T5GemmaDecoder(T5GemmaEncoder):

    def __init__(self, config):
        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.norm = T5GemmaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.gradient_checkpointing = False
+
        self.layers = nn.ModuleList(
            [T5GemmaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
        )
+        self.dropout = nn.Dropout(config.dropout_rate)
        self.rotary_emb = T5GemmaRotaryEmbedding(config=config)

+        # Initialize weights and apply final processing
        self.post_init()

    @check_model_inputs()
@ -809,7 +835,9 @@ class T5GemmaDecoder(T5GemmaEncoder):
            inputs_embeds = self.embed_tokens(input_ids)

        if not self.training and use_cache and past_key_values is None:
-            past_key_values = EncoderDecoderCache(DynamicCache(config=self.config), DynamicCache(config=self.config))
+            # We do not pass the config to the cross attn cache to avoid initializing SWA
+            # --> we use full attention between our cross attentions
+            past_key_values = EncoderDecoderCache(DynamicCache(config=self.config), DynamicCache())
        if cache_position is None:
            past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
            cache_position = torch.arange(
--- a/src/transformers/models/table_transformer/convert_table_transformer_to_hf.py
+++ b/src/transformers/models/table_transformer/convert_table_transformer_to_hf.py
@ -311,7 +311,9 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", default=None, type=str, help="Path to the folder to output PyTorch model."
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )
    args = parser.parse_args()
    convert_table_transformer_checkpoint(args.checkpoint_url, args.pytorch_dump_folder_path, args.push_to_hub)
--- a/src/transformers/models/table_transformer/convert_table_transformer_to_hf_no_timm.py
+++ b/src/transformers/models/table_transformer/convert_table_transformer_to_hf_no_timm.py
@ -428,7 +428,9 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", default=None, type=str, help="Path to the folder to output PyTorch model."
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )
    args = parser.parse_args()
    convert_table_transformer_checkpoint(args.checkpoint_url, args.pytorch_dump_folder_path, args.push_to_hub)
--- a/src/transformers/models/timesformer/convert_timesformer_to_pytorch.py
+++ b/src/transformers/models/timesformer/convert_timesformer_to_pytorch.py
@ -244,7 +244,9 @@ if __name__ == "__main__":
    )
    parser.add_argument("--model_name", default="timesformer-base-finetuned-k400", type=str, help="Name of the model.")
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )

    args = parser.parse_args()
--- a/src/transformers/models/udop/convert_udop_to_hf.py
+++ b/src/transformers/models/udop/convert_udop_to_hf.py
@ -217,7 +217,9 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )

    args = parser.parse_args()
--- a/src/transformers/models/univnet/convert_univnet.py
+++ b/src/transformers/models/univnet/convert_univnet.py
@ -141,7 +141,7 @@ def main():
        "--pytorch_dump_folder_path", required=True, default=None, type=str, help="Path to the output PyTorch model."
    )
    parser.add_argument(
-        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the 🤗 hub."
+        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the Hugging Face hub."
    )
    parser.add_argument(
        "--safe_serialization", action="store_true", help="Whether to save the model using `safetensors`."
--- a/src/transformers/models/upernet/convert_convnext_upernet_to_pytorch.py
+++ b/src/transformers/models/upernet/convert_convnext_upernet_to_pytorch.py
@ -207,7 +207,9 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )

    args = parser.parse_args()
--- a/src/transformers/models/upernet/convert_swin_upernet_to_pytorch.py
+++ b/src/transformers/models/upernet/convert_swin_upernet_to_pytorch.py
@ -290,7 +290,9 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )

    args = parser.parse_args()
--- a/src/transformers/models/videomae/convert_videomae_to_pytorch.py
+++ b/src/transformers/models/videomae/convert_videomae_to_pytorch.py
@ -317,7 +317,9 @@ if __name__ == "__main__":
    )
    parser.add_argument("--model_name", default="videomae-base", type=str, help="Name of the model.")
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )

    args = parser.parse_args()
--- a/src/transformers/models/vitmatte/convert_vitmatte_to_hf.py
+++ b/src/transformers/models/vitmatte/convert_vitmatte_to_hf.py
@ -163,7 +163,9 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )

    args = parser.parse_args()
--- a/src/transformers/models/vitpose/convert_vitpose_to_hf.py
+++ b/src/transformers/models/vitpose/convert_vitpose_to_hf.py
@ -410,7 +410,9 @@ def main():
        "--pytorch_dump_folder_path", default=None, type=str, help="Path to store the converted model."
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )
    parser.add_argument(
        "--check_logits", action="store_false", help="Whether or not to verify the logits of the converted model."
--- a/src/transformers/models/vits/convert_original_checkpoint.py
+++ b/src/transformers/models/vits/convert_original_checkpoint.py
@ -374,7 +374,7 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", required=True, default=None, type=str, help="Path to the output PyTorch model."
    )
    parser.add_argument(
-        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the 🤗 hub."
+        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the Hugging Face hub."
    )

    args = parser.parse_args()
--- a/src/transformers/models/vjepa2/convert_vjepa2_to_hf.py
+++ b/src/transformers/models/vjepa2/convert_vjepa2_to_hf.py
@ -336,7 +336,7 @@ if __name__ == "__main__":
    parser.add_argument(
        "--push_to_hub",
        action="store_true",
-        help="Whether or not to push the converted model to the 🤗 hub.",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )
    parser.add_argument("--upload_original", action="store_true", help="upload the original checkpoint")

--- a/src/transformers/models/x_clip/convert_x_clip_original_pytorch_to_hf.py
+++ b/src/transformers/models/x_clip/convert_x_clip_original_pytorch_to_hf.py
@ -379,7 +379,9 @@ if __name__ == "__main__":
        "--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
    )
    parser.add_argument(
-        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+        "--push_to_hub",
+        action="store_true",
+        help="Whether or not to push the converted model to the Hugging Face hub.",
    )

    args = parser.parse_args()
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Funtowicz Morgan	1d3de2d89d	Merge branch 'main' into kernels-opt-out-flag	2025-11-12 09:41:27 +01:00
ARAVINDHAN T	2b8068c306	T5 migration to new masking interface (#41804 ) * Refactor: migrate T5 attention masking to masking_utils interface * Refactor: migrate T5 attention masking to masking_utils interface * create_bidirectional_mask function with appropriate paramaters * create_bidirectional_mask function with appropriate paramaters * fixup executorch + import * revert causal masks * rm executorch stuff * add causal mask with non vmap * copies * remove unnecessary import --------- Co-authored-by: Vasqu <antonprogamer@gmail.com> Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>	2025-11-11 18:10:12 +00:00
Anton Vlasjuk	33c60a5254	[`T5Gemma`] Fix cross attention cache (#41890 ) * fix * add test * style * added comment	2025-11-11 18:08:34 +00:00
githubnemo	fa22b56903	🚨 Fix gradient checkpointing for several models and improve test robustness (#41818 ) * Implement gradient checkpointing in GPTBigCode Support for gradient checkpointing was lost in the major refactoring in PR #38635 and this is the attempt to re-add it. I extended the tests to - test `use_reentrant=True` and `False` - make sure `model.train` is called so that gradient checkpointing works; this is a limiation of the tests currently used by GPTBigCode - make sure that one (the first) gradient checkpointing layer is called - make sure that the same non-zero grads are there for normal and checkpointing runs - this is something we tripped over before in PEFT due to the possibly incompletely stored runtime environment in the checkpointed forward step, see also peft#2826 Note that the invocation of `GPTBigCodeBlock.forward` has changed: - `layer_past` is now passed as a keyword argument so that `GradientCheckpointingLayer.__call__` can see and filter this parameter (`use_reentrant=False` fails otherwise) - `{encoder_}hidden_states` are still passed as positional arguments so that `torch.utils.checkpoint.checkpoint` receives them as pos. args and computes gradients for these (kwargs would be filtered by `GradientCheckpointingLayer`). * Improve gradient checkpointing tests - Compare that the non-zero gradients in a reference run are present in the checkpointing run - Make sure that the forward of at least one gradient checkpointing layer is actually called more than once (as expected during gradient checkpointing backward) Currently there are some problems with Bert-derived MultipleChoice models, when dropout is enabled there are scenarios during gradient checkpointing where `classifier.bias.grad` is None. I don't yet have a good explanation for this, disabling dropout resolves this. I would have understood, if it is dropout on the classification layer but enabling attention dropout is also leading to this behavior. MoE models have selective sparsity depending on the selected experts, for this reason we only compare gradients on parameters collected on the reference backward run. * Remove duplicated gradient checkpointing code * Address review comments * Make test output consistent * GradientCheckpointingLayer for xlstm, zamba, zamba2 * GradientCheckpointingLayer for swiftformer also drop janus from ignore list - only the VQVAE case is without gradient checkpointing and it is doubtful that it is usefule in that case. Training with gradient checkpointing is not tested anyway. * Make an exception for CLVP The implementation of GradientCheckpointingLayers is not trivial and may break behavior that was previously expected. Therefore we keep it as-is for now. * Remove unneeded exceptions --------- Co-authored-by: nemo <git@ningu.net> Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>	2025-11-11 18:13:38 +01:00
Albert Villanova del Moral	f30c22500b	Fix logic error in `prepare_inputs_for_generation` cache slicing condition (#41764 ) Fix logic error in cache slicing condition Co-authored-by: Raushan Turganbay <raushan@huggingface.co>	2025-11-11 16:52:38 +00:00
Vijaya Bhaskar	496c283615	Add dinov3 autobackbone (#41276 ) * feat: Add DINOv3 support to AutoBackbone [DRAFT] - Implement DINOv3ViTConfig, DINOv3ViTModel, and DINOv3ViTBackbone - Add DINOv3 to MODEL_FOR_BACKBONE_MAPPING_NAMES - Support get_intermediate_layers for Facebook compatibility - Enable multi-scale feature extraction for detection/segmentation Note: Tests and documentation coming in follow-up commits Addresses #40323 * Updated import structure of get_aligned_output_features_output_indices * Added test for DINOv3ViTBackbone * Add DINOv3ViTBackbone to model documentation * Refactored the code to adhere to the Transformers principles * Generated modeling_dinov3_vit.py * DINOv3ViT backbone: keep hidden_states with return_dict=False, add @check_model_inputs and polish docs - Add @check_model_inputs to DINOv3ViTBackbone.forward to normalize flags and enable output recording. - Preserve hidden_states when return_dict=False by appending them to the tuple output when requested. - Clean up config docstring formatting (consistent indentation and use list[...] types). * Restructure DINOv3 backbone and update its tests * Resolved merge conflicts * Resolved failing testcase * Fix DINOv3 backbone to use self.norm for feature maps --------- Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>	2025-11-11 11:22:06 -05:00
Yoni Gozlan	df45a92cea	Enforce check_auto_docstring (#41635 ) fix issues and enforce check_auto_docstring	2025-11-11 16:05:55 +00:00
Leonardo Emili	3ff0e69f84	Avoid mutating user-provided arguments in preprocessing utils (#42126 ) * fix: deepcopy kwargs to avoid mutating user-provided arguments * mod: shallow copy instead of deepcopy --------- Co-authored-by: Leonardo Emili <lemili@apple.com>	2025-11-11 14:52:07 +00:00
7amim	31839d741a	Bugfix/remove emojis from print (#42091 ) * Removed initial batch of emojis from strings and print statements. * Removed many of the emojis that are printed to the client. * Removed the Hugging Face emoji from the arg-parse help descriptions. * Removed even more emojis. * Used cursor to do a pass on top of my manual pass to remove run-time emojis. I did a manual verification of the results. * Reverting some changes that were not necessary. * Used ruff to format files.	2025-11-11 14:38:38 +00:00
MohammadJavad Alizadeh	2072f3059e	DataCollatorForLanguageModeling warning error fixed (#42144 ) * warning error fixed * Update src/transformers/data/data_collator.py --------- Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>	2025-11-11 14:22:20 +00:00
Cyril Vallez	3760afb21c	Fix T5Gemma module structure (#42145 ) * fix modular * oupsi typo	2025-11-11 12:26:03 +01:00
Qile Xu	3c0b2b101e	fix: improve video processing fps assignment logic (#42009 ) * fix: improve video processing fps and do_sample_frames assignment logic * fix: set return_metadata=True to get metadata * reformat the modular file * fix typo * revert flag change and fix fps assignment * Taking 'num_frames' into considered. Avoid error when 'num_frames' is passed rather than 'fps'. * fix * fix: avoid potential reference before assignment error * fix * add 'sample_fps' to 'VideoMetadata' * fix missing comma * fix trailing whitespace * Handle different 'sample_indices_fn' * Cleaning white space * import callable from collections.abc * calculate sampled_fps using indices * correct the order * fix * properly check value in kwargs * handle sampled_fps as property * remove duplicated definition * fix * fix * add safety check --------- Co-authored-by: Raushan Turganbay <raushan@huggingface.co>	2025-11-11 10:54:33 +01:00
Arthur	e869e9df54	update deps table (#42120 ) * update deps table * [build-ci-image] * [build-ci-image] * [push-ci-image]	2025-11-11 09:23:58 +01:00
Yaswanth Gali	37d48bbb48	Remove unused functions in `image_transforms.py` (#42044 ) * up * make style * Update trimaps logic * fix typo * Revert changes --------- Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>	2025-11-10 16:55:57 +00:00
Yoni Gozlan	21913b2e10	Fix MaskFormer/Mask2Former fast image processors (#41393 ) * Merge conflict * add fast processor * add fast processor * make style * add new convert rgb * use nested group by shape in mllama fast, add support for multiple inputs in group by shape * fix maskformer mask2 former fast im proc and add tests * refactor after review * add _iterate_items utility * Fix failing tests * fix copies and improve docs --------- Co-authored-by: Vincent <phamvinh257@gmail.com>	2025-11-10 16:48:10 +00:00
Morgan Funtowicz	fddc1de0a4	chore(quality): STYLE	2025-11-10 10:32:32 +01:00
Morgan Funtowicz	bfb29a0259	chore(quality): style here we go again ... again	2025-11-10 10:14:31 +01:00
Morgan Funtowicz	0f45870a77	chore(quality): style here we go again	2025-11-10 10:14:31 +01:00
Morgan Funtowicz	51c803299a	test(kernels): add some opt-out test logic	2025-11-10 10:14:31 +01:00
Morgan Funtowicz	ebc7977b4f	misc(quality): style	2025-11-10 10:13:17 +01:00
Morgan Funtowicz	87bce70355	feat(kernels): add opt-out flag to disable kernels hub usage through the lib	2025-11-10 10:13:10 +01:00