Files
transformers/docs/source/en/model_doc/deepseek_vl_hybrid.md
Yuanyuan Chen f64354e89a Format empty lines and white space in markdown files. (#41100)
* Remove additional white space and empty lines from markdown files

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

* Add empty lines around code

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

---------

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
2025-09-23 16:20:01 -07:00

7.3 KiB
Raw Blame History

This model was released on 2024-03-08 and added to Hugging Face Transformers on 2025-07-25.

PyTorch SDPA

DeepseekVLHybrid

Deepseek-VL-Hybrid was introduced by the DeepSeek AI team. It is a vision-language model (VLM) designed to process both text and images for generating contextually relevant responses. The model leverages LLaMA as its text encoder, while SigLip is used for encoding low-resolution images and SAM (Segment Anything Model) is incorporated to handle high-resolution image encoding, enhancing the models ability to process fine-grained visual details. Deepseek-VL-Hybrid is a variant of Deepseek-VL that uses SAM (Segment Anything Model) to handle high-resolution image encoding.

You can find all the original Deepseek-VL-Hybrid checkpoints under the DeepSeek-community organization.

Tip

Click on the Deepseek-VL-Hybrid models in the right sidebar for more examples of how to apply Deepseek-VL-Hybrid to different vision and language tasks.

The example below demonstrates how to generate text based on an image with [Pipeline] or the [AutoModel] class.

import torch
from transformers import pipeline

pipe = pipeline(
    task="image-text-to-text",
    model="deepseek-community/deepseek-vl-7b-chat",
    device=0,
    dtype=torch.float16
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
            },
            { "type": "text", "text": "Describe this image."},
        ]
    }
]

pipe(text=messages, max_new_tokens=20, return_full_text=False)
import torch
from transformers import DeepseekVLHybridForConditionalGeneration, AutoProcessor

model = DeepseekVLHybridForConditionalGeneration.from_pretrained(
    "deepseek-community/deepseek-vl-7b-chat",
    dtype=torch.float16,
    device_map="auto",
    attn_implementation="sdpa"
)

processor = AutoProcessor.from_pretrained("deepseek-community/deepseek-vl-7b-chat")

messages = [
    {
        "role":"user",
        "content":[
            {
                "type":"image",
                "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
            },
            {
                "type":"text",
                "text":"Describe this image."
            }
        ]
    }

]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device, dtype=model.dtype)

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text)

Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the Quantization overview for more available quantization backends.

The example below uses torchao to only quantize the weights to int4.

import torch
from transformers import TorchAoConfig, DeepseekVLHybridForConditionalGeneration, AutoProcessor

quantization_config = TorchAoConfig(
    "int4_weight_only",
    group_size=128
)

model = DeepseekVLHybridForConditionalGeneration.from_pretrained(
    "deepseek-community/deepseek-vl-7b-chat",
    dtype=torch.bfloat16,
    device_map="auto",
    quantization_config=quantization_config
)

Notes

  • Do inference with multiple images in a single conversation.

    import torch
    from transformers import DeepseekVLHybridForConditionalGeneration, AutoProcessor
    
    model = DeepseekVLHybridForConditionalGeneration.from_pretrained(
        "deepseek-community/deepseek-vl-7b-chat",
        dtype=torch.float16,
        device_map="auto",
        attn_implementation="sdpa"
    )
    
    processor = AutoProcessor.from_pretrained("deepseek-community/deepseek-vl-7b-chat")
    
    messages = [
        [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Whats the difference between"},
                    {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
                    {"type": "text", "text": " and "},
                    {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"}
                ]
            }
        ],
        [
            {
                "role": "user",
                "content": [
                    {"type": "image", "url": "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"},
                    {"type": "text", "text": "What do you see in this image?"}
                ]
            }
        ]
    ]
    
    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        padding=True,
        truncation=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt"
    ).to(model.device, dtype=model.dtype)
    
    generated_ids = model.generate(**inputs, max_new_tokens=128)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    
    print(output_text)
    

DeepseekVLHybridConfig

autodoc DeepseekVLHybridConfig

DeepseekVLHybridProcessor

autodoc DeepseekVLHybridProcessor

DeepseekVLHybridImageProcessor

autodoc DeepseekVLHybridImageProcessor

DeepseekVLHybridImageProcessorFast

autodoc DeepseekVLHybridImageProcessorFast

DeepseekVLHybridModel

autodoc DeepseekVLHybridModel - forward

DeepseekVLHybridForConditionalGeneration

autodoc DeepseekVLHybridForConditionalGeneration - forward