* Fix white space Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Revert changes Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix autodoc Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> --------- Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
4.4 KiB
This model was released on 2022-01-28 and added to Hugging Face Transformers on 2022-12-21.
BLIP
BLIP (Bootstrapped Language-Image Pretraining) is a vision-language pretraining (VLP) framework designed for both understanding and generation tasks. Most existing pretrained models are only good at one or the other. It uses a captioner to generate captions and a filter to remove the noisy captions. This increases training data quality and more effectively uses the messy web data.
You can find all the original BLIP checkpoints under the BLIP collection.
Tip
This model was contributed by ybelkada.
Click on the BLIP models in the right sidebar for more examples of how to apply BLIP to different vision language tasks.
The example below demonstrates how to visual question answering with [Pipeline
] or the [AutoModel
] class.
import torch
from transformers import pipeline
pipeline = pipeline(
task="visual-question-answering",
model="Salesforce/blip-vqa-base",
dtype=torch.float16,
device=0
)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
pipeline(question="What is the weather in this image?", image=url)
import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVisualQuestionAnswering
processor = AutoProcessor.from_pretrained("Salesforce/blip-vqa-base")
model = AutoModelForVisualQuestionAnswering.from_pretrained(
"Salesforce/blip-vqa-base",
dtype=torch.float16,
device_map="auto"
)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
question = "What is the weather in this image?"
inputs = processor(images=image, text=question, return_tensors="pt").to(model.device, torch.float16)
output = model.generate(**inputs)
processor.batch_decode(output, skip_special_tokens=True)[0]
Resources
Refer to this notebook to learn how to fine-tune BLIP for image captioning on a custom dataset.
BlipConfig
autodoc BlipConfig - from_text_vision_configs
BlipTextConfig
autodoc BlipTextConfig
BlipVisionConfig
autodoc BlipVisionConfig
BlipProcessor
autodoc BlipProcessor
BlipImageProcessor
autodoc BlipImageProcessor - preprocess
BlipImageProcessorFast
autodoc BlipImageProcessorFast - preprocess
BlipModel
BlipModel
is going to be deprecated in future versions, please use BlipForConditionalGeneration
, BlipForImageTextRetrieval
or BlipForQuestionAnswering
depending on your usecase.
autodoc BlipModel - forward - get_text_features - get_image_features
BlipTextModel
autodoc BlipTextModel - forward
BlipTextLMHeadModel
autodoc BlipTextLMHeadModel - forward
BlipVisionModel
autodoc BlipVisionModel - forward
BlipForConditionalGeneration
autodoc BlipForConditionalGeneration - forward
BlipForImageTextRetrieval
autodoc BlipForImageTextRetrieval - forward
BlipForQuestionAnswering
autodoc BlipForQuestionAnswering - forward