4.0 KiB
This model was released on {release_date} and added to Hugging Face Transformers on 2023-10-19 and contributed by Molbap.
Fuyu
Fuyu is a small, open-source multimodal model designed for AI agents that can handle both text and images. Unlike most multimodal models, it uses a decoder-only Transformer without a separate image encoder, projecting image patches directly into the transformer and supporting arbitrary image resolutions. This simplified architecture allows fast inference—under 100 milliseconds for large images—and streamlines training by removing multiple specialized stages. Despite its small size, Fuyu-8B achieves competitive performance on standard image understanding benchmarks like VQAv2, OKVQA, COCO Captions, and AI2D, outperforming larger models on several metrics while being easier to scale and deploy.
import torch
from transformers import pipeline
pipeline = pipeline(task="text-generation", model="adept/fuyu-8b", dtype="auto")
pipeline("Plants generate energy through a process known as ")
import torch
import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
processor = AutoProcessor.from_pretrained("adept/fuyu-8b")
model = AutoModelForCausalLM.from_pretrained("adept/fuyu-8b", dtype="auto")
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "Generate a coco-style caption.\n"
inputs = processor(images=image, text=prompt, return_tensors="pt")
outputs = model(**inputs)
generated_ids = model.generate(**inputs, max_new_tokens=7)
generation_text = processor.batch_decode(generated_ids[:, -7:], skip_special_tokens=True)
print(generation_text[0])
Usage tips
- Fuyu models trained with bfloat16, but original inference uses float16. Hub checkpoints use
dtype='float16'
. The AutoModel API casts checkpoints fromtorch.float32
totorch.float16
. - Online weight dtype matters only when using
dtype="auto"
. The model downloads first (using checkpoint dtype), then casts to torch's default dtype (torch.float32
). Specify your desired dtype or it defaults totorch.float32
. - Don't fine-tune in float16. It produces NaN values. Fine-tune in bfloat16 instead.
- Clone the original repository to convert the model:
git clone https://github.com/persimmon-ai-labs/adept-inference
. - Pass inputs through a specific Processor for correct formats. A processor needs an
image_processor
and atokenizer
. - Fuyu uses a sentencepiece-based tokenizer with a Unigram model. It supports bytefallback (available in
tokenizers==0.14.0
for the fast tokenizer). [LlamaTokenizer
] wraps sentencepiece as a standard wrapper. - Use this prompt for image captioning:
f"Generate a coco-style caption.\\n"
.
FuyuConfig
autodoc FuyuConfig
FuyuForCausalLM
autodoc FuyuForCausalLM - forward
FuyuModel
autodoc FuyuModel - forward
FuyuImageProcessor
autodoc FuyuImageProcessor - call
FuyuProcessor
autodoc FuyuProcessor - call