make model docs device agnostic (2) (#40256)

* doc cont.

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

* more models

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

* Update docs/source/en/quicktour.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/quicktour.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/quicktour.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/quicktour.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update mixtral.md

---------

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
This commit is contained in:
Yao Matrix
2025-08-19 13:10:03 -07:00
committed by GitHub
parent 42fe769928
commit eaa48c81e9
59 changed files with 157 additions and 159 deletions

View File

@ -65,7 +65,7 @@ model = AutoModelForMaskedLM.from_pretrained(
device_map="auto",
attn_implementation="sdpa"
)
inputs = tokenizer("Plants create <mask> through a process known as photosynthesis.", return_tensors="pt").to("cuda")
inputs = tokenizer("Plants create <mask> through a process known as photosynthesis.", return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model(**inputs)

View File

@ -68,7 +68,7 @@ model = AutoModelForMaskedLM.from_pretrained(
torch_dtype=torch.float16,
device_map="auto",
)
inputs = tokenizer("Les plantes produisent <mask> grâce à un processus appelé photosynthèse.", return_tensors="pt").to("cuda")
inputs = tokenizer("Les plantes produisent <mask> grâce à un processus appelé photosynthèse.", return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model(**inputs)

View File

@ -82,7 +82,7 @@ Quang tổng hợp hay gọi tắt là quang hợp là quá trình thu nhận v
tảo và một số vi khuẩn để tạo ra hợp chất hữu cơ phục vụ bản thân cũng như làm nguồn thức ăn cho hầu hết các sinh vật
trên Trái Đất. Quang hợp trong thực vật thường liên quan đến chất tố diệp lục màu xanh lá cây và tạo ra oxy như một sản phẩm phụ
"""
inputs = tokenizer(text, return_tensors="pt").to("cuda")
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(inputs["input_ids"], num_beams=2, min_length=0, max_length=20)
tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

View File

@ -67,7 +67,7 @@ model = AutoModelForMaskedLM.from_pretrained(
torch_dtype=torch.float16,
device_map="auto"
)
inputs = tokenizer("Plants create <mask> through a process known as photosynthesis.", return_tensors="pt").to("cuda")
inputs = tokenizer("Plants create <mask> through a process known as photosynthesis.", return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model(**inputs)

View File

@ -64,7 +64,7 @@ model = AutoModelForMaskedLM.from_pretrained(
torch_dtype=torch.float16,
device_map="auto",
)
inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt").to("cuda")
inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model(**inputs)

View File

@ -72,7 +72,7 @@ input_text = """Plants are among the most remarkable and essential life forms on
Through photosynthesis, plants capture energy from sunlight using a green pigment called chlorophyll, which is located in specialized cell structures called chloroplasts. In the presence of light, plants absorb carbon dioxide from the atmosphere through small pores in their leaves called stomata, and take in water from the soil through their root systems.
These ingredients are then transformed into glucose, a type of sugar that serves as a source of chemical energy, and oxygen, which is released as a byproduct into the atmosphere. The glucose produced during photosynthesis is not just used immediately; plants also store it as starch or convert it into other organic compounds like cellulose, which is essential for building their cellular structure.
This energy reserve allows them to grow, develop leaves, produce flowers, bear fruit, and carry out various physiological processes throughout their lifecycle."""
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
output = model.generate(**input_ids, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
@ -115,7 +115,7 @@ input_text = """Plants are among the most remarkable and essential life forms on
Through photosynthesis, plants capture energy from sunlight using a green pigment called chlorophyll, which is located in specialized cell structures called chloroplasts. In the presence of light, plants absorb carbon dioxide from the atmosphere through small pores in their leaves called stomata, and take in water from the soil through their root systems.
These ingredients are then transformed into glucose, a type of sugar that serves as a source of chemical energy, and oxygen, which is released as a byproduct into the atmosphere. The glucose produced during photosynthesis is not just used immediately; plants also store it as starch or convert it into other organic compounds like cellulose, which is essential for building their cellular structure.
This energy reserve allows them to grow, develop leaves, produce flowers, bear fruit, and carry out various physiological processes throughout their lifecycle."""
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
output = model.generate(**input_ids, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))

View File

@ -73,7 +73,7 @@ url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/
image = Image.open(requests.get(url, stream=True).raw)
question = "What is the weather in this image?"
inputs = processor(images=image, text=question, return_tensors="pt").to("cuda", torch.float16)
inputs = processor(images=image, text=question, return_tensors="pt").to(model.device, torch.float16)
output = model.generate(**inputs)
processor.batch_decode(output, skip_special_tokens=True)[0]

View File

@ -48,7 +48,7 @@ tokenizer = AutoTokenizer.from_pretrained("laion/clap-htsat-unfused")
texts = ["the sound of a cat", "the sound of a dog", "music playing"]
inputs = tokenizer(texts, padding=True, return_tensors="pt").to("cuda")
inputs = tokenizer(texts, padding=True, return_tensors="pt").to(model.device)
with torch.no_grad():
text_features = model.get_text_features(**inputs)

View File

@ -74,7 +74,7 @@ model = AutoModelForCausalLM.from_pretrained(
# basic code generation
prompt = "# Function to calculate the factorial of a number\ndef factorial(n):"
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
input_ids = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
**input_ids,
@ -121,7 +121,7 @@ model = AutoModelForCausalLM.from_pretrained(
)
prompt = "# Write a Python function to check if a string is a palindrome\ndef is_palindrome(s):"
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
input_ids = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**input_ids, max_new_tokens=200, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))

View File

@ -38,10 +38,10 @@ CSM can be used to simply generate speech from a text prompt:
```python
import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
from transformers import CsmForConditionalGeneration, AutoProcessor, infer_device
model_id = "sesame/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"
device = infer_device()
# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
@ -72,11 +72,11 @@ CSM can be used to generate speech given a conversation, allowing consistency in
```python
import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
from transformers import CsmForConditionalGeneration, AutoProcessor, infer_device
from datasets import load_dataset, Audio
model_id = "sesame/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"
device = infer_device()
# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
@ -117,11 +117,11 @@ CSM supports batched inference!
```python
import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
from transformers import CsmForConditionalGeneration, AutoProcessor, infer_device
from datasets import load_dataset, Audio
model_id = "sesame/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"
device = infer_device()
# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
@ -306,11 +306,11 @@ print("="*50)
CSM Transformers integration supports training!
```python
from transformers import CsmForConditionalGeneration, AutoProcessor
from transformers import CsmForConditionalGeneration, AutoProcessor, infer_device
from datasets import load_dataset, Audio
model_id = "sesame/csm-1b"
device = "cuda"
device = infer_device()
# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)

View File

@ -69,7 +69,7 @@ model = AutoModelForImageClassification.from_pretrained(
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = image_processor(image, return_tensors="pt").to("cuda")
inputs = image_processor(image, return_tensors="pt").to(model.device)
with torch.no_grad():
logits = model(**inputs).logits

View File

@ -58,7 +58,7 @@ model = DbrxForCausalLM.from_pretrained(
input_text = "What does it take to build a great LLM?"
messages = [{"role": "user", "content": input_text}]
input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
outputs = model.generate(**input_ids, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))
@ -80,7 +80,7 @@ model = DbrxForCausalLM.from_pretrained(
input_text = "What does it take to build a great LLM?"
messages = [{"role": "user", "content": input_text}]
input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
outputs = model.generate(**input_ids, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))
@ -102,7 +102,7 @@ model = DbrxForCausalLM.from_pretrained(
input_text = "What does it take to build a great LLM?"
messages = [{"role": "user", "content": input_text}]
input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
outputs = model.generate(**input_ids, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))

View File

@ -46,9 +46,9 @@ The DepthPro model processes an input image by first downsampling it at multiple
>>> import requests
>>> from PIL import Image
>>> import torch
>>> from transformers import DepthProImageProcessorFast, DepthProForDepthEstimation
>>> from transformers import DepthProImageProcessorFast, DepthProForDepthEstimation, infer_device
>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
>>> device = infer_device()
>>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
>>> image = Image.open(requests.get(url, stream=True).raw)
@ -56,7 +56,7 @@ The DepthPro model processes an input image by first downsampling it at multiple
>>> image_processor = DepthProImageProcessorFast.from_pretrained("apple/DepthPro-hf")
>>> model = DepthProForDepthEstimation.from_pretrained("apple/DepthPro-hf").to(device)
>>> inputs = image_processor(images=image, return_tensors="pt").to(device)
>>> inputs = image_processor(images=image, return_tensors="pt").to(model.device)
>>> with torch.no_grad():
... outputs = model(**inputs)

View File

@ -42,9 +42,9 @@ tokens and decodes them back into audio.
### Generation with Text
```python
from transformers import AutoProcessor, DiaForConditionalGeneration
from transformers import AutoProcessor, DiaForConditionalGeneration, infer_device
torch_device = "cuda"
torch_device = infer_device()
model_checkpoint = "nari-labs/Dia-1.6B-0626"
text = ["[S1] Dia is an open weights text to dialogue model."]
@ -64,9 +64,9 @@ processor.save_audio(outputs, "example.wav")
```python
from datasets import load_dataset, Audio
from transformers import AutoProcessor, DiaForConditionalGeneration
from transformers import AutoProcessor, DiaForConditionalGeneration, infer_device
torch_device = "cuda"
torch_device = infer_device()
model_checkpoint = "nari-labs/Dia-1.6B-0626"
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
@ -91,9 +91,9 @@ processor.save_audio(outputs, "example_with_audio.wav")
```python
from datasets import load_dataset, Audio
from transformers import AutoProcessor, DiaForConditionalGeneration
from transformers import AutoProcessor, DiaForConditionalGeneration, infer_device
torch_device = "cuda"
torch_device = infer_device()
model_checkpoint = "nari-labs/Dia-1.6B-0626"
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")

View File

@ -72,7 +72,7 @@ model = AutoModelForSequenceClassification.from_pretrained(
device_map="auto",
attn_implementation="sdpa"
)
inputs = tokenizer("I love using Hugging Face Transformers!", return_tensors="pt").to("cuda")
inputs = tokenizer("I love using Hugging Face Transformers!", return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model(**inputs)

View File

@ -70,7 +70,7 @@ model = AutoModelForImageClassification.from_pretrained(
)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/dit-example.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = image_processor(image, return_tensors="pt").to("cuda")
inputs = image_processor(image, return_tensors="pt").to(model.device)
with torch.no_grad():
logits = model(**inputs).logits

View File

@ -52,7 +52,7 @@ pipe = pipeline(
task="text-generation",
model="google/gemma-2-9b",
torch_dtype=torch.bfloat16,
device="cuda",
device_map="auto",
)
pipe("Explain quantum computing simply. ", max_new_tokens=50)
@ -74,7 +74,7 @@ model = AutoModelForCausalLM.from_pretrained(
)
input_text = "Explain quantum computing simply."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**input_ids, max_new_tokens=32, cache_implementation="static")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
@ -108,7 +108,7 @@ model = AutoModelForCausalLM.from_pretrained(
)
input_text = "Explain quantum computing simply."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**input_ids, max_new_tokens=32, cache_implementation="static")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

View File

@ -61,8 +61,8 @@ Tips:
In the following, we demonstrate how to use `glm-4-9b-chat` for the inference. Note that we have used the ChatML format for dialog, in this demo we show how to leverage `apply_chat_template` for this purpose.
```python
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> device = "cuda" # the device to load the model onto
>>> from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
>>> device = infer_device() # the device to load the model onto
>>> model = AutoModelForCausalLM.from_pretrained("THUDM/glm-4-9b-chat", device_map="auto", trust_remote_code=True)
>>> tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat")
@ -73,7 +73,7 @@ In the following, we demonstrate how to use `glm-4-9b-chat` for the inference. N
>>> text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
>>> model_inputs = tokenizer([text], return_tensors="pt").to(device)
>>> model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
>>> generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=True)

View File

@ -58,7 +58,7 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2", torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa")
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
input_ids = tokenizer("Hello, I'm a language model", return_tensors="pt").to("cuda")
input_ids = tokenizer("Hello, I'm a language model", return_tensors="pt").to(model.device)
output = model.generate(**input_ids, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
@ -102,7 +102,7 @@ model = AutoModelForCausalLM.from_pretrained(
)
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2-xl")
inputs = tokenizer("Once upon a time, there was a magical forest", return_tensors="pt").to("cuda")
inputs = tokenizer("Once upon a time, there was a magical forest", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

View File

@ -67,15 +67,15 @@ To load and run a model using Flash Attention 2, refer to the snippet below:
```python
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> device = "cuda" # the device to load the model onto
>>> from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
>>> device = infer_device() # the device to load the model onto
>>> model = AutoModelForCausalLM.from_pretrained("bigcode/gpt_bigcode-santacoder", torch_dtype=torch.float16, attn_implementation="flash_attention_2")
>>> tokenizer = AutoTokenizer.from_pretrained("bigcode/gpt_bigcode-santacoder")
>>> prompt = "def hello_world():"
>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
>>> model.to(device)
>>> generated_ids = model.generate(**model_inputs, max_new_tokens=30, do_sample=False)

View File

@ -41,10 +41,10 @@ This model was contributed by [Stella Biderman](https://huggingface.co/stellaath
which could be used to further minimize the RAM usage:
```python
>>> from transformers import GPTJForCausalLM
>>> from transformers import GPTJForCausalLM, infer_device
>>> import torch
>>> device = "cuda"
>>> device = infer_device()
>>> model = GPTJForCausalLM.from_pretrained(
... "EleutherAI/gpt-j-6B",
... revision="float16",
@ -96,10 +96,10 @@ model.
...or in float16 precision:
```python
>>> from transformers import GPTJForCausalLM, AutoTokenizer
>>> from transformers import GPTJForCausalLM, AutoTokenizer, infer_device
>>> import torch
>>> device = "cuda"
>>> device = infer_device()
>>> model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16).to(device)
>>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
@ -109,7 +109,7 @@ model.
... "researchers was the fact that the unicorns spoke perfect English."
... )
>>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
>>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
>>> gen_tokens = model.generate(
... input_ids,

View File

@ -119,14 +119,13 @@ In the following, we demonstrate how to use `helium-1-preview` for the inference
```python
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> device = "cuda" # the device to load the model onto
>>> model = AutoModelForCausalLM.from_pretrained("kyutai/helium-1-preview-2b", device_map="auto")
>>> tokenizer = AutoTokenizer.from_pretrained("kyutai/helium-1-preview-2b")
>>> prompt = "Give me a short introduction to large language model."
>>> model_inputs = tokenizer(prompt, return_tensors="pt").to(device)
>>> model_inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
>>> generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=True)

View File

@ -86,10 +86,9 @@ This example demonstrates how to perform inference on a single image with the In
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch
>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map="auto", torch_dtype=torch.bfloat16)
>>> messages = [
... {
@ -118,10 +117,9 @@ This example shows how to generate text using the InternVL model without providi
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch
>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map="auto", torch_dtype=torch.bfloat16)
>>> messages = [
... {
@ -148,10 +146,9 @@ InternVL models also support batched image and text inputs.
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch
>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map="auto", torch_dtype=torch.bfloat16)
>>> messages = [
... [
@ -192,10 +189,9 @@ This implementation of the InternVL models supports batched text-images inputs w
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch
>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map="auto", torch_dtype=torch.bfloat16)
>>> messages = [
...     [
@ -275,10 +271,9 @@ This example showcases how to handle a batch of chat conversations with interlea
>>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
>>> import torch
>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map="auto", torch_dtype=torch.bfloat16)
>>> messages = [
...     [

View File

@ -70,7 +70,7 @@ model = AutoModelForCausalLM.from_pretrained(
device_map="auto",
attn_implementation="sdpa"
)
input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")
input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)
output = model.generate(**input_ids, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))

View File

@ -33,10 +33,10 @@ rendered properly in your Markdown viewer.
```python
import torch
from datasets import load_dataset, Audio
from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration
from transformers import infer_device, KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration
# 1. load the model and the processor
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
torch_device = infer_device()
model_id = "kyutai/stt-2.6b-en-trfs"
processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
@ -52,7 +52,7 @@ ds = ds.cast_column("audio", Audio(sampling_rate=24000))
inputs = processor(
ds[0]["audio"]["array"],
)
inputs.to(torch_device)
inputs.to(model.device)
# 4. infer the model
output_tokens = model.generate(**inputs)
@ -66,10 +66,10 @@ print(processor.batch_decode(output_tokens, skip_special_tokens=True))
```python
import torch
from datasets import load_dataset, Audio
from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration
from transformers import infer_device, KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration
# 1. load the model and the processor
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
torch_device = infer_device()
model_id = "kyutai/stt-2.6b-en-trfs"
processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
@ -84,7 +84,7 @@ ds = ds.cast_column("audio", Audio(sampling_rate=24000))
# 3. prepare the model inputs
audio_arrays = [ds[i]["audio"]["array"] for i in range(4)]
inputs = processor(audio_arrays, return_tensors="pt", padding=True)
inputs = inputs.to(torch_device)
inputs = inputs.to(model.device)
# 4. infer the model
output_tokens = model.generate(**inputs)

View File

@ -69,7 +69,7 @@ model = AutoModelForCausalLM.from_pretrained(
device_map="auto",
attn_implementation="sdpa"
)
input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")
input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)
output = model.generate(**input_ids, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
@ -103,7 +103,7 @@ model = AutoModelForCausalLM.from_pretrained(
)
tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-30b")
input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")
input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)
output = model.generate(**input_ids, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))

View File

@ -69,7 +69,7 @@ model = AutoModelForCausalLM.from_pretrained(
device_map="auto",
attn_implementation="sdpa"
)
input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")
input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)
output = model.generate(**input_ids, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
@ -103,7 +103,7 @@ model = AutoModelForCausalLM.from_pretrained(
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")
input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")
input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)
output = model.generate(**input_ids, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))

View File

@ -75,7 +75,7 @@ The complexity of this mechanism is `O(l(r + l/k))`.
>>> dataset = load_dataset("scientific_papers", "pubmed", split="validation")
>>> model = (
... LongT5ForConditionalGeneration.from_pretrained("Stancld/longt5-tglobal-large-16384-pubmed-3k_steps")
... .to("cuda")
... .to("auto")
... .half()
... )
>>> tokenizer = AutoTokenizer.from_pretrained("Stancld/longt5-tglobal-large-16384-pubmed-3k_steps")
@ -85,8 +85,8 @@ The complexity of this mechanism is `O(l(r + l/k))`.
... inputs_dict = tokenizer(
... batch["article"], max_length=16384, padding="max_length", truncation=True, return_tensors="pt"
... )
... input_ids = inputs_dict.input_ids.to("cuda")
... attention_mask = inputs_dict.attention_mask.to("cuda")
... input_ids = inputs_dict.input_ids.to(model.device)
... attention_mask = inputs_dict.attention_mask.to(model.device)
... output_ids = model.generate(input_ids, attention_mask=attention_mask, max_length=512, num_beams=2)
... batch["predicted_abstract"] = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
... return batch

View File

@ -59,7 +59,7 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("state-spaces/mamba-130m-hf")
model = AutoModelForCausalLM.from_pretrained("state-spaces/mamba-130m-hf", torch_dtype=torch.float16, device_map="auto",)
input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")
input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)
output = model.generate(**input_ids)
print(tokenizer.decode(output[0], skip_special_tokens=True)
@ -88,7 +88,7 @@ quantization_config = Int4WeightOnlyConfig(group_size=128)
quantization_config = TorchAoConfig(quant_type=quant_config)
tokenizer = AutoTokenizer.from_pretrained("state-spaces/mamba-2.8b-hf")
model = AutoModelForCausalLM.from_pretrained("state-spaces/mamba-2.8b-hf", torch_dtype=torch.bfloat16, quantization_config=quantization_config, device_map="auto",)
input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")
input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)
output = model.generate(**input_ids)
print(tokenizer.decode(output[0], skip_special_tokens=True))

View File

@ -57,7 +57,7 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mamba-Codestral-7B-v0.1")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mamba-Codestral-7B-v0.1", torch_dtype=torch.bfloat16, device_map="auto")
input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")
input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)
output = model.generate(**input_ids)
print(tokenizer.decode(output[0], skip_special_tokens=True))
@ -84,7 +84,7 @@ from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mamba-Codestral-7B-v0.1")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mamba-Codestral-7B-v0.1", torch_dtype=torch.bfloat16, quantization_config=quantization_config, device_map="auto")
input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")
input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)
output = model.generate(**input_ids)
print(tokenizer.decode(output[0], skip_special_tokens=True))

View File

@ -72,7 +72,7 @@ model = AutoModelForSeq2SeqLM.from_pretrained("facebook/mbart-large-50-many-to-m
tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer.src_lang = "en_XX"
encoded_hi = tokenizer(article_en, return_tensors="pt").to("cuda")
encoded_hi = tokenizer(article_en, return_tensors="pt").to(model.device)
generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"], cache_implementation="static")
print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True))
```
@ -208,4 +208,4 @@ print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True))
- decode
</jax>
</frameworkcontent>
</frameworkcontent>

View File

@ -67,7 +67,7 @@ The pre-trained model can be used as follows:
... {"role": "user", "content": "Do you have mayonnaise recipes?"}
... ]
>>> model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
>>> model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
>>> generated_ids = model.generate(model_inputs, max_new_tokens=100, do_sample=True)
>>> tokenizer.batch_decode(generated_ids)[0]
@ -99,7 +99,7 @@ To load and run a model using Flash Attention-2, refer to the snippet below:
>>> prompt = "My favourite condiment is"
>>> model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
>>> model.to(device)
>>> generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
@ -142,7 +142,7 @@ Quantizing a model is as simple as passing a `quantization_config` to the model.
... {"role": "user", "content": "Do you have mayonnaise recipes?"}
... ]
>>> model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
>>> model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
>>> generated_ids = model.generate(model_inputs, max_new_tokens=100, do_sample=True)
>>> tokenizer.batch_decode(generated_ids)[0]

View File

@ -68,8 +68,7 @@ The base model can be used as follows:
>>> prompt = "My favourite condiment is"
>>> model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
>>> model.to(device)
>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
>>> generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
>>> tokenizer.batch_decode(generated_ids)[0]
@ -90,7 +89,7 @@ The instruction tuned model can be used as follows:
... {"role": "user", "content": "Do you have mayonnaise recipes?"}
... ]
>>> model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
>>> model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
>>> generated_ids = model.generate(model_inputs, max_new_tokens=100, do_sample=True)
>>> tokenizer.batch_decode(generated_ids)[0]
@ -122,8 +121,7 @@ To load and run a model using Flash Attention-2, refer to the snippet below:
>>> prompt = "My favourite condiment is"
>>> model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
>>> model.to(device)
>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
>>> generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
>>> tokenizer.batch_decode(generated_ids)[0]
@ -173,7 +171,7 @@ Quantizing a model is as simple as passing a `quantization_config` to the model.
... {"role": "user", "content": "Do you have mayonnaise recipes?"}
... ]
>>> model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
>>> model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
>>> generated_ids = model.generate(model_inputs, max_new_tokens=100, do_sample=True)
>>> tokenizer.batch_decode(generated_ids)[0]
@ -191,7 +189,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
- A demo notebook to perform supervised fine-tuning (SFT) of Mixtral-8x7B can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Mistral/Supervised_fine_tuning_(SFT)_of_an_LLM_using_Hugging_Face_tooling.ipynb). 🌎
- A [blog post](https://medium.com/@prakharsaxena11111/finetuning-mixtral-7bx8-6071b0ebf114) on fine-tuning Mixtral-8x7B using PEFT. 🌎
- The [Alignment Handbook](https://github.com/huggingface/alignment-handbook) by Hugging Face includes scripts and recipes to perform supervised fine-tuning (SFT) and direct preference optimization with Mistral-7B. This includes scripts for full fine-tuning, QLoRa on a single GPU as well as multi-GPU fine-tuning.
- The [Alignment Handbook](https://github.com/huggingface/alignment-handbook) by Hugging Face includes scripts and recipes to perform supervised fine-tuning (SFT) and direct preference optimization with Mistral-7B. This includes scripts for full fine-tuning, QLoRa on a single accelerator as well as multi-accelerator fine-tuning.
- [Causal language modeling task guide](../tasks/language_modeling)
## MixtralConfig

View File

@ -39,13 +39,13 @@ The example below demonstrates how to generate text based on an image with the [
```py
import torch
from transformers import AutoModelForZeroShotObjectDetection, AutoProcessor
from transformers import AutoModelForZeroShotObjectDetection, AutoProcessor, infer_device
from transformers.image_utils import load_image
# Prepare processor and model
model_id = "openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det"
device = "cuda" if torch.cuda.is_available() else "cpu"
device = infer_device()
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)

View File

@ -75,11 +75,11 @@ image_processor = AutoImageProcessor.from_pretrained(
"apple/mobilevit-small",
use_fast=True,
)
model = MobileViTForImageClassification.from_pretrained("apple/mobilevit-small")
model = MobileViTForImageClassification.from_pretrained("apple/mobilevit-small", device_map="auto")
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = image_processor(image, return_tensors="pt").to("cuda")
inputs = image_processor(image, return_tensors="pt").to(model.device)
with torch.no_grad():
logits = model(**inputs).logits

View File

@ -66,7 +66,7 @@ model = AutoModelForMaskedLM.from_pretrained(
device_map="auto",
attn_implementation="sdpa"
)
inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt").to("cuda")
inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model(**inputs)

View File

@ -116,13 +116,13 @@ To follow the example of the following image, `"Hello, I'm Moshi"` could be tran
```python
>>> from datasets import load_dataset, Audio
>>> import torch, math
>>> from transformers import MoshiForConditionalGeneration, AutoFeatureExtractor, AutoTokenizer
>>> from transformers import MoshiForConditionalGeneration, AutoFeatureExtractor, AutoTokenizer, infer_device
>>> librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("kyutai/moshiko-pytorch-bf16")
>>> tokenizer = AutoTokenizer.from_pretrained("kyutai/moshiko-pytorch-bf16")
>>> device = "cuda"
>>> device = infer_device()
>>> dtype = torch.bfloat16
>>> # prepare user input audio

View File

@ -73,7 +73,7 @@ model = AutoModelForSeq2SeqLM.from_pretrained(
input_text = """Plants are remarkable organisms that produce their own food using a method called photosynthesis.
This process involves converting sunlight, carbon dioxide, and water into glucose, which provides energy for growth.
Plants play a crucial role in sustaining life on Earth by generating oxygen and serving as the foundation of most ecosystems."""
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
output = model.generate(**input_ids, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
@ -115,7 +115,7 @@ tokenizer = AutoTokenizer.from_pretrained(
input_text = """Plants are remarkable organisms that produce their own food using a method called photosynthesis.
This process involves converting sunlight, carbon dioxide, and water into glucose, which provides energy for growth.
Plants play a crucial role in sustaining life on Earth by generating oxygen and serving as the foundation of most ecosystems."""
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
output = model.generate(**input_ids, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))

View File

@ -65,14 +65,14 @@ into a single instance to both extract the input features and decode the predict
>>> import re
>>> from PIL import Image
>>> from transformers import NougatProcessor, VisionEncoderDecoderModel
>>> from transformers import NougatProcessor, VisionEncoderDecoderModel, infer_device
>>> from datasets import load_dataset
>>> import torch
>>> processor = NougatProcessor.from_pretrained("facebook/nougat-base")
>>> model = VisionEncoderDecoderModel.from_pretrained("facebook/nougat-base")
>>> device = "cuda" if torch.cuda.is_available() else "cpu"
>>> device = infer_device()
>>> model.to(device) # doctest: +IGNORE_RESULT
>>> # prepare PDF image for the model
@ -125,4 +125,4 @@ The model is identical to [Donut](donut) in terms of architecture.
- save_pretrained
- batch_decode
- decode
- post_process_generation
- post_process_generation

View File

@ -59,9 +59,9 @@ print(result)
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
device = "cuda" if torch.cuda.is_available() else "cpu"
device = infer_device()
model = AutoModelForCausalLM.from_pretrained("allenai/OLMoE-1B-7B-0924", attn_implementation="sdpa", torch_dtype="auto", device_map="auto").to(device)
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMoE-1B-7B-0924")
@ -79,9 +79,9 @@ The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quan
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, infer_device
device = "cuda" if torch.cuda.is_available() else "cpu"
device = infer_device()
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,

View File

@ -77,7 +77,7 @@ processor = AutoProcessor.from_pretrained(
prompt = "What is in this image?"
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(image, prompt, return_tensors="pt").to("cuda")
inputs = processor(image, prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=50, cache_implementation="static")
print(processor.decode(output[0], skip_special_tokens=True))
@ -111,7 +111,7 @@ processor = AutoProcessor.from_pretrained(
prompt = "What is in this image?"
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(image, prompt, return_tensors="pt").to("cuda")
inputs = processor(image, prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=50, cache_implementation="static")
print(processor.decode(output[0], skip_special_tokens=True))

View File

@ -75,7 +75,7 @@ model = AutoModelForSeq2SeqLM.from_pretrained(
input_text = """Plants are remarkable organisms that produce their own food using a method called photosynthesis.
This process involves converting sunlight, carbon dioxide, and water into glucose, which provides energy for growth.
Plants play a crucial role in sustaining life on Earth by generating oxygen and serving as the foundation of most ecosystems."""
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
output = model.generate(**input_ids, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
@ -117,7 +117,7 @@ tokenizer = AutoTokenizer.from_pretrained(
input_text = """Plants are remarkable organisms that produce their own food using a method called photosynthesis.
This process involves converting sunlight, carbon dioxide, and water into glucose, which provides energy for growth.
Plants play a crucial role in sustaining life on Earth by generating oxygen and serving as the foundation of most ecosystems."""
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
output = model.generate(**input_ids, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))

View File

@ -60,7 +60,7 @@ model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1", torch_dtype=torc
input_ids = tokenizer('''def print_prime(n):
"""
Print all primes between 1 and n
"""''', return_tensors="pt").to("cuda")
"""''', return_tensors="pt").to(model.device)
output = model.generate(**input_ids, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
@ -91,7 +91,7 @@ model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1", torch_dtype=torc
input_ids = tokenizer('''def print_prime(n):
"""
Print all primes between 1 and n
"""''', return_tensors="pt").to("cuda")
"""''', return_tensors="pt").to(model.device)
output = model.generate(**input_ids, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
@ -116,7 +116,7 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
input_ids = tokenizer('''def print_prime(n):
"""
Print all primes between 1 and n
"""''', return_tensors="pt").to("cuda")
"""''', return_tensors="pt").to(model.device)
output = model.generate(**input_ids, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))

View File

@ -67,7 +67,7 @@ torch.random.manual_seed(0)
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3.5-MoE-instruct",
device_map="cuda",
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
)

View File

@ -58,7 +58,7 @@ from transformers import AutoProcessor, LlavaForConditionalGeneration
model_id = "mistral-community/pixtral-12b"
processor = AutoProcessor.from_pretrained(model_id)
model = LlavaForConditionalGeneration.from_pretrained(model_id, device_map="cuda")
model = LlavaForConditionalGeneration.from_pretrained(model_id, device_map="auto")
chat = [
{

View File

@ -82,7 +82,7 @@ text = tokenizer.apply_chat_template(
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to("cuda")
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
model_inputs.input_ids,
@ -137,7 +137,7 @@ model = AutoModelForCausalLM.from_pretrained(
attn_implementation="flash_attention_2"
)
inputs = tokenizer("The Qwen2 model family is", return_tensors="pt").to("cuda")
inputs = tokenizer("The Qwen2 model family is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

View File

@ -82,7 +82,7 @@ text = tokenizer.apply_chat_template(
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to("cuda")
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
model_inputs.input_ids,
@ -131,7 +131,7 @@ model = AutoModelForCausalLM.from_pretrained(
attn_implementation="flash_attention_2"
)
inputs = tokenizer("The Qwen2 model family is", return_tensors="pt").to("cuda")
inputs = tokenizer("The Qwen2 model family is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

View File

@ -69,7 +69,7 @@ model = AutoModelForMaskedLM.from_pretrained(
device_map="auto",
attn_implementation="sdpa"
)
inputs = tokenizer("Plants create <mask> through a process known as photosynthesis.", return_tensors="pt").to("cuda")
inputs = tokenizer("Plants create <mask> through a process known as photosynthesis.", return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model(**inputs)

View File

@ -65,7 +65,7 @@ model = AutoModelForMaskedLM.from_pretrained(
torch_dtype=torch.float16,
device_map="auto",
)
inputs = tokenizer("這家餐廳的拉麵是我[MASK]過的最好的拉麵之", return_tensors="pt").to("cuda")
inputs = tokenizer("這家餐廳的拉麵是我[MASK]過的最好的拉麵之", return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model(**inputs)

View File

@ -52,9 +52,9 @@ Below is an example on how to run mask generation given an image and a 2D point:
import torch
from PIL import Image
import requests
from transformers import SamModel, SamProcessor
from transformers import SamModel, SamProcessor, infer_device
device = "cuda" if torch.cuda.is_available() else "cpu"
device = infer_device()
model = SamModel.from_pretrained("facebook/sam-vit-huge").to(device)
processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")
@ -78,9 +78,9 @@ You can also process your own masks alongside the input images in the processor
import torch
from PIL import Image
import requests
from transformers import SamModel, SamProcessor
from transformers import SamModel, SamProcessor, infer_device
device = "cuda" if torch.cuda.is_available() else "cpu"
device = infer_device()
model = SamModel.from_pretrained("facebook/sam-vit-huge").to(device)
processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")

View File

@ -52,10 +52,10 @@ SAM2's key strength is its ability to track objects across video frames. Here's
#### Basic Video Tracking
```python
>>> from transformers import Sam2VideoModel, Sam2VideoProcessor
>>> from transformers import Sam2VideoModel, Sam2VideoProcessor, infer_device
>>> import torch
>>> device = "cuda" if torch.cuda.is_available() else "cpu"
>>> device = infer_device()
>>> model = Sam2VideoModel.from_pretrained("facebook/sam2.1-hiera-tiny").to(device, dtype=torch.bfloat16)
>>> processor = Sam2VideoProcessor.from_pretrained("facebook/sam2.1-hiera-tiny")

View File

@ -74,7 +74,7 @@ candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"]
texts = [f'This is a photo of {label}.' for label in candidate_labels]
# IMPORTANT: we pass `padding=max_length` and `max_length=64` since the model was trained with this
inputs = processor(text=texts, images=image, padding="max_length", max_length=64, return_tensors="pt").to("cuda")
inputs = processor(text=texts, images=image, padding="max_length", max_length=64, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model(**inputs)
@ -102,7 +102,7 @@ candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"]
texts = [f'This is a photo of {label}.' for label in candidate_labels]
# default value for `max_num_patches` is 256, but you can increase resulted image resolution providing higher values e.g. `max_num_patches=512`
inputs = processor(text=texts, images=image, padding="max_length", max_num_patches=256, return_tensors="pt").to("cuda")
inputs = processor(text=texts, images=image, padding="max_length", max_num_patches=256, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model(**inputs)
@ -137,7 +137,7 @@ candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"]
texts = [f'This is a photo of {label}.' for label in candidate_labels]
# IMPORTANT: we pass `padding=max_length` and `max_length=64` since the model was trained with this
inputs = processor(text=texts, images=image, padding="max_length", max_length=64, return_tensors="pt").to("cuda")
inputs = processor(text=texts, images=image, padding="max_length", max_length=64, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model(**inputs)
@ -149,7 +149,7 @@ print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
## Notes
- Training is supported for DDP and FSDP on single-node multi-GPU setups. However, it does not use [torch.distributed](https://pytorch.org/tutorials/beginner/dist_overview.html) utilities which may limit the scalability of batch size.
- Training is supported for DDP and FSDP on single-node multi-accelerator setups. However, it does not use [torch.distributed](https://pytorch.org/tutorials/beginner/dist_overview.html) utilities which may limit the scalability of batch size.
- When using the standalone [`GemmaTokenizerFast`] make sure to pass `padding="max_length"` and `max_length=64` as that's how the model was trained.
- Model was trained with *lowercased* text, so make sure your text labels are preprocessed the same way.
- To get the same results as the [`Pipeline`], a prompt template of `"This is a photo of {label}."` should be passed to the processor.

View File

@ -63,7 +63,7 @@ processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-256M-Video-Ins
model = AutoModelForImageTextToText.from_pretrained(
"HuggingFaceTB/SmolVLM2-256M-Video-Instruct",
torch_dtype=torch.bfloat16,
device_map="cuda"
device_map="auto"
)
conversation = [
@ -125,7 +125,7 @@ processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-256M-Video-Ins
model = AutoModelForImageTextToText.from_pretrained(
"HuggingFaceTB/SmolVLM2-256M-Video-Instruct",
torch_dtype=torch.bfloat16,
device_map="cuda"
device_map="auto"
)
# Conversation for the first image

View File

@ -49,7 +49,7 @@ These ready-to-use checkpoints can be downloaded and used via the HuggingFace Hu
>>> prompt = "def print_hello_world():"
>>> model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
>>> generated_ids = model.generate(**model_inputs, max_new_tokens=10, do_sample=False)
>>> tokenizer.batch_decode(generated_ids)[0]

View File

@ -47,7 +47,7 @@ model = TimesFmModelForPrediction.from_pretrained(
"google/timesfm-2.0-500m-pytorch",
torch_dtype=torch.bfloat16,
attn_implementation="sdpa",
device_map="cuda" if torch.cuda.is_available() else None
device_map="auto"
)
@ -61,12 +61,10 @@ frequency_input = [0, 1, 2]
# Convert inputs to sequence of tensors
forecast_input_tensor = [
torch.tensor(ts, dtype=torch.bfloat16).to("cuda" if torch.cuda.is_available() else "cpu")
torch.tensor(ts, dtype=torch.bfloat16).to(model.device)
for ts in forecast_input
]
frequency_input_tensor = torch.tensor(frequency_input, dtype=torch.long).to(
"cuda" if torch.cuda.is_available() else "cpu"
)
frequency_input_tensor = torch.tensor(frequency_input, dtype=torch.long).to(model.device)
# Get predictions from the pre-trained model
with torch.no_grad():

View File

@ -159,13 +159,14 @@ Otherwise, [`~Wav2Vec2ProcessorWithLM.batch_decode`] performance will be slower
```python
>>> # Let's see how to use a user-managed pool for batch decoding multiple audios
>>> from multiprocessing import get_context
>>> from transformers import AutoTokenizer, AutoProcessor, AutoModelForCTC
>>> from transformers import AutoTokenizer, AutoProcessor, AutoModelForCTC, infer_device
>>> from datasets import load_dataset
>>> import datasets
>>> import torch
>>> device = infer_device()
>>> # import model, feature extractor, tokenizer
>>> model = AutoModelForCTC.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm").to("cuda")
>>> model = AutoModelForCTC.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm").to(device)
>>> processor = AutoProcessor.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")
>>> # load example dataset
@ -183,8 +184,9 @@ Otherwise, [`~Wav2Vec2ProcessorWithLM.batch_decode`] performance will be slower
>>> def map_to_pred(batch, pool):
... device = infer_device()
... inputs = processor(batch["speech"], sampling_rate=16_000, padding=True, return_tensors="pt")
... inputs = {k: v.to("cuda") for k, v in inputs.items()}
... inputs = {k: v.to(device) for k, v in inputs.items()}
... with torch.no_grad():
... logits = model(**inputs).logits

View File

@ -60,7 +60,7 @@ tokenizer = AutoTokenizer.from_pretrained("Zyphra/Zamba-7B-v1")
model = AutoModelForCausalLM.from_pretrained("Zyphra/Zamba-7B-v1", device_map="auto", torch_dtype=torch.bfloat16)
input_text = "A funny prompt would be "
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**input_ids, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

View File

@ -72,7 +72,7 @@ model = AutoModelForDepthEstimation.from_pretrained(
)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = image_processor(image, return_tensors="pt").to("cuda")
inputs = image_processor(image, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model(inputs)
@ -128,4 +128,4 @@ Image.fromarray(depth.astype("uint8"))
## ZoeDepthForDepthEstimation
[[autodoc]] ZoeDepthForDepthEstimation
- forward
- forward

View File

@ -99,7 +99,7 @@ Use [`~PreTrainedModel.from_pretrained`] to load the weights and configuration f
When you load a model, configure the following parameters to ensure the model is optimally loaded.
- `device_map="auto"` automatically allocates the model weights to your fastest device first, which is typically the GPU.
- `device_map="auto"` automatically allocates the model weights to your fastest device first.
- `torch_dtype="auto"` directly initializes the model weights in the data type they're stored in, which can help avoid loading the weights twice (PyTorch loads weights in `torch.float32` by default).
```py
@ -109,10 +109,10 @@ model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_d
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
```
Tokenize the text and return PyTorch tensors with the tokenizer. Move the model to a GPU if it's available to accelerate inference.
Tokenize the text and return PyTorch tensors with the tokenizer. Move the model to an accelerator if it's available to accelerate inference.
```py
model_inputs = tokenizer(["The secret to baking a good cake is "], return_tensors="pt").to("cuda")
model_inputs = tokenizer(["The secret to baking a good cake is "], return_tensors="pt").to(model.device)
```
The model is now ready for inference or training.
@ -169,12 +169,14 @@ Create a [`Pipeline`] object and select a task. By default, [`Pipeline`] downloa
<hfoptions id="pipeline-tasks">
<hfoption id="text generation">
Set `device="cuda"` to accelerate inference with a GPU.
Use [`~infer_device`] to automatically detect an available accelerator for inference.
```py
from transformers import pipeline
from transformers import pipeline, infer_device
pipeline = pipeline("text-generation", model="meta-llama/Llama-2-7b-hf", device="cuda")
device = infer_device()
pipeline = pipeline("text-generation", model="meta-llama/Llama-2-7b-hf", device=device)
```
Prompt [`Pipeline`] with some initial text to generate more text.
@ -187,12 +189,14 @@ pipeline("The secret to baking a good cake is ", max_length=50)
</hfoption>
<hfoption id="image segmentation">
Set `device="cuda"` to accelerate inference with a GPU.
Use [`~infer_device`] to automatically detect an available accelerator for inference.
```py
from transformers import pipeline
from transformers import pipeline, infer_device
pipeline = pipeline("image-segmentation", model="facebook/detr-resnet-50-panoptic", device="cuda")
device = infer_device()
pipeline = pipeline("image-segmentation", model="facebook/detr-resnet-50-panoptic", device=device)
```
Pass an image - a URL or local path to the image - to [`Pipeline`].
@ -212,12 +216,14 @@ segments[1]["label"]
</hfoption>
<hfoption id="automatic speech recognition">
Set `device="cuda"` to accelerate inference with a GPU.
Use [`~infer_device`] to automatically detect an available accelerator for inference.
```py
from transformers import pipeline
from transformers import pipeline, infer_device
pipeline = pipeline("automatic-speech-recognition", model="openai/whisper-large-v3", device="cuda")
device = infer_device()
pipeline = pipeline("automatic-speech-recognition", model="openai/whisper-large-v3", device=device)
```
Pass an audio file to [`Pipeline`].