make model docs device agnostic (2) (#40256)

* doc cont. Signed-off-by: Yao, Matrix <matrix.yao@intel.com> * more models Signed-off-by: Yao, Matrix <matrix.yao@intel.com> * Update docs/source/en/quicktour.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/quicktour.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/quicktour.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/quicktour.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update mixtral.md --------- Signed-off-by: Yao, Matrix <matrix.yao@intel.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-10-20 17:13:56 +08:00 · 2025-08-19 13:10:03 -07:00
parent 42fe769928
commit eaa48c81e9
59 changed files with 157 additions and 159 deletions
--- a/docs/source/en/model_doc/bart.md
+++ b/docs/source/en/model_doc/bart.md
@ -65,7 +65,7 @@ model = AutoModelForMaskedLM.from_pretrained(
    device_map="auto",
    attn_implementation="sdpa"
 )
-inputs = tokenizer("Plants create <mask> through a process known as photosynthesis.", return_tensors="pt").to("cuda")
+inputs = tokenizer("Plants create <mask> through a process known as photosynthesis.", return_tensors="pt").to(model.device)

 with torch.no_grad():
    outputs = model(**inputs)
--- a/docs/source/en/model_doc/barthez.md
+++ b/docs/source/en/model_doc/barthez.md
@ -68,7 +68,7 @@ model = AutoModelForMaskedLM.from_pretrained(
    torch_dtype=torch.float16,
    device_map="auto",
 )
-inputs = tokenizer("Les plantes produisent <mask> grâce à un processus appelé photosynthèse.", return_tensors="pt").to("cuda")
+inputs = tokenizer("Les plantes produisent <mask> grâce à un processus appelé photosynthèse.", return_tensors="pt").to(model.device)

 with torch.no_grad():
    outputs = model(**inputs)
--- a/docs/source/en/model_doc/bartpho.md
+++ b/docs/source/en/model_doc/bartpho.md
@ -82,7 +82,7 @@ Quang tổng hợp hay gọi tắt là quang hợp là quá trình thu nhận v
 tảo và một số vi khuẩn để tạo ra hợp chất hữu cơ phục vụ bản thân cũng như làm nguồn thức ăn cho hầu hết các sinh vật 
 trên Trái Đất. Quang hợp trong thực vật thường liên quan đến chất tố diệp lục màu xanh lá cây và tạo ra oxy như một sản phẩm phụ
 """
-inputs = tokenizer(text, return_tensors="pt").to("cuda")
+inputs = tokenizer(text, return_tensors="pt").to(model.device)

 outputs = model.generate(inputs["input_ids"], num_beams=2, min_length=0, max_length=20)
 tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
--- a/docs/source/en/model_doc/bertweet.md
+++ b/docs/source/en/model_doc/bertweet.md
@ -67,7 +67,7 @@ model = AutoModelForMaskedLM.from_pretrained(
    torch_dtype=torch.float16,
    device_map="auto"
 )
-inputs = tokenizer("Plants create <mask> through a process known as photosynthesis.", return_tensors="pt").to("cuda")
+inputs = tokenizer("Plants create <mask> through a process known as photosynthesis.", return_tensors="pt").to(model.device)

 with torch.no_grad():
    outputs = model(**inputs)
--- a/docs/source/en/model_doc/big_bird.md
+++ b/docs/source/en/model_doc/big_bird.md
@ -64,7 +64,7 @@ model = AutoModelForMaskedLM.from_pretrained(
    torch_dtype=torch.float16,
    device_map="auto",
 )
-inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt").to("cuda")
+inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt").to(model.device)

 with torch.no_grad():
    outputs = model(**inputs)
--- a/docs/source/en/model_doc/bigbird_pegasus.md
+++ b/docs/source/en/model_doc/bigbird_pegasus.md
@ -72,7 +72,7 @@ input_text = """Plants are among the most remarkable and essential life forms on
 Through photosynthesis, plants capture energy from sunlight using a green pigment called chlorophyll, which is located in specialized cell structures called chloroplasts. In the presence of light, plants absorb carbon dioxide from the atmosphere through small pores in their leaves called stomata, and take in water from the soil through their root systems.
 These ingredients are then transformed into glucose, a type of sugar that serves as a source of chemical energy, and oxygen, which is released as a byproduct into the atmosphere. The glucose produced during photosynthesis is not just used immediately; plants also store it as starch or convert it into other organic compounds like cellulose, which is essential for building their cellular structure.
 This energy reserve allows them to grow, develop leaves, produce flowers, bear fruit, and carry out various physiological processes throughout their lifecycle."""
-input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
+input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)

 output = model.generate(**input_ids, cache_implementation="static")
 print(tokenizer.decode(output[0], skip_special_tokens=True))
@ -115,7 +115,7 @@ input_text = """Plants are among the most remarkable and essential life forms on
 Through photosynthesis, plants capture energy from sunlight using a green pigment called chlorophyll, which is located in specialized cell structures called chloroplasts. In the presence of light, plants absorb carbon dioxide from the atmosphere through small pores in their leaves called stomata, and take in water from the soil through their root systems.
 These ingredients are then transformed into glucose, a type of sugar that serves as a source of chemical energy, and oxygen, which is released as a byproduct into the atmosphere. The glucose produced during photosynthesis is not just used immediately; plants also store it as starch or convert it into other organic compounds like cellulose, which is essential for building their cellular structure.
 This energy reserve allows them to grow, develop leaves, produce flowers, bear fruit, and carry out various physiological processes throughout their lifecycle."""
-input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
+input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)

 output = model.generate(**input_ids, cache_implementation="static")
 print(tokenizer.decode(output[0], skip_special_tokens=True))
--- a/docs/source/en/model_doc/blip.md
+++ b/docs/source/en/model_doc/blip.md
@ -73,7 +73,7 @@ url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/
 image = Image.open(requests.get(url, stream=True).raw)

 question = "What is the weather in this image?"
-inputs = processor(images=image, text=question, return_tensors="pt").to("cuda", torch.float16)
+inputs = processor(images=image, text=question, return_tensors="pt").to(model.device, torch.float16)

 output = model.generate(**inputs)
 processor.batch_decode(output, skip_special_tokens=True)[0]
--- a/docs/source/en/model_doc/clap.md
+++ b/docs/source/en/model_doc/clap.md
@ -48,7 +48,7 @@ tokenizer = AutoTokenizer.from_pretrained("laion/clap-htsat-unfused")

 texts = ["the sound of a cat", "the sound of a dog", "music playing"]

-inputs = tokenizer(texts, padding=True, return_tensors="pt").to("cuda")
+inputs = tokenizer(texts, padding=True, return_tensors="pt").to(model.device)

 with torch.no_grad():
    text_features = model.get_text_features(**inputs)
--- a/docs/source/en/model_doc/code_llama.md
+++ b/docs/source/en/model_doc/code_llama.md
@ -74,7 +74,7 @@ model = AutoModelForCausalLM.from_pretrained(

 # basic code generation
 prompt = "# Function to calculate the factorial of a number\ndef factorial(n):"
-input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
+input_ids = tokenizer(prompt, return_tensors="pt").to(model.device)

 output = model.generate(
    **input_ids,
@ -121,7 +121,7 @@ model = AutoModelForCausalLM.from_pretrained(
 )

 prompt = "# Write a Python function to check if a string is a palindrome\ndef is_palindrome(s):"
-input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
+input_ids = tokenizer(prompt, return_tensors="pt").to(model.device)

 output = model.generate(**input_ids, max_new_tokens=200, cache_implementation="static")
 print(tokenizer.decode(output[0], skip_special_tokens=True))
--- a/docs/source/en/model_doc/csm.md
+++ b/docs/source/en/model_doc/csm.md
@ -38,10 +38,10 @@ CSM can be used to simply generate speech from a text prompt:

 ```python
 import torch
-from transformers import CsmForConditionalGeneration, AutoProcessor
+from transformers import CsmForConditionalGeneration, AutoProcessor, infer_device

 model_id = "sesame/csm-1b"
-device = "cuda" if torch.cuda.is_available() else "cpu"
+device = infer_device()

 # load the model and the processor
 processor = AutoProcessor.from_pretrained(model_id)
@ -72,11 +72,11 @@ CSM can be used to generate speech given a conversation, allowing consistency in

 ```python
 import torch
-from transformers import CsmForConditionalGeneration, AutoProcessor
+from transformers import CsmForConditionalGeneration, AutoProcessor, infer_device
 from datasets import load_dataset, Audio

 model_id = "sesame/csm-1b"
-device = "cuda" if torch.cuda.is_available() else "cpu"
+device = infer_device()

 # load the model and the processor
 processor = AutoProcessor.from_pretrained(model_id)
@ -117,11 +117,11 @@ CSM supports batched inference!

 ```python
 import torch
-from transformers import CsmForConditionalGeneration, AutoProcessor
+from transformers import CsmForConditionalGeneration, AutoProcessor, infer_device
 from datasets import load_dataset, Audio

 model_id = "sesame/csm-1b"
-device = "cuda" if torch.cuda.is_available() else "cpu"
+device = infer_device()

 # load the model and the processor
 processor = AutoProcessor.from_pretrained(model_id)
@ -306,11 +306,11 @@ print("="*50)
 CSM Transformers integration supports training!

 ```python
-from transformers import CsmForConditionalGeneration, AutoProcessor
+from transformers import CsmForConditionalGeneration, AutoProcessor, infer_device
 from datasets import load_dataset, Audio

 model_id = "sesame/csm-1b"
-device = "cuda"
+device = infer_device()

 # load the model and the processor
 processor = AutoProcessor.from_pretrained(model_id)
--- a/docs/source/en/model_doc/cvt.md
+++ b/docs/source/en/model_doc/cvt.md
@ -69,7 +69,7 @@ model = AutoModelForImageClassification.from_pretrained(

 url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)
-inputs = image_processor(image, return_tensors="pt").to("cuda")
+inputs = image_processor(image, return_tensors="pt").to(model.device)

 with torch.no_grad():
  logits = model(**inputs).logits
--- a/docs/source/en/model_doc/dbrx.md
+++ b/docs/source/en/model_doc/dbrx.md
@ -58,7 +58,7 @@ model = DbrxForCausalLM.from_pretrained(

 input_text = "What does it take to build a great LLM?"
 messages = [{"role": "user", "content": input_text}]
-input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
+input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)

 outputs = model.generate(**input_ids, max_new_tokens=200)
 print(tokenizer.decode(outputs[0]))
@ -80,7 +80,7 @@ model = DbrxForCausalLM.from_pretrained(

 input_text = "What does it take to build a great LLM?"
 messages = [{"role": "user", "content": input_text}]
-input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
+input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)

 outputs = model.generate(**input_ids, max_new_tokens=200)
 print(tokenizer.decode(outputs[0]))
@ -102,7 +102,7 @@ model = DbrxForCausalLM.from_pretrained(

 input_text = "What does it take to build a great LLM?"
 messages = [{"role": "user", "content": input_text}]
-input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
+input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)

 outputs = model.generate(**input_ids, max_new_tokens=200)
 print(tokenizer.decode(outputs[0]))
--- a/docs/source/en/model_doc/depth_pro.md
+++ b/docs/source/en/model_doc/depth_pro.md
@ -46,9 +46,9 @@ The DepthPro model processes an input image by first downsampling it at multiple
 >>> import requests
 >>> from PIL import Image
 >>> import torch
->>> from transformers import DepthProImageProcessorFast, DepthProForDepthEstimation
+>>> from transformers import DepthProImageProcessorFast, DepthProForDepthEstimation, infer_device

->>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+>>> device = infer_device()

 >>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
 >>> image = Image.open(requests.get(url, stream=True).raw)
@ -56,7 +56,7 @@ The DepthPro model processes an input image by first downsampling it at multiple
 >>> image_processor = DepthProImageProcessorFast.from_pretrained("apple/DepthPro-hf")
 >>> model = DepthProForDepthEstimation.from_pretrained("apple/DepthPro-hf").to(device)

->>> inputs = image_processor(images=image, return_tensors="pt").to(device)
+>>> inputs = image_processor(images=image, return_tensors="pt").to(model.device)

 >>> with torch.no_grad():
 ...     outputs = model(**inputs)
--- a/docs/source/en/model_doc/dia.md
+++ b/docs/source/en/model_doc/dia.md
@ -42,9 +42,9 @@ tokens and decodes them back into audio.
 ### Generation with Text

 ```python
-from transformers import AutoProcessor, DiaForConditionalGeneration
+from transformers import AutoProcessor, DiaForConditionalGeneration, infer_device

-torch_device = "cuda"
+torch_device = infer_device()
 model_checkpoint = "nari-labs/Dia-1.6B-0626"

 text = ["[S1] Dia is an open weights text to dialogue model."]
@ -64,9 +64,9 @@ processor.save_audio(outputs, "example.wav")

 ```python
 from datasets import load_dataset, Audio
-from transformers import AutoProcessor, DiaForConditionalGeneration
+from transformers import AutoProcessor, DiaForConditionalGeneration, infer_device

-torch_device = "cuda"
+torch_device = infer_device()
 model_checkpoint = "nari-labs/Dia-1.6B-0626"

 ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
@ -91,9 +91,9 @@ processor.save_audio(outputs, "example_with_audio.wav")

 ```python
 from datasets import load_dataset, Audio
-from transformers import AutoProcessor, DiaForConditionalGeneration
+from transformers import AutoProcessor, DiaForConditionalGeneration, infer_device

-torch_device = "cuda"
+torch_device = infer_device()
 model_checkpoint = "nari-labs/Dia-1.6B-0626"

 ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
--- a/docs/source/en/model_doc/distilbert.md
+++ b/docs/source/en/model_doc/distilbert.md
@ -72,7 +72,7 @@ model = AutoModelForSequenceClassification.from_pretrained(
    device_map="auto",
    attn_implementation="sdpa"
 )
-inputs = tokenizer("I love using Hugging Face Transformers!", return_tensors="pt").to("cuda")
+inputs = tokenizer("I love using Hugging Face Transformers!", return_tensors="pt").to(model.device)

 with torch.no_grad():
    outputs = model(**inputs)
--- a/docs/source/en/model_doc/dit.md
+++ b/docs/source/en/model_doc/dit.md
@ -70,7 +70,7 @@ model = AutoModelForImageClassification.from_pretrained(
 )
 url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/dit-example.jpg"
 image = Image.open(requests.get(url, stream=True).raw)
-inputs = image_processor(image, return_tensors="pt").to("cuda")
+inputs = image_processor(image, return_tensors="pt").to(model.device)

 with torch.no_grad():
  logits = model(**inputs).logits
--- a/docs/source/en/model_doc/gemma2.md
+++ b/docs/source/en/model_doc/gemma2.md
@ -52,7 +52,7 @@ pipe = pipeline(
    task="text-generation",
    model="google/gemma-2-9b",
    torch_dtype=torch.bfloat16,
-    device="cuda",
+    device_map="auto",
 )

 pipe("Explain quantum computing simply. ", max_new_tokens=50)
@ -74,7 +74,7 @@ model = AutoModelForCausalLM.from_pretrained(
 )

 input_text = "Explain quantum computing simply."
-input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
+input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)

 outputs = model.generate(**input_ids, max_new_tokens=32, cache_implementation="static")
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
@ -108,7 +108,7 @@ model = AutoModelForCausalLM.from_pretrained(
 )

 input_text = "Explain quantum computing simply."
-input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
+input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)

 outputs = model.generate(**input_ids, max_new_tokens=32, cache_implementation="static")
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
--- a/docs/source/en/model_doc/glm.md
+++ b/docs/source/en/model_doc/glm.md
@ -61,8 +61,8 @@ Tips:
 In the following, we demonstrate how to use `glm-4-9b-chat` for the inference. Note that we have used the ChatML format for dialog, in this demo we show how to leverage `apply_chat_template` for this purpose.

 ```python
->>> from transformers import AutoModelForCausalLM, AutoTokenizer
->>> device = "cuda" # the device to load the model onto
+>>> from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
+>>> device = infer_device() # the device to load the model onto

 >>> model = AutoModelForCausalLM.from_pretrained("THUDM/glm-4-9b-chat", device_map="auto", trust_remote_code=True)
 >>> tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat")
@ -73,7 +73,7 @@ In the following, we demonstrate how to use `glm-4-9b-chat` for the inference. N

 >>> text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

->>> model_inputs = tokenizer([text], return_tensors="pt").to(device)
+>>> model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

 >>> generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=True)

--- a/docs/source/en/model_doc/gpt2.md
+++ b/docs/source/en/model_doc/gpt2.md
@ -58,7 +58,7 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
 model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2", torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa")
 tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")

-input_ids = tokenizer("Hello, I'm a language model", return_tensors="pt").to("cuda")
+input_ids = tokenizer("Hello, I'm a language model", return_tensors="pt").to(model.device)

 output = model.generate(**input_ids, cache_implementation="static")
 print(tokenizer.decode(output[0], skip_special_tokens=True))
@ -102,7 +102,7 @@ model = AutoModelForCausalLM.from_pretrained(
 )

 tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2-xl")
-inputs = tokenizer("Once upon a time, there was a magical forest", return_tensors="pt").to("cuda")
+inputs = tokenizer("Once upon a time, there was a magical forest", return_tensors="pt").to(model.device)
 outputs = model.generate(**inputs, max_new_tokens=100)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
--- a/docs/source/en/model_doc/gpt_bigcode.md
+++ b/docs/source/en/model_doc/gpt_bigcode.md
@ -67,15 +67,15 @@ To load and run a model using Flash Attention 2, refer to the snippet below:

 ```python
 >>> import torch
->>> from transformers import AutoModelForCausalLM, AutoTokenizer
->>> device = "cuda" # the device to load the model onto
+>>> from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
+>>> device = infer_device() # the device to load the model onto

 >>> model = AutoModelForCausalLM.from_pretrained("bigcode/gpt_bigcode-santacoder", torch_dtype=torch.float16, attn_implementation="flash_attention_2")
 >>> tokenizer = AutoTokenizer.from_pretrained("bigcode/gpt_bigcode-santacoder")

 >>> prompt = "def hello_world():"

->>> model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
+>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
 >>> model.to(device)

 >>> generated_ids = model.generate(**model_inputs, max_new_tokens=30, do_sample=False)
--- a/docs/source/en/model_doc/gptj.md
+++ b/docs/source/en/model_doc/gptj.md
@ -41,10 +41,10 @@ This model was contributed by [Stella Biderman](https://huggingface.co/stellaath
  which could be used to further minimize the RAM usage:

 ```python
->>> from transformers import GPTJForCausalLM
+>>> from transformers import GPTJForCausalLM, infer_device
 >>> import torch

->>> device = "cuda"
+>>> device = infer_device()
 >>> model = GPTJForCausalLM.from_pretrained(
 ...     "EleutherAI/gpt-j-6B",
 ...     revision="float16",
@ -96,10 +96,10 @@ model.
 ...or in float16 precision:

 ```python
->>> from transformers import GPTJForCausalLM, AutoTokenizer
+>>> from transformers import GPTJForCausalLM, AutoTokenizer, infer_device
 >>> import torch

->>> device = "cuda"
+>>> device = infer_device()
 >>> model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16).to(device)
 >>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

@ -109,7 +109,7 @@ model.
 ...     "researchers was the fact that the unicorns spoke perfect English."
 ... )

->>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
+>>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)

 >>> gen_tokens = model.generate(
 ...     input_ids,
--- a/docs/source/en/model_doc/helium.md
+++ b/docs/source/en/model_doc/helium.md
@ -119,14 +119,13 @@ In the following, we demonstrate how to use `helium-1-preview` for the inference

 ```python
 >>> from transformers import AutoModelForCausalLM, AutoTokenizer
->>> device = "cuda" # the device to load the model onto

 >>> model = AutoModelForCausalLM.from_pretrained("kyutai/helium-1-preview-2b", device_map="auto")
 >>> tokenizer = AutoTokenizer.from_pretrained("kyutai/helium-1-preview-2b")

 >>> prompt = "Give me a short introduction to large language model."

->>> model_inputs = tokenizer(prompt, return_tensors="pt").to(device)
+>>> model_inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

 >>> generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=True)

--- a/docs/source/en/model_doc/internvl.md
+++ b/docs/source/en/model_doc/internvl.md
@ -86,10 +86,9 @@ This example demonstrates how to perform inference on a single image with the In
 >>> from transformers import AutoProcessor, AutoModelForImageTextToText
 >>> import torch

->>> torch_device = "cuda"
 >>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
 >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
->>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
+>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map="auto", torch_dtype=torch.bfloat16)

 >>> messages = [
 ...     {
@ -118,10 +117,9 @@ This example shows how to generate text using the InternVL model without providi
 >>> from transformers import AutoProcessor, AutoModelForImageTextToText
 >>> import torch

->>> torch_device = "cuda"
 >>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
 >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
->>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
+>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map="auto", torch_dtype=torch.bfloat16)

 >>> messages = [
 ...     {
@ -148,10 +146,9 @@ InternVL models also support batched image and text inputs.
 >>> from transformers import AutoProcessor, AutoModelForImageTextToText
 >>> import torch

->>> torch_device = "cuda"
 >>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
 >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
->>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
+>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map="auto", torch_dtype=torch.bfloat16)

 >>> messages = [
 ...     [
@ -192,10 +189,9 @@ This implementation of the InternVL models supports batched text-images inputs w
 >>> from transformers import AutoProcessor, AutoModelForImageTextToText
 >>> import torch

->>> torch_device = "cuda"
 >>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
 >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
->>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
+>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map="auto", torch_dtype=torch.bfloat16)

 >>> messages = [
 ...     [
@ -275,10 +271,9 @@ This example showcases how to handle a batch of chat conversations with interlea
 >>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
 >>> import torch

->>> torch_device = "cuda"
 >>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
 >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
->>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
+>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map="auto", torch_dtype=torch.bfloat16)

 >>> messages = [
 ...     [
--- a/docs/source/en/model_doc/jamba.md
+++ b/docs/source/en/model_doc/jamba.md
@ -70,7 +70,7 @@ model = AutoModelForCausalLM.from_pretrained(
    device_map="auto",
    attn_implementation="sdpa"
 )
-input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")
+input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)

 output = model.generate(**input_ids, cache_implementation="static")
 print(tokenizer.decode(output[0], skip_special_tokens=True))
--- a/docs/source/en/model_doc/kyutai_speech_to_text.md
+++ b/docs/source/en/model_doc/kyutai_speech_to_text.md
@ -33,10 +33,10 @@ rendered properly in your Markdown viewer.
 ```python
 import torch
 from datasets import load_dataset, Audio
-from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration
+from transformers import infer_device, KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration

 # 1. load the model and the processor
-torch_device = "cuda" if torch.cuda.is_available() else "cpu"
+torch_device = infer_device()
 model_id = "kyutai/stt-2.6b-en-trfs"

 processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
@ -52,7 +52,7 @@ ds = ds.cast_column("audio", Audio(sampling_rate=24000))
 inputs = processor(
    ds[0]["audio"]["array"],
 )
-inputs.to(torch_device)
+inputs.to(model.device)

 # 4. infer the model
 output_tokens = model.generate(**inputs)
@ -66,10 +66,10 @@ print(processor.batch_decode(output_tokens, skip_special_tokens=True))
 ```python
 import torch
 from datasets import load_dataset, Audio
-from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration
+from transformers import infer_device, KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration

 # 1. load the model and the processor
-torch_device = "cuda" if torch.cuda.is_available() else "cpu"
+torch_device = infer_device()
 model_id = "kyutai/stt-2.6b-en-trfs"

 processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
@ -84,7 +84,7 @@ ds = ds.cast_column("audio", Audio(sampling_rate=24000))
 # 3. prepare the model inputs
 audio_arrays = [ds[i]["audio"]["array"] for i in range(4)]
 inputs = processor(audio_arrays, return_tensors="pt", padding=True)
-inputs = inputs.to(torch_device)
+inputs = inputs.to(model.device)

 # 4. infer the model
 output_tokens = model.generate(**inputs)
--- a/docs/source/en/model_doc/llama.md
+++ b/docs/source/en/model_doc/llama.md
@ -69,7 +69,7 @@ model = AutoModelForCausalLM.from_pretrained(
    device_map="auto",
    attn_implementation="sdpa"
 )
-input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")
+input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)

 output = model.generate(**input_ids, cache_implementation="static")
 print(tokenizer.decode(output[0], skip_special_tokens=True))
@ -103,7 +103,7 @@ model = AutoModelForCausalLM.from_pretrained(
 )

 tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-30b")
-input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")
+input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)

 output = model.generate(**input_ids, cache_implementation="static")
 print(tokenizer.decode(output[0], skip_special_tokens=True))
--- a/docs/source/en/model_doc/llama2.md
+++ b/docs/source/en/model_doc/llama2.md
@ -69,7 +69,7 @@ model = AutoModelForCausalLM.from_pretrained(
    device_map="auto",
    attn_implementation="sdpa"
 )
-input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")
+input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)

 output = model.generate(**input_ids, cache_implementation="static")
 print(tokenizer.decode(output[0], skip_special_tokens=True))
@ -103,7 +103,7 @@ model = AutoModelForCausalLM.from_pretrained(
 )

 tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")
-input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")
+input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)

 output = model.generate(**input_ids, cache_implementation="static")
 print(tokenizer.decode(output[0], skip_special_tokens=True))
--- a/docs/source/en/model_doc/longt5.md
+++ b/docs/source/en/model_doc/longt5.md
@ -75,7 +75,7 @@ The complexity of this mechanism is `O(l(r + l/k))`.
 >>> dataset = load_dataset("scientific_papers", "pubmed", split="validation")
 >>> model = (
 ...     LongT5ForConditionalGeneration.from_pretrained("Stancld/longt5-tglobal-large-16384-pubmed-3k_steps")
-...     .to("cuda")
+...     .to("auto")
 ...     .half()
 ... )
 >>> tokenizer = AutoTokenizer.from_pretrained("Stancld/longt5-tglobal-large-16384-pubmed-3k_steps")
@ -85,8 +85,8 @@ The complexity of this mechanism is `O(l(r + l/k))`.
 ...     inputs_dict = tokenizer(
 ...         batch["article"], max_length=16384, padding="max_length", truncation=True, return_tensors="pt"
 ...     )
-...     input_ids = inputs_dict.input_ids.to("cuda")
-...     attention_mask = inputs_dict.attention_mask.to("cuda")
+...     input_ids = inputs_dict.input_ids.to(model.device)
+...     attention_mask = inputs_dict.attention_mask.to(model.device)
 ...     output_ids = model.generate(input_ids, attention_mask=attention_mask, max_length=512, num_beams=2)
 ...     batch["predicted_abstract"] = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
 ...     return batch
--- a/docs/source/en/model_doc/mamba.md
+++ b/docs/source/en/model_doc/mamba.md
@ -59,7 +59,7 @@ from transformers import AutoModelForCausalLM, AutoTokenizer

 tokenizer = AutoTokenizer.from_pretrained("state-spaces/mamba-130m-hf")
 model = AutoModelForCausalLM.from_pretrained("state-spaces/mamba-130m-hf", torch_dtype=torch.float16, device_map="auto",)  
-input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")  
+input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)  

 output = model.generate(**input_ids)  
 print(tokenizer.decode(output[0], skip_special_tokens=True)
@ -88,7 +88,7 @@ quantization_config = Int4WeightOnlyConfig(group_size=128)
 quantization_config = TorchAoConfig(quant_type=quant_config)
 tokenizer = AutoTokenizer.from_pretrained("state-spaces/mamba-2.8b-hf")
 model = AutoModelForCausalLM.from_pretrained("state-spaces/mamba-2.8b-hf", torch_dtype=torch.bfloat16, quantization_config=quantization_config, device_map="auto",)
-input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")
+input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)

 output = model.generate(**input_ids)
 print(tokenizer.decode(output[0], skip_special_tokens=True))
--- a/docs/source/en/model_doc/mamba2.md
+++ b/docs/source/en/model_doc/mamba2.md
@ -57,7 +57,7 @@ from transformers import AutoModelForCausalLM, AutoTokenizer

 tokenizer = AutoTokenizer.from_pretrained("mistralai/Mamba-Codestral-7B-v0.1")
 model = AutoModelForCausalLM.from_pretrained("mistralai/Mamba-Codestral-7B-v0.1", torch_dtype=torch.bfloat16, device_map="auto")  
-input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")  
+input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)  

 output = model.generate(**input_ids)  
 print(tokenizer.decode(output[0], skip_special_tokens=True))
@ -84,7 +84,7 @@ from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
 quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
 tokenizer = AutoTokenizer.from_pretrained("mistralai/Mamba-Codestral-7B-v0.1")
 model = AutoModelForCausalLM.from_pretrained("mistralai/Mamba-Codestral-7B-v0.1", torch_dtype=torch.bfloat16, quantization_config=quantization_config, device_map="auto")
-input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")
+input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)

 output = model.generate(**input_ids)
 print(tokenizer.decode(output[0], skip_special_tokens=True))
--- a/docs/source/en/model_doc/mbart.md
+++ b/docs/source/en/model_doc/mbart.md
@ -72,7 +72,7 @@ model = AutoModelForSeq2SeqLM.from_pretrained("facebook/mbart-large-50-many-to-m
 tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")

 tokenizer.src_lang = "en_XX"
-encoded_hi = tokenizer(article_en, return_tensors="pt").to("cuda")
+encoded_hi = tokenizer(article_en, return_tensors="pt").to(model.device)
 generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"], cache_implementation="static")
 print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True))
 ```
@ -208,4 +208,4 @@ print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True))
    - decode

 </jax>
-</frameworkcontent>
+</frameworkcontent>
--- a/docs/source/en/model_doc/minimax.md
+++ b/docs/source/en/model_doc/minimax.md
@ -67,7 +67,7 @@ The pre-trained model can be used as follows:
 ...     {"role": "user", "content": "Do you have mayonnaise recipes?"}
 ... ]

->>> model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
+>>> model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

 >>> generated_ids = model.generate(model_inputs, max_new_tokens=100, do_sample=True)
 >>> tokenizer.batch_decode(generated_ids)[0]
@ -99,7 +99,7 @@ To load and run a model using Flash Attention-2, refer to the snippet below:

 >>> prompt = "My favourite condiment is"

->>> model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
+>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
 >>> model.to(device)

 >>> generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
@ -142,7 +142,7 @@ Quantizing a model is as simple as passing a `quantization_config` to the model.
 ...     {"role": "user", "content": "Do you have mayonnaise recipes?"}
 ... ]

->>> model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
+>>> model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

 >>> generated_ids = model.generate(model_inputs, max_new_tokens=100, do_sample=True)
 >>> tokenizer.batch_decode(generated_ids)[0]
--- a/docs/source/en/model_doc/mixtral.md
+++ b/docs/source/en/model_doc/mixtral.md
@ -68,8 +68,7 @@ The base model can be used as follows:

 >>> prompt = "My favourite condiment is"

->>> model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
->>> model.to(device)
+>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)

 >>> generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
 >>> tokenizer.batch_decode(generated_ids)[0]
@ -90,7 +89,7 @@ The instruction tuned model can be used as follows:
 ...     {"role": "user", "content": "Do you have mayonnaise recipes?"}
 ... ]

->>> model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
+>>> model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

 >>> generated_ids = model.generate(model_inputs, max_new_tokens=100, do_sample=True)
 >>> tokenizer.batch_decode(generated_ids)[0]
@ -122,8 +121,7 @@ To load and run a model using Flash Attention-2, refer to the snippet below:

 >>> prompt = "My favourite condiment is"

->>> model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
->>> model.to(device)
+>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)

 >>> generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
 >>> tokenizer.batch_decode(generated_ids)[0]
@ -173,7 +171,7 @@ Quantizing a model is as simple as passing a `quantization_config` to the model.
 ...     {"role": "user", "content": "Do you have mayonnaise recipes?"}
 ... ]

->>> model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
+>>> model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

 >>> generated_ids = model.generate(model_inputs, max_new_tokens=100, do_sample=True)
 >>> tokenizer.batch_decode(generated_ids)[0]
@ -191,7 +189,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 - A demo notebook to perform supervised fine-tuning (SFT) of Mixtral-8x7B can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Mistral/Supervised_fine_tuning_(SFT)_of_an_LLM_using_Hugging_Face_tooling.ipynb). 🌎
 - A [blog post](https://medium.com/@prakharsaxena11111/finetuning-mixtral-7bx8-6071b0ebf114) on fine-tuning Mixtral-8x7B using PEFT. 🌎
- The [Alignment Handbook](https://github.com/huggingface/alignment-handbook) by Hugging Face includes scripts and recipes to perform supervised fine-tuning (SFT) and direct preference optimization with Mistral-7B. This includes scripts for full fine-tuning, QLoRa on a single GPU as well as multi-GPU fine-tuning.
+- The [Alignment Handbook](https://github.com/huggingface/alignment-handbook) by Hugging Face includes scripts and recipes to perform supervised fine-tuning (SFT) and direct preference optimization with Mistral-7B. This includes scripts for full fine-tuning, QLoRa on a single accelerator as well as multi-accelerator fine-tuning.
 - [Causal language modeling task guide](../tasks/language_modeling)

 ## MixtralConfig
--- a/docs/source/en/model_doc/mm-grounding-dino.md
+++ b/docs/source/en/model_doc/mm-grounding-dino.md
@ -39,13 +39,13 @@ The example below demonstrates how to generate text based on an image with the [

 ```py
 import torch
-from transformers import AutoModelForZeroShotObjectDetection, AutoProcessor
+from transformers import AutoModelForZeroShotObjectDetection, AutoProcessor, infer_device
 from transformers.image_utils import load_image


 # Prepare processor and model
 model_id = "openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det"
-device = "cuda" if torch.cuda.is_available() else "cpu"
+device = infer_device()
 processor = AutoProcessor.from_pretrained(model_id)
 model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)

--- a/docs/source/en/model_doc/mobilevit.md
+++ b/docs/source/en/model_doc/mobilevit.md
@ -75,11 +75,11 @@ image_processor = AutoImageProcessor.from_pretrained(
   "apple/mobilevit-small",
   use_fast=True,
 )
-model = MobileViTForImageClassification.from_pretrained("apple/mobilevit-small")
+model = MobileViTForImageClassification.from_pretrained("apple/mobilevit-small", device_map="auto")

 url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)
-inputs = image_processor(image, return_tensors="pt").to("cuda")
+inputs = image_processor(image, return_tensors="pt").to(model.device)

 with torch.no_grad():
    logits = model(**inputs).logits
--- a/docs/source/en/model_doc/modernbert.md
+++ b/docs/source/en/model_doc/modernbert.md
@ -66,7 +66,7 @@ model = AutoModelForMaskedLM.from_pretrained(
    device_map="auto",
    attn_implementation="sdpa"
 )
-inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt").to("cuda")
+inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt").to(model.device)

 with torch.no_grad():
    outputs = model(**inputs)
--- a/docs/source/en/model_doc/moshi.md
+++ b/docs/source/en/model_doc/moshi.md
@ -116,13 +116,13 @@ To follow the example of the following image, `"Hello, I'm Moshi"` could be tran
 ```python 
 >>> from datasets import load_dataset, Audio
 >>> import torch, math
->>> from transformers import MoshiForConditionalGeneration, AutoFeatureExtractor, AutoTokenizer
+>>> from transformers import MoshiForConditionalGeneration, AutoFeatureExtractor, AutoTokenizer, infer_device


 >>> librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
 >>> feature_extractor = AutoFeatureExtractor.from_pretrained("kyutai/moshiko-pytorch-bf16")
 >>> tokenizer = AutoTokenizer.from_pretrained("kyutai/moshiko-pytorch-bf16")
->>> device = "cuda"
+>>> device = infer_device()
 >>> dtype = torch.bfloat16

 >>> # prepare user input audio 
--- a/docs/source/en/model_doc/mt5.md
+++ b/docs/source/en/model_doc/mt5.md
@ -73,7 +73,7 @@ model = AutoModelForSeq2SeqLM.from_pretrained(
 input_text = """Plants are remarkable organisms that produce their own food using a method called photosynthesis.
 This process involves converting sunlight, carbon dioxide, and water into glucose, which provides energy for growth.
 Plants play a crucial role in sustaining life on Earth by generating oxygen and serving as the foundation of most ecosystems."""
-input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
+input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)

 output = model.generate(**input_ids, cache_implementation="static")
 print(tokenizer.decode(output[0], skip_special_tokens=True))
@ -115,7 +115,7 @@ tokenizer = AutoTokenizer.from_pretrained(
 input_text = """Plants are remarkable organisms that produce their own food using a method called photosynthesis.
 This process involves converting sunlight, carbon dioxide, and water into glucose, which provides energy for growth.
 Plants play a crucial role in sustaining life on Earth by generating oxygen and serving as the foundation of most ecosystems."""
-input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
+input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)

 output = model.generate(**input_ids, cache_implementation="static")
 print(tokenizer.decode(output[0], skip_special_tokens=True))
--- a/docs/source/en/model_doc/nougat.md
+++ b/docs/source/en/model_doc/nougat.md
@ -65,14 +65,14 @@ into a single instance to both extract the input features and decode the predict
 >>> import re
 >>> from PIL import Image

->>> from transformers import NougatProcessor, VisionEncoderDecoderModel
+>>> from transformers import NougatProcessor, VisionEncoderDecoderModel, infer_device
 >>> from datasets import load_dataset
 >>> import torch

 >>> processor = NougatProcessor.from_pretrained("facebook/nougat-base")
 >>> model = VisionEncoderDecoderModel.from_pretrained("facebook/nougat-base")

->>> device = "cuda" if torch.cuda.is_available() else "cpu"
+>>> device = infer_device()
 >>> model.to(device)  # doctest: +IGNORE_RESULT

 >>> # prepare PDF image for the model
@ -125,4 +125,4 @@ The model is identical to [Donut](donut) in terms of architecture.
    - save_pretrained
    - batch_decode
    - decode
-    - post_process_generation
+    - post_process_generation
--- a/docs/source/en/model_doc/olmoe.md
+++ b/docs/source/en/model_doc/olmoe.md
@ -59,9 +59,9 @@ print(result)

 ```py
 import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device

-device = "cuda" if torch.cuda.is_available() else "cpu"
+device = infer_device()

 model = AutoModelForCausalLM.from_pretrained("allenai/OLMoE-1B-7B-0924", attn_implementation="sdpa", torch_dtype="auto", device_map="auto").to(device)
 tokenizer = AutoTokenizer.from_pretrained("allenai/OLMoE-1B-7B-0924")
@ -79,9 +79,9 @@ The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quan

 ```py
 import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, infer_device

-device = "cuda" if torch.cuda.is_available() else "cpu"
+device = infer_device()

 quantization_config = BitsAndBytesConfig(
   load_in_4bit=True,
--- a/docs/source/en/model_doc/paligemma.md
+++ b/docs/source/en/model_doc/paligemma.md
@ -77,7 +77,7 @@ processor = AutoProcessor.from_pretrained(
 prompt = "What is in this image?"
 url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)
-inputs = processor(image, prompt, return_tensors="pt").to("cuda")
+inputs = processor(image, prompt, return_tensors="pt").to(model.device)

 output = model.generate(**inputs, max_new_tokens=50, cache_implementation="static")
 print(processor.decode(output[0], skip_special_tokens=True))
@ -111,7 +111,7 @@ processor = AutoProcessor.from_pretrained(
 prompt = "What is in this image?"
 url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)
-inputs = processor(image, prompt, return_tensors="pt").to("cuda")
+inputs = processor(image, prompt, return_tensors="pt").to(model.device)

 output = model.generate(**inputs, max_new_tokens=50, cache_implementation="static")
 print(processor.decode(output[0], skip_special_tokens=True))
--- a/docs/source/en/model_doc/pegasus.md
+++ b/docs/source/en/model_doc/pegasus.md
@ -75,7 +75,7 @@ model = AutoModelForSeq2SeqLM.from_pretrained(
 input_text = """Plants are remarkable organisms that produce their own food using a method called photosynthesis.
 This process involves converting sunlight, carbon dioxide, and water into glucose, which provides energy for growth.
 Plants play a crucial role in sustaining life on Earth by generating oxygen and serving as the foundation of most ecosystems."""
-input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
+input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)

 output = model.generate(**input_ids, cache_implementation="static")
 print(tokenizer.decode(output[0], skip_special_tokens=True))
@ -117,7 +117,7 @@ tokenizer = AutoTokenizer.from_pretrained(
 input_text = """Plants are remarkable organisms that produce their own food using a method called photosynthesis.
 This process involves converting sunlight, carbon dioxide, and water into glucose, which provides energy for growth.
 Plants play a crucial role in sustaining life on Earth by generating oxygen and serving as the foundation of most ecosystems."""
-input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
+input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)

 output = model.generate(**input_ids, cache_implementation="static")
 print(tokenizer.decode(output[0], skip_special_tokens=True))
--- a/docs/source/en/model_doc/phi.md
+++ b/docs/source/en/model_doc/phi.md
@ -60,7 +60,7 @@ model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1", torch_dtype=torc
 input_ids = tokenizer('''def print_prime(n):
   """
   Print all primes between 1 and n
-   """''', return_tensors="pt").to("cuda")
+   """''', return_tensors="pt").to(model.device)

 output = model.generate(**input_ids, cache_implementation="static")
 print(tokenizer.decode(output[0], skip_special_tokens=True))
@ -91,7 +91,7 @@ model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1", torch_dtype=torc
 input_ids = tokenizer('''def print_prime(n):
   """
   Print all primes between 1 and n
-   """''', return_tensors="pt").to("cuda")
+   """''', return_tensors="pt").to(model.device)

 output = model.generate(**input_ids, cache_implementation="static")
 print(tokenizer.decode(output[0], skip_special_tokens=True))
@ -116,7 +116,7 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
    input_ids = tokenizer('''def print_prime(n):
       """
       Print all primes between 1 and n
-       """''', return_tensors="pt").to("cuda")
+       """''', return_tensors="pt").to(model.device)

    output = model.generate(**input_ids, cache_implementation="static")
    print(tokenizer.decode(output[0], skip_special_tokens=True))
--- a/docs/source/en/model_doc/phimoe.md
+++ b/docs/source/en/model_doc/phimoe.md
@ -67,7 +67,7 @@ torch.random.manual_seed(0)

 model = AutoModelForCausalLM.from_pretrained( 
    "microsoft/Phi-3.5-MoE-instruct",  
-    device_map="cuda",  
+    device_map="auto",  
    torch_dtype="auto",  
    trust_remote_code=True,  
 ) 
--- a/docs/source/en/model_doc/pixtral.md
+++ b/docs/source/en/model_doc/pixtral.md
@ -58,7 +58,7 @@ from transformers import AutoProcessor, LlavaForConditionalGeneration

 model_id = "mistral-community/pixtral-12b"
 processor = AutoProcessor.from_pretrained(model_id)
-model = LlavaForConditionalGeneration.from_pretrained(model_id, device_map="cuda")
+model = LlavaForConditionalGeneration.from_pretrained(model_id, device_map="auto")

 chat = [
    {
--- a/docs/source/en/model_doc/qwen2.md
+++ b/docs/source/en/model_doc/qwen2.md
@ -82,7 +82,7 @@ text = tokenizer.apply_chat_template(
    tokenize=False,
    add_generation_prompt=True
 )
-model_inputs = tokenizer([text], return_tensors="pt").to("cuda")
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

 generated_ids = model.generate(
    model_inputs.input_ids,
@ -137,7 +137,7 @@ model = AutoModelForCausalLM.from_pretrained(
    attn_implementation="flash_attention_2"
 )

-inputs = tokenizer("The Qwen2 model family is", return_tensors="pt").to("cuda")
+inputs = tokenizer("The Qwen2 model family is", return_tensors="pt").to(model.device)
 outputs = model.generate(**inputs, max_new_tokens=100)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
--- a/docs/source/en/model_doc/qwen2_moe.md
+++ b/docs/source/en/model_doc/qwen2_moe.md
@ -82,7 +82,7 @@ text = tokenizer.apply_chat_template(
    tokenize=False,
    add_generation_prompt=True
 )
-model_inputs = tokenizer([text], return_tensors="pt").to("cuda")
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

 generated_ids = model.generate(
    model_inputs.input_ids,
@ -131,7 +131,7 @@ model = AutoModelForCausalLM.from_pretrained(
    attn_implementation="flash_attention_2"
 )

-inputs = tokenizer("The Qwen2 model family is", return_tensors="pt").to("cuda")
+inputs = tokenizer("The Qwen2 model family is", return_tensors="pt").to(model.device)
 outputs = model.generate(**inputs, max_new_tokens=100)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
--- a/docs/source/en/model_doc/roberta.md
+++ b/docs/source/en/model_doc/roberta.md
@ -69,7 +69,7 @@ model = AutoModelForMaskedLM.from_pretrained(
    device_map="auto",
    attn_implementation="sdpa"
 )
-inputs = tokenizer("Plants create <mask> through a process known as photosynthesis.", return_tensors="pt").to("cuda")
+inputs = tokenizer("Plants create <mask> through a process known as photosynthesis.", return_tensors="pt").to(model.device)

 with torch.no_grad():
    outputs = model(**inputs)
--- a/docs/source/en/model_doc/roc_bert.md
+++ b/docs/source/en/model_doc/roc_bert.md
@ -65,7 +65,7 @@ model = AutoModelForMaskedLM.from_pretrained(
   torch_dtype=torch.float16,
   device_map="auto",
 )
-inputs = tokenizer("這家餐廳的拉麵是我[MASK]過的最好的拉麵之", return_tensors="pt").to("cuda")
+inputs = tokenizer("這家餐廳的拉麵是我[MASK]過的最好的拉麵之", return_tensors="pt").to(model.device)

 with torch.no_grad():
   outputs = model(**inputs)
--- a/docs/source/en/model_doc/sam.md
+++ b/docs/source/en/model_doc/sam.md
@ -52,9 +52,9 @@ Below is an example on how to run mask generation given an image and a 2D point:
 import torch
 from PIL import Image
 import requests
-from transformers import SamModel, SamProcessor
+from transformers import SamModel, SamProcessor, infer_device

-device = "cuda" if torch.cuda.is_available() else "cpu"
+device = infer_device()
 model = SamModel.from_pretrained("facebook/sam-vit-huge").to(device)
 processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")

@ -78,9 +78,9 @@ You can also process your own masks alongside the input images in the processor
 import torch
 from PIL import Image
 import requests
-from transformers import SamModel, SamProcessor
+from transformers import SamModel, SamProcessor, infer_device

-device = "cuda" if torch.cuda.is_available() else "cpu"
+device = infer_device()
 model = SamModel.from_pretrained("facebook/sam-vit-huge").to(device)
 processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")

--- a/docs/source/en/model_doc/sam2_video.md
+++ b/docs/source/en/model_doc/sam2_video.md
@ -52,10 +52,10 @@ SAM2's key strength is its ability to track objects across video frames. Here's
 #### Basic Video Tracking

 ```python
->>> from transformers import Sam2VideoModel, Sam2VideoProcessor
+>>> from transformers import Sam2VideoModel, Sam2VideoProcessor, infer_device
 >>> import torch

->>> device = "cuda" if torch.cuda.is_available() else "cpu"
+>>> device = infer_device()
 >>> model = Sam2VideoModel.from_pretrained("facebook/sam2.1-hiera-tiny").to(device, dtype=torch.bfloat16)
 >>> processor = Sam2VideoProcessor.from_pretrained("facebook/sam2.1-hiera-tiny")

--- a/docs/source/en/model_doc/siglip2.md
+++ b/docs/source/en/model_doc/siglip2.md
@ -74,7 +74,7 @@ candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"]
 texts = [f'This is a photo of {label}.' for label in candidate_labels]

 # IMPORTANT: we pass `padding=max_length` and `max_length=64` since the model was trained with this
-inputs = processor(text=texts, images=image, padding="max_length", max_length=64, return_tensors="pt").to("cuda")
+inputs = processor(text=texts, images=image, padding="max_length", max_length=64, return_tensors="pt").to(model.device)

 with torch.no_grad():
    outputs = model(**inputs)
@ -102,7 +102,7 @@ candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"]
 texts = [f'This is a photo of {label}.' for label in candidate_labels]

 # default value for `max_num_patches` is 256, but you can increase resulted image resolution providing higher values e.g. `max_num_patches=512`
-inputs = processor(text=texts, images=image, padding="max_length", max_num_patches=256, return_tensors="pt").to("cuda")
+inputs = processor(text=texts, images=image, padding="max_length", max_num_patches=256, return_tensors="pt").to(model.device)

 with torch.no_grad():
    outputs = model(**inputs)
@ -137,7 +137,7 @@ candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"]
 texts = [f'This is a photo of {label}.' for label in candidate_labels]

 # IMPORTANT: we pass `padding=max_length` and `max_length=64` since the model was trained with this
-inputs = processor(text=texts, images=image, padding="max_length", max_length=64, return_tensors="pt").to("cuda")
+inputs = processor(text=texts, images=image, padding="max_length", max_length=64, return_tensors="pt").to(model.device)

 with torch.no_grad():
    outputs = model(**inputs)
@ -149,7 +149,7 @@ print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")

 ## Notes

- Training is supported for DDP and FSDP on single-node multi-GPU setups. However, it does not use [torch.distributed](https://pytorch.org/tutorials/beginner/dist_overview.html) utilities which may limit the scalability of batch size.
+- Training is supported for DDP and FSDP on single-node multi-accelerator setups. However, it does not use [torch.distributed](https://pytorch.org/tutorials/beginner/dist_overview.html) utilities which may limit the scalability of batch size.
 - When using the standalone [`GemmaTokenizerFast`] make sure to pass `padding="max_length"` and `max_length=64` as that's how the model was trained.
 - Model was trained with *lowercased* text, so make sure your text labels are preprocessed the same way.
 - To get the same results as the [`Pipeline`], a prompt template of `"This is a photo of {label}."` should be passed to the processor.
--- a/docs/source/en/model_doc/smolvlm.md
+++ b/docs/source/en/model_doc/smolvlm.md
@ -63,7 +63,7 @@ processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-256M-Video-Ins
 model = AutoModelForImageTextToText.from_pretrained(
    "HuggingFaceTB/SmolVLM2-256M-Video-Instruct",
    torch_dtype=torch.bfloat16,
-    device_map="cuda"
+    device_map="auto"
 )

 conversation = [
@ -125,7 +125,7 @@ processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-256M-Video-Ins
 model = AutoModelForImageTextToText.from_pretrained(
    "HuggingFaceTB/SmolVLM2-256M-Video-Instruct",
    torch_dtype=torch.bfloat16,
-    device_map="cuda"
+    device_map="auto"
 )

 # Conversation for the first image
--- a/docs/source/en/model_doc/starcoder2.md
+++ b/docs/source/en/model_doc/starcoder2.md
@ -49,7 +49,7 @@ These ready-to-use checkpoints can be downloaded and used via the HuggingFace Hu

 >>> prompt = "def print_hello_world():"

->>> model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
+>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)

 >>> generated_ids = model.generate(**model_inputs, max_new_tokens=10, do_sample=False)
 >>> tokenizer.batch_decode(generated_ids)[0]
--- a/docs/source/en/model_doc/timesfm.md
+++ b/docs/source/en/model_doc/timesfm.md
@ -47,7 +47,7 @@ model = TimesFmModelForPrediction.from_pretrained(
    "google/timesfm-2.0-500m-pytorch",
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
-    device_map="cuda" if torch.cuda.is_available() else None
+    device_map="auto"
 )


@ -61,12 +61,10 @@ frequency_input = [0, 1, 2]

 # Convert inputs to sequence of tensors
 forecast_input_tensor = [
-    torch.tensor(ts, dtype=torch.bfloat16).to("cuda" if torch.cuda.is_available() else "cpu")
+    torch.tensor(ts, dtype=torch.bfloat16).to(model.device)
    for ts in forecast_input
 ]
-frequency_input_tensor = torch.tensor(frequency_input, dtype=torch.long).to(
-    "cuda" if torch.cuda.is_available() else "cpu"
-)
+frequency_input_tensor = torch.tensor(frequency_input, dtype=torch.long).to(model.device)

 # Get predictions from the pre-trained model
 with torch.no_grad():
--- a/docs/source/en/model_doc/wav2vec2.md
+++ b/docs/source/en/model_doc/wav2vec2.md
@ -159,13 +159,14 @@ Otherwise, [`~Wav2Vec2ProcessorWithLM.batch_decode`] performance will be slower
 ```python
 >>> # Let's see how to use a user-managed pool for batch decoding multiple audios
 >>> from multiprocessing import get_context
->>> from transformers import AutoTokenizer, AutoProcessor, AutoModelForCTC
+>>> from transformers import AutoTokenizer, AutoProcessor, AutoModelForCTC, infer_device
 >>> from datasets import load_dataset
 >>> import datasets
 >>> import torch

+>>> device = infer_device()
 >>> # import model, feature extractor, tokenizer
->>> model = AutoModelForCTC.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm").to("cuda")
+>>> model = AutoModelForCTC.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm").to(device)
 >>> processor = AutoProcessor.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")

 >>> # load example dataset
@ -183,8 +184,9 @@ Otherwise, [`~Wav2Vec2ProcessorWithLM.batch_decode`] performance will be slower


 >>> def map_to_pred(batch, pool):
+...     device = infer_device()
 ...     inputs = processor(batch["speech"], sampling_rate=16_000, padding=True, return_tensors="pt")
-...     inputs = {k: v.to("cuda") for k, v in inputs.items()}
+...     inputs = {k: v.to(device) for k, v in inputs.items()}

 ...     with torch.no_grad():
 ...         logits = model(**inputs).logits
--- a/docs/source/en/model_doc/zamba.md
+++ b/docs/source/en/model_doc/zamba.md
@ -60,7 +60,7 @@ tokenizer = AutoTokenizer.from_pretrained("Zyphra/Zamba-7B-v1")
 model = AutoModelForCausalLM.from_pretrained("Zyphra/Zamba-7B-v1", device_map="auto", torch_dtype=torch.bfloat16)

 input_text = "A funny prompt would be "
-input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
+input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)

 outputs = model.generate(**input_ids, max_new_tokens=100)
 print(tokenizer.decode(outputs[0]))
--- a/docs/source/en/model_doc/zoedepth.md
+++ b/docs/source/en/model_doc/zoedepth.md
@ -72,7 +72,7 @@ model = AutoModelForDepthEstimation.from_pretrained(
 )
 url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)
-inputs = image_processor(image, return_tensors="pt").to("cuda")
+inputs = image_processor(image, return_tensors="pt").to(model.device)

 with torch.no_grad():
  outputs = model(inputs)
@ -128,4 +128,4 @@ Image.fromarray(depth.astype("uint8"))
 ## ZoeDepthForDepthEstimation

 [[autodoc]] ZoeDepthForDepthEstimation
-    - forward
+    - forward
--- a/docs/source/en/quicktour.md
+++ b/docs/source/en/quicktour.md
@ -99,7 +99,7 @@ Use [`~PreTrainedModel.from_pretrained`] to load the weights and configuration f

 When you load a model, configure the following parameters to ensure the model is optimally loaded.

- `device_map="auto"` automatically allocates the model weights to your fastest device first, which is typically the GPU.
+- `device_map="auto"` automatically allocates the model weights to your fastest device first.
 - `torch_dtype="auto"` directly initializes the model weights in the data type they're stored in, which can help avoid loading the weights twice (PyTorch loads weights in `torch.float32` by default).

 ```py
@ -109,10 +109,10 @@ model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_d
 tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
 ```

-Tokenize the text and return PyTorch tensors with the tokenizer. Move the model to a GPU if it's available to accelerate inference.
+Tokenize the text and return PyTorch tensors with the tokenizer. Move the model to an accelerator if it's available to accelerate inference.

 ```py
-model_inputs = tokenizer(["The secret to baking a good cake is "], return_tensors="pt").to("cuda")
+model_inputs = tokenizer(["The secret to baking a good cake is "], return_tensors="pt").to(model.device)
 ```

 The model is now ready for inference or training.
@ -169,12 +169,14 @@ Create a [`Pipeline`] object and select a task. By default, [`Pipeline`] downloa
 <hfoptions id="pipeline-tasks">
 <hfoption id="text generation">

-Set `device="cuda"` to accelerate inference with a GPU.
+Use [`~infer_device`] to automatically detect an available accelerator for inference.

 ```py
-from transformers import pipeline
+from transformers import pipeline, infer_device

-pipeline = pipeline("text-generation", model="meta-llama/Llama-2-7b-hf", device="cuda")
+device = infer_device()
+
+pipeline = pipeline("text-generation", model="meta-llama/Llama-2-7b-hf", device=device)
 ```

 Prompt [`Pipeline`] with some initial text to generate more text.
@ -187,12 +189,14 @@ pipeline("The secret to baking a good cake is ", max_length=50)
 </hfoption>
 <hfoption id="image segmentation">

-Set `device="cuda"` to accelerate inference with a GPU.
+Use [`~infer_device`] to automatically detect an available accelerator for inference.

 ```py
-from transformers import pipeline
+from transformers import pipeline, infer_device

-pipeline = pipeline("image-segmentation", model="facebook/detr-resnet-50-panoptic", device="cuda")
+device = infer_device()
+
+pipeline = pipeline("image-segmentation", model="facebook/detr-resnet-50-panoptic", device=device)
 ```

 Pass an image - a URL or local path to the image - to [`Pipeline`].
@ -212,12 +216,14 @@ segments[1]["label"]
 </hfoption>
 <hfoption id="automatic speech recognition">

-Set `device="cuda"` to accelerate inference with a GPU.
+Use [`~infer_device`] to automatically detect an available accelerator for inference.

 ```py
-from transformers import pipeline
+from transformers import pipeline, infer_device

-pipeline = pipeline("automatic-speech-recognition", model="openai/whisper-large-v3", device="cuda")
+device = infer_device()
+
+pipeline = pipeline("automatic-speech-recognition", model="openai/whisper-large-v3", device=device)
 ```

 Pass an audio file to [`Pipeline`].