Commit Graph

10 Commits

Author SHA1 Message Date
fd40516923 Update GH org references (#6998)
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr>
2025-02-05 00:56:50 +00:00
249c1db2fb Add fp16 support of Qwen1.5MoE models (A2.7B) to DeepSpeed-FastGen (#5403)
This PR adds support for Qwen1.5MoE-A2.7B models.

support for https://github.com/microsoft/DeepSpeed-MII/issues/457

### Test Code

for mii pipeline:
```python
import mii

pipe = mii.pipeline("/data/zonepg/models/Qwen/Qwen1.5-MoE-A2.7B")
responses = pipe("DeepSpeed is", max_new_tokens=128, do_sample=False)
if pipe.is_rank_0:
    print(responses[0])
```
for huggingface:
```python
import mii

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
tokenizer = AutoTokenizer.from_pretrained("/data/zonepg/models/Qwen/Qwen1.5-MoE-A2.7B")
model = AutoModelForCausalLM.from_pretrained("/data/zonepg/models/Qwen/Qwen1.5-MoE-A2.7B", device_map="auto", torch_dtype=torch.float16, trust_remote_code=True).eval()
print(model)
inputs = tokenizer('DeepSpeed is', return_tensors='pt')
inputs = inputs.to(model.device)
pred = model.generate(**inputs, max_new_tokens=128, do_sample=False, repetition_penalty=1.0)
test = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False)
print(test)
```

### Qwen1.5-MoE-A2.7B
Huggingface output with prompt "DeepSpeed is":
```
 a deep learning framework that is designed to accelerate the training of large-scale neural networks. It is built on top of PyTorch and provides a set of tools and techniques for optimizing the performance of deep learning models.

DeepSpeed supports a variety of hardware accelerators, including GPUs, TPUs, and FPGAs, and can be used to train models on distributed systems, such as clusters of GPUs or TPUs.

One of the key features of DeepSpeed is its ability to automatically parallelize the training of deep learning models across multiple GPUs or TPUs. This can significantly reduce the time required to train large models, as it allows the
```
DeepSpeed-FastGen output with prompt "DeepSpeed is":
```
 a deep learning framework that is designed to accelerate the training of large-scale neural networks. It is built on top of PyTorch and provides a set of tools and techniques for optimizing the performance of deep learning models.

DeepSpeed supports a variety of hardware accelerators, including GPUs, TPUs, and FPGAs, and can be used to train models on distributed systems, such as clusters of GPUs or TPUs.

One of the key features of DeepSpeed is its ability to automatically parallelize the training of deep learning models across multiple GPUs or TPUs. This can significantly reduce the time required to train large models, as it allows the
```

DeepSpeed-FastGen output with prompt "DeepSpeed is" with 8-way sharding:
```
 a deep learning framework that is designed to accelerate the training of large-scale neural networks. It is built on top of PyTorch and provides a set of tools and techniques for optimizing the performance of deep learning models.

DeepSpeed supports a variety of hardware accelerators, including GPUs, TPUs, and FPGAs, and can be used to train models on distributed systems, such as clusters of GPUs or TPUs.

One of the key features of DeepSpeed is its ability to automatically parallelize the training of deep learning models across multiple GPUs or TPUs. This can significantly reduce the time required to train large models, as it allows the
```

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Heyang Qin <heyangqin@microsoft.com>
Co-authored-by: Abhishek Kulkarni <11399+adk9@users.noreply.github.com>
2024-08-01 10:27:24 -07:00
6a163e03f4 Add support for Microsoft Phi-3 model to DeepSpeed-FastGen (#5559)
This PR adds support for Microsoft Phi-3 model to FastGen.

DeepSpeed-FastGen output with prompt "DeepSpeed is":
```
an AI-powered platform designed to optimize and scale distributed deep learning models across clusters.**

DeepSpeed is a cutting-edge AI-driven toolkit that empowers users to enhance and scale deep learning models across distributed computing environments. By harnessing the power of artificial intelligence, DeepSpeed provides innovative solutions for optimizing resource allocation, managing data synchronization, and improving model parallelism. This enables efficient scaling and execution of complex deep learning tasks, unlocking the full potential of distributed computing systems.

### Key Features of DeepSpeed:

1.
```

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-20 03:18:36 +00:00
ed10cc7382 Add support of Qwen models (7b, 14b, 72b) to DeepSpeed-FastGen (#4913)
This PR adds support for Qwen models 7b, 14b and 72b.

### Test Code

for mii pipeline:
```python
from mii import pipeline
pipe = pipeline("Qwen/Qwen-7B-Chat")
pipe.tokenizer.tokenizer.eos_token_id = 151643
output = pipe(["DeepSpeed is"], max_new_tokens=128, do_sample=False)
print(output)
```
for huggingface:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
inputs = tokenizer('DeepSpeed is', return_tensors='pt')
inputs = inputs.to(model.device)
pred = model.generate(**inputs, max_new_tokens=128, do_sample=False, repetition_penalty=1.0)
test = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False)
print(test)
```

### Qwen 7B
Huggingface output with prompt "DeepSpeed is":
```
 a high-performance, low-latency database management system designed for real-time analytics and machine learning applications. It is built on top of Apache Arrow, a columnar in-memory data format, and is optimized for processing large volumes of data in parallel.\n\nDeepSpeed offers several key features that make it well-suited for real-time analytics and machine learning applications:\n\n1. High Performance: DeepSpeed is designed to deliver high performance by leveraging parallel processing and optimized data structures. It can process large volumes of data in real-time, making it ideal for applications that require real-time analytics.\n\n2. Low Latency: DeepSpeed is designed to minimize latency by
```
DeepSpeed-FastGen output with prompt "DeepSpeed is":
```
 a high-performance, low-latency database management system designed for real-time analytics and machine learning applications. It is built on top of Apache Arrow, a columnar in-memory data format, and is optimized for processing large volumes of data in parallel.\n\nDeepSpeed offers several key features that make it well-suited for real-time analytics and machine learning applications:\n\n1. High Performance: DeepSpeed is designed to deliver high performance by leveraging parallel processing and optimized data structures. It can process large volumes of data in real-time, making it ideal for applications that require real-time analytics.\n\n2. Low Latency: DeepSpeed is designed to minimize latency by
```

### Qwen 72B
Huggingface output with prompt "DeepSpeed is":
```
是一个开源的深度学习优化库,它提供了多种优化技术,包括模型并行、数据并行、混合并行、ZeRO内存优化等。它可以帮助用户在大规模GPU集群上训练深度学习模型,提高训练速度,减少内存使用。\n在Deepspeed中,模型并行是一种将模型的不同部分分配到不同的GPU上的技术。这样可以处理模型太大,无法放在一个GPU上的问题。数据并行是将数据集分成多个部分,每个部分在不同的GPU上进行训练。混合并行则是结合了模型并行和数据并行,以更有效地利用GPU资源
```
DeepSpeed-FastGen output with prompt "DeepSpeed is" with 8-way sharding:
```
是一个开源的深度学习优化库,它提供了多种优化技术,包括模型并行、数据并行、混合并行、ZeRO内存优化等。它可以帮助用户在大规模GPU集群上训练深度学习模型,提高训练速度,减少内存使用。\n在Deepspeed中,模型并行是一种将模型的不同部分分配到不同的GPU上的技术。这样可以处理模型太大,无法放在一个GPU上的问题。数据并行是将数据集分成多个部分,每个部分在不同的GPU上进行训练。混合并行则是结合了模型并行和数据并行,以更有效地利用GPU资源
```

---------

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-01-10 23:05:27 +00:00
834272531a Add support of Microsoft Phi-2 model to DeepSpeed-FastGen (#4812)
This PR adds support for Microsoft Phi-2 model.

HF output with prompt "DeepSpeed is":
```
a company that helps make videos and movies look really good. They have a special way of making videos that makes them look like they were made in a movie theater. This is called "4K Ultra HD" and it makes the videos look very clear and detailed. DeepSpeed also has a special way of making videos that makes them look like they were made in a movie theater. This is called "4K Ultra HD" and it makes the videos look very clear and detailed. DeepSpeed also has a special way of making videos that makes them look like they were made in a movie theater. This is called "4K Ultra HD"
```

DeepSpeed-FastGen output with prompt "DeepSpeed is":
```
a company that helps make videos and movies look really good. They have a special way of making videos that makes them look like they were made in a movie theater. This is called "4K Ultra HD" and it makes the videos look very clear and detailed. DeepSpeed also has a special way of making videos that makes them look like they were made in a movie theater. This is called "4K Ultra HD" and it makes the videos look very clear and detailed. DeepSpeed also has a special way of making videos that makes them look like they were made in a movie theater. This is called "4K Ultra HD"
```

---------

Co-authored-by: Connor Holmes <connorholmes@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2024-01-03 13:27:06 -08:00
a7900bcc3d Add support of Falcon models (7b, 40b, 180b) to DeepSpeed-FastGen (#4790) 2023-12-12 13:31:44 -08:00
ab6b1e16bb Add Japanese blog for DeepSpeed-FastGen (#4651)
This blog adds Japanese blog for DeepSpeed-FastGen.
(also includes small fix of typos in the original blog)

---------

Co-authored-by: Conglong Li <conglong.li@gmail.com>
2023-11-07 10:10:45 -08:00
d89027be61 Fix figure in FlexGen blog (#4624)
Fix the latency-throughput figure for 13B.
2023-11-05 10:23:46 -08:00
ff53c22485 Add number for latency comparison (#4612)
This PR adds latency comparison
2023-11-03 16:57:00 -07:00
1d9e256c03 DeepSpeed-FastGen blog (#4607)
Co-authored-by: Connor Holmes <connorholmes@microsoft.com>
Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com>
Co-authored-by: Heyang Qin <heyangqin@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Lev Kurilenko <lekurile@microsoft.com>
2023-11-03 15:32:40 -07:00