pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Files

Jerry Zhang a962ae511d Extend gpt-fast LLM dashboard to support torchao autoquant (#140627 )

Summary:
We want to test autoquant on relevant LLM models

right now only llama2 and mixtral, but want to extend to more models like https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models

Test Plan:

```
                                            Llama-2-7b-chat-hf          Mixtral-8x7B-v0.1
gpt-fast int8                           112.98                              147.92
torchao autoquant                  87.41                               85.90
torchao autoquantv2             131.12                                79.59
```

https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fpytorch

in pytorch/benchmarks/gpt_fast
```
python benchmark.py
```

output:
```
Loading model Llama-2-7b-chat-hf
Using int8 weight-only quantization!
Time to load model: 2.80 seconds
Compilation time: 170.24 seconds
Average tokens/sec: 112.98 tokens/sec
Average bandwidth achieved: 746.86 GB/s
Memory used: 7.95 GB

Loading model Mixtral-8x7B-v0.1
Using int8 weight-only quantization!
Time to load model: 0.24 seconds
Compilation time: 181.81 seconds
Average tokens/sec: 147.92 tokens/sec
Average bandwidth achieved: 953.06 GB/s
Memory used: 32.45 GB

Loading model Llama-2-7b-chat-hf
Time to load model: 0.11 seconds
Using autoquant
Compilation time: 109.31 seconds
Average tokens/sec: 87.17 tokens/sec
Average bandwidth achieved: 1151.86 GB/s
Memory used: 32.45 GB

Loading model Llama-2-7b-chat-hf
Time to load model: 0.11 seconds
Compilation time: 48.08 seconds
Average tokens/sec: 87.41 tokens/sec
Average bandwidth achieved: 1155.05 GB/s
Memory used: 36.86 GB

Loading model Mixtral-8x7B-v0.1
Time to load model: 0.20 seconds
Using autoquant
Compilation time: 47.32 seconds
Average tokens/sec: 85.90 tokens/sec
Average bandwidth achieved: 1106.37 GB/s
Memory used: 66.81 GB

local test (autoquant v2):
Loading model Mixtral-8x7B-v0.1
Compilation time: 124.40 seconds
Average tokens/sec: 90.41 tokens/sec
Average bandwidth achieved: 1164.47 GB/s
Memory used: 53.91 GB

Loading model Llama-2-7b-chat-hf
TODO

```

gpt_fast_benchmark.csv:
```
name,metric,target,actual,dtype,device,arch,is_model
Llama-2-7b-chat-hf,token_per_sec,144,112.98,int8,cuda,NVIDIA PG509-210,True
Llama-2-7b-chat-hf,memory_bandwidth(GB/s),957,746.86,int8,cuda,NVIDIA PG509-210,True
Llama-2-7b-chat-hf,compilation_time(s),136,170.24,int8,cuda,NVIDIA PG509-210,True
Mixtral-8x7B-v0.1,token_per_sec,175,147.92,int8,cuda,NVIDIA PG509-210,True
Mixtral-8x7B-v0.1,memory_bandwidth(GB/s),1130,953.06,int8,cuda,NVIDIA PG509-210,True
Mixtral-8x7B-v0.1,compilation_time(s),133,181.81,int8,cuda,NVIDIA PG509-210,True
gemv,memory_bandwidth(GB/s),870,867.06,int8,cuda,NVIDIA PG509-210,False
gemv,memory_bandwidth(GB/s),990,1092.43,bfloat16,cuda,NVIDIA PG509-210,False
layer_norm,memory_bandwidth(GB/s),950,573.57,bfloat16,cuda,NVIDIA PG509-210,False
Llama-2-7b-chat-hf,token_per_sec,144,87.17,autoquant,cuda,NVIDIA PG509-210,True
Llama-2-7b-chat-hf,memory_bandwidth(GB/s),957,1151.86,autoquant,cuda,NVIDIA PG509-210,True
Llama-2-7b-chat-hf,compilation_time(s),136,109.31,autoquant,cuda,NVIDIA PG509-210,True
gather_gemv,memory_bandwidth(GB/s),990,945.38,int8,cuda,NVIDIA PG509-210,False
gather_gemv,memory_bandwidth(GB/s),1060,1188.29,bfloat16,cuda,NVIDIA PG509-210,False
mlp_layer_norm_gelu,flops_utilization,0.8,0.82,bfloat16,cuda,NVIDIA PG509-210,False
Llama-2-7b-chat-hf,token_per_sec,94,87.41,bfloat16,cuda,NVIDIA PG509-210,True
Llama-2-7b-chat-hf,memory_bandwidth(GB/s),1253,1155.05,bfloat16,cuda,NVIDIA PG509-210,True
Llama-2-7b-chat-hf,compilation_time(s),133,48.08,bfloat16,cuda,NVIDIA PG509-210,True
Mixtral-8x7B-v0.1,token_per_sec,175,85.90,autoquant,cuda,NVIDIA PG509-210,True
Mixtral-8x7B-v0.1,memory_bandwidth(GB/s),1130,1106.37,autoquant,cuda,NVIDIA PG509-210,True
Mixtral-8x7B-v0.1,compilation_time(s),133,47.32,autoquant,cuda,NVIDIA PG509-210,True
```
Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140627
Approved by: https://github.com/huydhn

2024-11-27 21:57:48 +00:00

benchmark.py

Extend gpt-fast LLM dashboard to support torchao autoquant (#140627 )

2024-11-27 21:57:48 +00:00

generate.py

Extend gpt-fast LLM dashboard to support torchao autoquant (#140627 )

2024-11-27 21:57:48 +00:00

mixtral_moe_model.py

Reduce the number of layers for mixtral moe model to adapt CI memory limitation (#125608 )

2024-05-06 21:52:25 +00:00

mixtral_moe_quantize.py

[BE] Format .ci/ / .github/ / benchmarks/ / functorch/ / tools/ / torchgen/ with ruff format (#132577 )