pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 05:34:18 +08:00

Author	SHA1	Message	Date
Huy Do	eb553ae3cf	Fix broken gpt_fast micro benchmark after #144315 (#145235 ) The benchmark is failing with the following error ``` File "/var/lib/jenkins/workspace/benchmarks/gpt_fast/benchmark.py", line 333, in <module> main(output_file=args.output, only_model=args.only) File "/var/lib/jenkins/workspace/benchmarks/gpt_fast/benchmark.py", line 308, in main lst = func(device) File "/var/lib/jenkins/workspace/benchmarks/gpt_fast/benchmark.py", line 66, in run_mlp_layer_norm_gelu us_per_iter = benchmarker.benchmark(compiled_mod, (x,)) * 1000 File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/_inductor/runtime/benchmarking.py", line 39, in wrapper return fn(self, args, *kwargs) TypeError: benchmark() missing 1 required positional argument: 'fn_kwargs' ``` An example error is https://github.com/pytorch/pytorch/actions/runs/12862761823/job/35858912555 I also assign `oncall: pt2` as the owner of this job going forward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145235 Approved by: https://github.com/nmacchioni	2025-01-21 17:42:24 +00:00
Nicolas Macchioni	4375c2c534	Cleanup gpt_fast benchmark (#144517 ) This is an exact copy of https://github.com/pytorch/pytorch/pull/144484, I bricked the last PR running ghstack land :( Pull Request resolved: https://github.com/pytorch/pytorch/pull/144517 Approved by: https://github.com/davidberard98, https://github.com/huydhn	2025-01-10 05:22:13 +00:00
Yanbo Liang	792e6184c5	[GPT-fast] Support run spcific model or micro-benchmark (#143607 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143607 Approved by: https://github.com/BoyuanFeng, https://github.com/jerryzh168, https://github.com/huydhn	2024-12-20 19:58:07 +00:00
Huy Do	fe68f61c59	Migrate micro benchmark results to benchmark database schema v3 (#141745 ) Similar to https://github.com/pytorch/pytorch/pull/141087, this uploads the micro benchmark results to benchmark database with its new schema v3. The data can then be queried. ~I'm testing with `inductor-micro-benchmark-x86` which should be sufficient because `inductor-micro-benchmark` is broken atm. The CSV output stays for now until the dashboard is migrated to schema v3.~ https://github.com/pytorch/pytorch/issues/141747 has been resolved, so inductor-micro-benchmark should work now Pull Request resolved: https://github.com/pytorch/pytorch/pull/141745 Approved by: https://github.com/yanboliang	2024-12-02 19:45:51 +00:00
Jerry Zhang	a962ae511d	Extend gpt-fast LLM dashboard to support torchao autoquant (#140627 ) Summary: We want to test autoquant on relevant LLM models right now only llama2 and mixtral, but want to extend to more models like https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models Test Plan: ``` Llama-2-7b-chat-hf Mixtral-8x7B-v0.1 gpt-fast int8 112.98 147.92 torchao autoquant 87.41 85.90 torchao autoquantv2 131.12 79.59 ``` https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fpytorch in pytorch/benchmarks/gpt_fast ``` python benchmark.py ``` output: ``` Loading model Llama-2-7b-chat-hf Using int8 weight-only quantization! Time to load model: 2.80 seconds Compilation time: 170.24 seconds Average tokens/sec: 112.98 tokens/sec Average bandwidth achieved: 746.86 GB/s Memory used: 7.95 GB Loading model Mixtral-8x7B-v0.1 Using int8 weight-only quantization! Time to load model: 0.24 seconds Compilation time: 181.81 seconds Average tokens/sec: 147.92 tokens/sec Average bandwidth achieved: 953.06 GB/s Memory used: 32.45 GB Loading model Llama-2-7b-chat-hf Time to load model: 0.11 seconds Using autoquant Compilation time: 109.31 seconds Average tokens/sec: 87.17 tokens/sec Average bandwidth achieved: 1151.86 GB/s Memory used: 32.45 GB Loading model Llama-2-7b-chat-hf Time to load model: 0.11 seconds Compilation time: 48.08 seconds Average tokens/sec: 87.41 tokens/sec Average bandwidth achieved: 1155.05 GB/s Memory used: 36.86 GB Loading model Mixtral-8x7B-v0.1 Time to load model: 0.20 seconds Using autoquant Compilation time: 47.32 seconds Average tokens/sec: 85.90 tokens/sec Average bandwidth achieved: 1106.37 GB/s Memory used: 66.81 GB local test (autoquant v2): Loading model Mixtral-8x7B-v0.1 Compilation time: 124.40 seconds Average tokens/sec: 90.41 tokens/sec Average bandwidth achieved: 1164.47 GB/s Memory used: 53.91 GB Loading model Llama-2-7b-chat-hf TODO ``` gpt_fast_benchmark.csv: ``` name,metric,target,actual,dtype,device,arch,is_model Llama-2-7b-chat-hf,token_per_sec,144,112.98,int8,cuda,NVIDIA PG509-210,True Llama-2-7b-chat-hf,memory_bandwidth(GB/s),957,746.86,int8,cuda,NVIDIA PG509-210,True Llama-2-7b-chat-hf,compilation_time(s),136,170.24,int8,cuda,NVIDIA PG509-210,True Mixtral-8x7B-v0.1,token_per_sec,175,147.92,int8,cuda,NVIDIA PG509-210,True Mixtral-8x7B-v0.1,memory_bandwidth(GB/s),1130,953.06,int8,cuda,NVIDIA PG509-210,True Mixtral-8x7B-v0.1,compilation_time(s),133,181.81,int8,cuda,NVIDIA PG509-210,True gemv,memory_bandwidth(GB/s),870,867.06,int8,cuda,NVIDIA PG509-210,False gemv,memory_bandwidth(GB/s),990,1092.43,bfloat16,cuda,NVIDIA PG509-210,False layer_norm,memory_bandwidth(GB/s),950,573.57,bfloat16,cuda,NVIDIA PG509-210,False Llama-2-7b-chat-hf,token_per_sec,144,87.17,autoquant,cuda,NVIDIA PG509-210,True Llama-2-7b-chat-hf,memory_bandwidth(GB/s),957,1151.86,autoquant,cuda,NVIDIA PG509-210,True Llama-2-7b-chat-hf,compilation_time(s),136,109.31,autoquant,cuda,NVIDIA PG509-210,True gather_gemv,memory_bandwidth(GB/s),990,945.38,int8,cuda,NVIDIA PG509-210,False gather_gemv,memory_bandwidth(GB/s),1060,1188.29,bfloat16,cuda,NVIDIA PG509-210,False mlp_layer_norm_gelu,flops_utilization,0.8,0.82,bfloat16,cuda,NVIDIA PG509-210,False Llama-2-7b-chat-hf,token_per_sec,94,87.41,bfloat16,cuda,NVIDIA PG509-210,True Llama-2-7b-chat-hf,memory_bandwidth(GB/s),1253,1155.05,bfloat16,cuda,NVIDIA PG509-210,True Llama-2-7b-chat-hf,compilation_time(s),133,48.08,bfloat16,cuda,NVIDIA PG509-210,True Mixtral-8x7B-v0.1,token_per_sec,175,85.90,autoquant,cuda,NVIDIA PG509-210,True Mixtral-8x7B-v0.1,memory_bandwidth(GB/s),1130,1106.37,autoquant,cuda,NVIDIA PG509-210,True Mixtral-8x7B-v0.1,compilation_time(s),133,47.32,autoquant,cuda,NVIDIA PG509-210,True ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/140627 Approved by: https://github.com/huydhn	2024-11-27 21:57:48 +00:00
Huy Do	24a223c49d	Run inductor micro benchmark on x86 metal runner (#135042 ) This enables inductor micro benchmark on CPU (x86): * Running on AWS metal runner for more accurate benchmark * I add a new `arch` column, which will be either x86_64 or arm64 for CPU or GPU name for GPU. We can use this later to differentiate between different setup, i.e. cuda (a100) vs cuda (a10g) or cpu (x86_64) vs cpu (arm64) The next step would be to run this one cpu arm64, and cuda (a10g). ### Testing Here is the CSV results from my test run https://github.com/pytorch/pytorch/actions/runs/10709344180 ``` name,metric,target,actual,dtype,device,arch,is_model mlp_layer_norm_gelu,flops_utilization,0.8,17.36,bfloat16,cpu,x86_64,False gather_gemv,memory_bandwidth(GB/s),990,170.80,int8,cpu,x86_64,False gather_gemv,memory_bandwidth(GB/s),1060,204.78,bfloat16,cpu,x86_64,False Mixtral-8x7B-v0.1,token_per_sec,175,26.68,int8,cpu,x86_64,True Mixtral-8x7B-v0.1,memory_bandwidth(GB/s),1130,171.91,int8,cpu,x86_64,True Mixtral-8x7B-v0.1,compilation_time(s),162,47.36,int8,cpu,x86_64,True gemv,memory_bandwidth(GB/s),870,236.36,int8,cpu,x86_64,False gemv,memory_bandwidth(GB/s),990,305.71,bfloat16,cpu,x86_64,False Llama-2-7b-chat-hf,token_per_sec,94,14.01,bfloat16,cpu,x86_64,True Llama-2-7b-chat-hf,memory_bandwidth(GB/s),1253,185.18,bfloat16,cpu,x86_64,True Llama-2-7b-chat-hf,compilation_time(s),162,74.99,bfloat16,cpu,x86_64,True Llama-2-7b-chat-hf,token_per_sec,144,25.09,int8,cpu,x86_64,True Llama-2-7b-chat-hf,memory_bandwidth(GB/s),957,165.83,int8,cpu,x86_64,True Llama-2-7b-chat-hf,compilation_time(s),172,70.69,int8,cpu,x86_64,True layer_norm,memory_bandwidth(GB/s),950,172.03,bfloat16,cpu,x86_64,False ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135042 Approved by: https://github.com/yanboliang	2024-09-05 21:31:36 +00:00
Nicolas Macchioni	5cb05a82b4	[BC breaking] move benchmarking + prefer inductor path (#132827 ) move benchmarking out of `torch._inductor.runtime.runtime_utils` and into `torch._inductor.runtime.benchmarking`, and prefer this path over directly accessing Triton's benchmarking Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132827 Approved by: https://github.com/eellison	2024-08-08 00:47:45 +00:00
Xuehai Pan	c0ed38e644	[BE][Easy][3/19] enforce style for empty lines in import segments in `benchmarks/` (#129754 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129754 Approved by: https://github.com/ezyang	2024-07-17 14:34:42 +00:00
Yanbo Liang	7b5a8424a1	[GPT-fast] Update micro benchmark numbers as A100-50G (#129799 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129799 Approved by: https://github.com/Chillee	2024-06-29 04:36:07 +00:00
Yanbo Liang	9554a9af87	[GPT-benchmark] Distinguish LLM models and mirco-benchmarks (#129498 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129498 Approved by: https://github.com/huydhn	2024-06-26 00:25:05 +00:00
Huy Do	9e8443b56f	Remove dtype from gpt-fast micro benchmark experiments model name (#128789 ) Per comments on https://github.com/pytorch/test-infra/pull/5344, we already have a dtype column with the same information Pull Request resolved: https://github.com/pytorch/pytorch/pull/128789 Approved by: https://github.com/yanboliang	2024-06-18 01:26:45 +00:00
Huy Do	f37121bb74	Add model name, quantization and device to gpt_fast micro benchmark output (#128091 ) A small enhancement to https://hud.pytorch.org/benchmark/llms with these columns in the output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128091 Approved by: https://github.com/yanboliang	2024-06-15 01:39:48 +00:00
Yanbo Liang	1fb4effe7a	[GPT-fast benchmark] Add MLP, gather + gemv, gemv micro benchmark (#128002 ) Output example: ``` \| name \| metric \| target \| actual \| \|------------------------------\|---------------------------\|---------\|---------\| \| layer_norm_bfloat16 \| memory_bandwidth(GB/s) \| 1017 \| 1000.01 \| \| mlp_layer_norm_gelu_bfloat16 \| flops_utilization \| 0.71 \| 0.71 \| \| gemv_int8 \| memory_bandwidth(GB/s) \| 990 \| 984.06 \| \| gemv_bfloat16 \| memory_bandwidth(GB/s) \| 1137 \| 1137.92 \| \| gather_gemv_int8 \| memory_bandwidth(GB/s) \| 1113 \| 1111.09 \| \| gather_gemv_bfloat16 \| memory_bandwidth(GB/s) \| 1249 \| 1248.15 \| ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128002 Approved by: https://github.com/Chillee	2024-06-14 17:03:22 +00:00
Yanbo Liang	0be06b08fc	[GPT-fast benchmark] Merge GPT-fast and micro benchmark output as one CSV file (#127586 ) Consolidate GPT-fast models benchmark with micro-benchmark, and save output as one CSV file with the same format as https://github.com/pytorch/pytorch/pull/126754#issue-2307296847. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127586 Approved by: https://github.com/Chillee	2024-05-31 18:50:49 +00:00
Xuehai Pan	26f4f10ac8	[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126 Approved by: https://github.com/kit1980	2024-05-27 14:49:57 +00:00
PyTorch MergeBot	55c0ab2887	Revert "[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 )" This reverts commit 7763c83af67eebfdd5185dbe6ce15ece2b992a0f. Reverted https://github.com/pytorch/pytorch/pull/127126 on behalf of https://github.com/XuehaiPan due to Broken CI ([comment](https://github.com/pytorch/pytorch/pull/127126#issuecomment-2133044286))	2024-05-27 09:22:08 +00:00
Xuehai Pan	7763c83af6	[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126 Approved by: https://github.com/kit1980 ghstack dependencies: #127122, #127123, #127124, #127125	2024-05-27 04:22:18 +00:00
Yanbo Liang	a174c536f8	GPT-fast benchmark: adding memory bandwidth and use A100-40GB as target (#125881 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/125881 Approved by: https://github.com/Chillee	2024-05-11 10:46:54 +00:00
Huy Do	9dee3ef919	Ingest gpt-fast benchmark results from S3 to Rockset (#125891 ) A follow-up of https://github.com/pytorch/pytorch/pull/125450, this extends the `tools/stats/upload_dynamo_perf_stats.py` script to upload arbitrary benchmark results in CSV format. * Upload gpt-fast benchmarks to a new Rockset collection `benchmarks/oss_ci_benchmark`. The file is in the following format: ``` $ cat test/test-reports/gpt_fast_benchmark.csv name,mode,target,actual,percentage Llama-2-7b-chat-hf,bfloat16,104,104.754128,100.73% ``` * The CSV output needs to be kept in `test/test-reports` directory. * Re-use the existing `.github/workflows/upload-test-stats.yml` workflow ### Testing Run the commands manually ``` (py3.11) huydo@huydo-mbp pytorch % python3 -m tools.stats.upload_artifacts --workflow-run-id 9026179545 --workflow-run-attempt 1 --repo "pytorch/pytorch" Using temporary directory: /var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmp6eug3cdz Downloading test-jsons-runattempt1-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmp6eug3cdz/test-jsons-runattempt1-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip to s3://gha-artifacts/pytorch/pytorch/9026179545/1/artifact/test-jsons-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip Downloading test-reports-runattempt1-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmp6eug3cdz/test-reports-runattempt1-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip to s3://gha-artifacts/pytorch/pytorch/9026179545/1/artifact/test-reports-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip (py3.11) huydo@huydo-mbp pytorch % python3 -m tools.stats.upload_dynamo_perf_stats --workflow-run-id 9026179545 --workflow-run-attempt 1 --repo "pytorch/pytorch" --head-branch "ciflow/inductor-micro-benchmark/125891" --rockset-collection oss_ci_benchmark --rockset-workspace benchmarks --match-filename "^gpt_fast_benchmark" Using temporary directory: /var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmp8xr4sdxk Downloading test-reports-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip Extracting test-reports-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip to unzipped-test-reports-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212 Processing gpt_fast_benchmark from test-reports-test-inductor-micro-benchmark-1-1-linux.gcp.a100_24803987212.zip Writing 3 documents to Rockset Done! ``` Also run a sanity check on ingesting inductor benchmark results: ``` (py3.11) huydo@huydo-mbp pytorch % python -m tools.stats.upload_dynamo_perf_stats --workflow-run-id 8997654356 --workflow-run-attempt 1 --repo pytorch/pytorch --head-branch main --rockset-collection torch_dynamo_perf_stats --rockset-workspace inductor --match-filename "^inductor_" ... Writing 4904 documents to Rockset Done! ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125891 Approved by: https://github.com/yanboliang	2024-05-11 04:16:36 +00:00
Yanbo Liang	f87fbfdb01	GPT-fast benchmark: remove Embedding layer from model size (#125901 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/125901 Approved by: https://github.com/Chillee	2024-05-10 08:18:13 +00:00
Yanbo Liang	8c74162074	Reduce the number of layers for mixtral moe model to adapt CI memory limitation (#125608 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/125608 Approved by: https://github.com/Chillee, https://github.com/huydhn	2024-05-06 21:52:25 +00:00
Aaron Gokaslan	1d6c5972c1	[BE]: Optimize min/max/sum comprehensions C419 (#123960 ) Automatic fixes that replaces certain list comprehensions with generator ones where appropriate so that they are immediately consumed. This is preview functionality in ruff for rule C419 and it was automatically applied. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123960 Approved by: https://github.com/malfet	2024-04-12 23:54:15 +00:00
chilli	ed37fbdf60	made gpt_fast benchmark run faster (#122872 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122872 Approved by: https://github.com/msaroufim, https://github.com/yifuwang ghstack dependencies: #122848	2024-03-29 03:49:19 +00:00
Yanbo Liang	43e243180b	Add gpt-fast as a static benchmark (#121886 ) Run: ``` python benchmarks/gpt_fast/benchmark.py ``` It generated a cvs file ```gpt_fast_benchmark.csv``` with the content like: ``` name,mode,target,actual,percentage Llama-2-7b-chat-hf,bfloat16,104,103.458618,99.48% Llama-2-7b-chat-hf,int8,155,158.964615,102.56% Mixtral-8x7B-v0.1,int8,97,99.760132,102.85% ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121886 Approved by: https://github.com/Chillee	2024-03-14 21:46:59 +00:00

24 Commits