pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Xuehai Pan	42015db6a9	[BE] fix typos in benchmarks/ (#156077 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156077 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #156069	2025-06-17 13:12:18 +00:00
Mikayla Gawarecki	19207b9183	Allow more backend worker threads with each using a separate cuda stream (#116190 ) Added a `--num_workers` option to `server.py` that allows more than 1 worker in the `ThreadPoolWorker` used for model predictions. Each worker uses its own `cuda.Stream()` that is created when the worker thread is initialized. Ran benchmark for 2-4 workers with `compile=False` (since compile is not thread-safe) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116190 Approved by: https://github.com/albanD ghstack dependencies: #115286, #116187, #116188, #116189	2023-12-20 22:08:29 +00:00
Mikayla Gawarecki	3793ad6a7e	Fix bugs in metrics calculation in inference benchmark and rerun baseline (#116188 ) Before this PR, each `request_time` was separated by the time for a `torch.randn(...)` to create the fake `data` tensor on CPU. This meant that the gap between `request_times` scaled with the batch_size. So the latency comparisons across batch sizes were inaccurate. In this PR we generate all the fake data outside the loop to avoid this. Other bug fixes: - Only start polling GPU utilization after warmup event is complete - Correct calculation of throughput: previously `(num_batches * batch_size) / sum(response_times)`, should have been `(num_batches * batch_size) / (last_response_time - first_request_time)` - Make sure that response sent back to frontend is on CPU - Use a lock to ensure writing to `metrics_dict` in `metrics_thread` and `gpu_utilization_thread` in a thread-safe manner Pull Request resolved: https://github.com/pytorch/pytorch/pull/116188 Approved by: https://github.com/albanD ghstack dependencies: #115286, #116187	2023-12-20 22:08:22 +00:00
Mikayla Gawarecki	b0c9ccdc4b	Add standard deviation of metrics over runs to inference benchmark (#113309 ) Run each `(batch_size, compile)` benchmark 10 times in `./runner.sh` and get mean and standard deviation of metrics in output table Only report `warmup latency`, `average_latency`, `throughput` and `gpu_util` Break `output.md` file into a single markdown file per `(batch_size, compile)` configuration. Further runs of `./runner.sh` will append one row to the table in each file for easy comparison Pull Request resolved: https://github.com/pytorch/pytorch/pull/113309 Approved by: https://github.com/albanD	2023-11-09 18:38:05 +00:00
Mikayla Gawarecki	df149581bc	Tabulate outputs in inference benchmark (#112900 ) - Fix error where script was always compiling model - Make`runner.sh` parse outputs into nice `.md` format Pull Request resolved: https://github.com/pytorch/pytorch/pull/112900 Approved by: https://github.com/albanD ghstack dependencies: #112582, #112863	2023-11-03 23:53:30 +00:00
Mikayla Gawarecki	c799689437	Refactor inference benchmark and add runner script to do sweep (#112863 ) - Added `runner.sh` that does a sweep over `batch_size=(1, 32, 64, 128, 256)` and `compile=(True, False)` - Added GPU utilization as a metric - Converted frontend from 2 processes (one putting requests into `request_queue` and one reading from `response_queue` and collecting metrics) to a single process with 3 threads (one putting requests into `request_queue` and one reading from `response_queue` and collecting metrics and one polling `nvidia-smi` for gpu utilization) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112863 Approved by: https://github.com/albanD ghstack dependencies: #112582	2023-11-03 20:26:43 +00:00
Mikayla Gawarecki	7cbf9869d5	Add v0 inference benchmark script (#112582 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112582 Approved by: https://github.com/albanD	2023-11-02 17:21:15 +00:00

7 Commits