Commit Graph

10 Commits

Author SHA1 Message Date
7763c83af6 [5/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torch (#127126)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
ghstack dependencies: #127122, #127123, #127124, #127125
2024-05-27 04:22:18 +00:00
19207b9183 Allow more backend worker threads with each using a separate cuda stream (#116190)
Added a `--num_workers` option to `server.py` that allows more than 1 worker in the `ThreadPoolWorker` used for model predictions. Each worker uses its own `cuda.Stream()` that is created when the worker thread is initialized.

Ran benchmark for 2-4 workers with `compile=False` (since compile is not thread-safe)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116190
Approved by: https://github.com/albanD
ghstack dependencies: #115286, #116187, #116188, #116189
2023-12-20 22:08:29 +00:00
0dd64174bd Do H2D/D2H of input/result on separate threads/cuda.Streams (#116189)
Added two `ThreadPoolExecutor`s with 1 worker each for D2H and H2D copies. Each uses its own `cuda.Stream`. The purpose is to try to overlap D2H and H2D with compute and allow the worker handling prediction to launch compute kernels without being blocked by D2H/H2D.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116189
Approved by: https://github.com/albanD
ghstack dependencies: #115286, #116187, #116188
2023-12-20 22:08:29 +00:00
3793ad6a7e Fix bugs in metrics calculation in inference benchmark and rerun baseline (#116188)
Before this PR, each `request_time` was separated by the time for a `torch.randn(...)` to create the fake `data` tensor on CPU. This meant that the gap between `request_times` **scaled with the batch_size**. So the latency comparisons across batch sizes were inaccurate. In this PR we generate all the fake data outside the loop to avoid this.

Other bug fixes:
- Only start polling GPU utilization after warmup event is complete
- Correct calculation of throughput: previously `(num_batches * batch_size) / sum(response_times)`, should have been `(num_batches * batch_size) / (last_response_time - first_request_time)`
- Make sure that response sent back to frontend is on CPU
- Use a lock to ensure writing to `metrics_dict` in `metrics_thread` and `gpu_utilization_thread` in a thread-safe manner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116188
Approved by: https://github.com/albanD
ghstack dependencies: #115286, #116187
2023-12-20 22:08:22 +00:00
75a4b10d56 [easy] Add option for profiling backend in inference benchmark (#116187)
Some misc fixes, also added option for experiment name to add to result table

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116187
Approved by: https://github.com/albanD
ghstack dependencies: #115286
2023-12-20 22:08:11 +00:00
31f21e033e Run inference in an Executor (#115286)
Experiment: run model predictions in the backend in a ThreadPoolExecutor so that each model prediction does not block reading requests from the queue

Baseline is reset in above PR that bugfixes a lot of the metrics calculations but I kept the metrics here anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115286
Approved by: https://github.com/albanD
2023-12-20 22:08:02 +00:00
b0c9ccdc4b Add standard deviation of metrics over runs to inference benchmark (#113309)
Run each `(batch_size, compile)` benchmark 10 times in `./runner.sh` and get mean and standard deviation of metrics in output table

Only report `warmup latency`, `average_latency`, `throughput` and `gpu_util`

Break `output.md` file into a single markdown file per `(batch_size, compile)` configuration. Further runs of `./runner.sh` will append one row to the table in each file for easy comparison

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113309
Approved by: https://github.com/albanD
2023-11-09 18:38:05 +00:00
df149581bc Tabulate outputs in inference benchmark (#112900)
- Fix error where script was always compiling model
- Make`runner.sh` parse outputs into nice `.md` format

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112900
Approved by: https://github.com/albanD
ghstack dependencies: #112582, #112863
2023-11-03 23:53:30 +00:00
c799689437 Refactor inference benchmark and add runner script to do sweep (#112863)
- Added `runner.sh` that does a sweep over `batch_size=(1, 32, 64, 128, 256)` and `compile=(True, False)`
- Added GPU utilization as a metric
- Converted frontend from 2 processes (one putting requests into `request_queue` and one reading from `response_queue` and collecting metrics) to a single process with 3 threads (one putting requests into `request_queue` and one reading from `response_queue` and collecting metrics and one polling `nvidia-smi` for gpu utilization)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112863
Approved by: https://github.com/albanD
ghstack dependencies: #112582
2023-11-03 20:26:43 +00:00
7cbf9869d5 Add v0 inference benchmark script (#112582)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112582
Approved by: https://github.com/albanD
2023-11-02 17:21:15 +00:00