mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 21:14:14 +08:00
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156077 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #156069
53 lines
2.7 KiB
Markdown
53 lines
2.7 KiB
Markdown
## Inference benchmarks
|
|
|
|
This folder contains a work in progress simulation of a python inference server.
|
|
|
|
The v0 version of this has a backend worker that is a single process. It loads a
|
|
ResNet-18 checkpoint to 'cuda:0' and compiles the model. It accepts requests in
|
|
the form of (tensor, request_time) from a `multiprocessing.Queue`, runs
|
|
inference on the request and returns (output, request_time) in the a separate
|
|
response `multiprocessing.Queue`.
|
|
|
|
The frontend worker is a process with three threads
|
|
1. A thread that generates fake data of a given batch size in the form of CPU
|
|
tensors and puts the data into the request queue
|
|
2. A thread that reads responses from the response queue and collects metrics on
|
|
the latency of the first response, which corresponds to the cold start time,
|
|
average, minimum and maximum response latency as well as throughput.
|
|
3. A thread that polls nvidia-smi for GPU utilization metrics.
|
|
|
|
For now we omit data preprocessing as well as result post-processing.
|
|
|
|
### Running a single benchmark
|
|
|
|
The togglable command line arguments to the script are as follows:
|
|
- `num_iters` (default: 100): how many requests to send to the backend
|
|
excluding the first warmup request
|
|
- `batch_size` (default: 32): the batch size of the requests.
|
|
- `model_dir` (default: '.'): the directory to load the checkpoint from
|
|
- `compile` (default: compile): or `--no-compile` whether to `torch.compile()`
|
|
the model
|
|
- `output_file` (default: output.csv): The name of the csv file to write the outputs to in the `results/` directory.
|
|
- `num_workers` (default: 2): The `max_threads` passed to the `ThreadPoolExecutor` in charge of model prediction
|
|
|
|
e.g. A sample command to run the benchmark
|
|
|
|
```
|
|
python -W ignore server.py --num_iters 1000 --batch_size 32
|
|
```
|
|
|
|
the results will be found in `results/output.csv`, which will be appended to if the file already exists.
|
|
|
|
Note that `m.compile()` time in the csv file is not the time for the model to be compiled,
|
|
which happens during the first iteration, but rather the time for PT2 components
|
|
to be lazily imported (e.g. triton).
|
|
|
|
### Running a sweep
|
|
|
|
The script `runner.sh` will run a sweep of the benchmark over different batch
|
|
sizes with compile on and off and collect the mean and standard deviation of warmup latency,
|
|
average latency, throughput and GPU utilization for each. The `results/` directory will contain the metrics
|
|
from running a sweep as we develop this benchmark where `results/output_{batch_size}_{compile}.md`
|
|
will contain the mean and standard deviation of results for a given batch size and compile setting.
|
|
If the file already exists, the metrics from the run will be appended as a new row in the markdown table.
|