mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 21:14:14 +08:00
Allow more backend worker threads with each using a separate cuda stream (#116190)
Added a `--num_workers` option to `server.py` that allows more than 1 worker in the `ThreadPoolWorker` used for model predictions. Each worker uses its own `cuda.Stream()` that is created when the worker thread is initialized. Ran benchmark for 2-4 workers with `compile=False` (since compile is not thread-safe) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116190 Approved by: https://github.com/albanD ghstack dependencies: #115286, #116187, #116188, #116189
This commit is contained in:
committed by
PyTorch MergeBot
parent
0dd64174bd
commit
19207b9183
@ -32,3 +32,18 @@ Semaphores are used in conjunction with `cuda.Event`s to ensure proper synchroni
|
||||
* Average latency decreased
|
||||
* Throughput increased
|
||||
* GPU utilization increased
|
||||
|
||||
|
||||
### [#116190](https://github.com/pytorch/pytorch/pull/116190)
|
||||
* Added a `--num_workers` option to `server.py` that allows more than 1 worker in the `ThreadPoolWorker` used for model predictions. Each worker uses its own `cuda.Stream()` that is created when the worker thread is initialized.
|
||||
|
||||
##### Results:
|
||||
Benchmarks were only run for `compile=False` since `torch.compile()` is not thread-safe. Benchmarks were run with `num_workers={2, 3, 4}`.
|
||||
|
||||
For the 2 worker case:
|
||||
* All metrics improved compared to the single worker case across all batch sizes.
|
||||
* For batch sizes 1, 32 and 64 we observed that the metrics were still slightly worse than the baseline.
|
||||
* For batch sizes 128 and 256 we observed that all metrics beat the baseline (e.g. ~300 samples/sec increase in throughput, ~5s decrease in average latency and ~2s decrease in warmup latency for bs=256)
|
||||
|
||||

|
||||

|
||||
|
@ -28,6 +28,7 @@ The togglable commmand line arguments to the script are as follows:
|
||||
- `compile` (default: compile): or `--no-compile` whether to `torch.compile()`
|
||||
the model
|
||||
- `output_file` (default: output.csv): The name of the csv file to write the outputs to in the `results/` directory.
|
||||
- `num_workers` (default: 2): The `max_threads` passed to the `ThreadPoolExecutor` in charge of model prediction
|
||||
|
||||
e.g. A sample command to run the benchmark
|
||||
|
||||
@ -47,5 +48,5 @@ The script `runner.sh` will run a sweep of the benchmark over different batch
|
||||
sizes with compile on and off and collect the mean and standard deviation of warmup latency,
|
||||
average latency, throughput and GPU utilization for each. The `results/` directory will contain the metrics
|
||||
from running a sweep as we develop this benchmark where `results/output_{batch_size}_{compile}.md`
|
||||
will contain the mean and standard deviation of results for a given batch size and compile setting,
|
||||
if the file already exists, the metrics form the run will be appended as a new row in the markdown table.
|
||||
will contain the mean and standard deviation of results for a given batch size and compile setting.
|
||||
If the file already exists, the metrics from the run will be appended as a new row in the markdown table.
|
||||
|
@ -4,3 +4,6 @@
|
||||
| ---------- | ------------------ | ------------------- | ------------------------ | ------------------- |
|
||||
| original | 5.355 +/- 0.293 | 14.172 +/- 0.267 | 518.884 +/- 7.854 | 56.108 +/- 0.862 |
|
||||
| h2d_d2h_threads | 3.810 +/- 0.319 | 14.146 +/- 1.145 | 551.909 +/- 34.079 | 52.057 +/- 4.121 |
|
||||
| 2_predict_workers | 3.639 +/- 0.037 | 11.161 +/- 0.160 | 636.701 +/- 14.753 | 53.279 +/- 2.659 |
|
||||
| 3_predict_workers | 4.930 +/- 0.060 | 10.532 +/- 0.801 | 677.736 +/- 25.115 | 53.806 +/- 1.324 |
|
||||
| 4_predict_workers | 3.819 +/- 0.253 | 11.451 +/- 0.439 | 638.146 +/- 22.611 | 50.129 +/- 1.764 |
|
||||
|
@ -4,3 +4,6 @@
|
||||
| ---------- | ------------------ | ------------------- | ------------------------ | ------------------- |
|
||||
| original | 5.497 +/- 0.681 | 0.383 +/- 0.007 | 168.818 +/- 2.169 | 12.574 +/- 2.205 |
|
||||
| h2d_d2h_threads | 4.039 +/- 1.175 | 0.593 +/- 0.246 | 127.449 +/- 34.358 | 11.445 +/- 1.925 |
|
||||
| 2_predict_workers | 3.766 +/- 0.610 | 0.369 +/- 0.010 | 186.485 +/- 2.657 | 11.857 +/- 2.591 |
|
||||
| 3_predict_workers | 3.460 +/- 0.063 | 0.407 +/- 0.033 | 168.863 +/- 9.227 | 12.427 +/- 1.430 |
|
||||
| 4_predict_workers | 4.184 +/- 0.636 | 0.692 +/- 0.182 | 110.231 +/- 31.420 | 8.490 +/- 1.826 |
|
||||
|
@ -4,3 +4,6 @@
|
||||
| ---------- | ------------------ | ------------------- | ------------------------ | ------------------- |
|
||||
| original | 6.424 +/- 1.141 | 26.361 +/- 0.557 | 573.027 +/- 9.661 | 64.000 +/- 3.405 |
|
||||
| h2d_d2h_threads | 4.600 +/- 0.724 | 21.314 +/- 0.403 | 704.344 +/- 9.843 | 71.963 +/- 1.558 |
|
||||
| 2_predict_workers | 4.199 +/- 0.363 | 16.772 +/- 1.435 | 864.678 +/- 32.353 | 70.026 +/- 1.403 |
|
||||
| 3_predict_workers | 4.496 +/- 0.755 | 15.983 +/- 0.455 | 912.386 +/- 18.299 | 68.283 +/- 2.226 |
|
||||
| 4_predict_workers | 4.252 +/- 0.515 | 14.702 +/- 0.259 | 951.261 +/- 7.986 | 70.716 +/- 2.774 |
|
||||
|
@ -4,3 +4,6 @@
|
||||
| ---------- | ------------------ | ------------------- | ------------------------ | ------------------- |
|
||||
| original | 5.680 +/- 0.919 | 4.785 +/- 0.864 | 394.178 +/- 81.705 | 38.515 +/- 11.152 |
|
||||
| h2d_d2h_threads | 4.856 +/- 0.142 | 6.694 +/- 0.497 | 287.201 +/- 41.480 | 27.028 +/- 4.773 |
|
||||
| 2_predict_workers | 3.465 +/- 0.082 | 5.369 +/- 0.900 | 334.981 +/- 50.292 | 31.635 +/- 4.492 |
|
||||
| 3_predict_workers | 3.819 +/- 0.617 | 4.409 +/- 0.149 | 402.236 +/- 22.151 | 35.893 +/- 0.877 |
|
||||
| 4_predict_workers | 3.994 +/- 0.509 | 6.007 +/- 0.408 | 296.260 +/- 16.524 | 25.751 +/- 1.810 |
|
||||
|
@ -4,3 +4,6 @@
|
||||
| ---------- | ------------------ | ------------------- | ------------------------ | ------------------- |
|
||||
| original | 5.823 +/- 0.541 | 8.024 +/- 0.508 | 460.474 +/- 29.896 | 47.112 +/- 6.576 |
|
||||
| h2d_d2h_threads | 3.971 +/- 0.709 | 9.583 +/- 1.030 | 393.394 +/- 26.110 | 37.550 +/- 1.381 |
|
||||
| 2_predict_workers | 3.535 +/- 0.124 | 8.258 +/- 1.188 | 441.640 +/- 85.705 | 38.630 +/- 3.499 |
|
||||
| 3_predict_workers | 3.724 +/- 0.312 | 7.614 +/- 0.951 | 467.236 +/- 36.619 | 37.834 +/- 2.418 |
|
||||
| 4_predict_workers | 4.346 +/- 0.744 | 8.417 +/- 0.541 | 430.672 +/- 17.979 | 34.461 +/- 3.257 |
|
||||
|
@ -164,6 +164,7 @@ class BackendWorker:
|
||||
response_queue,
|
||||
read_requests_event,
|
||||
batch_size,
|
||||
num_workers,
|
||||
model_dir=".",
|
||||
compile_model=True,
|
||||
):
|
||||
@ -174,11 +175,14 @@ class BackendWorker:
|
||||
self.response_queue = response_queue
|
||||
self.read_requests_event = read_requests_event
|
||||
self.batch_size = batch_size
|
||||
self.num_workers = num_workers
|
||||
self.model_dir = model_dir
|
||||
self.compile_model = compile_model
|
||||
self._setup_complete = False
|
||||
self.h2d_stream = torch.cuda.Stream()
|
||||
self.d2h_stream = torch.cuda.Stream()
|
||||
# maps thread_id to the cuda.Stream associated with that worker thread
|
||||
self.stream_map = dict()
|
||||
|
||||
def _setup(self):
|
||||
import time
|
||||
@ -221,7 +225,8 @@ class BackendWorker:
|
||||
):
|
||||
# copy_sem makes sure copy_event has been recorded in the data copying thread
|
||||
copy_sem.acquire()
|
||||
torch.cuda.current_stream().wait_event(copy_event)
|
||||
self.stream_map[threading.get_native_id()].wait_event(copy_event)
|
||||
with torch.cuda.stream(self.stream_map[threading.get_native_id()]):
|
||||
with torch.no_grad():
|
||||
response_list.append(model(input_buffer))
|
||||
compute_event.record()
|
||||
@ -243,7 +248,12 @@ class BackendWorker:
|
||||
self.response_queue.put((response_list[0].cpu(), request_time))
|
||||
|
||||
async def run(self):
|
||||
worker_pool = ThreadPoolExecutor(max_workers=1)
|
||||
def worker_initializer():
|
||||
self.stream_map[threading.get_native_id()] = torch.cuda.Stream()
|
||||
|
||||
worker_pool = ThreadPoolExecutor(
|
||||
max_workers=self.num_workers, initializer=worker_initializer
|
||||
)
|
||||
h2d_pool = ThreadPoolExecutor(max_workers=1)
|
||||
d2h_pool = ThreadPoolExecutor(max_workers=1)
|
||||
|
||||
@ -314,6 +324,7 @@ if __name__ == "__main__":
|
||||
parser.add_argument(
|
||||
"--profile", default=False, action=argparse.BooleanOptionalAction
|
||||
)
|
||||
parser.add_argument("--num_workers", type=int, default=4)
|
||||
args = parser.parse_args()
|
||||
|
||||
downloaded_checkpoint = False
|
||||
@ -354,6 +365,7 @@ if __name__ == "__main__":
|
||||
response_queue,
|
||||
read_requests_event,
|
||||
args.batch_size,
|
||||
args.num_workers,
|
||||
args.model_dir,
|
||||
args.compile,
|
||||
)
|
||||
|
BIN
benchmarks/inference/src/avg_latency_plot.png
Normal file
BIN
benchmarks/inference/src/avg_latency_plot.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 21 KiB |
BIN
benchmarks/inference/src/throughput_plot.png
Normal file
BIN
benchmarks/inference/src/throughput_plot.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 22 KiB |
Reference in New Issue
Block a user