Allow more backend worker threads with each using a separate cuda stream (#116190)

Added a `--num_workers` option to `server.py` that allows more than 1 worker in the `ThreadPoolWorker` used for model predictions. Each worker uses its own `cuda.Stream()` that is created when the worker thread is initialized.

Ran benchmark for 2-4 workers with `compile=False` (since compile is not thread-safe)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116190
Approved by: https://github.com/albanD
ghstack dependencies: #115286, #116187, #116188, #116189
This commit is contained in:
Mikayla Gawarecki
2023-12-20 10:58:16 -08:00
committed by PyTorch MergeBot
parent 0dd64174bd
commit 19207b9183
10 changed files with 51 additions and 8 deletions

View File

@ -32,3 +32,18 @@ Semaphores are used in conjunction with `cuda.Event`s to ensure proper synchroni
* Average latency decreased
* Throughput increased
* GPU utilization increased
### [#116190](https://github.com/pytorch/pytorch/pull/116190)
* Added a `--num_workers` option to `server.py` that allows more than 1 worker in the `ThreadPoolWorker` used for model predictions. Each worker uses its own `cuda.Stream()` that is created when the worker thread is initialized.
##### Results:
Benchmarks were only run for `compile=False` since `torch.compile()` is not thread-safe. Benchmarks were run with `num_workers={2, 3, 4}`.
For the 2 worker case:
* All metrics improved compared to the single worker case across all batch sizes.
* For batch sizes 1, 32 and 64 we observed that the metrics were still slightly worse than the baseline.
* For batch sizes 128 and 256 we observed that all metrics beat the baseline (e.g. ~300 samples/sec increase in throughput, ~5s decrease in average latency and ~2s decrease in warmup latency for bs=256)
![Throughput against batch size](./src/throughput_plot.png)
![Avg latency against batch size](./src/avg_latency_plot.png)

View File

@ -28,6 +28,7 @@ The togglable commmand line arguments to the script are as follows:
- `compile` (default: compile): or `--no-compile` whether to `torch.compile()`
the model
- `output_file` (default: output.csv): The name of the csv file to write the outputs to in the `results/` directory.
- `num_workers` (default: 2): The `max_threads` passed to the `ThreadPoolExecutor` in charge of model prediction
e.g. A sample command to run the benchmark
@ -47,5 +48,5 @@ The script `runner.sh` will run a sweep of the benchmark over different batch
sizes with compile on and off and collect the mean and standard deviation of warmup latency,
average latency, throughput and GPU utilization for each. The `results/` directory will contain the metrics
from running a sweep as we develop this benchmark where `results/output_{batch_size}_{compile}.md`
will contain the mean and standard deviation of results for a given batch size and compile setting,
if the file already exists, the metrics form the run will be appended as a new row in the markdown table.
will contain the mean and standard deviation of results for a given batch size and compile setting.
If the file already exists, the metrics from the run will be appended as a new row in the markdown table.

View File

@ -4,3 +4,6 @@
| ---------- | ------------------ | ------------------- | ------------------------ | ------------------- |
| original | 5.355 +/- 0.293 | 14.172 +/- 0.267 | 518.884 +/- 7.854 | 56.108 +/- 0.862 |
| h2d_d2h_threads | 3.810 +/- 0.319 | 14.146 +/- 1.145 | 551.909 +/- 34.079 | 52.057 +/- 4.121 |
| 2_predict_workers | 3.639 +/- 0.037 | 11.161 +/- 0.160 | 636.701 +/- 14.753 | 53.279 +/- 2.659 |
| 3_predict_workers | 4.930 +/- 0.060 | 10.532 +/- 0.801 | 677.736 +/- 25.115 | 53.806 +/- 1.324 |
| 4_predict_workers | 3.819 +/- 0.253 | 11.451 +/- 0.439 | 638.146 +/- 22.611 | 50.129 +/- 1.764 |

View File

@ -4,3 +4,6 @@
| ---------- | ------------------ | ------------------- | ------------------------ | ------------------- |
| original | 5.497 +/- 0.681 | 0.383 +/- 0.007 | 168.818 +/- 2.169 | 12.574 +/- 2.205 |
| h2d_d2h_threads | 4.039 +/- 1.175 | 0.593 +/- 0.246 | 127.449 +/- 34.358 | 11.445 +/- 1.925 |
| 2_predict_workers | 3.766 +/- 0.610 | 0.369 +/- 0.010 | 186.485 +/- 2.657 | 11.857 +/- 2.591 |
| 3_predict_workers | 3.460 +/- 0.063 | 0.407 +/- 0.033 | 168.863 +/- 9.227 | 12.427 +/- 1.430 |
| 4_predict_workers | 4.184 +/- 0.636 | 0.692 +/- 0.182 | 110.231 +/- 31.420 | 8.490 +/- 1.826 |

View File

@ -4,3 +4,6 @@
| ---------- | ------------------ | ------------------- | ------------------------ | ------------------- |
| original | 6.424 +/- 1.141 | 26.361 +/- 0.557 | 573.027 +/- 9.661 | 64.000 +/- 3.405 |
| h2d_d2h_threads | 4.600 +/- 0.724 | 21.314 +/- 0.403 | 704.344 +/- 9.843 | 71.963 +/- 1.558 |
| 2_predict_workers | 4.199 +/- 0.363 | 16.772 +/- 1.435 | 864.678 +/- 32.353 | 70.026 +/- 1.403 |
| 3_predict_workers | 4.496 +/- 0.755 | 15.983 +/- 0.455 | 912.386 +/- 18.299 | 68.283 +/- 2.226 |
| 4_predict_workers | 4.252 +/- 0.515 | 14.702 +/- 0.259 | 951.261 +/- 7.986 | 70.716 +/- 2.774 |

View File

@ -4,3 +4,6 @@
| ---------- | ------------------ | ------------------- | ------------------------ | ------------------- |
| original | 5.680 +/- 0.919 | 4.785 +/- 0.864 | 394.178 +/- 81.705 | 38.515 +/- 11.152 |
| h2d_d2h_threads | 4.856 +/- 0.142 | 6.694 +/- 0.497 | 287.201 +/- 41.480 | 27.028 +/- 4.773 |
| 2_predict_workers | 3.465 +/- 0.082 | 5.369 +/- 0.900 | 334.981 +/- 50.292 | 31.635 +/- 4.492 |
| 3_predict_workers | 3.819 +/- 0.617 | 4.409 +/- 0.149 | 402.236 +/- 22.151 | 35.893 +/- 0.877 |
| 4_predict_workers | 3.994 +/- 0.509 | 6.007 +/- 0.408 | 296.260 +/- 16.524 | 25.751 +/- 1.810 |

View File

@ -4,3 +4,6 @@
| ---------- | ------------------ | ------------------- | ------------------------ | ------------------- |
| original | 5.823 +/- 0.541 | 8.024 +/- 0.508 | 460.474 +/- 29.896 | 47.112 +/- 6.576 |
| h2d_d2h_threads | 3.971 +/- 0.709 | 9.583 +/- 1.030 | 393.394 +/- 26.110 | 37.550 +/- 1.381 |
| 2_predict_workers | 3.535 +/- 0.124 | 8.258 +/- 1.188 | 441.640 +/- 85.705 | 38.630 +/- 3.499 |
| 3_predict_workers | 3.724 +/- 0.312 | 7.614 +/- 0.951 | 467.236 +/- 36.619 | 37.834 +/- 2.418 |
| 4_predict_workers | 4.346 +/- 0.744 | 8.417 +/- 0.541 | 430.672 +/- 17.979 | 34.461 +/- 3.257 |

View File

@ -164,6 +164,7 @@ class BackendWorker:
response_queue,
read_requests_event,
batch_size,
num_workers,
model_dir=".",
compile_model=True,
):
@ -174,11 +175,14 @@ class BackendWorker:
self.response_queue = response_queue
self.read_requests_event = read_requests_event
self.batch_size = batch_size
self.num_workers = num_workers
self.model_dir = model_dir
self.compile_model = compile_model
self._setup_complete = False
self.h2d_stream = torch.cuda.Stream()
self.d2h_stream = torch.cuda.Stream()
# maps thread_id to the cuda.Stream associated with that worker thread
self.stream_map = dict()
def _setup(self):
import time
@ -221,11 +225,12 @@ class BackendWorker:
):
# copy_sem makes sure copy_event has been recorded in the data copying thread
copy_sem.acquire()
torch.cuda.current_stream().wait_event(copy_event)
with torch.no_grad():
response_list.append(model(input_buffer))
compute_event.record()
compute_sem.release()
self.stream_map[threading.get_native_id()].wait_event(copy_event)
with torch.cuda.stream(self.stream_map[threading.get_native_id()]):
with torch.no_grad():
response_list.append(model(input_buffer))
compute_event.record()
compute_sem.release()
del input_buffer
def copy_data(self, input_buffer, data, copy_event, copy_sem):
@ -243,7 +248,12 @@ class BackendWorker:
self.response_queue.put((response_list[0].cpu(), request_time))
async def run(self):
worker_pool = ThreadPoolExecutor(max_workers=1)
def worker_initializer():
self.stream_map[threading.get_native_id()] = torch.cuda.Stream()
worker_pool = ThreadPoolExecutor(
max_workers=self.num_workers, initializer=worker_initializer
)
h2d_pool = ThreadPoolExecutor(max_workers=1)
d2h_pool = ThreadPoolExecutor(max_workers=1)
@ -314,6 +324,7 @@ if __name__ == "__main__":
parser.add_argument(
"--profile", default=False, action=argparse.BooleanOptionalAction
)
parser.add_argument("--num_workers", type=int, default=4)
args = parser.parse_args()
downloaded_checkpoint = False
@ -354,6 +365,7 @@ if __name__ == "__main__":
response_queue,
read_requests_event,
args.batch_size,
args.num_workers,
args.model_dir,
args.compile,
)

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB