Allow more backend worker threads with each using a separate cuda stream (#116190)

Added a `--num_workers` option to `server.py` that allows more than 1 worker in the `ThreadPoolWorker` used for model predictions. Each worker uses its own `cuda.Stream()` that is created when the worker thread is initialized. Ran benchmark for 2-4 workers with `compile=False` (since compile is not thread-safe) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116190 Approved by: https://github.com/albanD ghstack dependencies: #115286, #116187, #116188, #116189
2025-10-20 12:54:11 +08:00 · 2023-12-20 10:58:16 -08:00
parent 0dd64174bd
commit 19207b9183
10 changed files with 51 additions and 8 deletions
--- a/benchmarks/inference/CHANGELOG.md
+++ b/benchmarks/inference/CHANGELOG.md
@ -32,3 +32,18 @@ Semaphores are used in conjunction with `cuda.Event`s to ensure proper synchroni
    * Average latency decreased
    * Throughput increased
    * GPU utilization increased
+
+
+### [#116190](https://github.com/pytorch/pytorch/pull/116190)
+* Added a `--num_workers` option to `server.py` that allows more than 1 worker in the `ThreadPoolWorker` used for model predictions. Each worker uses its own `cuda.Stream()` that is created when the worker thread is initialized.
+
+##### Results:
+Benchmarks were only run for `compile=False` since `torch.compile()` is not thread-safe. Benchmarks were run with `num_workers={2, 3, 4}`.
+
+For the 2 worker case:
+* All metrics improved compared to the single worker case across all batch sizes.
+* For batch sizes 1, 32 and 64 we observed that the metrics were still slightly worse than the baseline.
+* For batch sizes 128 and 256 we observed that all metrics beat the baseline (e.g. ~300 samples/sec increase in throughput, ~5s decrease in average latency and ~2s decrease in warmup latency for bs=256)
+
+![Throughput against batch size](./src/throughput_plot.png)
+![Avg latency against batch size](./src/avg_latency_plot.png)
--- a/benchmarks/inference/README.md
+++ b/benchmarks/inference/README.md
@ -28,6 +28,7 @@ The togglable commmand line arguments to the script are as follows:
  - `compile` (default: compile): or `--no-compile` whether to `torch.compile()`
    the model
  - `output_file` (default: output.csv): The name of the csv file to write the outputs to in the `results/` directory.
+  - `num_workers` (default: 2): The `max_threads` passed to the `ThreadPoolExecutor` in charge of model prediction

 e.g. A sample command to run the benchmark

@ -47,5 +48,5 @@ The script `runner.sh` will run a sweep of the benchmark over different batch
 sizes with compile on and off and collect the mean and standard deviation of warmup latency,
 average latency, throughput and GPU utilization for each. The `results/` directory will contain the metrics
 from running a sweep as we develop this benchmark where `results/output_{batch_size}_{compile}.md`
-will contain the mean and standard deviation of results for a given batch size and compile setting,
-if the file already exists, the metrics form the run will be appended as a new row in the markdown table.
+will contain the mean and standard deviation of results for a given batch size and compile setting.
+If the file already exists, the metrics from the run will be appended as a new row in the markdown table.
--- a/benchmarks/inference/results/output_128_false.md
+++ b/benchmarks/inference/results/output_128_false.md
@ -4,3 +4,6 @@
 | ---------- | ------------------ | ------------------- | ------------------------ | ------------------- |
 | original | 5.355 +/- 0.293 | 14.172 +/- 0.267 | 518.884 +/- 7.854 | 56.108 +/- 0.862 |
 | h2d_d2h_threads | 3.810 +/- 0.319 | 14.146 +/- 1.145 | 551.909 +/- 34.079 | 52.057 +/- 4.121 |
+| 2_predict_workers | 3.639 +/- 0.037 | 11.161 +/- 0.160 | 636.701 +/- 14.753 | 53.279 +/- 2.659 |
+| 3_predict_workers | 4.930 +/- 0.060 | 10.532 +/- 0.801 | 677.736 +/- 25.115 | 53.806 +/- 1.324 |
+| 4_predict_workers | 3.819 +/- 0.253 | 11.451 +/- 0.439 | 638.146 +/- 22.611 | 50.129 +/- 1.764 |
--- a/benchmarks/inference/results/output_1_false.md
+++ b/benchmarks/inference/results/output_1_false.md
@ -4,3 +4,6 @@
 | ---------- | ------------------ | ------------------- | ------------------------ | ------------------- |
 | original | 5.497 +/- 0.681 | 0.383 +/- 0.007 | 168.818 +/- 2.169 | 12.574 +/- 2.205 |
 | h2d_d2h_threads | 4.039 +/- 1.175 | 0.593 +/- 0.246 | 127.449 +/- 34.358 | 11.445 +/- 1.925 |
+| 2_predict_workers | 3.766 +/- 0.610 | 0.369 +/- 0.010 | 186.485 +/- 2.657 | 11.857 +/- 2.591 |
+| 3_predict_workers | 3.460 +/- 0.063 | 0.407 +/- 0.033 | 168.863 +/- 9.227 | 12.427 +/- 1.430 |
+| 4_predict_workers | 4.184 +/- 0.636 | 0.692 +/- 0.182 | 110.231 +/- 31.420 | 8.490 +/- 1.826 |
--- a/benchmarks/inference/results/output_256_false.md
+++ b/benchmarks/inference/results/output_256_false.md
@ -4,3 +4,6 @@
 | ---------- | ------------------ | ------------------- | ------------------------ | ------------------- |
 | original | 6.424 +/- 1.141 | 26.361 +/- 0.557 | 573.027 +/- 9.661 | 64.000 +/- 3.405 |
 | h2d_d2h_threads | 4.600 +/- 0.724 | 21.314 +/- 0.403 | 704.344 +/- 9.843 | 71.963 +/- 1.558 |
+| 2_predict_workers | 4.199 +/- 0.363 | 16.772 +/- 1.435 | 864.678 +/- 32.353 | 70.026 +/- 1.403 |
+| 3_predict_workers | 4.496 +/- 0.755 | 15.983 +/- 0.455 | 912.386 +/- 18.299 | 68.283 +/- 2.226 |
+| 4_predict_workers | 4.252 +/- 0.515 | 14.702 +/- 0.259 | 951.261 +/- 7.986 | 70.716 +/- 2.774 |
--- a/benchmarks/inference/results/output_32_false.md
+++ b/benchmarks/inference/results/output_32_false.md
@ -4,3 +4,6 @@
 | ---------- | ------------------ | ------------------- | ------------------------ | ------------------- |
 | original | 5.680 +/- 0.919 | 4.785 +/- 0.864 | 394.178 +/- 81.705 | 38.515 +/- 11.152 |
 | h2d_d2h_threads | 4.856 +/- 0.142 | 6.694 +/- 0.497 | 287.201 +/- 41.480 | 27.028 +/- 4.773 |
+| 2_predict_workers | 3.465 +/- 0.082 | 5.369 +/- 0.900 | 334.981 +/- 50.292 | 31.635 +/- 4.492 |
+| 3_predict_workers | 3.819 +/- 0.617 | 4.409 +/- 0.149 | 402.236 +/- 22.151 | 35.893 +/- 0.877 |
+| 4_predict_workers | 3.994 +/- 0.509 | 6.007 +/- 0.408 | 296.260 +/- 16.524 | 25.751 +/- 1.810 |
--- a/benchmarks/inference/results/output_64_false.md
+++ b/benchmarks/inference/results/output_64_false.md
@ -4,3 +4,6 @@
 | ---------- | ------------------ | ------------------- | ------------------------ | ------------------- |
 | original | 5.823 +/- 0.541 | 8.024 +/- 0.508 | 460.474 +/- 29.896 | 47.112 +/- 6.576 |
 | h2d_d2h_threads | 3.971 +/- 0.709 | 9.583 +/- 1.030 | 393.394 +/- 26.110 | 37.550 +/- 1.381 |
+| 2_predict_workers | 3.535 +/- 0.124 | 8.258 +/- 1.188 | 441.640 +/- 85.705 | 38.630 +/- 3.499 |
+| 3_predict_workers | 3.724 +/- 0.312 | 7.614 +/- 0.951 | 467.236 +/- 36.619 | 37.834 +/- 2.418 |
+| 4_predict_workers | 4.346 +/- 0.744 | 8.417 +/- 0.541 | 430.672 +/- 17.979 | 34.461 +/- 3.257 |
--- a/benchmarks/inference/server.py
+++ b/benchmarks/inference/server.py
@ -164,6 +164,7 @@ class BackendWorker:
        response_queue,
        read_requests_event,
        batch_size,
+        num_workers,
        model_dir=".",
        compile_model=True,
    ):
@ -174,11 +175,14 @@ class BackendWorker:
        self.response_queue = response_queue
        self.read_requests_event = read_requests_event
        self.batch_size = batch_size
+        self.num_workers = num_workers
        self.model_dir = model_dir
        self.compile_model = compile_model
        self._setup_complete = False
        self.h2d_stream = torch.cuda.Stream()
        self.d2h_stream = torch.cuda.Stream()
+        # maps thread_id to the cuda.Stream associated with that worker thread
+        self.stream_map = dict()

    def _setup(self):
        import time
@ -221,11 +225,12 @@ class BackendWorker:
    ):
        # copy_sem makes sure copy_event has been recorded in the data copying thread
        copy_sem.acquire()
-        torch.cuda.current_stream().wait_event(copy_event)
-        with torch.no_grad():
-            response_list.append(model(input_buffer))
-            compute_event.record()
-            compute_sem.release()
+        self.stream_map[threading.get_native_id()].wait_event(copy_event)
+        with torch.cuda.stream(self.stream_map[threading.get_native_id()]):
+            with torch.no_grad():
+                response_list.append(model(input_buffer))
+                compute_event.record()
+                compute_sem.release()
        del input_buffer

    def copy_data(self, input_buffer, data, copy_event, copy_sem):
@ -243,7 +248,12 @@ class BackendWorker:
            self.response_queue.put((response_list[0].cpu(), request_time))

    async def run(self):
-        worker_pool = ThreadPoolExecutor(max_workers=1)
+        def worker_initializer():
+            self.stream_map[threading.get_native_id()] = torch.cuda.Stream()
+
+        worker_pool = ThreadPoolExecutor(
+            max_workers=self.num_workers, initializer=worker_initializer
+        )
        h2d_pool = ThreadPoolExecutor(max_workers=1)
        d2h_pool = ThreadPoolExecutor(max_workers=1)

@ -314,6 +324,7 @@ if __name__ == "__main__":
    parser.add_argument(
        "--profile", default=False, action=argparse.BooleanOptionalAction
    )
+    parser.add_argument("--num_workers", type=int, default=4)
    args = parser.parse_args()

    downloaded_checkpoint = False
@ -354,6 +365,7 @@ if __name__ == "__main__":
            response_queue,
            read_requests_event,
            args.batch_size,
+            args.num_workers,
            args.model_dir,
            args.compile,
        )
--- a/benchmarks/inference/src/avg_latency_plot.png
+++ b/benchmarks/inference/src/avg_latency_plot.png
--- a/benchmarks/inference/src/throughput_plot.png
+++ b/benchmarks/inference/src/throughput_plot.png