close blobs queues when stopping + test

Summary: Mysterious deadlocks after epoch has finished have occured randomly but quite frequently recently for myself, vigneshr and others. Looking at a stack trace of vigneshr's job (P57129798), I noticed a couple of threads were calling BlobsQueue.blockingWrite (or something like that). That call stucks when the caffe2/c++ side queue is at capacity (we use capacity of 4 with data workers). So in cases when this call was just being made while the script was to be terminated, the thread did not close and the whole process did not close either (not completely sure why that is since thread is a daemon thread, but this might be a flow-related issue since we run inside a flow container). This is quite easy to fix: just call CloseBlobsQueue() when terminating the process. I modified coordinator.stop() and wait_for_finish() to return a status code based on whether threads that were joined actually closed within the 1.0sec timeout. This allowed creating an unit test to test for this issue. Before my change, the unit test failed. Reviewed By: pietern Differential Revision: D4619638 fbshipit-source-id: d96314ca783977517274fc7aadf8db4ee5636bdf
2025-11-06 00:54:56 +08:00 · 2017-02-27 09:58:20 -08:00
parent 97f95bb247
commit 449f8997ab
2 changed files with 45 additions and 2 deletions
--- a/caffe2/python/data_workers_test.py
+++ b/caffe2/python/data_workers_test.py
@ -5,6 +5,7 @@ from __future__ import unicode_literals

 import numpy as np
 import unittest
+import time

 from caffe2.python import workspace, cnn
 from caffe2.python import timeout_guard
@ -56,3 +57,33 @@ class DataWorkersTest(unittest.TestCase):
                self.assertEqual(labels[j], data[j, 2])

        coordinator.stop()
+
+    def testGracefulShutdown(self):
+        model = cnn.CNNModelHelper(name="test")
+        coordinator = data_workers.init_data_input_workers(
+            model,
+            ["data", "label"],
+            dummy_fetcher,
+            32,
+            2,
+        )
+        self.assertEqual(coordinator._fetcher_id_seq, 2)
+        coordinator.start()
+
+        workspace.RunNetOnce(model.param_init_net)
+        workspace.CreateNet(model.net)
+
+        while coordinator._coordinators[0]._inputs < 100:
+            time.sleep(0.01)
+
+        # Run a couple of rounds
+        workspace.RunNet(model.net.Proto().name)
+        workspace.RunNet(model.net.Proto().name)
+
+        # Wait for the enqueue thread to get blocked
+        time.sleep(0.2)
+
+        # We don't dequeue on caffe2 side (as we don't run the net)
+        # so the enqueue thread should be blocked.
+        # Let's now shutdown and see it succeeds.
+        self.assertTrue(coordinator.stop())