close blobs queues when stopping + test

Summary:
Mysterious deadlocks after epoch has finished have occured randomly but quite frequently recently for myself, vigneshr and others. Looking at a stack trace of vigneshr's job (P57129798), I noticed a couple of threads were calling BlobsQueue.blockingWrite (or something like that). That call stucks when the caffe2/c++ side queue is at capacity (we use capacity of 4 with data workers). So in cases when this call was just being made while the script was to be terminated, the thread did not close and the whole process did not close either (not completely sure why that is since thread is a daemon thread, but this might be a flow-related issue since we run inside a flow container).

This is quite easy to fix: just call CloseBlobsQueue() when terminating the process. I modified coordinator.stop() and wait_for_finish() to return a status code based on whether threads that were joined actually closed within the 1.0sec timeout. This allowed creating an unit test to test for this issue. Before my change, the unit test failed.

Reviewed By: pietern

Differential Revision: D4619638

fbshipit-source-id: d96314ca783977517274fc7aadf8db4ee5636bdf
This commit is contained in:
Aapo Kyrola
2017-02-27 09:58:20 -08:00
committed by Facebook Github Bot
parent 97f95bb247
commit 449f8997ab
2 changed files with 45 additions and 2 deletions

View File

@ -5,6 +5,7 @@ from __future__ import unicode_literals
import numpy as np
import unittest
import time
from caffe2.python import workspace, cnn
from caffe2.python import timeout_guard
@ -56,3 +57,33 @@ class DataWorkersTest(unittest.TestCase):
self.assertEqual(labels[j], data[j, 2])
coordinator.stop()
def testGracefulShutdown(self):
model = cnn.CNNModelHelper(name="test")
coordinator = data_workers.init_data_input_workers(
model,
["data", "label"],
dummy_fetcher,
32,
2,
)
self.assertEqual(coordinator._fetcher_id_seq, 2)
coordinator.start()
workspace.RunNetOnce(model.param_init_net)
workspace.CreateNet(model.net)
while coordinator._coordinators[0]._inputs < 100:
time.sleep(0.01)
# Run a couple of rounds
workspace.RunNet(model.net.Proto().name)
workspace.RunNet(model.net.Proto().name)
# Wait for the enqueue thread to get blocked
time.sleep(0.2)
# We don't dequeue on caffe2 side (as we don't run the net)
# so the enqueue thread should be blocked.
# Let's now shutdown and see it succeeds.
self.assertTrue(coordinator.stop())