mirror of
https://github.com/pytorch/pytorch.git
synced 2025-11-06 00:54:56 +08:00
close blobs queues when stopping + test
Summary: Mysterious deadlocks after epoch has finished have occured randomly but quite frequently recently for myself, vigneshr and others. Looking at a stack trace of vigneshr's job (P57129798), I noticed a couple of threads were calling BlobsQueue.blockingWrite (or something like that). That call stucks when the caffe2/c++ side queue is at capacity (we use capacity of 4 with data workers). So in cases when this call was just being made while the script was to be terminated, the thread did not close and the whole process did not close either (not completely sure why that is since thread is a daemon thread, but this might be a flow-related issue since we run inside a flow container). This is quite easy to fix: just call CloseBlobsQueue() when terminating the process. I modified coordinator.stop() and wait_for_finish() to return a status code based on whether threads that were joined actually closed within the 1.0sec timeout. This allowed creating an unit test to test for this issue. Before my change, the unit test failed. Reviewed By: pietern Differential Revision: D4619638 fbshipit-source-id: d96314ca783977517274fc7aadf8db4ee5636bdf
This commit is contained in:
committed by
Facebook Github Bot
parent
97f95bb247
commit
449f8997ab
@ -5,6 +5,7 @@ from __future__ import unicode_literals
|
||||
|
||||
import numpy as np
|
||||
import unittest
|
||||
import time
|
||||
|
||||
from caffe2.python import workspace, cnn
|
||||
from caffe2.python import timeout_guard
|
||||
@ -56,3 +57,33 @@ class DataWorkersTest(unittest.TestCase):
|
||||
self.assertEqual(labels[j], data[j, 2])
|
||||
|
||||
coordinator.stop()
|
||||
|
||||
def testGracefulShutdown(self):
|
||||
model = cnn.CNNModelHelper(name="test")
|
||||
coordinator = data_workers.init_data_input_workers(
|
||||
model,
|
||||
["data", "label"],
|
||||
dummy_fetcher,
|
||||
32,
|
||||
2,
|
||||
)
|
||||
self.assertEqual(coordinator._fetcher_id_seq, 2)
|
||||
coordinator.start()
|
||||
|
||||
workspace.RunNetOnce(model.param_init_net)
|
||||
workspace.CreateNet(model.net)
|
||||
|
||||
while coordinator._coordinators[0]._inputs < 100:
|
||||
time.sleep(0.01)
|
||||
|
||||
# Run a couple of rounds
|
||||
workspace.RunNet(model.net.Proto().name)
|
||||
workspace.RunNet(model.net.Proto().name)
|
||||
|
||||
# Wait for the enqueue thread to get blocked
|
||||
time.sleep(0.2)
|
||||
|
||||
# We don't dequeue on caffe2 side (as we don't run the net)
|
||||
# so the enqueue thread should be blocked.
|
||||
# Let's now shutdown and see it succeeds.
|
||||
self.assertTrue(coordinator.stop())
|
||||
|
||||
Reference in New Issue
Block a user