Files
pytorch/torch/multiprocessing
Chip Turner 2ed47fecc5 Robustify torch.multiprocessing.spawn error reporting to be less deadlock prone (#114688)
multiprocessing.Queue relies on, among other things, background threads to send messages between processes.  This works in the happy path but can cause issues if a process is exiting by bypassing atexit handlers or crashing because the writer to the Queue can terminate while the reader is blocked reading the queue.  The reader sees the queue as non-empty yet even with a timeout will actually block forever.

An example of a Queue deadlock is here: https://gist.github.com/chipturner/342f72341f087737befe9df84d0e41ce

Since the error reporting case here is a simple one-shot message from the dying child to the parent, we can just use a file-based rendezvous.  This eliminates the deadlock when a large traceback is still being flushed to the network when a child exits.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114688
Approved by: https://github.com/suo, https://github.com/yifuwang
2023-12-09 03:36:43 +00:00
..