mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-21 05:34:18 +08:00
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33325 Closes https://github.com/pytorch/pytorch/issues/32924. There was a bug where for TCPStore, we would not respect the timeout passed into `init_process_group` while constructing the TCPStore. Instead, we'd set the timeout after the rendezvous created the store, meaning that we used the default timeout of 300s while connecting to the server. This diff passes the timeout passed into `init_process_group` to rendezvous so that it can be passed into the constructor for TCPStore, so that we can use the right timeout at construction time. Question: Should we make this change for FileStore as well? Currently the FileStore constructor does not take in a timeout at all. ghstack-source-id: 98401875 Test Plan: Added a UT Differential Revision: D19871946 fbshipit-source-id: dd002180c4c883216645b8a97cc472c6116ac117
9 lines
368 B
Python
9 lines
368 B
Python
from datetime import timedelta
|
|
|
|
# Default process group wide timeout, if applicable.
|
|
# This only applies to the gloo and nccl backends
|
|
# (only if NCCL_BLOCKING_WAIT is set to 1). To make an attempt at
|
|
# backwards compatibility with THD, we use an extraordinarily high default
|
|
# timeout, given that THD did not have timeouts.
|
|
default_pg_timeout = timedelta(minutes=30)
|