Files
pytorch/torch/distributed/constants.py
Rohan Varma df47a3abe0 [distributed] pass in timeout to TCP store when initializing (#33325)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33325

Closes https://github.com/pytorch/pytorch/issues/32924. There was a bug where for TCPStore, we would not respect the timeout passed into `init_process_group` while constructing the TCPStore. Instead, we'd set the timeout after the rendezvous created the store, meaning that we used the default timeout of 300s while connecting to the server. This diff passes the timeout passed into `init_process_group` to rendezvous so that it can be passed into the constructor for TCPStore, so that we can use the right timeout at construction time.

Question: Should we make this change for FileStore as well? Currently the FileStore constructor does not take in a timeout at all.
ghstack-source-id: 98401875

Test Plan: Added a UT

Differential Revision: D19871946

fbshipit-source-id: dd002180c4c883216645b8a97cc472c6116ac117
2020-02-16 17:59:44 -08:00

9 lines
368 B
Python

from datetime import timedelta
# Default process group wide timeout, if applicable.
# This only applies to the gloo and nccl backends
# (only if NCCL_BLOCKING_WAIT is set to 1). To make an attempt at
# backwards compatibility with THD, we use an extraordinarily high default
# timeout, given that THD did not have timeouts.
default_pg_timeout = timedelta(minutes=30)