Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33434
Reland of https://github.com/pytorch/pytorch/pull/33325, since the
unit test was flaky and failed on land.
To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.
ghstack-source-id: 98558377
Test Plan: Added UT test_tcp_store_timeout_set
Differential Revision: D19935390
fbshipit-source-id: 56ccf8c333dd2f954a33614d35cd1642d4e9473a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33325
Closes https://github.com/pytorch/pytorch/issues/32924. There was a bug where for TCPStore, we would not respect the timeout passed into `init_process_group` while constructing the TCPStore. Instead, we'd set the timeout after the rendezvous created the store, meaning that we used the default timeout of 300s while connecting to the server. This diff passes the timeout passed into `init_process_group` to rendezvous so that it can be passed into the constructor for TCPStore, so that we can use the right timeout at construction time.
Question: Should we make this change for FileStore as well? Currently the FileStore constructor does not take in a timeout at all.
ghstack-source-id: 98401875
Test Plan: Added a UT
Differential Revision: D19871946
fbshipit-source-id: dd002180c4c883216645b8a97cc472c6116ac117
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32016
The previously logic will raise exception when there is query in url when rank or world_size is specified
The fix will parse the url and stitch rank and world_size into url.query and regenerate the url.
Test Plan: f161291877
Differential Revision: D19337929
fbshipit-source-id: 6bb3a07716dda5233553804000b706052ff18db8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30093https://github.com/pytorch/pytorch/pull/28226 introduced `worker_to_id` arg to the `def init_rpc` function for other `RpcAgent`. While it's not really used by `ProcessGroupAgent`. Cleanup is wanted for this, as described in https://github.com/pytorch/pytorch/issues/29031.
To adapt to the difference of different `RpcAgent`, adding a `RpcAgentOptions` base classes, which allow leveraging inheritance to add extra fields.
ghstack-source-id: 94197295
Test Plan:
### OSS RPC + RRef tests
```
buck test mode/dev-nosan //caffe2/test:rpc_fork
```
```
buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:thrift_rpc_fork_test -- test_sync_rpc
```
### Prototype RRef tests
```
buck test mode/dev-nosan caffe2/torch/fb/distributed/pytorch/tests:test_rpc
```
```
buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_rpc_thrift_rpc_agent
```
### Dist autograd
```
buck test mode/dev-nosan caffe2/test:dist_autograd_fork
```
```
buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:thrift_dist_autograd_fork_test
```
Differential Revision: D18595578
fbshipit-source-id: 616fca3b844c171ed5277bbc6a2b1693bc3a8065
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28226
# Goal
Rendezvous step should be the first step not only for `init_process_group` but also for `init_model_parallel`.
The road block is that there is special step in `init_process_group` where arguments `rank`, `world_size` passed to `init_process_group(..)` are appended to `init_method` url string.
We need to make this argument appending step common and re-usable for both `init_process_group` and `init_model_parallel`.
# Solution
- Put argument appending inside of `rendezvous` function.
- Remove manual `init_method` url construction. Delegate the responsibility to the `rendezvous` function.
- Use the `rendezvous` function for any `RpcAgent`.
Test Plan:
```
buck test mode/dev-nosan caffe2/test:c10d
```
```
buck test mode/dev-nosan caffe2/test:rpc_fork -- test_invalid_names
buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_worker_id
```
```
buck test mode/dev-nosan caffe2/torch/fb/distributed/pytorch/tests:test_rpc -- test_sync_rpc
```
```
buck test mode/dev-nosan caffe2/torch/fb/rendezvous:zeus_test
```
```
buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_pairwise_attention_pooling -- test_single_trainer_multiple_pss
```
Differential Revision: D5524494
fbshipit-source-id: 50be58ec3c928621b0874b044ef4a1640534d8ef
Summary:
This PR fixes a race condition for TCP init method, when master rank can exit earlier than slave ranks and thus the TCP daemon thread gets shutdown before other slaves are able to access it.
This will let every rank (process) write a special key to the store to mark that they are completed (and thus about to exit). The master rank (who is the server) will always wait until all the ranks to complete before complete itself.
This should fix: https://github.com/pytorch/pytorch/issues/15638
Tested using the repro of https://github.com/pytorch/pytorch/issues/15638 and works fine. Also test_distributed and test_c10d should have already had this coverage.
I had to make rendezvous test in c10d the world size of 1, since it is a single process code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15684
Differential Revision: D13570904
Pulled By: teng-li
fbshipit-source-id: 34f3bc471204bbd29320df359347ad5561c6b589
Summary:
Fixing: https://github.com/pytorch/pytorch/issues/14446
This was a supported behavior in old torch.distributed. We want to support it in the new release.
Test should cover all combination of scenario when we have either env or arg set up for rank or size or both
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14494
Differential Revision: D13253433
Pulled By: teng-li
fbshipit-source-id: c05974d84f1bdf969f74ec45763e11a841fe4848
Summary:
This addressed: https://github.com/pytorch/pytorch/issues/11874
and we will have the identical file init_method behavior as the previous THD file init.
Also the FileStore::add bug is pretty annoying.
Two bugs:
(1) Add doesn't append to the end of the file.
(2) Cache doesn't get updated.
Both are fixed and tests are covered.
I examined the /tmp to ensure that all temp files are auto deleted after test_c10d.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13708
Reviewed By: pietern
Differential Revision: D12972810
Pulled By: teng-li
fbshipit-source-id: 917255390aa52845f6b0ad0f283875a7a704da48
Summary:
A missing environment variable raised a missing key error. Now it
raises a more descriptive error of the actual problem, for example:
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11782
Differential Revision: D9888962
Pulled By: pietern
fbshipit-source-id: 5947e7a7bf7aa45f13bbd7b5e997529f26cc92d6
Summary:
The old `torch.distributed` will go to `torch.distributed.deprecated`
The old DDP will go to `torch.nn.parallel.deprecated`
Now `torch.nn.parallel.DDP` will use c10d DDP
Now `torch.distributed` will use C10d frontend API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11405
Reviewed By: pietern
Differential Revision: D9733733
Pulled By: teng-li
fbshipit-source-id: d6a3f3e73f8d3a7fcb1f4baef53c78063b8cbb08