pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Jane Xu	34051d74da	Add test owner to distributed files starting with test_ (#66797 ) Summary: Action based on https://github.com/pytorch/pytorch/issues/66232 cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/66797 Reviewed By: gchanan Differential Revision: D31761389 Pulled By: janeyx99 fbshipit-source-id: c27c9ab4acec1eb71d5edd4538cd113b770dfc6c	2021-10-19 10:55:20 -07:00
Howard Huang	dc1bd6acee	Remove PROCESS GROUP rpc backend (#62411 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62411 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D29990408 Pulled By: H-Huang fbshipit-source-id: 183d3b316767b12993cebbe32b73c2850fd1cc42	2021-08-02 12:26:22 -07:00
Howard Huang	43274ca145	test_store multiworker remove multiprocessing (#59599 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59599 This will fix the flakiness for these tests internally when running under TSAN. We don't need multiprocessing since we should restrict the testing to the `wait_for_workers` and `world_size` parameters of the tcp store master store. Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D28947838 Pulled By: H-Huang fbshipit-source-id: d3e3904aa7ac81ae4c744a193a3b7167c2227bc8	2021-06-08 14:38:42 -07:00
Can Balioglu	1d9c1cc00a	[4/n] [c10d] Introduce the multi-tenancy feature in TCPStore (#58331 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58331 This PR is the final part of a stack that addresses the GitHub issue #41614; it introduces the multi-tenancy feature to the `TCPStore` class allowing two server stores to be instantiated with the same host:port pair. ghstack-source-id: 130676394 Test Plan: - Run the existing and newly-introduced tests. - Run several smoke tests including the short code snippet referred in GitHub issue #41614. Reviewed By: H-Huang Differential Revision: D28453850 fbshipit-source-id: f9066b164305de0f8c257e9d5736e93fd7e21ec6	2021-06-05 07:50:07 -07:00
Andrew Gu	bbf7eceaf0	Refactor c10d and dist aliases for torch.distributed (#59456 ) Summary: Overview: This consolidates `c10d` and `dist` to only `dist` as the alias for `torch.distributed` in `test_store.py`. Both aliases were used most likely due to incremental additions to the test file and not intentional. Pull Request resolved: https://github.com/pytorch/pytorch/pull/59456 Test Plan: ``` python test/distributed/test_store.py ``` Reviewed By: agolynski Differential Revision: D28910169 Pulled By: andwgu fbshipit-source-id: f830dead29e9de48aaf2845dfa5861c9cccec15d	2021-06-04 16:07:44 -07:00
Andrew Gu	2ad4b8e58c	Extract c10d Store tests to dedicated test file (#59271 ) Summary: Partially addresses https://github.com/pytorch/pytorch/issues/55340 Overview This factors out `FileStoreTest`, `HashStoreTest`, `PrefixFileStoreTest`, `TCPStoreTest`, `PrefixTCPStoreTest`, `PythonStoreTest`, `RendezvousTest`, `RendezvousEnvTest`, `RendezvousFileTest`, and `RendezvousTCPTest` from `test_c10d_common.py` to a new file `test_store.py`. Additionally, unused import/initialization statements are removed from `test_c10d_common.py`, and the minimal set of import/initialization statements are used for `test_store.py`. Also, this changes `.jenkins/pytorch/multigpu-test.sh`, `.jenkins/pytorch/win-test-helpers/test_distributed.bat`, and `test/run_test.py` to include the new `test_store.py`. Testing All commands shown are run on an AI AWS cluster. I check the Store tests: ``` python test/distributed/test_store.py ``` I also check `test_c10d_common.py` since it is the source of the refactored code. In addition, I check `test_c10d_nccl.py` and `test_c10d_gloo.py` since they import from `test_c10d_common.py`; those two should be the only test files depending on `test_c10d_common.py`. ``` python test/distributed/test_c10d_common.py python test/distributed/test_c10d_nccl.py python test/distributed/test_c10d_gloo.py ``` `test_c10d_gloo.py` produces warnings about how using sparse tensors in TorchScript is experimental, but the warnings do not result from this PR's changes. Testing Issues (To Be Revisited) ``` WORLD_SIZE=4 BACKEND=gloo gpurun pytest test/distributed/test_c10d_gloo.py ``` Running the above command fails three tests (written as `[Test]`: `[Error]`): - `ProcessGroupGlooWrapperTest.test_collective_hang`: `RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [10.200.24.101]:15580` - `CommTest.test_broadcast_coalesced_gloo_cuda`: `RuntimeError: cuda runtime error (3) : initialization error at ../aten/src/THC/THCGeneral.cpp:54` - `CommTest.test_sequence_num_incremented_gloo_default`: `RuntimeError: cuda runtime error (3) : initialization error at ../aten/src/THC/THCGeneral.cpp:54` However, running each of the following yields no errors: ``` WORLD_SIZE=4 BACKEND=gloo gpurun pytest test/distributed/test_c10d_gloo.py -k test_collective_hang WORLD_SIZE=4 BACKEND=gloo gpurun pytest test/distributed/test_c10d_gloo.py -k test_broadcast_coalesced_gloo_cuda WORLD_SIZE=4 BACKEND=gloo gpurun pytest test/distributed/test_c10d_gloo.py -k test_sequence_num_incremented_gloo_default ``` This suggests the existence of some inadvertent state dependency between tests (e.g. improper cleanup). I have not explored this further yet. In particular, I do not have a solid understanding of the tests to be able to explain why using `pytest` and `gpurun` induces the failure (since notably, running the `.py` directly shows no issue). Similarly, running the following yields 47 errors: ``` WORLD_SIZE=4 BACKEND=nccl gpurun pytest test/distributed/test_c10d_nccl.py ``` The errors seem to all be simply complaining about the usage of `fork()` instead of `spawn()` for CUDA multiprocessing. Though, most of the tests in `test_c10d_nccl.py` ask for at least 2 CUDA devices, so I think that the `gpurun` is warranted (assuming that the test file does not need to be run partially on different machines). Both `test_c10d_common.py` and `test_store.py` work fine with `pytest`. Other Notes I noticed that `torch.distributed` is imported both as `dist` and as `c10d` and that `c10d` is used throughout the Store tests. I was curious if this is intentional (as opposed to using `dist` to refer to `torch.distributed`). Also, the original [issue](https://github.com/pytorch/pytorch/issues/55340) suggests that the Store tests do not use multiprocessing, but I saw that `torch.multiprocessing` is still used in `TCPStoreTest`. The links for the Store files in the `CONTRIBUTING.md` [file](https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md) are broken. This can fixed in a separate PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/59271 Reviewed By: jbschlosser, mrshenli Differential Revision: D28856920 Pulled By: andwgu fbshipit-source-id: 630950cba18d34e6b5de661f5a748f2cddc1b446	2021-06-03 10:53:33 -07:00

6 Commits