Commit Graph

588 Commits

Author SHA1 Message Date
c1b92f518d Remove ProcessGroup::getGroupRank (#19147)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19147

After #14809 was merged there is no longer a need for getGroupRank.
Every ProcessGroup object has its own rank and size fields which are
accurate for the global group as well as subgroups.

Strictly speaking removing a function in a minor version bump is a big
no-no, but I highly doubt this was ever used outside of
`torch.distributed` itself. This will result in a compile error for
folks who have subclassed the ProcessGroup class though.

If this is a concern we can delay merging until a later point in time,
but eventually this will need to be cleaned up.

Differential Revision: D14889736

fbshipit-source-id: 3846fe118b3265b50a10ab8b1c75425dad06932d
2019-04-11 09:17:40 -07:00
ce166d949d ProcessGroupMPI exists only if it is valid (#14809)
Summary:
Previously, MPI process groups were created for all processes, even if
they were not part of the created group. Their MPI_Comm member field
would be MPI_COMM_NULL and they would ignore any calls. Their rank and
size were identical to that of the global process group and they had a
special groupRank and groupSize field to capture the _real_ rank.

This also meant assymetry with other process group types, where creating
a new group would either return the process group OR
GroupMember.NON_GROUP_MEMBER. For the MPI process group, it would always
return a process group and an additional check was needed to verify
whether or not a process was indeed part of a process group or not.

This commit changes this such that every MPI process group is a valid
process group, and by extension that we no longer have to special case
MPI to determine whether or not a process is part of a group. Now, if
the value returned by `new_group` is GroupMember.NON_GROUP_MEMBER, the
process is not a member, otherwise it is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14809

Differential Revision: D14887937

Pulled By: pietern

fbshipit-source-id: c5bf86d3b33e524cc5004ee68e30103178fa491d
2019-04-10 21:36:35 -07:00
ce92cf9bd1 Add tests for reducer class (#18845)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18845

This adds a few CPU only test cases for the reducer class.

Reviewed By: mrshenli

Differential Revision: D14768432

fbshipit-source-id: c008a52206826304e634a95bc14167ed94c97662
2019-04-05 09:07:29 -07:00
bdfdf6c2b9 C++ handler for gradient reduction (#18251)
Summary:
This commit adds the `c10d::Reducer` class that hooks into autograd
and performs gradient bucketing and reduction. These are the core
parts of `nn.parallel.DistributedDataParallel` that up to now were
only usable for CUDA models.

This should enable the following:

* Distributed data parallelism for models defined using the C++ frontend.
* Allow overlap of gradient computation and reduction for non-CUDA models.
* Enable distributed data parallelism for models with some unused parameters.

This does not include any logic for computing bucket assignment, which
can be done separately; either by observing autograd execution order
(this is what Apex does), or by assigning buckets based on some
maximum byte size, or both.

Also see #17757 and #13273.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18251

Reviewed By: mrshenli

Differential Revision: D14571899

Pulled By: pietern

fbshipit-source-id: 20f95eefd288dfe8cfffe0a28ca22fa7c9c3cd4c
2019-04-01 14:30:02 -07:00
2e753fc753 Remove unused parameter in ProcessGroupGloo (#17718)
Summary:
This is not used anywhere and wasn't cleaned up prior to 1.0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17718

Reviewed By: janewangfb

Differential Revision: D14355154

Pulled By: pietern

fbshipit-source-id: f8ff3c8f50cd6365b369a5c5b85d72d8940df048
2019-03-11 18:01:20 -07:00
b4bc55beef TCP init method race condition fix (#15684)
Summary:
This PR fixes a race condition for TCP init method, when master rank can exit earlier than slave ranks and thus the TCP daemon thread gets shutdown before other slaves are able to access it.

This will let every rank (process) write a special key to the store to mark that they are completed (and thus about to exit).  The master rank (who is the server) will always wait until all the ranks to complete before complete itself.

This should fix: https://github.com/pytorch/pytorch/issues/15638

Tested using the repro of https://github.com/pytorch/pytorch/issues/15638 and works fine. Also test_distributed and test_c10d should have already had this coverage.

I had to make rendezvous test in c10d the world size of 1, since it is a single process code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15684

Differential Revision: D13570904

Pulled By: teng-li

fbshipit-source-id: 34f3bc471204bbd29320df359347ad5561c6b589
2019-01-18 02:29:38 -08:00
d6c53328f9 Large scale fix of python-related files in torch/csrc/
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14515

Differential Revision: D13247966

Pulled By: goldsborough

fbshipit-source-id: 7a127c508fc576a7a92626dd6b729f660162d628
2018-12-07 13:04:46 -08:00
9127ab3866 Fixed new_group won't work for two or more different rank groups (#14529)
Summary:
This fixed two things:

(1) NCCL group doesn't support 2 or more groups, this is because, we need a group name in ProcessGroupNCCL class to keep track of the ProcessGroup ID within that group name, and also the NCCL unique ID within that group name and process group ID.  Otherwise, different processes will create different NCCL PG in different orders and can clash on these names.  This will fix the NCCL problem.

(2)  When using new_group, each rank should enter this function and update its global group name counter to ensure that every rank always operates on the same group name.

With both fixes: repro code in: https://github.com/pytorch/pytorch/issues/14528 should work with both NCCL and Gloo backends.

```
tengli@learnfair096:~$ python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=30000 ~/github_issues/nccl_group.py
rank: 0 - val: 6.0
rank: 2 - val: 6.0
rank: 3 - val: 6.0
rank: 1 - val: 6.0
rank: 4 - val: 22.0
rank: 6 - val: 22.0
rank: 5 - val: 22.0
rank: 7 - val: 22.0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14529

Differential Revision: D13253434

Pulled By: teng-li

fbshipit-source-id: 8eb45882b996b06d951fc9a306d5de86a42e8b84
2018-11-29 19:57:47 -08:00
4ec6bd7356 Add sourceRank() to ProcessGroup::Work (#14453)
Summary:
This function is only implemented for the subclasses where it makes
sense. If it's not overridden it will throw an error. Having this
function removes the need for a pointer passing hack to pass the
source rank of a recv operation back to the caller. Instead, the
caller can now call `source_rank` on the work object and achieve
the same result.

Closes #11804.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14453

Differential Revision: D13230898

Pulled By: pietern

fbshipit-source-id: ef38f48bfaca8ef9a364e5be122951bafc9f8e49
2018-11-29 09:16:53 -08:00
03864b7b11 Add option structs and timeout field (#14297)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14297

Adds option structs for allgather and barrier such that we have one
for every collective. Add timeout member field to every one of these
such that we can support per operation timeouts.

Use default constructed options struct for every collective process
group function exposed to Python.

Reviewed By: manojkris

Differential Revision: D13158474

fbshipit-source-id: 3d28977de2f2bd6fc2f42ba3108b63a429338906
2018-11-27 10:46:38 -08:00
91c0b7159a Remove header generated at configuration time (#14244)
Summary:
The build was picking up the empty stub header instead of the generated
one. Because of the large number of include paths we end up passing to
the compiler it is brittle to have both an empty stub file and a
generated file and expect the compiler to pick up the right one.

With the recent change to compile everything from a single CMake run we
can now use native CMake facilities to propagate macros that indicate
backend support. The stanzas target_compile_definitions with the
INTERFACE flag ensure that these macros are set only for downstream
consumers of the c10d target.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14244

Reviewed By: teng-li

Differential Revision: D13144293

Pulled By: pietern

fbshipit-source-id: f49324220db689c68c126b159f4f00a8b9bc1252
2018-11-21 08:45:08 -08:00
45fd77d3b7 Adding GLOO_SOCKET_IFNAME env to allow user set gloo device (#14065)
Summary:
Address https://github.com/pytorch/pytorch/issues/14063

This is a lot easier to use, follow the NCCL convention since they provide the similar NCCL_SOCKET_IFNAME.

We can later document this better.

Tested on my two hosts, and work out of the box
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14065

Differential Revision: D13095522

Pulled By: teng-li

fbshipit-source-id: 131dff212626f1aab7e752427f1b684845b909dc
2018-11-15 22:33:56 -08:00
97036d3c30 FileStore auto deletes file and FileStore::add bug fix (#13708)
Summary:
This addressed: https://github.com/pytorch/pytorch/issues/11874

and we will have the identical file init_method behavior as the previous THD file init.

Also the FileStore::add bug is pretty annoying.

Two bugs:
(1) Add doesn't append to the end of the file.
(2) Cache doesn't get updated.

Both are fixed and tests are covered.

I examined the /tmp to ensure that all temp files are auto deleted after test_c10d.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13708

Reviewed By: pietern

Differential Revision: D12972810

Pulled By: teng-li

fbshipit-source-id: 917255390aa52845f6b0ad0f283875a7a704da48
2018-11-14 01:34:22 -08:00
1413dd4bfc Added the finer bucketing option for DDP (#13607)
Summary:
We only need this for backward, for FWD cast, the non-fine-grained bucketing should be better since it's sequential anyway.

Test should be covered all by c10d test, reduced bucket size to make bucketing happen in c10d test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13607

Differential Revision: D12944515

Pulled By: teng-li

fbshipit-source-id: d982e8dca2874c91d39b30b73a85bfbeb768c508
2018-11-07 12:00:55 -08:00
044d00516c Rename DistBackend -> Backend (#11830)
Summary:
Also add docs for get_backend, Backend, and reduce_op

fixes #11803

cc The controller you requested could not be found. pietern apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11830

Differential Revision: D9927991

Pulled By: SsnL

fbshipit-source-id: a2ffb70826241ba84264f36f2cb173e00b19af48
2018-11-07 11:58:12 -08:00
74819087de Mixed precision DDP hang fix and fine-grained option for DDP perf (#13496)
Summary:
When go to mixed precision fp16 training, DDP randomly hangs.  Initially, I thought this smells like a similar NCCL bug I filed a while ago. It turns out it's not. Again, I am seeing different rank process has different size.  How could this even happen?

It turns out that take_tensors will generate a list of bucketed tensors in an un deterministic order, because, the key to the map is a pointer.  An interesting bug digging and fix.

Now fp16 DDP training should be fully working now.

Also, added another take_tensor fine grained helper that aims to improve the performance of DDP, making it a TODO to replace the DDP take_tensors with that.

Fixed: https://github.com/pytorch/pytorch/issues/12150
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13496

Differential Revision: D12920985

Pulled By: teng-li

fbshipit-source-id: 26f3edae7be45a80fa7b2410a2e5a1baab212d9c
2018-11-05 16:22:15 -08:00
8c182cd89e Add overload of ProcessGroup.allreduce with list of tensors (#13576)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13576

TSIA

Reviewed By: SsnL

Differential Revision: D12923457

fbshipit-source-id: 7824490548edbacac3cda81c7500bd1f851c6093
2018-11-05 11:56:49 -08:00
526460fc8b Use default timeout of 30 minutes for gloo backend (#13056)
Summary:
The existing default timeout was set at 10 seconds, which is too low
for asynchronous tasks that depend on a barrier to resynchronize.
Having a single timeout for all operations is not ideal and this will
be addressed in future commits.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13056

Reviewed By: teng-li

Differential Revision: D10558746

Pulled By: pietern

fbshipit-source-id: d857ea55b1776fc7d0baf2efd77951b5d98beabb
2018-10-25 16:35:53 -07:00
c250f6f3d5 DDP perf improvement: move sync_reduction to C++, dedicated CUDA streams for memcpy (#12954)
Summary:
- Moved sync_reduction to C++
- Use a dedicated CUDA stream for memcpy
- Also use a dedicated CUDA stream for memcpy in queue_reduction

Added test as well.

CI should cover both DDP and unittest
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12954

Differential Revision: D10520069

Pulled By: teng-li

fbshipit-source-id: 64348e4e43c15f9695a4c28b036c232587ecfb65
2018-10-24 21:37:13 -07:00
8d3e7e2fcb Move DDP queue_reduction to C++ (#12852)
Summary:
fully working version by using continuing on goldsborough 's initial version.

waiting on the stream guard to be merged before adding more stream perf logics into the c++ version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12852

Differential Revision: D10468696

Pulled By: teng-li

fbshipit-source-id: 8e46d408796973817abfd9dbd6566e0ca5b7a13f
2018-10-22 16:07:46 -07:00
7535d98ec4 Add message tag parameter to send/recv
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11490

Reviewed By: teng-li

Differential Revision: D9828116

Pulled By: pietern

fbshipit-source-id: 98be1ae84b6763ffb329e63c030c5e3ec0e748b7
2018-09-14 10:55:37 -07:00
6dcdbd3a1d Make C10d support CPU only build (#11513)
Summary:
This makes torch.distributed works for CPU only build.

Also added one more CI test case to cover MPI CPU build.
All CI tests should cover this change
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11513

Differential Revision: D9784546

Pulled By: teng-li

fbshipit-source-id: 0976a6b0fd199670926f0273e17ad7d2805e42e7
2018-09-11 22:10:34 -07:00
0988bbad2d C10d release to torch.distributed for PT1 (#11405)
Summary:
The old `torch.distributed` will go to `torch.distributed.deprecated`
The old DDP will go to `torch.nn.parallel.deprecated`

Now `torch.nn.parallel.DDP` will use c10d DDP
Now `torch.distributed` will use C10d frontend API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11405

Reviewed By: pietern

Differential Revision: D9733733

Pulled By: teng-li

fbshipit-source-id: d6a3f3e73f8d3a7fcb1f4baef53c78063b8cbb08
2018-09-10 23:27:22 -07:00
01930a3145 Move sync_params to C++ (#9805)
Summary:
The next function I'm moving to C++ is `sync_params`. It is stacked on top of https://github.com/pytorch/pytorch/pull/9729, so some changes will go away when it lands and I rebase.

I also split code into a `.h` and `.cpp` file for better code organization.

The controller you requested could not be found. pietern apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9805

Differential Revision: D9688604

Pulled By: goldsborough

fbshipit-source-id: 4467104d3f9e2354425503b9e4edbd59603e20a8
2018-09-07 12:56:40 -07:00
ec195129ec Adding setTimeout option in Store (#11265)
Summary:
This will allow users to set customized timeout option for the store.

Tested by my own debug print to make sure that C++ actually used the timeout
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11265

Differential Revision: D9666164

Pulled By: teng-li

fbshipit-source-id: 4eb6441783da106a3fd59b95457e503e83e4640f
2018-09-06 12:55:50 -07:00
3791bd12c8 PT1 Release Milestone No.2 MPI Group Support with all tests passed (#11128)
Summary:
Added MPI group support.
And this will make all previous group test cases of MPI passed.

Also, release the MPI thread level support by serializing different PG's MPI ops. This is required.

The build is fixed too
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11128

Differential Revision: D9602188

Pulled By: teng-li

fbshipit-source-id: 1d618925ae5fb7b47259b23051cc181535aa7497
2018-08-31 12:39:56 -07:00
56539f5fe1 PT1 Distributed Release MileStone No.1 - Completed Distributed Package and CI tests (#10871)
Summary:
The PR includes:
(1) torch.distributed.c10d, which now includes the complete backward compatible frontend API for `torch.distributed`
(2) `env://` init method functionality
(3) Minor change to `test_distributed.py`, which is now a test for `torch.distributed.c10d`.
(4) The old `test_distributed.py' is now moved to `test_distributed_thd`
(5) Miscellaneous bug fixes.
(6) DDP CPU test is removed since c10d doesn't have this support yet, but this is a very easy test after moving DDP CPU's dependency to torch.distributed.c10d.
(7) CI config to test MPI, NCCL, and Gloo backend of c10d

**Now all the distributed test including c10d DDP can pass with the c10d frontend API**

TODO: (in a separate PR)
MPI subgroup support, once this is added, CI group test will be enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10871

Differential Revision: D9554514

Pulled By: teng-li

fbshipit-source-id: fb686ad42258526c8b4372148e82969fac4f42dd
2018-08-29 12:55:57 -07:00
df2d48b42c Added PrefixStore, pybind, test for group backward compatibility (#10762)
Summary:
Added Prefix Store support.

This will make group be backward compatible.

Test is covered too.
```
tengli@devfair033:~/new_pytorch/pytorch/torch/lib/build/c10d/test$ ./FileStoreTest
Using temporary file: /tmp/testoglRl4
Using temporary file: /tmp/testepZIpB
Test succeeded
tengli@devfair033:~/new_pytorch/pytorch/torch/lib/build/c10d/test$ ./TCPStoreTest
Test succeeded
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10762

Differential Revision: D9484032

Pulled By: teng-li

fbshipit-source-id: 85754af91fe3f5605087c4a2f79ae930a9fd1387
2018-08-23 18:10:37 -07:00
44b47fd7f3 Working pybind version of MPI process group and abort() pybind (#10606)
Summary:
This will make pybind version of MPI PG work. The issue is the scope of the tensor list won't be available for the MPI worker thread. So we pass the vector by value instead.

Also added recv_anysource pybind to make it work. The front-end API will wrap one level up with an int for this function. So taking a tensor should be the easiest way for now.

Also added abort pybind and fixed the flaky test.
```
tengli@devfair033:~/new_pytorch/pytorch/torch/lib/build/c10d/test$ mpirun -np 8 ProcessGroupMPITest
Test successful
Test successful
Test successful
Test successful
Test successful
Test successful
Test successful
Test successful
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10606

Differential Revision: D9474393

Pulled By: teng-li

fbshipit-source-id: cca236c333656431e87d0d3573eeae9232c598b0
2018-08-22 18:26:04 -07:00
b6cc65afea Send, Recv, RecvAnysource, Barrier Op for MPI PG and Python Bindings (#10227)
Summary:
Based on: https://github.com/pytorch/pytorch/pull/10199
Added:
(1) send, recv, recvanysource, and barrier for MPI process group.
(2) python binding
(3) testing

Please review: 2e64f5d675
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10227

Reviewed By: ailzhang

Differential Revision: D9327138

Pulled By: teng-li

fbshipit-source-id: 80496714550a3ca498eb474465ddbd1b8d657d49
2018-08-14 20:10:11 -07:00
b69b1c477b Adding python binding for MPI process group (#10199)
Summary:
Based on https://github.com/pytorch/pytorch/pull/10159

Please review ProcessGroupMPI.cpp/hpp and init.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10199

Reviewed By: yf225

Differential Revision: D9324027

Pulled By: teng-li

fbshipit-source-id: 2dd524bee0c7ca8f9594ec3b4f3ebbbb608df337
2018-08-14 15:56:33 -07:00
3c39e857ca Python binding for reduce,allgather,scatter,gather ops and python tests (#10159)
Summary:
Provided python binding for these four ops. Also provided nccl binding test.

Based on https://github.com/pytorch/pytorch/pull/10058

Please only review init.cpp, and test file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10159

Reviewed By: yf225

Differential Revision: D9323192

Pulled By: teng-li

fbshipit-source-id: b03822009d3a785ec36fecce2fc3071d23f9994e
2018-08-14 14:24:57 -07:00
7d2bda7588 Move DDP broadcast coalesced to C++ (#9729)
Summary:
This PR depends on the tests added in #9670. It moves the first, tiny function from the c10d DDP to C++: `dist_broadcast_coalesced`. Let me know if ` torch/csrc/distributed/c10d/ddp.h` will be a good place to put these rewritten functions.

pietern The controller you requested could not be found. apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9729

Differential Revision: D8985308

Pulled By: goldsborough

fbshipit-source-id: dc459fe9040273714044152063585e746974752f
2018-07-31 19:54:21 -07:00
709c300437 [c10d] Configurable number of algorithm entries per key (#8765) 2018-06-21 14:30:55 -07:00
3da27312bb Export ProcessGroupGloo options to Python (#8664)
This surfaces the options struct that can be passed to the
ProcessGroupGloo constructor to Python. By default, if no options struct
is passed at construction time, the Python bindings default to using a
struct with a TCP backed Gloo device that uses the machine's hostname to
resolve the IP address to bind to.
2018-06-20 09:08:06 -07:00
61c96811be [c10d] NCCL python binding and CI test, with bug fixes (#8357)
* [c10d] NCCL python binding and CI test, with bug fixes

* Addressed comments and further bug fix

* Made NCCL build optional, made C10D libc10d.a only

* Fixed tests so that NCCL pg won't run when not neeeded

* Addressed comments
2018-06-19 13:02:39 -07:00
5484a197d9 [c10d] Convenience wrappers for collective functions (#8292)
* [c10d] Add convenience wrappers

* Release GIL
2018-06-12 09:05:16 -07:00
695d40efc2 Create initial Python bindings for c10d (#8119)
* Build and install c10d from tools/build_pytorch_libs.sh

* Create initial Python bindings for c10d

* clang-format

* Switch link order to include more symbols

* Add bindings and tests for ProcessGroupGloo

* Add broadcast test

* Separate build flag for c10d

* Explicit PIC property

* Skip c10d tests if not available

* Remove c10d from Windows blacklist

Let it skip by itself because it won't be available anyway.

* Make lint happy

* Comments

* Move c10d module into torch.distributed

* Close tempfile such that it is deleted
2018-06-08 12:59:51 -07:00