pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Pieter Noordhuis	c1b92f518d	Remove ProcessGroup::getGroupRank (#19147 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/19147 After #14809 was merged there is no longer a need for getGroupRank. Every ProcessGroup object has its own rank and size fields which are accurate for the global group as well as subgroups. Strictly speaking removing a function in a minor version bump is a big no-no, but I highly doubt this was ever used outside of `torch.distributed` itself. This will result in a compile error for folks who have subclassed the ProcessGroup class though. If this is a concern we can delay merging until a later point in time, but eventually this will need to be cleaned up. Differential Revision: D14889736 fbshipit-source-id: 3846fe118b3265b50a10ab8b1c75425dad06932d	2019-04-11 09:17:40 -07:00
Pieter Noordhuis	ce166d949d	ProcessGroupMPI exists only if it is valid (#14809 ) Summary: Previously, MPI process groups were created for all processes, even if they were not part of the created group. Their MPI_Comm member field would be MPI_COMM_NULL and they would ignore any calls. Their rank and size were identical to that of the global process group and they had a special groupRank and groupSize field to capture the _real_ rank. This also meant assymetry with other process group types, where creating a new group would either return the process group OR GroupMember.NON_GROUP_MEMBER. For the MPI process group, it would always return a process group and an additional check was needed to verify whether or not a process was indeed part of a process group or not. This commit changes this such that every MPI process group is a valid process group, and by extension that we no longer have to special case MPI to determine whether or not a process is part of a group. Now, if the value returned by `new_group` is GroupMember.NON_GROUP_MEMBER, the process is not a member, otherwise it is. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14809 Differential Revision: D14887937 Pulled By: pietern fbshipit-source-id: c5bf86d3b33e524cc5004ee68e30103178fa491d	2019-04-10 21:36:35 -07:00
Pieter Noordhuis	ce92cf9bd1	Add tests for reducer class (#18845 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18845 This adds a few CPU only test cases for the reducer class. Reviewed By: mrshenli Differential Revision: D14768432 fbshipit-source-id: c008a52206826304e634a95bc14167ed94c97662	2019-04-05 09:07:29 -07:00
Pieter Noordhuis	bdfdf6c2b9	C++ handler for gradient reduction (#18251 ) Summary: This commit adds the `c10d::Reducer` class that hooks into autograd and performs gradient bucketing and reduction. These are the core parts of `nn.parallel.DistributedDataParallel` that up to now were only usable for CUDA models. This should enable the following: * Distributed data parallelism for models defined using the C++ frontend. * Allow overlap of gradient computation and reduction for non-CUDA models. * Enable distributed data parallelism for models with some unused parameters. This does not include any logic for computing bucket assignment, which can be done separately; either by observing autograd execution order (this is what Apex does), or by assigning buckets based on some maximum byte size, or both. Also see #17757 and #13273. Pull Request resolved: https://github.com/pytorch/pytorch/pull/18251 Reviewed By: mrshenli Differential Revision: D14571899 Pulled By: pietern fbshipit-source-id: 20f95eefd288dfe8cfffe0a28ca22fa7c9c3cd4c	2019-04-01 14:30:02 -07:00
Pieter Noordhuis	2e753fc753	Remove unused parameter in ProcessGroupGloo (#17718 ) Summary: This is not used anywhere and wasn't cleaned up prior to 1.0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/17718 Reviewed By: janewangfb Differential Revision: D14355154 Pulled By: pietern fbshipit-source-id: f8ff3c8f50cd6365b369a5c5b85d72d8940df048	2019-03-11 18:01:20 -07:00
Teng Li	b4bc55beef	TCP init method race condition fix (#15684 ) Summary: This PR fixes a race condition for TCP init method, when master rank can exit earlier than slave ranks and thus the TCP daemon thread gets shutdown before other slaves are able to access it. This will let every rank (process) write a special key to the store to mark that they are completed (and thus about to exit). The master rank (who is the server) will always wait until all the ranks to complete before complete itself. This should fix: https://github.com/pytorch/pytorch/issues/15638 Tested using the repro of https://github.com/pytorch/pytorch/issues/15638 and works fine. Also test_distributed and test_c10d should have already had this coverage. I had to make rendezvous test in c10d the world size of 1, since it is a single process code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/15684 Differential Revision: D13570904 Pulled By: teng-li fbshipit-source-id: 34f3bc471204bbd29320df359347ad5561c6b589	2019-01-18 02:29:38 -08:00
Peter Goldsborough	d6c53328f9	Large scale fix of python-related files in torch/csrc/ Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14515 Differential Revision: D13247966 Pulled By: goldsborough fbshipit-source-id: 7a127c508fc576a7a92626dd6b729f660162d628	2018-12-07 13:04:46 -08:00
Teng Li	9127ab3866	Fixed new_group won't work for two or more different rank groups (#14529 ) Summary: This fixed two things: (1) NCCL group doesn't support 2 or more groups, this is because, we need a group name in ProcessGroupNCCL class to keep track of the ProcessGroup ID within that group name, and also the NCCL unique ID within that group name and process group ID. Otherwise, different processes will create different NCCL PG in different orders and can clash on these names. This will fix the NCCL problem. (2) When using new_group, each rank should enter this function and update its global group name counter to ensure that every rank always operates on the same group name. With both fixes: repro code in: https://github.com/pytorch/pytorch/issues/14528 should work with both NCCL and Gloo backends. ``` tengli@learnfair096:~$ python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=30000 ~/github_issues/nccl_group.py rank: 0 - val: 6.0 rank: 2 - val: 6.0 rank: 3 - val: 6.0 rank: 1 - val: 6.0 rank: 4 - val: 22.0 rank: 6 - val: 22.0 rank: 5 - val: 22.0 rank: 7 - val: 22.0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/14529 Differential Revision: D13253434 Pulled By: teng-li fbshipit-source-id: 8eb45882b996b06d951fc9a306d5de86a42e8b84	2018-11-29 19:57:47 -08:00
Pieter Noordhuis	4ec6bd7356	Add sourceRank() to ProcessGroup::Work (#14453 ) Summary: This function is only implemented for the subclasses where it makes sense. If it's not overridden it will throw an error. Having this function removes the need for a pointer passing hack to pass the source rank of a recv operation back to the caller. Instead, the caller can now call `source_rank` on the work object and achieve the same result. Closes #11804. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14453 Differential Revision: D13230898 Pulled By: pietern fbshipit-source-id: ef38f48bfaca8ef9a364e5be122951bafc9f8e49	2018-11-29 09:16:53 -08:00
Pieter Noordhuis	03864b7b11	Add option structs and timeout field (#14297 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14297 Adds option structs for allgather and barrier such that we have one for every collective. Add timeout member field to every one of these such that we can support per operation timeouts. Use default constructed options struct for every collective process group function exposed to Python. Reviewed By: manojkris Differential Revision: D13158474 fbshipit-source-id: 3d28977de2f2bd6fc2f42ba3108b63a429338906	2018-11-27 10:46:38 -08:00
Pieter Noordhuis	91c0b7159a	Remove header generated at configuration time (#14244 ) Summary: The build was picking up the empty stub header instead of the generated one. Because of the large number of include paths we end up passing to the compiler it is brittle to have both an empty stub file and a generated file and expect the compiler to pick up the right one. With the recent change to compile everything from a single CMake run we can now use native CMake facilities to propagate macros that indicate backend support. The stanzas target_compile_definitions with the INTERFACE flag ensure that these macros are set only for downstream consumers of the c10d target. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14244 Reviewed By: teng-li Differential Revision: D13144293 Pulled By: pietern fbshipit-source-id: f49324220db689c68c126b159f4f00a8b9bc1252	2018-11-21 08:45:08 -08:00
Teng Li	45fd77d3b7	Adding GLOO_SOCKET_IFNAME env to allow user set gloo device (#14065 ) Summary: Address https://github.com/pytorch/pytorch/issues/14063 This is a lot easier to use, follow the NCCL convention since they provide the similar NCCL_SOCKET_IFNAME. We can later document this better. Tested on my two hosts, and work out of the box Pull Request resolved: https://github.com/pytorch/pytorch/pull/14065 Differential Revision: D13095522 Pulled By: teng-li fbshipit-source-id: 131dff212626f1aab7e752427f1b684845b909dc	2018-11-15 22:33:56 -08:00
Teng Li	97036d3c30	FileStore auto deletes file and FileStore::add bug fix (#13708 ) Summary: This addressed: https://github.com/pytorch/pytorch/issues/11874 and we will have the identical file init_method behavior as the previous THD file init. Also the FileStore::add bug is pretty annoying. Two bugs: (1) Add doesn't append to the end of the file. (2) Cache doesn't get updated. Both are fixed and tests are covered. I examined the /tmp to ensure that all temp files are auto deleted after test_c10d.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/13708 Reviewed By: pietern Differential Revision: D12972810 Pulled By: teng-li fbshipit-source-id: 917255390aa52845f6b0ad0f283875a7a704da48	2018-11-14 01:34:22 -08:00
Teng Li	1413dd4bfc	Added the finer bucketing option for DDP (#13607 ) Summary: We only need this for backward, for FWD cast, the non-fine-grained bucketing should be better since it's sequential anyway. Test should be covered all by c10d test, reduced bucket size to make bucketing happen in c10d test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/13607 Differential Revision: D12944515 Pulled By: teng-li fbshipit-source-id: d982e8dca2874c91d39b30b73a85bfbeb768c508	2018-11-07 12:00:55 -08:00
Tongzhou Wang	044d00516c	Rename DistBackend -> Backend (#11830 ) Summary: Also add docs for get_backend, Backend, and reduce_op fixes #11803 cc The controller you requested could not be found. pietern apaszke Pull Request resolved: https://github.com/pytorch/pytorch/pull/11830 Differential Revision: D9927991 Pulled By: SsnL fbshipit-source-id: a2ffb70826241ba84264f36f2cb173e00b19af48	2018-11-07 11:58:12 -08:00
Teng Li	74819087de	Mixed precision DDP hang fix and fine-grained option for DDP perf (#13496 ) Summary: When go to mixed precision fp16 training, DDP randomly hangs. Initially, I thought this smells like a similar NCCL bug I filed a while ago. It turns out it's not. Again, I am seeing different rank process has different size. How could this even happen? It turns out that take_tensors will generate a list of bucketed tensors in an un deterministic order, because, the key to the map is a pointer. An interesting bug digging and fix. Now fp16 DDP training should be fully working now. Also, added another take_tensor fine grained helper that aims to improve the performance of DDP, making it a TODO to replace the DDP take_tensors with that. Fixed: https://github.com/pytorch/pytorch/issues/12150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/13496 Differential Revision: D12920985 Pulled By: teng-li fbshipit-source-id: 26f3edae7be45a80fa7b2410a2e5a1baab212d9c	2018-11-05 16:22:15 -08:00
Pieter Noordhuis	8c182cd89e	Add overload of ProcessGroup.allreduce with list of tensors (#13576 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13576 TSIA Reviewed By: SsnL Differential Revision: D12923457 fbshipit-source-id: 7824490548edbacac3cda81c7500bd1f851c6093	2018-11-05 11:56:49 -08:00
Pieter Noordhuis	526460fc8b	Use default timeout of 30 minutes for gloo backend (#13056 ) Summary: The existing default timeout was set at 10 seconds, which is too low for asynchronous tasks that depend on a barrier to resynchronize. Having a single timeout for all operations is not ideal and this will be addressed in future commits. Pull Request resolved: https://github.com/pytorch/pytorch/pull/13056 Reviewed By: teng-li Differential Revision: D10558746 Pulled By: pietern fbshipit-source-id: d857ea55b1776fc7d0baf2efd77951b5d98beabb	2018-10-25 16:35:53 -07:00
Teng Li	c250f6f3d5	DDP perf improvement: move sync_reduction to C++, dedicated CUDA streams for memcpy (#12954 ) Summary: - Moved sync_reduction to C++ - Use a dedicated CUDA stream for memcpy - Also use a dedicated CUDA stream for memcpy in queue_reduction Added test as well. CI should cover both DDP and unittest Pull Request resolved: https://github.com/pytorch/pytorch/pull/12954 Differential Revision: D10520069 Pulled By: teng-li fbshipit-source-id: 64348e4e43c15f9695a4c28b036c232587ecfb65	2018-10-24 21:37:13 -07:00
Teng Li	8d3e7e2fcb	Move DDP queue_reduction to C++ (#12852 ) Summary: fully working version by using continuing on goldsborough 's initial version. waiting on the stream guard to be merged before adding more stream perf logics into the c++ version Pull Request resolved: https://github.com/pytorch/pytorch/pull/12852 Differential Revision: D10468696 Pulled By: teng-li fbshipit-source-id: 8e46d408796973817abfd9dbd6566e0ca5b7a13f	2018-10-22 16:07:46 -07:00
Pieter Noordhuis	7535d98ec4	Add message tag parameter to send/recv Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11490 Reviewed By: teng-li Differential Revision: D9828116 Pulled By: pietern fbshipit-source-id: 98be1ae84b6763ffb329e63c030c5e3ec0e748b7	2018-09-14 10:55:37 -07:00
Teng Li	6dcdbd3a1d	Make C10d support CPU only build (#11513 ) Summary: This makes torch.distributed works for CPU only build. Also added one more CI test case to cover MPI CPU build. All CI tests should cover this change Pull Request resolved: https://github.com/pytorch/pytorch/pull/11513 Differential Revision: D9784546 Pulled By: teng-li fbshipit-source-id: 0976a6b0fd199670926f0273e17ad7d2805e42e7	2018-09-11 22:10:34 -07:00
Teng Li	0988bbad2d	C10d release to torch.distributed for PT1 (#11405 ) Summary: The old `torch.distributed` will go to `torch.distributed.deprecated` The old DDP will go to `torch.nn.parallel.deprecated` Now `torch.nn.parallel.DDP` will use c10d DDP Now `torch.distributed` will use C10d frontend API Pull Request resolved: https://github.com/pytorch/pytorch/pull/11405 Reviewed By: pietern Differential Revision: D9733733 Pulled By: teng-li fbshipit-source-id: d6a3f3e73f8d3a7fcb1f4baef53c78063b8cbb08	2018-09-10 23:27:22 -07:00
Peter Goldsborough	01930a3145	Move sync_params to C++ (#9805 ) Summary: The next function I'm moving to C++ is `sync_params`. It is stacked on top of https://github.com/pytorch/pytorch/pull/9729, so some changes will go away when it lands and I rebase. I also split code into a `.h` and `.cpp` file for better code organization. The controller you requested could not be found. pietern apaszke Pull Request resolved: https://github.com/pytorch/pytorch/pull/9805 Differential Revision: D9688604 Pulled By: goldsborough fbshipit-source-id: 4467104d3f9e2354425503b9e4edbd59603e20a8	2018-09-07 12:56:40 -07:00
Teng Li	ec195129ec	Adding setTimeout option in Store (#11265 ) Summary: This will allow users to set customized timeout option for the store. Tested by my own debug print to make sure that C++ actually used the timeout Pull Request resolved: https://github.com/pytorch/pytorch/pull/11265 Differential Revision: D9666164 Pulled By: teng-li fbshipit-source-id: 4eb6441783da106a3fd59b95457e503e83e4640f	2018-09-06 12:55:50 -07:00
Teng Li	3791bd12c8	PT1 Release Milestone No.2 MPI Group Support with all tests passed (#11128 ) Summary: Added MPI group support. And this will make all previous group test cases of MPI passed. Also, release the MPI thread level support by serializing different PG's MPI ops. This is required. The build is fixed too Pull Request resolved: https://github.com/pytorch/pytorch/pull/11128 Differential Revision: D9602188 Pulled By: teng-li fbshipit-source-id: 1d618925ae5fb7b47259b23051cc181535aa7497	2018-08-31 12:39:56 -07:00
Teng Li	56539f5fe1	PT1 Distributed Release MileStone No.1 - Completed Distributed Package and CI tests (#10871 ) Summary: The PR includes: (1) torch.distributed.c10d, which now includes the complete backward compatible frontend API for `torch.distributed` (2) `env://` init method functionality (3) Minor change to `test_distributed.py`, which is now a test for `torch.distributed.c10d`. (4) The old `test_distributed.py' is now moved to `test_distributed_thd` (5) Miscellaneous bug fixes. (6) DDP CPU test is removed since c10d doesn't have this support yet, but this is a very easy test after moving DDP CPU's dependency to torch.distributed.c10d. (7) CI config to test MPI, NCCL, and Gloo backend of c10d Now all the distributed test including c10d DDP can pass with the c10d frontend API TODO: (in a separate PR) MPI subgroup support, once this is added, CI group test will be enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/10871 Differential Revision: D9554514 Pulled By: teng-li fbshipit-source-id: fb686ad42258526c8b4372148e82969fac4f42dd	2018-08-29 12:55:57 -07:00
Teng Li	df2d48b42c	Added PrefixStore, pybind, test for group backward compatibility (#10762 ) Summary: Added Prefix Store support. This will make group be backward compatible. Test is covered too. ``` tengli@devfair033:~/new_pytorch/pytorch/torch/lib/build/c10d/test$ ./FileStoreTest Using temporary file: /tmp/testoglRl4 Using temporary file: /tmp/testepZIpB Test succeeded tengli@devfair033:~/new_pytorch/pytorch/torch/lib/build/c10d/test$ ./TCPStoreTest Test succeeded ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/10762 Differential Revision: D9484032 Pulled By: teng-li fbshipit-source-id: 85754af91fe3f5605087c4a2f79ae930a9fd1387	2018-08-23 18:10:37 -07:00
Teng Li	44b47fd7f3	Working pybind version of MPI process group and abort() pybind (#10606 ) Summary: This will make pybind version of MPI PG work. The issue is the scope of the tensor list won't be available for the MPI worker thread. So we pass the vector by value instead. Also added recv_anysource pybind to make it work. The front-end API will wrap one level up with an int for this function. So taking a tensor should be the easiest way for now. Also added abort pybind and fixed the flaky test. ``` tengli@devfair033:~/new_pytorch/pytorch/torch/lib/build/c10d/test$ mpirun -np 8 ProcessGroupMPITest Test successful Test successful Test successful Test successful Test successful Test successful Test successful Test successful ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/10606 Differential Revision: D9474393 Pulled By: teng-li fbshipit-source-id: cca236c333656431e87d0d3573eeae9232c598b0	2018-08-22 18:26:04 -07:00
Teng Li	b6cc65afea	Send, Recv, RecvAnysource, Barrier Op for MPI PG and Python Bindings (#10227 ) Summary: Based on: https://github.com/pytorch/pytorch/pull/10199 Added: (1) send, recv, recvanysource, and barrier for MPI process group. (2) python binding (3) testing Please review: `2e64f5d675` Pull Request resolved: https://github.com/pytorch/pytorch/pull/10227 Reviewed By: ailzhang Differential Revision: D9327138 Pulled By: teng-li fbshipit-source-id: 80496714550a3ca498eb474465ddbd1b8d657d49	2018-08-14 20:10:11 -07:00
Teng Li	b69b1c477b	Adding python binding for MPI process group (#10199 ) Summary: Based on https://github.com/pytorch/pytorch/pull/10159 Please review ProcessGroupMPI.cpp/hpp and init.cpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/10199 Reviewed By: yf225 Differential Revision: D9324027 Pulled By: teng-li fbshipit-source-id: 2dd524bee0c7ca8f9594ec3b4f3ebbbb608df337	2018-08-14 15:56:33 -07:00
Teng Li	3c39e857ca	Python binding for reduce,allgather,scatter,gather ops and python tests (#10159 ) Summary: Provided python binding for these four ops. Also provided nccl binding test. Based on https://github.com/pytorch/pytorch/pull/10058 Please only review init.cpp, and test file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/10159 Reviewed By: yf225 Differential Revision: D9323192 Pulled By: teng-li fbshipit-source-id: b03822009d3a785ec36fecce2fc3071d23f9994e	2018-08-14 14:24:57 -07:00
Peter Goldsborough	7d2bda7588	Move DDP broadcast coalesced to C++ (#9729 ) Summary: This PR depends on the tests added in #9670. It moves the first, tiny function from the c10d DDP to C++: `dist_broadcast_coalesced`. Let me know if ` torch/csrc/distributed/c10d/ddp.h` will be a good place to put these rewritten functions. pietern The controller you requested could not be found. apaszke Pull Request resolved: https://github.com/pytorch/pytorch/pull/9729 Differential Revision: D8985308 Pulled By: goldsborough fbshipit-source-id: dc459fe9040273714044152063585e746974752f	2018-07-31 19:54:21 -07:00
Pieter Noordhuis	709c300437	[c10d] Configurable number of algorithm entries per key (#8765 )	2018-06-21 14:30:55 -07:00
Pieter Noordhuis	3da27312bb	Export ProcessGroupGloo options to Python (#8664 ) This surfaces the options struct that can be passed to the ProcessGroupGloo constructor to Python. By default, if no options struct is passed at construction time, the Python bindings default to using a struct with a TCP backed Gloo device that uses the machine's hostname to resolve the IP address to bind to.	2018-06-20 09:08:06 -07:00
Teng Li	61c96811be	[c10d] NCCL python binding and CI test, with bug fixes (#8357 ) * [c10d] NCCL python binding and CI test, with bug fixes * Addressed comments and further bug fix * Made NCCL build optional, made C10D libc10d.a only * Fixed tests so that NCCL pg won't run when not neeeded * Addressed comments	2018-06-19 13:02:39 -07:00
Pieter Noordhuis	5484a197d9	[c10d] Convenience wrappers for collective functions (#8292 ) * [c10d] Add convenience wrappers * Release GIL	2018-06-12 09:05:16 -07:00
Pieter Noordhuis	695d40efc2	Create initial Python bindings for c10d (#8119 ) * Build and install c10d from tools/build_pytorch_libs.sh * Create initial Python bindings for c10d * clang-format * Switch link order to include more symbols * Add bindings and tests for ProcessGroupGloo * Add broadcast test * Separate build flag for c10d * Explicit PIC property * Skip c10d tests if not available * Remove c10d from Windows blacklist Let it skip by itself because it won't be available anyway. * Make lint happy * Comments * Move c10d module into torch.distributed * Close tempfile such that it is deleted	2018-06-08 12:59:51 -07:00

... 8 9 10 11 12

588 Commits