pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
atalman	4ece056791	Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073 ) Should resolve: https://github.com/pytorch/pytorch/issues/144768 We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1`` For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1`` We use pinned version of NCCL rather then submodule. Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj	2025-02-19 03:52:26 +00:00
PyTorch MergeBot	7622e29a37	Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073 )" This reverts commit eecee5863e698d19458b33df7bfecbda0a04557a. Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks Locally building benchmarks ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2667054179))	2025-02-18 22:23:35 +00:00
atalman	eecee5863e	Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073 ) Should resolve: https://github.com/pytorch/pytorch/issues/144768 We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1`` For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1`` We use pinned version of NCCL rather then submodule. Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj	2025-02-14 21:23:19 +00:00
PyTorch MergeBot	e06ee4aa9f	Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073 )" This reverts commit 06f4a5c0e578d7da10ebdf14edcd24e5dcef78d6. Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks macos builds: ModuleNotFoundError: No module named 'torch._C._distributed_c10d'; 'torch._C' is not a package ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2659802389))	2025-02-14 16:44:46 +00:00
atalman	06f4a5c0e5	Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073 ) Should resolve: https://github.com/pytorch/pytorch/issues/144768 We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1`` For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1`` We use pinned version of NCCL rather then submodule. Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj	2025-02-14 15:29:59 +00:00
Nikita Shulga	df5e232563	[BE] Delete NCCL slimming (#146943 ) It was added by https://github.com/pytorch/pytorch/pull/35843 and served its purpose when everything was linked statically in libtorch_cuda.so, but for all our releases it's no longer relevant as nccl is now a dynamic dependency of libtorch_cuda.so Besides, It does not work with CXX11 ABI anyway, and creates problems with newer version of NCCL, when two `collectvies.o` are package into library archive. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146943 Approved by: https://github.com/Skylion007, https://github.com/atalman	2025-02-12 00:35:55 +00:00
Aaron Gokaslan	93f2a64d4d	Update submodule NCCL to v2.18.3 (#104993 ) Update NCCL submodule to v2.18.3 which fixes numerous bugs and performance issues, particularly on newer GPUs: https://docs.nvidia.com/deeplearning/nccl/release-notes/rel_2-18-3.html#rel_2-18-3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104993 Approved by: https://github.com/malfet	2023-08-18 23:43:01 +00:00
Huy Do	ee2ce3fef6	Set make max load when building libtorch (#89237 ) The nccl build is still OOM sometimes when using `$(MAKE)`: ``` virtual memory exhausted: Cannot allocate memory Makefile:73: recipe for target '/var/lib/jenkins/cpp-build/caffe2/build/nccl/obj/collectives/device/devlink.o' failed make[5]: *** [/var/lib/jenkins/cpp-build/caffe2/build/nccl/obj/collectives/device/devlink.o] Error 1 make[5]: Leaving directory '/var/lib/jenkins/workspace/third_party/nccl/nccl/src/collectives/device' ``` * https://github.com/pytorch/pytorch/actions/runs/3476485191/jobs/5811758058 * https://github.com/pytorch/pytorch/actions/runs/3422228421/jobs/5702153639 So trying to set the same limit here as when building with ninja Pull Request resolved: https://github.com/pytorch/pytorch/pull/89237 Approved by: https://github.com/malfet	2022-11-18 18:55:33 +00:00
Peter Bell	9a81da7ad1	Update NCCL to current master and remove patch step (#85367 ) The patch from #84245 has been upstreamed into NCCL, so the patch step is no longer required. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85367 Approved by: https://github.com/ezyang	2022-09-21 19:23:49 +00:00
Peter Bell	fa86874bbd	Fix intermittent link errors in NCCL build (#84245 ) Should fix #13362 and fix #83790 I think I've discovered the root cause of the intermittent nccl link failures. If we look at the variable name in the redefinition error: ``` _02021d91_11_sendrecv_cu_0bc7b9c8_11152 ``` this is the name of the file being compiled + some form of unique ID. As part of NCCL's build process, the same file is compiled multiple times with different macro definitions depending on which operator and dtype are being compiled, e.g. ``` nvcc -DNCCL_OP=0 -DNCCL_TYPE=0 -dc sendrecv.cu -o sendrecv_sum_i8.o ``` Since the filename parts are the same, then if the unique IDs also happen to collide then the entire identifier will collide and the link fails. So the fix here is to generate a unique `.cu` file for each object file. I've implemented this as a `.patch` file that gets applied from our cmake code, but if we instead fork nccl that would be cleaner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84245 Approved by: https://github.com/janeyx99, https://github.com/malfet	2022-09-13 19:55:52 +00:00
Shen Li	56a37ea1a6	Set default value for nccl make MAX_JOBS if ProcessorCount returns 0 (#84231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84231 Approved by: https://github.com/malfet, https://github.com/rohan-varma	2022-08-30 16:06:34 +00:00
Peter Bell	2000eba454	NCCL: Re-enable parallel builds (#83696 ) Since #83173 was merged I have noticed some CI being slowed down by the nccl building step. e.g. if there are no C++ changes then sccache compiles everything else very quickly and nccl becomes the limiting factor. This re-enables parallel builds with some safeguards to protect against oversubscription. When `make` is the parent build system, we can use `$(MAKE)` and the `make` jobserver will coordinate job allocation with the sub-process. For other build systems, this calls `make` with the `-l` flag which should prevent it launching jobs when the system load average is already too high. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83696 Approved by: https://github.com/malfet	2022-08-25 05:16:01 +00:00
Jane Xu	37d3db7579	Deletes CCACHE_DISABLE and SCCACHE_DISABLE from nccl.cmake (#84007 ) Looking through the code and online, it does not look like these variables actually change anything. Regardless, this change was instituted to fix https://github.com/pytorch/pytorch/issues/13362, but we are again running into similar issues even with the workaround: see https://github.com/pytorch/pytorch/issues/83790. Thus, since 1. this change isn't preventing flakiness 2. these variables do not seem used anywhere in pytorch/pytorch nor mozilla/sccache we should remove this confusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84007 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/ZainRizvi	2022-08-24 21:43:12 +00:00
Nikita Shulga	3a9ae518f2	Skip NCCL slimming for cxx11 libtorch builds (#83959 ) Fixes https://github.com/pytorch/pytorch/issues/83887 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83959 Approved by: https://github.com/atalman	2022-08-24 18:31:27 +00:00
Peter Bell	1c83ec8f61	Build nccl single-threaded (#83173 ) Closes #82888 This is a tentative fix. make is called by ninja so should be run in parallel with other jobs already. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83173 Approved by: https://github.com/malfet	2022-08-10 21:40:46 +00:00
Nikita Shulga	c08092fdf2	Update NCCL to v2.13.4-1 (#82775 ) Also, update slimming script to include two instances of net.o that new library generates Pull Request resolved: https://github.com/pytorch/pytorch/pull/82775 Approved by: https://github.com/ngimel	2022-08-04 19:36:45 +00:00
Nikita Shulga	7c298b8244	Fix objcopy version detection (#82774 ) By extending regex to match any character other than not just version On Ubuntu version string looks as follows: ``` $ objcopy --version GNU objcopy (GNU Binutils for Ubuntu) 2.30 ``` And on some CentOSes it looks as ``` $ objcopy --version GNU objcopy (GNU Binutils) 2.37 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/82774 Approved by: https://github.com/ngimel	2022-08-04 16:26:31 +00:00
Brian Vaughan	2eef1f27f8	Disable ccache for nccl builds (#62208 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62208 reverts https://github.com/pytorch/pytorch/pull/55814 which removed a workaround for: https://github.com/pytorch/pytorch/issues/13362 Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D29935472 Pulled By: nairbv fbshipit-source-id: 7ce9cde1408f17153632036fd128814032739746	2021-07-27 08:07:26 -07:00
Eli Uriegas	b98f011cd4	cmake: Enable (s)ccache for nccl builds (#55814 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55814 I don't really know if the original issue is resolved but let's just check and see if this passes CI so that we can potentially get some speed up on our builds Signed-off-by: Eli Uriegas <eliuriegas@fb.com> Test Plan: Imported from OSS Reviewed By: walterddr Differential Revision: D27715734 Pulled By: seemethere fbshipit-source-id: a8f90774dfd25b0abf8e57283fe3591a8d8f3c4b	2021-04-13 14:49:25 -07:00
Nikita Shulga	8a574c7104	[Cmake] Drop quotation marks around `$ENV{MAX_JOBS}` (#44557 ) Summary: Solves `the '-j' option requires a positive integer argument` error on some systems when MAX_JOBS is not defined Pull Request resolved: https://github.com/pytorch/pytorch/pull/44557 Reviewed By: vkuzo Differential Revision: D23653511 Pulled By: malfet fbshipit-source-id: 7d86fb7fb6c946c34afdc81bf2c3168a74d00a1f	2020-09-11 12:57:11 -07:00
Nikita Shulga	4d431881d1	Control NCCL build parallelism via MAX_JOBS environment var (#44167 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44167 Reviewed By: walterddr, ngimel Differential Revision: D23522419 Pulled By: malfet fbshipit-source-id: 31b25a71fef3e470bdf382eb3698e267326fa354	2020-09-04 10:02:53 -07:00
Nikita Shulga	cf7e7909d5	NCCL must depend on librt (#41978 ) Summary: Since NCCL makes calls to shm_open/shm_close it must depend on librt on Linux This should fix `DSO missing from command line` error on some platforms Pull Request resolved: https://github.com/pytorch/pytorch/pull/41978 Reviewed By: colesbury Differential Revision: D22721430 Pulled By: malfet fbshipit-source-id: d2ae08ce9da3979daaae599e677d5e4519b080f0	2020-07-24 16:47:19 -07:00
Nikita Shulga	6a45584272	Remove `__nv_relfatbin` section from nccl_static library (#35843 ) Summary: NCCL library is built using [CUDA separate compilation](https://devblogs.nvidia.com/separate-compilation-linking-cuda-device-code/), which consists of building intermediate CUDA binaries and then linking them into GPU code that could be executed on device. Intermediate CUDA code is stored in `__nv_relfatbin` section, and code that can be launched is stored in `.nv_fatbin`. When `nvcc` is used to link executable/shared library, it removes those intermediate binaries, but default host linker is not aware of that and therefore it is kept inside host executable. Help compiler by removing `__nv_relfatbin` sections from object file inside `libncc_static.a`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35843 Test Plan: Build pytorch with CUDA and run `test_distributed.py` Differential Revision: D20882224 Pulled By: malfet fbshipit-source-id: f23dd4aa416518324cb38b9bd6846e73a1c7dd21	2020-04-06 18:23:08 -07:00
peter	45c9ed825a	Formatting cmake (to lowercase without space for if/elseif/else/endif) (#35521 ) Summary: Running commands: ```bash shopt -s globstar sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i CMakeLists.txt sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i caffe2//CMakeLists.txt sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i torch//CMakeLists.txt sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i c10//CMakeLists.txt sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i cmake//.cmake sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i cmake//.cmake.in ``` We may further convert all the commands into lowercase according to the following issue: `77543bde41`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35521 Differential Revision: D20704382 Pulled By: malfet fbshipit-source-id: 42186b9b1660c34428ab7ceb8d3f7a0ced5d2e80	2020-03-27 14:25:17 -07:00
Hong Xu	60c46dd4df	Let CMake handle NCCL detection instead of our handcrafted Python script. (#22930 ) Summary: --- How does the current code subsume all detections in the deleted `nccl.py`? - The dependency of `USE_NCCL` on the OS and `USE_CUDA` is handled as dependency options in `CMakeLists.txt`. - The main NCCL detection happens in [FindNCCL.cmake](`8377d4b32c/cmake/Modules/FindNCCL.cmake`), which is called by [nccl.cmake](`8377d4b32c/cmake/External/nccl.cmake`). When `USE_SYSTEM_NCCL` is false, the previous Python code defer the detection to `find_package(NCCL)`. The change in `nccl.cmake` retains this. - `USE_STATIC_NCCL` in the previous Python code simply changes the name of the detected library. This is done in `IF (USE_STATIC_NCCL)`. - Now we only need to look at how the lines below line 20 in `nccl.cmake` are subsumed. These lines list paths to header and library directories that NCCL headers and libraries may reside in and try to search these directories for the key header and library files in turn. These are done by `find_path` for headers and `find_library` for the library files in `FindNCCL.cmake`. * The call of [find_path](https://cmake.org/cmake/help/v3.8/command/find_path.html) (Search for `NO_DEFAULT_PATH` in the link) by default searches for headers in `<prefix>/include` for each `<prefix>` in `CMAKE_PREFIX_PATH` and `CMAKE_SYSTEM_PREFIX_PATH`. Like the Python code, this commit sets `CMAKE_PREFIX_PATH` to search for `<prefix>` in `NCCL_ROOT_DIR` and home to CUDA. `CMAKE_SYSTEM_PREFIX_PATH` includes the standard directories such as `/usr/local` and `/usr`. `NCCL_INCLUDE_DIR` is also specifically handled. * Similarly, the call of [find_library](https://cmake.org/cmake/help/v3.8/command/find_library.html) (Search for `NO_DEFAULT_PATH` in the link) by default searches for libraries in directories including `<prefix>/lib` for each `<prefix>` in `CMAKE_PREFIX_PATH` and `CMAKE_SYSTEM_PREFIX_PATH`. But it also handles the edge cases intended to be solved in the Python code more properly: - It only searches for `<prefix>/lib64` (and `<prefix>/lib32`) if it is appropriate on the system. - It only searches for `<prefix>/lib/<arch>` for the right `<arch>`, unlike the Python code searches for `lib/<arch>` in a generic way (e.g., the Python code searches for `/usr/lib/x86_64-linux-gnu` but in reality systems have `/usr/lib/x86_64-some-customized-name-linux-gnu`, see https://unix.stackexchange.com/a/226180/38242 ). --- Regarding for relevant issues: - https://github.com/pytorch/pytorch/issues/12063 and https://github.com/pytorch/pytorch/issues/2877: These are properly handled, as explained in the updated comment. - https://github.com/pytorch/pytorch/issues/2941 does not changes NCCL detection specifically for Windows (it changed CUDA detection). - b7e258f81ef61d19b884194cdbcd6c7089636d46 A versioned library detection is added, but the order is reversed: The unversioned library becomes preferred. This is because normally unversioned libraries are linked to versioned libraries and preferred by users, and local installation by users are often unversioned. Like the document of [find_library](https://cmake.org/cmake/help/v3.8/command/find_library.html) suggests: > When using this to specify names with and without a version suffix, we recommend specifying the unversioned name first so that locally-built packages can be found before those provided by distributions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/22930 Differential Revision: D16440275 Pulled By: ezyang fbshipit-source-id: 11fe80743d4fe89b1ed6f96d5d996496e8ec01aa	2019-07-23 08:45:51 -07:00
Edward Yang	798d5d9771	Revert D16281714: Add sanity checks for NCCL detection. Differential Revision: D16281714 Original commit changeset: 396bcbf099bd fbshipit-source-id: a22cc112d1b6a62d689f9d8a7f93e8be3abe2a44	2019-07-16 13:58:27 -07:00
Hong Xu	e2046f8c1d	Add sanity checks for NCCL detection. Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22819 Test Plan: Imported from OSS Differential Revision: D16281714 Pulled By: ezyang fbshipit-source-id: 396bcbf099bd07b996cf779c6b43092096b52d90	2019-07-16 11:32:32 -07:00
Soumith Chintala	8711df89cc	fix nccl compilation to make sure it compiles for architectures that pytorch compiles for (#18739 ) Summary: resubmit of https://github.com/pytorch/pytorch/pull/18704 with additional fixes Fixes https://github.com/pytorch/pytorch/issues/18359 Pull Request resolved: https://github.com/pytorch/pytorch/pull/18739 Differential Revision: D14737274 Pulled By: soumith fbshipit-source-id: cfbbbf68b098594bd045861d1b2c085da693ea51	2019-04-03 12:52:50 -07:00
Soumith Chintala	a799751e33	Revert D14717015: [pytorch][PR] fix nccl compilation to make sure it compiles for architectures that pytorch compiles for Differential Revision: D14717015 Original commit changeset: 4aac036f57e5 fbshipit-source-id: c820b8dfb27564271e6b80e133fe655658a7c25c	2019-04-02 09:39:03 -07:00
Soumith Chintala	fc6296d777	fix nccl compilation to make sure it compiles for architectures that pytorch compiles for (#18704 ) Summary: cc: t-vi gchanan zou3519 This fixes https://github.com/pytorch/pytorch/issues/18359 Pull Request resolved: https://github.com/pytorch/pytorch/pull/18704 Differential Revision: D14717015 Pulled By: soumith fbshipit-source-id: 4aac036f57e564b05d759662e8ad7a80170901c0	2019-04-01 17:10:42 -07:00
Soumith Chintala	37627a182b	fix USE_SYSTEM_NCCL build (#14606 ) Summary: fixes https://github.com/pytorch/pytorch/issues/14537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/14606 Differential Revision: D13274156 Pulled By: soumith fbshipit-source-id: f834715e8e17dacf60be459b0efffba1d4df40ae	2018-11-29 23:36:17 -08:00
andersj	fb7e40b7eb	nccl fixes (#14195 ) Summary: This has 4 changes 1) propagate USE_SYSTEM_NCCL. Previously it was ignored and cmake always did a FindPackage 2) respect SCCACHE_DISABLE in our caffe2 sccache wrapper for circleci 3) use SCCACHE_DISABLE when building nccl, because it triggers the same bug as when using CCACHE (already tracked in https://github.com/pytorch/pytorch/issues/13362). This was hidden because we weren't respecting USE_SYSTEM_NCCL, and were never building nccl ourselves in CI 4) In one particular CI configuration (caffe2, cuda 8, cudnn 7), force USE_SYSTEM_NCCL=1. Building the bundled nccl triggers a bug in nvlink. I've done some investigation, but this looks like a tricky, preexisting bug, so rather than hold up this diff I'm tracking it separately in https://github.com/pytorch/pytorch/issues/14486 Pull Request resolved: https://github.com/pytorch/pytorch/pull/14195 Differential Revision: D13237502 Pulled By: anderspapitto fbshipit-source-id: 1100ac1269c7cd39e2e0b3ba12a56a3ce8977c55	2018-11-28 14:43:06 -08:00
Anders Papitto	44d2ca660a	Disable CCACHE while building NCCL (#13340 ) Summary: I don't have a full analysis, but ccache appears to often fail while nccl. To work around this, run the NCCL build with CCACHE_DISABLE. Pull Request resolved: https://github.com/pytorch/pytorch/pull/13340 Differential Revision: D12855467 Pulled By: anderspapitto fbshipit-source-id: 63eb12183ab9d03dd22090f084688ae6390fe8bd	2018-10-30 22:19:21 -07:00
Anders Papitto	c68b82ebc8	don't expand cmake variable in IF Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13331 Differential Revision: D12849306 Pulled By: anderspapitto fbshipit-source-id: 2f1f72a44ed3a176be8c7490652e49771c3fadbf	2018-10-30 15:20:43 -07:00
Anders Papitto	380d2dfb27	absorb nccl (#13150 ) Summary: always build nccl from within the main cmake build, rather than via a separate invocation in build_pytorch_libs.sh. Use the existing caffe2 codepaths Pull Request resolved: https://github.com/pytorch/pytorch/pull/13150 Differential Revision: D12815674 Pulled By: anderspapitto fbshipit-source-id: a710b6f242d159b9816911a25ee2c4b8c3f855aa	2018-10-29 12:04:32 -07:00
Teng Li	c5d7494ca1	Use open-source NCCL2 in PyTorch (#12359 ) Summary: - Removed the old nccl file - Make open-source NCCL a submodule - CMake to make NCCL itself NCCL2 now is in the default build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/12359 Reviewed By: orionr, yns88 Differential Revision: D10219665 Pulled By: teng-li fbshipit-source-id: 134ff47057512ba617b48bf390c1c816fff3f881	2018-10-08 15:39:07 -07:00
Orion Reblitz-Richardson	895994a7c3	Back out "[pytorch][PR] [Build] Use open-source NCCL2 in PyTorch" Reviewed By: The controller you requested could not be found. fbshipit-source-id: a13075339d3a7b970e81be0b1a32a7c4c3a6c68d	2018-10-04 14:12:04 -07:00
Teng Li	ae7a7fb398	Use open-source NCCL2 in PyTorch (#12312 ) Summary: - Removed the old nccl file - Make open-source NCCL a submodule - CMake to make NCCL itself NCCL2 now is in the default build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/12312 Differential Revision: D10190845 Pulled By: teng-li fbshipit-source-id: 08d42253b774149a66919d194f88b34628c39bae	2018-10-04 11:42:17 -07:00
Orion Reblitz-Richardson	4bf0202cac	[build] Have PyTorch depend on minimal libcaffe2.so instead of libATen.so (#7399 ) * Have PyTorch depend on minimal libcaffe2.so instead of libATen.so * Build ATen tests as a part of Caffe2 build * Hopefully cufft and nvcc fPIC fixes * Make ATen install components optional * Add tests back for ATen and fix TH build * Fixes for test_install.sh script * Fixes for cpp_build/build_all.sh * Fixes for aten/tools/run_tests.sh * Switch ATen cmake calls to USE_CUDA instead of NO_CUDA * Attempt at fix for aten/tools/run_tests.sh * Fix typo in last commit * Fix valgrind call after pushd * Be forgiving about USE_CUDA disable like PyTorch * More fixes on the install side * Link all libcaffe2 during test run * Make cuDNN optional for ATen right now * Potential fix for non-CUDA builds * Use NCCL_ROOT_DIR environment variable * Pass -fPIC through nvcc to base compiler/linker * Remove THCUNN.h requirement for libtorch gen * Add Mac test for -Wmaybe-uninitialized * Potential Windows and Mac fixes * Move MSVC target props to shared function * Disable cpp_build/libtorch tests on Mac * Disable sleef for Windows builds * Move protos under BUILD_CAFFE2 * Remove space from linker flags passed with -Wl * Remove ATen from Caffe2 dep libs since directly included * Potential Windows fixes * Preserve options while sleef builds * Force BUILD_SHARED_LIBS flag for Caffe2 builds * Set DYLD_LIBRARY_PATH and LD_LIBRARY_PATH for Mac testing * Pass TORCH_CUDA_ARCH_LIST directly in cuda.cmake * Fixes for the last two changes * Potential fix for Mac build failure * Switch Caffe2 to build_caffe2 dir to not conflict * Cleanup FindMKL.cmake * Another attempt at Mac cpp_build fix * Clear cpp-build directory for Mac builds * Disable test in Mac build/test to match cmake	2018-05-24 07:47:27 -07:00
Marat Dukhan	c9cc514df4	Bump minimum CMake version to 3.2 CMake 3.2 is required to properly track dependencies in projects imported as ExternalProject_Add (BUILD_BYPRODUCTS parameter). Users on Ubuntu 14.04 LTS would need to install and use cmake3 package for configurations. Users of other popular distributions generally have a recent enough CMake package.	2018-03-06 19:57:48 -08:00
Yangqing Jia	80430501c9	Remove the use of EXTERNAL_DEPENDENCIES (#2045 ) * [cmake] Move nccl to modern cmake, and avoid using EXTERNAL_DEPENDENCIES * [cmake] Move nnpack to modern cmake and avoid using EXTERNAL_DEPENDENCIES. * [cmake] Move ATen to modern cmake and avoid using EXTERNAL_DEPENDENCIES. * Move cpufeatures to modern cmake, and avoid using EXTERNAL_DEPENDENCIES * Finally remove EXTERNAL_DEPENDENCIES. * Maratyszcza's comments	2018-02-24 16:15:28 -08:00
Junjie Bai	bd22b83d62	Fix nccl cmake files Summary: Closes https://github.com/caffe2/caffe2/pull/1963 Differential Revision: D6994392 Pulled By: bddppq fbshipit-source-id: 4ab6a8f7dcb4469bdd3e152559ff3474984776fc	2018-02-14 16:04:11 -08:00
Pieter Noordhuis	2c10b13eeb	Pass CUDA_NVCC_EXECUTABLE to NCCL build Summary: If this variable is set to a ccache symlink then the NCCL build will also use the cache. The NCCL build is the slowest component of a cached build without this change Closes https://github.com/caffe2/caffe2/pull/1416 Reviewed By: Yangqing Differential Revision: D6214008 Pulled By: pietern fbshipit-source-id: e0a90e27de9b1c5a1fdc0e5bad5fb61f9fa924c3	2017-11-01 15:32:22 -07:00
Pieter Noordhuis	45e6e71198	Tidy up CMake for NCCL Summary: Use HINTS instead of PATHS for find_library so that you can specify -DNCCL_ROOT_DIR and it will use this NCCL installation regardless of what else is installed on your system. Also add a path hint to include the default base path for NCCL 2 libraries. Closes https://github.com/caffe2/caffe2/pull/1152 Reviewed By: Yangqing Differential Revision: D5740053 Pulled By: pietern fbshipit-source-id: 43f0908a63e8a9b90320dece0bbb558827433b48	2017-08-30 15:39:56 -07:00
Luke Yeager	49c89d6664	Use add_dependencies() for ExternalProjects Summary: I closed https://github.com/caffe2/caffe2/pull/736 because one of these variables should be used after all. Here's how C1 uses this variable: https://github.com/BVLC/caffe/blob/rc5/cmake/Targets.cmake#L116 Without this fix, there is a race condition in the parallel build leading to this error: ``` make[2]: *** No rule to make target `../third_party/NNPACK/lib/libnnpack.a', needed by `caffe2/libCaffe2_CPU.so'. ``` Closes https://github.com/caffe2/caffe2/pull/737 Differential Revision: D5211794 Pulled By: Yangqing fbshipit-source-id: 9e368f09b01edaf86252727adc6f6cc40d244e29	2017-06-08 13:34:42 -07:00
Yangqing Jia	6490d58a75	Add cuda path to nccl build Summary: This was found necessary on some CentOS. aaronmarkham Closes https://github.com/caffe2/caffe2/pull/240 Differential Revision: D4819591 Pulled By: Yangqing fbshipit-source-id: 40161cd484a2c8d43f26077919ad2762440dde13	2017-04-03 10:05:26 -07:00
Pieter Noordhuis	d5880b128e	CMake support for Gloo dependency Summary: This also requires a change to cmake/External/nccl.cmake to use the static NCCL binary instead of the shared object. When the Caffe2/Gloo build uses the bundled NCCL version it should be packaged up in the resulting libraries and not cause another runtime dependency on a library that has to be installed separately. Closes https://github.com/caffe2/caffe2/pull/218 Differential Revision: D4769926 Pulled By: pietern fbshipit-source-id: 5c85559992c200d874f4218724823815ffb5adb5	2017-03-24 08:32:24 -07:00
Pieter Noordhuis	f449af378d	Explicitly pass CXX to NCCL Makefile Summary: Necessary if CXX isn't set when cmake is called. The CXX variable will then be empty which prevents make from using its own default. Closes https://github.com/caffe2/caffe2/pull/202 Differential Revision: D4711113 Pulled By: Yangqing fbshipit-source-id: 895c07044b263ba9b5440453978248506d7ac225	2017-03-14 18:33:36 -07:00
Bram Wasti	59d263280e	fix directory reference in cmake for inclusion as library Summary: This fixes build that include caffe2 and change the value of CMAKE_BINARY_DIR to their own binary dir. Allows the generation of protobuf headers/files in particular. Reviewed By: Yangqing Differential Revision: D4466126 fbshipit-source-id: eba264094dd2bff07a7f050b95fd2d5525462b09	2017-01-26 14:44:37 -08:00
Yangqing Jia	324ef09e01	fix typo	2017-01-03 23:16:36 -08:00

1 2

52 Commits