Commit Graph

16 Commits

Author SHA1 Message Date
7d39e73c57 Fix more URLs (#153277)
Or ignore them.
Found by running the lint_urls.sh script locally with https://github.com/pytorch/pytorch/pull/153246

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153277
Approved by: https://github.com/malfet
2025-05-14 16:23:50 +00:00
92b95e5243 Fix NCCL version check when nccl.h in non-standard location. (#40982)
Summary:
The NCCL discovery process fails to compile detect_nccl_version.cc when nccl.h resides in a non-standard location.
Pass __NCCL_INCLUDE_DIRS__ to _try_run(... detect_nccl_version.cc)_ to fix this.

Can reproduce with Dockerfile ..
```Dockerfile
FROM nvidia/cuda:10.2-cudnn7-devel-ubuntu18.04 as build
WORKDIR /stage

# install conda
ARG CONDA_VERSION=4.7.10
ARG CONDA_URL=https://repo.anaconda.com/miniconda/Miniconda3-${CONDA_VERSION}-Linux-x86_64.sh
RUN cd /stage && curl -fSsL --insecure ${CONDA_URL} -o install-conda.sh &&\
    /bin/bash ./install-conda.sh -b -p /opt/conda &&\
    /opt/conda/bin/conda clean -ya
ENV PATH=/opt/conda/bin:${PATH}

# install prerequisites
RUN conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi

# attempt compile
ENV CUDA_HOME="/usr/local/cuda" \
    CUDNN_LIBRARY="/usr/lib/x86_64-linux-gnu" \
    NCCL_INCLUDE_DIR="/usr/local/cuda/include" \
    NCCL_LIB_DIR="/usr/local/cuda/lib64" \
    USE_SYSTEM_NCCL=1
RUN apt-get -y update &&\
    apt-get -y install git &&\
    cd /stage && git clone https://github.com/pytorch/pytorch.git &&\
    cd pytorch &&\
    git submodule update --init --recursive &&\
    python setup.py bdist_wheel
```

This generates the following error ..
```
-- Found NCCL: /usr/local/cuda/include
-- Determining NCCL version from /usr/local/cuda/include/nccl.h...
-- Looking for NCCL_VERSION_CODE
-- Looking for NCCL_VERSION_CODE - found
CMake Error at cmake/Modules/FindNCCL.cmake:78 (message):
  Found NCCL header version and library version do not match! (include:
  /usr/local/cuda/include, library: /usr/local/cuda/lib64/libnccl.so) Please
  set NCCL_INCLUDE_DIR and NCCL_LIB_DIR manually.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40982

Reviewed By: zou3519

Differential Revision: D22603911

Pulled By: malfet

fbshipit-source-id: 084870375a270fb9c7daf3c2e731992a03614ad6
2020-07-17 13:54:17 -07:00
a8edc2b5d2 Add sanity checks for NCCL detection.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22926

Differential Revision: D16546369

Pulled By: colesbury

fbshipit-source-id: 56f7ef4476e586dee19366fdb720085d1c2f2027
2019-07-29 13:47:05 -07:00
60c46dd4df Let CMake handle NCCL detection instead of our handcrafted Python script. (#22930)
Summary:
 ---

How does the current code subsume all detections in the deleted `nccl.py`?

- The dependency of `USE_NCCL` on the OS and `USE_CUDA` is handled as dependency options in `CMakeLists.txt`.

- The main NCCL detection happens in [FindNCCL.cmake](8377d4b32c/cmake/Modules/FindNCCL.cmake), which is called by [nccl.cmake](8377d4b32c/cmake/External/nccl.cmake). When `USE_SYSTEM_NCCL` is false, the previous Python code defer the detection to `find_package(NCCL)`. The change in `nccl.cmake` retains this.

- `USE_STATIC_NCCL` in the previous Python code simply changes the name of the detected library. This is done in `IF (USE_STATIC_NCCL)`.

- Now we only need to look at how the lines below line 20 in `nccl.cmake` are subsumed. These lines list paths to header and library directories that NCCL headers and libraries may reside in and try to search these directories for the key header and library files in turn. These are done by `find_path` for headers and `find_library` for the library files in `FindNCCL.cmake`.
  * The call of [find_path](https://cmake.org/cmake/help/v3.8/command/find_path.html) (Search for `NO_DEFAULT_PATH` in the link) by default searches for headers in `<prefix>/include` for each `<prefix>` in `CMAKE_PREFIX_PATH` and `CMAKE_SYSTEM_PREFIX_PATH`. Like the Python code, this commit sets `CMAKE_PREFIX_PATH` to search for `<prefix>` in `NCCL_ROOT_DIR` and home to CUDA.  `CMAKE_SYSTEM_PREFIX_PATH` includes the standard directories such as `/usr/local` and `/usr`. `NCCL_INCLUDE_DIR` is also specifically handled.

  * Similarly, the call of [find_library](https://cmake.org/cmake/help/v3.8/command/find_library.html) (Search for `NO_DEFAULT_PATH` in the link) by default searches for libraries in directories including `<prefix>/lib` for each `<prefix>` in `CMAKE_PREFIX_PATH` and `CMAKE_SYSTEM_PREFIX_PATH`. But it also handles the edge cases intended to be solved in the Python code more properly:
     - It only searches for `<prefix>/lib64` (and `<prefix>/lib32`) if it is appropriate on the system.
     - It only searches for `<prefix>/lib/<arch>` for the right `<arch>`, unlike the Python code searches for `lib/<arch>` in a generic way (e.g., the Python code searches for `/usr/lib/x86_64-linux-gnu` but in reality systems have `/usr/lib/x86_64-some-customized-name-linux-gnu`, see https://unix.stackexchange.com/a/226180/38242 ).

 ---

Regarding for relevant issues:

- https://github.com/pytorch/pytorch/issues/12063 and https://github.com/pytorch/pytorch/issues/2877: These are properly handled, as explained in the updated comment.
- https://github.com/pytorch/pytorch/issues/2941 does not changes NCCL detection specifically for Windows (it changed CUDA detection).
- b7e258f81ef61d19b884194cdbcd6c7089636d46 A versioned library detection is added, but the order is reversed: The unversioned library becomes preferred. This is because normally unversioned libraries are linked to versioned libraries and preferred by users, and local installation by users are often unversioned. Like the document of [find_library](https://cmake.org/cmake/help/v3.8/command/find_library.html) suggests:

> When using this to specify names with and without a version suffix, we recommend specifying the unversioned name first so that locally-built packages can be found before those provided by distributions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22930

Differential Revision: D16440275

Pulled By: ezyang

fbshipit-source-id: 11fe80743d4fe89b1ed6f96d5d996496e8ec01aa
2019-07-23 08:45:51 -07:00
798d5d9771 Revert D16281714: Add sanity checks for NCCL detection.
Differential Revision:
D16281714

Original commit changeset: 396bcbf099bd

fbshipit-source-id: a22cc112d1b6a62d689f9d8a7f93e8be3abe2a44
2019-07-16 13:58:27 -07:00
01f03d56ee Revert D16283037: Add sanity checks for NCCL detection.
Differential Revision:
D16283037

Original commit changeset: fc09c9443a56

fbshipit-source-id: 30cdf7b1ad91498ee615d018de5571ba36f4383e
2019-07-16 13:20:43 -07:00
31497799b9 Add sanity checks for NCCL detection.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22819

Test Plan: Imported from OSS

Differential Revision: D16283037

Pulled By: ezyang

fbshipit-source-id: fc09c9443a568d9af1c78a847282a7d707c49dd6
2019-07-16 11:32:36 -07:00
e2046f8c1d Add sanity checks for NCCL detection.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22819

Test Plan: Imported from OSS

Differential Revision: D16281714

Pulled By: ezyang

fbshipit-source-id: 396bcbf099bd07b996cf779c6b43092096b52d90
2019-07-16 11:32:32 -07:00
a994b432ee [c10d] NCCL Process Group implementation (#8182)
* [c10d] Process Group NCCL implementation

* Addressed comments

* Added one missing return and clang format again

* Use cmake/Modules for everything and fix gloo build

* Fixed compiler warnings

* Deleted duplicated FindNCCL
2018-06-08 10:33:27 -07:00
4bf0202cac [build] Have PyTorch depend on minimal libcaffe2.so instead of libATen.so (#7399)
* Have PyTorch depend on minimal libcaffe2.so instead of libATen.so

* Build ATen tests as a part of Caffe2 build

* Hopefully cufft and nvcc fPIC fixes

* Make ATen install components optional

* Add tests back for ATen and fix TH build

* Fixes for test_install.sh script

* Fixes for cpp_build/build_all.sh

* Fixes for aten/tools/run_tests.sh

* Switch ATen cmake calls to USE_CUDA instead of NO_CUDA

* Attempt at fix for aten/tools/run_tests.sh

* Fix typo in last commit

* Fix valgrind call after pushd

* Be forgiving about USE_CUDA disable like PyTorch

* More fixes on the install side

* Link all libcaffe2 during test run

* Make cuDNN optional for ATen right now

* Potential fix for non-CUDA builds

* Use NCCL_ROOT_DIR environment variable

* Pass -fPIC through nvcc to base compiler/linker

* Remove THCUNN.h requirement for libtorch gen

* Add Mac test for -Wmaybe-uninitialized

* Potential Windows and Mac fixes

* Move MSVC target props to shared function

* Disable cpp_build/libtorch tests on Mac

* Disable sleef for Windows builds

* Move protos under BUILD_CAFFE2

* Remove space from linker flags passed with -Wl

* Remove ATen from Caffe2 dep libs since directly included

* Potential Windows fixes

* Preserve options while sleef builds

* Force BUILD_SHARED_LIBS flag for Caffe2 builds

* Set DYLD_LIBRARY_PATH and LD_LIBRARY_PATH for Mac testing

* Pass TORCH_CUDA_ARCH_LIST directly in cuda.cmake

* Fixes for the last two changes

* Potential fix for Mac build failure

* Switch Caffe2 to build_caffe2 dir to not conflict

* Cleanup FindMKL.cmake

* Another attempt at Mac cpp_build fix

* Clear cpp-build directory for Mac builds

* Disable test in Mac build/test to match cmake
2018-05-24 07:47:27 -07:00
108f5c197f [pytorch] add static linkage support for CuDNN and NCCL (#6410)
* when linking static CUDA libs, additional dep on culibos.a

* add USE_STATIC_NCCL option

* add USE_STATIC_CUDNN option

* remove libATen soversion

* add caffe, caffe2 folders to setup.py exclude list
2018-04-08 22:54:18 -04:00
54342287fe Look for NCCL in CUDA_TOOLKIT_ROOT_DIR
Summary: Closes https://github.com/caffe2/caffe2/pull/1611

Reviewed By: dzhulgakov

Differential Revision: D6550168

Pulled By: pietern

fbshipit-source-id: e034ce4057d37bfc8b53949c56cbcb701ea5d958
2017-12-12 21:50:49 -08:00
db06e91097 Bump gloo
Summary:
Latest version of Gloo takes care of MPI_Init/MPI_Finalize for us, so
this commit removes handling that from caffe2/contrib/gloo. It also
imports CMake NCCL module changes from Gloo to stay consistent and
allow setting NCCL_INCLUDE_DIR and NCCL_LIB_DIR separately.
Closes https://github.com/caffe2/caffe2/pull/1295

Reviewed By: dzhulgakov

Differential Revision: D5979364

Pulled By: pietern

fbshipit-source-id: 794b00b0a445317c30a13cc8f0f4dc38e590cc77
2017-10-05 16:57:59 -07:00
45e6e71198 Tidy up CMake for NCCL
Summary:
Use HINTS instead of PATHS for find_library so that you can specify
-DNCCL_ROOT_DIR and it will use this NCCL installation regardless of
what else is installed on your system. Also add a path hint to include
the default base path for NCCL 2 libraries.
Closes https://github.com/caffe2/caffe2/pull/1152

Reviewed By: Yangqing

Differential Revision: D5740053

Pulled By: pietern

fbshipit-source-id: 43f0908a63e8a9b90320dece0bbb558827433b48
2017-08-30 15:39:56 -07:00
fbbb87cd46 Enhancements
Add BLAS chooser
Move cuDNN detection from Cuda -> FindCuDNN
Refactor main C2 libs, should enable no-GPU build (untested)
2016-12-13 09:29:01 -05:00
52f09fe2c9 Initial building with deps 2016-12-13 09:29:01 -05:00