Summary:
This fixes rebuild issues with the ninja part of the build. With this patch all ninja files will now report `nothing to do` if nothing has changed assuming `BUILD_CAFFE2_OPS=0`.
1. This only does the python file processing for caffe2 when BUILD_CAFFE2_OPS=1, this part of the build file is written in such a way that it is always required to rerun and can take substantial time to move files around in the no-op build. In the future this part should be rewritten to use a faster method of copying the files or should treat copying the files as part of the build rules and only run when the files are out of date.
2. This points `sleef` to a patched version that fixes a dead build output that is causing everything to relink all the time. See https://github.com/shibatch/sleef/pull/231#partial-pull-merging for the upstream change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14969
Reviewed By: soumith
Differential Revision: D13395998
Pulled By: zdevito
fbshipit-source-id: ca85b7be9e99c5c578103c144ef0f2c3b927e724
Summary:
- Removed the old nccl file
- Make open-source NCCL a submodule
- CMake to make NCCL itself
NCCL2 now is in the default build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12359
Reviewed By: orionr, yns88
Differential Revision: D10219665
Pulled By: teng-li
fbshipit-source-id: 134ff47057512ba617b48bf390c1c816fff3f881
Summary:
- Removed the old nccl file
- Make open-source NCCL a submodule
- CMake to make NCCL itself
NCCL2 now is in the default build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12312
Differential Revision: D10190845
Pulled By: teng-li
fbshipit-source-id: 08d42253b774149a66919d194f88b34628c39bae
* Dump autogradpp into PyTorch
* Fixed up CMake for autogradpp/C++ API
* Made cereal a submodule
* Change search location of autogradpps mnist directory
* Add test_api to CI
* Download MNIST from the internet instead of storing in repo
* Fix warnings
- gloo, pybind11, nanopb and nccl now live in third_party.
- ATen builds in aten/build rather than torch/lib/build/aten
- A bit of faffing about in the scripts was necessary, because they used to assume that everything lived in the same directory. Now you are expected to cd into the correct directory before calling one of the build functions. The actual builder script lives in tools
- Lint now just unconditionally ignores third_party, rather than enumerating folders explicitly
This PR addresses #5648. In particular, following the discussion at #5648:
- it adds Catch as a submodule (https://github.com/catchorg/Catch2) in torch/aten/utils
- it ports all ATen tests to Catch
- it ports torch/csrc/jit/test_jit.cpp to Catch (libtorch only, Python build is unaffected)
#5481 was reverted due to a strange test bug. This PR attempts to fix that.
This diff adds vectorization to ATen. It uses intel intrinsics to build a general vec256 class, that represents types of 256bit width. These can then be treated like regular variables. Using those it implements torch.sum() for the contiguous case. It uses Intel TBB for multithreading, which allows workstealing and chunks the reduction operations based on a experimentally chosen value (_THRESHOLD). It uses cpuinfo to pick the right code depending on the host's capabilities.
The kernels are implemented under native/cpu. Each .cpp file is compiled with -avx, -avx2 and no additional flags. A macro is used to append AVX, AVX2 or NONE to the function name. The header then needs to define the functions three times, one for each capability. This could be improved by either changing the cmake file a bit or possibly generating source code using a Python script etc.
For the non-contiguous case this defaults to the current implementation within TH. For CUDA is entirely defaults to the implementation within THC.
There probably needs to be a bit of a debate around the design decisions here, the additional dependencies, parallelization strategy, clarity, etc. The numerical results also diverge from numpy with larger tensors, which is expected since we're summing, for example, 8 numbers and then adding the result to the running sum, instead of each number one by one. But there might be something to be said about accumulating into a double for floats or the degree of divergence, the behavior with respect to CUDA, etc.
I wrote a [small Python script]( https://github.com/cpuhrsch/benchmark/blob/sumall/benchmarks/sum_bench.py) to compare the results with numpy numerically as well as on timing. I ran this script to create timings both on master and this branch.
Here is the command for 1 core
`OMP_NUM_THREAD=1 taskset -c 0 python sum_bench.py --enable_numpy 200`
Here is the command for all cores
`python sum_bench.py --enable_numpy 200`
Here are the results of each:
[Master, 1 core](https://paste.fedoraproject.org/paste/Nho9JzHpPVK9av8a6mByjQ)
[This branch, 1 core](https://paste.fedoraproject.org/paste/6xLHkYvcVJx9z~5MoHxN4w)
[Master, all cores](https://paste.fedoraproject.org/paste/5l3V1d5zGqvJcMXIUteMRw)
[This branch, all cores](https://paste.fedoraproject.org/paste/J4RuDU-0Drz0aZwtphQwEA)
To test the command is
`python sum_bench.py --test 200`
[This branch, test results](https://paste.fedoraproject.org/paste/kTEoUC~oWgXA6XWMAfNfNw)
For this test we look at the average absolute value of the differences. This does not take into account the relative magnitude of the numbers. The numbers are sampled from a standard normal distribution.
In terms of performance this diff should bring PyTorch on par with Numpy and usually exceed it by 1.5 to 2x.
* Revert "ATen ReduceOps (#5481)"
This reverts commit 310c3735b9eb97f30cee743b773e5bb054989edc.
* Revert "Check that new cpuinfo and tbb submodules exist (#5714)"
This reverts commit 1a23c9901dbfee295bf5b3dad36e4d3ee7e86366.
This diff adds vectorization to ATen. It uses intel intrinsics to build a general vec256 class, that represents types of 256bit width. These can then be treated like regular variables. Using those it implements torch.sum() for the contiguous case. It uses Intel TBB for multithreading, which allows workstealing and chunks the reduction operations based on a experimentally chosen value (_THRESHOLD). It uses cpuinfo to pick the right code depending on the host's capabilities.
The kernels are implemented under native/cpu. Each .cpp file is compiled with -avx, -avx2 and no additional flags. A macro is used to append AVX, AVX2 or NONE to the function name. The header then needs to define the functions three times, one for each capability. This could be improved by either changing the cmake file a bit or possibly generating source code using a Python script etc.
For the non-contiguous case this defaults to the current implementation within TH. For CUDA is entirely defaults to the implementation within THC.
There probably needs to be a bit of a debate around the design decisions here, the additional dependencies, parallelization strategy, clarity, etc. The numerical results also diverge from numpy with larger tensors, which is expected since we're summing, for example, 8 numbers and then adding the result to the running sum, instead of each number one by one. But there might be something to be said about accumulating into a double for floats or the degree of divergence, the behavior with respect to CUDA, etc.
I wrote a [small Python script]( https://github.com/cpuhrsch/benchmark/blob/sumall/benchmarks/sum_bench.py) to compare the results with numpy numerically as well as on timing. I ran this script to create timings both on master and this branch.
Here is the command for 1 core
`OMP_NUM_THREAD=1 taskset -c 0 python sum_bench.py --enable_numpy 200`
Here is the command for all cores
`python sum_bench.py --enable_numpy 200`
Here are the results of each:
[Master, 1 core](https://paste.fedoraproject.org/paste/Nho9JzHpPVK9av8a6mByjQ)
[This branch, 1 core](https://paste.fedoraproject.org/paste/6xLHkYvcVJx9z~5MoHxN4w)
[Master, all cores](https://paste.fedoraproject.org/paste/5l3V1d5zGqvJcMXIUteMRw)
[This branch, all cores](https://paste.fedoraproject.org/paste/J4RuDU-0Drz0aZwtphQwEA)
To test the command is
`python sum_bench.py --test 200`
[This branch, test results](https://paste.fedoraproject.org/paste/kTEoUC~oWgXA6XWMAfNfNw)
For this test we look at the average absolute value of the differences. This does not take into account the relative magnitude of the numbers. The numbers are sampled from a standard normal distribution.
In terms of performance this diff should bring PyTorch on par with Numpy and usually exceed it by 1.5 to 2x.
Summary:
Include six, enum34, and PeachPy as Caffe2 submodules, and use the versions from submodules instead of downloading them during configuration time
Closes https://github.com/caffe2/caffe2/pull/1917
Reviewed By: orionr
Differential Revision: D6938735
Pulled By: Maratyszcza
fbshipit-source-id: 841a6c47a1cd003a19f48f6c256aa4d9eb2cc6e4
Summary:
Original commit changeset: d0c1c7681605
Reverting due to broken OSS build due to this commit
Reviewed By: bddppq
Differential Revision: D6935666
fbshipit-source-id: 955cfeb6d5a4ed265b2e099094cfb5bfe960ff95
Summary:
Include six, enum34, and PeachPy as Caffe2 submodules, and use the versions from submodules instead of downloading them during configuration time
Closes https://github.com/caffe2/caffe2/pull/1901
Differential Revision: D6930731
Pulled By: Maratyszcza
fbshipit-source-id: d0c1c7681605d957de6f51bd24fbb25afc0f282f
Summary:
This is in order for us to share compression ops to oss.
Closes https://github.com/caffe2/caffe2/pull/1463
Reviewed By: hlu1
Differential Revision: D6319101
Pulled By: Yangqing
fbshipit-source-id: 16c94e71fc3efe256054a648170aaf7702e5bcfe
Summary:
This operator allows the use of Torch's underlying TH libraries (TH, THC, THNN, and THCUNN)
through the ATen tensor library. Use of the operator is described in the README.
The operator itself is generated from ATen's Declarations.yaml file which describes its public API.
Closes https://github.com/caffe2/caffe2/pull/1235
Reviewed By: dzhulgakov
Differential Revision: D5876944
Pulled By: zdevito
fbshipit-source-id: b558e8563a5e82a0e6278705a4a359bd7df4e70a