Summary:
Recent versions of GCC split unaligned load and store intrinsics into
two 128-bit instructions. On old processors (Sandy Bridge) this was a
bit faster for unaligned data, but bit slower for aligned data. On new
processors (Intel Haswell+, recent AMD) splitting loads is slower on
both aligned and unaligned data.
Clang, MSVC, and ICC do not split unaligned load and store intrinsics.
There's a good explanation here:
https://stackoverflow.com/questions/52626726/why-doesnt-gcc-resolve-mm256-loadu-pd-as-single-vmovupd#tab-top
Splitting load and store intrinsics makes no sense in our AVX2
configuration because the CPUs that support AVX2 instructions are the
same CPUs where splitting is disadvantageous on all data alignemnt.
Note that this doesn't change the AVX configuration (used by CPUs that
support AVX but not AVX2). It's possible this would be benficial for
that configuration too (our data is usually 32-byte aligned), but I'd
prefer the conservative change for now.
torch.add generated assembly (hot loop) (GCC 7.3.0)
before:
https://gist.github.com/colesbury/066376537bccd514daf8fe4ab54d8295
after:
https://gist.github.com/colesbury/8b4b948145001d44b225c51d2428bb91
Timing of `torch.add(x, y, out=z)` for size 10240 (1 thread, Broadwell,
no turbo):
before: 7.35 us after: 6.39 us
(Take the torch.add timings with a grain of salt. The difference in timings
is much larger than I would expect.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20609
Differential Revision: D15385800
Pulled By: colesbury
fbshipit-source-id: 66415b148a3b19360b9de9881af594ab46547b6f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19285
The either type is a tagged union with two members.
This is going to be used in a diff stacked on top to allow a function to return one of two types.
Also, generally, either<Error, Result> is a great pattern for returning value_or_error from a function without using exceptions and we could use this class for that later.
Reviewed By: dzhulgakov
Differential Revision: D14931923
fbshipit-source-id: 7d1dd77b3e5b655f331444394dcdeab24772ab3a
Summary:
Dear All,
The proposed patch fixes the test code snippets used in cmake infrastructure, and implicit failure to set properly the ```CAFFE2_COMPILER_SUPPORTS_AVX2_EXTENSIONS``` flag. The libcaffe2.so will have some ```UND``` avx2 related references, rendering it unusable.
* Using GCC 9 test code from cmake build infra always fails:
```
$ gcc -O2 -g -pipe -Wall -m64 -mtune=generic -fopenmp -DCXX_HAS_AVX_1 -fPIE -o test.o -c test.c -mavx2
test.c: In function ‘main’:
test.c:11:26: error: incompatible type for argument 1 of ‘_mm256_extract_epi64’
11 | _mm256_extract_epi64(x, 0); // we rely on this in our AVX2 code
| ^
| |
| __m256 {aka __vector(8) float}
In file included from /usr/lib/gcc/x86_64-redhat-linux/9/include/immintrin.h:51,
from test.c:4:
/usr/lib/gcc/x86_64-redhat-linux/9/include/avxintrin.h:550:31: note: expected ‘__m256i’ {aka ‘__vector(4) long long int’} but argument is of type ‘__m256’ {aka ‘__vector(8) float’}
550 | _mm256_extract_epi64 (__m256i __X, const int __N)
|
$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/9/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,objc,obj-c++,ada,go,d,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl --enable-offload-targets=nvptx-none --without-cuda-driver --enable-gnu-indirect-function --enable-cet --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 9.0.1 20190328 (Red Hat 9.0.1-0.12) (GCC)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18991
Differential Revision: D14821838
Pulled By: ezyang
fbshipit-source-id: 7eb3a854a1a831f6fda8ed7ad089746230b529d7
Summary:
Multiple configurations is the default (eg. Release;Debug) on Windows and this check always broke this configuration as CMAKE_BUILD_TYPE was not set. The workaround was to always set CMAKE_BUILD_TYPE to Debug or Release, which was very unfortunate.
The correct method is to use generator expressions that expand depending on the current CONFIG being processed.
Side note: Anywhere else CMAKE_BUILD_TYPE is checked should probably be fixed too.
Note that the CMakeLists.txt forces it in to Release mode. However, I came across this error when importing the prebuilt Config in to another project, where CMAKE_BUILD_TYPE was not set.
> 3>CMake Error at pre_built/pytorch-1.0.1/share/cmake/Caffe2/public/cuda.cmake:380 (message):
> 3> Unknown cmake build type:
Proper support for configurations would mean we can build debug and release at the same time and as you can see, it is less CMake code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18548
Differential Revision: D14730790
Pulled By: ezyang
fbshipit-source-id: 70ae16832870d742c577c34a50ec7564c3da0afb
Summary:
Previously it would look for the Config even if it was not written.
Fixed#18419
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18421
Differential Revision: D14597139
Pulled By: ezyang
fbshipit-source-id: c212cbf5dc91564c12d9d07e507c8285e11c6bdf
Summary:
Our AVX2 routines use functions such as _mm256_extract_epi64
that do not exist on 32 bit systems even when they have AVX2.
This disables AVX2 when _mm256_extract_epi64 does not exist.
This fixes the "local" part of #17901 (except disabling FBGEMM),
but there also is sleef to be updated and NNPACK to be fixed,
see the bug report for further discussion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17915
Differential Revision: D14437338
Pulled By: soumith
fbshipit-source-id: d4ef7e0801b5d1222a855a38ec207dd88b4680da
Summary:
This changes the libnvToolsExt dependency to go through CMake find_library.
I have a machine where cuda libs, and libnvToolsExt in particular, are in the "usual library locations". It would be neat if we could find libnvToolsExt and use the path currently hardcoded as default.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16714
Differential Revision: D14020315
Pulled By: ezyang
fbshipit-source-id: 00be27be10b1863ca92fd585f273d50bded850f8
Summary:
Rehash of previous attempts. This tries a different approach where we accept the install as specified in cmake (leaving bin/ include/ and lib/ alone), and then try to adjust the rest of the files to this more standard layout.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16414
Differential Revision: D13863635
Pulled By: zdevito
fbshipit-source-id: 23725f5c64d7509bf3ca8f472dcdcad074de9828
Summary:
Initial enabling of the upcoming hip-clang compiler for the PyTorch source base.
Changes:
* update the Eigen submodule to a version including our upstreamed hip-clang enabling there
* modify a few ifdef guards with the `__HIP__` macro used by hip-clang
* use `__lane_id` instead of `hc::__lane_id`
* add Debug flags for ROCm to the cmake infrastructure
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16085
Differential Revision: D13709459
Pulled By: ezyang
fbshipit-source-id: 1b7b33fe810a0434766180580d4443ea177eb7c7
Summary:
This tests the water for adding back NNPACK in PyTorch, it's a lot better than the fallback THNN versions.
In #6151, we (ezyang and soumith) removed NNPACK support from PyTorch. Of course Maratyszcza might have advice, too. (Or an opinion on the CMake changes.)
The only functional changes are to use NNPack more aggressively on mobile and a .contiguous() to match NNPack's assumption (I stumbled over that while using NNPack for style transfer.)
The CMake changes try to use the NNPack we already have in git.
In terms of lines of code this is a large part of the diff of https://lernapparat.de/pytorch-jit-android/ . As far as I can tell, we don't have MKLDNN on mobile and the native THNN implementation are prohibitively expensive in terms of both CPU and memory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15924
Differential Revision: D13709576
Pulled By: ezyang
fbshipit-source-id: f2e287739909451c173abf046588209a7450ca2c
Summary: This PR aims to remove support for cuDNN 6.
Differential Revision: D13709595
Pulled By: ezyang
fbshipit-source-id: 853624db1cf66b0534d7028654c38c2806fb4107
Summary:
1. Add some gloo communication operators into related fallback list;
2. Work around to avoid compiling errors while using fallback operator whose CPU operator inherits from 'OperatorBase' directly like PrefetchOperator;
3. Add new cpu context support for some python module files and resnet50 training example file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11330
Reviewed By: yinghai
Differential Revision: D13624519
Pulled By: wesolwsk
fbshipit-source-id: ce39d57ddb8cd7786db2e873bfe954069d972f4f
Summary:
bypass-lint
- Change all Caffe2 builds to use setup.py instead of cmake
- Add a -cmake- Caffe2 build configuration that uses cmake and only builds cpp
- Move skipIfCI logic from onnx test scripts to the rest of CI logic
- Removal of old PYTHONPATH/LD_LIBRARY_PATH/etc. env management
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15917
Reviewed By: orionr
Differential Revision: D13637583
Pulled By: pjh5
fbshipit-source-id: c5c5639db0251ba12b6e4b51b2ac3b26a8953153