Compare commits

..

107 Commits

Author SHA1 Message Date
3c31d73c87 [ONNX] Fix pow op export [1.5.1] (#39791)
* [ONNX] Fix pow op export (#38065)

Summary:
Fix pow type cast for opset 9 and update opset 12
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38065

Differential Revision: D21485353

Pulled By: malfet

fbshipit-source-id: 3993e835ffad07b2e6585eb5cf1cb7c8474de2ec

* Update ort-nighly version as suggested in https://github.com/pytorch/pytorch/pull/39685#issuecomment-641452470

* Apply changes from https://github.com/pytorch/pytorch/pull/37846 to  `test_topk_smallest_unsorted`

Co-authored-by: neginraoof <neginmr@utexas.edu>
2020-06-11 15:26:46 -07:00
dfe8cdff5a [v1.5.1] add dtype checks for scatter/gather family of functions (#39773)
* add dtype checks for scatter/gather family of functions [1.5.1]

Adds additional dtype checks for scatter/gather family of functions, namely:
1. Checks whether `index` is of type `Long`
2. Checks whether `src.dtype == self.dtype`.

This is a rather involved rework of https://github.com/pytorch/pytorch/pull/38646

* Adjust test to match both TH and ATen exception patterns
2020-06-10 10:29:54 -07:00
e7a6ed8151 [v1.5.1] add dtype checking for gather and scatter (#38025)
Summary:
Fixed https://github.com/pytorch/pytorch/issues/37996

in the `cpu_scatter_gather_base_kernel`, it interpret a pointer as `int64_t` regardless the actual dtype.
2b41b9bceb/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp (L106)
add a index dtype checking will avoid the nasty index out of bound error. As using `int64_t` is convention in ATen code (a.k.a, a limitation), no further fix is needed at the moment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38025

Differential Revision: D21498146

Pulled By: ezyang

fbshipit-source-id: b1f96f394a460c4bc63d21ec8d4a2cfbf3e97b03
2020-06-09 10:59:51 -04:00
fc0dde5db3 Fix weight quantization in RNNs
Weight quantization was done incorrectly for LSTMs, the statistics for all weights (across layers) were combined in the observer. This meant that weights for later layers in a LSTM would use sub-optimal scales impacting accuracy. The problem gets worse as the number of layers increases.

Differential Revision: [D20842145](https://our.internmc.facebook.com/intern/diff/D20842145/)

[ghstack-poisoned]
2020-06-09 10:39:57 -04:00
83edd5164a [1.5.1] Check illegal output dtype for torch.{min, max} (#39686)
* Check illegal output dtype for torch.{min, max}

Summary:
The test is currently only enabled for CPU, and it will be enabled for CUDA after the migration of `min` and `max` from THC to ATen is done.
This is a cherry-pick of https://github.com/pytorch/pytorch/pull/38850

* Skip test_minmax_illegal_dtype for XLA

Co-authored-by: Xiang Gao <qasdfgtyuiop@gmail.com>
2020-06-08 21:25:10 -07:00
833c4201ad allow user passing relative paths in include_dirs within setuptools.setup (#38264)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38264

Test Plan: Imported from OSS

Differential Revision: D21509277

Pulled By: glaringlee

fbshipit-source-id: b0bc17d375a89b96b1bdacde5987b4f4baa9468e
2020-06-08 11:35:49 -04:00
5579c9e4c2 [v1.5.1] Remove duplicate 'with_gil' declaration.
This gets picked up by mypy as an error in 1.5.1, not sure if it's a different version or setting, but might as well fix.

ghstack-source-id: 016f8d4bdb0444dd8285f1f29bdc8f2db2265c12
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39540
2020-06-08 11:34:31 -04:00
367901e1f9 [v1.5.1 cherry-pick] Work around building onnx in older rocm docker images (#39253) (#39547)
Summary:
Cherry-pick of https://github.com/pytorch/pytorch/pull/39253

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2020-06-04 21:10:57 -07:00
c4903bde61 [1.5.1] Bug fix for argmin/argmax (#39212) 2020-06-03 19:26:54 -04:00
7d2fcd505c [v1.5.1 cherry pick] fix the device inconsistency for import convert_sync_batchnorm (#39344)
* resolve merge conflict

* Remove wrong merge

Co-authored-by: jiej <jiej@nvidia.com>
2020-06-03 17:27:26 -04:00
bb33e5fc85 as_strided : add size and stride length check (#39301)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39281
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39301

Differential Revision: D21849082

Pulled By: gchanan

fbshipit-source-id: 5d30ef10767c4d35c6cb59c5e6a9acbfe0270a40
2020-06-03 16:52:29 -04:00
c5424a85dc Make _C extension a thin C wrapper (#39422)
Summary:
It just depends on a single `torch_python` library.
C library does not depend on standard C++ library and as result it closes https://github.com/pytorch/pytorch/issues/36941
This is a cherry-pick of https://github.com/pytorch/pytorch/pull/39375 into release/1.5 branch
2020-06-03 07:44:48 -07:00
5d01f87e58 fix asserts in cuda code (#39047)
Summary:
Gets rid of some in-kernel asserts where they can be replaced with static_asserts
Replaces bare in-kernel `assert` in one case with `CUDA_KERNEL_ASSERT` where necessary
replaces host code `assert`s with `TORCH_INTERNAL_ASSERT`
Another group of asserts is in fractional max pooling kernels which should be fixed regardless https://github.com/pytorch/pytorch/issues/39044, the problems there are not just asserts.
I've audited remaining cases of in-kernel asserts, and they are more like `TORCH_INTERNAL_ASSERT`, so they should not happen with invalid user data. I think it's ok to leave them as is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39047

Differential Revision: D21750392

Pulled By: ngimel

fbshipit-source-id: e9417523a2c672284de3515933cb7ed166e56719
2020-06-03 10:01:08 -04:00
82f549b0a8 [v1.5.1][JIT] make torch.unique compilable (#38156)
Summary:
Fix for https://github.com/pytorch/pytorch/issues/37986

Follows the stack in https://github.com/pytorch/pytorch/pull/33783 stack to make functions in `torch/functional.py` resolve to their python implementations. Because the return type of `torch.unique` depends on `return_inverse` and `return_counts` I had to refactor the implementation to use our boolean_dispatch mechanism.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38156

Differential Revision: D21504449

Pulled By: eellison

fbshipit-source-id: 7efb1dff3b5c00655da10168403ac4817286ff59
2020-06-02 12:00:57 -04:00
f306655d49 [v1.5.1] Implement CUDA_KERNEL_ASSERT for MSVC (#39218) (#39288)
* Implement CUDA_KERNEL_ASSERT for MSVC (#39218)

Summary:
Tested locally on CPU/GPU + Debug/Release.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39218

Differential Revision: D21786500

Pulled By: malfet

fbshipit-source-id: 7e871003d3509436952932b5ff3599e36bb8f205

# Conflicts:
#	test/test_cuda.py

* Fix one more conflict
2020-06-01 16:01:27 -07:00
409e42e3b8 Restore thread_local states in continuation thread on RPC servers (#38512)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38512

As we gradually making the RPC non-blocking on server side, the
processing of the same request can yield-run on different threads.
Hence, we need to populate thread_local states (e.g., ctx id) in
the continuation thread.

Fixes #38439

Test Plan: Imported from OSS

Differential Revision: D21583642

Pulled By: mrshenli

fbshipit-source-id: a79bce1cb207fd11f1fa02b08465e49badda65fc
2020-06-01 17:56:25 -04:00
6151405f6c Fix DDP bug in single process multiple device use cases (#36503)
This is a commit to merge #36503 into the 1.5.1 release. It fixes
single-process multi-GPU DDP use cases by explicitly exposing
model replica's parameters to DDP. #36656 is landed into master at 8d6a8d2.

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36503

Test Plan: Imported from OSS

Differential Revision: D21179274

Pulled By: mrshenli

fbshipit-source-id: 0afce30ae0ddda753d1e240584a0f80df9aec4c2
2020-06-01 17:56:01 -04:00
d01065e50c fix argmin/argmax behavior wrt autograd 2020-06-01 11:17:47 -04:00
67508dadaa Update FBGEMM hash (#39278)
Includes FBGEMM-1.5.0 hash + cherry-picked https://github.com/pytorch/FBGEMM/pull/381
2020-06-01 08:07:39 -07:00
b54a731c8e [v1.5.1] fix clip_grad_norm to work with parameters on the different devices (#38615)
Summary:
Per title.
We move all the individual gradient norms to a single device before stacking (no-op if all the gradients are already on a single device), `clip_coef` is copied to the device of gradient, which may be suboptimal as there could be multiple copies, but no worse than when we were synchronizing for each parameter. In a simple case of all gradients on a single device, there should be no synchronization.
Also, we no longer error out if parameter list is empty or none of the parameters have gradients, and return 0 total_norm instead.
Fixes https://github.com/pytorch/pytorch/issues/38605
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38615

Reviewed By: ailzhang

Differential Revision: D21634588

Pulled By: ngimel

fbshipit-source-id: ea4d08d4f3445438260052820c7ca285231a156b
2020-05-29 19:12:26 -04:00
3920c1d173 Support paths with spaces when building ninja extension (#38670)
Summary:
Generate the following `build.ninja` file and can successfully build:
```
cflags = -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DWITH_CUDA '-I/scratch/yuxinwu/space space/detectron2/layers/csrc' -I/private/home/yuxinwu/miniconda3/lib/python3.7
/site-packages/torch/include -I/private/home/yuxinwu/miniconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/private/home/yuxinwu/miniconda3/lib/python3.7/site-packages/torc
h/include/TH -I/private/home/yuxinwu/miniconda3/lib/python3.7/site-packages/torch/include/THC -I/public/apps/cuda/10.1/include -I/private/home/yuxinwu/miniconda3/include/python3.7m -c
post_cflags = -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
cuda_cflags = -DWITH_CUDA '-I/scratch/yuxinwu/space space/detectron2/layers/csrc' -I/private/home/yuxinwu/miniconda3/lib/python3.7/site-packages/torch/include -I/private/home/yuxinwu/miniconda3/li
b/python3.7/site-packages/torch/include/torch/csrc/api/include -I/private/home/yuxinwu/miniconda3/lib/python3.7/site-packages/torch/include/TH -I/private/home/yuxinwu/miniconda3/lib/python3.7/site
-packages/torch/include/THC -I/public/apps/cuda/10.1/include -I/private/home/yuxinwu/miniconda3/include/python3.7m -c
cuda_post_cflags = -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_
OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -ccbin=/public/apps/gcc/7.1.0/bin/gcc -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0
-gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -std=c++14
ldflags =

rule compile
  command = $cxx -MMD -MF $out.d $cflags -c $in -o $out $post_cflags
  depfile = $out.d
  deps = gcc

rule cuda_compile
  command = $nvcc $cuda_cflags -c $in -o $out $cuda_post_cflags

build /scratch/yuxinwu/space$ space/build/temp.linux-x86_64-3.7/scratch/yuxinwu/space$ space/detectron2/layers/csrc/vision.o: compile /scratch/yuxinwu/space$ space/detectron2/layers/csrc/vision.c$
p
build /scratch/yuxinwu/space$ space/build/temp.linux-x86_64-3.7/scratch/yuxinwu/space$ space/detectron2/layers/csrc/box_iou_rotated/box_iou_rotated_cpu.o: compile /scratch/yuxinwu/space$ space/de$
ectron2/layers/csrc/box_iou_rotated/box_iou_rotated_cpu.cpp
build /scratch/yuxinwu/space$ space/build/temp.linux-x86_64-3.7/scratch/yuxinwu/space$ space/detectron2/layers/csrc/ROIAlignRotated/ROIAlignRotated_cpu.o: compile /scratch/yuxinwu/space$ space/de$
ectron2/layers/csrc/ROIAlignRotated/ROIAlignRotated_cpu.cpp
build /scratch/yuxinwu/space$ space/build/temp.linux-x86_64-3.7/scratch/yuxinwu/space$ space/detectron2/layers/csrc/nms_rotated/nms_rotated_cpu.o: compile /scratch/yuxinwu/space$ space/detectron2$
layers/csrc/nms_rotated/nms_rotated_cpu.cpp
build /scratch/yuxinwu/space$ space/build/temp.linux-x86_64-3.7/scratch/yuxinwu/space$ space/detectron2/layers/csrc/ROIAlign/ROIAlign_cpu.o: compile /scratch/yuxinwu/space$ space/detectron2/layer$
/csrc/ROIAlign/ROIAlign_cpu.cpp

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38670

Differential Revision: D21689613

Pulled By: ppwwyyxx

fbshipit-source-id: 1f71b12433e18f6b0c6aad5e1b390b4438654563
2020-05-29 09:48:49 -04:00
8d48a6490a Fix cpp extension build failure if path contains space (#38860)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38860

Differential Revision: D21686335

Pulled By: ezyang

fbshipit-source-id: 2675f4f70b48ae3b58ea597a2b584b446d03c704
2020-05-29 09:48:40 -04:00
17eae0e0cd restore proper cuda assert behavior with DNDEBUG (#38943)
Summary:
Per title. https://github.com/pytorch/pytorch/issues/32719 essentially disabled asserts in cuda kernels in release build. Asserts in cuda kernels are typically used to prevent invalid reads/writes, so without asserts invalid read/writes are silent errors in most cases (sometimes they would still cause "illegal memory access" errors, but because of caching allocator this usually won't happen).
We don't need 2 macros, CUDA_ALWAYS_ASSERT and CUDA_KERNEL_ASSERT because all current asserts in cuda kernels are important to prevent illegal memory accesses, and they should never be disabled.
This PR removes macro CUDA_ALWAYS_ASSERT and instead makes CUDA_KERNEL_ASSERT (that is commonly used in the kernels) an asserttion both in release and debug builds.
Fixes https://github.com/pytorch/pytorch/issues/38771
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38943

Differential Revision: D21723767

Pulled By: ngimel

fbshipit-source-id: d88d8aa1b047b476d5340e69311e65aff4da5074
2020-05-28 19:00:31 -04:00
4a9e45d50e [v1.5.1] Reduction should not coalesce_dimensions when splitting for 32bit indexing (#37788)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37583
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37788

Differential Revision: D21387325

Pulled By: ngimel

fbshipit-source-id: dbd0f5a23e06d8c4cc68cd21b09b4b0221c4bba7
2020-05-28 19:00:16 -04:00
eb387a0a2b Give _VariableFunctions class a different name, so pickling works (#38033)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38033

Pickles require class names to be actually accessible from the module
in question.  _VariableFunction was not!  This fixes it.

Fixes https://github.com/pytorch/pytorch/issues/37703

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21458068

Pulled By: ezyang

fbshipit-source-id: 2a5ac41f9d1972e300724981b9b4b84364ddc18c
2020-05-28 14:15:35 -04:00
420c6dc43d [v1.5.1] Fixes floordiv dunder registrations (#38695)
Summary:
floordiv was missing a couple dunder registrations, which was causing __ifloordiv__ to not be called when it should. This adds the appropriate registrations and adds a test verifying that the inplace dunders are actually occuring inplace.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38695

Differential Revision: D21633980

Pulled By: mruberry

fbshipit-source-id: a423f5ec327cdc062fd6d9d56abd36fe44ac8198
2020-05-28 14:13:29 -04:00
39f0a2752a fix multinomial kernels to properly advance random states (#38046)
Summary:
Before, multinomial kernels did not advance random states enough, which lead to the same sequence being generated over and over with a shift of 4. This PR fixes that.
Fixes https://github.com/pytorch/pytorch/issues/37403
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38046

Differential Revision: D21516542

Pulled By: ngimel

fbshipit-source-id: 23248a8c3a5c44316c4c35cd71a8c3b5f76c90f2
2020-05-28 14:07:13 -04:00
366026ab10 Fix memory usage increase reported in #38568 (#38674)
Summary:
update to in-place version for bias add in convolution, this saves unnecessary memory allocation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38674

Differential Revision: D21626080

Pulled By: ngimel

fbshipit-source-id: 4f52a3ae2e5aefae372d8ea5188336216f910da3
2020-05-28 13:54:00 -04:00
408e158df9 skip ctc_loss test on Windows (#35069)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35069

It is flaky on Windows only, so disable for now:
https://github.com/pytorch/pytorch/issues/34870

Test Plan: Imported from OSS

Differential Revision: D20544736

Pulled By: suo

fbshipit-source-id: 49e35a4b4f0d1d20157769a4dff22cb4fe86770c
2020-05-28 13:52:47 -04:00
3598dea7ad Pin flake8 to 3.7.9 (#38269)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38269

Test Plan: Imported from OSS

Differential Revision: D21510318

Pulled By: mrshenli

fbshipit-source-id: ac57a0ffed7401c13b7983b8685a8706b8181142
2020-05-27 18:12:10 -04:00
a5b05e8867 Correct Javadoc link (#39039)
Correct Javadoc link to match the 1.4 version: https://github.com/pytorch/pytorch/blob/release/1.4/docs/source/index.rst
2020-05-27 12:41:47 -07:00
7fc2433458 Fix conv non zero padding being applied in wrong dim (#37881)
Summary:
Turns out F.pad takes in dims in reverse order. Fixes https://github.com/pytorch/pytorch/issues/37844
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37881

Differential Revision: D21554011

Pulled By: soumith

fbshipit-source-id: a85a7f6db9f981d915728965903c5c57b6617c93
2020-05-18 11:26:00 -04:00
aba610b9e8 add slope == 0 case into standard leaky relu nn test (#37559)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37559

Test Plan: Imported from OSS

Differential Revision: D21319922

Pulled By: glaringlee

fbshipit-source-id: 212ef8e9d0f0d55a312d282693cd5990e0376c6a
2020-05-05 11:32:04 -04:00
dc30c519dd allow inplace leaky_relu backward calc when slope == 0 (#37453)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37453

to fix (#37345)

Test Plan: Imported from OSS

Differential Revision: D21290911

Pulled By: glaringlee

fbshipit-source-id: 81677e9e195298bc1bde82b77c51f52d58aa5422
2020-05-05 11:32:04 -04:00
9bf2aaa659 Fix cpp extension compile failure on some envs (#37221)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37221

Test Plan: Imported from OSS

Differential Revision: D21226873

Pulled By: glaringlee

fbshipit-source-id: 0a390bbeaf153ee5ec355943f92c2dbcc5e04b59
2020-05-05 11:32:04 -04:00
25621d05df Don't use NonVariableTypeMode in custom ops (#37355)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37355

Potentially fixes https://github.com/pytorch/pytorch/issues/37306
ghstack-source-id: 103073537

Test Plan: waitforsandcastle

Differential Revision: D21261946

fbshipit-source-id: 454652b528dcf942bec5438f89201822de40bbf0
2020-04-29 21:10:19 -07:00
96f218d7dd Add experimental tag 2020-04-21 10:12:52 -07:00
f810011c40 Update persons_of_interest.rst (#37001)
Co-authored-by: Joseph Spisak <spisakjo@gmail.com>
2020-04-21 12:13:52 -04:00
5f8bb352c3 Move rpc.rst back to the source folder to preserve existing doc URLs (#36675) (#36732)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36675

Test Plan: Imported from OSS

Differential Revision: D21048628

Pulled By: mrshenli

fbshipit-source-id: 3cb1b35ddc1f40c673b0db9048d77dfa024be1e7

Co-authored-by: Shen Li <shenli@devfair017.maas>
2020-04-21 07:55:53 -07:00
52469a512b run the simple executor for jit tests by default, add profiling jobs for fusion tests (#36933)
* run the simple executor for jit tests by default, add profiling jobs for fusion tests

* fix flake8 warnings

* fix ci failures

* fix test_determination.py
2020-04-21 10:52:39 -04:00
c56adee862 Add new C++ landing page and update in index.rst (#36972)
* Add cpp landing page

* Update C++ to go to cpp_index.rst
2020-04-21 00:35:45 -07:00
4ff3872a20 [v.1.5.0] Ensure linearIndex of advanced indexing backwards is contig… (#36962)
* [v.1.5.0] Ensure linearIndex of advanced indexing backwards is contiguous.

This is a more straightforward solution to the problem than https://github.com/pytorch/pytorch/pull/36957; I don't know about the relative performance.

Fixes: #36956

ghstack-source-id: 43c48eaee7232cd3ed2b108edbbee24c11e8321a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36959

* Fix test.
2020-04-20 19:59:38 -04:00
d7bdffabed [v1.5 Patch] Disable flaky test_backward_node_failure_python_udf test in dist_autograd_test.py
This test is flaky on 1.5 release branch. Below is a failed CI run:
https://app.circleci.com/pipelines/github/pytorch/pytorch/157331/workflows/b3e0bd6b-6c55-4d14-bde8-96b8345cf9e2/jobs/5190025
2020-04-20 14:25:32 -04:00
9ba0a89489 Overwrite bazel if /usr/bin/bazel already exists. 2020-04-20 14:24:42 -04:00
c164fbccb1 Add TorchServe 2020-04-19 21:44:32 -07:00
9a51e477ac make simple executor the default for OSS 2020-04-17 20:00:53 -04:00
375566fb78 Handle log_sigmoid(out=) properly.
Fixes: https://github.com/pytorch/pytorch/issues/36499

Changes:
1) Moves some bindings from LegacyNNDefinitions to Activation so all of log_sigmoid lives together
2) Properly handle non-contiguous / incorrectly sized out parameters to log_sigmoid.  This is done by copying from a buffer if necessary.
3) Require that the internal buffer (different from 2)) is contiguous.  This should always be the case because it's always created internally.
4) Adds a test
2020-04-17 15:43:35 -04:00
dfdc788076 Fix incorrect merge of #34136.
If you look at https://github.com/pytorch/pytorch/pull/34136/, you will notice a commit (80c15c087c) that didn't get merged.
This is to address that, to avoid crashing on remainder when the rhs is 0.

ghstack-source-id: e805e290bd4b7d3165fd78d4e537e56e4c459162
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36760
2020-04-17 15:42:20 -04:00
9e6ef814cc [v1.5.0] Print keyword-only arg symbol for function signature suggestions.
Fixes: https://github.com/pytorch/pytorch/issues/36773

ghstack-source-id: 6b08839ffc8b228e9533a47b7fd034367fc93dec
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36780
2020-04-17 15:42:04 -04:00
31461800f6 Migrate release CI jobs to CircleCI for Windows (v1.5 Release) (#36658)
* Migrate release CI jobs to CircleCI for Windows (v1.5 Release)

* Fix comments
2020-04-16 12:18:27 -04:00
Jie
e741839b0e Fixing SyncBN dgrad (#36382)
Summary:
Previous PR https://github.com/pytorch/pytorch/issues/22248 which provides support for variadic batch size across processes doesn't account the mean_dy/mean_dy_xmu on backward path, which produces wrong dgrad.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36382

Differential Revision: D20984446

Pulled By: ngimel

fbshipit-source-id: 80066eee83760b275d61e2cdd4e86facca5577fd
2020-04-16 10:58:16 -04:00
8eb39c9cfd [CI] fix test_distributed for python 3.8+ (#36542)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36542

Python 3.8 set the default multiprocessing start mode to spawn, but we
need fork in these tests, otherwise there are some pickling issues.
Test: Ensure that these tests succeed when run with python 3.8
ghstack-source-id: 102093824

Test Plan: Ensure success with python 3.8

Differential Revision: D21007753

fbshipit-source-id: 4b39844c6ba76a53293c0dfde7c98ec5a78fe113
2020-04-16 10:54:57 -04:00
b5e4c0993d Add a warning for Single-Process Multi-GPU DDP 2020-04-15 19:08:24 -04:00
6bc6832bda fix syntax 2020-04-15 19:00:11 -04:00
593594839c Update docs for 1.5 to remove Python 2 references (#36338)
* Remove python 2 from jit.rst

* Remove python 2 from jit_language_reference.rst

* Remove python 2 from multiprocessing.rst

* Remove python 2 from named_tensor.rst

* Remove python 2 from multiprocessing.rst

* Remove python 2 from windows.rst

* Update multiprocessing.rst

* Remove python 2 from notes/multiprocessing.rst
2020-04-14 15:57:02 -07:00
cf65c8ef15 Fix torch.min docs (#36319)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36319

On the way to resolving #35216.
This is a fix for just the master branch but once this goes in,
I'll send a cherry-pick to release/1.5

The problem is that we were not calling `format` on a string that had
templates (e.g., '{input}', '{dim}'). This change makes it so that we
call format on the entire docstring for `torch.min`.

Test Plan:
- The `torch.max` docs are OK:
https://pytorch.org/docs/master/torch.html#torch.max and don't need
changing.
- `torch.min` docs, before this change: see second screenshot in #35216.
- after this change: <Insert link here on github>

![image](https://user-images.githubusercontent.com/5652049/78921702-4e2acc00-7a63-11ea-9ea0-89636ff6fb0a.png)

Differential Revision: D20946702

Pulled By: zou3519

fbshipit-source-id: a1a28707e41136a9bb170c8a4191786cf037a0c2
2020-04-13 19:03:03 -04:00
ca0dc1fcdc skip test in 3.8 because of inconsistent regex 2020-04-10 11:06:47 -07:00
b58f89b2e4 Use counter instead of vector of futures in _parallel_run (#36159) (#36334)
Summary:
This should be faster than allocating one mutex, flag and conditional variable per task.

Using `std::atomic<size_t>` to count remaing tasks is not sufficient,
because modification of remaining counter and signalling conditional variable must happen atomically,
otherwise `wait()` might get invoked after `notify_one()` was called.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36159

Test Plan: CI

Differential Revision: D20905411

Pulled By: malfet

fbshipit-source-id: facaf599693649c3f43edafc49f369e90d2f60de
(cherry picked from commit 986a8fdd6a18d9110f8bde59361967139450966b)
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Co-authored-by: Nikita Shulga <nshulga@fb.com>
2020-04-09 14:08:57 -07:00
87b6685c6b repr and _*state_dict for qRNN (#31540)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31540

Fixes #31468

Test Plan: Imported from OSS

Differential Revision: D19205894

Pulled By: z-a-f

fbshipit-source-id: 80c36f74aa20a125ea8d74a54e9905576f1bc6d7
2020-04-09 12:26:56 -04:00
f746f1b746 Revert "Avoid clone for sparse tensors during accumulation of grads. (#33427)"
This reverts commit b185359fb4ba4dcb0c048fd1d049da23eff88b27.
2020-04-09 11:33:55 -04:00
1379415150 Revert "AccumulateGrad: ensure sparse tensor indices and values refcount is always 1 (#34559)"
This reverts commit 2ce9513b0c8894987f6d42bfb57ff95b22e32c95.
2020-04-09 11:33:55 -04:00
7d638d2596 [v1.5.0] fix is_float_scale_factor warning (python and c++) (#36274)
* fix is_float_scale_factor warning

* fix python impl

Co-authored-by: Robin Lobel <divide@divideconcept.net>
Co-authored-by: Will Feng <willfeng@fb.com>
2020-04-09 11:31:13 -04:00
bad005d331 .circleci: Add binary builds/tests to run on release branches (#36283)
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
2020-04-08 16:37:24 -07:00
16d8a52407 [pytorch] Add error when PyTorch used with Python 2 (#36151)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36151

Python 2 has reached end-of-life and is no longer supported by PyTorch. To avoid confusing behavior when trying to use PyTorch with Python 2, detect this case early and fail with a clear message.  This commit covers `import torch` only and not C++  for now.

Test Plan: waitforsandcastle

Reviewed By: dreiss

Differential Revision: D20894381

fbshipit-source-id: a1073b7a648e07cf10cda5a99a2cf4eee5a89230
2020-04-08 18:55:58 -04:00
a33b264588 Revert "Update docs for 1.5 to remove Python 2 references (#36116)"
This reverts commit 63dcd9eccc90136afdfb5d8130077ff1e917ba2e.
2020-04-08 18:51:13 -04:00
3a67e00889 [1.5 cherrypick] C++ Adam optimizer - corrected messages for check of default options (#36245)
* Corrected messages for check of default options

* Added 0<= betas < 1 range check, match python messages for check of betas

Co-authored-by: meganset <meganset@gmail.com>
2020-04-08 18:06:16 -04:00
6bd039551d Remove determine_from from test/run_test.py (#36256)
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
2020-04-08 14:58:23 -07:00
b6c3058d61 Exclude torch/csrc/cuda/*nccl* from clang-tidy (#36251)
Since workflow configures pytorch with 'USE_NCCL` set to 0, we can not tidy those files

(cherry picked from commit e172a6ef920b6838b67eb8f0020d78031df8cde5)
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Co-authored-by: Nikita Shulga <nshulga@fb.com>
2020-04-08 13:37:16 -07:00
ed908b4fbc [release/1.5] Move all nccl from torch_python to torch_cuda (#36229)
* Remote dead code

`THCPModule_useNccl()` doesn't seem to be used anywhere

* Move all nccl calls from `torch_python` to `torch_cuda`

Because `torch_python` is supposed to be thin wrapper around torch

This ensures API parity between C++ and Python, as well as reduces `torch_python` binary size

Co-authored-by: Nikita Shulga <nshulga@fb.com>
2020-04-08 10:39:20 -07:00
b66e0af58b s/repo.continuum.io/repo.anaconda.com/
Followup after  https://github.com/pytorch/pytorch/pull/36201

Per https://github.com/conda/conda/issues/6886  `repo.anaconda.com` should have been used since Feb 2019

Test Plan: CI
2020-04-08 13:05:04 -04:00
bf8a5ede96 [ONNX] fix size for opset 11 (#35984)
Summary:
Fixing size, as the aten op has updated to support 0 inputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35984

Reviewed By: hl475

Differential Revision: D20858214

Pulled By: houseroad

fbshipit-source-id: 8ad0a0174a569455e89da6798eed403c8b162a47
2020-04-08 11:50:59 -04:00
c2bc5c56c5 Use repo.anaconda.com instead of repo.continuum.io (#36201)
Summary:
Per https://github.com/conda/conda/issues/6886  `repo.anaconda.com` should have been used since Feb 2019
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36201

Test Plan: CI

Differential Revision: D20910667

Pulled By: malfet

fbshipit-source-id: 3a191e2cae293e6f96dbb323853e84c07cd7aabc
2020-04-08 08:39:52 -07:00
db3c3ed662 Move test to test_jit_py3.py 2020-04-08 11:15:33 -04:00
9de4770bbd [v1.5.0] Group libraries in TOC and add PyTorch Elastic
Move XLA out of Notes and group with other libraries. Also adds link to PyTorch Elastic.
2020-04-08 11:08:39 -04:00
911a2a6b63 [BugFix] Fix compare_exchange_weak in DispatchStub.h (#35794)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35794

### Summary

As PyTorch has gone in production on iOS for about week, we've spotted a few crashes (90 out of 20.3k ) related to DispatchStub.h. The major part of the crash log is pasted below (full crash information can be found at `bunnylol logview 1d285dc9172c877b679d0f8539da58f0`):

```
FBCameraFramework void at::native::DispatchStub<void (*)(at::TensorIterator&, c10::Scalar), at::native::add_stub>::operator()<at::TensorIterator&, c10::Scalar&>(c10::DeviceType, at::TensorIterator&, c10::Scalar&)(DispatchStub.h:0)
+FBCameraFramework at::native::add(at::Tensor const&, at::Tensor const&, c10::Scalar)(BinaryOps.cpp:53)
+FBCameraFramework at::CPUType::add_Tensor(at::Tensor const&, at::Tensor const&, c10::Scalar)(CPUType.cpp:55)
+FBCameraFramework at::add(at::Tensor const&, at::Tensor const&, c10::Scalar)(Functions.h:1805)
+FBCameraFramework [inlined] c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::intrusive_ptr(c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>&&)(intrusive_ptr.h:0)
+FBCameraFramework [inlined] c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::intrusive_ptr(c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>&&)(intrusive_ptr.h:221)
+FBCameraFramework [inlined] at::Tensor::Tensor(at::Tensor&&)(TensorBody.h:93)
+FBCameraFramework [inlined] at::Tensor::Tensor(at::Tensor&&)(TensorBody.h:93)
+FBCameraFramework c10::detail::WrapRuntimeKernelFunctor_<(anonymous namespace)::$_3, at::Tensor, c10::guts::typelist::typelist<at::Tensor, at::Tensor, c10::Scalar> >::operator()(at::Tensor, at::Tensor, c10::Scalar)(kernel_lambda.h:23)
+FBCameraFramework [inlined] c10::guts::infer_function_traits<c10::detail::WrapRuntimeKernelFunctor_<(anonymous namespace)::$_3, at::Tensor, c10::guts::typelist::typelist<at::Tensor, at::Tensor, c10::Scalar> > >::type::return_type c10::detail::call_functor_with_args_from_stack_<c10::detail::WrapRuntimeKernelFunctor_<(anonymous namespace)::$_3, at::Tensor, c10::guts::typelist::typelist<at::Tensor, at::Tensor, c10::Scalar> >, false, 0ul, 1ul, 2ul>(c10::detail::WrapRuntimeKernelFunctor_<(anonymous namespace)::$_3, at::Tensor, c10::guts::typelist::typelist<at::Tensor, at::Tensor, c10::Scalar> >*, std::__1::vector<c10::IValue, c10::detail::WrapRuntimeKernelFunctor_<(anonymous namespace)::$_3, at::Tensor, c10::guts::typelist::typelist<at::Tensor, at::Tensor, c10::Scalar> >*::allocator<std::__1::vector> >*, c10::detail::WrapRuntimeKernelFunctor_<(anonymous namespace)::$_3, at::Tensor, c10::guts::typelist::typelist<at::Tensor, at::Tensor, c10::Scalar> >*::integer_sequence<unsigned long, 0ul, 1ul, 2ul>)(kernel_functor.h:210)
+FBCameraFramework [inlined] c10::guts::infer_function_traits<c10::detail::WrapRuntimeKernelFunctor_<(anonymous namespace)::$_3, at::Tensor, c10::guts::typelist::typelist<at::Tensor, at::Tensor, c10::Scalar> > >::type::return_type c10::detail::call_functor_with_args_from_stack<c10::detail::WrapRuntimeKernelFunctor_<(anonymous namespace)::$_3, at::Tensor, c10::guts::typelist::typelist<at::Tensor, at::Tensor, c10::Scalar> >, false>(c10::detail::WrapRuntimeKernelFunctor_<(anonymous namespace)::$_3, at::Tensor, c10::guts::typelist::typelist<at::Tensor, at::Tensor, c10::Scalar> >*, std::__1::vector<c10::IValue, c10::detail::WrapRuntimeKernelFunctor_<(anonymous namespace)::$_3, at::Tensor, c10::guts::typelist::typelist<at::Tensor, at::Tensor, c10::Scalar> >*::allocator<std::__1::vector> >*)(kernel_functor.h:218)
+FBCameraFramework c10::detail::make_boxed_from_unboxed_functor<c10::detail::WrapRuntimeKernelFunctor_<(anonymous namespace)::$_3, at::Tensor, c10::guts::typelist::typelist<at::Tensor, at::Tensor, c10::Scalar> >, false, void>::call(c10::OperatorKernel*, c10::OperatorHandle const&, std::__1::vector<c10::IValue, std::__1::allocator<c10::IValue> >*)(kernel_functor.h:250)
+FBCameraFramework [inlined] (anonymous namespace)::variable_fallback_kernel(c10::OperatorHandle const&, std::__1::vector<c10::IValue, std::__1::allocator<c10::IValue> >*)(VariableFallbackKernel.cpp:32)
+FBCameraFramework void c10::KernelFunction::make_boxed_function<&((anonymous namespace)::variable_fallback_kernel(c10::OperatorHandle const&, std::__1::vector<c10::IValue, std::__1::allocator<c10::IValue> >*))>(c10::OperatorKernel*, c10::OperatorHandle const&, std::__1::vector<c10::IValue, std::__1::allocator<c10::IValue> >*)(KernelFunction_impl.h:21)
+FBCameraFramework torch::jit::mobile::InterpreterState::run(std::__1::vector<c10::IValue, std::__1::allocator<c10::IValue> >&)(interpreter.cpp:0)
+FBCameraFramework torch::jit::mobile::Function::run(std::__1::vector<c10::IValue, std::__1::allocator<c10::IValue> >&) const(function.cpp:59)
+FBCameraFramework torch::jit::mobile::Module::run_method(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::vector<c10::IValue, std::__1::allocator<c10::IValue> >)(module.cpp:51)
+FBCameraFramework [inlined] torch::jit::mobile::Module::forward(std::__1::vector<c10::IValue, std::__1::allocator<c10::IValue> >)(module.h:28)
```
The problem is `compare_exchange_weak` is not guaranteed to be successful in one shot, as described in  [C++ Concurrency in Action (2nd Edition)](https://livebook.manning.com/book/c-plus-plus-concurrency-in-action-second-edition/chapter-5/79). This might result in `cpu_dispatch_ptr` being null pointer in concurrent situations, thus leading to the crash. As suggested in the book, due to spurious failure, the `compare_exchange_weak` is typically used in a loop.  There is also a [stackoverflow discussion](https://stackoverflow.com/questions/25199838/understanding-stdatomiccompare-exchange-weak-in-c11) about this. Feel free to drop comments below if there is a better option.

### The original PR

- [Enhance DispatchStub to be thread safe from a TSAN point of view](https://github.com/pytorch/pytorch/pull/32148)

### Test Plan

- Keep observing the crash reports in QE

Test Plan: Imported from OSS

Differential Revision: D20808751

Pulled By: xta0

fbshipit-source-id: 52f5c865b70c59b332ef9f0865315e76d97f6eaa
2020-04-08 10:56:07 -04:00
60375bcfdf [1.5.0] Attempt to fix the pytorch_cpp_doc_push build by pinning breathe. 2020-04-08 10:54:56 -04:00
63dcd9eccc Update docs for 1.5 to remove Python 2 references (#36116) 2020-04-07 16:03:44 -07:00
e8236d2ed4 fix max_pool2d cuda version Dimension out of range issue(#36046) (#36095)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36095

Test Plan: Imported from OSS

Differential Revision: D20876733

Pulled By: glaringlee

fbshipit-source-id: a2b92fd2dd0254c5443af469e3fb2faa2323e5c9
2020-04-07 18:52:21 -04:00
0058b1bb7e [1.5 cherrypick][JIT] Fix fake_range() 2020-04-07 18:47:22 -04:00
419283e291 Improve C++ API autograd and indexing docs (#35777)
Summary:
This PR adds docs for the following components:
1. Tensor autograd APIs (such as `is_leaf` / `backward` / `detach` / `detach_` / `retain_grad` / `grad` / `register_hook` / `remove_hook`)
2. Autograd APIs: `torch::autograd::backward` / `grad` / `Function` / `AutogradContext`, `torch::NoGradGuard` / `torch::AutoGradMode`
3. Tensor indexing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35777

Differential Revision: D20810616

Pulled By: yf225

fbshipit-source-id: 60526ec0c5b051021901d89bc3b56861c68758e8
2020-04-07 18:37:27 -04:00
0e6f6ba218 [pytorch] Remove python2 support from tests and torch.jit (#35042) (#36162)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35042

Removing python2 tests and some compat code in torch.jit. Check if dependent projects and external tests have any issues after these changes.

Test Plan: waitforsandcastle

Reviewed By: suo, seemethere

Differential Revision: D18942633

fbshipit-source-id: d76cc41ff20bee147dd8d44d70563c10d8a95a35
(cherry picked from commit 8240db11e193b0334a60a33d9fc907ebc6ba6987)
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Co-authored-by: Orion Reblitz-Richardson <orionr@fb.com>
2020-04-07 13:55:50 -07:00
ec8dbaf920 Add more alternative filters in places people forgot to add them. (#36082) (#36148)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36082

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20874618

Pulled By: ezyang

fbshipit-source-id: b6f12100a247564428eb7272f803a03c9cad3a97
(cherry picked from commit 449a4ca3408774ed961f1702ca31a549f5818b80)
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Co-authored-by: Edward Yang <ezyang@fb.com>
2020-04-07 09:59:33 -07:00
7e168d134f Pin Sphinx to 2.4.4 (take 2), fix docs CIs (#36072)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36072

Update to https://github.com/pytorch/pytorch/pull/36065/ which was
almost there

Test Plan: - Wait for CI

Differential Revision: D20871661

Pulled By: zou3519

fbshipit-source-id: 2bf5ce382e879aafd232700ff1c0d61fc17ea52d
2020-04-07 10:54:36 -04:00
6daae58871 Remove __nv_relfatbin section from nccl_static library (#35907)
Test Plan: CI

(cherry picked from commit 04e06b419990328157f0e2108a95b2848f66d75f)
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Co-authored-by: Nikita Shulga <nshulga@fb.com>
2020-04-06 16:57:03 -07:00
fee0ff1bf6 May fix TopKTypeConfig<at::Half> without an additional Bitfield specialization 2020-04-06 19:41:17 -04:00
deaf3b65cf Compile THCTensorTopK per dtype.
ROCm builds fail inconsistently on this file by timing out.

ghstack-source-id: 4a8f22731aa82c02d464a8cba522e856afbe49b8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36074
2020-04-06 19:41:17 -04:00
dca9c2501d Revert "Revert "Fix handling of non-finite values in topk (#35253)" (#35582)"
This reverts commit dacdbc22d195f80e0b529b4e9111c8ca9a172914.
2020-04-06 19:41:17 -04:00
842cd47416 Refactor and turn on C++ API parity test in CI
gh-metadata: pytorch pytorch 35190 gh/yf225/106/head
2020-04-06 15:40:35 -04:00
a30b49085c Move NewModuleTest and NewCriterionTest from test_nn.py to common_nn.py (#35189)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35189

Test Plan: Imported from OSS

Differential Revision: D20588197

Pulled By: yf225

fbshipit-source-id: 5a28159b653895678c250cbc0c1ddd51bc7a3123
2020-04-06 15:40:35 -04:00
82626f8ad9 More generic dedupe MKL fix (#35966)
* Stop linking against MKL

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Perform test for build size

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* fixup

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* One more MSVC fix

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Revert "Perform test for build size"

This reverts commit 8b5ed8eac81cc880b5cedb33cb3b86f584abacb7.
2020-04-06 11:50:48 -07:00
27fddfda4f Use std::abs instead of abs in lbfgs.cpp (#35974)
Summary:
This supersedes https://github.com/pytorch/pytorch/pull/35698.

`abs` is a C-style function that takes only integral argument
`std::abs` is polymorphic and can be applied to both integral and floating point types

This PR also increases `kBatchSize` in `test_optimizer_xor` function in `test/cpp/api/optim.cpp` to fix `OptimTest.XORConvergence_LBFGS` failure under ASAN.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35974

Test Plan: CI

Reviewed By: pbelevich

Differential Revision: D20853570

Pulled By: yf225

fbshipit-source-id: 6135588df2426c5b974e4e097b416955d1907bd4
2020-04-06 14:50:18 -04:00
7ecf6a1c10 [release/1.5] Bump libtorch to 3.7, remove python2 (#36080)
* .cirlceci: Remove Python 2.7 builds, switch libtorch to 3.7

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

* .circleci: Bump libtorch builds to 3.7

The image is actually using Python 3.7.2 so we should reflect that
within our circleci configs

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
(cherry picked from commit b3f2572aaf83d1f5383369187f6263e6f926103b)
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
2020-04-06 11:10:48 -07:00
beb07a44c4 Ports integer division callsite cleanup 2020-04-02 20:17:31 -04:00
a01c3bd1fe [BC] Fix the BC test for 1.5 (#35733)
* [BC] Fix the BC test for 1.5

* Skip RRef

* Skip more

* Skip more

* Fix whitelist

* Fix whitelist
2020-04-02 19:36:18 -04:00
ffd010f8a0 Make test_leaky_relu_inplace_with_neg_slope device-generic and skipIfRocm. (#35816)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35816

Fixes https://github.com/pytorch/pytorch/issues/35689.

Test Plan: Imported from OSS

Differential Revision: D20796656

Pulled By: gchanan

fbshipit-source-id: 474790fe07899d9944644f6b3d7a15db1c2b96db
2020-04-02 17:05:23 -04:00
8ad59f03a8 Skip ROCm test in test/test_cpp_extensions_aot.py (#35838)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35838

It may be flaky.

Test Plan: Imported from OSS

Differential Revision: D20807409

Pulled By: gchanan

fbshipit-source-id: f085d05bcb6a04d304f3cd048c38d2e8453125d6
2020-04-02 17:04:54 -04:00
ed3640df68 Fix another case of float2::x and float2::y may not be the same on ROCm (#35785)
Summary:
This is another case of the issue fixed in https://github.com/pytorch/pytorch/pull/35783. Mirroring 35786.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35785

Differential Revision: D20800317

Pulled By: ezyang

fbshipit-source-id: de5f32839755d5ff5aefff8408df69adbab4d0a1
2020-04-02 17:01:27 -04:00
fb88942f6c Fix typo 2020-04-02 13:53:13 -04:00
5d05c51887 Refactored rpc docs (#35109)
Summary:
Reorganize as per jlin27 's comments. Screenshots added in comments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35109

Differential Revision: D20788774

Pulled By: rohan-varma

fbshipit-source-id: 7d64be70ef76ed6ff303d05d39c338293c234766
2020-04-02 13:53:13 -04:00
df5986fbf3 [1.5 Release] Disabled complex tensor construction (#35579)
* disabled complex tensor construction

* minor

* doc fix

* added docs back and updated complex dtype check

* removed test_complex.py

* removed complexfloat reg test

* debug
2020-04-01 11:11:05 -04:00
165403f614 [v1.5.0] float2::x and float2::y may not be the same as float on ROCm (#35593)
Summary:
This causes ambiguity and can be triggered sometimes (e.g., by https://github.com/pytorch/pytorch/issues/35217). Explicitly convert them to float.

    error: conditional expression is ambiguous; 'const
    hip_impl::Scalar_accessor<float, Native_vec_, 0>' can be converted to
    'float' and vice versa
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35593

Differential Revision: D20735663

Pulled By: ezyang

fbshipit-source-id: ae6a38a08e59821bae13eb0b9f9bdf21a008d5c0
2020-03-31 19:58:40 -04:00
fbf18c34ff ports disabling imag 2020-03-31 18:55:45 -04:00
84f806c821 ports real and imag fixes 2020-03-31 13:34:39 -04:00
94139a7d95 Add warnings that amp is incomplete in 1.5 2020-03-31 10:49:45 -04:00
75e36186b2 [v1.5.0] Fix Caffe2 mobile compilation
Ports #35288
2020-03-30 17:17:59 -04:00
f4a0b406dd Warn a known autograd issue on XLA backend. 2020-03-30 17:16:39 -04:00
e884e720f0 [Windows] make torch_cuda's forced link also work for CMake
Was only working for ninja
2020-03-30 17:13:51 -04:00
198 changed files with 6801 additions and 3562 deletions

View File

@ -466,7 +466,7 @@ But if you want to try, then Id recommend
# Always install miniconda 3, even if building for Python <3
new_conda="~/my_new_conda"
conda_sh="$new_conda/install_miniconda.sh"
curl -o "$conda_sh" https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
curl -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
chmod +x "$conda_sh"
"$conda_sh" -b -p "$MINICONDA_ROOT"
rm -f "$conda_sh"

View File

@ -41,7 +41,7 @@ LINUX_PACKAGE_VARIANTS = OrderedDict(
],
conda=dimensions.STANDARD_PYTHON_VERSIONS,
libtorch=[
"2.7m",
"3.7m",
],
)
@ -51,11 +51,21 @@ CONFIG_TREE_DATA = OrderedDict(
wheel=dimensions.STANDARD_PYTHON_VERSIONS,
conda=dimensions.STANDARD_PYTHON_VERSIONS,
libtorch=[
"2.7",
"3.7",
],
)),
windows=(dimensions.CUDA_VERSIONS, OrderedDict(
wheel=dimensions.STANDARD_PYTHON_VERSIONS,
conda=dimensions.STANDARD_PYTHON_VERSIONS,
libtorch=[
"3.7",
],
)),
)
CONFIG_TREE_DATA_NO_WINDOWS = CONFIG_TREE_DATA.copy()
CONFIG_TREE_DATA_NO_WINDOWS.pop("windows")
# GCC config variants:
#
# All the nightlies (except libtorch with new gcc ABI) are built with devtoolset7,
@ -72,6 +82,11 @@ LINUX_GCC_CONFIG_VARIANTS = OrderedDict(
],
)
WINDOWS_LIBTORCH_CONFIG_VARIANTS = [
"debug",
"release",
]
class TopLevelNode(ConfigNode):
def __init__(self, node_name, config_tree_data, smoke):
@ -106,6 +121,8 @@ class PackageFormatConfigNode(ConfigNode):
def get_children(self):
if self.find_prop("os_name") == "linux":
return [LinuxGccConfigNode(self, v) for v in LINUX_GCC_CONFIG_VARIANTS[self.find_prop("package_format")]]
elif self.find_prop("os_name") == "windows" and self.find_prop("package_format") == "libtorch":
return [WindowsLibtorchConfigNode(self, v) for v in WINDOWS_LIBTORCH_CONFIG_VARIANTS]
else:
return [ArchConfigNode(self, v) for v in self.find_prop("cuda_versions")]
@ -127,6 +144,16 @@ class LinuxGccConfigNode(ConfigNode):
return [ArchConfigNode(self, v) for v in cuda_versions]
class WindowsLibtorchConfigNode(ConfigNode):
def __init__(self, parent, libtorch_config_variant):
super(WindowsLibtorchConfigNode, self).__init__(parent, "LIBTORCH_CONFIG_VARIANT=" + str(libtorch_config_variant))
self.props["libtorch_config_variant"] = libtorch_config_variant
def get_children(self):
return [ArchConfigNode(self, v) for v in self.find_prop("cuda_versions")]
class ArchConfigNode(ConfigNode):
def __init__(self, parent, cu):
super(ArchConfigNode, self).__init__(parent, get_processor_arch_name(cu))

View File

@ -6,7 +6,7 @@ import cimodel.lib.miniutils as miniutils
class Conf(object):
def __init__(self, os, cuda_version, pydistro, parms, smoke, libtorch_variant, gcc_config_variant):
def __init__(self, os, cuda_version, pydistro, parms, smoke, libtorch_variant, gcc_config_variant, libtorch_config_variant):
self.os = os
self.cuda_version = cuda_version
@ -15,11 +15,14 @@ class Conf(object):
self.smoke = smoke
self.libtorch_variant = libtorch_variant
self.gcc_config_variant = gcc_config_variant
self.libtorch_config_variant = libtorch_config_variant
def gen_build_env_parms(self):
elems = [self.pydistro] + self.parms + [binary_build_data.get_processor_arch_name(self.cuda_version)]
if self.gcc_config_variant is not None:
elems.append(str(self.gcc_config_variant))
if self.libtorch_config_variant is not None:
elems.append(str(self.libtorch_config_variant))
return elems
def gen_docker_image(self):
@ -67,9 +70,14 @@ class Conf(object):
job_def["requires"].append("update_s3_htmls_for_nightlies_devtoolset7")
job_def["filters"] = {"branches": {"only": "postnightly"}}
else:
filter_branches = ["nightly"]
# we only want to add the release branch filter if we aren't
# uploading
if phase not in ["upload"]:
filter_branches.append(r"/release\/.*/")
job_def["filters"] = {
"branches": {
"only": "nightly"
"only": filter_branches
},
# Will run on tags like v1.5.0-rc1, etc.
"tags": {
@ -105,11 +113,18 @@ class Conf(object):
def get_root(smoke, name):
return binary_build_data.TopLevelNode(
name,
binary_build_data.CONFIG_TREE_DATA,
smoke,
)
if smoke:
return binary_build_data.TopLevelNode(
name,
binary_build_data.CONFIG_TREE_DATA_NO_WINDOWS,
smoke,
)
else:
return binary_build_data.TopLevelNode(
name,
binary_build_data.CONFIG_TREE_DATA,
smoke,
)
def gen_build_env_list(smoke):
@ -127,6 +142,7 @@ def gen_build_env_list(smoke):
c.find_prop("smoke"),
c.find_prop("libtorch_variant"),
c.find_prop("gcc_config_variant"),
c.find_prop("libtorch_config_variant"),
)
newlist.append(conf)

View File

@ -4,7 +4,6 @@ from cimodel.lib.conf_tree import Ver
CONFIG_TREE_DATA = [
(Ver("ubuntu", "16.04"), [
([Ver("gcc", "5")], [XImportant("onnx_py2")]),
([Ver("clang", "7")], [XImportant("onnx_main_py3.6"),
XImportant("onnx_ort1_py3.6"),
XImportant("onnx_ort2_py3.6")]),

View File

@ -33,8 +33,7 @@ class Conf:
# TODO: Eventually we can probably just remove the cudnn7 everywhere.
def get_cudnn_insertion(self):
omit = self.language == "onnx_py2" \
or self.language == "onnx_main_py3.6" \
omit = self.language == "onnx_main_py3.6" \
or self.language == "onnx_ort1_py3.6" \
or self.language == "onnx_ort2_py3.6" \
or set(self.compiler_names).intersection({"android", "mkl", "clang"}) \
@ -71,11 +70,10 @@ class Conf:
def gen_docker_image(self):
lang_substitutions = {
"onnx_py2": "py2",
"onnx_main_py3.6": "py3.6",
"onnx_ort1_py3.6": "py3.6",
"onnx_ort2_py3.6": "py3.6",
"cmake": "py2",
"cmake": "py3",
}
lang = miniutils.override(self.language, lang_substitutions)
@ -85,7 +83,7 @@ class Conf:
def gen_workflow_params(self, phase):
parameters = OrderedDict()
lang_substitutions = {
"onnx_py2": "onnx-py2",
"onnx_py3": "onnx-py3",
"onnx_main_py3.6": "onnx-main-py3.6",
"onnx_ort1_py3.6": "onnx-ort1-py3.6",
"onnx_ort2_py3.6": "onnx-ort2-py3.6",
@ -129,7 +127,7 @@ class Conf:
job_name = "caffe2_" + self.get_platform() + "_build"
if not self.is_important:
job_def["filters"] = {"branches": {"only": ["master", r"/ci-all\/.*/"]}}
job_def["filters"] = {"branches": {"only": ["master", r"/ci-all\/.*/", r"/release\/.*/"]}}
job_def.update(self.gen_workflow_params(phase))
return {job_name : job_def}

View File

@ -8,7 +8,6 @@ CUDA_VERSIONS = [
]
STANDARD_PYTHON_VERSIONS = [
"2.7",
"3.5",
"3.6",
"3.7",

View File

@ -114,7 +114,7 @@ class Conf:
if not self.is_important:
# If you update this, update
# caffe2_build_definitions.py too
job_def["filters"] = {"branches": {"only": ["master", r"/ci-all\/.*/"]}}
job_def["filters"] = {"branches": {"only": ["master", r"/ci-all\/.*/", r"/release\/.*/"]}}
job_def.update(self.gen_workflow_params(phase))
return {job_name : job_def}

File diff suppressed because it is too large Load Diff

View File

@ -4,7 +4,7 @@ set -ex
# Optionally install conda
if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
BASE_URL="https://repo.continuum.io/miniconda"
BASE_URL="https://repo.anaconda.com/miniconda"
MAJOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 1)

View File

@ -10,6 +10,11 @@ retry () {
if [[ "$(uname)" == Darwin ]]; then
# macos executor (builds and tests)
workdir="/Users/distiller/project"
elif [[ "$OSTYPE" == "msys" ]]; then
# windows executor (builds and tests)
rm -rf /c/w
ln -s "/c/Users/circleci/project" /c/w
workdir="/c/w"
elif [[ -d "/home/circleci/project" ]]; then
# machine executor (binary tests)
workdir="/home/circleci/project"
@ -19,8 +24,14 @@ else
fi
# It is very important that this stays in sync with binary_populate_env.sh
export PYTORCH_ROOT="$workdir/pytorch"
export BUILDER_ROOT="$workdir/builder"
if [[ "$OSTYPE" == "msys" ]]; then
# We need to make the paths as short as possible on Windows
export PYTORCH_ROOT="$workdir/p"
export BUILDER_ROOT="$workdir/b"
else
export PYTORCH_ROOT="$workdir/pytorch"
export BUILDER_ROOT="$workdir/builder"
fi
# Clone the Pytorch branch
retry git clone https://github.com/pytorch/pytorch.git "$PYTORCH_ROOT"

View File

@ -31,9 +31,9 @@ fi
conda_sh="$workdir/install_miniconda.sh"
if [[ "$(uname)" == Darwin ]]; then
curl --retry 3 -o "$conda_sh" https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
curl --retry 3 -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
else
curl --retry 3 -o "$conda_sh" https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
curl --retry 3 -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
fi
chmod +x "$conda_sh"
"$conda_sh" -b -p "$MINICONDA_ROOT"

View File

@ -5,7 +5,11 @@ export TZ=UTC
tagged_version() {
# Grabs version from either the env variable CIRCLE_TAG
# or the pytorch git described version
GIT_DESCRIBE="git --git-dir ${workdir}/pytorch/.git describe"
if [[ "$OSTYPE" == "msys" ]]; then
GIT_DESCRIBE="git --git-dir ${workdir}/p/.git describe"
else
GIT_DESCRIBE="git --git-dir ${workdir}/pytorch/.git describe"
fi
if [[ -n "${CIRCLE_TAG:-}" ]]; then
echo "${CIRCLE_TAG}"
elif ${GIT_DESCRIBE} --exact --tags >/dev/null; then
@ -20,6 +24,9 @@ tagged_version() {
if [[ "$(uname)" == Darwin ]]; then
# macos executor (builds and tests)
workdir="/Users/distiller/project"
elif [[ "$OSTYPE" == "msys" ]]; then
# windows executor (builds and tests)
workdir="/c/w"
elif [[ -d "/home/circleci/project" ]]; then
# machine executor (binary tests)
workdir="/home/circleci/project"
@ -36,7 +43,15 @@ configs=($BUILD_ENVIRONMENT)
export PACKAGE_TYPE="${configs[0]}"
export DESIRED_PYTHON="${configs[1]}"
export DESIRED_CUDA="${configs[2]}"
export DESIRED_DEVTOOLSET="${configs[3]:-}"
if [[ "${BUILD_FOR_SYSTEM:-}" == "windows" ]]; then
export DESIRED_DEVTOOLSET=""
export LIBTORCH_CONFIG="${configs[3]:-}"
if [[ "$LIBTORCH_CONFIG" == 'debug' ]]; then
export DEBUG=1
fi
else
export DESIRED_DEVTOOLSET="${configs[3]:-}"
fi
if [[ "$PACKAGE_TYPE" == 'libtorch' ]]; then
export BUILD_PYTHONLESS=1
fi
@ -109,6 +124,10 @@ export DESIRED_CUDA="$DESIRED_CUDA"
export LIBTORCH_VARIANT="${LIBTORCH_VARIANT:-}"
export BUILD_PYTHONLESS="${BUILD_PYTHONLESS:-}"
export DESIRED_DEVTOOLSET="$DESIRED_DEVTOOLSET"
if [[ "${BUILD_FOR_SYSTEM:-}" == "windows" ]]; then
export LIBTORCH_CONFIG="${LIBTORCH_CONFIG:-}"
export DEBUG="${DEBUG:-}"
fi
export DATE="$DATE"
export NIGHTLIES_DATE_PREAMBLE=1.5.0.dev
@ -128,8 +147,13 @@ export DOCKER_IMAGE="$DOCKER_IMAGE"
export workdir="$workdir"
export MAC_PACKAGE_WORK_DIR="$workdir"
export PYTORCH_ROOT="$workdir/pytorch"
export BUILDER_ROOT="$workdir/builder"
if [[ "$OSTYPE" == "msys" ]]; then
export PYTORCH_ROOT="$workdir/p"
export BUILDER_ROOT="$workdir/b"
else
export PYTORCH_ROOT="$workdir/pytorch"
export BUILDER_ROOT="$workdir/builder"
fi
export MINICONDA_ROOT="$workdir/miniconda"
export PYTORCH_FINAL_PACKAGE_DIR="$workdir/final_pkgs"

View File

@ -0,0 +1,33 @@
#!/bin/bash
set -eux -o pipefail
source "/c/w/env"
mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR"
export CUDA_VERSION="${DESIRED_CUDA/cu/}"
export VC_YEAR=2017
export USE_SCCACHE=1
export SCCACHE_BUCKET=ossci-compiler-cache-windows
export NIGHTLIES_PYTORCH_ROOT="$PYTORCH_ROOT"
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}
set -x
if [[ "$CIRCLECI" == 'true' && -d "C:\\Program Files (x86)\\Microsoft Visual Studio\\2019" ]]; then
rm -rf "C:\\Program Files (x86)\\Microsoft Visual Studio\\2019"
fi
echo "Free space on filesystem before build:"
df -h
pushd "$BUILDER_ROOT"
if [[ "$PACKAGE_TYPE" == 'conda' ]]; then
./windows/internal/build_conda.bat
elif [[ "$PACKAGE_TYPE" == 'wheel' || "$PACKAGE_TYPE" == 'libtorch' ]]; then
./windows/internal/build_wheels.bat
fi
echo "Free space on filesystem after build:"
df -h

View File

@ -0,0 +1,37 @@
#!/bin/bash
set -eu -o pipefail
set +x
declare -x "AWS_ACCESS_KEY_ID=${PYTORCH_BINARY_AWS_ACCESS_KEY_ID}"
declare -x "AWS_SECRET_ACCESS_KEY=${PYTORCH_BINARY_AWS_SECRET_ACCESS_KEY}"
#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!
# DO NOT TURN -x ON BEFORE THIS LINE
#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!
set -eux -o pipefail
source "/env"
# This gets set in binary_populate_env.sh, but lets have a sane default just in case
PIP_UPLOAD_FOLDER=${PIP_UPLOAD_FOLDER:-nightly/}
# TODO: Combine CONDA_UPLOAD_CHANNEL and PIP_UPLOAD_FOLDER into one variable
# The only difference is the trailing slash
# Strip trailing slashes if there
CONDA_UPLOAD_CHANNEL=$(echo "${PIP_UPLOAD_FOLDER}" | sed 's:/*$::')
pushd /root/workspace/final_pkgs
# Upload the package to the final location
if [[ "$PACKAGE_TYPE" == conda ]]; then
retry conda install -yq anaconda-client
anaconda -t "${CONDA_PYTORCHBOT_TOKEN}" upload "$(ls)" -u "pytorch-${CONDA_UPLOAD_CHANNEL}" --label main --no-progress --force
elif [[ "$PACKAGE_TYPE" == libtorch ]]; then
retry conda install -c conda-forge -yq awscli
s3_dir="s3://pytorch/libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
for pkg in $(ls); do
retry aws s3 cp "$pkg" "$s3_dir" --acl public-read
done
else
retry conda install -c conda-forge -yq awscli
s3_dir="s3://pytorch/whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
retry aws s3 cp "$(ls)" "$s3_dir" --acl public-read
fi

View File

@ -72,10 +72,10 @@ time python tools/setup_helpers/generate_code.py \
# Build the docs
pushd docs/cpp
pip install breathe>=4.13.0 bs4 lxml six
pip install breathe==4.13.0 bs4 lxml six
pip install --no-cache-dir -e "git+https://github.com/pytorch/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme"
pip install exhale>=0.2.1
pip install sphinx>=2.0
pip install sphinx==2.4.4
# Uncomment once it is fixed
# pip install -r requirements.txt
time make VERBOSE=1 html -j

View File

@ -52,3 +52,12 @@ binary_mac_params: &binary_mac_params
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
binary_windows_params: &binary_windows_params
parameters:
build_environment:
type: string
default: ""
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
BUILD_FOR_SYSTEM: windows

View File

@ -275,3 +275,46 @@
script="/Users/distiller/project/.circleci/scripts/binary_ios_upload.sh"
cat "$script"
source "$script"
binary_windows_build:
<<: *binary_windows_params
executor:
name: windows-cpu-with-nvidia-cuda
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- attach_scripts
- run:
<<: *binary_checkout
- run:
<<: *binary_populate_env
- run:
name: Build
no_output_timeout: "1h"
command: |
set -eux -o pipefail
script="/c/w/p/.circleci/scripts/binary_windows_build.sh"
cat "$script"
source "$script"
- persist_to_workspace:
root: "C:/w"
paths: final_pkgs
binary_windows_upload:
<<: *binary_windows_params
docker:
- image: continuumio/miniconda
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- attach_scripts
- run:
<<: *binary_checkout
- run:
<<: *binary_populate_env
- run:
name: Upload
no_output_timeout: "10m"
command: |
set -eux -o pipefail
script="/pytorch/.circleci/scripts/binary_windows_upload.sh"
cat "$script"
source "$script"

View File

@ -151,7 +151,7 @@
# Install Anaconda if we need to
if [ -n "${CAFFE2_USE_ANACONDA}" ]; then
rm -rf ${TMPDIR}/anaconda
curl --retry 3 -o ${TMPDIR}/conda.sh https://repo.continuum.io/miniconda/Miniconda${ANACONDA_VERSION}-latest-MacOSX-x86_64.sh
curl --retry 3 -o ${TMPDIR}/conda.sh https://repo.anaconda.com/miniconda/Miniconda${ANACONDA_VERSION}-latest-MacOSX-x86_64.sh
chmod +x ${TMPDIR}/conda.sh
/bin/bash ${TMPDIR}/conda.sh -b -p ${TMPDIR}/anaconda
rm -f ${TMPDIR}/conda.sh

View File

@ -15,6 +15,7 @@
only:
- master
- /ci-all\/.*/
- /release\/.*/
- pytorch_windows_test:
name: pytorch_windows_vs2017_14.11_py36_cuda10.1_test1
test_name: pytorch-windows-test1
@ -32,6 +33,7 @@
only:
- master
- /ci-all\/.*/
- /release\/.*/
- pytorch_windows_test:
name: pytorch_windows_vs2017_14.11_py36_cuda10.1_test2
test_name: pytorch-windows-test2
@ -49,6 +51,7 @@
only:
- master
- /ci-all\/.*/
- /release\/.*/
- pytorch_windows_build:
name: pytorch_windows_vs2017_14.16_py36_cuda10.1_build
cuda_version: "10"
@ -64,6 +67,7 @@
only:
- master
- /ci-all\/.*/
- /release\/.*/
- pytorch_windows_test:
name: pytorch_windows_vs2017_14.16_py36_cuda10.1_test1
test_name: pytorch-windows-test1
@ -81,6 +85,7 @@
only:
- master
- /ci-all\/.*/
- /release\/.*/
- pytorch_windows_test:
name: pytorch_windows_vs2017_14.16_py36_cuda10.1_test2
test_name: pytorch-windows-test2
@ -98,6 +103,7 @@
only:
- master
- /ci-all\/.*/
- /release\/.*/
- pytorch_windows_build:
name: pytorch_windows_vs2019_py36_cuda10.1_build
cuda_version: "10"

View File

@ -7,12 +7,6 @@
# pytorch-ci-hud to adjust the list of whitelisted builds
# at https://github.com/ezyang/pytorch-ci-hud/blob/master/src/BuildHistoryDisplay.js
- binary_linux_build:
name: binary_linux_manywheel_2_7mu_cpu_devtoolset7_build
build_environment: "manywheel 2.7mu cpu devtoolset7"
requires:
- setup
docker_image: "pytorch/manylinux-cuda102"
- binary_linux_build:
name: binary_linux_manywheel_3_7m_cu102_devtoolset7_build
build_environment: "manywheel 3.7m cu102 devtoolset7"
@ -23,24 +17,21 @@
branches:
only:
- master
- binary_linux_build:
name: binary_linux_conda_2_7_cpu_devtoolset7_build
build_environment: "conda 2.7 cpu devtoolset7"
requires:
- setup
docker_image: "pytorch/conda-cuda"
- /ci-all\/.*/
- /release\/.*/
# This binary build is currently broken, see https://github_com/pytorch/pytorch/issues/16710
# - binary_linux_conda_3_6_cu90_devtoolset7_build
# TODO rename to remove python version for libtorch
- binary_linux_build:
name: binary_linux_libtorch_2_7m_cpu_devtoolset7_shared-with-deps_build
build_environment: "libtorch 2.7m cpu devtoolset7"
name: binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_build
build_environment: "libtorch 3.7m cpu devtoolset7"
requires:
- setup
libtorch_variant: "shared-with-deps"
docker_image: "pytorch/manylinux-cuda102"
- binary_linux_build:
name: binary_linux_libtorch_2_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build
build_environment: "libtorch 2.7m cpu gcc5.4_cxx11-abi"
name: binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build
build_environment: "libtorch 3.7m cpu gcc5.4_cxx11-abi"
requires:
- setup
libtorch_variant: "shared-with-deps"
@ -48,45 +39,51 @@
# TODO we should test a libtorch cuda build, but they take too long
# - binary_linux_libtorch_2_7m_cu90_devtoolset7_static-without-deps_build
- binary_mac_build:
name: binary_macos_wheel_3_6_cpu_build
build_environment: "wheel 3.6 cpu"
requires:
- setup
filters:
branches:
only:
- master
- binary_mac_build:
name: binary_macos_conda_2_7_cpu_build
build_environment: "conda 2.7 cpu"
name: binary_macos_wheel_3_7_cpu_build
build_environment: "wheel 3.7 cpu"
requires:
- setup
filters:
branches:
only:
- master
- /ci-all\/.*/
- /release\/.*/
# This job has an average run time of 3 hours o.O
# Now only running this on master to reduce overhead
# TODO rename to remove python version for libtorch
- binary_mac_build:
name: binary_macos_libtorch_2_7_cpu_build
build_environment: "libtorch 2.7 cpu"
name: binary_macos_libtorch_3_7_cpu_build
build_environment: "libtorch 3.7 cpu"
requires:
- setup
filters:
branches:
only:
- master
- binary_linux_test:
name: binary_linux_manywheel_2_7mu_cpu_devtoolset7_test
build_environment: "manywheel 2.7mu cpu devtoolset7"
- /ci-all\/.*/
- /release\/.*/
- binary_windows_build:
name: binary_windows_libtorch_3_7_cpu_debug_build
build_environment: "libtorch 3.7 cpu debug"
requires:
- setup
- binary_windows_build:
name: binary_windows_libtorch_3_7_cpu_release_build
build_environment: "libtorch 3.7 cpu release"
requires:
- setup
- binary_windows_build:
name: binary_windows_wheel_3_7_cu102_build
build_environment: "wheel 3.7 cu102"
requires:
- setup
- binary_linux_manywheel_2_7mu_cpu_devtoolset7_build
docker_image: "pytorch/manylinux-cuda102"
filters:
branches:
only:
- master
- /ci-all\/.*/
- /release\/.*/
- binary_linux_test:
name: binary_linux_manywheel_3_7m_cu102_devtoolset7_test
build_environment: "manywheel 3.7m cu102 devtoolset7"
@ -100,29 +97,25 @@
branches:
only:
- master
- binary_linux_test:
name: binary_linux_conda_2_7_cpu_devtoolset7_test
build_environment: "conda 2.7 cpu devtoolset7"
requires:
- setup
- binary_linux_conda_2_7_cpu_devtoolset7_build
docker_image: "pytorch/conda-cuda"
- /ci-all\/.*/
- /release\/.*/
# This binary build is currently broken, see https://github_com/pytorch/pytorch/issues/16710
# - binary_linux_conda_3_6_cu90_devtoolset7_test:
# TODO rename to remove python version for libtorch
- binary_linux_test:
name: binary_linux_libtorch_2_7m_cpu_devtoolset7_shared-with-deps_test
build_environment: "libtorch 2.7m cpu devtoolset7"
name: binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_test
build_environment: "libtorch 3.7m cpu devtoolset7"
requires:
- setup
- binary_linux_libtorch_2_7m_cpu_devtoolset7_shared-with-deps_build
- binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_build
libtorch_variant: "shared-with-deps"
docker_image: "pytorch/manylinux-cuda102"
- binary_linux_test:
name: binary_linux_libtorch_2_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_test
build_environment: "libtorch 2.7m cpu gcc5.4_cxx11-abi"
name: binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_test
build_environment: "libtorch 3.7m cpu gcc5.4_cxx11-abi"
requires:
- setup
- binary_linux_libtorch_2_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build
- binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_build
libtorch_variant: "shared-with-deps"
docker_image: "pytorch/pytorch-binary-docker-image-ubuntu16.04:latest"

View File

@ -20,21 +20,12 @@
- docker_build_job:
name: "pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"
image_name: "pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"
- docker_build_job:
name: "pytorch-linux-xenial-cuda9-cudnn7-py2"
image_name: "pytorch-linux-xenial-cuda9-cudnn7-py2"
- docker_build_job:
name: "pytorch-linux-xenial-cuda9-cudnn7-py3"
image_name: "pytorch-linux-xenial-cuda9-cudnn7-py3"
- docker_build_job:
name: "pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc7"
image_name: "pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc7"
- docker_build_job:
name: "pytorch-linux-xenial-py2.7.9"
image_name: "pytorch-linux-xenial-py2.7.9"
- docker_build_job:
name: "pytorch-linux-xenial-py2.7"
image_name: "pytorch-linux-xenial-py2.7"
- docker_build_job:
name: "pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
image_name: "pytorch-linux-xenial-py3-clang5-android-ndk-r19c"

View File

@ -4,6 +4,8 @@
branches:
only:
- master
- /ci-all\/.*/
- /release\/.*/
requires:
- pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build
@ -13,6 +15,8 @@
branches:
only:
- master
- /ci-all\/.*/
- /release\/.*/
requires:
- pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build
- pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_64_build

View File

@ -7,10 +7,10 @@
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:f990c76a-a798-42bb-852f-5be5006f8026"
resource_class: large
- pytorch_linux_test:
name: pytorch_linux_xenial_py3_6_gcc5_4_ge_config_simple_test
name: pytorch_linux_xenial_py3_6_gcc5_4_ge_config_profiling_test
requires:
- setup
- pytorch_linux_xenial_py3_6_gcc5_4_build
build_environment: "pytorch-linux-xenial-py3.6-gcc5.4-ge_config_simple-test"
build_environment: "pytorch-linux-xenial-py3.6-gcc5.4-ge_config_profiling-test"
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:f990c76a-a798-42bb-852f-5be5006f8026"
resource_class: large

View File

@ -31,6 +31,7 @@
only:
- master
- /ci-all\/.*/
- /release\/.*/
build_environment: "pytorch-linux-xenial-py3-clang5-mobile-code-analysis"
build_only: "1"
# Use LLVM-DEV toolchain in android-ndk-r19c docker image

View File

@ -67,7 +67,7 @@ jobs:
- name: Run flake8
run: |
set -eux
pip install flake8 flake8-mypy flake8-bugbear flake8-comprehensions flake8-executable flake8-pyi mccabe pycodestyle pyflakes
pip install flake8==3.7.9 flake8-mypy flake8-bugbear flake8-comprehensions flake8-executable flake8-pyi mccabe pycodestyle==2.5.0 pyflakes==2.1.1
flake8 --version
flake8 --exit-zero > ${GITHUB_WORKSPACE}/flake8-output.txt
cat ${GITHUB_WORKSPACE}/flake8-output.txt
@ -81,44 +81,6 @@ jobs:
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
flake8-py2:
runs-on: ubuntu-latest
steps:
- name: Setup Python
uses: actions/setup-python@v1
with:
python-version: 2.x
architecture: x64
- name: Fetch PyTorch
uses: actions/checkout@v1
- name: Checkout PR tip
run: |
set -eux
if [[ "${{ github.event_name }}" == "pull_request" ]]; then
# We are on a PR, so actions/checkout leaves us on a merge commit.
# Check out the actual tip of the branch.
git checkout ${{ github.event.pull_request.head.sha }}
fi
echo ::set-output name=commit_sha::$(git rev-parse HEAD)
id: get_pr_tip
- name: Run flake8
run: |
set -eux
pip install flake8
rm -rf .circleci tools/clang_format_new.py
flake8 --exit-zero > ${GITHUB_WORKSPACE}/flake8-output.txt
cat ${GITHUB_WORKSPACE}/flake8-output.txt
- name: Add annotations
uses: pytorch/add-annotations-github-action@master
with:
check_name: 'flake8-py2'
linter_output_path: 'flake8-output.txt'
commit_sha: ${{ steps.get_pr_tip.outputs.commit_sha }}
regex: '^(?<filename>.*?):(?<lineNumber>\d+):(?<columnNumber>\d+): (?<errorCode>\w\d+) (?<errorDesc>.*)'
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
clang-tidy:
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
@ -198,6 +160,8 @@ jobs:
-g"-torch/csrc/jit/export.cpp" \
-g"-torch/csrc/jit/import.cpp" \
-g"-torch/csrc/jit/netdef_converter.cpp" \
-g"-torch/csrc/cuda/nccl.*" \
-g"-torch/csrc/cuda/python_nccl.cpp" \
"$@" > ${GITHUB_WORKSPACE}/clang-tidy-output.txt
cat ${GITHUB_WORKSPACE}/clang-tidy-output.txt

View File

@ -6,6 +6,69 @@ TEST_DIR="$ROOT_DIR/caffe2_tests"
gtest_reports_dir="${TEST_DIR}/cpp"
pytest_reports_dir="${TEST_DIR}/python"
# This is needed to work around ROCm using old docker images until
# the transition to new images is complete.
# TODO: Remove once ROCm CI is using new images.
if [[ $BUILD_ENVIRONMENT == py3.6-devtoolset7-rocmrpm-centos* ]]; then
# This file is sourced multiple times, only install conda the first time.
# We must install conda where we have write access.
CONDA_DIR="$ROOT_DIR/conda"
if [[ ! -d $CONDA_DIR ]]; then
ANACONDA_PYTHON_VERSION=3.6
BASE_URL="https://repo.anaconda.com/miniconda"
CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"
mkdir $CONDA_DIR
pushd /tmp
wget -q "${BASE_URL}/${CONDA_FILE}"
chmod +x "${CONDA_FILE}"
./"${CONDA_FILE}" -b -f -p "$CONDA_DIR"
popd
export PATH="$CONDA_DIR/bin:$PATH"
# Ensure we run conda in a directory that jenkins has write access to
pushd $CONDA_DIR
# Track latest conda update
conda update -n base conda
# Install correct Python version
conda install python="$ANACONDA_PYTHON_VERSION"
conda_install() {
# Ensure that the install command don't upgrade/downgrade Python
# This should be called as
# conda_install pkg1 pkg2 ... [-c channel]
conda install -q -y python="$ANACONDA_PYTHON_VERSION" $*
}
# Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README
conda_install numpy pyyaml mkl mkl-include setuptools cffi typing future six
# TODO: This isn't working atm
conda_install nnpack -c killeent
# Install some other packages
# Need networkx 2.0 because bellmand_ford was moved in 2.1 . Scikit-image by
# defaults installs the most recent networkx version, so we install this lower
# version explicitly before scikit-image pulls it in as a dependency
pip install networkx==2.0
# TODO: Why is scipy pinned
# numba & llvmlite is pinned because of https://github.com/numba/numba/issues/4368
# scikit-learn is pinned because of
# https://github.com/scikit-learn/scikit-learn/issues/14485 (affects gcc 5.5
# only)
pip install --progress-bar off pytest scipy==1.1.0 scikit-learn==0.20.3 scikit-image librosa>=0.6.2 psutil numba==0.46.0 llvmlite==0.30.0
# click - onnx
# hypothesis - tests
# jupyter - for tutorials
pip install --progress-bar off click hypothesis jupyter protobuf tabulate virtualenv mock typing-extensions
popd
else
export PATH="$CONDA_DIR/bin:$PATH"
fi
fi
# Figure out which Python to use
PYTHON="$(which python)"
if [[ "${BUILD_ENVIRONMENT}" =~ py((2|3)\.?[0-9]?\.?[0-9]?) ]]; then

View File

@ -144,7 +144,7 @@ if [[ "$BUILD_ENVIRONMENT" == *onnx* ]]; then
# default pip version is too old(9.0.2), unable to support tag `manylinux2010`.
# Fix the pip error: Couldn't find a version that satisfies the requirement
sudo pip install --upgrade pip
pip install -q --user -i https://test.pypi.org/simple/ ort-nightly==1.1.0.dev1228
pip install -q --user -i https://test.pypi.org/simple/ ort-nightly==1.3.0.dev202005123
fi
"$ROOT_DIR/scripts/onnx/test.sh"
fi

View File

@ -259,7 +259,7 @@ if [[ "${BUILD_ENVIRONMENT}" == *xla* ]]; then
# XLA build requires Bazel
# We use bazelisk to avoid updating Bazel version manually.
sudo npm install -g @bazel/bazelisk
sudo ln -s "$(command -v bazelisk)" /usr/bin/bazel
sudo ln -sf "$(command -v bazelisk)" /usr/bin/bazel
# Install bazels3cache for cloud cache
sudo npm install -g bazels3cache

View File

@ -13,7 +13,7 @@ mkdir -p ${WORKSPACE_DIR}
# If a local installation of conda doesn't exist, we download and install conda
if [ ! -d "${WORKSPACE_DIR}/miniconda3" ]; then
mkdir -p ${WORKSPACE_DIR}
curl --retry 3 https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -o ${WORKSPACE_DIR}/miniconda3.sh
curl --retry 3 https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -o ${WORKSPACE_DIR}/miniconda3.sh
retry bash ${WORKSPACE_DIR}/miniconda3.sh -b -p ${WORKSPACE_DIR}/miniconda3
fi
export PATH="${WORKSPACE_DIR}/miniconda3/bin:$PATH"

View File

@ -20,7 +20,7 @@ if [ -n "${IN_CIRCLECI}" ]; then
sudo apt-get install -y --allow-downgrades --allow-change-held-packages libnccl-dev=2.5.6-1+cuda10.1 libnccl2=2.5.6-1+cuda10.1
fi
if [[ "$BUILD_ENVIRONMENT" == *-xenial-cuda9-cudnn7-py2* ]]; then
if [[ "$BUILD_ENVIRONMENT" == *-xenial-cuda10.1-cudnn7-py3* ]]; then
# TODO: move this to Docker
sudo apt-get update
sudo apt-get install -y --allow-downgrades --allow-change-held-packages openmpi-bin libopenmpi-dev

View File

@ -21,7 +21,7 @@ if [ -n "${IN_CIRCLECI}" ]; then
sudo apt-get -qq install --allow-downgrades --allow-change-held-packages libnccl-dev=2.5.6-1+cuda10.1 libnccl2=2.5.6-1+cuda10.1
fi
if [[ "$BUILD_ENVIRONMENT" == *-xenial-cuda9-cudnn7-py2* ]]; then
if [[ "$BUILD_ENVIRONMENT" == *-xenial-cuda10.1-cudnn7-py3* ]]; then
# TODO: move this to Docker
sudo apt-get -qq update
sudo apt-get -qq install --allow-downgrades --allow-change-held-packages openmpi-bin libopenmpi-dev
@ -141,8 +141,8 @@ test_python_nn() {
assert_git_not_dirty
}
test_python_ge_config_simple() {
time python test/run_test.py --include test_jit_simple --verbose --determine-from="$DETERMINE_FROM"
test_python_ge_config_profiling() {
time python test/run_test.py --include test_jit_profiling test_jit_fuser_profiling --verbose --determine-from="$DETERMINE_FROM"
assert_git_not_dirty
}
@ -152,7 +152,7 @@ test_python_ge_config_legacy() {
}
test_python_all_except_nn() {
time python test/run_test.py --exclude test_nn test_jit_simple test_jit_legacy test_jit_fuser_legacy --verbose --bring-to-front test_quantization test_quantized test_quantized_tensor test_quantized_nn_mods --determine-from="$DETERMINE_FROM"
time python test/run_test.py --exclude test_nn test_jit_profiling test_jit_legacy test_jit_fuser_legacy test_jit_fuser_profiling --verbose --bring-to-front test_quantization test_quantized test_quantized_tensor test_quantized_nn_mods --determine-from="$DETERMINE_FROM"
assert_git_not_dirty
}
@ -244,7 +244,7 @@ test_backward_compatibility() {
pushd test/backward_compatibility
python dump_all_function_schemas.py --filename new_schemas.txt
pip_uninstall torch
pip_install --pre torch -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
pip_install torch==1.4.0+cpu torchvision==0.5.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
python check_backward_compatibility.py --new-schemas new_schemas.txt
popd
set +x
@ -264,8 +264,8 @@ elif [[ "${BUILD_ENVIRONMENT}" == *xla* || "${JOB_BASE_NAME}" == *xla* ]]; then
test_xla
elif [[ "${BUILD_ENVIRONMENT}" == *ge_config_legacy* || "${JOB_BASE_NAME}" == *ge_config_legacy* ]]; then
test_python_ge_config_legacy
elif [[ "${BUILD_ENVIRONMENT}" == *ge_config_simple* || "${JOB_BASE_NAME}" == *ge_config_simple* ]]; then
test_python_ge_config_simple
elif [[ "${BUILD_ENVIRONMENT}" == *ge_config_profiling* || "${JOB_BASE_NAME}" == *ge_config_profiling* ]]; then
test_python_ge_config_profiling
elif [[ "${BUILD_ENVIRONMENT}" == *libtorch* ]]; then
# TODO: run some C++ tests
echo "no-op at the moment"

View File

@ -5,7 +5,7 @@ if "%BUILD_ENVIRONMENT%"=="" (
)
if "%REBUILD%"=="" (
IF EXIST %CONDA_PARENT_DIR%\Miniconda3 ( rd /s /q %CONDA_PARENT_DIR%\Miniconda3 )
curl --retry 3 -k https://repo.continuum.io/miniconda/Miniconda3-latest-Windows-x86_64.exe --output %TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe
curl --retry 3 -k https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe --output %TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe
%TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe /InstallationType=JustMe /RegisterPython=0 /S /AddToPath=0 /D=%CONDA_PARENT_DIR%\Miniconda3
)
call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3

View File

@ -13,7 +13,7 @@ if "%BUILD_ENVIRONMENT%"=="" (
)
if NOT "%BUILD_ENVIRONMENT%"=="" (
IF EXIST %CONDA_PARENT_DIR%\Miniconda3 ( rd /s /q %CONDA_PARENT_DIR%\Miniconda3 )
curl --retry 3 https://repo.continuum.io/miniconda/Miniconda3-latest-Windows-x86_64.exe --output %TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe
curl --retry 3 https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe --output %TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe
%TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe /InstallationType=JustMe /RegisterPython=0 /S /AddToPath=0 /D=%CONDA_PARENT_DIR%\Miniconda3
)
call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3

View File

@ -1,3 +1,3 @@
call %SCRIPT_HELPERS_DIR%\setup_pytorch_env.bat
cd test && python run_test.py --exclude test_nn test_jit_simple test_jit_legacy test_jit_fuser_legacy --verbose --determine-from="%1" && cd ..
cd test && python run_test.py --exclude test_nn test_jit_profiling test_jit_legacy test_jit_fuser_legacy test_jit_fuser_profiling --verbose --determine-from="%1" && cd ..
if ERRORLEVEL 1 exit /b 1

View File

@ -160,20 +160,18 @@ ENDIF(BLAS_FOUND)
IF(LAPACK_FOUND)
list(APPEND ATen_CPU_DEPENDENCY_LIBS ${LAPACK_LIBRARIES})
if(USE_CUDA)
if(USE_CUDA AND MSVC)
# Although Lapack provides CPU (and thus, one might expect that ATen_cuda
# would not need this at all), some of our libraries (magma in particular)
# backend to CPU BLAS/LAPACK implementations, and so it is very important
# we get the *right* implementation, because even if the symbols are the
# same, LAPACK implementions may have different calling conventions.
# This caused https://github.com/pytorch/pytorch/issues/7353
#
# We do NOT do this on Linux, since we just rely on torch_cpu to
# provide all of the symbols we need
list(APPEND ATen_CUDA_DEPENDENCY_LIBS ${LAPACK_LIBRARIES})
endif()
if(USE_ROCM)
# It's not altogether clear that HIP behaves the same way, but it
# seems safer to assume that it needs it too
list(APPEND ATen_HIP_DEPENDENCY_LIBS ${LAPACK_LIBRARIES})
endif()
ENDIF(LAPACK_FOUND)
IF (UNIX AND NOT APPLE)
@ -331,8 +329,12 @@ IF(USE_CUDA AND NOT USE_ROCM)
IF(USE_MAGMA)
list(APPEND ATen_CUDA_DEPENDENCY_LIBS ${MAGMA_LIBRARIES})
IF ($ENV{TH_BINARY_BUILD})
list(APPEND ATen_CUDA_DEPENDENCY_LIBS
"${BLAS_LIBRARIES};${BLAS_LIBRARIES};${BLAS_LIBRARIES}")
IF (MSVC)
# Do not do this on Linux: see Note [Extra MKL symbols for MAGMA in torch_cpu]
# in caffe2/CMakeLists.txt
list(APPEND ATen_CUDA_DEPENDENCY_LIBS
"${BLAS_LIBRARIES};${BLAS_LIBRARIES};${BLAS_LIBRARIES}")
ENDIF(MSVC)
ENDIF($ENV{TH_BINARY_BUILD})
ENDIF(USE_MAGMA)
IF ($ENV{ATEN_STATIC_CUDA})

View File

@ -125,13 +125,15 @@ void _parallel_run(
std::tie(num_tasks, chunk_size) =
internal::calc_num_tasks_and_chunk_size(begin, end, grain_size);
std::atomic_flag err_flag = ATOMIC_FLAG_INIT;
std::exception_ptr eptr;
std::vector<std::shared_ptr<c10::ivalue::Future>> futures(num_tasks);
for (size_t task_id = 0; task_id < num_tasks; ++task_id) {
futures[task_id] = std::make_shared<c10::ivalue::Future>(c10::NoneType::get());
}
auto task = [f, &eptr, &err_flag, &futures, begin, end, chunk_size]
struct {
std::atomic_flag err_flag = ATOMIC_FLAG_INIT;
std::exception_ptr eptr;
std::mutex mutex;
volatile size_t remaining;
std::condition_variable cv;
} state;
auto task = [f, &state, begin, end, chunk_size]
(int /* unused */, size_t task_id) {
int64_t local_start = begin + task_id * chunk_size;
if (local_start < end) {
@ -140,21 +142,30 @@ void _parallel_run(
ParallelRegionGuard guard(task_id);
f(local_start, local_end, task_id);
} catch (...) {
if (!err_flag.test_and_set()) {
eptr = std::current_exception();
if (!state.err_flag.test_and_set()) {
state.eptr = std::current_exception();
}
}
}
futures[task_id]->markCompleted();
{
std::unique_lock<std::mutex> lk(state.mutex);
if (--state.remaining == 0) {
state.cv.notify_one();
}
}
};
state.remaining = num_tasks;
_run_with_pool(task, num_tasks);
// Wait for all tasks to finish.
for (size_t task_id = 0; task_id < num_tasks; ++task_id) {
futures[task_id]->wait();
{
std::unique_lock<std::mutex> lk(state.mutex);
if (state.remaining != 0) {
state.cv.wait(lk);
}
}
if (eptr) {
std::rethrow_exception(eptr);
if (state.eptr) {
std::rethrow_exception(state.eptr);
}
}

View File

@ -27,14 +27,9 @@ using c10::KernelFunction;
namespace {
void variable_fallback_kernel(const OperatorHandle& op, Stack* stack) {
at::AutoNonVariableTypeMode _var_guard(true);
Dispatcher::singleton().callBoxed(op, stack);
}
static auto registry = Dispatcher::singleton().registerBackendFallbackKernel(
static auto registry = c10::Dispatcher::singleton().registerBackendFallbackKernel(
DispatchKey::VariableTensorId,
KernelFunction::makeFromBoxedFunction<&variable_fallback_kernel>()
KernelFunction::makeFallthrough()
);
}

View File

@ -550,7 +550,6 @@ FunctionOption = TypedDict('FunctionOption', {
'type_method_definition_dispatch': str,
'type_method_formals': List[str],
'variants': str,
'with_gil': bool,
'zero_dim_dispatch_when_scalar': str,
})

View File

@ -673,11 +673,11 @@ Tensor & leaky_relu_(
return at::leaky_relu_out(self, self, neg_val);
}
// Note: leakyReLu backward calculation doesn't support in-place call with non-positive slope.
// Note: leakyReLu backward calculation doesn't support in-place call with negative slope.
// The reason is that for in-place forward call, the forward result will be saved into autograd
// node instead of the input itself, when calculating backward gradient, there is no way to know
// whether the original input for current node is positive or not if the input slope is
// non-positive. eg. forward is 2, slope is -0.2, the original input for this node could be
// negative. eg. forward is 2, slope is -0.2, the original input for this node could be
// either 2, or -10, so no way to get a correct backward gradient in this case.
Tensor leaky_relu_backward(
const Tensor& grad_output,
@ -685,11 +685,11 @@ Tensor leaky_relu_backward(
Scalar negval,
bool is_result) {
TORCH_CHECK(
!is_result || negval.to<double>() > 0.0,
"In-place leakyReLu backward calculation is triggered with a non-positive slope which is not supported. "
"This is caused by calling in-place forward function with a non-positive slope, "
!is_result || negval.to<double>() >= 0.0,
"In-place leakyReLu backward calculation is triggered with a negative slope which is not supported. "
"This is caused by calling in-place forward function with a negative slope, "
"please call out-of-place version instead. File an issue at https://github.com/pytorch/pytorch if you do "
"require supporting in-place leakRelu backward calculation with non-positive slope");
"require supporting in-place leakRelu backward calculation with negative slope");
Tensor result;
auto iter = TensorIterator::binary_op(result, self_or_result, grad_output);
@ -698,17 +698,34 @@ Tensor leaky_relu_backward(
}
std::tuple<Tensor, Tensor> log_sigmoid_forward_cpu(const Tensor& input) {
auto result = at::zeros_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
auto buffer = at::zeros_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
// FIXME: do these actually need to be zeros_like or can they be empty_like?
auto result = at::zeros_like(input, at::MemoryFormat::Contiguous);
auto buffer = at::zeros_like(input, at::MemoryFormat::Contiguous);
log_sigmoid_cpu_stub(kCPU, result, buffer, input.contiguous());
return std::make_tuple(result, buffer);
}
std::tuple<Tensor&, Tensor&> log_sigmoid_forward_out_cpu(Tensor& result, Tensor& buffer, const Tensor& input) {
log_sigmoid_cpu_stub(kCPU, result, buffer, input);
result.resize_as_(input);
buffer.resize_as_(input, at::MemoryFormat::Contiguous);
TORCH_CHECK(buffer.is_contiguous(), "Contiguous buffer required for log_sigmoid with out parameter");
Tensor result_tmp = result.is_contiguous() ? result : at::empty_like(result, at::MemoryFormat::Contiguous);
log_sigmoid_cpu_stub(kCPU, result_tmp, buffer, input.contiguous());
if (!result.is_contiguous()) {
result.copy_(result_tmp);
}
return std::forward_as_tuple(result, buffer);
}
Tensor & log_sigmoid_out(Tensor & output, const Tensor & self) {
Tensor buffer = at::empty({0}, self.options());
return std::get<0>(at::log_sigmoid_forward_out(output, buffer, self));
}
Tensor log_sigmoid(const Tensor & self) {
return std::get<0>(at::log_sigmoid_forward(self));
}
Tensor log_sigmoid_backward_cpu(const Tensor& grad_output, const Tensor& input, const Tensor& buffer) {
Tensor grad_input;
auto iter = at::TensorIterator();

View File

@ -609,7 +609,7 @@ at::Tensor _convolution(
auto weight_view = at::_unsafe_view(weight, -1);
auto out = input*weight_view[0];
if (bias.defined())
out = out + bias[0];
out.add_(bias[0]);
return out.view(o);
}
@ -639,7 +639,7 @@ at::Tensor _convolution(
input.contiguous(cudnn_memory_format), weight,
padding, stride, dilation, params.groups, params.benchmark, params.deterministic);
if (bias.defined()) {
output = output + reshape_bias(input.dim(), bias);
output.add_(reshape_bias(input.dim(), bias));
}
} else if (params.use_miopen(input, bias.defined())){
@ -662,14 +662,14 @@ at::Tensor _convolution(
input.contiguous(cudnn_memory_format), weight,
params.padding, params.output_padding, params.stride, params.dilation, params.groups, params.benchmark, params.deterministic);
if (bias.defined()) {
output = output + reshape_bias(input.dim(), bias);
output.add_(reshape_bias(input.dim(), bias));
}
} else {
output = at::cudnn_convolution(
input.contiguous(cudnn_memory_format), weight,
params.padding, params.stride, params.dilation, params.groups, params.benchmark, params.deterministic);
if (bias.defined()) {
output = output + reshape_bias(input.dim(), bias);
output.add_(reshape_bias(input.dim(), bias));
}
}
} else if (params.use_miopen(input, bias.defined())) {

View File

@ -70,8 +70,8 @@ struct CAFFE2_API DispatchStub<rT (*)(Args...), T> {
// they will still compute the same value for cpu_dispatch_ptr.
if (!cpu_dispatch_ptr.load(std::memory_order_relaxed)) {
FnPtr tmp_cpu_dispatch_ptr = nullptr;
cpu_dispatch_ptr.compare_exchange_weak(
tmp_cpu_dispatch_ptr, choose_cpu_impl(), std::memory_order_relaxed);
while(!cpu_dispatch_ptr.compare_exchange_weak(
tmp_cpu_dispatch_ptr, choose_cpu_impl(), std::memory_order_relaxed));
}
return (*cpu_dispatch_ptr)(std::forward<ArgTypes>(args)...);
} else if (device_type == DeviceType::CUDA) {

View File

@ -31,15 +31,6 @@ Tensor nll_loss2d(const Tensor & self, const Tensor & target, const Tensor & wei
return std::get<0>(at::nll_loss2d_forward(self, target, weight, reduction, ignore_index));
}
Tensor & log_sigmoid_out(Tensor & output, const Tensor & self) {
Tensor buffer = at::empty({0}, self.options());
return std::get<0>(at::log_sigmoid_forward_out(output, buffer, self));
}
Tensor log_sigmoid(const Tensor & self) {
return std::get<0>(at::log_sigmoid_forward(self));
}
Tensor & thnn_conv2d_out(Tensor & output, const Tensor & self, const Tensor & weight, IntArrayRef kernel_size, const Tensor & bias, IntArrayRef stride, IntArrayRef padding) {
Tensor finput = at::empty({0}, self.options());
Tensor fgrad_input = at::empty({0}, self.options());

View File

@ -533,7 +533,7 @@ Tensor frobenius_norm(const Tensor& self, IntArrayRef dim, bool keepdim) {
return at::norm(self, 2, dim, keepdim, self.scalar_type());
}
if (self.is_complex()){
return at::sqrt(at::sum((self.conj() * self).real(), dim, keepdim));
return at::sqrt(at::sum(at::real(self.conj() * self), dim, keepdim));
} else {
return at::sqrt(at::sum((self * self), dim, keepdim));
}
@ -553,7 +553,7 @@ Tensor &frobenius_norm_out(
return at::norm_out(result, self, 2, dim, keepdim, self.scalar_type());
}
if (self.is_complex()){
return at::sqrt_out(result, at::sum((self.conj() * self).real(), dim, keepdim));
return at::sqrt_out(result, at::sum(at::real(self.conj() * self), dim, keepdim));
} else {
return at::sqrt_out(result, at::sum((self * self), dim, keepdim));
}

View File

@ -799,7 +799,7 @@ static Tensor &std_var_out(Tensor &result, const Tensor &self, IntArrayRef dim,
if (at::isComplexType(self.scalar_type())){
ScalarType dtype = c10::toValueType(get_dtype(result, self, {}, true));
Tensor real_in = self.real().to(dtype);
Tensor real_in = at::real(self).to(dtype);
Tensor real_out = at::empty({0}, self.options().dtype(dtype));
auto iter = make_reduction("std or var", real_out, real_in, dim, keepdim, dtype);
if (iter.numel() == 0) {
@ -807,7 +807,7 @@ static Tensor &std_var_out(Tensor &result, const Tensor &self, IntArrayRef dim,
} else {
std_var_stub(iter.device_type(), iter, unbiased, false);
}
Tensor imag_in = self.imag().to(dtype);
Tensor imag_in = at::imag(self).to(dtype);
Tensor imag_out = at::empty({0}, self.options().dtype(dtype));
iter = make_reduction("std or var", imag_out, imag_in, dim, keepdim, dtype);
if (iter.numel() == 0) {
@ -845,7 +845,7 @@ static std::tuple<Tensor&,Tensor&> std_var_mean_out(const char* fname, Tensor &r
".");
if (at::isComplexType(self.scalar_type())){
ScalarType dtype = c10::toValueType(get_dtype(result1, self, {}, true));
Tensor real_in = self.real().to(dtype);
Tensor real_in = at::real(self).to(dtype);
Tensor real_out_var = at::empty({0}, self.options().dtype(dtype));
Tensor real_out_mean = at::empty({0}, self.options().dtype(dtype));
auto iter = make_reduction(fname, real_out_var, real_out_mean, real_in, dim, keepdim, dtype);
@ -855,7 +855,7 @@ static std::tuple<Tensor&,Tensor&> std_var_mean_out(const char* fname, Tensor &r
} else {
std_var_stub(iter.device_type(), iter, unbiased, false);
}
Tensor imag_in = self.imag().to(dtype);
Tensor imag_in = at::imag(self).to(dtype);
Tensor imag_out_var = at::empty({0}, self.options().dtype(dtype));
Tensor imag_out_mean = at::empty({0}, self.options().dtype(dtype));
iter = make_reduction(fname, imag_out_var, imag_out_mean, imag_in, dim, keepdim, dtype);

View File

@ -85,6 +85,7 @@ inline void setStrided(
IntArrayRef size,
IntArrayRef stride,
int64_t storage_offset) {
TORCH_CHECK(size.size() == stride.size(), "mismatch in length of strides and shape");
auto* self_ = self.unsafeGetTensorImpl();
checkInBoundsForStorage(size, stride, storage_offset, self_->storage());
@ -93,7 +94,6 @@ inline void setStrided(
self_->set_storage_offset(storage_offset);
/* size and stride */
AT_ASSERT(size.size() == stride.size());
if (self_->sizes() == size && self_->strides() == stride) {
return;
}

View File

@ -130,6 +130,28 @@ static Tensor reshape_indexer(const Tensor& index, int64_t dims_before, int64_t
return index.reshape(shape);
}
// checks whether index.dtype == int64
// and self.dtyp == src.dtype if src is a Tensor
static void scatter_gather_dtype_check(
const std::string& method_name,
const Tensor& self,
const Tensor& index,
const c10::optional<const Tensor>& src_opt = c10::nullopt
) {
TORCH_CHECK(
index.scalar_type() == at::ScalarType::Long,
method_name, "(): Expected dtype int64 for index"
);
if (src_opt.has_value()) {
auto src = src_opt.value();
TORCH_CHECK(
self.scalar_type() == src.scalar_type(),
method_name, "(): Expected self.dtype to be equal to src.dtype"
);
}
}
AdvancedIndex::AdvancedIndex(const Tensor& src, TensorList indices_list)
{
int64_t element_size_bytes = src.element_size();
@ -493,40 +515,48 @@ Tensor index_fill(const Tensor & self, int64_t dim, const Tensor & index, const
}
Tensor & gather_out_cpu(Tensor & result, const Tensor & self, int64_t dim, const Tensor & index, bool sparse_grad) {
scatter_gather_dtype_check("gather_out_cpu", self, index, result);
result.resize_(index.sizes());
gather_stub(result.device().type(), result, self, dim, index);
return result;
}
Tensor gather_cpu(const Tensor & self, int64_t dim, const Tensor & index, bool sparse_grad) {
scatter_gather_dtype_check("gather_cpu", self, index);
Tensor result = at::empty({0}, self.options());
return gather_out_cpu(result, self, dim, index, sparse_grad);
}
Tensor & scatter_cpu_(Tensor & self, int64_t dim, const Tensor & index, const Tensor & src) {
scatter_gather_dtype_check("scatter_cpu", self, index, src);
scatter_stub(self.device().type(), self, dim, index, src);
return self;
}
Tensor & scatter_fill_cpu_(Tensor & self, int64_t dim, const Tensor & index, Scalar src) {
scatter_gather_dtype_check("scatter_fill_cpu", self, index);
scatter_fill_stub(self.device().type(), self, dim, index, src);
return self;
}
Tensor scatter(const Tensor & self, int64_t dim, const Tensor & index, const Tensor & source) {
scatter_gather_dtype_check("scatter", self, index, source);
return self.clone(at::MemoryFormat::Preserve).scatter_(dim, index, source);
}
Tensor scatter(const Tensor & self, int64_t dim, const Tensor & index, Scalar source) {
scatter_gather_dtype_check("scatter", self, index);
return self.clone(at::MemoryFormat::Preserve).scatter_(dim, index, source);
}
Tensor & scatter_add_cpu_(Tensor & self, int64_t dim, const Tensor & index, const Tensor & src) {
scatter_gather_dtype_check("scatter_add_cpu", self, index, src);
scatter_add_stub(self.device().type(), self, dim, index, src);
return self;
}
Tensor scatter_add(const Tensor & self, int64_t dim, const Tensor & index, const Tensor & source) {
scatter_gather_dtype_check("scatter_add", self, index, source);
return self.clone(at::MemoryFormat::Preserve).scatter_add_(dim, index, source);
}

View File

@ -99,7 +99,7 @@ Tensor _dim_arange(const Tensor& like, int64_t dim) {
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ empty ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tensor empty_cpu(IntArrayRef size, const TensorOptions& options_, c10::optional<c10::MemoryFormat> optional_memory_format) {
TORCH_CHECK(!isComplexType(at::typeMetaToScalarType(options_.dtype())), "Complex dtype not supported.");
TORCH_CHECK(
!(options_.has_memory_format() && optional_memory_format.has_value()),
"Cannot set memory_format both in TensorOptions and explicit argument; please delete "

View File

@ -638,7 +638,7 @@ void TensorIterator::narrow(int dim, int64_t start, int64_t size) {
for (auto& op : operands_) {
op.data = ((char*)op.data) + op.stride_bytes[dim] * start;
}
if (size == 1) {
if (size == 1 && !is_reduction_) {
coalesce_dimensions();
}
}
@ -891,10 +891,13 @@ std::unique_ptr<TensorIterator> TensorIterator::split(int dim) {
}
int TensorIterator::get_dim_to_split() const {
TORCH_INTERNAL_ASSERT(ndim() >= 1 && shape()[ndim() - 1] >= 2);
TORCH_INTERNAL_ASSERT(ndim() >= 1);
int64_t max_extent = -1;
int dim_to_split = -1;
for (int dim = ndim() - 1; dim >= 0; dim--) {
if (shape_[dim] == 0) {
continue;
}
int64_t size = shape_[dim];
for (auto& op : operands_) {
int64_t extent = (size - 1) * op.stride_bytes[dim];

View File

@ -73,11 +73,17 @@ Tensor& abs_(Tensor& self) { return unary_op_impl_(self, at::abs_out); }
Tensor& angle_out(Tensor& result, const Tensor& self) { return unary_op_impl_out(result, self, angle_stub); }
Tensor angle(const Tensor& self) { return unary_op_impl(self, at::angle_out); }
Tensor& real_out(Tensor& result, const Tensor& self) { return unary_op_impl_out(result, self, real_stub); }
Tensor real(const Tensor& self) { return unary_op_impl(self, at::real_out); }
Tensor real(const Tensor& self) {
TORCH_CHECK(!self.is_complex(), "real is not yet implemented for complex tensors.");
return self;
}
Tensor& imag_out(Tensor& result, const Tensor& self) { return unary_op_impl_out(result, self, imag_stub); }
Tensor imag(const Tensor& self) { return unary_op_impl(self, at::imag_out); }
Tensor imag(const Tensor& self) {
TORCH_CHECK(false, "imag is not yet implemented.");
// Note: unreachable
return at::zeros_like(self);
}
Tensor& conj_out(Tensor& result, const Tensor& self) { return unary_op_impl_out(result, self, conj_stub); }
Tensor conj(const Tensor& self) { return unary_op_impl(self, at::conj_out); }

View File

@ -87,6 +87,10 @@ static void max_kernel_impl(
Tensor& max_indices,
const Tensor& self,
c10::optional<int64_t> dim) {
TORCH_CHECK(max.scalar_type() == self.scalar_type() && max_indices.scalar_type() == kLong,
"Expect dtype ", self.scalar_type(), "and torch.long, but got ", max.scalar_type(), "and", max_indices.scalar_type());
AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND(ScalarType::Bool, self.scalar_type(), "max", [&] {
Reduction<scalar_t, int64_t>::apply(max, max_indices, self, dim, true);
});
@ -97,6 +101,10 @@ static void min_kernel_impl(
Tensor& min_indices,
const Tensor& self,
c10::optional<int64_t> dim) {
TORCH_CHECK(min.scalar_type() == self.scalar_type() && min_indices.scalar_type() == kLong,
"Expect dtype ", self.scalar_type(), "and torch.long, but got ", min.scalar_type(), "and", min_indices.scalar_type());
AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND(ScalarType::Bool, self.scalar_type(), "min", [&] {
Reduction<scalar_t, int64_t>::apply(min, min_indices, self, dim, false);
});

View File

@ -69,7 +69,6 @@ void remainder_kernel_cuda(TensorIterator& iter) {
AT_DISPATCH_INTEGRAL_TYPES(iter.dtype(), "remainder_cuda", [&]() {
using thrust_t = typename ztype_cuda<scalar_t>::thrust_t;
gpu_kernel_with_scalars(iter, []GPU_LAMBDA(thrust_t a, thrust_t b) -> thrust_t {
CUDA_KERNEL_ASSERT(b != 0);
thrust_t r = a % b;
if ((r != 0) && ((r < 0) != (b < 0))) {
r += b;

View File

@ -358,7 +358,7 @@ void max_pool2d_with_indices_out_cuda_template(
Tensor input = input_.contiguous(memory_format);
const int64_t in_stride_n = input.stride(-4);
const int64_t in_stride_n = input_.ndimension() == 4 ? input.stride(-4) : 0;
const int64_t in_stride_c = input.stride(-3);
const int64_t in_stride_h = input.stride(-2);
const int64_t in_stride_w = input.stride(-1);
@ -506,7 +506,7 @@ void max_pool2d_with_indices_backward_out_cuda_template(
const int64_t inputHeight = input.size(-2);
const int64_t inputWidth = input.size(-1);
const int64_t in_stride_n = input.stride(-4);
const int64_t in_stride_n = input.ndimension() == 4 ? input.stride(-4) : 0;
const int64_t in_stride_c = input.stride(-3);
const int64_t in_stride_h = input.stride(-2);
const int64_t in_stride_w = input.stride(-1);

View File

@ -54,7 +54,7 @@ __global__ void EmbeddingBag_updateOutputKernel(
scalar_t *weightFeat = weight + featureDim * weight_stride1;
int64_t begin = bag == 0 ? 0 : offsets[bag]; // forces first offset to be 0 instead of asserting on it
int64_t end = (bag < numBags - 1) ? (offsets[bag + 1]) : numIndices;
assert(end >= begin);
CUDA_KERNEL_ASSERT(end >= begin);
accscalar_t weightFeatSum = 0;
scalar_t weightFeatMax;

View File

@ -192,13 +192,13 @@ void index_put_accum_kernel(Tensor & self, TensorList indices, const Tensor & va
if (num_indices > 0 && sliceSize > 0) {
const bool permuted = !src.is_contiguous();
auto src_ = permuted ? src.contiguous() : src;
linearIndex = linearIndex.view(-1);
linearIndex = linearIndex.reshape(-1);
auto sorted_indices = at::empty_like(linearIndex, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
auto orig_indices = at::empty_like(linearIndex, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
using device_ptr = thrust::device_ptr<int64_t>;
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
linearIndex.div_(sliceSize);
linearIndex.floor_divide_(sliceSize);
{
sorted_indices.copy_(linearIndex);
auto allocator = THCThrustAllocator(globalContext().lazyInitCUDA());

View File

@ -35,13 +35,13 @@ __global__ void renormRowsL1(scalar_t* dist, long rows, long cols) {
scalar_t sum = static_cast<scalar_t>(0);
for (int64_t col = threadIdx.x; col < cols; col += blockDim.x) {
val = dist[row * cols + col];
CUDA_ALWAYS_ASSERT(!THCNumerics<scalar_t>::lt(val, zero)); // ! < 0 for NaN handling
CUDA_KERNEL_ASSERT(!THCNumerics<scalar_t>::lt(val, zero)); // ! < 0 for NaN handling
sum = sum + val;
}
sum = reduceBlock(smem, blockDim.x, sum, ReduceAdd<scalar_t>(), zero);
if (threadIdx.x == 0) {
CUDA_ALWAYS_ASSERT(!THCNumerics<scalar_t>::lt(val, zero)); // ! < 0 for NaN handling
CUDA_KERNEL_ASSERT(!THCNumerics<scalar_t>::lt(val, zero)); // ! < 0 for NaN handling
smem[0] = sum;
}
__syncthreads();
@ -61,7 +61,7 @@ void renormRows(Tensor& t) {
int64_t cols = t.size(1);
auto props = at::cuda::getCurrentDeviceProperties();
CUDA_ALWAYS_ASSERT(props != NULL);
CUDA_KERNEL_ASSERT(props != NULL);
int numSM = props->multiProcessorCount;
int maxThreads = props->maxThreadsPerBlock;
@ -84,7 +84,7 @@ __device__ int binarySearchForMultinomial(scalar_t* cumdist,
int start = 0;
int end = size;
// cumdist[size - 1] = 0 => all zero prob dist
CUDA_ALWAYS_ASSERT(cumdist[size - 1] > static_cast<scalar_t>(0));
CUDA_KERNEL_ASSERT(cumdist[size - 1] > static_cast<scalar_t>(0));
while (end - start > 0) {
int mid = start + (end - start) / 2;
@ -124,36 +124,33 @@ sampleMultinomialWithReplacement(std::pair<uint64_t, uint64_t> seeds,
// search due to divergence. It seems possible to compute multiple
// values and limit divergence though later on.
// global index formula for 1D grid of 2D blocks
int idx = blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x;
// global index formula for 2D grid of 1D blocks
int idx = blockIdx.y * gridDim.x * blockDim.x + blockIdx.x * blockDim.x + threadIdx.x;
curandStatePhilox4_32_10_t state;
curand_init(seeds.first, idx, seeds.second, &state);
// The block determines the distribution for which we generate a point
for (int64_t curDist = blockIdx.x;
for (int64_t curDist = blockIdx.y;
curDist < distributions;
curDist += gridDim.x) {
for (int sampleBase = 0;
sampleBase < totalSamples; sampleBase += blockDim.y) {
// The warp determines the sample
int sample = sampleBase + threadIdx.y;
curDist += gridDim.y) {
for (int sample = blockIdx.x*blockDim.x + threadIdx.x;
sample < totalSamples; sample += blockDim.x*gridDim.x) {
// All threads participate in this
//we are losing 3 out of 4 generated numbers but it's ok
//this kernel is not very efficient anyway
auto rand = curand_uniform4(&state);
scalar_t r = static_cast<scalar_t>(rand.x);
if (threadIdx.x == 0 && sample < totalSamples) {
// Find the bucket that a uniform sample lies in
int choice = binarySearchForMultinomial<scalar_t>(
normDistPrefixSum + curDist * categories,
normDist + curDist * categories,
categories,
r);
// Find the bucket that a uniform sample lies in
int choice = binarySearchForMultinomial<scalar_t>(
normDistPrefixSum + curDist * categories,
normDist + curDist * categories,
categories,
r);
dest[curDist * totalSamples + sample] = choice;
// Torch indices are 1-based
dest[curDist * totalSamples + sample] = choice;
}
}
}
}
@ -180,17 +177,14 @@ sampleMultinomialWithoutReplacement(std::pair<uint64_t, uint64_t> seeds,
// The block and warp determines the distribution for which we
// generate a point
for (int64_t curDistBase = blockIdx.x * blockDim.y;
curDistBase < distributions;
curDistBase += gridDim.x * blockDim.y) {
// The warp determines the distribution
int64_t curDist = curDistBase + threadIdx.y;
for (int64_t curDist = blockIdx.x * blockDim.y + threadIdx.y;
curDist < distributions;
curDist += gridDim.x * blockDim.y) {
// All threads must participate in this
auto rand = curand_uniform4(&state);
scalar_t r = static_cast<scalar_t>(rand.x);
if (threadIdx.x == 0 && curDist < distributions) {
if (threadIdx.x == 0) {
// Find the bucket that a uniform sample lies in
int choice = binarySearchForMultinomial<scalar_t>(
normDistPrefixSum + curDist * categories,
@ -240,9 +234,9 @@ sampleMultinomialOnce(int64_t* dest,
scalar_t val;
for (int cat = threadIdx.x; cat < categories; cat += blockDim.x) {
val = dist[curDist * stride_dist + cat * stride_categories];
CUDA_ALWAYS_ASSERT(val >= zero);
CUDA_ALWAYS_ASSERT(!THCNumerics<scalar_t>::isinf(val));
CUDA_ALWAYS_ASSERT(!THCNumerics<scalar_t>::isnan(val));
CUDA_KERNEL_ASSERT(val >= zero);
CUDA_KERNEL_ASSERT(!THCNumerics<scalar_t>::isinf(val));
CUDA_KERNEL_ASSERT(!THCNumerics<scalar_t>::isnan(val));
sum = sum + static_cast<accscalar_t>(val);
}
@ -252,8 +246,8 @@ sampleMultinomialOnce(int64_t* dest,
// Broadcast sum and sample value
if (threadIdx.x == 0) {
// Make sure the sum of our distribution didn't overflow
CUDA_ALWAYS_ASSERT(!THCNumerics<accscalar_t>::isinf(sum));
CUDA_ALWAYS_ASSERT(sum > accZero);
CUDA_KERNEL_ASSERT(!THCNumerics<accscalar_t>::isinf(sum));
CUDA_KERNEL_ASSERT(sum > accZero);
asmem[0] = sum;
smem[0] = sampled[curDist];
@ -363,7 +357,7 @@ void multinomial_kernel_impl(Tensor& result, const Tensor& self, const int64_t n
AT_DISPATCH_FLOATING_TYPES_AND_HALF(self_v.scalar_type(), "multinomial_kernel_cuda", [&] {
using accscalar_t = at::acc_type<scalar_t, true>;
auto props = at::cuda::getCurrentDeviceProperties();
CUDA_ALWAYS_ASSERT(props != NULL);
CUDA_KERNEL_ASSERT(props != NULL);
int numSM = props->multiProcessorCount;
int maxThreads = props->maxThreadsPerBlock;
int maxShared = props->sharedMemPerBlock;
@ -415,26 +409,27 @@ void multinomial_kernel_impl(Tensor& result, const Tensor& self, const int64_t n
std::pair<uint64_t, uint64_t> rng_engine_inputs;
if (with_replacement) {
// Binary search is warp divergent (so effectively we're running
// with just a single thread), but for better utilization,
// we need each block to have at least 4 warps.
dim3 block(128);
// Each block will generate a sample from one
// distribution concurrently.
int grid_y=std::min<int>(numDist, at::cuda::getCurrentDeviceProperties()->maxGridSize[1]);
dim3 grid((n_sample-1)/block.x+1, grid_y);
{
// See Note [Acquire lock when using random generators]
std::lock_guard<std::mutex> lock(gen->mutex_);
// each thread will utilize one random, however, since we have to use
// each thread generates a single sample for (numdist/numblocks.y) distributions, however, since we have to use
// curand_uniform4 (See Note [Register spilling in curand call for CUDA < 10]),
// offset is 4.
rng_engine_inputs = gen->philox_engine_inputs(4);
// offset is 4 times that.
auto offset = ((numDist-1)/grid.y+1)*4;
rng_engine_inputs = gen->philox_engine_inputs(offset);
}
// Sample with replacement
// Binary search is warp divergent (so effectively we're running
// with just a single thread), but for better utilization,
// we need each block to have at least 4 warps.
dim3 block(32, 4);
// Each warp in a block will generate a sample from one
// distribution concurrently.
dim3 grid(numDist < MAX_NUM_BLOCKS ? numDist : MAX_NUM_BLOCKS);
sampleMultinomialWithReplacement
<<<grid, block, 0, at::cuda::getCurrentCUDAStream()>>>(
rng_engine_inputs,
@ -470,10 +465,11 @@ void multinomial_kernel_impl(Tensor& result, const Tensor& self, const int64_t n
// See Note [Acquire lock when using random generators]
std::lock_guard<std::mutex> lock(gen->mutex_);
// each thread will utilize one random, however, since we have to use
// each thread will utilize distributions/(gridDim.x*blockDim.y) randoms, however, since we have to use
// curand_uniform4 (See Note [Register spilling in curand call for CUDA < 10]),
// offset is 4.
rng_engine_inputs = gen->philox_engine_inputs(4);
// offset is 4 times that.
auto offset = ((numDist-1)/(grid.x*block.y)+1)*4;
rng_engine_inputs = gen->philox_engine_inputs(offset);
}
// The kernel can only draw one sample before we have to

View File

@ -431,13 +431,12 @@ __global__ void batch_norm_backward_reduce_kernel(
const GenericPackedTensorAccessor<input_scalar_t, 3, DefaultPtrTraits, index_t> grad_output,
GenericPackedTensorAccessor<stat_accscalar_t, 1, DefaultPtrTraits, index_t> mean,
GenericPackedTensorAccessor<stat_accscalar_t, 1, DefaultPtrTraits, index_t> invstd,
GenericPackedTensorAccessor<stat_accscalar_t, 1, DefaultPtrTraits, index_t> mean_dy,
GenericPackedTensorAccessor<stat_accscalar_t, 1, DefaultPtrTraits, index_t> mean_dy_xmu,
GenericPackedTensorAccessor<stat_accscalar_t, 1, DefaultPtrTraits, index_t> sum_dy,
GenericPackedTensorAccessor<stat_accscalar_t, 1, DefaultPtrTraits, index_t> sum_dy_xmu,
GenericPackedTensorAccessor<stat_scalar_t, 1, DefaultPtrTraits, index_t> grad_weight,
GenericPackedTensorAccessor<stat_scalar_t, 1, DefaultPtrTraits, index_t> grad_bias) {
index_t plane = blockIdx.x;
index_t N = input.size(0) * input.size(2);
stat_accscalar_t r_mean = mean[plane];
stat_accscalar_t factor = invstd[plane];
@ -446,7 +445,6 @@ __global__ void batch_norm_backward_reduce_kernel(
Float2<input_scalar_t, stat_accscalar_t> res = reduce<Float2<input_scalar_t, stat_accscalar_t>, GradOp<input_scalar_t, stat_accscalar_t,
GenericPackedTensorAccessor<input_scalar_t, 3, DefaultPtrTraits, index_t>>>(g, grad_output, plane);
stat_accscalar_t norm = stat_accscalar_t(1) / N;
if (threadIdx.x == 0) {
if (grad_weight.size(0) > 0) {
grad_weight[plane] = static_cast<stat_scalar_t>(res.v2 * factor);
@ -454,11 +452,11 @@ __global__ void batch_norm_backward_reduce_kernel(
if (grad_bias.size(0) > 0) {
grad_bias[plane] = static_cast<stat_scalar_t>(res.v1);
}
if (mean_dy.size(0) > 0) {
mean_dy[plane] = static_cast<stat_accscalar_t>(res.v1 * norm);
if (sum_dy.size(0) > 0) {
sum_dy[plane] = static_cast<stat_accscalar_t>(res.v1);
}
if (mean_dy_xmu.size(0) > 0) {
mean_dy_xmu[plane] = static_cast<stat_accscalar_t>(res.v2 * norm);
if (sum_dy_xmu.size(0) > 0) {
sum_dy_xmu[plane] = static_cast<stat_accscalar_t>(res.v2);
}
}
}
@ -740,16 +738,16 @@ std::tuple<Tensor, Tensor, Tensor, Tensor> batch_norm_backward_reduce_cuda_templ
using stat_accscalar_t = at::acc_type<stat_scalar_t, true>;
int64_t n_input = input_.size(1);
Tensor mean_dy_;
Tensor mean_dy_xmu_;
Tensor sum_dy_;
Tensor sum_dy_xmu_;
Tensor grad_weight_;
Tensor grad_bias_;
auto input_reshaped = input_.reshape({input_.size(0), input_.size(1), -1}); // internally we merge the feature dimensions
auto grad_output_reshaped = grad_out_.reshape(input_reshaped.sizes());
if (input_g) {
mean_dy_ = at::empty_like(mean_, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
mean_dy_xmu_ = at::empty_like(mean_, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
sum_dy_ = at::empty_like(mean_, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
sum_dy_xmu_ = at::empty_like(mean_, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
}
if (weight_g) {
grad_weight_ = at::empty({n_input}, weight_.options());
@ -764,8 +762,8 @@ std::tuple<Tensor, Tensor, Tensor, Tensor> batch_norm_backward_reduce_cuda_templ
auto grad_bias = packed_accessor_or_dummy<stat_scalar_t, 1, DefaultPtrTraits, index_t>(grad_bias_);
auto mean = packed_accessor_or_dummy<stat_accscalar_t, 1, DefaultPtrTraits, index_t>(mean_);
auto invstd = packed_accessor_or_dummy<stat_accscalar_t, 1, DefaultPtrTraits, index_t>(invstd_);
auto mean_dy = packed_accessor_or_dummy<stat_accscalar_t, 1, DefaultPtrTraits, index_t>(mean_dy_);
auto mean_dy_xmu = packed_accessor_or_dummy<stat_accscalar_t, 1, DefaultPtrTraits, index_t>(mean_dy_xmu_);
auto sum_dy = packed_accessor_or_dummy<stat_accscalar_t, 1, DefaultPtrTraits, index_t>(sum_dy_);
auto sum_dy_xmu = packed_accessor_or_dummy<stat_accscalar_t, 1, DefaultPtrTraits, index_t>(sum_dy_xmu_);
auto batch_size = input_reshaped.size(0);
auto feature_size = input_reshaped.size(2);
@ -778,10 +776,10 @@ std::tuple<Tensor, Tensor, Tensor, Tensor> batch_norm_backward_reduce_cuda_templ
const dim3 grid(n_input);
batch_norm_backward_reduce_kernel<input_scalar_t, stat_scalar_t, stat_accscalar_t, index_t> <<<grid, block, 0, stream>>>
(input, grad_output, mean, invstd, mean_dy, mean_dy_xmu, grad_weight, grad_bias);
(input, grad_output, mean, invstd, sum_dy, sum_dy_xmu, grad_weight, grad_bias);
AT_CUDA_CHECK(cudaGetLastError());
return std::make_tuple(mean_dy_, mean_dy_xmu_, grad_weight_, grad_bias_);
return std::make_tuple(sum_dy_, sum_dy_xmu_, grad_weight_, grad_bias_);
}
template<typename input_scalar_t, typename stat_scalar_t, typename index_t>

View File

@ -65,7 +65,7 @@ struct TopKTypeConfig<int16_t> {
typedef uint32_t RadixType;
static inline __device__ RadixType convert(int16_t v) {
assert(sizeof(short) == 2);
static_assert(sizeof(short) == 2, "");
return 32768u + v;
}
@ -79,7 +79,7 @@ struct TopKTypeConfig<int32_t> {
typedef uint32_t RadixType;
static inline __device__ RadixType convert(int32_t v) {
assert(sizeof(int) == 4);
static_assert(sizeof(int) == 4, "");
return 2147483648u + v;
}
@ -93,7 +93,7 @@ struct TopKTypeConfig<int64_t> {
typedef uint64_t RadixType;
static inline __device__ RadixType convert(int64_t v) {
assert(sizeof(int64_t) == 8);
static_assert(sizeof(int64_t) == 8, "");
return 9223372036854775808ull + v;
}
@ -125,7 +125,7 @@ struct TopKTypeConfig<at::Half> {
static inline __device__ RadixType convert(at::Half v) {
#if defined(__CUDA_ARCH__) || defined(__HIP_PLATFORM_HCC__)
RadixType x = __half_as_ushort(v);
RadixType mask = -((x >> 15)) | 0x8000;
RadixType mask = (x & 0x00008000) ? 0x0000ffff : 0x00008000;
return (v == v) ? (x ^ mask) : 0xffff;
#else
assert(false);
@ -135,7 +135,7 @@ struct TopKTypeConfig<at::Half> {
static inline __device__ at::Half deconvert(RadixType v) {
#if defined(__CUDA_ARCH__) || defined(__HIP_PLATFORM_HCC__)
RadixType mask = ((v >> 15) - 1) | 0x8000;
RadixType mask = (v & 0x00008000) ? 0x00008000 : 0x0000ffff;
return __ushort_as_half(v ^ mask);
#else
assert(false);

View File

@ -44,6 +44,7 @@ Tensor& eye_out_cuda(Tensor& result, int64_t n, int64_t m) {
}
Tensor empty_cuda(IntArrayRef size, const TensorOptions& options, c10::optional<MemoryFormat> optional_memory_format) {
TORCH_CHECK(!isComplexType(at::typeMetaToScalarType(options.dtype())), "Complex dtype not supported.");
AT_ASSERT(options.device().type() == at::DeviceType::CUDA);
TORCH_INTERNAL_ASSERT(impl::variable_excluded_from_dispatch());
TORCH_CHECK(!options.pinned_memory(), "Only dense CPU tensors can be pinned");

View File

@ -238,18 +238,12 @@
- func: real(Tensor self) -> Tensor
use_c10_dispatcher: full
variants: function, method
supports_named_tensor: True
- func: real.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
variants: function
supports_named_tensor: True
- func: imag(Tensor self) -> Tensor
use_c10_dispatcher: full
variants: function, method
supports_named_tensor: True
- func: imag.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
variants: function
supports_named_tensor: True
- func: conj(Tensor self) -> Tensor

View File

@ -138,7 +138,7 @@ SparseTensor coalesce_sparse_cuda(const SparseTensor& self) {
// broadcasting logic; instead, it will blast the elements from one
// to the other so long as the numel is the same
indicesSlice.copy_(indices1D);
indices1D.div_(self.size(d));
indices1D.floor_divide_(self.size(d));
indicesSlice.add_(indices1D, -self.size(d));
}
}

View File

@ -423,6 +423,85 @@ class CAFFE2_API Tensor {
// ~~~~~ Autograd API ~~~~~
/// \fn bool is_leaf() const;
///
/// All Tensors that have `requires_grad()` which is ``false`` will be leaf Tensors by convention.
///
/// For Tensors that have `requires_grad()` which is ``true``, they will be leaf Tensors if they were
/// created by the user. This means that they are not the result of an operation and so
/// `grad_fn()` is `nullptr`.
///
/// Only leaf Tensors will have their `grad()` populated during a call to `backward()`.
/// To get `grad()` populated for non-leaf Tensors, you can use `retain_grad()`.
///
/// Example:
/// @code
/// auto a = torch::rand(10, torch::requires_grad());
/// std::cout << a.is_leaf() << std::endl; // prints `true`
///
/// auto b = torch::rand(10, torch::requires_grad()).to(torch::kCUDA);
/// std::cout << b.is_leaf() << std::endl; // prints `false`
/// // b was created by the operation that cast a cpu Tensor into a cuda Tensor
///
/// auto c = torch::rand(10, torch::requires_grad()) + 2;
/// std::cout << c.is_leaf() << std::endl; // prints `false`
/// // c was created by the addition operation
///
/// auto d = torch::rand(10).cuda();
/// std::cout << d.is_leaf() << std::endl; // prints `true`
/// // d does not require gradients and so has no operation creating it (that is tracked by the autograd engine)
///
/// auto e = torch::rand(10).cuda().requires_grad_();
/// std::cout << e.is_leaf() << std::endl; // prints `true`
/// // e requires gradients and has no operations creating it
///
/// auto f = torch::rand(10, torch::device(torch::kCUDA).requires_grad(true));
/// std::cout << f.is_leaf() << std::endl; // prints `true`
/// // f requires grad, has no operation creating it
/// @endcode
/// \fn void backward(const Tensor & gradient={}, bool keep_graph=false, bool create_graph=false) const;
///
/// Computes the gradient of current tensor with respect to graph leaves.
///
/// The graph is differentiated using the chain rule. If the tensor is
/// non-scalar (i.e. its data has more than one element) and requires
/// gradient, the function additionally requires specifying ``gradient``.
/// It should be a tensor of matching type and location, that contains
/// the gradient of the differentiated function w.r.t. this Tensor.
///
/// This function accumulates gradients in the leaves - you might need to
/// zero them before calling it.
///
/// \param gradient Gradient w.r.t. the
/// tensor. If it is a tensor, it will be automatically converted
/// to a Tensor that does not require grad unless ``create_graph`` is True.
/// None values can be specified for scalar Tensors or ones that
/// don't require grad. If a None value would be acceptable then
/// this argument is optional.
/// \param keep_graph If ``false``, the graph used to compute
/// the grads will be freed. Note that in nearly all cases setting
/// this option to True is not needed and often can be worked around
/// in a much more efficient way. Defaults to the value of
/// ``create_graph``.
/// \param create_graph If ``true``, graph of the derivative will
/// be constructed, allowing to compute higher order derivative
/// products. Defaults to ``false``.
/// \fn Tensor detach() const;
///
/// Returns a new Tensor, detached from the current graph.
/// The result will never require gradient.
/// \fn Tensor & detach_() const;
///
/// Detaches the Tensor from the graph that created it, making it a leaf.
/// Views cannot be detached in-place.
/// \fn void retain_grad() const;
///
/// Enables .grad() for non-leaf Tensors.
Tensor& set_requires_grad(bool requires_grad) {
impl_->set_requires_grad(requires_grad);
return *this;
@ -431,9 +510,16 @@ class CAFFE2_API Tensor {
return impl_->requires_grad();
}
/// Return a mutable reference to the gradient. This is conventionally
/// used as `t.grad() = x` to set a gradient to a completely new tensor.
Tensor& grad() {
return impl_->grad();
}
/// This function returns an undefined tensor by default and returns a defined tensor
/// the first time a call to `backward()` computes gradients for this Tensor.
/// The attribute will then contain the gradients computed and future calls
/// to `backward()` will accumulate (add) gradients into it.
const Tensor& grad() const {
return impl_->grad();
}
@ -505,11 +591,38 @@ class CAFFE2_API Tensor {
template <typename T>
using hook_return_var_t = std::enable_if_t<std::is_same<typename std::result_of<T&(Tensor)>::type, Tensor>::value, unsigned>;
// Returns the index of the hook in the list which can be used to remove hook
// Register a hook with no return value
/// Registers a backward hook.
///
/// The hook will be called every time a gradient with respect to the Tensor is computed.
/// The hook should have one of the following signature:
/// ```
/// hook(Tensor grad) -> Tensor
/// ```
/// ```
/// hook(Tensor grad) -> void
/// ```
/// The hook should not modify its argument, but it can optionally return a new gradient
/// which will be used in place of `grad`.
///
/// This function returns the index of the hook in the list which can be used to remove hook.
///
/// Example:
/// @code
/// auto v = torch::tensor({0., 0., 0.}, torch::requires_grad());
/// auto h = v.register_hook([](torch::Tensor grad){ return grad * 2; }); // double the gradient
/// v.backward(torch::tensor({1., 2., 3.}));
/// // This prints:
/// // ```
/// // 2
/// // 4
/// // 6
/// // [ CPUFloatType{3} ]
/// // ```
/// std::cout << v.grad() << std::endl;
/// v.remove_hook(h); // removes the hook
/// @endcode
template <typename T>
hook_return_void_t<T> register_hook(T&& hook) const;
// Register a hook with variable return value
template <typename T>
hook_return_var_t<T> register_hook(T&& hook) const;
@ -518,7 +631,7 @@ private:
public:
// Remove hook at given position
/// Remove hook at given position
void remove_hook(unsigned pos) const;
// View Variables

View File

@ -9,7 +9,7 @@ set(extra_src)
# loop over all types
foreach(THC_TYPE Byte Char Short Int Long Half Float Double)
# loop over files which need to be split between types (because of long compile times)
foreach(THC_FILE TensorSort TensorMathPointwise TensorMathReduce TensorMasked)
foreach(THC_FILE TensorSort TensorMathPointwise TensorMathReduce TensorMasked TensorTopK)
if(NOT EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/generated/THC${THC_FILE}${THC_TYPE}.cu")
FILE(WRITE "${CMAKE_CURRENT_SOURCE_DIR}/generated/THC${THC_FILE}${THC_TYPE}.cu"
"#include <THC/THC${THC_FILE}.cuh>\n#include <THC/THCTensor.hpp>\n\n#include <THC/generic/THC${THC_FILE}.cu>\n#include <THC/THCGenerate${THC_TYPE}Type.h>\n")
@ -56,7 +56,6 @@ set(ATen_CUDA_SRCS ${ATen_CUDA_SRCS}
${CMAKE_CURRENT_SOURCE_DIR}/THCTensorIndex.cu
${CMAKE_CURRENT_SOURCE_DIR}/THCTensorRandom.cu
${CMAKE_CURRENT_SOURCE_DIR}/THCTensorScatterGather.cu
${CMAKE_CURRENT_SOURCE_DIR}/THCTensorTopK.cu
${CMAKE_CURRENT_SOURCE_DIR}/THCTensorSort.cu
${CMAKE_CURRENT_SOURCE_DIR}/THCSortUtils.cu
${CMAKE_CURRENT_SOURCE_DIR}/THCTensorMode.cu

View File

@ -73,7 +73,7 @@ TensorInfo<T, IndexType>::TensorInfo(T* p,
template <typename T, typename IndexType>
void
TensorInfo<T, IndexType>::reduceDim(int dim) {
assert(dim < dims && dim >= 0);
TORCH_INTERNAL_ASSERT(dim < dims && dim >= 0);
sizes[dim] = 1;
}
@ -81,7 +81,7 @@ template <typename T, typename IndexType>
int
TensorInfo<T, IndexType>::collapseDims(const int excludeDim) {
assert(excludeDim >= -1 && excludeDim < dims);
TORCH_INTERNAL_ASSERT(excludeDim >= -1 && excludeDim < dims);
int stopDim = (excludeDim == -1) ? dims : excludeDim;
int newIndex = -1;

View File

@ -1,19 +0,0 @@
#include <THC/THC.h>
#include <THC/THCReduceApplyUtils.cuh>
#include <THC/THCTensorCopy.h>
#include <THC/THCTensorMath.h>
#include <THC/THCAsmUtils.cuh>
#include <THC/THCScanUtils.cuh>
#include <THC/THCTensorTypeUtils.cuh>
#include <THC/THCTensorMathReduce.cuh>
#include <ATen/WrapDimUtils.h>
#include <algorithm> // for std::min
#if CUDA_VERSION >= 7000 || defined __HIP_PLATFORM_HCC__
#include <thrust/system/cuda/execution_policy.h>
#endif
#include <THC/THCTensorTopK.cuh>
#include <THC/generic/THCTensorTopK.cu>
#include <THC/THCGenerateAllTypes.h>

View File

@ -1,6 +1,21 @@
#ifndef THC_TENSOR_TOPK_CUH
#define THC_TENSOR_TOPK_CUH
#include <THC/THC.h>
#include <THC/THCReduceApplyUtils.cuh>
#include <THC/THCTensorCopy.h>
#include <THC/THCTensorMath.h>
#include <THC/THCAsmUtils.cuh>
#include <THC/THCScanUtils.cuh>
#include <THC/THCTensorTypeUtils.cuh>
#include <THC/THCTensorMathReduce.cuh>
#include <ATen/WrapDimUtils.h>
#include <algorithm> // for std::min
#if CUDA_VERSION >= 7000 || defined __HIP_PLATFORM_HCC__
#include <thrust/system/cuda/execution_policy.h>
#endif
#include <c10/macros/Macros.h>
#include <ATen/native/cuda/SortingRadixSelect.cuh>
@ -52,6 +67,7 @@ __global__ void gatherTopK(TensorInfo<T, IndexType> input,
inputSliceStart, outputSliceSize,
inputSliceSize, inputWithinSliceStride,
smem, &topKValue);
const auto topKConverted = at::native::TopKTypeConfig<T>::convert(topKValue);
// Every value that is strictly less/greater than `pattern`
// (depending on sort dir) in sorted int format is in the top-K.
@ -74,11 +90,12 @@ __global__ void gatherTopK(TensorInfo<T, IndexType> input,
bool inRange = (i < inputSliceSize);
T v =
inRange ? doLdg(&inputSliceStart[i * inputWithinSliceStride]) : ScalarConvert<int, T>::to(0);
const auto convertedV = at::native::TopKTypeConfig<T>::convert(v);
bool hasTopK;
if (Order) {
hasTopK = inRange && (THCNumerics<T>::gt(v, topKValue));
hasTopK = inRange && (convertedV > topKConverted);
} else {
hasTopK = inRange && (THCNumerics<T>::lt(v, topKValue));
hasTopK = inRange && (convertedV < topKConverted);
}
int index;
@ -111,7 +128,8 @@ __global__ void gatherTopK(TensorInfo<T, IndexType> input,
bool inRange = (i < inputSliceSize);
T v =
inRange ? doLdg(&inputSliceStart[i * inputWithinSliceStride]) : ScalarConvert<int, T>::to(0);
bool hasTopK = inRange && (THCNumerics<T>::eq(v, topKValue));
const auto convertedV = at::native::TopKTypeConfig<T>::convert(v);
bool hasTopK = inRange && (convertedV == topKConverted);
int index;
int carry;

View File

@ -0,0 +1,5 @@
#include <THC/THCTensorTopK.cuh>
#include <THC/THCTensor.hpp>
#include <THC/generic/THCTensorTopK.cu>
#include <THC/THCGenerateByteType.h>

View File

@ -0,0 +1,5 @@
#include <THC/THCTensorTopK.cuh>
#include <THC/THCTensor.hpp>
#include <THC/generic/THCTensorTopK.cu>
#include <THC/THCGenerateCharType.h>

View File

@ -0,0 +1,5 @@
#include <THC/THCTensorTopK.cuh>
#include <THC/THCTensor.hpp>
#include <THC/generic/THCTensorTopK.cu>
#include <THC/THCGenerateDoubleType.h>

View File

@ -0,0 +1,5 @@
#include <THC/THCTensorTopK.cuh>
#include <THC/THCTensor.hpp>
#include <THC/generic/THCTensorTopK.cu>
#include <THC/THCGenerateFloatType.h>

View File

@ -0,0 +1,5 @@
#include <THC/THCTensorTopK.cuh>
#include <THC/THCTensor.hpp>
#include <THC/generic/THCTensorTopK.cu>
#include <THC/THCGenerateHalfType.h>

View File

@ -0,0 +1,5 @@
#include <THC/THCTensorTopK.cuh>
#include <THC/THCTensor.hpp>
#include <THC/generic/THCTensorTopK.cu>
#include <THC/THCGenerateIntType.h>

View File

@ -0,0 +1,5 @@
#include <THC/THCTensorTopK.cuh>
#include <THC/THCTensor.hpp>
#include <THC/generic/THCTensorTopK.cu>
#include <THC/THCGenerateLongType.h>

View File

@ -0,0 +1,5 @@
#include <THC/THCTensorTopK.cuh>
#include <THC/THCTensor.hpp>
#include <THC/generic/THCTensorTopK.cu>
#include <THC/THCGenerateShortType.h>

View File

@ -269,7 +269,7 @@ void THCTensor_(mode)(THCState *state,
break;
case 1:
default:
assert(false);
TORCH_INTERNAL_ASSERT(false);
}
THCudaCheck(cudaGetLastError());

View File

@ -101,7 +101,7 @@ void THCTensor_(sortKeyValueInplace)(THCState* state,
/* Nothing to do, data already sorted */ \
break; \
default: \
assert(false); \
TORCH_INTERNAL_ASSERT(false); \
} \
}

View File

@ -204,25 +204,29 @@ constexpr uint32_t CUDA_THREADS_PER_BLOCK_FALLBACK = 256;
#define __func__ __FUNCTION__
#endif
// CUDA_KERNEL_ASSERT is a macro that wraps an assert() call inside cuda
// kernels. This is not supported by Apple platforms so we special case it.
// See http://docs.nvidia.com/cuda/cuda-c-programming-guide/#assertion
#if defined(__APPLE__) || defined(__HIP_PLATFORM_HCC__)
#define CUDA_KERNEL_ASSERT(...)
#else // __APPLE__
#define CUDA_KERNEL_ASSERT(...) assert(__VA_ARGS__)
#endif // __APPLE__
// CUDA_ALWAYS_ASSERT is similar to CUDA_KERNEL_ASSERT but checks the assertion
// CUDA_KERNEL_ASSERT checks the assertion
// even when NDEBUG is defined. This is useful for important assertions in CUDA
// code that when building Release.
#if defined(__APPLE__) || defined(__HIP_PLATFORM_HCC__)
// Those platforms do not support assert()
#define CUDA_ALWAYS_ASSERT(cond)
#define CUDA_KERNEL_ASSERT(cond)
#elif defined(_MSC_VER)
// TODO: This should be defined but I don't have the environment to properly
// test it. See e.g., https://github.com/pytorch/pytorch/pull/32719#discussion_r379918384
#define CUDA_ALWAYS_ASSERT(cond)
#if defined(NDEBUG)
extern "C" {
C10_IMPORT
#if defined(__CUDA_ARCH__) || defined(__HIP_ARCH__) || defined(__HIP__)
__host__ __device__
#endif // __CUDA_ARCH__
void _wassert(
wchar_t const* _Message,
wchar_t const* _File,
unsigned _Line);
}
#endif
#define CUDA_KERNEL_ASSERT(cond) \
if (C10_UNLIKELY(!(cond))) { \
(void)(_wassert(_CRT_WIDE(#cond), _CRT_WIDE(__FILE__), static_cast<unsigned>(__LINE__)), 0); \
}
#else // __APPLE__, _MSC_VER
#if defined(NDEBUG)
extern "C" {
@ -241,7 +245,7 @@ __host__ __device__
const char* function) throw();
}
#endif // NDEBUG
#define CUDA_ALWAYS_ASSERT(cond) \
#define CUDA_KERNEL_ASSERT(cond) \
if (C10_UNLIKELY(!(cond))) { \
__assert_fail(#cond, __FILE__, static_cast<unsigned int>(__LINE__), \
__func__); \

View File

@ -748,7 +748,7 @@ if (NOT INTERN_BUILD_MOBILE OR NOT BUILD_CAFFE2_MOBILE)
target_include_directories(torch_cuda PUBLIC "${NVTOOLEXT_HOME}/include")
# -INCLUDE is used to ensure torch_cuda is linked against in a project that relies on it.
# Related issue: https://github.com/pytorch/pytorch/issues/31611
target_link_libraries(torch_cuda INTERFACE "-INCLUDE:\"?warp_size@cuda@at@@YAHXZ\"")
target_link_libraries(torch_cuda INTERFACE "-INCLUDE:?warp_size@cuda@at@@YAHXZ")
elseif(APPLE)
set(TORCH_CUDA_LIBRARIES
@ -949,6 +949,31 @@ if (USE_OPENMP AND OPENMP_FOUND)
target_link_libraries(torch_cpu PRIVATE ${OpenMP_CXX_LIBRARIES})
endif()
if ($ENV{TH_BINARY_BUILD})
if (NOT MSVC AND USE_CUDA AND NOT APPLE)
# Note [Extra MKL symbols for MAGMA in torch_cpu]
#
# When we build CUDA libraries and link against MAGMA, MAGMA makes use of
# some BLAS symbols in its CPU fallbacks when it has no GPU versions
# of kernels. Previously, we ensured the BLAS symbols were filled in by
# MKL by linking torch_cuda with BLAS, but when we are statically linking
# against MKL (when we do wheel builds), this actually ends up pulling in a
# decent chunk of MKL into torch_cuda, inflating our torch_cuda binary
# size by 8M. torch_cpu exposes most of the MKL symbols we need, but
# empirically we determined that there are four which it doesn't provide. If
# we link torch_cpu with these --undefined symbols, we can ensure they
# do get pulled in, and then we can avoid statically linking in MKL to
# torch_cuda at all!
#
# We aren't really optimizing for binary size on Windows (and this link
# line doesn't work on Windows), so don't do it there.
#
# These linker commands do not work on OS X, do not attempt this there.
# (It shouldn't matter anyway, though, because OS X has dropped CUDA support)
set_target_properties(torch_cpu PROPERTIES LINK_FLAGS "-Wl,--undefined=mkl_lapack_slaed0 -Wl,--undefined=mkl_lapack_dlaed0 -Wl,--undefined=mkl_lapack_dormql -Wl,--undefined=mkl_lapack_sormql")
endif()
endif()
target_link_libraries(torch_cpu PUBLIC c10)
target_link_libraries(torch_cpu PUBLIC ${Caffe2_PUBLIC_DEPENDENCY_LIBS})
target_link_libraries(torch_cpu PRIVATE ${Caffe2_DEPENDENCY_LIBS})

View File

@ -261,15 +261,6 @@ CAFFE2_CUDA_API const char* curandGetErrorString(curandStatus_t error);
for (size_t j = blockIdx.y * blockDim.y + threadIdx.y; j < (m); \
j += blockDim.y * gridDim.y)
// CUDA_KERNEL_ASSERT is a macro that wraps an assert() call inside cuda
// kernels. This is not supported by Apple platforms so we special case it.
// See http://docs.nvidia.com/cuda/cuda-c-programming-guide/#assertion
#if defined(__APPLE__) || defined(__HIP_PLATFORM_HCC__)
#define CUDA_KERNEL_ASSERT(...)
#else // __APPLE__
#define CUDA_KERNEL_ASSERT(...) assert(__VA_ARGS__)
#endif // __APPLE__
// The following helper functions are here so that you can write a kernel call
// when you are not particularly interested in maxing out the kernels'
// performance. Usually, this will give you a reasonable speed, but if you

View File

@ -1,6 +1,8 @@
#include "caffe2/operators/fused_rowwise_nbitfake_conversion_ops.h"
#include <fp16.h>
#ifdef __AVX__
#include <immintrin.h>
#endif
#include "c10/util/Registry.h"
namespace caffe2 {

View File

@ -50,8 +50,13 @@ __global__ void ReluCUDAKernel<half2>(const int N, const half2* X, half2* Y) {
Y[i] = __hmul2(__hgt2(__ldg(X + i), kZero), __ldg(X + i));
#else
const float2 xx = __half22float2(X[i]);
Y[i] =
__floats2half2_rn(xx.x > 0.0f ? xx.x : 0.0f, xx.y > 0.0f ? xx.y : 0.0f);
// There are explicit cast to float here, because it may otherwise cause ambiguity on ROCm and can be triggered
// sometimes:
//
// error: conditional expression is ambiguous; 'const hip_impl::Scalar_accessor<float, Native_vec_, 0>' can be
// converted to 'float' and vice versa
Y[i] = __floats2half2_rn(xx.x > 0.0f ? static_cast<float>(xx.x) : 0.0f,
xx.y > 0.0f ? static_cast<float>(xx.y) : 0.0f);
#endif
}
}
@ -100,8 +105,14 @@ __global__ void ReluGradientCUDAKernel<half2>(
#else
const float2 dy = __half22float2(dY[i]);
const float2 yy = __half22float2(Y[i]);
dX[i] =
__floats2half2_rn(yy.x > 0.0f ? dy.x : 0.0f, yy.y > 0.0f ? dy.y : 0.0f);
// There are explicit cast to float here, because it may otherwise cause ambiguity on ROCm and can be triggered
// sometimes:
//
// error: conditional expression is ambiguous; 'const hip_impl::Scalar_accessor<float, Native_vec_, 1>' can be
// converted to 'float' and vice versa
dX[i] = __floats2half2_rn(yy.x > 0.0f ? static_cast<float>(dy.x) : 0.0f,
yy.y > 0.0f ? static_cast<float>(dy.y) : 0.0f);
#endif
}
}

View File

@ -76,7 +76,7 @@ struct TopKTypeConfig<short> {
typedef unsigned int RadixType;
static inline __device__ RadixType convert(short v) {
CUDA_KERNEL_ASSERT(sizeof(short) == 2);
static_assert(sizeof(short) == 2, "");
return 32768u + v;
}
@ -90,7 +90,7 @@ struct TopKTypeConfig<int> {
typedef unsigned int RadixType;
static inline __device__ RadixType convert(int v) {
CUDA_KERNEL_ASSERT(sizeof(int) == 4);
static_assert(sizeof(int) == 4, "");
return 2147483648u + v;
}
@ -104,6 +104,7 @@ struct TopKTypeConfig<long> {
typedef unsigned long long int RadixType;
static inline __device__ RadixType convert(long v) {
//static_assert fails on windows, so leave it as CUDA_KERNEL_ASSERT
CUDA_KERNEL_ASSERT(sizeof(long) == 8);
return 9223372036854775808ull + v;
}

View File

@ -15,6 +15,7 @@ if (NOT __NCCL_INCLUDED)
# this second replacement is needed when there are multiple archs
string(REPLACE ";-gencode" " -gencode" NVCC_GENCODE "${NVCC_GENCODE}")
set(__NCCL_BUILD_DIR "${CMAKE_CURRENT_BINARY_DIR}/nccl")
ExternalProject_Add(nccl_external
SOURCE_DIR ${PROJECT_SOURCE_DIR}/third_party/nccl/nccl
BUILD_IN_SOURCE 1
@ -30,20 +31,49 @@ if (NOT __NCCL_INCLUDED)
"CUDA_HOME=${CUDA_TOOLKIT_ROOT_DIR}"
"NVCC=${CUDA_NVCC_EXECUTABLE}"
"NVCC_GENCODE=${NVCC_GENCODE}"
"BUILDDIR=${CMAKE_CURRENT_BINARY_DIR}/nccl"
"BUILDDIR=${__NCCL_BUILD_DIR}"
"VERBOSE=0"
"-j"
BUILD_BYPRODUCTS "${CMAKE_CURRENT_BINARY_DIR}/nccl/lib/libnccl_static.a"
BUILD_BYPRODUCTS "${__NCCL_BUILD_DIR}/lib/libnccl_static.a"
INSTALL_COMMAND ""
)
# Detect objcopy version
execute_process (COMMAND "${CMAKE_OBJCOPY}" "--version" OUTPUT_VARIABLE OBJCOPY_VERSION_STR)
string(REGEX REPLACE "GNU objcopy version ([0-9])\\.([0-9]+).*" "\\1" OBJCOPY_VERSION_MAJOR ${OBJCOPY_VERSION_STR})
string(REGEX REPLACE "GNU objcopy version ([0-9])\\.([0-9]+).*" "\\2" OBJCOPY_VERSION_MINOR ${OBJCOPY_VERSION_STR})
if ((${OBJCOPY_VERSION_MAJOR} GREATER 2) OR ((${OBJCOPY_VERSION_MAJOR} EQUAL 2) AND (${OBJCOPY_VERSION_MINOR} GREATER 27)))
message(WARNING "Enabling NCCL library slimming")
add_custom_command(
OUTPUT "${__NCCL_BUILD_DIR}/lib/libnccl_slim_static.a"
DEPENDS nccl_external
COMMAND "${CMAKE_COMMAND}" -E make_directory "${__NCCL_BUILD_DIR}/objects"
COMMAND cd objects
COMMAND "${CMAKE_AR}" x "${__NCCL_BUILD_DIR}/lib/libnccl_static.a"
COMMAND for obj in all_gather_* all_reduce_* broadcast_* reduce_*.o$<SEMICOLON> do "${CMAKE_OBJCOPY}" --remove-relocations .nvFatBinSegment --remove-section __nv_relfatbin $$obj$<SEMICOLON> done
COMMAND "${CMAKE_AR}" cr "${__NCCL_BUILD_DIR}/lib/libnccl_slim_static.a" "*.o"
COMMAND cd -
COMMAND "${CMAKE_COMMAND}" -E remove_directory "${__NCCL_BUILD_DIR}/objects"
WORKING_DIRECTORY "${__NCCL_BUILD_DIR}"
COMMENT "Slimming NCCL"
)
add_custom_target(nccl_slim_external DEPENDS "${__NCCL_BUILD_DIR}/lib/libnccl_slim_static.a")
set(__NCCL_LIBRARY_DEP nccl_slim_external)
set(NCCL_LIBRARIES ${__NCCL_BUILD_DIR}/lib/libnccl_slim_static.a)
else()
message(WARNING "Objcopy version is too old to support NCCL library slimming")
set(__NCCL_LIBRARY_DEP nccl_external)
set(NCCL_LIBRARIES ${__NCCL_BUILD_DIR}/lib/libnccl_static.a)
endif()
set(NCCL_FOUND TRUE)
add_library(__caffe2_nccl INTERFACE)
# The following old-style variables are set so that other libs, such as Gloo,
# can still use it.
set(NCCL_INCLUDE_DIRS ${CMAKE_CURRENT_BINARY_DIR}/nccl/include)
set(NCCL_LIBRARIES ${CMAKE_CURRENT_BINARY_DIR}/nccl/lib/libnccl_static.a)
add_dependencies(__caffe2_nccl nccl_external)
set(NCCL_INCLUDE_DIRS ${__NCCL_BUILD_DIR}/include)
add_dependencies(__caffe2_nccl ${__NCCL_LIBRARY_DEP})
target_link_libraries(__caffe2_nccl INTERFACE ${NCCL_LIBRARIES})
target_include_directories(__caffe2_nccl INTERFACE ${NCCL_INCLUDE_DIRS})
endif()

View File

@ -56,6 +56,10 @@ INPUT = ../../../aten/src/ATen/ATen.h \
../../../c10/cuda/CUDAStream.h \
../../../torch/csrc/api/include \
../../../torch/csrc/api/src \
../../../torch/csrc/autograd/autograd.h \
../../../torch/csrc/autograd/custom_function.h \
../../../torch/csrc/autograd/function.h \
../../../torch/csrc/autograd/variable.h \
../../../torch/csrc/autograd/generated/variable_factories.h \
../../../torch/csrc/jit/runtime/custom_operator.h \
../../../torch/csrc/jit/serialization/import.h \

View File

@ -281,7 +281,9 @@ change one property, this is quite practical.
In conclusion, we can now compare how ``TensorOptions`` defaults, together with
the abbreviated API for creating ``TensorOptions`` using free functions, allow
tensor creation in C++ with the same convenience as in Python. Compare this
call in Python::
call in Python:
.. code-block:: python
torch.randn(3, 4, dtype=torch.float32, device=torch.device('cuda', 1), requires_grad=True)

View File

@ -0,0 +1,99 @@
Tensor Indexing API
===================
Indexing a tensor in the PyTorch C++ API works very similar to the Python API.
All index types such as ``None`` / ``...`` / integer / boolean / slice / tensor
are available in the C++ API, making translation from Python indexing code to C++
very simple. The main difference is that, instead of using the ``[]``-operator
similar to the Python API syntax, in the C++ API the indexing methods are:
- ``torch::Tensor::index`` (`link <https://pytorch.org/cppdocs/api/classat_1_1_tensor.html#_CPPv4NK2at6Tensor5indexE8ArrayRefIN2at8indexing11TensorIndexEE>`_)
- ``torch::Tensor::index_put_`` (`link <https://pytorch.org/cppdocs/api/classat_1_1_tensor.html#_CPPv4N2at6Tensor10index_put_E8ArrayRefIN2at8indexing11TensorIndexEERK6Tensor>`_)
It's also important to note that index types such as ``None`` / ``Ellipsis`` / ``Slice``
live in the ``torch::indexing`` namespace, and it's recommended to put ``using namespace torch::indexing``
before any indexing code for convenient use of those index types.
Here are some examples of translating Python indexing code to C++:
Getter
------
+----------------------------------------------------------+--------------------------------------------------------------------------------------+
| Python | C++ (assuming ``using namespace torch::indexing``) |
+==========================================================+======================================================================================+
| ``tensor[None]`` | ``tensor.index({None})`` |
+----------------------------------------------------------+--------------------------------------------------------------------------------------+
| ``tensor[Ellipsis, ...]`` | ``tensor.index({Ellipsis, "..."})`` |
+----------------------------------------------------------+--------------------------------------------------------------------------------------+
| ``tensor[1, 2]`` | ``tensor.index({1, 2})`` |
+----------------------------------------------------------+--------------------------------------------------------------------------------------+
| ``tensor[True, False]`` | ``tensor.index({true, false})`` |
+----------------------------------------------------------+--------------------------------------------------------------------------------------+
| ``tensor[1::2]`` | ``tensor.index({Slice(1, None, 2)})`` |
+----------------------------------------------------------+--------------------------------------------------------------------------------------+
| ``tensor[torch.tensor([1, 2])]`` | ``tensor.index({torch::tensor({1, 2})})`` |
+----------------------------------------------------------+--------------------------------------------------------------------------------------+
| ``tensor[..., 0, True, 1::2, torch.tensor([1, 2])]`` | ``tensor.index({"...", 0, true, Slice(1, None, 2), torch::tensor({1, 2})})`` |
+----------------------------------------------------------+--------------------------------------------------------------------------------------+
Setter
------
+----------------------------------------------------------+--------------------------------------------------------------------------------------+
| Python | C++ (assuming ``using namespace torch::indexing``) |
+==========================================================+======================================================================================+
| ``tensor[None] = 1`` | ``tensor.index_put_({None}, 1)`` |
+----------------------------------------------------------+--------------------------------------------------------------------------------------+
| ``tensor[Ellipsis, ...] = 1`` | ``tensor.index_put_({Ellipsis, "..."}, 1)`` |
+----------------------------------------------------------+--------------------------------------------------------------------------------------+
| ``tensor[1, 2] = 1`` | ``tensor.index_put_({1, 2}, 1)`` |
+----------------------------------------------------------+--------------------------------------------------------------------------------------+
| ``tensor[True, False] = 1`` | ``tensor.index_put_({true, false}, 1)`` |
+----------------------------------------------------------+--------------------------------------------------------------------------------------+
| ``tensor[1::2] = 1`` | ``tensor.index_put_({Slice(1, None, 2)}, 1)`` |
+----------------------------------------------------------+--------------------------------------------------------------------------------------+
| ``tensor[torch.tensor([1, 2])] = 1`` | ``tensor.index_put_({torch::tensor({1, 2})}, 1)`` |
+----------------------------------------------------------+--------------------------------------------------------------------------------------+
| ``tensor[..., 0, True, 1::2, torch.tensor([1, 2])] = 1`` | ``tensor.index_put_({"...", 0, true, Slice(1, None, 2), torch::tensor({1, 2})}, 1)`` |
+----------------------------------------------------------+--------------------------------------------------------------------------------------+
Translating between Python/C++ index types
------------------------------------------
The one-to-one translation between Python and C++ index types is as follows:
+-------------------------+------------------------------------------------------------------------+
| Python | C++ (assuming ``using namespace torch::indexing``) |
+=========================+========================================================================+
| ``None`` | ``None`` |
+-------------------------+------------------------------------------------------------------------+
| ``Ellipsis`` | ``Ellipsis`` |
+-------------------------+------------------------------------------------------------------------+
| ``...`` | ``"..."`` |
+-------------------------+------------------------------------------------------------------------+
| ``123`` | ``123`` |
+-------------------------+------------------------------------------------------------------------+
| ``True`` | ``true`` |
+-------------------------+------------------------------------------------------------------------+
| ``False`` | ``false`` |
+-------------------------+------------------------------------------------------------------------+
| ``:`` or ``::`` | ``Slice()`` or ``Slice(None, None)`` or ``Slice(None, None, None)`` |
+-------------------------+------------------------------------------------------------------------+
| ``1:`` or ``1::`` | ``Slice(1, None)`` or ``Slice(1, None, None)`` |
+-------------------------+------------------------------------------------------------------------+
| ``:3`` or ``:3:`` | ``Slice(None, 3)`` or ``Slice(None, 3, None)`` |
+-------------------------+------------------------------------------------------------------------+
| ``::2`` | ``Slice(None, None, 2)`` |
+-------------------------+------------------------------------------------------------------------+
| ``1:3`` | ``Slice(1, 3)`` |
+-------------------------+------------------------------------------------------------------------+
| ``1::2`` | ``Slice(1, None, 2)`` |
+-------------------------+------------------------------------------------------------------------+
| ``:3:2`` | ``Slice(None, 3, 2)`` |
+-------------------------+------------------------------------------------------------------------+
| ``1:3:2`` | ``Slice(1, 3, 2)`` |
+-------------------------+------------------------------------------------------------------------+
| ``torch.tensor([1, 2])``| ``torch::tensor({1, 2})`` |
+-------------------------+------------------------------------------------------------------------+

View File

@ -1,4 +1,4 @@
sphinx
sphinx==2.4.4
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme
sphinxcontrib.katex
matplotlib

View File

@ -13,6 +13,13 @@ use ``torch.float16`` (``half``). Some operations, like linear layers and convol
are much faster in ``float16``. Other operations, like reductions, often require the dynamic
range of ``float32``. Networks running in mixed precision try to match each operation to its appropriate datatype.
.. warning::
:class:`torch.cuda.amp.GradScaler` is not a complete implementation of automatic mixed precision.
:class:`GradScaler` is only useful if you manually run regions of your model in ``float16``.
If you aren't sure how to choose op precision manually, the master branch and nightly pip/conda
builds include a context manager that chooses op precision automatically wherever it's enabled.
See the `master documentation <https://pytorch.org/docs/master/amp.html>`_ for details.
.. contents:: :local:
.. _gradient-scaling:

View File

@ -95,14 +95,6 @@ MKLDNN
- Junjie Bai (`bddppq <https://github.com/bddppq>`__)
- Yinghai Lu (`yinghai <https://github.com/yinghai>`__)
XLA
~~~
- Ailing Zhang (`ailzhang <https://github.com/ailzhang>`__)
- Gregory Chanan (`gchanan <https://github.com/gchanan>`__)
- Davide Libenzi (`dlibenzi <https://github.com/dlibenzi>`__)
- Alex Suhan (`asuhan <https://github.com/asuhan>`__)
AMD/ROCm/HIP
~~~~~~~~~~~~
@ -151,3 +143,40 @@ PowerPC
~~~~~~~
- Alfredo Mendoza (`avmgithub <https://github.com/avmgithub>`__)
Library-level maintainers
------------------------
XLA
~~~
- Ailing Zhang (`ailzhang <https://github.com/ailzhang>`__)
- Gregory Chanan (`gchanan <https://github.com/gchanan>`__)
- Davide Libenzi (`dlibenzi <https://github.com/dlibenzi>`__)
- Alex Suhan (`asuhan <https://github.com/asuhan>`__)
TorchServe
~~~~~~~~~~
- Manoj Rao (`mycpuorg <https://github.com/mycpuorg>`__)
- Vamshi Dantu (`vdantu <https://github.com/vdantu>`__)
- Dhanasekar Karuppasamy (`dhanainme <https://github.com/dhanainme>`__)
TorchVision
~~~~~~~~~~~
- Francisco Massa (`fmassa <https://github.com/fmassa>`__)
TorchText
~~~~~~~~~
- Guanheng George Zhang (`zhangguanheng66 <https://github.com/zhangguanheng66>`__)
- Christian Puhrsch (`cpuhrsch <https://github.com/cpuhrsch>`__)
TorchAudio
~~~~~~~~~~
- Vincent QB (`vincentqb <https://github.com/vincentqb>`__)

43
docs/source/cpp_index.rst Normal file
View File

@ -0,0 +1,43 @@
C++
===================================
.. Note::
If you are looking for the PyTorch C++ API docs, directly go `here <https://pytorch.org/cppdocs/>`__.
PyTorch provides several features for working with C++, and its best to choose from them based on your needs. At a high level, the following support is available:
TorchScript C++ API
--------------------
`TorchScript <https://pytorch.org/docs/stable/jit.html>`__ allows PyTorch models defined in Python to be serialized and then loaded and run in C++ capturing the model code via compilation or tracing its execution. You can learn more in the `Loading a TorchScript Model in C++ tutorial <https://pytorch.org/tutorials/advanced/cpp_export.html>`__. This means you can define your models in Python as much as possible, but subsequently export them via TorchScript for doing no-Python execution in production or embedded environments. The TorchScript C++ API is used to interact with these models and the TorchScript execution engine, including:
* Loading serialized TorchScript models saved from Python
* Doing simple model modifications if needed (e.g. pulling out submodules)
* Constructing the input and doing preprocessing using C++ Tensor API
Extending PyTorch and TorchScript with C++ Extensions
------------------------------------------------------
TorchScript can be augmented with user-supplied code through custom operators and custom classes.
Once registered with TorchScript, these operators and classes can be invoked in TorchScript code run from
Python or from C++ as part of a serialized TorchScript model. The `Extending TorchScript with Custom C++ Operators <https://pytorch.org/tutorials/advanced/torch_script_custom_ops.html>`__ tutorial walks through interfacing TorchScript with OpenCV. In addition to wrapping a function call with a custom operator, C++ classes and structs can be bound into TorchScript through a pybind11-like interface which is explained in the `Extending TorchScript with Custom C++ Classes <https://pytorch.org/tutorials/advanced/torch_script_custom_classes.html>`__ tutorial.
Tensor and Autograd in C++
---------------------------
Most of the tensor and autograd operations in PyTorch Python API are also available in the C++ API. These include:
* ``torch::Tensor`` methods such as ``add`` / ``reshape`` / ``clone``. For the full list of methods available, please see: https://pytorch.org/cppdocs/api/classat_1_1_tensor.html
* C++ tensor indexing API that looks and behaves the same as the Python API. For details on its usage, please see: https://pytorch.org/cppdocs/notes/tensor_indexing.html
* The tensor autograd APIs and the ``torch::autograd`` package that are crucial for building dynamic neural networks in C++ frontend. For more details, please see: https://pytorch.org/tutorials/advanced/cpp_autograd.html
Authoring Models in C++
------------------------
The "author in TorchScript, infer in C++" workflow requires model authoring to be done in TorchScript.
However, there might be cases where the model has to be authored in C++ (e.g. in workflows where a Python
component is undesirable). To serve such use cases, we provide the full capability of authoring and training a neural net model purely in C++, with familiar components such as ``torch::nn`` / ``torch::nn::functional`` / ``torch::optim`` that closely resemble the Python API.
* For an overview of the PyTorch C++ model authoring and training API, please see: https://pytorch.org/cppdocs/frontend.html
* For a detailed tutorial on how to use the API, please see: https://pytorch.org/tutorials/advanced/cpp_frontend.html
* Docs for components such as ``torch::nn`` / ``torch::nn::functional`` / ``torch::optim`` can be found at: https://pytorch.org/cppdocs/api/library_root.html
Packaging for C++
------------------
For guidance on how to install and link with libtorch (the library that contains all of the above C++ APIs), please see: https://pytorch.org/cppdocs/installing.html. Note that on Linux there are two types of libtorch binaries provided: one compiled with GCC pre-cxx11 ABI and the other with GCC cxx11 ABI, and you should make the selection based on the GCC ABI your system is using.

View File

@ -16,14 +16,13 @@ PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.
:caption: Notes
notes/*
PyTorch on XLA Devices <http://pytorch.org/xla/>
.. toctree::
:maxdepth: 1
:caption: Language Bindings
C++ API <https://pytorch.org/cppdocs/>
packages
cpp_index
Javadoc <https://pytorch.org/javadoc/>
.. toctree::
:maxdepth: 1
@ -46,7 +45,7 @@ PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.
onnx
optim
quantization
rpc
rpc/index.rst
torch.random <random>
sparse
storage
@ -62,24 +61,16 @@ PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.
name_inference
torch.__config__ <__config__>
.. toctree::
:glob:
:maxdepth: 2
:caption: torchvision Reference
torchvision/index
.. toctree::
:maxdepth: 1
:caption: torchaudio Reference
:caption: Libraries
torchaudio <https://pytorch.org/audio>
.. toctree::
:maxdepth: 1
:caption: torchtext Reference
torchtext <https://pytorch.org/text>
torchvision/index
TorchElastic <https://pytorch.org/elastic/>
TorchServe <https://pytorch.org/serve>
PyTorch on XLA Devices <http://pytorch.org/xla/>
.. toctree::
:glob:

View File

@ -790,21 +790,6 @@ New API:
m = torch.jit.script(MyModule())
Python 2
""""""""
If you are stuck on Python 2 and cannot use the class annotation syntax, you can use the ``__annotations__`` class member to directly apply type annotations.
.. testcode::
from typing import Dict
class MyModule(torch.jit.ScriptModule):
__annotations__ = {'my_dict': Dict[str, int]}
def __init__(self):
super(MyModule, self).__init__()
self.my_dict = {}
self.my_int = 20
Constants
^^^^^^^^^

View File

@ -185,13 +185,10 @@ MyPy-style type annotations using the types listed above.
...
In our examples, we use comment-based type hints to ensure Python 2
compatibility as well.
An empty list is assumed to be ``List[Tensor]`` and empty dicts
``Dict[str, Tensor]``. To instantiate an empty list or dict of other types,
use `Python 3 type hints`_. If you are on Python 2, you can use ``torch.jit.annotate``.
use `Python 3 type hints`_.
Example (type annotations for Python 3):
@ -217,31 +214,6 @@ Example (type annotations for Python 3):
x = torch.jit.script(EmptyDataStructures())
Example (``torch.jit.annotate`` for Python 2):
.. testcode::
import torch
import torch.nn as nn
from typing import Dict, List, Tuple
class EmptyDataStructures(torch.nn.Module):
def __init__(self):
super(EmptyDataStructures, self).__init__()
def forward(self, x):
# type: (Tensor) -> Tuple[List[Tuple[int, float]], Dict[str, int]]
# This annotates the list to be a `List[Tuple[int, float]]`
my_list = torch.jit.annotate(List[Tuple[int, float]], [])
for i in range(10):
my_list.append((i, float(x.item())))
my_dict = torch.jit.annotate(Dict[str, int], {})
return my_list, my_dict
x = torch.jit.script(EmptyDataStructures())
Optional Type Refinement
@ -856,28 +828,8 @@ Supported constant Python types are
* tuples containing supported types
* ``torch.nn.ModuleList`` which can be used in a TorchScript for loop
.. note::
If you are on Python 2, you can mark an attribute as a constant by adding
its name to the ``__constants__`` property of the class:
.. testcode::
import torch
import torch.nn as nn
class Foo(nn.Module):
__constants__ = ['a']
def __init__(self):
super(Foo, self).__init__()
self.a = 1 + 4
def forward(self, input):
return self.a + input
f = torch.jit.script(Foo())
|
.. _module attributes:
@ -924,32 +876,3 @@ Example:
f = torch.jit.script(Foo({'hi': 2}))
.. note::
If you are on Python 2, you can mark an attribute's type by adding it to
the ``__annotations__`` class property as a dictionary of attribute name to
type
.. testcode::
from typing import List, Dict
class Foo(nn.Module):
__annotations__ = {'words': List[str], 'some_dict': Dict[str, int]}
def __init__(self, a_dict):
super(Foo, self).__init__()
self.words = []
self.some_dict = a_dict
# `int`s can be inferred
self.my_int = 10
def forward(self, input):
# type: (str) -> int
self.words.append(input)
return self.some_dict[input] + self.my_int
f = torch.jit.script(Foo({'hi': 2}))
|

View File

@ -25,7 +25,6 @@ are not bound on `torch` or because Python expects a different schema than
TorchScript.
* :func:`torch.tensordot`
* :func:`torch.unique`
* :func:`torch.unique_consecutive`
* :func:`torch.nn.init.calculate_gain`
* :func:`torch.nn.init.eye_`

View File

@ -30,9 +30,7 @@ Sharing CUDA tensors
--------------------
Sharing CUDA tensors between processes is supported only in Python 3, using
a ``spawn`` or ``forkserver`` start methods. :mod:`python:multiprocessing` in
Python 2 can only create subprocesses using ``fork``, and it's not supported
by the CUDA runtime.
a ``spawn`` or ``forkserver`` start methods.
Unlike CPU tensors, the sending process is required to keep the original tensor
as long as the receiving process retains a copy of the tensor. The refcounting is

View File

@ -187,7 +187,7 @@ mentioning all of them as in required by :meth:`~Tensor.permute`.
# Move the F (dim 5) and E dimension (dim 4) to the front while keeping
# the rest in the same order
>>> tensor.permute(5, 4, 0, 1, 2, 3)
>>> named_tensor.align_to('F', 'E', ...) # Use '...' instead in Python 2
>>> named_tensor.align_to('F', 'E', ...)
Use :meth:`~Tensor.flatten` and :meth:`~Tensor.unflatten` to flatten and unflatten
dimensions, respectively. These methods are more verbose than :meth:`~Tensor.view`
@ -317,4 +317,3 @@ operators, see :ref:`name_inference_reference-doc`.
.. warning::
The named tensor API is experimental and subject to change.

View File

@ -5,6 +5,13 @@ Automatic Mixed Precision examples
.. currentmodule:: torch.cuda.amp
.. warning::
:class:`torch.cuda.amp.GradScaler` is not a complete implementation of automatic mixed precision.
:class:`GradScaler` is only useful if you manually run regions of your model in ``float16``.
If you aren't sure how to choose op precision manually, the master branch and nightly pip/conda
builds include a context manager that chooses op precision automatically wherever it's enabled.
See the `master documentation <https://pytorch.org/docs/master/amp.html>`_ for details.
.. contents:: :local:
.. _gradient-scaling-examples:

View File

@ -27,10 +27,7 @@ others that require asynchronous operation.
CUDA in multiprocessing
-----------------------
The CUDA runtime does not support the ``fork`` start method. However,
:mod:`python:multiprocessing` in Python 2 can only create subprocesses using
``fork``. So Python 3 and either ``spawn`` or ``forkserver`` start method are
required to use CUDA in subprocesses.
The CUDA runtime does not support the ``fork`` start method. In Python 3, either the ``spawn`` or ``forkserver`` start method are
.. note::
The start method can be set via either creating a context with

View File

@ -151,11 +151,6 @@ Package not found in win-32 channel.
PyTorch doesn't work on 32-bit system. Please use Windows and
Python 64-bit version.
Why are there no Python 2 packages for Windows?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Because it's not stable enough. There're some issues that need to
be solved before we officially release it. You can build it by yourself.
Import error
^^^^^^^^^^^^
@ -290,4 +285,3 @@ tensors cannot succeed, there are two alternatives for this.
2. Share CPU tensors instead. Make sure your custom
:class:`~torch.utils.data.DataSet` returns CPU tensors.

View File

@ -3,6 +3,9 @@
Quantization
===========================
.. warning ::
Quantization is experimental and subject to change.
Introduction to Quantization
----------------------------

View File

@ -8,6 +8,9 @@ training through a set of primitives to allow for remote communication, and a
higher-level API to automatically differentiate models split across several
machines.
.. warning ::
APIs in the RPC package are stable. There are multiple ongoing work items
to improve performance and error handling, which will ship in future releases.
Basics

Some files were not shown because too many files have changed in this diff Show More