Compare commits

...

70 Commits

Author SHA1 Message Date
cefb9e0cd6 Update pthreadpool to pthreadpool:029c88620802e1361ccf41d1970bd5b07fd6b7bb. (#40524) (#41190)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40524

Reviewed By: ezyang

Differential Revision: D22215742

Pulled By: AshkanAliabadi

fbshipit-source-id: ef594e0901337a92b21ddd44e554da66c723eb7c
2020-07-10 09:11:32 -07:00
d9e9e0087a [v1.6] [RPC docs] Remove mention of TensorPipe's SHM and CMA backends as they're not built (#41229)
Summary:
In short, we messed up. The SHM and CMA backends of TensorPipe are Linux-specific and thus they are guarded by a #ifdef in the agent's code. Due to a mishap with CMake (due the fact that TensorPipe has two CMake files, one for PyTorch and a "standalone" one) we were not correctly propagating some flags and these #ifdefs were always false. This means that these two backends have always been disabled and have thus never been covered by our OSS CI. It would be irresponsible to enable them now in v1.6, so instead we remove any mention of them from the docs.

Note that this is perhaps not as bad as it sounds. These two backends were providing higher performance (latency) when the two endpoints were on the same machine. However, I suspect that most RPC users will only do transfers across machines, for which SHM and CMA wouldn't have played any role.

Original PR against master: #41200 (merged as dde3d5f4a8f713ecc4649d776565b68ca75ae5c8)

Test Plan: Docs only
2020-07-10 09:02:08 -07:00
43d746305c Preserve CUDA gencode flags (#41212)
Summary:
Add `torch._C._cuda_getArchFlags()` that returns list of architecture `torch_cuda` were compiled with
Add `torch.cuda.get_arch_list()` and `torch.cuda.get_gencode_flags()` methods that returns architecture list and gencode flags PyTorch were compiled with
Print warning if some of GPUs is not compatible with any of the CUBINs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41173

Differential Revision: D22459998

Pulled By: malfet

fbshipit-source-id: 65d40ae29e54a0ba0f3f2da11b821fdb4d452d95
2020-07-09 17:34:50 -07:00
9409e03903 [ONNX][1.6] Update interpolate recompute_scale_factor default (#41117)
* Update interpolate recompute_scale_factor default

* Update upsampling.h

* Update functional.py
2020-07-09 17:24:53 -07:00
c9a1853d2f [1.6] Make IterableDataset DataLoader.__len__ warning clearer (#41185)
* make IterableDataset DataLoader.__len__ warning clearer

* typo
2020-07-09 14:07:58 -07:00
7fa9b2923b quantizer.cpp: fix cuda memory pinning (#41139) (#41194)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41139

Fixes the test case in https://github.com/pytorch/pytorch/issues/41115
by using PyTorch's CUDA allocator instead of the old Caffe2 one.

Test Plan:
run the test case from the issue:
https://gist.github.com/vkuzo/6d013aa1645cb986d0d4464a931c779b

let's run CI and see what it uncovers

Imported from OSS

Reviewed By: malfet

Differential Revision: D22438787

fbshipit-source-id: 0853b0115d198a99c43e6176aef34ea951bf5c2e

Co-authored-by: Vasiliy Kuznetsov <vasiliy@fb.com>
2020-07-09 14:06:11 -07:00
40bf15a8ac Remove copy_ warnings for angle and abs for complex tensors (#41152) (#41191)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41152

fixes https://github.com/pytorch/pytorch/issues/40838

Test Plan: Imported from OSS

Differential Revision: D22444357

Pulled By: anjali411

fbshipit-source-id: 2879d0cffc0a011c624eb8e00c7b64bd33522cc3

Co-authored-by: anjali411 <chourdiaanjali123@gmail.com>
2020-07-09 13:41:15 -07:00
c164fc4d7f Patch #40883 to 1.6 release. (#41033) 2020-07-09 10:25:39 -07:00
e0b7480f34 Revert "make IterableDataset DataLoader.__len__ warning clearer (#41183)"
This reverts commit 89d7f194d8ea19f36c9afb52585a00b5b7d0ffeb.
2020-07-09 08:05:24 -07:00
89d7f194d8 make IterableDataset DataLoader.__len__ warning clearer (#41183) 2020-07-09 08:00:00 -07:00
59bb44a8e8 Add a link in RPC doc page to point to PT Distributed overview (#41108) (#41156)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41108

Test Plan: Imported from OSS

Differential Revision: D22440751

Pulled By: mrshenli

fbshipit-source-id: 9e7b002091a3161ae385fdfcc26484ae8fc243bb
2020-07-09 07:49:10 -07:00
8f4d01d9f1 Disables unary op casting to output dtype (#41097) (#41160)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41047.

Some CPU kernel implementations don't call `cast_outputs()`, so when CPU temporaries were created to hold their outputs they weren't copied back to the out parameters correctly. Instead of fixing that issue, for simplicity this PR disables the behavior. The corresponding test in test_type_promotion.py is expanded with more operations to verify that unary ops can no longer have out arguments with different dtypes than their inputs (except in special cases like torch.abs which maps complex inputs to float outputs and torch.deg2rad which is secretly torch.mul).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41097

Differential Revision: D22422352

Pulled By: mruberry

fbshipit-source-id: 8e61d34ef1c9608790b35cf035302fd226fd9421

Co-authored-by: Mike Ruberry <mruberry@devfair044.maas>
2020-07-08 22:06:48 -07:00
77ffb25925 Add guard for non-default stream in DDP's autograd engine callback (#40115) (#41151)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40115

Closes https://github.com/pytorch/pytorch/issues/37790
Closes https://github.com/pytorch/pytorch/issues/37944

A user may wish to run DDP's forward + backwards step under a non-default CUDA stream such as those created by `with torch.cuda.Stream(stream)`. In this case, the user should be responsible for synchronizing events on this stream with other streams used in the program (per the documentation at https://pytorch.org/docs/stable/notes/cuda.html#cuda-semantics), but currently DDP has a bug which causes DDP under non-default streams to fail.

If a user does the following:
```
model = DDP(...)
loss = model(inptut).sum()
loss.backward()
grad = model.module.weight.grad()
average = dist.all_reduce(grad)
```

There is a chance that `average` and `grad` will not be equal. This is because the CUDA kernels corresponding to the  `all_reduce` call may run before `loss.backward()`'s kernels are finished. Specifically, in DDP we copy the allreduced gradients back to the model parameter gradients in an autograd engine callback, but this callback runs on the default stream. Note that this can also be fixed by the application synchronizing on the current stream, although this should not be expected, since the application is not using the current stream at all.

This PR fixes the issue by passing the current stream into DDP's callback.

Tested by adding a UT `test_DistributedDataParallel_non_default_stream` that fails without this PR
ghstack-source-id: 106481208

Differential Revision: D22073353

fbshipit-source-id: 70da9b44e5f546ff8b6d8c42022ecc846dff033e
2020-07-08 21:08:17 -07:00
af9600b1f5 [Caffe2] Move in-header virtual function implementation to .cc files (#41090)
* Move OperatorSchema default inference function implementations to .cc… (#40845)

Summary:
… file

This prevents implementation of those functions(as lambdas) to be embedded as weak symbol into every shared library that includes this header.

Combination of this and https://github.com/pytorch/pytorch/pull/40844 reduces size of `libcaffe2_module_test_dynamic.so` from 500kb to 50Kb.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40845

Differential Revision: D22334779

Pulled By: malfet

fbshipit-source-id: 64706918fc2947350a58c0877f294b1b8b085455

* Move `OperatorBase::AddRelatedBlobInfo` implementation to .cc file (#40844)

Summary:
If virtual function is implemented in header file, it's implementation will be included as a weak symbol to every shared library that includes this header along with all of it's dependencies.

This was one of the reasons why size of libcaffe2_module_test_dynamic.so  was 500Kb (AddRelatedBlobInfo implementation pulled a quarter of libprotobuf.a with it)

Combination of this and https://github.com/pytorch/pytorch/issues/40845 reduces size of `libcaffe2_module_test_dynamic.so` from 500kb to 50Kb.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40844

Differential Revision: D22334725

Pulled By: malfet

fbshipit-source-id: 836a4cbb9f344355ddd2512667e77472546616c0
2020-07-07 21:17:11 -07:00
83262b1ba1 torch._six.PY37 should be true for Python-3.8 as well (#40868) (#41091)
Summary:
Right now it is used to check whether `math.remainder` exists, which is the case for both Python-3.7 and 3.8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40868

Differential Revision: D22343454

Pulled By: malfet

fbshipit-source-id: 6b6d4869705b64c4b952309120f92c04ac7e39fd
2020-07-07 17:15:01 -07:00
f862a6ba4d Remove unused Logger in get_matching_activations (#41023) (#41087)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41023

Remove Logger in get_matching_activations since it's not used.
ghstack-source-id: 107237046

Test Plan:
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_submodule_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_functional_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_functional_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_dynamic'

Differential Revision: D22394957

fbshipit-source-id: 7d59e0f35e9f4c304b8487460d48236ee6e5a872

Co-authored-by: Haixin Liu <haixin@fb.com>
2020-07-07 16:09:37 -07:00
f3c1ea7455 [PyTorch Numeric Suite] Remove unnecessary Logger in input arguments (#40890) (#41086)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40890

Remove unnecessary Logger in input arguments and simplify the API.
ghstack-source-id: 107110487

Test Plan:
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_submodule_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_functional_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_functional_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_dynamic'

Differential Revision: D22345477

fbshipit-source-id: d8b4eb3d6cb3049aa3296dead8ba29bf5467bd1c

Co-authored-by: Haixin Liu <haixin@fb.com>
2020-07-07 16:09:11 -07:00
2ed3ad2891 fix autodoc for torch.distributed.launch (#40963) (#41089)
Summary:
The doc for `torch.distributed.launch` is missing since v1.2.0 (see issue https://github.com/pytorch/pytorch/issues/36386) because PR https://github.com/pytorch/pytorch/issues/22501 added some imports at the first line.
542ac74987/torch/distributed/launch.py (L1-L5)
I move it below the docstring to make the autodoc in Sphinx work normally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40963

Differential Revision: D22380816

Pulled By: mrshenli

fbshipit-source-id: ee8406785b9a198bbf3fc65e589854379179496f

Co-authored-by: Xin Yao <yaox12@outlook.com>
2020-07-07 14:23:31 -07:00
a857af50a4 [quant][graphmode][fix] cloning schema in insert_observers (#40624) (#40934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40624

Previously we didn't clone schema, so the default schema is used, this is
causing issue for some models

Test Plan: Imported from OSS

Differential Revision: D22259519

fbshipit-source-id: e2a393a54cb18f55da0c7152a74ddc22079ac350
2020-07-07 13:27:36 -07:00
d0045e5520 Some fixes for graph mode quantization (#40935)
* [quant] aten::repeat work for quantized tensor (#40644)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40644

Test Plan: Imported from OSS

Differential Revision: D22268558

fbshipit-source-id: 3bc9a129bece1b547c519772ecc6b980780fb904

* [quant][graphmode][fix] remove unsupported ops in the list (#40653)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40653

(Note: this ignores all push blocking failures!)

Test Plan: Imported from OSS

Differential Revision: D22271413

fbshipit-source-id: a01611b5d90849ac673fa5a310f910c858e907a3
2020-07-07 13:26:27 -07:00
0406b69b79 [quant][graphmode][fix] Fold conv bn (#40865) (#40970)
* [quant][graphmode][fix] Fold conv bn (#40865)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40865

1. applied filter for the module types
2. removed the assumption that the conv bn are immediate child of parent module

Test Plan:
python test/test_quantization.py TestQuantizeJitPasses

Imported from OSS

Differential Revision: D22338074

fbshipit-source-id: 64739a5e56c0a74249a1dbc2c8454b88ec32aa9e

* [quant][graphmode][fix] Print the node in error message (#40889)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40889

Test Plan: Imported from OSS

Differential Revision: D22348266

fbshipit-source-id: eed2ece5c94fcfaf187d6770bed4a7109f0c0b4a
2020-07-07 13:25:39 -07:00
6220cc4380 [quant][graphmode][fix] dequantize propagation for {add/mul}_scalar + aten::repeat (#40933)
* [quant][graphmode][fix] dequantize propagation for {add/mul}_scalar (#40596)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40596

Previously the fusion patterns for {add/mul}_scalar is inconsistent since the op pattern
produces a non-quantized tensor and the op replacement graph produces a quantized tensor

Test Plan: Imported from OSS

Differential Revision: D22251072

fbshipit-source-id: e16eb92cf6611578cca1ed8ebde961f8d0610137

* [quant][graphmode] Support quantization for `aten::apend` (#40743)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40743

`aten::append` modifies input inplace and the output is ignored, these ops are not
supported right now, so we'll need to first make `aten::append` non-inplace
by change
```
ignored = aten::append(list, x)
```
to
```
x_list = aten::ListConstruct(x)
result = aten::add(list, x_list)
```
and then quantize the aten::add instead.

Test Plan:
TestQuantizeJitOps.test_general_shape_ops

Imported from OSS

Differential Revision: D22302151

fbshipit-source-id: 931000388e7501e9dd17bec2fad8a96b71a5efc5
2020-07-07 13:25:02 -07:00
eaf3f2fd34 Added index_put to promotelist (#41036)
* Added index_put to promotelist

* docstring

Co-authored-by: Michael Carilli <mcarilli@nvidia.com>
2020-07-07 13:00:32 -07:00
c35b4c770b Bucket of shape analysis fixes (#41044)
* [JIT] fix unfold shape analysis (#40749)

Summary:
unfold on a 0-dimensioned tensor returns a 1-dim tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40749

Differential Revision: D22361481

Pulled By: eellison

fbshipit-source-id: 621597e5f97f6e39953eb86f8b85bb4142527a9f

* shape analysis fix for default dtype'

ghstack-source-id: 723aa27c2685417715a0891f5ca1ae885d4c9832
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40938

* fix grad thrashing of shape analysis

ghstack-source-id: dd8742b1da52d17e9d6ab6c81ff0b27520b09417
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40939

Co-authored-by: Elias Ellison <eellison@fb.com>
2020-07-07 12:59:47 -07:00
11baccf1b5 [release/1.6] .circleci: Output binary sizes, store binaries (#41075)
We need an easy to way to quickly visually grep binary sizes from builds
and then have a way to test out those binaries quickly.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
(cherry picked from commit 66813515d4dec66f319442ba967c64b87c0286cd)
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
2020-07-07 11:27:00 -07:00
f0f0cbdd4a Docstring changes for dynamic quantized classes (#40931) (#41032)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40931

Fix docstrings for dynamic quantized Linear/LSTM and associated classes
ghstack-source-id: 107064446

Test Plan: Docs show up in correctly

Differential Revision: D22360787

fbshipit-source-id: 8e357e081dc59ee42fd7f12ea5079ce5d0cc9df2
2020-07-06 21:37:53 -07:00
11b70b0041 [JIT] Switch executor from Simple to Legacy. (#41017)
* properly skip legacy tests regardless of the default executor (#40381)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40381

Differential Revision: D22173938

Pulled By: Krovatkin

fbshipit-source-id: 305fc4484977e828cc4cee6e053a1e1ab9f0d6c7

* [JIT] Switch executor from Simple to Legacy.

This is done for 1.6 only in order to recover performance regressions
caused by the Legacy->Simple switch that was done in 1.5. On master we
still plan to use Simple executor and fix the performance issues in 1.7
without falling back to the Legacy executor.

Co-authored-by: Nikolay Korovaiko <korovaikon@gmail.com>
2020-07-06 21:35:02 -07:00
01e9562313 [1.6 cherrypick] Fix delegating to jit.load from torch.load (#41013) 2020-07-06 16:55:00 -07:00
3f13c9a2c8 infer tensor properties based on an input tensor rather than defaults for xxx_like ctors (#40895) (#41016)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40895

Reviewed By: eellison

Differential Revision: D22358878

Pulled By: Krovatkin

fbshipit-source-id: 2db2429aa89c180d8e52a6bb1265308483da46a2
2020-07-06 16:52:59 -07:00
63a94c021a shape inference of undefined for prim::grad (#40866) (#41015)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40866

Reviewed By: pbelevich

Differential Revision: D22358988

Pulled By: Krovatkin

fbshipit-source-id: 7118d7f8d4eaf056cfb71dc0d588d38b1dfb0fc7
2020-07-06 16:51:37 -07:00
2b175ba909 update requires_gard on loop inputs correctly (master) (#40926) (#41014)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40926

Reviewed By: eellison

Differential Revision: D22359471

Pulled By: Krovatkin

fbshipit-source-id: 823e87674e2d2917f075255ec926e0485972f4e2
2020-07-06 16:30:14 -07:00
8c3f662224 Update FP16 to FP16:4dfe081cf6bcd15db339cf2680b9281b8451eeb3. (#40956) 2020-07-06 06:59:41 -07:00
0ffdd5aa1d Update cpuinfo to cpuinfo:63b254577ed77a8004a9be6ac707f3dccc4e1fd9. (#40955) 2020-07-06 06:59:30 -07:00
d53427c541 Update FXdiv to FXdiv:b408327ac2a15ec3e43352421954f5b1967701d1. (#40954) 2020-07-06 06:59:17 -07:00
b44b1d868e Update psimd to psimd:072586a71b55b7f8c584153d223e95687148a900 (#40953) 2020-07-06 06:59:01 -07:00
9184c9832e Re-apply PyTorch pthreadpool changes (#40951)
* Re-apply PyTorch pthreadpool changes

Summary:
This re-applies D21232894 (b9d3869df3) and D22162524, plus updates jni_deps in a few places
to avoid breaking host JNI tests.

Test Plan: `buck test @//fbandroid/mode/server //fbandroid/instrumentation_tests/com/facebook/caffe2:host-test`

Reviewed By: xcheng16

Differential Revision: D22199952

fbshipit-source-id: df13eef39c01738637ae8cf7f581d6ccc88d37d5

* Enable XNNPACK ops on iOS and macOS.

Test Plan: buck run aibench:run_bench -- -b aibench/specifications/models/pytorch/pytext/pytext_mobile_inference.json --platform ios --framework pytorch --remote --devices D221 (9788a74da8)AP-12.0.1

Reviewed By: xta0

Differential Revision: D21886736

fbshipit-source-id: ac482619dc1b41a110a3c4c79cc0339e5555edeb

* Respect user set thread count. (#40707)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40707

Test Plan: Imported from OSS

Differential Revision: D22318197

Pulled By: AshkanAliabadi

fbshipit-source-id: f11b7302a6e91d11d750df100d2a3d8d96b5d1db

* Fix and reenable threaded QNNPACK linear (#40587)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40587

Previously, this was causing divide-by-zero only in the multithreaded
empty-batch case, while calculating tiling parameters for the threads.
In my opinion, the bug here is using a value that is allowed to be zero
(batch size) for an argument that should not be zero (tile size), so I
fixed the bug by bailing out right before the call to
pthreadpool_compute_4d_tiled.

Test Plan: TestQuantizedOps.test_empty_batch

Differential Revision: D22264414

Pulled By: dreiss

fbshipit-source-id: 9446d5231ff65ef19003686f3989e62f04cf18c9

* Fix batch size zero for QNNPACK linear_dynamic (#40588)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40588

Two bugs were preventing this from working.  One was a divide by zero
when multithreading was enabled, fixed similarly to the fix for static
quantized linear in the previous commit.  The other was computation of
min and max to determine qparams.  FBGEMM uses [0,0] for [min,max] of
empty input, do the same.

Test Plan: Added a unit test.

Differential Revision: D22264415

Pulled By: dreiss

fbshipit-source-id: 6ca9cf48107dd998ef4834e5540279a8826bc754

Co-authored-by: David Reiss <dreiss@fb.com>
2020-07-06 06:58:25 -07:00
e89c4f0dec [quant] Fix fuse linear pass (#40549) (#40751)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40549

Currently we didn't check if %weight_t is produced by `aten::t`, this will fuse some `matmul`/`addmm` that is
not 2d to `aten::linear`, which is incorrect

Test Plan: Imported from OSS

Differential Revision: D22225921

fbshipit-source-id: 9723e82fdbac6d8e1a7ade22f3a9791321ab12b6
2020-07-02 10:23:22 -07:00
ea273c68f9 Inplace construct of TorchScript Module and inplace option for quantization (#40750)
* [WIP][JIT] Add ScriptModule._reconstruct (#39979)

Summary:
**Summary**
This commit adds an instance method `_reconstruct` that permits users
to reconstruct a `ScriptModule` from a given C++ `Module` instance.

**Testing**
This commit adds a unit test for `_reconstruct`.

**Fixes**
This pull request fixes https://github.com/pytorch/pytorch/issues/33912.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39979

Differential Revision: D22172323

Pulled By: SplitInfinity

fbshipit-source-id: 9aa6551c422a5a324b822a09cd8d7c660f99ca5c

* [quant][graphmode] Enable inplace option for top level API (#40414)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40414

after `_reconstruct` is supported in RecursiveScriptModule: https://github.com/pytorch/pytorch/pull/39979
we can support inplace option in quantization API

Test Plan: Imported from OSS

Differential Revision: D22178326

fbshipit-source-id: c78bc2bcf2c42b06280c12262bb31aebcadc6c32

Co-authored-by: Meghan Lele <meghanl@fb.com>
2020-07-02 10:22:45 -07:00
4dd37bfbf7 [jit] Remove unnecessary clone APIs for script::Module and RecursiveScriptModule (#40297) (#40748)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40297

Test Plan: Imported from OSS

Differential Revision: D22191660

fbshipit-source-id: 4b338ca82caaca04784bffe01fdae3d180c192f4
2020-07-02 10:22:27 -07:00
2533b9da83 Fix complex printing for sci_mode=True (#40513) (#40919)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40513

This PR makes the following changes:
1. Complex Printing now uses print formatting for it's real and imaginary values and they are joined at the end.
2. Adding 1. naturally fixes the printing of complex tensors in sci_mode=True

```
>>> torch.tensor(float('inf')+float('inf')*1j)
tensor(nan+infj)
>>> torch.randn(2000, dtype=torch.cfloat)
tensor([ 0.3015-0.2502j, -1.1102+1.2218j, -0.6324+0.0640j,  ...,
        -1.0200-0.2302j,  0.6511-0.1889j, -0.1069+0.1702j])
>>> torch.tensor([1e-3, 3+4j, 1e-5j, 1e-2+3j, 5+1e-6j])
tensor([1.0000e-03+0.0000e+00j, 3.0000e+00+4.0000e+00j, 0.0000e+00+1.0000e-05j,
        1.0000e-02+3.0000e+00j, 5.0000e+00+1.0000e-06j])
>>> torch.randn(3, dtype=torch.cfloat)
tensor([ 1.0992-0.4459j,  1.1073+0.1202j, -0.2177-0.6342j])
>>> x = torch.tensor([1e2, 1e-2])
>>> torch.set_printoptions(sci_mode=False)
>>> x
tensor([  100.0000,     0.0100])
>>> x = torch.tensor([1e2, 1e-2j])
>>> x
tensor([100.+0.0000j,   0.+0.0100j])
```

Test Plan: Imported from OSS

Differential Revision: D22309294

Pulled By: anjali411

fbshipit-source-id: 20edf9e28063725aeff39f3a246a2d7f348ff1e8

Co-authored-by: anjali411 <chourdiaanjali123@gmail.com>
2020-07-02 09:45:35 -07:00
c5c8a85a82 If ninja is being used, force build_ext to run. (#40881)
As ninja has accurate dependency tracking, if there is nothing to do,
then we will very quickly noop.  But this is important for correctness:
if a change was made to a header that is not listed explicitly in
the distutils Extension, then distutils will come to the wrong
conclusion about whether or not recompilation is needed (but Ninja
will work it out.)

This caused https://github.com/pytorch/vision/issues/2367

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: 6409595c8ac091f3863f305c123266b9d3a167ad
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40837
2020-07-02 08:05:25 -07:00
b4b8f5b9d4 Release GIL during DDP construction. (#40877)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40495

As part of debugging flaky ddp_under_dist_autograd tests, I realized
we were running into the following deadlock.

1) Rank 0 would go into DDP construction, hold GIL and wait for broadcast in
DDP construction.
2) Rank 3 is a little slower and performs an RRef fetch call before the DDP
construction.
3) The RRef fetch call is done on Rank 0 and tries to acquire GIL.
4) We now have a deadlock since Rank 0 is waiting for Rank 3 to enter the
collective and Rank 3 is waiting for Rank 0 to release GIL.
ghstack-source-id: 106534442

Test Plan:
1) Ran ddp_under_dist_autograd 500 times.
2) waitforbuildbot

Differential Revision: D22205180

fbshipit-source-id: 6afd55342e801b9edb9591ff25158a244a8ea66a

Co-authored-by: Pritam Damania <pritam.damania@fb.com>
2020-07-01 13:36:50 -07:00
41816dc97f [1.6] Fix dictConstruct ordering and enable dict mix (#40797)
A combination of https://github.com/pytorch/pytorch/pull/39601 and
https://github.com/pytorch/pytorch/pull/40424 both are approved and
merged in master
2020-07-01 09:30:16 -07:00
31d9776c04 [1.6] fix autograd doc subsubsection display issue (#40796)
Master branch PR: https://github.com/pytorch/pytorch/pull/40582
2020-07-01 09:28:25 -07:00
ddea6c552f Ports full dtype inference deprecation to 1.6 (#40799)
* ports full deprecation

* fixtures

* Fixes lint

* Trying to fix phantom lint issue

* nuclear lint option

* Paradoxical linter fix

Co-authored-by: Mike Ruberry <mruberry@devfair044.maas>
2020-07-01 09:27:27 -07:00
091537a764 [JIT][1.6] Shape analysis fixes. (#40716)
* [JIT] Update type of the unsqueeze's output in shape analysis.

* [JIT] Fix shape analysis for aten::masked_select.

The reference says that this op always returns a 1-D tensor, even if
the input and the mask are 0-D.
2020-07-01 08:41:05 -07:00
bf4d905ea1 Fix wrong MSVC version constraint for CUDA 9.2 (#40794) (#40849)
Summary:
Tested with https://github.com/pytorch/pytorch/pull/40782.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40794

Differential Revision: D22318045

Pulled By: malfet

fbshipit-source-id: a737ffd7cb8a6a9efb62b84378318f4c3800ad8f
2020-07-01 08:37:40 -07:00
415e499330 Fix zip serialization for file > 2GiB for Windows (#40852) 2020-07-01 08:36:40 -07:00
eaf7dad5d6 [1.6 cherrypick] Support Pathlike for zipfile serialization (#40793) 2020-06-30 10:38:00 -07:00
75a074abdc 1.6 Port: Dynamic Versioning (#40542)
Co-authored-by: Mike Ruberry <mruberry@devfair044.maas>
2020-06-30 10:18:18 -07:00
dede34eab7 [1.6 cherrypick] Doc fix for complex views
Cherry-pick of https://github.com/pytorch/pytorch/pull/40450

Test Plan: Imported from OSS
2020-06-30 09:37:02 -07:00
0c90b6da5c [1.6 cherrypick] Fix zip serialization for file > 2GiB (#40757)
* [1.6 cherrypick] Fix zip serialization for file > 2GiB

* Update test/test_serialization.py

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
2020-06-30 07:10:02 -07:00
4316199832 Add examples and tests for combining static/class method with async execution (#40619) (#40688)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40619

Test Plan: Imported from OSS

Differential Revision: D22258407

Pulled By: mrshenli

fbshipit-source-id: 036d85a2affc4505efd2df197fc513dba010e359
2020-06-29 19:34:23 -07:00
f993e5ac88 [1.6] Update TensorPipe submodule (#40634)
Upstream PR: #40614

Summary:
This update pulls in a oneliner fix, which sets the TCP_NODELAY option on the TCP sockets of the UV transport. This leads to exceptional performance gains in terms of latency, with about a 25x improvement in one simple benchmark. This thus resolves a regression that TensorPipe had compared to the ProcessGroup agent and, in fact, ends up beating it by 2x.

The benchmark I ran is this, with the two endpoints pinned to different cores of the same machine:
```
torch.jit.script
def remote_fn(t: int):
    return t

torch.jit.script
def local_fn():
    for _ in range(1_000_000):
        fut = rpc.rpc_async("rhs", remote_fn, (42,))
        fut.wait()
```

And the average round-trip time (one iteration) is:
- TensorPipe with SHM: 97.2 us
- TensorPipe with UV _after the fix_: 205us
- Gloo: 440us
- TensorPipe with UV _before the fix_: 5ms

Test Plan: Ran PyTorch RPC test suite
2020-06-29 19:33:32 -07:00
c5bd737f0c [JIT] script if tracing fix (#40468) (#40572)
Summary:
Currently, torchvision annotates `batched_nms` with `torch.jit.script` so the function gets compiled when it is traced and ONNX will work. Unfortunately, this means we are eagerly compiling batched_nms, which fails if torchvision isn't built with `torchvision.ops.nms`. As a result, torchvision doesn't work on torch hub right now.

`_script_if_tracing` could solve our problem here, but right now it does not correctly interact with recursive compilation. This PR fixes that bug.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40468

Reviewed By: jamesr66a

Differential Revision: D22195771

Pulled By: eellison

fbshipit-source-id: 83022ca0bab6d389a48a478aec03052c9282d2b7

Co-authored-by: Elias Ellison <eellison@fb.com>
2020-06-29 19:30:41 -07:00
fe45c2c986 Allow slicing sequential container (#40538)
- fixes #38034
- works around missing slice functionality in Sequential
  by casting to tuple and slicing that instead
- supports iterating on the resulting slice but not call()
2020-06-29 19:29:19 -07:00
a9996bb482 Fixes caffe2 loading issues on Windows (#39513) (#40487)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/27840#issuecomment-638715422.
Contains a bunch of fixes (https://github.com/pytorch/pytorch/pull/39376 + https://github.com/pytorch/pytorch/pull/39334 + https://github.com/pytorch/pytorch/pull/38302 + https://github.com/pytorch/pytorch/pull/35362)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39513

Differential Revision: D22190761

Pulled By: malfet

fbshipit-source-id: b2d52f6cb16c233d16071e9c0670dfff7da2710e
(cherry picked from commit e2201e2ed8ed7bf9c6226f8c484192949d94c248)
2020-06-29 19:17:34 -07:00
bdfcbfa18c [release/1.6] .jenkins: Install torch from test channel (#40706)
We're on a test branch so we should install from the test channel

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
2020-06-29 13:53:14 -07:00
ea1b0dba18 Remove constexpr for NVCC on Windows (#40676) 2020-06-29 13:48:50 -07:00
6d85b2c989 Pin XLA CI to use r1.6 release branch. (#40721) 2020-06-29 13:41:14 -07:00
44f79651a7 Tweak file_diff_from_base for release/1.6 branch (#40712) 2020-06-29 11:41:46 -07:00
8682ac147b Docs merge (#40569)
Co-authored-by: Elias Ellison <eellison@fb.com>
2020-06-26 12:24:08 -07:00
4cc605e80a (1.6) Update docs feature classifications (#40539)
Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2020-06-26 12:23:02 -07:00
b0cce716f7 Add beta warning for quant docs (#40540)
Add a beta warning to match stable and master docs: https://github.com/pytorch/pytorch/blob/master/docs/source/quantization.rst
2020-06-26 12:20:06 -07:00
0dc93ac119 [v1.6.0 patch] Install method docstrings from PyRRef to RRef (#40620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40461

It turned out `:inheried-members:` (see [doc](https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#directive-autoclass)) is not really usable.

Because pybind11 generates a docstring that writes `self` as parent class, `rpc.PyRRef`, type.

As a workaround, I am pulling docstrings on parent-class, `PyRRef` class, into subclass, `RRef`. And do surgery on the docstring generated by pybind11.

{F241283111}

ghstack-source-id: 106472496

P134031188

Differential Revision: D7933834

fbshipit-source-id: c03a8a4c9d98888b64492a8caba1591595bfe247

Co-authored-by: Shihao Xu <shihaoxu@fb.com>
2020-06-26 12:15:28 -07:00
bb848df10b [1.6] Remove table of contents at the top of rpc.rst (#40482)
Master PR: https://github.com/pytorch/pytorch/pull/40205

Remove the table of contents created by the `.. contents:: :local: :depth: 2` since this page isn't one of the large documentation pages (https://github.com/pytorch/pytorch/issues/38010) and is simply a landing page for the Distributed RPC Framework.

Changes made in this original PR: f10fbcc820 (diff-250b9b23fd6f1a5c15aecdb72afb9d7d)
2020-06-26 08:37:49 -07:00
2dc0b84aca Skip test_mem_leak on Windows (#40498)
(cherry picked from commit 3fb6f038256a3a5ce43e857409ce4ffb807d93a5)
2020-06-25 16:45:48 -07:00
168cddf5f1 .circleci: Fix upload to backup directory
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
2020-06-23 20:57:42 -07:00
bc8760b3db .circleci: Fix pip installation of awscli
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
2020-06-23 19:05:48 -07:00
4269b9a8fc .circleci: Fix backup uploads
awscli was not loaded on conda builds and the backup upload did not work
since it was a recursive copy instead of just specifically copying what
we want.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
2020-06-23 18:18:06 -07:00
195 changed files with 2682 additions and 2853 deletions

View File

@ -958,6 +958,11 @@ jobs:
no_output_timeout: "1h" no_output_timeout: "1h"
command: | command: |
source "/pytorch/.circleci/scripts/binary_linux_build.sh" source "/pytorch/.circleci/scripts/binary_linux_build.sh"
- run:
name: Output binary sizes
no_output_timeout: "1m"
command: |
ls -lah /final_pkgs
- run: - run:
name: save binary size name: save binary size
no_output_timeout: "5m" no_output_timeout: "5m"
@ -972,6 +977,9 @@ jobs:
root: / root: /
paths: final_pkgs paths: final_pkgs
- store_artifacts:
path: /final_pkgs
# This should really just be another step of the binary_linux_build job above. # This should really just be another step of the binary_linux_build job above.
# This isn't possible right now b/c the build job uses the docker executor # This isn't possible right now b/c the build job uses the docker executor
# (otherwise they'd be really really slow) but this one uses the macine # (otherwise they'd be really really slow) but this one uses the macine

View File

@ -14,7 +14,7 @@ mkdir -p ${ZIP_DIR}/src
cp -R ${ARTIFACTS_DIR}/arm64/include ${ZIP_DIR}/install/ cp -R ${ARTIFACTS_DIR}/arm64/include ${ZIP_DIR}/install/
# build a FAT bianry # build a FAT bianry
cd ${ZIP_DIR}/install/lib cd ${ZIP_DIR}/install/lib
target_libs=(libc10.a libclog.a libcpuinfo.a libeigen_blas.a libpytorch_qnnpack.a libtorch_cpu.a libtorch.a libXNNPACK.a) target_libs=(libc10.a libclog.a libcpuinfo.a libeigen_blas.a libpthreadpool.a libpytorch_qnnpack.a libtorch_cpu.a libtorch.a libXNNPACK.a)
for lib in ${target_libs[*]} for lib in ${target_libs[*]}
do do
if [ -f "${ARTIFACTS_DIR}/x86_64/lib/${lib}" ] && [ -f "${ARTIFACTS_DIR}/arm64/lib/${lib}" ]; then if [ -f "${ARTIFACTS_DIR}/x86_64/lib/${lib}" ] && [ -f "${ARTIFACTS_DIR}/arm64/lib/${lib}" ]; then

View File

@ -20,6 +20,7 @@ PIP_UPLOAD_FOLDER=${PIP_UPLOAD_FOLDER:-nightly}
CONDA_UPLOAD_CHANNEL=$(echo "${PIP_UPLOAD_FOLDER}" | sed 's:/*$::') CONDA_UPLOAD_CHANNEL=$(echo "${PIP_UPLOAD_FOLDER}" | sed 's:/*$::')
BACKUP_BUCKET="s3://pytorch-backup" BACKUP_BUCKET="s3://pytorch-backup"
retry pip install -q awscli
# Upload the package to the final location # Upload the package to the final location
pushd /home/circleci/project/final_pkgs pushd /home/circleci/project/final_pkgs
if [[ "$PACKAGE_TYPE" == conda ]]; then if [[ "$PACKAGE_TYPE" == conda ]]; then
@ -30,14 +31,12 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then
subdir=$(tar -xOf ./*.bz2 info/index.json | grep subdir | cut -d ':' -f2 | sed -e 's/[[:space:]]//' -e 's/"//g' -e 's/,//') subdir=$(tar -xOf ./*.bz2 info/index.json | grep subdir | cut -d ':' -f2 | sed -e 's/[[:space:]]//' -e 's/"//g' -e 's/,//')
BACKUP_DIR="conda/${subdir}" BACKUP_DIR="conda/${subdir}"
elif [[ "$PACKAGE_TYPE" == libtorch ]]; then elif [[ "$PACKAGE_TYPE" == libtorch ]]; then
retry pip install -q awscli
s3_dir="s3://pytorch/libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/" s3_dir="s3://pytorch/libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
for pkg in $(ls); do for pkg in $(ls); do
retry aws s3 cp "$pkg" "$s3_dir" --acl public-read retry aws s3 cp "$pkg" "$s3_dir" --acl public-read
done done
BACKUP_DIR="libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/" BACKUP_DIR="libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
else else
retry pip install -q awscli
s3_dir="s3://pytorch/whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/" s3_dir="s3://pytorch/whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
retry aws s3 cp "$(ls)" "$s3_dir" --acl public-read retry aws s3 cp "$(ls)" "$s3_dir" --acl public-read
BACKUP_DIR="whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/" BACKUP_DIR="whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
@ -45,5 +44,5 @@ fi
if [[ -n "${CIRCLE_TAG:-}" ]]; then if [[ -n "${CIRCLE_TAG:-}" ]]; then
s3_dir="${BACKUP_BUCKET}/${CIRCLE_TAG}/${BACKUP_DIR}" s3_dir="${BACKUP_BUCKET}/${CIRCLE_TAG}/${BACKUP_DIR}"
retry aws s3 cp . "$s3_dir" retry aws s3 cp --recursive . "$s3_dir"
fi fi

View File

@ -21,6 +21,7 @@ PIP_UPLOAD_FOLDER=${PIP_UPLOAD_FOLDER:-nightly}
CONDA_UPLOAD_CHANNEL=$(echo "${PIP_UPLOAD_FOLDER}" | sed 's:/*$::') CONDA_UPLOAD_CHANNEL=$(echo "${PIP_UPLOAD_FOLDER}" | sed 's:/*$::')
BACKUP_BUCKET="s3://pytorch-backup" BACKUP_BUCKET="s3://pytorch-backup"
retry pip install -q awscli
pushd "$workdir/final_pkgs" pushd "$workdir/final_pkgs"
if [[ "$PACKAGE_TYPE" == conda ]]; then if [[ "$PACKAGE_TYPE" == conda ]]; then
retry conda install -yq anaconda-client retry conda install -yq anaconda-client
@ -30,14 +31,12 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then
subdir=$(tar -xOf ./*.bz2 info/index.json | grep subdir | cut -d ':' -f2 | sed -e 's/[[:space:]]//' -e 's/"//g' -e 's/,//') subdir=$(tar -xOf ./*.bz2 info/index.json | grep subdir | cut -d ':' -f2 | sed -e 's/[[:space:]]//' -e 's/"//g' -e 's/,//')
BACKUP_DIR="conda/${subdir}" BACKUP_DIR="conda/${subdir}"
elif [[ "$PACKAGE_TYPE" == libtorch ]]; then elif [[ "$PACKAGE_TYPE" == libtorch ]]; then
retry pip install -q awscli
s3_dir="s3://pytorch/libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/" s3_dir="s3://pytorch/libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
for pkg in $(ls); do for pkg in $(ls); do
retry aws s3 cp "$pkg" "$s3_dir" --acl public-read retry aws s3 cp "$pkg" "$s3_dir" --acl public-read
done done
BACKUP_DIR="libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/" BACKUP_DIR="libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
else else
retry pip install -q awscli
s3_dir="s3://pytorch/whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/" s3_dir="s3://pytorch/whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
retry aws s3 cp "$(ls)" "$s3_dir" --acl public-read retry aws s3 cp "$(ls)" "$s3_dir" --acl public-read
BACKUP_DIR="whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/" BACKUP_DIR="whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
@ -45,5 +44,5 @@ fi
if [[ -n "${CIRCLE_TAG:-}" ]]; then if [[ -n "${CIRCLE_TAG:-}" ]]; then
s3_dir="${BACKUP_BUCKET}/${CIRCLE_TAG}/${BACKUP_DIR}" s3_dir="${BACKUP_BUCKET}/${CIRCLE_TAG}/${BACKUP_DIR}"
retry aws s3 cp . "$s3_dir" retry aws s3 cp --recursive . "$s3_dir"
fi fi

View File

@ -19,6 +19,7 @@ PIP_UPLOAD_FOLDER=${PIP_UPLOAD_FOLDER:-nightly/}
CONDA_UPLOAD_CHANNEL=$(echo "${PIP_UPLOAD_FOLDER}" | sed 's:/*$::') CONDA_UPLOAD_CHANNEL=$(echo "${PIP_UPLOAD_FOLDER}" | sed 's:/*$::')
BACKUP_BUCKET="s3://pytorch-backup" BACKUP_BUCKET="s3://pytorch-backup"
retry pip install -q awscli
pushd /root/workspace/final_pkgs pushd /root/workspace/final_pkgs
# Upload the package to the final location # Upload the package to the final location
if [[ "$PACKAGE_TYPE" == conda ]]; then if [[ "$PACKAGE_TYPE" == conda ]]; then
@ -29,14 +30,12 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then
subdir=$(tar -xOf ./*.bz2 info/index.json | grep subdir | cut -d ':' -f2 | sed -e 's/[[:space:]]//' -e 's/"//g' -e 's/,//') subdir=$(tar -xOf ./*.bz2 info/index.json | grep subdir | cut -d ':' -f2 | sed -e 's/[[:space:]]//' -e 's/"//g' -e 's/,//')
BACKUP_DIR="conda/${subdir}" BACKUP_DIR="conda/${subdir}"
elif [[ "$PACKAGE_TYPE" == libtorch ]]; then elif [[ "$PACKAGE_TYPE" == libtorch ]]; then
retry conda install -c conda-forge -yq awscli
s3_dir="s3://pytorch/libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/" s3_dir="s3://pytorch/libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
for pkg in $(ls); do for pkg in $(ls); do
retry aws s3 cp "$pkg" "$s3_dir" --acl public-read retry aws s3 cp "$pkg" "$s3_dir" --acl public-read
done done
BACKUP_DIR="libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/" BACKUP_DIR="libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
else else
retry conda install -c conda-forge -yq awscli
s3_dir="s3://pytorch/whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/" s3_dir="s3://pytorch/whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
retry aws s3 cp "$(ls)" "$s3_dir" --acl public-read retry aws s3 cp "$(ls)" "$s3_dir" --acl public-read
BACKUP_DIR="whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/" BACKUP_DIR="whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
@ -44,5 +43,5 @@ fi
if [[ -n "${CIRCLE_TAG:-}" ]]; then if [[ -n "${CIRCLE_TAG:-}" ]]; then
s3_dir="${BACKUP_BUCKET}/${CIRCLE_TAG}/${BACKUP_DIR}" s3_dir="${BACKUP_BUCKET}/${CIRCLE_TAG}/${BACKUP_DIR}"
retry aws s3 cp . "$s3_dir" retry aws s3 cp --recursive . "$s3_dir"
fi fi

View File

@ -41,6 +41,11 @@
no_output_timeout: "1h" no_output_timeout: "1h"
command: | command: |
source "/pytorch/.circleci/scripts/binary_linux_build.sh" source "/pytorch/.circleci/scripts/binary_linux_build.sh"
- run:
name: Output binary sizes
no_output_timeout: "1m"
command: |
ls -lah /final_pkgs
- run: - run:
name: save binary size name: save binary size
no_output_timeout: "5m" no_output_timeout: "5m"
@ -55,6 +60,9 @@
root: / root: /
paths: final_pkgs paths: final_pkgs
- store_artifacts:
path: /final_pkgs
# This should really just be another step of the binary_linux_build job above. # This should really just be another step of the binary_linux_build job above.
# This isn't possible right now b/c the build job uses the docker executor # This isn't possible right now b/c the build job uses the docker executor
# (otherwise they'd be really really slow) but this one uses the macine # (otherwise they'd be really really slow) but this one uses the macine

View File

@ -181,7 +181,7 @@ fi
# Patch required to build xla # Patch required to build xla
if [[ "${BUILD_ENVIRONMENT}" == *xla* ]]; then if [[ "${BUILD_ENVIRONMENT}" == *xla* ]]; then
git clone --recursive https://github.com/pytorch/xla.git git clone --recursive -b r1.6 https://github.com/pytorch/xla.git
./xla/scripts/apply_patches.sh ./xla/scripts/apply_patches.sh
fi fi

View File

@ -185,9 +185,9 @@ function get_exit_code() {
function file_diff_from_base() { function file_diff_from_base() {
# The fetch may fail on Docker hosts, but it's not always necessary. # The fetch may fail on Docker hosts, but it's not always necessary.
set +e set +e
git fetch origin master --quiet git fetch origin release/1.6 --quiet
set -e set -e
git diff --name-only "$(git merge-base origin/master HEAD)" > "$1" git diff --name-only "$(git merge-base origin/release/1.6 HEAD)" > "$1"
} }
function get_bazel() { function get_bazel() {

View File

@ -289,7 +289,7 @@ test_backward_compatibility() {
pushd test/backward_compatibility pushd test/backward_compatibility
python dump_all_function_schemas.py --filename new_schemas.txt python dump_all_function_schemas.py --filename new_schemas.txt
pip_uninstall torch pip_uninstall torch
pip_install --pre torch -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html pip_install --pre torch -f https://download.pytorch.org/whl/test/cpu/torch_test.html
python check_backward_compatibility.py --new-schemas new_schemas.txt python check_backward_compatibility.py --new-schemas new_schemas.txt
popd popd
set +x set +x
@ -341,8 +341,8 @@ elif [[ "${BUILD_ENVIRONMENT}" == *-test2 || "${JOB_BASE_NAME}" == *-test2 ]]; t
elif [[ "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then elif [[ "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then
test_bazel test_bazel
elif [[ "${BUILD_ENVIRONMENT}" == pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc5.4* ]]; then elif [[ "${BUILD_ENVIRONMENT}" == pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc5.4* ]]; then
# test cpp extension for xenial + cuda 9.2 + gcc 5.4 to make sure # test cpp extension for xenial + cuda 9.2 + gcc 5.4 to make sure
# cpp extension can be built correctly under this old env # cpp extension can be built correctly under this old env
test_cpp_extensions test_cpp_extensions
else else
test_torchvision test_torchvision

View File

@ -1350,7 +1350,6 @@ filegroup(
"caffe2/utils/smart_tensor_printer.cc", "caffe2/utils/smart_tensor_printer.cc",
"caffe2/utils/string_utils.cc", "caffe2/utils/string_utils.cc",
"caffe2/utils/threadpool/ThreadPool.cc", "caffe2/utils/threadpool/ThreadPool.cc",
"caffe2/utils/threadpool/ThreadPoolMobile.cc",
"caffe2/utils/threadpool/pthreadpool.cc", "caffe2/utils/threadpool/pthreadpool.cc",
"caffe2/utils/threadpool/pthreadpool_impl.cc", "caffe2/utils/threadpool/pthreadpool_impl.cc",
], ],

View File

@ -481,7 +481,7 @@ if(USE_PYTORCH_QNNPACK)
endif() endif()
if(USE_XNNPACK) if(USE_XNNPACK)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DUSE_XNNPACK")
endif() endif()
if(USE_VULKAN) if(USE_VULKAN)

View File

@ -99,6 +99,7 @@ if(ANDROID_ABI)
import_static_lib(libnnpack) import_static_lib(libnnpack)
import_static_lib(libXNNPACK) import_static_lib(libXNNPACK)
import_static_lib(libpytorch_qnnpack) import_static_lib(libpytorch_qnnpack)
import_static_lib(libpthreadpool)
import_static_lib(libeigen_blas) import_static_lib(libeigen_blas)
import_static_lib(libcpuinfo) import_static_lib(libcpuinfo)
import_static_lib(libclog) import_static_lib(libclog)
@ -115,6 +116,7 @@ if(ANDROID_ABI)
libnnpack libnnpack
libXNNPACK libXNNPACK
libpytorch_qnnpack libpytorch_qnnpack
libpthreadpool
libeigen_blas libeigen_blas
libcpuinfo libcpuinfo
libclog libclog
@ -129,6 +131,7 @@ else()
nnpack nnpack
XNNPACK XNNPACK
pytorch_qnnpack pytorch_qnnpack
pthreadpool
cpuinfo cpuinfo
clog clog
) )

View File

@ -8,8 +8,10 @@
#include "pytorch_jni_common.h" #include "pytorch_jni_common.h"
#if defined(__ANDROID__) #if defined(__ANDROID__)
#include <caffe2/utils/threadpool/ThreadPool.h> #ifndef USE_PTHREADPOOL
#include <caffe2/utils/threadpool/ThreadPoolMobile.h> #define USE_PTHREADPOOL
#endif /* USE_PTHREADPOOL */
#include <caffe2/utils/threadpool/pthreadpool-cpp.h>
#endif #endif
namespace pytorch_jni { namespace pytorch_jni {
@ -605,7 +607,7 @@ class PyTorchAndroidJni : public facebook::jni::JavaClass<PyTorchAndroidJni> {
} }
static void setNumThreads(facebook::jni::alias_ref<jclass>, jint numThreads) { static void setNumThreads(facebook::jni::alias_ref<jclass>, jint numThreads) {
caffe2::mobile_threadpool()->setNumThreads(numThreads); caffe2::pthreadpool()->set_thread_count(numThreads);
} }
}; };
#endif #endif

View File

@ -6,8 +6,7 @@
#ifndef C10_MOBILE #ifndef C10_MOBILE
#include <c10/core/thread_pool.h> #include <c10/core/thread_pool.h>
#else #else
#include <caffe2/utils/threadpool/ThreadPool.h> #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
#include <caffe2/utils/threadpool/ThreadPoolMobile.h>
#endif // C10_MOBILE #endif // C10_MOBILE
#include <atomic> #include <atomic>
@ -88,15 +87,15 @@ void _run_with_pool(const std::function<void(int, size_t)>& fn, size_t range) {
// Run the first task on the current thread directly. // Run the first task on the current thread directly.
fn(0, 0); fn(0, 0);
#else #else
caffe2::ThreadPool* pool = caffe2::mobile_threadpool(); caffe2::PThreadPool* const pool = caffe2::pthreadpool();
if (pool) { TORCH_INTERNAL_ASSERT(pool, "Invalid thread pool!");
// caffe2::ThreadPool can utilize the current thread.
pool->run(fn, range); pool->run(
} else { // PThreadPool::run() is blocking. A std::function [const] reference to
for (size_t i = 0; i < range; ++i) { // this lambda cannot go out of scope before PThreadPool::run() returns.
fn(0, i); [&fn](const size_t task_id) {
} fn(0 /* unused */, task_id);
} }, range);
#endif // C10_MOBILE #endif // C10_MOBILE
} }
@ -184,7 +183,7 @@ void init_num_threads() {
#endif #endif
#ifdef C10_MOBILE #ifdef C10_MOBILE
caffe2::mobile_threadpool(); caffe2::pthreadpool();
#endif #endif
} }
@ -208,7 +207,9 @@ void set_num_threads(int nthreads) {
} }
} }
#else #else
TORCH_CHECK(false, "set_num_threads is not supported for mobile."); caffe2::PThreadPool* const pool = caffe2::pthreadpool();
TORCH_INTERNAL_ASSERT(pool, "Invalid thread pool!");
pool->set_thread_count(nthreads);
#endif // C10_MOBILE #endif // C10_MOBILE
} }
@ -226,9 +227,9 @@ int get_num_threads() {
return _get_intraop_pool().size() + 1; return _get_intraop_pool().size() + 1;
} }
#else #else
caffe2::ThreadPool* pool = caffe2::mobile_threadpool(); caffe2::PThreadPool* const pool = caffe2::pthreadpool();
// caffe2::ThreadPool::getNumThreads() counts the current thread. TORCH_INTERNAL_ASSERT(pool, "Invalid thread pool!")
return !pool || in_parallel_region() ? 1 /* current thread */ : pool->getNumThreads(); return in_parallel_region() ? 1 /* current thread */ : pool->get_thread_count();
#endif // C10_MOBILE #endif // C10_MOBILE
} }
@ -257,8 +258,8 @@ void intraop_launch(std::function<void()> func) {
func(); func();
} }
#else #else
// TODO: caffe2::ThreadPool doesn't support submitting tasks separately and // TODO: caffe2::PThreadPool only provides a data-parallel API.
// running in parallel. Should fix it when this API becomes popular. // Task parallelism is not currently supported.
func(); func();
#endif // C10_MOBILE #endif // C10_MOBILE
} }
@ -280,8 +281,8 @@ std::shared_ptr<c10::ivalue::Future> intraop_launch_future(
} }
return future; return future;
#else #else
// TODO: caffe2::ThreadPool doesn't support submitting tasks separately and // TODO: caffe2::PThreadPool only provides a data-parallel API.
// running in parallel. Should fix it when this API becomes popular. // Task parallelism is not currently supported.
auto future = std::make_shared<c10::ivalue::Future>(NoneType::get()); auto future = std::make_shared<c10::ivalue::Future>(NoneType::get());
func(); func();
future->markCompleted(); future->markCompleted();

View File

@ -135,6 +135,7 @@ UPTOb( bool , equal , (const Tensor &A, const Tensor &B) )
UPTOb( Tensor, cat , (TensorList A, int64_t B) ) UPTOb( Tensor, cat , (TensorList A, int64_t B) )
UPTOb( Tensor, cat , (TensorList A, Dimname B) ) UPTOb( Tensor, cat , (TensorList A, Dimname B) )
UPTOb( Tensor, _cat , (TensorList A, int64_t B) ) UPTOb( Tensor, _cat , (TensorList A, int64_t B) )
UPTOd( Tensor, index_put, (const Tensor &A, TensorList B, const Tensor & C, bool D) )
UPTOb( Tensor, stack , (TensorList A, int64_t B) ) UPTOb( Tensor, stack , (TensorList A, int64_t B) )
#undef UPTOa #undef UPTOa

View File

@ -482,15 +482,16 @@ TORCH_LIBRARY_IMPL(aten, Autocast, m) {
KERNEL(ADD_NS(addcdiv), "addcdiv", Tensor (const Tensor &, const Tensor &, const Tensor &, Scalar), promote) KERNEL(ADD_NS(addcdiv), "addcdiv", Tensor (const Tensor &, const Tensor &, const Tensor &, Scalar), promote)
KERNEL(ADD_NS(addcmul), "addcmul", Tensor (const Tensor &, const Tensor &, const Tensor &, Scalar), promote) KERNEL(ADD_NS(addcmul), "addcmul", Tensor (const Tensor &, const Tensor &, const Tensor &, Scalar), promote)
KERNEL(ADD_NS(atan2), "atan2", Tensor (const Tensor &, const Tensor &), promote) KERNEL(ADD_NS(atan2), "atan2", Tensor (const Tensor &, const Tensor &), promote)
KERNEL(ADD_NS(cross), "cross", Tensor (const Tensor &, const Tensor &, c10::optional<int64_t>), promote)
KERNEL_UNBOXED_ONLY(ADD_NS(bilinear), "bilinear", Tensor (const Tensor &, const Tensor &, const Tensor &, const Tensor &), promote) KERNEL_UNBOXED_ONLY(ADD_NS(bilinear), "bilinear", Tensor (const Tensor &, const Tensor &, const Tensor &, const Tensor &), promote)
KERNEL_UNBOXED_ONLY(ADD_NS(tensordot), "tensordot", Tensor (const Tensor &, const Tensor &, IntArrayRef, IntArrayRef), promote)
KERNEL_UNBOXED_ONLY(ADD_NS(dot), "dot", Tensor (const Tensor &, const Tensor &), promote)
KERNEL(ADD_NS(equal), "equal", bool (const Tensor &, const Tensor &), promote)
KERNEL(ADD_NS(cat), "cat", Tensor (TensorList, int64_t), promote) KERNEL(ADD_NS(cat), "cat", Tensor (TensorList, int64_t), promote)
KERNEL_UNBOXED_ONLY(ADD_NS(cat), "cat.names", Tensor (TensorList, Dimname), promote) KERNEL_UNBOXED_ONLY(ADD_NS(cat), "cat.names", Tensor (TensorList, Dimname), promote)
KERNEL(ADD_NS(_cat), "_cat", Tensor (TensorList, int64_t), promote) KERNEL(ADD_NS(_cat), "_cat", Tensor (TensorList, int64_t), promote)
KERNEL(ADD_NS(cross), "cross", Tensor (const Tensor &, const Tensor &, c10::optional<int64_t>), promote)
KERNEL_UNBOXED_ONLY(ADD_NS(dot), "dot", Tensor (const Tensor &, const Tensor &), promote)
KERNEL(ADD_NS(equal), "equal", bool (const Tensor &, const Tensor &), promote)
KERNEL_UNBOXED_ONLY(ADD_NS(index_put), "index_put", Tensor (const Tensor &, TensorList, const Tensor &, bool), promote)
KERNEL(ADD_NS(stack), "stack", Tensor (TensorList, int64_t), promote) KERNEL(ADD_NS(stack), "stack", Tensor (TensorList, int64_t), promote)
KERNEL_UNBOXED_ONLY(ADD_NS(tensordot), "tensordot", Tensor (const Tensor &, const Tensor &, IntArrayRef, IntArrayRef), promote)
m.impl_UNBOXED("binary_cross_entropy", &at::autocast::binary_cross_entropy_banned); m.impl_UNBOXED("binary_cross_entropy", &at::autocast::binary_cross_entropy_banned);
} }

View File

@ -188,6 +188,7 @@ namespace c10 {
_(prim, unchecked_unwrap_optional) \ _(prim, unchecked_unwrap_optional) \
_(aten, __contains__) \ _(aten, __contains__) \
_(prim, BailoutTemplate) \ _(prim, BailoutTemplate) \
_(prim, grad) \
_(aten, zero_) \ _(aten, zero_) \
_(aten, fill_) \ _(aten, fill_) \
FORALL_ATEN_BASE_SYMBOLS(_) \ FORALL_ATEN_BASE_SYMBOLS(_) \

View File

@ -1481,7 +1481,7 @@ inline TypePtr TensorType::fromBoolType() {
inline c10::optional<c10::ScalarType> tryScalarTypeFromJitType(const c10::TypePtr & type) { inline c10::optional<c10::ScalarType> tryScalarTypeFromJitType(const c10::TypePtr & type) {
if (type == FloatType::get()) { if (type == FloatType::get()) {
return at::ScalarType::Double; return at::typeMetaToScalarType(c10::get_default_dtype());
} else if (type == IntType::get()) { } else if (type == IntType::get()) {
return at::ScalarType::Long; return at::ScalarType::Long;
} else if (type == BoolType::get()) { } else if (type == BoolType::get()) {

View File

@ -181,6 +181,10 @@ Allocator* CUDAHooks::getPinnedMemoryAllocator() const {
return at::cuda::getPinnedMemoryAllocator(); return at::cuda::getPinnedMemoryAllocator();
} }
Allocator* CUDAHooks::getCUDADeviceAllocator() const {
return at::cuda::getCUDADeviceAllocator();
}
bool CUDAHooks::compiledWithCuDNN() const { bool CUDAHooks::compiledWithCuDNN() const {
return AT_CUDNN_ENABLED(); return AT_CUDNN_ENABLED();
} }

View File

@ -22,6 +22,7 @@ struct CUDAHooks : public at::CUDAHooksInterface {
int64_t current_device() const override; int64_t current_device() const override;
bool hasPrimaryContext(int64_t device_index) const override; bool hasPrimaryContext(int64_t device_index) const override;
c10::optional<int64_t> getDevceIndexWithPrimaryContext() const override; c10::optional<int64_t> getDevceIndexWithPrimaryContext() const override;
Allocator* getCUDADeviceAllocator() const override;
Allocator* getPinnedMemoryAllocator() const override; Allocator* getPinnedMemoryAllocator() const override;
bool compiledWithCuDNN() const override; bool compiledWithCuDNN() const override;
bool compiledWithMIOpen() const override; bool compiledWithMIOpen() const override;

View File

@ -121,6 +121,10 @@ struct CAFFE2_API CUDAHooksInterface {
TORCH_CHECK(false, "Pinned memory requires CUDA. ", CUDA_HELP); TORCH_CHECK(false, "Pinned memory requires CUDA. ", CUDA_HELP);
} }
virtual Allocator* getCUDADeviceAllocator() const {
TORCH_CHECK(false, "CUDADeviceAllocator requires CUDA. ", CUDA_HELP);
}
virtual bool compiledWithCuDNN() const { virtual bool compiledWithCuDNN() const {
return false; return false;
} }

View File

@ -262,9 +262,7 @@ auto ConvParams::use_xnnpack(
const at::Tensor& input, const at::Tensor& input,
const at::Tensor& weight, const at::Tensor& weight,
const at::Tensor& bias) const -> bool { const at::Tensor& bias) const -> bool {
// Disable the xnnpack operators for both iOS and macOS temporarily due to the crash in pthreadpool #if defined(C10_MOBILE)
// TODO:T66297472 remove `!defined(__APPLE__)` once we figure out the root cause of the crash.
#if defined(C10_MOBILE) && !defined(__APPLE__)
if (!transposed) { if (!transposed) {
return (input.size(1) == groups) && return (input.size(1) == groups) &&
xnnpack::use_convolution2d( xnnpack::use_convolution2d(

View File

@ -17,9 +17,7 @@ Tensor linear(const Tensor& input, const Tensor& weight, const Tensor& bias) {
if (input.is_mkldnn()) { if (input.is_mkldnn()) {
return at::mkldnn_linear(input, weight, bias); return at::mkldnn_linear(input, weight, bias);
} }
// Disable the xnnpack operators for both iOS and macOS temporarily due to the crash in pthreadpool #if defined(C10_MOBILE)
// TODO:T66297472 remove `!defined(__APPLE__)` once we figure out the root cause of the crash.
#if defined(C10_MOBILE) && !defined(__APPLE__)
if (xnnpack::use_linear(input, weight, bias)) { if (xnnpack::use_linear(input, weight, bias)) {
return xnnpack::linear(input, weight, bias); return xnnpack::linear(input, weight, bias);
} }

View File

@ -58,8 +58,9 @@ bool _nnpack_available() {
#include <nnpack.h> #include <nnpack.h>
#include <caffe2/utils/threadpool/ThreadPoolMobile.h> #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
#include <ATen/native/ConvUtils.h> #include <ATen/native/ConvUtils.h>
#include <ATen/Parallel.h>
namespace at { namespace at {
namespace native { namespace native {
@ -87,15 +88,9 @@ static bool init_nnpack() {
} }
static pthreadpool_t nnpack_threadpool() { static pthreadpool_t nnpack_threadpool() {
// Try initializing a threadpool for NNPACK's use. If we fail to
// successfully initialize an implementation, return nullptr which will
// instruct NNPACK to run single threaded.
#ifdef C10_MOBILE #ifdef C10_MOBILE
// If building for mobile, use Caffe 2's mobile-friendly threadpool. return caffe2::pthreadpool_();
return caffe2::mobile_pthreadpool();
#else #else
// Otherwise, try using pthreadpool if we manage to initialize it successfully.
static pthreadpool_t nnpack_threadpool_ = nullptr; static pthreadpool_t nnpack_threadpool_ = nullptr;
static bool called_nnpack_threadpool_ = false; static bool called_nnpack_threadpool_ = false;

View File

@ -135,9 +135,7 @@ Tensor max_pool2d(
self, kernel_size, stride, padding, dilation, ceil_mode); self, kernel_size, stride, padding, dilation, ceil_mode);
} }
// Disable the xnnpack operators for both iOS and macOS temporarily due to the crash in pthreadpool #if defined(C10_MOBILE)
// TODO:T66297472 remove `!defined(__APPLE__)` once we figure out the root cause of the crash.
#if defined(C10_MOBILE) && !defined(__APPLE__)
if(xnnpack::use_max_pool2d(self, kernel_size, padding, stride, if(xnnpack::use_max_pool2d(self, kernel_size, padding, stride,
dilation, ceil_mode)) { dilation, ceil_mode)) {
return xnnpack::max_pool2d( return xnnpack::max_pool2d(

View File

@ -355,13 +355,12 @@ TensorOptions infer_full_options(
if (!options.has_dtype()) { if (!options.has_dtype()) {
if (fill_value.isIntegral(true)) { if (fill_value.isIntegral(true)) {
TORCH_WARN_ONCE( TORCH_CHECK(false,
"Deprecation warning: In a future PyTorch release torch.full ", "Providing a bool or integral fill value without setting the optional ",
"will no longer return tensors of floating dtype by default. ", "`dtype` or `out` arguments is currently unsupported. In PyTorch 1.7, ",
"Instead, a bool fill_value will return a tensor of torch.bool dtype, ", "when `dtype` and `out` are not set a bool fill value will ",
"and an integral fill_value will return a tensor of torch.long dtype. ", "return a tensor of torch.bool dtype, and an integral fill value ",
"Set the optional `dtype` or `out` arguments to suppress this warning." "will return a tensor of torch.long dtype.");
);
} else if (fill_value.isComplex()) { } else if (fill_value.isComplex()) {
auto scalar_type = (get_default_dtype() == ScalarType::Double) ? auto scalar_type = (get_default_dtype() == ScalarType::Double) ?
ScalarType::ComplexDouble : ScalarType::ComplexDouble :

View File

@ -706,8 +706,9 @@ TensorIterator TensorIterator::unary_op(Tensor& out, const Tensor& a,
.set_check_mem_overlap(check_mem_overlap) .set_check_mem_overlap(check_mem_overlap)
.add_output(out) .add_output(out)
.add_input(a) .add_input(a)
.cast_common_dtype_to_outputs(true) .cast_common_dtype_to_outputs(false)
.enforce_safe_casting_to_output(true) .enforce_safe_casting_to_output(false)
.check_all_same_dtype(true)
.build(); .build();
} }

View File

@ -762,7 +762,12 @@ Tensor repeat(const Tensor& self, IntArrayRef repeats) {
Tensor xtensor = self.expand(padded_size); Tensor xtensor = self.expand(padded_size);
Tensor result = at::empty(target_size, self.options()); Tensor result;
if (self.is_quantized()) {
result = at::empty_quantized(target_size, self);
} else {
result = at::empty(target_size, self.options());
}
// return an empty tensor if one of the repeat dimensions is zero // return an empty tensor if one of the repeat dimensions is zero
if (zero_tensor) { if (zero_tensor) {

View File

@ -67,7 +67,7 @@ static inline Tensor& unary_op_impl_with_complex_to_float_out(Tensor& result, co
// Copies the complex result to the actual result and returns it // Copies the complex result to the actual result and returns it
result.resize_(complex_result.sizes()); result.resize_(complex_result.sizes());
result.copy_(complex_result); result.copy_(at::real(complex_result));
return result; return result;
} }

View File

@ -1127,6 +1127,12 @@
variants: method variants: method
device_guard: False device_guard: False
- func: empty_quantized(int[] size, Tensor qtensor) -> Tensor
variants: function
dispatch:
QuantizedCPU: empty_quantized
QuantizedCUDA: empty_quantized
- func: empty.out(int[] size, *, MemoryFormat? memory_format=None, Tensor(a!) out) -> Tensor(a!) - func: empty.out(int[] size, *, MemoryFormat? memory_format=None, Tensor(a!) out) -> Tensor(a!)
device_guard: False device_guard: False
@ -5108,6 +5114,8 @@
dispatch: dispatch:
CPU: unfold CPU: unfold
CUDA: unfold CUDA: unfold
QuantizedCPU: unfold
QuantizedCUDA: unfold
- func: unfold_backward(Tensor grad_in, int[] input_sizes, int dim, int size, int step) -> Tensor - func: unfold_backward(Tensor grad_in, int[] input_sizes, int dim, int size, int step) -> Tensor
variants: function variants: function

View File

@ -76,5 +76,28 @@ Tensor empty_per_channel_affine_quantized_other_backends_stub(
TORCH_CHECK(false, "Creation of quantized tensor requires quantized dtype like torch.quint8"); TORCH_CHECK(false, "Creation of quantized tensor requires quantized dtype like torch.quint8");
} }
// Create an empty quantized Tensor with size, based on the options
// and quantization parameters of the input quantized Tensor
Tensor empty_quantized(IntArrayRef size, const Tensor& qtensor) {
Tensor output;
if (qtensor.qscheme() == kPerTensorAffine) {
output = at::_empty_affine_quantized(size, qtensor.options(),
qtensor.q_scale(),
qtensor.q_zero_point());
} else if (qtensor.qscheme() == kPerChannelAffine) {
output = at::_empty_per_channel_affine_quantized(
size,
qtensor.q_per_channel_scales(),
qtensor.q_per_channel_zero_points(),
qtensor.q_per_channel_axis(),
qtensor.options());
} else {
TORCH_CHECK(false,
"QScheme not supported by empty_quantized:",
toString(qtensor.qscheme()));
}
return output;
}
} // namespace native } // namespace native
} // namespace at } // namespace at

View File

@ -5,7 +5,7 @@
#include <ATen/native/quantized/cpu/init_qnnpack.h> #include <ATen/native/quantized/cpu/init_qnnpack.h>
#include <ATen/native/quantized/cpu/qnnpack_utils.h> #include <ATen/native/quantized/cpu/qnnpack_utils.h>
#include <ATen/native/quantized/cpu/quantized_ops.h> #include <ATen/native/quantized/cpu/quantized_ops.h>
#include <caffe2/utils/threadpool/ThreadPoolMobile.h> #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
#include <c10/util/math_compat.h> #include <c10/util/math_compat.h>
#include <algorithm> #include <algorithm>
@ -375,7 +375,7 @@ Tensor qnnpack_avg_pool2d(
CAFFE_ENFORCE( CAFFE_ENFORCE(
setupStatus == pytorch_qnnp_status_success, setupStatus == pytorch_qnnp_status_success,
"failed to setup QNNPACK Average Pooling operator"); "failed to setup QNNPACK Average Pooling operator");
pthreadpool_t threadpool = caffe2::mobile_pthreadpool(); pthreadpool_t threadpool = caffe2::pthreadpool_();
const pytorch_qnnp_status runStatus = const pytorch_qnnp_status runStatus =
pytorch_qnnp_run_operator(qnnpack_operator, threadpool); pytorch_qnnp_run_operator(qnnpack_operator, threadpool);
TORCH_INTERNAL_ASSERT( TORCH_INTERNAL_ASSERT(

View File

@ -5,7 +5,6 @@
#include <ATen/native/quantized/cpu/init_qnnpack.h> #include <ATen/native/quantized/cpu/init_qnnpack.h>
#include <ATen/native/quantized/cpu/qnnpack_utils.h> #include <ATen/native/quantized/cpu/qnnpack_utils.h>
#include <ATen/native/quantized/cpu/quantized_ops.h> #include <ATen/native/quantized/cpu/quantized_ops.h>
#include <caffe2/utils/threadpool/ThreadPoolMobile.h>
#include <c10/util/math_compat.h> #include <c10/util/math_compat.h>
#include <algorithm> #include <algorithm>

View File

@ -7,7 +7,7 @@
#include <ATen/native/quantized/cpu/quantized_ops.h> #include <ATen/native/quantized/cpu/quantized_ops.h>
#include <ATen/native/quantized/cpu/init_qnnpack.h> #include <ATen/native/quantized/cpu/init_qnnpack.h>
#include <ATen/native/quantized/cpu/qnnpack_utils.h> #include <ATen/native/quantized/cpu/qnnpack_utils.h>
#include <caffe2/utils/threadpool/ThreadPoolMobile.h> #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
#include <algorithm> #include <algorithm>
@ -194,7 +194,7 @@ Tensor qnnpack_add(Tensor qa, Tensor qb, double scale, int64_t zero_point) {
setupStatus == pytorch_qnnp_status_success, setupStatus == pytorch_qnnp_status_success,
"failed to setup QNNPACK Add operator"); "failed to setup QNNPACK Add operator");
pthreadpool_t threadpool = caffe2::mobile_pthreadpool(); pthreadpool_t threadpool = caffe2::pthreadpool_();
const pytorch_qnnp_status runStatus = const pytorch_qnnp_status runStatus =
pytorch_qnnp_run_operator(qnnpack_operator, threadpool); pytorch_qnnp_run_operator(qnnpack_operator, threadpool);

View File

@ -8,7 +8,7 @@
#include <ATen/native/quantized/cpu/init_qnnpack.h> #include <ATen/native/quantized/cpu/init_qnnpack.h>
#include <ATen/native/quantized/cpu/qnnpack_utils.h> #include <ATen/native/quantized/cpu/qnnpack_utils.h>
#include <c10/core/TensorOptions.h> #include <c10/core/TensorOptions.h>
#include <caffe2/utils/threadpool/ThreadPoolMobile.h> #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
#include <algorithm> #include <algorithm>
@ -82,7 +82,7 @@ Tensor quantized_channel_shuffle_impl(
setupStatus == pytorch_qnnp_status_success, setupStatus == pytorch_qnnp_status_success,
"failed to setup QNNPACK ChannelShuffle operator"); "failed to setup QNNPACK ChannelShuffle operator");
pthreadpool_t threadpool = caffe2::mobile_pthreadpool(); pthreadpool_t threadpool = caffe2::pthreadpool_();
const pytorch_qnnp_status runStatus = const pytorch_qnnp_status runStatus =
pytorch_qnnp_run_operator(qnnpack_operator, threadpool); pytorch_qnnp_run_operator(qnnpack_operator, threadpool);
TORCH_INTERNAL_ASSERT( TORCH_INTERNAL_ASSERT(

View File

@ -7,7 +7,7 @@
#include <ATen/native/quantized/cpu/init_qnnpack.h> #include <ATen/native/quantized/cpu/init_qnnpack.h>
#include <ATen/native/quantized/cpu/qnnpack_utils.h> #include <ATen/native/quantized/cpu/qnnpack_utils.h>
#include <ATen/quantized/Quantizer.h> #include <ATen/quantized/Quantizer.h>
#include <caffe2/utils/threadpool/ThreadPoolMobile.h> #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
#include <algorithm> #include <algorithm>
@ -64,7 +64,7 @@ Tensor qnnpack_clamp(Tensor input, Scalar min, Scalar max) {
TORCH_INTERNAL_ASSERT(setupStatus == pytorch_qnnp_status_success, TORCH_INTERNAL_ASSERT(setupStatus == pytorch_qnnp_status_success,
"failed to setup QNNPACK Clamp operator"); "failed to setup QNNPACK Clamp operator");
pthreadpool_t threadpool = caffe2::mobile_pthreadpool(); pthreadpool_t threadpool = caffe2::pthreadpool_();
const pytorch_qnnp_status runStatus = const pytorch_qnnp_status runStatus =
pytorch_qnnp_run_operator(clamp_op, threadpool); pytorch_qnnp_run_operator(clamp_op, threadpool);

View File

@ -10,7 +10,7 @@
#include <ATen/native/quantized/cpu/qnnpack_utils.h> #include <ATen/native/quantized/cpu/qnnpack_utils.h>
#include <ATen/native/quantized/cpu/quant_utils.h> #include <ATen/native/quantized/cpu/quant_utils.h>
#include <ATen/native/quantized/cpu/conv_packed_params.h> #include <ATen/native/quantized/cpu/conv_packed_params.h>
#include <caffe2/utils/threadpool/ThreadPoolMobile.h> #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
template <int kSpatialDim = 2> template <int kSpatialDim = 2>
bool ConvDimChecks( bool ConvDimChecks(
@ -603,7 +603,7 @@ at::Tensor PackedConvWeightsQnnp<kSpatialDim>::apply_impl(
output_min, output_min,
output_max, output_max,
reinterpret_cast<uint8_t*>(output.template data_ptr<c10::quint8>()), reinterpret_cast<uint8_t*>(output.template data_ptr<c10::quint8>()),
caffe2::mobile_pthreadpool()); caffe2::pthreadpool_());
TORCH_INTERNAL_ASSERT( TORCH_INTERNAL_ASSERT(
run_status == pytorch_qnnp_status_success, run_status == pytorch_qnnp_status_success,

View File

@ -5,7 +5,7 @@
#include <ATen/native/quantized/cpu/quantized_ops.h> #include <ATen/native/quantized/cpu/quantized_ops.h>
#include <ATen/native/quantized/cpu/init_qnnpack.h> #include <ATen/native/quantized/cpu/init_qnnpack.h>
#include <ATen/native/quantized/cpu/qnnpack_utils.h> #include <ATen/native/quantized/cpu/qnnpack_utils.h>
#include <caffe2/utils/threadpool/ThreadPoolMobile.h> #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
#include <algorithm> #include <algorithm>
@ -57,7 +57,7 @@ Tensor qnnpack_hardsigmoid(Tensor input) {
TORCH_INTERNAL_ASSERT(setupStatus == pytorch_qnnp_status_success, TORCH_INTERNAL_ASSERT(setupStatus == pytorch_qnnp_status_success,
"failed to setup QNNPACK Hardsigmoid operator"); "failed to setup QNNPACK Hardsigmoid operator");
pthreadpool_t threadpool = caffe2::mobile_pthreadpool(); pthreadpool_t threadpool = caffe2::pthreadpool_();
const pytorch_qnnp_status runStatus = const pytorch_qnnp_status runStatus =
pytorch_qnnp_run_operator(hardsigmoid_op, threadpool); pytorch_qnnp_run_operator(hardsigmoid_op, threadpool);

View File

@ -5,7 +5,7 @@
#include <ATen/native/quantized/cpu/quantized_ops.h> #include <ATen/native/quantized/cpu/quantized_ops.h>
#include <ATen/native/quantized/cpu/init_qnnpack.h> #include <ATen/native/quantized/cpu/init_qnnpack.h>
#include <ATen/native/quantized/cpu/qnnpack_utils.h> #include <ATen/native/quantized/cpu/qnnpack_utils.h>
#include <caffe2/utils/threadpool/ThreadPoolMobile.h> #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
#include <algorithm> #include <algorithm>
@ -51,7 +51,7 @@ Tensor qnnpack_hardswish(const Tensor& qx, Tensor& qy) {
TORCH_INTERNAL_ASSERT(setupStatus == pytorch_qnnp_status_success, TORCH_INTERNAL_ASSERT(setupStatus == pytorch_qnnp_status_success,
"failed to setup QNNPACK Hardswish operator"); "failed to setup QNNPACK Hardswish operator");
pthreadpool_t threadpool = caffe2::mobile_pthreadpool(); pthreadpool_t threadpool = caffe2::pthreadpool_();
const pytorch_qnnp_status runStatus = const pytorch_qnnp_status runStatus =
pytorch_qnnp_run_operator(hardswish_op, threadpool); pytorch_qnnp_run_operator(hardswish_op, threadpool);

View File

@ -4,7 +4,7 @@
#include <ATen/native/quantized/cpu/fbgemm_utils.h> #include <ATen/native/quantized/cpu/fbgemm_utils.h>
#include <ATen/native/quantized/cpu/packed_params.h> #include <ATen/native/quantized/cpu/packed_params.h>
#include <ATen/native/quantized/cpu/qnnpack_utils.h> #include <ATen/native/quantized/cpu/qnnpack_utils.h>
#include <caffe2/utils/threadpool/ThreadPoolMobile.h> #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
#include <torch/custom_class.h> #include <torch/custom_class.h>
#include <torch/library.h> #include <torch/library.h>
@ -341,7 +341,9 @@ at::Tensor PackedLinearWeightsQnnp::apply_impl(
packB->getPackedWeights(), packB->getPackedWeights(),
(uint8_t*)output.data_ptr<c10::quint8>(), (uint8_t*)output.data_ptr<c10::quint8>(),
rows_w /* output_stride */, rows_w /* output_stride */,
caffe2::mobile_pthreadpool() /* threadpool */); // TODO (Ashkan): Disabling temporarily.
// Throws a floating point exception with OSS pthreadpool.
caffe2::pthreadpool_() /* threadpool */);
TORCH_INTERNAL_ASSERT( TORCH_INTERNAL_ASSERT(
runStatus == pytorch_qnnp_status_success, runStatus == pytorch_qnnp_status_success,

View File

@ -5,7 +5,7 @@
#include <ATen/native/quantized/cpu/packed_params.h> #include <ATen/native/quantized/cpu/packed_params.h>
#include <ATen/native/quantized/cpu/qnnpack_utils.h> #include <ATen/native/quantized/cpu/qnnpack_utils.h>
#include <ATen/native/quantized/cpu/quant_utils.h> #include <ATen/native/quantized/cpu/quant_utils.h>
#include <caffe2/utils/threadpool/ThreadPoolMobile.h> #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
#include <torch/library.h> #include <torch/library.h>
#include <torch/custom_class.h> #include <torch/custom_class.h>
@ -241,8 +241,17 @@ at::Tensor PackedLinearWeightsQnnp::apply_dynamic_impl(at::Tensor input) {
// Calculate statistics for quantization of input Tensor // Calculate statistics for quantization of input Tensor
// TODO: optimized kernel // TODO: optimized kernel
float x_min = input_contig.min().item<float>(); float x_min;
float x_max = input_contig.max().item<float>(); float x_max;
if (input.numel() > 0) {
x_min = input_contig.min().item<float>();
x_max = input_contig.max().item<float>();
} else {
// On empty input, no output data will be generated,
// so use arbitrary qparams.
x_min = 0;
x_max = 0;
}
auto q_params = quant_utils::ChooseQuantizationParams( auto q_params = quant_utils::ChooseQuantizationParams(
/*min=*/x_min, /*min=*/x_min,
@ -327,7 +336,7 @@ at::Tensor PackedLinearWeightsQnnp::apply_dynamic_impl(at::Tensor input) {
bias_ptr, bias_ptr,
output.data_ptr<float>(), output.data_ptr<float>(),
rows_w /* output_stride */, rows_w /* output_stride */,
caffe2::mobile_pthreadpool() /* threadpool */); caffe2::pthreadpool_() /* threadpool */);
TORCH_INTERNAL_ASSERT( TORCH_INTERNAL_ASSERT(
runStatus == pytorch_qnnp_status_success, runStatus == pytorch_qnnp_status_success,

View File

@ -100,6 +100,12 @@ enum pytorch_qnnp_status qnnpackLinearDynamic(
.ukernel = pytorch_qnnp_params.q8conv.gemm_dq, .ukernel = pytorch_qnnp_params.q8conv.gemm_dq,
}; };
if (output_size == 0) {
// pthreadpool can tolerate a range of 0, but not a tile of 0.
// We use output_size as a tile size, so bail here if it's 0.
return pytorch_qnnp_status_success;
}
pthreadpool_compute_4d_tiled( pthreadpool_compute_4d_tiled(
threadpool, threadpool,
(pthreadpool_function_4d_tiled_t)compute_q8gemm_dq, (pthreadpool_function_4d_tiled_t)compute_q8gemm_dq,

View File

@ -98,6 +98,12 @@ enum pytorch_qnnp_status qnnpackLinear(
.ukernel = pytorch_qnnp_params.q8conv.gemm, .ukernel = pytorch_qnnp_params.q8conv.gemm,
}; };
if (output_size == 0) {
// pthreadpool can tolerate a range of 0, but not a tile of 0.
// We use output_size as a tile size, so bail here if it's 0.
return pytorch_qnnp_status_success;
}
pthreadpool_compute_4d_tiled( pthreadpool_compute_4d_tiled(
threadpool, threadpool,
(pthreadpool_function_4d_tiled_t) compute_q8gemm, (pthreadpool_function_4d_tiled_t) compute_q8gemm,

View File

@ -9,7 +9,7 @@
#include <ATen/native/quantized/cpu/quantized_ops.h> #include <ATen/native/quantized/cpu/quantized_ops.h>
#include <ATen/native/quantized/cpu/init_qnnpack.h> #include <ATen/native/quantized/cpu/init_qnnpack.h>
#include <ATen/native/quantized/cpu/qnnpack_utils.h> #include <ATen/native/quantized/cpu/qnnpack_utils.h>
#include <caffe2/utils/threadpool/ThreadPoolMobile.h> #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
#include <algorithm> #include <algorithm>
#include <vector> #include <vector>
@ -346,7 +346,7 @@ void check_maxpool2d_params(
setupStatus == pytorch_qnnp_status_success, setupStatus == pytorch_qnnp_status_success,
"failed to setup QNNPACK MaxPool operator"); "failed to setup QNNPACK MaxPool operator");
pthreadpool_t threadpool = caffe2::mobile_pthreadpool(); pthreadpool_t threadpool = caffe2::pthreadpool_();
const pytorch_qnnp_status runStatus = const pytorch_qnnp_status runStatus =
pytorch_qnnp_run_operator(qnnpack_operator, threadpool); pytorch_qnnp_run_operator(qnnpack_operator, threadpool);
TORCH_INTERNAL_ASSERT( TORCH_INTERNAL_ASSERT(

View File

@ -3,7 +3,7 @@
#include <ATen/NativeFunctions.h> #include <ATen/NativeFunctions.h>
#include <ATen/native/quantized/cpu/init_qnnpack.h> #include <ATen/native/quantized/cpu/init_qnnpack.h>
#include <ATen/native/quantized/cpu/qnnpack_utils.h> #include <ATen/native/quantized/cpu/qnnpack_utils.h>
#include <caffe2/utils/threadpool/ThreadPoolMobile.h> #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
namespace at { namespace at {
namespace native { namespace native {
@ -66,7 +66,7 @@ Tensor qnnpack_mean(const Tensor& input, IntArrayRef dim) {
CAFFE_ENFORCE( CAFFE_ENFORCE(
setupStatus == pytorch_qnnp_status_success, setupStatus == pytorch_qnnp_status_success,
"failed to setup QNNPACK Global Average Pooling operator"); "failed to setup QNNPACK Global Average Pooling operator");
pthreadpool_t threadpool = caffe2::mobile_pthreadpool(); pthreadpool_t threadpool = caffe2::pthreadpool_();
const pytorch_qnnp_status runStatus = const pytorch_qnnp_status runStatus =
pytorch_qnnp_run_operator(qnnpack_operator, threadpool); pytorch_qnnp_run_operator(qnnpack_operator, threadpool);
TORCH_INTERNAL_ASSERT( TORCH_INTERNAL_ASSERT(

View File

@ -6,7 +6,7 @@
#include <ATen/native/quantized/cpu/init_qnnpack.h> #include <ATen/native/quantized/cpu/init_qnnpack.h>
#include <ATen/native/quantized/cpu/qnnpack_utils.h> #include <ATen/native/quantized/cpu/qnnpack_utils.h>
#include <ATen/native/quantized/cpu/quantized_ops.h> #include <ATen/native/quantized/cpu/quantized_ops.h>
#include <caffe2/utils/threadpool/ThreadPoolMobile.h> #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
#include <torch/library.h> #include <torch/library.h>
#include <algorithm> #include <algorithm>
@ -69,7 +69,7 @@ Tensor qnnpack_relu(Tensor input) {
setupStatus == pytorch_qnnp_status_success, setupStatus == pytorch_qnnp_status_success,
"failed to setup QNNPACK Relu operator"); "failed to setup QNNPACK Relu operator");
pthreadpool_t threadpool = caffe2::mobile_pthreadpool(); pthreadpool_t threadpool = caffe2::pthreadpool_();
const pytorch_qnnp_status runStatus = const pytorch_qnnp_status runStatus =
pytorch_qnnp_run_operator(qnnpack_operator, threadpool); pytorch_qnnp_run_operator(qnnpack_operator, threadpool);

View File

@ -7,7 +7,7 @@
#include <ATen/native/quantized/cpu/quantized_ops.h> #include <ATen/native/quantized/cpu/quantized_ops.h>
#include <ATen/native/quantized/cpu/init_qnnpack.h> #include <ATen/native/quantized/cpu/init_qnnpack.h>
#include <ATen/native/quantized/cpu/qnnpack_utils.h> #include <ATen/native/quantized/cpu/qnnpack_utils.h>
#include <caffe2/utils/threadpool/ThreadPoolMobile.h> #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
#include <algorithm> #include <algorithm>
@ -66,7 +66,7 @@ Tensor qnnpack_sigmoid(Tensor input) {
TORCH_INTERNAL_ASSERT(setupStatus == pytorch_qnnp_status_success, TORCH_INTERNAL_ASSERT(setupStatus == pytorch_qnnp_status_success,
"failed to setup QNNPACK sigmoid operator"); "failed to setup QNNPACK sigmoid operator");
pthreadpool_t threadpool = caffe2::mobile_pthreadpool(); pthreadpool_t threadpool = caffe2::pthreadpool_();
const pytorch_qnnp_status runStatus = const pytorch_qnnp_status runStatus =
pytorch_qnnp_run_operator(sigmoid_op, threadpool); pytorch_qnnp_run_operator(sigmoid_op, threadpool);

View File

@ -7,7 +7,7 @@
#include <ATen/native/quantized/cpu/quantized_ops.h> #include <ATen/native/quantized/cpu/quantized_ops.h>
#include <ATen/native/quantized/cpu/init_qnnpack.h> #include <ATen/native/quantized/cpu/init_qnnpack.h>
#include <ATen/native/quantized/cpu/qnnpack_utils.h> #include <ATen/native/quantized/cpu/qnnpack_utils.h>
#include <caffe2/utils/threadpool/ThreadPoolMobile.h> #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
#include <algorithm> #include <algorithm>
@ -64,7 +64,7 @@ Tensor qnnpack_tanh(Tensor input) {
TORCH_INTERNAL_ASSERT(setupStatus == pytorch_qnnp_status_success, TORCH_INTERNAL_ASSERT(setupStatus == pytorch_qnnp_status_success,
"failed to setup QNNPACK TanH operator"); "failed to setup QNNPACK TanH operator");
pthreadpool_t threadpool = caffe2::mobile_pthreadpool(); pthreadpool_t threadpool = caffe2::pthreadpool_();
const pytorch_qnnp_status runStatus = const pytorch_qnnp_status runStatus =
pytorch_qnnp_run_operator(tanh_op, threadpool); pytorch_qnnp_run_operator(tanh_op, threadpool);

View File

@ -5,7 +5,7 @@
#ifdef USE_XNNPACK #ifdef USE_XNNPACK
#include <xnnpack.h> #include <xnnpack.h>
#include <caffe2/utils/threadpool/ThreadPoolXNNPACK.h> #include <caffe2/utils/threadpool/pthreadpool-cpp.h>
namespace at { namespace at {
namespace native { namespace native {

View File

@ -208,15 +208,15 @@ Tensor run(
padded_input_nhwc.size(Layout::Activation4D::width), // input_width padded_input_nhwc.size(Layout::Activation4D::width), // input_width
padded_input_nhwc.data_ptr<float>(), // input padded_input_nhwc.data_ptr<float>(), // input
output.data_ptr<float>(), // output output.data_ptr<float>(), // output
caffe2::xnnpack_threadpool()); // threadpool caffe2::pthreadpool_()); // threadpool
TORCH_CHECK( TORCH_CHECK(
xnn_status_success == setup_status, xnn_status_success == setup_status,
"xnn_setup_convolution2d_nhwc_f32 failed!"); "xnn_setup_convolution2d_nhwc_f32 failed!");
const xnn_status run_status = xnn_run_operator( const xnn_status run_status = xnn_run_operator(
context.op.get(), // operator context.op.get(), // operator
caffe2::xnnpack_threadpool()); // threadpool caffe2::pthreadpool_()); // threadpool
TORCH_INTERNAL_ASSERT( TORCH_INTERNAL_ASSERT(
xnn_status_success == run_status, xnn_status_success == run_status,

View File

@ -137,15 +137,15 @@ Tensor run(
Layout::ActivationND::batch(padded_input.sizes()), // Batch, Layout::ActivationND::batch(padded_input.sizes()), // Batch,
padded_input.data_ptr<float>(), // input padded_input.data_ptr<float>(), // input
output.data_ptr<float>(), // output output.data_ptr<float>(), // output
caffe2::xnnpack_threadpool()); // threadpool caffe2::pthreadpool_()); // threadpool
TORCH_CHECK( TORCH_CHECK(
xnn_status_success == setup_status, xnn_status_success == setup_status,
"xnn_setup_fully_connected_nc_f32 failed!"); "xnn_setup_fully_connected_nc_f32 failed!");
const xnn_status run_status = xnn_run_operator( const xnn_status run_status = xnn_run_operator(
context.op.get(), // operator context.op.get(), // operator
caffe2::xnnpack_threadpool()); // threadpool caffe2::pthreadpool_()); // threadpool
TORCH_INTERNAL_ASSERT( TORCH_INTERNAL_ASSERT(
xnn_status_success == run_status, xnn_status_success == run_status,

View File

@ -219,15 +219,15 @@ Tensor max_pool2d(
input_padded_contig_nhwc.size(Layout::Activation4D::width), // input_width input_padded_contig_nhwc.size(Layout::Activation4D::width), // input_width
input_padded_contig_nhwc.data_ptr<float>(), // input input_padded_contig_nhwc.data_ptr<float>(), // input
output_padded_contig_nhwc.data_ptr<float>(), // output output_padded_contig_nhwc.data_ptr<float>(), // output
caffe2::xnnpack_threadpool()); // threadpool caffe2::pthreadpool_()); // threadpool
TORCH_CHECK( TORCH_CHECK(
xnn_status_success == setup_status, xnn_status_success == setup_status,
"xnn_setup_max_pooling2d_nhwc_f32 failed!"); "xnn_setup_max_pooling2d_nhwc_f32 failed!");
const xnn_status run_status = xnn_run_operator( const xnn_status run_status = xnn_run_operator(
max_pool_op, // operator max_pool_op, // operator
caffe2::xnnpack_threadpool()); // threadpool caffe2::pthreadpool_()); // threadpool
TORCH_INTERNAL_ASSERT( TORCH_INTERNAL_ASSERT(
xnn_status_success == run_status, xnn_status_success == run_status,

View File

@ -4,10 +4,10 @@
#include <ATen/NativeFunctions.h> #include <ATen/NativeFunctions.h>
#include <ATen/Parallel.h> #include <ATen/Parallel.h>
#include <ATen/core/Tensor.h> #include <ATen/core/Tensor.h>
#include <ATen/detail/CUDAHooksInterface.h>
#include <ATen/native/TensorFactories.h> #include <ATen/native/TensorFactories.h>
#include <ATen/native/quantized/affine_quantizer.h> #include <ATen/native/quantized/affine_quantizer.h>
#include <ATen/quantized/QTensorImpl.h> #include <ATen/quantized/QTensorImpl.h>
#include <c10/core/Allocator.h>
#include <c10/core/CPUAllocator.h> #include <c10/core/CPUAllocator.h>
#include <cmath> #include <cmath>
#include <typeinfo> #include <typeinfo>
@ -66,7 +66,9 @@ inline Tensor new_qtensor(
const TensorOptions& options, const TensorOptions& options,
QuantizerPtr quantizer) { QuantizerPtr quantizer) {
auto memory_format = options.memory_format_opt().value_or(MemoryFormat::Contiguous); auto memory_format = options.memory_format_opt().value_or(MemoryFormat::Contiguous);
at::Allocator* allocator = GetAllocator(options.device().type()); at::Allocator* allocator = options.device().type() == DeviceType::CUDA
? at::detail::getCUDAHooks().getCUDADeviceAllocator()
: at::getCPUAllocator();
#ifdef USE_PYTORCH_QNNPACK #ifdef USE_PYTORCH_QNNPACK
if (at::globalContext().qEngine() == at::QEngine::QNNPACK) { if (at::globalContext().qEngine() == at::QEngine::QNNPACK) {

View File

@ -87,7 +87,6 @@ endif()
# Note: the folders that are being commented out have not been properly # Note: the folders that are being commented out have not been properly
# addressed yet. # addressed yet.
# For pthreadpool_new_if_impl. TODO: Remove when threadpools are unitied.
if(NOT MSVC AND USE_XNNPACK) if(NOT MSVC AND USE_XNNPACK)
if(NOT TARGET fxdiv) if(NOT TARGET fxdiv)
set(FXDIV_BUILD_TESTS OFF CACHE BOOL "") set(FXDIV_BUILD_TESTS OFF CACHE BOOL "")
@ -96,10 +95,6 @@ if(NOT MSVC AND USE_XNNPACK)
"${FXDIV_SOURCE_DIR}" "${FXDIV_SOURCE_DIR}"
"${CMAKE_BINARY_DIR}/FXdiv") "${CMAKE_BINARY_DIR}/FXdiv")
endif() endif()
if(NOT (INTERN_BUILD_MOBILE AND NOT BUILD_CAFFE2_MOBILE))
set_source_files_properties(
utils/threadpool/pthreadpool_new_if_impl.c PROPERTIES COMPILE_FLAGS -fno-openmp)
endif()
endif() endif()
add_subdirectory(core) add_subdirectory(core)

View File

@ -818,6 +818,67 @@ c10::optional<int> OperatorBase::argumentIndexWithName(
#endif #endif
} }
bool OperatorBase::RunAsync(int stream_id) {
try {
auto result = Run(stream_id);
if (result) {
if (HasAsyncPart()) {
RecordEvent();
} else {
SetEventFinished();
}
} else {
SetEventFinished(getErrorMsg().c_str());
}
return result;
} catch (EnforceNotMet& err) {
SetEventFinishedWithException(err.what());
throw;
} catch (const std::exception& err) {
SetEventFinishedWithException(err.what());
throw;
} catch (...) {
SetEventFinishedWithException(getErrorMsg().c_str());
throw;
}
}
void OperatorBase::AddRelatedBlobInfo(EnforceNotMet* err) {
CAFFE_ENFORCE(
isLegacyOperator(),
"AddRelatedBlobInfo(err) not supported for operators exported to c10.");
if (!has_debug_def()) {
return;
}
bool found_input = false;
bool found_output = false;
if (err->caller() != nullptr) {
std::ostringstream oss;
for (size_t i = 0; i < inputs_.size(); i++) {
if (inputs_[i]->GetRaw() == err->caller()) {
found_input = true;
oss << "while accessing input: " << debug_def().input(i);
break;
}
}
for (size_t i = 0; i < outputs_.size(); i++) {
if (outputs_[i]->GetRaw() == err->caller()) {
found_output = true;
if (found_input) {
oss << " OR ";
}
oss << "while accessing output: " << debug_def().output(i);
break;
}
}
if (found_input || found_output) {
err->add_context(oss.str());
}
}
}
OperatorBase::~OperatorBase() noexcept = default; OperatorBase::~OperatorBase() noexcept = default;
#ifndef C10_MOBILE #ifndef C10_MOBILE

View File

@ -480,70 +480,13 @@ class CAFFE2_API OperatorBase : public Observable<OperatorBase> {
virtual void CancelAsyncCallback() {} virtual void CancelAsyncCallback() {}
// RunAsync, if implemenented by the specific operators, will schedule the // RunAsync, if implemented by the specific operators, will schedule the
// computation on the corresponding context and record the event in its // computation on the corresponding context and record the event in its
// event_ member object. If the specific operator does not support RunAsync, // event_ member object. If the specific operator does not support RunAsync,
// it will simply be synchronous as a fallback. // it will simply be synchronous as a fallback.
virtual bool RunAsync(int stream_id = 0) { virtual bool RunAsync(int stream_id = 0);
try {
auto result = Run(stream_id);
if (result) {
if (HasAsyncPart()) {
RecordEvent();
} else {
SetEventFinished();
}
} else {
SetEventFinished(getErrorMsg().c_str());
}
return result;
} catch (EnforceNotMet& err) {
SetEventFinishedWithException(err.what());
throw;
} catch (const std::exception& err) {
SetEventFinishedWithException(err.what());
throw;
} catch (...) {
SetEventFinishedWithException(getErrorMsg().c_str());
throw;
}
}
virtual void AddRelatedBlobInfo(EnforceNotMet* err) { virtual void AddRelatedBlobInfo(EnforceNotMet* err);
CAFFE_ENFORCE(
isLegacyOperator(),
"AddRelatedBlobInfo(err) not supported for operators exported to c10.");
if (!has_debug_def()) {
return;
}
bool found_input = false;
bool found_output = false;
if (err->caller() != nullptr) {
std::ostringstream oss;
for (size_t i = 0; i < inputs_.size(); i++) {
if (inputs_[i]->GetRaw() == err->caller()) {
found_input = true;
oss << "while accessing input: " << debug_def().input(i);
break;
}
}
for (size_t i = 0; i < outputs_.size(); i++) {
if (outputs_[i]->GetRaw() == err->caller()) {
found_output = true;
if (found_input) {
oss << " OR ";
}
oss << "while accessing output: " << debug_def().output(i);
break;
}
}
if (found_input || found_output) {
err->add_context(oss.str());
}
}
}
virtual std::string debug_info_string() const { virtual std::string debug_info_string() const {
return ""; return "";

View File

@ -3,6 +3,25 @@
namespace caffe2 { namespace caffe2 {
OpSchema::OpSchema(const string& type, const string& file, const int line)
: type_(type), file_(file), line_(line), tensor_inference_function_(
[](const OperatorDef& def, const vector<TensorShape>&) {
vector<TensorShape> out;
for (int i = 0; i < def.output_size(); i++) {
TensorShape ts;
ts.set_unknown_shape(true);
out.push_back(ts);
}
return out;
}), device_inference_function_(
[](const OperatorDef& def) {
auto op_device =
def.has_device_option() ? def.device_option() : DeviceOption();
vector<DeviceOption> in_dev(def.input_size(), op_device);
vector<DeviceOption> out_dev(def.output_size(), op_device);
return std::make_pair(in_dev, out_dev);
}) {}
bool OpSchema::Verify(const OperatorDef& def) const { bool OpSchema::Verify(const OperatorDef& def) const {
// Check the number of inputs. // Check the number of inputs.
if (def.input_size() < min_input_ || def.input_size() > max_input_) { if (def.input_size() < min_input_ || def.input_size() > max_input_) {

View File

@ -39,9 +39,8 @@ constexpr int kCannotComputeNumOutputs = -1;
*/ */
class CAFFE2_API OpSchema { class CAFFE2_API OpSchema {
public: public:
OpSchema() : type_("unknown"), file_("unknown"), line_(0) {} OpSchema() : OpSchema("unknown", "unknown", 0) {}
OpSchema(const string& type, const string& file, const int line) OpSchema(const string& type, const string& file, const int line);
: type_(type), file_(file), line_(line) {}
/** /**
* @brief Returns the file that the op schema is registered from. * @brief Returns the file that the op schema is registered from.
@ -443,25 +442,9 @@ class CAFFE2_API OpSchema {
std::function<bool(int, int)> inplace_enforced_ = [](int, int) { std::function<bool(int, int)> inplace_enforced_ = [](int, int) {
return false; return false;
}; };
TensorInferenceFunctionType tensor_inference_function_ = TensorInferenceFunctionType tensor_inference_function_;
[](const OperatorDef& def, const vector<TensorShape>&) {
vector<TensorShape> out;
for (int i = 0; i < def.output_size(); i++) {
TensorShape ts;
ts.set_unknown_shape(true);
out.push_back(ts);
}
return out;
};
std::unique_ptr<CostInferenceFunctionType> cost_inference_function_ = nullptr; std::unique_ptr<CostInferenceFunctionType> cost_inference_function_ = nullptr;
DeviceInferenceFunctionType device_inference_function_ = DeviceInferenceFunctionType device_inference_function_;
[](const OperatorDef& def) {
auto op_device =
def.has_device_option() ? def.device_option() : DeviceOption();
vector<DeviceOption> in_dev(def.input_size(), op_device);
vector<DeviceOption> out_dev(def.output_size(), op_device);
return std::make_pair(in_dev, out_dev);
};
std::function<std::vector<TensorFiller>( std::function<std::vector<TensorFiller>(
const std::vector<std::vector<int64_t>>&)> const std::vector<std::vector<int64_t>>&)>

View File

@ -88,7 +88,7 @@ class Int8AddOp final : public Operator<CPUContext> {
setupStatus == qnnp_status_success, setupStatus == qnnp_status_success,
"failed to setup QNNPACK add operator"); "failed to setup QNNPACK add operator");
#ifdef FBCODE_CAFFE2 #if defined(FBCODE_CAFFE2) || !defined(USE_INTERNAL_PTHREADPOOL_IMPL)
const qnnp_status runStatus = const qnnp_status runStatus =
qnnp_run_operator(this->qnnpackOperator_, nullptr /* thread pool */); qnnp_run_operator(this->qnnpackOperator_, nullptr /* thread pool */);
#else #else

View File

@ -80,7 +80,7 @@ class Int8AveragePoolOp final : public ConvPoolOpBase<CPUContext> {
setupStatus == qnnp_status_success, setupStatus == qnnp_status_success,
"failed to setup QNNPACK Global Average Pooling operator"); "failed to setup QNNPACK Global Average Pooling operator");
#ifdef FBCODE_CAFFE2 #if defined(FBCODE_CAFFE2) || !defined(USE_INTERNAL_PTHREADPOOL_IMPL)
const qnnp_status runStatus = const qnnp_status runStatus =
qnnp_run_operator(this->qnnpackGlobalOperator_, qnnp_run_operator(this->qnnpackGlobalOperator_,
nullptr /* thread pool */); nullptr /* thread pool */);
@ -122,7 +122,7 @@ class Int8AveragePoolOp final : public ConvPoolOpBase<CPUContext> {
setupStatus == qnnp_status_success, setupStatus == qnnp_status_success,
"failed to setup QNNPACK Average Pooling operator"); "failed to setup QNNPACK Average Pooling operator");
#ifdef FBCODE_CAFFE2 #if defined(FBCODE_CAFFE2) || !defined(USE_INTERNAL_PTHREADPOOL_IMPL)
const qnnp_status runStatus = const qnnp_status runStatus =
qnnp_run_operator(this->qnnpackOperator_, nullptr /* thread pool */); qnnp_run_operator(this->qnnpackOperator_, nullptr /* thread pool */);
#else #else

View File

@ -72,7 +72,7 @@ class Int8ChannelShuffleOp final : public ConvPoolOpBase<CPUContext> {
setupStatus == qnnp_status_success, setupStatus == qnnp_status_success,
"failed to setup QNNPACK channel shuffle operator"); "failed to setup QNNPACK channel shuffle operator");
#ifdef FBCODE_CAFFE2 #if defined(FBCODE_CAFFE2) || !defined(USE_INTERNAL_PTHREADPOOL_IMPL)
const qnnp_status runStatus = const qnnp_status runStatus =
qnnp_run_operator(this->qnnpackOperator_, nullptr /* thread pool */); qnnp_run_operator(this->qnnpackOperator_, nullptr /* thread pool */);
#else #else

View File

@ -141,7 +141,7 @@ class Int8ConvOp final : public ConvPoolOpBase<CPUContext> {
lastOutputPointer_ = Y->t.template mutable_data<uint8_t>(); lastOutputPointer_ = Y->t.template mutable_data<uint8_t>();
} }
#ifdef FBCODE_CAFFE2 #if defined(FBCODE_CAFFE2) || !defined(USE_INTERNAL_PTHREADPOOL_IMPL)
const qnnp_status runStatus = const qnnp_status runStatus =
qnnp_run_operator(this->qnnpackObject_, nullptr /* thread pool */); qnnp_run_operator(this->qnnpackObject_, nullptr /* thread pool */);
#else #else

View File

@ -140,7 +140,7 @@ class Int8ConvTransposeOp final : public ConvTransposeUnpoolBase<CPUContext> {
lastOutputPointer_ = Y->t.template mutable_data<uint8_t>(); lastOutputPointer_ = Y->t.template mutable_data<uint8_t>();
} }
#ifdef FBCODE_CAFFE2 #if defined(FBCODE_CAFFE2) || !defined(USE_INTERNAL_PTHREADPOOL_IMPL)
const qnnp_status runStatus = const qnnp_status runStatus =
qnnp_run_operator(this->qnnpackObject_, nullptr /* thread pool */); qnnp_run_operator(this->qnnpackObject_, nullptr /* thread pool */);
#else #else

View File

@ -104,7 +104,7 @@ class Int8FCOp final : public Operator<CPUContext> {
lastOutputPointer_ = Y->t.template mutable_data<uint8_t>(); lastOutputPointer_ = Y->t.template mutable_data<uint8_t>();
} }
#ifdef FBCODE_CAFFE2 #if defined(FBCODE_CAFFE2) || !defined(USE_INTERNAL_PTHREADPOOL_IMPL)
const qnnp_status runStatus = const qnnp_status runStatus =
qnnp_run_operator(this->qnnpackObject_, nullptr /* thread pool */); qnnp_run_operator(this->qnnpackObject_, nullptr /* thread pool */);
#else #else

View File

@ -80,7 +80,7 @@ class Int8LeakyReluOp final : public Operator<CPUContext> {
setupStatus == qnnp_status_success, setupStatus == qnnp_status_success,
"failed to setup QNNPACK Leaky ReLU operator"); "failed to setup QNNPACK Leaky ReLU operator");
#ifdef FBCODE_CAFFE2 #if defined(FBCODE_CAFFE2) || !defined(USE_INTERNAL_PTHREADPOOL_IMPL)
const qnnp_status runStatus = const qnnp_status runStatus =
qnnp_run_operator(this->qnnpackOperator_, nullptr /* thread pool */); qnnp_run_operator(this->qnnpackOperator_, nullptr /* thread pool */);
#else #else

View File

@ -74,7 +74,7 @@ class Int8MaxPoolOp final : public ConvPoolOpBase<CPUContext> {
setupStatus == qnnp_status_success, setupStatus == qnnp_status_success,
"failed to setup QNNPACK Max Pooling operator"); "failed to setup QNNPACK Max Pooling operator");
#ifdef FBCODE_CAFFE2 #if defined(FBCODE_CAFFE2) || !defined(USE_INTERNAL_PTHREADPOOL_IMPL)
const qnnp_status runStatus = const qnnp_status runStatus =
qnnp_run_operator(this->qnnpackOperator_, nullptr /* thread pool */); qnnp_run_operator(this->qnnpackOperator_, nullptr /* thread pool */);
#else #else

View File

@ -65,7 +65,7 @@ class Int8ReluOp final : public Operator<CPUContext> {
setupStatus == qnnp_status_success, setupStatus == qnnp_status_success,
"failed to setup QNNPACK Clamp operator"); "failed to setup QNNPACK Clamp operator");
#ifdef FBCODE_CAFFE2 #if defined(FBCODE_CAFFE2) || !defined(USE_INTERNAL_PTHREADPOOL_IMPL)
const qnnp_status runStatus = const qnnp_status runStatus =
qnnp_run_operator(this->qnnpackOperator_, nullptr /* thread pool */); qnnp_run_operator(this->qnnpackOperator_, nullptr /* thread pool */);
#else #else

View File

@ -73,7 +73,7 @@ class Int8SigmoidOp final : public Operator<CPUContext> {
setupStatus == qnnp_status_success, setupStatus == qnnp_status_success,
"failed to setup QNNPACK Sigmoid operator"); "failed to setup QNNPACK Sigmoid operator");
#ifdef FBCODE_CAFFE2 #if defined(FBCODE_CAFFE2) || !defined(USE_INTERNAL_PTHREADPOOL_IMPL)
const qnnp_status runStatus = const qnnp_status runStatus =
qnnp_run_operator(this->qnnpackOperator_, nullptr /* thread pool */); qnnp_run_operator(this->qnnpackOperator_, nullptr /* thread pool */);
#else #else

View File

@ -73,7 +73,7 @@ class Int8SoftmaxOp final : public Operator<CPUContext> {
setupStatus == qnnp_status_success, setupStatus == qnnp_status_success,
"failed to setup QNNPACK SoftArgMax operator"); "failed to setup QNNPACK SoftArgMax operator");
#ifdef FBCODE_CAFFE2 #if defined(FBCODE_CAFFE2) || !defined(USE_INTERNAL_PTHREADPOOL_IMPL)
const qnnp_status runStatus = const qnnp_status runStatus =
qnnp_run_operator(this->qnnpackOperator_, nullptr /* thread pool */); qnnp_run_operator(this->qnnpackOperator_, nullptr /* thread pool */);
#else #else

View File

@ -42,13 +42,48 @@ if platform.system() == 'Windows':
else: else:
cuda_path = '' cuda_path = ''
if not is_conda and sys.version_info >= (3, 8): import ctypes
dll_paths = list(filter(os.path.exists, [th_dll_path, py_dll_path, nvtoolsext_dll_path, cuda_path])) kernel32 = ctypes.WinDLL('kernel32.dll', use_last_error=True)
dll_paths = list(filter(os.path.exists, [th_dll_path, py_dll_path, nvtoolsext_dll_path, cuda_path]))
with_load_library_flags = hasattr(kernel32, 'AddDllDirectory')
prev_error_mode = kernel32.SetErrorMode(0x0001)
for dll_path in dll_paths: kernel32.LoadLibraryW.restype = ctypes.c_void_p
if with_load_library_flags:
kernel32.AddDllDirectory.restype = ctypes.c_void_p
kernel32.LoadLibraryExW.restype = ctypes.c_void_p
for dll_path in dll_paths:
if sys.version_info >= (3, 8):
os.add_dll_directory(dll_path) os.add_dll_directory(dll_path)
else: elif with_load_library_flags:
dll_paths = [th_dll_path, py_dll_path, nvtoolsext_dll_path, cuda_path] res = kernel32.AddDllDirectory(dll_path)
dll_paths = list(filter(os.path.exists, dll_paths)) + [os.environ['PATH']] if res is None:
err = ctypes.WinError(ctypes.get_last_error())
err.strerror += ' Error adding "{}" to the DLL directories.'.format(dll_path)
raise err
os.environ['PATH'] = ';'.join(dll_paths) dlls = glob.glob(os.path.join(th_dll_path, '*.dll'))
path_patched = False
for dll in dlls:
is_loaded = False
if with_load_library_flags:
res = kernel32.LoadLibraryExW(dll, None, 0x00001100)
last_error = ctypes.get_last_error()
if res is None and last_error != 126:
err = ctypes.WinError(last_error)
err.strerror += ' Error loading "{}" or one of its dependencies.'.format(dll)
raise err
elif res is not None:
is_loaded = True
if not is_loaded:
if not path_patched:
os.environ['PATH'] = ';'.join(dll_paths + [os.environ['PATH']])
path_patched = True
res = kernel32.LoadLibraryW(dll)
if res is None:
err = ctypes.WinError(ctypes.get_last_error())
err.strerror += ' Error loading "{}" or one of its dependencies.'.format(dll)
raise err
kernel32.SetErrorMode(prev_error_mode)

View File

@ -4,6 +4,7 @@
#include <istream> #include <istream>
#include <ostream> #include <ostream>
#include <fstream> #include <fstream>
#include <algorithm>
#include <c10/core/Allocator.h> #include <c10/core/Allocator.h>
#include <c10/core/Backend.h> #include <c10/core/Backend.h>
@ -303,10 +304,10 @@ void PyTorchStreamWriter::setup(const string& file_name) {
mz_zip_writer_init_v2(ar_.get(), 0, MZ_ZIP_FLAG_WRITE_ZIP64); mz_zip_writer_init_v2(ar_.get(), 0, MZ_ZIP_FLAG_WRITE_ZIP64);
valid("initializing archive ", file_name.c_str()); valid("initializing archive ", file_name.c_str());
}
std::string version = c10::to_string(kProducedFileFormatVersion); void PyTorchStreamWriter::setMinVersion(const uint64_t version) {
version.push_back('\n'); version_ = std::max(version, version_);
writeRecord("version", version.c_str(), version.size());
} }
void PyTorchStreamWriter::writeRecord( void PyTorchStreamWriter::writeRecord(
@ -339,6 +340,11 @@ void PyTorchStreamWriter::writeRecord(
} }
void PyTorchStreamWriter::writeEndOfFile() { void PyTorchStreamWriter::writeEndOfFile() {
// Writes version info
std::string version = c10::to_string(version_);
version.push_back('\n');
writeRecord("version", version.c_str(), version.size());
AT_ASSERT(!finalized_); AT_ASSERT(!finalized_);
finalized_ = true; finalized_ = true;
mz_zip_writer_finalize_archive(ar_.get()); mz_zip_writer_finalize_archive(ar_.get());

View File

@ -94,14 +94,45 @@ constexpr uint64_t kMinSupportedFileFormatVersion = 0x1L;
constexpr uint64_t kMaxSupportedFileFormatVersion = 0x5L; constexpr uint64_t kMaxSupportedFileFormatVersion = 0x5L;
// Versions (i.e. why was the version number bumped?) // Versions (i.e. why was the version number bumped?)
// Note [Dynamic Versions and torch.jit.save vs. torch.save]
//
// Our versioning scheme has a "produced file format version" which
// describes how an archive is to be read. The version written in an archive
// is at least this current produced file format version, but may be greater
// if it includes certain symbols. We refer to these conditional versions
// as "dynamic," since they are identified at runtime.
//
// Dynamic versioning is useful when an operator's semantics are updated.
// When using torch.jit.save we want those semantics to be preserved. If
// we bumped the produced file format version on every change, however,
// then older versions of PyTorch couldn't read even simple archives, like
// a single tensor, from newer versions of PyTorch. Instead, we
// assign dynamic versions to these changes that override the
// produced file format version as needed. That is, when the semantics
// of torch.div changed it was assigned dynamic version 4, and when
// torch.jit.saving modules that use torch.div those archives also have
// (at least) version 4. This prevents earlier versions of PyTorch
// from accidentally performing the wrong kind of division. Modules
// that don't use torch.div or other operators with dynamic versions
// can write the produced file format version, and these programs will
// run as expected on earlier versions of PyTorch.
//
// While torch.jit.save attempts to preserve operator semantics,
// torch.save does not. torch.save is analogous to pickling Python, so
// a function that uses torch.div will have different behavior if torch.saved
// and torch.loaded across PyTorch versions. From a technical perspective,
// torch.save ignores dynamic versioning.
// 1. Initial version // 1. Initial version
// 2. Removed op_version_set version numbers // 2. Removed op_version_set version numbers
// 3. Added type tags to pickle serialization of container types // 3. Added type tags to pickle serialization of container types
// 4. Stopped integer division using torch.div // 4. (Dynamic) Stopped integer division using torch.div
// (a versioned symbol preserves the historic behavior of versions 1--3) // (a versioned symbol preserves the historic behavior of versions 1--3)
// 5. (Read-only) Stops torch.full inferring a floating point dtype // 5. (Dynamic) Stops torch.full inferring a floating point dtype
// when given integer fill values. // when given bool or integer fill values.
constexpr uint64_t kProducedFileFormatVersion = 0x4L; // (a versioned symbol preserves the historic behavior of versions 1--4)
constexpr uint64_t kProducedFileFormatVersion = 0x3L;
// Writer-specific constants // Writer-specific constants
constexpr uint64_t kFieldAlignment = 64; constexpr uint64_t kFieldAlignment = 64;
@ -144,6 +175,8 @@ class CAFFE2_API PyTorchStreamWriter final {
explicit PyTorchStreamWriter( explicit PyTorchStreamWriter(
const std::function<size_t(const void*, size_t)>& writer_func); const std::function<size_t(const void*, size_t)>& writer_func);
void setMinVersion(const uint64_t version);
void writeRecord( void writeRecord(
const std::string& name, const std::string& name,
const void* data, const void* data,
@ -171,6 +204,7 @@ class CAFFE2_API PyTorchStreamWriter final {
std::string padding_; std::string padding_;
std::ofstream file_stream_; std::ofstream file_stream_;
std::function<size_t(const void*, size_t)> writer_func_; std::function<size_t(const void*, size_t)> writer_func_;
uint64_t version_ = kProducedFileFormatVersion;
bool finalized_ = false; bool finalized_ = false;
bool err_seen_ = false; bool err_seen_ = false;
friend size_t ostream_write_func( friend size_t ostream_write_func(

View File

@ -195,7 +195,12 @@ bool NNPACKConvOp::RunOnDeviceWithOrderNCHW() {
const nnp_size output_subsample = {.width = static_cast<size_t>(stride_w()), const nnp_size output_subsample = {.width = static_cast<size_t>(stride_w()),
.height = static_cast<size_t>(stride_h())}; .height = static_cast<size_t>(stride_h())};
initNNPACK(); initNNPACK();
#if !defined(USE_INTERNAL_PTHREADPOOL_IMPL)
pthreadpool_t pool = nullptr;
#else
pthreadpool_t pool = reinterpret_cast<pthreadpool_t>(ws_->GetThreadPool()); pthreadpool_t pool = reinterpret_cast<pthreadpool_t>(ws_->GetThreadPool());
#endif
runWithSharedBuffer<CPUContext>(ws_, [&](Tensor* buffer) { runWithSharedBuffer<CPUContext>(ws_, [&](Tensor* buffer) {
if (transformStrategy_ == nnp_convolution_transform_strategy_precompute) { if (transformStrategy_ == nnp_convolution_transform_strategy_precompute) {

View File

@ -1,15 +1,8 @@
# TODO: Add ThreadPoolXNNPACK.cc when XNNPACK integration is updated
# to pass the actual threadpool ptr instead of nullptr.
if(INTERN_BUILD_MOBILE AND NOT BUILD_CAFFE2_MOBILE) if(INTERN_BUILD_MOBILE AND NOT BUILD_CAFFE2_MOBILE)
add_definitions(-DUSE_INTERNAL_THREADPOOL_IMPL)
list(APPEND Caffe2_CPU_SRCS list(APPEND Caffe2_CPU_SRCS
utils/string_utils.cc utils/string_utils.cc
utils/threadpool/pthreadpool.cc utils/threadpool/pthreadpool-cpp.cc
utils/threadpool/pthreadpool_impl.cc
utils/threadpool/pthreadpool_new_if_impl.c
utils/threadpool/ThreadPool.cc utils/threadpool/ThreadPool.cc
utils/threadpool/ThreadPoolMobile.cc
utils/threadpool/ThreadPoolXNNPACK.cc
) )
set(Caffe2_CPU_SRCS ${Caffe2_CPU_SRCS} PARENT_SCOPE) set(Caffe2_CPU_SRCS ${Caffe2_CPU_SRCS} PARENT_SCOPE)
return() return()
@ -28,23 +21,19 @@ list(APPEND Caffe2_CPU_SRCS
utils/proto_convert.cc utils/proto_convert.cc
utils/proto_utils.cc utils/proto_utils.cc
utils/proto_wrap.cc utils/proto_wrap.cc
utils/threadpool/ThreadPool.cc
utils/signal_handler.cc utils/signal_handler.cc
utils/smart_tensor_printer.cc utils/smart_tensor_printer.cc
utils/string_utils.cc utils/string_utils.cc)
utils/threadpool/ThreadPool.cc)
# ---[ threadpool/pthreadpool* is a local modification of the NNPACK if(USE_PTHREADPOOL)
# pthreadpool with a very similar interface. Neither NNPACK, nor this list(APPEND Caffe2_CPU_SRCS
# thread pool supports Windows. utils/threadpool/pthreadpool-cpp.cc)
if(NOT MSVC AND USE_XNNPACK) if(USE_INTERNAL_PTHREADPOOL_IMPL)
add_definitions(-DUSE_INTERNAL_THREADPOOL_IMPL) list(APPEND Caffe2_CPU_SRCS
set(Caffe2_CPU_SRCS ${Caffe2_CPU_SRCS} utils/threadpool/pthreadpool.cc
utils/threadpool/pthreadpool.cc utils/threadpool/pthreadpool_impl.cc)
utils/threadpool/pthreadpool_impl.cc endif()
utils/threadpool/pthreadpool_new_if_impl.c
utils/threadpool/ThreadPoolMobile.cc
utils/threadpool/ThreadPoolXNNPACK.cc
)
endif() endif()
set(Caffe2_GPU_SRCS ${Caffe2_GPU_SRCS} set(Caffe2_GPU_SRCS ${Caffe2_GPU_SRCS}

View File

@ -1,21 +0,0 @@
#include <caffe2/utils/threadpool/ThreadPoolMobile.h>
#include <caffe2/utils/threadpool/ThreadPool.h>
#include <caffe2/utils/threadpool/pthreadpool.h>
namespace caffe2 {
caffe2::ThreadPool* mobile_threadpool() {
#ifdef C10_MOBILE
static std::unique_ptr<caffe2::ThreadPool> thread_pool =
caffe2::ThreadPool::defaultThreadPool();
return thread_pool.get();
#else
return nullptr;
#endif
}
pthreadpool_t mobile_pthreadpool() {
return reinterpret_cast<pthreadpool_t>(mobile_threadpool());
}
} // namespace caffe2

View File

@ -1,24 +0,0 @@
#pragma once
#include <caffe2/utils/threadpool/pthreadpool.h>
// TODO Implement a parallel_for version for Mobile here, add to Aten/Parallel.h
namespace caffe2 {
class ThreadPool;
// Return a singleton instance of caffe2::ThreadPool for ATen/TH multithreading.
ThreadPool* mobile_threadpool();
// NOTE: This interface is temporary and should not be used.
// Please use Aten/Parallel.h for parallel primitives in pytorch.
// This implementation will be used by pytorch mobile, specifically
// NNPACK/QNNPACK. For mobile we need to use caffe2::ThreadPool instead of the
// 3rd party pthreadpool. Future work (TODO) Implement a mobile version of
// "at::parallel_for" using caffe2::ThreadPool so all ATen/TH multithreading
// usage is mobile friendly; Refactor QNNPACK or pthreadpool to explicitly using
// "at::parallel_for" primitive to replace pthreadpool_compute_1d for Pytorch;
pthreadpool_t mobile_pthreadpool();
size_t getDefaultNumThreads();
} // namespace caffe2

View File

@ -1,22 +0,0 @@
#include <caffe2/utils/threadpool/pthreadpool.h>
#include <caffe2/utils/threadpool/ThreadPoolMobile.h>
#include <caffe2/utils/threadpool/ThreadPoolXNNPACK.h>
#include <memory>
namespace caffe2 {
// Will be unified.
pthreadpool_t xnnpack_threadpool() {
// Depending on internal implemenation vs. OSS we will link against pthreadpool_create_xnnpack
// or pthreadpool_create. This is only temporary. It will be unified soon.
#ifdef USE_INTERNAL_THREADPOOL_IMPL
static std::unique_ptr<pthreadpool, decltype(&pthreadpool_destroy_xnnpack)>
threadpool(pthreadpool_create_xnnpack(getDefaultNumThreads()), pthreadpool_destroy_xnnpack);
#else
static std::unique_ptr<pthreadpool, decltype(&pthreadpool_destroy)>
threadpool(pthreadpool_create(getDefaultNumThreads()), pthreadpool_destroy);
#endif
return threadpool.get();
}
} // namespace caffe2

View File

@ -1,7 +0,0 @@
#pragma once
// Creating a separate .h/.cc file for creating threadpool for XNNPACK
// to avoid touching existing internal builds.
// When we unify threadpools this should all go away.
namespace caffe2 {
pthreadpool_t xnnpack_threadpool();
} // namespace caffe2

View File

@ -0,0 +1,71 @@
#include <caffe2/utils/threadpool/pthreadpool-cpp.h>
#include <c10/util/Exception.h>
namespace caffe2 {
PThreadPool::PThreadPool(const size_t thread_count)
: threadpool_(pthreadpool_create(thread_count), pthreadpool_destroy) {}
size_t PThreadPool::get_thread_count() const {
std::lock_guard<std::mutex> lock{mutex_};
TORCH_INTERNAL_ASSERT(threadpool_.get(), "Invalid threadpool!");
return pthreadpool_get_threads_count(threadpool_.get());
}
void PThreadPool::set_thread_count(const size_t thread_count) {
std::lock_guard<std::mutex> lock{mutex_};
// As it stands, pthreadpool is an entirely data parallel framework with no
// support for task parallelism. Hence, all functions are blocking, and no
// user-provided tasks can be in flight when the control is returned to the
// user of the API, which means re-initializing the library, without the
// need to wait on any pending tasks, is all one needs to do to re-adjust
// the thread count.
threadpool_.reset(pthreadpool_create(thread_count));
}
void PThreadPool::run(
const std::function<void(size_t)>& fn,
const size_t range) {
std::lock_guard<std::mutex> lock{mutex_};
TORCH_INTERNAL_ASSERT(threadpool_.get(), "Invalid threadpool!");
struct Context final {
const std::function<void(size_t)>& fn;
} context{
fn,
};
pthreadpool_parallelize_1d(
threadpool_.get(),
// Note: pthreadpool_parallelize_1d() is a blocking function. The
// function pointer to this lambda passed on to
// pthreadpool_parallelize_1d() cannot go out of scope until
// pthreadpool_parallelize_1d() returns.
[](void* const context, const size_t item) {
reinterpret_cast<Context*>(context)->fn(item);
},
&context,
range,
0u);
}
// Forward declaration
size_t getDefaultNumThreads();
PThreadPool* pthreadpool() {
static std::unique_ptr<PThreadPool> threadpool =
std::make_unique<PThreadPool>(getDefaultNumThreads());
return threadpool.get();
}
pthreadpool_t pthreadpool_() {
PThreadPool* const threadpool = pthreadpool();
TORCH_INTERNAL_ASSERT(
threadpool, "Failed to acquire an instance of PThreadPool!");
return threadpool->threadpool_.get();
}
} // namespace caffe2

View File

@ -0,0 +1,54 @@
#pragma once
#ifdef USE_PTHREADPOOL
#ifdef USE_INTERNAL_PTHREADPOOL_IMPL
#include <caffe2/utils/threadpool/pthreadpool.h>
#else
#include <pthreadpool.h>
#endif
#include <functional>
#include <memory>
#include <mutex>
namespace caffe2 {
class PThreadPool final {
public:
explicit PThreadPool(size_t thread_count);
~PThreadPool() = default;
PThreadPool(const PThreadPool&) = delete;
PThreadPool& operator=(const PThreadPool&) = delete;
PThreadPool(PThreadPool&&) = delete;
PThreadPool& operator=(PThreadPool&&) = delete;
size_t get_thread_count() const;
void set_thread_count(size_t thread_count);
// Run, in parallel, function fn(task_id) over task_id in range [0, range).
// This function is blocking. All input is processed by the time it returns.
void run(const std::function<void(size_t)>& fn, size_t range);
private:
friend pthreadpool_t pthreadpool_();
private:
mutable std::mutex mutex_;
std::unique_ptr<pthreadpool, decltype(&pthreadpool_destroy)> threadpool_;
};
// Return a singleton instance of PThreadPool for ATen/TH multithreading.
PThreadPool* pthreadpool();
// Exposes the underlying implementation of PThreadPool.
// Only for use in external libraries so as to unify threading across
// internal (i.e. ATen, etc.) and external (e.g. NNPACK, QNNPACK, XNNPACK)
// use cases.
pthreadpool_t pthreadpool_();
} // namespace caffe2
#endif /* USE_PTHREADPOOL */

View File

@ -32,7 +32,7 @@ static inline size_t min(size_t a, size_t b) {
} }
struct compute_1d_tiled_context { struct compute_1d_tiled_context {
pthreadpool_function_1d_tiled_t function; legacy_pthreadpool_function_1d_tiled_t function;
void* argument; void* argument;
size_t range; size_t range;
size_t tile; size_t tile;
@ -46,9 +46,9 @@ static void compute_1d_tiled(void* context_, size_t linear_index) {
context->function(context->argument, index, tile); context->function(context->argument, index, tile);
} }
void pthreadpool_compute_1d_tiled( void legacy_pthreadpool_compute_1d_tiled(
pthreadpool_t threadpool, legacy_pthreadpool_t threadpool,
pthreadpool_function_1d_tiled_t function, legacy_pthreadpool_function_1d_tiled_t function,
void* argument, void* argument,
size_t range, size_t range,
size_t tile) size_t tile)
@ -65,12 +65,12 @@ void pthreadpool_compute_1d_tiled(
/*.argument = */ argument, /*.argument = */ argument,
/*.range = */ range, /*.range = */ range,
/*.tile = */ tile}; /*.tile = */ tile};
pthreadpool_compute_1d(threadpool, (pthreadpool_function_1d_t) compute_1d_tiled, &context, tile_range); legacy_pthreadpool_compute_1d(threadpool, (legacy_pthreadpool_function_1d_t) compute_1d_tiled, &context, tile_range);
} }
} }
struct compute_2d_context { struct compute_2d_context {
pthreadpool_function_2d_t function; legacy_pthreadpool_function_2d_t function;
void* argument; void* argument;
caffe2::FixedDivisor<int32_t> range_j; caffe2::FixedDivisor<int32_t> range_j;
}; };
@ -85,9 +85,9 @@ static void compute_2d(void* context_, size_t linear_index) {
context->function(context->argument, q, r); context->function(context->argument, q, r);
} }
void pthreadpool_compute_2d( void legacy_pthreadpool_compute_2d(
struct pthreadpool* threadpool, legacy_pthreadpool_t threadpool,
pthreadpool_function_2d_t function, legacy_pthreadpool_function_2d_t function,
void* argument, void* argument,
size_t range_i, size_t range_i,
size_t range_j) size_t range_j)
@ -106,12 +106,12 @@ void pthreadpool_compute_2d(
/*.function = */ function, /*.function = */ function,
/*.argument = */ argument, /*.argument = */ argument,
/*.range_j = */ caffe2::FixedDivisor<int32_t>(range_j)}; /*.range_j = */ caffe2::FixedDivisor<int32_t>(range_j)};
pthreadpool_compute_1d(threadpool, (pthreadpool_function_1d_t) compute_2d, &context, range_i * range_j); legacy_pthreadpool_compute_1d(threadpool, (legacy_pthreadpool_function_1d_t) compute_2d, &context, range_i * range_j);
} }
} }
struct compute_2d_tiled_context { struct compute_2d_tiled_context {
pthreadpool_function_2d_tiled_t function; legacy_pthreadpool_function_2d_tiled_t function;
void* argument; void* argument;
caffe2::FixedDivisor<int32_t> tile_range_j; caffe2::FixedDivisor<int32_t> tile_range_j;
size_t range_i; size_t range_i;
@ -135,9 +135,9 @@ static void compute_2d_tiled(void* context_, size_t linear_index) {
context->function(context->argument, index_i, index_j, tile_i, tile_j); context->function(context->argument, index_i, index_j, tile_i, tile_j);
} }
void pthreadpool_compute_2d_tiled( void legacy_pthreadpool_compute_2d_tiled(
pthreadpool_t threadpool, legacy_pthreadpool_t threadpool,
pthreadpool_function_2d_tiled_t function, legacy_pthreadpool_function_2d_tiled_t function,
void* argument, void* argument,
size_t range_i, size_t range_i,
size_t range_j, size_t range_j,
@ -166,12 +166,12 @@ void pthreadpool_compute_2d_tiled(
/*.range_j = */ range_j, /*.range_j = */ range_j,
/*.tile_i = */ tile_i, /*.tile_i = */ tile_i,
/*.tile_j = */ tile_j}; /*.tile_j = */ tile_j};
pthreadpool_compute_1d(threadpool, (pthreadpool_function_1d_t) compute_2d_tiled, &context, tile_range_i * tile_range_j); legacy_pthreadpool_compute_1d(threadpool, (legacy_pthreadpool_function_1d_t) compute_2d_tiled, &context, tile_range_i * tile_range_j);
} }
} }
struct compute_3d_tiled_context { struct compute_3d_tiled_context {
pthreadpool_function_3d_tiled_t function; legacy_pthreadpool_function_3d_tiled_t function;
void* argument; void* argument;
caffe2::FixedDivisor<int32_t> tile_range_j; caffe2::FixedDivisor<int32_t> tile_range_j;
caffe2::FixedDivisor<int32_t> tile_range_k; caffe2::FixedDivisor<int32_t> tile_range_k;
@ -205,9 +205,9 @@ static void compute_3d_tiled(
context->argument, index_i, index_j, index_k, tile_i, tile_j, tile_k); context->argument, index_i, index_j, index_k, tile_i, tile_j, tile_k);
} }
void pthreadpool_compute_3d_tiled( void legacy_pthreadpool_compute_3d_tiled(
pthreadpool_t threadpool, legacy_pthreadpool_t threadpool,
pthreadpool_function_3d_tiled_t function, legacy_pthreadpool_function_3d_tiled_t function,
void* argument, void* argument,
size_t range_i, size_t range_i,
size_t range_j, size_t range_j,
@ -251,16 +251,16 @@ void pthreadpool_compute_3d_tiled(
/*.tile_i = */ tile_i, /*.tile_i = */ tile_i,
/*.tile_j = */ tile_j, /*.tile_j = */ tile_j,
/*.tile_k = */ tile_k}; /*.tile_k = */ tile_k};
pthreadpool_compute_1d( legacy_pthreadpool_compute_1d(
threadpool, threadpool,
(pthreadpool_function_1d_t)compute_3d_tiled, (legacy_pthreadpool_function_1d_t)compute_3d_tiled,
&context, &context,
tile_range_i * tile_range_j * tile_range_k); tile_range_i * tile_range_j * tile_range_k);
} }
} }
struct compute_4d_tiled_context { struct compute_4d_tiled_context {
pthreadpool_function_4d_tiled_t function; legacy_pthreadpool_function_4d_tiled_t function;
void* argument; void* argument;
caffe2::FixedDivisor<int32_t> tile_range_kl; caffe2::FixedDivisor<int32_t> tile_range_kl;
caffe2::FixedDivisor<int32_t> tile_range_j; caffe2::FixedDivisor<int32_t> tile_range_j;
@ -310,9 +310,9 @@ static void compute_4d_tiled(
tile_l); tile_l);
} }
void pthreadpool_compute_4d_tiled( void legacy_pthreadpool_compute_4d_tiled(
pthreadpool_t threadpool, legacy_pthreadpool_t threadpool,
pthreadpool_function_4d_tiled_t function, legacy_pthreadpool_function_4d_tiled_t function,
void* argument, void* argument,
size_t range_i, size_t range_i,
size_t range_j, size_t range_j,
@ -367,9 +367,9 @@ void pthreadpool_compute_4d_tiled(
/*.tile_j = */ tile_j, /*.tile_j = */ tile_j,
/*.tile_k = */ tile_k, /*.tile_k = */ tile_k,
/*.tile_l = */ tile_l}; /*.tile_l = */ tile_l};
pthreadpool_compute_1d( legacy_pthreadpool_compute_1d(
threadpool, threadpool,
(pthreadpool_function_1d_t)compute_4d_tiled, (legacy_pthreadpool_function_1d_t)compute_4d_tiled,
&context, &context,
tile_range_i * tile_range_j * tile_range_k * tile_range_l); tile_range_i * tile_range_j * tile_range_k * tile_range_l);
} }

View File

@ -5,49 +5,16 @@
#include "ThreadPoolCommon.h" #include "ThreadPoolCommon.h"
#include <stddef.h> // for size_t #include <stddef.h> // for size_t
typedef struct pthreadpool* pthreadpool_t;
typedef void (*pthreadpool_function_1d_t)(void*, size_t);
typedef void (*pthreadpool_function_1d_tiled_t)(void*, size_t, size_t);
typedef void (*pthreadpool_function_2d_t)(void*, size_t, size_t);
typedef void (*pthreadpool_function_2d_tiled_t)(void*, size_t, size_t, size_t, size_t);
typedef void (*pthreadpool_function_3d_tiled_t)(
void*,
size_t,
size_t,
size_t,
size_t,
size_t,
size_t);
typedef void (*pthreadpool_function_4d_tiled_t)(
void*,
size_t,
size_t,
size_t,
size_t,
size_t,
size_t,
size_t,
size_t);
#include <stdint.h> // for uint32_t #include <stdint.h> // for uint32_t
typedef void (*pthreadpool_task_1d_t)(void*, size_t); typedef struct pthreadpool* legacy_pthreadpool_t;
typedef void (*pthreadpool_task_1d_tile_1d_t)(void*, size_t, size_t);
typedef void (*pthreadpool_task_2d_t)(void*, size_t, size_t); typedef void (*legacy_pthreadpool_function_1d_t)(void*, size_t);
typedef void (*pthreadpool_task_2d_tile_1d_t)(void*, size_t, size_t, size_t); typedef void (*legacy_pthreadpool_function_1d_tiled_t)(void*, size_t, size_t);
typedef void (*pthreadpool_task_2d_tile_2d_t)(void*, size_t, size_t, size_t, size_t); typedef void (*legacy_pthreadpool_function_2d_t)(void*, size_t, size_t);
typedef void (*pthreadpool_task_3d_tile_2d_t)( typedef void (*legacy_pthreadpool_function_2d_tiled_t)(void*, size_t, size_t, size_t, size_t);
void*, typedef void (*legacy_pthreadpool_function_3d_tiled_t)(
size_t,
size_t,
size_t,
size_t,
size_t);
typedef void (*pthreadpool_task_4d_tile_2d_t)(
void*, void*,
size_t, size_t,
size_t, size_t,
@ -55,16 +22,7 @@ typedef void (*pthreadpool_task_4d_tile_2d_t)(
size_t, size_t,
size_t, size_t,
size_t); size_t);
typedef void (*pthreadpool_task_5d_tile_2d_t)( typedef void (*legacy_pthreadpool_function_4d_tiled_t)(
void*,
size_t,
size_t,
size_t,
size_t,
size_t,
size_t,
size_t);
typedef void (*pthreadpool_task_6d_tile_2d_t)(
void*, void*,
size_t, size_t,
size_t, size_t,
@ -90,8 +48,8 @@ extern "C" {
* On error the function returns NULL and sets errno accordingly. * On error the function returns NULL and sets errno accordingly.
*/ */
//Returns internal threadpool impl. // Returns internal threadpool impl.
pthreadpool_t pthreadpool_create(size_t threads_count); legacy_pthreadpool_t legacy_pthreadpool_create(size_t threads_count);
/** /**
* Queries the number of threads in a thread pool. * Queries the number of threads in a thread pool.
@ -100,7 +58,7 @@ pthreadpool_t pthreadpool_create(size_t threads_count);
* *
* @returns The number of threads in the thread pool. * @returns The number of threads in the thread pool.
*/ */
size_t pthreadpool_get_threads_count(pthreadpool_t threadpool); size_t legacy_pthreadpool_get_threads_count(legacy_pthreadpool_t threadpool);
/** /**
* Processes items in parallel using threads from a thread pool. * Processes items in parallel using threads from a thread pool.
@ -117,38 +75,45 @@ size_t pthreadpool_get_threads_count(pthreadpool_t threadpool);
* @param[in] items The number of items to process. The @a function * @param[in] items The number of items to process. The @a function
* will be called once for each item. * will be called once for each item.
*/ */
void pthreadpool_compute_1d( void legacy_pthreadpool_compute_1d(
pthreadpool_t threadpool, legacy_pthreadpool_t threadpool,
pthreadpool_function_1d_t function, legacy_pthreadpool_function_1d_t function,
void* argument, void* argument,
size_t range); size_t range);
void pthreadpool_compute_1d_tiled( void legacy_pthreadpool_parallelize_1d(
pthreadpool_t threadpool, legacy_pthreadpool_t threadpool,
pthreadpool_function_1d_tiled_t function, legacy_pthreadpool_function_1d_t function,
void* argument,
size_t range,
uint32_t flags);
void legacy_pthreadpool_compute_1d_tiled(
legacy_pthreadpool_t threadpool,
legacy_pthreadpool_function_1d_tiled_t function,
void* argument, void* argument,
size_t range, size_t range,
size_t tile); size_t tile);
void pthreadpool_compute_2d( void legacy_pthreadpool_compute_2d(
pthreadpool_t threadpool, legacy_pthreadpool_t threadpool,
pthreadpool_function_2d_t function, legacy_pthreadpool_function_2d_t function,
void* argument, void* argument,
size_t range_i, size_t range_i,
size_t range_j); size_t range_j);
void pthreadpool_compute_2d_tiled( void legacy_pthreadpool_compute_2d_tiled(
pthreadpool_t threadpool, legacy_pthreadpool_t threadpool,
pthreadpool_function_2d_tiled_t function, legacy_pthreadpool_function_2d_tiled_t function,
void* argument, void* argument,
size_t range_i, size_t range_i,
size_t range_j, size_t range_j,
size_t tile_i, size_t tile_i,
size_t tile_j); size_t tile_j);
void pthreadpool_compute_3d_tiled( void legacy_pthreadpool_compute_3d_tiled(
pthreadpool_t threadpool, legacy_pthreadpool_t threadpool,
pthreadpool_function_3d_tiled_t function, legacy_pthreadpool_function_3d_tiled_t function,
void* argument, void* argument,
size_t range_i, size_t range_i,
size_t range_j, size_t range_j,
@ -157,9 +122,9 @@ void pthreadpool_compute_3d_tiled(
size_t tile_j, size_t tile_j,
size_t tile_k); size_t tile_k);
void pthreadpool_compute_4d_tiled( void legacy_pthreadpool_compute_4d_tiled(
pthreadpool_t threadpool, legacy_pthreadpool_t threadpool,
pthreadpool_function_4d_tiled_t function, legacy_pthreadpool_function_4d_tiled_t function,
void* argument, void* argument,
size_t range_i, size_t range_i,
size_t range_j, size_t range_j,
@ -178,129 +143,29 @@ void pthreadpool_compute_4d_tiled(
* *
* @param[in,out] threadpool The thread pool to destroy. * @param[in,out] threadpool The thread pool to destroy.
*/ */
void pthreadpool_destroy(pthreadpool_t threadpool); void legacy_pthreadpool_destroy(legacy_pthreadpool_t threadpool);
// New interface copy/pasted from pthreadpool. #ifdef USE_INTERNAL_PTHREADPOOL_IMPL
// We will merge the internal and third-party/pthreadpool eventually.
// For now copy-paste to get past build issues.
#define PTHREADPOOL_FLAG_DISABLE_DENORMALS 0x00000001 #define pthreadpool_t legacy_pthreadpool_t
#define pthreadpool_function_1d_t legacy_pthreadpool_function_1d_t
#define pthreadpool_function_1d_tiled_t legacy_pthreadpool_function_1d_tiled_t
#define pthreadpool_function_2d_t legacy_pthreadpool_function_2d_t
#define pthreadpool_function_2d_tiled_t legacy_pthreadpool_function_2d_tiled_t
#define pthreadpool_function_3d_tiled_t legacy_pthreadpool_function_3d_tiled_t
#define pthreadpool_function_4d_tiled_t legacy_pthreadpool_function_4d_tiled_t
#define pthreadpool_create legacy_pthreadpool_create
#define pthreadpool_destroy legacy_pthreadpool_destroy
#define pthreadpool_get_threads_count legacy_pthreadpool_get_threads_count
#define pthreadpool_compute_1d legacy_pthreadpool_compute_1d
#define pthreadpool_parallelize_1d legacy_pthreadpool_parallelize_1d
#define pthreadpool_compute_1d_tiled legacy_pthreadpool_compute_1d_tiled
#define pthreadpool_compute_2d legacy_pthreadpool_compute_2d
#define pthreadpool_compute_2d_tiled legacy_pthreadpool_compute_2d_tiled
#define pthreadpool_compute_3d_tiled legacy_pthreadpool_compute_3d_tiled
#define pthreadpool_compute_4d_tiled legacy_pthreadpool_compute_4d_tiled
// Returns the copied threadpool impl of third-party/pthreadpool #endif /* USE_INTERNAL_PTHREADPOOL_IMPL */
pthreadpool_t pthreadpool_create_xnnpack(size_t threads_count);
// Copied third-party impl.
size_t pthreadpool_get_threads_count_xnnpack(pthreadpool_t threadpool);
// Copied third-party impl.
void pthreadpool_destroy_xnnpack(pthreadpool_t threadpool);
/**
* Processes items in parallel using threads from a thread pool.
*
* When the call returns, all items have been processed and the thread pool is
* ready for a new task.
*
* @note If multiple threads call this function with the same thread pool, the
* calls are serialized.
*
* @param[in] threadpool The thread pool to use for parallelisation.
* @param[in] function The function to call for each item.
* @param[in] argument The first argument passed to the @a function.
* @param[in] items The number of items to process. The @a function
* will be called once for each item.
*/
void pthreadpool_parallelize_1d(
pthreadpool_t threadpool,
pthreadpool_task_1d_t function,
void* argument,
size_t range,
uint32_t flags);
void pthreadpool_parallelize_1d_tile_1d(
pthreadpool_t threadpool,
pthreadpool_task_1d_tile_1d_t function,
void* argument,
size_t range,
size_t tile,
uint32_t flags);
void pthreadpool_parallelize_2d(
pthreadpool_t threadpool,
pthreadpool_task_2d_t function,
void* argument,
size_t range_i,
size_t range_j,
uint32_t flags);
void pthreadpool_parallelize_2d_tile_1d(
pthreadpool_t threadpool,
pthreadpool_task_2d_tile_1d_t function,
void* argument,
size_t range_i,
size_t range_j,
size_t tile_j,
uint32_t flags);
void pthreadpool_parallelize_2d_tile_2d(
pthreadpool_t threadpool,
pthreadpool_task_2d_tile_2d_t function,
void* argument,
size_t range_i,
size_t range_j,
size_t tile_i,
size_t tile_j,
uint32_t flags);
void pthreadpool_parallelize_3d_tile_2d(
pthreadpool_t threadpool,
pthreadpool_task_3d_tile_2d_t function,
void* argument,
size_t range_i,
size_t range_j,
size_t range_k,
size_t tile_j,
size_t tile_k,
uint32_t flags);
void pthreadpool_parallelize_4d_tile_2d(
pthreadpool_t threadpool,
pthreadpool_task_4d_tile_2d_t function,
void* argument,
size_t range_i,
size_t range_j,
size_t range_k,
size_t range_l,
size_t tile_k,
size_t tile_l,
uint32_t flags);
void pthreadpool_parallelize_5d_tile_2d(
pthreadpool_t threadpool,
pthreadpool_task_5d_tile_2d_t function,
void* argument,
size_t range_i,
size_t range_j,
size_t range_k,
size_t range_l,
size_t range_m,
size_t tile_l,
size_t tile_m,
uint32_t flags);
void pthreadpool_parallelize_6d_tile_2d(
pthreadpool_t threadpool,
pthreadpool_task_6d_tile_2d_t function,
void* argument,
size_t range_i,
size_t range_j,
size_t range_k,
size_t range_l,
size_t range_m,
size_t range_n,
size_t tile_m,
size_t tile_n,
uint32_t flags);
#ifdef __cplusplus #ifdef __cplusplus
} /* extern "C" */ } /* extern "C" */

View File

@ -6,9 +6,9 @@
// External API // External API
// //
void pthreadpool_compute_1d( void legacy_pthreadpool_compute_1d(
pthreadpool_t threadpool, legacy_pthreadpool_t threadpool,
pthreadpool_function_1d_t function, legacy_pthreadpool_function_1d_t function,
void* argument, void* argument,
size_t range) { size_t range) {
if (threadpool == nullptr) { if (threadpool == nullptr) {
@ -27,30 +27,31 @@ void pthreadpool_compute_1d(
range); range);
} }
size_t pthreadpool_get_threads_count(pthreadpool_t threadpool) { void legacy_pthreadpool_parallelize_1d(
// The current fix only useful when XNNPACK calls pthreadpool_get_threads_count with nullptr. const legacy_pthreadpool_t threadpool,
const legacy_pthreadpool_function_1d_t function,
void* const argument,
const size_t range,
uint32_t) {
legacy_pthreadpool_compute_1d(threadpool, function, argument, range);
}
size_t legacy_pthreadpool_get_threads_count(legacy_pthreadpool_t threadpool) {
// The current fix only useful when XNNPACK calls legacy_pthreadpool_get_threads_count with nullptr.
if (threadpool == nullptr) { if (threadpool == nullptr) {
return 1; return 1;
} }
return reinterpret_cast<caffe2::ThreadPool*>(threadpool)->getNumThreads(); return reinterpret_cast<caffe2::ThreadPool*>(threadpool)->getNumThreads();
// TODO: Future fix: If we keep maintaining two different threadpools.
// Old C2 and new one for XNNPACK, then the we have two different pthreadpool pointer
// types. One is caffe2::Thredpool*, the other is pthreadpool* (pthreadpool_new_if_impl.c)
// XNNPACK calls pthreadpool_get_threads_count during op setup using pthreadpool*, and
// uses _parallelize_ interface for for actual work.
// While NNPACK uses caffe2::Threadpool*.
// Thus if pthreadpool_get_threads_count is getting called from XNNPACK we cannot
// reinterpret_cast it to ThreadPool. It will seg fault or worse will have unedfined behavior.
} }
pthreadpool_t pthreadpool_create(size_t threads_count) { legacy_pthreadpool_t legacy_pthreadpool_create(size_t threads_count) {
std::mutex thread_pool_creation_mutex_; std::mutex thread_pool_creation_mutex_;
std::lock_guard<std::mutex> guard(thread_pool_creation_mutex_); std::lock_guard<std::mutex> guard(thread_pool_creation_mutex_);
return reinterpret_cast<pthreadpool_t>(new caffe2::ThreadPool(threads_count)); return reinterpret_cast<legacy_pthreadpool_t>(new caffe2::ThreadPool(threads_count));
} }
void pthreadpool_destroy(pthreadpool_t pthreadpool) { void legacy_pthreadpool_destroy(legacy_pthreadpool_t pthreadpool) {
if (pthreadpool) { if (pthreadpool) {
caffe2::ThreadPool* threadpool = caffe2::ThreadPool* threadpool =
reinterpret_cast<caffe2::ThreadPool*>(pthreadpool); reinterpret_cast<caffe2::ThreadPool*>(pthreadpool);

File diff suppressed because it is too large Load Diff

View File

@ -1,62 +0,0 @@
#pragma once
#include <stdint.h>
#if defined(__SSE__) || defined(__x86_64__)
#include <xmmintrin.h>
#endif
struct fpu_state {
#if defined(__SSE__) || defined(__x86_64__)
uint32_t mxcsr;
#elif defined(__arm__) && defined(__ARM_FP) && (__ARM_FP != 0)
uint32_t fpscr;
#elif defined(__aarch64__)
uint64_t fpcr;
#else
char unused;
#endif
};
static inline struct fpu_state get_fpu_state() {
struct fpu_state state = { 0 };
#if defined(__SSE__) || defined(__x86_64__)
state.mxcsr = (uint32_t) _mm_getcsr();
#elif defined(__arm__) && defined(__ARM_FP) && (__ARM_FP != 0)
__asm__ __volatile__("VMRS %[fpscr], fpscr" : [fpscr] "=r" (state.fpscr));
#elif defined(__aarch64__)
__asm__ __volatile__("MRS %[fpcr], fpcr" : [fpcr] "=r" (state.fpcr));
#endif
return state;
}
static inline void set_fpu_state(const struct fpu_state state) {
#if defined(__SSE__) || defined(__x86_64__)
_mm_setcsr((unsigned int) state.mxcsr);
#elif defined(__arm__) && defined(__ARM_FP) && (__ARM_FP != 0)
__asm__ __volatile__("VMSR fpscr, %[fpscr]" : : [fpscr] "r" (state.fpscr));
#elif defined(__aarch64__)
__asm__ __volatile__("MSR fpcr, %[fpcr]" : : [fpcr] "r" (state.fpcr));
#endif
}
static inline void disable_fpu_denormals() {
#if defined(__SSE__) || defined(__x86_64__)
_mm_setcsr(_mm_getcsr() | 0x8040);
#elif defined(__arm__) && defined(__ARM_FP) && (__ARM_FP != 0)
uint32_t fpscr;
__asm__ __volatile__(
"VMRS %[fpscr], fpscr\n"
"ORR %[fpscr], #0x1000000\n"
"VMSR fpscr, %[fpscr]\n"
: [fpscr] "=r" (fpscr));
#elif defined(__aarch64__)
uint64_t fpcr;
__asm__ __volatile__(
"MRS %[fpcr], fpcr\n"
"ORR %w[fpcr], %w[fpcr], 0x1000000\n"
"ORR %w[fpcr], %w[fpcr], 0x80000\n"
"MSR fpcr, %[fpcr]\n"
: [fpcr] "=r" (fpcr));
#endif
}

View File

@ -239,10 +239,10 @@ if(USE_NNPACK OR USE_QNNPACK OR USE_PYTORCH_QNNPACK OR USE_XNNPACK)
endif() endif()
if(DISABLE_NNPACK_AND_FAMILY) if(DISABLE_NNPACK_AND_FAMILY)
set(USE_NNPACK OFF) caffe2_update_option(USE_NNPACK OFF)
set(USE_QNNPACK OFF) caffe2_update_option(USE_QNNPACK OFF)
set(USE_PYTORCH_QNNPACK OFF) caffe2_update_option(USE_PYTORCH_QNNPACK OFF)
set(USE_XNNPACK OFF) caffe2_update_option(USE_XNNPACK OFF)
else() else()
set(CAFFE2_THIRD_PARTY_ROOT "${PROJECT_SOURCE_DIR}/third_party") set(CAFFE2_THIRD_PARTY_ROOT "${PROJECT_SOURCE_DIR}/third_party")
@ -261,11 +261,9 @@ if(USE_NNPACK OR USE_QNNPACK OR USE_PYTORCH_QNNPACK OR USE_XNNPACK)
if(NOT DEFINED PTHREADPOOL_SOURCE_DIR) if(NOT DEFINED PTHREADPOOL_SOURCE_DIR)
set(PTHREADPOOL_SOURCE_DIR "${CAFFE2_THIRD_PARTY_ROOT}/pthreadpool" CACHE STRING "pthreadpool source directory") set(PTHREADPOOL_SOURCE_DIR "${CAFFE2_THIRD_PARTY_ROOT}/pthreadpool" CACHE STRING "pthreadpool source directory")
endif() endif()
set(CPUINFO_LIBRARY_TYPE "static" CACHE STRING "")
set(CPUINFO_LOG_LEVEL "error" CACHE STRING "")
set(PTHREADPOOL_LIBRARY_TYPE "static" CACHE STRING "")
endif() endif()
else()
set(DISABLE_NNPACK_AND_FAMILY ON)
endif() endif()
set(CONFU_DEPENDENCIES_SOURCE_DIR ${PROJECT_BINARY_DIR}/confu-srcs set(CONFU_DEPENDENCIES_SOURCE_DIR ${PROJECT_BINARY_DIR}/confu-srcs
@ -281,45 +279,48 @@ if(INTERN_BUILD_MOBILE AND INTERN_USE_EIGEN_BLAS)
endif() endif()
# ---[ pthreadpool # ---[ pthreadpool
# QNNPACK and NNPACK both depend on pthreadpool, but when building with libtorch # Only add a dependency on pthreadpool if we are on a mobile build
# they should use the pthreadpool implementation under caffe2/utils/threadpool # or are building any of the libraries in the {Q/X}NNPACK family.
# instead of the default implementation. To avoid confusion, add pthreadpool if(INTERN_BUILD_MOBILE OR NOT DISABLE_NNPACK_AND_FAMILY)
# subdirectory explicitly with EXCLUDE_FROM_ALL property prior to QNNPACK/NNPACK set(USE_PTHREADPOOL ON CACHE BOOL "" FORCE)
# does so, which will prevent it from installing the default pthreadpool library. set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DUSE_PTHREADPOOL")
if(INTERN_BUILD_MOBILE AND NOT BUILD_CAFFE2_MOBILE AND (USE_QNNPACK OR USE_NNPACK OR USE_XNNPACK))
if(NOT DEFINED PTHREADPOOL_SOURCE_DIR) # Always use third_party/pthreadpool.
set(CAFFE2_THIRD_PARTY_ROOT "${PROJECT_SOURCE_DIR}/third_party") set(USE_INTERNAL_PTHREADPOOL_IMPL OFF CACHE BOOL "" FORCE)
set(PTHREADPOOL_SOURCE_DIR "${CAFFE2_THIRD_PARTY_ROOT}/pthreadpool" CACHE STRING "pthreadpool source directory")
endif()
if(NOT TARGET pthreadpool) if(NOT TARGET pthreadpool)
set(PTHREADPOOL_BUILD_TESTS OFF CACHE BOOL "") if(USE_SYSTEM_PTHREADPOOL)
set(PTHREADPOOL_BUILD_BENCHMARKS OFF CACHE BOOL "") add_library(pthreadpool SHARED IMPORTED)
add_subdirectory( find_library(PTHREADPOOL_LIBRARY pthreadpool)
"${PTHREADPOOL_SOURCE_DIR}" set_property(TARGET pthreadpool PROPERTY IMPORTED_LOCATION "${PTHREADPOOL_LIBRARY}")
"${CONFU_DEPENDENCIES_BINARY_DIR}/pthreadpool" if(NOT PTHREADPOOL_LIBRARY)
EXCLUDE_FROM_ALL) message(FATAL_ERROR "Cannot find pthreadpool")
endif() endif()
endif() message("-- Found pthreadpool: ${PTHREADPOOL_LIBRARY}")
elseif(NOT USE_INTERNAL_PTHREADPOOL_IMPL)
if(NOT DEFINED PTHREADPOOL_SOURCE_DIR)
set(CAFFE2_THIRD_PARTY_ROOT "${PROJECT_SOURCE_DIR}/third_party")
set(PTHREADPOOL_SOURCE_DIR "${CAFFE2_THIRD_PARTY_ROOT}/pthreadpool" CACHE STRING "pthreadpool source directory")
endif()
# XNNPACK has not option of like QNNPACK_CUSTOM_THREADPOOL set(PTHREADPOOL_BUILD_TESTS OFF CACHE BOOL "")
# that allows us to hijack pthreadpool interface. set(PTHREADPOOL_BUILD_BENCHMARKS OFF CACHE BOOL "")
# Thus not doing this ends up building pthreadpool as well as set(PTHREADPOOL_LIBRARY_TYPE "static" CACHE STRING "")
# the internal implemenation of pthreadpool which results in symbol conflicts. set(PTHREADPOOL_ALLOW_DEPRECATED_API ON CACHE BOOL "")
if(USE_XNNPACK AND NOT USE_SYSTEM_XNNPACK) add_subdirectory(
if(NOT DEFINED PTHREADPOOL_SOURCE_DIR) "${PTHREADPOOL_SOURCE_DIR}"
set(CAFFE2_THIRD_PARTY_ROOT "${PROJECT_SOURCE_DIR}/third_party") "${CONFU_DEPENDENCIES_BINARY_DIR}/pthreadpool")
set(PTHREADPOOL_SOURCE_DIR "${CAFFE2_THIRD_PARTY_ROOT}/pthreadpool" CACHE STRING "pthreadpool source directory") set_property(TARGET pthreadpool PROPERTY POSITION_INDEPENDENT_CODE ON)
endif() endif()
if(NOT TARGET pthreadpool) if(USE_INTERNAL_PTHREADPOOL_IMPL)
set(PTHREADPOOL_BUILD_TESTS OFF CACHE BOOL "") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DUSE_INTERNAL_PTHREADPOOL_IMPL")
set(PTHREADPOOL_BUILD_BENCHMARKS OFF CACHE BOOL "") else()
add_subdirectory( list(APPEND Caffe2_DEPENDENCY_LIBS pthreadpool)
"${PTHREADPOOL_SOURCE_DIR}" endif()
"${CONFU_DEPENDENCIES_BINARY_DIR}/pthreadpool"
EXCLUDE_FROM_ALL)
endif() endif()
else()
set(USE_PTHREADPOOL OFF CACHE BOOL "" FORCE)
endif() endif()
# ---[ Caffe2 uses cpuinfo library in the thread pool # ---[ Caffe2 uses cpuinfo library in the thread pool
@ -369,9 +370,12 @@ if(USE_QNNPACK)
endif() endif()
if(NOT TARGET qnnpack) if(NOT TARGET qnnpack)
if(NOT USE_SYSTEM_PTHREADPOOL AND USE_INTERNAL_PTHREADPOOL_IMPL)
set(QNNPACK_CUSTOM_THREADPOOL ON CACHE BOOL "")
endif()
set(QNNPACK_BUILD_TESTS OFF CACHE BOOL "") set(QNNPACK_BUILD_TESTS OFF CACHE BOOL "")
set(QNNPACK_BUILD_BENCHMARKS OFF CACHE BOOL "") set(QNNPACK_BUILD_BENCHMARKS OFF CACHE BOOL "")
set(QNNPACK_CUSTOM_THREADPOOL ON CACHE BOOL "")
set(QNNPACK_LIBRARY_TYPE "static" CACHE STRING "") set(QNNPACK_LIBRARY_TYPE "static" CACHE STRING "")
add_subdirectory( add_subdirectory(
"${QNNPACK_SOURCE_DIR}" "${QNNPACK_SOURCE_DIR}"
@ -379,8 +383,29 @@ if(USE_QNNPACK)
# We build static versions of QNNPACK and pthreadpool but link # We build static versions of QNNPACK and pthreadpool but link
# them into a shared library for Caffe2, so they need PIC. # them into a shared library for Caffe2, so they need PIC.
set_property(TARGET qnnpack PROPERTY POSITION_INDEPENDENT_CODE ON) set_property(TARGET qnnpack PROPERTY POSITION_INDEPENDENT_CODE ON)
set_property(TARGET pthreadpool PROPERTY POSITION_INDEPENDENT_CODE ON)
set_property(TARGET cpuinfo PROPERTY POSITION_INDEPENDENT_CODE ON) set_property(TARGET cpuinfo PROPERTY POSITION_INDEPENDENT_CODE ON)
if(QNNPACK_CUSTOM_THREADPOOL)
target_compile_definitions(
qnnpack PRIVATE
pthreadpool_t=legacy_pthreadpool_t
pthreadpool_function_1d_t=legacy_pthreadpool_function_1d_t
pthreadpool_function_1d_tiled_t=legacy_pthreadpool_function_1d_tiled_t
pthreadpool_function_2d_t=legacy_pthreadpool_function_2d_t
pthreadpool_function_2d_tiled_t=legacy_pthreadpool_function_2d_tiled_t
pthreadpool_function_3d_tiled_t=legacy_pthreadpool_function_3d_tiled_t
pthreadpool_function_4d_tiled_t=legacy_pthreadpool_function_4d_tiled_t
pthreadpool_create=legacy_pthreadpool_create
pthreadpool_destroy=legacy_pthreadpool_destroy
pthreadpool_get_threads_count=legacy_pthreadpool_get_threads_count
pthreadpool_compute_1d=legacy_pthreadpool_compute_1d
pthreadpool_parallelize_1d=legacy_pthreadpool_parallelize_1d
pthreadpool_compute_1d_tiled=legacy_pthreadpool_compute_1d_tiled
pthreadpool_compute_2d=legacy_pthreadpool_compute_2d
pthreadpool_compute_2d_tiled=legacy_pthreadpool_compute_2d_tiled
pthreadpool_compute_3d_tiled=legacy_pthreadpool_compute_3d_tiled
pthreadpool_compute_4d_tiled=legacy_pthreadpool_compute_4d_tiled)
endif()
endif() endif()
list(APPEND Caffe2_DEPENDENCY_LIBS qnnpack) list(APPEND Caffe2_DEPENDENCY_LIBS qnnpack)
@ -400,9 +425,12 @@ if(USE_PYTORCH_QNNPACK)
endif() endif()
if(NOT TARGET pytorch_qnnpack) if(NOT TARGET pytorch_qnnpack)
if(NOT USE_SYSTEM_PTHREADPOOL AND USE_INTERNAL_PTHREADPOOL_IMPL)
set(PYTORCH_QNNPACK_CUSTOM_THREADPOOL ON CACHE BOOL "")
endif()
set(PYTORCH_QNNPACK_BUILD_TESTS OFF CACHE BOOL "") set(PYTORCH_QNNPACK_BUILD_TESTS OFF CACHE BOOL "")
set(PYTORCH_QNNPACK_BUILD_BENCHMARKS OFF CACHE BOOL "") set(PYTORCH_QNNPACK_BUILD_BENCHMARKS OFF CACHE BOOL "")
set(PYTORCH_QNNPACK_CUSTOM_THREADPOOL ON CACHE BOOL "")
set(PYTORCH_QNNPACK_LIBRARY_TYPE "static" CACHE STRING "") set(PYTORCH_QNNPACK_LIBRARY_TYPE "static" CACHE STRING "")
add_subdirectory( add_subdirectory(
"${PYTORCH_QNNPACK_SOURCE_DIR}" "${PYTORCH_QNNPACK_SOURCE_DIR}"
@ -410,10 +438,29 @@ if(USE_PYTORCH_QNNPACK)
# We build static versions of QNNPACK and pthreadpool but link # We build static versions of QNNPACK and pthreadpool but link
# them into a shared library for Caffe2, so they need PIC. # them into a shared library for Caffe2, so they need PIC.
set_property(TARGET pytorch_qnnpack PROPERTY POSITION_INDEPENDENT_CODE ON) set_property(TARGET pytorch_qnnpack PROPERTY POSITION_INDEPENDENT_CODE ON)
if(NOT USE_SYSTEM_PTHREADPOOL)
set_property(TARGET pthreadpool PROPERTY POSITION_INDEPENDENT_CODE ON)
endif()
set_property(TARGET cpuinfo PROPERTY POSITION_INDEPENDENT_CODE ON) set_property(TARGET cpuinfo PROPERTY POSITION_INDEPENDENT_CODE ON)
if(PYTORCH_QNNPACK_CUSTOM_THREADPOOL)
target_compile_definitions(
pytorch_qnnpack PRIVATE
pthreadpool_t=legacy_pthreadpool_t
pthreadpool_function_1d_t=legacy_pthreadpool_function_1d_t
pthreadpool_function_1d_tiled_t=legacy_pthreadpool_function_1d_tiled_t
pthreadpool_function_2d_t=legacy_pthreadpool_function_2d_t
pthreadpool_function_2d_tiled_t=legacy_pthreadpool_function_2d_tiled_t
pthreadpool_function_3d_tiled_t=legacy_pthreadpool_function_3d_tiled_t
pthreadpool_function_4d_tiled_t=legacy_pthreadpool_function_4d_tiled_t
pthreadpool_create=legacy_pthreadpool_create
pthreadpool_destroy=legacy_pthreadpool_destroy
pthreadpool_get_threads_count=legacy_pthreadpool_get_threads_count
pthreadpool_compute_1d=legacy_pthreadpool_compute_1d
pthreadpool_parallelize_1d=legacy_pthreadpool_parallelize_1d
pthreadpool_compute_1d_tiled=legacy_pthreadpool_compute_1d_tiled
pthreadpool_compute_2d=legacy_pthreadpool_compute_2d
pthreadpool_compute_2d_tiled=legacy_pthreadpool_compute_2d_tiled
pthreadpool_compute_3d_tiled=legacy_pthreadpool_compute_3d_tiled
pthreadpool_compute_4d_tiled=legacy_pthreadpool_compute_4d_tiled)
endif()
endif() endif()
list(APPEND Caffe2_DEPENDENCY_LIBS pytorch_qnnpack) list(APPEND Caffe2_DEPENDENCY_LIBS pytorch_qnnpack)
@ -447,7 +494,6 @@ if(USE_XNNPACK AND NOT USE_SYSTEM_XNNPACK)
endif() endif()
if(NOT TARGET XNNPACK) if(NOT TARGET XNNPACK)
set(XNNPACK_CUSTOM_THREADPOOL ON CACHE BOOL "")
set(XNNPACK_LIBRARY_TYPE "static" CACHE STRING "") set(XNNPACK_LIBRARY_TYPE "static" CACHE STRING "")
set(XNNPACK_BUILD_BENCHMARKS OFF CACHE BOOL "") set(XNNPACK_BUILD_BENCHMARKS OFF CACHE BOOL "")
set(XNNPACK_BUILD_TESTS OFF CACHE BOOL "") set(XNNPACK_BUILD_TESTS OFF CACHE BOOL "")
@ -457,15 +503,6 @@ if(USE_XNNPACK AND NOT USE_SYSTEM_XNNPACK)
"${CONFU_DEPENDENCIES_BINARY_DIR}/XNNPACK") "${CONFU_DEPENDENCIES_BINARY_DIR}/XNNPACK")
set_property(TARGET XNNPACK PROPERTY POSITION_INDEPENDENT_CODE ON) set_property(TARGET XNNPACK PROPERTY POSITION_INDEPENDENT_CODE ON)
# Context: pthreadpool_get_threads_count implementation that is built in pytorch, uses
# implementation defined in caffe2/utils/threadpool/pthreadpool_impl.cc. This implementation
# assumes the the pthreadpool* passed is of type caffe2::ThradPool and thus does reinterpret cast.
# This is not valid when we create pthreadpool via caffe2::xnnpack_threadpool, which is of type
# compatible with new pthreadpool interface and is used in PT's XNNPACK integration.
# Thus all the calls for pthreadpool_get_threads_count originating from XNNPACK must be routed
# appropriately to pthreadpool_get_threads_count_xnnpack, which does not do the aforementioned
# casting to caffe2::ThradPool. Once the threadpools are unified, we will not need this.
target_compile_definitions(XNNPACK PRIVATE -Dpthreadpool_get_threads_count=pthreadpool_get_threads_count_xnnpack)
endif() endif()
include_directories(SYSTEM ${XNNPACK_INCLUDE_DIR}) include_directories(SYSTEM ${XNNPACK_INCLUDE_DIR})

View File

@ -59,9 +59,12 @@ if(ANDROID OR IOS OR ${CMAKE_SYSTEM_NAME} STREQUAL "Linux" OR ${CMAKE_SYSTEM_NAM
set(GOOGLETEST_SOURCE_DIR "${CAFFE2_THIRD_PARTY_ROOT}/googletest" CACHE STRING "Google Test source directory") set(GOOGLETEST_SOURCE_DIR "${CAFFE2_THIRD_PARTY_ROOT}/googletest" CACHE STRING "Google Test source directory")
if(NOT TARGET nnpack) if(NOT TARGET nnpack)
if(NOT USE_SYSTEM_PTHREADPOOL AND USE_INTERNAL_PTHREADPOOL_IMPL)
set(NNPACK_CUSTOM_THREADPOOL ON CACHE BOOL "")
endif()
set(NNPACK_BUILD_TESTS OFF CACHE BOOL "") set(NNPACK_BUILD_TESTS OFF CACHE BOOL "")
set(NNPACK_BUILD_BENCHMARKS OFF CACHE BOOL "") set(NNPACK_BUILD_BENCHMARKS OFF CACHE BOOL "")
set(NNPACK_CUSTOM_THREADPOOL ON CACHE BOOL "")
set(NNPACK_LIBRARY_TYPE "static" CACHE STRING "") set(NNPACK_LIBRARY_TYPE "static" CACHE STRING "")
set(PTHREADPOOL_LIBRARY_TYPE "static" CACHE STRING "") set(PTHREADPOOL_LIBRARY_TYPE "static" CACHE STRING "")
set(CPUINFO_LIBRARY_TYPE "static" CACHE STRING "") set(CPUINFO_LIBRARY_TYPE "static" CACHE STRING "")
@ -73,6 +76,28 @@ if(ANDROID OR IOS OR ${CMAKE_SYSTEM_NAME} STREQUAL "Linux" OR ${CMAKE_SYSTEM_NAM
set_property(TARGET nnpack PROPERTY POSITION_INDEPENDENT_CODE ON) set_property(TARGET nnpack PROPERTY POSITION_INDEPENDENT_CODE ON)
set_property(TARGET pthreadpool PROPERTY POSITION_INDEPENDENT_CODE ON) set_property(TARGET pthreadpool PROPERTY POSITION_INDEPENDENT_CODE ON)
set_property(TARGET cpuinfo PROPERTY POSITION_INDEPENDENT_CODE ON) set_property(TARGET cpuinfo PROPERTY POSITION_INDEPENDENT_CODE ON)
if(NNPACK_CUSTOM_THREADPOOL)
target_compile_definitions(
nnpack PRIVATE
pthreadpool_t=legacy_pthreadpool_t
pthreadpool_function_1d_t=legacy_pthreadpool_function_1d_t
pthreadpool_function_1d_tiled_t=legacy_pthreadpool_function_1d_tiled_t
pthreadpool_function_2d_t=legacy_pthreadpool_function_2d_t
pthreadpool_function_2d_tiled_t=legacy_pthreadpool_function_2d_tiled_t
pthreadpool_function_3d_tiled_t=legacy_pthreadpool_function_3d_tiled_t
pthreadpool_function_4d_tiled_t=legacy_pthreadpool_function_4d_tiled_t
pthreadpool_create=legacy_pthreadpool_create
pthreadpool_destroy=legacy_pthreadpool_destroy
pthreadpool_get_threads_count=legacy_pthreadpool_get_threads_count
pthreadpool_compute_1d=legacy_pthreadpool_compute_1d
pthreadpool_parallelize_1d=legacy_pthreadpool_parallelize_1d
pthreadpool_compute_1d_tiled=legacy_pthreadpool_compute_1d_tiled
pthreadpool_compute_2d=legacy_pthreadpool_compute_2d
pthreadpool_compute_2d_tiled=legacy_pthreadpool_compute_2d_tiled
pthreadpool_compute_3d_tiled=legacy_pthreadpool_compute_3d_tiled
pthreadpool_compute_4d_tiled=legacy_pthreadpool_compute_4d_tiled)
endif()
endif() endif()
set(NNPACK_FOUND TRUE) set(NNPACK_FOUND TRUE)

View File

@ -69,6 +69,11 @@ if(NOT @BUILD_SHARED_LIBS@)
list(APPEND TORCH_LIBRARIES ${XNNPACK_LIBRARY}) list(APPEND TORCH_LIBRARIES ${XNNPACK_LIBRARY})
endif() endif()
if(NOT @USE_INTERNAL_PTHREADPOOL_IMPL@)
find_library(PTHREADPOOL_LIBRARY pthreadpool PATHS "${TORCH_INSTALL_PREFIX}/lib")
list(APPEND TORCH_LIBRARIES ${PTHREADPOOL_LIBRARY})
endif()
if(@INTERN_USE_EIGEN_BLAS@) if(@INTERN_USE_EIGEN_BLAS@)
find_library(EIGEN_BLAS_LIBRARY eigen_blas PATHS "${TORCH_INSTALL_PREFIX}/lib") find_library(EIGEN_BLAS_LIBRARY eigen_blas PATHS "${TORCH_INSTALL_PREFIX}/lib")
list(APPEND TORCH_LIBRARIES ${EIGEN_BLAS_LIBRARY}) list(APPEND TORCH_LIBRARIES ${EIGEN_BLAS_LIBRARY})

View File

@ -404,15 +404,15 @@ if((CUDA_VERSION VERSION_EQUAL 9.0) OR
endif() endif()
elseif(CUDA_VERSION VERSION_EQUAL 9.2) elseif(CUDA_VERSION VERSION_EQUAL 9.2)
if(CMAKE_CXX_COMPILER_ID STREQUAL "MSVC" AND if(CMAKE_CXX_COMPILER_ID STREQUAL "MSVC" AND
NOT CMAKE_CXX_COMPILER_VERSION VERSION_LESS 19.13 AND NOT CMAKE_CXX_COMPILER_VERSION VERSION_LESS 19.14 AND
NOT DEFINED ENV{CUDAHOSTCXX}) NOT DEFINED ENV{CUDAHOSTCXX})
message(FATAL_ERROR message(FATAL_ERROR
"CUDA ${CUDA_VERSION} is not compatible with MSVC toolchain version " "CUDA ${CUDA_VERSION} is not compatible with MSVC toolchain version "
">= 19.13. (a.k.a Visual Studio 2017 Update 7, VS 15.7) " ">= 19.14. (a.k.a Visual Studio 2017 Update 7, VS 15.7) "
"Please upgrade to CUDA >= 10.0 or set the following environment " "Please upgrade to CUDA >= 10.0 or set the following environment "
"variable to use another version (for example): \n" "variable to use another version (for example): \n"
" set \"CUDAHOSTCXX=C:\\Program Files (x86)\\Microsoft Visual Studio" " set \"CUDAHOSTCXX=C:\\Program Files (x86)\\Microsoft Visual Studio"
"\\2017\\Enterprise\\VC\\Tools\\MSVC\\14.12.25827\\bin\\HostX64\\x64\\cl.exe\"\n") "\\2017\\Enterprise\\VC\\Tools\\MSVC\\14.13.26132\\bin\\HostX64\\x64\\cl.exe\"\n")
endif() endif()
elseif(CUDA_VERSION VERSION_EQUAL 10.0) elseif(CUDA_VERSION VERSION_EQUAL 10.0)
if(CMAKE_CXX_COMPILER_ID STREQUAL "MSVC" AND if(CMAKE_CXX_COMPILER_ID STREQUAL "MSVC" AND

View File

@ -186,6 +186,7 @@ autocast casts all inputs to ``float32`` and runs the op in ``float32``.
``cross``, ``cross``,
``dot``, ``dot``,
``equal``, ``equal``,
``index_put``,
``stack``, ``stack``,
``tensordot`` ``tensordot``

View File

@ -17,7 +17,7 @@ Functional higher level API
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. warning:: .. warning::
This API is experimental. Even though the function signatures are very unlikely to change, major This API is in beta. Even though the function signatures are very unlikely to change, major
improvements to performances are planned before we consider this stable. improvements to performances are planned before we consider this stable.
This section contains the higher level API for the autograd that builds on the basic API above This section contains the higher level API for the autograd that builds on the basic API above

View File

@ -4,6 +4,10 @@
Distributed communication package - torch.distributed Distributed communication package - torch.distributed
===================================================== =====================================================
.. note ::
Please refer to `PyTorch Distributed Overview <https://pytorch.org/tutorials/beginner/dist_overview.html>`__
for a brief introduction to all features related to distributed training.
.. automodule:: torch.distributed .. automodule:: torch.distributed
.. currentmodule:: torch.distributed .. currentmodule:: torch.distributed

View File

@ -46,8 +46,11 @@ Creating TorchScript Code
script script
trace trace
trace_module trace_module
fork
wait
ScriptModule ScriptModule
ScriptFunction ScriptFunction
freeze
save save
load load
ignore ignore

View File

@ -316,4 +316,4 @@ operators, see :ref:`name_inference_reference-doc`.
(('N', 'features'), torch.Size([32, 49152])) (('N', 'features'), torch.Size([32, 49152]))
.. warning:: .. warning::
The named tensor API is experimental and subject to change. The named tensor API is a prototype feature and subject to change.

View File

@ -11,7 +11,7 @@ programs, and can aid you in debugging.
.. _excluding-subgraphs: .. _excluding-subgraphs:
Excluding subgraphs from backward Excluding subgraphs from backward
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ---------------------------------
Every Tensor has a flag: :attr:`requires_grad` that allows for fine grained Every Tensor has a flag: :attr:`requires_grad` that allows for fine grained
exclusion of subgraphs from gradient computation and can increase efficiency. exclusion of subgraphs from gradient computation and can increase efficiency.
@ -19,7 +19,7 @@ exclusion of subgraphs from gradient computation and can increase efficiency.
.. _excluding-requires_grad: .. _excluding-requires_grad:
``requires_grad`` ``requires_grad``
~~~~~~~~~~~~~~~~~ ^^^^^^^^^^^^^^^^^
If there's a single input to an operation that requires gradient, its output If there's a single input to an operation that requires gradient, its output
will also require gradient. Conversely, only if all inputs don't require will also require gradient. Conversely, only if all inputs don't require
@ -61,7 +61,7 @@ will also require them.
.. _how-autograd-encodes-history: .. _how-autograd-encodes-history:
How autograd encodes the history How autograd encodes the history
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ --------------------------------
Autograd is reverse automatic differentiation system. Conceptually, Autograd is reverse automatic differentiation system. Conceptually,
autograd records a graph recording all of the operations that created autograd records a graph recording all of the operations that created
@ -87,7 +87,7 @@ every iteration. You don't have to encode all possible paths before you
launch the training - what you run is what you differentiate. launch the training - what you run is what you differentiate.
In-place operations with autograd In-place operations with autograd
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ---------------------------------
Supporting in-place operations in autograd is a hard matter, and we discourage Supporting in-place operations in autograd is a hard matter, and we discourage
their use in most cases. Autograd's aggressive buffer freeing and reuse makes their use in most cases. Autograd's aggressive buffer freeing and reuse makes
@ -121,7 +121,8 @@ functions and not seeing any errors, you can be sure that the computed
gradients are correct. gradients are correct.
Multithreaded Autograd Multithreaded Autograd
^^^^^^^^^^^^^^^^^^^^^^^^^^^ ----------------------
The autograd engine is responsible for running all the backward operations The autograd engine is responsible for running all the backward operations
necessary to compute the backward pass. This section will describe all the details necessary to compute the backward pass. This section will describe all the details
that can help you make the best use of it in a multithreaded environment.(this is that can help you make the best use of it in a multithreaded environment.(this is
@ -156,7 +157,7 @@ does not block on the concurrent backward computations, example code could be:
Note that some behaviors that user should be aware of: Note that some behaviors that user should be aware of:
Concurrency on CPU Concurrency on CPU
------------------ ^^^^^^^^^^^^^^^^^^
When you run ``backward()`` or ``grad()`` via python or C++ API in multiple When you run ``backward()`` or ``grad()`` via python or C++ API in multiple
threads on CPU, you are expecting to see extra concurrency instead of threads on CPU, you are expecting to see extra concurrency instead of
@ -164,7 +165,7 @@ serializing all the backward calls in a specific order during execution
(behavior before PyTorch 1.6). (behavior before PyTorch 1.6).
Non-determinism Non-determinism
------------------ ^^^^^^^^^^^^^^^
If you are calling ``backward()`` on multiple thread concurrently but with If you are calling ``backward()`` on multiple thread concurrently but with
shared inputs (i.e. Hogwild CPU training). Since parameters are automatically shared inputs (i.e. Hogwild CPU training). Since parameters are automatically
@ -180,7 +181,7 @@ to happen. User could use the functional API :func:`torch.autograd.grad` to
calculate the gradients instead of ``backward()`` to avoid non-determinism. calculate the gradients instead of ``backward()`` to avoid non-determinism.
Graph retaining Graph retaining
------------------ ^^^^^^^^^^^^^^^
If part of the autograd graph is shared between threads, i.e. run first If part of the autograd graph is shared between threads, i.e. run first
part of forward single thread, then run second part in multiple threads, part of forward single thread, then run second part in multiple threads,
@ -192,7 +193,7 @@ crash in this case. Autograd will error out to the user similar to what call
they should use ``retain_graph=True``. they should use ``retain_graph=True``.
Thread Safety on Autograd Node Thread Safety on Autograd Node
------------------------------ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Since Autograd allows the caller thread to drive its backward execution for Since Autograd allows the caller thread to drive its backward execution for
potential parallelism, it's important that we ensure thread safety on CPU with potential parallelism, it's important that we ensure thread safety on CPU with
@ -204,7 +205,7 @@ for built-in C++ Autograd Nodes(e.g. AccumulateGrad, CopySlices) and custom
thread safety on autograd Nodes that might have state write/read. thread safety on autograd Nodes that might have state write/read.
No thread safety on C++ hooks No thread safety on C++ hooks
------------------------------ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Autograd relies on the user to write thread safe C++ hooks. If you want the hook Autograd relies on the user to write thread safe C++ hooks. If you want the hook
to be correctly applied in multithreading environment, you will need to write to be correctly applied in multithreading environment, you will need to write

View File

@ -3,6 +3,9 @@
Quantization Quantization
============ ============
.. warning ::
Quantization is in beta and subject to change.
Introduction to Quantization Introduction to Quantization
---------------------------- ----------------------------

View File

@ -1,8 +1,3 @@
:orphan:
.. contents:: :local:
:depth: 2
.. _distributed-rpc-framework: .. _distributed-rpc-framework:
Distributed RPC Framework Distributed RPC Framework
@ -17,6 +12,9 @@ machines.
APIs in the RPC package are stable. There are multiple ongoing work items APIs in the RPC package are stable. There are multiple ongoing work items
to improve performance and error handling, which will ship in future releases. to improve performance and error handling, which will ship in future releases.
.. note ::
Please refer to `PyTorch Distributed Overview <https://pytorch.org/tutorials/beginner/dist_overview.html>`__
for a brief introduction to all features related to distributed training.
Basics Basics
------ ------
@ -97,7 +95,7 @@ applications can always explicitly move the input tensors to CPU on the caller
and move it to the desired devices on the callee if necessary. and move it to the desired devices on the callee if necessary.
.. warning:: .. warning::
TorchScript support in RPC is experimental and subject to change. Since TorchScript support in RPC is a prototype feature and subject to change. Since
v1.5.0, ``torch.distributed.rpc`` supports calling TorchScript functions as v1.5.0, ``torch.distributed.rpc`` supports calling TorchScript functions as
RPC target functions, and this will help improve parallelism on the callee RPC target functions, and this will help improve parallelism on the callee
side as executing TorchScript functions does not require GIL. side as executing TorchScript functions does not require GIL.
@ -115,7 +113,7 @@ The RPC package also provides decorators which allow applications to specify
how a given function should be treated on the callee side. how a given function should be treated on the callee side.
.. warning:: .. warning::
The ``rpc.functions`` package is experimental and subject to change. The ``rpc.functions`` package is a prototype feature and subject to change.
.. autofunction:: torch.distributed.rpc.functions.async_execution .. autofunction:: torch.distributed.rpc.functions.async_execution
@ -204,12 +202,8 @@ The TensorPipe backend has been introduced in PyTorch v1.6 and is being actively
developed. At the moment, it only supports CPU tensors, with GPU support coming developed. At the moment, it only supports CPU tensors, with GPU support coming
soon. It comes with a TCP-based transport, just like Gloo. It is also able to soon. It comes with a TCP-based transport, just like Gloo. It is also able to
automatically chunk and multiplex large tensors over multiple sockets and automatically chunk and multiplex large tensors over multiple sockets and
threads in order to achieve very high bandwidths. In addition to that, it packs threads in order to achieve very high bandwidths. The agent will be able to pick
two Linux-specific transports for communication between processes on a same the best transport on its own, with no intervention required.
machine (one based on ringbuffers stored in shared memory, the other on the
cross-memory attach syscalls) which can achieve lower latencies than TCP.
The agent will be able to pick the best transport on its own, with no
intervention required.
Example:: Example::
@ -252,7 +246,7 @@ parameters during training. See :ref:`remote-reference-protocol` for more
details. details.
.. autoclass:: RRef .. autoclass:: RRef
:inherited-members: :members:
.. toctree:: .. toctree::
@ -301,4 +295,4 @@ Tutorials
The RPC tutorial introduces users to the RPC framework and provides two example applications using :ref:`torch.distributed.rpc<distributed-rpc-framework>` APIs. The RPC tutorial introduces users to the RPC framework and provides two example applications using :ref:`torch.distributed.rpc<distributed-rpc-framework>` APIs.
- `Getting started with Distributed RPC Framework <https://pytorch.org/tutorials/intermediate/rpc_tutorial.html>`__ - `Getting started with Distributed RPC Framework <https://pytorch.org/tutorials/intermediate/rpc_tutorial.html>`__
- `Implementing a Parameter Server using Distributed RPC Framework <https://pytorch.org/tutorials/intermediate/rpc_param_server_tutorial.html>`__ - `Implementing a Parameter Server using Distributed RPC Framework <https://pytorch.org/tutorials/intermediate/rpc_param_server_tutorial.html>`__

View File

@ -7,7 +7,7 @@ torch.sparse
.. warning:: .. warning::
This API is currently experimental and may change in the near future. This API is in beta and may change in the near future.
Torch supports sparse tensors in COO(rdinate) format, which can Torch supports sparse tensors in COO(rdinate) format, which can
efficiently store and process tensors for which the majority of elements efficiently store and process tensors for which the majority of elements

View File

@ -208,7 +208,7 @@ torch.layout
A :class:`torch.layout` is an object that represents the memory layout of a A :class:`torch.layout` is an object that represents the memory layout of a
:class:`torch.Tensor`. Currently, we support ``torch.strided`` (dense Tensors) :class:`torch.Tensor`. Currently, we support ``torch.strided`` (dense Tensors)
and have experimental support for ``torch.sparse_coo`` (sparse COO Tensors). and have beta support for ``torch.sparse_coo`` (sparse COO Tensors).
``torch.strided`` represents dense Tensors and is the memory layout that ``torch.strided`` represents dense Tensors and is the memory layout that
is most commonly used. Each strided tensor has an associated is most commonly used. Each strided tensor has an associated

View File

@ -63,7 +63,7 @@ targets.each do |target|
target.resources_build_phase.add_file_reference(config_file_ref, true) target.resources_build_phase.add_file_reference(config_file_ref, true)
end end
puts "Linking static libraries..." puts "Linking static libraries..."
libs = ['libc10.a', 'libclog.a', 'libXNNPACK.a', 'libeigen_blas.a', 'libcpuinfo.a', 'libpytorch_qnnpack.a', 'libtorch_cpu.a', 'libtorch.a'] libs = ['libc10.a', 'libclog.a', 'libpthreadpool.a', 'libXNNPACK.a', 'libeigen_blas.a', 'libcpuinfo.a', 'libpytorch_qnnpack.a', 'libtorch_cpu.a', 'libtorch.a']
targets.each do |target| targets.each do |target|
target.frameworks_build_phases.clear target.frameworks_build_phases.clear
for lib in libs do for lib in libs do

Some files were not shown because too many files have changed in this diff Show More