159 Commits

Author SHA1 Message Date
732255f031 [vulkan] Add VMA as a third_party subrepo (#83906)
the [VulkanMemoryAllocator](https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator) is a popular library for GPU memory allocation using Vulkan. The Vulkan backend has a dependency on it, but since it is only a single header file we currently include it by checking it into the repo under [aten/src/ATen/native/vulkan/api/vk_mem_alloc.h](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/vulkan/api/vk_mem_alloc.h). However, it is better to check it in as a third party submodule, since it allows better version tracking.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83906
Approved by: https://github.com/kimishpatel
2022-08-23 18:42:46 +00:00
3c7044728b Enable Intel® VTune™ Profiler's Instrumentation and Tracing Technology APIs (ITT) to PyTorch (#63289)
More detailed description of benefits can be found at #41001. This is Intel's counterpart of NVidia’s NVTX (https://pytorch.org/docs/stable/autograd.html#torch.autograd.profiler.emit_nvtx).

ITT is a functionality for labeling trace data during application execution across different Intel tools.
For integrating Intel(R) VTune Profiler into Kineto, ITT needs to be integrated into PyTorch first. It works with both standalone VTune Profiler [(https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html](https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html)) and Kineto-integrated VTune functionality in the future.
It works for both Intel CPU and Intel XPU devices.

Pitch
Add VTune Profiler's ITT API function calls to annotate PyTorch ops, as well as developer customized code scopes on CPU, like NVTX for NVidia GPU.

This PR rebases the code changes at https://github.com/pytorch/pytorch/pull/61335 to the latest master branch.

Usage example:
```
with torch.autograd.profiler.emit_itt():
    for i in range(10):
        torch.itt.range_push('step_{}'.format(i))
        model(input)
        torch.itt.range_pop()
```

cc @ilia-cher @robieta @chaekit @gdankel @bitfort @ngimel @orionr @nbcsm @guotuofeng @guyang3532 @gaoteng-git
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63289
Approved by: https://github.com/malfet
2022-07-13 13:50:15 +00:00
1454515253 Revert "Enable Intel® VTune™ Profiler's Instrumentation and Tracing Technology APIs (ITT) to PyTorch (#63289)"
This reverts commit f988aa2b3ff77d5aa010bdaae4e52c6ee345c04d.

Reverted https://github.com/pytorch/pytorch/pull/63289 on behalf of https://github.com/malfet due to broke trunk, see f988aa2b3f
2022-06-30 12:49:41 +00:00
f988aa2b3f Enable Intel® VTune™ Profiler's Instrumentation and Tracing Technology APIs (ITT) to PyTorch (#63289)
More detailed description of benefits can be found at #41001. This is Intel's counterpart of NVidia’s NVTX (https://pytorch.org/docs/stable/autograd.html#torch.autograd.profiler.emit_nvtx).

ITT is a functionality for labeling trace data during application execution across different Intel tools.
For integrating Intel(R) VTune Profiler into Kineto, ITT needs to be integrated into PyTorch first. It works with both standalone VTune Profiler [(https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html](https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html)) and Kineto-integrated VTune functionality in the future.
It works for both Intel CPU and Intel XPU devices.

Pitch
Add VTune Profiler's ITT API function calls to annotate PyTorch ops, as well as developer customized code scopes on CPU, like NVTX for NVidia GPU.

This PR rebases the code changes at https://github.com/pytorch/pytorch/pull/61335 to the latest master branch.

Usage example:
```
with torch.autograd.profiler.emit_itt():
    for i in range(10):
        torch.itt.range_push('step_{}'.format(i))
        model(input)
        torch.itt.range_pop()
```

cc @ilia-cher @robieta @chaekit @gdankel @bitfort @ngimel @orionr @nbcsm @guotuofeng @guyang3532 @gaoteng-git
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63289
Approved by: https://github.com/malfet
2022-06-30 05:14:03 +00:00
e487ba7333 Add nlohmann/json submodule (#80322)
Summary: Introduce nlohmann/json as a submodule within pytorch/third_party. This library is already a transitive dependency and is included in our licenses file. Adding it directly to third_party will enable its use by the CoreML backend.

Test Plan: There are no code changes, so submodule sync and perform the steps outline in the building from source section of the pytorch readme.

Differential Revision: D37449817

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80322
Approved by: https://github.com/mcr229
2022-06-28 23:54:33 +00:00
ec4be38ba9 Revert "To add hipify_torch as a submodule in pytorch/third_party (#74704)"
This reverts commit 93b0fec39dd112d5c06106ad0186d55d61f1531a.

Reverted https://github.com/pytorch/pytorch/pull/74704 on behalf of https://github.com/malfet due to broke torchvision
2022-06-21 23:54:00 +00:00
93b0fec39d To add hipify_torch as a submodule in pytorch/third_party (#74704)
`hipify_torch` as a submodule in `pytorch/third_party`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74704
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2022-06-21 18:56:49 +00:00
fa7117c64a Update PeachPy submodule (#78326)
Forked the repo, merged latest changes into pre-generated branch and
update pregenerared opcodes

Re-enabled NNPACK builds on MacOS

Picking f8ef1a3c0a  fixes https://github.com/pytorch/pytorch/issues/76094

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78326
Approved by: https://github.com/atalman, https://github.com/albanD
2022-05-26 13:58:36 +00:00
8473173c36 Remove breakpad dependency
This functionality does not seem to be used
and there are some requests to update dependency.

Add `third_party` to torch_cpu include directories if compiling with
Caffe2 support, as `caffe2/quantization/server/conv_dnnlowp_op.cc` depends on `third_party/fbgemm/src/RefImplementations.h`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75394
Approved by: https://github.com/janeyx99, https://github.com/seemethere
2022-05-03 20:21:55 +00:00
d79d9fa283 Revert "Remove breakpad dependency"
This reverts commit 9aa3c7fd8389735b04622bf07f6ef85c608374d0.

Reverted https://github.com/pytorch/pytorch/pull/75394 on behalf of https://github.com/malfet
2022-04-17 17:58:51 +00:00
9aa3c7fd83 Remove breakpad dependency
This functionality does not seem to be used
and there are some requests to update dependency

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75394
Approved by: https://github.com/janeyx99, https://github.com/seemethere
2022-04-17 17:43:45 +00:00
f98881b1bf update eigen submodule to latest release (3.4.0) with rocm fixes
Fixes #73177

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73178
Approved by: https://github.com/jeffdaily, https://github.com/suo, https://github.com/malfet
2022-04-07 18:54:58 +00:00
1bc3571078 [pytorch][PR] Add ability for a mobile::Module to save as flatbuffer (#70201)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70201

Included functions:
save_mobile_module -> saves a mobile::Module to flatbuffer
load_mobile_module_from_file -> loads a flatbuffer into mobile::Module
parse_mobile_module -> parses from bytes or deserialized flatbuffer module object

Compared to previous attempts, this diff only adds flatbuffer to cmake target and leaves fbcode/xplat ones unchanged.

Test Plan: unittest

Reviewed By: malfet, gmagogsfm

Differential Revision: D33239362

fbshipit-source-id: b9ca36b83d6af2d78cc50b9eb9e2a6fa7fce0763
2022-01-12 16:30:39 -08:00
17f3179d60 Back out "[pytorch][PR] Add ability for a mobile::Module to save as flatbuffer" (#69796)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69796

(Note: this ignores all push blocking failures!)

Test Plan: External CI + Sandcastle

Reviewed By: zhxchen17

Differential Revision: D33032671

fbshipit-source-id: dbf6690e960e25d6a5f19043cbe792add2acd7ef
2021-12-10 21:29:53 -08:00
d3649309e6 [pytorch][PR] Add ability for a mobile::Module to save as flatbuffer (#69306)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69306

Included functions:

save_mobile_module -> saves a mobile::Module to flatbuffer
load_mobile_module_from_file -> loads a flatbuffer into mobile::Module
parse_mobile_module -> parses from bytes or deserialized flatbuffer
Module object

Test Plan: unittests

Reviewed By: gmagogsfm

Differential Revision: D32806835

fbshipit-source-id: 71913c6650e225634f878946bd16960d377a7f57
2021-12-09 14:53:31 -08:00
00ebbd5ef6 Revert D32010095: [pytorch][PR] Add ability for a mobile::Module to save as flatbuffer
Test Plan: revert-hammer

Differential Revision:
D32010095 (41d35dc201)

Original commit changeset: d763b0557780

fbshipit-source-id: bf746a0389135c9f5f67f00f449435ce08fb5f6d
2021-12-02 06:41:40 -08:00
41d35dc201 Add ability for a mobile::Module to save as flatbuffer (#67351)
Summary:
Included functions:

* save_mobile_module -> saves a mobile::Module to flatbuffer
* load_mobile_module_from_file -> loads a flatbuffer into mobile::Module
* parse_mobile_module -> parses from bytes or deserialized flatbuffer
      Module object

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67351

Reviewed By: iseeyuan

Differential Revision: D32010095

Pulled By: qihqi

fbshipit-source-id: d763b0557780f7c2661b6485105b045e41a5e8f1
2021-12-01 23:58:15 -08:00
bd8608cd5c Use CMake for breakpad (#63186)
Summary:
We currently build breakpad from [this fork](https://github.com/driazati/breakpad) to include extra logic to restore signal handlers that were previously present. With some [new additions](https://github.com/google/breakpad/compare/main...driazati:main) this fork now includes a CMake based build, so we can add breakpad as a proper dependency rather than rely on including it in Docker images as a system library which is error prone (we have a bunch of images) and hard to extend to MacOS / Windows. This also includes some changes to the crash handling code to support MacOS / Windows in a similar way to Linux.

```python
import torch

# On Windows this writes crashes to C:\Users\<user>\AppData\pytorch_crashes
# On MacOS/Linux this writes crashes to /tmp/pytorch_crashes
torch.utils._crash_handler.enable_minidumps()

# Easy way to cause a segfault and trigger the handler
torch.bincount(input=torch.tensor([9223372036854775807]))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63186

Reviewed By: malfet, seemethere

Differential Revision: D30318404

Pulled By: driazati

fbshipit-source-id: 0d7daf3701cfaba5451cc529a0730272ab1eb1dc
2021-08-19 10:42:01 -07:00
6e5d065b2b Add pocketfft as submodule (#62841)
Summary:
Using https://github.com/mreineck/pocketfft

Also delete explicit installation of pocketfft during the build as it will be available via submodule

Limit PocketFFT support to cmake-3.10 or newer, as `set_source_files_properties` does not seem to work as expected with cmake-3.5

Partially addresses https://github.com/pytorch/pytorch/issues/62821

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62841

Reviewed By: seemethere

Differential Revision: D30140441

Pulled By: malfet

fbshipit-source-id: d1a1cf1b43375321f5ec5b3d0b538f58082f7825
2021-08-17 15:29:56 -07:00
6c70cbedb6 step 0 of cuDNN v8 convolution API integration (#51390)
Summary:
This PR is step 0 of adding PyTorch convolution bindings using the cuDNN frontend. The cuDNN frontend is the recommended way of using cuDNN v8 API. It is supposed to have faster release cycles, so that, for example, if people find a specific kernel has a bug, they can report it, and that kernel will be blocked in the cuDNN frontend and frameworks could just update that submodule without the need for waiting for a whole cuDNN release.

The work is not complete, and this PR is only step 0.

**What this PR does:**
- Add cudnn-frontend as a submodule.
- Modify cmake to build that submodule.
- Add bindings for convolution forward in `Conv_v8.cpp`, which is disabled by a macro by default.
- Tested manually by enabling the macro and run `test_nn.py`. All tests pass except those mentioned below.

**What this PR doesn't:**
- Only convolution forward, no backward. The backward will use v7 API.
- No 64bit-indexing support for some configuration. This is a known issue of cuDNN, and will be fixed in a later cuDNN version. PyTorch will not implement any workaround for issue, but instead, v8 API should be disabled on problematic cuDNN versions.
- No test beyond PyTorch's unit tests.
  - Not tested for correctness on real models.
  - Not benchmarked for performance.
- Benchmark cache is not thread-safe. (This is marked as `FIXME` in the code, and will be fixed in a follow-up PR)
- cuDNN benchmark is not supported.
- There are failing tests, which will be resolved later:
  ```
  FAILED test/test_nn.py::TestNNDeviceTypeCUDA::test_conv_cudnn_nhwc_cuda_float16 - AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0.001 and atol=1e-05, found 32 element(s) (out of 32) whose difference(s) exceeded the margin of error (in...
  FAILED test/test_nn.py::TestNNDeviceTypeCUDA::test_conv_cudnn_nhwc_cuda_float32 - AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 32 element(s) (out of 32) whose difference(s) exceeded the margin of error (...
  FAILED test/test_nn.py::TestNNDeviceTypeCUDA::test_conv_large_cuda - RuntimeError: CUDNN_BACKEND_OPERATION: cudnnFinalize Failed cudnn_status: 9
  FAILED test/test_nn.py::TestNN::test_Conv2d_depthwise_naive_groups_cuda - AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0 and atol=1e-05, found 64 element(s) (out of 64) whose difference(s) exceeded the margin of error (including 0 an...
  FAILED test/test_nn.py::TestNN::test_Conv2d_deterministic_cudnn - RuntimeError: not supported yet
  FAILED test/test_nn.py::TestNN::test_ConvTranspose2d_groups_cuda_fp32 - RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
  FAILED test/test_nn.py::TestNN::test_ConvTranspose2d_groups_cuda_tf32 - RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
  ```

Although this is not a complete implementation of cuDNN v8 API binding, I still want to merge this first. This would allow me to do small and incremental work, for the ease of development and review.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51390

Reviewed By: malfet

Differential Revision: D28513167

Pulled By: ngimel

fbshipit-source-id: 9cc20c9dec5bbbcb1f94ac9e0f59b10c34f62740
2021-05-19 12:54:09 -07:00
19f77700ec clean up typos in submodule (#54372)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54372

Reviewed By: heitorschueroff

Differential Revision: D27233797

Pulled By: walterddr

fbshipit-source-id: f8d321199b6ae8b482e2ac3f10575402551365ef
2021-03-22 11:13:06 -07:00
6f3aa58d80 Fix autograd thread crash with python-3.9 (#50998)
Summary:
Update pybind repo to include `gil_scoped_acquire::disarm()` methods
In python_engine allocate scoped_acquire as unique_ptr and leak it if engine is finalizing for Python-3.9+

Fixes https://github.com/pytorch/pytorch/issues/50014 and https://github.com/pytorch/pytorch/issues/50893

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50998

Reviewed By: ezyang

Differential Revision: D26038314

Pulled By: malfet

fbshipit-source-id: 035411e22825e8fdcf1348fed36da0bc33e16f60
2021-01-26 13:29:47 -08:00
fdc62c74a6 Add Kineto submodule (separate PR) (#48332)
Summary:
Separate PR to add Kineto submodule, mirrors the one I used
in my stack (45887)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48332

Reviewed By: gdankel

Differential Revision: D25139969

Pulled By: ilia-cher

fbshipit-source-id: b9ca2be5f15647655eeb4b2fbf4c82f84eee3dd8
2020-11-20 23:46:34 -08:00
aa8aa30a0b third_party: Update pybind to point to fork (#48117)
Summary:
There are specific patches we need for Python 3.9 compatibilty and that
process is currently hung up on separate issues.

Let's update to a newer version of our forked pybind to grab the Python
3.9 fixes while we wait for them to be upstreamed

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48117

Relates to: https://github.com/pybind/pybind11/pull/2657

Full comparison for this update looks like this: 59a2ac2745...seemethere:v2.6-fb

Fixes https://github.com/pytorch/pytorch/issues/47776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48120

Reviewed By: gchanan

Differential Revision: D25030688

Pulled By: seemethere

fbshipit-source-id: 10889c813aeaa70ef1298adad5c631e6b5a39d72
2020-11-19 19:30:09 -08:00
49af421143 Embed callgrind headers (#45914)
Summary:
Because access to https://sourceware.org/git/valgrind.git can be really slow especially in some regions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45914

Reviewed By: seemethere

Differential Revision: D24144420

Pulled By: malfet

fbshipit-source-id: a454c8c3182c570ec344bf6468bb5e55d8b8da79
2020-10-06 17:51:10 -07:00
2b13d9413e Re-land: Add callgrind collection to Timer #44717 (#45586)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45586

Test Plan: The unit test has been softened to be less platform sensitive.

Reviewed By: mruberry

Differential Revision: D24025415

Pulled By: robieta

fbshipit-source-id: ee986933b984e736cf1525e1297de6b21ac1f0cf
2020-09-30 17:43:06 -07:00
51d0ae9207 Revert D24010742: [pytorch][PR] Add callgrind collection to Timer
Test Plan: revert-hammer

Differential Revision:
D24010742 (9b27e0926b)

Original commit changeset: df6bc765f8ef

fbshipit-source-id: 4c1edd57ea932896f7052716427059c924222501
2020-09-30 10:15:46 -07:00
9b27e0926b Add callgrind collection to Timer (#44717)
Summary:
This PR allows Timer to collect deterministic instruction counts for (some) snippets. Because of the intrusive nature of Valgrind (effectively replacing the CPU with an emulated one) we have to perform our measurements in a separate process. This PR writes a `.py` file containing the Timer's `setup` and `stmt`, and executes it within a `valgrind` subprocess along with a plethora of checks and error handling. There is still a bit of jitter around the edges due to the Python glue that I'm using, but the PyTorch signal is quite good and thus this provides a low friction way of getting signal. I considered using JIT as an alternative, but:

A) Python specific overheads (e.g. parsing) are important
B) JIT might do rewrites which would complicate measurement.

Consider the following bit of code, related to https://github.com/pytorch/pytorch/issues/44484:
```
from torch.utils._benchmark import Timer
counts = Timer(
    "x.backward()",
    setup="x = torch.ones((1,)) + torch.ones((1,), requires_grad=True)"
).collect_callgrind()

for c, fn in counts[:20]:
    print(f"{c:>12}  {fn}")
```

```
      812800  ???:_dl_update_slotinfo
      355600  ???:update_get_addr
      308300  work/Python/ceval.c:_PyEval_EvalFrameDefault'2
      304800  ???:__tls_get_addr
      196059  ???:_int_free
      152400  ???:__tls_get_addr_slow
      138400  build/../c10/core/ScalarType.h:c10::typeMetaToScalarType(caffe2::TypeMeta)
      126526  work/Objects/dictobject.c:_PyDict_LoadGlobal
      114268  ???:malloc
      101400  work/Objects/unicodeobject.c:PyUnicode_FromFormatV
       85900  work/Python/ceval.c:_PyEval_EvalFrameDefault
       79946  work/Objects/typeobject.c:_PyType_Lookup
       72000  build/../c10/core/Device.h:c10::Device::validate()
       70000  /usr/include/c++/8/bits/stl_vector.h:std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector()
       66400  work/Objects/object.c:_PyObject_GenericGetAttrWithDict
       63000  ???:pthread_mutex_lock
       61200  work/Objects/dictobject.c:PyDict_GetItem
       59800  ???:free
       58400  work/Objects/tupleobject.c:tupledealloc
       56707  work/Objects/dictobject.c:lookdict_unicode_nodummy
```

Moreover, if we backport this PR to 1.6 (just copy the `_benchmarks` folder) and load those counts as `counts_1_6`, then we can easily diff them:
```
print(f"Head instructions: {sum(c for c, _ in counts)}")
print(f"1.6 instructions:  {sum(c for c, _ in counts_1_6)}")
count_dict = {fn: c for c, fn in counts}
for c, fn in counts_1_6:
    _ = count_dict.setdefault(fn, 0)
    count_dict[fn] -= c
count_diffs = sorted([(c, fn) for fn, c in count_dict.items()], reverse=True)
for c, fn in count_diffs[:15] + [["", "..."]] + count_diffs[-15:]:
    print(f"{c:>8}  {fn}")
```

```
Head instructions: 7609547
1.6 instructions:  6059648
  169600  ???:_dl_update_slotinfo
  101400  work/Objects/unicodeobject.c:PyUnicode_FromFormatV
   74200  ???:update_get_addr
   63600  ???:__tls_get_addr
   46800  work/Python/ceval.c:_PyEval_EvalFrameDefault
   33512  work/Objects/dictobject.c:_PyDict_LoadGlobal
   31800  ???:__tls_get_addr_slow
   31700  build/../aten/src/ATen/record_function.cpp:at::RecordFunction::RecordFunction(at::RecordScope)
   28300  build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionSignature::parse(_object*, _object*, _object*, _object**, bool)
   27800  work/Objects/object.c:_PyObject_GenericGetAttrWithDict
   27401  work/Objects/dictobject.c:lookdict_unicode_nodummy
   24115  work/Objects/typeobject.c:_PyType_Lookup
   24080  ???:_int_free
   21700  work/Objects/dictobject.c:PyDict_GetItemWithError
   20700  work/Objects/dictobject.c:PyDict_GetItem
          ...
   -3200  build/../c10/util/SmallVector.h:at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool)
   -3400  build/../aten/src/ATen/native/TensorIterator.cpp:at::TensorIterator::resize_outputs(at::TensorIteratorConfig const&)
   -3500  /usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:std::unique_lock<std::mutex>::unlock()
   -3700  build/../torch/csrc/utils/python_arg_parser.cpp:torch::PythonArgParser::raw_parse(_object*, _object*, _object**)
   -4207  work/Objects/obmalloc.c:PyMem_Calloc
   -4500  /usr/include/c++/8/bits/stl_vector.h:std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector()
   -4800  build/../torch/csrc/autograd/generated/VariableType_2.cpp:torch::autograd::VariableType::add__Tensor(at::Tensor&, at::Tensor const&, c10::Scalar)
   -5000  build/../c10/core/impl/LocalDispatchKeySet.cpp:c10::impl::ExcludeDispatchKeyGuard::ExcludeDispatchKeyGuard(c10::DispatchKey)
   -5300  work/Objects/listobject.c:PyList_New
   -5400  build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionParameter::check(_object*, std::vector<pybind11::handle, std::allocator<pybind11::handle> >&)
   -5600  /usr/include/c++/8/bits/std_mutex.h:std::unique_lock<std::mutex>::unlock()
   -6231  work/Objects/obmalloc.c:PyMem_Free
   -6300  work/Objects/listobject.c:list_repeat
  -11200  work/Objects/listobject.c:list_dealloc
  -28900  build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionSignature::parse(_object*, _object*, _object**, bool)
```

Remaining TODOs:
  * Include a timer in the generated script for cuda sync.
  * Add valgrind to CircleCI machines and add a unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44717

Reviewed By: soumith

Differential Revision: D24010742

Pulled By: robieta

fbshipit-source-id: df6bc765f8efce7193893edba186cd62b4b23623
2020-09-30 05:52:54 -07:00
6a6c29c1c9 Update TensorPipe submodule (#37729)
Summary:
In order to include these fixes that were blocking https://github.com/pytorch/pytorch/pull/35483:
- 673eda9efc
- ff8d1733ad
- c73367836f

Pull Request resolved: https://github.com/pytorch/pytorch/pull/37729

Reviewed By: beauby

Differential Revision: D21378972

Pulled By: lw

fbshipit-source-id: 3375fe1fa6e79817da3bb033127c3c8f31c3ffc3
2020-05-04 04:44:57 -07:00
8a30553738 [TensorPipe/RPC] Add TensorPipe dependency (#36695)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36695

Reviewed By: lw

Differential Revision: D21312297

Pulled By: beauby

fbshipit-source-id: 39fdc3de91efa4ac97dd169f09fb304b273b0050
2020-04-30 11:05:15 -07:00
68895eda9d add fmt, take 7 (#37356)
Summary:
fmt is a formatting library for C++. It has several properties that make it nice
for inclusion in PyTorch:
- Widely used
- Basically copies how Python does it
- Support for all the compilers and platforms we care about
- Standards track (C++20)
- Small code size
- Header only

This PR includes it as a submodule and sets up the build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37356

Differential Revision: D21262619

Pulled By: suo

fbshipit-source-id: 1d9a1a5ed08a634213748e7b02fc718ef8dac4c9
2020-04-29 09:08:24 -07:00
0e52627358 Fixing pthreadpool symbol conflict issue. (#33869)
Summary:
Mainly renaming pthread_create of C2, the only one referred internally in NNPACK, that
is conflicting, to pthread_create_c2.
Removed 2 other conflicting symbols that are not used internally at all.
Pointing XNNPACK to original repo instead of the fork.

Copy pasted the new interface and implementation to
caff2/utils/threadpool, so that for internal builds we compile against
this.

When threadpool is unified this will be removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33869

Differential Revision: D20140580

Pulled By: kimishpatel

fbshipit-source-id: de70df0af9c7d6bc065e85ede0e1c4dd6a9e6be3
2020-02-28 21:23:18 -08:00
6aecfd1e80 Mobile Backend: NHWC memory layout + XNNPACK integration. (#33722)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33722

In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK.

XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards.  This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs.  This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way.

Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed.  The less efficient implementation would be to hook these operators into their corresponding native implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance.

Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop.

The more optimal solution, and one we will  decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models.  Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution.

This PR does not include any of the front end changes  mentioned above.  Neither does it include the mobile threadpool unification present in the original https://github.com/pytorch/pytorch/issues/30644.  Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move.

Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/32509

Test Plan:
Build: CI
Functionality: Not exposed

Reviewed By: dreiss

Differential Revision: D20069796

Pulled By: AshkanAliabadi

fbshipit-source-id: d46c1c91d4bea91979ea5bd46971ced5417d309c
2020-02-24 21:58:56 -08:00
039dc90854 Revert D19521853: [pytorch][PR] Mobile Backend: NHWC memory layout + XNNPACK integration.
Test Plan: revert-hammer

Differential Revision:
D19521853

Original commit changeset: 99a1fab31d0e

fbshipit-source-id: 76dfc1f481797ba2386997533cf19957637687d6
2020-02-23 22:07:19 -08:00
941b42428a Mobile Backend: NHWC memory layout + XNNPACK integration. (#32509)
Summary:
In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK.

XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards.  This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs.  This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way.

Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed.  The less efficient implementation would be to hook these operators into their corresponding **native** implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance.

Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop.

The more optimal solution, and one we will  decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models.  Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution.

This PR does not include any of the front end changes  mentioned above.  Neither does it include the mobile threadpool unification present in the original https://github.com/pytorch/pytorch/issues/30644.  Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move.

Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32509

Reviewed By: dreiss

Differential Revision: D19521853

Pulled By: AshkanAliabadi

fbshipit-source-id: 99a1fab31d0ece64961df074003bb852c36acaaa
2020-02-23 19:08:42 -08:00
42faf961c8 Update fbjni submodule to new upstream and latest version
Summary:
The central fbjni repository is now public, so point to it and
take the latest version, which includes support for host builds
and some condensed syntax.

Test Plan: CI

Differential Revision: D18217840

fbshipit-source-id: 454e3e081f7e3155704fed692506251c4018b2a1
2019-10-31 11:48:25 -07:00
ee6cdb5726 Upgrade sleef to v3.4.0. (#26749)
Summary:
This reset the sleef submodule to upstream, since everything else except
a small build sanity fix
<191f655caa>
has been merged to upstream. The new release includes an important fix
for trigonometric functions on MacOS, which would unblock https://github.com/pytorch/pytorch/issues/26431.

This should supersede https://github.com/pytorch/pytorch/issues/20536.

Close https://github.com/pytorch/pytorch/issues/20536.

cc colesbury resistor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26749

Differential Revision: D17572783

Pulled By: ezyang

fbshipit-source-id: dd7827e8c8500a0050e3e318d184134c792d3ecc
2019-09-25 08:25:43 -07:00
d62bca9792 jni-java wrapper for pytorchScript api (#25084)
Summary:
TLDR; initial commit of android java-jni wrapper of pytorchscript c++ api

The main idea is to provide java interface for android developers to use pytorchscript modules.
java API tries to repeat semantic of c++ and python pytorchscript API

org.pytorch.Module (wrapper of torch::jit::script::Module)
 - static Module load(String path)
 - IValue forward(IValue... inputs)
 - IValue runMethod(String methodName, IValue... inputs)

org.pytorch.Tensor (semantic of at::Tensor)
 - newFloatTensor(long[] dims, float[] data)
 - newFloatTensor(long[] dims, FloatBuffer data)

 - newIntTensor(long[] dims, int[] data)
 - newIntTensor(long[] dims, IntBuffer data)

 - newByteTensor(long[] dims, byte[] data)
 - newByteTensor(long[] dims, ByteBuffer data)

org.pytorch.IValue (semantic of at::IValue)
 - static factory methods to create pytorchscript supported types

Examples of usage api could be found in PytorchInstrumentedTests.java:

Module module = Module.load(path);
IValue input = IValue.tensor(Tensor.newByteTensor(new long[]{1}, Tensor.allocateByteBuffer(1)));
IValue output = module.forward(input);
Tensor outputTensor = output.getTensor();

ThreadSafety:
Api is not thread safe, all synchronization must be done on caller side.

Mutability:
org.pytorch.Tensor buffer is DirectBuffer with native byte order, can be created with static factory methods specifing DirectBuffer.
At the moment org.pytorch.Tensor does not hold at::Tensor on jni side, it has: long[] dimensions, type, DirectByteBuffer blobData

Input tensors are mutable (can be modified and used for the next inference),
Uses values from buffer on the momment of Module#forward or Module#runMethod calls.
Buffers of input tensors is used directly by input at::Tensor

Output is copied from output at::Tensor and is immutable.

Dependencies:
Jni level is implemented with usage of fbjni library, that was developed in Facebook,
and was already used and opensourced in several opensource projects,
added to the repo as submodule from personal account to be able to switch submodule
when fbjni will be opensourced separately.

ghstack-source-id: b39c848359a70d717f2830a15265e4aa122279c0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25084
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25105

Reviewed By: dreiss

Differential Revision: D16988107

Pulled By: IvanKobzarev

fbshipit-source-id: 41ca7c9869f8370b8504c2ef8a96047cc16516d4
2019-08-23 10:42:44 -07:00
eb51131fb4 Revert D16423217: [pytorch][PR] Update sleef to master, fixes #20535
Differential Revision:
D16423217

Original commit changeset: 587de3f10e83

fbshipit-source-id: 466e56eab73ce669cc179d08b7f39d2c8b0ffb34
2019-07-24 11:10:15 -07:00
7203612f85 Update sleef to master, fixes #20535 (#23168)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/20535

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23168

Differential Revision: D16423217

Pulled By: ezyang

fbshipit-source-id: 587de3f10e839b94f51f673741b5fda8849e32f6
2019-07-24 08:18:14 -07:00
580eab6562 Restore TBB module (#20454)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20454
ghimport-source-id: 14aca1dedbe647d41e55e7538a6b7eeab0fc4384

Differential Revision: D15326062

Pulled By: ilia-cher

fbshipit-source-id: 02b005a679b10dc7a264978e87a8d2bb98ab972f
2019-05-28 02:49:36 -07:00
ecf012213b Update submodule URL based on redirection. (#20973)
Summary:
Changes:
  - protobuf has been moved to protocolbuffers/protobuf a while ago.
  - cpuinfo has been moved to pytorch/cpuinfo and updated in FBGEMM recently.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20973

Differential Revision: D15511926

Pulled By: soumith

fbshipit-source-id: 2c50373c9b245524f839bd1059870dd2b84e3b81
2019-05-26 22:29:21 -07:00
785583a435 Use ignore=dirty in submodules. (#20135)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20135
ghimport-source-id: 73a07e07ed9485f80374262de2fb9b87e687a47a

Differential Revision: D15214187

Pulled By: zdevito

fbshipit-source-id: 2f2272f0ee7dad3935e6c31897a0b635b4e66133
2019-05-07 15:41:19 -07:00
a3933b87c6 Back out "Revert D14613517: [pytorch][PR] Updating onnxtrt submodule to master branch" (#18514)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18514

Original commit changeset: d6267ddfc339

Reviewed By: bddppq

Differential Revision: D14634476

fbshipit-source-id: 2633b0b4c512d71001e5c20cd79c0c0d7856f942
2019-03-26 23:44:33 -07:00
66e8c74814 Revert D14613517: [pytorch][PR] Updating onnxtrt submodule to master branch
Differential Revision:
D14613517

Original commit changeset: dd20d718db55

fbshipit-source-id: d6267ddfc339d04f182e2de1750a601c8d6bf8c6
2019-03-26 17:37:55 -07:00
bbe110f4e1 Updating onnxtrt submodule to master branch
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18441

Differential Revision: D14613517

Pulled By: bddppq

fbshipit-source-id: dd20d718db55942df9cce7acd1151d6902bc57ff
2019-03-26 14:25:55 -07:00
0fe6e8c870 Remove ComputeLibrary submodule
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18052

Reviewed By: ezyang

Differential Revision: D14477355

fbshipit-source-id: c56b802f6d69701596c327cf9af6782f30e335fa
2019-03-16 09:06:42 -07:00
e6cf3c886d add foxi submodule (#17184) 2019-02-20 16:25:05 -05:00
aefc83f46d fixing some rebuild issues (#14969)
Summary:
This fixes rebuild issues with the ninja part of the build. With this patch all ninja files will now report `nothing to do` if nothing has changed assuming `BUILD_CAFFE2_OPS=0`.

1. This only does the python file processing for caffe2 when BUILD_CAFFE2_OPS=1, this part of the build file is written in such a way that it is always required to rerun and can take substantial time to move files around in the no-op build. In the future this part should be rewritten to use a faster method of copying the files or should treat copying the files as part of the build rules and only run when the files are out of date.

2. This points `sleef` to a patched version that fixes a dead build output that is causing everything to relink all the time. See https://github.com/shibatch/sleef/pull/231#partial-pull-merging for the upstream change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14969

Reviewed By: soumith

Differential Revision: D13395998

Pulled By: zdevito

fbshipit-source-id: ca85b7be9e99c5c578103c144ef0f2c3b927e724
2018-12-09 16:32:19 -08:00
5e06fa0baf ONNX changes to use int32_t (instead of enum) to store data type
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14926

Reviewed By: houseroad

Differential Revision: D13390642

Pulled By: bddppq

fbshipit-source-id: c2314b24d9384f188fda2b9a5cc16465ad39581e
2018-12-08 01:06:08 -08:00