Commit Graph

405 Commits

Author SHA1 Message Date
ddff4efa26 Don't use RTLD_GLOBAL to load _C. (#31162)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31162

This should help us resolve a multitude of weird segfaults and crashes
when PyTorch is imported along with other packages. Those would often
happen because libtorch symbols were exposed globally and could be used
as a source of relocations in shared libraries loaded after libtorch.

Fixes #3059.

Some of the subtleties in preparing this patch:

* Getting ASAN to play ball was a pain in the ass. The basic problem is that when we load with `RTLD_LOCAL`, we now may load a library multiple times into the address space; this happens when we have custom C++ extensions. Since the libraries are usually identical, this is usually benign, but it is technically undefined behavior and UBSAN hates it. I sprayed a few ways of getting things to "work" correctly: I preload libstdc++ (so that it is seen consistently over all library loads) and added turned off vptr checks entirely. Another possibility is we should have a mode where we use RTLD_GLOBAL to load _C, which would be acceptable in environments where you're sure C++ lines up correctly. There's a long comment in the test script going into more detail about this.
* Making some of our shared library dependencies load with `RTLD_LOCAL` breaks them. OpenMPI and MKL don't work; they play linker shenanigans to look up their symbols which doesn't work when loaded locally, and if we load a library with `RLTD_LOCAL` we aren't able to subsequently see it with `ctypes`. To solve this problem, we employ a clever device invented by apaszke: we create a dummy library `torch_global_deps` with dependencies on all of the libraries which need to be loaded globally, and then load that with `RTLD_GLOBAL`. As long as none of these libraries have C++ symbols, we can avoid confusion about C++ standard library.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D19262579

Test Plan: Imported from OSS

Pulled By: ezyang

fbshipit-source-id: 06a48a5d2c9036aacd535f7e8a4de0e8fe1639f2
2020-01-09 07:28:15 -08:00
f362cd510d Move prim ops from JIT registration to C10 (#30612)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30612

The first version to move prim ops to c10 registration. After the reviewers are fine with the initial changes, more operators will be moved in the same style.

Test Plan: Imported from OSS

Differential Revision: D19237648

Pulled By: iseeyuan

fbshipit-source-id: c5a519604efffb80564a556536f17d829f71d9f9
2020-01-04 13:47:44 -08:00
7d630278da Separate torchbind from Python (#30242)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30242

Pull Request resolved: https://github.com/pytorch/pytorch/pull/29501

Currently blocked on schema serialization issue

Test Plan: Imported from OSS

Differential Revision: D18463063

Pulled By: jamesr66a

fbshipit-source-id: c12a1b644eb9bf04e68ff93cccf91d6cb3e75359
2019-12-21 22:52:40 -08:00
e3fecabdcb Setup operator registration for distributed package (#31214)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31214

This set up the basic infrastructure for distributed autograd and rpc to
bind their operators to TorchScript, since the whole distributed package
is builtin behind the `USE_DISTRIBUTED` flag, we separate the
registration and build it only when the flag is on.

Test Plan: Imported from OSS

Differential Revision: D19137160

fbshipit-source-id: ff47dc4c380ebe273fe0eea9e5e3fccfbd6466d7
2019-12-17 17:26:43 -08:00
58eb15f41c JIT Type parser for mobile (#30391)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30391

A Type parser to parse the python string of a Type. For example,
"Tuple[str, Optional[float], Dict[str, List[Tensor]], int]".
Please refer to test_type_parser.cpp for the usage.

One of the use cases is in lite interpreter, types needs to be serialized (directly calling the python_str() of the Type) and deserialized (calling parseType(str)).

Test Plan: Imported from OSS

Differential Revision: D18924268

Pulled By: iseeyuan

fbshipit-source-id: 830d411563abfbeec023f01e7f8f4a1796f9a59a
2019-12-14 20:29:42 -08:00
20a2e526ef build a generic future<T> (#29579)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29579

Per #28923, this diff is to move Future<Message> to torch::utils and extend it to be Future<T>, most of implementations are copied from FutureMessage and ivalue::Future. merge ivalue::Future with Future<T> will be done separately.

The main difference between Future<T>  and FutureMessage is the error handling, instead of checking message type inside Future to handle error, this future<T> owns has_error_ and error_ states.

also this future passes value_, has_error_ and error_ states to callbacks for easily read future states.

In next diff, a torch script rpc async API will be created, before the API returns, it will create an ivalue::Future and passes it to Future<T>'s call back where state of ivalue::Future will be set.  In this way, the torch script rpc async API  can still return a ivalue::Future and call wait() to get its state appropriately afterwards.
ghstack-source-id: 95479525

Test Plan: unit tests

Differential Revision: D18263023

fbshipit-source-id: 48a65712656a72c2feb0bb3ec8b308c0528986a6
2019-12-12 16:57:14 -08:00
38986e1dea Split libtorch.so back into libtorch_{cpu,cuda,hip} (#30315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30315

The new structure is that libtorch_cpu contains the bulk of our
code, and libtorch depends on libtorch_cpu and libtorch_cuda.
This is a reland of https://github.com/pytorch/pytorch/pull/29731 but
I've extracted all of the prep work into separate PRs which can be
landed before this one.

Some things of note:

* torch/csrc/cuda/nccl.cpp was added to the wrong list of SRCS, now fixed (this didn't matter before because previously they were all in the same library)
* The dummy file for libtorch was brought back from the dead; it was previously deleted in #20774
In an initial version of the patch, I forgot to make torch_cuda explicitly depend on torch_cpu. This lead to some very odd errors, most notably "bin/blob_test: hidden symbol `_ZNK6google8protobuf5Arena17OnArenaAllocationEPKSt9type_infom' in lib/libprotobuf.a(arena.cc.o) is referenced by DSO"
* A number of places in Android/iOS builds have to add torch_cuda explicitly as a library, as they do not have transitive dependency calculation working correctly
* I had to torch_cpu/torch_cuda caffe2_interface_library so that they get whole-archived linked into torch when you statically link. And I had to do this in an *exported* fashion because torch needs to depend on torch_cpu_library. In the end I exported everything and removed the redefinition in the Caffe2Config.cmake. However, I am not too sure why the old code did it in this way in the first place; however, it doesn't seem to have broken anything to switch it this way.
* There's some uses of `__HIP_PLATFORM_HCC__` still in `torch_cpu` code, so I had to apply it to that library too (UGH). This manifests as a failer when trying to run the CUDA fuser. This doesn't really matter substantively right now because we still in-place HIPify, but it would be good to fix eventually. This was a bit difficult to debug because of an unrelated HIP bug, see https://github.com/ROCm-Developer-Tools/HIP/issues/1706

Fixes #27215 (as our libraries are smaller), and executes on
part of the plan in #29235.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18790941

Pulled By: ezyang

fbshipit-source-id: 01296f6089d3de5e8365251b490c51e694f2d6c7
2019-12-04 08:04:57 -08:00
53785771a7 Don't build test_cpp_rpc if torch is built without distributed support (#30587)
Summary:
On the latest master, I get link errors when building one of the tests:

```sh
/home/pbell/git/pytorch/build/../test/cpp/rpc/test_wire_serialization.cpp:23:
undefined reference to `torch::distributed::rpc::wireDeserialize(void const*, unsigned long)'
```

This seems to be caused by PR https://github.com/pytorch/pytorch/issues/29785 not working with `USE_DISTRIBUTED=0`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30587

Differential Revision: D18758625

Pulled By: jjlilley

fbshipit-source-id: 0ad0703acdbbac22bb4b8317370fbe2606fcb67e
2019-12-01 16:43:12 -08:00
f4e7e9039d Improve process_group_agent() serialization speed (#29785)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29785

TLDR: This change improves process_group's serialization speed:
  Serialize_Tensor64:     12.38us ->   1.99us  (~-84%)
  Deserialize_Tensor64:   33.89us ->   5.62us  (~-84%)
  Serialize_Tensor1M:    525.74us -> 285.43us  (~-45%)
  Deserialize_Tensor1M:  892.61us -> 273.68us  (~-70%)

After speaking with the jit team, we had consensus that torch::save()/load()
are somewhat high-overhead for RPC serialization, mostly intended for
persistent disk data.

(Particularly, for large tensors, 35% of the time is spent in CRC checking, even
with the fb-side changes to subsitute 40x faster SSE-accelerated crc checking;
Also, for small tensors, the zip container overhead is considerable, as is the
overhead of lexing/parsing an embedded text python program for each RPC).

The jit team encouraged us to use jit::pickler, with the WriteableTensorData
way of outputting result tensors (not the default side-tensor table, or
with pickling the actual tensors). This ends up just pickling some tensor
metadata, and giving us some tensor blobs that we can mindlessly
blit over the wire (they copy to cpu memory if needed).

There is yet no standardized container format for the pickled data
(there is jit::pickle_save() checked in, but but it's experimental,
no load function is yet provided), but they encouraged us to just use
something sensible for this, and possibly revisit later. For now, I made
the directory headers slightly http-inspired.

Note that serialization is just one component of the pipeline, but that
said, we also see reasonable reductions in end-to-end echo times (noisier):
   ProcessGroupAgent_Echo(Tensor_Small)   855.25us -> 492.65us  (~-42%)
   ProcessGroupAgent_Echo(Tensor_1M)       10.82ms -> 6.94ms    (~-35%)
   ProcessGroupAgent_Echo(Small_NoTensor) 688.82us -> 301.72us  (~-56%)
   ProcessGroupAgent_Echo(1MB_NoTensor)     4.65ms -> 3.71ms    (~-20%)

I moved the "wire serialization" logic to a separate file to assist with
unittesting.
ghstack-source-id: 94694682

Test Plan:
buck test mode/dev-nosan caffe2/test/cpp/api:serialize
  buck test mode/dev-nosan caffe2/test/...

Differential Revision: D18493938

fbshipit-source-id: 07ddfe87dbe56472bc944f7d070627052c94a8f4
2019-11-28 09:57:52 -08:00
168570b0da move module_save.cpp to non-mobile build section in cmake (#30221)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30221

PR #29881 moved Module::save() methods to a separate source file
and removed C10_MOBILE gating logic. Seems it should stay with
export_module.cpp (which is in "NOT INTERN_BUILD_MOBILE" section).
Otherwise it causes link error with build_mobile.sh.

Test:
- build locally
- check CI

Test Plan: Imported from OSS

Differential Revision: D18649234

Pulled By: ljk53

fbshipit-source-id: b6c90a532d191c41ce10c1047a869d8f73854c4d
2019-11-21 18:56:34 -08:00
352731bd6e Revert D18632773: Split libtorch.so back into libtorch_{cpu,cuda,hip}
Test Plan: revert-hammer

Differential Revision:
D18632773

Original commit changeset: ea717c81e0d7

fbshipit-source-id: 18601439f9f81c9f389020e5a0e4e04adb21772d
2019-11-21 15:01:09 -08:00
ec30d9028a Split libtorch.so back into libtorch_{cpu,cuda,hip} (#29731)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29731

The new structure is that libtorch_cpu contains the bulk of our
code, and libtorch depends on libtorch_cpu and libtorch_cuda.

Some subtleties about the patch:
- There were a few functions that crossed CPU-CUDA boundary without API macros. I just added them, easy enough. An inverse situation was aten/src/THC/THCTensorRandom.cu where we weren't supposed to put API macros directly in a cpp file.
- DispatchStub wasn't getting all of its symbols related to static members on DispatchStub exported properly. I tried a few fixes but in the end I just moved everyone off using DispatchStub to dispatch CUDA/HIP (so they just use normal dispatch for those cases.) Additionally, there were some mistakes where people incorrectly were failing to actually import the declaration of the dispatch stub, so added includes for those cases.
- torch/csrc/cuda/nccl.cpp was added to the wrong list of SRCS, now fixed (this didn't matter before because previously they were all in the same library)
- The dummy file for libtorch was brought back from the dead; it was previously deleted in #20774
- In an initial version of the patch, I forgot to make torch_cuda explicitly depend on torch_cpu. This lead to some very odd errors, most notably "bin/blob_test: hidden symbol `_ZNK6google8protobuf5Arena17OnArenaAllocationEPKSt9type_infom' in lib/l
ibprotobuf.a(arena.cc.o) is referenced by DSO"
- A number of places in Android/iOS builds have to add torch_cuda explicitly as a library, as they do not have transitive dependency calculation working correctly. This situation also happens with custom C++ extensions.
- There's a ROCm compiler bug where extern "C" on functions is not respected. There's a little workaround to handle this.
- Because I was too lazy to check if HIPify was converting TORCH_CUDA_API into TORCH_HIP_API, I just made it so HIP build also triggers the TORCH_CUDA_API macro. Eventually, we should translate and keep the nature of TORCH_CUDA_API constant in all cases.

Fixes #27215 (as our libraries are smaller), and executes on
part of the plan in #29235.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18632773

Pulled By: ezyang

fbshipit-source-id: ea717c81e0d7554ede1dc404108603455a81da82
2019-11-21 11:27:33 -08:00
d934cf484b call find_package(OpenMP) only when USE_OPENMP=ON (#30223)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30223

I ran into find_package(OpenMP) failure in some linux environment when
USE_OPENMP=OFF. Figured this workaround to unblock - not sure how hard
to find & fix the root cause of find_package() failure.

Test:
- works in my case;
- will check CI;

Test Plan: Imported from OSS

Differential Revision: D18640309

Pulled By: ljk53

fbshipit-source-id: b5b30623f5da4edbe59574a8b35286b74c3225d3
2019-11-21 10:35:15 -08:00
43fb0015db custom build script (#30144)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30144

Create script to produce libtorch that only contains ops needed by specific
models. Developers can use this workflow to further optimize mobile build size.

Need keep a dummy stub for unused (stripped) ops because some JIT side
logic requires certain function schemas to be existed in the JIT op
registry.

Test Steps:
1. Build "dump_operator_names" binary and use it to dump root ops needed
by a specific model:
```
build/bin/dump_operator_names --model=mobilenetv2.pk --output=mobilenetv2.yaml
```

2. The MobileNetV2 model should use the following ops:
```
- aten::t
- aten::dropout
- aten::mean.dim
- aten::add.Tensor
- prim::ListConstruct
- aten::addmm
- aten::_convolution
- aten::batch_norm
- aten::hardtanh_
- aten::mm
```
NOTE that for some reason it outputs "aten::addmm" but actually uses "aten::mm".
You need fix it manually for now.

3. Run custom build script locally (use Android as an example):
```
SELECTED_OP_LIST=mobilenetv2.yaml scripts/build_pytorch_android.sh armeabi-v7a
```

4. Checkout demo app that uses locally built library instead of
downloading from jcenter repo:
```
git clone --single-branch --branch custom_build git@github.com:ljk53/android-demo-app.git
```

5. Copy locally built libraries to demo app folder:
```
find ${HOME}/src/pytorch/android -name '*.aar' -exec cp {} ${HOME}/src/android-demo-app/HelloWorldApp/app/libs/ \;
```

6. Build demo app with locally built libtorch:
```
cd ${HOME}/src/android-demo-app/HelloWorldApp
./gradlew clean && ./gradlew assembleDebug
```

7. Install and run the demo app.

In-APK arm-v7 libpytorch_jni.so build size reduced from 5.5M to 2.9M.

Test Plan: Imported from OSS

Differential Revision: D18612127

Pulled By: ljk53

fbshipit-source-id: fa8d5e1d3259143c7346abd1c862773be8c7e29a
2019-11-20 13:16:02 -08:00
fbcb88e8b3 Split module.cpp and export.cpp to support saving on mobile (#29881)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29881

Breaking these into separate files allows us to have three different builds:
- Mobile inference-only.
- Mobile with module saving.
- Server with module saving and other export functions like ONNX.

And this can be accomplished just by selecting which cpp files to compile,
without setting any preprocessor flags.

Test Plan: CI.  Local mobile+saving build.

Reviewed By: smessmer

Differential Revision: D18509296

fbshipit-source-id: 9438273bac4624df5c7f035b2bacb901cce43053
2019-11-20 10:47:21 -08:00
ec52d911bd InstanceNorm{1,2,3}d (#28790)
Summary:
Hi yf225,

I have a few doubts related to implementation:
1) What tests do I have to write?
2) What does _load_state_from_dict does?
3) Do I need to override reset() function as I can not see it's utility?
4) InstanceNormOptions could be removed with BatchNormOptions, but I find that
`track_running_status` is not defined instead `stateful` is defined.

InstanceNorm{1,2,3}d https://github.com/pytorch/pytorch/issues/25883
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28790

Differential Revision: D18588666

Pulled By: yf225

fbshipit-source-id: bb9b81f01f62c3fc8765fa0ba0716768087ee155
2019-11-19 16:57:01 -08:00
0e5200adfe Refactor target_compile_options into torch_compile_options (#29730)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29730

Back in the day, Caffe2 had a good idea: instead of spattering
target_compile_options all over the codebase, define a helper
function which sets all the options for a target.  This is especially
helpful if I want to split libtorch.so into libtorch_cpu.so
and libtorch_cuda.so; I need a way to easily apply options
to multiple targets.  A shared helper function is just the ticket.

I moved every target_compile_options call in caffe2/CMakeLists.txt
that didn't seem target dependent (exclusions included OpenMP flags,
API-related macros, ONNX related macros and HIP flags) into
torch_compile_options.  I slavishly preserved the structure:
there's a nearly redundant WERROR if() in the output but I preserved
it.

There is one thing I don't like about this, which is that now
the compile options are off in a random directory that no one would
expect.  But c'est la vie...

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18571166

Pulled By: ezyang

fbshipit-source-id: 21cd5f7663485077600782078fbb1787fab09035
2019-11-18 07:05:48 -08:00
1381301d46 Remove AT_LINK_STYLE entirely. (#29729)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29729

It already errored when you built with CUDA/HIP support as no longer supported;
now I expunge it entirely.  Along the way, I delete useless INTERFACE
libraries (which aren't used anywhere else in the cmake.)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18571167

Pulled By: ezyang

fbshipit-source-id: f88c73a16fad3b61eaa7745a2d15514c68704bec
2019-11-18 07:05:43 -08:00
18bdf97dbb Factor Module into Object and Module
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29500

Test Plan: Imported from OSS

Differential Revision: D18463064

Pulled By: jamesr66a

fbshipit-source-id: d37bef242a8626593d4b8754042152cfc0f0acb2
2019-11-17 22:58:50 -08:00
3003c5f91b OPN ops TupleConstruct/Unpack and format. (#29635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29635

TupleConstruct/Unpack as OPN ops.

Test Plan: Imported from OSS

Differential Revision: D18499602

fbshipit-source-id: 389b21d3ea532ef6fa729d67ce34214d86700cd2
2019-11-15 16:22:42 -08:00
77bb41c965 Rename dist_autograd_context and dist_autograd_container. (#29696)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29696

The paths distributed/autograd/context/dist_autograd_context.h and
distributed/autograd/context/dist_autograd_container.h were repetitive.

Therefore renaming these to distributed/autograd/context/context.h and
distributed/autograd/context/container.h
ghstack-source-id: 93850266

Test Plan: waitforbuildbot

Differential Revision: D18467624

fbshipit-source-id: bbf3905396f553006851af296c880c1bd106ec47
2019-11-14 14:49:34 -08:00
01d76145fc Fix typo: Caffe2_MAIN_LIB to Caffe2_MAIN_LIBS (#29746)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29746

I don't know if this actually broke anything because I just discovered
the typo while reading the cmake.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18504546

Pulled By: ezyang

fbshipit-source-id: 6cb5fb1e71721e5cf8fc2f7b5552dc7c514f065f
2019-11-14 07:55:09 -08:00
604fc9ec41 F::embedding, F::embedding_bag, moved Embedding and EmbeddingBag options to embedding.h in options
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28669

Differential Revision: D18377609

Pulled By: anjali411

fbshipit-source-id: 6a2c547368849ebd1a2f8828cfbe7252152b26a2
2019-11-11 11:51:26 -08:00
309b28ee3a Trace module calls
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29261

Test Plan: Imported from OSS

Differential Revision: D18343363

Pulled By: jamesr66a

fbshipit-source-id: 0c6394205e2c0ea8708028d20df83fe17b466ff4
2019-11-06 15:05:49 -08:00
a5d356cb39 Delete THP_CORE macro; partially replace with THP_BUILD_MAIN_LIB (#29143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29143

THP_CORE macro is a very old macro that appeared to have served
two purposes:

1. The torch-python equivalent of CAFFE2_BUILD_MAIN_LIB, to toggle
   symbol visibility headers

2. Some sort of ad hoc way of hiding certain definitions from headers
   so external clients can't get at them.

It did (2) in a very confusing manner, because we set THP_CORE in both
torch and torch-python (it shouldn't do anything in torch).  In this
PR I just get rid of use case (2) entirely (so everything shows up in
headers all the time), and then redo (1) using a new THP_BUILD_MAIN_LIB
macro.  This cleans up some of the macro definitions and makes my life
easier for working on #27215.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18309594

Pulled By: ezyang

fbshipit-source-id: adcb6d7cb387cd818480137e2b94e5e761dbfefc
2019-11-06 15:02:02 -08:00
6389c18709 C++ parity, nn::CrossMapLRN2d (#29039)
Summary:
yf225 https://github.com/pytorch/pytorch/issues/25883
re- pull request because of rebase mistake!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29039

Differential Revision: D18326829

Pulled By: yf225

fbshipit-source-id: 5ed737f6275e4463efa4951d9b7f45c6f2723c82
2019-11-05 15:27:08 -08:00
6c3915643b Rename PythonUDF{Call,Resp} (#27530)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27530

Per discussion in #27286, the `UDF` part is superfluous.

This makes the naming consistent with the `MessageType` enum.

Test Plan: Imported from OSS

Differential Revision: D17808211

Pulled By: pietern

fbshipit-source-id: 0ff925de26d027951ce285750ad276ed17fee4c6
2019-11-05 06:25:26 -08:00
dbbb2fc9e5 Remove the linkage to CUDA libraries when ROCM is used. (#29009)
Summary:
Currently when ROCM is used, CUDA libraries are still linked. There has
been no error because USE_CUDA is set to OFF upon a preliminary check in
tools/setup_helper/cuda.py, and no CUDA variable is set. Hence, these
lines can pass simply because those variables are always undefined, and thus expanded to empty strings. But this
cannot be safely relied on, and is causing https://github.com/pytorch/pytorch/issues/28617 to fail.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29009

Differential Revision: D18273472

Pulled By: ezyang

fbshipit-source-id: b8b6580e8a44d874ac678ed9073412d4d2e393ee
2019-11-01 11:18:21 -07:00
47faee2fae Switching tests to ProfilingExecutor (rebased)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28535

Differential Revision: D18197932

Pulled By: Krovatkin

fbshipit-source-id: 2639b205e899f800787ee57c157447d54e4669c3
2019-10-29 11:41:42 -07:00
52dd587123 C++ API parity: Upsample (#28413)
Summary:
Adds `interpolate` functional and `Upsample` module support for the C++ API.

**Issue**: https://github.com/pytorch/pytorch/issues/25883

**Reviewer**: yf225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28413

Differential Revision: D18165014

Pulled By: yf225

fbshipit-source-id: ecae2f432a301b1f4afa7c038b2d104cbad139f2
2019-10-28 21:34:44 -07:00
aea94de067 Exclude more files in torch/csrc/distributed when USE_DISTRIBUTED=0 (#28621)
Summary:
Changelog:
- Guard inclusion of certain files in torch/csrc/distributed included in caffe2/CMakeLists.txt when USE_DISTRIBUTED=0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28621

Test Plan:
- Builds should be successful
- Tests should pass

Differential Revision: D18145330

Pulled By: ezyang

fbshipit-source-id: 7167a356b03ae783e6b0120f2ad3552db2b3ed86
2019-10-28 08:03:30 -07:00
e885ce6130 C++ parity, grid_sample functional (#28354)
Summary:
https://github.com/pytorch/pytorch/issues/25883
I put grid_sample in vision.h with affine grid.

I have a question in string argument(interpolation mode, padding mode)
I reuse torch::native::detail::GridSamplerInterpolation in GridSampler.h instead of using string.
It follows the way that uses reduction enum in loss functions.
I am not sure this is right.

yf225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28354

Differential Revision: D18109333

Pulled By: yf225

fbshipit-source-id: 1bf972b671b107464f73b937bbe0de76fb259fbf
2019-10-24 15:14:37 -07:00
78375c02b8 C++ nn::ReflectionPad1d and nn::ReflectionPad2d
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28538

Test Plan: Imported from OSS

Differential Revision: D18115608

Pulled By: yf225

fbshipit-source-id: 3a48d8c11721f013076db2965f5f75b71662c78e
2019-10-24 12:02:51 -07:00
d8c66c1576 autograd/profiler: make python record_function use JIT methods
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28264

Test Plan: buck test caffe2/test:autograd caffe2/test/cpp/jit:jit

Reviewed By: bddppq

Differential Revision: D17997612

fbshipit-source-id: 8a29ae50c28ce905f63c732fe0aa49edfc9d99e3
2019-10-24 10:28:32 -07:00
7b59174882 torch::nn::LayerNorm
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28032

Differential Revision: D18047371

Pulled By: anjali411

fbshipit-source-id: fb61aea52d6622a67ec1d84950e17e85686461ae
2019-10-22 12:50:22 -07:00
079b3cc02c Add C++ nn::functional pad
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26601

Test Plan: Imported from OSS

Differential Revision: D17517468

Pulled By: yf225

fbshipit-source-id: 9ee8b93b88a60f91f2ae78c242f9eaa246b3293c
2019-10-21 22:20:38 -07:00
a3902c901a Revert "Fix early expansion of CUDA_TOOLKIT_ROOT_DIR in libtorch builds (#27887)" (#28310)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28310

This reverts commit 3d3bff5ff1bc277306d15a3caa96c2a6fdb924bb.

Test Plan: Imported from OSS

Differential Revision: D18042859

Pulled By: ezyang

fbshipit-source-id: cded781dda6fcc04199af6abd07ac09fdc0405de
2019-10-21 14:45:17 -07:00
d9b4788e5d cleanup dist autograd context on other nodes when it is released on one node (#27951)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27951

we want to clean up the distributed autograd context across the other nodes when a single node is done (here done means exited the context manager `with dist_autograd.context() as context_id: ...`).

This PR does a few things to implement the above:
1) Add classes to encapsulate messages for requesting this context release and the response
2) Handling of this request in `request_callback_impl.cpp`. When we receive this request, we get the context from a given context_id and release it.
3) RPC call in `DistAutogradContainer::releaseContext` to send this command. This currently does not wait for an ack or implement any sort of retrying. We send the RPC to all the workerIds we have come into contact with (implemented in https://github.com/pytorch/pytorch/pull/26324)
4) Relevant unit tests

In follow up PRs, we will add error checking + retries for this call.

ghstack-source-id: 92269279

Test Plan: Added/modified unit tests in `test/dist_autograd_test.py`

Differential Revision: D17920137

fbshipit-source-id: 7403512ab5fcbc28d21c548b2e45319dd472e26a
2019-10-21 07:34:08 -07:00
a1e14a6626 PixelShuffle module and functional (#28140)
Summary:
Added `PixelShuffle` module and functional https://github.com/pytorch/pytorch/issues/25883
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28140

Differential Revision: D18008474

Pulled By: yf225

fbshipit-source-id: f482495bb56998701c79a61ef065a121bf5a5154
2019-10-18 15:54:14 -07:00
0aa694ebe5 Move Method::lowered_graph to a separate pass out of the Method class. (#28242)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28242

There is no reason to have it in a general API of Module/Method - it's
just another graph pass. It was there because some time ago modules were
not first class and all graphs were lowered. After that changed, this
API was added for easier transition, but now we don't need it anymore.

Test Plan: Imported from OSS

Differential Revision: D17986724

Pulled By: ZolotukhinM

fbshipit-source-id: 279a1ec450cd8fac8164ee581515b09f1d755630
2019-10-18 12:48:40 -07:00
aad5071206 Use torch::variant for enums in C++ API
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26837

Test Plan: Imported from OSS

Differential Revision: D17579438

Pulled By: yf225

fbshipit-source-id: 9ac59df28a317fdb3be2cc02c65962ad99117127
2019-10-16 22:40:57 -07:00
3d3bff5ff1 Fix early expansion of CUDA_TOOLKIT_ROOT_DIR in libtorch builds (#27887)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/15476, supersedes https://github.com/pytorch/pytorch/issues/23496, supersedes and closes https://github.com/pytorch/pytorch/issues/27607

As explained by rgommers in https://github.com/pytorch/pytorch/issues/23496, linking against the expanded library path for `libculibos` in `cmake/Dependencies.cmake` hard codes the path into the distributed cmake files.

Instead, I only link against the targets (e.g. `caffe2::cudnn`) and move the  dependency on `libculibos` into the cuda import targets declared in `cmake/public/cuda.cmake`. That file is distributed with the other cmake files and so the variable is expanded on the user's machine. I am now also using `CMAKE_STATIC_LIBRARY_SUFFIX` instead of `.a` to fix the windows issue from https://github.com/pytorch/pytorch/issues/15828.  I don't have a windows setup to confirm though.

Finally, to get pytorch to compile with the extra libraries enabled, I also had to link `__caffe2_nccl` to `torch_python`; otherwise I was getting include errors as the hard coded include directory was wrong. `nccl` is built into `build` not `third_party/build`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27887

Differential Revision: D17929440

Pulled By: ezyang

fbshipit-source-id: 3db6bd94d758fca2e1d6a64f4f5eea03cc07cf64
2019-10-16 09:21:47 -07:00
cc5c34a0d0 Add nn::functional::normalize() to C++ Frontend (#27280)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/27048

PR Summary:

Files Added:

_torch/csrc/api/include/torch/nn/options/normalization.h
torch/csrc/api/include/torch/nn/functional/normalization.h_

Files Modified:

_test/cpp/api/functional.cpp
torch/csrc/api/include/torch/nn/functional.h_

 ---

yf225 : I couldn't find a C++ equivalent of gradcheck(), is there such a function or is it sufficient to call .backward() in the test body? I don't think any solutions are checked for the Python tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27280

Differential Revision: D17902109

Pulled By: yf225

fbshipit-source-id: 1bce1a88103d0f1848633fec90fde95ea8f3d1ed
2019-10-14 08:39:02 -07:00
3bccd3fc0d Distributed Autograd - FAST mode backward pass implementation. (#27022)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27022

This change implements the "FAST" mode distributed autograd backward
pass as described in https://github.com/pytorch/pytorch/issues/23110.

At a high level the backward pass works as follows:
1. We start by computing dependencies on the node that calls
`torch.distributed.backward`.
2. This node computes the dependencies starting from the root nodes provided in
the backward call and all the 'send' functions present in the current autograd
context. The "FAST" mode assumes all 'send' functions are part of the autograd
computation.
3. Once the dependency computation is done, the distributed autograd engine
calls the local autograd engine to execute the autograd graph. Note that the
autograd graph on a single node is not necessarily connected because of
inter-node communication. As a result, we have special handling to ensure the
local autograd engine ensures we execute the entire graph starting from the
provided roots and all 'send' functions on the node.
4. When the local autograd engine hits a 'recv' function, it performs an async
RPC to send the gradients over to the appropriate node and stores a future in
the autograd context to keep track of this RPC.
5. On the destination node, the appropriate 'send' function is looked up and
enqueued on the local autograd engine. If this is the first time the node is
hearing about this autograd context id on the backward pass, then the node
computes dependencies for the local autograd engine.
6. As part of compute dependencies, the distributed autograd engine discovers
all leaf nodes and ensures those are passed as 'outputs' to the local autograd
engine. This avoids running the 'AccumulateGrad' function.
7. The gradients computed for the leaf nodes are then actually accumulated in
`DistAutogradContext` for the appropriate autograd context id.
8. The distributed autograd engine waits for the local autograd engine
to complete and also waits for all the 'Futures' (stored in 4.) for respective
RPCs to finish.

We have made the following changes to the local autograd engine for this
purpose:

1. Expose GraphTask and NodeTask so that the distributed autograd engine can
use them.
2. Expose a `execute_with_graph_task` API which gives the distributed engine
to build a GraphTask and pass it to the local autograd engine.
3. Expose a `enqueue_on_cpu` API, which allows the distributed engine to build
a `NodeTask` for a 'send' function and enqueue it on the local autograd engine.

In addition to this a few general improvements:
1. Added a `PropagateGradients` RPC call for the 'recv' function to pass
gradients to the appropriate node during the backward pass.
2. Use IValues as much as possible in serialization for RpcWithAutograd.
3. If Future.wait(), contains a message type EXCEPTION, we throw an appropriate
exception instead of just returning the message. This is inline with what most
Future.wait() APIs do.
4. Added a `get_gradients(context_id)` API which allows users to retrieve a map
from Tensor to respective gradient for the provided context_id on the local
node.
ghstack-source-id: 91794926

Test Plan: unit tests.

Differential Revision: D17652615

fbshipit-source-id: 96f65c52adb2706ee29f4b49e1655afaa0a3bec3
2019-10-12 09:47:49 -07:00
d57124823b Regenerate aten_op.h when native_functions.yaml changes. (#27253)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/10127.

This ensures that aten_op.h is regenerated whenever a new native kernel
is removed. Previously it was only being regenerated when new native
kernels were added because this generated new source files, which this
cmake target depended on. However if a native kernel is removed then
there is no dependent target and the header is never regenerated.

Explicitly depending on native_functions.yaml ensures that the header
is regenerated even if a kernel is removed.

I'm no cmake expert so alternative approaches or reasons why this is
obviously incorrect are very appreciated!

EDIT: reflecting comments below we now depend on `Dependencies.yaml` instead of `native_functions.yaml`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27253

Differential Revision: D17813659

Pulled By: ezyang

fbshipit-source-id: 2c754a88ba62495c14de8a9649f6675d2dad0b7d
2019-10-08 11:54:51 -07:00
24242e86fa Ensure NCCL error handling code is disabled for NCCL versions < 2.4 (#27124)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27124

ncclCommAbort() and ncclGetAsyncError() were two APIs added in NCCL
2.4 to detect errors in NCCL communicators. These were used as part of
ProcesGroupNCCL and we also enforced that only NCCL versions 2.4+ were
supported. Although, there is still legitimate use for older NCCL versions and
hence we should still support those.

For that purpose, in this change I've ensured we disable NCCL error checking
for versions < 2.4.
ghstack-source-id: 91452959

Test Plan:
1) Test with 2.4.8
2) Test with 2.2.13
3) unit tests.

Differential Revision: D17178988

fbshipit-source-id: 5dc44b5f7b4b00466c67fd452315f1d4f5c47698
2019-10-07 17:39:32 -07:00
0b6186d778 Remove Tensor.h, TensorMethods.h from src/core. (#27086)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27086

This is a major source of merge conflicts, and AFAICT isn't necessary anymore (it may have been necessary for some mobile build stuff in the past).

This is a commandeer of #25031

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D17687345

Pulled By: ezyang

fbshipit-source-id: bf6131af835ed1f9e3c10699c81d4454a240445f
2019-10-06 09:37:50 -07:00
2486b0ba82 Add Python RRef as args and return value (#25499)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25499

See #23110 for model parallel design details, and #26759 for the RRef
protocol. This commit add support for using RRef as Python UDF arguments
and return value. RRefs can now be shared from owner to user, from user to
owner, or from user to user.

Limitations:
1. No implicit type conversion yet. (#27099)
2. No failure handling and retry. (#26116)
3. UDF is not yet blocked until all RRefs are confirmed. (#27098)
4. Internal RRef control messages are not idempotent yet. (#26116)
5. Cannot delete RRefs correctly when there are circular dependencies. (#27096)

Main changes:

1. Added `SCRIPT_REMOTE_CALL` and `PYTHON_REMOTE_CALL` to `Message.h` to represent `dist.remote` invocations.
2. Added `SCRIPT_RREF_FETCH_CALL`, `PYTHON_RREF_FETCH_CALL`, `RREF_USER_ACCEPT`, `RREF_USER_DELETE`, `RREF_CHILD_ACCEPT`, and `RREF_FORK_REQUEST` to `Message.h` as internal RRef control messages.
3. New message request handling code is added to `functions.cpp`, and message format is added in `script_remote_call.h`, `python_remote_call.h`, and `rref_proto.h`.
4. Added a `PyRRef` type in `py_rref.h` and `py_rref.cpp` which holds a shared pointer to C++ `RRef` type. `PyRRef` wraps the C++ API and also implements RRef pickling and unpickling. RRef fork related control messages will be sent during RRef pickling/unpickling procedure.
5.  Update `RRef.h` and `RRef.cpp` accordingly to support `py::object` RRefs.
6. RRef context (reference count, etc.) are tracked in `rref_context.h` and `rref_context.cpp`.

Test Plan:
Imported from OSS

buck test mode/dev-nosan //caffe2/test:rpc_fork

Differential Revision: D17184146

Pulled By: mrshenli

fbshipit-source-id: a3a268efc087ac1ef489136ab957080382629265
2019-10-03 17:47:12 -07:00
fe4170bda8 Add send and recv backward functions for builtin operators RPC. (#25527)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25527

Master GH issue: https://github.com/pytorch/pytorch/issues/23110.

This change builds upon https://github.com/pytorch/pytorch/pull/24876 and
provides all the autograd hooks needed for a forward pass with distributed rpc
for builtin operators. This change does not address distributed rpc for python
UDFs and that will be addressed in follow up PRs.

Summary of changes:
1. Attach send autograd functions when a request is sent from the client and
response is sent from the server.
2. Attach receive autograd functions when a request is received on the server
and a response is received on the client.
3. Generate a globally unique autograd_message_id for each send/recv autograd
function pair to uniquely identify them.
ghstack-source-id: 91240466

Test Plan: unit tests.

Differential Revision: D17148077

fbshipit-source-id: 192d8a3f552ed7cc939f55dcca332965c9bd3233
2019-10-03 01:18:46 -07:00
515e3b85da C++ API parity: Hardshrink
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27035

Test Plan: Imported from OSS

Differential Revision: D17682403

Pulled By: pbelevich

fbshipit-source-id: 186377fe577abfdd53acc95751a7ed845b51af95
2019-10-02 08:30:20 -07:00