Commit Graph

405 Commits

Author SHA1 Message Date
b9adbb5002 Fix/relax CMake linter rules (#35574)
Summary:
Ignore mixed upper-case/lower-case style for now
Fix space between function and its arguments violation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35574

Test Plan: CI

Differential Revision: D20712969

Pulled By: malfet

fbshipit-source-id: 0012d430aed916b4518599a0b535e82d15721f78
2020-03-27 16:52:33 -07:00
77ad3c5aeb Revert D20683972: [pytorch][PR] Fix PyTorch separate compilation
Test Plan: revert-hammer

Differential Revision:
D20683972

Original commit changeset: bc1492aa9d1d

fbshipit-source-id: 8994cbb36877d4338b8677ac6bc807dd16efa67c
2020-03-27 09:18:48 -07:00
2e739f822b Fix PyTorch separate compilation (#34863)
Summary:
Looks like there is a bug in CUDA device linker, but kernels that uses `thust::sort_by_key` can not be linked with other kernels
    Solve the problem by splitting 5 thrust-heavy .cu files into `__torch_cuda_sp` library which is statically linked into `torch_cuda`
    For default compilation workflow it should not make any difference.

    Test Plan: Compile with `-DCUDA_SEPARABLE_COMPILATION=YES` and observe library size difference: 310Mb before, 173Mb after if compiled for sm_75
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34863

Differential Revision: D20683972

Pulled By: malfet

fbshipit-source-id: bc1492aa9d1d2d21c48e8764a8a7b403feaec5da
2020-03-26 17:49:07 -07:00
a4ea16dbc6 Put prim ops used in full jit only in a separate file (#35232)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35232

Some prim operators, like profile and fusion, are not used in mobile (at least in short term). They are coupled with JIT code. Put them in a separate file (register_prim_ops_fulljit.cpp).
ghstack-source-id: 100807055

Test Plan: buck build //xplat/caffe2:torch

Reviewed By: dreiss

Differential Revision: D20408827

fbshipit-source-id: 9013093357cf75723ef00c34bbfdb6b7ea40a4cf
2020-03-25 14:15:34 -07:00
512bcf68be [Formatting] if ( -> if( in CMakeLists.txt (#35343)
Summary:
Same to `else`, `endif` and `elseif`.
Also prefer lowercase over uppercase ones
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35343

Test Plan: None at all

Differential Revision: D20638789

Pulled By: malfet

fbshipit-source-id: 8058075693185e66f5dda7b825b725e139d0d000
2020-03-25 13:48:42 -07:00
361eed6a6e Use JIT op registration directly for lite interpreter. (#34070)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34070

The first step to make all operators available for lite interpreter. The original code used manual registration for lite interpreter ops with a "_" prefix, for two reasons:
1. To minimize the build size.
2. To avoid duplicate registration in OSS (majorly feature testing and unit tests).

Now since we have more and more models to support, the manual registration way is not practical. To make this process automatic while keeping the binary size under control, we plan to:
1. Make all necessary ops callable from lite interpreter.
2. The binary size would be increased because of step 1. Use ljk53 's custom build to selectively build the binary with ops used in specific models. The ops will be automatically collected using get_opnames.
3. The temporary "register_mobile_ops.cpp" can be removed.

Test Plan: Imported from OSS

Differential Revision: D20291596

Pulled By: iseeyuan

fbshipit-source-id: 553b4699619cd71fea20658f3bc8c2d48852ef5c
2020-03-25 07:21:51 -07:00
5b2f8cef08 [JIT] Functional Graph Pass (#33020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33020

This is a pass to create functional blocks. The other PRs in the stack help avoid some of the limitations that are are often found in graphs. It's possible that this would work well with a graph that is frozen. Follow up work items that will help this pass:

- We don't currently have any capacity in alias analysis to tell whether a Value that came from the wildcard set "re-escapes" back into the wildcard set.
- More comments on the semantics of the graph and correctness conditions
- We could consider using dynamic dag if the perf of this is a limitation.
- potential make Functional Graphs Functional Blocks instead, so that we do not repeatedly copy constants, also to make IR read easier.

Test Plan: Imported from OSS

Differential Revision: D20603188

Pulled By: eellison

fbshipit-source-id: 6822a6e65f4cc2676f8f6445fe8aa1cb858ebeeb
2020-03-24 23:44:18 -07:00
a7f8655314 Revert D20624571: [pytorch][PR] [TensorExpr] Extend arithmetic simplifier to work with multi variable expressions
Test Plan: revert-hammer

Differential Revision:
D20624571

Original commit changeset: e49049377bee

fbshipit-source-id: 7d8dda0c3b44be1c3236a0313bbfa128b7015de7
2020-03-24 16:59:51 -07:00
fce67800f4 [TensorExpr] Extend arithmetic simplifier to work with multi variable expressions (#35127)
Summary:
A new version of the IR simplifier used by the jit/tensorexpr fuser. This is capable of simplifying expressions containing (shock) multiple variables, eg:

```(m * (1 * n_1) + (n  + 1)) - (m *  (1 * n_1) + n) => 1```

Similar to the previous IR Simplifier it uses a two stage approach:
1. Traverse the tree combining subtree's of commutable operations in to a flat structure. In this implementation we have two intermediate Exprs: Term (expressing products of sub expressions) and Polynomial (expressing sums of sub expressions).
2. Traverse the tree expanding Term's and Polynomials into their component operators.

Using the example above we execute with a process like this to simplify:
```
   (m * (1 * n_1) + (n  + 1)) - (m *  (1 * n_1) + n)
# Using PolynomialTransformer:
=> Sub(Add(Mul(m, Mul(1, n_1)), Add(n, 1)), Add(Mul(m, Mul(1, n_1)), n))
=> Sub(Polynomial(Term(m, n_1), n, 1), Polynomial(Term(m, n_1), n))
=> Polynomial(Term(m, n_1), Term(-1, m, n_1), n, -n, 1)
=> Polynomial(1)
# Using TermExpander
=> 1
```

The IRSimplifier supports arithmetic simplifications of operators Add, Sub and Mul and constant folding of all binary Exprs and Intrinsics, but does not attempt expansion of multiplication of Polynomials to the canonical form since that generally leads to less efficient representations. It will do scalar factorization if it results in removal of operators, and will merge chains of multilane primitives (such as Broadcast and Ramp) down into a single operator. The ir_simplifier unit tests are a short tour of its capabilities.

The existing simplifier has a bug where it will sometimes reorder operations on floating point types which are not associative. This causes (at least) the pyhpc equation_of_state benchmark to produce incorrect results. I have fixed that issue in this version and verified that that benchmark produces the same results with and without the simplifier.

Tests: all cpp & py tensorexpr tests, and pyphc benchmark:
```
benchmarks.equation_of_state
============================
Running on CPU

size          backend     calls     mean      stdev     min       25%       median    75%       max   Δ
------------------------------------------------------------------------------------------------------------------
   4,194,304  pytorch           10     0.246     0.002     0.243     0.245     0.246     0.248     0.250     1.000
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35127

Differential Revision: D20624571

Pulled By: nickgg

fbshipit-source-id: e49049377beee69e02dcf26eb922bef1447ae776
2020-03-24 14:16:07 -07:00
65cea95777 [TensorExpr] Rename schedule.{cpp,h} to loopnest.{cpp,h}. (#35119)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35119

Differential Revision: D20567927

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: 1fb6d03bd4c6e66aca62140d2b537692577f261d
2020-03-20 23:37:51 -07:00
7065c46ea2 Respect dist autograd context in torch.jit._fork. (#34360)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34360

The distributed autograd context sets up a thread local context id
which is used to perform appropriate book keeping and autograd recording of RPC
functions in the forward pass.

However, if we use torch.jit._fork within the distributed autograd context, the
code executed within torch.jit._fork will lose this context since it is run in
a separate JIT thread and the thread local is not set in that thread.

To fix this problem, we pass in the distributed autograd context to
torch.jit._fork similar to what we did in
https://github.com/pytorch/pytorch/pull/16101.
ghstack-source-id: 100445465

Test Plan: waitforbuildbot

Differential Revision: D20301352

fbshipit-source-id: aa3fffe69c2b40722c66213351a4e0d77484a621
2020-03-19 14:12:28 -07:00
96860af870 Revert D20164420: [1.5 Release][Dist Autograd][Better Engineering] Notify Workers on Failure during Distributed Autograd
Test Plan: revert-hammer

Differential Revision:
D20164420

Original commit changeset: 3d4ed7423096

fbshipit-source-id: 67f0f9c11cee84df6dbe37db7821dd601227df66
2020-03-19 08:02:07 -07:00
5f67c923f1 [1.5 Release][Dist Autograd][Better Engineering] Notify Workers on Failure during Distributed Autograd (#34638)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34638

Fixes: https://github.com/pytorch/pytorch/issues/27643

This PR manages notifying workers in the event of a failure during distributed autograd. Gracefully handles propagating errors across all nodes in the backward pass and sets state in the local autograd engines accordingly.

(Note: this ignores all push blocking failures!)

Test Plan: Added 2 new tests checking errors when they are thrown in an intermediate node during distributed autograd. Ensured that all existing distributed autograd tests pass.

Differential Revision: D20164420

fbshipit-source-id: 3d4ed74230969ac70bb763f1b5b1c16d979f66a2
2020-03-18 18:56:14 -07:00
cfab65d90d Fix CMake Dev warning in caffe2/CMakeLists.txt (#34886)
Summary:
If arguments of `ENDIF()` block are non-empty, they should match corresponding `IF()` BLOCK
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34886

Test Plan: CI

Differential Revision: D20494631

Pulled By: malfet

fbshipit-source-id: 5fed86239b4a0cb4b3aedd02c950c1b800199d2d
2020-03-17 12:19:42 -07:00
ea5c86c276 [TensorExpr] Add LLVM codegen. (#34228)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34228

This PR adds LLVM codegen to tensor expressions. LLVM is added as an
optional build dependency specified with `USE_LLVM=<path_to_llvm>`
variable. If this variable is not set or LLVM is not found in the
specified path, the LLVM codegen is completely disabled.

Differential Revision: D20251832

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: 77e203ab4421eb03afc64f8da17e0daab277ecc2
2020-03-16 11:49:34 -07:00
35e7efeb9a [TensorExpr] Add CUDA codegen. (#34227)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34227

This PR adds a CUDA support to tensor expressions.

Differential Revision: D20251836

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: ab36a55834cceff30c8371fef6cca1054a32f017
2020-03-16 11:49:29 -07:00
42b2c8c65d [TensorExpr] Add a fuser pass based on tensor expressions. (#34226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34226

LLVM and Cuda backends are added in subsequent PRs, so at this point the fuser is pretty useless, but it still can be tested and its logic is not going to change with addition of the codegens.

Differential Revision: D20251838

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: 82b0d221fa89904ed526689d02a6c7676a8ce8de
2020-03-16 11:49:24 -07:00
e31d462e92 [TensorExpr] Pull changes to core classes for representing expressions and statements from the side branch. (#34224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34224

Our development has been happening on a side branch `pytorch_fusion` in
`bertmaher/pytorch` fork. This PR moves changes to the core classes
representing expressions and transformations on them.

At this moment, the tensor expressions are only used in tests.
Subsequent PRs add LLVM and CUDA codegen for tensor expressions and
implement fuser on top of these.

This PR is huge as it is a squashed version of changes in the side
branch. It is not practical to pull changes one by one from the branch,
so here is the squashed version. If you're interested in seeing the
history of changes, please refer to https://github.com/bertmaher/pytorch

Differential Revision: D20251835

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: 1a871acc09cf3c6f7fb4af40d408cdbb82dc7dab
2020-03-16 11:47:47 -07:00
24c9e61e79 Enable JIT tests on Windows (#27029)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27029

Reviewed By: eellison

Differential Revision: D20458664

Pulled By: jamesr66a

fbshipit-source-id: 22be918543703869f471e89b3478423198351bf3
2020-03-16 11:26:21 -07:00
4da5569300 Pass to remove prepacking ops. (#34319)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34319

Removes prepacking ops and install them as attributes of the top level
module. Needs to run freezing as the first pass.

Test Plan:
python test/test_xnnpack_integration.py

Imported from OSS

Differential Revision: D20290726

fbshipit-source-id: 633ceaa867ff7d5c8e69bd814c0362018394cb3a
2020-03-14 12:53:31 -07:00
7dd5da2026 JIT pass to insert XNNPACK ops (#34048)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34048

Rewrites the graph to insert xnnpack prepack and packed run ops for
conv2d and linear.

Test Plan:
python test/test_xnnpack_integration.py

Imported from OSS

Differential Revision: D20185658

fbshipit-source-id: c4c073c912ad33e822e7beb4ed86c9f895129d55
2020-03-14 12:53:27 -07:00
9e6cd98c3f Ensure torch_cuda is linked against on Windows (#34288)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/31611.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34288

Differential Revision: D20314251

Pulled By: seemethere

fbshipit-source-id: 15ab2d4de665d553a1622a2d366148697deb6c02
2020-03-12 12:16:44 -07:00
a54416d208 [C++ API] Remove deprecated torch::nn::BatchNorm / FeatureDropout / modules_ordered_dict and torch::nn::init::Nonlinearity / FanMode (#34508)
Summary:
This PR is BC-breaking in the following way:
- The deprecated `torch::nn::BatchNorm` is removed in favor of `torch::nn::BatchNorm{1,2,3}d`
- The deprecated `torch::nn::FeatureDropout` is removed in favor of `torch::nn::Dropout{2,3}d`
- The deprecated `torch::nn::modules_ordered_dict` is removed. User should do `Sequential sequential({{"m1", MyModule(1)}, {"m2", MyModule(2)}})` instead.
- The deprecated `torch::nn::init::Nonlinearity` is removed, in favor of the following enums:
    - `torch::kLinear`
    - `torch::kConv1D`
    - `torch::kConv2D`
    - `torch::kConv3D`
    - `torch::kConvTranspose1D`
    - `torch::kConvTranspose2D`
    - `torch::kConvTranspose3D`
    - `torch::kSigmoid`
    - `torch::kTanh`
    - `torch::kReLU`
    - `torch::kLeakyReLU`
- The deprecated `torch::nn::init::FanMode` is removed, in favor of the following enums:
    - `torch::kFanIn`
    - `torch::kFanOut`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34508

Differential Revision: D20351601

Pulled By: yf225

fbshipit-source-id: cca0cd112f29a31bb023e348ca8f82780e42bea3
2020-03-12 10:09:58 -07:00
e95657b87e [C++ API] AdaptiveLogSoftmaxWithLoss (#29076)
Summary:
Implemented AdaptiveLogSoftmaxWithLoss and some tests for modules. Reference https://github.com/pytorch/pytorch/issues/25883
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29076

Differential Revision: D20404588

Pulled By: yf225

fbshipit-source-id: edbadf432b8173cbcc6caf83c9c03dd92dc31a37
2020-03-12 09:53:58 -07:00
965146b818 [jit] delete netdef converter (#33807)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33807

afaik this is unused, so removing it from the source tree. RIP :(

Test Plan: Imported from OSS

Differential Revision: D20122118

Pulled By: suo

fbshipit-source-id: cb45943f5b9f969482301a2f9fe540326dbc78f2
2020-03-09 22:25:16 -07:00
45a504dd2d [JIT] Introduce BuiltinOpFunction and integrate into torchbind (#34098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34098

* #33900 [JIT] Move stuff out of class_type.cpp

Test Plan: Imported from OSS

Differential Revision: D20229166

Pulled By: jamesr66a

fbshipit-source-id: d658a63a5d6e372e675f35b8456adc8de82b49f3
2020-03-07 10:03:56 -08:00
60e8615a6d [JIT] Virtualize Function (#33921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33921

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.intern.facebook.com/intern/diff/D20153092/)!

Test Plan: Imported from OSS

Differential Revision: D20177227

Pulled By: jamesr66a

fbshipit-source-id: 87f3e484c4f873d60f76f50f6789c1b4a73bdfde
2020-03-07 10:03:50 -08:00
9a5e9d8cec [pytorch][mobile] change mobile build scripts to build PyTorch by default (#34203)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34203

Currently cmake and mobile build scripts still build libcaffe2 by
default. To build pytorch mobile users have to set environment variable
BUILD_PYTORCH_MOBILE=1 or set cmake option BUILD_CAFFE2_MOBILE=OFF.

PyTorch mobile has been released for a while. It's about time to change
CMake and build scripts to build libtorch by default.

Changed caffe2 CI job to build libcaffe2 by setting BUILD_CAFFE2_MOBILE=1
environment variable. Only found android CI for libcaffe2 - do we ever
have iOS CI for libcaffe2?

Test Plan: Imported from OSS

Differential Revision: D20267274

Pulled By: ljk53

fbshipit-source-id: 9d997032a599c874d62fbcfc4f5d4fbf8323a12e
2020-03-05 23:40:47 -08:00
Jie
2b79bab029 [CUDA_FUSER] Fork CUDA fuser (#33527)
Summary:
Separating CUDA fuser from CPU fuser.

1. New node in IR - prim::CudaFusionGroup:
   This enables the cuda fuser to co-exist along side the old fuser. Allows us
   to incrementally build and expand cuda fuser.

2. copied FuseGraph optimization passes to CudaFuserGraph:
   We will re-factor & reuse Chunk/Concat in the old fuser logic, which is
   handled in the optimization pass at this moment. Unfortunately many code in
   the pass is tightly binded with the legacy fuser, which makes code sharing
   difficult.
   The CudaFusionGraph will support only a subset of operations comparing to
   legacy fuser (CUDA only). It is registered as a custom pass post fusion via
     ```torch._C._jit_register_cuda_fuser()```
   To have it in effect, you should also turn off fusion on GPU via
     ```torch._C._jit_override_can_fuse_on_gpu(False)```

3. We don't have codegen in this PR yet (WIP). Currently we just fall back to
   the old fuser.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33527

Differential Revision: D20171598

Pulled By: ZolotukhinM

fbshipit-source-id: 9a3c0f06f46da7eaa80ae7551c04869f5b03ef71
2020-03-04 20:25:08 -08:00
f097ca503d Add and test training in lite interpreter. (#32359)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32359

Test Plan: Imported from OSS

Differential Revision: D19450614

Pulled By: iseeyuan

fbshipit-source-id: 6bafff39d7880a5b7fb9cd70c33a4e584812be12
2020-03-03 23:33:43 -08:00
7d01888a75 [JIT] Register rpc.rpc_async(..) as a JIT operator (#33329)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33329

# Use case

```
torch.jit.script
def send_rpc_async(dst_worker_name, user_callable_qual_name, tensor):
    # type: (str, str, Tensor) -> None
    rpc._rpc_async_torchscript(
        dst_worker_name, user_callable_qual_name, args=(tensor,)
    )
```

# Problem

```
torch.jit.frontend.NotSupportedError: keyword-arg expansion is not supported:
  File "/data/users/shihaoxu/fbsource/fbcode/buck-out/dev/gen/caffe2/test/distributed/rpc/rpc_spawn#binary,link-tree/torch/distributed/rpc/api.py", line 722
    args = args if args else ()
    kwargs = kwargs if kwargs else {}
    fut = _invoke_rpc_torchscript(to, qualified_name, *args, **kwargs)
                                                               ~~~~~~ <--- HERE
    return fut
```

# Solution

Register `rpc.rpc_async(..)` as a JIT operator to handle variable-length argument list.

# Plan

This PR is the required changes to make `rpc.rpc_async(..)` a JIT prim operator, which can dynamically handle different number of arguments.

- Register "prim::rpc_async" as a `Symbol` in "interned_string.h"
- Add a if branch in "python_sugared_value.cpp" `toSugarValue(py::object, ..)` entry utility function to set up how JIT frontend convert `torch.distributed.rpc.rpc_async(..)` Python function (Python object) into a `SpecialFormValue` (IR SugaredValue).
- Add a switch case for "prim::rpc_aynsc" Symbol in "ir_emitter.cpp" and `emitApplySpecialForm(..)` to set up how JIT compiler provides inputs to the "prim::rpc_aynsc" Operator.
- Register "prim::rpc_async" as a `jit::Operator` and provide implementation in "register_distributed_ops.cpp".

Notice, since the distributed module is an optional part when building PyTorch. The code to be added in this PR should be wrapped within preprocessing maco.
```
#ifdef USE_DISTRIBUTED
new code here
#endif
```

Test Plan:
Items that need to be confirmed in the test cases

https://fb.quip.com/DCvdA9ZLjeO0

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork  \
\
&& buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par -r test_call_python_function_remotely_from_script_not_supported
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_spawn
```

```
buck test mode/dev-nosan //caffe2/caffe2/python/operator_test:layer_norm_op_test-2.7 -- test_layer_norm_op_jit
```

Differential Revision: D5738300

fbshipit-source-id: a4604fe762e00be062dc8232ca9790df31fb2074
2020-03-03 19:57:42 -08:00
9b39ad7f2c [jit] Fix iOS build (#34180)
Summary:
`unpickler.cpp` depends on the mobile type parser all the time, so include it regardless of whether it's a mobile build or not
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34180

Pulled By: driazati

Differential Revision: D20241881

fbshipit-source-id: a998dd2b3f1c7f58e55bb7851dc595c8ddf9eacb
2020-03-03 19:44:43 -08:00
cab8772c6c Freezing Torchscript modules (#32178)
Summary:
This patch enables folding GetAttr nodes with their corresponding
values. _jit_pass_freeze_module API returns a new TorchScipt module
where all function calls and get attributes are inlined.
Usage:

frozen_model = torch._C._freeze_module(scrited_model._c)
frozen_model.forward(...)

This API currently optimizes the forward method. We will follow up to
to preserve and optimize methods and attributes that are annotated as
 torch.jit.interface.

Several future improvements to JIT optimizations are required to maximize
clean up/de-sugar the graph and eliminate redundancies.
Ideally, we want to produce a graph that can easily be lowered to
GLOW and other low-level backends.
__
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32178

Differential Revision: D19419640

Pulled By: bzinodev

fbshipit-source-id: 52baffaba9bca2cd60a8e747baa68d57711ad42b
2020-03-02 11:38:36 -08:00
0e52627358 Fixing pthreadpool symbol conflict issue. (#33869)
Summary:
Mainly renaming pthread_create of C2, the only one referred internally in NNPACK, that
is conflicting, to pthread_create_c2.
Removed 2 other conflicting symbols that are not used internally at all.
Pointing XNNPACK to original repo instead of the fork.

Copy pasted the new interface and implementation to
caff2/utils/threadpool, so that for internal builds we compile against
this.

When threadpool is unified this will be removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33869

Differential Revision: D20140580

Pulled By: kimishpatel

fbshipit-source-id: de70df0af9c7d6bc065e85ede0e1c4dd6a9e6be3
2020-02-28 21:23:18 -08:00
b678256bfb Move glu to Aten(CPU) (#33179)
Summary:
This PR move glu to Aten(CPU).
Test script:
```
import torch
import torch.nn.functional as F
import time

torch.manual_seed(0)

def _time():
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    return time.time()

device = "cpu"

#warm up
for n in [10, 100, 1000, 10000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    grad_output = torch.ones(128, n // 2, device=device)
    for i in range(1000):
        output = F.glu(input)
        output.backward(grad_output)

for n in [10, 100, 1000, 10000]:
    fwd_t = 0
    bwd_t = 0
    input = torch.randn(128, n, requires_grad=True, device=device)
    grad_output = torch.ones(128, n // 2, device=device)
    for i in range(10000):
        t1 = _time()
        output = F.glu(input)
        t2 = _time()
        output.backward(grad_output)
        t3 = _time()
        fwd_t = fwd_t + (t2 -t1)
        bwd_t = bwd_t + (t3 - t2)
    fwd_avg = fwd_t / 10000 * 1000
    bwd_avg = bwd_t / 10000 * 1000
    print("input size(128, %d) forward time is %.2f (ms); backwad avg time is %.2f (ms)."
          % (n, fwd_avg, bwd_avg))
```
Test device: **skx-8180.**
Before:
```
input size(128, 10) forward time is 0.04 (ms); backwad avg time is 0.08 (ms).
input size(128, 100) forward time is 0.06 (ms); backwad avg time is 0.14 (ms).
input size(128, 1000) forward time is 0.11 (ms); backwad avg time is 0.31 (ms).
input size(128, 10000) forward time is 1.52 (ms); backwad avg time is 2.04 (ms).
```
After:
```
input size(128, 10) forward time is 0.02 (ms); backwad avg time is 0.05 (ms).
input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.09 (ms).
input size(128, 1000) forward time is 0.07 (ms); backwad avg time is 0.17 (ms).
input size(128, 10000) forward time is 0.13 (ms); backwad avg time is 1.03 (ms).
```
Fix https://github.com/pytorch/pytorch/issues/24707, https://github.com/pytorch/pytorch/issues/24708.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33179

Differential Revision: D19839835

Pulled By: VitalyFedyunin

fbshipit-source-id: e4d3438556a1068da2c4a7e573d6bbf8d2a6e2b9
2020-02-28 14:54:38 -08:00
dbe850af5b [jit] do the code reorg (#33851)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33851

Rationale and context described in #33828.

Script to reproduce the move:
https://gist.github.com/suo/16cbefaaeb67ca5a7c6caffd49b7f6e9
ghstack-source-id: 99079645

Test Plan: Make sure CI passes

Reviewed By: jamesr66a

Differential Revision: D20133869

fbshipit-source-id: 390e9241a9c85366d9005c492ac31f10aa96488e
2020-02-27 13:02:51 -08:00
bf00b4d305 [TensorExpr] Add a boilerplate pass for future TensorExpr fusion pass. (#33464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33464

I added a python-exposed knob to register this pass in custom passes pipeline. If the knob is not used, the pass is not registered and thus not run at all.

Differential Revision: D19958217

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: fecdd98567fcda069fbdf8995c796899a3dbfa5c
2020-02-24 18:47:31 -08:00
4d9b649261 jit pickling rref (#32959)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32959

in rpc torch script call path, we need to pickle/unpickle rref, this diff is added to make jit pickler/unpickler be able to pickle/unpickle rref. It is similar to what is implemented for PyRef::pickle() and PyRef::unpickle().
The pickling/unpickling design assumes it is always coupled with RPC calls. It is not needed to checkpoint a model with rref, before checkpointing the model, user should call ref.to_here() to get value inside rref.

The pickling process is:
1. push torch.distributed.rpc.rref global string
1. call rref.fork() and create rrefForkData, which is a few IDs and type str of the value held inside the rref, the IDs includes rref id, fork id, caller work id, callee work id, owner work id
2. push the rrefForkData

The unpickling process is:
1. read torch.distributed.rpc.rref global string, and retrieve the cached global lamda function
2. the globa lamda function will get rrefForkData
3. if callee is also owner work id, then get owner rref based on Ids inside rrefFork data and return the ownerRRef
4. if callee is not owner work id, then create user rref using the rrefForkData and return the userRRef
5. meanwhile owner rref will be notified and do reference counting correctly

During unpickling, a type_resolver is needed to parse type str. This type_resolver has python dependency, so we get it from rpc_agent, and pass it to unpickler during construction. So we added a type_resolver argumenmt to jit unpickler constructor in this diff.
ghstack-source-id: 98814793

Test Plan: unit test

Differential Revision: D19713293

fbshipit-source-id: 4fd776cdd4ce8f457c4034d79acdfb4cd095c52e
2020-02-24 11:16:35 -08:00
bb5181b716 [TensorExpr] Add IR Printer. (#33220)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33220

Test Plan: Imported from OSS

Differential Revision: D19848379

Pulled By: ZolotukhinM

fbshipit-source-id: 1c6ab4f63080d4506dedc3c47938de92fb4bfba2
2020-02-21 13:10:26 -08:00
fc70fc3610 [TensorExpr] Add IR visitor, IR mutator, and IR evaluator. (#33219)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33219

Test Plan: Imported from OSS

Differential Revision: D19848381

Pulled By: ZolotukhinM

fbshipit-source-id: 44ca7cd99c25e290a8ffd8146785c19f9c785dfd
2020-02-21 13:10:22 -08:00
49af9425a7 [TensorExpr] Add core classes for representing expressions and statements. (#33218)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33218

Test Plan: Imported from OSS

Differential Revision: D19848378

Pulled By: ZolotukhinM

fbshipit-source-id: 48399f8651324d5ad0607e08573d5d7b2026bb23
2020-02-21 13:10:17 -08:00
1a4f997178 [TensorExpr] Add a class for representing data type. (#33217)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33217

Test Plan: Imported from OSS

Differential Revision: D19848380

Pulled By: ZolotukhinM

fbshipit-source-id: d8683f8fc4555d2456cd2a7c827d8e8231915b49
2020-02-21 13:10:12 -08:00
089d658153 [TensorExpr] Add classes for memory management in tensor expressions. (#33216)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33216

All tensor expressions belong to a kernel arena and are freed when the
arena is destroyed. Until it is destroyed, all expressions stay valid.

Test Plan: Imported from OSS

Differential Revision: D19848382

Pulled By: ZolotukhinM

fbshipit-source-id: a581ea2b635b9ba2cc53949616a13d8d3a47caae
2020-02-21 13:08:50 -08:00
806e7daa1f Rename TorchScript compiler to IR emitter to better reflect its function. (#33127)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33127

Test Plan: Imported from OSS

Differential Revision: D19806503

Pulled By: ZolotukhinM

fbshipit-source-id: ab78bdbbac5f12dbcc6c2e2573f5862a16ffcf3d
2020-02-12 18:45:13 -08:00
12bcfa7c77 Remove Python dependency (toPyTuple/fromPyTuple, jitCompilationUnit, deserialize) in rref_impl.h/cpp (#32753)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32753

Functions to be bound as an Aten operator could not have Python dependency.

This is to refactor and remove Python dependency.
ghstack-source-id: 97485800

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_script_functions_not_supported

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_script_functions_not_supported
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_fork

buck-out/gen/caffe2/test/distributed/rpc/dist_autograd_fork\#binary.par -r test_backward_simple_script_call
```

Differential Revision: D5741675

fbshipit-source-id: 31ee60955be8d815d0773f3699e3ff2f1f9d8849
2020-01-30 17:52:48 -08:00
fb159b5236 Some work on eager op binding codegen (gen_python_functions.py) (#29986)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29986

Previously in addition to generating a python binding for each op,
we would generate an almost-trivial helper for each overload.
This PR eliminates the helpers, simplifying codegen logic a bit and
reducing the source-level indirection by a step.
Perf should be unchanged.

codegen diff: 1f2f07fb60

Note: in the interests of keeping the diff contained, there's only
some light cleanup here beyond what's necessary for the codegen changes.
Plan is to do some more substantial refactoring in followup PRs that
leave generated code unchanged.

Test Plan: Imported from OSS

Differential Revision: D18567980

Pulled By: bhosmer

fbshipit-source-id: eb9a81babb4489abd470842757af45580d4c9906
2020-01-30 00:29:53 -08:00
25d33a2ee8 [JIT] Use Type Level Granularity in Alias Analysis Wildcards (#32251)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32251

Previously wildcard sets were associated by TypeKind, meaning all Lists were in one alias set, all Classes were in one alias set, etc. We can improve analysis by bucketing wildcard sets by TypePtr instead. Any two mutable types which can unify should be in the same wildcard set bucket.

This also allows us do much simpler `mayContainAlias` analysis, and also improves `analyzeConservative` analysis because now we can recurse through all contained memory locations and mark writes, instead of just recursing only level deep in contained elements.

Test Plan: Imported from OSS

Differential Revision: D19563263

Pulled By: eellison

fbshipit-source-id: 371a37d1a8596abc6c53f41c09840b6c140ea362
2020-01-28 18:07:48 -08:00
465ebd58ba [JIT] pickle serialization for custom bound classes
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32604

Test Plan: Imported from OSS

Differential Revision: D19566633

fbshipit-source-id: 9387d3ff45cbd6ccde49ce190a52859481cc301c
2020-01-28 11:02:59 -08:00
0ac31a99be run code analysis against mobile interpreter (#32276)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32276

Include mobile interpreter in mobile code analysis pass, which has some
manually registered ops in temporary namespaces.

The mobile interpreter is still under development and these ops will be
removed in the future. This is a temporary step for internal build
experiment.

Test Plan: Imported from OSS

Differential Revision: D19426818

Pulled By: ljk53

fbshipit-source-id: 507453dc801e5f93208f1baea12400beccda9ca5
2020-01-17 17:21:28 -08:00
ab5eb65e74 gate torch_global_deps with BUILD_SHARED_LIBS flag (#32011)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32011

Run into build problem with Ninja + code analysis build as follows:
```
The install of the torch_global_deps target requires changing an RPATH from
the build tree, but this is not supported with the Ninja generator unless
on an ELF-based platform.
```

Seems we don't need build the target for static build mode?

Verified code analyzer works with the patch.

Test Plan: Imported from OSS

Differential Revision: D19336818

Pulled By: ljk53

fbshipit-source-id: 37f45a9392c45ce92c1df40d739b23954e50a13a
2020-01-10 11:37:24 -08:00