Summary:
By default, TorchScript execution is single threaded and uses the caller's thread pool. For the use case of distributed inference, we hope there is a way to customize the behavior where the interpreter in torch script can be executed in other places. This diff allows an explicit taskLauncher for torchscript interpreter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46865
Test Plan:
unit test is passed.
fbshipit-source-id: 1d7b003926c0d1f8facc53206efb960cff8897ac
Fixes #{issue number}
Reviewed By: houseroad
Differential Revision: D24616102
Pulled By: garroud
fbshipit-source-id: 79202b62f92d0b0baf72e4bf7aa3f05e0da91d59
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45789
Making sure that more tests invoke a run with a Fusion Group.
Test Plan: Imported from OSS
Reviewed By: Krovatkin
Differential Revision: D24169535
Pulled By: eellison
fbshipit-source-id: 54d7af434772ba52144b12d15d32ae30460c0c3c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44588
1) SOURCE_DUMP crashes when invoked on a backward graph since
`prim::GradOf` nodes can't be printed as sources (they don't have
schema).
2) Dumping graph each time we execute an optimized plan produces lots of
output in tests where we run the graph multiple times (e.g.
benchmarks). Outputting that on the least level of verbosity seems
like an overkill.
3) Duplicated log statement is removed.
Differential Revision: D23666812
Test Plan: Imported from OSS
Reviewed By: bertmaher
Pulled By: ZolotukhinM
fbshipit-source-id: b9a30e34fd39c85f3e13c3f1e3594e157e1c130f
Summary:
When the backward ops execute via the autograd engine evaluate_function(), the fn.release_variables() is called to release the SavedVariables. For the eager mode ops, this releases the saved inputs that was required for backward grad function. However, with TorchScript, we get a DifferentableGraph and the DifferentiableGraphBackward() doesn't implement a release_variables(). This leads to the SavedVariables to be alive longer. Implement release_variables() for DifferentiableGraphBackward to release these SavedVariables early.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42994
Reviewed By: izdeby
Differential Revision: D23503172
Pulled By: albanD
fbshipit-source-id: d87127498cfa72883ae6bb31d0e6c7056c4c36d4
Summary:
This PR adds API to package unoptimized/fallback blocks as function calls. It's mainly meant to be used by TensorExpressionsFuser and SpecializeAutogradZero passes as both specialize the original graph but would also like to provide a fallback path in case the assumptions under which the graph was specialized do not hold for some inputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43274
Reviewed By: malfet
Differential Revision: D23406961
Pulled By: Krovatkin
fbshipit-source-id: ef21fc9ad886953461b09418d02c75c58375490c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43630
No functional changes here - just refactoring specialize autograd zero to a class, and standardizing its API to take in a shared_ptr<Graph>
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23358805
Pulled By: eellison
fbshipit-source-id: 42e19ef2e14df66b44592252497a47d03cb07a7f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42688
Both the profiling executor and the legacy executor have the debug
loggin now.
Ideally, if we had a pass manager, this could be done as a part of it,
but since we have none, I had to insert the debug statements manually.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D22981675
Pulled By: ZolotukhinM
fbshipit-source-id: 22b8789e860aa90d5802fc72a4113b22c6fc4da5
Summary:
This should be in its own file...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41137
Reviewed By: jamesr66a
Differential Revision: D22437922
Pulled By: eellison
fbshipit-source-id: 1b62dde1a4ebac673b5c60aea4f398f734d62501
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37034
c10 takes a Stack* in boxed functions while JIT took Stack&.
c10 doesn't return anything while JIT returns an int which is always zero.
This changes JIT to follow the c10 behavior.
ghstack-source-id: 106834069
Test Plan: unit tests
Differential Revision: D20567950
fbshipit-source-id: 1a7aea291023afc52ae706957e9a5ca576fbb53b
Summary:
* Disable the mode where PE can still run the old fuser.
* Clean up
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38591
Differential Revision: D21643664
Pulled By: Krovatkin
fbshipit-source-id: 6753ed6bdc544698a1340e59a624608ff3abf7f9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38734
As far as I can tell, this pass only exists to canonicalize ops that are generating in the graph fuser, so it's kind of a misnomer.
Test Plan: Imported from OSS
Differential Revision: D21673109
Pulled By: eellison
fbshipit-source-id: b7bedf34ccaf1fcd442bfb2bbb990e64915f51d4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37970
This change makes the pass friendlier for users who try to invoke it
directly.
Test Plan: Imported from OSS
Reviewed By: suo
Differential Revision: D21444832
Pulled By: ZolotukhinM
fbshipit-source-id: 8be4b5028b3bd84082874e16f38a70b245af5d19
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35913
The pass itself is still disabled by default, but with this change we
don't need to register it as a custom pass anymore. It allows us to
control its behavior with env variables more easily.
Test Plan: Imported from OSS
Reviewed By: suo
Differential Revision: D20827189
Pulled By: ZolotukhinM
fbshipit-source-id: e74d90b5e46422e7ab7bc40974a805220da50fbc
Summary:
**Summary:** This PR contains the infrastructure of a new CUDA fuser. This CUDA fuser is based on many of the same principles of TensorExpressions and Halide, however the implementation is ground up. The fusion pass itself is similar to the default CUDA fuser, however, it has undergone some refactoring and is using the new code generation infrastructure. For those who are interested in how the code generation in this PR works, I would recommend reviewing _test/cpp/jit/test_gpu_fusion.cpp_ as well as the long comment section at the beginning of _torch/csrc/jit/codegen/cuda/transform_replay.h_ One of the largest differences between our approach and that of TVM/Halide, is the concept of "TensorView". TensorView from a high level should be thought of similarly to how we think of working with Tensors in PyTorch. It's an N-D object which can undergo transformations that change its dimensionality. Dimensionality changes are done through the operations split/merge/reorder/computeAt. These transformations are similar to split/fuse/reorder/compute_at of TVM, they modify how a tensor is iterated over to generate GPU code. Interestingly, in our scheme these transformations are applied to tensors and only impact how that tensor is generated.
**Warning:** This PR is purposefully not feature complete with the current fuser. We wanted to separate out the infrastructure from the fusion capabilities. Once in, smaller incremental PRs will be submitted to expand capabilities of the fuser.
**Short term goals:**
Parity with current CUDA fuser (including performance):
- Dynamic shapes (no recompilation)
- Implicit handling of braodcast (broadcasted tensors are treated as tensors of the braodcasted size in the generated code)
- Dropout
**Mid-term goals:**
- Transposes fused with pointwise operations where transpose involves only 2 axes (across the fused operation).
- 1-D reductions fused with pointwise operations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34785
Reviewed By: ZolotukhinM
Differential Revision: D20650977
Pulled By: soumith
fbshipit-source-id: ee39c95a880e1b9822e874ed4cc180971572bf63
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34109
This change adds glue to GraphExecutor to give the RPC server
access to the future-based Interpreter::runAsync() api.
Previously, if a server encounted a TorchScript continuation-based block
with fork/wait, it would simply block in the server thread until the handler
completed, since it uses the synchronous Interpreter::run() api.
With the ivalue::Future returned by the Interpreter, we can run the
TorchScript code asynchronously from c++ simply by connecting its
callback to the server callback.
We add test cases to cover the new logic, both rpc_async and remote.
ghstack-source-id: 101245438
Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc/...
Differential Revision: D20194321
fbshipit-source-id: 16785ec5d9ed0b16cb1ffab0a9771a77de30fcb0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35115
This commit runs the newly added tools/clang_format.py on the JIT
codebase and includes all of the formatting changes thus produced.
Testing:
Ran the script, CI.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D20568523
Pulled By: SplitInfinity
fbshipit-source-id: e09bdb982ccf090eecfb7c7b461b8d0681eef82b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34588
I constructed the patch by deleting OperatorOptions and then rerouting
all queries for AliasAnalysisKind to FunctionSchema. Some of the
behavior is kind of bogus: we really shouldn't be mutating FunctionSchema
after the fact, but that won't get fixed until we actually switch to
true schema merging.
Reland of https://github.com/pytorch/pytorch/pull/34160
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D20387079
Pulled By: ezyang
fbshipit-source-id: d189f7a6ad8cd186b88b6fbfa3f189994eea14e8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34160
I constructed the patch by deleting OperatorOptions and then rerouting
all queries for AliasAnalysisKind to FunctionSchema. Some of the
behavior is kind of bogus: we really shouldn't be mutating FunctionSchema
after the fact, but that won't get fixed until we actually switch to
true schema merging.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D20282846
Pulled By: ezyang
fbshipit-source-id: ba7bca6e8adc3365789639b88e54c4e881b1692e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33834
This changes how we report Tracebacks to make them more clear when
there are both serialized and non-serialized ranges. It now looks like:
```
Traceback (most recent call last):
File "foo.py", line 25, in <module>
s2(a, b)
File "/scratch/zdevito/pytorch/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__.py", line 7, in forward
x: Tensor,
y: Tensor) -> Tensor:
return (self).bar(x, y, )
~~~~~~~~~ <--- HERE
def bar(self: __torch__.Moo,
x: Tensor,
File "code/__torch__.py", line 11, in bar
x: Tensor,
y: Tensor) -> Tensor:
_0 = (self).baz(x, y, )
~~~~~~~~~ <--- HERE
_1 = torch.ones([3], dtype=None, layout=None, device=None, pin_memory=None)
return torch.add(_0, _1, alpha=1)
File "code/__torch__.py", line 17, in baz
x: Tensor,
y: Tensor) -> Tensor:
return torch.add(x, y, alpha=1)
~~~~~~~~~ <--- HERE
Traceback of TorchScript, original code (most recent call last):
File "foo.py", line 11, in forward
def forward(self, x, y):
return self.bar(x, y)
~~~~~~~~ <--- HERE
File "foo.py", line 9, in bar
def bar(self, x, y):
return self.baz(x, y) + torch.ones(3)
~~~~~~~~ <--- HERE
File "foo.py", line 7, in baz
def baz(self, x, y):
return x + y
~~~~~ <--- HERE
RuntimeError: The size of tensor a (4) must match the size of tensor b (5) at non-singleton dimension 1
```
It follows Python convension of having the most important information last
and reading from the bottom up.
Changes:
* Moved the error message to the end, to copy Python
* Report original traceback separate from serialized traceback
* Make sure root functions have names in the interpreter trace.
Test Plan: Imported from OSS
Differential Revision: D20126136
Pulled By: zdevito
fbshipit-source-id: fd01f9985e5d74e04c4d064c02e8bc320f4fac13
Summary:
This patch enables folding GetAttr nodes with their corresponding
values. _jit_pass_freeze_module API returns a new TorchScipt module
where all function calls and get attributes are inlined.
Usage:
frozen_model = torch._C._freeze_module(scrited_model._c)
frozen_model.forward(...)
This API currently optimizes the forward method. We will follow up to
to preserve and optimize methods and attributes that are annotated as
torch.jit.interface.
Several future improvements to JIT optimizations are required to maximize
clean up/de-sugar the graph and eliminate redundancies.
Ideally, we want to produce a graph that can easily be lowered to
GLOW and other low-level backends.
__
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32178
Differential Revision: D19419640
Pulled By: bzinodev
fbshipit-source-id: 52baffaba9bca2cd60a8e747baa68d57711ad42b