Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65387
Added a customized NNC implementation for signed log1p kernel and enabled the fusion pass that adds the fused signed log1p op.
Also, added a SR microbenchmark for this kernel which shows the performance improvement.
Without fusion:
```
--------------------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------------------
BM_signed_log1p/16 1953 ns 1953 ns 358746
BM_signed_log1p/64 2049 ns 2049 ns 342145
BM_signed_log1p/512 3291 ns 3291 ns 214342
BM_signed_log1p/4096 15559 ns 15559 ns 44420
BM_signed_log1p/32768 101936 ns 101935 ns 6843
BM_signed_log1p/65536 194792 ns 194789 ns 3615
```
With NNC fusion:
```
--------------------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------------------
BM_signed_log1p/16 369 ns 369 ns 1896179
BM_signed_log1p/64 497 ns 497 ns 1406995
BM_signed_log1p/512 1618 ns 1618 ns 430209
BM_signed_log1p/4096 11327 ns 11326 ns 61463
BM_signed_log1p/32768 84099 ns 84086 ns 8325
BM_signed_log1p/65536 166531 ns 166510 ns 4186
```
This clearly shows >15% improvement in performance of this kernel with NNC fusion.
On inline_cvr local model, there is a small improvement in terms of profiled time spent on ops:
without fusion: `0.9%` (computed by adding the % spent on all the 4 ops involved)
with NNC fusion: `0.55%`
Test Plan:
`buck test mode/opt-clang //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- SignedLog1p`
Also, did the accuracy test with inline_cvr as described here, https://fb.quip.com/qmdDAJzEmPtf, on the full size model (285298536_1)
```
get 57220 prediction values
get 57220 prediction values
max_error: 0 total: 0
```
Reviewed By: hlu1
Differential Revision: D30609492
fbshipit-source-id: d2e68df580569a30ee61abb0ef18d2c4c56827bd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65118
Cloning the module can increase memory use. By freezing the module directly without cloning it first, we can avoid this memory usage increase.
Reviewed By: eellison, movefast1990
Differential Revision: D30955053
fbshipit-source-id: 2feb738eddcf66aa68c92bf695cc05b57bd990f0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64934
Add a new op `static_runtime::VarTupleUnpack` and a graph pass transforming graph sequences from:
```
%0, %1 = prim::TupleUnpack(%a)
%2, %3 = prim::TupleUnpack(%b)
```
into:
```
%0, %1, %2, %3 = static_runtime::VarTupleUnpack(%a, %b)
```
The pass is only applied to contiguous blocks of `TupleUnpack` nodes. This is the most straightforward way to guarantee correctness, and it is sufficient for the models we care about.
Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- VarTupleUnpack`
Reviewed By: d1jang
Differential Revision: D30872109
fbshipit-source-id: 1ed4a7e201c532da28f703a3a50241c392a6c7e9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65123
This change re-reverts D30883290 (0e11454d19). D30883290 (0e11454d19) broke the OSS build since the change in this change implicitly removed the default move constructor of `StaticRuntime`.
```
ep 15 15:39:57 /var/lib/jenkins/workspace/benchmarks/static_runtime/deep_wide_pt_bench.cc:95:10: error: call to implicitly-deleted copy constructor of 'torch::jit::StaticRuntime'
Sep 15 15:39:57 return torch::jit::StaticRuntime(*smod);
Sep 15 15:39:57 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sep 15 15:39:57 /var/lib/jenkins/workspace/torch/csrc/jit/runtime/static/impl.h:321:34: note: copy constructor of 'StaticRuntime' is implicitly deleted because field 'planner_' has a deleted copy constructor
Sep 15 15:39:57 std::unique_ptr<MemoryPlanner> planner_;
Sep 15 15:39:57 ^
Sep 15 15:39:57 /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/unique_ptr.h:356:7: note: 'unique_ptr' has been explicitly marked deleted here
Sep 15 15:39:57 unique_ptr(const unique_ptr&) = delete;
Sep 15 15:39:57 ^
Sep 15 15:39:57 /var/lib/jenkins/workspace/benchmarks/static_runtime/deep_wide_pt_bench.cc:99:9: error: call to implicitly-deleted copy constructor of 'torch::jit::StaticRuntime'
Sep 15 15:39:57 auto sr = getStaticRuntime();
Sep 15 15:39:57 ^ ~~~~~~~~~~~~~~~~~~
Sep 15 15:39:57 /var/lib/jenkins/workspace/torch/csrc/jit/runtime/static/impl.h:321:34: note: copy constructor of 'StaticRuntime' is implicitly deleted because field 'planner_' has a deleted copy constructor
Sep 15 15:39:57 std::unique_ptr<MemoryPlanner> planner_;
Sep 15 15:39:57 ^
Sep 15 15:39:57 /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/unique_ptr.h:356:7: note: 'unique_ptr' has been explicitly marked deleted here
Sep 15 15:39:57 unique_ptr(const unique_ptr&) = delete;
Sep 15 15:39:57 ^
Sep 15 15:39:57 2 errors generated.
```
This change fixes the issue by explicitly defining the default move constructor (courtesy of mikeiovine).
Original Summary:
This change moves `MemoryPlanner` out of impl.cpp into memory_planner.cpp.
`MemoryPlanner` performs an independent sub-task of static analysis of a graph, and creating memory planning, and allocating/deallocating managed Tensors.
This change will reduce merge conflicts as I work on MemoryPlanner more actively for output Tensor support.
Test Plan: - Confirm that OSS build went well (See External Tests section).
Reviewed By: mikeiovine
Differential Revision: D30983292
fbshipit-source-id: a59f407fa1123527824157268111144a1bf58116
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63013
This change enhances the current memory overlapping check to include outputs: the enhancement enforces a constraint that all outputs of a node should NOT overlap with each other since they are supposed to be update by a node at the same time, holding the node's outputs.
This check will detect a problem like T97393697 immediately in debug mode.
Test Plan:
- Added a unittest `ProcessedNode.VerifyMemoryOverlapWithOverlappingOutputs`
- Ran `inline_cvr` on ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench with this diff and confirmed that the checking condition holds true during the run.
Reviewed By: hlu1
Differential Revision: D30211705
fbshipit-source-id: 994d8dace2422e2498e504eb61452a55739238c0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64209
Add a new fusion pass that turns transforms the following pattern:
```
graph(%input):
%0 : Tensor = aten::sign(%input)
%1 : Tensor = aten::abs(%input)
%2 : Tensor = aten::log1p(%1)
%res : Tensor = aten::mul(%0, %2)
return (%res)
```
Into a single op:
```
graph(%input):
%res : Tensor = static_runtim::signed_log1p(%input)
return (%res)
```
The intent is to reduce the number of passes over the tensor. However, enabling this pass actually causes a performance regression, probably due to a lack of vectorization in the fused implementation. Because of this issue, this diff **does not** enable this pass.
Followup: navahgar will add an NNC kernel which is faster than the the unfused version and enable this pass. We still need this version as a fallback since the NNC kernel will not support all dtypes.
Test Plan:
`buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- SignedLog1p`
Test passed with new graph pass disabled and enabled.
Reviewed By: hlu1
Differential Revision: D30559929
fbshipit-source-id: e4e080cb2e6a705cfdde1fc98bee92b723f8132a
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64159
Test Plan:
Confirm out variant is called for both versions:
```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
```
Reviewed By: mikeiovine
Differential Revision: D30622819
fbshipit-source-id: a2c8c7f969dae5f507718fb3d513e1fb4f026736
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64157
UseVariadicCat optimization is not applied to aten::cat if list input to the op can not be moved to the position before op (https://fburl.com/diffusion/l6kweimu). For these cases we will need out version for SR.
Test Plan:
Confirm out variant is called:
```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
```
Reviewed By: d1jang
Differential Revision: D30598574
fbshipit-source-id: 74cfa8291dc8b5df4aef58adfb1ab2a16f10d90a
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64070
Test Plan:
Confirm out variant is called for both versions:
```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
```
Reviewed By: d1jang
Differential Revision: D30595816
fbshipit-source-id: e88d88d4fc698774e83a98efce66b8fa4e281563
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64078
This change converts `aten::layer_norm -> output Tensor` to `static_runtime::layer_norm -> (output Tensor, temp1 Tensor, tmp2 Tensor)` to manage `tmp1` and `tmp2` Tensors by the static runtime.
Currently the out-variant of `aten::layer_norm` creates two temporary Tensors inside it:
```
at::Tensor mean = create_empty_from({M}, *X);
at::Tensor rstd = create_empty_from({M}, *X);
```
that the static runtime misses an opportunity to manage.
This change puts them into (unused) output Tensors of a new placeholder op `static_runtime::layer_norm` so that the static runtime can mange them since the static runtime as of now chooses to manage only output tensors.
Test Plan:
- Enhanced `StaticRuntime.LayerNorm` to ensure that `static_runtime::layer_norm` gets activated.
- Confirmed that the new op gets activated during testing:
```
V0825 12:51:50.017890 2265227 impl.cpp:1396] Switch to out variant for node: %8 : Tensor, %9 : Tensor, %10 : Tensor = static_runtime::layer_norm(%input.1, %normalized_shape.1, %4, %4, %5, %3)
```
Reviewed By: hlu1
Differential Revision: D30486475
fbshipit-source-id: 5121c44ab58c2d8a954aa0bbd9dfeb7468347a2d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64024
`aten::expand_as` creates a view of the input tensor. This change adds its native op implementation for the static runtime.
Test Plan: - Added `StaticRuntime.IndividualOps_ExpandAs`
Reviewed By: hlu1
Differential Revision: D30546851
fbshipit-source-id: e53483048af890bc41b6192a1ab0c5ba0ee2bdc0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63579
Provide a static runtime out variant implementation for the new op introduced in D30426232 (1385f9fb12).
Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- IndividualOps_VarStack`
Reviewed By: navahgar
Differential Revision: D30410525
fbshipit-source-id: bc59a3d8ad23e3d94561ec2dca9cc20687dbadf8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63398
This change provides a native `__getitem__` implementation for lists to avoid overhead associated with falling back to the JIT interpreter.
Test Plan: Unit tests: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D30368464
fbshipit-source-id: e0e0971508cd5d9bcf6025606993dc24ecbf6764
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63350
Add a native implementation for `aten::append`, the list append op.
Test Plan: New unit test: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Append`
Reviewed By: hlu1
Differential Revision: D30326461
fbshipit-source-id: 0dbdf6cc82e78c7c36db39583256f6b87385e3d3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62347
This diff includes tests for all `aten` ops that did not already have test coverage.
Test Plan: `buck test //caffe2/benchmarks/static_runtime/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D29968280
fbshipit-source-id: 768655ca535f9e37422711673168dce193de45d2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62335
This change ensures that unittests only use out variants or native ops.
- Our unittests currently assume that a graph fed to the static runtime correctly replaces an interpreter op for its corresponding out variant / native op, but it's not checked by the unittest. This change ensures that.
- We relied on manual inspection of log messages to see if an out variant is used for a specific workload even for unittesting. This change frees us from doing that.
- `aten::add` is excluded from this check since it's only enabled for an internal workload. Also some unittests are excluded by using `expect_interpreter_op = true` since they are written to use interpreter ops by design.
Test Plan: Ran `buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest` successfully.
Reviewed By: mikeiovine, hlu1
Differential Revision: D29952381
fbshipit-source-id: e60e70b80ccf45e91c6654b4ad53f92ffd5ab702
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62622
This allows us to catch cases where an out variant is being tested but the test author forgot to call `.clone()` in the test script. More than 2 ops does not guarantee that the memory planner is being exercised, but less than 2 guarantees that it is not being used.
Reviewed By: hlu1
Differential Revision: D30058050
fbshipit-source-id: 5bc053736f1cc6fd1ffcf8254bf38874ac18c34b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62064
`testStaticRuntime` was previously only available in `test_static_runtime.cc`. It has been moved to a common library `test_utils` to facilitate code re-use. This also lets us test dynamic shapes in `test_fb_operators`
Reviewed By: hlu1
Differential Revision: D29858928
fbshipit-source-id: 68a94760166ddb745972b0f1fc24bed594937d1c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62067
The wrapper for aten::cat is no longer needed after the variadic cat change in D29565344 (ae58a4c45d) .
Also added a simple test to test dynamic shapes, i.e., input tensors in args2 are larger than in args1.
Reviewed By: navahgar, mikeiovine
Differential Revision: D29864600
fbshipit-source-id: 44a712c2e776815c09e0bf5631412149b81274b2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62098
The build was broken by D29821533 (1d2ea76afb). The `clamp` overloads used in `deep_wide.h`
are no longer available in the `at::native` namespace.
Use `at::cpu::clamp` and `at:🗜️:clip_out` (which should be an alias for
clamp) instead.
Reviewed By: hlu1
Differential Revision: D29880187
fbshipit-source-id: 210b6d2be8a8142e7af1a0ba07e55a95b1a77d25
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61783
Implement two new prim operators for static runtime: `isinstance` and `TypeCheck`. `isinstance` is very straightforward, but there were a few wrinkles with implementing `TypeCheck`:
1. There is no way to directly generate `TypeCheck` nodes from TorchScript, they are generated by the JIT at runtime. This makes testing a little difficult. I had to make some modifications to `testStaticRuntime` to allow for the use of IR and TorchScript tests.
2. The behavior of `prim::TypeCheck` as implemented here does not match up 1:1 with the version implemented in the interpreter! This is because grad mode is disabled in static runtime. Here's an example.
IR is the same as the one included in this test, but with `requires_grad == 1`
```
graph(%a.1 : Tensor,
%b.1 : Tensor):
%t0 : Float(2, 2, strides=[2, 1], device=cpu, requires_grad=1), %t1 : Float(3, 3, strides=[3, 1]), %type_matched : bool = prim::TypeCheck[types=[Float(2, 2, strides=[2, 1], device=cpu, requires_grad=1), Float(3, 3, strides=[3, 1])]](%a.1, %b.1)
return (%t0, %t1, %type_matched)
```
And in the test setup:
```
auto a = at::zeros({2, 2}, at::kFloat);
a.to(at::kCPU);
a.set_requires_grad(true);
auto b = at::ones({3, 3}, at::kFloat);
std::vector<IValue> args_correct = {a, b};
// prim::TypeCheck should be true with args_correct,
// but we get false when using static runtime!
```
Reviewed By: hlu1
Differential Revision: D29743862
fbshipit-source-id: db1788f0f5de42bab42602e8cc24eee04cbcc280
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61595
Add out variant wrapper for `aten::linear` in the static runtime
Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D29684236
fbshipit-source-id: 94df6d7267b3f269b2cadf065f207648777147df
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61566
This change uses `at::allclose` to compare results from sigmoid functions (CPU/NNC) instead of `Tensor::equals` due to numerical errors occurring between them.
Test Plan:
I confirmed that the flakiness of `StaticRuntime.Sigmoid` is gone with this change:
```
[djang@devvm1999.ftw0 ~/fbsource/fbcode] buck-out/gen/caffe2/benchmarks/static_runtime/static_runtime_cpptest -v 3 --gtest_filter=StaticRuntime.Sigmoid --gtest_repeat=100 &> output.txt
[djang@devvm1999.ftw0 ~/fbsource/fbcode] grep PASSED output.txt | wc
100 500 2100
```
Reviewed By: bertmaher
Differential Revision: D29671203
fbshipit-source-id: 99a7b16d18ea047c9aad444f36d8368f9d0b088d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61301
This change adds a `DCHECK` to ensure that outputs do not overlap with immutable inputs.
Test Plan:
Added unittests as follows:
- `ProcessedNode.VerifyOutputsNotOverlappingWithImmutableInputsWithImmutableArguments`
- `ProcessedNode.VerifyOutputsNotOverlappingWithImmutableInputsWithMutableArguments`
Reviewed By: hlu1
Differential Revision: D29564158
fbshipit-source-id: bf14b4978ab544af79010cf724ed28202b4521cc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61000
Add unit tests to bmm and addmm operators in static runtime.
Test Plan:
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
{F628935117}
Reviewed By: hlu1
Differential Revision: D29459679
fbshipit-source-id: 5c7fa5c9b0675c1c84f3ae3110204d663255009c
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60669
Test Plan: Added unit test to check for nested outputs.
Reviewed By: ajyu
Differential Revision: D29322025
fbshipit-source-id: a3c8d3c5f0bb7cf7fda4bc5f579adb8fa7bc3724
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60503
Fixed a few issues in the static_runtime::to_copy impl:
- fixed a bug with memory_format
- copy strides when appropriate. This is necessary to make sure that the fbgemm path in the copy kernel gets hit.
- fix the schema in the `ReplaceWithCopy` pass
- add registration of `static_runtime::to_copy.other`
Add more unit tests:
- test dynamic shapes
- test strided input tensor to `aten::to`
- test alias case (same input/output)
- test `to.other`
Reviewed By: ajyu
Differential Revision: D26838933
fbshipit-source-id: ec0d1a2deebe998fcfe8858e772e1ef429cb4522
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60579
- Modify testStaticRuntime to take two sets of inputs so if the second set of inputs have bigger shapes, it would trigger memory allocations in resize_ calls.
- Modify test scripts so that the output of the test op is managed by the memory planner, as explained in comments.
Reviewed By: ajyu
Differential Revision: D29221452
fbshipit-source-id: 09f0f7eb384dc8ca67594f1fa76e1e31392ee6ca