Commit Graph

64 Commits

Author SHA1 Message Date
09157c76c0 [Static Runtime] Add schema checks for aten::list (#83753)
Summary:
The previous implementation assumed that there was only one overload and unconditionally tried to convert its input into a string. Some users were running into crashes because of this. Added a new overload for the list overload and schema checks.

Also, I managed to uncover another bug when writing tests for this case (yikes). Returning inputs didn't work because the input cleanup process would destroy the output. Extended `CreateOwnedRefsForSpecialIValues` to fix that.

Test Plan: CI + new unit tests

Differential Revision: D38870803

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83753
Approved by: https://github.com/tenpercent, https://github.com/albanD
2022-08-22 13:42:47 +00:00
4f34cd6d1e Replace all CHECK_ and DCHECK_ with TORCH_* macros (#82032)
Avoid exposing defines that conflict with google logging, since this blocks external usage of libtorch in certain cases.

All the 'interesting' changes should be in these two files, and the rest should just be mechanical changes via sed.
c10/util/logging_is_not_google_glog.h
c10/util/logging_is_google_glog.h

Fixes https://github.com/pytorch/pytorch/issues/81415

cc @miladm @malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82032
Approved by: https://github.com/soumith, https://github.com/miladm
2022-07-26 01:20:44 +00:00
ced2f2965c [Static Runtime] support forked subgraph execution on parent graph's executor (#80381)
Summary:
- support async excecution of forked nodes on custom executor
- fork subgraph execution was performed on inter-op thread pool executor by default
- Handle forked graph async execution on custom executor when the parent graph is executed with runAsync() API passing the executor for async ops

Differential Revision: D37466525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80381
Approved by: https://github.com/mikeiovine
2022-06-29 23:31:07 +00:00
7f8e852dff [Static Runtime] Support Futures in Static Runtime Engine (#80162)
Summary: - Static Runtime now exports runAsync() API which returns an intrusive_ptr to c10:Future type

Differential Revision: D37385849

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80162
Approved by: https://github.com/mikeiovine
2022-06-28 23:57:26 +00:00
3afc802c5a [Static Runtime] Add Metadata to ProcessedNode depending upon the op type (#79961)
Summary:
- ProcessedNodeMetadata class wraps the possible metadata for ProcessedNode. Depending upon the nature of op, processedNode can have one of the below possibilities of metadata:
1. prim::If/prim::Loop ops contains block_runners_ as their metadata
2. prim::fork op contains TaskLauncher (std::function) responsible for execution of forked subgraph

Differential Revision: D37320704

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79961
Approved by: https://github.com/mikeiovine
2022-06-24 06:03:06 +00:00
0545c85f74 [static runtime] Add JIT prim ops: aten::cpu, aten::list, aten::numel, aten::__range_length (#79111)
Summary: This adds the missing jit prim ops appear in the non ads models for c2->pt mitigation: aten::cpu, aten::list, aten::numel, aten::__range_length

Test Plan: static runtime unit tests

Differential Revision: D36984960

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79111
Approved by: https://github.com/davidberard98
2022-06-18 16:38:58 +00:00
ca7ab1708b [Static runtime] Pass parent graph metadata to forked subgraphs (#79578)
Summary:
- Remove creation of new StaticModuleOptions for the forked subgraph. Use parent graph's options for creation of runtime for forked subtraph
- StaticRuntimeMetdata extends CustomClassHolder which can be casted to IValue and attached to IR node's attributes.

Differential Revision: D37159684

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79578
Approved by: https://github.com/mikeiovine
2022-06-16 20:35:46 +00:00
8d7fcfa8f1 [static runtime] Add native ops: aten::index_put, aten::item, aten::tensor_split (#79065)
Summary: This adds the pytorch operators that are currently missing in non-ads models from c2->pt mitigation: aten::index_put, aten::item, aten::tensor_split

Test Plan: buck run mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest

Differential Revision: D36984961

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79065
Approved by: https://github.com/davidberard98
2022-06-15 19:15:34 +00:00
20675977bc [Static Runtime] Performance optimization for fork operation (#79482)
Summary:
- StaticModule was being created at runtime which was adding overhead to the forked operation
- Move staticModule creation to outside of runtime so that StaticRuntime instance can be created on top of same staticModule that is created once

Differential Revision: D37126923

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79482
Approved by: https://github.com/tenpercent
2022-06-14 22:31:15 +00:00
65a37923f9 [Static Runtime] Exception handling during fork subgraph execution (#79292)
Summary:
- Exception handling was not performed in forked subgraph execution
- forked subgraph runtime can throw runtime exception. Future returned by prim::fork needs to handle exceptions so that aten::wait handles it.

Test Plan:
local test cases:
  - buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators
  - buck test mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest
  - buck test mode/opt caffe2/test:static_runtime

Async execution of the subgraph is tested by adding pytorch profiler hooks on the StaticRuntime execution via below code. Async execution in threadpool is verfiied by checking trace
  with profile(activities=[ProfilerActivity.CPU]) as prof:
      static_runtime_module(inputs)
  prof.export_chrome_trace("trace.json")

Differential Revision: D37072493

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79292
Approved by: https://github.com/mikeiovine
2022-06-11 03:11:49 +00:00
26f2376b78 [Static Runtime] support fork and wait operations on Static Runtime (#79211)
Summary:
- Initial support for fork was done on JIT interpreter. This patch enabled the async execution on static runtime

- For each forked node, seeprate runtime is created for the execution of subgraph. Async  execution is handled by aten::ParallelThreadPoolNative threadpool

- aten::wait waits on the future of fork to be completed

Test Plan:
local test cases:
    - buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators
    - buck test mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest
    - buck test mode/opt caffe2/test:static_runtime

Async execution of the subgraph is tested by adding pytorch profiler hooks on the StaticRuntime execution via below code. Async execution in threadpool is verfiied by checking trace

    with profile(activities=[ProfilerActivity.CPU]) as prof:
        static_runtime_module(inputs)
    prof.export_chrome_trace("trace.json")

Differential Revision: D37044513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79211
Approved by: https://github.com/mikeiovine
2022-06-10 19:11:03 +00:00
bf80f6c7b0 [Static Runtime] prim::fork asyunchronous execution on JIT interpreter (#78858)
Summary:
prim::fork was executed synchronously in the main thread
- Added changes that executes the prim::fork calls on asynchronously on one of the threads from TaskThreadPoolBase defined in aten
- Changes are tested via pytorch profiler tracing. Fork calls are executed on different threads

Test Plan:
local tests scripts exected:
- buck run mode/opt caffe2/test:static_runtime
- buck run caffe2/benchmarks/static_runtime/fb:test_fb_operators
- buck run caffe2/benchmarks/static_runtime:static_runtime_cpptest

Executing pytorch profiler to see the spawned execution of fork operations in parallel on different threads

Differential Revision: D36909308

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78858
Approved by: https://github.com/mikeiovine
2022-06-06 22:05:19 +00:00
720cb5023a [Static Runtime] Implement prim::Fork and aten::wait (#78780)
Summary:
basic implementation of prim::fork and aten::wait

- current implementation uses interpreter to call the forked subgraph
- interpreter call to be replaced in future
- Added custom test cases for fork/wait procedures in the graph

Test Plan:
custom tests are created in test_static_runtime.py file for verification of static_runtime output compared to reference pytorch output.

test command
- buck run caffe2/test:static_runtime
- buck run caffe2/benchmarks/static_runtime:static_runtime_cpptest
- buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators

Differential Revision: D36881214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78780
Approved by: https://github.com/tenpercent
2022-06-03 23:39:04 +00:00
6ba29d715e [BE] Fix deprecated usages of isIntegral
By passing `includeBool=` parameter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75524
Approved by: https://github.com/seemethere, https://github.com/janeyx99
2022-04-08 20:43:41 +00:00
11c412a8ec [static-runtime] optimize empty if blocks at runtime (#74987)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74987

Add specializations to `prim::If` operator at runtime to save resources when some of subblocks are empty

Test Plan:
`buck build //caffe2:torch-cpp-cpu`
`buck test //caffe2/benchmarks/static_runtime/...`
Add unit test:
`buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- StaticRuntime.EmptyIfBlock`

Reviewed By: mikeiovine

Differential Revision: D35262952

fbshipit-source-id: 324f88471f33f035f4d8a9b212716530d8e59df2
(cherry picked from commit 2db1b1a6833b1376fa376f54791effc8e12fb77f)
2022-04-01 05:43:33 +00:00
3f37337ed0 [SR] Native implementation for reshape_as (#74585)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74585

Native static runtime for `aten::reshape_as`
ghstack-source-id: 152340038

Test Plan: New unit test

Reviewed By: hlu1

Differential Revision: D35060895

fbshipit-source-id: c4e6f8a04c7df3821c7e654bfaf584e5a72ea701
(cherry picked from commit 6fa596cd866a024b6653239e0e30ddad42de242f)
2022-03-28 17:02:14 +00:00
9f2344aa40 [SR] Native implementation for select (#74568)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74568

Native static runtime implementation for `aten::select(Tensor, int, int)` overload
ghstack-source-id: 152340037

Test Plan: New unit test

Reviewed By: hlu1

Differential Revision: D35053900

fbshipit-source-id: c315d4202a4dfca3360325547af805aea33ecc9f
(cherry picked from commit 8683f214dbd8c081365bad727007bbff969b64d0)
2022-03-28 17:02:14 +00:00
facdbe6d72 [SR] Native implementation for IntImplicit (#74562)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74562

Add a native implementation for `aten::IntImplicit`, which is similar to `aten::Int` except for a few extra checks it must do
ghstack-source-id: 152340039

Test Plan: New unit tests

Reviewed By: hlu1

Differential Revision: D35052997

fbshipit-source-id: cb2f0faf7c62382e3f13750d8e1280c49c6b9e42
(cherry picked from commit 359c7493f8deaeccebc27e1b6e6e9777850010c1)
2022-03-28 17:02:14 +00:00
93be0e2053 [SR] Avoid boxing inputs in DictConstruct/ListUnpack (#74250)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74250

The `DictConstruct`/`ListUnpack` implementations currently put all of their inputs onto a stack before calling the JIT implementation in `vararg_functions.cpp`. This was done to avoid code duplication, but it's quite wasteful since it causes extra heap allocations and, potentially, refcount bumps.

Given that these two ops are quite common and the code duplication is only a few lines, it seems reasonable to avoid this cost.
ghstack-source-id: 151897634

Test Plan: Existing unit tests

Reviewed By: navahgar

Differential Revision: D34901245

fbshipit-source-id: ece0618a6134a35720f214e79c64f12045f074d0
(cherry picked from commit 1f8e223c1887ed205c84a7ac4587813f94b11bad)
2022-03-22 23:05:58 +00:00
9e8bda0e93 [Static Runtime] Use IValue::toListRef for aten::len to address comment on D34705231 (#74192)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74192

This change uses toListRef() to avoid creating a new list object for `aten::len` to address hlu1's comment on D34705231 (87564a1bd7)

Test Plan: Existing tests, StaticRuntime.Len*

Reviewed By: mikeiovine

Differential Revision: D34863266

fbshipit-source-id: 65daf36944a64dfd7afde1103aab5aee1681ac87
(cherry picked from commit 3a0f3798f2fcc203f6cb01e59b91e195ecabe1bc)
2022-03-14 22:31:10 +00:00
87564a1bd7 [Static Runtime] Add native op support for aten::len (#73899)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73899

This change adds native op wrappers to Static Runtime as appears in JIT (https://www.internalfb.com/code/fbsource/[429d233b9beb5e6f60df7304b792e2ff332f6ecd]/fbcode/caffe2/torch/csrc/jit/runtime/register_prim_ops.cpp?lines=613 , search for "aten::len" in that file).

Test Plan: Added unittests, "StaticRuntime.LenWith*", and confirmed they are passing with `V0307 17:39:39.817956 3516654 impl.cpp:1792] Switch to native impl for node: %2 : int = aten::len(%input.1)` per added unittest: P485159811

Reviewed By: mikeiovine

Differential Revision: D34705231

fbshipit-source-id: 916b1f8bdbc92def07bc3f98ce1db22f0f5ce206
(cherry picked from commit 66d2bb9a0a294b55e1bc87ae33f5553b1460e74b)
2022-03-10 02:57:51 +00:00
c62de0ac15 [Static Runtime] [Code Cleanup] Use SROperator for operators' function type (#73450)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73450

This change uses `SROperator` for operators' function type

Test Plan: N/A

Reviewed By: mikeiovine

Differential Revision: D34483246

fbshipit-source-id: ed544bb91b676ed08983dc8dc78cedd0f77d499f
(cherry picked from commit eb9de3ad8de043990c02f30ffa48a29c8e5e81f2)
2022-03-01 02:30:48 +00:00
d1c5f9e439 [JIT][SR] Introduce prim::IfThenElse (#72587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72587

This pattern frequently appears in a few graphs:

```
%result = prim::If(%condition)
  block0():
    -> (%a)
  block1():
    -> (%b)
```

This is slow, particularly in static runtime. Static runtime creates memory planners/block runners for each sub-block, which eats up a lot of memory and introduces a lot of extra overhead for this relatively simple operation.

This diff introduces a new op that replaces nodes like the above with a single op meant to act like a ternary operator:

```
%result = prim::IfThenElse(%condition, %a, %b)
```

Test Plan: New unit tests

Reviewed By: eellison

Differential Revision: D34091789

fbshipit-source-id: eb6a8c460c39b4c019a1f4ab1f3f1e5b6edc400c
(cherry picked from commit 0f1b335e5b83f402bda2dcdd9ecb411e0b67c651)
2022-02-17 18:22:48 +00:00
c975b928ab [SR][easy] CPU fuser uses native control flow (#72544)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72544

Now that static runtime supports control flow, there's no need to fall back to the JIT. We get better performance with the native control flow since we avoid heap allocation/ref count bumps during stack construction.

I've left the old `prim::TensorExprDynamicGroup` around in case we need to support it in the future. I've also added native support for a few scalar ops that are used inside the control flow sub-blocks.
ghstack-source-id: 148825816

Test Plan: New unit tests

Reviewed By: d1jang

Differential Revision: D34083080

fbshipit-source-id: a7ffc0fda39ab3df3ba47e44a03d857131dc1e50
(cherry picked from commit 2ef39e0e54d5e9da76af9e617a11233ffc81b011)
2022-02-10 18:40:39 +00:00
84729cef70 [Static Runtime] Fix a bug in aten::slice to honor optional arguments (#72530)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72530

This bug was revealed from a failed attempt to run a feed/story model.

Test Plan:
- This fix was tested to successfully run the failed model: P479037453
- Added a unittest

Reviewed By: mikeiovine

Differential Revision: D34055801

fbshipit-source-id: 4a3e06bbb3b9fa78b0514c9c67aa4a0b79f46a8d
(cherry picked from commit bfa2bfb81ceaadad399522e422863fcea4aa13f1)
2022-02-09 17:05:45 +00:00
6c0521b919 [SR] Add native implementations for converted prim ops (#71474)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71474

The PyTorch edge team is working on promoting some prim ops to interpreter instructions (see D33398092). Since the JIT fallback ops will be unavailable soon, we need to implement these ops in static runtime.

Ops not included in this diff:
* `aten::__is__` and `aten::__isnot__`: disabled in static runtime for unrelated reasons
* `prim::NumToTensor` and `aten::__get__.Dict` already exist
ghstack-source-id: 148641179

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: d1jang

Differential Revision: D33657816

fbshipit-source-id: 6d15244ae1024a56d3b25e51a433fa104ce8ee5e
(cherry picked from commit 33f8f861ff88a6dda6a545c12515e92c893027d4)
2022-02-08 23:25:34 +00:00
0bb3158eae [SR] Implement prim::CreateObject (#71854)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71854

Support `prim::CreateObject` - this is a native interpreter instruction, so we can't fall back to the JIT for this op.

Test Plan: New unit test exercises creating and modifying custom objects

Reviewed By: d1jang

Differential Revision: D33783759

fbshipit-source-id: 8185ff71b5d441597d712a5d4aab7fc4dddf7034
(cherry picked from commit bd3f52d8e2cd8e20a8d66e2d2b802c1d92088e4e)
2022-02-03 12:18:46 +00:00
2d5296b0e7 [SR] Implement prim::Loop (#69838)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69838

Implement `prim::Loop` with the new `StaticRuntimeBlockRunner` abstraction.
ghstack-source-id: 148186483

Test Plan: New unit tests: `buck test caffe2/benchmark/static_runtime/...`

Reviewed By: d1jang

Differential Revision: D33049595

fbshipit-source-id: 550de5167b46fccd65ff77d092785289b5e5d532
(cherry picked from commit 8baf1753af34f4c166b4680e42589517fd2e508d)
2022-02-02 19:30:50 +00:00
2aa699505d [SR] Implement prim::If (#69837)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69837

Implement `prim::If` with the new `StaticRuntimeBlockRunner` abstraction.
ghstack-source-id: 148186475

Test Plan:
New unit tests: `buck test caffe2/benchmarks/static_runtime/...`

Accuracy test at top of stack

Reviewed By: d1jang

Differential Revision: D33045908

fbshipit-source-id: 281fb4a73528249fa60f65ac26f8ae6737771f55
(cherry picked from commit de3b12dc0871e8ca09891c257e1dfd7cd352aa7c)
2022-02-02 19:30:50 +00:00
238dded10f [SR] Graph pass to create owned refs of special IValues (#69835)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69835

`StaticRuntimeBlockRunner` moves its outputs to the return value at the end of `run_impl`. However, there's a corner case where this can cause problems. If we return a constant, then the only reference in the `constants_` array can be destroyed by this move. We could add special logic to handle this in `run_impl`. But since this is a relatively rare corner case, it's simpler to just add an op that does nothing but create an owned reference to its input. This owned reference can be safely moved out of `StaticRuntimeBlockRunner`.

Note that this also applies to returned values in sub-blocks that are from outer scopes.
ghstack-source-id: 148186452

Test Plan:
`buck test caffe2/benchmarks/static_runtime/...`

Added a new unit test with a graph that simply returns a constant.

Tests with sub-blocks at top of stack.

Reviewed By: d1jang

Differential Revision: D33047519

fbshipit-source-id: 22b6058f0d1da8a6d1d61a6f2866bc518bff482b
(cherry picked from commit a8f89a12ee726aa7d7e546dee25d696eef868ce7)
2022-02-02 19:30:50 +00:00
3a77fb244b [PyTorch][Static Runtime] Delete cleanup_activations option (#71501)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71501

This option disabled the memory planner. Supporting it would require us to add multiple versions of ops that borrow their inputs (because they rely on the memory planner to support that), and I'm not aware of a particular need to continue supporting it.
ghstack-source-id: 147385569

Test Plan: CI, rerun broken test from task

Reviewed By: mikeiovine

Differential Revision: D33669290

fbshipit-source-id: ecb01995891aecb5f4d0da2d9c51eed1f8fe489a
(cherry picked from commit 5e4fefb109b6c92d59fc7e24d69f1b6b2780c776)
2022-01-21 18:15:43 +00:00
fcbc34a5eb [PyTorch][Static Runtime] Avoid recomputing input size in dict_unpack (#71252)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71252

Same old problem, same old solution.

Interestingly, I tried using c10::irange instead, but that caused really bad assembly to be generated -- we lost inlining for lots of the loop body!
ghstack-source-id: 146939573

Test Plan:
CI

Spot-checked assembly before/after and confirmed that loop termination value was recomputed before and not after

Reviewed By: mikeiovine

Differential Revision: D33558118

fbshipit-source-id: 9fda2f1f89bacba2e8b5e61ba432871e973201fe
2022-01-14 14:33:56 -08:00
bf82d2012e [PyTorch] Add IValue::toDimVector & mostly replace toIntVector with it (#71247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71247

Most uses of toIntVector() were for a Tensor shape. We have DimVector to avoid heap allocations in those cases, so let's use it.
ghstack-source-id: 146933314

Test Plan: CI -- if we think DimVector is good in general then I think we have to think this change is good?

Reviewed By: mikeiovine

Differential Revision: D33556198

fbshipit-source-id: cf2ad92c2d0b99ab1df4da0f6843e6ccb9a6320b
2022-01-14 14:32:40 -08:00
3cc34a4502 [PyTorch][Static Runtime] s/toObject/toObjectRef/ in native ops (#71238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71238

Saves a refcount bump for these.
ghstack-source-id: 146927203

Test Plan: CI

Reviewed By: mikeiovine

Differential Revision: D33554385

fbshipit-source-id: b2f8d5afdc0eb80c8765d88560d0e547376f28d1
2022-01-12 18:44:40 -08:00
ffdc0e23af [SR] Add various missing native ops (#71113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71113

This diff adds a variety of missing ~~out variants~~/native ops. Most of these are trivial, so I included them all in one diff.

Native ops
* `aten::mul` (list variant)
* `aten::sub` (int variant)
* `aten::add` (list variant)
* `aten::Int`

Out variants
* ~~`aten::gt`~~ (codegen will handle)
* ~~`aten::eq`~~ (codegen will handle)
ghstack-source-id: 146927552

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D33510756

fbshipit-source-id: df385958b9561955b2e866dab2e4c050abd26766
2022-01-12 18:40:31 -08:00
10b40acbdb [PyTorch][Static Runtime] Fast aliasing in select_tensor by manual borrowing (#68122)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68122

See code comments for details; in brief, we repurpose support
for borrowing `Tensor`s in `MaybeOwned` to make the `select_tensor`
output a borrowed IValue that we have to clean up manually.

If we have any other ops that always create a new reference to an
existing Tensor, we can easily apply this same optimization.
ghstack-source-id: 146482212

Test Plan:
See perf measurements on ctr_mobile_feed local_ro net for this stack: P467203421
(local is neutral: P467267554)

--do_profile output for local_ro (updated Dec 10):

```
swolchok@devbig032 /d/u/s/f/fbcode> tail Stable.profile.txt
First iter time: 0.989023 ms
Number of operators: 2037
Total number of managed tensors: 1597
Total number of managed output tensors: 0
Total number of unmanaged values: 2568
Number of unmanaged values requiring cleanup: 2568
Number of unmanaged values not requiring cleanup: 0
Total memory managed: 50368 bytes
Total number of reused tensors: 1010
Total number of 'out' variant nodes/total number of nodes: 2001/2037 (98.2327%)
swolchok@devbig032 /d/u/s/f/fbcode> ttail TMCC^C
swolchok@devbig032 /d/u/s/f/fbcode> tail TMCOFastAliasing.profile.txt
First iter time: 0.994703 ms
Number of operators: 2551
Total number of managed tensors: 1146
Total number of managed output tensors: 0
Total number of unmanaged values: 4047
Number of unmanaged values requiring cleanup: 3533
Number of unmanaged values not requiring cleanup: 514
Total memory managed: 50048 bytes
Total number of reused tensors: 559
Total number of 'out' variant nodes/total number of nodes: 2001/2551 (78.4398%)
```

for local: (also Dec 10):

```
==> Stable.local.profile.txt <==
First iter time: 9.0909 ms
Number of operators: 1766
Total number of managed tensors: 1894
Total number of managed output tensors: 0
Total number of unmanaged values: 2014
Number of unmanaged values requiring cleanup: 2014
Number of unmanaged values not requiring cleanup: 0
Total memory managed: 4541440 bytes
Total number of reused tensors: 847
Total number of 'out' variant nodes/total number of nodes: 1744/1766 (98.7542%)

==> TMCOFastAliasing.local.profile.txt <==
First iter time: 7.5512 ms
Number of operators: 2378
Total number of managed tensors: 1629
Total number of managed output tensors: 0
Total number of unmanaged values: 3503
Number of unmanaged values requiring cleanup: 2891
Number of unmanaged values not requiring cleanup: 612
Total memory managed: 3949312 bytes
Total number of reused tensors: 586
Total number of 'out' variant nodes/total number of nodes: 1744/2378 (73.3389%)
```

Reviewed By: hlu1

Differential Revision: D32318674

fbshipit-source-id: a2d781105936fda2a3436d32ea22a196f82dc783
2022-01-04 22:36:13 -08:00
4d8fc8693c [PyTorch][Static Runtime] Support memory planning for torch.to() w/o requiring copying (#67223)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67223

ghstack-source-id: 146482215

Test Plan:
See perf measurements on ctr_mobile_feed local_ro net for this stack: P467203421
(local is neutral: P467267554)

Reviewed By: hlu1

Differential Revision: D31776259

fbshipit-source-id: f84fcaa05029577213f3bf2ae9d4b987b68480b3
2022-01-04 22:36:10 -08:00
24f16de987 [Static Runtime] Support native op split_with_sizes (#69999)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69999

This adds support for the split_with_sizes operator in static runtime by adding native operators. Those operators will have less overhead comparing to their JIT fallbacks (no dispatching, no stack constructing in runtime).

split_with_sizes can be called directly from cpp API, or in `torch.split`  when `split_sizes` is a list. This diff adds support for both use cases.

Test Plan:
- Added unit tests. Made sure the operators are used
- Benchmark
```
./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \
--scripted_model=/data/users/dxd/305797439_0.predictor.precompute.remote_request_only \
--method_name=user.forward --pt_cleanup_activations=1 \
--pt_enable_out_variant=1 --pt_optimize_memory=1 --iters=1000 --warmup_iters=500 \
--num_threads=1 --pt_enable_static_runtime=1 --set_compatibility=1 \
--input_type="recordio" --pt_inputs=/data/users/dxd/305797439_0_user.inputs.recordio \
--recordio_use_ivalue_format=1 --do_profile=1 --do_benchmark=1
```

#### Before
```
Static runtime ms per iter: 3.62073. Iters per second: 276.187
0.0471904 ms.    1.31501%. aten::split_with_sizes (5 nodes)
```
#### After
```
Static runtime ms per iter: 3.44374. Iters per second: 290.382
0.0432057 ms.    1.34276%. aten::split_with_sizes (5 nodes, native)
```

Reviewed By: swolchok

Differential Revision: D33141006

fbshipit-source-id: feae34c4c873fc22d48a8ff3bf4d71c0e00bb365
2021-12-20 18:32:54 -08:00
3e20a74b55 [SR] Update memory planner docs (#69559)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69559

We have a lot of special cases. Document them so they're easy to learn about.
ghstack-source-id: 145226542

Test Plan: Spell check? :)

Reviewed By: d1jang

Differential Revision: D32929416

fbshipit-source-id: 2362410f25a27cdb74a4939903446192cef61978
2021-12-09 14:22:33 -08:00
8954c92529 [PyTorch][Static Runtime] Borrow outputs in static_runtime::VarTupleUnpack (#68161)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68161

Continuing rollout of borrowing outputs for native ops.
ghstack-source-id: 143424920

Test Plan:
Compare CMF local_ro perf again.

Previous diff:
```
I1110 22:05:23.245435 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03272. Iters per second: 968.313
I1110 22:05:23.822196 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.06478. Iters per second: 939.163
I1110 22:05:24.395256 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.035. Iters per second: 966.186
I1110 22:05:24.964169 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.02786. Iters per second: 972.898
I1110 22:05:25.536558 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03205. Iters per second: 968.946
I1110 22:05:26.109027 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.04256. Iters per second: 959.174
I1110 22:05:26.679611 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03245. Iters per second: 968.567
I1110 22:05:27.253048 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.04493. Iters per second: 957.005
I1110 22:05:27.822629 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.0299. Iters per second: 970.971
I1110 22:05:28.393326 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03039. Iters per second: 970.509
I1110 22:05:28.393368 113949 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.03726, standard deviation: 0.0111053
```

This diff:
```
I1110 22:18:48.453075 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.931188. Iters per second: 1073.9
I1110 22:18:48.967614 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.933196. Iters per second: 1071.59
I1110 22:18:49.483338 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.932087. Iters per second: 1072.86
I1110 22:18:49.997144 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.930877. Iters per second: 1074.26
I1110 22:18:50.529383 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.936981. Iters per second: 1067.26
I1110 22:18:51.085038 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.953214. Iters per second: 1049.08
I1110 22:18:51.607192 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.940719. Iters per second: 1063.02
I1110 22:18:52.126169 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.942638. Iters per second: 1060.85
I1110 22:18:52.644445 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.937574. Iters per second: 1066.58
I1110 22:18:53.163486 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.941636. Iters per second: 1061.98
I1110 22:18:53.163537 191647 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 0.938011, standard deviation: 0.00691196
```

0.099 (9.5%!) usec/iter improvement over previous diff

Reviewed By: hlu1

Differential Revision: D32347900

fbshipit-source-id: 8169ebcadf1248e555a18bbffa99eef6cac1ba85
2021-11-16 12:32:15 -08:00
755be54c77 [PyTorch][Static Runtime] Borrow outputs in static_runtime::dict_unpack (#68160)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68160

This generalizes the mechanism D32318674 added for letting native ops borrow their outputs and uses it in dict_unpack.
ghstack-source-id: 143424919

Test Plan:
4.5% in CMF local_ro compared to D32318674 (previous two diffs were necessary steps but didn't get the full win yet):

```
FastAliasingInSelectTensor, local_ro
========================================
I1110 22:06:37.549811 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08488. Iters per second: 921.76
I1110 22:06:38.147949 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08675. Iters per second: 920.171
I1110 22:06:38.766340 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08626. Iters per second: 920.592
I1110 22:06:39.366608 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08376. Iters per second: 922.717
I1110 22:06:39.964979 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08362. Iters per second: 922.833
I1110 22:06:40.565248 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08423. Iters per second: 922.312
I1110 22:06:41.167326 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.0945. Iters per second: 913.659
I1110 22:06:41.766187 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08373. Iters per second: 922.742
I1110 22:06:42.367816 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08995. Iters per second: 917.475
I1110 22:06:42.968391 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08854. Iters per second: 918.665
I1110 22:06:42.968446 119627 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.08662, standard deviation: 0.00351662

BorrowDictUnpackOutputs, local_ro
========================================

I1110 22:05:23.245435 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03272. Iters per second: 968.313
I1110 22:05:23.822196 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.06478. Iters per second: 939.163
I1110 22:05:24.395256 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.035. Iters per second: 966.186
I1110 22:05:24.964169 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.02786. Iters per second: 972.898
I1110 22:05:25.536558 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03205. Iters per second: 968.946
I1110 22:05:26.109027 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.04256. Iters per second: 959.174
I1110 22:05:26.679611 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03245. Iters per second: 968.567
I1110 22:05:27.253048 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.04493. Iters per second: 957.005
I1110 22:05:27.822629 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.0299. Iters per second: 970.971
I1110 22:05:28.393326 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03039. Iters per second: 970.509
I1110 22:05:28.393368 113949 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.03726, standard deviation: 0.0111053
```

0.04936 (4.5%) usec/iter improvement

Reviewed By: hlu1

Differential Revision: D32347390

fbshipit-source-id: e636ddafacf30ed2a2d84a6e15fff97481342fdb
2021-11-16 12:31:03 -08:00
bbc24222d2 [PyTorch][Static Runtime] Refcount bump pass in native_ops (#68159)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68159

These all look like they'll cause unnecessary refcount bumps to me.
ghstack-source-id: 143424917

Test Plan:
CI

TODO profile local_ro?

Reviewed By: hlu1

Differential Revision: D32347392

fbshipit-source-id: d8ed91b5855b86765db00c61ad3650273302c7b6
2021-11-16 12:27:12 -08:00
6acde23bec [PyTorch][Static Runtime] Switch input/output repr to 2-byte offsets (#67934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67934

This reduces the memory requirements of ProcessedNode: by allocating outputs sequentially into a shared array and supporting at most 2**16 - 1 values (current models seem to have 10-20x less than that), we only need to store the 2-byte offset into that array and 2-byte number of outputs in ProcessedNode.
ghstack-source-id: 143429113

Test Plan:
Patched d1jang's diff to measure memory turnover around SR startup.

Previous diff, CMF local:

```
I1104 12:19:39.900211 597593 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 427120
```

This diff, CMF local:

```
I1105 12:17:36.459688 866763 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 354208
72912 bytes (17%) savings
```

Perf looks neutral; see next diff (D32216573) test plan for details.

Reviewed By: hlu1

Differential Revision: D32190751

fbshipit-source-id: 30c1e2caa9460f0d83b2d9bb24c68ccfcef757cc
2021-11-16 10:19:50 -08:00
1f07efd0f2 [SR] Fix aten::split schema (#68135)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68135

Update the schema to reflect the changes in  D31935573 (6b44e75f6b).

Test Plan:
`buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Confirmed native implementation is used.

Reviewed By: hlu1

Differential Revision: D32326865

fbshipit-source-id: 7f607f57ceb6690a2782d94d9ee736ba64e7d242
2021-11-10 20:03:30 -08:00
ecd5b1a8d4 [SR] Native implementation for aten::split (#67476)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67476

Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: d1jang

Differential Revision: D31994040

fbshipit-source-id: 9de57d8d7925ee46544478eae8229952ca5f248a
2021-11-10 10:23:03 -08:00
82f7f8d471 [PyTorch] Adopt IValue::toTupleRef() where obvious (#65505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65505

Generated with

`fastmod -m 'toTuple\(\)(\s*)->' 'toTupleRef()${1}.'`

, followed by

`fastmod '(std::move\(.*)toTupleRef\(\).' '${1}toTuple()->'`

to unbreak 2 callsites.
ghstack-source-id: 142065835

Test Plan: CI

Reviewed By: gchanan

Differential Revision: D31131025

fbshipit-source-id: 54457ae5bbeb38db9c7f196d469b98521c3d3f34
2021-11-02 10:22:18 -07:00
d9bac7c316 [PyTorch] Add IValue::toTupleRef() (#65504)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65504

We should be able to borrow a Tuple from an IValue without incurring refcount bumps.
ghstack-source-id: 142065833

Test Plan:
Added test coverage.

Profiled static runtime on the local_ro net for ctr_mobile_feed. Inclusive time spent in VarTupleUnpack decreased about 0.3%, which roughly matches with the 0.36% of runtime that was previously spent in IValue::toTuple().

Reviewed By: hlu1

Differential Revision: D31130570

fbshipit-source-id: afa14f46445539e449068fd908d547b8da7f402c
2021-11-02 10:16:25 -07:00
39ad7b670e [SR] Native implementation for aten::squeeze (#67441)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67441

Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D31992093

fbshipit-source-id: 88191c13d229ffeac4e5b17b78e25f51d3f7f23e
2021-11-01 08:22:57 -07:00
354363b57a [SR] Native implementation for aten::size (#67346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67346

Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: d1jang

Differential Revision: D31965159

fbshipit-source-id: 86a69c395f401c4a4c55daa4c5fe80764383c8e5
2021-10-28 14:18:17 -07:00
afb8434440 [SR] Native implementation for aten::view (#67341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67341

Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like `TupleUnpack`). We should improve op coverage where possible.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D31962589

fbshipit-source-id: 3107fb169c1b02fb2bafbb355c005669b5fa8435
2021-10-28 13:37:46 -07:00