Summary:
The previous implementation assumed that there was only one overload and unconditionally tried to convert its input into a string. Some users were running into crashes because of this. Added a new overload for the list overload and schema checks.
Also, I managed to uncover another bug when writing tests for this case (yikes). Returning inputs didn't work because the input cleanup process would destroy the output. Extended `CreateOwnedRefsForSpecialIValues` to fix that.
Test Plan: CI + new unit tests
Differential Revision: D38870803
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83753
Approved by: https://github.com/tenpercent, https://github.com/albanD
Summary:
- support async excecution of forked nodes on custom executor
- fork subgraph execution was performed on inter-op thread pool executor by default
- Handle forked graph async execution on custom executor when the parent graph is executed with runAsync() API passing the executor for async ops
Differential Revision: D37466525
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80381
Approved by: https://github.com/mikeiovine
Summary:
- ProcessedNodeMetadata class wraps the possible metadata for ProcessedNode. Depending upon the nature of op, processedNode can have one of the below possibilities of metadata:
1. prim::If/prim::Loop ops contains block_runners_ as their metadata
2. prim::fork op contains TaskLauncher (std::function) responsible for execution of forked subgraph
Differential Revision: D37320704
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79961
Approved by: https://github.com/mikeiovine
Summary: This adds the missing jit prim ops appear in the non ads models for c2->pt mitigation: aten::cpu, aten::list, aten::numel, aten::__range_length
Test Plan: static runtime unit tests
Differential Revision: D36984960
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79111
Approved by: https://github.com/davidberard98
Summary:
- Remove creation of new StaticModuleOptions for the forked subgraph. Use parent graph's options for creation of runtime for forked subtraph
- StaticRuntimeMetdata extends CustomClassHolder which can be casted to IValue and attached to IR node's attributes.
Differential Revision: D37159684
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79578
Approved by: https://github.com/mikeiovine
Summary: This adds the pytorch operators that are currently missing in non-ads models from c2->pt mitigation: aten::index_put, aten::item, aten::tensor_split
Test Plan: buck run mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest
Differential Revision: D36984961
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79065
Approved by: https://github.com/davidberard98
Summary:
- StaticModule was being created at runtime which was adding overhead to the forked operation
- Move staticModule creation to outside of runtime so that StaticRuntime instance can be created on top of same staticModule that is created once
Differential Revision: D37126923
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79482
Approved by: https://github.com/tenpercent
Summary:
- Exception handling was not performed in forked subgraph execution
- forked subgraph runtime can throw runtime exception. Future returned by prim::fork needs to handle exceptions so that aten::wait handles it.
Test Plan:
local test cases:
- buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators
- buck test mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest
- buck test mode/opt caffe2/test:static_runtime
Async execution of the subgraph is tested by adding pytorch profiler hooks on the StaticRuntime execution via below code. Async execution in threadpool is verfiied by checking trace
with profile(activities=[ProfilerActivity.CPU]) as prof:
static_runtime_module(inputs)
prof.export_chrome_trace("trace.json")
Differential Revision: D37072493
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79292
Approved by: https://github.com/mikeiovine
Summary:
- Initial support for fork was done on JIT interpreter. This patch enabled the async execution on static runtime
- For each forked node, seeprate runtime is created for the execution of subgraph. Async execution is handled by aten::ParallelThreadPoolNative threadpool
- aten::wait waits on the future of fork to be completed
Test Plan:
local test cases:
- buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators
- buck test mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest
- buck test mode/opt caffe2/test:static_runtime
Async execution of the subgraph is tested by adding pytorch profiler hooks on the StaticRuntime execution via below code. Async execution in threadpool is verfiied by checking trace
with profile(activities=[ProfilerActivity.CPU]) as prof:
static_runtime_module(inputs)
prof.export_chrome_trace("trace.json")
Differential Revision: D37044513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79211
Approved by: https://github.com/mikeiovine
Summary:
prim::fork was executed synchronously in the main thread
- Added changes that executes the prim::fork calls on asynchronously on one of the threads from TaskThreadPoolBase defined in aten
- Changes are tested via pytorch profiler tracing. Fork calls are executed on different threads
Test Plan:
local tests scripts exected:
- buck run mode/opt caffe2/test:static_runtime
- buck run caffe2/benchmarks/static_runtime/fb:test_fb_operators
- buck run caffe2/benchmarks/static_runtime:static_runtime_cpptest
Executing pytorch profiler to see the spawned execution of fork operations in parallel on different threads
Differential Revision: D36909308
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78858
Approved by: https://github.com/mikeiovine
Summary:
basic implementation of prim::fork and aten::wait
- current implementation uses interpreter to call the forked subgraph
- interpreter call to be replaced in future
- Added custom test cases for fork/wait procedures in the graph
Test Plan:
custom tests are created in test_static_runtime.py file for verification of static_runtime output compared to reference pytorch output.
test command
- buck run caffe2/test:static_runtime
- buck run caffe2/benchmarks/static_runtime:static_runtime_cpptest
- buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators
Differential Revision: D36881214
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78780
Approved by: https://github.com/tenpercent
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74987
Add specializations to `prim::If` operator at runtime to save resources when some of subblocks are empty
Test Plan:
`buck build //caffe2:torch-cpp-cpu`
`buck test //caffe2/benchmarks/static_runtime/...`
Add unit test:
`buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- StaticRuntime.EmptyIfBlock`
Reviewed By: mikeiovine
Differential Revision: D35262952
fbshipit-source-id: 324f88471f33f035f4d8a9b212716530d8e59df2
(cherry picked from commit 2db1b1a6833b1376fa376f54791effc8e12fb77f)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74585
Native static runtime for `aten::reshape_as`
ghstack-source-id: 152340038
Test Plan: New unit test
Reviewed By: hlu1
Differential Revision: D35060895
fbshipit-source-id: c4e6f8a04c7df3821c7e654bfaf584e5a72ea701
(cherry picked from commit 6fa596cd866a024b6653239e0e30ddad42de242f)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74562
Add a native implementation for `aten::IntImplicit`, which is similar to `aten::Int` except for a few extra checks it must do
ghstack-source-id: 152340039
Test Plan: New unit tests
Reviewed By: hlu1
Differential Revision: D35052997
fbshipit-source-id: cb2f0faf7c62382e3f13750d8e1280c49c6b9e42
(cherry picked from commit 359c7493f8deaeccebc27e1b6e6e9777850010c1)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74250
The `DictConstruct`/`ListUnpack` implementations currently put all of their inputs onto a stack before calling the JIT implementation in `vararg_functions.cpp`. This was done to avoid code duplication, but it's quite wasteful since it causes extra heap allocations and, potentially, refcount bumps.
Given that these two ops are quite common and the code duplication is only a few lines, it seems reasonable to avoid this cost.
ghstack-source-id: 151897634
Test Plan: Existing unit tests
Reviewed By: navahgar
Differential Revision: D34901245
fbshipit-source-id: ece0618a6134a35720f214e79c64f12045f074d0
(cherry picked from commit 1f8e223c1887ed205c84a7ac4587813f94b11bad)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74192
This change uses toListRef() to avoid creating a new list object for `aten::len` to address hlu1's comment on D34705231 (87564a1bd7)
Test Plan: Existing tests, StaticRuntime.Len*
Reviewed By: mikeiovine
Differential Revision: D34863266
fbshipit-source-id: 65daf36944a64dfd7afde1103aab5aee1681ac87
(cherry picked from commit 3a0f3798f2fcc203f6cb01e59b91e195ecabe1bc)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73450
This change uses `SROperator` for operators' function type
Test Plan: N/A
Reviewed By: mikeiovine
Differential Revision: D34483246
fbshipit-source-id: ed544bb91b676ed08983dc8dc78cedd0f77d499f
(cherry picked from commit eb9de3ad8de043990c02f30ffa48a29c8e5e81f2)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72587
This pattern frequently appears in a few graphs:
```
%result = prim::If(%condition)
block0():
-> (%a)
block1():
-> (%b)
```
This is slow, particularly in static runtime. Static runtime creates memory planners/block runners for each sub-block, which eats up a lot of memory and introduces a lot of extra overhead for this relatively simple operation.
This diff introduces a new op that replaces nodes like the above with a single op meant to act like a ternary operator:
```
%result = prim::IfThenElse(%condition, %a, %b)
```
Test Plan: New unit tests
Reviewed By: eellison
Differential Revision: D34091789
fbshipit-source-id: eb6a8c460c39b4c019a1f4ab1f3f1e5b6edc400c
(cherry picked from commit 0f1b335e5b83f402bda2dcdd9ecb411e0b67c651)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72544
Now that static runtime supports control flow, there's no need to fall back to the JIT. We get better performance with the native control flow since we avoid heap allocation/ref count bumps during stack construction.
I've left the old `prim::TensorExprDynamicGroup` around in case we need to support it in the future. I've also added native support for a few scalar ops that are used inside the control flow sub-blocks.
ghstack-source-id: 148825816
Test Plan: New unit tests
Reviewed By: d1jang
Differential Revision: D34083080
fbshipit-source-id: a7ffc0fda39ab3df3ba47e44a03d857131dc1e50
(cherry picked from commit 2ef39e0e54d5e9da76af9e617a11233ffc81b011)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72530
This bug was revealed from a failed attempt to run a feed/story model.
Test Plan:
- This fix was tested to successfully run the failed model: P479037453
- Added a unittest
Reviewed By: mikeiovine
Differential Revision: D34055801
fbshipit-source-id: 4a3e06bbb3b9fa78b0514c9c67aa4a0b79f46a8d
(cherry picked from commit bfa2bfb81ceaadad399522e422863fcea4aa13f1)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71474
The PyTorch edge team is working on promoting some prim ops to interpreter instructions (see D33398092). Since the JIT fallback ops will be unavailable soon, we need to implement these ops in static runtime.
Ops not included in this diff:
* `aten::__is__` and `aten::__isnot__`: disabled in static runtime for unrelated reasons
* `prim::NumToTensor` and `aten::__get__.Dict` already exist
ghstack-source-id: 148641179
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: d1jang
Differential Revision: D33657816
fbshipit-source-id: 6d15244ae1024a56d3b25e51a433fa104ce8ee5e
(cherry picked from commit 33f8f861ff88a6dda6a545c12515e92c893027d4)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71854
Support `prim::CreateObject` - this is a native interpreter instruction, so we can't fall back to the JIT for this op.
Test Plan: New unit test exercises creating and modifying custom objects
Reviewed By: d1jang
Differential Revision: D33783759
fbshipit-source-id: 8185ff71b5d441597d712a5d4aab7fc4dddf7034
(cherry picked from commit bd3f52d8e2cd8e20a8d66e2d2b802c1d92088e4e)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69838
Implement `prim::Loop` with the new `StaticRuntimeBlockRunner` abstraction.
ghstack-source-id: 148186483
Test Plan: New unit tests: `buck test caffe2/benchmark/static_runtime/...`
Reviewed By: d1jang
Differential Revision: D33049595
fbshipit-source-id: 550de5167b46fccd65ff77d092785289b5e5d532
(cherry picked from commit 8baf1753af34f4c166b4680e42589517fd2e508d)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69837
Implement `prim::If` with the new `StaticRuntimeBlockRunner` abstraction.
ghstack-source-id: 148186475
Test Plan:
New unit tests: `buck test caffe2/benchmarks/static_runtime/...`
Accuracy test at top of stack
Reviewed By: d1jang
Differential Revision: D33045908
fbshipit-source-id: 281fb4a73528249fa60f65ac26f8ae6737771f55
(cherry picked from commit de3b12dc0871e8ca09891c257e1dfd7cd352aa7c)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69835
`StaticRuntimeBlockRunner` moves its outputs to the return value at the end of `run_impl`. However, there's a corner case where this can cause problems. If we return a constant, then the only reference in the `constants_` array can be destroyed by this move. We could add special logic to handle this in `run_impl`. But since this is a relatively rare corner case, it's simpler to just add an op that does nothing but create an owned reference to its input. This owned reference can be safely moved out of `StaticRuntimeBlockRunner`.
Note that this also applies to returned values in sub-blocks that are from outer scopes.
ghstack-source-id: 148186452
Test Plan:
`buck test caffe2/benchmarks/static_runtime/...`
Added a new unit test with a graph that simply returns a constant.
Tests with sub-blocks at top of stack.
Reviewed By: d1jang
Differential Revision: D33047519
fbshipit-source-id: 22b6058f0d1da8a6d1d61a6f2866bc518bff482b
(cherry picked from commit a8f89a12ee726aa7d7e546dee25d696eef868ce7)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71501
This option disabled the memory planner. Supporting it would require us to add multiple versions of ops that borrow their inputs (because they rely on the memory planner to support that), and I'm not aware of a particular need to continue supporting it.
ghstack-source-id: 147385569
Test Plan: CI, rerun broken test from task
Reviewed By: mikeiovine
Differential Revision: D33669290
fbshipit-source-id: ecb01995891aecb5f4d0da2d9c51eed1f8fe489a
(cherry picked from commit 5e4fefb109b6c92d59fc7e24d69f1b6b2780c776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71252
Same old problem, same old solution.
Interestingly, I tried using c10::irange instead, but that caused really bad assembly to be generated -- we lost inlining for lots of the loop body!
ghstack-source-id: 146939573
Test Plan:
CI
Spot-checked assembly before/after and confirmed that loop termination value was recomputed before and not after
Reviewed By: mikeiovine
Differential Revision: D33558118
fbshipit-source-id: 9fda2f1f89bacba2e8b5e61ba432871e973201fe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71247
Most uses of toIntVector() were for a Tensor shape. We have DimVector to avoid heap allocations in those cases, so let's use it.
ghstack-source-id: 146933314
Test Plan: CI -- if we think DimVector is good in general then I think we have to think this change is good?
Reviewed By: mikeiovine
Differential Revision: D33556198
fbshipit-source-id: cf2ad92c2d0b99ab1df4da0f6843e6ccb9a6320b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71113
This diff adds a variety of missing ~~out variants~~/native ops. Most of these are trivial, so I included them all in one diff.
Native ops
* `aten::mul` (list variant)
* `aten::sub` (int variant)
* `aten::add` (list variant)
* `aten::Int`
Out variants
* ~~`aten::gt`~~ (codegen will handle)
* ~~`aten::eq`~~ (codegen will handle)
ghstack-source-id: 146927552
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D33510756
fbshipit-source-id: df385958b9561955b2e866dab2e4c050abd26766
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68122
See code comments for details; in brief, we repurpose support
for borrowing `Tensor`s in `MaybeOwned` to make the `select_tensor`
output a borrowed IValue that we have to clean up manually.
If we have any other ops that always create a new reference to an
existing Tensor, we can easily apply this same optimization.
ghstack-source-id: 146482212
Test Plan:
See perf measurements on ctr_mobile_feed local_ro net for this stack: P467203421
(local is neutral: P467267554)
--do_profile output for local_ro (updated Dec 10):
```
swolchok@devbig032 /d/u/s/f/fbcode> tail Stable.profile.txt
First iter time: 0.989023 ms
Number of operators: 2037
Total number of managed tensors: 1597
Total number of managed output tensors: 0
Total number of unmanaged values: 2568
Number of unmanaged values requiring cleanup: 2568
Number of unmanaged values not requiring cleanup: 0
Total memory managed: 50368 bytes
Total number of reused tensors: 1010
Total number of 'out' variant nodes/total number of nodes: 2001/2037 (98.2327%)
swolchok@devbig032 /d/u/s/f/fbcode> ttail TMCC^C
swolchok@devbig032 /d/u/s/f/fbcode> tail TMCOFastAliasing.profile.txt
First iter time: 0.994703 ms
Number of operators: 2551
Total number of managed tensors: 1146
Total number of managed output tensors: 0
Total number of unmanaged values: 4047
Number of unmanaged values requiring cleanup: 3533
Number of unmanaged values not requiring cleanup: 514
Total memory managed: 50048 bytes
Total number of reused tensors: 559
Total number of 'out' variant nodes/total number of nodes: 2001/2551 (78.4398%)
```
for local: (also Dec 10):
```
==> Stable.local.profile.txt <==
First iter time: 9.0909 ms
Number of operators: 1766
Total number of managed tensors: 1894
Total number of managed output tensors: 0
Total number of unmanaged values: 2014
Number of unmanaged values requiring cleanup: 2014
Number of unmanaged values not requiring cleanup: 0
Total memory managed: 4541440 bytes
Total number of reused tensors: 847
Total number of 'out' variant nodes/total number of nodes: 1744/1766 (98.7542%)
==> TMCOFastAliasing.local.profile.txt <==
First iter time: 7.5512 ms
Number of operators: 2378
Total number of managed tensors: 1629
Total number of managed output tensors: 0
Total number of unmanaged values: 3503
Number of unmanaged values requiring cleanup: 2891
Number of unmanaged values not requiring cleanup: 612
Total memory managed: 3949312 bytes
Total number of reused tensors: 586
Total number of 'out' variant nodes/total number of nodes: 1744/2378 (73.3389%)
```
Reviewed By: hlu1
Differential Revision: D32318674
fbshipit-source-id: a2d781105936fda2a3436d32ea22a196f82dc783
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67223
ghstack-source-id: 146482215
Test Plan:
See perf measurements on ctr_mobile_feed local_ro net for this stack: P467203421
(local is neutral: P467267554)
Reviewed By: hlu1
Differential Revision: D31776259
fbshipit-source-id: f84fcaa05029577213f3bf2ae9d4b987b68480b3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69999
This adds support for the split_with_sizes operator in static runtime by adding native operators. Those operators will have less overhead comparing to their JIT fallbacks (no dispatching, no stack constructing in runtime).
split_with_sizes can be called directly from cpp API, or in `torch.split` when `split_sizes` is a list. This diff adds support for both use cases.
Test Plan:
- Added unit tests. Made sure the operators are used
- Benchmark
```
./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \
--scripted_model=/data/users/dxd/305797439_0.predictor.precompute.remote_request_only \
--method_name=user.forward --pt_cleanup_activations=1 \
--pt_enable_out_variant=1 --pt_optimize_memory=1 --iters=1000 --warmup_iters=500 \
--num_threads=1 --pt_enable_static_runtime=1 --set_compatibility=1 \
--input_type="recordio" --pt_inputs=/data/users/dxd/305797439_0_user.inputs.recordio \
--recordio_use_ivalue_format=1 --do_profile=1 --do_benchmark=1
```
#### Before
```
Static runtime ms per iter: 3.62073. Iters per second: 276.187
0.0471904 ms. 1.31501%. aten::split_with_sizes (5 nodes)
```
#### After
```
Static runtime ms per iter: 3.44374. Iters per second: 290.382
0.0432057 ms. 1.34276%. aten::split_with_sizes (5 nodes, native)
```
Reviewed By: swolchok
Differential Revision: D33141006
fbshipit-source-id: feae34c4c873fc22d48a8ff3bf4d71c0e00bb365
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69559
We have a lot of special cases. Document them so they're easy to learn about.
ghstack-source-id: 145226542
Test Plan: Spell check? :)
Reviewed By: d1jang
Differential Revision: D32929416
fbshipit-source-id: 2362410f25a27cdb74a4939903446192cef61978
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68160
This generalizes the mechanism D32318674 added for letting native ops borrow their outputs and uses it in dict_unpack.
ghstack-source-id: 143424919
Test Plan:
4.5% in CMF local_ro compared to D32318674 (previous two diffs were necessary steps but didn't get the full win yet):
```
FastAliasingInSelectTensor, local_ro
========================================
I1110 22:06:37.549811 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08488. Iters per second: 921.76
I1110 22:06:38.147949 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08675. Iters per second: 920.171
I1110 22:06:38.766340 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08626. Iters per second: 920.592
I1110 22:06:39.366608 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08376. Iters per second: 922.717
I1110 22:06:39.964979 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08362. Iters per second: 922.833
I1110 22:06:40.565248 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08423. Iters per second: 922.312
I1110 22:06:41.167326 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.0945. Iters per second: 913.659
I1110 22:06:41.766187 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08373. Iters per second: 922.742
I1110 22:06:42.367816 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08995. Iters per second: 917.475
I1110 22:06:42.968391 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08854. Iters per second: 918.665
I1110 22:06:42.968446 119627 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.08662, standard deviation: 0.00351662
BorrowDictUnpackOutputs, local_ro
========================================
I1110 22:05:23.245435 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03272. Iters per second: 968.313
I1110 22:05:23.822196 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.06478. Iters per second: 939.163
I1110 22:05:24.395256 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.035. Iters per second: 966.186
I1110 22:05:24.964169 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.02786. Iters per second: 972.898
I1110 22:05:25.536558 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03205. Iters per second: 968.946
I1110 22:05:26.109027 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.04256. Iters per second: 959.174
I1110 22:05:26.679611 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03245. Iters per second: 968.567
I1110 22:05:27.253048 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.04493. Iters per second: 957.005
I1110 22:05:27.822629 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.0299. Iters per second: 970.971
I1110 22:05:28.393326 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03039. Iters per second: 970.509
I1110 22:05:28.393368 113949 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.03726, standard deviation: 0.0111053
```
0.04936 (4.5%) usec/iter improvement
Reviewed By: hlu1
Differential Revision: D32347390
fbshipit-source-id: e636ddafacf30ed2a2d84a6e15fff97481342fdb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68159
These all look like they'll cause unnecessary refcount bumps to me.
ghstack-source-id: 143424917
Test Plan:
CI
TODO profile local_ro?
Reviewed By: hlu1
Differential Revision: D32347392
fbshipit-source-id: d8ed91b5855b86765db00c61ad3650273302c7b6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67934
This reduces the memory requirements of ProcessedNode: by allocating outputs sequentially into a shared array and supporting at most 2**16 - 1 values (current models seem to have 10-20x less than that), we only need to store the 2-byte offset into that array and 2-byte number of outputs in ProcessedNode.
ghstack-source-id: 143429113
Test Plan:
Patched d1jang's diff to measure memory turnover around SR startup.
Previous diff, CMF local:
```
I1104 12:19:39.900211 597593 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 427120
```
This diff, CMF local:
```
I1105 12:17:36.459688 866763 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 354208
72912 bytes (17%) savings
```
Perf looks neutral; see next diff (D32216573) test plan for details.
Reviewed By: hlu1
Differential Revision: D32190751
fbshipit-source-id: 30c1e2caa9460f0d83b2d9bb24c68ccfcef757cc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68135
Update the schema to reflect the changes in D31935573 (6b44e75f6b).
Test Plan:
`buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Confirmed native implementation is used.
Reviewed By: hlu1
Differential Revision: D32326865
fbshipit-source-id: 7f607f57ceb6690a2782d94d9ee736ba64e7d242
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67476
Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible.
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: d1jang
Differential Revision: D31994040
fbshipit-source-id: 9de57d8d7925ee46544478eae8229952ca5f248a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65504
We should be able to borrow a Tuple from an IValue without incurring refcount bumps.
ghstack-source-id: 142065833
Test Plan:
Added test coverage.
Profiled static runtime on the local_ro net for ctr_mobile_feed. Inclusive time spent in VarTupleUnpack decreased about 0.3%, which roughly matches with the 0.36% of runtime that was previously spent in IValue::toTuple().
Reviewed By: hlu1
Differential Revision: D31130570
fbshipit-source-id: afa14f46445539e449068fd908d547b8da7f402c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67441
Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible.
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D31992093
fbshipit-source-id: 88191c13d229ffeac4e5b17b78e25f51d3f7f23e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67346
Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible.
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: d1jang
Differential Revision: D31965159
fbshipit-source-id: 86a69c395f401c4a4c55daa4c5fe80764383c8e5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67341
Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like `TupleUnpack`). We should improve op coverage where possible.
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D31962589
fbshipit-source-id: 3107fb169c1b02fb2bafbb355c005669b5fa8435