Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71986
To address concerns over space increase from control flow.
`op_name_` was only stored as a minor optimization to avoid name lookup during logging, we can safely get rid of it. Thanks to the sampling mechanism, `get_op_name()` is called very infrequently, so this shouldn't cause too much of a regression
ghstack-source-id: 148086244
Test Plan: CI
Reviewed By: d1jang
Differential Revision: D33821005
fbshipit-source-id: 6f74eb30a54a046ca90768aebbcde22e8c435f35
(cherry picked from commit 361ba32e97dbd130938bae10b5159730822c518c)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69834
* Modify the `StaticModule` constructor to handle index initialization for sub-blocks.
* Add a new class `StaticRuntimeBlockRunner`. This class is almost exactly like what we've been calling `StaticRuntime` up to this point, except that it does not own a `values_` array. All `StaticRuntimeBlockRunners` hold an unowned reference to a `values_` array owned by `StaticRuntime`. This is a useful abstraction for implementing control flow - it gives us a way for sub-blocks to look up values from surrounding scopes!
ghstack-source-id: 148086245
Test Plan: `buck test caffe2/benchmarks/static_runtime/...`
Reviewed By: d1jang
Differential Revision: D33028039
fbshipit-source-id: 4f01417bad51a0cf09b1680a518308da647be1f6
(cherry picked from commit 3a9feffd929869120c717d35aa55aad8a382783d)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70944
Added special net-level/op-level scopes for static runtime. We can use these to add special behavior in record functions when they are invoked from a static runtime context.
Reviewed By: navahgar
Differential Revision: D33458211
fbshipit-source-id: 0b7022100e9f5ac872f4cb5bfba14e92af2c71b0
(cherry picked from commit b486548544c5e822803071756c85e675e37d2dad)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71705
This fixes a crash `resetMemory` caused by trying to access a `TensorImpl` via a borrowed `IValue` after it had already been destroyed. We need to clean up all borrows *before* we destroy the owning `IValue`, not after.
ghstack-source-id: 147688982
Test Plan:
New unit test covers this case
ICE w/ inline_cvr v0 [finishes successfully](https://www.internalfb.com/intern/unidash/dashboard/ads_infra_cost_estimation/a_metrics/?e[select_ESTIMATION_RUN_ID]=ICE_mikeiovine_16431103211c65), didn't see any nnpi errors
Reviewed By: ajyu
Differential Revision: D33725435
fbshipit-source-id: f8dd109382b5cf54df6f194f8dcb5c0812b174bb
(cherry picked from commit 31339d9d38e63248d2ac3646be71008ed731f16c)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71501
This option disabled the memory planner. Supporting it would require us to add multiple versions of ops that borrow their inputs (because they rely on the memory planner to support that), and I'm not aware of a particular need to continue supporting it.
ghstack-source-id: 147385569
Test Plan: CI, rerun broken test from task
Reviewed By: mikeiovine
Differential Revision: D33669290
fbshipit-source-id: ecb01995891aecb5f4d0da2d9c51eed1f8fe489a
(cherry picked from commit 5e4fefb109b6c92d59fc7e24d69f1b6b2780c776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69853
We can implement this overload more efficiently.
ghstack-source-id: 146924693
Test Plan:
patched alias_analysis tests
Time reported to initialize a predictor by static runtime when given ctr_mobile_feed local_ro net is 9.5s instead of 10.5s.
Reviewed By: mikeiovine
Differential Revision: D33039731
fbshipit-source-id: 52559d678e9eb00e335b9e0db304e7a5840ea397
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68122
See code comments for details; in brief, we repurpose support
for borrowing `Tensor`s in `MaybeOwned` to make the `select_tensor`
output a borrowed IValue that we have to clean up manually.
If we have any other ops that always create a new reference to an
existing Tensor, we can easily apply this same optimization.
ghstack-source-id: 146482212
Test Plan:
See perf measurements on ctr_mobile_feed local_ro net for this stack: P467203421
(local is neutral: P467267554)
--do_profile output for local_ro (updated Dec 10):
```
swolchok@devbig032 /d/u/s/f/fbcode> tail Stable.profile.txt
First iter time: 0.989023 ms
Number of operators: 2037
Total number of managed tensors: 1597
Total number of managed output tensors: 0
Total number of unmanaged values: 2568
Number of unmanaged values requiring cleanup: 2568
Number of unmanaged values not requiring cleanup: 0
Total memory managed: 50368 bytes
Total number of reused tensors: 1010
Total number of 'out' variant nodes/total number of nodes: 2001/2037 (98.2327%)
swolchok@devbig032 /d/u/s/f/fbcode> ttail TMCC^C
swolchok@devbig032 /d/u/s/f/fbcode> tail TMCOFastAliasing.profile.txt
First iter time: 0.994703 ms
Number of operators: 2551
Total number of managed tensors: 1146
Total number of managed output tensors: 0
Total number of unmanaged values: 4047
Number of unmanaged values requiring cleanup: 3533
Number of unmanaged values not requiring cleanup: 514
Total memory managed: 50048 bytes
Total number of reused tensors: 559
Total number of 'out' variant nodes/total number of nodes: 2001/2551 (78.4398%)
```
for local: (also Dec 10):
```
==> Stable.local.profile.txt <==
First iter time: 9.0909 ms
Number of operators: 1766
Total number of managed tensors: 1894
Total number of managed output tensors: 0
Total number of unmanaged values: 2014
Number of unmanaged values requiring cleanup: 2014
Number of unmanaged values not requiring cleanup: 0
Total memory managed: 4541440 bytes
Total number of reused tensors: 847
Total number of 'out' variant nodes/total number of nodes: 1744/1766 (98.7542%)
==> TMCOFastAliasing.local.profile.txt <==
First iter time: 7.5512 ms
Number of operators: 2378
Total number of managed tensors: 1629
Total number of managed output tensors: 0
Total number of unmanaged values: 3503
Number of unmanaged values requiring cleanup: 2891
Number of unmanaged values not requiring cleanup: 612
Total memory managed: 3949312 bytes
Total number of reused tensors: 586
Total number of 'out' variant nodes/total number of nodes: 1744/2378 (73.3389%)
```
Reviewed By: hlu1
Differential Revision: D32318674
fbshipit-source-id: a2d781105936fda2a3436d32ea22a196f82dc783
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67223
ghstack-source-id: 146482215
Test Plan:
See perf measurements on ctr_mobile_feed local_ro net for this stack: P467203421
(local is neutral: P467267554)
Reviewed By: hlu1
Differential Revision: D31776259
fbshipit-source-id: f84fcaa05029577213f3bf2ae9d4b987b68480b3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67221
Update memory leak checks to not require that output tensors are cleaned up.
ghstack-source-id: 146464297
Test Plan: Tests should still pass; reviewers to confirm that this is OK in principle
Reviewed By: d1jang
Differential Revision: D31847567
fbshipit-source-id: bb7ff2f2ed701e2d7de07d8032a1281fccabd6a9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691
TraceType is a sharded file, so by only including specific operator
headers, we ensure that changing one (non-method) operator only needs
one shard to be re-compiled.
This also changes all the included autograd and jit headers from
including `ATen/ATen.h` to just including `ATen/core/Tensor.h`.
Test Plan: Imported from OSS
Reviewed By: gchanan
Differential Revision: D33336948
Pulled By: albanD
fbshipit-source-id: 4e40371592b9a5a7e7fcd1d8cecae11ffb873113
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69774
We recently ran into a nasty bug caused by incorrect schema annotations on an `aten::split` overload. `verify_and_correct_memory_overlap` is supposed to prevent crashes in this scenario, but it didn't because it did not handle `Tensor[]` outputs.
This change extends the memory correction mechanism to handle tensor lists.
ghstack-source-id: 146152478
Test Plan: `buck test caffe2/benchmarks/static_runtime/...`
Reviewed By: hlu1
Differential Revision: D33022494
fbshipit-source-id: 8d1d41ca1d4fd5dfb7c8a66028c391ba63551eb0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69475
This diff adds TensorExpr fusion with dynamic shapes in SR. This includes tracing the input graph with sample inputs, and then performing fusion with generalization to get fused graphs with dynamic shapes.
ghstack-source-id: 146059043
Test Plan:
```
buck run mode/opt //caffe2/caffe2/fb/predictor:pytorch_predictor_test
```
Reviewed By: d1jang
Differential Revision: D32320088
fbshipit-source-id: 397f498878ddfcee9dad7a839652f79f034fefe3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69473
This diff refactors StaticModule and its uses to pass in sample inputs. These inputs need to be passed into the constructor because they are need to perform TensorExpr fusion before other optimizations are performed on the input graph.
ghstack-source-id: 146059041
Test Plan: buck run mode/opt //caffe2/caffe2/fb/predictor:pytorch_predictor_test
Reviewed By: donaldong
Differential Revision: D32320084
fbshipit-source-id: b8bd46d442be4cc90ca60f521e0416fdb88eea60
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69406
Most files that include `interned_strings.h` don't actually depend on
anything generated from `FORALL_NS_SYMBOLS` yet because they're in a
single file you need to recompile whenever a new symbol is added. Here
I move the class definition into a separate file so this doesn't
happen.
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D32923637
Pulled By: albanD
fbshipit-source-id: 6e488cbfcfe2c041a99d9ff22e167dbddf3f46d7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691
TraceType is a sharded file, so by only including specific operator
headers, we ensure that changing one (non-method) operator only needs
one shard to be re-compiled.
This also changes all the included autograd and jit headers from
including `ATen/ATen.h` to just including `ATen/core/Tensor.h`.
Test Plan: Imported from OSS
Reviewed By: jbschlosser, malfet
Differential Revision: D32596264
Pulled By: albanD
fbshipit-source-id: 2f28b62d7b9932f30fad7daacd8ac5bb7f63c621
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69087
This diff includes a variety of improvements to `set_inputs` to unify behavior with `torch::jit::Module`:
1. Eliminate code duplication between rvalue/lvalue overloads
2. Add type checks
3. Make input length check a `TORCH_CHECK` instead of a debug check - we have to fail when the wrong number of inputs are passed.
4. `schema` now always includes `self`, even if we release `module_`. This is consistent with `torch::jit::Module`.|
ghstack-source-id: 145599837
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D32711705
fbshipit-source-id: fe97c10b4f03801ba59868b452e7d02b26b3106b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69755
Per swolchok's suggestion on D32609915 (1c43b1602c). Hide the value offset indices behind accessors to provide more flexibility if we ever decide to change the layout of the values array.
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D32838145
fbshipit-source-id: cf805c077672de4c2fded9b41da01eca6d84b388
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69809
SR options is only printed out once per model per net. Logging it is actually pretty helpful for debugging.
Test Plan: CI
Reviewed By: donaldong
Differential Revision: D33046814
fbshipit-source-id: 536b34e00fbc8a273c5eb4d8ae5caca0dc1f4c24
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69595
This changes encapsulates `function` object in `ProcessedFunction` objects instead of exposing it unnecessarily just for executing it.
Test Plan: Existing tests
Reviewed By: mikeiovine
Differential Revision: D32908341
fbshipit-source-id: 5ff4951cbe276c5c6292227124d9eec1dd16e364
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68795
This change improves static runtime exception safety. Added a scope exit guard that invokes `MemoryPlanner::deallocate` in its destructor.
Caveat: we have to be really careful with the exception behavior of `MemoryPlanner::deallocate` and `MemoryPlanner`'s constructor, because they're now both potentially called in the destructor of the scope exit guard. Letting exceptions potentially escape destructors is playing with fire since 1) the destructor of `Deallocator` is (implicitly) `noexcept`, 2) even if it wasn't, `std::terminate` will be called if an exception escapes and the stack is already unwinding. To get around this, we wrap the deallocation stuff in a try/catch. If deallocation throws, then we simply reset all of the memory planner stuff and carry on.
There's a catch: the code path that we take when handling the deallocation exception can't throw. However, this code path is much simpler than memory planner construction/deallocation, so it's much easier to manually audit the correctness here.
Test Plan:
**New unit tests**
`buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D32609915
fbshipit-source-id: 71fbe6994fd573ca6b7dd859b2e6fbd7eeabcd9e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69489
This change avoids pulling out `Node*` out of `ProcessedNode*` to evaluate expressions related to `Node*` at op execution time.
Perf gain is expected to be there but not measurable and the purpose of this change is to make SR's code more self-contained (calling more code from SR not JIT) during execution time.
Test Plan: Existing tests
Reviewed By: mikeiovine
Differential Revision: D32893265
fbshipit-source-id: f0f397666b3556f985d45112af8fe0b08de22139
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68302
Implement the new memory re-use algorithm. It’s roughly based on the c2 one, but after going through many iterations it may not be a 1:1 port anymore. Also deleted the old liveness analysis.
Test Plan:
## **Re-use metrics**
`inline_cvr` (294738512_58)
**Before**
* `local`
```
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 4601984 bytes
Total number of reused tensors: 1183
```
* `local_ro`
```
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2677
Total memory managed: 29696 bytes
Total number of reused tensors: 959
```
**After**
* `local`
```
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 4520000 bytes
Total number of reused tensors: 1198
```
* `local_ro`
```
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2677
Total memory managed: 29120 bytes
Total number of reused tensors: 963
```
Reviewed By: hlu1
Differential Revision: D32370424
fbshipit-source-id: 06a8e0a295ed7a2b4d14071349c1f1e975f746bf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69274
`impl.h` is the main header file that defines the interface of Static Runtime to its clients.
However, it is currently filled with implementation details that should not be leaked to our clients. 1) this can unnecessarily leak our internals to our clients which can make it hard to change them later 2) cause unnecessary merge conflicts when multiple people are touching this enormous impl.cpp file.
To alleviate the situation, this change moves the implementation details from impl.h into a new file, internal.h, that's internally kept without leaking the details to our clients.
This change will be followed by another change to rename `impl.h` into `runtime.h` or anything better since `impl.h` is currently not about implementation but SR's interface.
Note that this change is NOT complete since the remaining declarations in impl.h still contain a lot of implementation details. Therefore, we should keep working on minimizing the interface to prevent our API from being bloated unnecessarily. Also we need to work on modularizing our implementations into separate pieces organized by separate files in the near future.
Test Plan: Existing unittests
Reviewed By: donaldong
Differential Revision: D32780415
fbshipit-source-id: 119b7aedbf563b195641c5674572a9348732145f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68884
This diff uses std::vector::reserve in GetLivenessMap to set container capacity for all local contains to avoid runtime resizing.
The changes should theoretically improves the performance by a little.
Test Plan:
- [x] `buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- -v 1`
- [x]
```
seq 1 10 | xargs -I{} ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \
--scripted_model=/data/users/dxd/302008423_0.predictor.disagg.local \
--method_name=local_request_only.forward --pt_cleanup_activations=1 \
--pt_enable_out_variant=1 --pt_optimize_memory=1 --iters=0 --warmup_iters=0 \
--num_threads=1 --pt_enable_static_runtime=1 --set_compatibility=1 \
--input_type="recordio" --pt_inputs=/data/users/dxd/302008423_0.local_ro.inputs.recordio \
--recordio_use_ivalue_format=1
```
### Before
```
I1201 12:04:46.753311 2874563 PyTorchPredictorBenchLib.cpp:336] Took 10.9826 sec to initialize a predictor.
I1201 12:05:00.617139 2875780 PyTorchPredictorBenchLib.cpp:336] Took 11.1078 sec to initialize a predictor.
I1201 12:05:15.279667 2876813 PyTorchPredictorBenchLib.cpp:336] Took 11.7979 sec to initialize a predictor.
I1201 12:05:30.201207 2877554 PyTorchPredictorBenchLib.cpp:336] Took 11.8901 sec to initialize a predictor.
I1201 12:05:44.386926 2879713 PyTorchPredictorBenchLib.cpp:336] Took 11.2722 sec to initialize a predictor.
I1201 12:05:58.003582 2881426 PyTorchPredictorBenchLib.cpp:336] Took 10.8046 sec to initialize a predictor.
I1201 12:06:12.004778 2882604 PyTorchPredictorBenchLib.cpp:336] Took 11.2754 sec to initialize a predictor.
I1201 12:06:26.101241 2884888 PyTorchPredictorBenchLib.cpp:336] Took 11.3355 sec to initialize a predictor.
I1201 12:06:40.364817 2886572 PyTorchPredictorBenchLib.cpp:336] Took 11.401 sec to initialize a predictor.
I1201 12:06:54.483794 2888614 PyTorchPredictorBenchLib.cpp:336] Took 11.3498 sec to initialize a predictor.
```
### After
```
I1201 11:51:53.775239 2818391 PyTorchPredictorBenchLib.cpp:336] Took 10.9113 sec to initialize a predictor.
I1201 11:52:07.412720 2819530 PyTorchPredictorBenchLib.cpp:336] Took 10.8413 sec to initialize a predictor.
I1201 11:52:21.202816 2820359 PyTorchPredictorBenchLib.cpp:336] Took 11.0216 sec to initialize a predictor.
I1201 11:52:35.513288 2821029 PyTorchPredictorBenchLib.cpp:336] Took 11.4216 sec to initialize a predictor.
I1201 11:52:49.145979 2821930 PyTorchPredictorBenchLib.cpp:336] Took 10.8272 sec to initialize a predictor.
I1201 11:53:02.908790 2822859 PyTorchPredictorBenchLib.cpp:336] Took 11.0262 sec to initialize a predictor.
I1201 11:53:16.276015 2823657 PyTorchPredictorBenchLib.cpp:336] Took 10.6893 sec to initialize a predictor.
I1201 11:53:30.103283 2824382 PyTorchPredictorBenchLib.cpp:336] Took 11.1854 sec to initialize a predictor.
I1201 11:53:44.298514 2825365 PyTorchPredictorBenchLib.cpp:336] Took 11.4796 sec to initialize a predictor.
I1201 11:53:58.258708 2826128 PyTorchPredictorBenchLib.cpp:336] Took 11.2652 sec to initialize a predictor.
```
Reviewed By: swolchok
Differential Revision: D32649252
fbshipit-source-id: 5cd296d12b12e5b15e85e4f1a8a236e293f37f9c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69164
We have lots of methods that take `std::unordered_map<std::string, c10::IValue>` now. That's kind of ugly and cumbersome to type, so add a `KWargs` typedef.
Also made the `operator()` default `kwargs` to empty. Note that we could have another overload that doesn't take `kwargs` at all, but the perf gain is so minuscule it's probably not worth it.
ghstack-source-id: 144691899
Test Plan: CI
Reviewed By: d1jang
Differential Revision: D32734677
fbshipit-source-id: 8d6496a6d1ec2dc71253151d2f6408f1387966cf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68295
There's no reason we can't figure out what tensors we need to manage at model load time. It's also useful to have the set of ranges available at load time for integrating the ranges algorithm introduced in the previous diff.
Test Plan: `buck test caffe2/benchmarks/static_runtime/...`
Reviewed By: hlu1
Differential Revision: D32400593
fbshipit-source-id: 0466b2641166ddc9c14f72774f4ba151407be400
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67220
Specifically we log AliasDb and same_storage_values, and are chattier about the aliasing logs in the liveness analysis.
ghstack-source-id: 144507289
Test Plan: Used to help develop D31776259
Reviewed By: hlu1
Differential Revision: D31847561
fbshipit-source-id: 8371455d060c17dace91cd90e4034b7618f820a6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68639
Fix all problems related to `ProcessedNode:: verify_no_memory_overlap()`
- Only enable this check for native and fallback ops that are not inplace or view ops
- Enable ProcessedNode:: verify_no_memory_overlap() in debug mode and enforce it
- Add gflag --static_runtime_disable_debug_memory_overlap_check to test the runtime memory overlap fix for bad schemas
fb::expand_dims's schema was not correct after this check is re-enabled. It's fixed in D32556204 (39ab417107)
Reviewed By: mikeiovine
Differential Revision: D32553708
fbshipit-source-id: 88de63cdf1ee4f87b7726c8b65a11a5fb8a99d13
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68834
This diff uses std::vector::reserve for constructing constants in StaticModule. We can also avoid two extra iterations over all the graph nodes.
This diff should technically improve its performance by a tiny bit.
Test Plan: - [x] buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- -v 1
Reviewed By: mikeiovine
Differential Revision: D32628806
fbshipit-source-id: 99dd2a7a36e86899ca1fe5300f3aa90d30a43726
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68284
Add a new class `ManagedTensorRanges` that determines when manage tensors can be made available for re-use. This class provides a method `availableTensors(Node* node)` that returns a vector of `Value*` (corresponding to managed tensors) that are not used (either directly or through any alias) after `node`.
Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: swolchok
Differential Revision: D32397207
fbshipit-source-id: fb0d9a23f13abf6f2207e3d7266384966f477fc6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67075
Sharing storage if `mayAlias` is incorrect, as the old comment notes; sharing if `mustAlias` would be nice but, as the new comment notes, would not matter.
ghstack-source-id: 143749553
Test Plan: CI
Reviewed By: hlu1
Differential Revision: D31851893
fbshipit-source-id: 5bdc8de984d5919332c9010e8b0160211d96bc2f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68368
Currently, each instance of `StaticRuntime` has its own copy of `std::function` object wrapped in `ProcessedNode::Function` object, in order to invoke actual operation implementation.
However, all instances of `StaticRuntime` derived from same `StaticModule` objects invoke exactly same op implementation, and this is avoidable.
This change adds `StaticModule::functions_` member variable to keep a list of unique instance of `ProcessedFunction` objects. A newly constructed `StaticRuntime` takes `ProcessedFunction`'s pointers instead of the whole function object. This can save a substantial amount of memory per `StaticRuntime` instance.
This comes with a sacrifice in execution time. Now that a `ProcessedNode` instance keeps the function object's pointer, executing a node now involves an extra pointer dereference. However, this cost was proved to be negligible from local performance tests.
Thanks to hlu1 for proposing this non-intrusive improvement idea :D
Test Plan:
This change reduces the size of a StaticRuntime instance by 14.41% (459KB -> 393KB) (patched D32181666 to print the memory turnover from instantiating a StaticRuntime instance) for CMF/local ( & 8% for CMF/local_ro). No noticeable latency regression was observed.
==AFTER
* CMF/local
memory turnover: 393608
latency: PyTorch run finished. Milliseconds per iter: 15.6965. Iters per second: 63.7087
* CMF/local_ro
memory turnover:387288
latency: PyTorch run finished. Milliseconds per iter: 7.51308. Iters per second: 133.101
==BEFORE
* CMF/local
memory turnover: 459888
latency: PyTorch run finished. Milliseconds per iter: 15.8278. Iters per second: 63.18
* CMF/local_ro
memory turnover: 420832
latenfcy: PyTorch run finished. Milliseconds per iter: 7.43756. Iters per second: 134.453
==Confirmation that ptvsc2_predictor_bench reports the same memrmoy management stats for inline_cvr:
==AFTER
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 1496896 bytes
Total number of reused tensors: 1183
Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%)
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2677
Total memory managed: 39040 bytes
Total number of reused tensors: 959
Total number of 'out' variant nodes/total number of nodes: 1928/1937 (99.5354%)
Total number of managed tensors: 1293
Total number of managed output tensors: 0
Total number of unmanaged values: 14
Total memory managed: 5293824 bytes
Total number of reused tensors: 771
Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%)
==BEFORE
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 1496896 bytes
Total number of reused tensors: 1183
Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%)
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2677
Total memory managed: 39040 bytes
Total number of reused tensors: 959
Total number of 'out' variant nodes/total number of nodes: 1928/1937 (99.5354%)
Total number of managed tensors: 1293
Total number of managed output tensors: 0
Total number of unmanaged values: 14
Total memory managed: 5293824 bytes
Total number of reused tensors: 771
Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%)
Reviewed By: swolchok
Differential Revision: D32337548
fbshipit-source-id: e714e735399c93fde337b0f70e203a2de632057a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68477
We're printing a lot of unnecessary logs in prod. Change these from LOG(INFO) to VLOG(1) so you can easily flip them back for testing.
Test Plan: CI
Reviewed By: ajyu, d1jang
Differential Revision: D32439776
fbshipit-source-id: 40fa57f4eeb6ca0b610008062cc94aed62fb6981
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67702
This isn't a particularly large optimization and it does
nothing before select_tensor is introduced (I'm surprised that no
operators have optimizable outputs!), but it seems like we should probably get the savings.
ghstack-source-id: 143424918
Test Plan: CI; checked `--do_profile=1` ouput with following diff and we save tracking hundreds of values, as expected.
Reviewed By: hlu1
Differential Revision: D32112522
fbshipit-source-id: 1804b77992a73670bfc1e36af608b852b8261bd2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67935
Rationale should be documented in code comments. In short, we
can avoid heap-allocating arrays of input indexes for operators with 5
or fewer inputs, at the cost of a tag bit check on access.
ghstack-source-id: 143429112
Test Plan:
Patched d1jang's D32181666, which prints static runtime memory usage.
Previous diff, local:
```
I1105 12:17:36.459688 866763 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 354208
```
This diff, local:
```
I1105 12:48:35.820663 1066520 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 338064
```
4.5% savings (16144 bytes)
Ran 10 repetitions of CMF local_ro with core pinning: P467095603. This diff is perf neutral compared to the previous diff.
Reviewed By: hlu1
Differential Revision: D32216573
fbshipit-source-id: d18483db255f75f1d90e610ecded7727c6ffe65c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67934
This reduces the memory requirements of ProcessedNode: by allocating outputs sequentially into a shared array and supporting at most 2**16 - 1 values (current models seem to have 10-20x less than that), we only need to store the 2-byte offset into that array and 2-byte number of outputs in ProcessedNode.
ghstack-source-id: 143429113
Test Plan:
Patched d1jang's diff to measure memory turnover around SR startup.
Previous diff, CMF local:
```
I1104 12:19:39.900211 597593 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 427120
```
This diff, CMF local:
```
I1105 12:17:36.459688 866763 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 354208
72912 bytes (17%) savings
```
Perf looks neutral; see next diff (D32216573) test plan for details.
Reviewed By: hlu1
Differential Revision: D32190751
fbshipit-source-id: 30c1e2caa9460f0d83b2d9bb24c68ccfcef757cc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67860
We don't need 8-byte sizes for inputs and outputs, and we only need op names if profiling isn't disabled.
ghstack-source-id: 143429111
Test Plan:
Ran CMF local & local_ro with recordio inputs. I'm calling
the result inconclusive/neutral because I saw some noise (as you'll
see below), but that's fine with me since this is a clear memory win.
```
Nov4Stable, local_ro
========================================
I1104 09:53:08.875444 505783 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.19925. Iters per second: 833.851
I1104 09:53:10.200443 505783 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.1996. Iters per second: 833.608
I1104 09:53:11.524045 505783 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.19746. Iters per second: 835.103
I1104 09:53:12.851861 505783 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.20479. Iters per second: 830.019
I1104 09:53:14.183387 505783 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.20487. Iters per second: 829.964
I1104 09:53:14.183427 505783 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.2012, standard deviation: 0.00341762
re-ran stable in light of baffling regression (see next entry), and sure enough we still have some significant run-to-run-variation:
I1104 09:56:15.244969 524012 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.24956. Iters per second: 800.28
I1104 09:56:16.621292 524012 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.24776. Iters per second: 801.437
I1104 09:56:18.018808 524012 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.25247. Iters per second: 798.42
I1104 09:56:19.399660 524012 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.25054. Iters per second: 799.656
I1104 09:56:20.781828 524012 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.25052. Iters per second: 799.664
I1104 09:56:20.781878 524012 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.25017, standard deviation: 0.00171396
Nov4SaveTwoWordsInProcessedNode, local_ro
========================================
I1104 09:53:42.070139 508309 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.2411. Iters per second: 805.736
I1104 09:53:43.438390 508309 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.24102. Iters per second: 805.788
I1104 09:53:44.773303 508309 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.20682. Iters per second: 828.621
I1104 09:53:46.110538 508309 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.21216. Iters per second: 824.973
I1104 09:53:47.448279 508309 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.21265. Iters per second: 824.639
I1104 09:53:47.448334 508309 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.22275, standard deviation: 0.0168698
early runs look like a glitch, rerunning
I1104 09:54:20.999117 511022 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.24558. Iters per second: 802.841
I1104 09:54:22.376780 511022 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.24436. Iters per second: 803.623
I1104 09:54:23.738584 511022 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23176. Iters per second: 811.845
I1104 09:54:25.113063 511022 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.24938. Iters per second: 800.395
I1104 09:54:26.476349 511022 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23552. Iters per second: 809.377
I1104 09:54:26.476395 511022 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.24132, standard deviation: 0.00737197
Nov4Stable, local
========================================
I1104 09:57:56.854537 533814 PyTorchPredictorBenchLib.cpp:346] memory turnover after getPredictor: 177885632
I1104 09:58:02.829813 533814 PrepareModelInputs.cpp:190] Loaded 696 records.
I1104 09:58:03.010681 533814 PyTorchPredictorBenchLib.cpp:353] memory turnover before benchmarking: 4590507056
I1104 09:58:03.010710 533814 PyTorchPredictorBenchLib.cpp:154] PyTorch predictor: number of prediction threads 1
I1104 09:58:58.839010 533814 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 20.0567. Iters per second: 49.8586
I1104 09:59:54.797755 533814 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 20.1007. Iters per second: 49.7494
I1104 10:00:50.696525 533814 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 20.0657. Iters per second: 49.8363
I1104 10:01:46.514736 533814 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 20.0696. Iters per second: 49.8265
I1104 10:02:42.378270 533814 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 20.0641. Iters per second: 49.8402
I1104 10:02:42.378316 533814 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 20.0714, standard deviation: 0.0170605
I1104 10:02:42.378325 533814 PyTorchPredictorBenchLib.cpp:366] memory turnover after benchmarking: 4591882400
Nov4SaveTwoWordsInProcessedNode, local
========================================
I1104 10:38:15.543320 733514 PyTorchPredictorBenchLib.cpp:346] memory turnover after getPredictor: 177721792
I1104 10:38:21.224673 733514 PrepareModelInputs.cpp:190] Loaded 696 records.
I1104 10:38:21.382973 733514 PyTorchPredictorBenchLib.cpp:353] memory turnover before benchmarking: 4590343216
I1104 10:38:21.382992 733514 PyTorchPredictorBenchLib.cpp:154] PyTorch predictor: number of prediction threads 1
I1104 10:39:17.005359 733514 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.9498. Iters per second: 50.1257
I1104 10:40:12.545269 733514 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.9279. Iters per second: 50.1808
I1104 10:41:08.138119 733514 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.999. Iters per second: 50.0026
I1104 10:42:03.686841 733514 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.9115. Iters per second: 50.2222
I1104 10:42:55.137498 733539 Proxy2Connection.cpp:343] Received NotRegisteredException from Configerator Proxy2.
I1104 10:42:55.138715 733539 ReadOnlyConnectionIf.h:91] Mark connection as healthy.
I1104 10:42:55.384534 733514 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.6297. Iters per second: 50.9433
I1104 10:42:55.384579 733514 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.8836, standard deviation: 0.14571
I1104 10:42:55.384588 733514 PyTorchPredictorBenchLib.cpp:366] memory turnover after benchmarking: 4591711760
```
Reviewed By: d1jang
Differential Revision: D32177531
fbshipit-source-id: 267e38a151d2dbab34fd648135d173cfbee1c22e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68091
Add record functions for recording perf stats on the entire network.
Note that this is backed by the same pre-sampling mechanism as the op record functions, so net level stats get logged relatively infrequently. (If this is not acceptable, we can not use pre-sampling at the cost of a little bit of perf, every inference will require an RNG call)
Reviewed By: hlu1
Differential Revision: D32296756
fbshipit-source-id: 09ff16c942f3bfc8f4435d6cca2be4a6b8dc6091
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67939
With `manage_output_tensor` enabled, a client of `StaticRuntime` requires to call it via `PyTorchPredictor::predict_managed_result`. If the client uses `PyTorchPredictor::operator()` the client will experience a crash (intended behavior not to leak memory of managed output tensors). This mistake can cause a catastrophic failure in production if that happens (by gatekeeper, config changes, etc).
Considering the complexity in how `PyTorchPredictor` is used in different settings, the chances that this bug can hit production is non-zero.
This change introduces `StaticRuntime::disableManageOutputTensor` to disable `manage_output_tensor` feature when a client mistakenly uses `PyTorchPredictor::operator()` instead of crashing. When `StaticRuntime` is invoked via `PyTorchPredictor::operator()`, it first calls `StaticRuntime::disableManageOutputTensor` to disable the feature, so that it can get non-managed output tensors to pass to the client safely.
A slight perf degradation is expected by forcefully disabling `manage_output_tensors`, but its robustness value outweighs a catastrophic failure of crashes at a high rate.
Test Plan: Added a unittest `StaticRuntime, DisableManageOutputTensors` to cover the newly added code.
Reviewed By: swolchok
Differential Revision: D32219731
fbshipit-source-id: caf5c910b34726c570e17435ede7d888443e90cf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67825
The comment explains how it works.
Test Plan:
A small regression to local and local_ro if we only enable it for fallback ops.
```
## local_ro
# before
I1103 21:25:05.250440 2636751 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22213. Iters per second: 818.247
I1103 21:25:08.629221 2636751 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22351. Iters per second: 817.319
I1103 21:25:12.005179 2636751 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22285. Iters per second: 817.759
I1103 21:25:12.005236 2636751 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.22283, standard deviation: 0.000693619
# after
# # only enable for fall back ops: 0.7%
I1103 21:26:40.190436 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22928. Iters per second: 813.481
I1103 21:26:43.590443 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23265. Iters per second: 811.262
I1103 21:26:46.992928 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23379. Iters per second: 810.51
I1103 21:26:46.992980 2644597 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.23191, standard deviation: 0.0023424
# enable for all (no clone): 4.7%
I1103 21:27:55.291216 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.28204. Iters per second: 780.005
I1103 21:27:58.822347 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.27854. Iters per second: 782.14
I1103 21:28:02.354184 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.27958. Iters per second: 781.506
I1103 21:28:02.354240 2649780 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.28006, standard deviation: 0.00179765
# local
# before
I1103 21:52:00.784718 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.676. Iters per second: 50.8233
I1103 21:52:28.985873 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.699. Iters per second: 50.7641
I1103 21:52:57.200223 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.6953. Iters per second: 50.7735
I1103 21:52:57.200273 2765168 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.6901, standard deviation: 0.0123206
# after
# # only enable for fall back ops: 0.1%
I1103 21:45:25.514535 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7103. Iters per second: 50.7349
I1103 21:45:53.773594 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7005. Iters per second: 50.7601
I1103 21:46:21.955680 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7398. Iters per second: 50.659
I1103 21:46:21.955729 2734440 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.7169, standard deviation: 0.0204658
# enable for all (no clone): 0.9%
I1103 21:43:22.162272 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8893. Iters per second: 50.2783
I1103 21:43:50.651847 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8566. Iters per second: 50.3611
I1103 21:44:19.068519 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8793. Iters per second: 50.3037
I1103 21:44:19.068570 2723868 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.875, standard deviation: 0.0167498
```
Reviewed By: d1jang
Differential Revision: D32124812
fbshipit-source-id: 0f60c26f8fb338d347e4ca7a70b23e5a386fc9aa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67911
If we can remove `self` from the graph inputs, there is no need for `StaticModule` to hold onto its `Module` reference anymore.
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D32190755
fbshipit-source-id: 9c4649a63b6e68c7d2e47395a23572985d2babb1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67437
Certain ops do nothing on the forward pass and can be discarded after training: `aten::detach` and `fb::scale_gradient` are examples of this.
Test Plan: `buck test caffe2/test:jit -- test_freezing`
Reviewed By: hlu1
Differential Revision: D31980843
fbshipit-source-id: 0045b6babcfae786a2ce801b2f5997a078205bc0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66130
We're reusing backing storage for these tensors, which is only safe because they have non-overlapping lifetimes. Accordingly, it seems that they can also share their StorageImpl.
ghstack-source-id: 142427752
Test Plan:
benchmarked ctr_mobile_feed local and local_ro:
Using recordio inputs for model 302008423_0
```
swolchok@devbig032 ~/f/fbcode> env MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 > environment^C
swolchok@devbig032 ~/f/fbcode> sudo ~/fbsource2/fbcode/scripts/bertrand/noise/denoise-env.sh \
/tmp/ptvsc2_predictor_benchNov1ArenaAllocateStorageImpls \
--scripted_model=/data/users/swolchok/ctr_mobile_feed_q3_2021/302008423_0.predictor.disagg.local \
--method_name=local.forward --pt_cleanup_activations=1 \
--pt_enable_out_variant=1 --pt_optimize_memory=1 --iters=2 --warmup_iters=2 \
--num_threads=1 --pt_enable_static_runtime=1 --set_compatibility=1 --repetitions=5 --recordio_use_ivalue_format=1 --recordio_inputs=/data/users/swolchok/ctr_mobile_feed_q3_2021/302008423_0.local.inputs.recordio
Stable
========================================
I1101 14:19:16.473964 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 20.0131. Iters per second: 49.9673
I1101 14:20:12.193130 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 20.0155. Iters per second: 49.9612
I1101 14:21:07.761898 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9751. Iters per second: 50.0624
I1101 14:22:03.218066 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9104. Iters per second: 50.2249
I1101 14:22:58.723256 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.956. Iters per second: 50.1102
I1101 14:22:58.723306 2748837 PyTorchPredictorBenchLib.cpp:262] Mean milliseconds per iter: 19.974, standard deviation: 0.043643
ArenaAllocateStorageImpls
========================================
I1101 14:08:57.070914 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9771. Iters per second: 50.0572
I1101 14:09:52.605121 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.924. Iters per second: 50.1907
I1101 14:10:48.098287 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9353. Iters per second: 50.1624
I1101 14:11:43.645395 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9723. Iters per second: 50.0694
I1101 14:12:39.171636 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9673. Iters per second: 50.0819
I1101 14:12:39.171685 2695478 PyTorchPredictorBenchLib.cpp:262] Mean milliseconds per iter: 19.9552, standard deviation: 0.0239318
difference: 0.0188 (0.09%), which is less than 1 standard deviation
Stable, local_ro
========================================
I1101 14:26:10.796161 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.25991. Iters per second: 793.708
I1101 14:26:12.194727 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.26862. Iters per second: 788.26
I1101 14:26:13.591312 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.26549. Iters per second: 790.207
I1101 14:26:14.982439 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.25943. Iters per second: 794.01
I1101 14:26:16.377033 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.25995. Iters per second: 793.68
I1101 14:26:16.377094 2787930 PyTorchPredictorBenchLib.cpp:262] Mean milliseconds per iter: 1.26268, standard deviation: 0.00414788
ArenaAllocateStorageImpls, local_ro
========================================
I1101 14:26:45.875073 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.20987. Iters per second: 826.536
I1101 14:26:47.207271 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.20827. Iters per second: 827.633
I1101 14:26:48.533766 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.20023. Iters per second: 833.174
I1101 14:26:49.850610 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.19206. Iters per second: 838.884
I1101 14:26:51.172356 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.19958. Iters per second: 833.622
I1101 14:26:51.172411 2790009 PyTorchPredictorBenchLib.cpp:262] Mean milliseconds per iter: 1.202, standard deviation: 0.00722754
Difference: 0.06 usec/iter (4.8%), which is much more than 1 standard deviation
```
we can see that this is a large relative improvement on local_ro, but no effect on local.
Reviewed By: hlu1
Differential Revision: D31357486
fbshipit-source-id: 229c003677da76e89c659d0e0639002accced76e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67166
This optimization is not really the same thing as `FuseListUnpack`, and mixing the logic in that pass is confusing and error-prone. It should really be its own pass.
It's slower since we have to do another pass over the graph, but this is not perf critical code; readability is more important.
Test Plan: Unit tests: `buck test caffe2/benchmarks/static_runtime/...`
Reviewed By: hlu1
Differential Revision: D31887458
fbshipit-source-id: 289e281d512435861fccfe19f017751ad015688c