Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74730
Motivation: I am working on implementing a new, more efficient memory planning algorithm. This algorithm cannot replace the old one entirely, because it can only be practically done for models that have sample inputs to warm up with. We need a way to make the memory planner's strategy extensible.
My first pass attempt at implementing the new algorithm crammed everything into the same class, but it became a nightmare to manage (a ton of `if (use_new_strategy)` statements everywhere). Additionally, it was a little clumsy since there are some concepts that make sense for one algorithm but not the other (like `StorageGroup`).
It's much cleaner if we instead turn `MemoryPlanner` into an abstract base class and have different subclasses implement their strategies in `allocateManagedTensors` and `deallocateManagedTensors`.
ghstack-source-id: 153288210
Test Plan: Existing unit tests
Reviewed By: navahgar, hlu1
Differential Revision: D35132124
fbshipit-source-id: c5ef5ae6361b44dedf97090201e244a76e1e6bce
(cherry picked from commit c96f6827c8db88f28c4eb379865ad208beae2034)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74966
It's clear that we don't want to manage tensors that escape their scope. Previously, we handled this by checking whether the tensor aliased the graph outputs. But there's actually another way to escape scope: by aliasing the wildcard set. The following graph demonstrates this:
```
def forward(self, cond: bool, a, b):
lst = []
if cond:
res = a + b # res should not be managed!!!
lst.append(res)
return lst
```
The `if cond:` sub-block returns nothing, but `res` escapes the scope through `lst`.
The fix is simple: we simply have to mark values that alias the wildcard set as an `external_alias_` in `ValueGroup`.
This diff also exposed another issue (via unit tests) in `checkOutputTensorMemoryLeaks`: it assumes that, if a node's `Value*` is managed, the underlying `IValue` must be a tensor. But this is not true after the addition of `to_maybe_copy_out`; TMCO does not produce a tensor in its first output slot if it does not copy.
ghstack-source-id: 153288188
Test Plan: New unit tests cover the problematic case
Reviewed By: navahgar
Differential Revision: D35257087
fbshipit-source-id: 853a761dffe51f2c70720759664dd8dfcd56d1d7
(cherry picked from commit 2c7f519354041975f33626eab6b7f16c2494bbf8)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74927
The move ctor was broken because `BlockRunner` stores a reference to `values_`. When moving runtime instances, the pointer to the root block would be moved, but the reference inside it would not be updated.
Pass `BlockRunner` a raw pointer to the heap-allocated IValues instead to avoid this issue.
ghstack-source-id: 153168602
Test Plan: New unit test/CI
Reviewed By: navahgar
Differential Revision: D35228467
fbshipit-source-id: 04e198b39f898b82677a0e41e1cdf00c2b0c09f3
(cherry picked from commit 03e2c591ac3a907d68025eae9500ed7226dec17e)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74481
This diff fixes an interesting performance issue related to `permute_copy`.
We see this pattern frequently:
```
y = torch.permute(x, (0, 2, 1))
z = torch.sum(y, dim=-1)
```
With copy variants off, we get a strided output from `permute`, and we hit this (faster) kernel in `sum`: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L589
But with copy variants on, we get a contiguous output from `permute_copy`, which causes us to hit the slower reduction:
https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L597
But the permute is actually unnecessary, we can just statically turn the graph into this to ensure that the fast kernel is hit with copy variants on:
```
z = torch.sum(x, dim=1)
```
ghstack-source-id: 152003888
Reviewed By: navahgar
Differential Revision: D34992319
fbshipit-source-id: 0baf493708ee2180c899814a954d220d88ba1d4f
(cherry picked from commit 797b6beb26325c56012e406e14fe211c0b5d744d)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73606
The single-output overload of `layer_norm` internally allocates two tensors. As an optimization, we previously added `static_runtime::layer_norm`. This variant of layer norm had two extra outputs to make the memory planner aware of these extra tensors. But these outputs were unused; it's actually better for us to avoid the allocation and associated computations entirely.
ghstack-source-id: 151394116
Test Plan: Existing unit tests
Reviewed By: hlu1
Differential Revision: D34562131
fbshipit-source-id: c6a6560e60db43b0b100aedc54ea4265acb347de
(cherry picked from commit 3bed52b6f688b93b9b032c3d2b4be68d08d8eb76)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73681
Static runtime is rejecting legal calls made with the kwargs API when there are parameters with default values.
ghstack-source-id: 150433627
Test Plan: Added unit test to cover this case
Reviewed By: navahgar, d1jang
Differential Revision: D34588804
fbshipit-source-id: 74d7ef5bee74f9d16b02b0c8ceda4285ea776755
(cherry picked from commit 9c3db19cb45f6022e646deeb1e8056daa04f363f)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73536
Currently `StaticNodeInfo` class assumes 2 distinct roles that are not too obvious:
1) "template" that contains metadata of an actual executable node by runtime. owned by `StaticModule`
2) fully instanced ones that are owned by `StaticRuntime`.
We currently merge these two usecases into one class, that can be error-prone in case illegal copying happens uncontrollably. Currently, we only copy objects of kind (1) into objects of kind (2) when a `StaticRuntime` instance is created.
To address ths issue, this change introduces `StaticNodeInfo`, a separate class, to distinguishes the aforementioned two usecases in the code more clearly. With this `StaticNodeInfo` is for (1) and `ProcessedNode` is now for (2).
Test Plan: Existing tests
Reviewed By: mikeiovine
Differential Revision: D33985600
fbshipit-source-id: 0c79cea2bf982dd956a35f48eaf6027e5b6e390c
(cherry picked from commit 0d8acc4a2b6eeb3e4af3ad2c99f4cd667680f8df)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72946
The passes to replace with copy variants are run after TensorExpr fusion. Due to this the resulting graph does not conform to the assumptions made in the fuser.
So, even if these flags `use_copy_variants`, `use_maybe_copy_variants` are turned on, the corresponding passes will not be executed if TensorExpr fusion is enabled.
ghstack-source-id: 149429753
Test Plan: Tested locally.
Reviewed By: mikeiovine
Differential Revision: D34283842
fbshipit-source-id: 74edea517a00c85dff0319f9c8b3ac8befe09018
(cherry picked from commit 3798af7f1b8c9b3c072862f58ebf16af6294db14)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73032
Currently, ptvsc2_predictor_bench reports nothing when the input size is zero. However, Static Runtime's module creation has some useful information even after loading a model.
This change reports static op statistics when the given input's size is zero. In addition to that, this enables it to report the out variant coverage percentage, which is crucial to establish the baseline performance of Static Runtime.
Test Plan: - Ran `ptvsc2_predictor_bench` with this change as seen above.
Reviewed By: mikeiovine
Differential Revision: D34294803
fbshipit-source-id: 80c02199075dae9280657d6edecc7c679c1c27f4
(cherry picked from commit 83aec141a25a9ede5d22e5c17c0b6b07307faf39)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72587
This pattern frequently appears in a few graphs:
```
%result = prim::If(%condition)
block0():
-> (%a)
block1():
-> (%b)
```
This is slow, particularly in static runtime. Static runtime creates memory planners/block runners for each sub-block, which eats up a lot of memory and introduces a lot of extra overhead for this relatively simple operation.
This diff introduces a new op that replaces nodes like the above with a single op meant to act like a ternary operator:
```
%result = prim::IfThenElse(%condition, %a, %b)
```
Test Plan: New unit tests
Reviewed By: eellison
Differential Revision: D34091789
fbshipit-source-id: eb6a8c460c39b4c019a1f4ab1f3f1e5b6edc400c
(cherry picked from commit 0f1b335e5b83f402bda2dcdd9ecb411e0b67c651)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72592
Only code paths that are not perf-critical read `ProcessedNode::num_outputs_` and also its static feature of the op that `ProcessedNode` instance is executing.
Therefore, it's better to move `ProcessedNode::num_outputs_` into `ProcessedFunction::num_outputs_` and let `ProcessedNode` access it via `ProcessedNode::fn_` for its occasional use. Note that this prevents duplicating num_outputs_ per node & per Static Runtime instance since `ProcessedFunction` instances are shared across all runtime instances.
It's confirmed that this change reduces the `sizeof(ProcessedNode)` by 14% from local instrumentation as follows:
- Before
-- sizeof(ProcessedNode): 56
- After
-- sizeof(Processednode): 48
Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: mikeiovine
Differential Revision: D33984792
fbshipit-source-id: e29ffc97b799e679215f42e1e85cd3fcd7e88983
(cherry picked from commit 0f7003f4dfd6473a70355ca3c6f51498abf1d7be)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71102
This graph pass is causing a major perf regression on some models. Ideally we would introduce maybe_copy variants for all these ops. But since those are tricky to write, I've introduced a flag to just turn the pass off for now.
ghstack-source-id: 148541673
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: navahgar
Differential Revision: D33510080
fbshipit-source-id: bb4847f26561197ea5e6bbad0a4d25db4ef468eb
(cherry picked from commit 8f333d3e8138e2a7ba04bea7509ad84dd97844eb)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71807
There's no need to completely disallow `aten::__is__` and `aten::__isnot__`. The only problematic case is when the comparison is between two tensors, e.g. in
```
def forward(x):
y = x.detach()
# Should be false, but we get True
# after our EliminateNoOps pass
return x is y
```
Test Plan: New unit test covers this case
Reviewed By: d1jang
Differential Revision: D33783668
fbshipit-source-id: c9f57fa96937ecce38a21554f12b69c45cc58fe4
(cherry picked from commit 019588f4ca3fcd2b3ae51bccab102f0538745b15)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69838
Implement `prim::Loop` with the new `StaticRuntimeBlockRunner` abstraction.
ghstack-source-id: 148186483
Test Plan: New unit tests: `buck test caffe2/benchmark/static_runtime/...`
Reviewed By: d1jang
Differential Revision: D33049595
fbshipit-source-id: 550de5167b46fccd65ff77d092785289b5e5d532
(cherry picked from commit 8baf1753af34f4c166b4680e42589517fd2e508d)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69837
Implement `prim::If` with the new `StaticRuntimeBlockRunner` abstraction.
ghstack-source-id: 148186475
Test Plan:
New unit tests: `buck test caffe2/benchmarks/static_runtime/...`
Accuracy test at top of stack
Reviewed By: d1jang
Differential Revision: D33045908
fbshipit-source-id: 281fb4a73528249fa60f65ac26f8ae6737771f55
(cherry picked from commit de3b12dc0871e8ca09891c257e1dfd7cd352aa7c)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69836
It is technically possible for the sub-blocks to return zero outputs. This is problematic for `StaticRuntimeBlockRunner`, because it assumes that at least one output is being returned.
Rather than slowing down SR with special logic for this corner case, we can simply force these sub-blocks to return `None`.
ghstack-source-id: 148186453
Test Plan: Sub-blocks with no return values tested at top of stack
Reviewed By: d1jang
Differential Revision: D33050420
fbshipit-source-id: 17d9e19fda6431aa9fd0b155131349bac42bc149
(cherry picked from commit c97fd07bf53e1e253a0e6c733db5ea7c86698fc9)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69835
`StaticRuntimeBlockRunner` moves its outputs to the return value at the end of `run_impl`. However, there's a corner case where this can cause problems. If we return a constant, then the only reference in the `constants_` array can be destroyed by this move. We could add special logic to handle this in `run_impl`. But since this is a relatively rare corner case, it's simpler to just add an op that does nothing but create an owned reference to its input. This owned reference can be safely moved out of `StaticRuntimeBlockRunner`.
Note that this also applies to returned values in sub-blocks that are from outer scopes.
ghstack-source-id: 148186452
Test Plan:
`buck test caffe2/benchmarks/static_runtime/...`
Added a new unit test with a graph that simply returns a constant.
Tests with sub-blocks at top of stack.
Reviewed By: d1jang
Differential Revision: D33047519
fbshipit-source-id: 22b6058f0d1da8a6d1d61a6f2866bc518bff482b
(cherry picked from commit a8f89a12ee726aa7d7e546dee25d696eef868ce7)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71986
To address concerns over space increase from control flow.
`op_name_` was only stored as a minor optimization to avoid name lookup during logging, we can safely get rid of it. Thanks to the sampling mechanism, `get_op_name()` is called very infrequently, so this shouldn't cause too much of a regression
ghstack-source-id: 148086244
Test Plan: CI
Reviewed By: d1jang
Differential Revision: D33821005
fbshipit-source-id: 6f74eb30a54a046ca90768aebbcde22e8c435f35
(cherry picked from commit 361ba32e97dbd130938bae10b5159730822c518c)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69834
* Modify the `StaticModule` constructor to handle index initialization for sub-blocks.
* Add a new class `StaticRuntimeBlockRunner`. This class is almost exactly like what we've been calling `StaticRuntime` up to this point, except that it does not own a `values_` array. All `StaticRuntimeBlockRunners` hold an unowned reference to a `values_` array owned by `StaticRuntime`. This is a useful abstraction for implementing control flow - it gives us a way for sub-blocks to look up values from surrounding scopes!
ghstack-source-id: 148086245
Test Plan: `buck test caffe2/benchmarks/static_runtime/...`
Reviewed By: d1jang
Differential Revision: D33028039
fbshipit-source-id: 4f01417bad51a0cf09b1680a518308da647be1f6
(cherry picked from commit 3a9feffd929869120c717d35aa55aad8a382783d)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70944
Added special net-level/op-level scopes for static runtime. We can use these to add special behavior in record functions when they are invoked from a static runtime context.
Reviewed By: navahgar
Differential Revision: D33458211
fbshipit-source-id: 0b7022100e9f5ac872f4cb5bfba14e92af2c71b0
(cherry picked from commit b486548544c5e822803071756c85e675e37d2dad)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71705
This fixes a crash `resetMemory` caused by trying to access a `TensorImpl` via a borrowed `IValue` after it had already been destroyed. We need to clean up all borrows *before* we destroy the owning `IValue`, not after.
ghstack-source-id: 147688982
Test Plan:
New unit test covers this case
ICE w/ inline_cvr v0 [finishes successfully](https://www.internalfb.com/intern/unidash/dashboard/ads_infra_cost_estimation/a_metrics/?e[select_ESTIMATION_RUN_ID]=ICE_mikeiovine_16431103211c65), didn't see any nnpi errors
Reviewed By: ajyu
Differential Revision: D33725435
fbshipit-source-id: f8dd109382b5cf54df6f194f8dcb5c0812b174bb
(cherry picked from commit 31339d9d38e63248d2ac3646be71008ed731f16c)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71501
This option disabled the memory planner. Supporting it would require us to add multiple versions of ops that borrow their inputs (because they rely on the memory planner to support that), and I'm not aware of a particular need to continue supporting it.
ghstack-source-id: 147385569
Test Plan: CI, rerun broken test from task
Reviewed By: mikeiovine
Differential Revision: D33669290
fbshipit-source-id: ecb01995891aecb5f4d0da2d9c51eed1f8fe489a
(cherry picked from commit 5e4fefb109b6c92d59fc7e24d69f1b6b2780c776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69853
We can implement this overload more efficiently.
ghstack-source-id: 146924693
Test Plan:
patched alias_analysis tests
Time reported to initialize a predictor by static runtime when given ctr_mobile_feed local_ro net is 9.5s instead of 10.5s.
Reviewed By: mikeiovine
Differential Revision: D33039731
fbshipit-source-id: 52559d678e9eb00e335b9e0db304e7a5840ea397
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68122
See code comments for details; in brief, we repurpose support
for borrowing `Tensor`s in `MaybeOwned` to make the `select_tensor`
output a borrowed IValue that we have to clean up manually.
If we have any other ops that always create a new reference to an
existing Tensor, we can easily apply this same optimization.
ghstack-source-id: 146482212
Test Plan:
See perf measurements on ctr_mobile_feed local_ro net for this stack: P467203421
(local is neutral: P467267554)
--do_profile output for local_ro (updated Dec 10):
```
swolchok@devbig032 /d/u/s/f/fbcode> tail Stable.profile.txt
First iter time: 0.989023 ms
Number of operators: 2037
Total number of managed tensors: 1597
Total number of managed output tensors: 0
Total number of unmanaged values: 2568
Number of unmanaged values requiring cleanup: 2568
Number of unmanaged values not requiring cleanup: 0
Total memory managed: 50368 bytes
Total number of reused tensors: 1010
Total number of 'out' variant nodes/total number of nodes: 2001/2037 (98.2327%)
swolchok@devbig032 /d/u/s/f/fbcode> ttail TMCC^C
swolchok@devbig032 /d/u/s/f/fbcode> tail TMCOFastAliasing.profile.txt
First iter time: 0.994703 ms
Number of operators: 2551
Total number of managed tensors: 1146
Total number of managed output tensors: 0
Total number of unmanaged values: 4047
Number of unmanaged values requiring cleanup: 3533
Number of unmanaged values not requiring cleanup: 514
Total memory managed: 50048 bytes
Total number of reused tensors: 559
Total number of 'out' variant nodes/total number of nodes: 2001/2551 (78.4398%)
```
for local: (also Dec 10):
```
==> Stable.local.profile.txt <==
First iter time: 9.0909 ms
Number of operators: 1766
Total number of managed tensors: 1894
Total number of managed output tensors: 0
Total number of unmanaged values: 2014
Number of unmanaged values requiring cleanup: 2014
Number of unmanaged values not requiring cleanup: 0
Total memory managed: 4541440 bytes
Total number of reused tensors: 847
Total number of 'out' variant nodes/total number of nodes: 1744/1766 (98.7542%)
==> TMCOFastAliasing.local.profile.txt <==
First iter time: 7.5512 ms
Number of operators: 2378
Total number of managed tensors: 1629
Total number of managed output tensors: 0
Total number of unmanaged values: 3503
Number of unmanaged values requiring cleanup: 2891
Number of unmanaged values not requiring cleanup: 612
Total memory managed: 3949312 bytes
Total number of reused tensors: 586
Total number of 'out' variant nodes/total number of nodes: 1744/2378 (73.3389%)
```
Reviewed By: hlu1
Differential Revision: D32318674
fbshipit-source-id: a2d781105936fda2a3436d32ea22a196f82dc783
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67223
ghstack-source-id: 146482215
Test Plan:
See perf measurements on ctr_mobile_feed local_ro net for this stack: P467203421
(local is neutral: P467267554)
Reviewed By: hlu1
Differential Revision: D31776259
fbshipit-source-id: f84fcaa05029577213f3bf2ae9d4b987b68480b3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67221
Update memory leak checks to not require that output tensors are cleaned up.
ghstack-source-id: 146464297
Test Plan: Tests should still pass; reviewers to confirm that this is OK in principle
Reviewed By: d1jang
Differential Revision: D31847567
fbshipit-source-id: bb7ff2f2ed701e2d7de07d8032a1281fccabd6a9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691
TraceType is a sharded file, so by only including specific operator
headers, we ensure that changing one (non-method) operator only needs
one shard to be re-compiled.
This also changes all the included autograd and jit headers from
including `ATen/ATen.h` to just including `ATen/core/Tensor.h`.
Test Plan: Imported from OSS
Reviewed By: gchanan
Differential Revision: D33336948
Pulled By: albanD
fbshipit-source-id: 4e40371592b9a5a7e7fcd1d8cecae11ffb873113
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69774
We recently ran into a nasty bug caused by incorrect schema annotations on an `aten::split` overload. `verify_and_correct_memory_overlap` is supposed to prevent crashes in this scenario, but it didn't because it did not handle `Tensor[]` outputs.
This change extends the memory correction mechanism to handle tensor lists.
ghstack-source-id: 146152478
Test Plan: `buck test caffe2/benchmarks/static_runtime/...`
Reviewed By: hlu1
Differential Revision: D33022494
fbshipit-source-id: 8d1d41ca1d4fd5dfb7c8a66028c391ba63551eb0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69475
This diff adds TensorExpr fusion with dynamic shapes in SR. This includes tracing the input graph with sample inputs, and then performing fusion with generalization to get fused graphs with dynamic shapes.
ghstack-source-id: 146059043
Test Plan:
```
buck run mode/opt //caffe2/caffe2/fb/predictor:pytorch_predictor_test
```
Reviewed By: d1jang
Differential Revision: D32320088
fbshipit-source-id: 397f498878ddfcee9dad7a839652f79f034fefe3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69473
This diff refactors StaticModule and its uses to pass in sample inputs. These inputs need to be passed into the constructor because they are need to perform TensorExpr fusion before other optimizations are performed on the input graph.
ghstack-source-id: 146059041
Test Plan: buck run mode/opt //caffe2/caffe2/fb/predictor:pytorch_predictor_test
Reviewed By: donaldong
Differential Revision: D32320084
fbshipit-source-id: b8bd46d442be4cc90ca60f521e0416fdb88eea60
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69406
Most files that include `interned_strings.h` don't actually depend on
anything generated from `FORALL_NS_SYMBOLS` yet because they're in a
single file you need to recompile whenever a new symbol is added. Here
I move the class definition into a separate file so this doesn't
happen.
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D32923637
Pulled By: albanD
fbshipit-source-id: 6e488cbfcfe2c041a99d9ff22e167dbddf3f46d7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691
TraceType is a sharded file, so by only including specific operator
headers, we ensure that changing one (non-method) operator only needs
one shard to be re-compiled.
This also changes all the included autograd and jit headers from
including `ATen/ATen.h` to just including `ATen/core/Tensor.h`.
Test Plan: Imported from OSS
Reviewed By: jbschlosser, malfet
Differential Revision: D32596264
Pulled By: albanD
fbshipit-source-id: 2f28b62d7b9932f30fad7daacd8ac5bb7f63c621
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69087
This diff includes a variety of improvements to `set_inputs` to unify behavior with `torch::jit::Module`:
1. Eliminate code duplication between rvalue/lvalue overloads
2. Add type checks
3. Make input length check a `TORCH_CHECK` instead of a debug check - we have to fail when the wrong number of inputs are passed.
4. `schema` now always includes `self`, even if we release `module_`. This is consistent with `torch::jit::Module`.|
ghstack-source-id: 145599837
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D32711705
fbshipit-source-id: fe97c10b4f03801ba59868b452e7d02b26b3106b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69755
Per swolchok's suggestion on D32609915 (1c43b1602c). Hide the value offset indices behind accessors to provide more flexibility if we ever decide to change the layout of the values array.
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D32838145
fbshipit-source-id: cf805c077672de4c2fded9b41da01eca6d84b388
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69809
SR options is only printed out once per model per net. Logging it is actually pretty helpful for debugging.
Test Plan: CI
Reviewed By: donaldong
Differential Revision: D33046814
fbshipit-source-id: 536b34e00fbc8a273c5eb4d8ae5caca0dc1f4c24
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69595
This changes encapsulates `function` object in `ProcessedFunction` objects instead of exposing it unnecessarily just for executing it.
Test Plan: Existing tests
Reviewed By: mikeiovine
Differential Revision: D32908341
fbshipit-source-id: 5ff4951cbe276c5c6292227124d9eec1dd16e364
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68795
This change improves static runtime exception safety. Added a scope exit guard that invokes `MemoryPlanner::deallocate` in its destructor.
Caveat: we have to be really careful with the exception behavior of `MemoryPlanner::deallocate` and `MemoryPlanner`'s constructor, because they're now both potentially called in the destructor of the scope exit guard. Letting exceptions potentially escape destructors is playing with fire since 1) the destructor of `Deallocator` is (implicitly) `noexcept`, 2) even if it wasn't, `std::terminate` will be called if an exception escapes and the stack is already unwinding. To get around this, we wrap the deallocation stuff in a try/catch. If deallocation throws, then we simply reset all of the memory planner stuff and carry on.
There's a catch: the code path that we take when handling the deallocation exception can't throw. However, this code path is much simpler than memory planner construction/deallocation, so it's much easier to manually audit the correctness here.
Test Plan:
**New unit tests**
`buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D32609915
fbshipit-source-id: 71fbe6994fd573ca6b7dd859b2e6fbd7eeabcd9e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69489
This change avoids pulling out `Node*` out of `ProcessedNode*` to evaluate expressions related to `Node*` at op execution time.
Perf gain is expected to be there but not measurable and the purpose of this change is to make SR's code more self-contained (calling more code from SR not JIT) during execution time.
Test Plan: Existing tests
Reviewed By: mikeiovine
Differential Revision: D32893265
fbshipit-source-id: f0f397666b3556f985d45112af8fe0b08de22139
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68302
Implement the new memory re-use algorithm. It’s roughly based on the c2 one, but after going through many iterations it may not be a 1:1 port anymore. Also deleted the old liveness analysis.
Test Plan:
## **Re-use metrics**
`inline_cvr` (294738512_58)
**Before**
* `local`
```
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 4601984 bytes
Total number of reused tensors: 1183
```
* `local_ro`
```
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2677
Total memory managed: 29696 bytes
Total number of reused tensors: 959
```
**After**
* `local`
```
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 4520000 bytes
Total number of reused tensors: 1198
```
* `local_ro`
```
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2677
Total memory managed: 29120 bytes
Total number of reused tensors: 963
```
Reviewed By: hlu1
Differential Revision: D32370424
fbshipit-source-id: 06a8e0a295ed7a2b4d14071349c1f1e975f746bf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69274
`impl.h` is the main header file that defines the interface of Static Runtime to its clients.
However, it is currently filled with implementation details that should not be leaked to our clients. 1) this can unnecessarily leak our internals to our clients which can make it hard to change them later 2) cause unnecessary merge conflicts when multiple people are touching this enormous impl.cpp file.
To alleviate the situation, this change moves the implementation details from impl.h into a new file, internal.h, that's internally kept without leaking the details to our clients.
This change will be followed by another change to rename `impl.h` into `runtime.h` or anything better since `impl.h` is currently not about implementation but SR's interface.
Note that this change is NOT complete since the remaining declarations in impl.h still contain a lot of implementation details. Therefore, we should keep working on minimizing the interface to prevent our API from being bloated unnecessarily. Also we need to work on modularizing our implementations into separate pieces organized by separate files in the near future.
Test Plan: Existing unittests
Reviewed By: donaldong
Differential Revision: D32780415
fbshipit-source-id: 119b7aedbf563b195641c5674572a9348732145f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68884
This diff uses std::vector::reserve in GetLivenessMap to set container capacity for all local contains to avoid runtime resizing.
The changes should theoretically improves the performance by a little.
Test Plan:
- [x] `buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- -v 1`
- [x]
```
seq 1 10 | xargs -I{} ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \
--scripted_model=/data/users/dxd/302008423_0.predictor.disagg.local \
--method_name=local_request_only.forward --pt_cleanup_activations=1 \
--pt_enable_out_variant=1 --pt_optimize_memory=1 --iters=0 --warmup_iters=0 \
--num_threads=1 --pt_enable_static_runtime=1 --set_compatibility=1 \
--input_type="recordio" --pt_inputs=/data/users/dxd/302008423_0.local_ro.inputs.recordio \
--recordio_use_ivalue_format=1
```
### Before
```
I1201 12:04:46.753311 2874563 PyTorchPredictorBenchLib.cpp:336] Took 10.9826 sec to initialize a predictor.
I1201 12:05:00.617139 2875780 PyTorchPredictorBenchLib.cpp:336] Took 11.1078 sec to initialize a predictor.
I1201 12:05:15.279667 2876813 PyTorchPredictorBenchLib.cpp:336] Took 11.7979 sec to initialize a predictor.
I1201 12:05:30.201207 2877554 PyTorchPredictorBenchLib.cpp:336] Took 11.8901 sec to initialize a predictor.
I1201 12:05:44.386926 2879713 PyTorchPredictorBenchLib.cpp:336] Took 11.2722 sec to initialize a predictor.
I1201 12:05:58.003582 2881426 PyTorchPredictorBenchLib.cpp:336] Took 10.8046 sec to initialize a predictor.
I1201 12:06:12.004778 2882604 PyTorchPredictorBenchLib.cpp:336] Took 11.2754 sec to initialize a predictor.
I1201 12:06:26.101241 2884888 PyTorchPredictorBenchLib.cpp:336] Took 11.3355 sec to initialize a predictor.
I1201 12:06:40.364817 2886572 PyTorchPredictorBenchLib.cpp:336] Took 11.401 sec to initialize a predictor.
I1201 12:06:54.483794 2888614 PyTorchPredictorBenchLib.cpp:336] Took 11.3498 sec to initialize a predictor.
```
### After
```
I1201 11:51:53.775239 2818391 PyTorchPredictorBenchLib.cpp:336] Took 10.9113 sec to initialize a predictor.
I1201 11:52:07.412720 2819530 PyTorchPredictorBenchLib.cpp:336] Took 10.8413 sec to initialize a predictor.
I1201 11:52:21.202816 2820359 PyTorchPredictorBenchLib.cpp:336] Took 11.0216 sec to initialize a predictor.
I1201 11:52:35.513288 2821029 PyTorchPredictorBenchLib.cpp:336] Took 11.4216 sec to initialize a predictor.
I1201 11:52:49.145979 2821930 PyTorchPredictorBenchLib.cpp:336] Took 10.8272 sec to initialize a predictor.
I1201 11:53:02.908790 2822859 PyTorchPredictorBenchLib.cpp:336] Took 11.0262 sec to initialize a predictor.
I1201 11:53:16.276015 2823657 PyTorchPredictorBenchLib.cpp:336] Took 10.6893 sec to initialize a predictor.
I1201 11:53:30.103283 2824382 PyTorchPredictorBenchLib.cpp:336] Took 11.1854 sec to initialize a predictor.
I1201 11:53:44.298514 2825365 PyTorchPredictorBenchLib.cpp:336] Took 11.4796 sec to initialize a predictor.
I1201 11:53:58.258708 2826128 PyTorchPredictorBenchLib.cpp:336] Took 11.2652 sec to initialize a predictor.
```
Reviewed By: swolchok
Differential Revision: D32649252
fbshipit-source-id: 5cd296d12b12e5b15e85e4f1a8a236e293f37f9c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69164
We have lots of methods that take `std::unordered_map<std::string, c10::IValue>` now. That's kind of ugly and cumbersome to type, so add a `KWargs` typedef.
Also made the `operator()` default `kwargs` to empty. Note that we could have another overload that doesn't take `kwargs` at all, but the perf gain is so minuscule it's probably not worth it.
ghstack-source-id: 144691899
Test Plan: CI
Reviewed By: d1jang
Differential Revision: D32734677
fbshipit-source-id: 8d6496a6d1ec2dc71253151d2f6408f1387966cf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68295
There's no reason we can't figure out what tensors we need to manage at model load time. It's also useful to have the set of ranges available at load time for integrating the ranges algorithm introduced in the previous diff.
Test Plan: `buck test caffe2/benchmarks/static_runtime/...`
Reviewed By: hlu1
Differential Revision: D32400593
fbshipit-source-id: 0466b2641166ddc9c14f72774f4ba151407be400