Not only is this change usually shorter and more readable, it also can yield better performance. size() is not always a constant time operation (such as on LinkedLists), but empty() always is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93236
Approved by: https://github.com/malfet
These PR fixes a number of bugs found by Svace static analyzer:
1. DEREF_AFTER_FREE at qnnpack_utils.h:
Pointer '&convolution->zero_buffer' is dereferenced at qnnpack_utils.h:258 after the referenced memory was deallocated at operator-delete.c:25 by passing as 1st parameter to function 'pytorch_qnnp_delete_operator' at qnnpack_utils.h:251.
2. DEREF_AFTER_NULL at impl.cpp:
After having been compared to NULL value at impl.cpp:1892, pointer 'schema' is passed as 2nd parameter in call to function 'c10::operator<<' at impl.cpp:1921, where it is dereferenced at function_schema_inl.h:13.
3. DEREF_OF_NULL at stmt.h:
After having been compared to NULL value at stmt.h:744, pointer 'body->_M_ptr' is passed in call to function 'torch::jit::tensorexpr::malformed_input::malformed_input' at stmt.h:745, where it is dereferenced at exceptions.h:67.
4. DEREF_OF_NULL at loopnest.h:
Pointer 'f->ptr' that can have only NULL value (checked at loopnest.cpp:1482), is passed in call to function 'torch::jit::tensorexpr::malformed_input::malformed_input' at loopnest.cpp:1483, where it is dereferenced at exceptions.h:67.
This is the same error as 3: forwarding a nullptr to malformed_input().
4. TAINTED_INT.LOOP in python_arg_parser:
Integer value 'this->size' obtained from untrusted source at python_arg_parser.cpp:118 without checking its bounds is used as a loop bound at python_arg_parser.cpp:698 by calling function 'torch::FunctionParameter::set_default_str' at python_arg_parser.cpp:133.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85705
Approved by: https://github.com/kit1980
Summary:
The pass introduces an `fb::` operator and thus cannot be used in OSS.
The test failure was not exposed because the Static Runtime tests have been disabled in OSS for a while. The Dev Infra folks encountered this failure when re-enabling the tests.
Test Plan: Existing tests
Differential Revision: D40724547
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87799
Approved by: https://github.com/huydhn
Summary:
Someone was running into problems where
1) Static Runtime enablement would fail
2) We would try to fall back to the JIT interpreter *after trying to create `StaticModule`*
3) The fallback fails because Static Runtime mangled the graph.
We don't want to prevent Static Runtime from mutating its input due to memory concerns. The intent of `canEnableStaticRuntime` is to catch issues in the module before Static Runtime messes with it.
With this diff, `StaticModule` instantiation can be avoided by querying `canEnableStaticRuntime` and the issue is fixed.
Test Plan: New unit test
Differential Revision: D40564452
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87396
Approved by: https://github.com/tenpercent
Summary: Split `quantized_linear_unpacked_weight_v2` into `linear_prepack` and `quantized_linear` so that the prepacking operation may be eliminated by constant folding.
Test Plan:
Fixes a huge regression in an internal model:
```
Before
89.6141 ms. 99.0923%. fb::quantized_linear_unpacked_weight_v2 (12 nodes)
After
0.806852 ms. 53.5365%. quantized::linear (12 nodes, out variant)
(prepacking eliminated)
```
Differential Revision: D39622530
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85289
Approved by: https://github.com/davidberard98
Summary: Apparently static runtime's list construct return value is always a `GenericList`, so we cannot use the `toOptionalTensorList` method in the general case -- we must convert each item individually.
Test Plan: New unit test
Differential Revision: D39628979
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85298
Approved by: https://github.com/tenpercent
Summary:
- ProcessedNodeMetadata class wraps the possible metadata for ProcessedNode. Depending upon the nature of op, processedNode can have one of the below possibilities of metadata:
1. prim::If/prim::Loop ops contains block_runners_ as their metadata
2. prim::fork op contains TaskLauncher (std::function) responsible for execution of forked subgraph
Differential Revision: D37320704
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79961
Approved by: https://github.com/mikeiovine
Per many C++ code-style guides members(for [example](https://google.github.io/styleguide/cppguide.html#Enumerator_Names) ) members of `enum` should be CamelCased,
and only defines should be ALL_CAPS
Changes `MemOverlap`, `MemOverlapStatus` and `CmpEvalResult` enum values
Also, `YES`, `NO`, `TRUE` and `FALSE` are often system defines
Fixes among other things, current iOS build regression, see, which manifests as follows (see [this](6e90572bb9):
```
/Users/runner/work/pytorch/pytorch/aten/src/ATen/MemoryOverlap.h:19:29: error: expected identifier
enum class MemOverlap { NO, YES, TOO_HARD };
^
/Applications/Xcode_12.4.app/Contents/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator14.4.sdk/usr/include/objc/objc.h:89:13: note: expanded from macro 'YES'
#define YES __objc_yes
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79772
Approved by: https://github.com/drisspg, https://github.com/kulinseth
Summary:
- Remove creation of new StaticModuleOptions for the forked subgraph. Use parent graph's options for creation of runtime for forked subtraph
- StaticRuntimeMetdata extends CustomClassHolder which can be casted to IValue and attached to IR node's attributes.
Differential Revision: D37159684
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79578
Approved by: https://github.com/mikeiovine
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75775
fbgemm kernels already implement the fused kernel, no reason not to use it
ghstack-source-id: 155450342
Test Plan: New unit tests
Reviewed By: navahgar
Differential Revision: D35633297
fbshipit-source-id: a744a33a65ce7dbb9ce8900dbe091b6d56dd4e48
(cherry picked from commit b1361b349862715aa17e6318c5e658cd6401a464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75774
`list[0:]` is a no-op. This should really be eliminated on the modeling side, implement as a graph pass for now until we can get this into prod models.
Test Plan: New unit tests
Reviewed By: navahgar
Differential Revision: D35632947
fbshipit-source-id: 0c564193c35039130e99172e0185e124ea24f62d
(cherry picked from commit e01d5273185e39a563c7acb15662d9c1549d4b58)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75993
Strobelight shows copy_ in embedding_bag taking up a lot of time in adfinder_story_post_ad_session_exit_model 334827604_0
{F723683014}
More details in https://fb.quip.com/MKumAjz1YD4 (1f47a80e88)a#temp:C:FPD3 (ecd5567980)e5a0871ae5d481286b511ef7
The last 3 outputs of embedding_bag are unused in the graph: P495814049.
* max_indices output isn't necessary for the main output, so remove it when it's not used in the graph.
* offset2bag is used as an intermediate to calculate the main output, so we don't remove this output even though it's unused in the graph.
* bag_size is used as an intermediate to calculate the main output for MODE_MEAN, so we don't remove this for now.
Test Plan:
`./caffe2/caffe2/fb/predictor/scripts/run_disagg_model_benchmarks.sh 334827604 0 /data/users/ansha/tmp/ads_tail sr_only`
Inputs uploaded to `/mnt/persistent-public/ansha/ads_tail/334827604`
Before:
I0414 10:53:12.261133 1070948 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.121318. Iters per second: 8242.78
0.11156 ms. 99.0457%. aten::embedding_bag (52 nodes, out variant)
After:
I0418 13:05:10.837378 2354604 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.0881273. Iters per second: 11347.2
0.0789221 ms. 98.7096%. static_runtime::embedding_bag (52 nodes, out variant)
* Ads prod canary:
https://www.internalfb.com/intern/ads/canary/443002539593035806/
* 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_inline_cvr_post_imp -a D35726594`
https://www.internalfb.com/intern/servicelab/602875732/
* 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_10x_ctr_mbl_feed_non_mimo -a D35726594`
https://www.internalfb.com/intern/servicelab/1002874745/
Reviewed By: mikeiovine
Differential Revision: D35726594
fbshipit-source-id: 3b71a0822657bf7a23ce37ca899baef9997b011a
(cherry picked from commit fd5e3098c047a1e7d4348e1c97341eecb892536e)
Summary:
[Comment](https://github.com/pytorch/pytorch/pull/62445/files#r680132022) claims, it got added for consistency with top level CMakeLists.txt, but `-Wno-unused-variable` is not mentioned there.
Modify violations in 50+ files that were added in the interim by either removing unused variables, or decorating the code with `C10_UNUSED` if local variable is likely used to extend object lifetime until the end of the block.
Caused preventable revert in https://github.com/pytorch/pytorch/pull/72633#issuecomment-1092300787
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75538
Reviewed By: anjali411
Differential Revision: D35747333
Pulled By: malfet
fbshipit-source-id: 3fc5828e44a4c05ba0e89e92613e6ebbdb260626
(cherry picked from commit c179fba21cfa2a0093fad50ccad5a22dd7cff52c)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75807
There is a tension in RecordFunction between two use cases:
1) In the normal eager path we don't run any callbacks, so we need to bail out of the profiling path as soon as possible to minimize eager overhead.
2) When profiling we want to determine which callbacks to run as efficiently as possible to minimize instrumentation overhead.
The confounding factor in all of this is sampling callbacks because they change which callbacks will run on each call, even in steady state operation. This has traditionally been handled with a two stage procedure: first we flip a coin to determine if a sampled callback *might* run. If false (which it usually is), do nothing. This solves (1). If true, check to see if we need to build the full callback set or if it was a false positive. This procedure has two negative effects:
* It forces us to rebuild the set of callbacks to run on every step when profiling
* It leaks the sampling abstraction, requiring other parts of the code to bump certain values and forces RecordFunction to lazily initialize.
This change introduces a multi-level cache which can (in the common case) quickly determine which callbacks *will* run, rather than if callbacks *might* run. This means that rather than call `shouldRunRecordFunction`, we can simply get the callbacks for an invocation and check if they are empty. (And completely removes the pre-sampling heuristic.) Another major benefit of the new cache structure is that it allows thread-safe registration and unregistration of global callbacks.
It's worth briefly discussing how this maintains eager performance. In the standard eager case (only sampling callbacks registered) the cache first checks that the global callbacks haven't changed (atomic read), decrements a counter to see if a sampling callback fired, and then returns the active callbacks which is simply a SmallVector of pointer pairs and a couple POD values (scope, needs inputs/outputs/ids). The biggest cost according to perf is the SmallVector logic; we could consider adopting a hard limit on active callbacks; more than half a dozen callbacks *running* in a single step would be quite a lot. But the total cost relative to `PYTORCH_DISABLE_PER_OP_PROFILING` is only ~10ns, so debatable if it's worth it to switch to `std::array`.
The primary change is in `record_function.cpp`, which has a more detailed description of the new cache structure. `record_function.h` has some minor changes to align with the new calling convention and the remaining files are simply changes to the call sites.
Future work:
* RecordFunction no longer needs to be lazily initialized.
* We can deprecate the disable/reenable APIs, since we can not safely add and remove global callbacks.
Test Plan:
I tested eager mode performance using the overhead benchmark and found that the non-profiled path was unaffected. However the no-op observer dropped from 0.41us to 0.37us (0.25us if no observers are active) which is about 1/3rd reduction in the cost of the callback selection machinery.
I also added several C++ unit tests, as the core RecordFunction machinery (especially sampling) was largely untested.
Reviewed By: swolchok, davidberard98
Differential Revision: D35276158
fbshipit-source-id: 35135f444724fba4eb97c0ae7f3f710f0f9016fd
(cherry picked from commit 9e359b87422c18f2a195185f32e7e85c82f956fd)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74730
Motivation: I am working on implementing a new, more efficient memory planning algorithm. This algorithm cannot replace the old one entirely, because it can only be practically done for models that have sample inputs to warm up with. We need a way to make the memory planner's strategy extensible.
My first pass attempt at implementing the new algorithm crammed everything into the same class, but it became a nightmare to manage (a ton of `if (use_new_strategy)` statements everywhere). Additionally, it was a little clumsy since there are some concepts that make sense for one algorithm but not the other (like `StorageGroup`).
It's much cleaner if we instead turn `MemoryPlanner` into an abstract base class and have different subclasses implement their strategies in `allocateManagedTensors` and `deallocateManagedTensors`.
ghstack-source-id: 153288210
Test Plan: Existing unit tests
Reviewed By: navahgar, hlu1
Differential Revision: D35132124
fbshipit-source-id: c5ef5ae6361b44dedf97090201e244a76e1e6bce
(cherry picked from commit c96f6827c8db88f28c4eb379865ad208beae2034)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74966
It's clear that we don't want to manage tensors that escape their scope. Previously, we handled this by checking whether the tensor aliased the graph outputs. But there's actually another way to escape scope: by aliasing the wildcard set. The following graph demonstrates this:
```
def forward(self, cond: bool, a, b):
lst = []
if cond:
res = a + b # res should not be managed!!!
lst.append(res)
return lst
```
The `if cond:` sub-block returns nothing, but `res` escapes the scope through `lst`.
The fix is simple: we simply have to mark values that alias the wildcard set as an `external_alias_` in `ValueGroup`.
This diff also exposed another issue (via unit tests) in `checkOutputTensorMemoryLeaks`: it assumes that, if a node's `Value*` is managed, the underlying `IValue` must be a tensor. But this is not true after the addition of `to_maybe_copy_out`; TMCO does not produce a tensor in its first output slot if it does not copy.
ghstack-source-id: 153288188
Test Plan: New unit tests cover the problematic case
Reviewed By: navahgar
Differential Revision: D35257087
fbshipit-source-id: 853a761dffe51f2c70720759664dd8dfcd56d1d7
(cherry picked from commit 2c7f519354041975f33626eab6b7f16c2494bbf8)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74927
The move ctor was broken because `BlockRunner` stores a reference to `values_`. When moving runtime instances, the pointer to the root block would be moved, but the reference inside it would not be updated.
Pass `BlockRunner` a raw pointer to the heap-allocated IValues instead to avoid this issue.
ghstack-source-id: 153168602
Test Plan: New unit test/CI
Reviewed By: navahgar
Differential Revision: D35228467
fbshipit-source-id: 04e198b39f898b82677a0e41e1cdf00c2b0c09f3
(cherry picked from commit 03e2c591ac3a907d68025eae9500ed7226dec17e)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74481
This diff fixes an interesting performance issue related to `permute_copy`.
We see this pattern frequently:
```
y = torch.permute(x, (0, 2, 1))
z = torch.sum(y, dim=-1)
```
With copy variants off, we get a strided output from `permute`, and we hit this (faster) kernel in `sum`: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L589
But with copy variants on, we get a contiguous output from `permute_copy`, which causes us to hit the slower reduction:
https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L597
But the permute is actually unnecessary, we can just statically turn the graph into this to ensure that the fast kernel is hit with copy variants on:
```
z = torch.sum(x, dim=1)
```
ghstack-source-id: 152003888
Reviewed By: navahgar
Differential Revision: D34992319
fbshipit-source-id: 0baf493708ee2180c899814a954d220d88ba1d4f
(cherry picked from commit 797b6beb26325c56012e406e14fe211c0b5d744d)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73606
The single-output overload of `layer_norm` internally allocates two tensors. As an optimization, we previously added `static_runtime::layer_norm`. This variant of layer norm had two extra outputs to make the memory planner aware of these extra tensors. But these outputs were unused; it's actually better for us to avoid the allocation and associated computations entirely.
ghstack-source-id: 151394116
Test Plan: Existing unit tests
Reviewed By: hlu1
Differential Revision: D34562131
fbshipit-source-id: c6a6560e60db43b0b100aedc54ea4265acb347de
(cherry picked from commit 3bed52b6f688b93b9b032c3d2b4be68d08d8eb76)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73681
Static runtime is rejecting legal calls made with the kwargs API when there are parameters with default values.
ghstack-source-id: 150433627
Test Plan: Added unit test to cover this case
Reviewed By: navahgar, d1jang
Differential Revision: D34588804
fbshipit-source-id: 74d7ef5bee74f9d16b02b0c8ceda4285ea776755
(cherry picked from commit 9c3db19cb45f6022e646deeb1e8056daa04f363f)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73536
Currently `StaticNodeInfo` class assumes 2 distinct roles that are not too obvious:
1) "template" that contains metadata of an actual executable node by runtime. owned by `StaticModule`
2) fully instanced ones that are owned by `StaticRuntime`.
We currently merge these two usecases into one class, that can be error-prone in case illegal copying happens uncontrollably. Currently, we only copy objects of kind (1) into objects of kind (2) when a `StaticRuntime` instance is created.
To address ths issue, this change introduces `StaticNodeInfo`, a separate class, to distinguishes the aforementioned two usecases in the code more clearly. With this `StaticNodeInfo` is for (1) and `ProcessedNode` is now for (2).
Test Plan: Existing tests
Reviewed By: mikeiovine
Differential Revision: D33985600
fbshipit-source-id: 0c79cea2bf982dd956a35f48eaf6027e5b6e390c
(cherry picked from commit 0d8acc4a2b6eeb3e4af3ad2c99f4cd667680f8df)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72946
The passes to replace with copy variants are run after TensorExpr fusion. Due to this the resulting graph does not conform to the assumptions made in the fuser.
So, even if these flags `use_copy_variants`, `use_maybe_copy_variants` are turned on, the corresponding passes will not be executed if TensorExpr fusion is enabled.
ghstack-source-id: 149429753
Test Plan: Tested locally.
Reviewed By: mikeiovine
Differential Revision: D34283842
fbshipit-source-id: 74edea517a00c85dff0319f9c8b3ac8befe09018
(cherry picked from commit 3798af7f1b8c9b3c072862f58ebf16af6294db14)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73032
Currently, ptvsc2_predictor_bench reports nothing when the input size is zero. However, Static Runtime's module creation has some useful information even after loading a model.
This change reports static op statistics when the given input's size is zero. In addition to that, this enables it to report the out variant coverage percentage, which is crucial to establish the baseline performance of Static Runtime.
Test Plan: - Ran `ptvsc2_predictor_bench` with this change as seen above.
Reviewed By: mikeiovine
Differential Revision: D34294803
fbshipit-source-id: 80c02199075dae9280657d6edecc7c679c1c27f4
(cherry picked from commit 83aec141a25a9ede5d22e5c17c0b6b07307faf39)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72587
This pattern frequently appears in a few graphs:
```
%result = prim::If(%condition)
block0():
-> (%a)
block1():
-> (%b)
```
This is slow, particularly in static runtime. Static runtime creates memory planners/block runners for each sub-block, which eats up a lot of memory and introduces a lot of extra overhead for this relatively simple operation.
This diff introduces a new op that replaces nodes like the above with a single op meant to act like a ternary operator:
```
%result = prim::IfThenElse(%condition, %a, %b)
```
Test Plan: New unit tests
Reviewed By: eellison
Differential Revision: D34091789
fbshipit-source-id: eb6a8c460c39b4c019a1f4ab1f3f1e5b6edc400c
(cherry picked from commit 0f1b335e5b83f402bda2dcdd9ecb411e0b67c651)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72592
Only code paths that are not perf-critical read `ProcessedNode::num_outputs_` and also its static feature of the op that `ProcessedNode` instance is executing.
Therefore, it's better to move `ProcessedNode::num_outputs_` into `ProcessedFunction::num_outputs_` and let `ProcessedNode` access it via `ProcessedNode::fn_` for its occasional use. Note that this prevents duplicating num_outputs_ per node & per Static Runtime instance since `ProcessedFunction` instances are shared across all runtime instances.
It's confirmed that this change reduces the `sizeof(ProcessedNode)` by 14% from local instrumentation as follows:
- Before
-- sizeof(ProcessedNode): 56
- After
-- sizeof(Processednode): 48
Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: mikeiovine
Differential Revision: D33984792
fbshipit-source-id: e29ffc97b799e679215f42e1e85cd3fcd7e88983
(cherry picked from commit 0f7003f4dfd6473a70355ca3c6f51498abf1d7be)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71102
This graph pass is causing a major perf regression on some models. Ideally we would introduce maybe_copy variants for all these ops. But since those are tricky to write, I've introduced a flag to just turn the pass off for now.
ghstack-source-id: 148541673
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: navahgar
Differential Revision: D33510080
fbshipit-source-id: bb4847f26561197ea5e6bbad0a4d25db4ef468eb
(cherry picked from commit 8f333d3e8138e2a7ba04bea7509ad84dd97844eb)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71807
There's no need to completely disallow `aten::__is__` and `aten::__isnot__`. The only problematic case is when the comparison is between two tensors, e.g. in
```
def forward(x):
y = x.detach()
# Should be false, but we get True
# after our EliminateNoOps pass
return x is y
```
Test Plan: New unit test covers this case
Reviewed By: d1jang
Differential Revision: D33783668
fbshipit-source-id: c9f57fa96937ecce38a21554f12b69c45cc58fe4
(cherry picked from commit 019588f4ca3fcd2b3ae51bccab102f0538745b15)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69838
Implement `prim::Loop` with the new `StaticRuntimeBlockRunner` abstraction.
ghstack-source-id: 148186483
Test Plan: New unit tests: `buck test caffe2/benchmark/static_runtime/...`
Reviewed By: d1jang
Differential Revision: D33049595
fbshipit-source-id: 550de5167b46fccd65ff77d092785289b5e5d532
(cherry picked from commit 8baf1753af34f4c166b4680e42589517fd2e508d)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69837
Implement `prim::If` with the new `StaticRuntimeBlockRunner` abstraction.
ghstack-source-id: 148186475
Test Plan:
New unit tests: `buck test caffe2/benchmarks/static_runtime/...`
Accuracy test at top of stack
Reviewed By: d1jang
Differential Revision: D33045908
fbshipit-source-id: 281fb4a73528249fa60f65ac26f8ae6737771f55
(cherry picked from commit de3b12dc0871e8ca09891c257e1dfd7cd352aa7c)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69836
It is technically possible for the sub-blocks to return zero outputs. This is problematic for `StaticRuntimeBlockRunner`, because it assumes that at least one output is being returned.
Rather than slowing down SR with special logic for this corner case, we can simply force these sub-blocks to return `None`.
ghstack-source-id: 148186453
Test Plan: Sub-blocks with no return values tested at top of stack
Reviewed By: d1jang
Differential Revision: D33050420
fbshipit-source-id: 17d9e19fda6431aa9fd0b155131349bac42bc149
(cherry picked from commit c97fd07bf53e1e253a0e6c733db5ea7c86698fc9)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69835
`StaticRuntimeBlockRunner` moves its outputs to the return value at the end of `run_impl`. However, there's a corner case where this can cause problems. If we return a constant, then the only reference in the `constants_` array can be destroyed by this move. We could add special logic to handle this in `run_impl`. But since this is a relatively rare corner case, it's simpler to just add an op that does nothing but create an owned reference to its input. This owned reference can be safely moved out of `StaticRuntimeBlockRunner`.
Note that this also applies to returned values in sub-blocks that are from outer scopes.
ghstack-source-id: 148186452
Test Plan:
`buck test caffe2/benchmarks/static_runtime/...`
Added a new unit test with a graph that simply returns a constant.
Tests with sub-blocks at top of stack.
Reviewed By: d1jang
Differential Revision: D33047519
fbshipit-source-id: 22b6058f0d1da8a6d1d61a6f2866bc518bff482b
(cherry picked from commit a8f89a12ee726aa7d7e546dee25d696eef868ce7)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71986
To address concerns over space increase from control flow.
`op_name_` was only stored as a minor optimization to avoid name lookup during logging, we can safely get rid of it. Thanks to the sampling mechanism, `get_op_name()` is called very infrequently, so this shouldn't cause too much of a regression
ghstack-source-id: 148086244
Test Plan: CI
Reviewed By: d1jang
Differential Revision: D33821005
fbshipit-source-id: 6f74eb30a54a046ca90768aebbcde22e8c435f35
(cherry picked from commit 361ba32e97dbd130938bae10b5159730822c518c)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69834
* Modify the `StaticModule` constructor to handle index initialization for sub-blocks.
* Add a new class `StaticRuntimeBlockRunner`. This class is almost exactly like what we've been calling `StaticRuntime` up to this point, except that it does not own a `values_` array. All `StaticRuntimeBlockRunners` hold an unowned reference to a `values_` array owned by `StaticRuntime`. This is a useful abstraction for implementing control flow - it gives us a way for sub-blocks to look up values from surrounding scopes!
ghstack-source-id: 148086245
Test Plan: `buck test caffe2/benchmarks/static_runtime/...`
Reviewed By: d1jang
Differential Revision: D33028039
fbshipit-source-id: 4f01417bad51a0cf09b1680a518308da647be1f6
(cherry picked from commit 3a9feffd929869120c717d35aa55aad8a382783d)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70944
Added special net-level/op-level scopes for static runtime. We can use these to add special behavior in record functions when they are invoked from a static runtime context.
Reviewed By: navahgar
Differential Revision: D33458211
fbshipit-source-id: 0b7022100e9f5ac872f4cb5bfba14e92af2c71b0
(cherry picked from commit b486548544c5e822803071756c85e675e37d2dad)