pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-22 06:11:27 +08:00

Author	SHA1	Message	Date
Mike Iovine	74fe57079f	[SR] Add new `fb::split_and_squeeze` op (#73252 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73252 Test Plan: New unit tests Reviewed By: d1jang Differential Revision: D34399345 fbshipit-source-id: 80cbf3e1556f2152576a07d1eeb3bb6ef97096cf (cherry picked from commit 259956749dad88f0960eecb60135cd466f1b56f4)	2022-03-02 19:31:42 +00:00
Raghavan Raman	2724e4c039	[Static Runtime] Do not replace with copy variants if TE fuser is enabled (#72946 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72946 The passes to replace with copy variants are run after TensorExpr fusion. Due to this the resulting graph does not conform to the assumptions made in the fuser. So, even if these flags `use_copy_variants`, `use_maybe_copy_variants` are turned on, the corresponding passes will not be executed if TensorExpr fusion is enabled. ghstack-source-id: 149429753 Test Plan: Tested locally. Reviewed By: mikeiovine Differential Revision: D34283842 fbshipit-source-id: 74edea517a00c85dff0319f9c8b3ac8befe09018 (cherry picked from commit 3798af7f1b8c9b3c072862f58ebf16af6294db14)	2022-02-18 18:34:50 +00:00
Don Jang	39fb771423	[Static Runtime] Report static op statistics from graph when input size is zero (#73032 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73032 Currently, ptvsc2_predictor_bench reports nothing when the input size is zero. However, Static Runtime's module creation has some useful information even after loading a model. This change reports static op statistics when the given input's size is zero. In addition to that, this enables it to report the out variant coverage percentage, which is crucial to establish the baseline performance of Static Runtime. Test Plan: - Ran `ptvsc2_predictor_bench` with this change as seen above. Reviewed By: mikeiovine Differential Revision: D34294803 fbshipit-source-id: 80c02199075dae9280657d6edecc7c679c1c27f4 (cherry picked from commit 83aec141a25a9ede5d22e5c17c0b6b07307faf39)	2022-02-17 23:58:32 +00:00
Mike Iovine	d1c5f9e439	[JIT][SR] Introduce prim::IfThenElse (#72587 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72587 This pattern frequently appears in a few graphs: ``` %result = prim::If(%condition) block0(): -> (%a) block1(): -> (%b) ``` This is slow, particularly in static runtime. Static runtime creates memory planners/block runners for each sub-block, which eats up a lot of memory and introduces a lot of extra overhead for this relatively simple operation. This diff introduces a new op that replaces nodes like the above with a single op meant to act like a ternary operator: ``` %result = prim::IfThenElse(%condition, %a, %b) ``` Test Plan: New unit tests Reviewed By: eellison Differential Revision: D34091789 fbshipit-source-id: eb6a8c460c39b4c019a1f4ab1f3f1e5b6edc400c (cherry picked from commit 0f1b335e5b83f402bda2dcdd9ecb411e0b67c651)	2022-02-17 18:22:48 +00:00
Don Jang	5ea74b4996	[Static Runtime] Remove ProcessedNode::num_outputs_ (#72592 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72592 Only code paths that are not perf-critical read `ProcessedNode::num_outputs_` and also its static feature of the op that `ProcessedNode` instance is executing. Therefore, it's better to move `ProcessedNode::num_outputs_` into `ProcessedFunction::num_outputs_` and let `ProcessedNode` access it via `ProcessedNode::fn_` for its occasional use. Note that this prevents duplicating num_outputs_ per node & per Static Runtime instance since `ProcessedFunction` instances are shared across all runtime instances. It's confirmed that this change reduces the `sizeof(ProcessedNode)` by 14% from local instrumentation as follows: - Before -- sizeof(ProcessedNode): 56 - After -- sizeof(Processednode): 48 Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: mikeiovine Differential Revision: D33984792 fbshipit-source-id: e29ffc97b799e679215f42e1e85cd3fcd7e88983 (cherry picked from commit 0f7003f4dfd6473a70355ca3c6f51498abf1d7be)	2022-02-17 05:09:17 +00:00
Mike Iovine	d51d2bd608	[SR] Add a flag to disable copy variants (#71102 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71102 This graph pass is causing a major perf regression on some models. Ideally we would introduce maybe_copy variants for all these ops. But since those are tricky to write, I've introduced a flag to just turn the pass off for now. ghstack-source-id: 148541673 Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: navahgar Differential Revision: D33510080 fbshipit-source-id: bb4847f26561197ea5e6bbad0a4d25db4ef468eb (cherry picked from commit 8f333d3e8138e2a7ba04bea7509ad84dd97844eb)	2022-02-08 02:43:07 +00:00
Mike Iovine	cff5e22a72	[SR] Relax aten::__is__ constraint for SR enablement (#71807 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71807 There's no need to completely disallow `aten::__is__` and `aten::__isnot__`. The only problematic case is when the comparison is between two tensors, e.g. in ``` def forward(x): y = x.detach() # Should be false, but we get True # after our EliminateNoOps pass return x is y ``` Test Plan: New unit test covers this case Reviewed By: d1jang Differential Revision: D33783668 fbshipit-source-id: c9f57fa96937ecce38a21554f12b69c45cc58fe4 (cherry picked from commit 019588f4ca3fcd2b3ae51bccab102f0538745b15)	2022-02-03 12:18:46 +00:00
Mike Iovine	2d5296b0e7	[SR] Implement prim::Loop (#69838 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69838 Implement `prim::Loop` with the new `StaticRuntimeBlockRunner` abstraction. ghstack-source-id: 148186483 Test Plan: New unit tests: `buck test caffe2/benchmark/static_runtime/...` Reviewed By: d1jang Differential Revision: D33049595 fbshipit-source-id: 550de5167b46fccd65ff77d092785289b5e5d532 (cherry picked from commit 8baf1753af34f4c166b4680e42589517fd2e508d)	2022-02-02 19:30:50 +00:00
Mike Iovine	2aa699505d	[SR] Implement prim::If (#69837 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69837 Implement `prim::If` with the new `StaticRuntimeBlockRunner` abstraction. ghstack-source-id: 148186475 Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime/...` Accuracy test at top of stack Reviewed By: d1jang Differential Revision: D33045908 fbshipit-source-id: 281fb4a73528249fa60f65ac26f8ae6737771f55 (cherry picked from commit de3b12dc0871e8ca09891c257e1dfd7cd352aa7c)	2022-02-02 19:30:50 +00:00
Mike Iovine	d2599701fd	[SR] Force sub-blocks to return at least one output (#69836 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69836 It is technically possible for the sub-blocks to return zero outputs. This is problematic for `StaticRuntimeBlockRunner`, because it assumes that at least one output is being returned. Rather than slowing down SR with special logic for this corner case, we can simply force these sub-blocks to return `None`. ghstack-source-id: 148186453 Test Plan: Sub-blocks with no return values tested at top of stack Reviewed By: d1jang Differential Revision: D33050420 fbshipit-source-id: 17d9e19fda6431aa9fd0b155131349bac42bc149 (cherry picked from commit c97fd07bf53e1e253a0e6c733db5ea7c86698fc9)	2022-02-02 19:30:50 +00:00
Mike Iovine	238dded10f	[SR] Graph pass to create owned refs of special IValues (#69835 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69835 `StaticRuntimeBlockRunner` moves its outputs to the return value at the end of `run_impl`. However, there's a corner case where this can cause problems. If we return a constant, then the only reference in the `constants_` array can be destroyed by this move. We could add special logic to handle this in `run_impl`. But since this is a relatively rare corner case, it's simpler to just add an op that does nothing but create an owned reference to its input. This owned reference can be safely moved out of `StaticRuntimeBlockRunner`. Note that this also applies to returned values in sub-blocks that are from outer scopes. ghstack-source-id: 148186452 Test Plan: `buck test caffe2/benchmarks/static_runtime/...` Added a new unit test with a graph that simply returns a constant. Tests with sub-blocks at top of stack. Reviewed By: d1jang Differential Revision: D33047519 fbshipit-source-id: 22b6058f0d1da8a6d1d61a6f2866bc518bff482b (cherry picked from commit a8f89a12ee726aa7d7e546dee25d696eef868ce7)	2022-02-02 19:30:50 +00:00
Mike Iovine	3dce68fdf4	[SR] Eliminate op_name_ in ProcessedNode (#71986 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71986 To address concerns over space increase from control flow. `op_name_` was only stored as a minor optimization to avoid name lookup during logging, we can safely get rid of it. Thanks to the sampling mechanism, `get_op_name()` is called very infrequently, so this shouldn't cause too much of a regression ghstack-source-id: 148086244 Test Plan: CI Reviewed By: d1jang Differential Revision: D33821005 fbshipit-source-id: 6f74eb30a54a046ca90768aebbcde22e8c435f35 (cherry picked from commit 361ba32e97dbd130938bae10b5159730822c518c)	2022-02-01 21:22:26 +00:00
Mike Iovine	4b789df68b	[SR] Add BlockRunner and handle sub-blocks (#69834 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69834 * Modify the `StaticModule` constructor to handle index initialization for sub-blocks. * Add a new class `StaticRuntimeBlockRunner`. This class is almost exactly like what we've been calling `StaticRuntime` up to this point, except that it does not own a `values_` array. All `StaticRuntimeBlockRunners` hold an unowned reference to a `values_` array owned by `StaticRuntime`. This is a useful abstraction for implementing control flow - it gives us a way for sub-blocks to look up values from surrounding scopes! ghstack-source-id: 148086245 Test Plan: `buck test caffe2/benchmarks/static_runtime/...` Reviewed By: d1jang Differential Revision: D33028039 fbshipit-source-id: 4f01417bad51a0cf09b1680a518308da647be1f6 (cherry picked from commit 3a9feffd929869120c717d35aa55aad8a382783d)	2022-02-01 17:20:55 +00:00
Mike Iovine	a49f2412e4	[SR] Add static runtime scopes to record function (#70944 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70944 Added special net-level/op-level scopes for static runtime. We can use these to add special behavior in record functions when they are invoked from a static runtime context. Reviewed By: navahgar Differential Revision: D33458211 fbshipit-source-id: 0b7022100e9f5ac872f4cb5bfba14e92af2c71b0 (cherry picked from commit b486548544c5e822803071756c85e675e37d2dad)	2022-01-27 18:00:08 +00:00
Mike Iovine	7e6312a5df	[SR] Reverse iteration order in resetMemory (#71705 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71705 This fixes a crash `resetMemory` caused by trying to access a `TensorImpl` via a borrowed `IValue` after it had already been destroyed. We need to clean up all borrows before we destroy the owning `IValue`, not after. ghstack-source-id: 147688982 Test Plan: New unit test covers this case ICE w/ inline_cvr v0 [finishes successfully](https://www.internalfb.com/intern/unidash/dashboard/ads_infra_cost_estimation/a_metrics/?e[select_ESTIMATION_RUN_ID]=ICE_mikeiovine_16431103211c65), didn't see any nnpi errors Reviewed By: ajyu Differential Revision: D33725435 fbshipit-source-id: f8dd109382b5cf54df6f194f8dcb5c0812b174bb (cherry picked from commit 31339d9d38e63248d2ac3646be71008ed731f16c)	2022-01-26 17:35:03 +00:00
Scott Wolchok	3a77fb244b	[PyTorch][Static Runtime] Delete cleanup_activations option (#71501 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71501 This option disabled the memory planner. Supporting it would require us to add multiple versions of ops that borrow their inputs (because they rely on the memory planner to support that), and I'm not aware of a particular need to continue supporting it. ghstack-source-id: 147385569 Test Plan: CI, rerun broken test from task Reviewed By: mikeiovine Differential Revision: D33669290 fbshipit-source-id: ecb01995891aecb5f4d0da2d9c51eed1f8fe489a (cherry picked from commit 5e4fefb109b6c92d59fc7e24d69f1b6b2780c776)	2022-01-21 18:15:43 +00:00
Scott Wolchok	1bbea3c3a2	[PyTorch][JIT] Support mayContainAlias(Value, ArrayRef<Value>) (#69853 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69853 We can implement this overload more efficiently. ghstack-source-id: 146924693 Test Plan: patched alias_analysis tests Time reported to initialize a predictor by static runtime when given ctr_mobile_feed local_ro net is 9.5s instead of 10.5s. Reviewed By: mikeiovine Differential Revision: D33039731 fbshipit-source-id: 52559d678e9eb00e335b9e0db304e7a5840ea397	2022-01-12 16:53:54 -08:00
Scott Wolchok	cd253938a9	[PyTorch][SR][easy] s/input_or_constant_aliases/external_aliases/ (#69852 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69852 Looks like a stale comment. ghstack-source-id: 146924694 Test Plan: review Reviewed By: hlu1 Differential Revision: D33033264 fbshipit-source-id: aa0eff463c42716bdd7142d4662d8668af439f68	2022-01-12 16:52:26 -08:00
Elias Ellison	9bccb31306	Remove precise tuple construct flag (#71121 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71121 Test Plan: Imported from OSS Reviewed By: d1jang Differential Revision: D33515234 Pulled By: eellison fbshipit-source-id: 57cfe171b583a6bb4d3493a34b159061e97a11b8	2022-01-11 22:12:36 -08:00
Scott Wolchok	10b40acbdb	[PyTorch][Static Runtime] Fast aliasing in select_tensor by manual borrowing (#68122 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68122 See code comments for details; in brief, we repurpose support for borrowing `Tensor`s in `MaybeOwned` to make the `select_tensor` output a borrowed IValue that we have to clean up manually. If we have any other ops that always create a new reference to an existing Tensor, we can easily apply this same optimization. ghstack-source-id: 146482212 Test Plan: See perf measurements on ctr_mobile_feed local_ro net for this stack: P467203421 (local is neutral: P467267554) --do_profile output for local_ro (updated Dec 10): ``` swolchok@devbig032 /d/u/s/f/fbcode> tail Stable.profile.txt First iter time: 0.989023 ms Number of operators: 2037 Total number of managed tensors: 1597 Total number of managed output tensors: 0 Total number of unmanaged values: 2568 Number of unmanaged values requiring cleanup: 2568 Number of unmanaged values not requiring cleanup: 0 Total memory managed: 50368 bytes Total number of reused tensors: 1010 Total number of 'out' variant nodes/total number of nodes: 2001/2037 (98.2327%) swolchok@devbig032 /d/u/s/f/fbcode> ttail TMCC^C swolchok@devbig032 /d/u/s/f/fbcode> tail TMCOFastAliasing.profile.txt First iter time: 0.994703 ms Number of operators: 2551 Total number of managed tensors: 1146 Total number of managed output tensors: 0 Total number of unmanaged values: 4047 Number of unmanaged values requiring cleanup: 3533 Number of unmanaged values not requiring cleanup: 514 Total memory managed: 50048 bytes Total number of reused tensors: 559 Total number of 'out' variant nodes/total number of nodes: 2001/2551 (78.4398%) ``` for local: (also Dec 10): ``` ==> Stable.local.profile.txt <== First iter time: 9.0909 ms Number of operators: 1766 Total number of managed tensors: 1894 Total number of managed output tensors: 0 Total number of unmanaged values: 2014 Number of unmanaged values requiring cleanup: 2014 Number of unmanaged values not requiring cleanup: 0 Total memory managed: 4541440 bytes Total number of reused tensors: 847 Total number of 'out' variant nodes/total number of nodes: 1744/1766 (98.7542%) ==> TMCOFastAliasing.local.profile.txt <== First iter time: 7.5512 ms Number of operators: 2378 Total number of managed tensors: 1629 Total number of managed output tensors: 0 Total number of unmanaged values: 3503 Number of unmanaged values requiring cleanup: 2891 Number of unmanaged values not requiring cleanup: 612 Total memory managed: 3949312 bytes Total number of reused tensors: 586 Total number of 'out' variant nodes/total number of nodes: 1744/2378 (73.3389%) ``` Reviewed By: hlu1 Differential Revision: D32318674 fbshipit-source-id: a2d781105936fda2a3436d32ea22a196f82dc783	2022-01-04 22:36:13 -08:00
Scott Wolchok	4d8fc8693c	[PyTorch][Static Runtime] Support memory planning for torch.to() w/o requiring copying (#67223 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67223 ghstack-source-id: 146482215 Test Plan: See perf measurements on ctr_mobile_feed local_ro net for this stack: P467203421 (local is neutral: P467267554) Reviewed By: hlu1 Differential Revision: D31776259 fbshipit-source-id: f84fcaa05029577213f3bf2ae9d4b987b68480b3	2022-01-04 22:36:10 -08:00
Scott Wolchok	1507ce90b2	[PyTorch][Static Runtime] Avoid managed output tensor DCHECK (#67221 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67221 Update memory leak checks to not require that output tensors are cleaned up. ghstack-source-id: 146464297 Test Plan: Tests should still pass; reviewers to confirm that this is OK in principle Reviewed By: d1jang Differential Revision: D31847567 fbshipit-source-id: bb7ff2f2ed701e2d7de07d8032a1281fccabd6a9	2022-01-04 22:36:07 -08:00
Peter Bell	fa09099ba3	Codegen: TraceType only includes operators being registered (#68691 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691 TraceType is a sharded file, so by only including specific operator headers, we ensure that changing one (non-method) operator only needs one shard to be re-compiled. This also changes all the included autograd and jit headers from including `ATen/ATen.h` to just including `ATen/core/Tensor.h`. Test Plan: Imported from OSS Reviewed By: gchanan Differential Revision: D33336948 Pulled By: albanD fbshipit-source-id: 4e40371592b9a5a7e7fcd1d8cecae11ffb873113	2022-01-02 13:09:19 -08:00
Mike Iovine	682fab19d4	[SR] verify_and_correct_memory_overlap handles tensor lists (#69774 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69774 We recently ran into a nasty bug caused by incorrect schema annotations on an `aten::split` overload. `verify_and_correct_memory_overlap` is supposed to prevent crashes in this scenario, but it didn't because it did not handle `Tensor[]` outputs. This change extends the memory correction mechanism to handle tensor lists. ghstack-source-id: 146152478 Test Plan: `buck test caffe2/benchmarks/static_runtime/...` Reviewed By: hlu1 Differential Revision: D33022494 fbshipit-source-id: 8d1d41ca1d4fd5dfb7c8a66028c391ba63551eb0	2021-12-22 17:18:18 -08:00
Raghavan Raman	a6f953156e	[StaticRuntime] Add TensorExpr fusion with dynamic shapes in SR (#69475 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69475 This diff adds TensorExpr fusion with dynamic shapes in SR. This includes tracing the input graph with sample inputs, and then performing fusion with generalization to get fused graphs with dynamic shapes. ghstack-source-id: 146059043 Test Plan: ``` buck run mode/opt //caffe2/caffe2/fb/predictor:pytorch_predictor_test ``` Reviewed By: d1jang Differential Revision: D32320088 fbshipit-source-id: 397f498878ddfcee9dad7a839652f79f034fefe3	2021-12-21 12:41:02 -08:00
Raghavan Raman	91da2d5fa1	[StaticRuntime] Refactor StaticModule to pass in sample inputs (#69473 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69473 This diff refactors StaticModule and its uses to pass in sample inputs. These inputs need to be passed into the constructor because they are need to perform TensorExpr fusion before other optimizations are performed on the input graph. ghstack-source-id: 146059041 Test Plan: buck run mode/opt //caffe2/caffe2/fb/predictor:pytorch_predictor_test Reviewed By: donaldong Differential Revision: D32320084 fbshipit-source-id: b8bd46d442be4cc90ca60f521e0416fdb88eea60	2021-12-21 11:20:25 -08:00
Peter Bell	ef70174f2e	Separate c10::Symbol header from list of interned strings (#69406 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69406 Most files that include `interned_strings.h` don't actually depend on anything generated from `FORALL_NS_SYMBOLS` yet because they're in a single file you need to recompile whenever a new symbol is added. Here I move the class definition into a separate file so this doesn't happen. Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D32923637 Pulled By: albanD fbshipit-source-id: 6e488cbfcfe2c041a99d9ff22e167dbddf3f46d7	2021-12-19 14:52:26 -08:00
Nikita Shulga	26e32988bd	Revert D32596264: Codegen: TraceType only includes operators being registered Test Plan: revert-hammer Differential Revision: D32596264 (`e66a8ab4f5`) Original commit changeset: 2f28b62d7b99 Original Phabricator Diff: D32596264 (`e66a8ab4f5`) fbshipit-source-id: 7d18c4e77ce30dd7817a95f9c39b565cb246cd12	2021-12-17 11:20:12 -08:00
Peter Bell	e66a8ab4f5	Codegen: TraceType only includes operators being registered (#68691 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691 TraceType is a sharded file, so by only including specific operator headers, we ensure that changing one (non-method) operator only needs one shard to be re-compiled. This also changes all the included autograd and jit headers from including `ATen/ATen.h` to just including `ATen/core/Tensor.h`. Test Plan: Imported from OSS Reviewed By: jbschlosser, malfet Differential Revision: D32596264 Pulled By: albanD fbshipit-source-id: 2f28b62d7b9932f30fad7daacd8ac5bb7f63c621	2021-12-17 10:35:05 -08:00
Mike Iovine	873585da2b	[SR] Improve set_inputs (#69087 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69087 This diff includes a variety of improvements to `set_inputs` to unify behavior with `torch::jit::Module`: 1. Eliminate code duplication between rvalue/lvalue overloads 2. Add type checks 3. Make input length check a `TORCH_CHECK` instead of a debug check - we have to fail when the wrong number of inputs are passed. 4. `schema` now always includes `self`, even if we release `module_`. This is consistent with `torch::jit::Module`.\| ghstack-source-id: 145599837 Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D32711705 fbshipit-source-id: fe97c10b4f03801ba59868b452e7d02b26b3106b	2021-12-15 09:31:19 -08:00
Mike Iovine	c6c3b43498	[SR][easy] Accessors for value array offsets (#69755 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69755 Per swolchok's suggestion on D32609915 (`1c43b1602c`). Hide the value offset indices behind accessors to provide more flexibility if we ever decide to change the layout of the values array. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D32838145 fbshipit-source-id: cf805c077672de4c2fded9b41da01eca6d84b388	2021-12-13 15:31:39 -08:00
Hao Lu	0420de3539	[SR] Log SR options (#69809 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69809 SR options is only printed out once per model per net. Logging it is actually pretty helpful for debugging. Test Plan: CI Reviewed By: donaldong Differential Revision: D33046814 fbshipit-source-id: 536b34e00fbc8a273c5eb4d8ae5caca0dc1f4c24	2021-12-12 16:32:00 -08:00
Hao Lu	a5996a6857	[SR] Wrap check_for_memory_leak with DCHECK (#69588 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69588 Code cleanup Reviewed By: mikeiovine Differential Revision: D32938333 fbshipit-source-id: d15dc405b281411c4c3c27a1dabf82f430c3ed08	2021-12-09 22:11:21 -08:00
Don Jang	9aa1b3e396	[Static Runtime] [Code Cleanup] Encapsulate function objects within ProcessedFunction (#69595 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69595 This changes encapsulates `function` object in `ProcessedFunction` objects instead of exposing it unnecessarily just for executing it. Test Plan: Existing tests Reviewed By: mikeiovine Differential Revision: D32908341 fbshipit-source-id: 5ff4951cbe276c5c6292227124d9eec1dd16e364	2021-12-09 15:11:03 -08:00
Mike Iovine	1c43b1602c	[SR] Scope exit guard for memory planner deallocation (#68795 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68795 This change improves static runtime exception safety. Added a scope exit guard that invokes `MemoryPlanner::deallocate` in its destructor. Caveat: we have to be really careful with the exception behavior of `MemoryPlanner::deallocate` and `MemoryPlanner`'s constructor, because they're now both potentially called in the destructor of the scope exit guard. Letting exceptions potentially escape destructors is playing with fire since 1) the destructor of `Deallocator` is (implicitly) `noexcept`, 2) even if it wasn't, `std::terminate` will be called if an exception escapes and the stack is already unwinding. To get around this, we wrap the deallocation stuff in a try/catch. If deallocation throws, then we simply reset all of the memory planner stuff and carry on. There's a catch: the code path that we take when handling the deallocation exception can't throw. However, this code path is much simpler than memory planner construction/deallocation, so it's much easier to manually audit the correctness here. Test Plan: New unit tests `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D32609915 fbshipit-source-id: 71fbe6994fd573ca6b7dd859b2e6fbd7eeabcd9e	2021-12-08 16:41:52 -08:00
Don Jang	afaa184b44	[Static Runtime] Avoid evaluating expressions of `Node` for interpreter fallback op (#69489 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69489 This change avoids pulling out `Node` out of `ProcessedNode` to evaluate expressions related to `Node` at op execution time. Perf gain is expected to be there but not measurable and the purpose of this change is to make SR's code more self-contained (calling more code from SR not JIT) during execution time. Test Plan: Existing tests Reviewed By: mikeiovine Differential Revision: D32893265 fbshipit-source-id: f0f397666b3556f985d45112af8fe0b08de22139	2021-12-08 08:40:30 -08:00
Mike Iovine	008469c5e2	[SR] Simplify memory re-use algorithm (#68302 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68302 Implement the new memory re-use algorithm. It’s roughly based on the c2 one, but after going through many iterations it may not be a 1:1 port anymore. Also deleted the old liveness analysis. Test Plan: ## Re-use metrics `inline_cvr` (294738512_58) Before * `local` ``` Total number of managed tensors: 2660 Total number of managed output tensors: 0 Total number of unmanaged values: 3041 Total memory managed: 4601984 bytes Total number of reused tensors: 1183 ``` * `local_ro` ``` Total number of managed tensors: 1412 Total number of managed output tensors: 0 Total number of unmanaged values: 2677 Total memory managed: 29696 bytes Total number of reused tensors: 959 ``` After * `local` ``` Total number of managed tensors: 2660 Total number of managed output tensors: 0 Total number of unmanaged values: 3041 Total memory managed: 4520000 bytes Total number of reused tensors: 1198 ``` * `local_ro` ``` Total number of managed tensors: 1412 Total number of managed output tensors: 0 Total number of unmanaged values: 2677 Total memory managed: 29120 bytes Total number of reused tensors: 963 ``` Reviewed By: hlu1 Differential Revision: D32370424 fbshipit-source-id: 06a8e0a295ed7a2b4d14071349c1f1e975f746bf	2021-12-07 13:25:42 -08:00
Don Jang	c97dc9286d	Revert D32780415: [Static Runtime] Move implementation details from impl.h into internal.h Test Plan: revert-hammer Differential Revision: D32780415 (`999e93e6a8`) Original commit changeset: 119b7aedbf56 fbshipit-source-id: 1aa777e8c1854ab27e86bc625188f7170097fac8	2021-12-04 19:44:07 -08:00
Don Jang	999e93e6a8	[Static Runtime] Move implementation details from impl.h into internal.h (#69274 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69274 `impl.h` is the main header file that defines the interface of Static Runtime to its clients. However, it is currently filled with implementation details that should not be leaked to our clients. 1) this can unnecessarily leak our internals to our clients which can make it hard to change them later 2) cause unnecessary merge conflicts when multiple people are touching this enormous impl.cpp file. To alleviate the situation, this change moves the implementation details from impl.h into a new file, internal.h, that's internally kept without leaking the details to our clients. This change will be followed by another change to rename `impl.h` into `runtime.h` or anything better since `impl.h` is currently not about implementation but SR's interface. Note that this change is NOT complete since the remaining declarations in impl.h still contain a lot of implementation details. Therefore, we should keep working on minimizing the interface to prevent our API from being bloated unnecessarily. Also we need to work on modularizing our implementations into separate pieces organized by separate files in the near future. Test Plan: Existing unittests Reviewed By: donaldong Differential Revision: D32780415 fbshipit-source-id: 119b7aedbf563b195641c5674572a9348732145f	2021-12-04 14:48:28 -08:00
Donald Dong	6f7a5ddffc	[SR] Use std::vector::reserve in GetLivenessMap (#68884 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68884 This diff uses std::vector::reserve in GetLivenessMap to set container capacity for all local contains to avoid runtime resizing. The changes should theoretically improves the performance by a little. Test Plan: - [x] `buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- -v 1` - [x] ``` seq 1 10 \| xargs -I{} ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \ --scripted_model=/data/users/dxd/302008423_0.predictor.disagg.local \ --method_name=local_request_only.forward --pt_cleanup_activations=1 \ --pt_enable_out_variant=1 --pt_optimize_memory=1 --iters=0 --warmup_iters=0 \ --num_threads=1 --pt_enable_static_runtime=1 --set_compatibility=1 \ --input_type="recordio" --pt_inputs=/data/users/dxd/302008423_0.local_ro.inputs.recordio \ --recordio_use_ivalue_format=1 ``` ### Before ``` I1201 12:04:46.753311 2874563 PyTorchPredictorBenchLib.cpp:336] Took 10.9826 sec to initialize a predictor. I1201 12:05:00.617139 2875780 PyTorchPredictorBenchLib.cpp:336] Took 11.1078 sec to initialize a predictor. I1201 12:05:15.279667 2876813 PyTorchPredictorBenchLib.cpp:336] Took 11.7979 sec to initialize a predictor. I1201 12:05:30.201207 2877554 PyTorchPredictorBenchLib.cpp:336] Took 11.8901 sec to initialize a predictor. I1201 12:05:44.386926 2879713 PyTorchPredictorBenchLib.cpp:336] Took 11.2722 sec to initialize a predictor. I1201 12:05:58.003582 2881426 PyTorchPredictorBenchLib.cpp:336] Took 10.8046 sec to initialize a predictor. I1201 12:06:12.004778 2882604 PyTorchPredictorBenchLib.cpp:336] Took 11.2754 sec to initialize a predictor. I1201 12:06:26.101241 2884888 PyTorchPredictorBenchLib.cpp:336] Took 11.3355 sec to initialize a predictor. I1201 12:06:40.364817 2886572 PyTorchPredictorBenchLib.cpp:336] Took 11.401 sec to initialize a predictor. I1201 12:06:54.483794 2888614 PyTorchPredictorBenchLib.cpp:336] Took 11.3498 sec to initialize a predictor. ``` ### After ``` I1201 11:51:53.775239 2818391 PyTorchPredictorBenchLib.cpp:336] Took 10.9113 sec to initialize a predictor. I1201 11:52:07.412720 2819530 PyTorchPredictorBenchLib.cpp:336] Took 10.8413 sec to initialize a predictor. I1201 11:52:21.202816 2820359 PyTorchPredictorBenchLib.cpp:336] Took 11.0216 sec to initialize a predictor. I1201 11:52:35.513288 2821029 PyTorchPredictorBenchLib.cpp:336] Took 11.4216 sec to initialize a predictor. I1201 11:52:49.145979 2821930 PyTorchPredictorBenchLib.cpp:336] Took 10.8272 sec to initialize a predictor. I1201 11:53:02.908790 2822859 PyTorchPredictorBenchLib.cpp:336] Took 11.0262 sec to initialize a predictor. I1201 11:53:16.276015 2823657 PyTorchPredictorBenchLib.cpp:336] Took 10.6893 sec to initialize a predictor. I1201 11:53:30.103283 2824382 PyTorchPredictorBenchLib.cpp:336] Took 11.1854 sec to initialize a predictor. I1201 11:53:44.298514 2825365 PyTorchPredictorBenchLib.cpp:336] Took 11.4796 sec to initialize a predictor. I1201 11:53:58.258708 2826128 PyTorchPredictorBenchLib.cpp:336] Took 11.2652 sec to initialize a predictor. ``` Reviewed By: swolchok Differential Revision: D32649252 fbshipit-source-id: 5cd296d12b12e5b15e85e4f1a8a236e293f37f9c	2021-12-03 12:18:06 -08:00
Mike Iovine	6ed7354435	[SR][Code cleanup] Typedef/default for kwargs (#69164 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69164 We have lots of methods that take `std::unordered_map<std::string, c10::IValue>` now. That's kind of ugly and cumbersome to type, so add a `KWargs` typedef. Also made the `operator()` default `kwargs` to empty. Note that we could have another overload that doesn't take `kwargs` at all, but the perf gain is so minuscule it's probably not worth it. ghstack-source-id: 144691899 Test Plan: CI Reviewed By: d1jang Differential Revision: D32734677 fbshipit-source-id: 8d6496a6d1ec2dc71253151d2f6408f1387966cf	2021-12-03 09:27:37 -08:00
Mike Iovine	cc46dc45e1	[SR] Factor logic that determines managed tensors out of MemoryPlanner (#68295 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68295 There's no reason we can't figure out what tensors we need to manage at model load time. It's also useful to have the set of ranges available at load time for integrating the ranges algorithm introduced in the previous diff. Test Plan: `buck test caffe2/benchmarks/static_runtime/...` Reviewed By: hlu1 Differential Revision: D32400593 fbshipit-source-id: 0466b2641166ddc9c14f72774f4ba151407be400	2021-12-03 04:45:27 -08:00
Scott Wolchok	21686923e8	[PyTorch][SR] More debug logging (#67220 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67220 Specifically we log AliasDb and same_storage_values, and are chattier about the aliasing logs in the liveness analysis. ghstack-source-id: 144507289 Test Plan: Used to help develop D31776259 Reviewed By: hlu1 Differential Revision: D31847561 fbshipit-source-id: 8371455d060c17dace91cd90e4034b7618f820a6	2021-12-02 10:36:23 -08:00
Hao Lu	ed3b73fd4d	[Static Runtime] Skip ProcessedNode:: verify_no_memory_overlap() for out variants (#68639 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68639 Fix all problems related to `ProcessedNode:: verify_no_memory_overlap()` - Only enable this check for native and fallback ops that are not inplace or view ops - Enable ProcessedNode:: verify_no_memory_overlap() in debug mode and enforce it - Add gflag --static_runtime_disable_debug_memory_overlap_check to test the runtime memory overlap fix for bad schemas fb::expand_dims's schema was not correct after this check is re-enabled. It's fixed in D32556204 (`39ab417107`) Reviewed By: mikeiovine Differential Revision: D32553708 fbshipit-source-id: 88de63cdf1ee4f87b7726c8b65a11a5fb8a99d13	2021-12-02 05:03:12 -08:00
Donald Dong	d9f3feb5a2	[SR] Use std::vector::reserve for StaticModule constants (#68834 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68834 This diff uses std::vector::reserve for constructing constants in StaticModule. We can also avoid two extra iterations over all the graph nodes. This diff should technically improve its performance by a tiny bit. Test Plan: - [x] buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- -v 1 Reviewed By: mikeiovine Differential Revision: D32628806 fbshipit-source-id: 99dd2a7a36e86899ca1fe5300f3aa90d30a43726	2021-11-23 18:00:04 -08:00
Mike Iovine	ee4cfaa286	[SR] Add utility class to determine tensor ranges (#68284 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68284 Add a new class `ManagedTensorRanges` that determines when manage tensors can be made available for re-use. This class provides a method `availableTensors(Node* node)` that returns a vector of `Value*` (corresponding to managed tensors) that are not used (either directly or through any alias) after `node`. Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: swolchok Differential Revision: D32397207 fbshipit-source-id: fb0d9a23f13abf6f2207e3d7266384966f477fc6	2021-11-19 13:10:55 -08:00
Scott Wolchok	ced57eb490	[PyTorch][Static Runtime] Delete incorrect alias analysis code (#67075 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67075 Sharing storage if `mayAlias` is incorrect, as the old comment notes; sharing if `mustAlias` would be nice but, as the new comment notes, would not matter. ghstack-source-id: 143749553 Test Plan: CI Reviewed By: hlu1 Differential Revision: D31851893 fbshipit-source-id: 5bdc8de984d5919332c9010e8b0160211d96bc2f	2021-11-18 22:34:50 -08:00
Don Jang	aa9ee8d02a	[Static Runtime] Avoid copying function objects per StaticRuntime instance (#68368 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68368 Currently, each instance of `StaticRuntime` has its own copy of `std::function` object wrapped in `ProcessedNode::Function` object, in order to invoke actual operation implementation. However, all instances of `StaticRuntime` derived from same `StaticModule` objects invoke exactly same op implementation, and this is avoidable. This change adds `StaticModule::functions_` member variable to keep a list of unique instance of `ProcessedFunction` objects. A newly constructed `StaticRuntime` takes `ProcessedFunction`'s pointers instead of the whole function object. This can save a substantial amount of memory per `StaticRuntime` instance. This comes with a sacrifice in execution time. Now that a `ProcessedNode` instance keeps the function object's pointer, executing a node now involves an extra pointer dereference. However, this cost was proved to be negligible from local performance tests. Thanks to hlu1 for proposing this non-intrusive improvement idea :D Test Plan: This change reduces the size of a StaticRuntime instance by 14.41% (459KB -> 393KB) (patched D32181666 to print the memory turnover from instantiating a StaticRuntime instance) for CMF/local ( & 8% for CMF/local_ro). No noticeable latency regression was observed. ==AFTER * CMF/local memory turnover: 393608 latency: PyTorch run finished. Milliseconds per iter: 15.6965. Iters per second: 63.7087 * CMF/local_ro memory turnover:387288 latency: PyTorch run finished. Milliseconds per iter: 7.51308. Iters per second: 133.101 ==BEFORE * CMF/local memory turnover: 459888 latency: PyTorch run finished. Milliseconds per iter: 15.8278. Iters per second: 63.18 * CMF/local_ro memory turnover: 420832 latenfcy: PyTorch run finished. Milliseconds per iter: 7.43756. Iters per second: 134.453 ==Confirmation that ptvsc2_predictor_bench reports the same memrmoy management stats for inline_cvr: ==AFTER Total number of managed tensors: 2660 Total number of managed output tensors: 0 Total number of unmanaged values: 3041 Total memory managed: 1496896 bytes Total number of reused tensors: 1183 Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%) Total number of managed tensors: 1412 Total number of managed output tensors: 0 Total number of unmanaged values: 2677 Total memory managed: 39040 bytes Total number of reused tensors: 959 Total number of 'out' variant nodes/total number of nodes: 1928/1937 (99.5354%) Total number of managed tensors: 1293 Total number of managed output tensors: 0 Total number of unmanaged values: 14 Total memory managed: 5293824 bytes Total number of reused tensors: 771 Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%) ==BEFORE Total number of managed tensors: 2660 Total number of managed output tensors: 0 Total number of unmanaged values: 3041 Total memory managed: 1496896 bytes Total number of reused tensors: 1183 Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%) Total number of managed tensors: 1412 Total number of managed output tensors: 0 Total number of unmanaged values: 2677 Total memory managed: 39040 bytes Total number of reused tensors: 959 Total number of 'out' variant nodes/total number of nodes: 1928/1937 (99.5354%) Total number of managed tensors: 1293 Total number of managed output tensors: 0 Total number of unmanaged values: 14 Total memory managed: 5293824 bytes Total number of reused tensors: 771 Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%) Reviewed By: swolchok Differential Revision: D32337548 fbshipit-source-id: e714e735399c93fde337b0f70e203a2de632057a	2021-11-16 20:28:48 -08:00
Hao Lu	75ccb07b26	[SR] LOG->VLOG (#68477 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68477 We're printing a lot of unnecessary logs in prod. Change these from LOG(INFO) to VLOG(1) so you can easily flip them back for testing. Test Plan: CI Reviewed By: ajyu, d1jang Differential Revision: D32439776 fbshipit-source-id: 40fa57f4eeb6ca0b610008062cc94aed62fb6981	2021-11-16 17:09:52 -08:00
Scott Wolchok	10e9d80ad1	[PyTorch][Static Runtime] Don't track scalar ivalues (#67702 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67702 This isn't a particularly large optimization and it does nothing before select_tensor is introduced (I'm surprised that no operators have optimizable outputs!), but it seems like we should probably get the savings. ghstack-source-id: 143424918 Test Plan: CI; checked `--do_profile=1` ouput with following diff and we save tracking hundreds of values, as expected. Reviewed By: hlu1 Differential Revision: D32112522 fbshipit-source-id: 1804b77992a73670bfc1e36af608b852b8261bd2	2021-11-16 11:05:42 -08:00

1 2 3 4 5

201 Commits