pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 13:44:15 +08:00

Author	SHA1	Message	Date
Mike Iovine	09157c76c0	[Static Runtime] Add schema checks for aten::list (#83753 ) Summary: The previous implementation assumed that there was only one overload and unconditionally tried to convert its input into a string. Some users were running into crashes because of this. Added a new overload for the list overload and schema checks. Also, I managed to uncover another bug when writing tests for this case (yikes). Returning inputs didn't work because the input cleanup process would destroy the output. Extended `CreateOwnedRefsForSpecialIValues` to fix that. Test Plan: CI + new unit tests Differential Revision: D38870803 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83753 Approved by: https://github.com/tenpercent, https://github.com/albanD	2022-08-22 13:42:47 +00:00
Will Constable	4f34cd6d1e	Replace all CHECK_ and DCHECK_ with TORCH_* macros (#82032 ) Avoid exposing defines that conflict with google logging, since this blocks external usage of libtorch in certain cases. All the 'interesting' changes should be in these two files, and the rest should just be mechanical changes via sed. c10/util/logging_is_not_google_glog.h c10/util/logging_is_google_glog.h Fixes https://github.com/pytorch/pytorch/issues/81415 cc @miladm @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/82032 Approved by: https://github.com/soumith, https://github.com/miladm	2022-07-26 01:20:44 +00:00
Akshay Parashar	ced2f2965c	[Static Runtime] support forked subgraph execution on parent graph's executor (#80381 ) Summary: - support async excecution of forked nodes on custom executor - fork subgraph execution was performed on inter-op thread pool executor by default - Handle forked graph async execution on custom executor when the parent graph is executed with runAsync() API passing the executor for async ops Differential Revision: D37466525 Pull Request resolved: https://github.com/pytorch/pytorch/pull/80381 Approved by: https://github.com/mikeiovine	2022-06-29 23:31:07 +00:00
Akshay Parashar	7f8e852dff	[Static Runtime] Support Futures in Static Runtime Engine (#80162 ) Summary: - Static Runtime now exports runAsync() API which returns an intrusive_ptr to c10:Future type Differential Revision: D37385849 Pull Request resolved: https://github.com/pytorch/pytorch/pull/80162 Approved by: https://github.com/mikeiovine	2022-06-28 23:57:26 +00:00
Akshay Parashar	3afc802c5a	[Static Runtime] Add Metadata to ProcessedNode depending upon the op type (#79961 ) Summary: - ProcessedNodeMetadata class wraps the possible metadata for ProcessedNode. Depending upon the nature of op, processedNode can have one of the below possibilities of metadata: 1. prim::If/prim::Loop ops contains block_runners_ as their metadata 2. prim::fork op contains TaskLauncher (std::function) responsible for execution of forked subgraph Differential Revision: D37320704 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79961 Approved by: https://github.com/mikeiovine	2022-06-24 06:03:06 +00:00
Hui Guo	0545c85f74	[static runtime] Add JIT prim ops: aten::cpu, aten::list, aten::numel, aten::__range_length (#79111 ) Summary: This adds the missing jit prim ops appear in the non ads models for c2->pt mitigation: aten::cpu, aten::list, aten::numel, aten::__range_length Test Plan: static runtime unit tests Differential Revision: D36984960 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79111 Approved by: https://github.com/davidberard98	2022-06-18 16:38:58 +00:00
Akshay Parashar	ca7ab1708b	[Static runtime] Pass parent graph metadata to forked subgraphs (#79578 ) Summary: - Remove creation of new StaticModuleOptions for the forked subgraph. Use parent graph's options for creation of runtime for forked subtraph - StaticRuntimeMetdata extends CustomClassHolder which can be casted to IValue and attached to IR node's attributes. Differential Revision: D37159684 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79578 Approved by: https://github.com/mikeiovine	2022-06-16 20:35:46 +00:00
Hui Guo	8d7fcfa8f1	[static runtime] Add native ops: aten::index_put, aten::item, aten::tensor_split (#79065 ) Summary: This adds the pytorch operators that are currently missing in non-ads models from c2->pt mitigation: aten::index_put, aten::item, aten::tensor_split Test Plan: buck run mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest Differential Revision: D36984961 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79065 Approved by: https://github.com/davidberard98	2022-06-15 19:15:34 +00:00
Akshay Parashar	20675977bc	[Static Runtime] Performance optimization for fork operation (#79482 ) Summary: - StaticModule was being created at runtime which was adding overhead to the forked operation - Move staticModule creation to outside of runtime so that StaticRuntime instance can be created on top of same staticModule that is created once Differential Revision: D37126923 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79482 Approved by: https://github.com/tenpercent	2022-06-14 22:31:15 +00:00
Akshay Parashar	65a37923f9	[Static Runtime] Exception handling during fork subgraph execution (#79292 ) Summary: - Exception handling was not performed in forked subgraph execution - forked subgraph runtime can throw runtime exception. Future returned by prim::fork needs to handle exceptions so that aten::wait handles it. Test Plan: local test cases: - buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators - buck test mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest - buck test mode/opt caffe2/test:static_runtime Async execution of the subgraph is tested by adding pytorch profiler hooks on the StaticRuntime execution via below code. Async execution in threadpool is verfiied by checking trace with profile(activities=[ProfilerActivity.CPU]) as prof: static_runtime_module(inputs) prof.export_chrome_trace("trace.json") Differential Revision: D37072493 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79292 Approved by: https://github.com/mikeiovine	2022-06-11 03:11:49 +00:00
Akshay Parashar	26f2376b78	[Static Runtime] support fork and wait operations on Static Runtime (#79211 ) Summary: - Initial support for fork was done on JIT interpreter. This patch enabled the async execution on static runtime - For each forked node, seeprate runtime is created for the execution of subgraph. Async execution is handled by aten::ParallelThreadPoolNative threadpool - aten::wait waits on the future of fork to be completed Test Plan: local test cases: - buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators - buck test mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest - buck test mode/opt caffe2/test:static_runtime Async execution of the subgraph is tested by adding pytorch profiler hooks on the StaticRuntime execution via below code. Async execution in threadpool is verfiied by checking trace with profile(activities=[ProfilerActivity.CPU]) as prof: static_runtime_module(inputs) prof.export_chrome_trace("trace.json") Differential Revision: D37044513 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79211 Approved by: https://github.com/mikeiovine	2022-06-10 19:11:03 +00:00
Akshay Parashar	bf80f6c7b0	[Static Runtime] prim::fork asyunchronous execution on JIT interpreter (#78858 ) Summary: prim::fork was executed synchronously in the main thread - Added changes that executes the prim::fork calls on asynchronously on one of the threads from TaskThreadPoolBase defined in aten - Changes are tested via pytorch profiler tracing. Fork calls are executed on different threads Test Plan: local tests scripts exected: - buck run mode/opt caffe2/test:static_runtime - buck run caffe2/benchmarks/static_runtime/fb:test_fb_operators - buck run caffe2/benchmarks/static_runtime:static_runtime_cpptest Executing pytorch profiler to see the spawned execution of fork operations in parallel on different threads Differential Revision: D36909308 Pull Request resolved: https://github.com/pytorch/pytorch/pull/78858 Approved by: https://github.com/mikeiovine	2022-06-06 22:05:19 +00:00
Akshay Parashar	720cb5023a	[Static Runtime] Implement prim::Fork and aten::wait (#78780 ) Summary: basic implementation of prim::fork and aten::wait - current implementation uses interpreter to call the forked subgraph - interpreter call to be replaced in future - Added custom test cases for fork/wait procedures in the graph Test Plan: custom tests are created in test_static_runtime.py file for verification of static_runtime output compared to reference pytorch output. test command - buck run caffe2/test:static_runtime - buck run caffe2/benchmarks/static_runtime:static_runtime_cpptest - buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators Differential Revision: D36881214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/78780 Approved by: https://github.com/tenpercent	2022-06-03 23:39:04 +00:00
Nikita Shulga	6ba29d715e	[BE] Fix deprecated usages of `isIntegral` By passing `includeBool=` parameter Pull Request resolved: https://github.com/pytorch/pytorch/pull/75524 Approved by: https://github.com/seemethere, https://github.com/janeyx99	2022-04-08 20:43:41 +00:00
Max Podkorytov	11c412a8ec	[static-runtime] optimize empty if blocks at runtime (#74987 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74987 Add specializations to `prim::If` operator at runtime to save resources when some of subblocks are empty Test Plan: `buck build //caffe2:torch-cpp-cpu` `buck test //caffe2/benchmarks/static_runtime/...` Add unit test: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- StaticRuntime.EmptyIfBlock` Reviewed By: mikeiovine Differential Revision: D35262952 fbshipit-source-id: 324f88471f33f035f4d8a9b212716530d8e59df2 (cherry picked from commit 2db1b1a6833b1376fa376f54791effc8e12fb77f)	2022-04-01 05:43:33 +00:00
Mike Iovine	3f37337ed0	[SR] Native implementation for reshape_as (#74585 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74585 Native static runtime for `aten::reshape_as` ghstack-source-id: 152340038 Test Plan: New unit test Reviewed By: hlu1 Differential Revision: D35060895 fbshipit-source-id: c4e6f8a04c7df3821c7e654bfaf584e5a72ea701 (cherry picked from commit 6fa596cd866a024b6653239e0e30ddad42de242f)	2022-03-28 17:02:14 +00:00
Mike Iovine	9f2344aa40	[SR] Native implementation for select (#74568 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74568 Native static runtime implementation for `aten::select(Tensor, int, int)` overload ghstack-source-id: 152340037 Test Plan: New unit test Reviewed By: hlu1 Differential Revision: D35053900 fbshipit-source-id: c315d4202a4dfca3360325547af805aea33ecc9f (cherry picked from commit 8683f214dbd8c081365bad727007bbff969b64d0)	2022-03-28 17:02:14 +00:00
Mike Iovine	facdbe6d72	[SR] Native implementation for IntImplicit (#74562 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74562 Add a native implementation for `aten::IntImplicit`, which is similar to `aten::Int` except for a few extra checks it must do ghstack-source-id: 152340039 Test Plan: New unit tests Reviewed By: hlu1 Differential Revision: D35052997 fbshipit-source-id: cb2f0faf7c62382e3f13750d8e1280c49c6b9e42 (cherry picked from commit 359c7493f8deaeccebc27e1b6e6e9777850010c1)	2022-03-28 17:02:14 +00:00
Mike Iovine	93be0e2053	[SR] Avoid boxing inputs in DictConstruct/ListUnpack (#74250 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74250 The `DictConstruct`/`ListUnpack` implementations currently put all of their inputs onto a stack before calling the JIT implementation in `vararg_functions.cpp`. This was done to avoid code duplication, but it's quite wasteful since it causes extra heap allocations and, potentially, refcount bumps. Given that these two ops are quite common and the code duplication is only a few lines, it seems reasonable to avoid this cost. ghstack-source-id: 151897634 Test Plan: Existing unit tests Reviewed By: navahgar Differential Revision: D34901245 fbshipit-source-id: ece0618a6134a35720f214e79c64f12045f074d0 (cherry picked from commit 1f8e223c1887ed205c84a7ac4587813f94b11bad)	2022-03-22 23:05:58 +00:00
Don Jang	9e8bda0e93	[Static Runtime] Use IValue::toListRef for aten::len to address comment on D34705231 (#74192 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74192 This change uses toListRef() to avoid creating a new list object for `aten::len` to address hlu1's comment on D34705231 (`87564a1bd7`) Test Plan: Existing tests, StaticRuntime.Len* Reviewed By: mikeiovine Differential Revision: D34863266 fbshipit-source-id: 65daf36944a64dfd7afde1103aab5aee1681ac87 (cherry picked from commit 3a0f3798f2fcc203f6cb01e59b91e195ecabe1bc)	2022-03-14 22:31:10 +00:00
Don Jang	87564a1bd7	[Static Runtime] Add native op support for `aten::len` (#73899 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73899 This change adds native op wrappers to Static Runtime as appears in JIT (https://www.internalfb.com/code/fbsource/[429d233b9beb5e6f60df7304b792e2ff332f6ecd]/fbcode/caffe2/torch/csrc/jit/runtime/register_prim_ops.cpp?lines=613 , search for "aten::len" in that file). Test Plan: Added unittests, "StaticRuntime.LenWith*", and confirmed they are passing with `V0307 17:39:39.817956 3516654 impl.cpp:1792] Switch to native impl for node: %2 : int = aten::len(%input.1)` per added unittest: P485159811 Reviewed By: mikeiovine Differential Revision: D34705231 fbshipit-source-id: 916b1f8bdbc92def07bc3f98ce1db22f0f5ce206 (cherry picked from commit 66d2bb9a0a294b55e1bc87ae33f5553b1460e74b)	2022-03-10 02:57:51 +00:00
Don Jang	c62de0ac15	[Static Runtime] [Code Cleanup] Use `SROperator` for operators' function type (#73450 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73450 This change uses `SROperator` for operators' function type Test Plan: N/A Reviewed By: mikeiovine Differential Revision: D34483246 fbshipit-source-id: ed544bb91b676ed08983dc8dc78cedd0f77d499f (cherry picked from commit eb9de3ad8de043990c02f30ffa48a29c8e5e81f2)	2022-03-01 02:30:48 +00:00
Mike Iovine	d1c5f9e439	[JIT][SR] Introduce prim::IfThenElse (#72587 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72587 This pattern frequently appears in a few graphs: ``` %result = prim::If(%condition) block0(): -> (%a) block1(): -> (%b) ``` This is slow, particularly in static runtime. Static runtime creates memory planners/block runners for each sub-block, which eats up a lot of memory and introduces a lot of extra overhead for this relatively simple operation. This diff introduces a new op that replaces nodes like the above with a single op meant to act like a ternary operator: ``` %result = prim::IfThenElse(%condition, %a, %b) ``` Test Plan: New unit tests Reviewed By: eellison Differential Revision: D34091789 fbshipit-source-id: eb6a8c460c39b4c019a1f4ab1f3f1e5b6edc400c (cherry picked from commit 0f1b335e5b83f402bda2dcdd9ecb411e0b67c651)	2022-02-17 18:22:48 +00:00
Mike Iovine	c975b928ab	[SR][easy] CPU fuser uses native control flow (#72544 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72544 Now that static runtime supports control flow, there's no need to fall back to the JIT. We get better performance with the native control flow since we avoid heap allocation/ref count bumps during stack construction. I've left the old `prim::TensorExprDynamicGroup` around in case we need to support it in the future. I've also added native support for a few scalar ops that are used inside the control flow sub-blocks. ghstack-source-id: 148825816 Test Plan: New unit tests Reviewed By: d1jang Differential Revision: D34083080 fbshipit-source-id: a7ffc0fda39ab3df3ba47e44a03d857131dc1e50 (cherry picked from commit 2ef39e0e54d5e9da76af9e617a11233ffc81b011)	2022-02-10 18:40:39 +00:00
Don Jang	84729cef70	[Static Runtime] Fix a bug in aten::slice to honor optional arguments (#72530 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72530 This bug was revealed from a failed attempt to run a feed/story model. Test Plan: - This fix was tested to successfully run the failed model: P479037453 - Added a unittest Reviewed By: mikeiovine Differential Revision: D34055801 fbshipit-source-id: 4a3e06bbb3b9fa78b0514c9c67aa4a0b79f46a8d (cherry picked from commit bfa2bfb81ceaadad399522e422863fcea4aa13f1)	2022-02-09 17:05:45 +00:00
Mike Iovine	6c0521b919	[SR] Add native implementations for converted prim ops (#71474 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71474 The PyTorch edge team is working on promoting some prim ops to interpreter instructions (see D33398092). Since the JIT fallback ops will be unavailable soon, we need to implement these ops in static runtime. Ops not included in this diff: * `aten::__is__` and `aten::__isnot__`: disabled in static runtime for unrelated reasons * `prim::NumToTensor` and `aten::__get__.Dict` already exist ghstack-source-id: 148641179 Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: d1jang Differential Revision: D33657816 fbshipit-source-id: 6d15244ae1024a56d3b25e51a433fa104ce8ee5e (cherry picked from commit 33f8f861ff88a6dda6a545c12515e92c893027d4)	2022-02-08 23:25:34 +00:00
Mike Iovine	0bb3158eae	[SR] Implement prim::CreateObject (#71854 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71854 Support `prim::CreateObject` - this is a native interpreter instruction, so we can't fall back to the JIT for this op. Test Plan: New unit test exercises creating and modifying custom objects Reviewed By: d1jang Differential Revision: D33783759 fbshipit-source-id: 8185ff71b5d441597d712a5d4aab7fc4dddf7034 (cherry picked from commit bd3f52d8e2cd8e20a8d66e2d2b802c1d92088e4e)	2022-02-03 12:18:46 +00:00
Mike Iovine	2d5296b0e7	[SR] Implement prim::Loop (#69838 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69838 Implement `prim::Loop` with the new `StaticRuntimeBlockRunner` abstraction. ghstack-source-id: 148186483 Test Plan: New unit tests: `buck test caffe2/benchmark/static_runtime/...` Reviewed By: d1jang Differential Revision: D33049595 fbshipit-source-id: 550de5167b46fccd65ff77d092785289b5e5d532 (cherry picked from commit 8baf1753af34f4c166b4680e42589517fd2e508d)	2022-02-02 19:30:50 +00:00
Mike Iovine	2aa699505d	[SR] Implement prim::If (#69837 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69837 Implement `prim::If` with the new `StaticRuntimeBlockRunner` abstraction. ghstack-source-id: 148186475 Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime/...` Accuracy test at top of stack Reviewed By: d1jang Differential Revision: D33045908 fbshipit-source-id: 281fb4a73528249fa60f65ac26f8ae6737771f55 (cherry picked from commit de3b12dc0871e8ca09891c257e1dfd7cd352aa7c)	2022-02-02 19:30:50 +00:00
Mike Iovine	238dded10f	[SR] Graph pass to create owned refs of special IValues (#69835 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69835 `StaticRuntimeBlockRunner` moves its outputs to the return value at the end of `run_impl`. However, there's a corner case where this can cause problems. If we return a constant, then the only reference in the `constants_` array can be destroyed by this move. We could add special logic to handle this in `run_impl`. But since this is a relatively rare corner case, it's simpler to just add an op that does nothing but create an owned reference to its input. This owned reference can be safely moved out of `StaticRuntimeBlockRunner`. Note that this also applies to returned values in sub-blocks that are from outer scopes. ghstack-source-id: 148186452 Test Plan: `buck test caffe2/benchmarks/static_runtime/...` Added a new unit test with a graph that simply returns a constant. Tests with sub-blocks at top of stack. Reviewed By: d1jang Differential Revision: D33047519 fbshipit-source-id: 22b6058f0d1da8a6d1d61a6f2866bc518bff482b (cherry picked from commit a8f89a12ee726aa7d7e546dee25d696eef868ce7)	2022-02-02 19:30:50 +00:00
Scott Wolchok	3a77fb244b	[PyTorch][Static Runtime] Delete cleanup_activations option (#71501 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71501 This option disabled the memory planner. Supporting it would require us to add multiple versions of ops that borrow their inputs (because they rely on the memory planner to support that), and I'm not aware of a particular need to continue supporting it. ghstack-source-id: 147385569 Test Plan: CI, rerun broken test from task Reviewed By: mikeiovine Differential Revision: D33669290 fbshipit-source-id: ecb01995891aecb5f4d0da2d9c51eed1f8fe489a (cherry picked from commit 5e4fefb109b6c92d59fc7e24d69f1b6b2780c776)	2022-01-21 18:15:43 +00:00
Scott Wolchok	fcbc34a5eb	[PyTorch][Static Runtime] Avoid recomputing input size in dict_unpack (#71252 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71252 Same old problem, same old solution. Interestingly, I tried using c10::irange instead, but that caused really bad assembly to be generated -- we lost inlining for lots of the loop body! ghstack-source-id: 146939573 Test Plan: CI Spot-checked assembly before/after and confirmed that loop termination value was recomputed before and not after Reviewed By: mikeiovine Differential Revision: D33558118 fbshipit-source-id: 9fda2f1f89bacba2e8b5e61ba432871e973201fe	2022-01-14 14:33:56 -08:00
Scott Wolchok	bf82d2012e	[PyTorch] Add IValue::toDimVector & mostly replace toIntVector with it (#71247 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71247 Most uses of toIntVector() were for a Tensor shape. We have DimVector to avoid heap allocations in those cases, so let's use it. ghstack-source-id: 146933314 Test Plan: CI -- if we think DimVector is good in general then I think we have to think this change is good? Reviewed By: mikeiovine Differential Revision: D33556198 fbshipit-source-id: cf2ad92c2d0b99ab1df4da0f6843e6ccb9a6320b	2022-01-14 14:32:40 -08:00
Scott Wolchok	3cc34a4502	[PyTorch][Static Runtime] s/toObject/toObjectRef/ in native ops (#71238 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71238 Saves a refcount bump for these. ghstack-source-id: 146927203 Test Plan: CI Reviewed By: mikeiovine Differential Revision: D33554385 fbshipit-source-id: b2f8d5afdc0eb80c8765d88560d0e547376f28d1	2022-01-12 18:44:40 -08:00
Mike Iovine	ffdc0e23af	[SR] Add various missing native ops (#71113 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71113 This diff adds a variety of missing ~~out variants~~/native ops. Most of these are trivial, so I included them all in one diff. Native ops * `aten::mul` (list variant) * `aten::sub` (int variant) * `aten::add` (list variant) * `aten::Int` Out variants * ~~`aten::gt`~~ (codegen will handle) * ~~`aten::eq`~~ (codegen will handle) ghstack-source-id: 146927552 Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D33510756 fbshipit-source-id: df385958b9561955b2e866dab2e4c050abd26766	2022-01-12 18:40:31 -08:00
Scott Wolchok	10b40acbdb	[PyTorch][Static Runtime] Fast aliasing in select_tensor by manual borrowing (#68122 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68122 See code comments for details; in brief, we repurpose support for borrowing `Tensor`s in `MaybeOwned` to make the `select_tensor` output a borrowed IValue that we have to clean up manually. If we have any other ops that always create a new reference to an existing Tensor, we can easily apply this same optimization. ghstack-source-id: 146482212 Test Plan: See perf measurements on ctr_mobile_feed local_ro net for this stack: P467203421 (local is neutral: P467267554) --do_profile output for local_ro (updated Dec 10): ``` swolchok@devbig032 /d/u/s/f/fbcode> tail Stable.profile.txt First iter time: 0.989023 ms Number of operators: 2037 Total number of managed tensors: 1597 Total number of managed output tensors: 0 Total number of unmanaged values: 2568 Number of unmanaged values requiring cleanup: 2568 Number of unmanaged values not requiring cleanup: 0 Total memory managed: 50368 bytes Total number of reused tensors: 1010 Total number of 'out' variant nodes/total number of nodes: 2001/2037 (98.2327%) swolchok@devbig032 /d/u/s/f/fbcode> ttail TMCC^C swolchok@devbig032 /d/u/s/f/fbcode> tail TMCOFastAliasing.profile.txt First iter time: 0.994703 ms Number of operators: 2551 Total number of managed tensors: 1146 Total number of managed output tensors: 0 Total number of unmanaged values: 4047 Number of unmanaged values requiring cleanup: 3533 Number of unmanaged values not requiring cleanup: 514 Total memory managed: 50048 bytes Total number of reused tensors: 559 Total number of 'out' variant nodes/total number of nodes: 2001/2551 (78.4398%) ``` for local: (also Dec 10): ``` ==> Stable.local.profile.txt <== First iter time: 9.0909 ms Number of operators: 1766 Total number of managed tensors: 1894 Total number of managed output tensors: 0 Total number of unmanaged values: 2014 Number of unmanaged values requiring cleanup: 2014 Number of unmanaged values not requiring cleanup: 0 Total memory managed: 4541440 bytes Total number of reused tensors: 847 Total number of 'out' variant nodes/total number of nodes: 1744/1766 (98.7542%) ==> TMCOFastAliasing.local.profile.txt <== First iter time: 7.5512 ms Number of operators: 2378 Total number of managed tensors: 1629 Total number of managed output tensors: 0 Total number of unmanaged values: 3503 Number of unmanaged values requiring cleanup: 2891 Number of unmanaged values not requiring cleanup: 612 Total memory managed: 3949312 bytes Total number of reused tensors: 586 Total number of 'out' variant nodes/total number of nodes: 1744/2378 (73.3389%) ``` Reviewed By: hlu1 Differential Revision: D32318674 fbshipit-source-id: a2d781105936fda2a3436d32ea22a196f82dc783	2022-01-04 22:36:13 -08:00
Scott Wolchok	4d8fc8693c	[PyTorch][Static Runtime] Support memory planning for torch.to() w/o requiring copying (#67223 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67223 ghstack-source-id: 146482215 Test Plan: See perf measurements on ctr_mobile_feed local_ro net for this stack: P467203421 (local is neutral: P467267554) Reviewed By: hlu1 Differential Revision: D31776259 fbshipit-source-id: f84fcaa05029577213f3bf2ae9d4b987b68480b3	2022-01-04 22:36:10 -08:00
Donald Dong	24f16de987	[Static Runtime] Support native op split_with_sizes (#69999 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69999 This adds support for the split_with_sizes operator in static runtime by adding native operators. Those operators will have less overhead comparing to their JIT fallbacks (no dispatching, no stack constructing in runtime). split_with_sizes can be called directly from cpp API, or in `torch.split` when `split_sizes` is a list. This diff adds support for both use cases. Test Plan: - Added unit tests. Made sure the operators are used - Benchmark ``` ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \ --scripted_model=/data/users/dxd/305797439_0.predictor.precompute.remote_request_only \ --method_name=user.forward --pt_cleanup_activations=1 \ --pt_enable_out_variant=1 --pt_optimize_memory=1 --iters=1000 --warmup_iters=500 \ --num_threads=1 --pt_enable_static_runtime=1 --set_compatibility=1 \ --input_type="recordio" --pt_inputs=/data/users/dxd/305797439_0_user.inputs.recordio \ --recordio_use_ivalue_format=1 --do_profile=1 --do_benchmark=1 ``` #### Before ``` Static runtime ms per iter: 3.62073. Iters per second: 276.187 0.0471904 ms. 1.31501%. aten::split_with_sizes (5 nodes) ``` #### After ``` Static runtime ms per iter: 3.44374. Iters per second: 290.382 0.0432057 ms. 1.34276%. aten::split_with_sizes (5 nodes, native) ``` Reviewed By: swolchok Differential Revision: D33141006 fbshipit-source-id: feae34c4c873fc22d48a8ff3bf4d71c0e00bb365	2021-12-20 18:32:54 -08:00
Mike Iovine	3e20a74b55	[SR] Update memory planner docs (#69559 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69559 We have a lot of special cases. Document them so they're easy to learn about. ghstack-source-id: 145226542 Test Plan: Spell check? :) Reviewed By: d1jang Differential Revision: D32929416 fbshipit-source-id: 2362410f25a27cdb74a4939903446192cef61978	2021-12-09 14:22:33 -08:00
Scott Wolchok	8954c92529	[PyTorch][Static Runtime] Borrow outputs in static_runtime::VarTupleUnpack (#68161 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68161 Continuing rollout of borrowing outputs for native ops. ghstack-source-id: 143424920 Test Plan: Compare CMF local_ro perf again. Previous diff: ``` I1110 22:05:23.245435 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03272. Iters per second: 968.313 I1110 22:05:23.822196 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.06478. Iters per second: 939.163 I1110 22:05:24.395256 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.035. Iters per second: 966.186 I1110 22:05:24.964169 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.02786. Iters per second: 972.898 I1110 22:05:25.536558 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03205. Iters per second: 968.946 I1110 22:05:26.109027 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.04256. Iters per second: 959.174 I1110 22:05:26.679611 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03245. Iters per second: 968.567 I1110 22:05:27.253048 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.04493. Iters per second: 957.005 I1110 22:05:27.822629 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.0299. Iters per second: 970.971 I1110 22:05:28.393326 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03039. Iters per second: 970.509 I1110 22:05:28.393368 113949 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.03726, standard deviation: 0.0111053 ``` This diff: ``` I1110 22:18:48.453075 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.931188. Iters per second: 1073.9 I1110 22:18:48.967614 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.933196. Iters per second: 1071.59 I1110 22:18:49.483338 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.932087. Iters per second: 1072.86 I1110 22:18:49.997144 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.930877. Iters per second: 1074.26 I1110 22:18:50.529383 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.936981. Iters per second: 1067.26 I1110 22:18:51.085038 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.953214. Iters per second: 1049.08 I1110 22:18:51.607192 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.940719. Iters per second: 1063.02 I1110 22:18:52.126169 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.942638. Iters per second: 1060.85 I1110 22:18:52.644445 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.937574. Iters per second: 1066.58 I1110 22:18:53.163486 191647 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 0.941636. Iters per second: 1061.98 I1110 22:18:53.163537 191647 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 0.938011, standard deviation: 0.00691196 ``` 0.099 (9.5%!) usec/iter improvement over previous diff Reviewed By: hlu1 Differential Revision: D32347900 fbshipit-source-id: 8169ebcadf1248e555a18bbffa99eef6cac1ba85	2021-11-16 12:32:15 -08:00
Scott Wolchok	755be54c77	[PyTorch][Static Runtime] Borrow outputs in static_runtime::dict_unpack (#68160 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68160 This generalizes the mechanism D32318674 added for letting native ops borrow their outputs and uses it in dict_unpack. ghstack-source-id: 143424919 Test Plan: 4.5% in CMF local_ro compared to D32318674 (previous two diffs were necessary steps but didn't get the full win yet): ``` FastAliasingInSelectTensor, local_ro ======================================== I1110 22:06:37.549811 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08488. Iters per second: 921.76 I1110 22:06:38.147949 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08675. Iters per second: 920.171 I1110 22:06:38.766340 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08626. Iters per second: 920.592 I1110 22:06:39.366608 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08376. Iters per second: 922.717 I1110 22:06:39.964979 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08362. Iters per second: 922.833 I1110 22:06:40.565248 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08423. Iters per second: 922.312 I1110 22:06:41.167326 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.0945. Iters per second: 913.659 I1110 22:06:41.766187 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08373. Iters per second: 922.742 I1110 22:06:42.367816 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08995. Iters per second: 917.475 I1110 22:06:42.968391 119627 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.08854. Iters per second: 918.665 I1110 22:06:42.968446 119627 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.08662, standard deviation: 0.00351662 BorrowDictUnpackOutputs, local_ro ======================================== I1110 22:05:23.245435 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03272. Iters per second: 968.313 I1110 22:05:23.822196 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.06478. Iters per second: 939.163 I1110 22:05:24.395256 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.035. Iters per second: 966.186 I1110 22:05:24.964169 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.02786. Iters per second: 972.898 I1110 22:05:25.536558 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03205. Iters per second: 968.946 I1110 22:05:26.109027 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.04256. Iters per second: 959.174 I1110 22:05:26.679611 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03245. Iters per second: 968.567 I1110 22:05:27.253048 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.04493. Iters per second: 957.005 I1110 22:05:27.822629 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.0299. Iters per second: 970.971 I1110 22:05:28.393326 113949 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.03039. Iters per second: 970.509 I1110 22:05:28.393368 113949 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.03726, standard deviation: 0.0111053 ``` 0.04936 (4.5%) usec/iter improvement Reviewed By: hlu1 Differential Revision: D32347390 fbshipit-source-id: e636ddafacf30ed2a2d84a6e15fff97481342fdb	2021-11-16 12:31:03 -08:00
Scott Wolchok	bbc24222d2	[PyTorch][Static Runtime] Refcount bump pass in native_ops (#68159 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68159 These all look like they'll cause unnecessary refcount bumps to me. ghstack-source-id: 143424917 Test Plan: CI TODO profile local_ro? Reviewed By: hlu1 Differential Revision: D32347392 fbshipit-source-id: d8ed91b5855b86765db00c61ad3650273302c7b6	2021-11-16 12:27:12 -08:00
Scott Wolchok	6acde23bec	[PyTorch][Static Runtime] Switch input/output repr to 2-byte offsets (#67934 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67934 This reduces the memory requirements of ProcessedNode: by allocating outputs sequentially into a shared array and supporting at most 2**16 - 1 values (current models seem to have 10-20x less than that), we only need to store the 2-byte offset into that array and 2-byte number of outputs in ProcessedNode. ghstack-source-id: 143429113 Test Plan: Patched d1jang's diff to measure memory turnover around SR startup. Previous diff, CMF local: ``` I1104 12:19:39.900211 597593 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 427120 ``` This diff, CMF local: ``` I1105 12:17:36.459688 866763 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 354208 72912 bytes (17%) savings ``` Perf looks neutral; see next diff (D32216573) test plan for details. Reviewed By: hlu1 Differential Revision: D32190751 fbshipit-source-id: 30c1e2caa9460f0d83b2d9bb24c68ccfcef757cc	2021-11-16 10:19:50 -08:00
Mike Iovine	1f07efd0f2	[SR] Fix aten::split schema (#68135 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68135 Update the schema to reflect the changes in D31935573 (`6b44e75f6b`). Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Confirmed native implementation is used. Reviewed By: hlu1 Differential Revision: D32326865 fbshipit-source-id: 7f607f57ceb6690a2782d94d9ee736ba64e7d242	2021-11-10 20:03:30 -08:00
Mike Iovine	ecd5b1a8d4	[SR] Native implementation for aten::split (#67476 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67476 Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: d1jang Differential Revision: D31994040 fbshipit-source-id: 9de57d8d7925ee46544478eae8229952ca5f248a	2021-11-10 10:23:03 -08:00
Scott Wolchok	82f7f8d471	[PyTorch] Adopt IValue::toTupleRef() where obvious (#65505 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65505 Generated with `fastmod -m 'toTuple(\s)->' 'toTupleRef()${1}.'` , followed by `fastmod '(std::move$.)toTupleRef\($.' '${1}toTuple()->'` to unbreak 2 callsites. ghstack-source-id: 142065835 Test Plan: CI Reviewed By: gchanan Differential Revision: D31131025 fbshipit-source-id: 54457ae5bbeb38db9c7f196d469b98521c3d3f34	2021-11-02 10:22:18 -07:00
Scott Wolchok	d9bac7c316	[PyTorch] Add IValue::toTupleRef() (#65504 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65504 We should be able to borrow a Tuple from an IValue without incurring refcount bumps. ghstack-source-id: 142065833 Test Plan: Added test coverage. Profiled static runtime on the local_ro net for ctr_mobile_feed. Inclusive time spent in VarTupleUnpack decreased about 0.3%, which roughly matches with the 0.36% of runtime that was previously spent in IValue::toTuple(). Reviewed By: hlu1 Differential Revision: D31130570 fbshipit-source-id: afa14f46445539e449068fd908d547b8da7f402c	2021-11-02 10:16:25 -07:00
Mike Iovine	39ad7b670e	[SR] Native implementation for aten::squeeze (#67441 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67441 Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D31992093 fbshipit-source-id: 88191c13d229ffeac4e5b17b78e25f51d3f7f23e	2021-11-01 08:22:57 -07:00
Mike Iovine	354363b57a	[SR] Native implementation for aten::size (#67346 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67346 Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: d1jang Differential Revision: D31965159 fbshipit-source-id: 86a69c395f401c4a4c55daa4c5fe80764383c8e5	2021-10-28 14:18:17 -07:00
Mike Iovine	afb8434440	[SR] Native implementation for aten::view (#67341 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67341 Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like `TupleUnpack`). We should improve op coverage where possible. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D31962589 fbshipit-source-id: 3107fb169c1b02fb2bafbb355c005669b5fa8435	2021-10-28 13:37:46 -07:00

1 2

64 Commits