pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Aaron Gokaslan	0247ed27cc	Apply Clang-Tidy readability-container-size-empty (#93236 ) Not only is this change usually shorter and more readable, it also can yield better performance. size() is not always a constant time operation (such as on LinkedLists), but empty() always is. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93236 Approved by: https://github.com/malfet	2023-01-29 23:28:19 +00:00
Daniil Kutz	1eba3f220e	Fix bugs found by static analysis (#85705 ) These PR fixes a number of bugs found by Svace static analyzer: 1. DEREF_AFTER_FREE at qnnpack_utils.h: Pointer '&convolution->zero_buffer' is dereferenced at qnnpack_utils.h:258 after the referenced memory was deallocated at operator-delete.c:25 by passing as 1st parameter to function 'pytorch_qnnp_delete_operator' at qnnpack_utils.h:251. 2. DEREF_AFTER_NULL at impl.cpp: After having been compared to NULL value at impl.cpp:1892, pointer 'schema' is passed as 2nd parameter in call to function 'c10::operator<<' at impl.cpp:1921, where it is dereferenced at function_schema_inl.h:13. 3. DEREF_OF_NULL at stmt.h: After having been compared to NULL value at stmt.h:744, pointer 'body->_M_ptr' is passed in call to function 'torch::jit::tensorexpr::malformed_input::malformed_input' at stmt.h:745, where it is dereferenced at exceptions.h:67. 4. DEREF_OF_NULL at loopnest.h: Pointer 'f->ptr' that can have only NULL value (checked at loopnest.cpp:1482), is passed in call to function 'torch::jit::tensorexpr::malformed_input::malformed_input' at loopnest.cpp:1483, where it is dereferenced at exceptions.h:67. This is the same error as 3: forwarding a nullptr to malformed_input(). 4. TAINTED_INT.LOOP in python_arg_parser: Integer value 'this->size' obtained from untrusted source at python_arg_parser.cpp:118 without checking its bounds is used as a loop bound at python_arg_parser.cpp:698 by calling function 'torch::FunctionParameter::set_default_str' at python_arg_parser.cpp:133. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85705 Approved by: https://github.com/kit1980	2022-10-28 23:51:55 +00:00
Mike Iovine	81c4049f4d	[Static Runtime] Move PrepackWeights to internal-only graph passes (#87799 ) Summary: The pass introduces an `fb::` operator and thus cannot be used in OSS. The test failure was not exposed because the Static Runtime tests have been disabled in OSS for a while. The Dev Infra folks encountered this failure when re-enabling the tests. Test Plan: Existing tests Differential Revision: D40724547 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87799 Approved by: https://github.com/huydhn	2022-10-28 01:28:34 +00:00
Mike Iovine	ed7a8ab436	[Static Runtime] Make canEnableStaticRuntime examine sub-blocks (#87396 ) Summary: Someone was running into problems where 1) Static Runtime enablement would fail 2) We would try to fall back to the JIT interpreter after trying to create `StaticModule` 3) The fallback fails because Static Runtime mangled the graph. We don't want to prevent Static Runtime from mutating its input due to memory concerns. The intent of `canEnableStaticRuntime` is to catch issues in the module before Static Runtime messes with it. With this diff, `StaticModule` instantiation can be avoided by querying `canEnableStaticRuntime` and the issue is fixed. Test Plan: New unit test Differential Revision: D40564452 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87396 Approved by: https://github.com/tenpercent	2022-10-26 14:34:29 +00:00
Mike Iovine	63c1f2fef9	[Static Runtime] Fold linear prepack ops (#85289 ) Summary: Split `quantized_linear_unpacked_weight_v2` into `linear_prepack` and `quantized_linear` so that the prepacking operation may be eliminated by constant folding. Test Plan: Fixes a huge regression in an internal model: ``` Before 89.6141 ms. 99.0923%. fb::quantized_linear_unpacked_weight_v2 (12 nodes) After 0.806852 ms. 53.5365%. quantized::linear (12 nodes, out variant) (prepacking eliminated) ``` Differential Revision: D39622530 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85289 Approved by: https://github.com/davidberard98	2022-09-22 20:23:07 +00:00
Mike Iovine	e4899764b2	[Static Runtime] Fix aten::index_put list conversions (#85298 ) Summary: Apparently static runtime's list construct return value is always a `GenericList`, so we cannot use the `toOptionalTensorList` method in the general case -- we must convert each item individually. Test Plan: New unit test Differential Revision: D39628979 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85298 Approved by: https://github.com/tenpercent	2022-09-22 20:21:52 +00:00
Will Constable	4f34cd6d1e	Replace all CHECK_ and DCHECK_ with TORCH_* macros (#82032 ) Avoid exposing defines that conflict with google logging, since this blocks external usage of libtorch in certain cases. All the 'interesting' changes should be in these two files, and the rest should just be mechanical changes via sed. c10/util/logging_is_not_google_glog.h c10/util/logging_is_google_glog.h Fixes https://github.com/pytorch/pytorch/issues/81415 cc @miladm @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/82032 Approved by: https://github.com/soumith, https://github.com/miladm	2022-07-26 01:20:44 +00:00
Mandar Deshpande	937ca69f15	[TorchArrow][efficiency][3/n] variadic versions of op fused /unfused inference_wrapper_run_flat (#81133 ) Summary: Added `variadic` version (just an optimization) of the registered fused and unfused ops. Reviewed By: tenpercent Differential Revision: D37456033 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81133 Approved by: https://github.com/tenpercent, https://github.com/qxy11	2022-07-13 10:30:32 +00:00
Max Podkorytov	94a8a8aa32	[di] avoid copying optional input for get_real_inputs_from_optional_inputs_v2 when possible (#81137 ) Summary: avoid copies and casting by moving out of the inputs list Differential Revision: D37572556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81137 Approved by: https://github.com/mikeiovine	2022-07-13 04:45:56 +00:00
Akshay Parashar	7f8e852dff	[Static Runtime] Support Futures in Static Runtime Engine (#80162 ) Summary: - Static Runtime now exports runAsync() API which returns an intrusive_ptr to c10:Future type Differential Revision: D37385849 Pull Request resolved: https://github.com/pytorch/pytorch/pull/80162 Approved by: https://github.com/mikeiovine	2022-06-28 23:57:26 +00:00
Akshay Parashar	3afc802c5a	[Static Runtime] Add Metadata to ProcessedNode depending upon the op type (#79961 ) Summary: - ProcessedNodeMetadata class wraps the possible metadata for ProcessedNode. Depending upon the nature of op, processedNode can have one of the below possibilities of metadata: 1. prim::If/prim::Loop ops contains block_runners_ as their metadata 2. prim::fork op contains TaskLauncher (std::function) responsible for execution of forked subgraph Differential Revision: D37320704 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79961 Approved by: https://github.com/mikeiovine	2022-06-24 06:03:06 +00:00
Nikita Shulga	4a4890cfb2	[BE] Use CamelCase for enum class members (#79772 ) Per many C++ code-style guides members(for [example](https://google.github.io/styleguide/cppguide.html#Enumerator_Names) ) members of `enum` should be CamelCased, and only defines should be ALL_CAPS Changes `MemOverlap`, `MemOverlapStatus` and `CmpEvalResult` enum values Also, `YES`, `NO`, `TRUE` and `FALSE` are often system defines Fixes among other things, current iOS build regression, see, which manifests as follows (see [this](`6e90572bb9`): ``` /Users/runner/work/pytorch/pytorch/aten/src/ATen/MemoryOverlap.h:19:29: error: expected identifier enum class MemOverlap { NO, YES, TOO_HARD }; ^ /Applications/Xcode_12.4.app/Contents/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator14.4.sdk/usr/include/objc/objc.h:89:13: note: expanded from macro 'YES' #define YES __objc_yes ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/79772 Approved by: https://github.com/drisspg, https://github.com/kulinseth	2022-06-17 05:53:57 +00:00
Akshay Parashar	ca7ab1708b	[Static runtime] Pass parent graph metadata to forked subgraphs (#79578 ) Summary: - Remove creation of new StaticModuleOptions for the forked subgraph. Use parent graph's options for creation of runtime for forked subtraph - StaticRuntimeMetdata extends CustomClassHolder which can be casted to IValue and attached to IR node's attributes. Differential Revision: D37159684 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79578 Approved by: https://github.com/mikeiovine	2022-06-16 20:35:46 +00:00
Hui Guo	0bc1b9e039	[static runtime] Add logging info for unspported ops (#79064 ) Summary: This adds logging info for unspported ops in static runitme. Test Plan: static runtime unit tests Differential Revision: D36984962 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79064 Approved by: https://github.com/davidberard98	2022-06-15 18:38:16 +00:00
Scott Wolchok	c083489f46	[kineto] Optimize getStepCallbacks for common case of no active callbacks Pull Request resolved: https://github.com/pytorch/pytorch/pull/77804 IIUC, the result of this function will be empty and unused if there are no sampled callbacks, which is the common case. We can accelerate this case by wrapping the result in an optional to save initializing an empty SmallVector. Differential Revision: [D36497279](https://our.internmc.facebook.com/intern/diff/D36497279/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36497279/)! Approved by: https://github.com/robieta	2022-05-24 19:38:01 +00:00
mikeiovine	2ae3c59e4b	[SR] Remove linear/relu fusion Pull Request resolved: https://github.com/pytorch/pytorch/pull/77620 Apparently, this is not implemented in fbgemm, so it's strictly worse than using NNC. Differential Revision: [D36431811](https://our.internmc.facebook.com/intern/diff/D36431811/) Approved by: https://github.com/hlu1	2022-05-23 21:46:27 +00:00
Hao Lu	c60d2ef4eb	[StaticRuntime] Replace Permute with copy version only when it's followed by reshape or flatten (#77832 ) Reviewed By: mikeiovine Differential Revision: D36466622 Pull Request resolved: https://github.com/pytorch/pytorch/pull/77832 Approved by: https://github.com/mikeiovine	2022-05-20 03:14:01 +00:00
mikeiovine	02713221e3	[SR] Fuse clamp/nan_to_num Pull Request resolved: https://github.com/pytorch/pytorch/pull/77094 Fuse `clamp` and `nan_to_num` in an NNC kernel. This leads to a big speed up on many models. We can avoid comparisons since clamp potentially gets rid of all of the `inf`s in the input tensor. Differential Revision: [D36220967](https://our.internmc.facebook.com/intern/diff/D36220967/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36220967/)! Approved by: https://github.com/navahgar	2022-05-10 23:33:59 +00:00
Scott Wolchok	52af4fc5ba	[PyTorch] Make RecordFunction store inputs as ArrayRef (#72484 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72484 Stepping stone toward stack-allocating array of inputs. Funnily enough, this seems to improve performance too. ghstack-source-id: 155492056 Test Plan: 1) CI 2) framework overhead benchmark with --stressTestRecordFunction --captureRecordFunctionInputs goes from 0.76 usec/iter to 0.72. Reviewed By: chaekit, robieta Differential Revision: D34061169 fbshipit-source-id: 073fedf1d3d162f927c4e9867cfda7dbfabba215 (cherry picked from commit dae77cf1cd8813d902d73999ad97133a3ef8e291)	2022-05-05 21:38:42 +00:00
Mike Iovine	fc64dbdc01	[SR] Fuse quantized linear/relu (#75775 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75775 fbgemm kernels already implement the fused kernel, no reason not to use it ghstack-source-id: 155450342 Test Plan: New unit tests Reviewed By: navahgar Differential Revision: D35633297 fbshipit-source-id: a744a33a65ce7dbb9ce8900dbe091b6d56dd4e48 (cherry picked from commit b1361b349862715aa17e6318c5e658cd6401a464)	2022-05-04 19:01:14 +00:00
Mike Iovine	b02b3f25db	[SR] Quick hack to eliminate no-op slice (#75774 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75774 `list[0:]` is a no-op. This should really be eliminated on the modeling side, implement as a graph pass for now until we can get this into prod models. Test Plan: New unit tests Reviewed By: navahgar Differential Revision: D35632947 fbshipit-source-id: 0c564193c35039130e99172e0185e124ea24f62d (cherry picked from commit e01d5273185e39a563c7acb15662d9c1549d4b58)	2022-05-03 19:29:46 +00:00
Ansha Yu	dcdc7b2ffc	[RF][scuba] add pytorch_operator_stats column for Static Runtime out variant (#76566 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76566 Add static_runtime_out_variant field to pytorch_operator_stats scuba. Add field for static_runtime_out_variant to RecordFunction. Test Plan: `ptail perfpipe_pytorch_operator_stats_dev \| grep devbig371` No out variant, SR on: P498206546 Out variant: P498206634 Check column shows up in scuba: https://fburl.com/scuba/pytorch_operator_stats_dev/tfgmth1t CMF 4M test https://www.internalfb.com/intern/servicelab/802987274/ ICVR 4M https://www.internalfb.com/intern/servicelab/802987272/ AF prod canary https://our.intern.facebook.com/intern/ads/canary/443234131523314631 Reviewed By: mikeiovine Differential Revision: D36016857 fbshipit-source-id: f3af315d1d2b0d8b147be76df63daa8ab872bf8e (cherry picked from commit 208db7cd15fb3e1be66db2d1834eebaf0964d017)	2022-05-03 17:20:28 +00:00
Ansha Yu	5f98145fd9	[scuba] log to pytorch_model_stats when we've tried and failed to enable static runtime Differential Revision: D35848179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/76240 Approved by: https://github.com/huiguoo, https://github.com/osalpekar	2022-05-02 18:25:39 +00:00
Ansha Yu	ee636e2fd1	[sr] remove max_indices argument of embedding_bag when unncessary (#75993 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75993 Strobelight shows copy_ in embedding_bag taking up a lot of time in adfinder_story_post_ad_session_exit_model 334827604_0 {F723683014} More details in https://fb.quip.com/MKumAjz1YD4 (`1f47a80e88`)a#temp:C:FPD3 (`ecd5567980`)e5a0871ae5d481286b511ef7 The last 3 outputs of embedding_bag are unused in the graph: P495814049. * max_indices output isn't necessary for the main output, so remove it when it's not used in the graph. * offset2bag is used as an intermediate to calculate the main output, so we don't remove this output even though it's unused in the graph. * bag_size is used as an intermediate to calculate the main output for MODE_MEAN, so we don't remove this for now. Test Plan: `./caffe2/caffe2/fb/predictor/scripts/run_disagg_model_benchmarks.sh 334827604 0 /data/users/ansha/tmp/ads_tail sr_only` Inputs uploaded to `/mnt/persistent-public/ansha/ads_tail/334827604` Before: I0414 10:53:12.261133 1070948 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.121318. Iters per second: 8242.78 0.11156 ms. 99.0457%. aten::embedding_bag (52 nodes, out variant) After: I0418 13:05:10.837378 2354604 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.0881273. Iters per second: 11347.2 0.0789221 ms. 98.7096%. static_runtime::embedding_bag (52 nodes, out variant) * Ads prod canary: https://www.internalfb.com/intern/ads/canary/443002539593035806/ * 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_inline_cvr_post_imp -a D35726594` https://www.internalfb.com/intern/servicelab/602875732/ * 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_10x_ctr_mbl_feed_non_mimo -a D35726594` https://www.internalfb.com/intern/servicelab/1002874745/ Reviewed By: mikeiovine Differential Revision: D35726594 fbshipit-source-id: 3b71a0822657bf7a23ce37ca899baef9997b011a (cherry picked from commit fd5e3098c047a1e7d4348e1c97341eecb892536e)	2022-04-22 15:36:35 +00:00
Nikita Shulga	f6c275f55d	Remove `-Wno-unused-variable` from `utils.cmake` (take 2) (#75538 ) Summary: [Comment](https://github.com/pytorch/pytorch/pull/62445/files#r680132022) claims, it got added for consistency with top level CMakeLists.txt, but `-Wno-unused-variable` is not mentioned there. Modify violations in 50+ files that were added in the interim by either removing unused variables, or decorating the code with `C10_UNUSED` if local variable is likely used to extend object lifetime until the end of the block. Caused preventable revert in https://github.com/pytorch/pytorch/pull/72633#issuecomment-1092300787 Pull Request resolved: https://github.com/pytorch/pytorch/pull/75538 Reviewed By: anjali411 Differential Revision: D35747333 Pulled By: malfet fbshipit-source-id: 3fc5828e44a4c05ba0e89e92613e6ebbdb260626 (cherry picked from commit c179fba21cfa2a0093fad50ccad5a22dd7cff52c)	2022-04-20 17:41:59 +00:00
Taylor Robie	a5e338a826	[RecordFunction] More effecient machinery to determine which callbacks to run. (#75807 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75807 There is a tension in RecordFunction between two use cases: 1) In the normal eager path we don't run any callbacks, so we need to bail out of the profiling path as soon as possible to minimize eager overhead. 2) When profiling we want to determine which callbacks to run as efficiently as possible to minimize instrumentation overhead. The confounding factor in all of this is sampling callbacks because they change which callbacks will run on each call, even in steady state operation. This has traditionally been handled with a two stage procedure: first we flip a coin to determine if a sampled callback might run. If false (which it usually is), do nothing. This solves (1). If true, check to see if we need to build the full callback set or if it was a false positive. This procedure has two negative effects: * It forces us to rebuild the set of callbacks to run on every step when profiling * It leaks the sampling abstraction, requiring other parts of the code to bump certain values and forces RecordFunction to lazily initialize. This change introduces a multi-level cache which can (in the common case) quickly determine which callbacks will run, rather than if callbacks might run. This means that rather than call `shouldRunRecordFunction`, we can simply get the callbacks for an invocation and check if they are empty. (And completely removes the pre-sampling heuristic.) Another major benefit of the new cache structure is that it allows thread-safe registration and unregistration of global callbacks. It's worth briefly discussing how this maintains eager performance. In the standard eager case (only sampling callbacks registered) the cache first checks that the global callbacks haven't changed (atomic read), decrements a counter to see if a sampling callback fired, and then returns the active callbacks which is simply a SmallVector of pointer pairs and a couple POD values (scope, needs inputs/outputs/ids). The biggest cost according to perf is the SmallVector logic; we could consider adopting a hard limit on active callbacks; more than half a dozen callbacks running in a single step would be quite a lot. But the total cost relative to `PYTORCH_DISABLE_PER_OP_PROFILING` is only ~10ns, so debatable if it's worth it to switch to `std::array`. The primary change is in `record_function.cpp`, which has a more detailed description of the new cache structure. `record_function.h` has some minor changes to align with the new calling convention and the remaining files are simply changes to the call sites. Future work: * RecordFunction no longer needs to be lazily initialized. * We can deprecate the disable/reenable APIs, since we can not safely add and remove global callbacks. Test Plan: I tested eager mode performance using the overhead benchmark and found that the non-profiled path was unaffected. However the no-op observer dropped from 0.41us to 0.37us (0.25us if no observers are active) which is about 1/3rd reduction in the cost of the callback selection machinery. I also added several C++ unit tests, as the core RecordFunction machinery (especially sampling) was largely untested. Reviewed By: swolchok, davidberard98 Differential Revision: D35276158 fbshipit-source-id: 35135f444724fba4eb97c0ae7f3f710f0f9016fd (cherry picked from commit 9e359b87422c18f2a195185f32e7e85c82f956fd)	2022-04-19 20:46:16 +00:00
PyTorch MergeBot	5c56b2286b	Revert "Remove `-Wno-unused-variable` from utils.cmake" This reverts commit 018cbe1f5ccccb9194394d6e737310f837f8ad7a. Reverted https://github.com/pytorch/pytorch/pull/75538 on behalf of https://github.com/seemethere	2022-04-19 17:19:09 +00:00
Nikita Shulga	018cbe1f5c	Remove `-Wno-unused-variable` from utils.cmake [Comment](https://github.com/pytorch/pytorch/pull/62445/files#r680132022) claims, it got added for consistency with top level CMakeLists.txt, but `-Wno-unused-variable` is not mentioned there. Modify violations in 50+ files that were added in the interim by either removing unused variables, or decorating the code with `C10_UNUSED` if local variable is likely used to extend object lifetime until the end of the block. Caused preventable revert in https://github.com/pytorch/pytorch/pull/72633#issuecomment-1092300787 Pull Request resolved: https://github.com/pytorch/pytorch/pull/75538 Approved by: https://github.com/cpuhrsch	2022-04-19 15:26:55 +00:00
Mike Iovine	b7682d351a	[SR] Refactor memory planner to prepare for new algorithm (#74730 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74730 Motivation: I am working on implementing a new, more efficient memory planning algorithm. This algorithm cannot replace the old one entirely, because it can only be practically done for models that have sample inputs to warm up with. We need a way to make the memory planner's strategy extensible. My first pass attempt at implementing the new algorithm crammed everything into the same class, but it became a nightmare to manage (a ton of `if (use_new_strategy)` statements everywhere). Additionally, it was a little clumsy since there are some concepts that make sense for one algorithm but not the other (like `StorageGroup`). It's much cleaner if we instead turn `MemoryPlanner` into an abstract base class and have different subclasses implement their strategies in `allocateManagedTensors` and `deallocateManagedTensors`. ghstack-source-id: 153288210 Test Plan: Existing unit tests Reviewed By: navahgar, hlu1 Differential Revision: D35132124 fbshipit-source-id: c5ef5ae6361b44dedf97090201e244a76e1e6bce (cherry picked from commit c96f6827c8db88f28c4eb379865ad208beae2034)	2022-04-07 22:11:19 +00:00
Mike Iovine	2f98fa9147	[SR] Do not manage tensors that escape scope via container (#74966 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74966 It's clear that we don't want to manage tensors that escape their scope. Previously, we handled this by checking whether the tensor aliased the graph outputs. But there's actually another way to escape scope: by aliasing the wildcard set. The following graph demonstrates this: ``` def forward(self, cond: bool, a, b): lst = [] if cond: res = a + b # res should not be managed!!! lst.append(res) return lst ``` The `if cond:` sub-block returns nothing, but `res` escapes the scope through `lst`. The fix is simple: we simply have to mark values that alias the wildcard set as an `external_alias_` in `ValueGroup`. This diff also exposed another issue (via unit tests) in `checkOutputTensorMemoryLeaks`: it assumes that, if a node's `Value*` is managed, the underlying `IValue` must be a tensor. But this is not true after the addition of `to_maybe_copy_out`; TMCO does not produce a tensor in its first output slot if it does not copy. ghstack-source-id: 153288188 Test Plan: New unit tests cover the problematic case Reviewed By: navahgar Differential Revision: D35257087 fbshipit-source-id: 853a761dffe51f2c70720759664dd8dfcd56d1d7 (cherry picked from commit 2c7f519354041975f33626eab6b7f16c2494bbf8)	2022-04-07 19:57:57 +00:00
Mike Iovine	4055d1f653	[SR] Fix StaticRuntime move ctor (#74927 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74927 The move ctor was broken because `BlockRunner` stores a reference to `values_`. When moving runtime instances, the pointer to the root block would be moved, but the reference inside it would not be updated. Pass `BlockRunner` a raw pointer to the heap-allocated IValues instead to avoid this issue. ghstack-source-id: 153168602 Test Plan: New unit test/CI Reviewed By: navahgar Differential Revision: D35228467 fbshipit-source-id: 04e198b39f898b82677a0e41e1cdf00c2b0c09f3 (cherry picked from commit 03e2c591ac3a907d68025eae9500ed7226dec17e)	2022-04-07 02:16:37 +00:00
Mike Iovine	2ca66ffb7d	[SR] Force split_and_squeeze usage via graph transformation (#74274 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74274 Reviewed By: navahgar Differential Revision: D34913889 fbshipit-source-id: 655d3f1e5f4c027cb94758b74826a4b4882e9458 (cherry picked from commit bc94d30b69888ca6633a27090a3b87a08919231a)	2022-03-29 19:13:40 +00:00
Mike Iovine	f5a9c36d0b	[SR] Eliminate extra permute ops before `aten::sum` (#74481 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74481 This diff fixes an interesting performance issue related to `permute_copy`. We see this pattern frequently: ``` y = torch.permute(x, (0, 2, 1)) z = torch.sum(y, dim=-1) ``` With copy variants off, we get a strided output from `permute`, and we hit this (faster) kernel in `sum`: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L589 But with copy variants on, we get a contiguous output from `permute_copy`, which causes us to hit the slower reduction: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L597 But the permute is actually unnecessary, we can just statically turn the graph into this to ensure that the fast kernel is hit with copy variants on: ``` z = torch.sum(x, dim=1) ``` ghstack-source-id: 152003888 Reviewed By: navahgar Differential Revision: D34992319 fbshipit-source-id: 0baf493708ee2180c899814a954d220d88ba1d4f (cherry picked from commit 797b6beb26325c56012e406e14fe211c0b5d744d)	2022-03-23 23:00:14 +00:00
Mike Iovine	f14a0be302	[SR] Avoid allocating rstd/mean in layer_norm (#73606 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73606 The single-output overload of `layer_norm` internally allocates two tensors. As an optimization, we previously added `static_runtime::layer_norm`. This variant of layer norm had two extra outputs to make the memory planner aware of these extra tensors. But these outputs were unused; it's actually better for us to avoid the allocation and associated computations entirely. ghstack-source-id: 151394116 Test Plan: Existing unit tests Reviewed By: hlu1 Differential Revision: D34562131 fbshipit-source-id: c6a6560e60db43b0b100aedc54ea4265acb347de (cherry picked from commit 3bed52b6f688b93b9b032c3d2b4be68d08d8eb76)	2022-03-15 22:07:11 +00:00
Mike Iovine	818bf361b6	[SR] Fix a kwargs API default value bug (#73681 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73681 Static runtime is rejecting legal calls made with the kwargs API when there are parameters with default values. ghstack-source-id: 150433627 Test Plan: Added unit test to cover this case Reviewed By: navahgar, d1jang Differential Revision: D34588804 fbshipit-source-id: 74d7ef5bee74f9d16b02b0c8ceda4285ea776755 (cherry picked from commit 9c3db19cb45f6022e646deeb1e8056daa04f363f)	2022-03-03 22:31:37 +00:00
Don Jang	bbc59ff2bf	[Static Runtime] Introduce StaticNodeInfo to store ProcessedNode's data independent from runtime instances (#73536 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73536 Currently `StaticNodeInfo` class assumes 2 distinct roles that are not too obvious: 1) "template" that contains metadata of an actual executable node by runtime. owned by `StaticModule` 2) fully instanced ones that are owned by `StaticRuntime`. We currently merge these two usecases into one class, that can be error-prone in case illegal copying happens uncontrollably. Currently, we only copy objects of kind (1) into objects of kind (2) when a `StaticRuntime` instance is created. To address ths issue, this change introduces `StaticNodeInfo`, a separate class, to distinguishes the aforementioned two usecases in the code more clearly. With this `StaticNodeInfo` is for (1) and `ProcessedNode` is now for (2). Test Plan: Existing tests Reviewed By: mikeiovine Differential Revision: D33985600 fbshipit-source-id: 0c79cea2bf982dd956a35f48eaf6027e5b6e390c (cherry picked from commit 0d8acc4a2b6eeb3e4af3ad2c99f4cd667680f8df)	2022-03-02 22:33:32 +00:00
Mike Iovine	74fe57079f	[SR] Add new `fb::split_and_squeeze` op (#73252 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73252 Test Plan: New unit tests Reviewed By: d1jang Differential Revision: D34399345 fbshipit-source-id: 80cbf3e1556f2152576a07d1eeb3bb6ef97096cf (cherry picked from commit 259956749dad88f0960eecb60135cd466f1b56f4)	2022-03-02 19:31:42 +00:00
Raghavan Raman	2724e4c039	[Static Runtime] Do not replace with copy variants if TE fuser is enabled (#72946 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72946 The passes to replace with copy variants are run after TensorExpr fusion. Due to this the resulting graph does not conform to the assumptions made in the fuser. So, even if these flags `use_copy_variants`, `use_maybe_copy_variants` are turned on, the corresponding passes will not be executed if TensorExpr fusion is enabled. ghstack-source-id: 149429753 Test Plan: Tested locally. Reviewed By: mikeiovine Differential Revision: D34283842 fbshipit-source-id: 74edea517a00c85dff0319f9c8b3ac8befe09018 (cherry picked from commit 3798af7f1b8c9b3c072862f58ebf16af6294db14)	2022-02-18 18:34:50 +00:00
Don Jang	39fb771423	[Static Runtime] Report static op statistics from graph when input size is zero (#73032 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73032 Currently, ptvsc2_predictor_bench reports nothing when the input size is zero. However, Static Runtime's module creation has some useful information even after loading a model. This change reports static op statistics when the given input's size is zero. In addition to that, this enables it to report the out variant coverage percentage, which is crucial to establish the baseline performance of Static Runtime. Test Plan: - Ran `ptvsc2_predictor_bench` with this change as seen above. Reviewed By: mikeiovine Differential Revision: D34294803 fbshipit-source-id: 80c02199075dae9280657d6edecc7c679c1c27f4 (cherry picked from commit 83aec141a25a9ede5d22e5c17c0b6b07307faf39)	2022-02-17 23:58:32 +00:00
Mike Iovine	d1c5f9e439	[JIT][SR] Introduce prim::IfThenElse (#72587 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72587 This pattern frequently appears in a few graphs: ``` %result = prim::If(%condition) block0(): -> (%a) block1(): -> (%b) ``` This is slow, particularly in static runtime. Static runtime creates memory planners/block runners for each sub-block, which eats up a lot of memory and introduces a lot of extra overhead for this relatively simple operation. This diff introduces a new op that replaces nodes like the above with a single op meant to act like a ternary operator: ``` %result = prim::IfThenElse(%condition, %a, %b) ``` Test Plan: New unit tests Reviewed By: eellison Differential Revision: D34091789 fbshipit-source-id: eb6a8c460c39b4c019a1f4ab1f3f1e5b6edc400c (cherry picked from commit 0f1b335e5b83f402bda2dcdd9ecb411e0b67c651)	2022-02-17 18:22:48 +00:00
Don Jang	5ea74b4996	[Static Runtime] Remove ProcessedNode::num_outputs_ (#72592 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72592 Only code paths that are not perf-critical read `ProcessedNode::num_outputs_` and also its static feature of the op that `ProcessedNode` instance is executing. Therefore, it's better to move `ProcessedNode::num_outputs_` into `ProcessedFunction::num_outputs_` and let `ProcessedNode` access it via `ProcessedNode::fn_` for its occasional use. Note that this prevents duplicating num_outputs_ per node & per Static Runtime instance since `ProcessedFunction` instances are shared across all runtime instances. It's confirmed that this change reduces the `sizeof(ProcessedNode)` by 14% from local instrumentation as follows: - Before -- sizeof(ProcessedNode): 56 - After -- sizeof(Processednode): 48 Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: mikeiovine Differential Revision: D33984792 fbshipit-source-id: e29ffc97b799e679215f42e1e85cd3fcd7e88983 (cherry picked from commit 0f7003f4dfd6473a70355ca3c6f51498abf1d7be)	2022-02-17 05:09:17 +00:00
Mike Iovine	d51d2bd608	[SR] Add a flag to disable copy variants (#71102 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71102 This graph pass is causing a major perf regression on some models. Ideally we would introduce maybe_copy variants for all these ops. But since those are tricky to write, I've introduced a flag to just turn the pass off for now. ghstack-source-id: 148541673 Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: navahgar Differential Revision: D33510080 fbshipit-source-id: bb4847f26561197ea5e6bbad0a4d25db4ef468eb (cherry picked from commit 8f333d3e8138e2a7ba04bea7509ad84dd97844eb)	2022-02-08 02:43:07 +00:00
Mike Iovine	cff5e22a72	[SR] Relax aten::__is__ constraint for SR enablement (#71807 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71807 There's no need to completely disallow `aten::__is__` and `aten::__isnot__`. The only problematic case is when the comparison is between two tensors, e.g. in ``` def forward(x): y = x.detach() # Should be false, but we get True # after our EliminateNoOps pass return x is y ``` Test Plan: New unit test covers this case Reviewed By: d1jang Differential Revision: D33783668 fbshipit-source-id: c9f57fa96937ecce38a21554f12b69c45cc58fe4 (cherry picked from commit 019588f4ca3fcd2b3ae51bccab102f0538745b15)	2022-02-03 12:18:46 +00:00
Mike Iovine	2d5296b0e7	[SR] Implement prim::Loop (#69838 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69838 Implement `prim::Loop` with the new `StaticRuntimeBlockRunner` abstraction. ghstack-source-id: 148186483 Test Plan: New unit tests: `buck test caffe2/benchmark/static_runtime/...` Reviewed By: d1jang Differential Revision: D33049595 fbshipit-source-id: 550de5167b46fccd65ff77d092785289b5e5d532 (cherry picked from commit 8baf1753af34f4c166b4680e42589517fd2e508d)	2022-02-02 19:30:50 +00:00
Mike Iovine	2aa699505d	[SR] Implement prim::If (#69837 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69837 Implement `prim::If` with the new `StaticRuntimeBlockRunner` abstraction. ghstack-source-id: 148186475 Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime/...` Accuracy test at top of stack Reviewed By: d1jang Differential Revision: D33045908 fbshipit-source-id: 281fb4a73528249fa60f65ac26f8ae6737771f55 (cherry picked from commit de3b12dc0871e8ca09891c257e1dfd7cd352aa7c)	2022-02-02 19:30:50 +00:00
Mike Iovine	d2599701fd	[SR] Force sub-blocks to return at least one output (#69836 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69836 It is technically possible for the sub-blocks to return zero outputs. This is problematic for `StaticRuntimeBlockRunner`, because it assumes that at least one output is being returned. Rather than slowing down SR with special logic for this corner case, we can simply force these sub-blocks to return `None`. ghstack-source-id: 148186453 Test Plan: Sub-blocks with no return values tested at top of stack Reviewed By: d1jang Differential Revision: D33050420 fbshipit-source-id: 17d9e19fda6431aa9fd0b155131349bac42bc149 (cherry picked from commit c97fd07bf53e1e253a0e6c733db5ea7c86698fc9)	2022-02-02 19:30:50 +00:00
Mike Iovine	238dded10f	[SR] Graph pass to create owned refs of special IValues (#69835 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69835 `StaticRuntimeBlockRunner` moves its outputs to the return value at the end of `run_impl`. However, there's a corner case where this can cause problems. If we return a constant, then the only reference in the `constants_` array can be destroyed by this move. We could add special logic to handle this in `run_impl`. But since this is a relatively rare corner case, it's simpler to just add an op that does nothing but create an owned reference to its input. This owned reference can be safely moved out of `StaticRuntimeBlockRunner`. Note that this also applies to returned values in sub-blocks that are from outer scopes. ghstack-source-id: 148186452 Test Plan: `buck test caffe2/benchmarks/static_runtime/...` Added a new unit test with a graph that simply returns a constant. Tests with sub-blocks at top of stack. Reviewed By: d1jang Differential Revision: D33047519 fbshipit-source-id: 22b6058f0d1da8a6d1d61a6f2866bc518bff482b (cherry picked from commit a8f89a12ee726aa7d7e546dee25d696eef868ce7)	2022-02-02 19:30:50 +00:00
Mike Iovine	3dce68fdf4	[SR] Eliminate op_name_ in ProcessedNode (#71986 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71986 To address concerns over space increase from control flow. `op_name_` was only stored as a minor optimization to avoid name lookup during logging, we can safely get rid of it. Thanks to the sampling mechanism, `get_op_name()` is called very infrequently, so this shouldn't cause too much of a regression ghstack-source-id: 148086244 Test Plan: CI Reviewed By: d1jang Differential Revision: D33821005 fbshipit-source-id: 6f74eb30a54a046ca90768aebbcde22e8c435f35 (cherry picked from commit 361ba32e97dbd130938bae10b5159730822c518c)	2022-02-01 21:22:26 +00:00
Mike Iovine	4b789df68b	[SR] Add BlockRunner and handle sub-blocks (#69834 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69834 * Modify the `StaticModule` constructor to handle index initialization for sub-blocks. * Add a new class `StaticRuntimeBlockRunner`. This class is almost exactly like what we've been calling `StaticRuntime` up to this point, except that it does not own a `values_` array. All `StaticRuntimeBlockRunners` hold an unowned reference to a `values_` array owned by `StaticRuntime`. This is a useful abstraction for implementing control flow - it gives us a way for sub-blocks to look up values from surrounding scopes! ghstack-source-id: 148086245 Test Plan: `buck test caffe2/benchmarks/static_runtime/...` Reviewed By: d1jang Differential Revision: D33028039 fbshipit-source-id: 4f01417bad51a0cf09b1680a518308da647be1f6 (cherry picked from commit 3a9feffd929869120c717d35aa55aad8a382783d)	2022-02-01 17:20:55 +00:00
Mike Iovine	a49f2412e4	[SR] Add static runtime scopes to record function (#70944 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70944 Added special net-level/op-level scopes for static runtime. We can use these to add special behavior in record functions when they are invoked from a static runtime context. Reviewed By: navahgar Differential Revision: D33458211 fbshipit-source-id: 0b7022100e9f5ac872f4cb5bfba14e92af2c71b0 (cherry picked from commit b486548544c5e822803071756c85e675e37d2dad)	2022-01-27 18:00:08 +00:00

1 2 3 4 5

237 Commits