Commit Graph

257 Commits

Author SHA1 Message Date
ed327876f5 [codemod] c10:optional -> std::optional (#126135)
Generated by running the following from PyTorch root:
```
find . -regex ".*\.\(cpp\|h\|cu\|hpp\|cc\|cxx\)$" | grep -v "build/" | xargs -n 50 -P 4 perl -pi -e 's/c10::optional/std::optional/'
```

`c10::optional` is just an alias for `std::optional`. This removes usages of that alias in preparation for eliminating it entirely.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126135
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/albanD, https://github.com/aaronenyeshi
2024-05-14 19:35:51 +00:00
ad8aef0f98 [BE] [3/N] Use nested namespaces (#110314)
Mostly in torch/csrc/jit/runtime and in `ATen/cuda/`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110314
Approved by: https://github.com/seemethere
2023-09-30 02:23:48 +00:00
7ce69d5dbe [RELAND] Remove some unnecessary <iostream> includes from headers (#108150)
In almost all cases this is only included for writing the output formatter, which
only uses `std::ostream` so including `<ostream>` is sufficient.

The istream header is ~1000 lines so the difference is non-trivial.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108150
Approved by: https://github.com/albanD, https://github.com/malfet
ghstack dependencies: #108149
2023-09-20 21:55:15 +00:00
39ff80125f Add support for an operator level thread local observer (#108822)
Summary: Add support for an operator level thread local observer

Test Plan: Verified the interception as part of a pytorch model evaluation with static runtime.

Differential Revision: D49082250

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108822
Approved by: https://github.com/davidberard98
2023-09-08 19:32:03 +00:00
378ffde8c1 Revert "Remove some unnecessary <iostream> includes from headers (#106914)"
This reverts commit a6c29b722772816804d54eed070fbb38450d3e6f.

Reverted https://github.com/pytorch/pytorch/pull/106914 on behalf of https://github.com/izaitsevfb due to Causing metal breakage internally, see D48709279 ([comment](https://github.com/pytorch/pytorch/pull/106914#issuecomment-1696670027))
2023-08-29 02:22:33 +00:00
a6c29b7227 Remove some unnecessary <iostream> includes from headers (#106914)
In almost all cases this is only included for writing the output formatter, which
only uses `std::ostream` so including `<ostream>` is sufficient.

The istream header is ~1000 lines so the difference is non-trivial.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106914
Approved by: https://github.com/lezcano
2023-08-25 18:24:05 +00:00
28dc1a093f Revert "Remove some unnecessary <iostream> includes from headers (#106914)"
This reverts commit 60936e4c296e79f56cac2431a560970bb4529d03.

Reverted https://github.com/pytorch/pytorch/pull/106914 on behalf of https://github.com/ZainRizvi due to Sorry, but this is breaking internal builds. Seems like a lot of internal code depends on some of the removed imports ([comment](https://github.com/pytorch/pytorch/pull/106914#issuecomment-1688605975))
2023-08-22 17:16:48 +00:00
60936e4c29 Remove some unnecessary <iostream> includes from headers (#106914)
In almost all cases this is only included for writing the output formatter, which
only uses `std::ostream` so including `<ostream>` is sufficient.

The istream header is ~1000 lines so the difference is non-trivial.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106914
Approved by: https://github.com/lezcano
2023-08-19 20:21:58 +00:00
cyy
77f2883c41 [Reland2] fix missing-prototypes warnings in torch_cpu (Part 4) (#102228)
This PR relands the changes introduced in PR https://github.com/pytorch/pytorch/pull/100849. The old PR turnd nnc_* functions into  static. We now add declarations for them and hope that inter builds will pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102228
Approved by: https://github.com/albanD
2023-06-02 22:04:44 +00:00
32ce06a5ab Revert "[Reland] fix missing-prototypes warnings in torch_cpu (Part 4) (#101949)"
This reverts commit 4f2c007a1b5170c2aa0d47e388ff9e07c7a7d354.

Reverted https://github.com/pytorch/pytorch/pull/101949 on behalf of https://github.com/osalpekar due to As noted in @izaitsevfb's comment, we are still seeing linker errors, this time due to `nnc_prepacked_linear_clamp_run` being made a static function. ([comment](https://github.com/pytorch/pytorch/pull/101949#issuecomment-1560226880))
2023-05-23 22:53:47 +00:00
cyy
4f2c007a1b [Reland] fix missing-prototypes warnings in torch_cpu (Part 4) (#101949)
This PR relands the changes introduced in PR #100849. The old PR turnd  nnc_aten_embedding  into a static function, however, it is actually used in torch/csrc/jit/tensorexpr/operators/misc.cpp.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101949
Approved by: https://github.com/albanD
2023-05-22 10:53:07 +00:00
498c34e8e8 Revert " fix missing-prototypes warnings in torch_cpu (Part 4) (#100849)"
This reverts commit c2f28d1c1df0db78f2951e4df5dde264f80f07eb.

Reverted https://github.com/pytorch/pytorch/pull/100849 on behalf of https://github.com/izaitsevfb due to fails internal Meta builds, including fbcode and android, see D46009888: ld.lld: error: undefined symbol: nnc_aten_embedding ([comment](https://github.com/pytorch/pytorch/pull/100849#issuecomment-1555105800))
2023-05-19 19:05:15 +00:00
cyy
c2f28d1c1d fix missing-prototypes warnings in torch_cpu (Part 4) (#100849)
This PR fixes more missing-prototypes violations in the torch_cpu source following PRs #100053, #100147 and #100245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100849
Approved by: https://github.com/albanD
2023-05-18 03:49:45 +00:00
e7874eea7a fix the use of incomplete vector<T> for C++20 compatibilities (#93978)
Avoid referring to std::vector<T> members and constructor/desctructors when T is incomplete.

Referring to incomplete members is [not legal](https://timsong-cpp.github.io/cppwp/n4868/vector#overview-4) according to the C++ standard.

Non-noexcept constructors need access to members' destructors. As of C++20, std::vector's destructor is constexpr and so forcefully requires a complete type for the vector's elements.

These issues cause build errors in newer toolchains under c++20 mode.

Fix them by moving code that needs complete types to a different place where the type is already defined.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93978
Approved by: https://github.com/Skylion007
2023-04-04 07:47:43 +00:00
d70f9c7888 Fix typo under torch/csrc/jit/runtime directory (#97243)
This PR fixes typo in comments and messages under `torch/csrc/jit/runtime` directory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97243
Approved by: https://github.com/davidberard98
2023-03-29 20:17:10 +00:00
57bb5b159d [static-runtime] one more attempt to improve crash log readability (#96903)
Summary:
* add human readable type and ivalue printout
* fix internal linter warnings

Test Plan:
error message now looks like e.g.
```
E0315 16:27:32.409082 422313 ExceptionTracer.cpp:222] exception stack complete
terminate called after throwing an instance of 'c10::Error'
  what():  List[int] is not a subtype of List[int]; schema arg name: 'split_sizes', ivalue: [1, 1]
```

Differential Revision: D44112297

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96903
Approved by: https://github.com/davidberard98
2023-03-17 17:56:26 +00:00
cc798f1a4f [PyTorch] add c10/util/FbcodeMaps.h (#96359)
Allow us to use folly maps in fbcode and std maps for compatibility in OSS, extending what static runtime is already doing.

Differential Revision: [D43926670](https://our.internmc.facebook.com/intern/diff/D43926670/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96359
Approved by: https://github.com/ezyang
2023-03-10 02:18:16 +00:00
b90a9c7db2 [static-runtime] fix one forwarding usage (#96271)
Summary: as titled

Test Plan: ci

Differential Revision: D43897369

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96271
Approved by: https://github.com/davidberard98
2023-03-08 07:38:21 +00:00
a5aceba61f [static-runtime] a pass through checks throwing exceptions (#95983)
Summary: increasing verbosity where possible

Test Plan: CI

Differential Revision: D43761268

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95983
Approved by: https://github.com/davidberard98
2023-03-07 19:16:27 +00:00
3bb76e6ced [static-runtime] increase verbosity for schema check (#95937)
Summary: as titled

Differential Revision: D43758690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95937
Approved by: https://github.com/wushirong, https://github.com/hl475, https://github.com/houseroad
2023-03-03 06:50:28 +00:00
0247ed27cc Apply Clang-Tidy readability-container-size-empty (#93236)
Not only is this change usually shorter and more readable, it also can yield better performance. size() is not always a constant time operation (such as on LinkedLists), but empty() always is.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93236
Approved by: https://github.com/malfet
2023-01-29 23:28:19 +00:00
1eba3f220e Fix bugs found by static analysis (#85705)
These PR fixes a number of bugs found by Svace static analyzer:

1. DEREF_AFTER_FREE at qnnpack_utils.h:
Pointer '&convolution->zero_buffer' is dereferenced at qnnpack_utils.h:258 after the referenced memory was deallocated at operator-delete.c:25 by passing as 1st parameter to function 'pytorch_qnnp_delete_operator' at qnnpack_utils.h:251.
2. DEREF_AFTER_NULL at impl.cpp:
After having been compared to NULL value at impl.cpp:1892, pointer 'schema' is passed as 2nd parameter in call to function 'c10::operator<<' at impl.cpp:1921, where it is dereferenced at function_schema_inl.h:13.
3. DEREF_OF_NULL  at stmt.h:
After having been compared to NULL value at stmt.h:744, pointer 'body->_M_ptr' is passed in call to function 'torch::jit::tensorexpr::malformed_input::malformed_input' at stmt.h:745, where it is dereferenced at exceptions.h:67.
4. DEREF_OF_NULL  at loopnest.h:
Pointer 'f->ptr' that can have only NULL value (checked at loopnest.cpp:1482), is passed in call to function 'torch::jit::tensorexpr::malformed_input::malformed_input' at loopnest.cpp:1483, where it is dereferenced at exceptions.h:67.
This is the same error as 3: forwarding a nullptr to malformed_input().
4. TAINTED_INT.LOOP in python_arg_parser:
Integer value 'this->size' obtained from untrusted source at python_arg_parser.cpp:118 without checking its bounds is used as a loop bound at python_arg_parser.cpp:698 by calling function 'torch::FunctionParameter::set_default_str' at python_arg_parser.cpp:133.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85705
Approved by: https://github.com/kit1980
2022-10-28 23:51:55 +00:00
81c4049f4d [Static Runtime] Move PrepackWeights to internal-only graph passes (#87799)
Summary:
The pass introduces an `fb::` operator and thus cannot be used in OSS.

The test failure was not exposed because the Static Runtime tests have been disabled in OSS for a while. The Dev Infra folks encountered this failure when re-enabling the tests.

Test Plan: Existing tests

Differential Revision: D40724547

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87799
Approved by: https://github.com/huydhn
2022-10-28 01:28:34 +00:00
ed7a8ab436 [Static Runtime] Make canEnableStaticRuntime examine sub-blocks (#87396)
Summary:
Someone was running into problems where

1) Static Runtime enablement would fail
2) We would try to fall back to the JIT interpreter *after trying to create `StaticModule`*
3) The fallback fails because Static Runtime mangled the graph.

We don't want to prevent Static Runtime from mutating its input due to memory concerns. The intent of `canEnableStaticRuntime` is to catch issues in the module before Static Runtime messes with it.

With this diff, `StaticModule` instantiation can be avoided by querying `canEnableStaticRuntime` and the issue is fixed.

Test Plan: New unit test

Differential Revision: D40564452

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87396
Approved by: https://github.com/tenpercent
2022-10-26 14:34:29 +00:00
63c1f2fef9 [Static Runtime] Fold linear prepack ops (#85289)
Summary: Split `quantized_linear_unpacked_weight_v2` into `linear_prepack` and `quantized_linear` so that the prepacking operation may be eliminated by constant folding.

Test Plan:
Fixes a huge regression in an internal model:

```
Before
        89.6141 ms.    99.0923%. fb::quantized_linear_unpacked_weight_v2 (12 nodes)
After
       0.806852 ms.    53.5365%. quantized::linear (12 nodes, out variant)
(prepacking eliminated)
```

Differential Revision: D39622530

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85289
Approved by: https://github.com/davidberard98
2022-09-22 20:23:07 +00:00
e4899764b2 [Static Runtime] Fix aten::index_put list conversions (#85298)
Summary: Apparently static runtime's list construct return value is always a `GenericList`, so we cannot use the `toOptionalTensorList` method in the general case -- we must convert each item individually.

Test Plan: New unit test

Differential Revision: D39628979

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85298
Approved by: https://github.com/tenpercent
2022-09-22 20:21:52 +00:00
4f34cd6d1e Replace all CHECK_ and DCHECK_ with TORCH_* macros (#82032)
Avoid exposing defines that conflict with google logging, since this blocks external usage of libtorch in certain cases.

All the 'interesting' changes should be in these two files, and the rest should just be mechanical changes via sed.
c10/util/logging_is_not_google_glog.h
c10/util/logging_is_google_glog.h

Fixes https://github.com/pytorch/pytorch/issues/81415

cc @miladm @malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82032
Approved by: https://github.com/soumith, https://github.com/miladm
2022-07-26 01:20:44 +00:00
937ca69f15 [TorchArrow][efficiency][3/n] variadic versions of op fused /unfused inference_wrapper_run_flat (#81133)
Summary:
Added `variadic` version (just an optimization) of the registered fused and unfused ops.

Reviewed By: tenpercent

Differential Revision: D37456033

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81133
Approved by: https://github.com/tenpercent, https://github.com/qxy11
2022-07-13 10:30:32 +00:00
94a8a8aa32 [di] avoid copying optional input for get_real_inputs_from_optional_inputs_v2 when possible (#81137)
Summary: avoid copies and casting by moving out of the inputs list

Differential Revision: D37572556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81137
Approved by: https://github.com/mikeiovine
2022-07-13 04:45:56 +00:00
7f8e852dff [Static Runtime] Support Futures in Static Runtime Engine (#80162)
Summary: - Static Runtime now exports runAsync() API which returns an intrusive_ptr to c10:Future type

Differential Revision: D37385849

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80162
Approved by: https://github.com/mikeiovine
2022-06-28 23:57:26 +00:00
3afc802c5a [Static Runtime] Add Metadata to ProcessedNode depending upon the op type (#79961)
Summary:
- ProcessedNodeMetadata class wraps the possible metadata for ProcessedNode. Depending upon the nature of op, processedNode can have one of the below possibilities of metadata:
1. prim::If/prim::Loop ops contains block_runners_ as their metadata
2. prim::fork op contains TaskLauncher (std::function) responsible for execution of forked subgraph

Differential Revision: D37320704

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79961
Approved by: https://github.com/mikeiovine
2022-06-24 06:03:06 +00:00
4a4890cfb2 [BE] Use CamelCase for enum class members (#79772)
Per many C++ code-style guides members(for [example](https://google.github.io/styleguide/cppguide.html#Enumerator_Names) ) members of `enum` should be CamelCased,
and only defines should be ALL_CAPS

Changes `MemOverlap`, `MemOverlapStatus` and `CmpEvalResult` enum values

Also, `YES`, `NO`, `TRUE` and `FALSE` are often system defines

Fixes among other things, current iOS build regression, see, which manifests as follows (see [this](6e90572bb9):
```
/Users/runner/work/pytorch/pytorch/aten/src/ATen/MemoryOverlap.h:19:29: error: expected identifier
enum class MemOverlap { NO, YES, TOO_HARD };
                            ^
/Applications/Xcode_12.4.app/Contents/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator14.4.sdk/usr/include/objc/objc.h:89:13: note: expanded from macro 'YES'
#define YES __objc_yes
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79772
Approved by: https://github.com/drisspg, https://github.com/kulinseth
2022-06-17 05:53:57 +00:00
ca7ab1708b [Static runtime] Pass parent graph metadata to forked subgraphs (#79578)
Summary:
- Remove creation of new StaticModuleOptions for the forked subgraph. Use parent graph's options for creation of runtime for forked subtraph
- StaticRuntimeMetdata extends CustomClassHolder which can be casted to IValue and attached to IR node's attributes.

Differential Revision: D37159684

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79578
Approved by: https://github.com/mikeiovine
2022-06-16 20:35:46 +00:00
0bc1b9e039 [static runtime] Add logging info for unspported ops (#79064)
Summary: This adds logging info for unspported ops in static runitme.

Test Plan: static runtime unit tests

Differential Revision: D36984962

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79064
Approved by: https://github.com/davidberard98
2022-06-15 18:38:16 +00:00
c083489f46 [kineto] Optimize getStepCallbacks for common case of no active callbacks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77804

IIUC, the result of this function will be empty and unused if there are no sampled callbacks, which is the common case. We can accelerate this case by wrapping the result in an optional to save initializing an empty SmallVector.

Differential Revision: [D36497279](https://our.internmc.facebook.com/intern/diff/D36497279/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36497279/)!

Approved by: https://github.com/robieta
2022-05-24 19:38:01 +00:00
2ae3c59e4b [SR] Remove linear/relu fusion
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77620

Apparently, this is not implemented in fbgemm, so it's strictly worse than using NNC.

Differential Revision: [D36431811](https://our.internmc.facebook.com/intern/diff/D36431811/)

Approved by: https://github.com/hlu1
2022-05-23 21:46:27 +00:00
c60d2ef4eb [StaticRuntime] Replace Permute with copy version only when it's followed by reshape or flatten (#77832)
Reviewed By: mikeiovine

Differential Revision: D36466622

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77832
Approved by: https://github.com/mikeiovine
2022-05-20 03:14:01 +00:00
02713221e3 [SR] Fuse clamp/nan_to_num
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77094

Fuse `clamp` and `nan_to_num` in an NNC kernel. This leads to a big speed up on many models. We can avoid comparisons since clamp potentially gets rid of all of the `inf`s in the input tensor.

Differential Revision: [D36220967](https://our.internmc.facebook.com/intern/diff/D36220967/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36220967/)!

Approved by: https://github.com/navahgar
2022-05-10 23:33:59 +00:00
52af4fc5ba [PyTorch] Make RecordFunction store inputs as ArrayRef (#72484)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72484

Stepping stone toward stack-allocating array of inputs.

Funnily enough, this seems to improve performance too.
ghstack-source-id: 155492056

Test Plan:
1) CI
2) framework overhead benchmark with --stressTestRecordFunction --captureRecordFunctionInputs goes from 0.76 usec/iter to 0.72.

Reviewed By: chaekit, robieta

Differential Revision: D34061169

fbshipit-source-id: 073fedf1d3d162f927c4e9867cfda7dbfabba215
(cherry picked from commit dae77cf1cd8813d902d73999ad97133a3ef8e291)
2022-05-05 21:38:42 +00:00
fc64dbdc01 [SR] Fuse quantized linear/relu (#75775)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75775

fbgemm kernels already implement the fused kernel, no reason not to use it
ghstack-source-id: 155450342

Test Plan: New unit tests

Reviewed By: navahgar

Differential Revision: D35633297

fbshipit-source-id: a744a33a65ce7dbb9ce8900dbe091b6d56dd4e48
(cherry picked from commit b1361b349862715aa17e6318c5e658cd6401a464)
2022-05-04 19:01:14 +00:00
b02b3f25db [SR] Quick hack to eliminate no-op slice (#75774)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75774

`list[0:]` is a no-op. This should really be eliminated on the modeling side, implement as a graph pass for now until we can get this into prod models.

Test Plan: New unit tests

Reviewed By: navahgar

Differential Revision: D35632947

fbshipit-source-id: 0c564193c35039130e99172e0185e124ea24f62d
(cherry picked from commit e01d5273185e39a563c7acb15662d9c1549d4b58)
2022-05-03 19:29:46 +00:00
dcdc7b2ffc [RF][scuba] add pytorch_operator_stats column for Static Runtime out variant (#76566)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76566

Add static_runtime_out_variant field to pytorch_operator_stats scuba.

Add field for static_runtime_out_variant to RecordFunction.

Test Plan:
`ptail perfpipe_pytorch_operator_stats_dev | grep devbig371`
No out variant, SR on: P498206546
Out variant: P498206634

Check column shows up in scuba: https://fburl.com/scuba/pytorch_operator_stats_dev/tfgmth1t

CMF 4M test https://www.internalfb.com/intern/servicelab/802987274/
ICVR 4M https://www.internalfb.com/intern/servicelab/802987272/

AF prod canary
https://our.intern.facebook.com/intern/ads/canary/443234131523314631

Reviewed By: mikeiovine

Differential Revision: D36016857

fbshipit-source-id: f3af315d1d2b0d8b147be76df63daa8ab872bf8e
(cherry picked from commit 208db7cd15fb3e1be66db2d1834eebaf0964d017)
2022-05-03 17:20:28 +00:00
5f98145fd9 [scuba] log to pytorch_model_stats when we've tried and failed to enable static runtime
Differential Revision: D35848179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76240
Approved by: https://github.com/huiguoo, https://github.com/osalpekar
2022-05-02 18:25:39 +00:00
ee636e2fd1 [sr] remove max_indices argument of embedding_bag when unncessary (#75993)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75993

Strobelight shows copy_ in embedding_bag taking up a lot of time in adfinder_story_post_ad_session_exit_model 334827604_0
{F723683014}

More details in https://fb.quip.com/MKumAjz1YD4 (1f47a80e88)a#temp:C:FPD3 (ecd5567980)e5a0871ae5d481286b511ef7

The last 3 outputs of embedding_bag are unused in the graph: P495814049.
* max_indices output isn't necessary for the main output, so remove it when it's not used in the graph.
* offset2bag is used as an intermediate to calculate the main output, so we don't remove this output even though it's unused in the graph.
* bag_size is used as an intermediate to calculate the main output for MODE_MEAN, so we don't remove this for now.

Test Plan:
`./caffe2/caffe2/fb/predictor/scripts/run_disagg_model_benchmarks.sh 334827604 0 /data/users/ansha/tmp/ads_tail sr_only`

Inputs uploaded to `/mnt/persistent-public/ansha/ads_tail/334827604`

Before:
I0414 10:53:12.261133 1070948 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.121318. Iters per second: 8242.78
        0.11156 ms.    99.0457%. aten::embedding_bag (52 nodes, out variant)

After:
I0418 13:05:10.837378 2354604 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.0881273. Iters per second: 11347.2
      0.0789221 ms.    98.7096%. static_runtime::embedding_bag (52 nodes, out variant)

* Ads prod canary:
https://www.internalfb.com/intern/ads/canary/443002539593035806/
* 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_inline_cvr_post_imp -a D35726594`
https://www.internalfb.com/intern/servicelab/602875732/
* 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_10x_ctr_mbl_feed_non_mimo -a D35726594`
https://www.internalfb.com/intern/servicelab/1002874745/

Reviewed By: mikeiovine

Differential Revision: D35726594

fbshipit-source-id: 3b71a0822657bf7a23ce37ca899baef9997b011a
(cherry picked from commit fd5e3098c047a1e7d4348e1c97341eecb892536e)
2022-04-22 15:36:35 +00:00
f6c275f55d Remove -Wno-unused-variable from utils.cmake (take 2) (#75538)
Summary:
[Comment](https://github.com/pytorch/pytorch/pull/62445/files#r680132022) claims, it got added for consistency with  top level CMakeLists.txt, but `-Wno-unused-variable` is not mentioned there.

Modify violations in 50+ files that were added in the interim by either removing unused variables, or decorating the code with `C10_UNUSED` if local variable is likely used to extend object lifetime until the end of the block.

Caused preventable revert in https://github.com/pytorch/pytorch/pull/72633#issuecomment-1092300787

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75538

Reviewed By: anjali411

Differential Revision: D35747333

Pulled By: malfet

fbshipit-source-id: 3fc5828e44a4c05ba0e89e92613e6ebbdb260626
(cherry picked from commit c179fba21cfa2a0093fad50ccad5a22dd7cff52c)
2022-04-20 17:41:59 +00:00
a5e338a826 [RecordFunction] More effecient machinery to determine which callbacks to run. (#75807)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75807

There is a tension in RecordFunction between two use cases:
1) In the normal eager path we don't run any callbacks, so we need to bail out of the profiling path as soon as possible to minimize eager overhead.
2) When profiling we want to determine which callbacks to run as efficiently as possible to minimize instrumentation overhead.

The confounding factor in all of this is sampling callbacks because they change which callbacks will run on each call, even in steady state operation. This has traditionally been handled with a two stage procedure: first we flip a coin to determine if a sampled callback *might* run. If false (which it usually is), do nothing. This solves (1). If true, check to see if we need to build the full callback set or if it was a false positive. This procedure has two negative effects:
* It forces us to rebuild the set of callbacks to run on every step when profiling
* It leaks the sampling abstraction, requiring other parts of the code to bump certain values and forces RecordFunction to lazily initialize.

This change introduces a multi-level cache which can (in the common case) quickly determine which callbacks *will* run, rather than if callbacks *might* run. This means that rather than call `shouldRunRecordFunction`, we can simply get the callbacks for an invocation and check if they are empty. (And completely removes the pre-sampling heuristic.) Another major benefit of the new cache structure is that it allows thread-safe registration and unregistration of global callbacks.

It's worth briefly discussing how this maintains eager performance. In the standard eager case (only sampling callbacks registered) the cache first checks that the global callbacks haven't changed (atomic read), decrements a counter to see if a sampling callback fired, and then returns the active callbacks which is simply a SmallVector of pointer pairs and a couple POD values (scope, needs inputs/outputs/ids). The biggest cost according to perf is the SmallVector logic; we could consider adopting a hard limit on active callbacks; more than half a dozen callbacks *running* in a single step would be quite a lot. But the total cost relative to `PYTORCH_DISABLE_PER_OP_PROFILING` is only ~10ns, so debatable if it's worth it to switch to `std::array`.

The primary change is in `record_function.cpp`, which has a more detailed description of the new cache structure. `record_function.h` has some minor changes to align with the new calling convention and the remaining files are simply changes to the call sites.

Future work:
  * RecordFunction no longer needs to be lazily initialized.
  * We can deprecate the disable/reenable APIs, since we can not safely add and remove global callbacks.

Test Plan:
I tested eager mode performance using the overhead benchmark and found that the non-profiled path was unaffected. However the no-op observer dropped from 0.41us to 0.37us (0.25us if no observers are active) which is about 1/3rd reduction in the cost of the callback selection machinery.

I also added several C++ unit tests, as the core RecordFunction machinery (especially sampling) was largely untested.

Reviewed By: swolchok, davidberard98

Differential Revision: D35276158

fbshipit-source-id: 35135f444724fba4eb97c0ae7f3f710f0f9016fd
(cherry picked from commit 9e359b87422c18f2a195185f32e7e85c82f956fd)
2022-04-19 20:46:16 +00:00
5c56b2286b Revert "Remove -Wno-unused-variable from utils.cmake"
This reverts commit 018cbe1f5ccccb9194394d6e737310f837f8ad7a.

Reverted https://github.com/pytorch/pytorch/pull/75538 on behalf of https://github.com/seemethere
2022-04-19 17:19:09 +00:00
018cbe1f5c Remove -Wno-unused-variable from utils.cmake
[Comment](https://github.com/pytorch/pytorch/pull/62445/files#r680132022) claims, it got added for consistency with  top level CMakeLists.txt, but `-Wno-unused-variable` is not mentioned there.

Modify violations in 50+ files that were added in the interim by either removing unused variables, or decorating the code with `C10_UNUSED` if local variable is likely used to extend object lifetime until the end of the block.

Caused preventable revert in https://github.com/pytorch/pytorch/pull/72633#issuecomment-1092300787

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75538
Approved by: https://github.com/cpuhrsch
2022-04-19 15:26:55 +00:00
b7682d351a [SR] Refactor memory planner to prepare for new algorithm (#74730)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74730

Motivation: I am working on implementing a new, more efficient memory planning algorithm. This algorithm cannot replace the old one entirely, because it can only be practically done for models that have sample inputs to warm up with. We need a way to make the memory planner's strategy extensible.

My first pass attempt at implementing the new algorithm crammed everything into the same class, but it became a nightmare to manage (a ton of `if (use_new_strategy)` statements everywhere). Additionally, it was a little clumsy since there are some concepts that make sense for one algorithm but not the other (like `StorageGroup`).

It's much cleaner if we instead turn `MemoryPlanner` into an abstract base class and have different subclasses implement their strategies in `allocateManagedTensors` and `deallocateManagedTensors`.
ghstack-source-id: 153288210

Test Plan: Existing unit tests

Reviewed By: navahgar, hlu1

Differential Revision: D35132124

fbshipit-source-id: c5ef5ae6361b44dedf97090201e244a76e1e6bce
(cherry picked from commit c96f6827c8db88f28c4eb379865ad208beae2034)
2022-04-07 22:11:19 +00:00
2f98fa9147 [SR] Do not manage tensors that escape scope via container (#74966)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74966

It's clear that we don't want to manage tensors that escape their scope. Previously, we handled this by checking whether the tensor aliased the graph outputs. But there's actually another way to escape scope: by aliasing the wildcard set. The following graph demonstrates this:

```
def forward(self, cond: bool, a, b):
    lst = []
    if cond:
        res = a + b # res should not be managed!!!
        lst.append(res)
    return lst
```

The `if cond:` sub-block returns nothing, but `res` escapes the scope through `lst`.

The fix is simple: we simply have to mark values that alias the wildcard set as an `external_alias_` in `ValueGroup`.

This diff also exposed another issue (via unit tests) in `checkOutputTensorMemoryLeaks`: it assumes that, if a node's `Value*` is managed, the underlying `IValue` must be a tensor. But this is not true after the addition of `to_maybe_copy_out`; TMCO does not produce a tensor in its first output slot if it does not copy.
ghstack-source-id: 153288188

Test Plan: New unit tests cover the problematic case

Reviewed By: navahgar

Differential Revision: D35257087

fbshipit-source-id: 853a761dffe51f2c70720759664dd8dfcd56d1d7
(cherry picked from commit 2c7f519354041975f33626eab6b7f16c2494bbf8)
2022-04-07 19:57:57 +00:00