pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Scott Wolchok	c083489f46	[kineto] Optimize getStepCallbacks for common case of no active callbacks Pull Request resolved: https://github.com/pytorch/pytorch/pull/77804 IIUC, the result of this function will be empty and unused if there are no sampled callbacks, which is the common case. We can accelerate this case by wrapping the result in an optional to save initializing an empty SmallVector. Differential Revision: [D36497279](https://our.internmc.facebook.com/intern/diff/D36497279/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36497279/)! Approved by: https://github.com/robieta	2022-05-24 19:38:01 +00:00
Taylor Robie	a5e338a826	[RecordFunction] More effecient machinery to determine which callbacks to run. (#75807 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75807 There is a tension in RecordFunction between two use cases: 1) In the normal eager path we don't run any callbacks, so we need to bail out of the profiling path as soon as possible to minimize eager overhead. 2) When profiling we want to determine which callbacks to run as efficiently as possible to minimize instrumentation overhead. The confounding factor in all of this is sampling callbacks because they change which callbacks will run on each call, even in steady state operation. This has traditionally been handled with a two stage procedure: first we flip a coin to determine if a sampled callback might run. If false (which it usually is), do nothing. This solves (1). If true, check to see if we need to build the full callback set or if it was a false positive. This procedure has two negative effects: * It forces us to rebuild the set of callbacks to run on every step when profiling * It leaks the sampling abstraction, requiring other parts of the code to bump certain values and forces RecordFunction to lazily initialize. This change introduces a multi-level cache which can (in the common case) quickly determine which callbacks will run, rather than if callbacks might run. This means that rather than call `shouldRunRecordFunction`, we can simply get the callbacks for an invocation and check if they are empty. (And completely removes the pre-sampling heuristic.) Another major benefit of the new cache structure is that it allows thread-safe registration and unregistration of global callbacks. It's worth briefly discussing how this maintains eager performance. In the standard eager case (only sampling callbacks registered) the cache first checks that the global callbacks haven't changed (atomic read), decrements a counter to see if a sampling callback fired, and then returns the active callbacks which is simply a SmallVector of pointer pairs and a couple POD values (scope, needs inputs/outputs/ids). The biggest cost according to perf is the SmallVector logic; we could consider adopting a hard limit on active callbacks; more than half a dozen callbacks running in a single step would be quite a lot. But the total cost relative to `PYTORCH_DISABLE_PER_OP_PROFILING` is only ~10ns, so debatable if it's worth it to switch to `std::array`. The primary change is in `record_function.cpp`, which has a more detailed description of the new cache structure. `record_function.h` has some minor changes to align with the new calling convention and the remaining files are simply changes to the call sites. Future work: * RecordFunction no longer needs to be lazily initialized. * We can deprecate the disable/reenable APIs, since we can not safely add and remove global callbacks. Test Plan: I tested eager mode performance using the overhead benchmark and found that the non-profiled path was unaffected. However the no-op observer dropped from 0.41us to 0.37us (0.25us if no observers are active) which is about 1/3rd reduction in the cost of the callback selection machinery. I also added several C++ unit tests, as the core RecordFunction machinery (especially sampling) was largely untested. Reviewed By: swolchok, davidberard98 Differential Revision: D35276158 fbshipit-source-id: 35135f444724fba4eb97c0ae7f3f710f0f9016fd (cherry picked from commit 9e359b87422c18f2a195185f32e7e85c82f956fd)	2022-04-19 20:46:16 +00:00
Scott Wolchok	22c6dafd33	[PyTorch] Use plain old function pointer for RecordFunctionCallback (reapply) (#49408 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49408 Nearly every non-test callsite doesn't need to capture any variables anyway, and this saves 48 bytes per callback. ghstack-source-id: 118665808 Test Plan: Wait for GitHub CI since we had C++14-specific issues with this one in previous PR https://github.com/pytorch/pytorch/pull/48629 Reviewed By: malfet Differential Revision: D25563207 fbshipit-source-id: 6a2831205917d465f8248ca37429ba2428d5626d	2020-12-15 19:16:01 -08:00
Mike Ruberry	25bc906281	Revert D25135415: [PyTorch] Use plain old function pointer for RecordFunctionCallback Test Plan: revert-hammer Differential Revision: D25135415 (`7e23ee1598`) Original commit changeset: 5e92dc79da64 fbshipit-source-id: 45b1634a100084c84dca158a1f16ca760fef6988	2020-12-14 21:04:27 -08:00
Scott Wolchok	7e23ee1598	[PyTorch] Use plain old function pointer for RecordFunctionCallback (#48629 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48629 Nearly every non-test callsite doesn't need to capture any variables anyway, and this saves 48 bytes per callback. ghstack-source-id: 118568240 Test Plan: CI Reviewed By: dhruvbird Differential Revision: D25135415 fbshipit-source-id: 5e92dc79da6473ed15d1e381a21ed315879168f3	2020-12-14 20:08:16 -08:00
Scott Wolchok	900aa4ee97	[PyTorch] remove convenience RecordFunctionCallback interface (#48620 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48620 In preparation for storing bare function pointer (8 bytes) instead of std::function (32 bytes). ghstack-source-id: 118568242 Test Plan: CI Reviewed By: ezyang Differential Revision: D25132183 fbshipit-source-id: 3790cfb5d98479a46cf665b14eb0041a872c13da	2020-12-14 20:03:15 -08:00
Ilia Cherniavskii	db5e5b439c	Extra sampling of record function events [resend] (#49114 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49114 resend of https://github.com/pytorch/pytorch/pull/48289 Test Plan: see 48289 Reviewed By: robieta Differential Revision: D25443365 Pulled By: ilia-cher fbshipit-source-id: c15ac312222bb4d744e10199ed79801cccae8227	2020-12-11 12:53:37 -08:00
Mike Ruberry	9f7fb54693	Revert D25111515: Extra sampling of record function events Test Plan: revert-hammer Differential Revision: D25111515 (`09b974c2d5`) Original commit changeset: 0d572a3636fe fbshipit-source-id: d558d8052924d937d86db7dd40dc6388e6d28823	2020-12-09 08:37:17 -08:00
Ilia Cherniavskii	09b974c2d5	Extra sampling of record function events (#48289 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48289 Adding extra sampling step when dispatching RecordFunction. (Note: this ignores all push blocking failures!) Reviewed By: swolchok Differential Revision: D25111515 Pulled By: ilia-cher fbshipit-source-id: 0d572a3636fe649a47ec47901826bbfc08368937	2020-12-09 02:29:13 -08:00
Ilia Cherniavskii	35596d39e9	Coalesce TLS accesses in RecordFunction constructor (#44970 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44970 Right now, when RecordFunction is not active (usual case), we do two TLS accesses (check for thread local callbacks, and check for thread local boolean). Experimenting with reducing number of TLS accesses in RecordFunction constructor. Test Plan: record_function_benchmark Reviewed By: dzhulgakov Differential Revision: D23791165 Pulled By: ilia-cher fbshipit-source-id: 6137ce4bface46f540ece325df9864fdde50e0a4	2020-09-28 21:42:23 -07:00
Ilia Cherniavskii	8e0714a60d	[rfc] Reduce number of coin flips in RecordFunction (#40758 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40758 Currently we flip a coin for each sampled callback each time we run RecordFunction, this PR is an attempt to skip most of the coin flips (for the low-probability observers) and keep the distribution close to the original one Test Plan: CI and record_function_benchmark ``` (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 30108 us. Time per iteration (1x1): 1496.78 us. Time per iteration (16x16): 2142.46 us. Pure RecordFunction runtime of 10000000 iterations 687929 us, number of callback invocations: 978 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 19051 us. Time per iteration (1x1): 1581.89 us. Time per iteration (16x16): 2195.67 us. Pure RecordFunction runtime of 10000000 iterations 682402 us, number of callback invocations: 1023 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18715 us. Time per iteration (1x1): 1566.11 us. Time per iteration (16x16): 2131.17 us. Pure RecordFunction runtime of 10000000 iterations 693571 us, number of callback invocations: 963 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18814 us. Time per iteration (1x1): 1536.2 us. Time per iteration (16x16): 1985.82 us. Pure RecordFunction runtime of 10000000 iterations 944959 us, number of callback invocations: 1015 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18278 us. Time per iteration (1x1): 1526.32 us. Time per iteration (16x16): 2093.77 us. Pure RecordFunction runtime of 10000000 iterations 985307 us, number of callback invocations: 1013 (python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark Warmup time: 18545 us. Time per iteration (1x1): 1524.65 us. Time per iteration (16x16): 2080 us. Pure RecordFunction runtime of 10000000 iterations 952835 us, number of callback invocations: 1048 ``` Reviewed By: dzhulgakov Differential Revision: D22320879 Pulled By: ilia-cher fbshipit-source-id: 2193f07d2f7625814fe7bc3cc85ba4092fe036bc	2020-06-30 17:23:00 -07:00
generatedunixname89002005287564	42f0ea49ca	[Codemod][GleanFbcode] Remove dead includes in caffe2/binaries Reviewed By: ilia-cher Differential Revision: D21949969 fbshipit-source-id: 80336f82e9507dd001d079644cba5012bc5c8eed	2020-06-15 12:16:52 -07:00
Ilia Cherniavskii	2d708cefcc	Move RecordFunction into ATen (#37548 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37548 Moving RecordFunction from torch::autograd::profiler into at namespace Test Plan: CI Imported from OSS Differential Revision: D21315852 fbshipit-source-id: 4a4dbabf116c162f9aef0da8606590ec3f3847aa	2020-05-07 14:52:39 -07:00
Ilia Cherniavskii	c24c5f9684	Make RecordFunction callbacks thread local and modernize interface (#37491 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37491 This PR modernizes RecordFunction API and adds thread local callbacks in addition to the global ones Changes: - support for TLS callbacks, this is going to be the foundation of profiler and other tools - modernize interface around simple set of functions (add\|remove\|has\|clear)(Global\|ThreadLocal)(Callback) and adding RecordFunctionCallback to easily construct callbacks to be passed - we also add `.setShouldRun` into the callback interface to support cases when simple uniform sampling is not enough - to properly support add/remove introduce the idea of callback handle returned by add - internal implementation still uses SmallVector to store intermediate state (as before) - in this case these are vector of handles of callbacks that were picked to run - to speed up runtime we keep these vectors sorted, this way we can quickly enumerate callbacks that need to be run - added tests for new functionality Test Plan: BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py develop install ./build/bin/test_jit CI record_function_benchmark: https://gist.github.com/ilia-cher/f1e094dae47fe23e55e7672ac4dcda2f Imported from OSS Differential Revision: D21300448 fbshipit-source-id: 6d55c26dbf20b33d35c3f1604dcc07bb063c8c43	2020-05-07 14:51:02 -07:00
Ilia Cherniavskii	800d5617c0	Recording of TorchScript functions (#34710 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34710 Extending RecordFunction API to support new recording scopes (such as TorchScript functions), as well as giving more flexibility to set sampling rate. Test Plan: unit test (test_misc.cpp/testRecordFunction) Reviewed By: gdankel, dzhulgakov Differential Revision: D20158523 fbshipit-source-id: a9e0819d21cc06f4952d92d43246587c36137582	2020-03-31 00:33:23 -07:00

15 Commits