Commit Graph

30 Commits

Author SHA1 Message Date
3916d7a575 Apply modernize-use-emplace to aten, c10, torch (#91077)
Apply clang-tidy check modernize-use-emplace. This is slightly more efficient by using an inplace constructor and is the recommended style in parts of the codebase covered by clang-tidy. This just manually applies the check to rest of the codebase. Pinging @ezyang as this is related to my other PRs he reviewed like #89000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91077
Approved by: https://github.com/ezyang
2022-12-19 07:49:56 +00:00
6e6f929b2c [Profiler] Restructure inputs and capture TensorLists. (#87825)
This PR unifies and rationalizes some of the input representation in Result. The current approach of storing separate types in separate vectors is tedious for two types (Tensors and scalars), but would be even more annoying with the addition of TensorLists. A similar disconnection exists with sizes and strides which the user is also expected to zip with tensor_metadata.

I simplified things by moving inputs to a variant and moving sizes and strides into TensorMetadata. This also forced collection of sizes and strides in python tracer which helps to bring it in line with op profiling. Collection of TensorLists is fairly straightforward; `InputOutputEncoder` already has a spot for them (I actually collected them in the original TorchTidy prototype) so it was just a matter of plumbing things through.

Differential Revision: [D40734451](https://our.internmc.facebook.com/intern/diff/D40734451/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87825
Approved by: https://github.com/slgong-fb, https://github.com/chaekit
2022-11-08 21:48:43 +00:00
b16b5fb802 [Profiler] Hold weak reference to prevent TensorImpl address reuse during profiling. (#87244)
A recurring problem with assigning Tensor IDs is that we want to preserve identity when storage changes but we don't observe TensorImpl destruction so identity assignment is not robust to the ABA problem with respect to TensorImpl*. ~TensorImpl is far too hot to instrument; even adding a call to a no-op function in a different compilation unit increases overhead by tens of percent. (OSS builds do not have any sort of LTO.)

Fortunately there is a solution. A PyTorch Tensor is a `c10::intrusive_ptr<c10::TensorImpl>`, which in turn holds a storage. (Which is a `c10::intrusive_ptr<c10::StorageImpl>`) `c10::intrusive_ptr` has a `c10::weak_intrusive_ptr` class for taking non-owning references to the underlying object. The implementation involves both a strong refcount and weak refcount in `c10::intrusive_ptr`. If the strong refcount of an intrusive_ptr goes to zero and there are no weak references then everything is deleted. However if there is a weak reference then the intrusive_ptr calls `release_resources()` but not delete.

This has the effect of freeing the underlying resources (ensuring that program semantics are unchanged) but leaves behind an empty shell of an `intrusive_ptr` that the `weak_intrusive_ptr`s use to check status. And herein lies the solution: as long as we hold a weak reference to a TensorImpl we will block deletion and prevent the `TensorImpl*` from being reused.

This PR uses a `c10::weak_intrusive_ptr<c10::TensorImpl>` to store the address of profiled TensorImpls and then converts it to a raw pointer (or rather, a `TensorImplAddress`) during post processing when we no longer care about blocking address reuse.

Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87244
Approved by: https://github.com/slgong-fb, https://github.com/albanD
2022-10-27 06:38:11 +00:00
b0e10292fa [Profiler] Tensor IDs for Module and Optimizer variables (#86754)
More sophisticated profiling will increasingly rely on python tracer to contextualize observed results. This PR adds Tensors which are observed by the python tracer to the identity assignment loop.

Differential Revision: [D39852885](https://our.internmc.facebook.com/intern/diff/D39852885/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86754
Approved by: https://github.com/slgong-fb, https://github.com/aaronenyeshi
2022-10-23 19:23:42 +00:00
be2d647ea6 [Profiler] Use parameter as key for optimizer state recording. (#86753)
While optimizer can store state however it likes, in practice most optimizer state corresponds to a particular parameter. (This is the case for all `torch.optim` optimizers.) Thus, it turns out to be ergonomic to collect using that structure. Note that this doesn't lock us into anything; we can always collect state with non Tensor keys if the use case arises.

One simplification that arises is that Module and Optimizer collection has very similar structure. So similar, in fact, that it is possible to use a common template for config. I also found that a lot of the `check_and_store` logic could be simplified and inlined by this joining of collected optimizer state.

Differential Revision: [D40210703](https://our.internmc.facebook.com/intern/diff/D40210703/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86753
Approved by: https://github.com/slgong-fb, https://github.com/aaronenyeshi
2022-10-23 19:23:39 +00:00
dbea07b6aa [Profiler] record gradient from nnModule (#86355)
Summary:
- catch .grad tensor info
- update data type and `check_and_store`, etc
- update unit test case

Test Plan: buck run mode/opt //caffe2/test:profiler

Reviewed By: chaekit

Differential Revision: D39711295

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86355
Approved by: https://github.com/chaekit
2022-10-07 09:58:50 +00:00
a117fde86f [Profiler] Apply TensorMetadata for Optimizer and nnModule (#86047)
Summary: - Use `TensorMetadat` struct in saving tensor info from Optimizer and nnModule.

Test Plan: buck run mode/opt //caffe2/test:profiler

Reviewed By: chaekit

Differential Revision: D39682205

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86047
Approved by: https://github.com/chaekit, https://github.com/robieta
2022-10-06 06:18:56 +00:00
3cfc61b846 [Profiler][trivial] Optimizer states (part 4 of Record Optimizer) (#85840)
Summary: - add states into OptInfo and update unit testcase

Test Plan: buck run mode/opt //caffe2/test:profiler

Reviewed By: chaekit

Differential Revision: D39406540

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85840
Approved by: https://github.com/robieta
2022-09-29 07:28:33 +00:00
7628603aee [Profiler] bug fix: python object reference counting (#85847)
Summary:
Wrong reference counting of Python Objects has made intermittent and corner-case-only segfault.
- before : increment once decrement in a loop.
- after: increment and decrement in different but consistent loops.

Test Plan: buck run mode/opt //caffe2/test:profiler

Differential Revision: D39902973

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85847
Approved by: https://github.com/robieta, https://github.com/aaronenyeshi
2022-09-29 03:58:34 +00:00
d776693701 [Profiler] Optimizer param_groups (part 3 of Record Optimizer) (#85784)
Summary:
- use TensorMetadata struct
- check_and_store util as overloading
- param_groups
- clean up unit test cases

Test Plan: buck run mode/opt //caffe2/test:profiler

Reviewed By: chaekit

Differential Revision: D39406072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85784
Approved by: https://github.com/aaronenyeshi, https://github.com/robieta
2022-09-28 19:18:12 +00:00
f80ef73d1c [Profiler] tracking Optimizer (part 2 of Record Optimizer) (#84920)
Summary:
Part 2 of Record Optimizer param_groups and states (https://github.com/pytorch/pytorch/pull/84063)
- hooking from optimizer step
- PyOptCall Type
- declare data type for collection
- python binding
- simple unit test case

Test Plan: buck run mode/opt //caffe2/test:profiler

Differential Revision: D39402667

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84920
Approved by: https://github.com/robieta
2022-09-28 02:48:07 +00:00
dc865bff4e [Profiler] set_class util (part 1 of Record Optimizer) (#84779)
Summary:
Part 1 of Record Optimizer param_groups and states (https://github.com/pytorch/pytorch/pull/84063)
- nnModule and Optimizer have duplicated parts
- create a util function to avoid duplication

Test Plan: buck run mode/opt //caffe2/test:profiler

Differential Revision: D39397210

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84779
Approved by: https://github.com/robieta
2022-09-13 01:48:41 +00:00
daffff9986 [Profiler] Make RecordQueue manage the lifetime of PythonTracer. (#83964)
`PythonTracer` holds a pointer to an owning `RecordQueue`, however that relationship is not enforced and it is possible to dangle that pointer if the ProfilerState owning the `RecordQueue` is destroyed without proper cleanup.

We currently use a singleton to enforce the requirement that only one python tracer is active at a time, however a better formulation is to simply enforce that with an atomic bool and manage object lifetime through composition. In this new architecture, `RecordQueue` explicitly holds a unique_ptr to the python tracer instance. That way if `~RecordQueue` is called it will call `~PythonTracer` which can then clean up any state. Overall it is just a simpler ownership model, and less prone to unexpected failures.

Differential Revision: [D38955616](https://our.internmc.facebook.com/intern/diff/D38955616/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83964
Approved by: https://github.com/slgong-fb
2022-09-09 19:04:08 +00:00
328538700a [Profiler][Trivial] Move PythonTracerBase to torch/csrc/profiler/orchestration (#83895)
The ownership model between `RecordQueue` and `PythonTracer` is brittle; if a profiler is popped without proper shutdown it can dangle a reference in `PythonTracer` which will segfault when dereferenced. The next PR will address this; to start we simply move the code into `torch/csrc/profiler/orchestration` to limit the sloc delta when making actual changes.

Differential Revision: [D38933962](https://our.internmc.facebook.com/intern/diff/D38933962/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38933962/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83895
Approved by: https://github.com/slgong-fb
2022-09-09 19:04:08 +00:00
fa241fd50e [Profiler] record nn.Module's parameters (#83209)
Summary:
Record nn.Module's parameters for detaild memory profiling:
- extend 'module_' in value cache  & NNModuleInfo to save parameters
- python binding and unit test case

Test Plan: buck run mode/opt //caffe2/test:profiler -- -r test_nnmodule

Differential Revision: D38379717

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83209
Approved by: https://github.com/robieta
2022-08-24 08:17:20 +00:00
09e837634b [Profiler][Minor] Set end time on python events when profiling stops. (#83621)
We don't have an end event for calls that are ongoing when profiling stops. (e.g. main) This cropped up when I was adding checks for negative durations.

I also refactored `populate` to use a pop method. This not only allows me to implement this fix, but should also provide a convenient entry point for https://github.com/pytorch/pytorch/pull/82154

Differential Revision: [D38426342](https://our.internmc.facebook.com/intern/diff/D38426342/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83621
Approved by: https://github.com/slgong-fb
2022-08-21 00:22:11 +00:00
7edd947178 [Profiler][Python tracer] Add ephemeral inputs to the value cache. (#81958)
There are a couple of bugs in the python tracer related to how we cache values. The first is that `ValueCache::store<CallType::PyModuleCall>` wrongly assumes that it will only be called from the profiling callback and calls `PyEval_GetFrame`, effectively violating the encapsulation of the cache by accessing global state. Secondly, we use `arg` to cache bound C functions. This turns out not to be correct, and collisions are resulting in incorrect traces.

In both cases, we can solve the problem by introducing a concept of ephemeral data which is used to materialize a cached value, but is not part of the cache key. (And the author is responsible for making sure that is done correctly.)

Differential Revision: [D38062921](https://our.internmc.facebook.com/intern/diff/D38062921/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81958
Approved by: https://github.com/ngimel
2022-07-29 05:12:09 +00:00
4b7de26556 Fix C API to be compatible with latest 3.11 beta (#81242)
Based off https://github.com/pytorch/pytorch/pull/80511 with extra changes:
- Update pybind to the latest release as it contains some needed fixes
- Extend the compat header to do reduce changes in code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81242
Approved by: https://github.com/malfet, https://github.com/mattip
2022-07-27 08:37:10 +00:00
72de816f5c GIL acquire needed in ValueCache::trimPrefixes (#81061)
Summary: Dubugged a segfault issue in Ondemand python tracing. Committing as a a separate diff from D37410204.

Test Plan:
- run a python test case with the following command for on-demand flow:
echo -e "PYTHON_STACK_TRACE=true" > /tmp/scott_kineto.conf && dyno gputrace --gputrace_duration 300ms --gpuconf /tmp/scott_kineto.conf

Reviewed By: chaekit

Differential Revision: D37662988

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81061
Approved by: https://github.com/albanD
2022-07-19 01:00:36 +00:00
30fb2c4aba [lint] autoformat test/cpp and torch/csrc
Let's have some fun.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78828

Approved by: https://github.com/ezyang
2022-06-11 21:11:16 +00:00
9f2e2aa28b Revert "Revert "[Profiler] Move python tracing to unified event type (Part 2)""
This reverts commit 4305f8e9bda34f18eb7aacab51c63651cfc61802.

replace TEST_CUDA with torch.has_cuda

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79173

Approved by: https://github.com/ezyang
2022-06-09 19:45:02 +00:00
4305f8e9bd Revert "[Profiler] Move python tracing to unified event type (Part 2)"
This reverts commit c2a3c8186c3f3798684cecd60d62a991c223eeef.

Reverted https://github.com/pytorch/pytorch/pull/78164 on behalf of https://github.com/malfet due to Broke cuda-on-cpu tests, see c2a3c8186c
2022-06-08 02:21:16 +00:00
c2a3c8186c [Profiler] Move python tracing to unified event type (Part 2)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78164

This PR finishes moving over the python tracer to use the unified event type. Things that changed:

1) The hacky after-the-fact splicing of python events in profiler_kineto.cpp is gone and python events now simply fold into the rest. (Yay!!!) This is a major BE win.
2) Added `ExtraFields<EventType::PyCall>` and `ExtraFields<EventType::PyCCall>`
3) The enter events (time + TraceKey) are now handled by RecordQueue for performance.
4) Python tracing now uses TSC for lower overhead.

Simplifications in profiler_python WRT part 1:
1) Rather than ValueCache emitting an intermediate value_t that gets further converted, load methods can now directly emit ExtraFields<...>
2) The complicated replay in profiler_python.cpp is replaced with a much simpler (and safer) pass to just pair start and end times.
3) During post processing we can now use `CallTypeHelper::map` to automatically pull in all events instead of having to loop over each the entries for each type manually. This will make it simpler to add new types of Python event later.

Differential Revision: [D36515869](https://our.internmc.facebook.com/intern/diff/D36515869/)

Approved by: https://github.com/aaronenyeshi
2022-06-07 23:42:00 +00:00
a173613f6d [Profiler] Move python tracing to unified event type (Part 1)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78163

The python function tracer is complicated and separate from the other profile types, so I've chosen to break the change into two diff. The first (this one) reworks the cache structure to make it amenable to integration (as well as some other nice tweaks) and the next one actually moves it over.

The old cache scheme worked very hard to pack all the information about an event into a small struct via bit packing, with a couple secondary caches for things like names. Because of the space constraints on that struct (and the fact that it had to represent all call and return types) there were a lot of subtle invariants swirling around that made it hard to offload anything to a different component. The new cache system is more modular and also, as it turns out, a bit faster. (Benchmarks in part 2)

There is a more detailed description of the cache hierarchy in the PR, but the gist is that I use various specializations to handle the different event types (python call, nn module, c function) and lean on the type system to keep everything safe and organized. (One nice thing about using unique IDs is that they also implicitly encode the event type. They implicitly encode everything!) Given that we are going to want to expand the semantics (e.g. torch ops, DataLoader, etc) this will give a nice way to capture richer semantics without significantly increasing the complexity of the profiler.

Differential Revision: [D36379147](https://our.internmc.facebook.com/intern/diff/D36379147/)

Approved by: https://github.com/aaronenyeshi
2022-06-07 23:42:00 +00:00
e0a071a47e [Profiler] Abstract interface for Python tracer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77699

The current machinery to connect libtorch to libtorch_python for profiling is... meh. Adequite for separate components that mostly just need to send a trigger, but not really clean. This PR makes an abstract interface class that the python tracer subclasses so the profiler can actually get at the tracer singleton, albeit through a restricted interface. This will help fold Python tracing into the new unified event structure.

Differential Revision: [D36325739](https://our.internmc.facebook.com/intern/diff/D36325739/)

Approved by: https://github.com/aaronenyeshi
2022-05-25 16:11:01 +00:00
7b8cf1f736 [pytorch][PR] [Profiler][Trivial] Format profiler_python.cpp
There are some unfortunate style issues, like four space indents and various other minor issues. There is a pretty big overhaul coming to the python tracer, so I want to be able to commit them with more style compliant code.

Differential Revision: [D36070201](https://our.internmc.facebook.com/intern/diff/D36070201/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77692

Approved by: https://github.com/aaronenyeshi
2022-05-18 03:52:19 +00:00
748790588c Upgrading the loop to use irange (#70326)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70326

See D24145988 for context: it allows loops such as for(int i=0;i<10;i++) to be expressed as for(const auto i : c10::irange(10)). This is nice because it auto-types the loops and adds const-safety to the iteration variable.

Test Plan: buck run //caffe2/torch/fb/sparsenn:test

Reviewed By: r-barnes

Differential Revision: D33243400

fbshipit-source-id: b1f1b4163f4bf662031baea9e5268459b40c69a3
2022-01-06 07:06:53 -08:00
33e9a0b5f6 [Reland] Python tracer. (#68325)
Summary:
There were two issues with the original PR:
1) My assumption that bound C functions could be trusted to stay alive was not valid. I'm still not entirely sure what was dying, but I've just added a cache so that the first time I see a function I collect the repr just like I was already doing with Python functions.

2) `std::regex` is known to be badly broken and prone to segfaults. Because I'm just doing a very simple prefix prune it's fine to do it manually; see `trimPrefix`. Long term we should move all of PyTorch to `re2` as the internal lint suggests, but CMake is hard and I couldn't get it to work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68325

Reviewed By: chaekit

Differential Revision: D32432596

Pulled By: robieta

fbshipit-source-id: 06fb4bcdc6933a3e76f6021ca69dc77a467e4b2e
2021-11-15 23:32:49 -08:00
8bf150f21b Revert D32178667: [pytorch][PR] Python tracer for profiler
Test Plan: revert-hammer

Differential Revision:
D32178667 (33353fb828)

Original commit changeset: 118547104a7d

fbshipit-source-id: 47510607589fc39c730ba913f47c01a7d107b7b0
2021-11-12 14:53:52 -08:00
33353fb828 Python tracer for profiler (#67407)
Summary:
This PR instruments the CPython interpreter and integrates the resulting trace into the PyTorch profiler.

The python tracing logic works by enabling `PyEval_SetProfile`, and then logging the minimal information to track every time python calls or returns from a function. A great deal of care has gone into keeping this process very lightweight; the `RawEvent` struct is only two words and doesn't do anything fancy. When a python function is called, we have to do extra work. If the call is to `nn.Module.__call__`, we simply incref to extend the life of the module. Otherwise we check if we have seen the function before, and if not go through the (somewhat expensive) task of saving the strings which we then cache.

To actually get a useful timeline, we have to replay the events to determine the state of the python stack at any given point. A second round of stack replay is needed to figure out what the last python function was for each torch op so we can reconstruct the correct python stack. All of this is done during post processing, so while we want to be reasonably performant it is no longer imperative to shave every last bit.

I still need to do a bit of refinement (particularly where the tracer interfaces with the profiler), but this should give a good sense of the general structure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67407

Test Plan:
```
import torch

class MyModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(2, 2)
        self.relu = torch.nn.ReLU()

    def forward(self, x):
        x = self.linear(x)
        return self.relu(x)

def call_module():
    m = MyModule()
    for _ in range(4):
        m(torch.ones((2, 2)))

def top_level_fn():
    with torch.profiler.profile(with_stack=True) as p:
        call_module()

    p.export_chrome_trace("test_trace.json")

top_level_fn()
```
<img width="1043" alt="Screen Shot 2021-10-27 at 6 43 18 PM" src="https://user-images.githubusercontent.com/13089297/139171803-f95e70f3-24aa-45e6-9d4b-6d437a3f108d.png">

PS: I've tried to comment liberally, particularly around some of the more magical parts. However I do plan on doing another linting and commenting pass. Hopefully it's not too bad right now.

Reviewed By: gdankel, chaekit

Differential Revision: D32178667

Pulled By: robieta

fbshipit-source-id: 118547104a7d887e830f17b94d3a29ee4f8c482f
2021-11-12 11:58:12 -08:00