This PR applies clang-tidy readability checks to jit sources and all headers in the code base. `readability-redundant-inline-specifier` is suppressed because it incurs too many changes. `readability-redundant-inline-specifier` is used to detect redundant inline specifiers on function and variable declarations. There are many in-class method definitions that are marked inline. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164652 Approved by: https://github.com/Skylion007
Profiler Overview
This README describes the details of how the profiler is implemented.
The profiler instruments PyTorch to collect information about the model's execution. Its main features are:
- Instrumenting op calls on the CPU side
- Interfacing with Kineto to collect information from the GPU (or other accelerators)
- Collecting python stack traces
- Exporting this information, e.g. in a chrome trace, or to be processed by downstream tools like HTA
Table of Contents
- Codebase Structure
RecordFunction
- Autograd Integration
- Torch Operation Collection
- Allocation Event Collection
- Kineto Integration
- Python Tracing
- Clock Alignment
Codebase Structure
This section highlights directories an files that are significant to the profiler. Lesser relevant files, directories, and modules are omitted.
torch/
│
├── profiler/ # Main package containing the core frontend logic
│ ├── __init__.py # Initialization file for profiler package
│ ├── profiler.py # Main profiler frontend class
│ └── _utils.py # FunctionEvent utils
│
├── autograd/ # Autograd package
│ ├── __init__.py # Initialization file for autograd package
│ ├── profiler.py # Main profiler backend class
│ └── profiler_utils.py # FunctionEvent utils
│
├── csrc/ # C and C++ source code
│ └── profiler/ # Profiler C++ source code
│ ├── collection.cpp # Main collection logic
│ ├── collection.h # Collection definitions
│ ├── kineto_client_interface.cpp # Interface to call Profiler from kineto (on-demand only)
│ ├── kineto_client_interface.h # Client interface definitions
│ ├── kineto_shim.cpp # Shim to call kineto from profiler
│ ├── kineto_shim.h # Shim definitions
│ ├── util.cpp # utils for handling args in profiler events
│ ├── util.h # util definitions
│ └── README.md # This file
│ └── autograd/ # Autograd C++ source code
│ ├── profiler_python.cpp # Main python stack collection logic
│ ├── profiler_python.h # Python stack collection definitions
│ ├── profiler_kineto.cpp # Profiler backend logic for starting collection/kineto
│ └── profiler_kineto.h # Profiler backend definitions for starting collection/kineto
│ └── ATen/ # ATen C++ source code
│ ├── record_function.cpp # RecordFunction collection logic
│ └── record_function.h # RecordFunction definitions
└── LICENSE # License information
RecordFunction
aten/src/ATen/record_function.h
RecordFunction
is used by the profiler to instrument CPU-side events.
RecordFunction
is a general method of instrumenting function calls in PyTorch. It can be used for other general applications, e.g. see Features for Large-Scale Deployments. In PyTorch, it is already included at some important locations; notably, in the dispatcher, surrounding every op.
Users (or PyTorch itself) can register callbacks that will be executed whenever a RecordFunction
guard is encountered. The profiler uses this mechanism to record the start and end times for each op call, as well as user-provided RecordFunction
annotations. The RecordFunction
machinery is designed to have relatively low overhead, especially when there are no callbacks registered. Nevertheless, there can still be some overhead.
There is also a python binding for RecordFunction
in python (with torch.profiler.record_function
); this is often used by users to annotate events corresponding to module-level events.
Autograd Integration
The autograd engine is responsible for automatically computing gradients.
The profiler records two pieces of information from the autograd engine:
- Sequence number: this is a unique-per-thread index assigned to each op call(*) in the forward pass. When a backward op is triggered, it is also assigned a sequence number matching the sequence number of the forward op that caused that backward op to be executed. Using this information, the profiler is able to match forward and backward ops; in chrome traces, this feature can be enabled with the "fwd_bwd" flow events
- Forward thread id: Autograd can be used in multi-threaded environments. The forward thread ID indicates the ID of the thread on which the forward op was executed on. This information is needed because the sequence number, mentioned above, is only unique within a thread; the forward thread ID is used for differentiating different ops with the same sequence number.
(*) Note that only op invocations whose inputs require gradients are assigned a sequence number
Torch Operation Collection
This section describes the general flow for collecting torch operations during auto-trace (in-process, synchronous tracing). For details on on-demand tracing (out-of-process, asynchronous), please refer to the Libkineto README.
When a trace begins, the autograd/profiler backend calls into profiler_kineto.cpp
to prepare, start, or stop collection. At the start of tracing, the onFunctionEnter
and onFunctionExit
callbacks defined in profiler_kineto.cpp
are registered.
Callback registration can be either global or local, depending on the ExperimentalConfig
used:
- Global: The callback is registered to all threads throughout execution.
- Local: The callback is registered only to threads present at the start of tracing.
Within
onFunctionEnter
, the profiler creates aThreadLocalSubqueue
instance for each thread, ensuring that each CPU operation is associated with the thread on which it was executed. When a torch operation is entered, the profiler callsbegin_op
(defined incollection.cpp
) to record the necessary information. Thebegin_op
routine is intentionally lightweight, as it is on the "hot path" during profiling. Excessive overhead here would distort the profile and reduce its usefulness. Therefore, only minimal information is collected during the callback; most logic occurs during post-processing.
Allocation Event Collection
Unlike torch operations, which have a start and stop, allocation events are represented as cpu_instant_event
(zero duration). As a result, RecordFunction
is bypassed for these events. Instead, emplace_allocation_event
is called directly to enqueue the event into the appropriate ThreadLocalSubqueue
.
Kineto Integration
Kineto serves as an abstraction layer for collecting events across multiple architectures. It interacts with libraries such as CUPTI to receive GPU and accelerator events, which are then forwarded to the frontend profiler. Kineto requires time to "prepare" (also referred to as "warmup") these third-party modules to avoid distorting the profile with initialization routines. While this could theoretically be done at job startup, keeping a heavy library like CUPTI running unnecessarily introduces significant overhead.
As previously mentioned, profiler_kineto.cpp
is used in the backend to invoke the appropriate profiler stage. It also calls into kineto_shim.cpp
, which triggers the corresponding routines in Kineto. Once a trace is complete, all events collected by Kineto are forwarded to the profiler for two main reasons:
- To coalesce all data and complete any post-processing between profiler and Kineto events.
- To forward these events to the Python frontend as
FunctionEvents
. The final step in integration is file export. After all events have been collected and post-processed, they can be exported to a JSON file for visualization in Perfetto or Chrome Tracer. This is done by calling Kineto'sActivityTraceInterface::save
, which writes all event information to disk.
Python Tracing
When with_stack=True
is set in the profiler, the Python stack tracer is generated using the make
function defined in PythonTracerBase
. The implementation resides in profiler_python.cpp
.
To profile the stack, PyEval_SetProfile
is used to trace and handle various execution events within a Python program. This enables comprehensive profiling by monitoring and responding to specific cases:
- Python Function Calls (
PyTrace_CALL
): TherecordPyCall
method logs each Python function call, capturing essential details for later analysis. - C Function Calls (
PyTrace_C_CALL
): TherecordCCall
method documents calls to C functions, including relevant arguments, providing a complete view of the program's execution flow. - Python Function Returns (
PyTrace_RETURN
): Exit times of Python functions are recorded, enabling precise measurement of function execution durations. - C Function Returns and Exceptions (
PyTrace_C_RETURN
andPyTrace_C_EXCEPTION
): Exit times for C functions are tracked, whether they conclude normally or due to an exception, ensuring all execution paths are accounted for. This setup allows for detailed and accurate data collection on both Python and C function executions, facilitating thorough post-processing and analysis. After profiling, the accumulated event stacks are processed to match entrances and exits, constructing complete events for further analysis by the profiler. Note: For Python 3.12.0–3.12.4, a bug in CPython requires the use ofsys.monitoring
as a workaround.
Clock Alignment
Depending on the system environment, the profiler will use the most efficient clock when creating a timestamp. The default for most Linux systems is TSC, which records time in the form of CPU cycles. To convert from this time to the unix time in nanoseconds, we create a clock converter. If Kineto is included in the profiler, this converter will also be passed into Kineto as well to ensure alignment.