Today, we always create and record an events in two places: 1) Upon seeing the first producer, we record an event on the producer, and we wait for this event in two places: (1) when the engine goes to run the consumer, the consumer stream waits for this event. (2) prior to doing accumulation, the accumulation stream waits for this event. 2) After doing accumulation, we record an event on the accumulation stream and wait for this event in a single place: when the engine goes to run the consumer. We do not actually need to record the event in the cases where the 1st producer stream is the same as the consumer and as the accumulation stream, and where the accumulation stream is the same as the consumer stream. Removing this unnecessary create + record event should save a few us for each instance avoided. Fixes https://github.com/pytorch/pytorch/issues/157407 ---- Manual test plan: - [x] @eqy to confirm perf is restored - [x] Running the repro originally reported before/after the patch Pull Request resolved: https://github.com/pytorch/pytorch/pull/157503 Approved by: https://github.com/eqy ghstack dependencies: #155715
Autograd
Autograd is a hotspot for PyTorch performance, so most of the heavy lifting is implemented in C++. This implies that we have to do some shuffling between Python and C++; and in general, we want data to be in a form that is convenient to manipulate from C++.
Our general model is that for any key data type that autograd manipulates,
there are two implementations: a C++ type and a Python object type. For
example, consider variables in autograd: we have both Variable
in variable.h
(the C++ type) and THPVariable
in python_variable.h
(the Python type.)
(By the way, THP stands for TorcH Python, not to be confused with THPP, TorcH
C++). Variable
contains the payload of a variable, while THPVariable
just
contains a shared_ptr
reference to Variable
, as well as references to other
Python objects which the Python runtime needs to know about. A lot of
data accessor implementations in python_variable.cpp
simply reach through
to the underlying Variable
and return the appropriate value.
The most complicated application of this principle is Function, which also supports users implementing custom behavior in Python. We have the following classes:
Node
infunction.h
, the C++ type.THPFunction
inpython_function.h
, the Python object type. Inpython_function.cpp
, you can see the boilerplate that tells the Python interpreter about this object.PyNode
inpython_function.h
, a subclass ofNode
which forwardsapply
to a PythonTHPFunction
. (NOT a Python object, despite its name!)
Outside of PyNode
, the C++ objects largely avoid referencing Python
objects (there are a few exceptions, like pyobj
in Variable
, and
PyNode
, whose whole point is to let C++ call into Python). And pyobj
in Node
to ensure uniqueness of the associated python wrapper (if it exists).