Files
pytorch/torch/csrc/autograd
soulitzer 8bda95228f [autograd] Avoid creating and recording event when unnecessary (#157503)
Today, we always create and record an events in two places:
1) Upon seeing the first producer, we record an event on the producer, and we wait for this event in two places: (1) when the engine goes to run the consumer, the consumer stream waits for this event. (2) prior to doing accumulation, the accumulation stream waits for this event.

2) After doing accumulation, we record an event on the accumulation stream and wait for this event in a single place: when the engine goes to run the consumer.

We do not actually need to record the event in the cases where the 1st producer stream is the same as the consumer and as the accumulation stream, and where the accumulation stream is the same as the consumer stream.

Removing this unnecessary create + record event should save a few us for each instance avoided.

Fixes https://github.com/pytorch/pytorch/issues/157407

----

Manual test plan:
- [x] @eqy to confirm perf is restored
- [x] Running the repro originally reported before/after the patch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157503
Approved by: https://github.com/eqy
ghstack dependencies: #155715
2025-07-09 03:36:14 +00:00
..
2024-11-13 02:15:16 +00:00

Autograd

Autograd is a hotspot for PyTorch performance, so most of the heavy lifting is implemented in C++. This implies that we have to do some shuffling between Python and C++; and in general, we want data to be in a form that is convenient to manipulate from C++.

Our general model is that for any key data type that autograd manipulates, there are two implementations: a C++ type and a Python object type. For example, consider variables in autograd: we have both Variable in variable.h (the C++ type) and THPVariable in python_variable.h (the Python type.) (By the way, THP stands for TorcH Python, not to be confused with THPP, TorcH C++). Variable contains the payload of a variable, while THPVariable just contains a shared_ptr reference to Variable, as well as references to other Python objects which the Python runtime needs to know about. A lot of data accessor implementations in python_variable.cpp simply reach through to the underlying Variable and return the appropriate value.

The most complicated application of this principle is Function, which also supports users implementing custom behavior in Python. We have the following classes:

  • Node in function.h, the C++ type.
  • THPFunction in python_function.h, the Python object type. In python_function.cpp, you can see the boilerplate that tells the Python interpreter about this object.
  • PyNode in python_function.h, a subclass of Node which forwards apply to a Python THPFunction. (NOT a Python object, despite its name!)

Outside of PyNode, the C++ objects largely avoid referencing Python objects (there are a few exceptions, like pyobj in Variable, and PyNode, whose whole point is to let C++ call into Python). And pyobj in Node to ensure uniqueness of the associated python wrapper (if it exists).