Files
pytorch/torch/compiler
Aaron Orenstein e20736bf1d Dont't GC as often when collecting cudagraphs (#158193)
TL;DR: Cuts vLLM cudagraph collection from 80s -> 24s

Stop garbage collecting by default on every cudagraph recording. The old behavior can be re-enabled by setting `TORCH_CUDAGRAPH_GC=1` or the config `force_cudagraph_gc`.

We were previously garbage collecting at the beginning of each cudagraph
capture. vLLM collects 5427 graphs and most of those garbage collections weren't
actually collecting any memory (CPU or GPU). This changes it to not collect more
than every 10s so if we're capturing in a loop we don't burn all our cycles
looking for garbage.

(These number have a lot of variance from run to run but give the correct
general scale)
```
       | calls | total | synchronize |  gcs | collect | empty cache | sys freed | cuda freed |
-------+-------+-------+-------------+------+---------+-------------+-----------+------------+
before |  5427 |   78s |       1.48s | 5427 |  53.22s |       1.21s |    145855 | 1539309568 |
-------+-------+-------+-------------+------+---------+-------------+-----------+------------+
after  |  5427 |   24s |          0s |    3 |   1.53s |       0.84s |       592 | 1539309568 |
-------+-------+-------+-------------+------+---------+-------------+-----------+------------+
```
total - this is the total time reported by vLLM's "Graph capturing finished" log.
The rest of these are measured in torch.cuda.graphs.graph.__enter__():
  calls - number of times torch.cuda.graphs.graph.__enter__ was called
  synchronize - this is the duration taken by the cuda.synchronize call
  gcs - number of times gc.collect was called
  collect - this is the duration taken by the gc.collect call
  empty cache - this is the duration taken by the torch.cuda.empty_cache call
  sys freed - the number of bytes reported freed by gc.collect
  cuda freed - the number of bytes reported freed by torch.cuda.memory_reserved

So it seems like the heavy lifting is done by torch.cuda.empty_cache() which is
fairly quick.

Cudagraph results from the TorchInductor Performance DashBoard (this is from the original version using the GC clock so the real results will be slightly better than this):
<img width="1494" height="382" alt="image" src="https://github.com/user-attachments/assets/69b705ef-47ce-4b6e-9733-1ec941cad93d" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158193
Approved by: https://github.com/ngimel
2025-07-24 21:37:11 +00:00
..