mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Files

Xuan Zhang c062550a35 [PT2][fusion] ban fusions with large accumulated reads (#157563 )

**Problem:**
Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet
```
total = torch.rand(N, N)
for _ in range(r):
    x = torch.rand(N, N)
    total = total + x
```
The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like:
```
x_1 = torch.rand(N, N)
x_2 =  torch.rand(N, N)
...
x_r = torch.rand(N, N)
total = x_1 + x_2 + ... + x_r
```
Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient.

[internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details

**Solution:**
Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile.
* During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated.
* During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`.

**Results:**
For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match.

<img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563
Approved by: https://github.com/jansel, https://github.com/mlazos

2025-07-14 22:27:21 +00:00

data

PyTorch Data Sampler benchmark (#156974 )

2025-06-27 04:49:43 +00:00

distributed/ddp

[BE] Remove outdated RPC benchmark (#146716 )

2025-03-29 04:44:36 +00:00

dynamo

[PT2][fusion] ban fusions with large accumulated reads (#157563 )

2025-07-14 22:27:21 +00:00

fastrnns

[BE] fix typos in benchmarks/ (#156077 )

2025-06-17 13:12:18 +00:00

framework_overhead_benchmark

Fix unused Python variables outside torch/ and test/ (#136359 )

2024-12-11 17:10:23 +00:00

functional_autograd_benchmark

[build] modernize build-frontend: python setup.py develop/install -> [uv ]pip install --no-build-isolation [-e ]. (#156027 )

2025-07-09 11:24:27 +00:00

fuser

Fix unused Python variables outside torch/ and test/ (#136359 )

2024-12-11 17:10:23 +00:00

gpt_fast

[BE] fix typos in benchmarks/ (#156077 )

2025-06-17 13:12:18 +00:00

inductor_backends

Rename inductor cache (#156128 )

2025-06-17 03:57:18 +00:00

inference

[BE] fix typos in benchmarks/ (#156077 )

2025-06-17 13:12:18 +00:00

instruction_counts

[BE] fix typos in benchmarks/ (#156077 )

2025-06-17 13:12:18 +00:00

nested

Fix unused Python variables outside torch/ and test/ (#136359 )

2024-12-11 17:10:23 +00:00

operator_benchmark

[build] modernize build-frontend: python setup.py develop/install -> [uv ]pip install --no-build-isolation [-e ]. (#156027 )

2025-07-09 11:24:27 +00:00

overrides_benchmark

[BE][Easy][3/19] enforce style for empty lines in import segments in benchmarks/ (#129754 )

2024-07-17 14:34:42 +00:00

profiler_benchmark

Apply TorchFix TOR203 fixes (#143691 )

2024-12-23 18:21:03 +00:00

record_function_benchmark

[Caffe2]Remove Caffe2 scripts and benchmarks (#126747 )

2024-06-05 23:46:31 +00:00

serialization

Fix unused Python variables outside torch/ and test/ (#136359 )

2024-12-11 17:10:23 +00:00

sparse

[build] modernize build-frontend: python setup.py develop/install -> [uv ]pip install --no-build-isolation [-e ]. (#156027 )

2025-07-09 11:24:27 +00:00

static_runtime

[3/N] Use internal linkage in C++ files (#151297 )

2025-05-05 17:48:39 +00:00

tensorexpr

[BE] fix typos in benchmarks/ (#156077 )

2025-06-17 13:12:18 +00:00

transformer

[BE] fix typos in benchmarks/ (#156077 )

2025-06-17 13:12:18 +00:00

compare-fastrnn-results.py

[BE][Easy][3/19] enforce style for empty lines in import segments in benchmarks/ (#129754 )

2024-07-17 14:34:42 +00:00

compare.sh

Benchmarks: add scripts for FastRNNs results comparison. (#44134 )

2020-09-03 13:44:42 -07:00

README.md

[build] modernize build-frontend: python setup.py develop/install -> [uv ]pip install --no-build-isolation [-e ]. (#156027 )

2025-07-09 11:24:27 +00:00

upload_scribe.py

Fix broken URLs (#152237 )

2025-04-27 09:56:42 +00:00

README.md

PyTorch Benchmarks

This folder contains scripts that produce reproducible timings of various PyTorch features.

It also provides mechanisms to compare PyTorch with other frameworks.

Setup environment

Make sure you're on a machine with CUDA, torchvision, and pytorch installed. Install in the following order:

# Install torchvision. It comes with the pytorch stable release binary
python -m pip install torch torchvision

# Install the latest pytorch master from source.
# It should supersede the installation from the release binary.
cd $PYTORCH_HOME
python -m pip install --no-build-isolation -v -e .

# Check the pytorch installation version
python -c "import torch; print(torch.__version__)"

Benchmark List

Please refer to each subfolder to discover each benchmark suite. Links are provided where descriptions exist: