Clone
21
PyTorch OSS benchmark infra
Huy Do edited this page 2025-10-10 13:06:28 -07:00

Architecture overview

At a high level, the PyTorch OSS benchmark infrastructure consists of 5 key components:

  1. Benchmark Servers - These servers come from various sources based on availability. Notable examples include:
    1. CUDA benchmarks like torch.compile on linux.aws.h100
    2. ROCm benchmarks on linux.rocm.gpu.mi300.2
    3. x86 CPU benchmarks on linux.24xl.spr-metal
    4. aarch64 CPU benchmark on linux.arm64.m7g.metal
    5. MPS benchmarks on macos-m2-15
    6. Android and iOS benchmarks on AWS Device Farms
  2. Integration Layer - Where benchmark results are processed. To support different use cases across PyTorch-org, we don't dictate what benchmarks to run or how. Instead, we provide an integration point on GitHub for CI and an API to upload benchmark results when running in a local environment. This gives teams flexibility to run benchmarks their own way as long as results are saved in a standardized format. This format is documented here
  3. Centralized Benchmark Database - Located on ClickHouse Cloud at https://console.clickhouse.cloud under the benchmark database and oss_ci_benchmark_v3 table.
  4. HUD Benchmark Dashboards - The benchmark dashboard family with code in PyTorch test-infra
  5. UPCOMING Benchmark Tooling Collection:
    1. A querying API for programmatic benchmark data access
    2. A regression notification mechanism (via Grafana)
    3. A bisecting tool to identify root causes of regressions

Benchmark results format

Your benchmark results should be formatted as a list of metrics as shown below. All fields are optional unless specified as required.

// The list of all benchmark metrics
[
  {
    // Information about the benchmark
    benchmark: Tuple(
      name,  // Required. The name of the benchmark
      mode,  // Training or inference
      dtype,  // The dtype used by the benchmark
      extra_info: {},  // Any additional information about the benchmark
    ),

    // Information about the model or the test
    model: Tuple (
      name,  // Required. The model or the test name
      type,  // Additional information, for example is this a HF model or a micro-benchmark custom layer
      backend,  // Any delegation backend used here, i.e. XNNPACK
      origins,  // Tell us where this is from, i.e. HF
      extra_info: {},  // Any additional information about the model or the test
    ),

    // Information about the benchmark result
    metric: Tuple(
      name,  // Required. The name of the metric. It's a good practice to include its unit here too, i.e. compilation_time(ms)
      benchmark_values,  // Float. Required. The metric values. It's a list here because a benchmark is usually run multiple times
      target_value,  // Float. The optional target value used to indicate if there is a regression
      extra_info: {},  // Any additional information about the benchmark result
    ),

    // Optional information about any inputs used by the benchmark
    inputs: {
      name: Tuple(
        dtype,  // The dtype of the input
        extra_info: {},  // Any additional information about the input
      )
    },
  },

  {
    ... Same structure as the first record
  },
  ...
]

Note that using a JSON list is optional. Writing one JSON record per line (JSONEachRow) is also accepted.

Upload the benchmark results

Upload API

The fastest way to upload benchmark results is using the upload_benchmark_results.py script. This script requires UPLOADER_[USERNAME|PASSWORD] credentials, so please contact PyTorch Dev Infra if you need access. Once written to the database, benchmark results should be considered immutable as updating or deleting them is complex and costly.

Here is an example usage:

export UPLOADER_USERNAME=<REDACT>
export UPLOADER_PASSWORD=<REDACT>

# For CUDA benchmark, other devices have different names.  Any string works as long as it's consistent
export GPU_DEVICE=$(nvidia-smi -i 0 --query-gpu=name --format=csv,noheader | awk '{print $2}')

git clone https://github.com/pytorch/pytorch-integration-testing
cd pytorch-integration-testing/.github/scripts

# The script dependencies
pip install -r requirements.txt

# --repo is where the repo is checkout and built
# --benchmark-name is an unique string that is used to identify the benchmark
# --benchmark-results is where the JSON benchmark result files are kept
# --dry-run is to prepare everything except writing the results to S3
python upload_benchmark_results.py \
  --repo pytorch \
  --benchmark-name "My PyTorch benchmark" \
  --benchmark-results benchmark-results-dir \
  --device "${GPU_DEVICE}" \
  --dry-run

You can also set the repo metadata manually, for example, when using nightly or release binaries, for example

# Use PyTorch 2.7 release
python upload_benchmark_results.py \
  --repo-name "pytorch/pytorch" \
  --head-branch "release/2.7" \
  --head-sha "e2d141dbde55c2a4370fac5165b0561b6af4798b" \
  --benchmark-name "My PyTorch benchmark" \
  --benchmark-results benchmark-results-dir \
  --device "${GPU_DEVICE}" \
  --dry-run

# Use PyTorch nightly
python upload_benchmark_results.py \
  --repo-name "pytorch/pytorch" \
  --head-branch "nightly" \
  --head-sha "be2ad70cfa1360da5c23a04ff6ca3480fa02f278" \
  --benchmark-name "My PyTorch benchmark" \
  --benchmark-results benchmark-results-dir \
  --device "${GPU_DEVICE}" \
  --dry-run

Behind the scenes, we have an API deployed at https://kvvka55vt7t2dzl6qlxys72kra0xtirv.lambda-url.us-east-1.on.aws that accepts the benchmark result JSON and an S3 path where it will be stored.

# This path is an example - any path under the v3 directory is acceptable. If the path already exists, the API will not overwrite it
s3_path = f"v3/{repo_name}/{head_branch}/{head_sha}/{device}/benchmark_results.json"

payload = {
    "username": UPLOADER_USERNAME,
    "password": UPLOADER_PASSWORD,
    "s3_path": s3_path,
    "content": json.dumps(benchmark_results),
}

headers = {"content-type": "application/json"}

requests.post(
    # One current limitation of the API is that AWS limits the maximum size of the JSON to be less than **6MB**
    "https://kvvka55vt7t2dzl6qlxys72kra0xtirv.lambda-url.us-east-1.on.aws",
    json=payload,
    headers=headers
)

GitHub CI

  1. If you are using PyTorch AWS self-hosted runners, they already have permission to upload benchmark results. No additional preparation is needed.
  2. If you are using non-AWS runners (such as ROCm runners), please contact the PyTorch Dev Infra team (POC: @huydhn) to create a GitHub environment with S3 write permissions. This environment is called upload-benchmark-results. See android-perf.yml for an example.

A sample job on AWS self-hosted runners

name: A sample benchmark job that runs on all main commits
on:
  push:
    - main

jobs:
  benchmark:
    runs-on: linux.2xlarge
    steps:
      - uses: actions/checkout@v3

      - name: Run your own benchmark logic
        shell: bash
        run: |
          set -eux

          # Run your benchmark script and write the result to benchmark-results.json whose format is defined in the previous section
          python run_my_benchmark_script.py > ${{ runner.temp }}/benchmark-results/benchmark-results.json

          # It's also ok to write the results into multiple JSON files, for example
          python run_my_benchmark_script.py --output-dir ${{ runner.temp }}/benchmark-results

     - name: Upload the benchmark results to OSS benchmark database for the dashboard
       uses: pytorch/test-infra/.github/actions/upload-benchmark-results@main
       with:
         benchmark-results-dir: ${{ runner.temp }}/benchmark-results
         dry-run: false
         schema-version: v3
         github-token: ${{ secrets.GITHUB_TOKEN }}

A sample job on non-AWS runners

name: A sample benchmark job that runs on all main commits
on:
  push:
    - main

jobs:
  benchmark:
    runs-on: linux.rocm.gpu.2  // An example non-AWS runner
    environment: upload-benchmark-results  // The environment has write access S3 to upload the results
    permissions:
      id-token: write
      contents: read
    steps:
      - uses: actions/checkout@v3

      - name: Authenticate with AWS
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_upload-benchmark-results
          # The max duration enforced by the server side
          role-duration-seconds: 18000
          aws-region: us-east-1

      - name: Run your own benchmark logic
        shell: bash
        run: |
          # Run your benchmark script and write the result to benchmark-results.json whose format is defined in the previous section
          python run_my_benchmark_script.py > ${{ runner.temp }}/benchmark-results/benchmark-results.json

          # It's also ok to write the results into multiple JSON files, for example
          python run_my_benchmark_script.py --output-dir ${{ runner.temp }}/benchmark-results

     - name: Upload the benchmark results to OSS benchmark database for the dashboard
       uses: pytorch/test-infra/.github/actions/upload-benchmark-results@main
       with:
         benchmark-results-dir: ${{ runner.temp }}/benchmark-results
         dry-run: false
         schema-version: v3
         github-token: ${{ secrets.GITHUB_TOKEN }}

Query benchmark results

[Experimental] Query API

An experimental query API is available at https://queries.clickhouse.cloud/run/84649f4e-52c4-4cf9-bd6e-0a105ea145c8 for querying benchmark results from the database. Please contact PyTorch Dev Infra (@huydhn) if you need credentials to access it:

import os
import json
import requests

username = os.environ.get("CLICKHOUSE_API_USERNAME")
password = os.environ.get("CLICKHOUSE_API_PASSWORD")

params = {
    "format": "JSONEachRow",
    "queryVariables": {       
        # REQUIRED: The repo name in org/repo format
        "repo": "pytorch/pytorch",
        # REQUIRED: The name of the benchmark
        "benchmark": "TorchInductor",
        # REQUIRED: YYYT-MM-DDThh:mm:ss
        "startTime": "2025-06-06T00:00:00",
        # REQUIRED: YYYT-MM-DDThh:mm:ss
        "stopTime": "2025-06-13T00:00:00",

        # OPTIONAL: Only query benchmark results for these models.  Leaving this as an empty array [] will fetch all of them
        "models": ["BERT_pytorch"],
        # OPTIONAL: Only fetch these metrics.  Leaving this as an empty array [] will fetch all of them
        "metrics": ["speedup"],

        # OPTIONAL: Filter the benchmark results by device, i.e. cuda, and arch, i.e. H100, leave them empty to get all devices
        "device": "",
        "arch": "",

        # OPTIONAL: Use this when you only care about the benchmark results from a specific branch and commit
        "branch": "",
        "commit": "",
    }
}

api_url = "https://queries.clickhouse.cloud/run/84649f4e-52c4-4cf9-bd6e-0a105ea145c8"

r = requests.post(api_url, json=params, auth=(username, password))
with open("benchmark_results.txt", "w") as f:
    print(r.text, file=f)

The list of available benchmarks at the moment are:

"pytorch-labs/tritonbench": "compile_time"
"pytorch-labs/tritonbench": "nightly"
"pytorch/ao": "TorchAO benchmark"
"pytorch/ao": "micro-benchmark api"
"pytorch/benchmark": "TorchInductor"
"pytorch/executorch": "ExecuTorch"
"pytorch/pytorch": "PyTorch gpt-fast benchmark"
"pytorch/pytorch": "PyTorch operator benchmark"
"pytorch/pytorch": "TorchCache Benchmark"
"pytorch/pytorch": "TorchInductor"
"pytorch/pytorch": "cache_benchmarks"
"pytorch/pytorch": "pr_time_benchmarks"
"vllm-project/vllm": "vLLM benchmark"

Here is the Bento notebook N7397718 to illustrate some use cases from TorchInductor benchmark.

Flambeau

To explore the benchmark database, we recommend using https://hud.pytorch.org/flambeau. You'll need write access to PyTorch to use the agent. The tool incorporates our clickhouse-mcp for database exploration.

For example, you can use this prompt to list available benchmarks: List all the benchmark names from different GitHub repositories from Jun 10th to Jun 16th. List each name only once

Benchmark database

The benchmark database on ClickHouse Cloud is accessible to all Metamates. We also provide a ClickHouse MCP server that you can install to access the database through AI agents like Claude Code.

Follow these steps to access the database:

  1. Login to https://console.clickhouse.cloud. For metamates, you can login using your meta email using SSO and request access. Read-only access will be granted by default.
  2. Select benchmark database
  3. Run a sample query:
select
    head_branch,
    head_sha,
    benchmark,
    model.name as model,
    metric.name as name,
    arrayAvg(metric.benchmark_values) as value
from
    oss_ci_benchmark_v3
where
    tupleElement(benchmark, 'name') = 'TorchAO benchmark'
    and oss_ci_benchmark_v3.timestamp < 1733870813
    and oss_ci_benchmark_v3.timestamp > 1733784413