mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 12:54:11 +08:00
Update wiki to include
@ -1,34 +1,27 @@
|
||||
## Architecture overview
|
||||
|
||||
<p align="center" width="100%">
|
||||
<img width="60%" src="https://github.com/user-attachments/assets/aa7b56f9-a610-4f99-b793-1989b819d023">
|
||||
<img width="60%" src="https://github.com/user-attachments/assets/f6bc50d0-3c78-47b7-bfd1-80c4d9f3bc18">
|
||||
</p>
|
||||
|
||||
## Prerequisite
|
||||
At a high-level, PyTorch OSS benchmark infra consists of 5 different components:
|
||||
|
||||
The OSS benchmark database is located on ClickHouse Cloud at https://console.clickhouse.cloud under `benchmark` database and `oss_ci_benchmark_v3` table. It's a good idea to try login to ClickHouse Cloud and do some simple queries to understand the data there
|
||||
1. The servers where the benchmark are run. They come from various sources depending on the availability. Some notable ones are:
|
||||
1. CUDA benchmarks like `torch.compile` on [linux.aws.h100](https://github.com/pytorch/pytorch/blob/main/.github/workflows/inductor-perf-test-nightly-h100.yml)
|
||||
2. ROCm benchmarks on [linux.rocm.gpu.mi300.2](https://github.com/pytorch/pytorch/blob/main/.github/workflows/inductor-perf-test-nightly-rocm.yml)
|
||||
3. x86 CPU benchmarks on [linux.24xl.spr-metal](https://github.com/pytorch/pytorch/blob/main/.github/workflows/inductor-perf-test-nightly-x86.yml)
|
||||
4. aarch64 CPU benchmark on [linux.arm64.m7g.metal](https://github.com/pytorch/pytorch/blob/main/.github/workflows/inductor-perf-test-nightly-aarch64.yml)
|
||||
5. MPS benchmarks on [macos-m2-15](https://github.com/pytorch/pytorch/blob/main/.github/workflows/inductor-perf-test-nightly-macos.yml)
|
||||
4. [Android](https://github.com/pytorch/executorch/blob/main/.github/workflows/android-perf.yml) and [iOS](https://github.com/pytorch/executorch/blob/main/.github/workflows/apple-perf.yml) benchmarks on AWS Device Farms
|
||||
2. The integration layer where benchmark results are processed. In order to support different use cases across PyTorch-org, we don't dictate what benchmarks could be run or how. Instead, we provide the integration touch point on GitHub for CI and an API to upload the benchmark results when running in a local environment. This gives onboarding teams the flexibility to run their benchmarks in their own ways as long as the results are saved in a standardize format. This format is documented [here](https://github.com/pytorch/pytorch/wiki/How-to-integrate-with-PyTorch-OSS-benchmark-infra#output-format)
|
||||
3. The centralize benchmark database is located on ClickHouse Cloud at https://console.clickhouse.cloud under `benchmark` database and `oss_ci_benchmark_v3` table.
|
||||
4. The family of [HUD benchmark dashboards](https://hud.pytorch.org) whose code is located in PyTorch [test-infra](https://github.com/pytorch/test-infra/tree/main/torchci/pages/benchmark)
|
||||
5. An UPCOMING collection of benchmark-related toolings including:
|
||||
1. A querying API to access the benchmark data programmatically
|
||||
2. A regression notification mechanism (via Grafana)
|
||||
2. A bisecting tool to root cause thus regression
|
||||
|
||||
1. Login to https://console.clickhouse.cloud. For metamates, you can login using your meta email using SSO and request access. Read-only access will be granted by default.
|
||||
2. Select `benchmark` database
|
||||
3. Run a sample query:
|
||||
|
||||
```
|
||||
select
|
||||
head_branch,
|
||||
head_sha,
|
||||
benchmark,
|
||||
model.name as model,
|
||||
metric.name as name,
|
||||
arrayAvg(metric.benchmark_values) as value
|
||||
from
|
||||
oss_ci_benchmark_v3
|
||||
where
|
||||
tupleElement(benchmark, 'name') = 'TorchAO benchmark'
|
||||
and oss_ci_benchmark_v3.timestamp < 1733870813
|
||||
and oss_ci_benchmark_v3.timestamp > 1733784413
|
||||
```
|
||||
|
||||
## Output format
|
||||
## Benchmark results format
|
||||
|
||||
Your benchmark results need to be a list of metrics in the following format. All fields are optional unless specified otherwise.
|
||||
|
||||
@ -79,14 +72,53 @@ Your benchmark results need to be a list of metrics in the following format. Al
|
||||
|
||||
Note that the JSON list is optional. Writing JSON record one per line ([JSONEachRow](https://clickhouse.com/docs/en/interfaces/formats#jsoneachrow)) is also accepted.
|
||||
|
||||
## Run the benchmark and upload the results
|
||||
## Upload the benchmark results
|
||||
|
||||
### Prerequisites
|
||||
### Upload API
|
||||
|
||||
The API is currently deployed at `https://kvvka55vt7t2dzl6qlxys72kra0xtirv.lambda-url.us-east-1.on.aws` and it accepts the benchmark result JSON and an S3 path on where to store it. The API is gated behind a credential at the moment, so please reach out to PyTorch Dev Infra if you need to use it.
|
||||
|
||||
```
|
||||
# The path here is a example, any path under v3 directory is ok. If the path already exists, the API will refuse to overwrite it
|
||||
s3_path = f"v3/{repo_name}/{head_branch}/{head_sha}/{device}/benchmark_results.json"
|
||||
|
||||
payload = {
|
||||
"username": UPLOADER_USERNAME,
|
||||
"password": UPLOADER_PASSWORD,
|
||||
"s3_path": s3_path,
|
||||
"content": json.dumps(benchmark_results),
|
||||
}
|
||||
|
||||
headers = {"content-type": "application/json"}
|
||||
|
||||
requests.post(
|
||||
"https://kvvka55vt7t2dzl6qlxys72kra0xtirv.lambda-url.us-east-1.on.aws",
|
||||
json=payload,
|
||||
headers=headers
|
||||
)
|
||||
```
|
||||
|
||||
You can also use the [upload_benchmark_results.py](https://github.com/pytorch/pytorch-integration-testing/blob/main/vllm-benchmarks/upload_benchmark_results.py) that implements the above logic in your bash script. For example,
|
||||
|
||||
```
|
||||
UPLOADER_USERNAME=<REDACT>
|
||||
UPLOADER_PASSWORD=<REDACT>
|
||||
GPU_DEVICE=$(nvidia-smi -i 0 --query-gpu=name --format=csv,noheader | awk '{print $2}')
|
||||
|
||||
python upload_benchmark_results.py \
|
||||
--repo pytorch \
|
||||
--benchmark-name "My PyTorch benchmark" \
|
||||
--benchmark-results benchmark-results-dir \
|
||||
--device "${GPU_DEVICE}" \
|
||||
--dry-run
|
||||
```
|
||||
|
||||
### GitHub CI
|
||||
|
||||
1. If you are using PyTorch AWS self-hosted runners, they already have the permission to upload the benchmark results. No need to prepare anything else.
|
||||
1. If you are using something else (non-AWS), for example ROCm runners. Please reach out to PyTorch Dev Infra team (poc @huydhn) to create a GitHub environment with permission to write to S3. The environment is called `upload-benchmark-results`. For example, [android-perf.yml](https://github.com/pytorch/executorch/blob/9666ee8259065a80858603b3bc9b95a71ecfe460/.github/workflows/android-perf.yml#L385)
|
||||
|
||||
### A sample job on AWS self-hosted runners
|
||||
#### A sample job on AWS self-hosted runners
|
||||
|
||||
```
|
||||
name: A sample benchmark job that runs on all main commits
|
||||
@ -120,7 +152,7 @@ jobs:
|
||||
github-token: ${{ secrets.GITHUB_TOKEN }}
|
||||
```
|
||||
|
||||
### A sample job on non-AWS runners
|
||||
#### A sample job on non-AWS runners
|
||||
|
||||
```
|
||||
name: A sample benchmark job that runs on all main commits
|
||||
@ -162,4 +194,29 @@ jobs:
|
||||
dry-run: false
|
||||
schema-version: v3
|
||||
github-token: ${{ secrets.GITHUB_TOKEN }}
|
||||
```
|
||||
|
||||
## Database
|
||||
The benchmark database on ClickHouse Cloud is currently accessible to all Metamates. We also provide a [ClickHouse MCP server](https://github.com/izaitsevfb/clickhouse-mcp) that you can install to access the database via an AI agent like Claude Code.
|
||||
|
||||
A quick way to check your access to the database is to do the following steps:
|
||||
|
||||
1. Login to https://console.clickhouse.cloud. For metamates, you can login using your meta email using SSO and request access. Read-only access will be granted by default.
|
||||
2. Select `benchmark` database
|
||||
3. Run a sample query:
|
||||
|
||||
```
|
||||
select
|
||||
head_branch,
|
||||
head_sha,
|
||||
benchmark,
|
||||
model.name as model,
|
||||
metric.name as name,
|
||||
arrayAvg(metric.benchmark_values) as value
|
||||
from
|
||||
oss_ci_benchmark_v3
|
||||
where
|
||||
tupleElement(benchmark, 'name') = 'TorchAO benchmark'
|
||||
and oss_ci_benchmark_v3.timestamp < 1733870813
|
||||
and oss_ci_benchmark_v3.timestamp > 1733784413
|
||||
```
|
Reference in New Issue
Block a user