[reland] Report test time regressions (#50171)

Summary: This is a followup to https://github.com/pytorch/pytorch/issues/49190. Vaguely speaking, the goals are to make it easy to identify test time regressions introduced by PRs. Eventually the hope is to use this information to edit Dr CI comments, but this particular PR just does the analysis and prints it to stdout, so a followup PR would be needed to edit the actual comments on GitHub. **Important:** for uninteresting reasons, this PR moves the `print_test_stats.py` file. - *Before:* `test/print_test_stats.py` - *After:* `torch/testing/_internal/print_test_stats.py` Notes on the approach: - Just getting the mean and stdev for the total job time of the last _N_ commits isn't sufficient, because e.g. if `master` was broken 5 commits ago, then a lot of those job times will be much shorter, breaking the statistics. - We use the commit history to make better estimates for the mean and stdev of individual test (and suite) times, but only when the test in that historical commit is present and its status matches that of the base commit. - We list all the tests that were removed or added, or whose status changed (e.g. skipped to not skipped, or vice versa), along with time (estimate) info for that test case and its containing suite. - We don't list tests whose time changed a lot if their status didn't change, because there's a lot of noise and it's unclear how to do that well without too many false positives. - We show a human-readable commit graph that indicates exactly how many commits are in the pool of commits that could be causing regressions (e.g. if a PR has multiple commits in it, or if the base commit on `master` doesn't have a report in S3). - We don't show an overall estimate of whether the PR increased or decreased the total test job time, because it's noisy and it's a bit tricky to aggregate stdevs up from individual tests to the whole job level. This might change in a followup PR. - Instead, we simply show a summary at the bottom which says how many tests were removed/added/modified (where "modified" means that the status changed), and our best estimates of the mean times (and stdevs) of those changes. - Importantly, the summary at the bottom is only for the test cases that were already shown in the more verbose diff report, and does not include any information about tests whose status didn't change but whose running time got much longer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/50171 Test Plan: To run the unit tests: ``` $ python test/test_testing.py $ python test/print_test_stats.py ``` To verify that this works, check the [CircleCI logs](https://app.circleci.com/pipelines/github/pytorch/pytorch/258628/workflows/9cfadc34-e042-485e-b3b3-dc251f160307) for a test job run on this PR; for example: - pytorch_linux_bionic_py3_6_clang9_test To test locally, use the following steps. First run an arbitrary test suite (you need to have some XML reports so that `test/print_test_stats.py` runs, but we'll be ignoring them here via the `--use-json` CLI option): ``` $ DATA_DIR=/tmp $ ARBITRARY_TEST=testing $ python test/test_$ARBITRARY_TEST.py --save-xml=$DATA_DIR/test/test_$ARBITRARY_TEST ``` Now choose a commit and a test job (it has to be on `master` since we're going to grab the test time data from S3, and [we only upload test times to S3 on the `master`, `nightly`, and `release` branches](https://github.com/pytorch/pytorch/pull/49645)): ``` $ export CIRCLE_SHA1=c39fb9771d89632c5c3a163d3c00af3bef1bd489 $ export CIRCLE_JOB=pytorch_linux_bionic_py3_6_clang9_test ``` Download the `*.json.bz2` file(s) for that commit/job pair: ``` $ aws s3 cp s3://ossci-metrics/test_time/$CIRCLE_SHA1/$CIRCLE_JOB/ $DATA_DIR/ossci-metrics/test_time/$CIRCLE_SHA1/$CIRCLE_JOB --recursive ``` And feed everything into `test/print_test_stats.py`: ``` $ bzip2 -kdc $DATA_DIR/ossci-metrics/test_time/$CIRCLE_SHA1/$CIRCLE_JOB/*Z.json.bz2 | torch/testing/_internal/print_test_stats.py --compare-with-s3 --use-json=/dev/stdin $DATA_DIR/test/test_$ARBITRARY_TEST ``` The first part of the output should be the same as before this PR; here is the new part, at the end of the output: - https://pastebin.com/Jj1svhAn Reviewed By: malfet, izdeby Differential Revision: D26317769 Pulled By: samestep fbshipit-source-id: 1ba06cec0fafac77f9e7341d57079543052d73db
2025-10-20 21:14:14 +08:00 · 2021-02-08 15:27:11 -08:00
parent 9e4f3b89c4
commit 21ef248fb8
8 changed files with 1325 additions and 259 deletions
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@ -642,9 +642,9 @@ jobs:
          export CIRCLE_JOB="$CIRCLE_JOB"
          export CIRCLE_WORKFLOW_ID="$CIRCLE_WORKFLOW_ID"
          cd workspace
-          python test/print_test_stats.py --upload-to-s3 test
+          python torch/testing/_internal/print_test_stats.py --upload-to-s3 --compare-with-s3 test
          EOL
-          echo "(cat docker_commands.sh | docker exec -u jenkins -i "$id" bash) 2>&1" > command.sh
+          echo "(cat docker_commands.sh | docker exec -u jenkins -e LANG=C.UTF-8 -i "$id" bash) 2>&1" > command.sh
          unbuffer bash command.sh | ts

          echo "Retrieving test reports"
--- a/.circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
+++ b/.circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
@ -201,9 +201,9 @@ jobs:
          export CIRCLE_JOB="$CIRCLE_JOB"
          export CIRCLE_WORKFLOW_ID="$CIRCLE_WORKFLOW_ID"
          cd workspace
-          python test/print_test_stats.py --upload-to-s3 test
+          python torch/testing/_internal/print_test_stats.py --upload-to-s3 --compare-with-s3 test
          EOL
-          echo "(cat docker_commands.sh | docker exec -u jenkins -i "$id" bash) 2>&1" > command.sh
+          echo "(cat docker_commands.sh | docker exec -u jenkins -e LANG=C.UTF-8 -i "$id" bash) 2>&1" > command.sh
          unbuffer bash command.sh | ts

          echo "Retrieving test reports"
--- a/mypy-strict.ini
+++ b/mypy-strict.ini
@ -37,6 +37,7 @@ files = tools/codegen/gen.py,
    tools/autograd/*.py,
    tools/pyi/*.py,
    torch/testing/_internal/mypy_wrapper.py,
+    torch/testing/_internal/print_test_stats.py,
    torch/utils/benchmark/utils/common.py,
    torch/utils/benchmark/utils/timer.py,
    torch/utils/benchmark/utils/valgrind_wrapper/*.py,
--- a/mypy.ini
+++ b/mypy.ini
@ -242,3 +242,9 @@ ignore_missing_imports = True

 [mypy-mypy.*]
 ignore_missing_imports = True
+
+[mypy-xml.*]
+ignore_missing_imports = True
+
+[mypy-boto3.*]
+ignore_missing_imports = True
--- a/test/print_test_stats.py
+++ b/test/print_test_stats.py
@ -1,254 +0,0 @@
-#!/usr/bin/env python
-# -*- coding: utf-8 -*-
-# Read and print test results statistics
-from xml.dom import minidom
-from glob import glob
-import bz2
-import json
-import os
-import statistics
-import time
-
-import boto3
-import datetime
-import requests
-
-class TestCase:
-    def __init__(self, dom):
-        self.class_name = str(dom.attributes['classname'].value)
-        self.name = str(dom.attributes['name'].value)
-        self.time = float(dom.attributes['time'].value)
-        self.errored = len(dom.getElementsByTagName('error')) > 0
-        self.failed = len(dom.getElementsByTagName('failure')) > 0
-        self.skipped = len(dom.getElementsByTagName('skipped')) > 0
-
-
-class TestSuite:
-    def __init__(self, name):
-        self.name = name
-        self.test_cases = []
-        self.failed_count = 0
-        self.skipped_count = 0
-        self.errored_count = 0
-        self.total_time = 0.0
-
-    def __repr__(self):
-        rc = f'{self.name} run_time: {self.total_time:.2f} tests: {len(self.test_cases)}'
-        if self.skipped_count > 0:
-            rc += f' skipped: {self.skipped_count}'
-        return f'TestSuite({rc})'
-
-    def append(self, test_case):
-        self.test_cases.append(test_case)
-        self.total_time += test_case.time
-        self.failed_count += 1 if test_case.failed else 0
-        self.skipped_count += 1 if test_case.skipped else 0
-        self.errored_count += 1 if test_case.errored else 0
-
-    def print_report(self, num_longest=3):
-        sorted_tests = sorted(self.test_cases, key=lambda x: x.time)
-        test_count = len(sorted_tests)
-        print(f"class {self.name}:")
-        print(f"    tests: {test_count} failed: {self.failed_count} skipped: {self.skipped_count} errored: {self.errored_count}")
-        print(f"    run_time: {self.total_time:.2f} seconds")
-        print(f"    avg_time: {self.total_time/test_count:.2f} seconds")
-        if test_count >= 2:
-            print(f"    median_time: {statistics.median(x.time for x in sorted_tests):.2f} seconds")
-        sorted_tests = sorted_tests[-num_longest:]
-        print(f"    {len(sorted_tests)} longest tests:")
-        for test in reversed(sorted_tests):
-            print(f"        {test.name} time: {test.time:.2f} seconds")
-        print("")
-
-
-
-def parse_report(path):
-    dom = minidom.parse(path)
-    for test_case in dom.getElementsByTagName('testcase'):
-        yield TestCase(test_case)
-
-def parse_reports(folder):
-    reports = glob(os.path.join(folder, '**', '*.xml'), recursive=True)
-    tests_by_class = dict()
-    for report in reports:
-        for test_case in parse_report(report):
-            class_name = test_case.class_name
-            if class_name not in tests_by_class:
-                tests_by_class[class_name] = TestSuite(class_name)
-            tests_by_class[class_name].append(test_case)
-    return tests_by_class
-
-def build_info():
-    return {
-        "build_pr": os.environ.get("CIRCLE_PR_NUMBER"),
-        "build_tag": os.environ.get("CIRCLE_TAG"),
-        "build_sha1": os.environ.get("CIRCLE_SHA1"),
-        "build_branch": os.environ.get("CIRCLE_BRANCH"),
-        "build_job": os.environ.get("CIRCLE_JOB"),
-        "build_workflow_id": os.environ.get("CIRCLE_WORKFLOW_ID"),
-    }
-
-def build_message(test_case):
-    return {
-        "normal": {
-            **build_info(),
-            "test_suite_name": test_case.class_name,
-            "test_case_name": test_case.name,
-        },
-        "int": {
-            "time": int(time.time()),
-            "test_total_count": 1,
-            "test_total_time": int(test_case.time * 1000),
-            "test_failed_count": 1 if test_case.failed > 0 else 0,
-            "test_skipped_count": 1 if test_case.skipped > 0 else 0,
-            "test_errored_count": 1 if test_case.errored > 0 else 0,
-        },
-    }
-
-def send_report_to_scribe(reports):
-    access_token = os.environ.get("SCRIBE_GRAPHQL_ACCESS_TOKEN")
-
-    if not access_token:
-        print("No scribe access token provided, skip sending report!")
-        return
-    print("Scribe access token provided, sending report...")
-    url = "https://graph.facebook.com/scribe_logs"
-    r = requests.post(
-        url,
-        data={
-            "access_token": access_token,
-            "logs": json.dumps(
-                [
-                    {
-                        "category": "perfpipe_pytorch_test_times",
-                        "message": json.dumps(build_message(test_case)),
-                        "line_escape": False,
-                    }
-                    for name in sorted(reports.keys())
-                    for test_case in reports[name].test_cases
-                ]
-            ),
-        },
-    )
-    r.raise_for_status()
-
-def send_report_to_s3(reports, *, total_seconds):
-    job = os.environ.get('CIRCLE_JOB')
-    sha1 = os.environ.get('CIRCLE_SHA1')
-    branch = os.environ.get('CIRCLE_BRANCH', '')
-    if branch not in ['master', 'nightly'] and not branch.startswith("release/"):
-        print("S3 upload only enabled on master, nightly and release branches.")
-        print(f"skipping test report on branch: {branch}")
-        return
-    now = datetime.datetime.utcnow().isoformat()
-    key = f'test_time/{sha1}/{job}/{now}Z.json.bz2'  # Z meaning UTC
-    s3 = boto3.resource('s3')
-    try:
-        s3.get_bucket_acl(Bucket='ossci-metrics')
-    except Exception as e:
-        print(f"AWS ACL failed: {e}")
-    print("AWS credential found, uploading to S3...")
-
-    obj = s3.Object('ossci-metrics', key)
-    print("")
-    # use bz2 because the results are smaller than gzip, and the
-    # compression time penalty we pay is only about half a second for
-    # input files of a few megabytes in size like these JSON files, and
-    # because for some reason zlib doesn't seem to play nice with the
-    # gunzip command whereas Python's bz2 does work with bzip2
-    obj.put(Body=bz2.compress(json.dumps({
-        **build_info(),
-        'total_seconds': total_seconds,
-        'suites': {
-            name: {
-                'total_seconds': suite.total_time,
-                'cases': [
-                    {
-                        'name': case.name,
-                        'seconds': case.time,
-                        'errored': case.errored,
-                        'failed': case.failed,
-                        'skipped': case.skipped,
-                    }
-                    for case in suite.test_cases
-                ],
-            }
-            for name, suite in reports.items()
-        }
-    }).encode()))
-
-def positive_integer(value):
-    parsed = int(value)
-    if parsed < 1:
-        raise argparse.ArgumentTypeError(f"{value} is not a natural number")
-    return parsed
-
-def positive_float(value):
-    parsed = float(value)
-    if parsed <= 0.0:
-        raise argparse.ArgumentTypeError(f"{value} is not a positive rational number")
-    return parsed
-
-if __name__ == '__main__':
-    import argparse
-    import sys
-    parser = argparse.ArgumentParser(
-        "Print statistics from test XML output.",
-        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
-    )
-    parser.add_argument(
-        "--longest-of-class",
-        type=positive_integer,
-        default=3,
-        metavar="N",
-        help="how many longest tests to show for each class",
-    )
-    parser.add_argument(
-        "--class-print-threshold",
-        type=positive_float,
-        default=1.0,
-        metavar="N",
-        help="Minimal total time to warrant class report",
-    )
-    parser.add_argument(
-        "--longest-of-run",
-        type=positive_integer,
-        default=10,
-        metavar="N",
-        help="how many longest tests to show from the entire run",
-    )
-    parser.add_argument(
-        "--upload-to-s3",
-        action="store_true",
-        help="upload test time to S3 bucket",
-    )
-    parser.add_argument(
-        "folder",
-        help="test report folder",
-    )
-    args = parser.parse_args()
-
-    reports = parse_reports(args.folder)
-    if len(reports) == 0:
-        print(f"No test reports found in {args.folder}")
-        sys.exit(0)
-
-    send_report_to_scribe(reports)
-
-    longest_tests = []
-    total_time = 0
-    for name in sorted(reports.keys()):
-        test_suite = reports[name]
-        if test_suite.total_time >= args.class_print_threshold:
-            test_suite.print_report(args.longest_of_class)
-        total_time += test_suite.total_time
-        longest_tests.extend(test_suite.test_cases)
-    longest_tests = sorted(longest_tests, key=lambda x: x.time)[-args.longest_of_run:]
-
-    if args.upload_to_s3:
-        send_report_to_s3(reports, total_seconds=total_time)
-
-    print(f"Total runtime is {datetime.timedelta(seconds=int(total_time))}")
-    print(f"{len(longest_tests)} longest tests of entire run:")
-    for test_case in reversed(longest_tests):
-        print(f"    {test_case.class_name}.{test_case.name}  time: {test_case.time:.2f} seconds")
--- a/test/test_testing.py
+++ b/test/test_testing.py
@ -8,6 +8,7 @@ from torch.testing._internal.common_utils import \
 from torch.testing._internal.common_device_type import \
    (instantiate_device_type_tests, onlyCUDA, onlyOnCPUAndCUDA, dtypes)
 from torch.testing._internal import mypy_wrapper
+from torch.testing._internal import print_test_stats

 # For testing TestCase methods and torch.testing functions
 class TestTesting(TestCase):
@ -545,5 +546,456 @@ class TestMypyWrapper(TestCase):
        ))


+def fakehash(char):
+    return char * 40
+
+
+def makecase(name, seconds, *, errored=False, failed=False, skipped=False):
+    return {
+        'name': name,
+        'seconds': seconds,
+        'errored': errored,
+        'failed': failed,
+        'skipped': skipped,
+    }
+
+
+def makereport(tests):
+    suites = {
+        suite_name: {
+            'total_seconds': sum(case['seconds'] for case in cases),
+            'cases': cases,
+        }
+        for suite_name, cases in tests.items()
+    }
+    return {
+        'total_seconds': sum(s['total_seconds'] for s in suites.values()),
+        'suites': suites,
+    }
+
+
+class TestPrintTestStats(TestCase):
+    maxDiff = None
+
+    def test_analysis(self):
+        head_report = makereport({
+            # input ordering of the suites is ignored
+            'Grault': [
+                # not printed: status same and time similar
+                makecase('test_grault0', 4.78, failed=True),
+                # status same, but time increased a lot
+                makecase('test_grault2', 1.473, errored=True),
+            ],
+            # individual tests times changed, not overall suite
+            'Qux': [
+                # input ordering of the test cases is ignored
+                makecase('test_qux1', 0.001, skipped=True),
+                makecase('test_qux6', 0.002, skipped=True),
+                # time in bounds, but status changed
+                makecase('test_qux4', 7.158, failed=True),
+                # not printed because it's the same as before
+                makecase('test_qux7', 0.003, skipped=True),
+                makecase('test_qux5', 11.968),
+                makecase('test_qux3', 23.496),
+            ],
+            # new test suite
+            'Bar': [
+                makecase('test_bar2', 3.742, failed=True),
+                makecase('test_bar1', 50.447),
+            ],
+            # overall suite time changed but no individual tests
+            'Norf': [
+                makecase('test_norf1', 3),
+                makecase('test_norf2', 3),
+                makecase('test_norf3', 3),
+                makecase('test_norf4', 3),
+            ],
+            # suite doesn't show up if it doesn't change enough
+            'Foo': [
+                makecase('test_foo1', 42),
+                makecase('test_foo2', 56),
+            ],
+        })
+
+        base_reports = {
+            # bbbb has no reports, so base is cccc instead
+            fakehash('b'): [],
+            fakehash('c'): [
+                makereport({
+                    'Baz': [
+                        makecase('test_baz2', 13.605),
+                        # no recent suites have & skip this test
+                        makecase('test_baz1', 0.004, skipped=True),
+                    ],
+                    'Foo': [
+                        makecase('test_foo1', 43),
+                        # test added since dddd
+                        makecase('test_foo2', 57),
+                    ],
+                    'Grault': [
+                        makecase('test_grault0', 4.88, failed=True),
+                        makecase('test_grault1', 11.967, failed=True),
+                        makecase('test_grault2', 0.395, errored=True),
+                        makecase('test_grault3', 30.460),
+                    ],
+                    'Norf': [
+                        makecase('test_norf1', 2),
+                        makecase('test_norf2', 2),
+                        makecase('test_norf3', 2),
+                        makecase('test_norf4', 2),
+                    ],
+                    'Qux': [
+                        makecase('test_qux3', 4.978, errored=True),
+                        makecase('test_qux7', 0.002, skipped=True),
+                        makecase('test_qux2', 5.618),
+                        makecase('test_qux4', 7.766, errored=True),
+                        makecase('test_qux6', 23.589, failed=True),
+                    ],
+                }),
+            ],
+            fakehash('d'): [
+                makereport({
+                    'Foo': [
+                        makecase('test_foo1', 40),
+                        # removed in cccc
+                        makecase('test_foo3', 17),
+                    ],
+                    'Baz': [
+                        # not skipped, so not included in stdev
+                        makecase('test_baz1', 3.14),
+                    ],
+                    'Qux': [
+                        makecase('test_qux7', 0.004, skipped=True),
+                        makecase('test_qux2', 6.02),
+                        makecase('test_qux4', 20.932),
+                    ],
+                    'Norf': [
+                        makecase('test_norf1', 3),
+                        makecase('test_norf2', 3),
+                        makecase('test_norf3', 3),
+                        makecase('test_norf4', 3),
+                    ],
+                    'Grault': [
+                        makecase('test_grault0', 5, failed=True),
+                        makecase('test_grault1', 14.325, failed=True),
+                        makecase('test_grault2', 0.31, errored=True),
+                    ],
+                }),
+            ],
+            fakehash('e'): [],
+            fakehash('f'): [
+                makereport({
+                    'Foo': [
+                        makecase('test_foo3', 24),
+                        makecase('test_foo1', 43),
+                    ],
+                    'Baz': [
+                        makecase('test_baz2', 16.857),
+                    ],
+                    'Qux': [
+                        makecase('test_qux2', 6.422),
+                        makecase('test_qux4', 6.382, errored=True),
+                    ],
+                    'Norf': [
+                        makecase('test_norf1', 0.9),
+                        makecase('test_norf3', 0.9),
+                        makecase('test_norf2', 0.9),
+                        makecase('test_norf4', 0.9),
+                    ],
+                    'Grault': [
+                        makecase('test_grault0', 4.7, failed=True),
+                        makecase('test_grault1', 13.146, failed=True),
+                        makecase('test_grault2', 0.48, errored=True),
+                    ],
+                }),
+            ],
+        }
+
+        simpler_head = print_test_stats.simplify(head_report)
+        simpler_base = {}
+        for commit, reports in base_reports.items():
+            simpler_base[commit] = [print_test_stats.simplify(r) for r in reports]
+        analysis = print_test_stats.analyze(
+            head_report=simpler_head,
+            base_reports=simpler_base,
+        )
+
+        self.assertEqual(
+            '''\
+
+- class Baz:
+-     # was   15.23s ±   2.30s
+-
+-     def test_baz1: ...
+-         # was   0.004s           (skipped)
+-
+-     def test_baz2: ...
+-         # was  15.231s ±  2.300s
+
+
+  class Grault:
+      # was   48.86s ±   1.19s
+      # now    6.25s
+
+    - def test_grault1: ...
+    -     # was  13.146s ±  1.179s (failed)
+
+    - def test_grault3: ...
+    -     # was  30.460s
+
+
+  class Qux:
+      # was   41.66s ±   1.06s
+      # now   42.63s
+
+    - def test_qux2: ...
+    -     # was   6.020s ±  0.402s
+
+    ! def test_qux3: ...
+    !     # was   4.978s           (errored)
+    !     # now  23.496s
+
+    ! def test_qux4: ...
+    !     # was   7.074s ±  0.979s (errored)
+    !     # now   7.158s           (failed)
+
+    ! def test_qux6: ...
+    !     # was  23.589s           (failed)
+    !     # now   0.002s           (skipped)
+
+    + def test_qux1: ...
+    +     # now   0.001s           (skipped)
+
+    + def test_qux5: ...
+    +     # now  11.968s
+
+
+ class Bar:
+     # now   54.19s
+
+     def test_bar1: ...
+         # now  50.447s
+
+     def test_bar2: ...
+         # now   3.742s           (failed)
+
+''',
+            print_test_stats.anomalies(analysis),
+        )
+
+    def test_graph(self):
+        # HEAD is on master
+        self.assertEqual(
+            '''\
+Commit graph (base is most recent master ancestor with at least one S3 report):
+
+    : (master)
+    |
+    * aaaaaaaaaa (HEAD)              total time   502.99s
+    * bbbbbbbbbb (base)   1 report,  total time    47.84s
+    * cccccccccc          1 report,  total time   332.50s
+    * dddddddddd          0 reports
+    |
+    :
+''',
+            print_test_stats.graph(
+                head_sha=fakehash('a'),
+                head_seconds=502.99,
+                base_seconds={
+                    fakehash('b'): [47.84],
+                    fakehash('c'): [332.50],
+                    fakehash('d'): [],
+                },
+                on_master=True,
+            )
+        )
+
+        self.assertEqual(
+            '''\
+Commit graph (base is most recent master ancestor with at least one S3 report):
+
+    : (master)
+    |
+    | * aaaaaaaaaa (HEAD)            total time  9988.77s
+    |/
+    * bbbbbbbbbb (base) 121 reports, total time  7654.32s ±   55.55s
+    * cccccccccc         20 reports, total time  5555.55s ±  253.19s
+    * dddddddddd          1 report,  total time  1234.56s
+    |
+    :
+''',
+            print_test_stats.graph(
+                head_sha=fakehash('a'),
+                head_seconds=9988.77,
+                base_seconds={
+                    fakehash('b'): [7598.77] * 60 + [7654.32] + [7709.87] * 60,
+                    fakehash('c'): [5308.77] * 10 + [5802.33] * 10,
+                    fakehash('d'): [1234.56],
+                },
+                on_master=False,
+            )
+        )
+
+        self.assertEqual(
+            '''\
+Commit graph (base is most recent master ancestor with at least one S3 report):
+
+    : (master)
+    |
+    | * aaaaaaaaaa (HEAD)            total time    25.52s
+    | |
+    | : (5 commits)
+    |/
+    * bbbbbbbbbb          0 reports
+    * cccccccccc          0 reports
+    * dddddddddd (base)  15 reports, total time    58.92s ±   25.82s
+    |
+    :
+''',
+            print_test_stats.graph(
+                head_sha=fakehash('a'),
+                head_seconds=25.52,
+                base_seconds={
+                    fakehash('b'): [],
+                    fakehash('c'): [],
+                    fakehash('d'): [52.25] * 14 + [152.26],
+                },
+                on_master=False,
+                ancestry_path=5,
+            )
+        )
+
+        self.assertEqual(
+            '''\
+Commit graph (base is most recent master ancestor with at least one S3 report):
+
+    : (master)
+    |
+    | * aaaaaaaaaa (HEAD)            total time     0.08s
+    |/|
+    | : (1 commit)
+    |
+    * bbbbbbbbbb          0 reports
+    * cccccccccc (base)   1 report,  total time     0.09s
+    * dddddddddd          3 reports, total time     0.10s ±    0.05s
+    |
+    :
+''',
+            print_test_stats.graph(
+                head_sha=fakehash('a'),
+                head_seconds=0.08,
+                base_seconds={
+                    fakehash('b'): [],
+                    fakehash('c'): [0.09],
+                    fakehash('d'): [0.05, 0.10, 0.15],
+                },
+                on_master=False,
+                other_ancestors=1,
+            )
+        )
+
+        self.assertEqual(
+            '''\
+Commit graph (base is most recent master ancestor with at least one S3 report):
+
+    : (master)
+    |
+    | * aaaaaaaaaa (HEAD)            total time     5.98s
+    | |
+    | : (1 commit)
+    |/|
+    | : (7 commits)
+    |
+    * bbbbbbbbbb (base)   2 reports, total time     6.02s ±    1.71s
+    * cccccccccc          0 reports
+    * dddddddddd         10 reports, total time     5.84s ±    0.92s
+    |
+    :
+''',
+            print_test_stats.graph(
+                head_sha=fakehash('a'),
+                head_seconds=5.98,
+                base_seconds={
+                    fakehash('b'): [4.81, 7.23],
+                    fakehash('c'): [],
+                    fakehash('d'): [4.97] * 5 + [6.71] * 5,
+                },
+                on_master=False,
+                ancestry_path=1,
+                other_ancestors=7,
+            )
+        )
+
+    def test_regression_info(self):
+        self.assertEqual(
+            '''\
+----- Historic stats comparison result ------
+
+    job: foo_job
+    commit: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+
+
+  class Foo:
+      # was   42.50s ±   2.12s
+      # now    3.02s
+
+    - def test_bar: ...
+    -     # was   1.000s
+
+    ! def test_foo: ...
+    !     # was  41.500s ±  2.121s
+    !     # now   0.020s           (skipped)
+
+    + def test_baz: ...
+    +     # now   3.000s
+
+
+Commit graph (base is most recent master ancestor with at least one S3 report):
+
+    : (master)
+    |
+    | * aaaaaaaaaa (HEAD)            total time     3.02s
+    |/
+    * bbbbbbbbbb (base)   1 report,  total time    41.00s
+    * cccccccccc          1 report,  total time    43.00s
+    |
+    :
+
+Removed  (across    1 suite)      1 test,  totaling -   1.00s
+Modified (across    1 suite)      1 test,  totaling -  41.48s ±   2.12s
+Added    (across    1 suite)      1 test,  totaling +   3.00s
+''',
+            print_test_stats.regression_info(
+                head_sha=fakehash('a'),
+                head_report=makereport({
+                    'Foo': [
+                        makecase('test_foo', 0.02, skipped=True),
+                        makecase('test_baz', 3),
+                    ]}),
+                base_reports={
+                    fakehash('b'): [
+                        makereport({
+                            'Foo': [
+                                makecase('test_foo', 40),
+                                makecase('test_bar', 1),
+                            ],
+                        }),
+                    ],
+                    fakehash('c'): [
+                        makereport({
+                            'Foo': [
+                                makecase('test_foo', 43),
+                            ],
+                        }),
+                    ],
+                },
+                job_name='foo_job',
+                on_master=False,
+                ancestry_path=0,
+                other_ancestors=0,
+            )
+        )
+
+
 if __name__ == '__main__':
    run_tests()
--- a/tools/code_coverage/package/oss/init.py
+++ b/tools/code_coverage/package/oss/init.py
@ -31,7 +31,6 @@ from .utils import (

 BLOCKED_PYTHON_TESTS = {
    "run_test.py",
-    "print_test_stats.py",
    "test_dataloader.py",
    "test_multiprocessing.py",
    "test_multiprocessing_spawn.py",
--- a/torch/testing/_internal/print_test_stats.py
+++ b/torch/testing/_internal/print_test_stats.py
@ -0,0 +1,862 @@
+#!/usr/bin/env python
+
+import bz2
+import datetime
+import json
+import math
+import os
+import statistics
+import subprocess
+import time
+from collections import defaultdict
+from glob import glob
+from pathlib import Path
+from typing import (Any, DefaultDict, Dict, Iterable, Iterator, List, Optional,
+                    Tuple)
+from xml.dom import minidom  # type: ignore[import]
+
+import requests
+from typing_extensions import TypedDict
+
+try:
+    import boto3  # type: ignore[import]
+    HAVE_BOTO3 = True
+except ImportError:
+    HAVE_BOTO3 = False
+
+Commit = str  # 40-digit SHA-1 hex string
+Status = Optional[str]  # errored, failed, skipped, or None
+
+# represent suite as dict because indexing is useful
+SimplerCase = Tuple[float, Status]
+SimplerSuite = Dict[str, SimplerCase]
+SimplerReport = Dict[str, SimplerSuite]
+
+
+class Case(TypedDict):
+    name: str
+    seconds: float
+    errored: bool
+    failed: bool
+    skipped: bool
+
+
+class Suite(TypedDict):
+    total_seconds: float
+    cases: List[Case]
+
+
+class ReportMeta(TypedDict, total=False):
+    build_pr: str
+    build_tag: str
+    build_sha1: Commit
+    build_branch: str
+    build_job: str
+    build_workflow_id: str
+
+
+class Report(ReportMeta):
+    total_seconds: float
+    suites: Dict[str, Suite]
+
+
+class Stat(TypedDict):
+    center: float
+    spread: Optional[float]
+
+
+class CaseDiff(TypedDict):
+    margin: str
+    name: str
+    was: Optional[Tuple[Stat, Status]]
+    now: Optional[SimplerCase]
+
+
+class SuiteDiff(TypedDict):
+    margin: str
+    name: str
+    was: Optional[Stat]
+    now: Optional[float]
+    cases: List[CaseDiff]
+
+
+def case_status(case: Case) -> Status:
+    for k in {'errored', 'failed', 'skipped'}:
+        if case[k]:  # type: ignore[misc]
+            return k
+    return None
+
+
+def simplify(report: Report) -> SimplerReport:
+    return {
+        suite_name: {
+            case['name']: (case['seconds'], case_status(case))
+            for case in suite['cases']
+        }
+        for suite_name, suite in report['suites'].items()
+    }
+
+
+def plural(n: int) -> str:
+    return '' if n == 1 else 's'
+
+
+def display_stat(
+    x: Stat,
+    format: Tuple[Tuple[int, int], Tuple[int, int]],
+) -> str:
+    spread_len = format[1][0] + 1 + format[1][1]
+    spread = x['spread']
+    if spread is not None:
+        spread_str = f' ± {spread:{spread_len}.{format[1][1]}f}s'
+    else:
+        spread_str = ' ' * (3 + spread_len + 1)
+    mean_len = format[0][0] + 1 + format[0][1]
+    return f'{x["center"]:{mean_len}.{format[0][1]}f}s{spread_str}'
+
+
+def list_stat(l: List[float]) -> Stat:
+    return {
+        'center': statistics.mean(l),
+        'spread': statistics.stdev(l) if len(l) > 1 else None
+    }
+
+
+def zero_stat() -> Stat:
+    return {'center': 0, 'spread': None}
+
+
+def recenter(was: Stat, now: float) -> Stat:
+    return {'center': now - was['center'], 'spread': was['spread']}
+
+
+def sum_normals(stats: Iterable[Stat]) -> Stat:
+    """
+    Returns a stat corresponding to the sum of the given stats.
+
+    Assumes that the center and spread for each of the given stats are
+    mean and stdev, respectively.
+    """
+    l = list(stats)
+    spread: Optional[float]
+    if any(stat['spread'] is not None for stat in l):
+        spread = math.sqrt(sum((stat['spread'] or 0)**2 for stat in l))
+    else:
+        spread = None
+    return {
+        'center': sum(stat['center'] for stat in l),
+        'spread': spread,
+    }
+
+
+def format_seconds(seconds: List[float]) -> str:
+    if len(seconds) > 0:
+        x = list_stat(seconds)
+        return f'total time {display_stat(x, ((5, 2), (4, 2)))}'.strip()
+    return ''
+
+
+def show_ancestors(num_commits: int) -> str:
+    return f'    | : ({num_commits} commit{plural(num_commits)})'
+
+
+def unlines(lines: List[str]) -> str:
+    return ''.join(f'{line}\n' for line in lines)
+
+
+def matching_test_times(
+    base_reports: Dict[Commit, List[SimplerReport]],
+    suite_name: str,
+    case_name: str,
+    status: Status,
+) -> List[float]:
+    times: List[float] = []
+    for reports in base_reports.values():
+        for report in reports:
+            suite = report.get(suite_name)
+            if suite:
+                case = suite.get(case_name)
+                if case:
+                    t, s = case
+                    if s == status:
+                        times.append(t)
+    return times
+
+
+def analyze(
+    *,
+    head_report: SimplerReport,
+    base_reports: Dict[Commit, List[SimplerReport]],
+) -> List[SuiteDiff]:
+    # most recent master ancestor with at least one S3 report
+    base_sha = next(sha for sha, reports in base_reports.items() if reports)
+
+    # find all relevant suites (those in either base or head or both)
+    all_reports = [head_report] + base_reports[base_sha]
+    all_suites = {k for r in all_reports for k in r.keys()}
+
+    removed_suites: List[SuiteDiff] = []
+    modified_suites: List[SuiteDiff] = []
+    added_suites: List[SuiteDiff] = []
+
+    for suite_name in sorted(all_suites):
+        case_diffs: List[CaseDiff] = []
+        head_suite = head_report.get(suite_name)
+        base_cases: Dict[str, Status] = dict(sorted(set.intersection(*[
+            {(n, s) for n, (_, s) in report.get(suite_name, {}).items()}
+            for report in base_reports[base_sha]
+        ] or [set()])))
+        case_stats: Dict[str, Stat] = {}
+        if head_suite:
+            now = sum(case[0] for case in head_suite.values())
+            if any(suite_name in report for report in base_reports[base_sha]):
+                removed_cases: List[CaseDiff] = []
+                for case_name, case_status in base_cases.items():
+                    case_stats[case_name] = list_stat(matching_test_times(
+                        base_reports,
+                        suite_name,
+                        case_name,
+                        case_status,
+                    ))
+                    if case_name not in head_suite:
+                        removed_cases.append({
+                            'margin': '-',
+                            'name': case_name,
+                            'was': (case_stats[case_name], case_status),
+                            'now': None,
+                        })
+                modified_cases: List[CaseDiff] = []
+                added_cases: List[CaseDiff] = []
+                for head_case_name in sorted(head_suite):
+                    head_case = head_suite[head_case_name]
+                    if head_case_name in base_cases:
+                        stat = case_stats[head_case_name]
+                        base_status = base_cases[head_case_name]
+                        if head_case[1] != base_status:
+                            modified_cases.append({
+                                'margin': '!',
+                                'name': head_case_name,
+                                'was': (stat, base_status),
+                                'now': head_case,
+                            })
+                    else:
+                        added_cases.append({
+                            'margin': '+',
+                            'name': head_case_name,
+                            'was': None,
+                            'now': head_case,
+                        })
+                # there might be a bug calculating this stdev, not sure
+                was = sum_normals(case_stats.values())
+                case_diffs = removed_cases + modified_cases + added_cases
+                if case_diffs:
+                    modified_suites.append({
+                        'margin': ' ',
+                        'name': suite_name,
+                        'was': was,
+                        'now': now,
+                        'cases': case_diffs,
+                    })
+            else:
+                for head_case_name in sorted(head_suite):
+                    head_case = head_suite[head_case_name]
+                    case_diffs.append({
+                        'margin': ' ',
+                        'name': head_case_name,
+                        'was': None,
+                        'now': head_case,
+                    })
+                added_suites.append({
+                    'margin': '+',
+                    'name': suite_name,
+                    'was': None,
+                    'now': now,
+                    'cases': case_diffs,
+                })
+        else:
+            for case_name, case_status in base_cases.items():
+                case_stats[case_name] = list_stat(matching_test_times(
+                    base_reports,
+                    suite_name,
+                    case_name,
+                    case_status,
+                ))
+                case_diffs.append({
+                    'margin': ' ',
+                    'name': case_name,
+                    'was': (case_stats[case_name], case_status),
+                    'now': None,
+                })
+            removed_suites.append({
+                'margin': '-',
+                'name': suite_name,
+                # there might be a bug calculating this stdev, not sure
+                'was': sum_normals(case_stats.values()),
+                'now': None,
+                'cases': case_diffs,
+            })
+
+    return removed_suites + modified_suites + added_suites
+
+
+def case_diff_lines(diff: CaseDiff) -> List[str]:
+    lines = [f'def {diff["name"]}: ...']
+
+    case_fmt = ((3, 3), (2, 3))
+
+    was = diff['was']
+    if was:
+        was_line = f'    # was {display_stat(was[0], case_fmt)}'
+        was_status = was[1]
+        if was_status:
+            was_line += f' ({was_status})'
+        lines.append(was_line)
+
+    now = diff['now']
+    if now:
+        now_stat: Stat = {'center': now[0], 'spread': None}
+        now_line = f'    # now {display_stat(now_stat, case_fmt)}'
+        now_status = now[1]
+        if now_status:
+            now_line += f' ({now_status})'
+        lines.append(now_line)
+
+    return [''] + [f'{diff["margin"]} {l}' for l in lines]
+
+
+def display_suite_diff(diff: SuiteDiff) -> str:
+    lines = [f'class {diff["name"]}:']
+
+    suite_fmt = ((4, 2), (3, 2))
+
+    was = diff['was']
+    if was:
+        lines.append(f'    # was {display_stat(was, suite_fmt)}')
+
+    now = diff['now']
+    if now is not None:
+        now_stat: Stat = {'center': now, 'spread': None}
+        lines.append(f'    # now {display_stat(now_stat, suite_fmt)}')
+
+    for case_diff in diff['cases']:
+        lines.extend([f'  {l}' for l in case_diff_lines(case_diff)])
+
+    return unlines([''] + [f'{diff["margin"]} {l}'.rstrip() for l in lines] + [''])
+
+
+def anomalies(diffs: List[SuiteDiff]) -> str:
+    return ''.join(map(display_suite_diff, diffs))
+
+
+def graph(
+    *,
+    head_sha: Commit,
+    head_seconds: float,
+    base_seconds: Dict[Commit, List[float]],
+    on_master: bool,
+    ancestry_path: int = 0,
+    other_ancestors: int = 0,
+) -> str:
+    lines = [
+        'Commit graph (base is most recent master ancestor with at least one S3 report):',
+        '',
+        '    : (master)',
+        '    |',
+    ]
+
+    head_time_str = f'           {format_seconds([head_seconds])}'
+    if on_master:
+        lines.append(f'    * {head_sha[:10]} (HEAD)   {head_time_str}')
+    else:
+        lines.append(f'    | * {head_sha[:10]} (HEAD) {head_time_str}')
+
+        if ancestry_path > 0:
+            lines += [
+                '    | |',
+                show_ancestors(ancestry_path),
+            ]
+
+        if other_ancestors > 0:
+            lines += [
+                '    |/|',
+                show_ancestors(other_ancestors),
+                '    |',
+            ]
+        else:
+            lines.append('    |/')
+
+    is_first = True
+    for sha, seconds in base_seconds.items():
+        num_runs = len(seconds)
+        prefix = str(num_runs).rjust(3)
+        base = '(base)' if is_first and num_runs > 0 else '      '
+        if num_runs > 0:
+            is_first = False
+        t = format_seconds(seconds)
+        p = plural(num_runs)
+        if t:
+            p = f'{p}, '.ljust(3)
+        lines.append(f'    * {sha[:10]} {base} {prefix} report{p}{t}')
+
+    lines.extend(['    |', '    :'])
+
+    return unlines(lines)
+
+
+def case_delta(case: CaseDiff) -> Stat:
+    was = case['was']
+    now = case['now']
+    return recenter(
+        was[0] if was else zero_stat(),
+        now[0] if now else 0,
+    )
+
+
+def display_final_stat(stat: Stat) -> str:
+    center = stat['center']
+    spread = stat['spread']
+    displayed = display_stat(
+        {'center': abs(center), 'spread': spread},
+        ((4, 2), (3, 2)),
+    )
+    if center < 0:
+        sign = '-'
+    elif center > 0:
+        sign = '+'
+    else:
+        sign = ' '
+    return f'{sign}{displayed}'.rstrip()
+
+
+def summary_line(message: str, d: DefaultDict[str, List[CaseDiff]]) -> str:
+    all_cases = [c for cs in d.values() for c in cs]
+    tests = len(all_cases)
+    suites = len(d)
+    sp = f'{plural(suites)})'.ljust(2)
+    tp = f'{plural(tests)},'.ljust(2)
+    # there might be a bug calculating this stdev, not sure
+    stat = sum_normals(case_delta(c) for c in all_cases)
+    return ''.join([
+        f'{message} (across {suites:>4} suite{sp}',
+        f'{tests:>6} test{tp}',
+        f' totaling {display_final_stat(stat)}',
+    ])
+
+
+def summary(analysis: List[SuiteDiff]) -> str:
+    removed_tests: DefaultDict[str, List[CaseDiff]] = defaultdict(list)
+    modified_tests: DefaultDict[str, List[CaseDiff]] = defaultdict(list)
+    added_tests: DefaultDict[str, List[CaseDiff]] = defaultdict(list)
+
+    for diff in analysis:
+        # the use of 'margin' here is not the most elegant
+        name = diff['name']
+        margin = diff['margin']
+        cases = diff['cases']
+        if margin == '-':
+            removed_tests[name] += cases
+        elif margin == '+':
+            added_tests[name] += cases
+        else:
+            removed = list(filter(lambda c: c['margin'] == '-', cases))
+            added = list(filter(lambda c: c['margin'] == '+', cases))
+            modified = list(filter(lambda c: c['margin'] == '!', cases))
+            if removed:
+                removed_tests[name] += removed
+            if added:
+                added_tests[name] += added
+            if modified:
+                modified_tests[name] += modified
+
+    return unlines([
+        summary_line('Removed ', removed_tests),
+        summary_line('Modified', modified_tests),
+        summary_line('Added   ', added_tests),
+    ])
+
+
+def regression_info(
+    *,
+    head_sha: Commit,
+    head_report: Report,
+    base_reports: Dict[Commit, List[Report]],
+    job_name: str,
+    on_master: bool,
+    ancestry_path: int,
+    other_ancestors: int,
+) -> str:
+    """
+    Return a human-readable report describing any test time regressions.
+
+    The head_sha and head_report args give info about the current commit
+    and its test times. Since Python dicts maintain insertion order
+    (guaranteed as part of the language spec since 3.7), the
+    base_reports argument must list the head's several most recent
+    master commits, from newest to oldest (so the merge-base is
+    list(base_reports)[0]).
+    """
+    simpler_head = simplify(head_report)
+    simpler_base: Dict[Commit, List[SimplerReport]] = {}
+    for commit, reports in base_reports.items():
+        simpler_base[commit] = [simplify(r) for r in reports]
+    analysis = analyze(
+        head_report=simpler_head,
+        base_reports=simpler_base,
+    )
+
+    return '\n'.join([
+        unlines([
+            '----- Historic stats comparison result ------',
+            '',
+            f'    job: {job_name}',
+            f'    commit: {head_sha}',
+        ]),
+        anomalies(analysis),
+        graph(
+            head_sha=head_sha,
+            head_seconds=head_report['total_seconds'],
+            base_seconds={
+                c: [r['total_seconds'] for r in rs]
+                for c, rs in base_reports.items()
+            },
+            on_master=on_master,
+            ancestry_path=ancestry_path,
+            other_ancestors=other_ancestors,
+        ),
+        summary(analysis),
+    ])
+
+
+class TestCase:
+    def __init__(self, dom: Any) -> None:
+        self.class_name = str(dom.attributes['classname'].value)
+        self.name = str(dom.attributes['name'].value)
+        self.time = float(dom.attributes['time'].value)
+        self.errored = len(dom.getElementsByTagName('error')) > 0
+        self.failed = len(dom.getElementsByTagName('failure')) > 0
+        self.skipped = len(dom.getElementsByTagName('skipped')) > 0
+
+
+class TestSuite:
+    def __init__(self, name: str) -> None:
+        self.name = name
+        self.test_cases: List[TestCase] = []
+        self.failed_count = 0
+        self.skipped_count = 0
+        self.errored_count = 0
+        self.total_time = 0.0
+
+    def __repr__(self) -> str:
+        rc = f'{self.name} run_time: {self.total_time:.2f} tests: {len(self.test_cases)}'
+        if self.skipped_count > 0:
+            rc += f' skipped: {self.skipped_count}'
+        return f'TestSuite({rc})'
+
+    def append(self, test_case: TestCase) -> None:
+        self.test_cases.append(test_case)
+        self.total_time += test_case.time
+        self.failed_count += 1 if test_case.failed else 0
+        self.skipped_count += 1 if test_case.skipped else 0
+        self.errored_count += 1 if test_case.errored else 0
+
+    def print_report(self, num_longest: int = 3) -> None:
+        sorted_tests = sorted(self.test_cases, key=lambda x: x.time)
+        test_count = len(sorted_tests)
+        print(f"class {self.name}:")
+        print(f"    tests: {test_count} failed: {self.failed_count} skipped: {self.skipped_count} errored: {self.errored_count}")
+        print(f"    run_time: {self.total_time:.2f} seconds")
+        print(f"    avg_time: {self.total_time/test_count:.2f} seconds")
+        if test_count >= 2:
+            print(f"    median_time: {statistics.median(x.time for x in sorted_tests):.2f} seconds")
+        sorted_tests = sorted_tests[-num_longest:]
+        print(f"    {len(sorted_tests)} longest tests:")
+        for test in reversed(sorted_tests):
+            print(f"        {test.name} time: {test.time:.2f} seconds")
+        print("")
+
+
+def parse_report(path: str) -> Iterator[TestCase]:
+    dom = minidom.parse(path)
+    for test_case in dom.getElementsByTagName('testcase'):
+        yield TestCase(test_case)
+
+
+def parse_reports(folder: str) -> Dict[str, TestSuite]:
+    reports = glob(os.path.join(folder, '**', '*.xml'), recursive=True)
+    tests_by_class = dict()
+    for report in reports:
+        for test_case in parse_report(report):
+            class_name = test_case.class_name
+            if class_name not in tests_by_class:
+                tests_by_class[class_name] = TestSuite(class_name)
+            tests_by_class[class_name].append(test_case)
+    return tests_by_class
+
+
+def build_info() -> ReportMeta:
+    return {
+        "build_pr": os.environ.get("CIRCLE_PR_NUMBER", ""),
+        "build_tag": os.environ.get("CIRCLE_TAG", ""),
+        "build_sha1": os.environ.get("CIRCLE_SHA1", ""),
+        "build_branch": os.environ.get("CIRCLE_BRANCH", ""),
+        "build_job": os.environ.get("CIRCLE_JOB", ""),
+        "build_workflow_id": os.environ.get("CIRCLE_WORKFLOW_ID", ""),
+    }
+
+
+def build_message(test_case: TestCase) -> Dict[str, Dict[str, Any]]:
+    return {
+        "normal": {
+            **build_info(),
+            "test_suite_name": test_case.class_name,
+            "test_case_name": test_case.name,
+        },
+        "int": {
+            "time": int(time.time()),
+            "test_total_count": 1,
+            "test_total_time": int(test_case.time * 1000),
+            "test_failed_count": 1 if test_case.failed > 0 else 0,
+            "test_skipped_count": 1 if test_case.skipped > 0 else 0,
+            "test_errored_count": 1 if test_case.errored > 0 else 0,
+        },
+    }
+
+
+def send_report_to_scribe(reports: Dict[str, TestSuite]) -> None:
+    access_token = os.environ.get("SCRIBE_GRAPHQL_ACCESS_TOKEN")
+
+    if not access_token:
+        print("No scribe access token provided, skip sending report!")
+        return
+    print("Scribe access token provided, sending report...")
+    url = "https://graph.facebook.com/scribe_logs"
+    r = requests.post(
+        url,
+        data={
+            "access_token": access_token,
+            "logs": json.dumps(
+                [
+                    {
+                        "category": "perfpipe_pytorch_test_times",
+                        "message": json.dumps(build_message(test_case)),
+                        "line_escape": False,
+                    }
+                    for name in sorted(reports.keys())
+                    for test_case in reports[name].test_cases
+                ]
+            ),
+        },
+    )
+    r.raise_for_status()
+
+
+def assemble_s3_object(
+    reports: Dict[str, TestSuite],
+    *,
+    total_seconds: float,
+) -> Report:
+    return {
+        **build_info(),  # type: ignore[misc]
+        'total_seconds': total_seconds,
+        'suites': {
+            name: {
+                'total_seconds': suite.total_time,
+                'cases': [
+                    {
+                        'name': case.name,
+                        'seconds': case.time,
+                        'errored': case.errored,
+                        'failed': case.failed,
+                        'skipped': case.skipped,
+                    }
+                    for case in suite.test_cases
+                ],
+            }
+            for name, suite in reports.items()
+        }
+    }
+
+
+def send_report_to_s3(head_report: Report) -> None:
+    job = os.environ.get('CIRCLE_JOB')
+    sha1 = os.environ.get('CIRCLE_SHA1')
+    branch = os.environ.get('CIRCLE_BRANCH', '')
+    if branch not in ['master', 'nightly'] and not branch.startswith("release/"):
+        print("S3 upload only enabled on master, nightly and release branches.")
+        print(f"skipping test report on branch: {branch}")
+        return
+    now = datetime.datetime.utcnow().isoformat()
+    key = f'test_time/{sha1}/{job}/{now}Z.json.bz2'  # Z meaning UTC
+    s3 = boto3.resource('s3')
+    obj = s3.Object('ossci-metrics', key)
+    # use bz2 because the results are smaller than gzip, and the
+    # compression time penalty we pay is only about half a second for
+    # input files of a few megabytes in size like these JSON files, and
+    # because for some reason zlib doesn't seem to play nice with the
+    # gunzip command whereas Python's bz2 does work with bzip2
+    obj.put(Body=bz2.compress(json.dumps(head_report).encode()))
+
+
+def print_regressions(head_report: Report, *, num_prev_commits: int) -> None:
+    sha1 = os.environ.get("CIRCLE_SHA1", "HEAD")
+
+    base = subprocess.check_output(
+        ["git", "merge-base", sha1, "origin/master"],
+        encoding="ascii",
+    ).strip()
+
+    count_spec = f"{base}..{sha1}"
+    intermediate_commits = int(subprocess.check_output(
+        ["git", "rev-list", "--count", count_spec],
+        encoding="ascii"
+    ))
+    ancestry_path = int(subprocess.check_output(
+        ["git", "rev-list", "--ancestry-path", "--count", count_spec],
+        encoding="ascii",
+    ))
+
+    # if current commit is already on master, we need to exclude it from
+    # this history; otherwise we include the merge-base
+    commits = subprocess.check_output(
+        ["git", "rev-list", f"--max-count={num_prev_commits+1}", base],
+        encoding="ascii",
+    ).splitlines()
+    on_master = False
+    if base == sha1:
+        on_master = True
+        commits = commits[1:]
+    else:
+        commits = commits[:-1]
+
+    job = os.environ.get("CIRCLE_JOB", "")
+    s3 = boto3.resource("s3")
+    bucket = s3.Bucket(name="ossci-metrics")
+    index = {}
+    for commit in commits:
+        summaries = bucket.objects.filter(Prefix=f"test_time/{commit}/{job}/")
+        index[commit] = list(summaries)
+
+    objects: Dict[Commit, List[Report]] = {}
+    # should we do these in parallel?
+    for commit, summaries in index.items():
+        objects[commit] = []
+        for summary in summaries:
+            binary = summary.get()["Body"].read()
+            string = bz2.decompress(binary).decode("utf-8")
+            objects[commit].append(json.loads(string))
+
+    print()
+    print(regression_info(
+        head_sha=sha1,
+        head_report=head_report,
+        base_reports=objects,
+        job_name=job,
+        on_master=on_master,
+        ancestry_path=ancestry_path - 1,
+        other_ancestors=intermediate_commits - ancestry_path,
+    ), end="")
+
+
+def positive_integer(value: str) -> float:
+    parsed = int(value)
+    if parsed < 1:
+        raise argparse.ArgumentTypeError(f"{value} is not a natural number")
+    return parsed
+
+
+def positive_float(value: str) -> float:
+    parsed = float(value)
+    if parsed <= 0.0:
+        raise argparse.ArgumentTypeError(f"{value} is not a positive rational number")
+    return parsed
+
+
+if __name__ == '__main__':
+    import argparse
+    import sys
+    parser = argparse.ArgumentParser(
+        "Print statistics from test XML output.",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    parser.add_argument(
+        "--longest-of-class",
+        type=positive_integer,
+        default=3,
+        metavar="N",
+        help="how many longest tests to show for each class",
+    )
+    parser.add_argument(
+        "--class-print-threshold",
+        type=positive_float,
+        default=1.0,
+        metavar="N",
+        help="Minimal total time to warrant class report",
+    )
+    parser.add_argument(
+        "--longest-of-run",
+        type=positive_integer,
+        default=10,
+        metavar="N",
+        help="how many longest tests to show from the entire run",
+    )
+    if HAVE_BOTO3:
+        parser.add_argument(
+            "--upload-to-s3",
+            action="store_true",
+            help="upload test time to S3 bucket",
+        )
+        parser.add_argument(
+            "--compare-with-s3",
+            action="store_true",
+            help="download test times for base commits and compare",
+        )
+    parser.add_argument(
+        "--num-prev-commits",
+        type=positive_integer,
+        default=10,
+        metavar="N",
+        help="how many previous commits to compare test times with",
+    )
+    parser.add_argument(
+        "--use-json",
+        metavar="FILE.json",
+        help="compare S3 with JSON file, instead of the test report folder",
+    )
+    parser.add_argument(
+        "folder",
+        help="test report folder",
+    )
+    args = parser.parse_args()
+
+    reports = parse_reports(args.folder)
+    if len(reports) == 0:
+        print(f"No test reports found in {args.folder}")
+        sys.exit(0)
+
+    send_report_to_scribe(reports)
+
+    longest_tests = []
+    total_time = 0.0
+    for name in sorted(reports.keys()):
+        test_suite = reports[name]
+        if test_suite.total_time >= args.class_print_threshold:
+            test_suite.print_report(args.longest_of_class)
+        total_time += test_suite.total_time
+        longest_tests.extend(test_suite.test_cases)
+    longest_tests = sorted(longest_tests, key=lambda x: x.time)[-args.longest_of_run:]
+
+    obj = assemble_s3_object(reports, total_seconds=total_time)
+
+    if args.upload_to_s3:
+        send_report_to_s3(obj)
+
+    print(f"Total runtime is {datetime.timedelta(seconds=int(total_time))}")
+    print(f"{len(longest_tests)} longest tests of entire run:")
+    for test_case in reversed(longest_tests):
+        print(f"    {test_case.class_name}.{test_case.name}  time: {test_case.time:.2f} seconds")
+
+    if args.compare_with_s3:
+        head_json = obj
+        if args.use_json:
+            head_json = json.loads(Path(args.use_json).read_text())
+        print_regressions(head_json, num_prev_commits=args.num_prev_commits)