In https://github.com/pytorch/pytorch/pull/110362, the failure was flaky but merge bot treated it as an actual failure. This is a regression after https://github.com/pytorch/test-infra/pull/4604 where the name returned by Dr.CI now includes workflow name. For example, the name is `trunk / macos-12-py3-arm64 / test (default, 2, 3, macos-m1-12)` in the JSON response:
```
{"FAILED": [], "FLAKY": [{"workflowId": 6372581477, "id": 17297638807, "name": "trunk / macos-12-py3-arm64 / test (default, 2, 3, macos-m1-12)", "jobName": "macos-12-py3-arm64 / test (default, 2, 3, macos-m1-12)", "conclusion": "failure", "completed_at": "2023-10-01T22:18:28Z", "html_url": "https://github.com/pytorch/pytorch/actions/runs/6372581477/job/17297638807", "head_branch": "ciflow/trunk/110362", "pr_number": 110362, "head_sha": "03f51e36dedf234931006d1db61677b229c9a119", "failure_captures": ["Failure: There is only 4671284KB free space left in /, which is less than the minimum requirement of"], "failure_line": "Failure: There is only 4671284KB free space left in /, which is less than the minimum requirement of 6291456KB for macOS", "time": "2023-10-01T22:17:53.847751Z"}], "BROKEN_TRUNK": [], "UNSTABLE": []}
```
I update merge bot to handle this better by considering both workflow name, job name, and the combination full name.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110661
Approved by: https://github.com/clee2000
After https://github.com/pytorch/test-infra/pull/4589, we can now query Dr.CI to get the list of flaky failures there. This change queries Dr.CI API endpoint and check if the failure is a flaky one using `is_flaky` function.
Because the change is relatively large, I'm breaking it down to several smaller PRs in this order:
* [x] This PR queries Dr.CI and adds `is_flaky` check
* [ ] Clean up the flaky rules logic because it has already been implemented on Dr. CI
* [ ] Clean up the broken trunk logic for the same reason
### Testing
* Create a new `drci_mocks.json` file to catch the JSON response from Dr.CI API endpoint. The API requires `DRCI_BOT_KEY`.
* `pytest -v test_trymerge.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110054
Approved by: https://github.com/clee2000
After https://github.com/pytorch/test-infra/pull/4589, we can now query Dr.CI to get the list of flaky failures there. This change queries Dr.CI API endpoint and check if the failure is a flaky one using `is_flaky` function.
Because the change is relatively large, I'm breaking it down to several smaller PRs in this order:
* [x] This PR queries Dr.CI and adds `is_flaky` check
* [ ] Clean up the flaky rules logic because it has already been implemented on Dr. CI
* [ ] Clean up the broken trunk logic for the same reason
### Testing
* Create a new `drci_mocks.json` file to catch the JSON response from Dr.CI API endpoint. The API requires `DRCI_BOT_KEY`.
* `pytest -v test_trymerge.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110054
Approved by: https://github.com/clee2000
I notice a curious case on https://github.com/pytorch/pytorch/pull/107508 where there was one broken trunk failure and the PR was merged with `merge -ic`. Because the failure had been classified as unrelated, I expected to see a no-op force merge here. However, it showed up as a force merge with failure.

The record on Rockset reveals https://github.com/pytorch/pytorch/pull/107508 has:
* 0 broken trunk check (unexpected, this should be 1 as Dr. CI clearly say so)
* 1 ignore current check (unexpected, this should be 0 and the failure should be counted as broken trunk instead)
* 3 unstable ROCm jobs (expected)
It turns out that ignore current takes precedence over flaky and broken trunk classification. This might have been the expectation in the past but I think that's not the case now. The bot should be consistent with what is shown on Dr. CI. The change here is to make flaky, unstable, and broken trunk classification to take precedence over ignore current. Basically, we only need to ignore new or unrecognized failures that have yet been classified.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107761
Approved by: https://github.com/clee2000
At the moment, we only record the list of pending and failed check on Rockset merge records. This is enough to compute the force merge KPI(s), but isn't enough for more in-depth analysis on what happened at the time of the merge:
* If the number of `ok_failed_checks` is less than `ok_failed_checks_threshold`, the list of `failed_checks` would be empty (expectedly). So Rockset would only record an empty list.
* We support retry in PR, so the classifications on Dr.CI could be different than what dev observed at the time of the merge if retry completed successfully
### Testing
`python .github/scripts/trymerge.py --comment-id 1654010315 106095 --dry-run` (need to comment out some of the code to actually write a test record to Rockset), then manually verify it with
```
SELECT
*
FROM
commons.merges
WHERE
pr_num = 106095
```
to see that `ignore_current_checks`, `broken_trunk_checks`, `flaky_checks`, and `unstable_checks` shows up correctly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106162
Approved by: https://github.com/clee2000
No need to wait if the job classification is unstable as it would be ignored anyway. This is useful to not need to wait for scarce resources like ROCm, which is also frequently in unstable mode (There is a ROCm queue atm)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106095
Approved by: https://github.com/clee2000
Given the number of unstable job atm (rocm, distributed), having the limit of 3 for ignorable failures is too low. When I manually look into force merges, I could find many examples like like https://github.com/pytorch/pytorch/pull/105848 where there are 3+ unrelated failures. As the classification is getting more accurate, we can aim to ignore all flaky and broken trunk failures.
* Default `ok_failed_checks_threshold` to `-1` to ignore all unrelated failures
* Increase the `IGNORABLE_FAILED_CHECKS_THESHOLD` to 10. The only concern I have before setting it to `-1` is the fog of war situation when a sev occurs. So 10 is a good middle ground before we agree to set it to `-1`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105998
Approved by: https://github.com/clee2000
During revert, use title of "Meta Internal-Only Changes Check" to determine whether or not internal diff is associated with the PR. When PR is merged/closed, "Meta Internal-Only Changes Check" status is always success, but title message can differ:
- "There is no internal Diff connected, this can be merged now" means that there are no internal change associated with PR (or it was landed via GitHub First workflow)
- "The internal Diff has landed, this can be merged now" meaning that PR has associated internal DIFF, and OSS and internal reverts must happen in sync using internal tooling. (Or a revert PR can be authored in OSS)
Add regression test for https://github.com/pytorch/pytorch/pull/100652 that was originated from the internal diff, but was merged as OSS PR.
Fixes https://github.com/pytorch/pytorch/issues/104232
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104344
Approved by: https://github.com/bigfootjon, https://github.com/huydhn
Per title, after https://github.com/pytorch/pytorch/pull/102426 landed, it makes sense to have a new category for UNSTABLE jobs and handle them accordingly in trymerge.
* The simple approach is to check for `unstable` in the check (job) name. I plan to roll this out first and then see if we need to cover the more complicated, but less popular case, of unstable build job. Specifically, an unstable build job has no `unstable` in its name
* An unstable job is ignored by trymerge. This is the same behavior we have atm when a job is moved to unstable. It's completely ignored
* The update to Dr. CI will come later, so that unstable failures would also be hidden like broken trunk or flaky
### Testing
Leverage the broken trunk Windows CPU job atm and mark Windows CPU jobs as unstable https://github.com/pytorch/pytorch/issues/102297
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102784
Approved by: https://github.com/clee2000
Prevent error message from becoming of single column of characters
Thanks @clee200 for explaining how it worked before
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at fef1e25</samp>
> _`reject_reason` fixed_
> _Syntax error caused trouble_
> _Autumn of bugs ends_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101745
Approved by: https://github.com/kit1980, https://github.com/osalpekar
I've noticed that 3-4 functions in trymerge are trying to implement similar tail recursion for flaky network retries.
Unify them using single wrapper in `gitutils.py`
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 8d40631</samp>
> _`retries_decorator`_
> _adds resilience to GitHub scripts_
> _autumn of errors_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101227
Approved by: https://github.com/kit1980
During regular merge process, when `GitHubPR` object is created, it does not have `merging` label and when label is added it does not update existing `GitHubPR` object either
To fix the problem, call REST API wrapper `gh_remove_label` directly. Worst case that can happen, if label is already removed at this point, is that it will be printed to the stderr, which is not rendered on HUD anyway
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100433
Approved by: https://github.com/PaliC, https://github.com/kit1980
During regular merge process, `GitHubPR` and `GitHubRepo` objects are first created in main() and than re-created in `merge()` instead of being passed by reference, which results in making the same GraphQL requests to the repo twice
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at ee4e23e</samp>
> _Sing, O Muse, of the skillful coder who refactored_
> _The `merge` function, to accept a `GitHubPR` object,_
> _And thus reduced the calls to the divine API_
> _And the duplication of code, that source of errors._
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100434
Approved by: https://github.com/kit1980, https://github.com/PaliC, https://github.com/huydhn, https://github.com/ZainRizvi
Mostly `s/@master/@main` in numerous `.yml` files.
Keep `master` in `weekly.yml` as it refers to `xla` repo and in `test_trymerge.py` as it refers to a branch PR originates from.
Small QoL improvement such that add_numbered_label now works more intuitively. Now if we push different labels instead of having `[reverted, mergedX2, revertX3, mergedX4, revertedX5, mergedX6]` we have `[reverted, merged, revertX2, mergedX2, revertedX3, mergedX3]`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98551
Approved by: https://github.com/huydhn
This upload a record to a new Rockset `merges` collection in `commons` workspace in the following format:
```
{
"id": comment_id,
"pr_num": pr_num,
"owner": owner,
"project": project,
"pending_checks": pending_checks, # At the time of the merge
"failed_checks": failed_checks, # At the time of the merge
"is_failed": is_failed, # This is set to True if the merge fails to get through for whatever reason
"dry_run": dry_run,
"skip_mandatory_checks": skip_mandatory_checks,
"ignore_current": ignore_current,
"error": error, # The same Exception message that will be shown on PR
}
```
To achieve this, I need to tweak `find_matching_merge_rule` a bit to return the list of pending and failed checks in addition to the matching merge rule. As this function is also used internally, I have confirmed that the internal call doesn't need the return values. Thus, the change is safe to land.
### Testing
* Unit testing
* Dry-run locally `python3 .github/scripts/trymerge.py --comment-id 1478678477 --dry-run 97293` using an older PR. The merge obviously failed, but the record was created successfully on Rockset
```
{
"_id": "52d3152b-ec35-4b5a-91fc-0e7298fc54b5-1",
"_event_time": "2023-03-23T21:10:32.754368Z",
"_meta": null,
"owner": "pytorch",
"is_failed": true,
"id": 1478678477,
"failed_checks": [],
"dry_run": true,
"error": "Command `git -C pytorch cherry-pick -x cc0d2e0fba648bb5deda34a9056f2c4192b22314` returned non-zero exit code 1...",
"ignore_current": false,
"project": "pytorch",
"pr_num": 97293,
"skip_mandatory_checks": false,
"pending_checks": []
}
```
* Dry-run locally with this PR `python3 .github/scripts/trymerge.py --comment-id 1481949104 --dry-run --force 97471` with `--force`
```
{
"_id": "dd7d2580-f6e5-47e7-9441-17df86056c14-1",
"_event_time": "2023-03-23T21:43:53.915911Z",
"_meta": null,
"owner": "pytorch",
"is_failed": true,
"id": 1481949104,
"failed_checks": [],
"dry_run": true,
"error": "PR #97471 has not been reviewed yet",
"ignore_current": false,
"project": "pytorch",
"pr_num": 97471,
"skip_mandatory_checks": true,
"pending_checks": []
}
```
* Dry-run locally with this PR `python3 .github/scripts/trymerge.py --comment-id 1481949104 --dry-run 97471` again with approval rule commented out
```
{
"_id": "5d7de4e3-1af1-4869-a3b7-d1a9dbced6ce-1",
"_event_time": "2023-03-24T00:10:41.914111Z",
"_meta": null,
"is_failed": false,
"id": 1481949104,
"failed_checks": [],
"error": "",
"last_commit_sha": "4657400513f0360a0a4f73d46e1aff0882221687",
"merge_commit_sha": "416bac5b813a181753afade781ae30f4f0843586",
"ignore_current": false,
"pending_checks": [
[
"pull / linux-focal-py3.8-gcc7 / test (default, 1, 3, linux.2xlarge)",
"https://github.com/pytorch/pytorch/actions/runs/4506464828/jobs/7933518379",
12239935788
],
...
[
"trunk / linux-bionic-cuda11.8-py3.10-gcc7 / test (default, 5, 5, linux.4xlarge.nvidia.gpu)",
"https://github.com/pytorch/pytorch/actions/runs/4506465633/jobs/7933621958",
12240067113
],
...
],
"owner": "pytorch",
"skip_mandatory_checks": true,
"author": "Huy Do <huydhn@gmail.com>",
"project": "pytorch",
"merge_base_sha": "a3b30c5025e3381022fa00b127b0d881f4ef66d4",
"pr_num": 97471
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97471
Approved by: https://github.com/clee2000
This has been bugging me for a while as I'm working on these Python scripts and they are not tracked by ufmt linter. So I add these script into that linter.
```
[[linter]]
code = 'UFMT'
include_patterns = [
'.github/**/*.py',
'test/run_test.py',
```
This change should just work and not break anything as ufmt (black + usort) linter is very safe to use for standalone util scripts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97588
Approved by: https://github.com/kit1980
Remove all references to land checks (rebase on viable strict in a different branch) since its no longer used. Adding ciflow/trunk on merge and/or rebasing the entire pr is preferred.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96401
Approved by: https://github.com/huydhn
The value from the PR info includes only unique files != The number of files changed (both are technically correct, depending on how you view it)
I'm trying to merge this PR https://github.com/pytorch/pytorch/pull/95233 which makes `.github/ci_commit_pins/triton.txt` a softlink. So the PR includes 2 changes to that file 1) to delete the file and 2) to add it as a symlink.
```
[
".ci/docker/build.sh",
".ci/docker/ci_commit_pins/triton.txt",
".ci/docker/common/common_utils.sh",
".ci/docker/common/install_triton.sh",
".ci/docker/requirements-ci.txt",
".ci/docker/ubuntu-cuda/Dockerfile",
".ci/docker/ubuntu/Dockerfile",
".github/ci_commit_pins/triton.txt", <--
".github/ci_commit_pins/triton.txt", <--
".github/workflows/build-triton-wheel.yml"
]
```
Trymerge doesn't like that and rejects the merge due to `Changed file count mismatch` https://github.com/pytorch/pytorch/actions/runs/4295438799/jobs/7485853815 . This is because the PRInfo GraphQL result from GitHub only counts 9 of them https://paste.sh/zVsOnWoT#p_3RKX_VMjj-e71vwsTeA01W (search for `changedFiles`). It means that the name are dedup, so that only unique file names are counted.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95720
Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/ZainRizvi