**Background:**
Almost all the tests in `test/test_openreg.py` are designed for `torch_openreg`, so placing these testcases in the test directory is not a good idea. Instead, they should be moved to the `tests` directory under `torch_openreg`, coordinating these tests with their corresponding functional logic.
**How to do:**
So how do we verify the quality of the third-party device integration mechanism?
We will maintain a `test_openreg` entrypoint in `test/run_test.py`.
This entrypoint will install `torch_openreg` and run all the testcases located in `torch_openreg`. As long as all testcases pass, we can guarantee that the out-of-tree backend integration mechanism is available.
**Next:**
We will also improve `torch_openreg's` test coverage in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161917
Approved by: https://github.com/albanD
Automatically replaces split with rsplit when relevant and only performs the split up to the first ( or last value). This allows early return of the split function and improve efficiency.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160107
Approved by: https://github.com/albanD
3 procs were used for sm86, but we switched to sm89 and the check failed so it switched back to 2
sm90 is H100, but idk what unittests we have running there, but I assume they also have a lot of memory
They use larger runners, which have more GPU memory, so its usually ok. I think it's ~22GB -> 10GB per proc if 2, 6GB per proc if 3 (cuda context maybe 1GB)
I've applied skips to the ones that OOMed
Time decreases from ~2.7hr per test job -> ~2hr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158691
Approved by: https://github.com/huydhn
3 procs were used for sm86, but we switched to sm89 and the check failed so it switched back to 2
sm90 is H100, but idk what unittests we have running there, but I assume they also have a lot of memory
They use larger runners, which have more GPU memory, so its usually ok. I think it's ~22GB -> 10GB per proc if 2, 6GB per proc if 3 (cuda context maybe 1GB)
I've applied skips to the ones that OOMed
Time decreases from ~2.7hr per test job -> ~2hr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158691
Approved by: https://github.com/huydhn
I guess this is more of an RFC
Goal:
Enable keep going so that we can get information immediately for failures. We want be aware of failures as soon as possible, especially on the main branch, this is so that reverts can happen quickly.
Proposal:
A job with `keep-going` will continue through errors in `python run_test.py`. If a test fails, before it runs the next test, it will upload a fake log that should have enough information in it so that viewing the log will be able to tell you what failed and any stack traces/error logs, and should be able to be parsed by log classifier to get a line.
I am getting the log by concating the test logs in test/test-reports, which is all the text outputted by pytest (unless someone runs with `ci-verbose-test-logs` label). There are obviously many things this won't catch, ex output outside of run_test.py, some output inside of run_test.py, but it should be enough.
After a log finishes, eventually its raw log is uploaded to ossci-raw-job-status s3 bucket and the log classifier will read it to do classification. This means we will have to change log classifier to read from this bucket as well.
I'm thinking just add an input parameter to log classifier like https://github.com/pytorch/test-infra/pull/6723/files
Also upload the temp results to a temp attribute instead of the real one
To overwrite the conclusion on HUD, I'm thinking a lambda that is s3 put trigger on the fake log being put into s3, that does something similar to log classifier where it just mutates the entry 13a990b678/aws/lambda/log-classifier/src/network.rs (L85) to add a new field like "will_fail": true, and also triggers the log classifier to run
Then we change HUD/ClickHouse to point the raw log url to the alternate place, the new "will_fail" field as the conclusion, and the temp log classifier result if needed
Why always write to temp attribution/column? I am unsure about overwriting the real results with fake ones
Pros:
Not many changes outside of HUD/UI
Cons:
Lots of moving parts, lots of temp fields that will require adjustment for queries, temp fields never really get deleted
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155371
Approved by: https://github.com/malfet
Manually map test_cpp_extensions_aot_ninja to files in test/cpp_extensions since test_cpp_extensions_aot_ninja isn't an actual file you can edit, but a wrapper for files in test/cpp_extensions.
Idk if this is a good idea, feels very manual. Maybe it would be better to classify this the same as any other TD failure where TD simply can't figure out the tests it needs to run
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148992
Approved by: https://github.com/malfet, https://github.com/seemethere, https://github.com/janeyx99
Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements
> Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target:
>
> ```python
> # Input
> assert (
> len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
>
> # Black
> assert (
> len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
> # Ruff
> assert len(policy_types) >= priority + num_duplicates, (
> f"This tests needs at least {priority + num_duplicates} many types."
> )
> ```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546
Approved by: https://github.com/malfet
Changes by apply order:
1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`.
2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`.
3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first.
`.parent{...}.absolute()` -> `.absolute().parent{...}`
4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.)
`.parent.parent.parent.parent` -> `.parents[3]`
5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~
~`.parents[3]` -> `.parents[4 - 1]`~
6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374
Approved by: https://github.com/justinchuby, https://github.com/malfet
Changes by apply order:
1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`.
2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`.
3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first.
`.parent{...}.absolute()` -> `.absolute().parent{...}`
4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.)
`.parent.parent.parent.parent` -> `.parents[3]`
5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~
~`.parents[3]` -> `.parents[4 - 1]`~
6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374
Approved by: https://github.com/justinchuby, https://github.com/malfet
* Add pytorchbot to list of approvers for file
* Add labels to the auto created PR
The auto generated PR is currently not merging due to some failing tests on slow workflow that were supposed to be moved back to normal
idk if this has much value, clearly we've been managing without the update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135390
Approved by: https://github.com/ZainRizvi
Move the slow test json to be in the pytorch/pytorch repo and make a job that will update it weekly. The job uses the same environment as the commit hash. It uses similar code to the hash updates, but the hash update contains a lot of code that is specific to the hash update, so I chose to pick out the parts that are relevant
Remove references to the old file and set up testing to read from the new file instead
The old update cadence was every day, the new one is every week
The auto slow test infra + the lack of pinning between pytorch and test-infra makes it really hard to tell if a test started failing because of a change or because of the slow test json changing. While this can have benefits, like disable test issues being effective everywhere immediately, it can also be very confusing, especially since we don't have the same insight into slow tests like we do for disable issues.
Example PR made: https://github.com/pytorch/pytorch/pull/132383 (with all the changes from this PR because it was working on top of this)
We should just get rid of this at some point in favor of the slowTest decorator, but there are some tests that take 5+ minutes to run and I don't want to track them down right now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132379
Approved by: https://github.com/huydhn
test_public_bindings should be run on anything that changes the public API - need to figure out in the future what is part of the public api, currently I'm using anything in torch/
flex_attention should be run on anything involving autograd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130397
Approved by: https://github.com/malfet
test_public_bindings should be run on anything that changes the public API - need to figure out in the future what is part of the public api, currently I'm using anything in torch/
flex_attention should be run on anything involving autograd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130397
Approved by: https://github.com/malfet