mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 21:14:14 +08:00
I guess this is more of an RFC
Goal:
Enable keep going so that we can get information immediately for failures. We want be aware of failures as soon as possible, especially on the main branch, this is so that reverts can happen quickly.
Proposal:
A job with `keep-going` will continue through errors in `python run_test.py`. If a test fails, before it runs the next test, it will upload a fake log that should have enough information in it so that viewing the log will be able to tell you what failed and any stack traces/error logs, and should be able to be parsed by log classifier to get a line.
I am getting the log by concating the test logs in test/test-reports, which is all the text outputted by pytest (unless someone runs with `ci-verbose-test-logs` label). There are obviously many things this won't catch, ex output outside of run_test.py, some output inside of run_test.py, but it should be enough.
After a log finishes, eventually its raw log is uploaded to ossci-raw-job-status s3 bucket and the log classifier will read it to do classification. This means we will have to change log classifier to read from this bucket as well.
I'm thinking just add an input parameter to log classifier like https://github.com/pytorch/test-infra/pull/6723/files
Also upload the temp results to a temp attribute instead of the real one
To overwrite the conclusion on HUD, I'm thinking a lambda that is s3 put trigger on the fake log being put into s3, that does something similar to log classifier where it just mutates the entry 13a990b678/aws/lambda/log-classifier/src/network.rs (L85)
to add a new field like "will_fail": true, and also triggers the log classifier to run
Then we change HUD/ClickHouse to point the raw log url to the alternate place, the new "will_fail" field as the conclusion, and the temp log classifier result if needed
Why always write to temp attribution/column? I am unsure about overwriting the real results with fake ones
Pros:
Not many changes outside of HUD/UI
Cons:
Lots of moving parts, lots of temp fields that will require adjustment for queries, temp fields never really get deleted
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155371
Approved by: https://github.com/malfet