You can trigger ciflow tags on main branch commits, so we should be more conservative when checking to see if a workflow is a PR/on the main branch.
get_pr_number checks for the pr number based on the PR_NUMBER env var or a tag of the for `ciflow/workflow/pr number`
If we fail to find something like this, then assume it is on the main branch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125485
Approved by: https://github.com/huydhn
yolo
Also
* Ensure that at least 1 test always gets run (`//` does truncation which results in 0 if you have too few tests discovered)
* Don't run test removal on slow tests - I'm not touching that yet
I am avoid everything other than pull + trunk workflows, so not doing this on windows CUDA, which runs on periodic
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125049
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
Test the generic torch.Stream/Event with fake device gurad and hooks. Since we added a fake device backend, it is mutual exclusive to other backends. Tests will be skipped if TEST_CUDA or TEST_ROCM is true.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123614
Approved by: https://github.com/albanD
ghstack dependencies: #123611, #123612
Add serial marker for individual tests so the test file can be removed from the ci serial list
Run serial marked tests first in serial
Run all other tests afterwards in parallel
Slowly reduce list and mark individual tests as serial instead
Hope # of serial tests is small so sharding evenness doesn't get too messed up
Hopefully can do 3 procs for sm86 and cpu?
serial no longer looks like a real word to me
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124085
Approved by: https://github.com/seemethere, https://github.com/malfet
Our prior approach to epilogue fusion was to select from a choice from a set of triton templates and extern calls based on benchmarking inputs, then unconditionally fuse epilogues. This can be sub-optimal in following ways:
- We select an extern kernel, however an epilogue like relu() exists such that choosing a triton template + relu would have been faster
- We select a triton template, epilogue fuse, and register spilling occurs causing it to be slower than not epilogue fusing.
In this PR we wait to select either the Triton Template or Extern Kernel based on benchmarking results from the kernel itself and its epilogue. As soon as a successful fusion occurs where a fused Triton Template + epilogue is faster than the unfused choice we finalize the MultiTemplateBuffer as a specific template. If no fusion occurs we'll finalize the MultiTemplateBuffer after fusion.
Note: if there are multiple epilogue fusions (not super likely), even though we select a template after the first fusion, we will still benchmark to see if subsequent epilogue are worth fusing. We could potentially defer choosing template in this case in a follow up at expense of compile time.
Gives 4% HF training win, 10% TIMM inference win. Increases compilation time which I will be trying to address more in follow up prs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120275
Approved by: https://github.com/jansel
ghstack dependencies: #121996
Make a test that fails on purpose to trigger retries. Check the opposite of success (that env vars exist)
It's bit hacky because I want it to fail on the normal flow in order to trigger reruns but I don't want to expose the failures to users since it's confusing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120519
Approved by: https://github.com/huydhn
Make a test that fails on purpose to trigger retries. Check the opposite of success (that env vars exist)
It's bit hacky because I want it to fail on the normal flow in order to trigger reruns but I don't want to expose the failures to users since it's confusing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120519
Approved by: https://github.com/huydhn
Give TD it's own job so that each shard can get the results from this one job artifact and they will always be in sync with each other/no longer need to worry about consistently issues
* Move test discovery to its own file that is not dependent on torch so it can be run without building torch
* Cannot do cpp test discovery before building pytorch
* Move TD calculation to own file that will create a json file with the final results
* TD is now job/build env agnostic
* TD will rank all tests, including those that test jobs may not want to run (ex it will rank distributed tests along with default tests, even though these tests are never run on the same machine together)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118250
Approved by: https://github.com/huydhn
Current threshold is to cut the bottom 75% of test files, which results in 13 min of tests getting cut.
test_ops, functorch/test_ops, and test_decomp and other really long running test files are not getting cut and make the top 25% to take really long (still 90+ min)
The original plan was to test on rocm but I'm worried about queuing given that cutting 75% of test files only cuts off 13 min, and crossref is rarely referenced by others and people keep talking about getting rid of it, so it's a good alternative
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119426
Approved by: https://github.com/huydhn
Convert from a list/bucket based TD system to just a numbers based TD system. Looks like a massive change but a decent amount of it is tests and removing code.
Main file of interest is interface.py, which Github is collapsing by default due to size
The test files pretty much got rewritten entirely since a lot of the old tests are no longer relevant.
Other notable changes:
* Use Frozenset to make TestRun hashable
* Adds tools/test/heuristics/__init__.py to ensure that unittest can discover the tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119901
Approved by: https://github.com/osalpekar, https://github.com/huydhn
Changes sharding to attempt to put all serial tests on as few shards as possible. Parallel tests are then distributed across all shards, with most of which likely ending up on the non serial shards
Example: 8 minutes of serial tests, 20 minutes of parallel tests, 2 proc per machine, 6 machines
-> 8 + 20/2 = 18 total minutes of tests
-> 18 / 6 machines = 3 min per machine
-> all serial tests should fit on 3 machines (3min, 3 min, 2min)
-> majority of parallel tests should go on last 4 machines, one of which is shared with the serial tests
Move serial tests to run first
If I want to move to a purely numbers based sharding, this ensures that parallel tests are run with parallel tests as much as possible instead of interleaving serial + parallel tests, which decreases effectiveness of parallelization, while also ensuring that test reordering is still mostly effective.
See 73e816ee80 for example logs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119078
Approved by: https://github.com/huydhn
env=None (which is the default) inherits the env from the calling process. Explicitly set the env to the calling process env so that things can be added to it later
Tested in: e7b4d8ec88
Checked that test-reports (which depend on the CI env var) get made.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120251
Approved by: https://github.com/huydhn
Changes sharding to attempt to put all serial tests on as few shards as possible. Parallel tests are then distributed across all shards, with most of which likely ending up on the non serial shards
Example: 8 minutes of serial tests, 20 minutes of parallel tests, 2 proc per machine, 6 machines
-> 8 + 20/2 = 18 total minutes of tests
-> 18 / 6 machines = 3 min per machine
-> all serial tests should fit on 3 machines (3min, 3 min, 2min)
-> majority of parallel tests should go on last 4 machines, one of which is shared with the serial tests
Move serial tests to run first
If I want to move to a purely numbers based sharding, this ensures that parallel tests are run with parallel tests as much as possible instead of interleaving serial + parallel tests, which decreases effectiveness of parallelization, while also ensuring that test reordering is still mostly effective.
See 73e816ee80 for example logs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119078
Approved by: https://github.com/huydhn
* The TODOs in `test/test_nestedtensor.py` has been mitigated, I keep the issue for reference.
* ~~The TODOs in `test/test_ops_fwd_gradients.py` doesn't apply anymore~~
* The TODOs in `run_test.py` to support disabling C++ tests is probably not going to happen. I have never seen a flaky C++ test that needs to be disabled before.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119113
Approved by: https://github.com/kit1980
Everyday I move closer and closer to just using numbers
* number of heuristics that marked it as high, probable, low, none etc
* order of heuristics in the `__init__` file as well as how the heuristic ordered the tests
* put heuristics historical edited files and profiling as not trial mode
* briefly sanity checked that all shards of the larger test files (ex test_ops) exist and there are no dups
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118029
Approved by: https://github.com/huydhn
Moves test discovery into a file that doesn't have import torch so test listing can be done without having torch installed.
Helpful when you don't have torch installed (aka me when I'm feeling lazy)
I want to move TD into it's own job that doesn't need to wait for build to finish, so this is part of that.
The first commit is a nothing more than a copy paste of the selected functions/vars into a new file, the second commit has various changes that should be checked.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118574
Approved by: https://github.com/huydhn
when doing print(f.read().decode etc etc) it prints an extra new line, so manually splitlines and strip to see if that helps
My guess is windows line ending differences
Also always save log file regardless of success or failure
See 476b81a9bf for what it looks like now
Swapped to opening in text mode instead of binary, seems to be ok now.
42483193bf024983060a234dc0262f4840aef4b8 for example
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118124
Approved by: https://github.com/huydhn
Test [ci-verbose-test-logs] (this worked, the test logs printing while running and interleaved and are really long)
Settings for no timeout (step timeout still applies, only gets rid of ~30 min timeout for shard of test file) and no piping logs/extra verbose test logs (good for debugging deadlocks but results in very long and possibly interleaved logs).
Also allows these to be set via pr body if the label name is in brackets ex [label name] or the test above.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117668
Approved by: https://github.com/huydhn