Commit Graph

695 Commits

Author SHA1 Message Date
47182a8f4b Add cpp stack traces to our own reruns (#119408)
Note that I'm not sure why we both have pytest rerun the failing test twice via 81abc2b249/test/run_test.py (L966) before our own logic retries it as well?

The failing test is only here to make sure it works as expected in the CI env. Will remove before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408
Approved by: https://github.com/huydhn
2024-02-14 18:40:23 +00:00
5d6e323549 No TD (test removal) option in CI (#118808)
It currently doesn't do anything, but I will want these env vars later.  Maybe I should start using ghstack

Intention: --enable-td actually gets rid of tests

I am open to better names
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118808
Approved by: https://github.com/huydhn, https://github.com/osalpekar
2024-02-09 16:42:27 +00:00
8182fce769 Revert "Add cpp stack traces to our own reruns (#119408)"
This reverts commit fbe6f6236e25e27e5968715f824dc8bfb0e37213.

Reverted https://github.com/pytorch/pytorch/pull/119408 on behalf of https://github.com/malfet due to Looks like it introduced intermittent crashes see https://github.com/pytorch/pytorch/actions/runs/7823402867/job/21344456540 for example, testing the theory ([comment](https://github.com/pytorch/pytorch/pull/119408#issuecomment-1934589057))
2024-02-08 17:20:39 +00:00
fbe6f6236e Add cpp stack traces to our own reruns (#119408)
Note that I'm not sure why we both have pytest rerun the failing test twice via 81abc2b249/test/run_test.py (L966) before our own logic retries it as well?

The failing test is only here to make sure it works as expected in the CI env. Will remove before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408
Approved by: https://github.com/huydhn
2024-02-08 00:54:16 +00:00
3ed9df36a9 Clean up some obsolete TODOs in run_test and several test files (#119113)
* The TODOs in `test/test_nestedtensor.py` has been mitigated, I keep the issue for reference.
* ~~The TODOs in `test/test_ops_fwd_gradients.py` doesn't apply anymore~~
* The TODOs in `run_test.py` to support disabling C++ tests is probably not going to happen.  I have never seen a flaky C++ test that needs to be disabled before.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119113
Approved by: https://github.com/kit1980
2024-02-03 23:54:30 +00:00
3b41793412 Purge redundant module init tests (#119028)
Fixes #118784

This test file is old and redundant; coverage is maintained in `test_modules.py` via the `test_factory_kwargs` set of tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119028
Approved by: https://github.com/zou3519
2024-02-02 20:17:00 +00:00
8b729fb826 [ez] Fix CI log file piping error (#118807)
Fixes https://github.com/pytorch/pytorch/issues/118764

Example log https://github.com/pytorch/pytorch/actions/runs/7737363970/job/21097159160
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118807
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere
2024-02-02 03:07:56 +00:00
9391af9796 Merging heuristics (#118029)
Everyday I move closer and closer to just using numbers

* number of heuristics that marked it as high, probable, low, none etc
* order of heuristics in the `__init__` file as well as how the heuristic ordered the tests
* put heuristics historical edited files and profiling as not trial mode
* briefly sanity checked that all shards of the larger test files (ex test_ops) exist and there are no dups
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118029
Approved by: https://github.com/huydhn
2024-01-31 20:00:10 +00:00
2eefbc02a0 [ez] Discover tests without importing torch (#118574)
Moves test discovery into a file that doesn't have import torch so test listing can be done without having torch installed.

Helpful when you don't have torch installed (aka me when I'm feeling lazy)
I want to move TD into it's own job that doesn't need to wait for build to finish, so this is part of that.

The first commit is a nothing more than a copy paste of the selected functions/vars into a new file, the second commit has various changes that should be checked.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118574
Approved by: https://github.com/huydhn
2024-01-30 03:02:29 +00:00
84251d1d71 [ez] Windows log printing + save successful test logs (#118124)
when doing print(f.read().decode etc etc) it prints an extra new line, so manually splitlines and strip to see if that helps

My guess is windows line ending differences

Also always save log file regardless of success or failure

See 476b81a9bf for what it looks like now

Swapped to opening in text mode instead of binary, seems to be ok now.

42483193bf024983060a234dc0262f4840aef4b8 for example
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118124
Approved by: https://github.com/huydhn
2024-01-26 21:14:25 +00:00
de9ddd19a5 Various CI settings (#117668)
Test [ci-verbose-test-logs] (this worked, the test logs printing while running and interleaved and are really long)

Settings for no timeout (step timeout still applies, only gets rid of ~30 min timeout for shard of test file) and no piping logs/extra verbose test logs (good for debugging deadlocks but results in very long and possibly interleaved logs).

Also allows these to be set via pr body if the label name is in brackets ex [label name] or the test above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117668
Approved by: https://github.com/huydhn
2024-01-26 00:17:29 +00:00
364728b27b Reduce pytest prints (#117069)
* custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing)
* normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes).  Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries

Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497
"items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069
Approved by: https://github.com/huydhn
2024-01-23 18:39:30 +00:00
40890ba8e7 [CI] Add python test skip logic for XPU (#117621)
Add python test skip logic for XPU

For test purpose, cherry-pick #116833 & #116850 firstly, and the xpu test passed https://github.com/pytorch/pytorch/actions/runs/7566746218/job/20604997985?pr=117621. Revert them now.

Works for #114850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117621
Approved by: https://github.com/huydhn
2024-01-23 08:20:42 +00:00
cef5b93f28 [ez] Serial when NUM_PROCS is 1 (#117977)
Makes it easier to understand whats going on
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117977
Approved by: https://github.com/huydhn
2024-01-22 23:11:41 +00:00
f96b7d06d7 [export] skip export tests when test with dynamo in ci (#117988)
Fixes https://github.com/pytorch/pytorch/issues/117947.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117988
Approved by: https://github.com/suo, https://github.com/zou3519
2024-01-22 22:14:36 +00:00
f684e44fd6 Revert "Reduce pytest prints (#117069)"
This reverts commit 40dbd567e04483c671f9c897171bf9d1e7162b68.

Reverted https://github.com/pytorch/pytorch/pull/117069 on behalf of https://github.com/clee2000 due to need to handle timeout expired better ([comment](https://github.com/pytorch/pytorch/pull/117069#issuecomment-1901270953))
2024-01-19 23:07:51 +00:00
40dbd567e0 Reduce pytest prints (#117069)
* custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing)
* normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes).  Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries

Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497
"items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069
Approved by: https://github.com/huydhn
2024-01-19 18:42:12 +00:00
6c5c2121b1 Run some OOMing tests serially (#117759)
They were disabled due to being flaky due to OOMs but got renamed.  Seeing if running serially helps

I kind of want to keep this test disabled since the rest of the file is probably fine...

Issues in question: #113132 #113136 #113140
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117759
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-01-19 16:45:35 +00:00
77cfacab55 Revert "Reduce pytest prints (#117069)"
This reverts commit 2f89ef23007626aca1a577a4a388e315253c834f.

Reverted https://github.com/pytorch/pytorch/pull/117069 on behalf of https://github.com/clee2000 due to distributed tests are not printing items ([comment](https://github.com/pytorch/pytorch/pull/117069#issuecomment-1899433816))
2024-01-19 00:27:03 +00:00
2f89ef2300 Reduce pytest prints (#117069)
* custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing)
* normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes).  Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries

Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497
"items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069
Approved by: https://github.com/huydhn
2024-01-18 23:30:59 +00:00
5aa895e53e Don't run inductor tests in Dynamo shard (#117747)
In theory we could, but these get really slow once we turn on strict
mode, so we're not going to for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117747
Approved by: https://github.com/bdhirsh
ghstack dependencies: #117729
2024-01-18 17:43:30 +00:00
db79ceb110 [ROCm] Enabling additional UTs on ROCm (#115738)
Unskips mostly for dynamo/inductor UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115738
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2024-01-09 08:36:07 +00:00
d455c33cca [ez][td] Pipe TD logs to log file (#116796)
It is a bit annoying have them come up when searching through the logs.  They're also surprisingly long
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116796
Approved by: https://github.com/huydhn
2024-01-05 19:05:12 +00:00
c52b78ebc2 [ez] Remove some args from run_test.py (#115459)
Don't think anyone uses these
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115459
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-12-11 19:56:37 +00:00
641ec2115f [AOTI] move model runner into a library (#115220)
Summary: So that we can import it in fbcode and do some AOTI run in py env

Test Plan: existed AOTI tests

Reviewed By: chenyang78

Differential Revision: D51780021

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115220
Approved by: https://github.com/desertfire
2023-12-09 19:03:32 +00:00
3b7d60b6ff Fix keep-going (#112098)
New function for continue on error

Another solution might be to run the entire suite to the end and use last failed, but I'm worried about concurrent processes writing to the same last failed cache entry, it's a bit different than the usual test rerunning strategy we use especially regarding segfaults and other ways the test suite can suddenly end, and there are some cases where the entire test suite should immediately get rerun in a new process (ex cuda error that causes sync to fail).

Find example logs on commit 2f1510839727f6ef2631040d5f0edde26265015d

TODO: continue on error for --subprocess and test_distributed aren't working fully
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112098
Approved by: https://github.com/huydhn
2023-11-30 04:01:57 +00:00
2ea2421b44 Skip unit tests that fail on MI210 runners (#114613)
Taken from https://github.com/pytorch/pytorch/pull/105980
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114613
Approved by: https://github.com/malfet
2023-11-27 22:25:35 +00:00
2aa486de9b vendor packaging.version (#114108)
Fixes #113940. This vendors the relevant parts of [`packaging==23.2.0`]() to have access to `Version` and `InvalidVersion` without taking a runtime dependency on `setuptools` or `packaging`.

I didn't find any vendoring policy so I put it under `torch._vendor.packaging`. While I have only vendored the files we need, I have not touched or trimmed the files otherwise.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114108
Approved by: https://github.com/malfet, https://github.com/albanD
2023-11-21 11:51:23 +00:00
ec20c9044e [TD] Fix metric emission for split test files (#113789)
Fixes a bug in TD metrics generation where it wouldn't be able to find the rank and relevance that a heuristic gave a test run if that heuristic had divided that test into multiple test runs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113789
Approved by: https://github.com/clee2000
2023-11-16 23:19:40 +00:00
87aeb248c9 More random stepcurrent (#113620)
Distributed tests for different backends have the same name, so they end up clashing using the current stepcurrent key, so tests were not being run.

Disabled the following tests because they are failing:
test_ddp_has_finalized

test_broadcast_object_list
<details>

```

2023-11-14T06:44:01.0428686Z
2023-11-14T06:44:01.0430447Z distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_broadcast_object_list <- ../../../../opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py INFO:numba.cuda.cudadrv.driver:init
2023-11-14T06:44:01.0431048Z [1699943450.893723] [99f90b6e6ff3:10028:0]     ucc_context.c:402  UCC  ERROR failed to create tl context for cuda
2023-11-14T06:44:01.0431625Z [1699943450.914385] [99f90b6e6ff3:10029:0]     ucc_context.c:402  UCC  ERROR failed to create tl context for cuda
2023-11-14T06:44:01.0432314Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] Caught exception:
2023-11-14T06:44:01.0433178Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last):
2023-11-14T06:44:01.0434677Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
2023-11-14T06:44:01.0435435Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     getattr(self, test_name)()
2023-11-14T06:44:01.0436895Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
2023-11-14T06:44:01.0437500Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     fn()
2023-11-14T06:44:01.0438917Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper
2023-11-14T06:44:01.0439637Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     method(*args, **kwargs)
2023-11-14T06:44:01.0441122Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 143, in wrapper
2023-11-14T06:44:01.0441873Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.0443340Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 274, in wrapper
2023-11-14T06:44:01.0444077Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     ret = func(*args, **kwargs)
2023-11-14T06:44:01.0445769Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7717, in test_broadcast_object_list
2023-11-14T06:44:01.0446732Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     return self._test_broadcast_object_list()
2023-11-14T06:44:01.0448433Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7683, in _test_broadcast_object_list
2023-11-14T06:44:01.0449187Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     dist.broadcast_object_list(
2023-11-14T06:44:01.0450553Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
2023-11-14T06:44:01.0451621Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.0453161Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2650, in broadcast_object_list
2023-11-14T06:44:01.0454065Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     broadcast(object_sizes_tensor, src=src, group=group)
2023-11-14T06:44:01.0455441Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
2023-11-14T06:44:01.0456183Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.0457775Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1947, in broadcast
2023-11-14T06:44:01.0458649Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     work = default_pg.broadcast([tensor], opts)
2023-11-14T06:44:01.0460923Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] RuntimeError: [/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:488] [Rank 1][ProcessGroupUCC-0][READY]failed to init cuda collective, error code -1: Operation is not supported, system error code 2
2023-11-14T06:44:01.0461471Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.0462430Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir:
2023-11-14T06:44:01.0463552Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]      python test/distributed/test_distributed_spawn.py -k test_broadcast_object_list
2023-11-14T06:44:01.0464082Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.0465136Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-11-14T06:44:01.0465945Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]  exiting process 1 with exit code: 10
2023-11-14T06:44:01.0466605Z [1699943451.005633] [99f90b6e6ff3:10029:0]          parser.c:2034 UCX  WARN  unused environment variables: UCX_COMMIT; UCX_HOME
2023-11-14T06:44:01.0467303Z [1699943451.005633] [99f90b6e6ff3:10029:0]          parser.c:2034 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
2023-11-14T06:44:01.0467972Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] Caught exception:
2023-11-14T06:44:01.0468743Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last):
2023-11-14T06:44:01.0470233Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
2023-11-14T06:44:01.0471106Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     getattr(self, test_name)()
2023-11-14T06:44:01.0472581Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
2023-11-14T06:44:01.0473162Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     fn()
2023-11-14T06:44:01.0474581Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper
2023-11-14T06:44:01.0475314Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     method(*args, **kwargs)
2023-11-14T06:44:01.0476776Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 143, in wrapper
2023-11-14T06:44:01.0477535Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.0478993Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 274, in wrapper
2023-11-14T06:44:01.0479886Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     ret = func(*args, **kwargs)
2023-11-14T06:44:01.0481593Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7717, in test_broadcast_object_list
2023-11-14T06:44:01.0482429Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     return self._test_broadcast_object_list()
2023-11-14T06:44:01.0484145Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7683, in _test_broadcast_object_list
2023-11-14T06:44:01.0484886Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     dist.broadcast_object_list(
2023-11-14T06:44:01.0486271Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
2023-11-14T06:44:01.0487018Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.0488559Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2650, in broadcast_object_list
2023-11-14T06:44:01.0489470Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     broadcast(object_sizes_tensor, src=src, group=group)
2023-11-14T06:44:01.0491078Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
2023-11-14T06:44:01.0491912Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.0493369Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1947, in broadcast
2023-11-14T06:44:01.0494419Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     work = default_pg.broadcast([tensor], opts)
2023-11-14T06:44:01.0496679Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] RuntimeError: [/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:488] [Rank 0][ProcessGroupUCC-0][READY]failed to init cuda collective, error code -1: Operation is not supported, system error code 2
2023-11-14T06:44:01.0497211Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.0498198Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir:
2023-11-14T06:44:01.0499291Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]      python test/distributed/test_distributed_spawn.py -k test_broadcast_object_list
2023-11-14T06:44:01.0499838Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.0500881Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-11-14T06:44:01.0501667Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]  exiting process 0 with exit code: 10
2023-11-14T06:44:01.0502343Z [1699943451.002362] [99f90b6e6ff3:10028:0]          parser.c:2034 UCX  WARN  unused environment variables: UCX_COMMIT; UCX_HOME
2023-11-14T06:44:01.0503024Z [1699943451.002362] [99f90b6e6ff3:10028:0]          parser.c:2034 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
2023-11-14T06:44:01.0503411Z ('RERUN', {'yellow': True}) [6.1102s] [100%]
```
</details>

test_ddp_sync_bn_training_vs_eval

<details>

```

2023-11-14T06:44:01.1494815Z
2023-11-14T06:44:01.1496630Z distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_sync_bn_training_vs_eval <- ../../../../opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py INFO:numba.cuda.cudadrv.driver:init
2023-11-14T06:44:01.1497290Z [1699943779.976037] [99f90b6e6ff3:10758:0]          parser.c:2034 UCX  WARN  unused environment variables: UCX_COMMIT; UCX_HOME
2023-11-14T06:44:01.1498119Z [1699943779.976037] [99f90b6e6ff3:10758:0]          parser.c:2034 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
2023-11-14T06:44:01.1498808Z STAGE:2023-11-14 06:36:20 10758:10758 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
2023-11-14T06:44:01.1499465Z [1699943779.970792] [99f90b6e6ff3:10757:0]          parser.c:2034 UCX  WARN  unused environment variables: UCX_COMMIT; UCX_HOME
2023-11-14T06:44:01.1500160Z [1699943779.970792] [99f90b6e6ff3:10757:0]          parser.c:2034 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
2023-11-14T06:44:01.1500820Z STAGE:2023-11-14 06:36:20 10757:10757 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
2023-11-14T06:44:01.1501556Z STAGE:2023-11-14 06:36:20 10758:10758 ActivityProfilerController.cpp:320] Completed Stage: Collection
2023-11-14T06:44:01.1502239Z STAGE:2023-11-14 06:36:20 10757:10757 ActivityProfilerController.cpp:320] Completed Stage: Collection
2023-11-14T06:44:01.1502952Z STAGE:2023-11-14 06:36:20 10757:10757 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
2023-11-14T06:44:01.1503678Z STAGE:2023-11-14 06:36:20 10758:10758 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
2023-11-14T06:44:01.1504350Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] Caught exception:
2023-11-14T06:44:01.1505119Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last):
2023-11-14T06:44:01.1506729Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
2023-11-14T06:44:01.1507492Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     getattr(self, test_name)()
2023-11-14T06:44:01.1508992Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
2023-11-14T06:44:01.1509578Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     fn()
2023-11-14T06:44:01.1510994Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper
2023-11-14T06:44:01.1511725Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     method(*args, **kwargs)
2023-11-14T06:44:01.1513193Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper
2023-11-14T06:44:01.1513962Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.1515697Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 9230, in test_ddp_sync_bn_training_vs_eval
2023-11-14T06:44:01.1516529Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     self.assertNotEqual([], all_gather_calls)
2023-11-14T06:44:01.1518019Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3448, in assertNotEqual
2023-11-14T06:44:01.1518910Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     with self.assertRaises(AssertionError, msg=msg):
2023-11-14T06:44:01.1520177Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 226, in __exit__
2023-11-14T06:44:01.1521062Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     self._raiseFailure("{} not raised".format(exc_name))
2023-11-14T06:44:01.1522238Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 163, in _raiseFailure
2023-11-14T06:44:01.1523099Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     raise self.test_case.failureException(msg)
2023-11-14T06:44:01.1523923Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] AssertionError: AssertionError not raised
2023-11-14T06:44:01.1524470Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.1525481Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir:
2023-11-14T06:44:01.1526632Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]      python test/distributed/test_distributed_spawn.py -k test_ddp_sync_bn_training_vs_eval
2023-11-14T06:44:01.1527180Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.1528223Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-11-14T06:44:01.1529029Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]  exiting process 0 with exit code: 10
2023-11-14T06:44:01.1529786Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] Caught exception:
2023-11-14T06:44:01.1530576Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last):
2023-11-14T06:44:01.1532383Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
2023-11-14T06:44:01.1533127Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     getattr(self, test_name)()
2023-11-14T06:44:01.1534608Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
2023-11-14T06:44:01.1535194Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     fn()
2023-11-14T06:44:01.1536817Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper
2023-11-14T06:44:01.1537575Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     method(*args, **kwargs)
2023-11-14T06:44:01.1539036Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper
2023-11-14T06:44:01.1539800Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.1541531Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 9230, in test_ddp_sync_bn_training_vs_eval
2023-11-14T06:44:01.1542388Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     self.assertNotEqual([], all_gather_calls)
2023-11-14T06:44:01.1544015Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3448, in assertNotEqual
2023-11-14T06:44:01.1544907Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     with self.assertRaises(AssertionError, msg=msg):
2023-11-14T06:44:01.1546061Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 226, in __exit__
2023-11-14T06:44:01.1546944Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     self._raiseFailure("{} not raised".format(exc_name))
2023-11-14T06:44:01.1548142Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 163, in _raiseFailure
2023-11-14T06:44:01.1548991Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     raise self.test_case.failureException(msg)
2023-11-14T06:44:01.1549806Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] AssertionError: AssertionError not raised
2023-11-14T06:44:01.1550350Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.1551304Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir:
2023-11-14T06:44:01.1552462Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]      python test/distributed/test_distributed_spawn.py -k test_ddp_sync_bn_training_vs_eval
2023-11-14T06:44:01.1553095Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.1554166Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-11-14T06:44:01.1554976Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]  exiting process 1 with exit code: 10
2023-11-14T06:44:01.1555235Z ('RERUN', {'yellow': True}) [6.6107s] [100%]
```
</details>

test_backend_full_group
<details>

```
2023-11-14T22:51:56.4502470Z FAILED [5.2125s] distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_backend_full_group - RuntimeError: Process 0 exited with error code 10 and exception:
2023-11-14T22:51:56.4502665Z Traceback (most recent call last):
2023-11-14T22:51:56.4503603Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
2023-11-14T22:51:56.4503796Z     getattr(self, test_name)()
2023-11-14T22:51:56.4504710Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
2023-11-14T22:51:56.4504845Z     fn()
2023-11-14T22:51:56.4505737Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper
2023-11-14T22:51:56.4505896Z     method(*args, **kwargs)
2023-11-14T22:51:56.4506823Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper
2023-11-14T22:51:56.4506992Z     return func(*args, **kwargs)
2023-11-14T22:51:56.4508285Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 882, in test_backend_full_group
2023-11-14T22:51:56.4508640Z     self._test_group_override_backend(self._init_full_group_test)
2023-11-14T22:51:56.4509798Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 852, in _test_group_override_backend
2023-11-14T22:51:56.4510104Z     group, group_id, rank = initializer(backend=new_backend)
2023-11-14T22:51:56.4510629Z UnboundLocalError: local variable 'new_backend' referenced before assignment
2023-11-14T22:51:56.4510650Z
2023-11-14T22:51:56.4510987Z To execute this test, run the following from the base repo dir:
2023-11-14T22:51:56.4511525Z      python test/distributed/test_distributed_spawn.py -k test_backend_full_group
2023-11-14T22:51:56.4511545Z
2023-11-14T22:51:56.4511970Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-11-14T22:51:56.4511989Z
2023-11-14T22:51:56.4512242Z Process 1 exited with error code 10 and exception:
2023-11-14T22:51:56.4512454Z Traceback (most recent call last):
2023-11-14T22:51:56.4513380Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
2023-11-14T22:51:56.4513687Z     getattr(self, test_name)()
2023-11-14T22:51:56.4514612Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
2023-11-14T22:51:56.4514746Z     fn()
2023-11-14T22:51:56.4515633Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper
2023-11-14T22:51:56.4515791Z     method(*args, **kwargs)
2023-11-14T22:51:56.4516708Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper
2023-11-14T22:51:56.4516895Z     return func(*args, **kwargs)
2023-11-14T22:51:56.4518008Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 882, in test_backend_full_group
2023-11-14T22:51:56.4518352Z     self._test_group_override_backend(self._init_full_group_test)
2023-11-14T22:51:56.4519509Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 852, in _test_group_override_backend
2023-11-14T22:51:56.4519813Z     group, group_id, rank = initializer(backend=new_backend)
2023-11-14T22:51:56.4520334Z UnboundLocalError: local variable 'new_backend' referenced before assignment
2023-11-14T22:51:56.4520355Z
2023-11-14T22:51:56.4528843Z To execute this test, run the following from the base repo dir:
2023-11-14T22:51:56.4529492Z      python test/distributed/test_distributed_spawn.py -k test_backend_full_group
2023-11-14T22:51:56.4529681Z
2023-11-14T22:51:56.4530122Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-11-14T22:51:56.4530423Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
```
</details>

pretty sure the solution for this one is to add ucc in _test_group_override_backend
https://ossci-raw-job-status.s3.amazonaws.com/log/18651430019
https://ossci-raw-job-status.s3.amazonaws.com/log/18651430132
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113620
Approved by: https://github.com/huydhn
2023-11-15 21:56:10 +00:00
0c448526a4 [experiment][TD] Rating number system (#112676)
Emit excessive amount of heuristic info emitted, but that just means I can do more with it later?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112676
Approved by: https://github.com/ZainRizvi
2023-11-07 19:40:11 +00:00
e2e5897269 [CI] Do not use packaging in run_tests.py (#112873)
It used to check that CUDA is newer than 11.6, but all of them are

Yet another mitigation towards missing `packaging` on MacOS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112873
Approved by: https://github.com/huydhn
2023-11-03 17:22:46 +00:00
4e67c69a7d [TD] Support downgrading test relevance (#112671)
Allow heuristics to actually downgrade the relevance of a test.  Note that NONE/UNLIKELY tests will still get executed, but they will be ran at the end of the CI

The Relevance chosen affects the outcome when Heuristics offer conflicting predictions. A relevance higher up in this list means higher confidence in the declared relevance:

HIGH > NONE > PROBABLE > UNLIKELY > UNRANKED

Given that we assume ordering based on the list in init right now since the lists are appended, do a similar thing for UNLIKELY and NONE
ex HEURISTICS = [a, b, c, d]
currently all things in b.high and added after a.high
if b.none includes things in a.high, a.high trumps
if b.none includes things in a.probable, then b.none trumps since none is stronger than probable
if b.unlikely includes things from a.high/probable, a.high/probable trumps since unlikely and probable are at a higher strength
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112671
Approved by: https://github.com/clee2000
2023-11-02 21:02:40 +00:00
a5641bc56b [TD] Enable Test Class granularity on heuristics (#112161)
Changes the heuristic framework to support multiple prioritizing individual classes within a test file.

Components of this included:
- Updating TestPrioritizations to accept individual test classes being prioritized. Previously, when a heuristic wanted to prioritize a test file it would pass in the test's name, now to prioritize a class within a test it uses the notation "test::classname"
- Changes are fully backwards compatible with existing heuristics
- Test sharding now supports sharding individual tests (for when they're prioritized)
- When a TestClass is prioritized, we pass the appropriate "-k" flags down to pytest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112161
Approved by: https://github.com/huydhn
2023-10-31 18:11:05 +00:00
3b5b7ebd09 [ci] Save various json files from test infra into folder (#111516)
We pull a lot of files from https://github.com/pytorch/test-infra/blob/generated-stats/stats and name them separately when we add them to the artifacts in the build, so stick them in a folder and just add that instead.

Slow test and disabled test jsons remain as they were since they are pulled during the test step and do not need to be included in the artifacts during build since they are not used for sharding.

Sanity checked that test times could be found for linux, mac, windows, and rocm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111516
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2023-10-23 20:38:25 +00:00
e9a51a6a07 [BE] Revive test_typing (#111428)
`test_typing.py` was written to use `pytest` in https://github.com/pytorch/pytorch/pull/54234 which unfortunately rendered it incompatible with run_test.py, and therefore it was not running in CI all this time.

In this PR, same functionality is re-written using unittest framework, and `parametrize` from `torch.testing._internal._common_utils`.

Valid `test_typing.py` with ufmt

Disable `fail/bitwise_ops.py` and `pass/jit.py` as it regressed at some point as well as one of examples in `namedtuple.py` as `torch.linalg.qr` type is no longer revealed correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111428
Approved by: https://github.com/clee2000
2023-10-18 02:19:49 +00:00
6b92c367c5 Add test_jit_cuda_fuser to ROCM_BLOCKLIST (#110440)
Adds the nvfuser related unit test suite to ROCM_BLOCKLIST as should not be run on ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110440
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/lezcano
2023-10-06 08:47:15 +00:00
8a09fe4a05 [ez] Remove print in heuristics aggregation (#110621)
Move print to the beginning instead because putting it at the end makes it so you have to scroll through when debugging, and nothing in that function indicates that it should be printing anything

Also the line for printing disabled issues out of the for loop
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110621
Approved by: https://github.com/huydhn
2023-10-06 02:04:53 +00:00
d6e5898e8d Quieter logs in CI (#110033)
To reduce the amount of logs
* for successes, only print the part that says what tests ran and don't print the rest.  Zip the log into an artifact.  The line listing al the test names is really long, but if you view source of the raw logs, it will not wrap so it will only be one line.  The log classifier can also be configured to ignored this line. Gets rid of lines like `test_ops.py::TestCommonCPU::test_multiple_devices_round_cpu_int64 SKIPPED [0.0010s] (Only runs on cuda) [  9%]`
* for failures/reruns, print logs.  Do not zip.

Also
* change log artifact name

Examples of various logs:
a074db0f7f failures
1b439e24c4 failures

possibly controversial haha
should i include an option for always printing?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110033
Approved by: https://github.com/huydhn
2023-10-05 16:40:37 +00:00
f69e9c8c91 run_tests.py minor logging changes (#110188)
Minor logging changes that just kind of annoyed me:
* prevent constant printing of `No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'` by moving import within the function (idk if this is ok)
* prevent constant printing of `Ignoring disabled issues:  ['']` (no idea why it was not gated behind a function or main)
* change all prints in run_tests.py to be through stderr so theres no weird interleaving (although if everything goes through stderr, might as well just print everything through stdout...)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110188
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/ZainRizvi
2023-10-03 01:22:47 +00:00
1277d0e834 [BE] Add sharding data by default to metrics (#110035)
Extend metric library to allow setting global metrics on a process level which will always be emitted.

Current use case for them is to include shard information every time a metric is emitted by run_test.py

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 0cae92c</samp>

> _`run_test` refactored_
> _Sharding metrics in Rockset_
> _Autumn of testing_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110035
Approved by: https://github.com/clee2000
2023-09-26 17:06:49 +00:00
47adcd412f Increase timeout for slow tests (#109206)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109206
Approved by: https://github.com/huydhn
2023-09-26 16:18:38 +00:00
0d3db1048a remove nvfuser test in upstream pytorch (#109918)
Removing nvfuser related tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109918
Approved by: https://github.com/msaroufim
2023-09-24 13:49:37 +00:00
fe198f3141 inductor/test_max_autotune serial in CI (#109209)
Fixes #ISSUE_NUMBER
Trying to figure out why the this keeps timing out, wondering if its due to parallelization weirdness
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109209
Approved by: https://github.com/huydhn
2023-09-13 17:04:43 +00:00
a4138b1f99 [ez] Fix small type error in run_test (#109036)
This is really small but it has tripped me up at least 3 times.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109036
Approved by: https://github.com/kit1980
2023-09-11 21:11:20 +00:00
c67ebae344 Put logging in run_tests (#107987)
Logging regarding which tests are serial + parallel + what tests actually get run on the shard got removed, which can be pretty helpful, so this adds it back in.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107987
Approved by: https://github.com/huydhn, https://github.com/Neilblaze
2023-09-01 20:23:30 +00:00
5727b07ac6 TD: logging bugfix (#108288)
Fix bug where logging metrics don't get emitted unless the 'keep-going' label is specified on the PR

Also adds some extra logging to make debugging easier
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108288
Approved by: https://github.com/Skylion007
2023-08-31 16:51:49 +00:00
238cc84af9 [TD] Emit metrics to compare heuristic quality (#108192)
When a test fails, we will now emit fine grained details about how accurately heuristics predicted the relevance of that test.

## Context
Why only look at failing tests? Our only signal that a PR is most likely relevant to a test is whether or not a test fails on it. Green tests don't tell us if the success was due to the code being good vs being irrelevant.  This isn't a perfect measure, since it can miscategorize unstable and flaky failures as having been "missed" by the heuristics, but it's a reasonable approximation.

## What's measured?
The metrics this PR collects are designed to answer the following questions

### How comprehensive are the heuristics?
- What's the false negative rate, the % of failures that ideally should have been prioritized but weren't? (Both at an aggregate level and at a per heuristic level)

### How precise are the heuristics?
- What % of failed tests were prioritized by a given heuristic? What % was prioritized overall?
- How relevant was a failed test was considered to be? (Both a aggregate level and at a per heuristic level)
- What % of time was a given heuristic prioritizing a failing test higher than any other heuristic?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108192
Approved by: https://github.com/huydhn
ghstack dependencies: #108117
2023-08-30 18:28:18 +00:00
620d267ef3 Refactor TestPrioritizations to support more priorities and reduce risk of accidental mutations (#108117)
Refactor TD code to make it easier to add additional categories later and also support the changes required to enable the metrics needed for TD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108117
Approved by: https://github.com/huydhn
2023-08-30 04:14:28 +00:00
36399d067a Port existing heuristics to TD framework (#107071)
This PR looks big, but it's mostly just refactorings with a bit of dead code deletion. Exceptions are:
- Some metric emissions were changed to comply with the new TD format
- Some logging changes
- We now run tests in three batches (highly_relevant, probably_relevant, unranked_relevance) instead of the previous two (prioritized and general)

Refactorings done:
- Moves all test reordering code to the new TD framework
- Refactors run_test.py to cleanly support multiple levels of test priorities
- Deletes some dead code that was originally written for logging
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107071
Approved by: https://github.com/clee2000, https://github.com/huydhn
2023-08-23 21:23:23 +00:00