pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 12:54:11 +08:00

Author	SHA1	Message	Date
Nikita Shulga	cbf212e9c7	[CI] Fix doctest job if build without distributed (#165449 ) Guard test with `TORCH_DOCTEST_DISTRIBUTED` and set it to true in run_test.py to be able to pass doctest for PyTorch build without distribtued support. This is a regression introduced by https://github.com/pytorch/pytorch/pull/164806 Fixes https://github.com/pytorch/pytorch/issues/165343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165449 Approved by: https://github.com/seemethere	2025-10-14 19:19:03 +00:00
Catherine Lee	a4925c0ce0	[testing] Print something for log classifier to better differentiate reruns vs real failures (#165163 ) The normal pytest/unittest failure patterns also match flaky tests (specifically I think tests that fail -> succeed on rerun in a new subprocess) So print something specifically for log classifier that it can match against Pull Request resolved: https://github.com/pytorch/pytorch/pull/165163 Approved by: https://github.com/izaitsevfb	2025-10-10 19:28:13 +00:00
FFFrog	e0abcee3b5	[Code Clean] Remove support of python3.9 (#163846 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163846 Approved by: https://github.com/ezyang	2025-10-10 11:11:56 +00:00
PyTorch MergeBot	91040f4934	Revert "[Code Clean] Remove support of python3.9 (#163846 )" This reverts commit bc1690c7e859dee8c47a7f0bbd3c43cc27c6fd2a. Reverted https://github.com/pytorch/pytorch/pull/163846 on behalf of https://github.com/izaitsevfb due to breaks distributed tests ([comment](https://github.com/pytorch/pytorch/pull/163846#issuecomment-3386855437))	2025-10-09 17:27:08 +00:00
FFFrog	bc1690c7e8	[Code Clean] Remove support of python3.9 (#163846 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163846 Approved by: https://github.com/ezyang	2025-10-09 11:54:10 +00:00
Han Qi	b5c4f46bb9	Add functions to setup PrivateUse1 as a python backend device. (#157859 ) Fixes #156052 and #156444. This PR setup the privateuseone key in Python to be used as a python backend for pytorch. Meaning that, after calling `setup_privateuseone_for_python_backend('npy')`, one can use a subclass to with that device to hold arbitrary python data as "device data" and use `torch.library` to register ops that takes that Tensor. Changes done in this PR: 1. Register an vanilla Device Guard: I extended NoOpDeviceGuard to have allow device index of 0 and to not raise errors when event related functions are accessed. If I don't do those, when calling backward I would get errors. (CPU backend uses NoOpDeviceGuard just fine, although there seems to be special treatment of CPU in the autograd engine. 2. Tensor subclass allows not having `__torch_dispatch__` if the device is not CUDA or CPU. The comment of the check suggests it was to avoid segfault when calling into ops that expects a storage. Here we have a different device so will not call into those ops. 3. python function that invokes the other incantations to setup the privateusekey backend. This took inspiration of https://github.com/bdhirsh/pytorch_open_registration_example and https://github.com/tinygrad/tinygrad/blob/master/extra/torch_backend/wrapped_tensor.cpp; great thanks to @bdhirsh and @geohot. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157859 Approved by: https://github.com/albanD	2025-10-01 21:32:59 +00:00
PyTorch MergeBot	410ed3006b	Revert "Add functions to setup PrivateUse1 as a python backend device. (#157859 )" This reverts commit 1310d6a1f9194ddcf6753f7e12fb78f278451f8a. Reverted https://github.com/pytorch/pytorch/pull/157859 on behalf of https://github.com/jeanschmidt due to introduce linting errors ([comment](https://github.com/pytorch/pytorch/pull/157859#issuecomment-3352140098))	2025-09-30 13:24:37 +00:00
Han Qi	1310d6a1f9	Add functions to setup PrivateUse1 as a python backend device. (#157859 ) Fixes #156052 and #156444. This PR setup the privateuseone key in Python to be used as a python backend for pytorch. Meaning that, after calling `setup_privateuseone_for_python_backend('npy')`, one can use a subclass to with that device to hold arbitrary python data as "device data" and use `torch.library` to register ops that takes that Tensor. Changes done in this PR: 1. Register an vanilla Device Guard: I extended NoOpDeviceGuard to have allow device index of 0 and to not raise errors when event related functions are accessed. If I don't do those, when calling backward I would get errors. (CPU backend uses NoOpDeviceGuard just fine, although there seems to be special treatment of CPU in the autograd engine. 2. Tensor subclass allows not having `__torch_dispatch__` if the device is not CUDA or CPU. The comment of the check suggests it was to avoid segfault when calling into ops that expects a storage. Here we have a different device so will not call into those ops. 3. python function that invokes the other incantations to setup the privateusekey backend. This took inspiration of https://github.com/bdhirsh/pytorch_open_registration_example and https://github.com/tinygrad/tinygrad/blob/master/extra/torch_backend/wrapped_tensor.cpp; great thanks to @bdhirsh and @geohot. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157859 Approved by: https://github.com/albanD	2025-09-30 08:39:36 +00:00
Klaus Zimmermann	50d418f69f	Replace setup.py bdist_wheel with python -m build --wheel (#156712 ) Previously we already replaced most use of `python setup.py develop/install`. This PR also replaces the use of `setup.py bdist_wheel` with the modern `python -m build --wheel` alternative. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156712 Approved by: https://github.com/atalman ghstack dependencies: #156711	2025-09-29 21:51:32 +00:00
Angel Li	3b73841f43	update test_quantization tests to run weekly (#163077 ) Fixes #162854 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163077 Approved by: https://github.com/huydhn	2025-09-24 11:31:11 +00:00
Ke Wen	7924b083c1	[CI] disable rerun of distributed tests (#163025 ) #162978 identified an issue that distributed test failures were wrongly muted. Per discussion with @malfet, one solution is to disable rerun of distributed tests in `run_test.py`. The PR makes use of the `is_distributed_test` flag to identify those tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163025 Approved by: https://github.com/malfet	2025-09-16 03:11:50 +00:00
FFFrog	27daa6af6a	[OpenReg] Strengthen Openreg's execution limits to minimize the waste of computing resources (#161918 ) Currently, OpenReg supports Linux, Windows, and OS X, ensuring stability and ease of integration with third-party devices across all three platforms. It also doesn't rely on any other accelerators (such as CUDA or MPS). Therefore, to minimize computational resource usage, `test_openreg` can be added to certain BLOCKLISTS to prevent its execution, limiting OpenReg's execution to only necessary scenarios. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161918 Approved by: https://github.com/albanD ghstack dependencies: #161917	2025-09-12 23:53:17 +00:00
FFFrog	9b429846e8	[OpenReg] Migrate OpenReg Tests from tests/test_openreg.py into torch_openreg/tests (#161917 ) Background: Almost all the tests in `test/test_openreg.py` are designed for `torch_openreg`, so placing these testcases in the test directory is not a good idea. Instead, they should be moved to the `tests` directory under `torch_openreg`, coordinating these tests with their corresponding functional logic. How to do: So how do we verify the quality of the third-party device integration mechanism? We will maintain a `test_openreg` entrypoint in `test/run_test.py`. This entrypoint will install `torch_openreg` and run all the testcases located in `torch_openreg`. As long as all testcases pass, we can guarantee that the out-of-tree backend integration mechanism is available. Next: We will also improve `torch_openreg's` test coverage in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161917 Approved by: https://github.com/albanD	2025-09-12 23:53:17 +00:00
Dmitry Rogozhkin	ee53ad2dd0	xpu: test py_limited_api with SyclExtension (#162546 ) Commit extends existing CUDA test to cover XPU SyclExtension case for the same feature - `py_limited_api`. Commit required a fix for xpu to install some Aten header files (#145902) which got resolved after the merge of #159621. See: https://github.com/pytorch/pytorch/issues/145902 Requires: https://github.com/pytorch/pytorch/pull/159621 Requires: https://github.com/intel/torch-xpu-ops/pull/1743 CC: @guangyey, @EikanWang Pull Request resolved: https://github.com/pytorch/pytorch/pull/162546 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/janeyx99	2025-09-12 21:57:01 +00:00
Jeff Daily	f03d635dc6	[ROCm][CI] skip test_max_autotune until resolved (#162496 ) many tests taking >30 min and causing timeouts Pull Request resolved: https://github.com/pytorch/pytorch/pull/162496 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-09 16:34:01 +00:00
FFFrog	827f0d4054	Using get_paths() to get correct installation path for PYTHONPATY (#161947 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161947 Approved by: https://github.com/albanD ghstack dependencies: #161845, #161903	2025-09-03 06:38:03 +00:00
FFFrog	dac8a4b91c	Using pip3 install instead of python setup.py develop/install (#161903 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161903 Approved by: https://github.com/ezyang ghstack dependencies: #161845	2025-09-03 03:12:18 +00:00
xinan.lin	81b7b16618	Reland "[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#161142 )" (#161949 ) This PR reland #161142 which is reverted to be able to revert other PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161949 Approved by: https://github.com/jansel	2025-09-02 23:43:27 +00:00
PyTorch MergeBot	54e275e0d8	Revert "[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#161142 )" This reverts commit c83cbd2f2a2de2e3258f07de77d8740743df6d2d. Reverted https://github.com/pytorch/pytorch/pull/161142 on behalf of https://github.com/jeanschmidt due to This PR needs to be reverted to be able to revert another PR, this is due to merge conflicts, I am sorry for this. Please feel free to rebase and merge at your earliest convenience ([comment](https://github.com/pytorch/pytorch/pull/161142#issuecomment-3242937640))	2025-09-01 17:03:50 +00:00
xinan.lin	c83cbd2f2a	[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#161142 ) Fixes #161384, Fixes #161162, Fixes #160946, Fixes #160947, Fixes #160948 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161142 Approved by: https://github.com/jansel	2025-08-30 11:09:07 +00:00
Aleksei Nikiforov	1069a08dac	Enable more nightly tests on s390x (#160893 ) Enable more nightly tests on s390x Pull Request resolved: https://github.com/pytorch/pytorch/pull/160893 Approved by: https://github.com/malfet	2025-08-28 22:20:55 +00:00
Jeff Daily	262640fd22	[ROCm][CI] restore test_flex_attention tests (#161519 ) Reverts #161450 and targets specific subtests to skip on MI200. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161519 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-26 19:31:30 +00:00
amdfaa	85adf80cf1	Disable inductor/test_flex_attention.py (#161450 ) Currently inductor/test_flex_attention.py is causing rocm pytorch mi250 shard 1 to go over the timeout limit. This PR is for disabling that test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161450 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-26 01:28:51 +00:00
FFFrog	56ebed627a	[OpenReg] Add OSX/Windows Support for OpenReg (#159441 ) As the title stated. Changes: - Abstract platform-specific APIs - Add OSX/Windows support - Set default symbol visibility to "hidden" Co-authored-by: @can-gaa-hou Original PR:https://github.com/pytorch/pytorch/pull/159029 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159441 Approved by: https://github.com/albanD Co-authored-by: jiahaochen666 <jiahaochen535@gmail.com>	2025-08-25 08:03:27 +00:00
rzou	8ab5868a21	Actually run the einops tests in CI (#159776 ) The test filter was wrong, it should not start with "test/". Test Plan: - wait for CI - Tested locally with `python test/run_test.py --einops --verbose` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159776 Approved by: https://github.com/atalman, https://github.com/StrongerXi	2025-08-07 15:23:06 +00:00
angelayi	b1ec088113	[mps] Turn on inductor dynamic shapes tests (#159456 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159456 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-08-05 22:27:06 +00:00
PyTorch MergeBot	fb8f32ef52	Revert "[mps] Turn on inductor dynamic shapes tests (#159456 )" This reverts commit 19f1f9960db7f29f2110a7f49f06a1a23c651ecf. Reverted https://github.com/pytorch/pytorch/pull/159456 on behalf of https://github.com/davidberard98 due to Sorry - this causes a merge conflict with https://github.com/pytorch/pytorch/pull/159798, which I'm trying to land with co-dev to resolve a sev ([comment](https://github.com/pytorch/pytorch/pull/159456#issuecomment-3152751821))	2025-08-04 23:11:05 +00:00
angelayi	19f1f9960d	[mps] Turn on inductor dynamic shapes tests (#159456 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159456 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-08-04 22:44:31 +00:00
Aleksei Nikiforov	e5a81aa7ba	Fix conversion of values in libtorch agnostic tests (#155115 ) Due to different byteorder, when copying data, it has to be put into last bytes to ensure that int32_t converted to int64_t keeps same value. Same has to be done when it's converted back. This change fixes test TestLibtorchAgnosticCPU::test_my_ones_like_cpu from cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py on s390x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155115 Approved by: https://github.com/huydhn	2025-08-04 13:40:22 +00:00
Aleksei Nikiforov	6646461764	S390X: fix detection of magic number placeholder in inductor (#157784 ) This change fixes multiple tests in test/inductor/test_aot_inductor_arrayref.py such as test_cond_with_parameters_cpu_with_stack_allocation, test_issue_140766_cpu_with_stack_allocation, test_model_modified_weights_cpu_with_stack_allocation, test_nested_tensor_from_jagged_cpu_with_stack_allocation. Enable tests in test/inductor/test_aot_inductor_arrayref.py This change is split off from https://github.com/pytorch/pytorch/pull/150116 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157784 Approved by: https://github.com/huydhn	2025-08-04 12:42:31 +00:00
FFFrog	4261e26a8b	[OpenReg] move fallback tests into test_openreg.py (#158441 ) ---- - move fallback tests into test_operneg - remove the test_cpp_extensions_open_device_registration.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/158441 Approved by: https://github.com/albanD ghstack dependencies: #158415, #158440	2025-07-25 02:39:41 +00:00
Catherine Lee	a00442421a	[CI][TD] Enable TD on all test configs (#158163 ) I think the main one that was missing is dynamo_wrapped There's also slow and inductor, but the filter later for workflows stops TD from running on those anyways dynamo_wrapped is the second longest jobs for pull right now <img width="1265" height="311" alt="image" src="https://github.com/user-attachments/assets/d4ca034c-a8f0-4b31-a80f-0f4f21fce32a" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158163 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2025-07-17 21:05:25 +00:00
angelayi	3cb11877aa	[aoti][mps] Enable test_aot_inductor.py tests (#155598 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155598 Approved by: https://github.com/yushangdi	2025-07-16 22:26:57 +00:00
FFFrog	1b389025ba	Refactor and Improve the OpenReg Module (#158090 ) ---- # Refactor and Improve the OpenReg Module ## Background Since PrivateUse1 has become the main path for integrating new devices with PyTorch, there have been some feature requests related to PrivateUse1 regarding interfaces, documentation, reference examples, etc., such as the following: - https://github.com/pytorch/pytorch/issues/155864 - https://github.com/pytorch/pytorch/issues/144955 - https://github.com/pytorch/pytorch/issues/144845 Taking these requests into consideration and combining them with the position of OpenReg, which is currently used as the test backend for PrivateUse1, I'm planning to make the following optimizations: - Optimize the implementation of OpenReg to make it align with the standard specifications for real backend (C++) access, serving as a reference for new device integration code. - Add comprehensive documentation to the [developer notes](https://docs.pytorch.org/docs/main/notes.html) to guide new accelerator integration, functioning as a reference manual. ## Design Principles: - Minimization Principle: Keep the code small and clear; only implement the minimum set of code required for verification and as an integration reference. - Authenticity Principle: Integrate OpenReg in the same way that real accelerators access PyTorch. ## More Infos: Pleaes refer to [this](`6b8020f1ab/test/cpp_extensions/open_registration_extension/torch_openreg/README.md`) for more information about `OpenReg`. ## Current Progress: - Refer to the implementation of [torch_xla](https://github.com/pytorch/xla) to refactor all of OpenReg's code, making it easier to understand. - Ensure all tests in [test/test_openreg.py](https://github.com/FFFrog/pytorch/blob/openreg/test/test_openreg.py) pass after refactoring. ## Next Steps: - Add more features to cover all integration points. - Gradually add user guides and documentation to the [developer notes](https://docs.pytorch.org/docs/main/notes.html). Pull Request resolved: https://github.com/pytorch/pytorch/pull/158090 Approved by: https://github.com/seemethere, https://github.com/albanD	2025-07-15 08:10:05 +00:00
FFFrog	aab949aa96	Deprecated pkg_resources and use distributions instead (#151915 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151915 Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/albanD	2025-07-10 01:51:26 +00:00
Catherine Lee	b4e3c9ea34	[ez][CI][testing] Set upload artifacts while running to default true if in CI (#157868 ) I was confused about why the distributed tests weren't showing up quickly on HUD, its because the call of run_tests.py for distributed didn't include upload artifacts while running flag, so set it to default to IS_CI so I don't need to put the flag everywhere Pull Request resolved: https://github.com/pytorch/pytorch/pull/157868 Approved by: https://github.com/huydhn	2025-07-09 15:21:25 +00:00
Xuehai Pan	4dce5b71a0	[build] modernize build-frontend: `python setup.py develop/install` -> `[uv ]pip install --no-build-isolation [-e ].` (#156027 ) Modernize the development installation: ```bash # python setup.py develop python -m pip install --no-build-isolation -e . # python setup.py install python -m pip install --no-build-isolation . ``` Now, the `python setup.py develop` is a wrapper around `python -m pip install -e .` since `setuptools>=80.0`: - pypa/setuptools#4955 `python setup.py install` is deprecated and will emit a warning during run. The warning will become an error on October 31, 2025. - `9c4d383631/setuptools/command/install.py (L58-L67)` > ```python > SetuptoolsDeprecationWarning.emit( > "setup.py install is deprecated.", > """ > Please avoid running ``setup.py`` directly. > Instead, use pypa/build, pypa/installer or other > standards-based tools. > """, > see_url="https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html", > due_date=(2025, 10, 31), > ) > ``` - pypa/setuptools#3849 Additional Resource: - [Why you shouldn't invoke setup.py directly](https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156027 Approved by: https://github.com/ezyang	2025-07-09 11:24:27 +00:00
Jithun Nair	38757d94f1	Enable target-determination (TD) for ROCm CI (#156545 ) Target determination sorts the tests in a PR CI run based on heuristics about which tests are more relevant to the PR's changes. This can help provide faster CI signal as well as help alleviate capacity concerns as job durations should decrease due to catching failures earlier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156545 Approved by: https://github.com/jeffdaily, https://github.com/clee2000	2025-07-08 06:27:40 +00:00
Aaron Orenstein	edf7bb4f51	Fix unbound local when an error occurs before pool is initialized (#156750 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156750 Approved by: https://github.com/jamesjwu	2025-07-08 00:28:21 +00:00
rzou	e3fe001d9e	Add einops x torch.compile testing in PyTorch CI (#157416 ) Fixes #146782. This PR adds testing for multiple einops versions in PyTorch CI. This occurs in a new "einops" CI job that runs for both Python 3.9 and 3.13 (aka, what we test Dynamo over). Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/157416 Approved by: https://github.com/guilhermeleobas, https://github.com/arogozhnikov, https://github.com/anijain2305	2025-07-03 17:36:39 +00:00
Aleksei Nikiforov	c11888e7a6	Skip more tests on s390x (#155210 ) Make CI for s390x green before fixing and restoring tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155210 Approved by: https://github.com/seemethere	2025-06-18 12:07:17 +00:00
Catherine Lee	32c1611263	[CI][run_test] Fix rerun logic for failing at exit (#155853 ) Sometimes a test file reports success according to pytest, but fails afterwards, and the rerun logic doesn't handle that correctly. The name of the last run test is saved in order to do more efficient reruns (target the last run test for a rerun without rerunning the entire file). This usually correct, ex test fails and pytest catches it -> lastrun = the test that failed, test segfaults (pytest doesn't catch) -> lastrun is the test that segfaulted. But sometimes pytest reports a success, but the process has non zero exit code. The two cases I know of are hangs and double freeing at exit. In this case, its unclear which test caused the failure, so lastrun is set to be the first test that ran in that session, so that during the next session it will start from the beginning in an attempt to replicate the error (an alternate solution would be to just fail and not rerun, which might be the better option). But then it reruns with runsingle, which prevents lastrun from being reset (not sure why, I'm pretty sure there's no difference between resetting and not normally), so lastrun becomes the last test that ran, and its not always true that lastrun is the one that caused it. Then on the next run, it starts from the last test and the process now exits cleanly Short term solution here: ensure the lastrun is always set to the initial value if the session succeeds. This is correct even in the normal path because initial value shouldn't change in that case Things that still need to be fixed: * log says "running single test" which is not true * no xml reports get generated here * also no xml reports get generated on segfault * docs for this I think I have a PR that fixes the above but its old so I need to take another look Testing: This from when I was based on a commit that had a hang for macs, and before I added the skips in inductor array ref: `cc862d2c14` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155853 Approved by: https://github.com/malfet	2025-06-17 17:51:40 +00:00
Catherine Lee	0079c80b35	[CI] Do not constrain memory for ROCm testing in CI (#156115 ) Fixes ROCm OOMs introduced by https://github.com/pytorch/pytorch/pull/155631 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156115 Approved by: https://github.com/jeffdaily	2025-06-17 15:30:36 +00:00
Catherine Lee	eef253d9f6	[CI] Keep going display on HUD: upload log when test fails (#155371 ) I guess this is more of an RFC Goal: Enable keep going so that we can get information immediately for failures. We want be aware of failures as soon as possible, especially on the main branch, this is so that reverts can happen quickly. Proposal: A job with `keep-going` will continue through errors in `python run_test.py`. If a test fails, before it runs the next test, it will upload a fake log that should have enough information in it so that viewing the log will be able to tell you what failed and any stack traces/error logs, and should be able to be parsed by log classifier to get a line. I am getting the log by concating the test logs in test/test-reports, which is all the text outputted by pytest (unless someone runs with `ci-verbose-test-logs` label). There are obviously many things this won't catch, ex output outside of run_test.py, some output inside of run_test.py, but it should be enough. After a log finishes, eventually its raw log is uploaded to ossci-raw-job-status s3 bucket and the log classifier will read it to do classification. This means we will have to change log classifier to read from this bucket as well. I'm thinking just add an input parameter to log classifier like https://github.com/pytorch/test-infra/pull/6723/files Also upload the temp results to a temp attribute instead of the real one To overwrite the conclusion on HUD, I'm thinking a lambda that is s3 put trigger on the fake log being put into s3, that does something similar to log classifier where it just mutates the entry `13a990b678/aws/lambda/log-classifier/src/network.rs (L85)` to add a new field like "will_fail": true, and also triggers the log classifier to run Then we change HUD/ClickHouse to point the raw log url to the alternate place, the new "will_fail" field as the conclusion, and the temp log classifier result if needed Why always write to temp attribution/column? I am unsure about overwriting the real results with fake ones Pros: Not many changes outside of HUD/UI Cons: Lots of moving parts, lots of temp fields that will require adjustment for queries, temp fields never really get deleted Pull Request resolved: https://github.com/pytorch/pytorch/pull/155371 Approved by: https://github.com/malfet	2025-06-13 21:21:55 +00:00
Catherine Lee	9b122aab5d	Fix set per proc memory fraction when running tests (#155631 ) env setting needs to happen before pool creation for it to take effect In theory this should fix some OOMs and also cause some OOMs, but this PR is green so idk alt options: use initializer? Pull Request resolved: https://github.com/pytorch/pytorch/pull/155631 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere, https://github.com/atalman	2025-06-12 01:28:08 +00:00
PyTorch MergeBot	8347268edc	Revert "Make open device registration tests standalone (#153855 )" This reverts commit 8823138e47a3200c313f6bf2d21eb689d8150f39. Reverted https://github.com/pytorch/pytorch/pull/153855 on behalf of https://github.com/clee2000 due to causing some linux aarch64 tests to fail [GH job link](https://github.com/pytorch/pytorch/actions/runs/15566289293/job/43832373302) [HUD commit link](`8823138e47`), should be easy fix, rename in places where its mentioned, there might be more than just aarch64 though ([comment](https://github.com/pytorch/pytorch/pull/153855#issuecomment-2960191503))	2025-06-10 18:11:24 +00:00
Joel Schlosser	8823138e47	Make open device registration tests standalone (#153855 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153855 Approved by: https://github.com/janeyx99	2025-06-10 17:33:26 +00:00
soulitzer	2af78d368f	Skip another test file that doesn't run gradcheck for slow gradcheck (#154852 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154852 Approved by: https://github.com/albanD	2025-06-04 07:47:09 +00:00
Alessandro Sangiorgi	f57754e815	[Inductor] Record Triton’s Base32 Cache Key in .best_config for Debugging (#154618 ) This is a follow-up PR of the reverted one https://github.com/pytorch/pytorch/pull/148981 re-opening for visibility : Modified TorchInductor’s autotuning flow so that each best_config JSON file also includes the Triton “base32” (or base64) cache key. Motivation Debugging & Analysis: With this change, we can quickly identify which compiled binary and IRs belongs to a given best config. The impact is minimal since it is only an extra field in .best_config. It can help advanced performance tuning or kernel-level debugging. Also, since Triton already stores cubin/hsaco in its cache, developers/researchers can avoid to set store_cubin = True since they can get the cubin/hsaco in the Triton cache and with the code provided in this PR, they can easily match the best_config with the right Triton cache directory for the "best" kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154618 Approved by: https://github.com/jansel	2025-05-30 19:30:25 +00:00
soulitzer	733e684b11	Skip test file that doesn't run gradcheck for slow gradcheck (#154509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154509 Approved by: https://github.com/malfet	2025-05-29 16:32:26 +00:00

1 2 3 4 5 ...

784 Commits