pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 13:44:15 +08:00

Author	SHA1	Message	Date
PaliC	09ffba3cf7	[docs] Decorator to create a deprecation warning (#155127 ) This PR adds the `@deprecate` decorator for internal functions which we are prepping for deprecation. Add it on top of an internal function to emit a deprecation warning + allow bc with the non internal version of the function. Tested with `python test/test_utils.py TestDeprecate.test_deprecated ` Furthermore, testing with a modified version of the tes in the pr gives something like this which is what we want ``` /home/sahanp/repos/pytorch/test/test_utils.py:1239: UserWarning: deprecated_api is DEPRECATED, please consider using an alternative API(s). deprecated_api(1, 2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155127 Approved by: https://github.com/albanD Co-authored-by: albanD <desmaison.alban@gmail.com>	2025-06-25 18:09:04 +00:00
Kshiteej K	694028f502	update get_default_device to also respect torch.device ctx manager (#148621 ) Fixes https://github.com/pytorch/pytorch/issues/131328 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148621 Approved by: https://github.com/ezyang	2025-06-07 14:26:17 +00:00
Marc Horowitz	f2ad2cdf1c	[utils] add try_import method for importing optional modules (#145528 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145528 Approved by: https://github.com/albanD	2025-01-25 00:14:07 +00:00
albanD	0d28188cc8	Move privateuse1 test out of test_utils and make them serial (#145380 ) Fixes https://github.com/pytorch/pytorch/issues/132720 The reason is that changing the privateuse1 module is global and so can race when other tests happen to check if it is enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145380 Approved by: https://github.com/Skylion007, https://github.com/janeyx99	2025-01-23 00:31:39 +00:00
cyy	df458be4e5	[4/N] Apply py39 ruff and pyupgrade fixes (#143257 ) ```torch/fx/passes/annotate_getitem_nodes.py``` was changed to support the new type hinting annotations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143257 Approved by: https://github.com/justinchuby, https://github.com/albanD	2025-01-04 10:47:51 +00:00
albanD	80a42399bb	Various fix for memory leak in test autograd and dataloader (#143323 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143323 Approved by: https://github.com/andrewkho, https://github.com/soulitzer ghstack dependencies: #143225	2024-12-18 13:56:59 +00:00
Oguz Ulgen	221350e3a4	Add None return type to init -- tests (#132352 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132352 Approved by: https://github.com/ezyang ghstack dependencies: #132335, #132351	2024-08-01 15:44:51 +00:00
Xuehai Pan	ba48cf6535	[BE][Easy][6/19] enforce style for empty lines in import segments in `test/` (#129757 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129757 Approved by: https://github.com/ezyang	2024-07-17 06:42:37 +00:00
Aaron Orenstein	3c971d2ef3	Flip default value for mypy disallow_untyped_defs [final] (#127836 ) Not requiring all functions to have types allows a lot of 'Any' types to slip in - which poison types and make mypy unable to properly typecheck the code. I want to flip the default so that new files are required to have fully typed defs and we can have a burndown list of files that fail to require full types. The preceding stack of PRs (cut up simply to limit the number of file changes per PR "reasonable") adds `# mypy: allow-untyped-defs` to any file which didn't immediately pass mypy with the flag flipped. Due to changing files and merge conflicts it will probably be necessary to have several passes through before landing this final PR which turns the option on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127836 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2024-06-12 15:28:42 +00:00
Aaron Orenstein	dcfa7702c3	Flip default value for mypy disallow_untyped_defs [1/11] (#127838 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127838 Approved by: https://github.com/oulgen	2024-06-08 18:16:33 +00:00
zong	196661255f	Enable UFMT format on test/test_utils.py (#125996 ) Fixes some files in #123062 Run lintrunner on files: test/test_utils.py ```bash $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125996 Approved by: https://github.com/ezyang	2024-05-15 18:22:57 +00:00
soulitzer	fab5bd5359	[checkpoint] Improve error message when use_reentrant=True is used with .grad() (#125155 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125155 Approved by: https://github.com/albanD	2024-04-29 18:57:35 +00:00
Xuehai Pan	93e249969b	[BE] enable `ruff` rule `RSE` and remove useless parentheses in `raise` statements (#124261 ) Remove useless parentheses in `raise` statements if the exception type is raised with no argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261 Approved by: https://github.com/albanD	2024-04-17 19:29:34 +00:00
Andrew Gu	1d6fc0d4de	Fixed `_infer_device_type` warning in `checkpoint` (#122726 ) Previously, we were checking `len(device_types)` where `device_types` is a `list`. This meant that if there were multiple inputs, we would see something like `device_types = ["cuda", "cuda"]` and a false positive warning. We should check `len(set(device_types))`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122726 Approved by: https://github.com/soulitzer	2024-03-27 18:38:42 +00:00
Pritam Damania	512251c8f3	Use tree_map to get device ids and device types for activation checkpointing (#121462 ) `get_device_states` doesn't recursively look into nested lists/dicts to find tensors. As a result, activation checkpointing for such inputs results in silent incorrect results as `get_device_states` returns an empty result and no rng is saved as a result here: https://github.com/pytorch/pytorch/blob/main/torch/utils/checkpoint.py#L188 since `fwd_device_states` is empty. Fixed this by using `tree_map` for both `get_device_states` and `_infer_device_type`. Also added appropriate unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121462 Approved by: https://github.com/soulitzer	2024-03-20 21:09:21 +00:00
PyTorch MergeBot	a9d9077f12	Revert "Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639 )" This reverts commit 7c556428c74a79c6d9c272826344a0828d3f66f5. Reverted https://github.com/pytorch/pytorch/pull/119639 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54286923 ([comment](https://github.com/pytorch/pytorch/pull/119639#issuecomment-1969634480))	2024-02-28 18:57:09 +00:00
Tobias Ringwald	7c556428c7	Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639 ) Fixes #115331. This PR increases the number of valid GPU devices to 512 (from 64) in order to future-proof PyTorch for providers that offer [single nodes with a large device count](https://www.tensorwave.com/). Until now, `DeviceIndex` was an `int8_t`, thus multiple changes were necessary: - `DeviceIndex` changed to `int16_t`. Updated consumers that assume it to be an `int8_t`. - Updated bounds checking for `torch.device()` in the Python frontend. Right now, we allow funny things like `torch.device('cpu', 200).index == -56`, which is undefined behavior. I inserted some checks to only allow values between 0 and `c10::Device::MAX_NUM_DEVICES - 1`. - Updated the `ArgumentInfo` struct as it hardcodes the device index as 8 bit field [^1]. Might be a breaking change, not sure if users rely on this. - Introduced `c10::Device::MAX_NUM_DEVICES` as a replacement for the old `C10_COMPILE_TIME_MAX_GPUS` [^1]: This field was unsigned, so I guess this has also been undef behavior the whole time? Our default device index is -1, so this always wrapped around to 255 when written to the `ArgumentInfo` struct. When I switched the `DeviceIndex` to `int16_t`, it actually stayed 255 after unpacking from `ArgumentInfo` again, as the `DeviceIndex` was now wide enough that it didn't wrap back to -1. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119639 Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/huydhn	2024-02-27 07:05:48 +00:00
dilililiwhy	a358b23a6a	Keep test order due to rename_privateuse1_backend is disposable (#120464 ) With the change in https://github.com/pytorch/pytorch/pull/120399. As rename_privateuse1_backend is disposable, run test_external_module_register with an renamed backend may cause problem. Try to change the testcase name and keep the right order (ASCII). Pull Request resolved: https://github.com/pytorch/pytorch/pull/120464 Approved by: https://github.com/albanD	2024-02-27 05:38:43 +00:00
dilililiwhy	77692736d1	Use privateuseone during external module register test (#120399 ) Fixes #120397 Use privateuseone instead of xpu in test_external_module_register. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120399 Approved by: https://github.com/albanD, https://github.com/malfet	2024-02-22 21:32:59 +00:00
PyTorch MergeBot	fff9d98e58	Revert "Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639 )" This reverts commit e0268821dd2ea0e8a51b81c0ef3b18e77f68a33d. Reverted https://github.com/pytorch/pytorch/pull/119639 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think the Window failures are legit as they are failing now in trunk, i.e. `450339ab2d` ([comment](https://github.com/pytorch/pytorch/pull/119639#issuecomment-1958428416))	2024-02-22 00:12:54 +00:00
Tobias Ringwald	e0268821dd	Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639 ) Fixes #115331. This PR increases the number of valid GPU devices to 512 (from 64) in order to future-proof PyTorch for providers that offer [single nodes with a large device count](https://www.tensorwave.com/). Until now, `DeviceIndex` was an `int8_t`, thus multiple changes were necessary: - `DeviceIndex` changed to `int16_t`. Updated consumers that assume it to be an `int8_t`. - Updated bounds checking for `torch.device()` in the Python frontend. Right now, we allow funny things like `torch.device('cpu', 200).index == -56`, which is undefined behavior. I inserted some checks to only allow values between 0 and `c10::Device::MAX_NUM_DEVICES - 1`. - Updated the `ArgumentInfo` struct as it hardcodes the device index as 8 bit field [^1]. Might be a breaking change, not sure if users rely on this. - Introduced `c10::Device::MAX_NUM_DEVICES` as a replacement for the old `C10_COMPILE_TIME_MAX_GPUS` [^1]: This field was unsigned, so I guess this has also been undef behavior the whole time? Our default device index is -1, so this always wrapped around to 255 when written to the `ArgumentInfo` struct. When I switched the `DeviceIndex` to `int16_t`, it actually stayed 255 after unpacking from `ArgumentInfo` again, as the `DeviceIndex` was now wide enough that it didn't wrap back to -1. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119639 Approved by: https://github.com/cyyever, https://github.com/albanD	2024-02-21 21:10:49 +00:00
hongxyan	b374f8987d	[ROCm] Hipify trie re-engineering and adding unit tests (#118433 ) Fixes #[117504](https://github.com/pytorch/pytorch/issues/117504) Re-engineering Hipify Trie: (1) Re-engineering Trie. (2) More documentation or comments for easier understanding (3) Created a set of unit test (class `TestHipifyTrie`) to test the Trie data structure and APIs. Test: ``` root@xxx:/development/pytorch# pytest test/test_utils.py -k TestHipifyTrie ==================================================================================================== test session starts ==================================================================================================== platform linux -- Python 3.9.18, pytest-7.3.2, pluggy-1.3.0 rootdir: /dockerx/development/pytorch configfile: pytest.ini plugins: flakefinder-1.1.0, rerunfailures-13.0, xdist-3.3.1, xdoctest-1.1.0, cpp-2.3.0, shard-0.1.2, hypothesis-5.35.1 collected 11453 items / 11445 deselected / 8 selected Running 8 items in this shard test/test_utils.py ........ [100%] ============================================================================================ 8 passed, 11445 deselected in 3.84s ============================================================================================ root@xxx:/development/pytorch# ``` Also performed diff on modified and generated contents by this tool with the original code and the new code of the hipify_python.py script. Verified no difference. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118433 Approved by: https://github.com/malfet, https://github.com/jeffdaily	2024-02-02 16:04:59 +00:00
FFFrog	327bdcdb14	Some tiny modification about torch.set/get_default_device (#116014 ) 1. fix bug of torch.set_default_device in multi-threading 2. add new interface named torch.get_default_device Fixes #115333 Fixes #115917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116014 Approved by: https://github.com/malfet, https://github.com/jansel	2023-12-19 05:08:06 +00:00
Jez Ng	c41a32a3bf	Move test_utils.py back to MYPY (#113745 ) Since MYPYNOFOLLOW is about to turn on import following, there's no reason to keep test_utils.py in the MYPYNOFOLLOW config. Moreover, I'm not sure it still takes 10 minutes to typecheck this file; adding it to the MYPY config takes `lintrunner --take MYPY --all-files` from 53s to 57s on my machine, which is substantial but not horrible. I guess we'll see how it fares on CI. (Note that we cannot simply merge MYPY and MYPYNOFOLLOW because the latter config turns on `disallow_any_generics` and so is in that sense stricter than the MYPY config.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113745 Approved by: https://github.com/clee2000	2023-11-16 01:57:58 +00:00
Catherine Lee	defb364adf	Clean up test_external_module_register (#110254 ) caused by #109866 The test registers new device module, the above pr checks for xpu, sees that it got registered and uses it but its a dummy module. This causes any test after it to fail so I "clean up" the registered module Another possible solution would be to run this test last lol Pull Request resolved: https://github.com/pytorch/pytorch/pull/110254 Approved by: https://github.com/huydhn	2023-09-29 17:02:13 +00:00
Edward Z. Yang	36bb7a1f42	Add fast traceback utilities (#107358 ) This adds some utilities for conveniently working with fast combined CapturedTraceback from Python. The main goal of these utilities is to make it easier for people to use CapturedTraceback as a drop-in replacement for `traceback.extract_stack`, which is 20x slower than CapturedTraceback. I port symbolic shapes to use the new CapturedTraceback code, to validate that the APIs work and are useful. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107358 Approved by: https://github.com/zdevito, https://github.com/albanD ghstack dependencies: #107438	2023-08-18 19:05:54 +00:00
Justin Chu	73e1455327	[BE] Enable ruff's UP rules and autoformat test/ (#105434 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434 Approved by: https://github.com/albanD	2023-07-19 20:36:06 +00:00
Edward Z. Yang	e03800a93a	Add torch._utils.render_call, improve printoptions (#102623 ) - Add get_printoptions and printoptions context manager - Improve edgeitems handling when it is zero - Add render_call which can be used to conveniently print command line arguments of a function call, while suppressing actual tensor data Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/102623 Approved by: https://github.com/albanD	2023-05-31 22:08:04 +00:00
Edward Z. Yang	96ee23e198	Print restarting analysis at INFO level with a exception breadcrumb (#101573 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101573 Approved by: https://github.com/albanD	2023-05-19 20:29:18 +00:00
Edward Z. Yang	9ba64cba55	Fix torch.utils._traceback on Python 3.11 (#101277 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101277 Approved by: https://github.com/albanD, https://github.com/Skylion007	2023-05-14 19:03:16 +00:00
shibo	9a2a6fcfa5	add get_device_index for custom device (#98804 ) Fixes #ISSUE_NUMBER as the title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98804 Approved by: https://github.com/ngimel	2023-04-12 23:58:31 +00:00
shibo	d03799f9a5	optimize the AMP func name in custom_device_mod (#98052 ) Fixes #ISSUE_NUMBER 1、optimize the func name of AMP in custom device module，use `torch.foo.set_autocast_enable` instead of `torch.foo.set_autocast_foo_enable`. 2、In AMP with custom device，use `custom_device_mod.set_autocast_enable` instead of `getattr(custom_device_mod, "set_autocast_enable"`, because we have check that `custom_device_mod` hasattr `set_autocast_enable` before. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98052 Approved by: https://github.com/bdhirsh	2023-03-31 17:04:32 +00:00
soulitzer	51c3fd39a5	Modify all calls to checkpoint pass use_reentrant explicitly (#97376 ) Fixes #ISSUE_NUMBER This is the first step toward making use_reentrant=False the default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97376 Approved by: https://github.com/albanD	2023-03-27 13:37:42 +00:00
shibo	6b691b99da	add amp support for custom backend (#96188 ) Fixes #ISSUE_NUMBER 1、add amp support for custom backend 2、optimize the file `backend_registration.py`, and rename it with `custom_backend_registration.py`. And then we would register other funcs for custom backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96188 Approved by: https://github.com/bdhirsh	2023-03-20 20:27:35 +00:00
PyTorch MergeBot	a8f36dd646	Revert "add amp support for custom backend (#96188 )" This reverts commit cf12edee02a44009c4f06e36efa97d9a7372ab35. Reverted https://github.com/pytorch/pytorch/pull/96188 on behalf of https://github.com/kit1980 due to Broke some linalg tests : https://github.com/pytorch/pytorch/actions/runs/4420037607/jobs/7750708339	2023-03-15 00:03:19 +00:00
shibo	cf12edee02	add amp support for custom backend (#96188 ) Fixes #ISSUE_NUMBER 1、add amp support for custom backend 2、optimize the file `backend_registration.py`, and rename it with `custom_backend_registration.py`. And then we would register other funcs for custom backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96188 Approved by: https://github.com/bdhirsh	2023-03-14 20:43:21 +00:00
soulitzer	d30db9a251	Replace non-reentrant checkpoint with a rewrite that can be nested and contain grad (#90105 ) Changes: - bc-breaking change: The main difference between this and the old non-reentrant impl that it replaces is that we clear recomputed tensors on backward immediately upon unpack, even if retain_graph=True. This has the following additional implications: - Accessing _saved_tensors multiple times will silently recompute forward multiple times. - Accessing ctx.saved_tensor twice in the same backward will now raise an error. - To avoid dealing with the potential consequences, early stopping has been hidden behind a global flag that is by default False, and can be enabled via a context manager. We can remove this in a follow up. Some features of nesting as a result do not work by default. Before land: - import to check for more bc-breakingness - implement any workarounds for the bc-breaking-ness, if we decide on any - update docs to reflect new lifetime of recomputed variables - update docs to mention the early stop feature Follow ups: - enable early-stopping by default - update docs/tutorial to feature nested use cases Related docs: - code comment: https://github.com/pytorch/pytorch/pull/90105/files#diff-9dcd955620b52ce128e18e3567be88edbb238810460d1288a86fabc20e483b30R448 - design doc: https://docs.google.com/document/d/1UDLhTNv6_kvuDTRlsjfj9WdqtNaQNr8ahrvdBIB6914/edit# - retains_grad <> checkpiont https://docs.google.com/document/d/1maiGmuFUxysQL0AdYUU88kngAaXh_L0XpDcLDh_5Ors/edit Pull Request resolved: https://github.com/pytorch/pytorch/pull/90105 Approved by: https://github.com/albanD	2023-03-14 20:38:36 +00:00
Xuehai Pan	046e88a291	[BE] [3/3] Rewrite `super()` calls in test (#94592 ) Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: `f152a79be9/caffe2/python/net_printer.py (L184-L190)` `f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94592 Approved by: https://github.com/ezyang, https://github.com/seemethere	2023-02-12 22:20:53 +00:00
Aaron Gokaslan	8fce9a09cd	[BE]: pyupgrade Python to 3.8 - imports and object inheritance only (#94308 ) Apply parts of pyupgrade to torch (starting with the safest changes). This PR only does two things: removes the need to inherit from object and removes unused future imports. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94308 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-02-07 21:10:56 +00:00
albanD	0b2dc3b3ac	[Py-3.11] Skip dynamo related tests (#94187 ) The quantization test fails to import Dynamo as expected. The traceback tool looks a lot more tricky, opened https://github.com/pytorch/pytorch/issues/94189 to investigate further. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94187 Approved by: https://github.com/malfet	2023-02-07 16:40:55 +00:00
Edward Z. Yang	8b00c54425	Add utility report_compile_source_on_error (#91069 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91069 Approved by: https://github.com/soumith, https://github.com/albanD	2023-01-11 22:54:46 +00:00
Edward Z. Yang	333540a458	Reland "Add torch.utils.device_mode" (#91796 ) Original PR https://github.com/pytorch/pytorch/pull/91525 Signed-off-by: Edward Z. Yang <ezyangfb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91796 Approved by: https://github.com/albanD	2023-01-09 20:57:12 +00:00
PyTorch MergeBot	9b415240d4	Revert "Reland "Add torch.utils.device_mode" (#91796 )" This reverts commit 81b5eff3c383f5308416e129861a2689d717702c. Reverted https://github.com/pytorch/pytorch/pull/91796 on behalf of https://github.com/huydhn due to This breaks trunk with the following failed test https://hud.pytorch.org/failure/test_jit_save%2CTestTracer	2023-01-09 04:45:47 +00:00
Edward Z. Yang	81b5eff3c3	Reland "Add torch.utils.device_mode" (#91796 ) Original PR https://github.com/pytorch/pytorch/pull/91525 Signed-off-by: Edward Z. Yang <ezyangfb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91796 Approved by: https://github.com/albanD	2023-01-08 03:44:56 +00:00
PyTorch MergeBot	f571ae4fdb	Revert "Make torch.device usable as a context manager (#91525 )" This reverts commit 619d52a5d296bc236ac98f40c7f7de54ab7c9d37. Reverted https://github.com/pytorch/pytorch/pull/91525 on behalf of https://github.com/mehtanirav due to Internal breakages	2023-01-05 21:34:50 +00:00
Edward Z. Yang	619d52a5d2	Make torch.device usable as a context manager (#91525 ) Fixes https://github.com/pytorch/pytorch/issues/82296 Fixes https://github.com/pytorch/pytorch/issues/27878 Fixes https://github.com/pytorch/pytorch/issues/260 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91525 Approved by: https://github.com/albanD	2023-01-04 01:32:00 +00:00
Huy Do	0417da2288	Set a timeout value when testing multiprocess DataLoader (#91476 ) Setting a timeout value when testing multiprocess DataLoader to prevent ASAN jobs timing out after 4 hours. We are seeing multiple timeout issue running ASAN tests on HUD https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=asan for examples * Without mem leak check enabled https://github.com/pytorch/pytorch/actions/runs/3794216079/jobs/6455118197 * With mem leak check https://github.com/pytorch/pytorch/actions/runs/3792743994/jobs/6449356306 Looking a bit closer into the test, the hanging happens when multiprocess DataLoader is used in `test_utils`. Here is the snapshot of those processes when I log into the hang runner: ``` UID PID PPID C STIME TTY TIME CMD jenkins 1 0 0 Dec28 pts/0 00:00:00 bash jenkins 8 0 0 Dec28 pts/1 00:00:00 sh -c pip install dist/torch-2.0.0a0+git97db9fd-cp37-cp37m-linux_x86_64.whl[opt-einsum] && .jenkins/pytorch/test.sh jenkins 20 8 0 Dec28 pts/1 00:00:00 /bin/bash .jenkins/pytorch/test.sh jenkins 764 20 0 Dec28 pts/1 00:00:07 python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --shard 5 5 --verbose jenkins 788 764 0 Dec28 pts/1 00:00:00 /opt/conda/bin/python -c from multiprocessing.semaphore_tracker import main;main(6) jenkins 3743 764 0 Dec28 pts/1 00:00:05 /opt/conda/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=7, pipe_handle=11) --multiprocessing-fork jenkins 3766 3743 0 Dec28 pts/1 00:00:06 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests jenkins 3878 3766 0 Dec28 pts/1 00:00:06 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests jenkins 3879 3766 0 Dec28 pts/1 00:00:00 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests jenkins 3880 3766 0 Dec28 pts/1 00:00:00 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests jenkins 3881 3766 0 Dec28 pts/1 00:00:00 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests jenkins 3893 0 0 01:45 pts/2 00:00:00 /bin/bash jenkins 3904 3893 0 01:46 pts/2 00:00:00 ps -ef ``` The specific hanging test was `test_random_seed` which spawned 4 subprocesses to load data. After I killed one of them, the test could continue and printed the following stacktrace: ``` test_random_seed (__main__.TestDataLoaderUtils) ... [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) ERROR (9345.840s) test_random_seed (__main__.TestDataLoaderUtils) ... test_random_seed errored - num_retries_left: 3 Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1134, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/opt/conda/lib/python3.7/multiprocessing/queues.py", line 104, in get if not self._poll(timeout): File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 257, in poll return self._poll(timeout) File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 414, in _poll r = wait([self], timeout) File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 921, in wait ready = selector.select(timeout) File "/opt/conda/lib/python3.7/selectors.py", line 415, in select fd_event_list = self._selector.poll(timeout) File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 3878) is killed by signal: Terminated. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "test_utils.py", line 469, in test_random_seed x2 = run() File "test_utils.py", line 464, in run return next(iter(dataloader)) File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 635, in __next__ data = self._next_data() File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1330, in _next_data idx, data = self._get_data() File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1296, in _get_data success, data = self._try_get_data() File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1147, in _try_get_data raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e RuntimeError: DataLoader worker (pid(s) 3878) exited unexpectedly [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) ok (0.137s) ``` This doesn't fix the issue which I'll need to follow up to see why they hang. However, this should allow the test to terminate gracefully and report errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91476 Approved by: https://github.com/kit1980	2022-12-29 17:50:37 +00:00
mikey dagitses	3a1bdfee67	skip environment collection test in fbcode (#88744 ) Summary: This runs pip, which we don't have in the fbcode environment. Test Plan: Rely on CI. Differential Revision: D41156589 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88744 Approved by: https://github.com/zou3519	2022-11-09 18:20:04 +00:00
soulitzer	c18eead2df	Update saved variable hooks to no longer trigger on wrapped numbers (#87316 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87316 Approved by: https://github.com/ezyang, https://github.com/albanD	2022-10-20 03:01:11 +00:00
Rohan Varma	7a411952fb	CheckpointSequential support non-reentrant (#86331 ) Closes https://github.com/pytorch/pytorch/issues/86328 Adds `use_reentrant` argument to `checkpoint_sequential`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86331 Approved by: https://github.com/zhaojuanmao, https://github.com/albanD	2022-10-06 23:10:18 +00:00

1 2 3 4 5

213 Commits