pytorch

frozenleaves/pytorch

Fork 0

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Commit Graph

Author	SHA1	Message	Date
Sam Larsen	8bb8c3997b	[inductor] parallel compile: add import of thread_safe_fork for internal (#137155 ) Summary: We had a report of crashes in parallel compile subprocesses linked to reading justknobs. See https://fburl.com/workplace/14a4mcbh internally. This is a known issue with justknobs. It looks like we don't have a lot of control over evaluating knobs. Some are read in inductor (`"pytorch/remote_cache:autotune_memcache_version`), but many are read by the triton compiler. According to this advice https://fburl.com/workplace/imx9lsx3, we can import thread_safe_fork which installs some functionality to destroy some singletons before forking and re-enable them after. This apporach works for the failing workload. Test Plan: See D63719673 where the reporting user was kind enough to provide us with a local repro. Without the relevant import, we can reproduce the crash. With the import, the training runs successfully to completion. Differential Revision: D63736829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137155 Approved by: https://github.com/xmfan, https://github.com/eellison	2024-10-03 17:37:21 +00:00

Author

SHA1

Message

Date

Sam Larsen

8bb8c3997b

[inductor] parallel compile: add import of thread_safe_fork for internal (#137155 )

Summary: We had a report of crashes in parallel compile subprocesses linked to reading justknobs. See https://fburl.com/workplace/14a4mcbh internally. This is a known issue with justknobs. It looks like we don't have a lot of control over evaluating knobs. Some are read in inductor (`"pytorch/remote_cache:autotune_memcache_version`), but many are read by the triton compiler. According to this advice https://fburl.com/workplace/imx9lsx3, we can import thread_safe_fork which installs some functionality to destroy some singletons before forking and re-enable them after. This apporach works for the failing workload.

Test Plan: See D63719673 where the reporting user was kind enough to provide us with a local repro. Without the relevant import, we can reproduce the crash. With the import, the training runs successfully to completion.

Differential Revision: D63736829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137155
Approved by: https://github.com/xmfan, https://github.com/eellison

2024-10-03 17:37:21 +00:00

1 Commits