pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 21:49:24 +08:00

Author	SHA1	Message	Date
Burak Turk	86ced14453	increment pending_callbacks_counter before initation the pt2 compile callbacks (#157185 ) Summary: Since we increment the counter after performing the callback, it leads to the assertion error when callback raises an error and increment never happens. Let's increment first to avoid it. Test Plan: tba Rollback Plan: Differential Revision: D77475650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157185 Approved by: https://github.com/xmfan	2025-06-30 01:23:59 +00:00
Simon Fan	721d2580db	[dynamo][callbacks] temporarily disable TRITON_AUTOTUNING (#157186 ) Differential Revision: D77476551 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157186 Approved by: https://github.com/burak-turk	2025-06-29 17:20:55 +00:00
Simon Fan	28796f71d0	Redo D75092426: [internal] Expose additional metadata to compilation callbacks (#155063 ) Originally https://github.com/pytorch/pytorch/pull/153596 --------------- Summary: via reverting D75708685 gate the ROCm failure Test Plan: Unit tests in OSS, sandcastle Rollback Plan: Bifferential Revision: D75894349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155063 Approved by: https://github.com/masnesral	2025-06-05 23:40:31 +00:00
PyTorch MergeBot	35fc5c49b4	Revert "[internal] Expose additional metadata to compilation callbacks (#153596 )" This reverts commit f889dea97dad3cc506d43e379a469334417040c8. Reverted https://github.com/pytorch/pytorch/pull/153596 on behalf of https://github.com/izaitsevfb due to introduces bunch of callback-related failures on rocm ([comment](https://github.com/pytorch/pytorch/pull/153596#issuecomment-2923139061))	2025-05-30 18:39:27 +00:00
Simon Fan	f889dea97d	[internal] Expose additional metadata to compilation callbacks (#153596 ) These hooks are used by internal stuck job detection to associate compilation events with the compile lease. Previously, we only had events for Dynamo and Inductor compilation. And recently, the callback handler was updated to ignore nested events. So the Inductor event was only really used by lazy backward. Here, I remove the inductor event, and add an explicit lazy backward one. Additionally, I add other runtime compilation events: autotuning and cudagraphs. I also expose the CompileId as a string to avoid imports, this will let internal UIs track each graph's contribution to the timeout. ```python class CallbackTrigger(enum.Enum): # most common case, dynamo attempts to trace a new frame DYNAMO = 1 # backward compilation can be deferred to runtime LAZY_BACKWARD = 2 # some backends autotune at runtime TRITON_AUTOTUNING = 3 # cudagraphs record at runtime CUDAGRAPH_RECORDING = 4 ``` Differential Revision: [D75092426](https://our.internmc.facebook.com/intern/diff/D75092426) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153596 Approved by: https://github.com/masnesral	2025-05-30 08:07:04 +00:00
Burak Turk	38bec787fa	cleanup JK for duplicate pt2 compile callbacks prevention (#148704 ) Summary: This diff cleans up the JK we used for enabling `add pt2 callbacks for backward pass and prevent duplicate callbacks` feature. Differential Revision: D70643543 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148704 Approved by: https://github.com/mlazos	2025-04-11 15:17:06 +00:00
Xuehai Pan	3ce352e389	[BE][PYFMT] migrate PYFMT for `torch._dynamo` to `ruff format` (#144549 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144549 Approved by: https://github.com/jansel	2025-02-28 03:03:53 +00:00
Raymond Li	21c2565f35	Document dynamo (#146736 ) Many files in dynamo are currently lacking file/module-level documentation, which makes it hard to know what they do at a glance and without digging into the code. This fixes that. Note: documentation was AI-generated and could be incorrect, please review carefully. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146736 Approved by: https://github.com/jansel, https://github.com/StrongerXi, https://github.com/anijain2305, https://github.com/zou3519	2025-02-13 00:02:21 +00:00
Burak Turk	01a4d86b31	add pt2 callbacks for backward pass and prevent duplicate callbacks (#145732 ) Summary: This change adds callbacks for lazy backwards compilation while preventing duplicate callbacks to be fired. Differential Revision: D68577593 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145732 Approved by: https://github.com/mlazos	2025-01-28 03:50:02 +00:00
Aaron Orenstein	a79100ab11	PEP585 update - torch/_dynamo (#145105 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145105 Approved by: https://github.com/bobrenjc93	2025-01-18 20:47:11 +00:00
Atul Jangra	6a096a0b96	[PT2] Fix callbacks to account for entire execution in compilation (#141323 ) Summary: In SJD, we register the callbacks to get notified of an active compilation. Using this information, we can basically allow for an increase time for the training loop The callbacks currently do not account for entire time and in several cases, the end callback is not called at all. This leads to a bunch of APS jobs getting terminated incorrectly: https://fburl.com/scuba/mast_hpc_job_run_status/ondwzt2w In this diff, we basically install a context manager which will call the start and end callbacks, similar to how we log counters and other information. Test Plan: ``` buck2 run mode/opt //aps_models/examples/dlrm:dlrm_train_app -- --config-name train_mast_fsdp_torchdynamo launcher.data_project=apf_ai_infra launcher.fbl_entitlement=ai_infra_training_rnd_tc launcher.hardware=TC_ANY_80G ``` Led to https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-atuljangra-ef2285ba9a?job_attempt=0&version=0&env=prod https://fburl.com/ai_infra/sv0a213y confirms that callback was correctly called and a lease was properly installed, which takes over the training loop lease. {F1965137027} Differential Revision: D66347023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141323 Approved by: https://github.com/ezyang	2024-11-24 22:31:04 +00:00
Bob Ren	d073223663	turn CompilationCallbackHandler into dataclass (#137312 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137312 Approved by: https://github.com/Skylion007 ghstack dependencies: #137181	2024-10-05 19:03:28 +00:00
Bob Ren	cfc51c858a	type _dynamo/callback.py (#137181 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137181 Approved by: https://github.com/Skylion007	2024-10-04 03:28:52 +00:00
Aaron Orenstein	dcfa7702c3	Flip default value for mypy disallow_untyped_defs [1/11] (#127838 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127838 Approved by: https://github.com/oulgen	2024-06-08 18:16:33 +00:00
Yanbo Liang	169c220bf8	[torch.compile] Provide capability to register callback on compile start/stop (#120764 ) This is a requirement from Meta internal cases, where ppl wants to register a callback function to detect if a job is stuck during compilation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120764 Approved by: https://github.com/jansel	2024-02-29 07:37:52 +00:00

15 Commits