Summary: Since we increment the counter after performing the callback, it leads to the assertion error when callback raises an error and increment never happens. Let's increment first to avoid it.
Test Plan:
tba
Rollback Plan:
Differential Revision: D77475650
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157185
Approved by: https://github.com/xmfan
These hooks are used by internal stuck job detection to associate compilation events with the compile lease. Previously, we only had events for Dynamo and Inductor compilation. And recently, the callback handler was updated to ignore nested events. So the Inductor event was only really used by lazy backward.
Here, I remove the inductor event, and add an explicit lazy backward one. Additionally, I add other runtime compilation events: autotuning and cudagraphs. I also expose the CompileId as a string to avoid imports, this will let internal UIs track each graph's contribution to the timeout.
```python
class CallbackTrigger(enum.Enum):
# most common case, dynamo attempts to trace a new frame
DYNAMO = 1
# backward compilation can be deferred to runtime
LAZY_BACKWARD = 2
# some backends autotune at runtime
TRITON_AUTOTUNING = 3
# cudagraphs record at runtime
CUDAGRAPH_RECORDING = 4
```
Differential Revision: [D75092426](https://our.internmc.facebook.com/intern/diff/D75092426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153596
Approved by: https://github.com/masnesral
Summary:
In SJD, we register the callbacks to get notified of an active compilation. Using this information, we can basically allow for an increase time for the training loop
The callbacks currently do not account for entire time and in several cases, the end callback is not called at all.
This leads to a bunch of APS jobs getting terminated incorrectly: https://fburl.com/scuba/mast_hpc_job_run_status/ondwzt2w
In this diff, we basically install a context manager which will call the start and end callbacks, similar to how we log counters and other information.
Test Plan:
```
buck2 run mode/opt //aps_models/examples/dlrm:dlrm_train_app -- --config-name train_mast_fsdp_torchdynamo launcher.data_project=apf_ai_infra launcher.fbl_entitlement=ai_infra_training_rnd_tc launcher.hardware=TC_ANY_80G
```
Led to https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-atuljangra-ef2285ba9a?job_attempt=0&version=0&env=prodhttps://fburl.com/ai_infra/sv0a213y confirms that callback was correctly called and a lease was properly installed, which takes over the training loop lease.
{F1965137027}
Differential Revision: D66347023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141323
Approved by: https://github.com/ezyang