mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 12:54:11 +08:00
Summary: # Context: When memory leak happens, it usually trigger the OOM in the later iterations. The snapshot of full iteration will be huge and hard to interpret. On CUDA side, they provide OOM observer which generates snapshot when OOM happens with latest 1,500,000 entries for debugging. In this diff, we want to implement the feature on MTIA side Test Plan: Run this test with last diff in the stack. ``` buck run @//mode/opt kineto/libkineto/fb/mtia/integration_tests:mtia_memory_auto_trace_test ``` As shown, the memory_snapshot is generated when oom happens Log: P1794792326 Snapshot: https://fburl.com/pytorch_memory_visualizer/lx73y6s3 {F1977402355} Differential Revision: D71993315 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152160 Approved by: https://github.com/sraikund16