Summary:
When using a custom deleter InefficientStdFunctionContext was using a
std::unique_ptr<> to store the pointer and call the deleter - but this failed to
call the deleter if the pointer was null. Since we have a separate holder class
anyway take out the std::unique_ptr<> and call the deleter directly.
Fixes#117273
Test Plan:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117418
Approved by: https://github.com/wjakob, https://github.com/yanboliang
Summary: Similar to reporting alloc and dealloc events in the PyTorch profiler, we are now reporting Out of Memory events as well. This is useful for performance troubleshooting
Test Plan: Added test_oom_tracing to test/test_profiler.py
Differential Revision: D36268132
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80050
Approved by: https://github.com/robieta
Summary:
Report pointed memory size, total allocated memory, total reserved size all in one report.
`ptr` and `alloc_size` will be used for associating with op trace.
`allocated_size`, `reserved_size` will be used for memory trace.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61282
Reviewed By: ejguan
Differential Revision: D29796282
Pulled By: chaekit
fbshipit-source-id: 5314c867632d3af1fa9a3811b35eaa5e931a5d87
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56830
Opt into formatting on GitHub and format everything. This is a trial run before turning on formatting for more and eventually all of the codebase.
Test Plan: CI
Reviewed By: zertosh
Differential Revision: D27979080
fbshipit-source-id: a80f0c48691c08ae8ca0af06377b87e6a2351151
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51360
Invariant should be satisfied by call sites of allocator
ensuring that the device type makes sense.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: swolchok
Differential Revision: D26170202
Pulled By: ezyang
fbshipit-source-id: f23681f34187c0d3da794f7a8c869ea8da88365d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50843
AT_ASSERTM is deprecated and should be replaced by either TORCH_CHECK or
TORCH_INTERNAL_ASSERT, depending on the situation.
Test Plan: Imported from OSS
Reviewed By: ailzhang
Differential Revision: D26074365
Pulled By: ezyang
fbshipit-source-id: 46e13588fad4e24828f3cc99635e9cb2223a6c2c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47796
`ThreadLocalDebugInfo::get()` is a hot function. For example, it is called by `DefaultCPUAllocator::allocate()`. Most callers do not even bother to keep the returned `shared_ptr` around, proving that they have no lifetime issues currently. For the rest, it appears that the only way that the returned pointer could become invalid is if they then called a function that swapped out `ThreadLocalDebugInfo` using `ThreadLocalStateGuard`. There are very few such paths, and it doesn't look like any current callers of `ThreadLocalDebugInfo::get()` needed a `shared_ptr` at all.
ghstack-source-id: 116979577
Test Plan:
1) reviewers to double-check audit of safety
2) run framework overhead benchmarks
Reviewed By: dzhulgakov
Differential Revision: D24902978
fbshipit-source-id: d684737cc2568534cac7cd3fb8d623b971c2fd28
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37640
Enable oversize arena to reduce memory fragmentation. Memory request with large size (configurable with FLAGS_caffe2_oversize_threshold) are fulfilled from dedicated arena separate from the existing huge page arena.
Two additional parameters are introduced to configure the 2-phase decay of the memory arena:
- caffe2_dirty_decay_ms
- caffe2_muzzy_decay_ms
In current JEMalloc implementation, oversized allocations will be immediately purged regardless of putting it in arena or not. Therefore we need to extend the decay time to indefinite. Currently we set the default for caffe2_muzzy_decay_ms to -1.
We now enable the arena allocator statically. To ensure it is correctly installed regardless of static initialization order, we add a priority flag in c10::SetAllocator, and only higher priority allocators can overwrite existing ones.
ghstack-source-id: 103276877
Test Plan:
buck test mode/dev //caffe2/caffe2/fb/init:huge_pages_allocator_test
Benchmarking known CV model that benefits from page arena:
```
PyTorchModelBench.cpp:183] test / base : 86.9532%
```
By adjusting ```dirty_decay_ms``` and ```muzzy_decay_ms```, we have the following plots:
https://pxl.cl/15SWWhttps://pxl.cl/15TnL
From the figures above we can see performance does not change much until dirty decay time is indefinite (set to -1). Either setting muzzy decay or dirty decay time to -1 will reach best performance, regardless of which one it is. Even setting the decay time to very long (100s, which is longer than the run), does not change the performance by much.
## Observe performance difference in production with a variety of models (WIP)
Reviewed By: dzhulgakov
Differential Revision: D21258581
fbshipit-source-id: c006f8b94f28aef0666e52f48d4e82cf0d3a48af