pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 12:54:11 +08:00

Author	SHA1	Message	Date
Yu, Guangye	a1114beed2	Deprecate overlapped functions in CUDAAllocatorConfig (#165289 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165289 Approved by: https://github.com/albanD ghstack dependencies: #165288	2025-10-19 15:34:26 +00:00
Yu, Guangye	219fb6aafc	Refactor CUDAAllocatorConfig using ConfigTokenizer (#165281 ) * #165129 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165281 Approved by: https://github.com/albanD ghstack dependencies: #165129, #165131, #165135, #165136	2025-10-16 15:26:50 +00:00
Yu, Guangye	515b5ff539	Remove unused code in CUDAAllocatorConfig (#165136 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165136 Approved by: https://github.com/Skylion007 ghstack dependencies: #165129, #165131, #165135	2025-10-16 15:26:50 +00:00
Yu, Guangye	608a6d4a26	Reuse AcceleratorAllocatorConfig in CUDAAllocatorConfig (#165135 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165135 Approved by: https://github.com/Skylion007 ghstack dependencies: #165129, #165131	2025-10-16 15:26:40 +00:00
Yu, Guangye	03e5dbb26e	Register CUDAAllocatorConfig to AcceleratorAllocatorConfig (#165131 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165131 Approved by: https://github.com/Skylion007 ghstack dependencies: #165129	2025-10-16 15:26:28 +00:00
Banit Agrawal	f39789cdab	[PyTorch Pinned Allocator] Add support of reserved pinned memory segment to avoid slow paths (#164501 ) Summary: This diff adds the feature of allocating a large pinned memory segment upfront based on the provided config. This large segment is then used to serve all the small pinned memory requests to avoid expensive device level APIs (slow paths). Example: PYTORCH_CUDA_ALLOC_CONF=pinned_reserve_segment_size_mb:2048 This reserves a 2GB pinned memory segment for the process and then all incoming small requests are just served from this segment and no cudaHostAlloc/cudaHostRegister apis are being called. Differential Revision: D83779074 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164501 Approved by: https://github.com/yangw-dev	2025-10-03 18:11:27 +00:00
Frank Lin	0c0e056a9e	[CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352 ) ## Introduction During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it capturing graph) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture. This PR adds an experimental flag `graph_capture_record_stream_reuse: True\|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path. ## Terms * Free marker: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it. * Terminal: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`. ## When can we reuse a block during capture? ### Strong Rule (Graph-Wide Safety) This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph. > A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph. Why it's safe: This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness. ### Per-stream Rule (A Practical Optimization) The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check. In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream. > Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S. In short, a block is considered reusable on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins. ## Implementation * On `free(block)` during capture * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail. * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path. * Otherwise, store the marker handles and keep the block in the capture-private structures. * On `allocate(stream)` during capture (attempt per-stream reclaim) * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`. * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal. * If yes, hand the block to S for immediate reuse within the same capture. * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances. * On capture end * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture. ## Examples (2 streams) <img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" /> * Case 0 — Unsafe The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails. Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this. * Case 1 — Reusable on stream 1 Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1. * Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator` This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable. * Case 3 — Safe (strong rule holds) In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block. * Case 4 — Freeing after a join See the note below. ## Edge Case: Freeing after a join Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](https://github.com/pytorch/pytorch/pull/158352#pullrequestreview-3112565198)). In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused. ## Thanks Thanks to @galv for his great idea around graph parsing and empty nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158352 Approved by: https://github.com/ngimel, https://github.com/eqy Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-04 17:21:26 +00:00
PyTorch MergeBot	63a9c23fe9	Revert "[CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352 )" This reverts commit 190c391a28845a14df26abb228d26aa813efb20c. Reverted https://github.com/pytorch/pytorch/pull/158352 on behalf of https://github.com/atalman due to Broke cuda 13.0 nightly builds https://github.com/pytorch/pytorch/actions/runs/17382188549/job/49341981474 ([comment](https://github.com/pytorch/pytorch/pull/158352#issuecomment-3242871629))	2025-09-01 16:27:03 +00:00
Frank Lin	190c391a28	[CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352 ) ## Introduction During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it capturing graph) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture. This PR adds an experimental flag `graph_capture_record_stream_reuse: True\|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path. ## Terms * Free marker: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it. * Terminal: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`. ## When can we reuse a block during capture? ### Strong Rule (Graph-Wide Safety) This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph. > A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph. Why it's safe: This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness. ### Per-stream Rule (A Practical Optimization) The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check. In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream. > Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S. In short, a block is considered reusable on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins. ## Implementation * On `free(block)` during capture * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail. * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path. * Otherwise, store the marker handles and keep the block in the capture-private structures. * On `allocate(stream)` during capture (attempt per-stream reclaim) * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`. * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal. * If yes, hand the block to S for immediate reuse within the same capture. * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances. * On capture end * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture. ## Examples (2 streams) <img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" /> * Case 0 — Unsafe The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails. Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this. * Case 1 — Reusable on stream 1 Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1. * Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator` This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable. * Case 3 — Safe (strong rule holds) In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block. * Case 4 — Freeing after a join See the note below. ## Edge Case: Freeing after a join Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](https://github.com/pytorch/pytorch/pull/158352#pullrequestreview-3112565198)). In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused. ## Thanks Thanks to @galv for his great idea around graph parsing and empty nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158352 Approved by: https://github.com/ngimel Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-01 09:25:01 +00:00
Yu, Guangye	b9c6aa1e17	Revert "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 )" (#161628 ) This reverts commit ae1a706444d6c0a6019ffc936c8b36574335a5d5. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161628 Approved by: https://github.com/atalman ghstack dependencies: #161625, #161626, #161627	2025-08-27 21:37:14 +00:00
Yu, Guangye	b7b9fb9962	Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" (#161627 ) This reverts commit c1145852a5eac96f5551b5d1805109ce4dc5e1fa. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161627 Approved by: https://github.com/atalman ghstack dependencies: #161625, #161626	2025-08-27 21:37:14 +00:00
Yu, Guangye	06ddaf1e0a	Revert "Back out "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" (#160999 )" (#161625 ) This reverts commit a818fa77e3a72271f144514ef349c5a666313205. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161625 Approved by: https://github.com/atalman	2025-08-27 21:34:12 +00:00
PyTorch MergeBot	69c7b16e6f	Revert "Back out "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 )" (#161002 )" This reverts commit a03cc53e6f6e2fe67316cb8c74c25f5b953f445b. Reverted https://github.com/pytorch/pytorch/pull/161002 on behalf of https://github.com/guangyey due to This PR breaks CI TestCudaMallocAsync::test_allocator_settings ([comment](https://github.com/pytorch/pytorch/pull/161002#issuecomment-3228980897))	2025-08-27 16:52:22 +00:00
Joshua Su	a03cc53e6f	Back out "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 )" (#161002 ) Summary: reverting this diff since it caused S551328. Please see D80217492 for dertails. Test Plan: NA Rollback Plan: Differential Revision: D80553588 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161002 Approved by: https://github.com/jingsh, https://github.com/izaitsevfb	2025-08-26 19:04:13 +00:00
Joshua Su	a818fa77e3	Back out "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" (#160999 ) Summary: reverting this diff since it caused S551328. Please see D80217492 for dertails. Test Plan: NA Rollback Plan: Differential Revision: D80553314 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160999 Approved by: https://github.com/izaitsevfb, https://github.com/jingsh	2025-08-20 15:04:36 +00:00
Yu, Guangye	c1145852a5	Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165 Approved by: https://github.com/albanD ghstack dependencies: #159629, #150312	2025-08-05 04:08:42 +00:00
Yu, Guangye	ae1a706444	Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 ) # Motivation Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312 Approved by: https://github.com/albanD ghstack dependencies: #159629	2025-08-05 04:08:04 +00:00
PyTorch MergeBot	5cc6a0abc1	Revert "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 )" This reverts commit dfacf11f66d6512396382bdf5088f0ba9de00406. Reverted https://github.com/pytorch/pytorch/pull/150312 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3142035444))	2025-08-01 03:24:54 +00:00
PyTorch MergeBot	90f13f3b2a	Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" This reverts commit 1fc010a9d8ea95bb74e54b31d17eba56ef16c27c. Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3142035444))	2025-08-01 03:24:54 +00:00
Yu, Guangye	1fc010a9d8	Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312	2025-07-30 06:37:15 +00:00
Yu, Guangye	dfacf11f66	Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 ) # Motivation Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908	2025-07-30 06:37:06 +00:00
PyTorch MergeBot	41971335c9	Revert "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 )" This reverts commit e241a07e6b88aa49d604803bc5a6562f0d9f94d2. Reverted https://github.com/pytorch/pytorch/pull/150312 on behalf of https://github.com/huydhn due to Sorry for reverting your change but because https://github.com/pytorch/pytorch/pull/157908 has been reverted + this PR caused issue earlier, I think it is better to revert the whole stack and reland it from scratch to be sure ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3074897532))	2025-07-15 18:24:36 +00:00
PyTorch MergeBot	ea5f88dca6	Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" This reverts commit e40ade5182233f548b25f2732effe3719d16e9ad. Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/huydhn due to Sorry for reverting your change but because https://github.com/pytorch/pytorch/pull/157908 has been reverted + this PR caused issue earlier, I think it is better to revert the whole stack and reland it from scratch to be sure ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3074897532))	2025-07-15 18:24:36 +00:00
Yu, Guangye	e40ade5182	Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165 Approved by: https://github.com/albanD ghstack dependencies: #150312	2025-07-15 10:14:35 +00:00
Yu, Guangye	e241a07e6b	Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 ) # Motivation Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312 Approved by: https://github.com/albanD	2025-07-15 10:14:35 +00:00
PyTorch MergeBot	6fe7456aa1	Revert "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 )" This reverts commit 03b307575a98dc1d953c9d3521a9489e0e61e70c. Reverted https://github.com/pytorch/pytorch/pull/150312 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing to build PyTorch internally ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3070218901))	2025-07-14 16:33:48 +00:00
PyTorch MergeBot	e8cca7bac7	Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" This reverts commit 85857181ebca86e9c709e9922a9d9ef41a9c4ef9. Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing to build PyTorch internally ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3070218901))	2025-07-14 16:33:48 +00:00
Yu, Guangye	85857181eb	Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312	2025-07-11 11:41:34 +00:00
Yu, Guangye	03b307575a	Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312 ) # Motivation Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908	2025-07-11 11:25:43 +00:00
Xuehai Pan	402ae09e41	[BE] fix typos in c10/ (#156078 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156078 Approved by: https://github.com/malfet, https://github.com/cyyever	2025-06-18 10:24:44 +00:00
Kaichao You	a5df6ffbc2	Improve IPC for Expandable Segments to use fabric handle when possible (#156074 ) Improve upon https://github.com/pytorch/pytorch/pull/130890 , inspired by https://github.com/pytorch/pytorch/pull/130890#issuecomment-2278882984 , we can automatically use the fabric handle for IPC when possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156074 Approved by: https://github.com/ngimel, https://github.com/malfet	2025-06-18 05:22:06 +00:00
cyy	ce94b212c7	[Environment Variable][Rebase] Use thread-safe getenv functions (#140200 ) Use our thread-safe getenv wrappers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140200 Approved by: https://github.com/kwen2501, https://github.com/eqy	2025-05-02 00:41:49 +00:00
Ethan Wee	6cbf97ede8	[ROCm] enable HIPMallocAsyncAllocator (#149145 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145 Approved by: https://github.com/izaitsevfb Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-19 23:42:35 +00:00
PyTorch MergeBot	e1d143cb7b	Revert "[ROCm] enable HIPMallocAsyncAllocator (#149145 )" This reverts commit ee1a2b7810126258ce64d1e22b59fae81a3f7bcb. Reverted https://github.com/pytorch/pytorch/pull/149145 on behalf of https://github.com/izaitsevfb due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/149145#issuecomment-2738115728))	2025-03-19 21:12:13 +00:00
Ethan Wee	ee1a2b7810	[ROCm] enable HIPMallocAsyncAllocator (#149145 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-03-19 03:59:55 +00:00
PyTorch MergeBot	9d37b501db	Revert "[ROCm] enable HIPMallocAsyncAllocator (#149145 )" This reverts commit 2e02c07a5d1c432547542f90de2885be9ffd13cf. Reverted https://github.com/pytorch/pytorch/pull/149145 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @albanD, might you be able to help get this PR landed? See D71214814 for more details on the failure. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/149145#issuecomment-2730104736))	2025-03-17 16:17:02 +00:00
Ethan Wee	2e02c07a5d	[ROCm] enable HIPMallocAsyncAllocator (#149145 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145 Approved by: https://github.com/jeffdaily	2025-03-14 18:21:27 +00:00
Banit Agrawal	a575ce0dc6	[PyTorch Pinned Allocator] Add support of background thread to process events (#135524 ) Summary: Currently we process events in the regular allocation path and we call cudaEventQuery to check on the events and this path can take some locks in libcuda driver. Its not entirely needed to do process events in the allocation path, we could move this to a background thread and keep processing events regularly and put the freed block to the free list. Differential Revision: D62396585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135524 Approved by: https://github.com/zyan0	2024-09-17 21:08:10 +00:00
Banit Agrawal	48d18fbd4c	[PyTorch CUDA Allocator] Allow reuse of non-split blocks with better rounding (#136174 ) Summary: This diff adds an option to round the non-split blocks in caching allocator so that they can be reused without causing lots of fragmentation for large memory segments. For example, if we specify max_split memory size as 400MB, then all allocations more than 400MB will not be split. Lets say, we allocated some 1024MB blocks and these are cached in the allocator blocks. If we request a new 500MB block, we round it to nearest power-2-division, thats 512MB, we add default kLargeBuffer of 20MB, that will be 532MB and since 532MB is less than existing 1024MB block, the 1024MB will not be used for this allocation, instead a new 512MB block will be created. In this diff, we provide an option to cofigure the kLargeBuffer for rounding and expose as a configurable option, so 512MB + max_non_split_rounding_size and if thats greater than 1024MB, we will use te 1024MB and we wont create a new 512MB block using cudaMalloc. This option is added so that we can pre-allocate some large blocks so that we can reuse them as much as possible and we dont stall on calling cudaMalloc. Differential Revision: D62758758 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136174 Approved by: https://github.com/zyan0	2024-09-17 19:08:44 +00:00
PyTorch MergeBot	277ab8a4c0	Revert "[Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449 )" This reverts commit a56e057814565b2ae33b2106b4d0136179aa18f8. Reverted https://github.com/pytorch/pytorch/pull/119449 on behalf of https://github.com/jeanschmidt due to Broken internal signals, @albanD please help get this sorted :) ([comment](https://github.com/pytorch/pytorch/pull/119449#issuecomment-2069716129))	2024-04-22 14:44:44 +00:00
cyy	a56e057814	[Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449 ) This PR is the beginning of attempts to wrap thread-unsafe getenv and set_env functions inside a RW mutex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119449 Approved by: https://github.com/malfet, https://github.com/albanD	2024-04-19 13:39:41 +00:00
PyTorch MergeBot	61bc188f42	Revert "[Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449 )" This reverts commit b51f66c1950a582dd18d1b2ee67df840a8c4dbbe. Reverted https://github.com/pytorch/pytorch/pull/119449 on behalf of https://github.com/malfet due to Broke gcc9 builds ([comment](https://github.com/pytorch/pytorch/pull/119449#issuecomment-2064936414))	2024-04-18 18:53:59 +00:00
cyy	b51f66c195	[Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449 ) This PR is the beginning of attempts to wrap thread-unsafe getenv and set_env functions inside a RW mutex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119449 Approved by: https://github.com/albanD	2024-04-18 13:35:48 +00:00
PyTorch MergeBot	f5049de242	Revert "[Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449 )" This reverts commit 5bef127c2ea49280e7fda4f9fa7cad6fa4078e7d. Reverted https://github.com/pytorch/pytorch/pull/119449 on behalf of https://github.com/PaliC due to your using TORCH_INTERNAL_ASSERT incorrectly ([comment](https://github.com/pytorch/pytorch/pull/119449#issuecomment-2062696010))	2024-04-17 23:44:00 +00:00
cyy	5bef127c2e	[Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449 ) This PR is the beginning of attempts to wrap thread-unsafe getenv and set_env functions inside a RW mutex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119449 Approved by: https://github.com/albanD	2024-04-16 04:39:20 +00:00
cyy	fb10e13000	[Clang-tidy header][24/N] Fix clang-tidy warnings on c10/cuda/*.{cpp,h} (#120781 ) This PR begins to clean clang-tidy warnings of code in c10/cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120781 Approved by: https://github.com/ezyang	2024-03-15 05:03:22 +00:00
Aaron Enye Shi	7973ac586d	[Memory Snapshot] Add CUDAAllocatorConfig details into snapshot metadata (#119404 ) Summary: Include the CUDAAllocatorConfig at the time of snapshot into the snapshot file. These include adding variables: ``` double garbage_collection_threshold; size_t max_split_size; size_t pinned_num_register_threads; bool expandable_segments; bool release_lock_on_cudamalloc; bool pinned_use_cuda_host_register; std::string last_allocator_settings; std::vector<size_t> roundup_power2_divisions; ``` Test Plan: `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True ` produces ``` {'PYTORCH_CUDA_ALLOC_CONF': 'expandable_segments:True', 'max_split_size': -1, 'garbage_collection_threshold': 0.0, 'expandable_segments': True, 'pinned_num_register_threads': 1, 'release_lock_on_cudamalloc': False, 'pinned_use_cuda_host_register': False, 'roundup_power2_divisions': {'1': 0, '2': 0, '4': 0, '8': 0, '16': 0, '32': 0, '64': 0, '128': 0, '256': 0, '512': 0, '1024': 0, '2048': 0, '4096': 0, '8192': 0, '16384': 0, '32768': 0}} ``` `PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:2000,roundup_power2_divisions:[256:1,512:2,1024:4,>:8]"` produces ``` {'PYTORCH_CUDA_ALLOC_CONF': 'max_split_size_mb:2000,roundup_power2_divisions:[256:1,512:2,1024:4,>:8]', 'max_split_size': 2097152000, 'garbage_collection_threshold': 0.0, 'expandable_segments': False, 'pinned_num_register_threads': 1, 'release_lock_on_cudamalloc': False, 'pinned_use_cuda_host_register': False, 'roundup_power2_divisions': {'1': 1, '2': 1, '4': 1, '8': 1, '16': 1, '32': 1, '64': 1, '128': 1, '256': 1, '512': 2, '1024': 8, '2048': 8, '4096': 8, '8192': 8, '16384': 8, '32768': 8} } ``` Differential Revision: D53536199 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/119404 Approved by: https://github.com/zdevito	2024-02-17 01:16:37 +00:00
cyy	b72ddbab60	[Clang-tidy header][15/N] Enable clang-tidy on headers in c10/cuda and c10/mobile (#116602 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116602 Approved by: https://github.com/ezyang	2024-01-18 08:15:50 +00:00
Nikita Shulga	53e32d12c4	[c10] Use nested namespace in c10/cuda (#116464 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116464 Approved by: https://github.com/Skylion007	2023-12-27 23:14:00 +00:00
Banit Agrawal	64583c4d04	[CUDA Host Allocator] Add support of CudaHostRegister (#108488 ) Summary: This diff adds another option to create cuda pinned memory using cudaHostRegister. Differential Revision: D45843715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108488 Approved by: https://github.com/zdevito	2023-10-06 04:13:02 +00:00

1 2

51 Commits