pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
PyTorch MergeBot	1e42fde45e	Revert "[CUDA] Add experimental green context support for SM carveout (#159104 )" This reverts commit 746fe78ecd52f3e9cfddda41f0ac82dada7bdd0b. Reverted https://github.com/pytorch/pytorch/pull/159104 on behalf of https://github.com/malfet due to Breaks Windows CD build ([comment](https://github.com/pytorch/pytorch/pull/159104#issuecomment-3378675515))	2025-10-07 20:51:22 +00:00
Eddie Yan	746fe78ecd	[CUDA] Add experimental green context support for SM carveout (#159104 ) Low-level PyTorch APIs should be usable/stable enough at this point but we might move the underlying driver API usage a bit from here... Built on top of @drisspg 's branch Pull Request resolved: https://github.com/pytorch/pytorch/pull/159104 Approved by: https://github.com/ngimel Co-authored-by: drisspg <drisspguessous@gmail.com>	2025-10-06 23:11:23 +00:00
Yuanyuan Chen	5103ecc5d8	[1/N] Fix clang-tidy readability checks (#164561 ) Check all `.cpp` files except `jit` files for readability thoroughly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164561 Approved by: https://github.com/Skylion007	2025-10-04 09:40:38 +00:00
PyTorch MergeBot	8ec8c14ace	Revert "[CUDA] Add experimental green context support for SM carveout (#159104 )" This reverts commit 3c59351c6ea2fc29d346903e28e95c5f4d0ccdbb. Reverted https://github.com/pytorch/pytorch/pull/159104 on behalf of https://github.com/clee2000 due to failed lint, pyfmt not caught pyi file, I think they need special handling since theyre not in the changed files list? ([comment](https://github.com/pytorch/pytorch/pull/159104#issuecomment-3367077208))	2025-10-03 20:15:56 +00:00
Eddie Yan	3c59351c6e	[CUDA] Add experimental green context support for SM carveout (#159104 ) Low-level PyTorch APIs should be usable/stable enough at this point but we might move the underlying driver API usage a bit from here... Built on top of @drisspg 's branch Pull Request resolved: https://github.com/pytorch/pytorch/pull/159104 Approved by: https://github.com/ngimel Co-authored-by: drisspg <drisspguessous@gmail.com>	2025-10-03 18:59:12 +00:00
Banit Agrawal	6bb586eafd	[PyTorch / Sigrid GPU] Fixes in pinned stats collection and add new ODS pinned memory stats (#164412 ) We do some fixes in pinned memory allocation stats collection and better differentiate between active vs allocated bytes. Reviewed By: bbus, sayitmemory Differential Revision: D83162346 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164412 Approved by: https://github.com/mradmila	2025-10-02 08:04:05 +00:00
Valentin Andrei	bb5be56619	[torch][cuda][device_limits] Library for querying device hardware limits for flops and bandwidth (#162942 ) In various benchmarks scattered across the repo, the limits for flops/second and memory bandwidth are usually hardcoded for a single device. This utility could help in providing a more structured way to query the device capabilities. If this is approved, we can use it when reporting flops efficiency and bandwidth relative to peak in the benchmarks and tests. The intent is to add more devices, more parameters (e.g. L2 cache bandwidth, NVLink, etc.) for both CPUs and accelerators. Testing: ``` import torch if torch.cuda.is_available(): device = torch.cuda.current_device() mod = torch.get_device_module('cuda') hw = mod._device_limits.GPULimits(device) print(hw.get_tflops_per_second(torch.float16)) print(hw.get_tflops_per_second(torch.float32)) print(hw.get_tflops_per_second(torch.float64)) print(hw.get_tflops_per_second(torch.bfloat16)) print(hw.get_tflops_per_second(torch.int8)) print(hw.get_memory_bandwidth_Bps() / 1e9) print(hw.get_shared_memory_bandwidth_Bps() / 1e9) # Output on an H100 GPU 1070.53056 535.26528 66.90816 1070.53056 2141.06112 4893.696 33454.08 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162942 Approved by: https://github.com/ngimel, https://github.com/albanD	2025-09-23 04:48:19 +00:00
thenumberouscode	4a160dae3c	[CUDA] revert PR 130472 (#162950 ) This change may also resolve https://github.com/pytorch/pytorch/issues/161789, though verification is still needed. PR #130472 would introduced the problem of freeing the same address without clean metadata. according to the below discussion, reverted it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162950 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed	2025-09-19 19:50:44 +00:00
PyTorch MergeBot	4b7aed89d8	Revert "[torch][cuda][device_limits] Library for querying device hardware limits for flops and bandwidth (#162942 )" This reverts commit 627482a7b7780752c0e7aea034a2eb2db5899fcc. Reverted https://github.com/pytorch/pytorch/pull/162942 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it needs some fixes for CUDA 13 ([comment](https://github.com/pytorch/pytorch/pull/162942#issuecomment-3308784448))	2025-09-18 17:49:16 +00:00
vandrei	627482a7b7	[torch][cuda][device_limits] Library for querying device hardware limits for flops and bandwidth (#162942 ) In various benchmarks scattered across the repo, the limits for flops/second and memory bandwidth are usually hardcoded for a single device. This utility could help in providing a more structured way to query the device capabilities. If this is approved, we can use it when reporting flops efficiency and bandwidth relative to peak in the benchmarks and tests. The intent is to add more devices, more parameters (e.g. L2 cache bandwidth, NVLink, etc.) for both CPUs and accelerators. Testing: ``` import torch if torch.cuda.is_available(): device = torch.cuda.current_device() mod = torch.get_device_module('cuda') hw = mod._device_limits.GPULimits(device) print(hw.get_tflops_per_second(torch.float16)) print(hw.get_tflops_per_second(torch.float32)) print(hw.get_tflops_per_second(torch.float64)) print(hw.get_tflops_per_second(torch.bfloat16)) print(hw.get_tflops_per_second(torch.int8)) print(hw.get_memory_bandwidth_Bps() / 1e9) print(hw.get_shared_memory_bandwidth_Bps() / 1e9) # Output on an H100 GPU 1070.53056 535.26528 66.90816 1070.53056 2141.06112 4893.696 33454.08 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162942 Approved by: https://github.com/ngimel	2025-09-18 06:40:07 +00:00
Frank Lin	0c0e056a9e	[CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352 ) ## Introduction During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it capturing graph) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture. This PR adds an experimental flag `graph_capture_record_stream_reuse: True\|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path. ## Terms * Free marker: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it. * Terminal: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`. ## When can we reuse a block during capture? ### Strong Rule (Graph-Wide Safety) This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph. > A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph. Why it's safe: This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness. ### Per-stream Rule (A Practical Optimization) The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check. In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream. > Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S. In short, a block is considered reusable on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins. ## Implementation * On `free(block)` during capture * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail. * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path. * Otherwise, store the marker handles and keep the block in the capture-private structures. * On `allocate(stream)` during capture (attempt per-stream reclaim) * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`. * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal. * If yes, hand the block to S for immediate reuse within the same capture. * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances. * On capture end * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture. ## Examples (2 streams) <img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" /> * Case 0 — Unsafe The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails. Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this. * Case 1 — Reusable on stream 1 Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1. * Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator` This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable. * Case 3 — Safe (strong rule holds) In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block. * Case 4 — Freeing after a join See the note below. ## Edge Case: Freeing after a join Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](https://github.com/pytorch/pytorch/pull/158352#pullrequestreview-3112565198)). In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused. ## Thanks Thanks to @galv for his great idea around graph parsing and empty nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158352 Approved by: https://github.com/ngimel, https://github.com/eqy Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-04 17:21:26 +00:00
Yu, Guangye	f8746b878d	Add uuid to XPU device properties (#161392 ) # Motivation Fix https://github.com/intel/torch-xpu-ops/issues/1955 Refer to https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_intel_device_info.md#device-uuid, `ext::intel::info::device::uuid` returns `std::array<unsigned char, 16>` as the UUID. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161392 Approved by: https://github.com/EikanWang, https://github.com/albanD	2025-09-02 06:41:32 +00:00
PyTorch MergeBot	63a9c23fe9	Revert "[CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352 )" This reverts commit 190c391a28845a14df26abb228d26aa813efb20c. Reverted https://github.com/pytorch/pytorch/pull/158352 on behalf of https://github.com/atalman due to Broke cuda 13.0 nightly builds https://github.com/pytorch/pytorch/actions/runs/17382188549/job/49341981474 ([comment](https://github.com/pytorch/pytorch/pull/158352#issuecomment-3242871629))	2025-09-01 16:27:03 +00:00
Frank Lin	190c391a28	[CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352 ) ## Introduction During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it capturing graph) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture. This PR adds an experimental flag `graph_capture_record_stream_reuse: True\|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path. ## Terms * Free marker: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it. * Terminal: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`. ## When can we reuse a block during capture? ### Strong Rule (Graph-Wide Safety) This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph. > A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph. Why it's safe: This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness. ### Per-stream Rule (A Practical Optimization) The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check. In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream. > Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S. In short, a block is considered reusable on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins. ## Implementation * On `free(block)` during capture * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail. * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path. * Otherwise, store the marker handles and keep the block in the capture-private structures. * On `allocate(stream)` during capture (attempt per-stream reclaim) * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`. * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal. * If yes, hand the block to S for immediate reuse within the same capture. * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances. * On capture end * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture. ## Examples (2 streams) <img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" /> * Case 0 — Unsafe The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails. Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this. * Case 1 — Reusable on stream 1 Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1. * Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator` This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable. * Case 3 — Safe (strong rule holds) In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block. * Case 4 — Freeing after a join See the note below. ## Edge Case: Freeing after a join Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](https://github.com/pytorch/pytorch/pull/158352#pullrequestreview-3112565198)). In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused. ## Thanks Thanks to @galv for his great idea around graph parsing and empty nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158352 Approved by: https://github.com/ngimel Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-01 09:25:01 +00:00
Yu, Guangye	b7b9fb9962	Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" (#161627 ) This reverts commit c1145852a5eac96f5551b5d1805109ce4dc5e1fa. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161627 Approved by: https://github.com/atalman ghstack dependencies: #161625, #161626	2025-08-27 21:37:14 +00:00
Yu, Guangye	c03d8d4082	Revert "Generalize torch._C._set_allocator_settings to be generic (#156175 )" (#161626 ) This reverts commit 908c5cc4c0f22d141776bde47c296b5186691855. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161626 Approved by: https://github.com/atalman ghstack dependencies: #161625	2025-08-27 21:37:14 +00:00
Yu, Guangye	06ddaf1e0a	Revert "Back out "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" (#160999 )" (#161625 ) This reverts commit a818fa77e3a72271f144514ef349c5a666313205. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161625 Approved by: https://github.com/atalman	2025-08-27 21:34:12 +00:00
Joshua Su	a818fa77e3	Back out "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" (#160999 ) Summary: reverting this diff since it caused S551328. Please see D80217492 for dertails. Test Plan: NA Rollback Plan: Differential Revision: D80553314 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160999 Approved by: https://github.com/izaitsevfb, https://github.com/jingsh	2025-08-20 15:04:36 +00:00
Yu, Guangye	908c5cc4c0	Generalize torch._C._set_allocator_settings to be generic (#156175 ) # Motivation This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`. Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156175 Approved by: https://github.com/albanD ghstack dependencies: #159629, #150312, #156165	2025-08-05 04:08:42 +00:00
Yu, Guangye	c1145852a5	Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165 Approved by: https://github.com/albanD ghstack dependencies: #159629, #150312	2025-08-05 04:08:42 +00:00
PyTorch MergeBot	90f13f3b2a	Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" This reverts commit 1fc010a9d8ea95bb74e54b31d17eba56ef16c27c. Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3142035444))	2025-08-01 03:24:54 +00:00
PyTorch MergeBot	cb9b74872b	Revert "Generalize torch._C._set_allocator_settings to be generic (#156175 )" This reverts commit d3ce45012ed42cd1e13d5048b046b781f0feabe0. Reverted https://github.com/pytorch/pytorch/pull/156175 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3142035444))	2025-08-01 03:24:54 +00:00
Yu, Guangye	d3ce45012e	Generalize torch._C._set_allocator_settings to be generic (#156175 ) # Motivation This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`. Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156175 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312, #156165	2025-07-30 06:37:15 +00:00
Yu, Guangye	1fc010a9d8	Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312	2025-07-30 06:37:15 +00:00
PyTorch MergeBot	ea5f88dca6	Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" This reverts commit e40ade5182233f548b25f2732effe3719d16e9ad. Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/huydhn due to Sorry for reverting your change but because https://github.com/pytorch/pytorch/pull/157908 has been reverted + this PR caused issue earlier, I think it is better to revert the whole stack and reland it from scratch to be sure ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3074897532))	2025-07-15 18:24:36 +00:00
Yu, Guangye	e40ade5182	Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165 Approved by: https://github.com/albanD ghstack dependencies: #150312	2025-07-15 10:14:35 +00:00
PyTorch MergeBot	e8cca7bac7	Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 )" This reverts commit 85857181ebca86e9c709e9922a9d9ef41a9c4ef9. Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing to build PyTorch internally ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3070218901))	2025-07-14 16:33:48 +00:00
Yu, Guangye	85857181eb	Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312	2025-07-11 11:41:34 +00:00
Aidyn-A	4a26bb8a12	[C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900 ) Fixes #155668 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155900 Approved by: https://github.com/ngimel	2025-06-17 18:59:44 +00:00
PyTorch MergeBot	365ce465f3	Revert "[C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900 )" This reverts commit 8142a0286016e63a0e91b5667e1fb1a5e868ffd7. Reverted https://github.com/pytorch/pytorch/pull/155900 on behalf of https://github.com/clee2000 due to causing some sort of hang? in test_distributed_spawn [GH job link](https://github.com/pytorch/pytorch/actions/runs/15678895788/job/44168117193) [HUD commit link](`8142a02860`) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/155900#issuecomment-2977365699))	2025-06-16 16:59:25 +00:00
Aidyn-A	8142a02860	[C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900 ) Fixes #155668 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155900 Approved by: https://github.com/ngimel	2025-06-16 10:55:47 +00:00
Yu, Guangye	d84efde3f0	Move _storage_Use_Count to be gerneric (#155451 ) # Motivation `torch._C._storage_Use_Count` should be a generic API that is not aware of device type. It is also used in `337cd7c53d/torchtune/training/_activation_offloading.py (L323)` to do some memory optimization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155451 Approved by: https://github.com/albanD	2025-06-12 01:39:04 +00:00
Shivam Raikundalia	1083bc749d	[Memory Snapshot] Add Flag to Toggle Global and Local Callbacks for Annotations (#154932 ) Summary: There are some cases where we want only local annotations for memory snapshot such as executing inside the cudastream callback, which cannot execute CUDA operators. Thus the cuda errors happen: Exception in RecordFunction callback: CUDA error: operation not permitted However, we need to have an option to turn on the globally so that on-demand snapshot can get annotations. Additionally, there may be some cases in which auto-trace will also want annotations using record functions so we expose the flag to the auto-trace as well. Test Plan: Run MVAI executable and see that the errors go away Rollback Plan: Differential Revision: D75831687 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154932 Approved by: https://github.com/mzzchy, https://github.com/sanrise	2025-06-04 23:15:19 +00:00
Natalia Gimelshein	f01e628e3b	Resubmit Remove MemPoolContext (#154042 ) (#154746 ) Summary: Per title Test Plan: Added tests + existing tests Differential Revision: D75695030 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154746 Approved by: https://github.com/malfet	2025-05-31 01:21:54 +00:00
PyTorch MergeBot	d173ba5a75	Revert "Remove MemPoolContext (#154042 )" This reverts commit 3b38989b5f8f918cf1ad38bdade059608544af4b. Reverted https://github.com/pytorch/pytorch/pull/154042 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/154042#issuecomment-2921401100))	2025-05-30 06:53:37 +00:00
Natalia Gimelshein	3b38989b5f	Remove MemPoolContext (#154042 ) Removes MemPoolContext from custom user mempools. The ground truth for which pool should be used is in graph_pools active pool, and MemPoolContext just introduced an opportunity for the pool pointed to by MemPoolContext and active pool in graph_pools to go out of sync (see all the asserts in the code to make sure that happens, and yet it still could happen in a multithread scenario, see my recent PRs (#153990). Pull Request resolved: https://github.com/pytorch/pytorch/pull/154042 Approved by: https://github.com/albanD, https://github.com/syed-ahmed	2025-05-28 16:35:48 +00:00
Shivam Raikundalia	dbb4444ce3	[Memento] Add PT2 to Memory Snapshot (#152707 ) Summary: To add PT2 information to memory snapshot we piggyback off of the Kineto implementation using record_function similar to adding the user annotations. To do this we add the following: 1. Stack implementation that we instantiate to keep track of which compile context stack we are currently in (top element of the stack). The stack will be per device and thread-local since different threads of a process can be in different compile contexts at a given time. For this reason, we do not need to add mutexes to our stack impl since no two threads will touch a given stack 2. RecordFunction hooks to properly pipe the correct events to the compile context stack. These hooks are similar to the annotation ones in the fact that we just register them lazily and DO NOT unregister them. This is done out of convenience. In the future, we should save the handles and unregister them to minimize overhead after profiling is finished. As of now, we are registering this at the FUNCTION scope which is wide; however, we treat any function that does not start with "Torch-Compiled Region" as a no-op so we anticipate the difference in performance to be negligible during and after profiling. We also hide this feature behind a flag set to off on default so existing jobs will be unaffected 3. Piping for compile context to pickle output Test Plan: In D74039793, we add CompileContext to the visualizer and we see the following {F1977654658} Differential Revision: D74028214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152707 Approved by: https://github.com/eqy	2025-05-12 21:12:51 +00:00
FFFrog	fd8fd01d25	[OpenReg] Add _lazy_init and rng_state support for OpenReg (#151914 ) As the title stated. Changes: - Add get_rng_state & set_rng_state support for OpenReg - Add _lazy_init support for OpenReg - Remove redundant code for cuda/Module.cpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/151914 Approved by: https://github.com/albanD	2025-05-04 09:42:08 +00:00
Boyuan Feng	d969e2ec33	[CUDAGraph Trees] support memory allocation on side stream (#152472 ) I tried `beginAllocateToPool` instead of `_cuda_beginAllocateCurrentStreamToPool` and the error in #151199 does not happen any more. However, this approach is unsafe for multithreading. When multiple run_eager happens concurrently, we expect memory allocation to different mem_pool. Since beginAllocateToPool does not check stream, these memory allocation may happen on the same mem_pool. So, I use `_cuda_beginAllocateCurrentThreadToPool` to direct all memory allocation on the same thread to a given mem_pool. In particular, `_cuda_beginAllocateCurrentThreadToPool` records the launching thread id, and during runtime checks if the current thread id matches the launching thread id. Fixes #151199 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152472 Approved by: https://github.com/eellison, https://github.com/ngimel	2025-05-02 04:26:35 +00:00
PyTorch MergeBot	56039b5778	Revert "[CUDAGraph Trees] support memory allocation on side stream (#152472 )" This reverts commit c620763ec2be83e37f9b31ad6663c6e82d6c0ab0. Reverted https://github.com/pytorch/pytorch/pull/152472 on behalf of https://github.com/BoyuanFeng due to should use tid instead pid ([comment](https://github.com/pytorch/pytorch/pull/152472#issuecomment-2843491656))	2025-04-30 22:18:10 +00:00
Boyuan Feng	c620763ec2	[CUDAGraph Trees] support memory allocation on side stream (#152472 ) I tried `beginAllocateToPool` instead of `_cuda_beginAllocateCurrentStreamToPool` and the error in #151199 does not happen any more. However, this approach is unsafe for multithreading. When multiple run_eager happens concurrently, we expect memory allocation to different mem_pool. Since beginAllocateToPool does not check stream, these memory allocation may happen on the same mem_pool. So, I use `_cuda_beginAllocateCurrentThreadToPool` to direct all memory allocation on the same thread to a given mem_pool. In particular, `_cuda_beginAllocateCurrentThreadToPool` records the launching thread id, and during runtime checks if the current thread id matches the launching thread id. Fixes #151199 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152472 Approved by: https://github.com/eellison	2025-04-30 17:45:07 +00:00
PyTorch MergeBot	3962b8f1e0	Revert "[OpenReg] Add _lazy_init and rng_state support for OpenReg (#151914 )" This reverts commit 64a55b531f4f4ae2b35175ab5d9a30a856b0d6ef. Reverted https://github.com/pytorch/pytorch/pull/151914 on behalf of https://github.com/malfet due to Looks like breaks number of ROCM jobs, see `797768cd90/1` ([comment](https://github.com/pytorch/pytorch/pull/151914#issuecomment-2839691038))	2025-04-29 17:36:12 +00:00
FFFrog	64a55b531f	[OpenReg] Add _lazy_init and rng_state support for OpenReg (#151914 ) As the title stated. Changes: - Add get_rng_state & set_rng_state support for OpenReg - Add _lazy_init support for OpenReg - Remove redundant code for cuda/Module.cpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/151914 Approved by: https://github.com/albanD	2025-04-29 11:18:12 +00:00
cyy	b34146a093	Fix initGdsBindings declaration (#152277 ) Move initGdsBindings into the correct namespace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152277 Approved by: https://github.com/Skylion007	2025-04-27 17:04:56 +00:00
Yu, Guangye	14e3ffb1ff	Deprecate host allocator legacy APIs (#151437 ) # Motivation This PR aims to deprecate the host allocator legacy API and recommend users to use the unified API `getHostAllocator(device_type)` APIs, such as: ```cpp at::getHostAllocator(device_type)->allocate(...); at::getHostAllocator(device_type)->empty_cache(); at::getHostAllocator(device_type)->record_event(...); at::getHostAllocator(device_type)->get_stats(); at::getHostAllocator(device_type)->reset_accumulated_stats(); at::getHostAllocator(device_type)->reset_peak_stats(); ``` # Additional Context TODO: - [ ] Move is_pinned from `AcceleratorHookInterface` to `HostAllocator` - [ ] Deprecate `getPinnedMemoryAllocator` inside `AcceleratorHookInterface` and recommend using `getHostAllocator` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151437 Approved by: https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #151403, #151431	2025-04-22 03:13:24 +00:00
Evgeny Fiksman	1a6effc5d8	[torch] Expose PCI info from CUDA device (#151672 ) Summary: PR#125083 add cuda device UUID info, but due to meta internal [version of ROCM the code was excluded](https://github.com/pytorch/pytorch/pull/125083?fbclid=IwY2xjawJvLnNleHRuA2FlbQIxMQABHlY55crrkTqWBWTsr2HVfuqnZ3R1GHR3o9Kf1o3h3uvyawEmCEdhdT48iY1P_aem_8tfrGrWE9SxFYasGfH8kCQ#issuecomment-2103315320). This change will ensure meta internal code is built and PCI info is available Test Plan: pass CI Differential Revision: D73253426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151672 Approved by: https://github.com/Skylion007	2025-04-21 19:55:19 +00:00
Xiaodong Wang	88b0553c58	[AMD] Remove fbcode limit for uuid (#151652 ) Summary: We're now w/ later rocm version so ok to add uuid back. Test Plan: sandcastle Differential Revision: D73240086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151652 Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/houseroad	2025-04-18 20:37:09 +00:00
Yu, Guangye	b0810168a3	Generalize poison fork logic for each device backend (#144664 ) # Motivation Generalize the posion_fork code to make it reusable across different devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144664 Approved by: https://github.com/EikanWang, https://github.com/albanD	2025-04-13 09:54:30 +00:00
PyTorch MergeBot	a0ab243c3a	Revert "Generalize poison fork logic for each device backend (#144664 )" This reverts commit 83bd0b63b55f224fada6d5f6dd7eb5b4cb3072fb. Reverted https://github.com/pytorch/pytorch/pull/144664 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/144664#issuecomment-2795157082))	2025-04-10 21:02:14 +00:00
Yu, Guangye	83bd0b63b5	Generalize poison fork logic for each device backend (#144664 ) # Motivation Generalize the posion_fork code to make it reusable across different devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144664 Approved by: https://github.com/EikanWang, https://github.com/albanD	2025-04-10 02:34:53 +00:00

1 2 3 4 5 ...

358 Commits