pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Yuanyuan Chen	85012fe167	Remove unnecessary list comprehensions (#164103 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/164103 Approved by: https://github.com/Lucaskabela, https://github.com/mlazos	2025-09-30 03:56:54 +00:00
can-gaa-hou	7c7ae86991	[Fix] Adding missing `f` prefixes to formatted strings [2/N] (#164066 ) As stated in the title. * #164068 * #164067 * __->__ #164066 * #164065 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164066 Approved by: https://github.com/Skylion007	2025-09-29 04:40:44 +00:00
Yuanyuan Chen	3cda34ebde	[2/N] Apply ruff UP035 check in torch files (#164054 ) This is the result of applying the ruff `UP035` check. `Callable` is imported from `collections.abc` instead of `typing`. `TypeAlias` is also imported from `typing`. This PR is the follow-up of #163947. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164054 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2025-09-29 03:35:32 +00:00
Min Si	b1a4efc302	[amd] Add cudaHostFn_t to cuda_to_hip_mappings (#164007 ) Summary: See title Test Plan: ``` buck build --flagfile fbcode//mode/opt-amd-gpu fbcode//comms/ctran/algos/common/tests:ctran_algo_gpe_kernel_sync_test ``` After fix: https://www.internalfb.com/buck2/362ff91e-53f2-4b82-9536-cb84c91384a2 Before fix: failed in D83294731 (version 1): https://www.internalfb.com/sandcastle/workflow/1792432651703947243 Differential Revision: D83375414 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164007 Approved by: https://github.com/llxxee	2025-09-27 06:09:50 +00:00
Nan Zhang	8d474bdc14	Change python grid calc for MTIA back to python mode (#163601 ) Differential Revision: D83000165 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163601 Approved by: https://github.com/blaine-rister	2025-09-27 00:12:53 +00:00
Yuanyuan Chen	5daa79fd6e	Remove dataclass_slots (#163623 ) `dataclass` now has `slots` kwarg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163623 Approved by: https://github.com/Skylion007	2025-09-26 00:54:42 +00:00
Sherlock Huang	f9821b1be7	DebugMode supports_higher_order_operators=True (#163824 ) Make DebugMode supports HOP Pull Request resolved: https://github.com/pytorch/pytorch/pull/163824 Approved by: https://github.com/ydwu4	2025-09-25 17:11:43 +00:00
Bob Ren	ad869c58f5	remove allow-untyped-defs from ./torch/utils/benchmark/op_fuzzers/sparse_unary.py (#163476 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163476 Approved by: https://github.com/ezyang, https://github.com/Skylion007 ghstack dependencies: #163478, #163475, #163471	2025-09-25 06:48:44 +00:00
dolpm	cde5c9aebd	fix pickling for BitwiseFn (#163571 ) Summary: ran into AttributeError: Can't get local object 'make_opaque_bitwise_fn.<locals>.BitwiseFn' looks like it was fixed for UnaryFn but not BitwiseFn in https://github.com/pytorch/pytorch/pull/138395 Fixes #147841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163571 Approved by: https://github.com/jamesjwu	2025-09-25 04:52:11 +00:00
Sherlock Huang	4c2c401ccf	Record redistribute_local_tensor in DebugMode (#163704 ) Explicit redistribute_local_tensor API call could also results in communication, record it! Pull Request resolved: https://github.com/pytorch/pytorch/pull/163704 Approved by: https://github.com/ezyang	2025-09-24 16:11:26 +00:00
FFFrog	a635505a99	[Code Clean] Remove deadcodes about Python3.9 [6/N] (#163645 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163645 Approved by: https://github.com/albanD ghstack dependencies: #163626, #163627, #163629, #163643, #163644	2025-09-24 07:30:50 +00:00
FFFrog	ec0cd81c38	[Code Clean] Remove deadcodes about Python3.9 [4/N] (#163643 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163643 Approved by: https://github.com/albanD ghstack dependencies: #163626, #163627, #163629	2025-09-24 07:30:50 +00:00
Nikita Shulga	f9fa138a39	[BE] Delete all pre py-3.10 checks (#163653 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163653 Approved by: https://github.com/jansel ghstack dependencies: #163648, #163649	2025-09-23 23:22:53 +00:00
Yuanyuan Chen	e3b392bdfd	[BC breaking] Remove deprecated imports for torch.utils.data.datapipes.iter.grouping (#163438 ) This PR removes import tricks of `SHARDING_PRIORITIES` and `ShardingFilterIterDataPipe` from `torch.utils.data.datapipes.iter.grouping`. They are declared to be removed in PyTorch 2.1 but not. Before change: ``` import torch.utils.data.datapipes.iter.grouping.SHARDING_PRIORITIES import torch.utils.data.datapipes.iter.grouping.ShardingFilterIterDataPipe ``` works After change: there is an import error exception. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163438 Approved by: https://github.com/janeyx99	2025-09-23 05:02:06 +00:00
Sherlock Huang	95ac7d724e	Rename to _debug_mode.py to make it private (#163534 ) rename debug_mode.py to _debug_mode.py to make it private, per @alban's request. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163534 Approved by: https://github.com/albanD	2025-09-23 04:27:10 +00:00
Yuanyuan Chen	6a48f57d2f	[1/N] Remove 'type: ignore' suppressions (#163468 ) Remove some unnecessary 'type: ignore' suppressions from python code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163468 Approved by: https://github.com/Skylion007, https://github.com/janeyx99	2025-09-23 03:53:11 +00:00
Bob Ren	46e1b7d70b	remove allow-untyped-defs from ./torch/utils/data/datapipes/iter/fileopener.py (#163469 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163469 Approved by: https://github.com/aorenste, https://github.com/Skylion007 ghstack dependencies: #163246	2025-09-22 20:29:09 +00:00
Yuanyuan Chen	f591bb5056	Remove data_source argument from Sampler (#163134 ) `data_source` is declared being removed in PT 2.2 but not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163134 Approved by: https://github.com/ezyang	2025-09-21 05:44:41 +00:00
Edward Yang	2887f3fde4	[BE] Slight improvements to documentation in python_dispatch (#162963 ) I was briefly confused which way I should iterate stack, here's the comments I wanted. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162963 Approved by: https://github.com/albanD, https://github.com/SherlockNoMad	2025-09-21 01:45:46 +00:00
dsashidh	a87aea03f7	Update RandomSampler docstring. data_source must be Sized not Dataset (#158857 ) Fixes #158631 The docstring said data_source was a Dataset, but RandomSampler only needs something that implements __len__. This updates the docstring to use Sized instead, which matches the actual type used in the constructor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158857 Approved by: https://github.com/divyanshk	2025-09-20 04:05:25 +00:00
Sherlock Huang	a1df0b42ce	Lazy import to avoid circular import issue for DebugMode (#163381 ) as title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163381 Approved by: https://github.com/dolpm	2025-09-20 01:54:57 +00:00
Sherlock Huang	04842ac2b0	Change DebugMode record_torchfunction default to False (#163293 ) Result is too noisy with `record_torchfunction = True`. Change it to False, to make it clean. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163293 Approved by: https://github.com/zpcore	2025-09-19 00:30:53 +00:00
PaliC	c43ccfbc2d	[BE] Remove bottleneck (#163210 ) Some cleanup related to this RFC: https://github.com/pytorch/pytorch/issues/68742 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163210 Approved by: https://github.com/ezyang	2025-09-18 12:08:13 +00:00
Sherlock Huang	df4ebddbe0	DisableTorchFunction in debug_string (#163096 ) debug_string() invokes some torch functions under the hood. Use DisableTorchFunction() to avoid re-invoking __torch_function__ when calling debug_sting() inside DebugMode() Pull Request resolved: https://github.com/pytorch/pytorch/pull/163096 Approved by: https://github.com/zpcore	2025-09-17 00:19:49 +00:00
joshuamarkovic	559e8d1c20	[doc]: Small typos (#162982 ) Small typo fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/162982 Approved by: https://github.com/ezyang, https://github.com/zou3519	2025-09-16 17:42:19 +00:00
Sherlock Huang	f8d379d29e	[DTensor] Introduce DebugMode (#162665 ) Introduce a lightweight TorchDispatchMode for understanding the magic behind DTensor. - Tracks redistribution, see `redistribute_input(input_idx, from_placement, to_placement)` - Optionally tracks torch-level functions, via `__torch_function__` - Optionally tracks FakeTensor operations, which was needed for propagating tensor meta as a step of sharding propagation - Optionally tracks real tensor operations, including functional c10d op, and regular ops - Calls are shown in the hierarchical structure! - shorthand representation - dt: DTesnor, ft: FakeTensor, t: Tensor - DM(2, 2) == DeviceMesh(shape = [2, 2]) - [R, P, S(0)] == Placement[Replicate, Partial, Shard(0)] - f32[8,8] == float32 with shape[8, 8] ``` debug_mode = DTensorDebugMode(record_faketensor=False, record_realtensor=True) with debug_mode: torch.mm(x_dtensor, y_dtensor) print(debug_mode.debug_string()) ``` produces: ``` torch.mm(dt: f32[8, 8][S(0)], dt: f32[8, 32][S(0)]) aten::mm(dt: f32[8, 8][S(0)], dt: f32[8, 32][S(0)]) redistribute_input(1, [S(0)], [R]) _c10d_functional::all_gather_into_tensor(t: f32[1, 32], 8, 0) _c10d_functional::wait_tensor(t: f32[8, 32]) aten::mm(t: f32[1, 8], t: f32[8, 32]) ``` Another example, for torch.einsum ``` torch.functional.einsum(bld,dnh->blnh, dt: f32[16, 6, 8][P, R], dt: f32[8, 4, 4][R, P]) aten::unsqueeze(dt: f32[16, 6, 8][P, R], 3) aten::unsqueeze(t: f32[16, 6, 8], 3) aten::unsqueeze(dt: f32[16, 6, 8, 1][P, R], 4) aten::unsqueeze(t: f32[16, 6, 8, 1], 4) aten::permute(dt: f32[16, 6, 8, 1, 1][P, R], [0, 1, 3, 4, 2]) aten::permute(t: f32[16, 6, 8, 1, 1], [0, 1, 3, 4, 2]) aten::unsqueeze(dt: f32[8, 4, 4][R, P], 3) aten::unsqueeze(t: f32[8, 4, 4], 3) aten::unsqueeze(dt: f32[8, 4, 4, 1][R, P], 4) aten::unsqueeze(t: f32[8, 4, 4, 1], 4) aten::permute(dt: f32[8, 4, 4, 1, 1][R, P], [3, 4, 1, 2, 0]) aten::permute(t: f32[8, 4, 4, 1, 1], [3, 4, 1, 2, 0]) aten::permute(dt: f32[16, 6, 1, 1, 8][P, R], [0, 1, 4, 2, 3]) aten::permute(t: f32[16, 6, 1, 1, 8], [0, 1, 4, 2, 3]) aten::view(dt: f32[16, 6, 8, 1, 1][P, R], [1, 96, 8]) aten::view(t: f32[16, 6, 8, 1, 1], [1, 96, 8]) aten::permute(dt: f32[1, 1, 4, 4, 8][R, P], [4, 2, 3, 0, 1]) aten::permute(t: f32[1, 1, 4, 4, 8], [4, 2, 3, 0, 1]) aten::view(dt: f32[8, 4, 4, 1, 1][R, P], [1, 8, 16]) aten::view(t: f32[8, 4, 4, 1, 1], [1, 8, 16]) aten::bmm(dt: f32[1, 96, 8][P, R], dt: f32[1, 8, 16][R, P]) redistribute_input(0, [P, R], [S(2), S(2)]) aten::chunk(t: f32[1, 96, 8], 4, 2) aten::cat(['t: f32[1, 96, 2]', 't: f32[1, 96, 2]', 't: f32[1, 96, 2]', 't: f32[1, 96, 2]']) _c10d_functional::reduce_scatter_tensor(t: f32[4, 96, 2], sum, 4, 2) aten::clone(t: f32[1, 96, 1]) redistribute_input(1, [R, P], [S(1), S(1)]) aten::chunk(t: f32[1, 8, 16], 4, 1) aten::clone(t: f32[1, 2, 16]) aten::chunk(t: f32[1, 2, 16], 2, 1) aten::cat(['t: f32[1, 1, 16]', 't: f32[1, 1, 16]']) _c10d_functional::reduce_scatter_tensor(t: f32[2, 1, 16], sum, 2, 3) _c10d_functional::wait_tensor(t: f32[1, 1, 16]) aten::bmm(t: f32[1, 96, 1], t: f32[1, 1, 16]) aten::view(dt: f32[1, 96, 16][P, P], [16, 6, 1, 4, 4]) aten::view(t: f32[1, 96, 16], [16, 6, 1, 4, 4]) aten::permute(dt: f32[16, 6, 1, 4, 4][P, P], [0, 1, 3, 4, 2]) aten::permute(t: f32[16, 6, 1, 4, 4], [0, 1, 3, 4, 2]) aten::view(dt: f32[16, 6, 4, 4, 1][P, P], [16, 6, 4, 4]) aten::view(t: f32[16, 6, 4, 4, 1], [16, 6, 4, 4]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162665 Approved by: https://github.com/ezyang	2025-09-16 07:30:05 +00:00
Scott Rostrup	b68a5115a4	Workaround for mtia double init issue in has_triton (#162974 ) Summary: This change adds a new environment variable (`TORCHINDUCTOR_TRITON_DISABLE_DEVICE_DETECTION`) and configuration in `torch._inductor.config` which can be set to `"1"` to allow a user to disable triton's device detection logic in [torch/utils/_triton.py:has_triton()](`c9e57d7e9f/torch/utils/_triton.py (L128)`). This function is used at import scope in several places but the function has a side effect of initializing the mtia device if it is available which is causing some of our autotuning workflows to crash. Worth noting that when enabled this configuration disables all device detection not just mtia and this is because the logic in has_triton will initialize the mtia device as a side effect even when checking for a cuda or other device via the [get_interface_for_device()](`c9e57d7e9f/torch/_dynamo/device_interface.py (L570)`) function. I've tagged it `topic: not user facing` since I don't anticipate any outside of meta users making use of this, however this is my first PR here, so please indicate if it should be handled differently. Test Plan: This has been tested in the context of internal workflows. Differential Revision: D82347853 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162974 Approved by: https://github.com/xmfan	2025-09-16 04:46:11 +00:00
Nick Riasanovsky	955e195c7d	[Triton] [Inductor] Add a Blackwell specific Template for persistent matmul (#162916 ) Summary: This adds the Triton Tutorial Matmul persistent matmul with device side TMA for Blackwell and adds it as a template option for blackwell. This uses newer Triton features such as automatic warp specialization and loop flattening, which while still containing flaws can improve performance on blackwell. This does not include the Epilogue subtiling section, as that will be a followup PR. This PR doesn't include any tuning. I am doing a larger benchmarking run to determine the best initial configs for tuning and will open a followup PR with better defaults soon. Test Plan: Tested on a Blackwell machine with test_max_autotune.py and confirmed the new tests pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162916 Approved by: https://github.com/NikhilAPatel	2025-09-15 23:23:04 +00:00
Clark Kang	6d64bc3990	[data foundation][vizard] Prevent checking the device type of numpy object in Tensorboard logger (#162888 ) Summary: The check is introduced in D82262053 - `scalar_value` could be a numpy object - Move the check of `device.type` into `make_np` method where it happens only when it's a `torch.Tensor`. Test Plan: ``` vizard launch -j 1x8 --launch=flow --config-path=pkg://vizard_projects.image_classification.configs --config-name=resnet50 ++flow.secure_group=ml_sensors ++flow.entitlement=ai_frameworks_pnb ++max_train_steps_per_epoch=10 ++max_epochs=5 ++log_every_n_steps=10 ++profiler=null ++max_eval_steps_per_epoch=10 ``` Rollback Plan: Differential Revision: D82383428 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162888 Approved by: https://github.com/xush6528	2025-09-14 08:09:08 +00:00
Aaryaman Vasishta	52af91e4c1	[ROCm/Windows] Support load_inline on windows (#162577 ) Supports `torch.utils.cpp_extension.load_inline` on Windows with ROCm. Tested on Windows with gfx1201. Note that it currently only works when CC and CXX are set to `clang-cl`. This is also needed when building extensions via. `setuptools` due to linker errors when using `cl` directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162577 Approved by: https://github.com/ezyang	2025-09-12 08:10:07 +00:00
Shangdi Yu	636a511084	[aoti] add config for libtorch free so (#162655 ) Users can specify the following to get a libtorch_free `.so`. "aot_inductor.use_libtorch": False, The following config is only used for torchnative (see https://github.com/meta-pytorch/torchnative/pull/110). It's not intended to be used by executorch. The reason we need it for torchnative is because a lot of the symbol definitions in torchnative repo is only in header files. "aot_inductor.libtorch_free_header": "/data/users/shangdiy/torchnative/standalone,/data/users/shangdiy/torchnative/" (or their custom headers) The main motivating use case is for executorch to produce a libtorch free `.so`. TODO for follow-up PR: this flag should be consolidated with the `compile_standalone` flag. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162655 Approved by: https://github.com/angelayi	2025-09-12 07:31:04 +00:00
Mark Saroufim	f7e8321961	fix cpp extension distributed warning spew (#162764 ) With the new change we only log the warning if we're running non distributed code or if we're in rank 0. Unit testing that certain messages get printed on certain ranks only feels kinda jank so test plan is below instead Test plan ```python # torchrun --nproc_per_node=2 demo_fix.py import os import logging logging.getLogger('torch.utils.cpp_extension').setLevel(logging.DEBUG) import torch if 'RANK' in os.environ: torch.distributed.init_process_group('nccl') from torch.utils.cpp_extension import _get_cuda_arch_flags _get_cuda_arch_flags() print(f"Rank {os.environ.get('RANK', '0')} done") ``` Logs showing how how `TORCH_CUDA_ARCH_LIST`only shows up once if we explicitly set the the logging level to `logging.DEBUG`. It also improves the debug message to explain what the actual behavior will be ``` (source) [marksaroufim@devgpu005]~% torchrun --nproc_per_node=2 demo_fix.py W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *************************************** W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] ************************************* [rank0]:V0911 18:30:18.921000 1316753 pytorch/torch/utils/cpp_extension.py:2444] TORCH_CUDA_ARCH_LIST is not set, using TORCH_CUDA_ARCH_LIST='10.0+PTX' for visible GPU architectures. Set os.environ['TORCH_CUDA_ARCH_LIST'] to override. Rank 0 done Rank 1 done ``` But if we just use the default and comment out `logging.getLogger('torch.utils.cpp_extension').setLevel(logging.DEBUG)` Then we get ``` (source) [marksaroufim@devgpu005]~% torchrun --nproc_per_node=2 demo_fix.py W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] ************************************* W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *************************************** Rank 0 done Rank 1 done (source) [marksaroufim@devgpu005]~% ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162764 Approved by: https://github.com/ezyang, https://github.com/zou3519	2025-09-12 06:12:46 +00:00
Laith Sakka	189a054cfb	Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. [attempt2] (#160869 ) [relanding again after fixing internal build] Summary: This might cause some new DDEs on call sites that do not use is_contiguous_or_false() or sym_is_contiguous() but want to find those call sites to handle this properly by calling is_contiguous_or_false() and not is_contiguous() explitly when appropriate. I had to fix one issue after removing the implicit size oblivious reasoning. here is context we defined in this https://github.com/pytorch/pytorch/pull/157472 sym_is_contiguous to be the function computing contiguity for dynamic shapes in c++. It returns a symbolic expression that represents contiguity and guaranteed not to throw a DDE. when people call is_contiguous we do sym_is_contiguous().guard_bool() when people call is_contiguous_or_false we do sym_is_contiguous().guard_or_false() one issue not handled well was this path ``` c10::SymBool TensorImpl::sym_is_contiguous_custom( at::MemoryFormat memory_format) const { if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) { return pyobj_slot_.load_pyobj_interpreter()->is_contiguous( this, memory_format); } return sym_is_contiguous_default(memory_format); } ``` namely if we call sym_is_contiguous_custom but we have matches_python_custom(SizesStridesPolicy::CustomStrides) return true , then we used to call is_contiguous(this, memory_format); This used to go through the load_pyobj_interpreter and end up calling the python is_contiguous call which used implicit size oblivious reasoning. once we removed that implicit size oblivious reasoning, the right thing we want is to call return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(this, memory_format); otherwise we would get DDE even if the caller is doing sym_is_contiguous. so I had to define it for pyinterpreter, and then I had to override it for nested tensors. Approved by: https://github.com/ezyang Test Plan: contbuild & OSS CI, see `e444cd24d4` Rollback Plan: Differential Revision: D80435179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160869 Approved by: https://github.com/ezyang	2025-09-08 22:59:13 +00:00
Frank Lin	0c0e056a9e	[CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352 ) ## Introduction During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it capturing graph) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture. This PR adds an experimental flag `graph_capture_record_stream_reuse: True\|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path. ## Terms * Free marker: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it. * Terminal: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`. ## When can we reuse a block during capture? ### Strong Rule (Graph-Wide Safety) This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph. > A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph. Why it's safe: This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness. ### Per-stream Rule (A Practical Optimization) The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check. In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream. > Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S. In short, a block is considered reusable on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins. ## Implementation * On `free(block)` during capture * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail. * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path. * Otherwise, store the marker handles and keep the block in the capture-private structures. * On `allocate(stream)` during capture (attempt per-stream reclaim) * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`. * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal. * If yes, hand the block to S for immediate reuse within the same capture. * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances. * On capture end * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture. ## Examples (2 streams) <img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" /> * Case 0 — Unsafe The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails. Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this. * Case 1 — Reusable on stream 1 Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1. * Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator` This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable. * Case 3 — Safe (strong rule holds) In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block. * Case 4 — Freeing after a join See the note below. ## Edge Case: Freeing after a join Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](https://github.com/pytorch/pytorch/pull/158352#pullrequestreview-3112565198)). In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused. ## Thanks Thanks to @galv for his great idea around graph parsing and empty nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158352 Approved by: https://github.com/ngimel, https://github.com/eqy Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-04 17:21:26 +00:00
Xinya Zhang	98efc9e93d	[ROCm] Bump AOTriton to 0.11b (#161754 ) Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.11b: * Invoke AITER Assembly kernels on gfx942/gfx950 when inputs meet requirements - AITER ASM kernels deliver over 500TFLOPS training performance. See [AOTriton 0.11b Release Page](https://github.com/ROCm/aotriton/releases/tag/0.11b) for more details. * Now returns natural based `logsumexp` tensor, matching CUDA's behavior - PR #156903 is reverted in this PR as well since it is not needed anymore. * Enables `CausalVariant.LOWER_RIGHT` The build system changes drastically along with new packaging scheme of AOTriton 0.11 * AOTriton 0.11 packs GPU images separately from AOTriton runtime * `aotriton.cmake` now selectively downloads image packs according to `PYTORCH_ROCM_ARCH` * `aotriton.cmake` now only use pre-compiled runtime library that exactly matches the ROCM in the build environment. For PyTorch builds with ROCm versions not listed in the file, the build process will build AOTriton runtime without GPU images from source - This avoids any further ABI breaks like ROCM 6.4 -> 7.0 - recursive git clone is disabled since building AOTriton runtime does not require submodules. Bug fixes: * Fix a kernel bug introduced when implementing SWA Known Problems: * gfx1100 target (Radeon RX 7000 Series) is moved back to experimental status due to accuracy issues. Triton compiler fixes are needed to restore the support status. * Enabling TF32 tests affects accuracy for later non-TF32 tests on ROCM 7.0. This issue is under investigation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161754 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2025-09-03 20:45:44 +00:00
Yulun Wang	850e1382a9	[hipify] Replace cudaStreamCaptureStatusNone (#161992 ) Replacing additional cuda symbols to hip symbols Differential Revision: D81420086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161992 Approved by: https://github.com/jeffdaily, https://github.com/Skylion007	2025-09-03 20:23:32 +00:00
Guilherme Leobas	789d494212	Defer loading hipify until it is needed (#160824 ) Saves a few milliseconds when running a test case: Before: ``` $ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow frames [('total', 1), ('ok', 1)] inline_call [] . ---------------------------------------------------------------------- Ran 1 test in 1.497s ``` After: ``` $ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow frames [('total', 1), ('ok', 1)] inline_call [] . ---------------------------------------------------------------------- Ran 1 test in 0.909s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160824 Approved by: https://github.com/zou3519	2025-09-02 15:27:37 +00:00
PyTorch MergeBot	13b65196db	Revert "Defer loading hipify until it is needed (#160824 )" This reverts commit 403a3a393cda7e60f503f3b04b8805a845dcf45d. Reverted https://github.com/pytorch/pytorch/pull/160824 on behalf of https://github.com/atalman due to Broke slow tests test_utils.py::TestHipifyTrie::test_special_char_export_trie_to_regex [GH job link](https://github.com/pytorch/pytorch/actions/runs/17387051351/job/49355619371) [HUD commit link](`403a3a393c`) ([comment](https://github.com/pytorch/pytorch/pull/160824#issuecomment-3243281628))	2025-09-01 21:34:13 +00:00
Guilherme Leobas	403a3a393c	Defer loading hipify until it is needed (#160824 ) Saves a few milliseconds when running a test case: Before: ``` $ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow frames [('total', 1), ('ok', 1)] inline_call [] . ---------------------------------------------------------------------- Ran 1 test in 1.497s ``` After: ``` $ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow frames [('total', 1), ('ok', 1)] inline_call [] . ---------------------------------------------------------------------- Ran 1 test in 0.909s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160824 Approved by: https://github.com/zou3519	2025-09-01 20:57:41 +00:00
PyTorch MergeBot	63a9c23fe9	Revert "[CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352 )" This reverts commit 190c391a28845a14df26abb228d26aa813efb20c. Reverted https://github.com/pytorch/pytorch/pull/158352 on behalf of https://github.com/atalman due to Broke cuda 13.0 nightly builds https://github.com/pytorch/pytorch/actions/runs/17382188549/job/49341981474 ([comment](https://github.com/pytorch/pytorch/pull/158352#issuecomment-3242871629))	2025-09-01 16:27:03 +00:00
Frank Lin	190c391a28	[CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352 ) ## Introduction During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it capturing graph) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture. This PR adds an experimental flag `graph_capture_record_stream_reuse: True\|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path. ## Terms * Free marker: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it. * Terminal: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`. ## When can we reuse a block during capture? ### Strong Rule (Graph-Wide Safety) This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph. > A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph. Why it's safe: This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness. ### Per-stream Rule (A Practical Optimization) The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check. In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream. > Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S. In short, a block is considered reusable on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins. ## Implementation * On `free(block)` during capture * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail. * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path. * Otherwise, store the marker handles and keep the block in the capture-private structures. * On `allocate(stream)` during capture (attempt per-stream reclaim) * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`. * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal. * If yes, hand the block to S for immediate reuse within the same capture. * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances. * On capture end * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture. ## Examples (2 streams) <img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" /> * Case 0 — Unsafe The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails. Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this. * Case 1 — Reusable on stream 1 Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1. * Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator` This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable. * Case 3 — Safe (strong rule holds) In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block. * Case 4 — Freeing after a join See the note below. ## Edge Case: Freeing after a join Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](https://github.com/pytorch/pytorch/pull/158352#pullrequestreview-3112565198)). In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused. ## Thanks Thanks to @galv for his great idea around graph parsing and empty nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158352 Approved by: https://github.com/ngimel Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-01 09:25:01 +00:00
Scott Wolchok	302d860157	Improve assert perf in _python_dispatch._correct_storage_aliasing (#161317 ) This assertion was expensive because of is_traceable_wrapper_subclass. Finding a cheap check to run first that's likely to let us skip the rest seems to improve things significantly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161317 Approved by: https://github.com/ezyang, https://github.com/XilunWu, https://github.com/bdhirsh ghstack dependencies: #161301, #161292, #161304, #161308, #161315	2025-08-30 06:55:42 +00:00
Scott Wolchok	0c459f2921	Fix pybind enum efficiency issue in return_and_correct_aliasing (#161315 ) Scanning a list of pybind enums with `in` is slow. See NOTE in code for full explanation. This is a significant optimization; will be updating the torchdispatch/return_and_correct_aliasing portion of this stack with benchmark and results soonish. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161315 Approved by: https://github.com/Skylion007, https://github.com/bdhirsh ghstack dependencies: #161301, #161292, #161304, #161308	2025-08-30 06:55:42 +00:00
Scott Wolchok	b96bcb9fdb	Optimize _python_dispatch.return_and_correct_aliasing.get_write_alias (#161308 ) - Empty containers are Falsey - Hoist cheap checks first - Microbenchmarked single-element set access method Benchmark code: ``` import timeit to_test = [ ('list(x)', 'x = set([3])'), ('x[0]', 'x = [3]'), ('list(x)[0]', 'x = set([3])'), ('next(iter(x))', 'x = set([3])'), ] for (stmt, setup) in to_test: res = timeit.timeit(stmt=stmt, setup=setup) print(f"Time for `{stmt}`: {res}") ``` Result with Python 3.13 on Mac (with excess digits manually trimmed; directionally matches result on Linux) ``` Time for `list(x)`: 0.03418 Time for `x[0]`: 0.00852 Time for `list(x)[0]`: 0.03561 Time for `next(iter(x))`: 0.02278 ``` FWIW, I was surprised by this result, so I guess I'm glad I wrote the benchmark! Pull Request resolved: https://github.com/pytorch/pytorch/pull/161308 Approved by: https://github.com/Skylion007, https://github.com/bdhirsh ghstack dependencies: #161301, #161292, #161304	2025-08-30 06:55:42 +00:00
Scott Wolchok	2089ed3d5e	Use `is`, not ==, to check exact type matches in _python_dispatch (#161304 ) `is` checks object identity and is more efficient. Google seems to confirm it is the correct way to do an exact type check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161304 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/bdhirsh ghstack dependencies: #161301, #161292	2025-08-30 06:55:42 +00:00
Scott Wolchok	1a64bf2636	Stop accessing func._schema in _python_dispatch.correct_storage_aliasing (#161292 ) func._schema is a pybind, accessing the arguments/returns is expensive, we have no reason to do it anyway, and even though #161301 makes accessing the arguments/returns less expensive, this still seems to improve performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161292 Approved by: https://github.com/wconstab, https://github.com/malfet, https://github.com/bdhirsh ghstack dependencies: #161301	2025-08-30 06:55:42 +00:00
xinan.lin	3519969e4f	[Intel GPU] Enable tensor memory descriptor in triton template for XPU. (#161600 ) As Intel Triton now supports tensor descriptor, this PR updates the pinned Intel Triton version and introduces support for Triton MM template with tensor descriptor on XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161600 Approved by: https://github.com/EikanWang, https://github.com/jansel	2025-08-28 12:39:58 +00:00
rzou	5edc3d814f	Add option for TorchDispatchMode to ignore torch.compile internals (#161648 ) If TorchDispatchMode.ignore_compile_internals() is True, then we turn off the TorchDispatchMode during the compilation process, instead turning it back on during runtime of the compiled artifact. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/161648 Approved by: https://github.com/bdhirsh	2025-08-28 02:41:33 +00:00
Benjamin Glass	cbc53b7696	Update pybind11 submodule to 3.0.1 (#160754 ) Upgrade to PyBind11 v3. This allows us to strip out our own (possibly broken?) handling of the C++ ABI when building extensions, in favor of the more-complete PyBind11 internal handling. Fixes a few test failures due to https://github.com/pybind/pybind11/issues/5774, which effectively makes the `__qualname__` attribute of functions platform-dependent. Test plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/160754 Approved by: https://github.com/Skylion007	2025-08-27 21:15:01 +00:00
Scott Wolchok	80bf883d21	Replace manual cache in _python_dispatch.get_alias_info with functools.cache (#161286 ) In addition to being more code, the manual cache was doing an extra dictionary lookup on each cache hit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161286 Approved by: https://github.com/wconstab	2025-08-27 00:17:51 +00:00

1 2 3 4 5 ...

2592 Commits