pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Edward Yang	2dd529df00	A basic CLAUDE.md based on bad things I see claude code doing (#162163 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162163 Approved by: https://github.com/albanD, https://github.com/Skylion007	2025-09-05 14:52:36 +00:00
Shunting Zhang	a714437093	[ez][inductor] add a few outer dimension reduction cases for LOAF (#162028 ) For the not able to fuse issue reported here: https://github.com/pytorch/pytorch/issues/93718 , LOAF can fuse the outer dimension softmax into a single kernel and brings 1.87x speedup for the example shape mentioned in the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162028 Approved by: https://github.com/jansel, https://github.com/eellison	2025-09-05 09:30:13 +00:00
atalman	bffc7dd1f3	[CD] Add cuda 13.0 libtorch builds, remove CUDA 12.9 builds (#161916 ) Related to https://github.com/pytorch/pytorch/issues/159779 Adding CUDA 13.0 libtorch builds, followup after https://github.com/pytorch/pytorch/pull/160956 Removing CUDA 12.9 builds, See https://github.com/pytorch/pytorch/issues/159980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161916 Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007 Co-authored-by: Ting Lu <tingl@nvidia.com>	2025-09-05 07:47:54 +00:00
Zeng, Xiangdong	5c473e9f5e	[1/N] Port 5 _composable/fsdp distributed test cases to Intel GPU (#159118 ) For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. We could enable Intel GPU with following methods and try the best to keep the original code styles: - use "torch.accelerator.current_accelerator()" to determine the accelerator backend - enabled XPU for some test path - skip some test cases which Intel GPU does not support Pull Request resolved: https://github.com/pytorch/pytorch/pull/159118 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-09-05 05:52:15 +00:00
Pian Pawakapan	5da573c42c	[PGO] handle PGO profile merges (#162097 ) Avoid merges from extra PGO key, if same source has different rank. Unlikely to happen (needs code hash match & source variable type to change), but being safe. Differential Revision: D81299840 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162097 Approved by: https://github.com/bobrenjc93	2025-09-05 04:58:15 +00:00
PyTorch UpdateBot	494878a11b	[audio hash update] update the pinned audio hash (#162114 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162114 Approved by: https://github.com/pytorchbot	2025-09-05 04:32:16 +00:00
PyTorch UpdateBot	3bbc2e3e4f	[vllm hash update] update the pinned vllm hash (#162226 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162226 Approved by: https://github.com/pytorchbot	2025-09-05 04:32:08 +00:00
Nick Riasanovsky	b67c410398	[BE] [Inductor] Add Kernel name to all coor-desc tuning (#161409 ) Summary: When running coordinate descent tuning the logging is difficult to parse if the results are parallelized at all. This includes the kernel name in each step so post-processing can unify the results, even if run in parallel. Test Plan: NFC. Just a logging change. Rollback Plan: Differential Revision: D80942794 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161409 Approved by: https://github.com/PaulZhang12	2025-09-05 02:53:13 +00:00
Colin L Reliability Rice	be5b03dde9	Allow for using a dedicated binary for the torch subproc pool. (#162093 ) Summary: The binary torch is running inside of can be larger than needed and in certain situations, this can cause a loss of memory. Test Plan: We've manually run tests via ``` TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_WORKER_SUPPRESS_LOGGING=0 make mc8-train-publish-cint-datafm-toy -C minimal_viable_ai/models/ifr_mtml/main_v1/ 2>&1 \| tee ~/run_out ``` and overriding the binary used to be the built fbpkg in /packages. We've also kicked off manual runs at ``` fire-feid-20250903-1051-ae8c6827 ``` Which do show the binary running - https://fburl.com/scuba/procprint/e6lwv32m Rollback Plan: steps: - jk.update: jk: pytorch/compiler:subproc_worker_binary constant_bool: null consistent_pass_rate: null fractional_host_rollout: null sampling_rate: null - manual.note: content: '' Differential Revision: D81616624 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162093 Approved by: https://github.com/masnesral	2025-09-05 01:43:46 +00:00
Eddie Yan	73eb4511fb	[B200][NVFP4] Fix argument passing in `test_blockwise_mxfp8_nvfp4_mxfp4_numerics_` (#162185 ) to unblock https://github.com/pytorch/pytorch/pull/159494 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162185 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2025-09-05 01:24:59 +00:00
Jeffro	29280864d9	Add new parameter for gen_pyi.py to make it more configureable. (#161772 ) This is a reposting of PR #128519. This change is important to how we maintain PyTorch at Google. From the previous PR: " This will make the script more flexible for the directory where it is executed. ... We plan to use the deprecated_yaml from a blaze genrule that invokes pyi.py. As the input to the pyi.py, genrule requires the input file to be explicitly listed out. When we feed the value of tools/autograd/deprecated.yaml to genrule, it failed to resolve since tools/autograd is a package from blaze perspective. Any file under a blaze package will a proper blaze target to be access. " Pull Request resolved: https://github.com/pytorch/pytorch/pull/161772 Approved by: https://github.com/albanD Co-authored-by: Haifeng Jin <haifeng-jin@users.noreply.github.com>	2025-09-05 00:48:15 +00:00
angelayi	5c67426d68	[dynamo] Add support for const prop on .item (#162204 ) Fixes some of the errors in https://fb.workplace.com/groups/1028545332188949/permalink/1303030824740397/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/162204 Approved by: https://github.com/williamwen42	2025-09-05 00:28:49 +00:00
Nikita Shulga	d2d4c8e9b2	[BLAS] Avoid downcasts for fp16fp16->fp32 BLAS (#161999 ) Followup after https://github.com/pytorch/pytorch/pull/154012 Fixes CPU part of https://github.com/pytorch/pytorch/issues/160841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161999 Approved by: https://github.com/drisspg	2025-09-04 23:35:27 +00:00
Eddie Yan	c7e41071a0	[B200][MXFP8] Fix regex in `test_blockwise_mxfp8_nvfp4_error_messages_recipe_mxfp8_cuda` (#162180 ) to unblock https://github.com/pytorch/pytorch/pull/159494 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162180 Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/nWEIdia	2025-09-04 23:29:10 +00:00
xinan.lin	9499c8761c	[Inductor][Intel GPU] Register triton template heuristic for addmm tma. (#162132 ) Fixes #162048 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162132 Approved by: https://github.com/jansel	2025-09-04 23:01:57 +00:00
Nan Zhang	3a207816cc	Forward fix for user defined triton kernel grid calc (#162162 ) Summary: This change fixes the test: inductor:fxir_backend - test_custom_triton_autotune_dynamic which was broken by https://github.com/pytorch/pytorch/pull/160997 Test Plan: inductor:fxir_backend - test_custom_triton_autotune_dynamic Rollback Plan: Differential Revision: D81679217 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162162 Approved by: https://github.com/eellison, https://github.com/jansel	2025-09-04 22:51:23 +00:00
Yiming Zhou	09be1890d7	[export] Fix torch.export.load with storage offset (#162172 ) Summary: As titled Test Plan: CI Rollback Plan: Differential Revision: D81687701 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162172 Approved by: https://github.com/angelayi	2025-09-04 22:50:33 +00:00
Pian Pawakapan	0d84ff3b78	[PGO] log add_extra_remote PGO to tlparse (#161751 ) Summary: log when additional PGO profile is merged in, from added read key Test Plan: test_pgo Rollback Plan: Differential Revision: D81284190 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161751 Approved by: https://github.com/bobrenjc93	2025-09-04 22:47:03 +00:00
PyTorch MergeBot	1ec2c15914	Revert "Fix Arm64 OSS pytorch build with FBGEMM (#161527 )" This reverts commit dbec08729fb9848bebed6048c63831b87170d061. Reverted https://github.com/pytorch/pytorch/pull/161527 on behalf of https://github.com/malfet due to This breaks all Mac builds, see `b04e922712/1` ([comment](https://github.com/pytorch/pytorch/pull/161527#issuecomment-3256034443))	2025-09-04 22:29:38 +00:00
Shangdi Yu	b04e922712	Fix memory leak in AOTI when calling `aoti_torch_as_strided` (#162118 ) Summary: Fix memory leak in AOTI when calling `aoti_torch_as_strided` If you have something like `AtenTensorHandle buf_handle`; and you allocated memory to it, you have to make it a `RAIIAtenTensorHandle` to release the ownership. Otherwise you have leaked the memory because even when the program ends, there's still a pointer pointing to the underlying storage of `buf_handle_restrided`, and the storage is never freed. Test Plan: ``` buck run fbcode//mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_pad_non_zero_memory_leak ``` Also verified by looking at `print(f"Allocated memory: {torch.cuda.memory_allocated() / 1024 ** 2:.2f} MB")` Differential Revision: D81640339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162118 Approved by: https://github.com/angelayi	2025-09-04 22:17:06 +00:00
Brian Hirsh	0d71a9dd5b	fix incorrect interaction between DDPOptimizer and donated buffers (#160745 ) This should fix https://x.com/wightmanr/status/1953147089518772254?t=ng_R4t0-tRhO_qQE8NqOhw&s=19. Still working on adding a reasonable test. You can see more of a description of the problem in the code comments. But the TLDR is that: * When using DDPOptimizer, we partition the graph and compile several subgraphs. So 1 dynamo graphs becomes N AOT/inductor artifacts * We have some existing logic to stash graph metadata (`fw_metadata`) in dynamo's TracingContext. When using DDPOptimizer, we generate one `fw_metadata` per AOT graph, and we stash it on the 1 TracingContext from dynamo. So we end up clobbering the `fw_metadata` for graph i-1 when AOT and inductor start compiling graph i * This is normally ok, but it becomes a problem if inductor ever wants to read from this `fw_metadata` during backward compilation. Why? We (by default) compile the backwards lazily. So when using DDPOptimizer, we will compile backward graph N, then bw graph N-1, etc. But... at the time that we have stated compiling bw graph N-1, its corresponding fw_metadata has already been clobbered! So we end up reusing graph N's metadata for all of our backward graph compilations. With donated buffer metadata, that means we end up donated and writing into incorrect input buffers The fix that I added was to add more dedicated DDPOptimizer metadata into the TracingContext, so we can properly switch between these N different `fw_metadata` objects in the backward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160745 Approved by: https://github.com/ezyang, https://github.com/zou3519	2025-09-04 21:57:27 +00:00
Ke Wen	89d41d3f61	[SymmMem] Feed tensor.data_ptr instead of handle.buffer_ptr into kernels (#162193 ) After MemPool support, `get_buffer_ptrs` points to base address of allocation segment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162193 Approved by: https://github.com/ngimel	2025-09-04 21:26:05 +00:00
Ke Wen	9bdcee01f8	[SymmMem] Add root argument to broadcast op (#161090 ) It was missing earlier. Also added range check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161090 Approved by: https://github.com/fegin	2025-09-04 21:09:54 +00:00
Prachi Gupta	b9ba612f7a	[ROCm] Enabling several UTs (#161715 ) All these UTs are working as is, just removing the skip - test_p2p_ipc - test_repros.py: working, added fp8 support - test_activation_checkpointing.py - test_content_store.py - test_cuda_multigpu.py - test_compute_comm_reordering.py - test_segment_reductions.py - test_dataloader.py - test_math_ops.py - test_loop_ordering.py - test_control_flow.py - distributed_test.py - test_mem_tracker.py - test_fsdp_optim_state.py - test_fully_shard_mixed_precision.py: skippped for < ROCm7.0 - test_aot_inductor_custom_ops.py - test_c10d_ops_nccl.py - test_eager_transforms.py - test_sparse_csr.py - test_inductor_collectives.py - test_fake_tensor.py - test_cupy_as_tensor.py - test_cuda.py: enable UTs that are working - test_matmul_cuda.py: enable UTs that are working Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/161715 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2025-09-04 20:43:03 +00:00
PyTorch MergeBot	d5b38410b5	Revert "[SymmMem] Add root argument to broadcast op (#161090 )" This reverts commit 3c0ff1b569c45cfa6935ad8031a9d4cf1551aa3f. Reverted https://github.com/pytorch/pytorch/pull/161090 on behalf of https://github.com/jeanschmidt due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/161090#issuecomment-3255574093))	2025-09-04 20:42:31 +00:00
PyTorch MergeBot	48bedd753d	Revert "Fix usage of forwarding references (#161094 )" This reverts commit 1ebd70d0c0d562d3be9abdee2a21906584af7d99. Reverted https://github.com/pytorch/pytorch/pull/161094 on behalf of https://github.com/jeanschmidt due to checking if revert will fix https://github.com/pytorch/pytorch/actions/runs/17470601839/job/49621447581 ([comment](https://github.com/pytorch/pytorch/pull/161094#issuecomment-3255541480))	2025-09-04 20:35:41 +00:00
Wang, Eikan	a3d72b09ae	Apply Triton tensor descriptor for flex-decoding for performance (#161643 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161643 Approved by: https://github.com/drisspg	2025-09-04 20:10:41 +00:00
Edward Z. Yang	ef3be6726f	Make distributed modules importable even when backend not built (#159889 ) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889 Approved by: https://github.com/wconstab ghstack dependencies: #160449	2025-09-04 20:05:50 +00:00
PyTorch MergeBot	95ee0bfea9	Revert "[nativert] triton runtime implementation (#161798 )" This reverts commit 3dde5d7f9bf80dd6623a712bc429e9e4302464b5. Reverted https://github.com/pytorch/pytorch/pull/161798 on behalf of https://github.com/jeanschmidt due to introducing linting failures ([comment](https://github.com/pytorch/pytorch/pull/161798#issuecomment-3255412085))	2025-09-04 20:05:24 +00:00
Ben Niu	dbec08729f	Fix Arm64 OSS pytorch build with FBGEMM (#161527 ) Summary: X-link: https://github.com/pytorch/FBGEMM/pull/4775 Without this change, Arm64 OSS pytorch build with FBGEMM failed with the following error. Undefined symbols for architecture arm64: "fbgemm::FindMinMax(float const, float, float*, long long)", referenced from: at::native::fbgemm_linear_int8_weight_fp32_activation(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&, at::Tensor const&) in QuantizedLinear.cpp.o at::native::fbgemm_linear_quantize_weight(at::Tensor const&) in QuantizedLinear.cpp.o PackedConvWeight<2>::apply_dynamic(at::Tensor const&, bool) in qconv_dynamic.cpp.o PackedConvWeight<3>::apply_dynamic(at::Tensor const&, bool) in qconv_dynamic.cpp.o at::Tensor PackedLinearWeight::apply_dynamic_impl<false>(at::Tensor, bool) in qlinear_dynamic.cpp.o at::Tensor PackedLinearWeight::apply_dynamic_impl<true>(at::Tensor, bool) in qlinear_dynamic.cpp.o ld: symbol(s) not found for architecture arm64 This change fixed the issue by moving FindMinMax's implementation from QuantUtilsAvx2.cc to QuantUtils.cc. FindMinMax is a platform-agnostic function with AVX2-specific optimizations so conceptually it can be put in QuantUtils.cc. Test Plan: With this change, Arm64 OSS pytorch built successfully with FBGEMM enabled. Rollback Plan: Reviewed By: q10 Differential Revision: D81052327 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161527 Approved by: https://github.com/q10	2025-09-04 20:01:13 +00:00
PyTorch MergeBot	c3d54dea9f	Revert "[BLAS] Avoid downcasts for fp16fp16->fp32 BLAS (#161999 )" This reverts commit 02c83f13348631d80aa23f57aaff6b7d1223bbdd. Reverted https://github.com/pytorch/pytorch/pull/161999 on behalf of https://github.com/jeanschmidt due to break a few internal tests ([comment](https://github.com/pytorch/pytorch/pull/161999#issuecomment-3255381925))	2025-09-04 19:56:48 +00:00
PyTorch MergeBot	afa6e5604d	Revert "[BE] Cleanup stale comments/copy from `gemm` (#162001 )" This reverts commit b40d9432be44a6b5974ee62e7d19c3c61c5ece37. Reverted https://github.com/pytorch/pytorch/pull/162001 on behalf of https://github.com/jeanschmidt due to break a few internal tests ([comment](https://github.com/pytorch/pytorch/pull/161999#issuecomment-3255381925))	2025-09-04 19:56:48 +00:00
PyTorch MergeBot	9e5247f51d	Revert "[MPS] enable cat op for sparse (#162007 )" This reverts commit 2c03f0acc53ed13fe8ebfe809129f25996e009a0. Reverted https://github.com/pytorch/pytorch/pull/162007 on behalf of https://github.com/jeanschmidt due to Breaks internal builds see [D81588372](https://www.internalfb.com/diff/D81588372), @malfet may you help the author? ([comment](https://github.com/pytorch/pytorch/pull/162007#issuecomment-3255357336))	2025-09-04 19:49:44 +00:00
Edward Yang	c37103234a	Always build USE_DISTRIBUTED. (#160449 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci	2025-09-04 19:43:17 +00:00
dolpm	3dde5d7f9b	[nativert] triton runtime implementation (#161798 ) Summary: att Test Plan: ci Rollback Plan: Reviewed By: minjang Differential Revision: D80828148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161798 Approved by: https://github.com/minjang, https://github.com/SherlockNoMad	2025-09-04 19:00:15 +00:00
Aaron Gokaslan	1f51056bd6	[BE]: Update cpp-httplib submodule to 0.26.0 (#162181 ) Update cpp-httplib with better error handling, bugfixes, and performance. Header only library update. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162181 Approved by: https://github.com/jansel	2025-09-04 18:59:32 +00:00
Animesh Jain	6b1900c22f	[dynamo][hops] Remove const outputs from the speculated subgraph (#161355 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161355 Approved by: https://github.com/zou3519	2025-09-04 18:52:01 +00:00
mansiag05	9480cdc0b6	Modified the docs to add example for torch.is_floating_point and torc… (#161951 ) …h.is_complex. The PR proposes adding a simple, self-explanatory example to the documentation page. The example demonstrates the function's output for tensors with various data types, showing both True and False return values. Fixes #161859 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161951 Approved by: https://github.com/zou3519	2025-09-04 18:50:19 +00:00
eqy	6f7608d603	[cuDNN][SDPA] Enable cuDNN SDPA by default for SM 9.0, SM 10.0 (#162073 ) for 2.9 🙏 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162073 Approved by: https://github.com/drisspg	2025-09-04 18:46:28 +00:00
Albert W	d1a15abfdc	export: add explicit decomposition for aten.expand_copy and unit test (#161688 ) Fixes #161080 torch.export.export fails with TypeError: expand() got an unexpected keyword argument 'implicit' when calling torch.expand_copy(..., implicit=True). This happened because expand_copy = _make_copy_from_view(aten.expand) register aten. expand as the decomposition path for aten.expand_copy, which doesn’t accept the implicit argument. I have added an explicit a decomposition for aten.expand_copy in torch/_decomp/decompositions.py to ignore the implicit argument, and a simple unit test to demonstrate the bug being fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161688 Approved by: https://github.com/angelayi, https://github.com/can-gaa-hou	2025-09-04 18:16:56 +00:00
Animesh Jain	33028597bf	[dynamo] Make the MRO walk more narrow (#162105 ) I dont have a failing test case but just saw an extra guard somewhere. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162105 Approved by: https://github.com/williamwen42, https://github.com/StrongerXi, https://github.com/jansel	2025-09-04 17:54:33 +00:00
vasiliy	9eadb37cdd	enable float32 and float16 in `torch._grouped_mm` fallback (#162059 ) Summary: Enables `torch.float32` and `torch.float16` options in `torch._grouped_mm`. Note that the fast path is only enabled if `mat_a`, `mat_b`, and `out_dtype` are `torch.bfloat16`. Saving for future PRs: 1. enabling testing on more platforms 2. supporting out_dtype != mat_a.dtype 3. opinfo 4. better compile support Test Plan: ```bash // on A100 and H100 pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x // on H100 pytest test/test_matmul_cuda.py -s -k test_scaled_grouped_gemm -x ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/162059 Approved by: https://github.com/ngimel, https://github.com/eqy ghstack dependencies: #161407, #161717	2025-09-04 17:48:52 +00:00
vasiliy	61fb632cfb	move `_grouped_mm` fallback to composite explicit autograd (#161717 ) Summary: Moves the `torch._grouped_mm` fallback from cuda-only code to a place where it can be used by multiple backends. Specifically: 1. make the fallback path and util functions reusable and move them to `ATen/native/GroupedMMUtils.h` 2. register a backend-agnostic kernel to composite explicit autograd key 3. refactor the grouped_mm tests to their own test case and enable CPU At the end of this PR, here is the support matrix: * CUDA SM90+: fast path with test coverage (no change) * CUDA SM80+: fallback with test coverage (no change) * CPU: fallback works, but without test coverage (new in this PR) * other SM versions and other backends: will probably already work, but let's leave this to future PRs * float32/float16: will probably already work, but let's leave this to future PRs Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/161717 Approved by: https://github.com/ngimel, https://github.com/drisspg ghstack dependencies: #161407	2025-09-04 17:48:52 +00:00
vasiliy	8a736fa1ea	create torch._grouped_mm fallback path with for loops / bmm (#161407 ) Summary: Creates a fallback path for `torch._grouped_mm`, using the naive for loop implementation (or bmm). For the sake of keeping the PR small, this PR only enables SM80+ (CUDA capability 8.0 and up), since I am testing this on an A100 machine. In future PRs, we can increase the coverage of the fallback to: 1. float32 and float16, which will extend the GPU coverage 2. cpu Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_2d_3d -x pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_3d_2d -x pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_2d_2d -x pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_3d_3d -x ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/161407 Approved by: https://github.com/drisspg, https://github.com/eqy	2025-09-04 17:48:44 +00:00
Ke Wen	8bb213b6d5	[SymmMem] Increase signal pad size for NVL72 (#162026 ) so that the signal calls do not step on each other's foot. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162026 Approved by: https://github.com/ngimel	2025-09-04 17:41:38 +00:00
Ke Wen	869cbcc16e	[SymmMem] Add a helper API to distinguish intra- and inter- node (#161984 ) Added a helper API to tell if the world is entirely within a P2P domain or crosses network. This is mainly for nblocks tuning purpose. (In later PRs) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161984 Approved by: https://github.com/ngimel ghstack dependencies: #161983	2025-09-04 17:37:59 +00:00
Frank Lin	0c0e056a9e	[CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352 ) ## Introduction During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it capturing graph) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture. This PR adds an experimental flag `graph_capture_record_stream_reuse: True\|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path. ## Terms * Free marker: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it. * Terminal: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`. ## When can we reuse a block during capture? ### Strong Rule (Graph-Wide Safety) This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph. > A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph. Why it's safe: This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness. ### Per-stream Rule (A Practical Optimization) The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check. In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream. > Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S. In short, a block is considered reusable on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins. ## Implementation * On `free(block)` during capture * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail. * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path. * Otherwise, store the marker handles and keep the block in the capture-private structures. * On `allocate(stream)` during capture (attempt per-stream reclaim) * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`. * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal. * If yes, hand the block to S for immediate reuse within the same capture. * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances. * On capture end * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture. ## Examples (2 streams) <img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" /> * Case 0 — Unsafe The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails. Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this. * Case 1 — Reusable on stream 1 Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1. * Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator` This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable. * Case 3 — Safe (strong rule holds) In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block. * Case 4 — Freeing after a join See the note below. ## Edge Case: Freeing after a join Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](https://github.com/pytorch/pytorch/pull/158352#pullrequestreview-3112565198)). In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused. ## Thanks Thanks to @galv for his great idea around graph parsing and empty nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158352 Approved by: https://github.com/ngimel, https://github.com/eqy Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-04 17:21:26 +00:00
William Wen	f36f285953	[dynamo] change error_on_graph_break/fullgraph semantics (#161747 ) This PR implements the semantics change to `torch._dynamo.error_on_graph_break`: - ~`torch.compile` now has a new `error_on_graph_break` kwarg that serves as a lower-priority toggle for erroring/continuing on graph breaks~ - `error_on_graph_break` is a new internal `torch.compile `setting that is lower-priority than `fullgraph`. It allows the user to toggle erroring/continuing on graph breaks. - `error_on_graph_break` does nothing when `fullgraph=True` - `error_on_graph_break` does NOT guarantee a single graph Followup [DONE]: need to change the programming model docs to reflect the 3 graph break modes for compilation: - `fullgraph=True`: enforce one graph, no graph breaks, cannot be toggled - `fullgraph=False, error_on_graph_break=True`: errors on graph breaks, latter can be toggled during compile time - `fullgraph=False, error_on_graph_break=False`: resumes tracing on graph breaks, latter can be toggled during compile time Pull Request resolved: https://github.com/pytorch/pytorch/pull/161747 Approved by: https://github.com/mlazos ghstack dependencies: #161739	2025-09-04 17:10:17 +00:00
Cui, Yifeng	ba7f546ccc	Update torch-xpu-ops commit pin (#162062 ) Update the torch-xpu-ops commit to [intel/torch-xpu-ops@83c5a5](`83c5a5a551`), includes: - Revert "Disable xccl timer avoid drlm hang" because XPU time event issue has been fixed - Fallback lu_factor kernel to CPU for single batch - Enable aten::linalg_inv and aten::linalg_inv_ex on XPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/162062 Approved by: https://github.com/EikanWang	2025-09-04 17:05:33 +00:00
Lakshay Garg	43b7c86a2c	Add dependency-groups.dev to pyproject.toml (#161216 ) [PEP 735](https://peps.python.org/pep-0735) introduces the [dependency-groups] table for a number of use-cases one of which includes specifying development dependencies for projects. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161216 Approved by: https://github.com/seemethere	2025-09-04 16:51:36 +00:00

1 2 3 4 5 ...

92595 Commits