pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Nikita Shulga	8be8b94793	Update SECURITY.md with reporting guidelines (#162608 ) Added clarification that all reports will be disclosed within 90 days Pull Request resolved: https://github.com/pytorch/pytorch/pull/162608 Approved by: https://github.com/seemethere, https://github.com/albanD	2025-09-11 16:30:29 +00:00
suo	fe8cc619b8	[torch][c10d] fix split_group in mixed backend case (#162424 ) Today we can initialize a mixed-backend process group (e.g. "cpu:gloo,cuda:nccl") but we can only pass one set of process group options. However, when we call `split_group`, we retrieve that set of options from the parent PG and pass it to the ProcessGroup::groupSplit C++ API, which then attempts to propagate that set of options to all backends. This leads to an assert on some user code, where ProcessGroupGloo::split is expecting gloo options but receives nccl options instead. Arguably the APIs as currently designed are just broken; we should not ever expect a single set of backend options to apply across multiple backends. However, fixing this would require changing quite a few public APIs. As a quick fix, since user-provided options really only exist for NCCL, just warn and fall-back to defaulted options for Gloo if non-gloo options are detected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162424 Approved by: https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/H-Huang	2025-09-11 16:29:32 +00:00
atalman	2f5a24c2a2	Smoke tests don't run nvshmem on Windows (#162646 ) Only available for linux x86 and aarch64 : https://pypi.org/project/nvidia-nvshmem-cu13/#files nvshmem is available only on linux: `` "nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' \| " `` https://github.com/pytorch/pytorch/blob/main/.github/scripts/generate_binary_build_matrix.py#L57 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162646 Approved by: https://github.com/kwen2501	2025-09-11 16:09:20 +00:00
Nikita Shulga	24492cbab2	[BE] Cleanup stale comments/copy from `gemm` (#162001 ) Followup after https://github.com/pytorch/pytorch/pull/154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/162001 Approved by: https://github.com/drisspg ghstack dependencies: #161999	2025-09-11 15:48:43 +00:00
Avik Chaudhuri	3f6d88f04c	paths to exclude shape guards (#162684 ) Summary: Easier to land than https://www.internalfb.com/diff/D82030581 Test Plan: everything blamed by https://www.internalfb.com/diff/D80713603 (except some old exir tests) Rollback Plan: Differential Revision: D82180349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162684 Approved by: https://github.com/tugsbayasgalan	2025-09-11 15:34:06 +00:00
PyTorch MergeBot	94db2ad51d	Revert "Move prioritized text linker optimization code from setup.py to cmake (#160078 )" This reverts commit 26b3ae58908becbb03b28636f7384d2972a8c9a5. Reverted https://github.com/pytorch/pytorch/pull/160078 on behalf of https://github.com/atalman due to Sorry reverting this broke linux aarch64 CUDA nightlies [pytorch/pytorch/actions/runs/17637486681/job/50146967503](https://github.com/pytorch/pytorch/actions/runs/17637486681/job/50146967503) ([comment](https://github.com/pytorch/pytorch/pull/160078#issuecomment-3281426631))	2025-09-11 15:29:29 +00:00
PyTorch MergeBot	9f783e172d	Revert "Build and Install Arm Compute Library in manylinux docker image (#159737 )" This reverts commit 582d278983b28a91ac0cedd035183f2495bb6887. Reverted https://github.com/pytorch/pytorch/pull/159737 on behalf of https://github.com/atalman due to Sorry reverting this broke linux aarch64 CUDA nightlies [pytorch/pytorch/actions/runs/17637486681/job/50146967503](https://github.com/pytorch/pytorch/actions/runs/17637486681/job/50146967503) ([comment](https://github.com/pytorch/pytorch/pull/159737#issuecomment-3281398272))	2025-09-11 15:25:24 +00:00
Animesh Jain	a8432bcaad	[dynamo][guards] Fail on an unknown framelocals to dict conversion (#162695 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162695 Approved by: https://github.com/williamwen42 ghstack dependencies: #162694	2025-09-11 15:01:00 +00:00
Animesh Jain	a3a40cb741	[dynamo][guards] Do not consturct framelocals to dict on GlobalsGuardAccessor (#162694 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162694 Approved by: https://github.com/williamwen42	2025-09-11 15:01:00 +00:00
Tugsbayasgalan Manlaibaatar	c924c675d0	Fix persistent buffer bug (#162190 ) For non-persistent buffers, we should properly register them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162190 Approved by: https://github.com/zhxchen17	2025-09-11 14:56:26 +00:00
Jithun Nair	c3f30eca9e	Remove tests-to-include from rocm-mi300 workflow (#162721 ) Accidentally introduced by https://github.com/pytorch/pytorch/pull/162288 (was meant to be a temporary change) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162721 Approved by: https://github.com/jeffdaily	2025-09-11 14:36:07 +00:00
Jeff Daily	1e710552c1	[ROCm][CI] benchmark must patch fbgemm_gpu with tbb dep (#162649 ) fbgemm adds tbb as a dep only for rocm to avoid missing tbb symbols at import. But the way it was done was in setup.py to add the linker flag to CMAKE_CXX_FLAGS and it wasn't working for reasons unknown to me. But what did work was to add tbb as a dep in the cmake file. [We have a PR against upstream fbgemm](https://github.com/pytorch/FBGEMM/pull/4859) for that. Meanwhile, a much smaller patch is applied here in this PR until the fbgemm rocm ci commit hash is moved forward to include the tbb patch from upstream. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162649 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-11 14:10:51 +00:00
Sun,Jiabin	7c39b2ecbe	use torch.accelerator and device_module instead of cuda to make DataParallel more device agnostic. (#162573 ) use torch.accelerator and `_get_device_module` instead of cuda to make DataParallel more device agnostic. Fixes #162152 recently, I've done some works to support my own privateuse1 backend in DataParallel module, but I found some cuda related APIs exist in parallel_apply.py file, that makes me have to monkey patch DataParallel module to support DP on my own backend. so I make some small changes to replace cuda.xxx to accelerator.xxx, and acquire device module by `_get_device_module`. this is my first time to contribute to pytorch, please let me know if there is any problem about the change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162573 Approved by: https://github.com/ezyang, https://github.com/guangyey Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Co-authored-by: Edward Z. Yang <ezyang@mit.edu>	2025-09-11 10:04:27 +00:00
Naveen Suda	afdd4247a2	[torchao][pt2e] Make prepare and convert faster by caching (#162550 ) Summary: D79674759 tried to fix the expensive prepare and convert steps, as `assert_and_get_unique_device` was called multiple times. This change fixes that issue by using `functools.cache` decorator. Test Plan: Verified on llm export to QNN. LLM Quantization prepare time of ~20min reduced to ~3min. Rollback Plan: Differential Revision: D82073679 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162550 Approved by: https://github.com/andrewor14	2025-09-11 07:59:22 +00:00
Lucy Qiu	22df9332da	[serialization] Add pte file to archive (#162520 ) Summary: Add _package_executorch_files to archive apis. Allow us to package a PTE file into the archive. I don't think there's a use-case to have more than one PTE file at the moment, but left it as `EXECUTORCH_FILES` just in case. Test Plan: Tested in D81992612 Rollback Plan: Differential Revision: D81977483 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162520 Approved by: https://github.com/angelayi	2025-09-11 07:59:11 +00:00
Sun, Jiayi	6b9b7ce6fe	fix torch.sparse.log_softmax on CPU (#161959 ) Fix https://github.com/pytorch/pytorch/issues/152293. Example: ``` import torch from torch.sparse import log_softmax as sparse_log_softmax def test_bug(): a = torch.rand(4, 3) b = a - 10000000.0 b_sparse = b.to_sparse() cpu_out_sparse = sparse_log_softmax(b_sparse, dim=1).to_dense() print('cpu_out_sparse =', cpu_out_sparse) b_sparse_double = b.double().to_sparse() cpu_out_sparse_double = sparse_log_softmax(b_sparse_double, dim=1).to_dense() print('cpu_out_sparse_double =', cpu_out_sparse_double) if __name__ == '__main__': test_bug() ``` Output: - before ``` cpu_out_sparse = tensor([[-2., -1., -2.], [-1., -1., -1.], [-1., -2., -2.], [-1., -1., -2.]]) cpu_out_sparse_double = tensor([[-1.5514, -0.5514, -1.5514], [-1.0986, -1.0986, -1.0986], [-0.5514, -1.5514, -1.5514], [-0.8620, -0.8620, -1.8620]], dtype=torch.float64) ``` - after ``` cpu_out_sparse = tensor([[-0.8620, -1.8620, -0.8620], [-1.0986, -1.0986, -1.0986], [-1.8620, -0.8620, -0.8620], [-1.0986, -1.0986, -1.0986]]) cpu_out_sparse_double = tensor([[-0.8620, -1.8620, -0.8620], [-1.0986, -1.0986, -1.0986], [-1.8620, -0.8620, -0.8620], [-1.0986, -1.0986, -1.0986]], dtype=torch.float64) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161959 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/mingfeima	2025-09-11 07:52:05 +00:00
Scott Wolchok	1274297e06	Remove __torch_dispatch__ check in THPVariable_make_dtensor (#162337 ) We control DTensor, so we can just guarantee there isn't a programming error with __torch_dispatch__. (The guard is already less-than-perfect; see the note that the deleted comment refers to.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162337 Approved by: https://github.com/Skylion007 ghstack dependencies: #161591, #161595, #161633, #161634, #161692, #162219, #162220, #162218, #161596	2025-09-11 06:58:35 +00:00
Scott Wolchok	f68f76d8c7	Remove logger.debug statements in DTensor dispatch (#161596 ) These seem to have been costing us 5-10 usec per detach (out of ~~95 usec total). If they need to ship let's talk about requirements and how we can make this more efficient given that we would prefer if an entire DTensor op could finish in 10 usec. Differential Revision: [D81530106](https://our.internmc.facebook.com/intern/diff/D81530106) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161596 Approved by: https://github.com/ezyang, https://github.com/Skylion007 ghstack dependencies: #161591, #161595, #161633, #161634, #161692, #162219, #162220, #162218	2025-09-11 06:58:35 +00:00
Deng, Daisy	fa1d409e83	[2/N]Port several test files under test/distributed to Intel GPU (#159473 ) For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. This PR will work on some test files under test/distributed. We could enable Intel GPU with following methods and try the best to keep the original code styles: - instantiate_device_type_tests() - use "torch.accelerator.current_accelerator()" to determine the accelerator backend - use requires_accelerator_dist_backend to allow both nccl and xccl test - enabled XPU for some test path - Change the hardcoded world_size according to device_count. - Unify some common code under torch/testing/_internal for multiple backend, for example: Added xpu for Backend.backend_capability and dist.Backend.register_backend() Pull Request resolved: https://github.com/pytorch/pytorch/pull/159473 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-09-11 06:44:26 +00:00
Xu Han	52d4660ae9	[AOTI] Fix Windows fail to zip opened file. (#162617 ) Original issue: <img width="1767" height="544" alt="Image" src="https://github.com/user-attachments/assets/9de90d50-217f-4049-8f19-77ff1660c8b0" /> reproducer: ```cmd pytest test\inductor\test_aot_inductor.py -v -k test_weight_on_disk_legacy_cpu ``` Fixed list: 1. `WritableTempFile`'s `__exit__` function auto unlink opened file, when the file was opened, it should raise error. Ignore it on Windows. 2. When open zip file, if the file is opened, it would be failed. Switch to `_wfsopen` with shared access flag, which can open file with shared access. Local test passed: <img width="1101" height="233" alt="image" src="https://github.com/user-attachments/assets/935cbf2e-52db-41f1-80fa-617569b92a96" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162617 Approved by: https://github.com/jansel	2025-09-11 06:22:21 +00:00
Mark Saroufim	7345454e2e	compile_kernel: Handle python floats as c double (#162626 ) This was an open todo in the code and probably a footgun in waiting Pull Request resolved: https://github.com/pytorch/pytorch/pull/162626 Approved by: https://github.com/malfet	2025-09-11 06:03:25 +00:00
PyTorch MergeBot	23170dfebc	Revert "Move inductor jobs 3.9->3.10 (#162323 )" This reverts commit 0663bdb12383b9717af49d58aed9d88de0dd0ecc. Reverted https://github.com/pytorch/pytorch/pull/162323 on behalf of https://github.com/huydhn due to Not sure what had happened, but some inductor unit tests start failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/162323#issuecomment-3278125192))	2025-09-11 05:57:13 +00:00
Mark Saroufim	12e993f533	compile_kernel large shared memory fix (#162647 ) Alternate solution to https://github.com/pytorch/pytorch/pull/162328 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162647 Approved by: https://github.com/eqy	2025-09-11 05:52:46 +00:00
PyTorch UpdateBot	07d2531672	[vllm hash update] update the pinned vllm hash (#162551 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162551 Approved by: https://github.com/pytorchbot	2025-09-11 04:56:04 +00:00
Jagadish Krishnamoorthy	6944d4b639	[ROCm] rocblas Aten GEMM overload for FP32 output from FP16/BF16 inputs (#162600 ) Fix ROCm GEMM helper to set output type (C/D) based on C_Dtype template parameter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162600 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony	2025-09-11 03:34:07 +00:00
Isuru Fernando	f654cff566	[inductor] Add shape to load_input in matmul templates (#162513 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162513 Approved by: https://github.com/eellison ghstack dependencies: #162426	2025-09-11 01:51:15 +00:00
Isuru Fernando	f17c5e0789	[inductor] Add shape for store_output in matmul templates (#162426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162426 Approved by: https://github.com/eellison	2025-09-11 01:51:15 +00:00
Tianyu Liu	435c18fb4a	[DTensor] add op support for aten.unbind.int (#162560 ) As titled. It seems unbind returns views of the original tensor. E.g. see https://stackoverflow.com/questions/78910951/does-unbind-return-the-views-of-tensors-in-pytorch So we error out when `shard_dim == unbind_dim`. This is similar to why we error out in view ops. https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/_ops/_view_ops.py#L544-L546 This PR also refactors some other tensor ops code, by creating two utils function `shift_shard_dims_after_insert`, `shift_shard_dims_after_remove`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162560 Approved by: https://github.com/zpcore	2025-09-11 00:58:23 +00:00
dolpm	612cdc8f48	-ldl for nativert tests (#162643 ) Fixes #162640 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162643 Approved by: https://github.com/yiming0416, https://github.com/robert-hardwick	2025-09-11 00:35:57 +00:00
Edward Yang	da5069f289	Don't include cuh header when USE_NVSHMEM is off (#162635 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162635 Approved by: https://github.com/kwen2501	2025-09-11 00:24:50 +00:00
Mark Saroufim	4fd2a2b273	Add cuda headers automatically for compile_kernel (#162634 ) Issue was pointed out before by @ngimel and more recently by https://gau-nernst.github.io/nvrtc-matmul/#missing-cuda-and-c-headers- by @gau-nernst Benefit is now we can add `#include <cuda_fp16.h>` without crapping out Pull Request resolved: https://github.com/pytorch/pytorch/pull/162634 Approved by: https://github.com/ngimel	2025-09-11 00:20:33 +00:00
Ting Lu	bb1d53bc47	[CD] CUDA 13 specific followup changes (#162455 ) Follow up for CUDA 13 bring up https://github.com/pytorch/pytorch/issues/159779 sm50-70 should not be added to sbsa build arch list, as previous archs had no support for arm. remove platform_machine from PYTORCH_EXTRA_INSTALL_REQUIREMENTS Pull Request resolved: https://github.com/pytorch/pytorch/pull/162455 Approved by: https://github.com/atalman	2025-09-11 00:03:47 +00:00
Ben Niu	36338fc7f2	Relax fences for intrusive ptr's refcnt (#162072 ) Summary: Relax fences for intrusive ptr's refcnt dec op for performance testing. lock needs acquire when the op succeeds and relaxed if the op is not. In addition, the expire call and the following refcnt reads were merged to remove one extra read. incref does not need any fences because the caller should already have a valid reference. use_count follows the same reasoning. decref only needs a release fence to make sure every write op prior to it has finished. When the refcnt goes to zero, there should be a acquire fence to make sure no read op reads stale data before the object is destructed. However, microbenchmark showed that the optimal fence for decref is not performing noticeably better than the current decref with acq-rel, so we keep decref as-is. This change should have no material impact on x86, but for Arm64 (and other CPUs with weak memory models), it should boost performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162072 Approved by: https://github.com/swolchok, https://github.com/yfeldblum	2025-09-10 23:17:01 +00:00
Daniel Vega-Myhre	e0c910149c	Build fbgemm_gpu for TORCH_CUDA_ARCH_LIST=10.0 and CUDA 12.8 and 12.9 (#162544 ) ## Summary - pytorch is not built for a variants of SM architectures, due to non-portability. However, we need fbgemm_gpu kernels built for sm100a (see #162209) ## Changes - Setting USE_FBGEMM_GENAI for CUDA builds: fbgemm_gpu builds for sm100a if using CUDA 12.8 or 12.9 ([source](`2033a0a08f/.github/scripts/nova_dir.bash (L29-L32)`)), so I follow the same rule here. - Extra nvcc flags*: if USE_FBGEMM_GENAI and USE_CUDA are set, we add extra nvcc flags for sm100a ## Test plan Test build: ``` echo $CUDA_HOME /usr/local/cuda-12.9 export TORCH_CUDA_ARCH_LIST=10.0 python -m pip install --no-build-isolation -v -e . ``` Check build logs: ``` CMake Warning at CMakeLists.txt:901 (message): Setting USE_FBGEMM_GENAI to ON, doing CUDA build for SM100a ``` Run unit tests: - `pytest test/test_matmul_cuda.py -k test_mxfp8_scaled_grouped_mm` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162544 Approved by: https://github.com/drisspg	2025-09-10 22:59:41 +00:00
eellison	f4aeceaa9d	Use upper bound for persistent rblock (#162441 ) Previously, we were using 128 and increasing to upper bound. We should be setting at the upper bound and raising to next power of 2. Differential Revision: [D81984103](https://our.internmc.facebook.com/intern/diff/D81984103) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162441 Approved by: https://github.com/PaulZhang12	2025-09-10 22:29:02 +00:00
Michael Lazos	d8e6b2fddc	[Cutlass] Add exp and sigmoid activations (#162536 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162536 Approved by: https://github.com/henrylhtsang, https://github.com/eellison ghstack dependencies: #162535	2025-09-10 21:44:26 +00:00
Michael Lazos	31c25c7d01	[Cutlass] Add tanh activation and test case for activations (#162535 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162535 Approved by: https://github.com/henrylhtsang	2025-09-10 21:44:26 +00:00
eqy	5dbee5691c	[cuDNN][Convolution][TF32][64bit] Add `tf32_on_and_off` decorator to conv3d 64bit test (#161004 ) cuDNN has new generated kernels that can use TF32. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161004 Approved by: https://github.com/janeyx99, https://github.com/Skylion007	2025-09-10 21:39:35 +00:00
drisspg	864ffe12d7	Fix some edge cases (#162295 ) ``` Summary 🔝 Top 5 Performance Differences (by absolute %): shape: (5, 7) ┌────────────────┬────────────────┬─────────────────────────────┬───────────────────┬──────────────────────┬───────────────────────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops BWD (base) ┆ TFlops BWD (no_peel) ┆ no_peel_speedup_over_base ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪═════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════╪═══════════╡ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64) ┆ 56.937931 ┆ 58.960459 ┆ 1.035522 ┆ 3.552163 │ │ noop ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 128) ┆ 89.221306 ┆ 86.295642 ┆ 0.967209 ┆ -3.27911 │ │ causal ┆ torch.bfloat16 ┆ (2, 16, 4096, 4, 4096, 128) ┆ 111.552594 ┆ 114.380841 ┆ 1.025353 ┆ 2.535349 │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, 1024, 64) ┆ 74.830149 ┆ 76.685445 ┆ 1.024793 ┆ 2.479344 │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64) ┆ 55.279932 ┆ 56.369312 ┆ 1.019707 ┆ 1.97066 │ └────────────────┴────────────────┴─────────────────────────────┴───────────────────┴──────────────────────┴───────────────────────────┴───────────┘ 🔺 Top 5 Cases Where no_peel (change) is Faster than base (baseline): shape: (5, 7) ┌────────────────┬────────────────┬─────────────────────────────┬───────────────────┬──────────────────────┬───────────────────────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops BWD (base) ┆ TFlops BWD (no_peel) ┆ no_peel_speedup_over_base ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪═════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════╪═══════════╡ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64) ┆ 56.937931 ┆ 58.960459 ┆ 1.035522 ┆ 3.552163 │ │ causal ┆ torch.bfloat16 ┆ (2, 16, 4096, 4, 4096, 128) ┆ 111.552594 ┆ 114.380841 ┆ 1.025353 ┆ 2.535349 │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, 1024, 64) ┆ 74.830149 ┆ 76.685445 ┆ 1.024793 ┆ 2.479344 │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64) ┆ 55.279932 ┆ 56.369312 ┆ 1.019707 ┆ 1.97066 │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 4096, 4, 4096, 64) ┆ 111.08814 ┆ 112.447047 ┆ 1.012233 ┆ 1.22327 │ └────────────────┴────────────────┴─────────────────────────────┴───────────────────┴──────────────────────┴───────────────────────────┴───────────┘ 🔻 Top 5 Cases Where no_peel (change) is Slower than base (baseline): shape: (5, 7) ┌────────────────┬────────────────┬─────────────────────────────┬───────────────────┬──────────────────────┬───────────────────────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops BWD (base) ┆ TFlops BWD (no_peel) ┆ no_peel_speedup_over_base ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪═════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════╪═══════════╡ │ noop ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 128) ┆ 89.221306 ┆ 86.295642 ┆ 0.967209 ┆ -3.27911 │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 1024, 4, 1024, 64) ┆ 78.23082 ┆ 76.693169 ┆ 0.980345 ┆ -1.965531 │ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 4, 2048, 128) ┆ 96.95663 ┆ 95.573333 ┆ 0.985733 ┆ -1.426717 │ │ alibi ┆ torch.bfloat16 ┆ (4, 16, 2048, 4, 2048, 64) ┆ 93.373473 ┆ 92.294147 ┆ 0.988441 ┆ -1.155924 │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 2048, 4, 2048, 128) ┆ 96.95147 ┆ 96.105389 ┆ 0.991273 ┆ -0.872685 │ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162295 Approved by: https://github.com/mlazos, https://github.com/v0i0	2025-09-10 21:33:45 +00:00
Yuhui Shi	4e35594674	[Lowering] Fix the edge case of empty subgraph split due to dataclass node (#161716 ) Summary: Fix the edge case by allowing `call_function` nodes with no deps as graph entry (starter_nodes) in the splitter. Test Plan: The test shall pass in the current diff (after fix), and fail in the parent diff (before fix) ``` buck test mode/opt //glow/fb/fx/lowering:split_tests -- test_dataclass_as_graph_entry ``` Rollback Plan: Differential Revision: D81232435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161716 Approved by: https://github.com/ezyang	2025-09-10 21:23:42 +00:00
Gabriel Ferns	35d7b32159	Improve device info with new flops and bandwidth formula based on hardware libraries (#162245 ) Previously, DeviceInfo provided theoretical hardware information based on a hardcoded list manually created from various datasheets. This update: - Attempting to gather the information from a hardware library like `pynvml`, improving accuracy and expanding support to devices that don't have entries in the datasheet list. - Adjusts flops and bw calculation based on these hardware values. For example, if the the memory or SMs are underclocked, it adjusts the theoretical max flops/bw accordingly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162245 Approved by: https://github.com/v0i0, https://github.com/shunting314	2025-09-10 21:19:13 +00:00
atalman	0663bdb123	Move inductor jobs 3.9->3.10 (#162323 ) Related to: https://github.com/pytorch/pytorch/issues/161167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162323 Approved by: https://github.com/huydhn, https://github.com/Skylion007	2025-09-10 20:58:41 +00:00
PyTorch MergeBot	40ea6e418a	Revert "Fix decorators skipping NCCL tests (#158846 )" This reverts commit c2388201fc85b0748173212de5a17514c7a71f21. Reverted https://github.com/pytorch/pytorch/pull/158846 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing some inductor tests ([comment](https://github.com/pytorch/pytorch/pull/158846#issuecomment-3276471387))	2025-09-10 20:51:31 +00:00
Colin Peppler	348303ebd2	[ez] add docstring/typing for codegen_kernel_benchmark (#162609 ) ``` lintrunner init && lintrunner -m origin/main ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162609 Approved by: https://github.com/coconutruben ghstack dependencies: #162442	2025-09-10 20:49:38 +00:00
Colin Peppler	94755e81c4	[inductor] Enable combo kernels with unbacked inputs (#162442 ) Internal user tried enabling combo kernels, but ran into "Cannot convert symbols to int". This PR is to enable combo kernels on inputs with data-dependent shapes. ### Example exception ``` File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 4997, in benchmark_combo_kernel kernel_code_list = self.generate_combo_kernel_code( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/simd.py", line 1849, in generate_combo_kernel_code src_code = kernel.codegen_kernel() ^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 802, in codegen_kernel code.splice(self.codegen_kernel_benchmark(num_gb=0)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 852, in codegen_kernel_benchmark var_names.extend(self.kernel_benchmark_extra_args()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 733, in kernel_benchmark_extra_args extra_args.append(str(V.graph.sizevars.size_hint(tree.numel))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/colinpeppler/pytorch/torch/_inductor/sizevars.py", line 584, in size_hint return int(out) ^^^^^^^^ File "/home/colinpeppler/.conda/envs/pytorch/lib/python3.12/site-packages/sympy/core/expr.py", line 307, in __int__ raise TypeError("Cannot convert symbols to int") torch._inductor.exc.InductorError: TypeError: Cannot convert symbols to int ``` Differential Revision: [D82042230](https://our.internmc.facebook.com/intern/diff/D82042230) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162442 Approved by: https://github.com/jansel	2025-09-10 20:49:38 +00:00
Tugsbayasgalan Manlaibaatar	6d65737aee	testing infra and some fixes (#162183 ) This PR is quite large in that it covers most of rough edges in the new strict export flow: 1. Handle nn_module_stack correctly now that we are tracing wrapper module 2. module_call_spec needs to get queried from source directly because we are not running the bytecode anymore. 3. Correct input and output handling. @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/162183 Approved by: https://github.com/zhxchen17	2025-09-10 20:48:12 +00:00
PyTorch MergeBot	053251b98d	Revert "Make functorch notebook symlinks PEP 517 valid (#157813 )" This reverts commit b494547f0bd6cb1ce5d8d104cb419802434c9c08. Reverted https://github.com/pytorch/pytorch/pull/157813 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but this surfaces a weird discrepancy between GitHub and Mecurial used internally ([comment](https://github.com/pytorch/pytorch/pull/157813#issuecomment-3276442242))	2025-09-10 20:45:48 +00:00
Justin Chu	7e2e83cdbe	[ONNX] Update export docstring (#162622 ) Update export docstring to reflect the latest configuration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162622 Approved by: https://github.com/titaiwangms	2025-09-10 20:29:46 +00:00
PyTorch MergeBot	d033d11d26	Revert "[torch][c10d] fix split_group in mixed backend case (#162424 )" This reverts commit 2dc26131801a430e030a773c4fbfe874e263259d. Reverted https://github.com/pytorch/pytorch/pull/162424 on behalf of https://github.com/clee2000 due to failure seems related, maybe a hang/timeout distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_model_diff_shape_across_ranks log classifier is pointing at the wrong line ([comment](https://github.com/pytorch/pytorch/pull/162424#issuecomment-3276360494))	2025-09-10 20:13:44 +00:00
PyTorch MergeBot	80d4da893c	Revert "Put torchao (0.13.0) back to benchmark workflow (#162227 )" This reverts commit 00985970e312c3c5e674e8e14d39fe77c226600e. Reverted https://github.com/pytorch/pytorch/pull/162227 on behalf of https://github.com/huydhn due to Crashing some inductor jobs in trunk ([comment](https://github.com/pytorch/pytorch/pull/162227#issuecomment-3276355034))	2025-09-10 20:11:37 +00:00

1 2 3 4 5 ...

92886 Commits