pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 12:54:11 +08:00

Author	SHA1	Message	Date
Aaron Gokaslan	ceb11a584d	[BE]: Update kleidai submodule to v1.15.0 (#165842 ) This mostly just adds a few new kernels and fixes some IMA and performance improvement of prev kernels. Also improves compiler support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165842 Approved by: https://github.com/albanD	2025-10-19 08:25:03 +00:00
Aaron Gokaslan	33adb276fe	[BE][Ez]: Update Eigen to 5.0.0. C++14 support and more! (#165840 ) Update Eigen pin to 5.0.0 . Tons of new features and perf improvements. Most importantly updates minimum from C++03 to C++14 giving a ton of performance optimizations like properly implemented move operators, simplified code, etc. Also improved vectorization particularily on ARM. We really only use this library as a fallback for sparse operators, but still useful to update it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165840 Approved by: https://github.com/albanD	2025-10-19 08:00:06 +00:00
Simon Layton	23417ae50f	[Submodule] Bump FBGEMM to latest (#165544 ) Summary: * FBGEMM submodule updated to main * CMake updated to reflect necessary changes * Notably pulls in NVFP4 grouped gemm kernels Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/165544 Approved by: https://github.com/cyyever, https://github.com/jeffdaily	2025-10-18 03:58:08 +00:00
Aaron Gokaslan	de09bab4b6	[BE]: Update cudnn frontend submodule to 1.15.0 (#165776 ) Update cudnn frontend submodule to 1.15.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165776 Approved by: https://github.com/eqy	2025-10-18 02:23:27 +00:00
Cui, Yifeng	9e89b1c4c7	Update torch-xpu-ops commit pin (#165321 ) Update the torch-xpu-ops commit to [intel/torch-xpu-ops@ce9db1](`ce9db15136`), includes: - Fix test_barrier hang by using static global rank in ProcessGroupXCCL - Update install_xpu_headers only when content should change to speedup recompilation - Add global rank information to communication logging - Remove duplicate normalization from FFT methods Pull Request resolved: https://github.com/pytorch/pytorch/pull/165321 Approved by: https://github.com/EikanWang	2025-10-14 09:07:24 +00:00
Cui, Yifeng	53f5af8c92	Update torch-xpu-ops commit pin (#164237 ) Update the torch-xpu-ops commit to [intel/torch-xpu-ops@f30173](`f301733b03`), includes: - Install xpu internal headers to PyTorch - Fix error handling for BatchLinearAlgebra Ops - Fix unnecessary double data type conversion - Fix overflow when calculating workgroups count - Fix segmentation fault and calculation error in AveragePool2dKernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/164237 Approved by: https://github.com/EikanWang	2025-10-09 10:38:59 +00:00
Henry Tsang	7d7ae4d7b2	[submodule] upgrade cutlass version to 4.2.1 and completely resolved python/cutlass name collision (#164156 ) Differential Revision: D83489362 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164156 Approved by: https://github.com/Skylion007, https://github.com/mlazos	2025-09-30 17:04:57 +00:00
Cui, Yifeng	4783e3ff49	Update torch-xpu-ops commit pin (#163758 ) Update the torch-xpu-ops commit to [intel/torch-xpu-ops@229e8b](`229e8ba104`), includes: - Revert tracking of Work status for FlightRecorder in ProcessGroupXCCL to fix memory leak - Enable SYCL warnings on Linux - Fix accuracy issues with CTC loss - Enable aten::nonzero_static on XPU backend - Stop recursive calculations in polynomial kernels if tensor has NaNs Pull Request resolved: https://github.com/pytorch/pytorch/pull/163758 Approved by: https://github.com/EikanWang	2025-09-26 09:05:08 +00:00
Shivam Raikundalia	45d9dcccc5	Update Kineto Submodule (#162222 ) Summary: Update Test Plan: CI Rollback Plan: Differential Revision: D81727392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162222 Approved by: https://github.com/sanrise	2025-09-23 06:08:55 +00:00
Chris Thi	e310cc5e06	Update fbgemm submodule (#163411 ) Test Plan: As titled, includes some new changes fbgemm to see if CUDA13 breakage is fixed. Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/163411 Approved by: https://github.com/Skylion007	2025-09-22 15:46:11 +00:00
Yuanyuan Chen	8a281d7214	[submodule] Bump libfmt to 12.0.0 (#163441 ) libfmt 12.0 brings new optimisations and fixes some compilation issues for clang 21 (https://github.com/fmtlib/fmt/pull/4477). For a detailed release log, see https://github.com/fmtlib/fmt/releases/tag/12.0.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163441 Approved by: https://github.com/Skylion007	2025-09-21 22:37:25 +00:00
PyTorch MergeBot	607469bdad	Revert "[ROCm] Bump FBGEMM commit to avoid CK errors (#162590 )" This reverts commit c9b80c4d4b48deb1931e5f8641ab345d7cc7b639. Reverted https://github.com/pytorch/pytorch/pull/162590 on behalf of https://github.com/malfet due to This breaks CUDA 13 builds ([comment](https://github.com/pytorch/pytorch/pull/162590#issuecomment-3313263772))	2025-09-19 18:13:00 +00:00
Han Chao	e134bb340a	Update torch-xpu-ops commit pin (#163244 ) Update the torch-xpu-ops commit to `24fab67b6e`, includes: - Clean up getDeviceIndexOfCurrentQueue - Fix hardswish gradients corner case - Fix xccl contiguous check - Move checks from nonzero kernel to operator - support high priority stream for xccl Pull Request resolved: https://github.com/pytorch/pytorch/pull/163244 Approved by: https://github.com/EikanWang	2025-09-19 02:04:40 +00:00
Prachi Gupta	c9b80c4d4b	[ROCm] Bump FBGEMM commit to avoid CK errors (#162590 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/162590 Approved by: https://github.com/jeffdaily	2025-09-19 01:14:20 +00:00
henrylhtsang	a81a2e54ed	[submodule] CUTLASS upgrade to 4.2.0 and change cutlass to cutlass_cppgen (#163092 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163092 Approved by: https://github.com/drisspg, https://github.com/Skylion007	2025-09-18 18:03:51 +00:00
Eddie Yan	9b7a8c4d05	[cuDNN][SDPA][submodule] Roll-back cuDNN frontend upgrade, update Meta registration (#163104 ) For https://github.com/pytorch/torchtitan/issues/1713 Also note that we will need to rollback the cuDNN frontend upgrade in 2.9 as it currently introduces a segmentation fault by assuming tensors have their strides and sizes populated at graph creation time `1a7b4b78db/include/cudnn_frontend/node/sdpa_support_surface.h (L447%C2%A0)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163104 Approved by: https://github.com/drisspg	2025-09-17 15:48:54 +00:00
Nikita Shulga	65845d7291	Update Gloo submodule (#163112 ) Which makes PyTorch buildable with gcc-15, tested by running the build inside `fedora:44` docker ``` docker run --rm -it fedora:44 bash -c "yum install -y g++ python3-devel git; git clone https://github.com/pytorch/pytorch; cd pytorch; git checkout 8f710acce8332979c9a7bf97e72666dfd35c43e6; python3 -mpip install -r requirements.txt; python3 setup.py bdist_wheel" ``` Fixes https://github.com/pytorch/pytorch/issues/156595 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163112 Approved by: https://github.com/huydhn	2025-09-17 03:04:09 +00:00
Cui, Yifeng	9786243b64	Update torch-xpu-ops commit pin (#162804 ) Update the torch-xpu-ops commit to [intel/torch-xpu-ops@d8c3ee](`d8c3eefc29`), includes: - Optimize adaptive average pool for channel-last memory format - Add unregister wait_tensor - Replace deprecated `[[intel::reqd_sub_group_size(SgSize)]]` with `[[sycl::reqd_sub_group_size(SIMD)]]` and remove unnecessary attributes - Revert "Roll back to original usage of sycl::get_kernel_bundle" Pull Request resolved: https://github.com/pytorch/pytorch/pull/162804 Approved by: https://github.com/EikanWang	2025-09-16 06:30:48 +00:00
Xu Han	52d4660ae9	[AOTI] Fix Windows fail to zip opened file. (#162617 ) Original issue: <img width="1767" height="544" alt="Image" src="https://github.com/user-attachments/assets/9de90d50-217f-4049-8f19-77ff1660c8b0" /> reproducer: ```cmd pytest test\inductor\test_aot_inductor.py -v -k test_weight_on_disk_legacy_cpu ``` Fixed list: 1. `WritableTempFile`'s `__exit__` function auto unlink opened file, when the file was opened, it should raise error. Ignore it on Windows. 2. When open zip file, if the file is opened, it would be failed. Switch to `_wfsopen` with shared access flag, which can open file with shared access. Local test passed: <img width="1101" height="233" alt="image" src="https://github.com/user-attachments/assets/935cbf2e-52db-41f1-80fa-617569b92a96" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162617 Approved by: https://github.com/jansel	2025-09-11 06:22:21 +00:00
PyTorch MergeBot	96ef26f71a	Revert "[ROCm] Integrate AITER Fav3 fwd kernels (#160105 )" This reverts commit d2393c2d7da03a1523a12e6f80edb6bd7b464ec5. Reverted https://github.com/pytorch/pytorch/pull/160105 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing internal ROCm build ([comment](https://github.com/pytorch/pytorch/pull/160105#issuecomment-3273297183))	2025-09-10 04:42:28 +00:00
PyTorch MergeBot	2281d009e5	Revert "[ROCm] Add specific compile options for CK SDPA (#161759 )" This reverts commit d22d916719eb7daff8455a01d216d65f81899a9e. Reverted https://github.com/pytorch/pytorch/pull/161759 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this seems to break internal ROCm jobs ([comment](https://github.com/pytorch/pytorch/pull/161759#issuecomment-3272807726))	2025-09-10 00:44:30 +00:00
Andy Lugo	d2393c2d7d	[ROCm] Integrate AITER Fav3 fwd kernels (#160105 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/160105 Approved by: https://github.com/jeffdaily	2025-09-09 22:30:12 +00:00
Andy Lugo	d22d916719	[ROCm] Add specific compile options for CK SDPA (#161759 ) Updates CK version and adds CK specific compilation options Pull Request resolved: https://github.com/pytorch/pytorch/pull/161759 Approved by: https://github.com/jeffdaily	2025-09-09 20:04:19 +00:00
Aaron Gokaslan	ec2c1371af	[BE]: Update cudnn frontend submodule to 1.14.1 (#162347 ) Fixes a few bugs introduced to CUDNN 1.11 which affects all our CUDA13 builds. Also adds support for new CUDNN features whenever we choose to update. @eqy pretty sure this addresses the concern you had over the previous upgrade since that bugfix is now merged. This is a simple header only update. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162347 Approved by: https://github.com/eqy, https://github.com/atalman	2025-09-08 20:03:23 +00:00
Daniel Vega-Myhre	b6d0a9ea90	MXFP8 grouped GEMM support for torch._scaled_grouped_mm + submodule bump (#162209 ) ## Summary - We just landed 2d-2d support for mxfp8 grouped gemm in FBGEMM: https://github.com/pytorch/FBGEMM/pull/4816 - This is needed for backward pass of mxfp8 MoE training with grouped gemms - Changes: - Add dispatching + input validation for mxfp8 grouped gemm in `torch._scaled_grouped_mm` - Add meta registration input validation for mxfp8 grouped gemm, for composability with compile - Add unit tests exercising torch._scaled_grouped_mm with mxfp8 inputs - Bump FBGEMM third party submodule to include: - https://github.com/pytorch/FBGEMM/pull/4816 - https://github.com/pytorch/FBGEMM/pull/4820 - https://github.com/pytorch/FBGEMM/pull/4821 - https://github.com/pytorch/FBGEMM/pull/4823 #### How fbgemm dependency was bumped Documenting this since I haven't found it documented elsewhere: - `cd ~/pytorch/third_party/fbgemm` - `git fetch` - `git checkout <hash>` - `cd ~/pytorch` - `git add third_party/fbgemm` ## Test plan #### Test build ``` USE_FBGEMM_GENAI=1 python -m pip install --no-build-isolation -v -e . ... Successfully installed torch-2.9.0a0+gitf5070f3 ``` [full build log](https://www.internalfb.com/phabricator/paste/view/P1933787581) #### Unit tests ``` pytest test/test_matmul_cuda.py -k test_mxfp8_scaled_grouped_mm_ ... test/test_matmul_cuda.py ......... [100%] ============================================================== 9 passed, 1668 deselected in 5.34s =============================================================== ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162209 Approved by: https://github.com/ngimel	2025-09-06 15:25:30 +00:00
Aaron Gokaslan	1f51056bd6	[BE]: Update cpp-httplib submodule to 0.26.0 (#162181 ) Update cpp-httplib with better error handling, bugfixes, and performance. Header only library update. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162181 Approved by: https://github.com/jansel	2025-09-04 18:59:32 +00:00
Cui, Yifeng	ba7f546ccc	Update torch-xpu-ops commit pin (#162062 ) Update the torch-xpu-ops commit to [intel/torch-xpu-ops@83c5a5](`83c5a5a551`), includes: - Revert "Disable xccl timer avoid drlm hang" because XPU time event issue has been fixed - Fallback lu_factor kernel to CPU for single batch - Enable aten::linalg_inv and aten::linalg_inv_ex on XPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/162062 Approved by: https://github.com/EikanWang	2025-09-04 17:05:33 +00:00
PyTorch MergeBot	4cdaf8265d	Revert "Update Kineto submodule (#161572 )" This reverts commit d33840c542b387ab08ba49aa6c45aa9567fd9be7. Reverted https://github.com/pytorch/pytorch/pull/161572 on behalf of https://github.com/seemethere due to This appears as though its causing downstream build failures in inductor workflows and for developers working locally. Going to revert out of an abundance of caution. ([comment](https://github.com/pytorch/pytorch/pull/161572#issuecomment-3247121981))	2025-09-02 23:28:19 +00:00
Yu, Guangye	a99d8d39bc	Update torch-xpu-ops commit pin (#161919 ) # Motivation 1. Fallback some linalg functionality such as `linalg_eig`, `linalg_householder_product`, `linalg_solve_triangular` to CPU; 2. Fix codegen dependency bug. # Additional Context This PR aims to fix https://github.com/pytorch/pytorch/issues/161498 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161919 Approved by: https://github.com/EikanWang	2025-09-02 17:09:07 +00:00
Shivam Raikundalia	d33840c542	Update Kineto submodule (#161572 ) Differential Revision: D81087601 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161572 Approved by: https://github.com/cyyever, https://github.com/aaronenyeshi	2025-09-02 16:31:55 +00:00
yucai-intel	f44ad54bc6	Update torch-xpu-ops commit pin (#161152 ) Update the torch-xpu-ops commit to [8b58040ee32689487f660462f655085f31506dab](`8b58040ee3`), includes: - Add vectorization path on maxpool forward channel last - Add FlightRecorder support for ProcessGroupXCCL - Fix random build failure on codegen - Suppress dllexport warning on Windows - Make torch-xpu-ops build depend on ATen XPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/161152 Approved by: https://github.com/EikanWang Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-08-30 07:19:24 +00:00
Benjamin Glass	cbc53b7696	Update pybind11 submodule to 3.0.1 (#160754 ) Upgrade to PyBind11 v3. This allows us to strip out our own (possibly broken?) handling of the C++ ABI when building extensions, in favor of the more-complete PyBind11 internal handling. Fixes a few test failures due to https://github.com/pybind/pybind11/issues/5774, which effectively makes the `__qualname__` attribute of functions platform-dependent. Test plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/160754 Approved by: https://github.com/Skylion007	2025-08-27 21:15:01 +00:00
PyTorch MergeBot	1b34e04485	Revert "Update pybind11 submodule to 3.0.1 (#160754 )" This reverts commit 660b0b8128181d11165176ea3f979fa899f24db1. Reverted https://github.com/pytorch/pytorch/pull/160754 on behalf of https://github.com/atalman due to please see https://github.com/pytorch/pytorch/pull/160754#issuecomment-3226051449 ([comment](https://github.com/pytorch/pytorch/pull/160754#issuecomment-3226078102))	2025-08-26 23:35:22 +00:00
Benjamin Glass	660b0b8128	Update pybind11 submodule to 3.0.1 (#160754 ) Upgrade to PyBind11 v3. This allows us to strip out our own (possibly broken?) handling of the C++ ABI when building extensions, in favor of the more-complete PyBind11 internal handling. Fixes a few test failures due to https://github.com/pybind/pybind11/issues/5774, which effectively makes the `__qualname__` attribute of functions platform-dependent. Test plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/160754 Approved by: https://github.com/Skylion007	2025-08-26 01:21:18 +00:00
frost-intel	9b4adc4db7	[fr] [xpu] Add FlightRecorder support for ProcessGroupXCCL (#158568 ) Adds support for FlightRecorder in ProcessGroupXCCL. See https://github.com/intel/torch-xpu-ops/pull/1867 for XCCL implementation and more details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158568 Approved by: https://github.com/guangyey, https://github.com/fduwjj	2025-08-22 09:03:35 +00:00
Johnny	691d17a5c6	Update TensorPipe submodule (#160808 ) To a commit containing https://github.com/pytorch/tensorpipe/pull/464 that fixes compilation with CUDA-13 Fixes https://github.com/pytorch/pytorch/issues/160104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160808 Approved by: https://github.com/nWEIdia, https://github.com/Skylion007, https://github.com/malfet	2025-08-17 14:11:41 +00:00
chunhuanMeng	663da17b62	Update torch-xpu-ops commit pin (#160062 ) Update the torch-xpu-ops commit to [77cc792cd265179745d335579d233e6d4f9a2667](`77cc792cd2`), includes: - Ensures that the XPU cache is cleared before creating tensors during the test - Add unused variable warning - Fix test_linalg and test_torch issue with bf32_on_and_off updates - Fix deterministic indexing with broadcast - Fix dist.gather with noncontiguous tensor - Improve accuracy of index put deterministic kernel - Add generate file rely avoid build before generate - optimize embedding bag Fixes #160661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160062 Approved by: https://github.com/EikanWang	2025-08-15 15:27:24 +00:00
cyy	c184cb3852	[submodule] Bump fbgemm to latest (#158210 ) Merge the recent commits of FBGEMM and remove unnecessary CMake code. Specifically, we 1. enable `fbgemm_autovec` since the target is now correctly handled. 2. remove option `USE_FAKELOWP` which is not used. 3. remove `CAFFE2_COMPILER_SUPPORTS_AVX512_EXTENSIONS` check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158210 Approved by: https://github.com/q10	2025-08-11 13:48:02 +00:00
Frank Seide	b8ef60b6bc	Enable XNNPACK aarch64 builds (#159762 ) Summary: This fixes the build of TorchScript's XNNPACK dependency for our aarch64 device. Thanks to andrewjcg for proposing this fix. Rollback Plan: Reviewed By: andrewjcg Differential Revision: D79497613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159762 Approved by: https://github.com/frankseide, https://github.com/malfet Co-authored-by: Frank Seide <seide@meta.com>	2025-08-06 20:20:32 +00:00
Nikita Shulga	9b953bb3fb	[BE] Update TensorPipe pin (#159834 ) No functional changes, just: - Update C++ standard to C++17 - Update `cmake` min version to 3.18 - Update `libuv` dependency to 1.51 (to move its cmake min version to 3.10) - Replace boost optional implementation with `std::optional` wrapper - Make it compilable with gcc-14.x plus by including `cstddef` in few headers - Avoid using deprecated enums for MacOS builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/159834 Approved by: https://github.com/Skylion007	2025-08-05 20:45:09 +00:00
Cui, Yifeng	57ab39f7e4	Update torch-xpu-ops commit pin (#159621 ) Update the torch-xpu-ops commit to [intel/torch-xpu-ops@1f7a57](`1f7a57f507`) includes: - Add Template Parameter to the function `gpu_kernel` for Controlling Broadcasting Vectorization - Add optional NaN checks to XCCL - Fix NllLossForwardReduce2DKernelFunctor accuracy - Extend the existing communication logging to include the reduction operation for collective calls - [Reland] Install xpu codegen header to torch/include Pull Request resolved: https://github.com/pytorch/pytorch/pull/159621 Approved by: https://github.com/EikanWang	2025-08-05 01:46:15 +00:00
Andy Lugo	06d28de17a	Update CK Kernel generation and update ck submodule (#157964 ) changes required to reduce the number of ck kernels generated. This change depends on https://github.com/ROCm/composable_kernel/pull/2480 to be merged first. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157964 Approved by: https://github.com/842974287	2025-08-01 22:24:27 +00:00
Chris Thi	c400c8e2e0	[ROCm] Add FP8 rowwise support to _scaled_grouped_mm + Submodule update (#159075 ) Summary: In this PR we integrate the [FBGEMM AMD FP8 rowwise scaling grouped GEMM kernel](https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped) to add support for the `_scaled_grouped_mm` API on AMD. `_scaled_grouped_mm` is [currently supported on Nvidia](`9faef3d17c/aten/src/ATen/native/cuda/Blas.cpp (L1614)`), this PR aims to bring parity to AMD. Related: [[RFC]: PyTorch Low-Precision GEMMs Public API](https://github.com/pytorch/pytorch/issues/157950#top) #157950. The kernel is developed using the Composable Kernel framework. Only MI300X is currently supported. In the near future we plan to add support for MI350X as well. For data types we support FP8 e3m4. The kernel support will be gated with the `USE_FBGEMM_GENAI` flag. We hope to enable this by default for relevant AMD builds. Note we also update submodule `third_party/fbgemm` to 0adf62831 for the required updates from fbgemm. Test Plan: Hipify & build ``` python tools/amd_build/build_amd.py USE_FBGEMM_GENAI=1 python setup.py develop ``` Unit tests ``` python test/test_matmul_cuda.py -- TestFP8MatmulCUDA Ran 488 tests in 32.969s OK (skipped=454) ``` Performance Sample \| G \| M \| N \| K \| Runtime Ms \| GB/S \| TFLOPS \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| 128 \| 1 \| 2048 \| 5120 \| 0.37\| 3590 \| 7.17 \| \| 128 \| 64 \| 2048 \| 5120 \| 0.51\| 2792 \| 338.34 \| \| 128 \| 128 \| 2048 \| 5120 \| 0.66\| 2272 \| 522.72 \| \| 128 \| 1 \| 5120 \| 1024 \| 0.21\| 3224 \| 6.43 \| \| 128 \| 64 \| 5120 \| 1024 \| 0.29\| 2590 \| 291.40 \| \| 128 \| 128 \| 5120 \| 1024 \| 0.40\| 2165 \| 434.76 \| \| 128 \| 1 \| 4096 \| 4096 \| 0.69\| 3126 \| 6.25 \| \| 128 \| 64 \| 4096 \| 4096 \| 0.85\| 2655 \| 324.66 \| \| 128 \| 128 \| 4096 \| 4096 \| 1.10\| 2142 \| 501.40 \| \| 128 \| 1 \| 8192 \| 8192 \| 2.45\| 3508 \| 7.01 \| \| 128 \| 64 \| 8192 \| 8192 \| 3.27\| 2692 \| 336.74 \| \| 128 \| 128 \| 8192 \| 8192 \| 4.04\| 2224 \| 543.76 \| \| 16 \| 1 \| 2048 \| 5120 \| 0.04\| 3928 \| 7.85 \| \| 16 \| 64 \| 2048 \| 5120 \| 0.05\| 3295 \| 399.29 \| \| 16 \| 128 \| 2048 \| 5120 \| 0.07\| 2558 \| 588.69 \| \| 16 \| 1 \| 5120 \| 1024 \| 0.03\| 3119 \| 6.23 \| \| 16 \| 64 \| 5120 \| 1024 \| 0.03\| 2849 \| 320.62 \| \| 16 \| 128 \| 5120 \| 1024 \| 0.05\| 2013 \| 404.11 \| \| 16 \| 1 \| 4096 \| 4096 \| 0.06\| 4512 \| 9.02 \| \| 16 \| 64 \| 4096 \| 4096 \| 0.09\| 3124 \| 381.95 \| \| 16 \| 128 \| 4096 \| 4096 \| 0.13\| 2340 \| 547.67 \| \| 16 \| 1 \| 8192 \| 8192 \| 0.32\| 3374 \| 6.75 \| \| 16 \| 64 \| 8192 \| 8192 \| 0.42\| 2593 \| 324.28 \| \| 16 \| 128 \| 8192 \| 8192 \| 0.53\| 2120 \| 518.36 \| - Using ROCm 6.4.1 - Collected through `triton.testing.do_bench_cudagraph` Binary size with gfx942 arch Before: 116103856 Jul 23 14:12 build/lib/libtorch_hip.so After: 118860960 Jul 23 14:29 build/lib/libtorch_hip.so The difference is 2757104 bytes (~2.6 MiB). Reviewers: @drisspg @ngimel @jwfromm @jeffdaily Pull Request resolved: https://github.com/pytorch/pytorch/pull/159075 Approved by: https://github.com/drisspg	2025-07-30 23:53:58 +00:00
Aaron Gokaslan	22492848b6	[BE]: Update CUTLASS submodule to 4.1.0 (#158854 ) Update the CUTLASS submodule to the latest version with new supported architectures and new features we can use. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158854 Approved by: https://github.com/henrylhtsang	2025-07-30 17:44:38 +00:00
Max Ren	86df3ff1f1	fix xnnpack build on mac (#158881 ) Summary: Fix a bug for not getting the correct sources Test Plan: CI on my mac: ``` buck2 build @//fbobjc/mode/profile --show-full-output //xplat/executorch/examples/portable/executor_runner:executor_runner_opt File changed: fbsource//xplat/caffe2/third_party/xnnpack.buck.bzl Buck UI: https://www.internalfb.com/buck2/67b59179-4de8-462a-9202-0b9c34a35aef Network: Up: 2.4MiB Down: 1.3KiB (reSessionID-f687a7cd-5961-4851-bc67-b07043baa52a) Loading targets. Remaining 0/1 504 targets declared Analyzing targets. Remaining 0/42 1960 actions, 2424 artifacts declared Executing actions. Remaining 0/975 37.2s exec time total Command: build. Finished 40 local Time elapsed: 7.7s BUILD SUCCEEDED fbsource//xplat/executorch/examples/portable/executor_runner:executor_runner_opt /Users/maxren/fbsource/buck-out/v2/gen/fbsource/267ffdee31edf15e/xplat/executorch/examples/portable/executor_runner/__executor_runner_opt__/executor_runner_opt ``` Rollback Plan: Reviewed By: swolchok Differential Revision: D78771697 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158881 Approved by: https://github.com/digantdesai	2025-07-23 22:06:27 +00:00
saienduri	5619bf9971	Enable MI355X PyTorch CI testing. (#158889 ) This PR consists of all the changes required to enable PyTorch ROCm CI on MI355X nodes. - Rework aotriton cmake configuration to rely on `HIP_VERSION` instead of `ROCM_VERSION` as aotriton depnds on hip. Hip loosely track the rocm major version, but the two are not actually synchronized as observed in the ROCm 7 alpha build. - Bump composable-kernel submodule to [df6023e305f389bbf7249b0c4414e649f3ad6598](`df6023e305`) for mi350 compatibility. - Extend the change docker permissions step to the MI355x runners as well. This step is included to apply the required permission change to the test folder for a successful upload of artifacts in k8s docker. - Create new rocm-mi355 workflow to trigger core PyTorch tests on a nightly basis at 2:30 am PST. - Successfully tested running the test suites listed in rocm-mi355.yml on MI355 runners by temporarily hacking rocm-mi300.yml: `ca7d5fae11 (rocm-mi300)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158889 Approved by: https://github.com/jeffdaily	2025-07-23 21:50:31 +00:00
PyTorch MergeBot	0142d5f4e2	Revert "Remove is_arvr_mode() from xnnpack.buck.bzl (#158682 )" This reverts commit f09a484b8164aaadd57a79354f0ccf47733f365e. Reverted https://github.com/pytorch/pytorch/pull/158682 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158682#issuecomment-3101648365))	2025-07-22 08:33:08 +00:00
Ketan Ambati	f09a484b81	Remove is_arvr_mode() from xnnpack.buck.bzl (#158682 ) Summary: Changes * Deleted function import from build definition utilities * Removed `load("//tools/build_defs:fbsource_utils.bzl", "is_arvr_mode")` * Replaced is_arvr_mode() function calls with direct references to configuration flags * Changed from `is_arvr_mode()` to `"ovr_config//build_mode:arvr_mode"` * Changed conditional expressions to Buck `select()` statements Test Plan: Check if CI passes Rollback Plan: Differential Revision: D78520947 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158682 Approved by: https://github.com/malfet	2025-07-21 22:49:26 +00:00
Eddie Yan	590607c599	[cuDNN][SDPA] Bump cuDNN frontend submodule version to 1.12.1 (#158044 ) Really we are just interested in this change which fixes an apparent regression for d=256 support on Hopper `bc5f4fd88d` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158044 Approved by: https://github.com/Skylion007	2025-07-10 22:01:18 +00:00
Aaron Gokaslan	ed6ae20cf0	[BE][Ez]: Update mimalloc submodule to 2.2.4 (#157794 ) Fixes a few minor bugfixes with the previous release and better compiler support. Should be a NOOP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157794 Approved by: https://github.com/atalman	2025-07-09 14:03:07 +00:00

1 2 3 4 5 ...

1867 Commits