1867 Commits

Author SHA1 Message Date
87bfd66c3c gloo: update to latest version (#149985)
This updates submodule Gloo to the latest version and brings a number of benefits:

* connection retries d2609ab5e8
* better error messages 5ca057d6cc
* multi_get support for larger scale jobs 4ff6edf45f
* metadata exchange optimizations  20dc202dd8
* miscellaneous other fixes

Old commit: 5354032ea0

Test plan:

This is already being used in production environments at scale.

PyTorch CI

```
pytest -v test/distributed/test_c10d_gloo.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149985
Approved by: https://github.com/fduwjj, https://github.com/malfet
2025-03-26 19:19:31 +00:00
ce54c430c0 [Submodule] [cpuinfo] cpuinfo update (#149305)
Updating `cpuinfo` module.

Relevant:
https://github.com/pytorch/cpuinfo/issues/270
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149305
Approved by: https://github.com/malfet
2025-03-25 22:44:50 +00:00
6bcf9c6ce3 [xnnpack] Expose subgraph symbols (#149397)
Summary: Main XNNPack target code uses symbols from subgraph so they need to be exported - this gets uncovered on macos where symbols were not visible after linking

Test Plan: CI / used for a macOS build on top of the stack.

Differential Revision: D71315023

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149397
Approved by: https://github.com/digantdesai
2025-03-19 01:14:46 +00:00
e84cc4c052 Update Kineto Submodule (#149089)
Summary: We have made a lot of changes in Kineto this month. It is a good idea to update the submodule in now especially since the roctracer-sdk change will be very large

Test Plan: CI

Differential Revision: D71082829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149089
Approved by: https://github.com/Skylion007
2025-03-13 17:18:16 +00:00
e9c12e819d Update torch-xpu-ops commit pin (#148881)
Update the torch-xpu-ops commit to [026b2c8c7c92a7b2cec5d26334006e3423251cc6](026b2c8c7c), includes:

- Enable AOT for LNL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148881
Approved by: https://github.com/EikanWang
2025-03-10 20:31:51 +00:00
f2f25a5444 Upgrade submodule oneDNN to v3.7.1 (#148293)
This PR is to upgrade submodule oneDNN to v3.7.1.

## Improvements

- Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support.
- Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Introduced initial optimizations for Intel GPUs based on Xe3 architecture.
- Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA.
- Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues.
- Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL).
- Improved bf16 to fp32 reorder performance.
- Improved bf16 reorder performance.
- Improved bf16 convolution with ACL.

Fixes https://github.com/pytorch/pytorch/issues/136348.

## Validation results on CPU

1. NLP models accuracy/inference/training
![image](https://github.com/user-attachments/assets/859279b8-1631-4268-b226-7de9ac5870d8)

![image](https://github.com/user-attachments/assets/30ec7151-41ca-482a-9d2d-0c4850e75bab)

2. Torchbench cpu userbenchmark inference & training

![image](https://github.com/user-attachments/assets/71c9807c-caf9-4385-9990-d2ab637031cd)

3. Inductor quantization

![image](https://github.com/user-attachments/assets/3d2a3bd3-82fa-4566-8050-7ea5d6b61675)

4. Dynamo benchmarks
![image](https://github.com/user-attachments/assets/554ecce3-c85c-4a0e-88f1-2e73983c5dcd)
![image](https://github.com/user-attachments/assets/148c88f8-4367-4428-bb54-ce8a4deefd1b)
![image](https://github.com/user-attachments/assets/f2e744f4-d710-4699-acf4-1f130ecfadf1)
![image](https://github.com/user-attachments/assets/97128b80-4d0e-495a-aeda-dde3e70c96fd)
![image](https://github.com/user-attachments/assets/a9afce37-684c-45c0-b938-6dd7e0383805)
![image](https://github.com/user-attachments/assets/b8714236-9681-4fbe-8d98-be93deedab88)
![image](https://github.com/user-attachments/assets/4423061f-d133-45ba-98bd-d2f739e50431)
![image](https://github.com/user-attachments/assets/7955da10-3d23-493e-99fa-658f7f40035b)

## Validation results on XPU
Accuracy is same as baseline. Performance is shown below.
![image](https://github.com/user-attachments/assets/7645304d-5b1d-43f9-b840-9f846ed380a0)

## Validation results on ARM
![image](https://github.com/user-attachments/assets/080f7c02-0238-436f-ad20-5a9e3f6aafbb)
![image](https://github.com/user-attachments/assets/443742aa-ca61-41de-ae80-5d4c65cd0c87)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148293
Approved by: https://github.com/mingfeima, https://github.com/atalman
2025-03-04 13:56:45 +00:00
6d70b42810 [BE][Ez]: Update fmt submodule to 11.1.4 (#148264)
This minor release is mostly bugfixes, ABI fixes, and compiler support fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148264
Approved by: https://github.com/jansel, https://github.com/cyyever
2025-03-02 19:00:00 +00:00
3a69dee955 [Submodule][FlashAttention] Bump to 2.7.4 (#148147)
# Summary
This makes me happy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148147
Approved by: https://github.com/Skylion007
2025-02-28 22:40:02 +00:00
21bd5fe203 Update torch-xpu-ops commit pin (#147968)
Update the torch-xpu-ops commit to [86aaaf8a9dd6932c088b7afcac0c0856b23d341a](86aaaf8a9d), includes:

- Bugfix (PT2E/BatchNorm)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147968
Approved by: https://github.com/Skylion007
2025-02-27 05:01:12 +00:00
7bd2e3bca1 Update torch-xpu-ops commit pin (#147743)
Update the torch-xpu-ops commit to [306a0ffb6e0cae27c5bd9a3b9cd378048c8e00e7](306a0ffb6e), includes:

- Bugfix (LayerNorm/Nonzeros)
- Update AOT target

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147743
Approved by: https://github.com/EikanWang
2025-02-25 08:06:35 +00:00
e72b4c61bf Revert "Upgrade submodule oneDNN to v3.7 (#147498)"
This reverts commit 576ed1e400d069ec2fff6162f82a71ff0bd81f7c.

Reverted https://github.com/pytorch/pytorch/pull/147498 on behalf of https://github.com/wdvr due to failing some tests on trunk - see below ([comment](https://github.com/pytorch/pytorch/pull/147498#issuecomment-2679867286))
2025-02-24 22:57:39 +00:00
576ed1e400 Upgrade submodule oneDNN to v3.7 (#147498)
This PR is to upgrade submodule oneDNN to v3.7.

## Improvements

- Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support.
- Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Introduced initial optimizations for Intel GPUs based on Xe3 architecture.
- Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA.
- Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues.
- Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL).
- Improved bf16 to fp32 reorder performance.
- Improved bf16 reorder performance.
- Improved bf16 convolution with ACL.

Fixes https://github.com/pytorch/pytorch/issues/136348.

## Validation results on CPU

1. NLP models accuracy/inference/training
![image](https://github.com/user-attachments/assets/859279b8-1631-4268-b226-7de9ac5870d8)

![image](https://github.com/user-attachments/assets/30ec7151-41ca-482a-9d2d-0c4850e75bab)

2. Torchbench cpu userbenchmark inference & training

![image](https://github.com/user-attachments/assets/71c9807c-caf9-4385-9990-d2ab637031cd)

3. Inductor quantization

![image](https://github.com/user-attachments/assets/3d2a3bd3-82fa-4566-8050-7ea5d6b61675)

4. Dynamo benchmarks
![image](https://github.com/user-attachments/assets/554ecce3-c85c-4a0e-88f1-2e73983c5dcd)
![image](https://github.com/user-attachments/assets/148c88f8-4367-4428-bb54-ce8a4deefd1b)
![image](https://github.com/user-attachments/assets/f2e744f4-d710-4699-acf4-1f130ecfadf1)
![image](https://github.com/user-attachments/assets/97128b80-4d0e-495a-aeda-dde3e70c96fd)
![image](https://github.com/user-attachments/assets/a9afce37-684c-45c0-b938-6dd7e0383805)
![image](https://github.com/user-attachments/assets/b8714236-9681-4fbe-8d98-be93deedab88)
![image](https://github.com/user-attachments/assets/4423061f-d133-45ba-98bd-d2f739e50431)
![image](https://github.com/user-attachments/assets/7955da10-3d23-493e-99fa-658f7f40035b)

## Validation results on XPU
Accuracy is same as baseline. Performance is shown below.
![image](https://github.com/user-attachments/assets/7645304d-5b1d-43f9-b840-9f846ed380a0)

## Validation results on ARM
![image](https://github.com/user-attachments/assets/080f7c02-0238-436f-ad20-5a9e3f6aafbb)
![image](https://github.com/user-attachments/assets/443742aa-ca61-41de-ae80-5d4c65cd0c87)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147498
Approved by: https://github.com/fadara01, https://github.com/mingfeima, https://github.com/atalman
2025-02-24 14:32:51 +00:00
db15cb0988 [Submodule] [Cutlass] Update to 3.8.0 tag (#147655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147655
Approved by: https://github.com/henrylhtsang, https://github.com/eqy
2025-02-22 20:05:31 +00:00
4ece056791 Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)
Should resolve: https://github.com/pytorch/pytorch/issues/144768
We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1``
For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1``
We use pinned version of NCCL rather then submodule.
Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj
2025-02-19 03:52:26 +00:00
7622e29a37 Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)"
This reverts commit eecee5863e698d19458b33df7bfecbda0a04557a.

Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks Locally building benchmarks ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2667054179))
2025-02-18 22:23:35 +00:00
5d675de754 Update ck (#144799)
Updates the CK version and re-implements kernel generation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144799
Approved by: https://github.com/jianyuh
2025-02-18 17:00:27 +00:00
6edc419d69 Update torch-xpu-ops commit pin (#147358)
Update the torch-xpu-ops commit to [a14d1eaa834a616705068103dc8129319087e864](a14d1eaa83), includes:

- SparseCSR XPU support
- Refine build system

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147358
Approved by: https://github.com/EikanWang
2025-02-18 16:05:25 +00:00
6a2bb629ec Update torch-xpu-ops commit pin (#147302)
Update the torch-xpu-ops commit to [b421032c8fed40df5eaee395c2e7f5f8a7bcc815](b421032c8f), includes:

- Correct int4 weight pack implementation
- Enhance build system: only build one shared library for the user

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147302
Approved by: https://github.com/EikanWang
2025-02-18 05:04:15 +00:00
4233a77960 update kineto submodule to include fix for windows build (#147195)
Fixes an issue causing windows builds to fail
https://github.com/pytorch/kineto/pull/1039
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147195
Approved by: https://github.com/cyyever, https://github.com/davidberard98, https://github.com/sraikund16
2025-02-15 01:53:16 +00:00
eecee5863e Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)
Should resolve: https://github.com/pytorch/pytorch/issues/144768
We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1``
For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1``
We use pinned version of NCCL rather then submodule.
Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj
2025-02-14 21:23:19 +00:00
e06ee4aa9f Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)"
This reverts commit 06f4a5c0e578d7da10ebdf14edcd24e5dcef78d6.

Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks macos builds: ModuleNotFoundError: No module named 'torch._C._distributed_c10d'; 'torch._C' is not a package ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2659802389))
2025-02-14 16:44:46 +00:00
059dfe2081 Revert "update kineto submodule (#147015)"
This reverts commit d1997b610f5b974af7ebad6b9903d2d8f751d927.

Reverted https://github.com/pytorch/pytorch/pull/147015 on behalf of https://github.com/atalman due to broke windows builds ([comment](https://github.com/pytorch/pytorch/pull/147015#issuecomment-2659730304))
2025-02-14 16:11:08 +00:00
06f4a5c0e5 Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)
Should resolve: https://github.com/pytorch/pytorch/issues/144768
We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1``
For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1``
We use pinned version of NCCL rather then submodule.
Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj
2025-02-14 15:29:59 +00:00
de26ddfbdc Update torch-xpu-ops commit pin (#146671)
Update the torch-xpu-ops commit to [80c375570e2b6b2989a8610da1871f8a50dfddc7](80c375570e), includes:

- Aten operator coverage improvement
- SYCL kernel optimization
- Nested Tensor OPs support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146671
Approved by: https://github.com/EikanWang
2025-02-14 09:30:36 +00:00
020232ec9f [Submodule]: Update KleidiAI submodule to v1.3.0 (#146480)
Change-Id: I687255982c72ee7daca438a15b718f07298963cc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146480
Approved by: https://github.com/digantdesai, https://github.com/malfet
2025-02-13 15:23:04 +00:00
d1997b610f update kineto submodule (#147015)
Fix https://github.com/pytorch/kineto/issues/1032
See https://github.com/pytorch/kineto/pull/1035 for testplan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147015
Approved by: https://github.com/sraikund16, https://github.com/Skylion007
2025-02-13 15:13:18 +00:00
3e4172d985 [BE][Ez]: Update fmtlib submodule to 11.1.3 (#146985)
This submodule just fixes a bunch of miscellaneous bugfix issues with ABI compatibility, compiler warning, workarounds for older compilers, performance, and edge cases in formatting.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146985
Approved by: https://github.com/drisspg
2025-02-13 06:47:11 +00:00
bb2fb554a9 [BE]: Update CUTLASS submodule to 3.7.0 (#145172)
* This has a couple of new features, but mostly has a lot of bugfixes for the prior releases
* This is the last Hopper-focused release of CUTLASS before blackwell drops, so let's upgrade to it.
* Most of the remaining diff noise is copyright year updates on the CUTLASS submodule
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145172
Approved by: https://github.com/eqy, https://github.com/henrylhtsang
2025-01-29 21:48:01 +00:00
f388ba5986 Update CUDNN frontend submodule to 1.10.0 (#145780)
Update to CUDNN 1.10. Most of this is release is about supporting some new APIs needed for Blackwell integration and new features in the corresponding CUDNN version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145780
Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/malfet
2025-01-28 22:54:24 +00:00
72da0a8a42 [Submodule] Add flash as third-party submodule [Prep for later PRs] (#145502)
# Context

Prototyped here: https://github.com/pytorch/pytorch/pull/144120, we are going to make flash-attention a 3rd party submodule. We will then use the c++ sources and include into our build of libtorch.so

This requires various changes to work including external and internal changes. Since these require internal changes we need to co-dev and in the co-dev environment I haven't found a way to sync submodule changes + internal only changes.

This is unused for now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145502
Approved by: https://github.com/Skylion007
2025-01-24 09:21:41 +00:00
41b38f755c Revert "Reverting the PR adding Kleidiai-based int4 kernels (#145392)" (#145505)
https://github.com/pytorch/pytorch/pull/134124 was reverted by https://github.com/pytorch/pytorch/pull/145392 due to KleidiAI clone issue.

1. This reverts commit 0940eb6d44f3cf69dd840db990245cbe1f78e770 (https://github.com/pytorch/pytorch/pull/145392 )and Fixes KleidiAI mirror issue.
2. KleidiAI is now cloned from github mirror instead of arm gitlab

Change-Id: I7d6eee7214cd117d3057d615936fcc3ee6052fa2

Fixes https://github.com/pytorch/pytorch/issues/145273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145505
Approved by: https://github.com/malfet
2025-01-23 18:50:59 +00:00
0940eb6d44 Reverting the PR adding Kleidiai-based int4 kernels (#145392)
Mitigation for https://github.com/pytorch/pytorch/issues/145273
Reverting https://github.com/pytorch/pytorch/pull/134124 and https://github.com/pytorch/pytorch/pull/144074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145392
Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/atalman, https://github.com/digantdesai
2025-01-22 20:11:49 +00:00
3afc5170d4 [Submodule] Upgrade to Cutlass 3.6 part deux (#144911)
# Summary
Take 2 of [D67866269](https://www.internalfb.com/diff/D67866269)
Main change is that we identified and fixed the FA2 regression. More details can be found here https://github.com/pytorch/pytorch/issues/144729 and have landed that before this here: [D68194635](https://www.internalfb.com/diff/D68194635)

Differential Revision: D68194470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144911
Approved by: https://github.com/eqy, https://github.com/Skylion007
2025-01-17 00:53:42 +00:00
6470b0ea6f Update torch-xpu-ops commit pin (#144739)
Update the torch-xpu-ops commit to [22cc419e4e60f469341712a5a103fa309a7dfd48](22cc419e4e), includes:

- Fix building issue https://github.com/intel/torch-xpu-ops/issues/1279
- Aten operator coverage improvement

Note: new torch-xpu-ops commit don't support bundle 0.5.3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144739
Approved by: https://github.com/EikanWang, https://github.com/malfet
2025-01-16 15:12:37 +00:00
db787181b5 Back out "[Submodule] Upgrade to Cutlass 3.6" (#144738)
Summary: Revert due to perf regressions see: https://github.com/pytorch/pytorch/issues/144729

Test Plan: sand castle

Differential Revision: D68137326

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144738
Approved by: https://github.com/huydhn
2025-01-15 02:57:14 +00:00
c9afa00a85 update sleef for disable libm on Windows [submodule Sleef] (#142245)
This PR is implement of RFC: https://github.com/pytorch/pytorch/issues/141946
Changes:
1. Update `Sleef` to contains it's PRS: https://github.com/shibatch/sleef/pull/603
2. Set `SLEEF_BUILD_WITH_LIBM` to `OFF`, it is turn off CMake find_library(libm) of `Sleef`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142245
Approved by: https://github.com/EikanWang, https://github.com/atalman

Co-authored-by: Eikan Wang <eikan.wang@intel.com>
2025-01-11 00:11:55 +00:00
bd1f5d1c32 update xnnpack for disable libm on Windows [submodule XNNPACK] (#141943)
This PR is implement of RFC: https://github.com/pytorch/pytorch/issues/141946
Changes:
1. Update `XNNPACK` to contains it's PRS: https://github.com/google/XNNPACK/pull/7456, https://github.com/google/XNNPACK/pull/7535 and other build fixing PRs.
2. Set `XNNPACK_BUILD_WITH_LIBM` to `OFF`, it is turn off CMake find_library(libm) of `XNNPACK`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141943
Approved by: https://github.com/atalman
2025-01-10 00:47:41 +00:00
206a932f23 [Submodule] Upgrade to Cutlass 3.6 (#144180)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144180
Approved by: https://github.com/eqy, https://github.com/Skylion007
2025-01-09 21:56:53 +00:00
f71688f30d Revert "[Submodule] Upgrade to Cutlass 3.6 (#144180)"
This reverts commit f2c103317814eecf2b622e322e4d0877c16af943.

Reverted https://github.com/pytorch/pytorch/pull/144180 on behalf of https://github.com/huydhn due to Ops, this fails some slow tests.  Please help fix and reland this ([comment](https://github.com/pytorch/pytorch/pull/144180#issuecomment-2581302233))
2025-01-09 21:45:39 +00:00
f2c1033178 [Submodule] Upgrade to Cutlass 3.6 (#144180)
Differential Revision: [D67866269](https://our.internmc.facebook.com/intern/diff/D67866269)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144180
Approved by: https://github.com/eqy, https://github.com/Skylion007
2025-01-09 17:29:58 +00:00
dcc3cf7066 [BE] fix ruff rule E226: add missing whitespace around operator in f-strings (#144415)
The fixes are generated by:

```bash
ruff check --fix --preview --unsafe-fixes --select=E226 .
lintrunner -a --take "RUFF,PYFMT" --all-files
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144415
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2025-01-08 21:55:00 +00:00
5c783bf410 [BE][Ez]: Update CUDNN Frontend submodule to 1.9.0 (#144200)
* Update CUDNN Frontend to 1.9.0, which include some API improvements, new features, and bugfixes. This is a header only lib fix so should be pretty straight forward.
* Nicest feature is that it now logs / print warnings when the CUDNN compiled version does not match the dynamically loaded one
* Fixes corrupted / truncated log lines from being printed by CUDNN Frontend
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144200
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-01-06 17:33:38 +00:00
1e881ceecf Update torch-xpu-ops commit pin (#143984)
Update the torch-xpu-ops commit to [28cfac20ec662abdb0ac98faf122450013e8f520](28cfac20ec), includes:

- Disable batch_norm vectorization path to fix accuracy issues.
- Fix the LSRM/RNN implementation error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143984
Approved by: https://github.com/EikanWang, https://github.com/ruidazeng, https://github.com/desertfire, https://github.com/jansel
2025-01-05 09:01:36 +00:00
005a4b9537 [Submodule] Bump Cutlass to 3.5.1 OSS PR (#144000)
## Summary
Follow up PR to https://github.com/pytorch/pytorch/pull/143515. That PR added a bunch of macro switches to ensure both 3.4 and 3.5.1 built succesfully. This PR actual bumps the cutlass pin to 3.5.1.

I am going to do a stack on top to add an conditional gates for 3.6 hijacking the 3.4 switches. We will leap frog our way to the top :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144000
Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/malfet
2025-01-04 18:04:03 +00:00
baee623691 [BE][Ez]: Update fmtlib submodule to 1.11.1 (#143937)
* Exactly the same as previous fmtlib except it fixes an edgecase that could affect ABI compatibility between fmtlib versions.
* Seems safe to update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143937
Approved by: https://github.com/albanD
2024-12-30 19:46:27 +00:00
2ed4d65af0 Update torch-xpu-ops commit pin (#143853)
Update the torch-xpu-ops commit to [214f33](214f33b9d9), includes:

- Fix building issue for transformer related operators
- Improve XPU operator coverage

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143853
Approved by: https://github.com/EikanWang
2024-12-30 02:38:16 +00:00
cyy
e05bfb8ee3 [Submodule] Bump libfmt to 11.1.0 (#143843)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143843
Approved by: https://github.com/Skylion007
2024-12-26 04:49:11 +00:00
b77406a9ec [BE][CI] bump ruff to 0.8.4 (#143753)
Changes:

1. Bump `ruff` from 0.7.4 to 0.8.4
2. Change `%`-formatted strings to f-string
3. Change arguments with the `__`-prefix to positional-only arguments with the `/` separator in function signature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143753
Approved by: https://github.com/Skylion007
2024-12-24 12:24:10 +00:00
94737e8a2a [ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.

2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.

3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).

4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights,  groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.

API Usage: https://github.com/pytorch/pytorch/issues/143289

Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode  : 40  t/s
2B Transformer model
Prefill : 747 t/s
Decode  : 80  t/s

Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s

OK

python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s

OK

python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s

Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-12-20 19:32:03 +00:00
2daa666591 update kineto to XPU Windows fixed PR. [submodule kineto] (#143445)
Include XPU Windows Fixed PR: https://github.com/pytorch/kineto/pull/1012

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143445
Approved by: https://github.com/sraikund16
2024-12-20 05:57:30 +00:00