87bfd66c3c
gloo: update to latest version ( #149985 )
...
This updates submodule Gloo to the latest version and brings a number of benefits:
* connection retries d2609ab5e8
* better error messages 5ca057d6cc
* multi_get support for larger scale jobs 4ff6edf45f
* metadata exchange optimizations 20dc202dd8
* miscellaneous other fixes
Old commit: 5354032ea0
Test plan:
This is already being used in production environments at scale.
PyTorch CI
```
pytest -v test/distributed/test_c10d_gloo.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149985
Approved by: https://github.com/fduwjj , https://github.com/malfet
2025-03-26 19:19:31 +00:00
ce54c430c0
[Submodule] [cpuinfo] cpuinfo update ( #149305 )
...
Updating `cpuinfo` module.
Relevant:
https://github.com/pytorch/cpuinfo/issues/270
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149305
Approved by: https://github.com/malfet
2025-03-25 22:44:50 +00:00
6bcf9c6ce3
[xnnpack] Expose subgraph symbols ( #149397 )
...
Summary: Main XNNPack target code uses symbols from subgraph so they need to be exported - this gets uncovered on macos where symbols were not visible after linking
Test Plan: CI / used for a macOS build on top of the stack.
Differential Revision: D71315023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149397
Approved by: https://github.com/digantdesai
2025-03-19 01:14:46 +00:00
e84cc4c052
Update Kineto Submodule ( #149089 )
...
Summary: We have made a lot of changes in Kineto this month. It is a good idea to update the submodule in now especially since the roctracer-sdk change will be very large
Test Plan: CI
Differential Revision: D71082829
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149089
Approved by: https://github.com/Skylion007
2025-03-13 17:18:16 +00:00
e9c12e819d
Update torch-xpu-ops commit pin ( #148881 )
...
Update the torch-xpu-ops commit to [026b2c8c7c92a7b2cec5d26334006e3423251cc6](026b2c8c7c
), includes:
- Enable AOT for LNL
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148881
Approved by: https://github.com/EikanWang
2025-03-10 20:31:51 +00:00
f2f25a5444
Upgrade submodule oneDNN to v3.7.1 ( #148293 )
...
This PR is to upgrade submodule oneDNN to v3.7.1.
## Improvements
- Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support.
- Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Introduced initial optimizations for Intel GPUs based on Xe3 architecture.
- Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA.
- Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues.
- Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL).
- Improved bf16 to fp32 reorder performance.
- Improved bf16 reorder performance.
- Improved bf16 convolution with ACL.
Fixes https://github.com/pytorch/pytorch/issues/136348 .
## Validation results on CPU
1. NLP models accuracy/inference/training


2. Torchbench cpu userbenchmark inference & training

3. Inductor quantization

4. Dynamo benchmarks








## Validation results on XPU
Accuracy is same as baseline. Performance is shown below.

## Validation results on ARM


Pull Request resolved: https://github.com/pytorch/pytorch/pull/148293
Approved by: https://github.com/mingfeima , https://github.com/atalman
2025-03-04 13:56:45 +00:00
6d70b42810
[BE][Ez]: Update fmt submodule to 11.1.4 ( #148264 )
...
This minor release is mostly bugfixes, ABI fixes, and compiler support fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148264
Approved by: https://github.com/jansel , https://github.com/cyyever
2025-03-02 19:00:00 +00:00
3a69dee955
[Submodule][FlashAttention] Bump to 2.7.4 ( #148147 )
...
# Summary
This makes me happy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148147
Approved by: https://github.com/Skylion007
2025-02-28 22:40:02 +00:00
21bd5fe203
Update torch-xpu-ops commit pin ( #147968 )
...
Update the torch-xpu-ops commit to [86aaaf8a9dd6932c088b7afcac0c0856b23d341a](86aaaf8a9d
), includes:
- Bugfix (PT2E/BatchNorm)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147968
Approved by: https://github.com/Skylion007
2025-02-27 05:01:12 +00:00
7bd2e3bca1
Update torch-xpu-ops commit pin ( #147743 )
...
Update the torch-xpu-ops commit to [306a0ffb6e0cae27c5bd9a3b9cd378048c8e00e7](306a0ffb6e
), includes:
- Bugfix (LayerNorm/Nonzeros)
- Update AOT target
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147743
Approved by: https://github.com/EikanWang
2025-02-25 08:06:35 +00:00
e72b4c61bf
Revert "Upgrade submodule oneDNN to v3.7 ( #147498 )"
...
This reverts commit 576ed1e400d069ec2fff6162f82a71ff0bd81f7c.
Reverted https://github.com/pytorch/pytorch/pull/147498 on behalf of https://github.com/wdvr due to failing some tests on trunk - see below ([comment](https://github.com/pytorch/pytorch/pull/147498#issuecomment-2679867286 ))
2025-02-24 22:57:39 +00:00
576ed1e400
Upgrade submodule oneDNN to v3.7 ( #147498 )
...
This PR is to upgrade submodule oneDNN to v3.7.
## Improvements
- Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support.
- Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Introduced initial optimizations for Intel GPUs based on Xe3 architecture.
- Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA.
- Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues.
- Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL).
- Improved bf16 to fp32 reorder performance.
- Improved bf16 reorder performance.
- Improved bf16 convolution with ACL.
Fixes https://github.com/pytorch/pytorch/issues/136348 .
## Validation results on CPU
1. NLP models accuracy/inference/training


2. Torchbench cpu userbenchmark inference & training

3. Inductor quantization

4. Dynamo benchmarks








## Validation results on XPU
Accuracy is same as baseline. Performance is shown below.

## Validation results on ARM


Pull Request resolved: https://github.com/pytorch/pytorch/pull/147498
Approved by: https://github.com/fadara01 , https://github.com/mingfeima , https://github.com/atalman
2025-02-24 14:32:51 +00:00
db15cb0988
[Submodule] [Cutlass] Update to 3.8.0 tag ( #147655 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147655
Approved by: https://github.com/henrylhtsang , https://github.com/eqy
2025-02-22 20:05:31 +00:00
4ece056791
Nccl update to 2.25.1 for cuda 12.4-12.8 ( #146073 )
...
Should resolve: https://github.com/pytorch/pytorch/issues/144768
We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1``
For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1``
We use pinned version of NCCL rather then submodule.
Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073
Approved by: https://github.com/Skylion007 , https://github.com/malfet , https://github.com/kwen2501 , https://github.com/fduwjj
2025-02-19 03:52:26 +00:00
7622e29a37
Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 ( #146073 )"
...
This reverts commit eecee5863e698d19458b33df7bfecbda0a04557a.
Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks Locally building benchmarks ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2667054179 ))
2025-02-18 22:23:35 +00:00
5d675de754
Update ck ( #144799 )
...
Updates the CK version and re-implements kernel generation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144799
Approved by: https://github.com/jianyuh
2025-02-18 17:00:27 +00:00
6edc419d69
Update torch-xpu-ops commit pin ( #147358 )
...
Update the torch-xpu-ops commit to [a14d1eaa834a616705068103dc8129319087e864](a14d1eaa83
), includes:
- SparseCSR XPU support
- Refine build system
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147358
Approved by: https://github.com/EikanWang
2025-02-18 16:05:25 +00:00
6a2bb629ec
Update torch-xpu-ops commit pin ( #147302 )
...
Update the torch-xpu-ops commit to [b421032c8fed40df5eaee395c2e7f5f8a7bcc815](b421032c8f
), includes:
- Correct int4 weight pack implementation
- Enhance build system: only build one shared library for the user
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147302
Approved by: https://github.com/EikanWang
2025-02-18 05:04:15 +00:00
4233a77960
update kineto submodule to include fix for windows build ( #147195 )
...
Fixes an issue causing windows builds to fail
https://github.com/pytorch/kineto/pull/1039
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147195
Approved by: https://github.com/cyyever , https://github.com/davidberard98 , https://github.com/sraikund16
2025-02-15 01:53:16 +00:00
eecee5863e
Nccl update to 2.25.1 for cuda 12.4-12.8 ( #146073 )
...
Should resolve: https://github.com/pytorch/pytorch/issues/144768
We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1``
For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1``
We use pinned version of NCCL rather then submodule.
Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073
Approved by: https://github.com/Skylion007 , https://github.com/malfet , https://github.com/kwen2501 , https://github.com/fduwjj
2025-02-14 21:23:19 +00:00
e06ee4aa9f
Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 ( #146073 )"
...
This reverts commit 06f4a5c0e578d7da10ebdf14edcd24e5dcef78d6.
Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks macos builds: ModuleNotFoundError: No module named 'torch._C._distributed_c10d'; 'torch._C' is not a package ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2659802389 ))
2025-02-14 16:44:46 +00:00
059dfe2081
Revert "update kineto submodule ( #147015 )"
...
This reverts commit d1997b610f5b974af7ebad6b9903d2d8f751d927.
Reverted https://github.com/pytorch/pytorch/pull/147015 on behalf of https://github.com/atalman due to broke windows builds ([comment](https://github.com/pytorch/pytorch/pull/147015#issuecomment-2659730304 ))
2025-02-14 16:11:08 +00:00
06f4a5c0e5
Nccl update to 2.25.1 for cuda 12.4-12.8 ( #146073 )
...
Should resolve: https://github.com/pytorch/pytorch/issues/144768
We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1``
For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1``
We use pinned version of NCCL rather then submodule.
Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073
Approved by: https://github.com/Skylion007 , https://github.com/malfet , https://github.com/kwen2501 , https://github.com/fduwjj
2025-02-14 15:29:59 +00:00
de26ddfbdc
Update torch-xpu-ops commit pin ( #146671 )
...
Update the torch-xpu-ops commit to [80c375570e2b6b2989a8610da1871f8a50dfddc7](80c375570e
), includes:
- Aten operator coverage improvement
- SYCL kernel optimization
- Nested Tensor OPs support
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146671
Approved by: https://github.com/EikanWang
2025-02-14 09:30:36 +00:00
020232ec9f
[Submodule]: Update KleidiAI submodule to v1.3.0 ( #146480 )
...
Change-Id: I687255982c72ee7daca438a15b718f07298963cc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146480
Approved by: https://github.com/digantdesai , https://github.com/malfet
2025-02-13 15:23:04 +00:00
d1997b610f
update kineto submodule ( #147015 )
...
Fix https://github.com/pytorch/kineto/issues/1032
See https://github.com/pytorch/kineto/pull/1035 for testplan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147015
Approved by: https://github.com/sraikund16 , https://github.com/Skylion007
2025-02-13 15:13:18 +00:00
3e4172d985
[BE][Ez]: Update fmtlib submodule to 11.1.3 ( #146985 )
...
This submodule just fixes a bunch of miscellaneous bugfix issues with ABI compatibility, compiler warning, workarounds for older compilers, performance, and edge cases in formatting.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146985
Approved by: https://github.com/drisspg
2025-02-13 06:47:11 +00:00
bb2fb554a9
[BE]: Update CUTLASS submodule to 3.7.0 ( #145172 )
...
* This has a couple of new features, but mostly has a lot of bugfixes for the prior releases
* This is the last Hopper-focused release of CUTLASS before blackwell drops, so let's upgrade to it.
* Most of the remaining diff noise is copyright year updates on the CUTLASS submodule
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145172
Approved by: https://github.com/eqy , https://github.com/henrylhtsang
2025-01-29 21:48:01 +00:00
f388ba5986
Update CUDNN frontend submodule to 1.10.0 ( #145780 )
...
Update to CUDNN 1.10. Most of this is release is about supporting some new APIs needed for Blackwell integration and new features in the corresponding CUDNN version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145780
Approved by: https://github.com/albanD , https://github.com/atalman , https://github.com/malfet
2025-01-28 22:54:24 +00:00
72da0a8a42
[Submodule] Add flash as third-party submodule [Prep for later PRs] ( #145502 )
...
# Context
Prototyped here: https://github.com/pytorch/pytorch/pull/144120 , we are going to make flash-attention a 3rd party submodule. We will then use the c++ sources and include into our build of libtorch.so
This requires various changes to work including external and internal changes. Since these require internal changes we need to co-dev and in the co-dev environment I haven't found a way to sync submodule changes + internal only changes.
This is unused for now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145502
Approved by: https://github.com/Skylion007
2025-01-24 09:21:41 +00:00
41b38f755c
Revert "Reverting the PR adding Kleidiai-based int4 kernels ( #145392 )" ( #145505 )
...
https://github.com/pytorch/pytorch/pull/134124 was reverted by https://github.com/pytorch/pytorch/pull/145392 due to KleidiAI clone issue.
1. This reverts commit 0940eb6d44f3cf69dd840db990245cbe1f78e770 (https://github.com/pytorch/pytorch/pull/145392 )and Fixes KleidiAI mirror issue.
2. KleidiAI is now cloned from github mirror instead of arm gitlab
Change-Id: I7d6eee7214cd117d3057d615936fcc3ee6052fa2
Fixes https://github.com/pytorch/pytorch/issues/145273
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145505
Approved by: https://github.com/malfet
2025-01-23 18:50:59 +00:00
0940eb6d44
Reverting the PR adding Kleidiai-based int4 kernels ( #145392 )
...
Mitigation for https://github.com/pytorch/pytorch/issues/145273
Reverting https://github.com/pytorch/pytorch/pull/134124 and https://github.com/pytorch/pytorch/pull/144074
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145392
Approved by: https://github.com/ZainRizvi , https://github.com/malfet , https://github.com/atalman , https://github.com/digantdesai
2025-01-22 20:11:49 +00:00
3afc5170d4
[Submodule] Upgrade to Cutlass 3.6 part deux ( #144911 )
...
# Summary
Take 2 of [D67866269](https://www.internalfb.com/diff/D67866269 )
Main change is that we identified and fixed the FA2 regression. More details can be found here https://github.com/pytorch/pytorch/issues/144729 and have landed that before this here: [D68194635](https://www.internalfb.com/diff/D68194635 )
Differential Revision: D68194470
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144911
Approved by: https://github.com/eqy , https://github.com/Skylion007
2025-01-17 00:53:42 +00:00
6470b0ea6f
Update torch-xpu-ops commit pin ( #144739 )
...
Update the torch-xpu-ops commit to [22cc419e4e60f469341712a5a103fa309a7dfd48](22cc419e4e
), includes:
- Fix building issue https://github.com/intel/torch-xpu-ops/issues/1279
- Aten operator coverage improvement
Note: new torch-xpu-ops commit don't support bundle 0.5.3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144739
Approved by: https://github.com/EikanWang , https://github.com/malfet
2025-01-16 15:12:37 +00:00
db787181b5
Back out "[Submodule] Upgrade to Cutlass 3.6" ( #144738 )
...
Summary: Revert due to perf regressions see: https://github.com/pytorch/pytorch/issues/144729
Test Plan: sand castle
Differential Revision: D68137326
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144738
Approved by: https://github.com/huydhn
2025-01-15 02:57:14 +00:00
c9afa00a85
update sleef for disable libm on Windows [submodule Sleef] ( #142245 )
...
This PR is implement of RFC: https://github.com/pytorch/pytorch/issues/141946
Changes:
1. Update `Sleef` to contains it's PRS: https://github.com/shibatch/sleef/pull/603
2. Set `SLEEF_BUILD_WITH_LIBM` to `OFF`, it is turn off CMake find_library(libm) of `Sleef`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142245
Approved by: https://github.com/EikanWang , https://github.com/atalman
Co-authored-by: Eikan Wang <eikan.wang@intel.com >
2025-01-11 00:11:55 +00:00
bd1f5d1c32
update xnnpack for disable libm on Windows [submodule XNNPACK] ( #141943 )
...
This PR is implement of RFC: https://github.com/pytorch/pytorch/issues/141946
Changes:
1. Update `XNNPACK` to contains it's PRS: https://github.com/google/XNNPACK/pull/7456 , https://github.com/google/XNNPACK/pull/7535 and other build fixing PRs.
2. Set `XNNPACK_BUILD_WITH_LIBM` to `OFF`, it is turn off CMake find_library(libm) of `XNNPACK`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141943
Approved by: https://github.com/atalman
2025-01-10 00:47:41 +00:00
206a932f23
[Submodule] Upgrade to Cutlass 3.6 ( #144180 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144180
Approved by: https://github.com/eqy , https://github.com/Skylion007
2025-01-09 21:56:53 +00:00
f71688f30d
Revert "[Submodule] Upgrade to Cutlass 3.6 ( #144180 )"
...
This reverts commit f2c103317814eecf2b622e322e4d0877c16af943.
Reverted https://github.com/pytorch/pytorch/pull/144180 on behalf of https://github.com/huydhn due to Ops, this fails some slow tests. Please help fix and reland this ([comment](https://github.com/pytorch/pytorch/pull/144180#issuecomment-2581302233 ))
2025-01-09 21:45:39 +00:00
f2c1033178
[Submodule] Upgrade to Cutlass 3.6 ( #144180 )
...
Differential Revision: [D67866269](https://our.internmc.facebook.com/intern/diff/D67866269 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144180
Approved by: https://github.com/eqy , https://github.com/Skylion007
2025-01-09 17:29:58 +00:00
dcc3cf7066
[BE] fix ruff rule E226: add missing whitespace around operator in f-strings ( #144415 )
...
The fixes are generated by:
```bash
ruff check --fix --preview --unsafe-fixes --select=E226 .
lintrunner -a --take "RUFF,PYFMT" --all-files
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144415
Approved by: https://github.com/huydhn , https://github.com/Skylion007
2025-01-08 21:55:00 +00:00
5c783bf410
[BE][Ez]: Update CUDNN Frontend submodule to 1.9.0 ( #144200 )
...
* Update CUDNN Frontend to 1.9.0, which include some API improvements, new features, and bugfixes. This is a header only lib fix so should be pretty straight forward.
* Nicest feature is that it now logs / print warnings when the CUDNN compiled version does not match the dynamically loaded one
* Fixes corrupted / truncated log lines from being printed by CUDNN Frontend
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144200
Approved by: https://github.com/cyyever , https://github.com/albanD
2025-01-06 17:33:38 +00:00
1e881ceecf
Update torch-xpu-ops commit pin ( #143984 )
...
Update the torch-xpu-ops commit to [28cfac20ec662abdb0ac98faf122450013e8f520](28cfac20ec
), includes:
- Disable batch_norm vectorization path to fix accuracy issues.
- Fix the LSRM/RNN implementation error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143984
Approved by: https://github.com/EikanWang , https://github.com/ruidazeng , https://github.com/desertfire , https://github.com/jansel
2025-01-05 09:01:36 +00:00
005a4b9537
[Submodule] Bump Cutlass to 3.5.1 OSS PR ( #144000 )
...
## Summary
Follow up PR to https://github.com/pytorch/pytorch/pull/143515 . That PR added a bunch of macro switches to ensure both 3.4 and 3.5.1 built succesfully. This PR actual bumps the cutlass pin to 3.5.1.
I am going to do a stack on top to add an conditional gates for 3.6 hijacking the 3.4 switches. We will leap frog our way to the top :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144000
Approved by: https://github.com/Skylion007 , https://github.com/eqy , https://github.com/malfet
2025-01-04 18:04:03 +00:00
baee623691
[BE][Ez]: Update fmtlib submodule to 1.11.1 ( #143937 )
...
* Exactly the same as previous fmtlib except it fixes an edgecase that could affect ABI compatibility between fmtlib versions.
* Seems safe to update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143937
Approved by: https://github.com/albanD
2024-12-30 19:46:27 +00:00
2ed4d65af0
Update torch-xpu-ops commit pin ( #143853 )
...
Update the torch-xpu-ops commit to [214f33](214f33b9d9
), includes:
- Fix building issue for transformer related operators
- Improve XPU operator coverage
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143853
Approved by: https://github.com/EikanWang
2024-12-30 02:38:16 +00:00
e05bfb8ee3
[Submodule] Bump libfmt to 11.1.0 ( #143843 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143843
Approved by: https://github.com/Skylion007
2024-12-26 04:49:11 +00:00
b77406a9ec
[BE][CI] bump ruff
to 0.8.4 ( #143753 )
...
Changes:
1. Bump `ruff` from 0.7.4 to 0.8.4
2. Change `%`-formatted strings to f-string
3. Change arguments with the `__`-prefix to positional-only arguments with the `/` separator in function signature.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143753
Approved by: https://github.com/Skylion007
2024-12-24 12:24:10 +00:00
94737e8a2a
[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend ( #134124 )
...
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.
2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.
3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).
4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.
API Usage: https://github.com/pytorch/pytorch/issues/143289
Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode : 40 t/s
2B Transformer model
Prefill : 747 t/s
Decode : 80 t/s
Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s
OK
python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s
OK
python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s
Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai , https://github.com/malfet
2024-12-20 19:32:03 +00:00
2daa666591
update kineto to XPU Windows fixed PR. [submodule kineto] ( #143445 )
...
Include XPU Windows Fixed PR: https://github.com/pytorch/kineto/pull/1012
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143445
Approved by: https://github.com/sraikund16
2024-12-20 05:57:30 +00:00