23 Commits

Author SHA1 Message Date
ab5086a7ae [WOQ] Add XPU kernel for _weight_int8pack_mm (#160938)
Summary:
This issue proposes implementing a XPU kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU and CUDA.

Motivation:
Same as https://github.com/pytorch/pytorch/pull/159325.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160938
Approved by: https://github.com/EikanWang, https://github.com/ZhiweiYan-96, https://github.com/liangan1, https://github.com/jerryzh168
2025-09-19 07:37:14 +00:00
fd6a6658c3 Enable _int_mm on Intel GPU (#157769)
# Moativation

This PR is used to enable _int_mm on Intel GPU. And _int_mm is used by int8 quantization on torchao.

# Model Test Result:
We run meta-llama/Llama-3.1-8B-Instruct on Intel GPU and A100 using torchao int8-dynamic-quantization. The model configs as below:
Precision : torch.bfloat16
quantization configuration : Int8DynamicActivationInt8WeightConfig
dataset : wikitext

Result:
The perplexity values for Intel GPU and A100 are 9.582953453063965 and 9.57755184173584, respectively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157769
Approved by: https://github.com/EikanWang, https://github.com/desertfire
2025-08-02 05:16:01 +00:00
3c05167489 [Intel GPU] fix matmul accuracy when offset > 0 (#154495)
This pr will make matmul tensors contiguous if they are not 64 byte alignment. oneDNN requires a minimal alignment of 64 https://uxlfoundation.github.io/oneDNN/dev_guide_c_and_cpp_apis.html#intel-r-processor-graphics-and-xe-architecture-graphics

Fixes https://github.com/intel/torch-xpu-ops/issues/1656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154495
Approved by: https://github.com/liangan1, https://github.com/guangyey, https://github.com/EikanWang
2025-05-30 09:53:51 +00:00
f12ce4e36b [Intel GPU] convolution fusion at XPU backend (#154202)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154202
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/etaf
ghstack dependencies: #140365
2025-05-28 03:14:18 +00:00
0ddfd1ed43 [Intel GPU] Enable mkdnn._linear_pointwise at XPU backend (#140365)
# Motivation

This PR is intended to add post-op fusion support fo Linear. The liner-pointwise fusion is expected to be used in graph mode like torch.compile. The FusionUtils.cpp file defines a utilization APIs for generating primitive attribute. This APIs would also be used for conv-pointwise fusion, which is in #140372.

# Validation
```bash
   python test/xpu/test_fusion.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140365
Approved by: https://github.com/etaf, https://github.com/guangyey, https://github.com/EikanWang
2025-05-27 15:57:15 +00:00
616e20b4f1 [Intel GPU] scalar tensor case handling in addmm, baddmm (#153051)
# Motivation
This PR adds scalar tensor (`t.elem()==1 and t.sizes().empty() == true and t.dim()=0` )handling in addmm, baddmm. The issue is found during the process of oneDNN upgradation. Found that the new version of oneDNN requires the post-binary (self in addmm) has same dimension as the one of output tensor. Now we need explicitly expand the shape of `self` tensor. Former version dnnl may help us do the broadcasting inside.

This PR could fix issues in https://github.com/intel/torch-xpu-ops/issues/1612 and CI error in https://github.com/pytorch/pytorch/pull/151767.

# Implementation
We treat the scalar tensor as normal tensor by `unsqueeze` it as 1 dimension tensor. Accompanied with the existing shape handle logic, it would be further `unsqueeze` to 2D or 3D shape.

UT testing
```
python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoXPU.test_comprehensive_addmm_xpu_float32
python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoXPU.test_comprehensive_addmv_xpu_float32
python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoXPU.test_comprehensive_baddbmm_xpu_float16
python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoXPU.test_comprehensive_baddbmm_xpu_float32
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153051
Approved by: https://github.com/EikanWang, https://github.com/guangyey
2025-05-21 12:24:37 +00:00
9f98e37eb4 [Intel GPU] add tf32 support for matmul on XPU (#144240)
Support xpu tf32 matmul using torch.bachend.mkldnn.allow_tf32, we will discuss in future if we need a new api to control matmul only
~~Support xpu tf32 matmul using torch.set_float32_matmul_precision. For conv, check https://github.com/pytorch/pytorch/pull/137570
We decide not following torch.backends.cuda.matmul.allow_tf32 because this API actually calls setAllowTF32CuBLAS to set matmul_precison to high. We also avoid other related tf32 changes (i.e. in inductor) by not introducing new API.~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144240
Approved by: https://github.com/EikanWang
2025-05-13 14:03:01 +00:00
e06a08059a Add device guard for xpu conv on multi device (#153067)
# Motivation
fixes https://github.com/pytorch/pytorch/issues/153022
The root cause is that the XPU backend registers the convolution op using `m.impl`, which bypasses the device guard logic typically added by the code generation system. This can lead to unexpected behavior if the current device isn't explicitly set.

# Additional Context
run the following script
```python
import torch
import torchvision.models as models

torch.manual_seed(0)

model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
model.eval()
data = torch.rand(1, 3, 224, 224)

device = torch.device('xpu:1')  # 'xpu:0'
model = model.to(device=device, dtype=torch.float16)
data = data.to(device, dtype=torch.float16)

with torch.no_grad():
    ret = model(data)
    print(ret)

print("Execution finished")
```
The output is
```bash
         -9.2102e-02, -7.7588e-01, -1.4111e+00, -9.2383e-01,  6.4551e-01,
         -6.0730e-03, -7.8271e-01, -1.1904e+00, -4.1602e-01,  3.2715e-02,
         -4.9854e-01, -6.3623e-01, -8.5107e-01, -6.8555e-01, -9.4434e-01,
         -8.8672e-01, -6.7969e-01, -6.9824e-01, -2.8882e-01,  2.0312e+00]],
       device='xpu:1', dtype=torch.float16)
Execution finished

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153067
Approved by: https://github.com/albanD, https://github.com/EikanWang
2025-05-09 09:41:51 +00:00
52d172eafd Facilitate at::_weight_int4pack_mm_with_scale_and_zeros related registration (#147962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147962
Approved by: https://github.com/jerryzh168, https://github.com/guangyey, https://github.com/EikanWang
ghstack dependencies: #137566

Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
2025-04-08 15:36:07 +00:00
ae5f7fec82 [Intel GPU] Enable fp64 GEMM (#140677)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140677
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/desertfire
2025-02-17 08:15:55 +00:00
ae351d4d0e [Intel GPU] allow_tf32 for oneDNN backend - XPU part (#137570)
# Motivation
Add context variable `torch.bachend.mkldnn.allow_tf32` to control tf32 computation in convolution kernels at XPU side.  The tf32 data type is beneficial to improve the performance of deep learning workloads during training/inference. Current PR uses the [oneDNN API fpmath_mode](https://oneapi-src.github.io/oneDNN/dev_guide_attributes_fpmath_mode.html#the-floating-point-math-mode-attribute) to trigger the tf32 acceleration in convolution kernels.

# Valiadation
* ut to test context variable
`python test/xpu/test_conv.py -k test_mkldnn_allow_tf32_get_set`

* Runtime exemplification
```
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic16oc33_ih50oh24kh3sh2dh0ph0_iw100ow49kw3sw2dw0pw0,0.649902
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic33oc33_ih24oh24kh3sh1dh0ph1_iw49ow49kw3sw1dw0pw1,0.151855
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,backward_data,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_undef::undef::: dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic33oc33_ih24oh24kh3sh1dh0ph1_iw49ow49kw3sw1dw0pw1,0.167969
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,backward_weights,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic33oc33_ih24oh24kh3sh1dh0ph1_iw49ow49kw3sw1dw0pw1,0.26709
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,backward_weights,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic16oc33_ih50oh24kh3sh2dh0ph0_iw100ow49kw3sw2dw0pw0,0.219971

```
According to the field `fpmath:tf32` in verbose, we could see that, current context setting utils could successfully trigger tf32 computation in conv forward/backward_data/backward_weights kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137570
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/atalman, https://github.com/malfet

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
2025-02-17 01:46:43 +00:00
ade5af9430 [XPU] Align XPU convolution_backward output layout between fake tensor and real output tensor. (#146880)
Fix #146879

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146880
Approved by: https://github.com/EikanWang, https://github.com/jansel
ghstack dependencies: #146762, #145248
2025-02-14 01:39:18 +00:00
01554c7b5a fix incorrect literal strings / accidental tuples (#146037)
* `expr,` is short for `(expr,)`
* literal strings over multiple lines need to escape the newline `\` or use `(...)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146037
Approved by: https://github.com/Skylion007
2025-02-03 15:08:11 +00:00
5aa5a5763e [inductor triton] Disable incorrect TF32 usage on CUDA capability < 8 (#145684)
Triton 2.2 and greater have a bug where allowing TF32 generation for a GPU that does not support TF32 will cause code generation errors. Patch around this problem by:

1. Adding a function to `torch.cuda` that determines whether CUDA hardware is capable of using the TF32 format.
2. Using that function to explicitly disable TF32 generation when calling Triton, where needed.

To demonstrate that this fix works, try running `test/inductor/test_max_autotune.py` on a GPU with CUDA compute capability < 8 (e.g. any NVIDIA consumer GPU) without this fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145684
Approved by: https://github.com/eqy
2025-01-28 22:01:08 +00:00
d8c8ba2440 Fix unused Python variables in test/[e-z]* (#136964)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964
Approved by: https://github.com/justinchuby, https://github.com/albanD
2024-12-18 23:02:30 +00:00
24dee99cb7 Populate submodules of torch._C to sys.modules recursively (#132216)
See comment:

e9d1c26275/torch/__init__.py (L938-L950)

This PR recursively sets the submodules in the C extension to `sys.modules` (e.g., `_C._dynamo.eval_frame`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132216
Approved by: https://github.com/ezyang
2024-08-08 10:20:25 +00:00
ff81ca8e0c Revert "Populate submodules of torch._C to sys.modules recursively (#132216)"
This reverts commit 672ce4610e41386da9763e07375b0879dc351905.

Reverted https://github.com/pytorch/pytorch/pull/132216 on behalf of https://github.com/PaliC due to was breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/132216#issuecomment-2274112397))
2024-08-07 18:45:00 +00:00
672ce4610e Populate submodules of torch._C to sys.modules recursively (#132216)
See comment:

e9d1c26275/torch/__init__.py (L938-L950)

This PR recursively sets the submodules in the C extension to `sys.modules` (e.g., `_C._dynamo.eval_frame`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132216
Approved by: https://github.com/ezyang
2024-08-01 12:04:59 +00:00
548c460bf1 [BE][Easy][7/19] enforce style for empty lines in import segments in test/[a-c]*/ and test/[q-z]*/ (#129758)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129758
Approved by: https://github.com/ezyang
2024-07-31 10:54:03 +00:00
c6aa03bd4e Add allow_xpu to enable XPU UTs (#130312)
# Motivation
enable UTs under folder test/xpu/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130312
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
2024-07-17 02:40:28 +00:00
36af9c0d7d [Aten] Fix XPU convolution_overrideable input memory format. (#124841)
[Aten] Fix convolution_overrideable input memory format.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124841
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/albanD
2024-04-26 15:55:01 +00:00
ad7b5d32b6 Intel GPU oneDNN Upstreaming: Convolution operators support (#117529)
# Motivation

This PR is a part of RFC #114848,. This PR would depend on oneDNN compilation in #117098 and basic integration support in #117112 and Conv integration code in #117512.  Some runtime support is needed in #116019.

This PR implements the convolution and deconvolution operators for XPU that should be defined in `aten` libraries. Also, the backward support is also supported.

With this PR, the conv-related operators should be functionality ready.

Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117529
Approved by: https://github.com/EikanWang, https://github.com/malfet
ghstack dependencies: #117512
2024-04-22 13:22:36 +00:00
9875a834e4 [Intel GPU] oneDNN GPU GEMM support (#117202)
# Motivation

This PR is a part of RFC #114848, and it  is a successor PR of #116249 and #116019. This PR would depend on oneDNN compilation in #116249. Some runtime support is needed in #116019.

Aten operators like `addmm`, `baddmm` is defined in `Blas.cpp` in `aten/src/ATen/native/mkldnn/xpu/`.

Accompanied with these files provide core functionaliy, `BlasImpl.h`, `Utils.h` and other file provide basic utilities for them. For instance, `Utils.h` provide common memory descriptor query utils for `Matmul.h` and these utility function will also be used in other primitive, like `convolution`.  `BlasImpl.h` is a header file that provide helper for handling shape info processing in matmul related operators. It would not only help basic GEMM operator like `addmm, baddmm` but also help fusion operators used in `torch.compile` like `linear_pointwise` in #117824.

In next stage, we would continually complete the oneDNN support through enabling  `matmul fusion`  and `convolution` related code.

Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117202
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #117098, #117112
2024-04-17 23:06:38 +00:00