We received reports AOTriton kernels mishandles the bias pointer and it causes NaN during fine-tuning llama3.2-11b vision model. This PR will fix the problem.
Note: this AOTriton 0.8.1b adds head dimension 512 support and thus the binary size increases, but it is considered experimental and will not be enabled right now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145508
Approved by: https://github.com/jeffdaily
According to the [APL documentation](https://developer.arm.com/documentation/101004/2404/General-information/Arm-Performance-Libraries-example-programs), libraries ending with _mp are OpenMP multi-threaded libraries.
When a project is compiled with MSVC and the -openmp flag, the vcomp library (Visual C++ implementation of OpenMP) is used for runtime calls.
However, the current APL implementation uses the libomp.dll (LLVM) variant.
As a result, there are unexpected behaviors at runtime.
---
For Example:
```python
import torch
# Create a sparse tensor
# Input (Sparse Tensor):
# [[0, 1],
# [1, 0]]
indices = torch.tensor([[0, 1], [1, 0]])
values = torch.tensor([1, 1], dtype=torch.float32)
size = torch.Size([2, 2])
sparse_tensor = torch.sparse_coo_tensor(indices, values, size)
# Convert sparse tensor to dense tensor
dense_tensor = sparse_tensor.to_dense()
# Expected Output (Dense Tensor):
# [[0, 1],
# [1, 0]]
print("\nDense Tensor:")
print(dense_tensor)
```
However, it prints unexpected outputs such as:
```python
# [[0, 11],
# [10, 0]]
```
The issue arises because the following code does not function as expected at runtime:
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/ParallelOpenMP.h#L30
```c++
// returns 1 , however since OpenMP is enabled it should return total number of threads
int64_t num_threads = omp_get_num_threads();
```
---
In the runtime, loading multiple OpenMP libraries (in this case `libomp` and `vcomp`) is causing unexpected behaviours.
So, we've changed libraries from `_mp` to non `_mp` versions and we used `vcomp` for OpenMP calls.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145215
Approved by: https://github.com/ozanMSFT, https://github.com/malfet
Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
Useful for code reuse for Metal shader build both for eager mode and MPSInductor, but it requires one to implement `_cpp_embed_headers` tool that, as name suggests, would preprocess and embeds the for shader to be used in dynamic compilation.
Test using:
- `TestMetalLibrary.test_metal_include`
- Moving `i0`/`i1` implementation to `c10/util/metal_special_math.h` and call it from `SpecialOps.metal` shader, which now looks much more compact:
```metal
template <typename T, typename Tout = T>
void kernel
i0(constant T* input,
device Tout* output,
uint index [[thread_position_in_grid]]) {
output[index] = c10::i0(static_cast<Tout>(input[index]));
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145087
Approved by: https://github.com/dcci
ghstack dependencies: #145023
We do not need `install_aotriton.sh` and `aotriton_version.txt` any more since `aotriton.cmake` now installs the best binary release package as the default option when building pytorch.
This should resolve the issue of needing a pre-installed aotriton package when building PyTorch for ROCm from source, which is not feasible if building PyTorch *outside* a CI docker image. With this change, a user can have a pre-installed AOTriton in their environment, if desired, and have the build pick it up by specifying the `AOTRITON_INSTALLED_PREFIX` env var, or have the build automatically detect and install the compatible version. As a third option, the user can also force AOTriton to build from source instead, using the `AOTRITON_INSTALL_FROM_SOURCE` env var.
Also, with the changes in this PR, the cmake build process handles the tasks of copying aotriton .so and images directory from `torch/lib` to the installation path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137443
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Replace https://github.com/pytorch/pytorch/pull/138947 for re-import.
Replaces https://github.com/ROCm/pytorch/pull/1592
This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics.
Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author
NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143695
Approved by: https://github.com/malfet
Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
I experienced an error while doing a DEBUG build of pytorch on rocm:
```
additional relocation overflows omitted from the output
/usr/bin/ld: failed to convert GOTPCREL relocation; relink with --no-relax
```
Based on discussions on similar issue #138427, I fixed it after adding the `--offload-compress` to the HIP_HIPCC_FLAGS to successfully build DEBUG mode on my node.
Further updated the PR to enable the flag for non-DEBUG builds as well due to the size reduction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143986
Approved by: https://github.com/jeffdaily
Currently, the roctracer for Windows is not available. This PR disables any mentions of its usage for Windows, and creates dummy functions for Windows to keep compatibility with existing code, but which warn the user about the lack of Windows' availability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143329
Approved by: https://github.com/sraikund16
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.
2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.
3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).
4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.
API Usage: https://github.com/pytorch/pytorch/issues/143289
Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode : 40 t/s
2B Transformer model
Prefill : 747 t/s
Decode : 80 t/s
Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s
OK
python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s
OK
python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s
Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.
2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.
3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).
4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.
API Usage: https://github.com/pytorch/pytorch/issues/143289
Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode : 40 t/s
2B Transformer model
Prefill : 747 t/s
Decode : 80 t/s
Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s
OK
python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s
OK
python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s
Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.
2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.
3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).
4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.
API Usage: https://github.com/pytorch/pytorch/issues/143289
Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode : 40 t/s
2B Transformer model
Prefill : 747 t/s
Decode : 80 t/s
Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s
OK
python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s
OK
python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s
Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
This PR introduces support for finding HIP-SDK Libraries on Windows.
Since reading the code changes using the diff view is a bit cumbersome due to introduced if branch, let me explain what was changed:
- The linux-specific steps to find HIP packages have been dragged into `if(UNIX) block`
- Windows steps follow in the `else()` clause
The separation was needed, because of several factors:
- HIP SDK for Windows typically names its components using `hip` in their names (for exmaple: `hip_version.h` instead of `rocm_version.h`, `HIP_VERSION_DEV_MAJOR` instead of `ROCM_VERSION_DEV_MAJOR`, etc.),
- The libraries included in HIP SDK are only a subset of what is available in Linux ROCm (missing hsa-rt, rccl, roctx)
- MIOpen isn't a part of HIP SDK, but can be built separately and as of now requires additional path to be defined using and env var.
- Windows can only find hip package in version greater than 1.0 and its libraries if the lowercase `find_package(hip ...)` is invoked first. This is because the lowercase `hip` name will cause the mechanism to find hip's packages using [config mode](https://cmake.org/cmake/help/latest/command/find_package.html#search-modes) which is the only one supported on Windows, assuming we also want to [include its libraries](https://rocm.docs.amd.com/en/latest/conceptual/cmake-packages.html#consuming-the-hip-api-in-c-code). The upper-case module-mode-seearched `find_package(HIP)` is used later for inclusion of macros such as `hip_add_library` and related macros.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137279
Approved by: https://github.com/jeffdaily
Using the same `tools/generate_torch_version.py` script
It's already available on Python level, but not on C++ one
Please note, that updating commit hash will force recompilation of less than 10 files according to
```
% touch caffe2/core/macros.h; ninja -d explain -j1 -v -n torch_python
ninja explain: output caffe2/torch/CMakeFiles/gen_torch_version doesn't exist
ninja explain: caffe2/torch/CMakeFiles/gen_torch_version is dirty
ninja explain: /Users/malfet/git/pytorch/pytorch/torch/version.py is dirty
ninja explain: output third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl of phony edge with no inputs doesn't exist
ninja explain: third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Version.cpp.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546390618881 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Version.cpp.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/core/common.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546233600752 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/core/common.cc.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/serialize/inline_container.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546651089243 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/serialize/inline_container.cc.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/serialize/file_adapter.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546224176845 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/serialize/file_adapter.cc.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/utils/threadpool/ThreadPool.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546464535054 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/utils/threadpool/ThreadPool.cc.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/jit/runtime/static/impl.cpp.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301550062608920 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/jit/runtime/static/impl.cpp.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/mps/MPSFallback.mm.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301547538843492 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/mps/MPSFallback.mm.o is dirty
```
Differential Revision: [D66468257](https://our.internmc.facebook.com/intern/diff/D66468257)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141313
Approved by: https://github.com/ezyang
Leverage existing FindGloo CMake module to locate system's library and headers. Add system's gloo headers to include path rather than the gloo from third party when USE_SYSTEM_GLOO is specified.
Fixes#140274
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140275
Approved by: https://github.com/malfet
# Motivation
This PR aims to maintain backward compatibility when building PyTorch XPU with the old and new compilers.
# Additional Context
The details are described here. The new compiler (2025.0.0) has some breaking changes compared with the old compiler(2024.1), for examples:
1. On Windows, sycl library is named `sycl7.lib` in the old compiler but is named `sycl.lib` in the new compiler.
2. On Linux, in order to support ABI=0, we have to link `libsycl-preview.so` in the old compiler but we could link `libsycl.so` in the new compiler to have the same ABI compatibility.
3. We added a macro `SYCL_COMPILER_VERSION` to support our new code has good backward compatibility with the old compiler. Now the new feature(Event elapsed_time, memory summary, and device architecture property) introduced by the new compiler will be controlled within the macro `SYCL_COMPILER_VERSION`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139258
Approved by: https://github.com/EikanWang, https://github.com/atalman, https://github.com/gujinghui
[Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM ops.
Motivation: There are two parts of aten ops for XPU, one is in-tree ops like GEMM related OPs and the other is out-off-tree ops in torch-xpu-ops. For the in-tree part,since Pytorch uses native_functions.yaml registration and is equipped with convenient codegen capabilities, we want to take advantage of these benefits as well.
At the same time, since AOT Inductor also uses native_functions.yaml to generate c shim wrappers, we also need to enable this mechanism for XPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139025
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
PyTorch MPS backend for the most part relies on MPSGraph to provide specific operations, but recently more and more often one had to implement custom kernel here that were simply embedded in the operator codebase and were compiled directly using [`- id<MTLLibrary>newLibraryWithSource:options:error:`](https://developer.apple.com/documentation/metal/mtldevice/1433431-newlibrarywithsource) (first metal kernel to MPS backend was added in https://github.com/pytorch/pytorch/pull/82307 )
Later on, as number of operator grew, those were refactored into `MetalShaderLibrary` convenience class (see https://github.com/pytorch/pytorch/pull/125550 )
But as number of kernels keeps growing, it's time to make a next step and properly compile them into `.metalib`
This PR does exactly that by:
- Moving shader sources into separate .metal files
- Adds check on whether full Xcode installed or just DeveloperTools
- If full Xcode is installed, compiles and links shaders into .metallib for Metal-3.0(Available on MacOS 13) and Metal-3.1 standard (available on MacOS 14, can use bfloat) and bundles both using `-sectcreate` linker option and `getsectiondata` API call. `metallib_dummy.cpp` file is used to properly express dependencies between metallib build and torch_cpu link stages. Logic for generating metallibraries is loosely based on https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/CMakeLists.txt.
- If only DeveloperTools CLI is installed, automatically wraps .metal into `_metallib.h` that contains shader source wrapped in `MetalShaderLibrary`
Bulk of changes introduced in this PR are just moving code around. I.e. for every file that contains non-templated shader definition in `aten/src/ATen/native/mps/operators` folder, corresponding `.metal` file is created in `aten/src/ATen/native/mps/kernels` folder and embedded shader definition is replaced with the following
```cpp
#ifndef PYTORCH_JIT_COMPILE_SHADERS
static auto& lib = MetalShaderLibrary::getBundledLibrary();
#else
#include <ATen/native/mps/OpName_metallib.h>
#endif
```
Some historical stats:
| PyTorch Version | Number of shaders in MPS | Ops added |
| ------------- | ------------- | ---- |
| 1.12 | 0 | |
| 1.13 | 2 | bitwise_ops and index.out |
| 2.0 | 4 | cross repeat and view) |
| 2.1 | 9 | unary_ops, histogram, renorm, binary_ops |
| 2.2 | 11 | gamma and bucketization |
| 2.3 | 12 | naive_matmul (to workaround crash) |
| 2.4 | 13 | quantized_mm |
| 2.5 | 14 | fused_adam |
Pros:
- Better code structure/readability
- Eventually allows one to use shared headers (and implement something like `TensorIterator`)
- Faster runtime (as compilation is done ahead of time) and perhaps better optimized compiled kernels
Cons:
- Build process is a bit more complicated that it used to be
- Need to maintain two codepath (as our CI builders only has DeveloperTools installed)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138636
Approved by: https://github.com/manuelcandales
We did a lot of optimization for PyTorch Windows, and we got good progress of it. But still some models have performance gap between PyTorch Windows and PyTorch Linux. Ref: https://pytorch.org/blog/performance-boost-windows/#conclusion
From the blog conclusion, we found the `ResNet50` is typical case of it.
Let's focus on the `ResNet50`, and collect the profiling log:
```cmd
(nightly) D:\xu_git\dnnl_cb>python test_script_resnet50.py
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
model_inference 3.91% 682.427ms 100.00% 17.448s 17.448s 1
aten::conv2d 0.18% 30.906ms 64.79% 11.305s 2.133ms 5300
aten::convolution 0.45% 78.031ms 64.62% 11.275s 2.127ms 5300
aten::_convolution 0.30% 51.670ms 64.17% 11.196s 2.113ms 5300
aten::mkldnn_convolution 63.58% 11.093s 63.87% 11.145s 2.103ms 5300
aten::batch_norm 0.13% 23.536ms 20.10% 3.506s 661.580us 5300
aten::_batch_norm_impl_index 0.28% 49.486ms 19.96% 3.483s 657.139us 5300
aten::native_batch_norm 19.26% 3.360s 19.64% 3.427s 646.615us 5300
aten::max_pool2d 0.01% 1.038ms 5.84% 1.018s 10.181ms 100
aten::max_pool2d_with_indices 5.83% 1.017s 5.83% 1.017s 10.171ms 100
aten::add_ 3.38% 588.907ms 3.38% 588.907ms 85.349us 6900
aten::relu_ 0.35% 60.358ms 1.67% 292.155ms 59.624us 4900
aten::clamp_min_ 1.33% 231.797ms 1.33% 231.797ms 47.306us 4900
aten::empty 0.46% 80.195ms 0.46% 80.195ms 1.513us 53000
aten::linear 0.01% 927.300us 0.23% 39.353ms 393.532us 100
aten::addmm 0.20% 35.379ms 0.21% 37.016ms 370.155us 100
aten::empty_like 0.12% 20.455ms 0.17% 29.976ms 5.656us 5300
aten::as_strided_ 0.11% 18.830ms 0.11% 18.830ms 3.553us 5300
aten::adaptive_avg_pool2d 0.00% 419.900us 0.08% 14.265ms 142.647us 100
aten::mean 0.01% 1.737ms 0.08% 13.845ms 138.448us 100
aten::sum 0.05% 8.113ms 0.05% 8.648ms 86.479us 100
aten::resize_ 0.03% 5.182ms 0.03% 5.182ms 0.978us 5300
aten::div_ 0.01% 1.445ms 0.02% 3.460ms 34.600us 100
aten::to 0.00% 337.000us 0.01% 2.015ms 20.154us 100
aten::_to_copy 0.01% 977.500us 0.01% 1.678ms 16.784us 100
aten::copy_ 0.01% 1.474ms 0.01% 1.474ms 7.371us 200
aten::t 0.00% 775.900us 0.01% 1.410ms 14.104us 100
aten::flatten 0.00% 420.900us 0.01% 1.311ms 13.106us 100
aten::view 0.01% 889.700us 0.01% 889.700us 8.897us 100
aten::transpose 0.00% 410.700us 0.00% 634.500us 6.345us 100
aten::expand 0.00% 496.800us 0.00% 566.800us 5.668us 100
aten::fill_ 0.00% 534.800us 0.00% 534.800us 5.348us 100
aten::as_strided 0.00% 293.800us 0.00% 293.800us 1.469us 200
aten::empty_strided 0.00% 241.700us 0.00% 241.700us 2.417us 100
aten::resolve_conj 0.00% 54.800us 0.00% 54.800us 0.274us 200
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 17.448s
Execution time: 20.02380895614624
```
We found the major kernel consume CPU resource is `aten::mkldnn_convolution`. It was dispatched to `MKLDNN`.
Acturally, we had optimized memory allocation via integrated mimalloc to pytorch C10 module. It helps PyTorch Windows boost a lot, but it does not cover `MKL` and `MKLDNN`'s intermediary temporary memory.
We still have potential to improve PyTorch Windows performance via optimize `MKL` and `MKLDNN`'s intermediary temporary memory.
So, I discussed with Intel MKL team, and get a method to register high performance memory allocation API to MKL, and it would help MKL to boost memory performance. Please check the online document: https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-windows/2023-0/redefining-memory-functions.html
This PR is optimize MKL memory alloction performance on Windows, via register mi_malloc to MKL. PR Changes:
1. Add cmake option: `USE_MIMALLOC_ON_MKL`, It is sub-option of `USE_MIMALLOC`.
2. Wrap and export mi_malloc APIs in C10, when `USE_MIMALLOC_ON_MKL` is `ON`.
3. Add MklAllocationHelp.cpp to register allocation APIs to MKL, when `USE_MIMALLOC_ON_MKL` is `ON`.
For `oneDNN`, it is still tracking in this proposal: https://github.com/oneapi-src/oneDNN/issues/1898
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138419
Approved by: https://github.com/jgong5, https://github.com/ezyang
## Summary
We are currently [updating](https://github.com/conda-forge/pytorch-cpu-feedstock/pull/277) the [`conda-forge::pytorch`](https://anaconda.org/conda-forge/pytorch) package to version 2.5.0. This update includes a new dependency, the third_party/NVTX submodule. However, like other package management frameworks (e.g., apt), conda-forge prefers using system-installed packages instead of vendor-provided third-party packages.
This pull request aims to add an option, `USE_SYSTEM_NVTX`, to select whether to use the vendored nvtx or the system-installed one, with the default being the vendored one (which is the current behavior).
## Test Plan
The `USE_SYSTEM_NVTX` option is tested by building the `conda-forge::pytorch` package with the change applied as a [patch](cd1d2464dd/recipe/patches/0005-Use-system-nvtx3.patch).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138287
Approved by: https://github.com/albanD