This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076
Approved by: https://github.com/malfet
Summary: This specific rocm logic will make aten-cpu code diverge between rocm and cuda. This is not good because we won't be able to share aten-cpu.so between rocm and cuda. More specifically, this will prevent us build aten-hip by default, which requires us to set up rocm specific rules which is an extra burden for our build system.
Test Plan: sandcastle + oss ci
Differential Revision: D54453492
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121082
Approved by: https://github.com/jeffdaily, https://github.com/aaronenyeshi, https://github.com/albanD
Fixes#121093
Previously, calling the following functions with invalid padding dimensions would cause a segmentation fault:
```
torch._C._nn.replication_pad1d, torch._C._nn.replication_pad3d, torch._C._nn.replication_pad3d
```
To fix, added condition checking to raise a runtime error with a debug message instead, specifying the correct dimensions necessary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121298
Approved by: https://github.com/mikaylagawarecki
in layer_norm_kernel.cu since the qualifier seems to be ignored according to:
```
[18/263] Building CUDA object
caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o
/home/mkozuki/ghq/github.com/crcrpar/torch-3/aten/src/ATen/native/cuda/layer_norm_kernel.cu(300):
warning #20050-D: inline qualifier ignored for "__global__" function
Remark: The warnings can be suppressed with "-diag-suppress
<warning-number>"
/home/mkozuki/ghq/github.com/crcrpar/torch-3/aten/src/ATen/native/cuda/layer_norm_kernel.cu(300):
warning #20050-D: inline qualifier ignored for "__global__" function
Remark: The warnings can be suppressed with "-diag-suppress
<warning-number>"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121246
Approved by: https://github.com/eqy, https://github.com/malfet
**Summary:**
This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:
```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
aten.native_batch_norm ->
aten._native_batch_norm_legit (export only) ->
_batch_norm_legit_cpu/cuda (kernels, export only) ->
_batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```
Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.
Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:
```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```
The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:
```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```
Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.
Test Plan: `OpInfo` tests for `batch_norm_with_update`.
Reviewers: albanD, bdhirsh
Subscribers: albanD, bdhirsh, supriyar
Tasks: https://github.com/pytorch/pytorch/issues/111384
Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
Summary: Update the macros to exclude using __grid__constant on compiling for devices > sm80 but cuda version < 11.8.
Test Plan: buck2 build --keep-going --config buck2.log_configured_graph_size=true --flagfile fbcode//mode/dev fbcode//sigrid/predictor/client/python:ig_sigrid_client_pybinding
Differential Revision: D54556796
Co-authored-by: Driss Guessous <drisspg@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121275
Approved by: https://github.com/drisspg
Summary: qnnack library merge fails on some application. This fix implements recommendation from Android build team to prevent merge for qnnpack.
Test Plan:
1. Measure the binary size impact
1. Release build failed previously; now it should succeed
Differential Revision: D54048156
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120676
Approved by: https://github.com/kimishpatel
Fixes#83149
There is a limitation of `TensorIterator` reductions:
The non-permuted input tensor will be coalesced down to a 2-d tensor by `TensorIterator` whereas the permuted case may become a >2d operation (for example, two reduced dimensions and non-reduced dim).
Since the cpu reduction loop of `TensorIterator` only operates on two dimensions at a time, this means the intermediate sums will be truncated to lower precision.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108559
Approved by: https://github.com/mingfeima, https://github.com/peterbell10
`linalg_eigvals_out` calls into a dispatch stub, so only supports CPU and CUDA
strided tensors but incorrectly claimed to be a composite op. `linalg_eigvals`
also shouldn't defer to the out variant inside a `CompositeImplicitAutograd` op
as not all types support out variants. Instead, I add a new helper
`_linalg_eigvals` which does the same thing in a non-composite operator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121142
Approved by: https://github.com/lezcano
Differential Revision: D54447700
## Context
This changeset updates Vulkan SPIR-V codegen to introduce a global SPIR-V shader registry and register shaders dynamically at static initialization time. This change makes it possible to define and link custom shader libraries to the ATen-Vulkan runtime.
Before:
* `gen_vulkan_spv.py` generated two files, `spv.h` and `spv.cpp` which would contain the definition and initialization of Vulkan shader registry variables.
After:
* Introduce the `ShaderRegistry` class in `api/`, which encapsulates functionality of the `ShaderRegistry` class previously defined in the generated `spv.h` file
* Introduce a global shader registry (defined as a static variable in the `api::shader_registry() function`
* Define a `ShaderRegisterInit` class (taking inspiration from `TorchLibraryInit`) that allows for dynamic shader registration
* `gen_vulkan_spv.py` now only generates `spv.cpp`, which defines a static `ShaderRegisterInit` instance that triggers registration of the compiled shaders to the global shader registry.
Benefits:
* Cleaner code base; we no longer have `ShaderRegistry` defined in a generated file, and don't need a separate implementation file (`impl/Registry.*`) to handle shader lookup. All that logic now lives under `api/ShaderRegistry.*`
* Makes it possible to compile and link separate shader libraries, providing similar flexibility as defining and linking custom ATen operators
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121088
Approved by: https://github.com/manuelcandales, https://github.com/jorgep31415
Differential Revision: D54487619
## Context
Allow the descriptor pool of an `api::Context` object to be initialized in a deferred fashion, instead of forcing initialization upon construction. This mode of operation will be used in the ExecuTorch Vulkan delegate, where the exact number of descriptor sets can determined once the graph is built instead of needing to "guess" an adequate amount.
## Implementation Details
* Check `config.descriptorPoolMaxSets > 0` to check if the descriptor pool should be initialized
* Introduce `DescriptorPool::init()` function to trigger intialization
* Introduce safeguards against using an uninitialized descriptor pool
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121134
Approved by: https://github.com/manuelcandales
**description**
Enable lowering of dynamic qlinear for X86Inductor. The pattern is `choose_qparams -> getitem -> q -> dq -> linear`. We only fuse `dq -> linear` and get `choose_qparams -> getitem -> q -> onednn.qlinear_pointwise`. So, we treat it as dynamic quantization of activation + static quantized linear.
The previous implementation of `onednn.qlinear_pointwise` is for the case where `x_scale` and `x_zp` are scalars. Since `choose_qparams` returns tensors, we added a variation `onednn.qlinear_pointwise.tensor` to support the case.
This feature is targeting PyTorch 2.3 release.
**Test plan**
```
python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_cpu
python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_qat_cpu
python inductor/test_cpu_cpp_wrapper.py -k test_dynamic_qlinear
```
**Performance before and after lowering `choose_qparam` to Inductor**
Before
- latency for shape (32, 32) = 0.151 ms
latency for shape (128, 128) = 0.153 ms
latency for shape (1024, 1024) = 0.247 ms
After
- latency for shape (32, 32) = 0.049 ms
- latency for shape (128, 128) = 0.052 ms
- latency for shape (1024, 1024) = 0.133 ms
Test method: A module with a single Linear layer, dynamic-quantize, lower to X86Inductor
Test env & config: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz, single instance, single core, using Intel OpenMP and Tcmalloc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120605
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
Summary:
## Context
Introduce some simple bug fixes to the Vulkan Compute API that were causing errors on Android.
1. When using deferred allocation for image textures, it is undefined behaviour to create a `vkImageView` for a `vkImage` that has not yet been bound to memory. Fix this by creating the image view only after the `vkImage` has been bound to memory.
2. When flushing the `api::Context`, the command pool is flushed but any current command buffers are not invalidated. This will cause a segmentation fault if the command buffer is not submitted prior to calling `flush()`, because subsequent calls to `submit_*_job()` will use the old command buffer which will have been freed when the command pool is flushed. To fix, invalidate any existing command buffers when calling `flush()`.
Test Plan:
Build the test binary for Android:
```
buck build --target-platforms=ovr_config//platform/android:arm64-fbsource -c ndk.custom_libcxx=false //xplat/caffe2:pt_vulkan_api_test_bin --show-output
```
Push and run the test binary on a local android phone.
Differential Revision: D54425370
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121027
Approved by: https://github.com/mcr229, https://github.com/cbilgin