Commit Graph

14022 Commits

Author SHA1 Message Date
862b99b571 Revert "[ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size (#120925)"
This reverts commit 3239f86a3df133b5977d988324639e0de7af8749.

Reverted https://github.com/pytorch/pytorch/pull/120925 on behalf of https://github.com/malfet due to Breaks internal tests, likely due to the increased memory requirements ([comment](https://github.com/pytorch/pytorch/pull/120925#issuecomment-1983875400))
2024-03-07 16:16:07 +00:00
cyy
3aa512cd72 [Clang-tidy header][23/N] Enable clang-tidy coverage on aten/src/ATen/*.{cpp,h} (#121380)
This PR finishes the works beginning with #https://github.com/pytorch/pytorch/pull/120763 by enabling clang-tidy on aten/src/ATen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121380
Approved by: https://github.com/Skylion007
2024-03-07 15:11:07 +00:00
cyy
4305c64fea Change ATEN generator argument type to const std::optional<Generator>& (#120076)
This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076
Approved by: https://github.com/malfet
2024-03-07 09:52:21 +00:00
b3065f6899 add int8 packed gemm support on CPU device (#118056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118056
Approved by: https://github.com/mikekgfb
2024-03-07 08:41:43 +00:00
291ce86a6c Modify StorageImplCreateHelper (#118459)
I want to use tensor.untyped_storage()[a:b] for ``PrivateUse1`` backend but fail. The code will go into ``THPStorage_get``:
bb6eba189f/torch/csrc/Storage.cpp (L525-L540)

Here ``torch`` will create a new ``c10::StorageImpl`` but not consider about ``PrivateUse1`` backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118459
Approved by: https://github.com/albanD
2024-03-07 06:26:55 +00:00
967dd31621 [cuDNN] Cleanup cuDNN < 8.1 ifdefs (#120862)
Follow-up of #95722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120862
Approved by: https://github.com/Skylion007
2024-03-07 01:46:25 +00:00
cyy
5cc511f72f Use c10::irange and fix other index types in ForeachReduceOp.cu (#121123)
This PR follows the suggestions in #121066 and changes most loops to c10::irange.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121123
Approved by: https://github.com/soulitzer
2024-03-07 00:11:27 +00:00
c268ce4a6d Make ATen-cpu cuda/rocm agnostic (#121082)
Summary: This specific rocm logic will make aten-cpu code diverge between rocm and cuda. This is not good because we won't be able to share aten-cpu.so between rocm and cuda. More specifically, this will prevent us build aten-hip by default, which requires us to set up rocm specific rules which is an extra burden for our build system.

Test Plan: sandcastle + oss ci

Differential Revision: D54453492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121082
Approved by: https://github.com/jeffdaily, https://github.com/aaronenyeshi, https://github.com/albanD
2024-03-06 23:51:40 +00:00
69cedc16c5 Add padding dimension checks and tests (#121298)
Fixes #121093

Previously, calling the following functions with invalid padding dimensions would cause a segmentation fault:
```
torch._C._nn.replication_pad1d, torch._C._nn.replication_pad3d, torch._C._nn.replication_pad3d
```

To fix, added condition checking to raise a runtime error with a debug message instead, specifying the correct dimensions necessary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121298
Approved by: https://github.com/mikaylagawarecki
2024-03-06 21:55:34 +00:00
cyy
5a2527db22 [Clang-tidy header][22/N] Fix clang-tidy warnings in aten/src/ATEN/*.{cpp,h} (#121102)
This PR continues to fix clang-tidy warnings in aten/src/ATEN/*, following #120763.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121102
Approved by: https://github.com/Skylion007
2024-03-06 18:36:31 +00:00
b529c19bdf Revert "Batch Norm Consolidation (#116092)"
This reverts commit 5680f565d5b7d4aa412a3988d3d91ca4c5679303.

Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/jeffdaily due to broke ROCm, PR signal was clean but trunk was not, the merge should have been blocked but wasn't ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1981373237))
2024-03-06 17:10:01 +00:00
a427d90411 add int4 packed gemm support on CPU device (#117475)
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec`

* WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec`
* WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec`

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117475
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-03-06 16:25:53 +00:00
099ff51d45 torch check the division by zero in batch_norm_update_stats (#120882)
Fixes #120803

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120882
Approved by: https://github.com/CaoE, https://github.com/malfet
2024-03-06 05:40:21 +00:00
2eec0e7c5f [BE] Remove __iniline__ from __global__ (#121246)
in layer_norm_kernel.cu since the qualifier seems to be ignored according to:

```
[18/263] Building CUDA object
caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o
/home/mkozuki/ghq/github.com/crcrpar/torch-3/aten/src/ATen/native/cuda/layer_norm_kernel.cu(300):
warning #20050-D: inline qualifier ignored for "__global__" function

Remark: The warnings can be suppressed with "-diag-suppress
<warning-number>"

/home/mkozuki/ghq/github.com/crcrpar/torch-3/aten/src/ATen/native/cuda/layer_norm_kernel.cu(300):
warning #20050-D: inline qualifier ignored for "__global__" function

Remark: The warnings can be suppressed with "-diag-suppress
<warning-number>"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121246
Approved by: https://github.com/eqy, https://github.com/malfet
2024-03-06 05:16:52 +00:00
5680f565d5 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-06 04:50:46 +00:00
f72eb5ae4c __grid__constant is only suported on cuda version >= 11.8 (#121275)
Summary: Update the macros to exclude using __grid__constant on compiling for devices > sm80 but cuda version < 11.8.

Test Plan: buck2 build --keep-going --config buck2.log_configured_graph_size=true --flagfile fbcode//mode/dev fbcode//sigrid/predictor/client/python:ig_sigrid_client_pybinding

Differential Revision: D54556796

Co-authored-by: Driss Guessous <drisspg@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121275
Approved by: https://github.com/drisspg
2024-03-06 03:44:59 +00:00
ce6a7d56fc Don't merge qnnpack (#120676)
Summary: qnnack library merge fails on some application. This fix implements recommendation from Android build team to prevent merge for qnnpack.

Test Plan:
1. Measure the binary size impact
1. Release build failed previously; now it should succeed

Differential Revision: D54048156

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120676
Approved by: https://github.com/kimishpatel
2024-03-06 01:42:13 +00:00
412c687e2e Fix permuted sum precision issue for lower precision on CPU (#108559)
Fixes #83149
There is a limitation of `TensorIterator` reductions:
The non-permuted input tensor will be coalesced down to a 2-d tensor by `TensorIterator` whereas the permuted case may become a >2d operation (for example, two reduced dimensions and non-reduced dim).
Since the cpu reduction loop of `TensorIterator` only operates on two dimensions at a time, this means the intermediate sums will be truncated to lower precision.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108559
Approved by: https://github.com/mingfeima, https://github.com/peterbell10
2024-03-06 01:01:35 +00:00
34e3f6f3c9 fix segfault in torch.native_channel_shuffle when input is empty (#121199)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

fix https://github.com/pytorch/pytorch/issues/121092

`torch.channel_shuffle` could handle empty inputs correctly. `torch.native_channel_shuffle` bypassed the `numel == 0` check, this causes divided by zero in underlying kernel.

* __->__ #121199

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121199
Approved by: https://github.com/malfet
2024-03-06 00:46:36 +00:00
eae9751e82 Fix linalg_eigvals invalid use of composite dispatch key (#121142)
`linalg_eigvals_out` calls into a dispatch stub, so only supports CPU and CUDA
strided tensors but incorrectly claimed to be a composite op. `linalg_eigvals`
also shouldn't defer to the out variant inside a `CompositeImplicitAutograd` op
as not all types support out variants. Instead, I add a new helper
`_linalg_eigvals` which does the same thing in a non-composite operator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121142
Approved by: https://github.com/lezcano
2024-03-05 21:13:27 +00:00
8ccf8b2c47 Avoid COW input materialize in more forward ops (#121070)
Affected operators are: addr, cdist, sparse.sampled_addm, sparse.mm,
matrix_exp, softmax, cross_entropy

Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121070
Approved by: https://github.com/ezyang
2024-03-05 19:47:24 +00:00
3239f86a3d [ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size (#120925)
According to the [cuBLAS API Reference](https://docs.nvidia.com/cuda/cublas/index.html#cublassetworkspace) the recommended workspace size for Hopper is 32 MiB and for the rest architectures 4 MiB. This PR increases the workspace size accordingly. I am not aware of the recommended workspace size for HIP, that is why I am keeping it unchanged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120925
Approved by: https://github.com/eqy, https://github.com/malfet
2024-03-05 18:13:05 +00:00
cyy
6ecd65886a Remove unnecessary const_casts (#121225)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121225
Approved by: https://github.com/soulitzer
2024-03-05 17:34:24 +00:00
42821d462a [ATen][Native][CUDA] Decrease max_threads in ctc_loss (#120746)
There will be some changes in CUDA 12.4 that would require smaller number of threads per block with double precision in `ctc_loss`. This PR addresses the change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120746
Approved by: https://github.com/ptrblck, https://github.com/janeyx99
2024-03-05 14:14:41 +00:00
cyy
507611f9ae [CUDACachingAllocator] Turn Allocator::allocate into non-const (#120969)
Ideally, the method should be non-const since it changes the allocator state. Some const_casts are also removed in the way.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120969
Approved by: https://github.com/albanD
2024-03-05 09:53:05 +00:00
ffe45a8188 [ATen-vulkan] Implement global shader registry (#121088)
Differential Revision: D54447700

## Context

This changeset updates Vulkan SPIR-V codegen to introduce a global SPIR-V shader registry and register shaders dynamically at static initialization time. This change makes it possible to define and link custom shader libraries to the ATen-Vulkan runtime.

Before:

* `gen_vulkan_spv.py` generated two files, `spv.h` and `spv.cpp` which would contain the definition and initialization of Vulkan shader registry variables.

After:

* Introduce the `ShaderRegistry` class in `api/`, which encapsulates functionality of the `ShaderRegistry` class previously defined in the generated `spv.h` file
* Introduce a global shader registry (defined as a static variable in the `api::shader_registry() function`
* Define a `ShaderRegisterInit` class (taking inspiration from `TorchLibraryInit`) that allows for dynamic shader registration
* `gen_vulkan_spv.py` now only generates `spv.cpp`, which defines a static `ShaderRegisterInit` instance that triggers registration of the compiled shaders to the global shader registry.

Benefits:

* Cleaner code base; we no longer have `ShaderRegistry` defined in a generated file, and don't need a separate implementation file (`impl/Registry.*`) to handle shader lookup. All that logic now lives under `api/ShaderRegistry.*`
* Makes it possible to compile and link separate shader libraries, providing similar flexibility as defining and linking custom ATen operators

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121088
Approved by: https://github.com/manuelcandales, https://github.com/jorgep31415
2024-03-05 03:56:57 +00:00
eba28a6f91 [VK-API][Op Redesign][3/n] Expose new Context and Resource APIs (#121060)
Summary: For use in the next diff.

Test Plan: sc

Differential Revision: D54397862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121060
Approved by: https://github.com/SS-JIA
2024-03-04 22:26:07 +00:00
70c23a51ac Revert "[ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size (#120925)"
This reverts commit 0a38a6ac8046e4d3f9cfaba86b7ec6517038646f.

Reverted https://github.com/pytorch/pytorch/pull/120925 on behalf of https://github.com/clee2000 due to broke inductor models and caused accuracy regression on nightly dashboard 0a38a6ac80 https://github.com/pytorch/pytorch/actions/runs/8118465367/job/22193590228 ([comment](https://github.com/pytorch/pytorch/pull/120925#issuecomment-1977556485))
2024-03-04 22:13:23 +00:00
6a5c7d5f95 [ATen-vulkan] Enable deferred descriptor pool initialization (#121134)
Differential Revision: D54487619

## Context

Allow the descriptor pool of an `api::Context` object to be initialized in a deferred fashion, instead of forcing initialization upon construction. This mode of operation will be used in the ExecuTorch Vulkan delegate, where the exact number of descriptor sets can determined once the graph is built instead of needing to "guess" an adequate amount.

## Implementation Details

* Check `config.descriptorPoolMaxSets > 0` to check if the descriptor pool should be initialized
* Introduce `DescriptorPool::init()` function to trigger intialization
* Introduce safeguards against using an uninitialized descriptor pool

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121134
Approved by: https://github.com/manuelcandales
2024-03-04 21:37:32 +00:00
0c07c0c15f Revert "add int4 packed gemm support on CPU device (#117475)"
This reverts commit 30befa592e0675cc694f87a4f6fb80894709e719.

Reverted https://github.com/pytorch/pytorch/pull/117475 on behalf of https://github.com/izaitsevfb due to fails meta-internal tests ([comment](https://github.com/pytorch/pytorch/pull/117475#issuecomment-1977474686))
2024-03-04 21:20:57 +00:00
a98c17edc7 Revert "add int8 packed gemm support on CPU device (#118056)"
This reverts commit f84375ca5db623a6a53cbce2864d27dfad626228.

Reverted https://github.com/pytorch/pytorch/pull/118056 on behalf of https://github.com/izaitsevfb due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/118056#issuecomment-1977368720))
2024-03-04 20:09:40 +00:00
9ff65d56a5 Revert "delete useless cast_outputs call in unary_op_impl_float_out (#120486)"
This reverts commit d053dcfa69a52e6b9f9f2ba997b6bffbc9b29bb5.

Reverted https://github.com/pytorch/pytorch/pull/120486 on behalf of https://github.com/izaitsevfb due to Fails meta internal tests ([comment](https://github.com/pytorch/pytorch/pull/120486#issuecomment-1977343125))
2024-03-04 19:52:23 +00:00
2e6c08a14b Update flash_attention kernel from 2.3.6 to 2.5.5 (#118935)
# Summary
Updates FlashAttention kernel code from tag [2.3.6](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.3.6) to [2.5.3](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.5).

The usual changes were then re-rellod on top of the modified kernel, changing how dropout saved for backward, removing the head_dim_pad since this would make the kernel inplace mutate and that has a bad interaction with functionalization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118935
Approved by: https://github.com/cpuhrsch
2024-03-04 17:36:22 +00:00
8ba49d0e53 Fix compilation error: load_fp32_from_fp16’ was not declared in this scope for ppc64le (#120307)
This patch adds missing Implementation of load_fp32_from_fp16 for half. Fixes the error  load_fp32_from_fp16’ was not declared in this scope .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120307
Approved by: https://github.com/jgong5
2024-03-04 11:08:39 +00:00
27ac73073b Fix hipification issue (#121107)
Differential Revision: D54470055

```
buck-out/v2/gen/fbcode/713b128926d8b21f/caffe2/__ATen-hip__/buck-headers/ATen/native/hip/MemoryAccess.cuh:201:61: error: comparison of integers of different signs: 'R' (aka 'unsigned int') and 'int' [-Werror,-Wsign-compare]
    return ((threadIdx.x  + thread_work_elem*num_threads()) < remaining);
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  ^ ~~~~~~~~~
```

```
buck-out/v2/gen/fbcode/713b128926d8b21f/caffe2/__ATen-hip__/buck-headers/ATen/native/hip/MemoryAccess.cuh:223:15: error: unused variable 'to' [-Werror,-Wunused-variable]
    scalar_t *to = reinterpret_cast<scalar_t *>(data[0]) + block_work_size() * idx;
              ^
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121107
Approved by: https://github.com/chenyang78
2024-03-04 09:41:21 +00:00
cyy
4b494d0750 Fix comparison of integer expressions of different signedness (#121066)
Fixes these warnings
```
src/aten/src/ATen/native/cuda/ForeachReduceOp.cu:190:19: warning: comparison of integer expressions of different signedness: ‘int’ and ‘const size_t’ {aka ‘const long unsigned int’} [-Wsign-compare]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121066
Approved by: https://github.com/tringwald, https://github.com/Skylion007
2024-03-04 02:14:10 +00:00
cyy
13fadea888 [Clang-tidy header][21/N] Fix clang-tidy warnings in aten/src/ATEN/*.{cpp,h} (#120763)
This PR continues to fix clang-tidy warnings in aten/src/ATEN/*, following #120574.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120763
Approved by: https://github.com/Skylion007
2024-03-03 23:18:43 +00:00
83d848e1c7 [Quant][Inductor] Enable lowering of dynamic qlinear for X86Inductor (#120605)
**description**
Enable lowering of dynamic qlinear for X86Inductor. The pattern is `choose_qparams -> getitem -> q -> dq -> linear`. We only fuse `dq -> linear` and get `choose_qparams -> getitem -> q -> onednn.qlinear_pointwise`. So, we treat it as dynamic quantization of activation + static quantized linear.
The previous implementation of `onednn.qlinear_pointwise` is for the case where `x_scale` and `x_zp` are scalars. Since `choose_qparams` returns tensors, we added a variation `onednn.qlinear_pointwise.tensor` to support the case.
This feature is targeting PyTorch 2.3 release.

**Test plan**
```
python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_cpu
python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_qat_cpu
python inductor/test_cpu_cpp_wrapper.py -k test_dynamic_qlinear
```

**Performance before and after lowering `choose_qparam` to Inductor**
Before
- latency for shape (32, 32) = 0.151 ms
  latency for shape (128, 128) = 0.153 ms
  latency for shape (1024, 1024) = 0.247 ms

After
- latency for shape (32, 32) = 0.049 ms
- latency for shape (128, 128) = 0.052 ms
- latency for shape (1024, 1024) = 0.133 ms

Test method: A module with a single Linear layer, dynamic-quantize, lower to X86Inductor
Test env & config: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz, single instance, single core, using Intel OpenMP and Tcmalloc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120605
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2024-03-02 05:11:17 +00:00
f84375ca5d add int8 packed gemm support on CPU device (#118056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118056
Approved by: https://github.com/mikekgfb
ghstack dependencies: #117475
2024-03-02 04:35:49 +00:00
5258c3645d [ATen-vulkan][EZ] Bug fixes: only create the image view when memory has been bound, invalidate cmd on flush (#121027)
Summary:
## Context

Introduce some simple bug fixes to the Vulkan Compute API that were causing errors on Android.

1. When using deferred allocation for image textures, it is undefined behaviour to create a `vkImageView` for a `vkImage` that has not yet been bound to memory. Fix this by creating the image view only after the `vkImage` has been bound to memory.
2. When flushing the `api::Context`, the command pool is flushed but any current command buffers are not invalidated. This will cause a segmentation fault if the command buffer is not submitted prior to calling `flush()`, because subsequent calls to `submit_*_job()` will use the old command buffer which will have been freed when the command pool is flushed. To fix, invalidate any existing command buffers when calling `flush()`.

Test Plan:
Build the test binary for Android:

```
buck build --target-platforms=ovr_config//platform/android:arm64-fbsource -c ndk.custom_libcxx=false //xplat/caffe2:pt_vulkan_api_test_bin --show-output
```

Push and run the test binary on a local android phone.

Differential Revision: D54425370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121027
Approved by: https://github.com/mcr229, https://github.com/cbilgin
2024-03-02 04:35:46 +00:00
2d9efad38f Add the bound check for flatten with out_dim (#120894)
Fixes #120762

The bound is not valid in the example but unchecked.
```
a = torch.tensor([1, 2, 3])
a.flatten(start_dim=0, end_dim=1, out_dim='a')
```

The same is checked for the case

```
a = torch.tensor([1, 2, 3])
a.flatten(start_dim=0, end_dim=1)
```

- Therefore, just apply the same check.

@malfet @janeyx99
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120894
Approved by: https://github.com/malfet, https://github.com/spzala
2024-03-02 03:56:55 +00:00
30befa592e add int4 packed gemm support on CPU device (#117475)
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec`

* WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec`
* WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec`

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117475
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-03-02 00:17:34 +00:00
0a38a6ac80 [ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size (#120925)
According to the [cuBLAS API Reference](https://docs.nvidia.com/cuda/cublas/index.html#cublassetworkspace) the recommended workspace size for Hopper is 32 MiB and for the rest architectures 4 MiB. This PR increases the workspace size accordingly. I am not aware of the recommended workspace size for HIP, that is why I am keeping it unchanged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120925
Approved by: https://github.com/eqy, https://github.com/malfet
2024-03-01 23:32:59 +00:00
b8e6ca6f76 Add sparse compressed meta tensor support (#120707)
As in the title.

Replaces https://github.com/pytorch/pytorch/pull/120498 and https://github.com/pytorch/pytorch/pull/120562

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120707
Approved by: https://github.com/ezyang
ghstack dependencies: #120703
2024-03-01 13:28:47 +00:00
70d4d109f2 Make SparseCsr a functionality dispatch key (#120703)
As in the title.

To enable meta and fake tensor support for sparse compressed tensors in compliance with the meta/fake tensor support for sparse COO tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120703
Approved by: https://github.com/ezyang
2024-03-01 13:28:46 +00:00
13a54ce279 Avoid COW materialization in at::parallel_for/parallel_reduce (#120455)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120455
Approved by: https://github.com/albanD
2024-03-01 05:05:28 +00:00
d053dcfa69 delete useless cast_outputs call in unary_op_impl_float_out (#120486)
cast_outputs function is only used for CPU device, and this function already called in cpu_xxx_vec, like cpu_kernel_vec.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120486
Approved by: https://github.com/ezyang
2024-03-01 04:54:11 +00:00
67c97a9aad fix the scale dot attention doc (#120859)
Fixes #120810

The code verifies the broadcast behavior (from the issue),
```
import torch

B = 3
S = 5
L = 7
E = 16
EV = 32
additional_batches = [2, 4]

query_shape = [B] + additional_batches + [L, E]
key_shape = [B] + additional_batches + [S, E]
value_shape = [B] + additional_batches + [S, EV]

query = torch.rand(*query_shape)
key = torch.rand(*key_shape)
value = torch.rand(*value_shape)
mask = torch.zeros((1, 1, S), dtype=torch.bool)
mask[:, :, S // 2 :] = True

# query.to("cuda")
# key.to("cuda")
# value.to("cuda")
# mask.to("cuda")

attention = torch.nn.functional.scaled_dot_product_attention(query, key, value, mask)

print(f"query shape = {query.shape}")
print(f"key shape = {key.shape}")
print(f"value shape = {value.shape}")
print(f"mask shape = {mask.shape}")
print(f"attention shape = {attention.shape}")

#in both CPU and cuda, output shape is:
# query shape = torch.Size([3, 2, 4, 7, 16])
# key shape = torch.Size([3, 2, 4, 5, 16])
# value shape = torch.Size([3, 2, 4, 5, 32])
# mask shape = torch.Size([1, 1, 5])
# attention shape = torch.Size([3, 2, 4, 7, 32])

## test add is broadcasting mask to query@(key.mT)
res = query@(key.mT)
print(res.shape)
res2 = torch.add(res, mask)
print(res2.shape)
```

At code level, in the default backend,
ab38354887/aten/src/ATen/native/transformers/attention.cpp (L735)

the add operation is broadcasting the `attn_mask` to `auto attn = at::matmul(query, key.transpose(-2, -1) * scaling_factor);`

- Changed the doc in [torch/nn/functional.py](https://github.com/pytorch/pytorch/pull/120859/files#diff-c358c214f663ba0c8b9c6846fbe0042fa29494cf02fe4714a17dcd0d268b035b).
- Also fixed a few inconsistencies in the cpp comments.

@mikaylagawarecki

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120859
Approved by: https://github.com/drisspg
2024-03-01 02:54:08 +00:00
9b2c35b4fe [dynamo] Fix convolution meta kernel when input channel is 0 (#120944)
Addresses https://github.com/pytorch/pytorch/issues/118797

Adding in special channel handling logic from eager (set output channels to 0 when input channels are 0):
67d3e4f2a2/aten/src/ATen/native/Convolution.cpp (L1400-L1403)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120944
Approved by: https://github.com/zou3519
2024-03-01 01:18:21 +00:00
7ebfe21724 Fix nll loss dynamo failure (#120805)
Fix for https://github.com/pytorch/pytorch/issues/119791 Part of dynamo bug bash
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120805
Approved by: https://github.com/Skylion007, https://github.com/zou3519, https://github.com/malfet
2024-02-29 22:34:49 +00:00