Commit Graph

15376 Commits

Author SHA1 Message Date
cyy
8cd7ad8b48 [Reland][Environment Variable][5/N] Use thread-safe getenv functions (#140594)
Reland of #139762 with no bug found.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140594
Approved by: https://github.com/ezyang
2024-11-18 21:45:35 +00:00
e46af7de0c [MPS] [BE] Use direct call vs virtual (#140950)
I.e. replace `at::detail::getMPSHooks().isOnMacOSorNewer` with `is_macos_13_or_newer`, which is a direct function call instead of going thru a virtual method call
Hooks are only needed to provide a feature-agnostic inteface to query something even on the platforms that might not have support for the featuee, while functions implemented in `ATen/native/xxx` should be able to call those platform specific methods directly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140950
Approved by: https://github.com/Skylion007
ghstack dependencies: #140896
2024-11-18 21:01:52 +00:00
4eed438a42 Implement deterministic scan (#140887)
Fixes #89492
Uses block-wise cub primitives
On large inputs, this implementation is approximately 25% slower than device cub implementation, so it's turned on only in cases where cub would have been (floating point inputs, cumsum that is effectively 1d)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140887
Approved by: https://github.com/ezyang, https://github.com/kurtamohler
2024-11-18 20:56:14 +00:00
408ad45014 [MPS][BE] Introduce mtl_setArgs (#140896)
Which is a variadic template that automates tedious (and error prone) process of pasing the arguments via series of
```cpp
  mtl_setBuffer(encoder, b1, 0);
  mtl_setBuffer(encoder, b2, 1);
  mtl_setBytes(encoder, param, 2);
```
into a compact
```
  mtl_setArgs(encoder, b1, b2, param);
```

Introduce few more specialization of `mps_setArg`, such as:
 - Call `setBuffer` for `id<MTLBuffer>`
 - Copy double as float (as MPS does not support double precision types)
 - Accept `std::optional<at::Tensor>` that will not call setBuffet, if optional is empty

Also, re-metaprogramm `mtl_setBytes` to make it usable with any trivially copiable structs, but keep separate implementation for containers, as uploading `c10:SmallVector`, which is trivially copiable would overwrite next arguments, which luckily resulted in test failures of `test_cross_entropy_label_smoothing_weight_ignore_indices_mps`

Introduce `has_size_type_v` which could be used to diferrentiate between trivially copiable `std::array` and `c10::ArrayRef` vs other trivially copiable structs.
```cpp
template <typename T>
class has_size_type {
  template <typename U>
  static constexpr std::true_type check(typename U::size_type*);
  template <typename>
  static constexpr std::false_type check(...);

 public:
  static constexpr bool value = decltype(check<T>(nullptr))::value;
};

template <typename T>
constexpr bool has_size_type_v = has_size_type<T>::value;
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140896
Approved by: https://github.com/Skylion007
2024-11-18 20:35:01 +00:00
0f1a88cfba Make Context to be Device-agnostic Step by Step (2/N) (#136526)
----

- add new method(getDefaultGenerator, getNewGenerator) into AcceleratorHooksInterface
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136526
Approved by: https://github.com/ezyang, https://github.com/EikanWang
2024-11-18 18:21:17 +00:00
cca34be584 Update XNNPACK Version (#139913)
Updating XNNPACK Version to 4ea82e595b36106653175dcb04b2aa532660d0d8

submodule update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139913
Approved by: https://github.com/digantdesai, https://github.com/huydhn
2024-11-18 18:16:31 +00:00
f630799587 move c10::overflows to its own header (#140564)
Working on moving `complex<Half>` to complex.h instead of Half.h; this depends on complex and isn't used particularly widely.

Differential Revision: [D65888038](https://our.internmc.facebook.com/intern/diff/D65888038/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140564
Approved by: https://github.com/ezyang, https://github.com/Skylion007, https://github.com/malfet
2024-11-18 15:56:21 +00:00
cyy
06dde8c157 [1/N] Remove inclusion of ATen/core/Array.h (#122064)
The functionality of Array.h is largely overlapped with std::array and it should be safe to use std::array instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122064
Approved by: https://github.com/ezyang
2024-11-18 08:50:28 +00:00
6c6f745fa7 Revert "[1/N] Remove inclusion of ATen/core/Array.h (#122064)"
This reverts commit 486b9aaa67a02807aea06f33c009b5311caab337.

Reverted https://github.com/pytorch/pytorch/pull/122064 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but lots of compilation errors show up after this lands ([comment](https://github.com/pytorch/pytorch/pull/122064#issuecomment-2482263396))
2024-11-18 08:31:38 +00:00
43edb94f8a [Quantization][PrivateUse1] Adding more support QuantizedPrivateuse1 backends (#139860)
Here's are some explanations of this PR.

1. Changes in `aten/src/ATen/core/Tensor.cpp` and `c10/core/DispatchKey.cpp`: Support toString method for `QuantizedPrivateUse1` backend, make pytorch print out correct backend string for it.
2. Add  header `DispatchStub.h` in `aten/src/ATen/native/quantized/IndexKernel.h`: If this header is not included, we can't utilize `masked_fill_kernel_quantized_stub` even we include this `IndexKernel.h` header, it would throw an error during compilation.
3. Add multiple `TORCH_API`s in `aten/src/ATen/native/quantized/AffineQuantizer.h`: these functions is useful for other privateuse1 backends supporting quantization functions, if these `TORCH_API` are missed, it would throw an error during runtime (undefined symbol)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139860
Approved by: https://github.com/bdhirsh
2024-11-18 05:09:59 +00:00
62d2c5b667 Revert "Enable XPUEvent elapsed_time function (#134666)" (#140872)
# Motivation
This PR raises an internal UT failure on XPU.
This reverts commit 4bbd6da33101a8d709f1d2921ad8ae6f9b0dc166.
# Additional Context
refer to https://github.com/pytorch/pytorch/issues/140814

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140872
Approved by: https://github.com/EikanWang
2024-11-18 02:58:05 +00:00
cyy
486b9aaa67 [1/N] Remove inclusion of ATen/core/Array.h (#122064)
The functionality of Array.h is largely overlapped with std::array and it should be safe to use std::array instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122064
Approved by: https://github.com/ezyang
2024-11-18 01:31:39 +00:00
99014a297c [BE][MPS] Apply clang-format to mps headers (#140906)
It was a mistake to amiss them in the past

All changes in this PR except ones to .lintrunner.toml are generated by running
`lintrunner -a --take CLANGFORMAT --all-files`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140906
Approved by: https://github.com/Skylion007
2024-11-17 21:06:27 +00:00
24be47f0c7 [MPS] Allow >2**32 metal dispatches (#140862)
By passing length as `NSUInteger` which should be a 64-bit value on all 64-bit systems according to https://developer.apple.com/documentation/objectivec/nsuinteger?language=objc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140862
Approved by: https://github.com/Skylion007
2024-11-17 18:05:44 +00:00
4269250a30 [BE][EZ] Use nested namespaces (#140905)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140905
Approved by: https://github.com/Skylion007
2024-11-17 17:53:00 +00:00
cyy
73602873c9 [10/N] Fix Wextra-semi warning (#140880)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140880
Approved by: https://github.com/ezyang
2024-11-17 16:12:28 +00:00
44afaac9fd [MPS][BE] Fix non-portable path warning (#140891)
I.e. fixes
```
1082/1084] Building OBJCXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/mps/operations/UpSample.mm.o
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/UpSample.mm:224:10: warning: non-portable path to file '<ATen/native/mps/UpSample_metallib.h>'; specified path differs in case from file name on disk [-Wnonportable-include-path]
  224 | #include <ATen/native/mps/Upsample_metallib.h>
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |          <ATen/native/mps/UpSample_metallib.h>
```
as generated header name should have the same capitalization as respective shader file, i.e. `kernels/UpSample.metal`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140891
Approved by: https://github.com/Skylion007
2024-11-17 15:14:05 +00:00
5df9207ba9 Don't go through dispatch for *_dot_with_fp32_arith (#140834)
We don't need to dispatch for these because they're only used from within ATen/native/cpu, which is rebuilt per-CPU_CAPABILITY anyway.

Differential Revision: [D66012283](https://our.internmc.facebook.com/intern/diff/D66012283/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140834
Approved by: https://github.com/malfet
2024-11-16 00:30:25 +00:00
109f8274a8 Revert "Add NHWC support for group normalization (#126635)"
This reverts commit ed0e63e938317fd254a705f00580caeb68768f9c.

Reverted https://github.com/pytorch/pytorch/pull/126635 on behalf of https://github.com/kit1980 due to Reverted internally at Meta, see D65979564 ([comment](https://github.com/pytorch/pytorch/pull/126635#issuecomment-2480130943))
2024-11-15 23:38:15 +00:00
80d63e7dd9 Fix softmax_backward_data cpu implementation error when argument output is noncontinguous (#139740)
Implementation of the `softmax_backward_data` operator for the CPU backend produces incorrect results when the `output` argument is non-contiguous.

Here is a test case that demonstrates this issue:

```python
torch.manual_seed(0)
op = torch.ops.aten._softmax_backward_data
grad_output = torch.ones(3, 3, 3)
temp = torch.randn(3, 10, 3)
out = temp[:, :3, :]
out = out.contiguous()
print(out.is_contiguous())
grad_input = op(grad_output, out, 1, torch.float32)
print(grad_input)
```

In this test case, the variable `grad_input` yields incorrect results if the line `out = out.contiguous()` is commented out. With this fix, `grad_input` consistently produces the same results whenever `output` is contiguous.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139740
Approved by: https://github.com/zou3519
2024-11-15 19:53:20 +00:00
cyy
55f1959fc1 [12/N] Fix extra warnings brought by clang-tidy-17 (#140801)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140801
Approved by: https://github.com/Skylion007
2024-11-15 16:54:30 +00:00
6c0a2d8bbf Fix the check for can_use_expanded_index_path (#140351)
Fixes #129093

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140351
Approved by: https://github.com/mingfeima, https://github.com/cpuhrsch
2024-11-15 05:52:23 +00:00
1c1d06a22c [ROCm] remove size restrictions in gemm_and_bias (#140724)
This aligns hipblaslt behavior with CUDA_VERSION >= 12010.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140724
Approved by: https://github.com/pruthvistony, https://github.com/eqy
2024-11-15 02:23:27 +00:00
baf8686aec [BE][MPS] Remove extra semicolons (#140776)
Fixes following warnings:
```
In file included from /Users/malfet/git/pytorch/pytorch/torch/csrc/Generator.cpp:25:
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/mps/MPSGeneratorImpl.h:40:63: warning: extra ';' after member function definition [-Wextra-semi]
   40 |   void set_engine(at::Philox4_32 engine) { engine_ = engine; };
      |                                                               ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/mps/MPSGeneratorImpl.h:41:46: warning: extra ';' after member function definition [-Wextra-semi]
   41 |   at::Philox4_32 engine() { return engine_; };
      |                                              ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/mps/MPSGeneratorImpl.h:43:62: warning: extra ';' after member function definition [-Wextra-semi]
   43 |   static DeviceType device_type() { return DeviceType::MPS; };
      |                                                              ^
3 warnings generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140776
Approved by: https://github.com/Skylion007
2024-11-15 01:47:55 +00:00
05c3330893 use more elements per thread for narrow dtypes (#139449)
Fix perf issue for narrow type by accessing more elements per thread

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139449
Approved by: https://github.com/Chillee, https://github.com/eqy
2024-11-14 22:50:16 +00:00
7621fc5dad Add missing boundary checks to cunn_SoftMaxForward (#140682)
This fixes OOB memory access for following code
```python
import torch
qk = torch.randn((1024,587), dtype=torch.float64, device='cuda')
smqk = torch.softmax(qk, dim=-1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140682
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2024-11-14 22:49:06 +00:00
27c7caf745 [ROCm] TunableOp fix for batched MM with views. (#140673)
Fixes #140278

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140673
Approved by: https://github.com/jeffdaily
2024-11-14 20:22:12 +00:00
b0d681417c [MPS] Reintroduce support for convolutions with output_channels > 65536 (#140726)
This reintroduces support for high channel sizes for convs. The guard for macOS versions < 15.1 is still present to prevent reintroducing #129207.

I'm unsure about the specific macOS version support, but I'm assuming this was fixed in 15.1, and I'm relying on signals from ci for verification. I'm expecting the new test will fail for macOS versions < 15.1, and the old test will start failing for > 15.0. I've added xfails for this and extended the version helpers to support 15.1+.

Fixes #140722
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140726
Approved by: https://github.com/malfet
2024-11-14 20:09:01 +00:00
adcff4bff0 Revert "use more elements per thread for narrow dtypes (#139449)"
This reverts commit d3fc13a9dd186ceb8d1b56b0968a41686ea645cd.

Reverted https://github.com/pytorch/pytorch/pull/139449 on behalf of https://github.com/ngimel due to breaks tests ([comment](https://github.com/pytorch/pytorch/pull/139449#issuecomment-2477012582))
2024-11-14 17:28:32 +00:00
ebeab262d9 Refine XPU device prop and fix typo (#140661)
# Motivation
`architecture` is an experimental attribute that might been used by triton AOT codegen. It should not be in `__repr__`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140661
Approved by: https://github.com/EikanWang
2024-11-14 11:18:01 +00:00
62eea62493 [Quant][Onednn] add linear_dynamic_fp16 ops (#140376)
**About this PR**
This PR adds the following ops for `linear_dynamic_fp16` in onednn namespace. These ops are intended for PT2E quantization eager mode.
- `onednn::linear_prepack_fp16`: packs fp32 weight to an fp16 MkldnnCPU tensor.
- `onednn::linear_dynamic_fp16`: takes an fp32 CPU tensor and an fp16 MkldnnCPU tensor and compute linear in fp32
- `onednn::linear_relu_dynamic_fp16`: similar as the former and apply relu on output.

**Test plan**
`python test/test_quantization.py -k test_linear_dynamic_fp16_onednn`

**Implementation**
These ops call oneDNN lib under the hood. It's worth noting that oneDNN does not support f32 * f16 -> f32 computation, so we have to convert fp16 weight to fp32 before computation. And weight is still in plain format after packing.

**Correctness and performance**
Correctness is guaranteed by UT.
Performance of the new ops may be better than the FBGEMM implementation when weight shape is small but worse when weight shape is large. It's because weight dtype conversion and computation are not fused.
For example, I ran benchmarks on an Intel(R) Xeon(R) Platinum 8490H machine with different cores and shapes. When using 1 core per instance, the new implementation generally is faster for weight shape < 1024 * 1024. When using more cores, the threshold will increase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140376
Approved by: https://github.com/jerryzh168, https://github.com/jgong5
2024-11-14 05:19:18 +00:00
9d93c27025 Implement unfold_backward on MPS (#135411)
This PR adds native implementation of unfold_backward as metal shader, mostly copy-n-paste of algorithms used in CUDA and CPU implementations, i.e. considering `out = in.unfold(dim, size, step)`, then following holds true:
* `out.shape[dim] == (in.shape[dim] - size) / step + 1`
* `out.shape[-1] == size`
* `out.ndim == in.ndim + 1`
`unfold_backward` Metal kernel  receives `grad_in` and returns `grad_out` such that:
* `grad_in.shape == out.shape`
* `grad_out.shape == in.shape`

For each index in `grad_out` find the elements contributing to it and sum them up. Such algorithm requires no synchronization between threads.
That is `grad_out[...,out_dim_idx,...]` accumulates all values `grad_in[...,in_dim_idx,...,in_last_idx]`, where `in_dim_idx` is range [`(out_dim_idx - size) / step`, `out_dim_idx / step`] clamped to (0, `in_dim_size`) and `in_last_idx` are equal `out_dim_idx - in_dim_idx * step` . Accumulation step is skipped if `in_last_idx` is outside of [0, size] range.

This operator has been requested 16 times on https://github.com/pytorch/pytorch/issues/77764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135411
Approved by: https://github.com/manuelcandales

Co-authored-by: Manuel Candales <42380156+manuelcandales@users.noreply.github.com>
2024-11-13 23:04:15 +00:00
2675ef8758 Revert " [Environment Variable][5/N] Use thread-safe getenv functions (#139762)"
This reverts commit 43f0fe60a36dc7e3bd8f77a2451bde81496679b0.

Reverted https://github.com/pytorch/pytorch/pull/139762 on behalf of https://github.com/malfet due to One of these diffs had incorrect downstream optional handling, we must reaudit all of these diffs ([comment](https://github.com/pytorch/pytorch/pull/139762#issuecomment-2474174813))
2024-11-13 16:50:00 +00:00
a58a565819 Revert "[Environment Variable][6/N] Use thread-safe getenv functions (#140200)"
This reverts commit 7d4f5f7508d3166af58fdcca8ff01a5b426af067.

Reverted https://github.com/pytorch/pytorch/pull/140200 on behalf of https://github.com/ezyang due to One of these diffs had incorrect downstream optional handling, we must reaudit all of these diffs ([comment](https://github.com/pytorch/pytorch/pull/140200#issuecomment-2473956859))
2024-11-13 15:33:23 +00:00
c6a29fc3d8 Revert "[Environment Variable][4/N] Use thread-safe getenv functions (#137843)"
This reverts commit 82eb09aafd7e4ee6e4fb0580f2221ea6253d218b.

Reverted https://github.com/pytorch/pytorch/pull/137843 on behalf of https://github.com/ezyang due to One of these diffs had incorrect downstream optional handling, we must reaudit all of these diffs ([comment](https://github.com/pytorch/pytorch/pull/137843#issuecomment-2473709760))
2024-11-13 14:06:52 +00:00
4a18e26ff5 Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211)"
This reverts commit a3cff4bbd4130d36b188dbe101a790e6d7da644f.

Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/ezyang due to One of these diffs had incorrect downstream optional handling, we must reaudit all of these diffs ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2473709246))
2024-11-13 14:05:01 +00:00
34743d8a16 Support dlpack for privateuse1 (#135331)
Fixes #129652
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135331
Approved by: https://github.com/shink, https://github.com/FFFrog, https://github.com/ezyang

Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>
2024-11-13 13:13:14 +00:00
5b1c67cc60 [Intel GPU] Avoid atomic add for XPU device in satter_add by deterministic mode (#137966)
The "scatter_add" op with the deterministic mode in XPU device is not implemented, it will report that "scatter_add_kernel" does not have a deterministic implementation in UT.

Just like the implementation of CUDA,  we need to check  _deterministic_algorithms in scatter_add op for the XPU device.

The UT is in: https://github.com/intel/torch-xpu-ops/blob/main/test/xpu/test_scatter_gather_ops_xpu.py. We reused [PyTorch UT code]( 96b30dcb25/test/test_scatter_gather_ops.py (L233)).
Now the UT case is [skipped in torch-xpu-ops test](4fa7921f1e/test/xpu/skip_list_common.py (L731)). Will open it when this PR is merged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137966
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/ezyang
2024-11-13 05:46:54 +00:00
79fb7416e7 [Intel GPU] Add device guard for XPU structured operator in torchgen (#138802)
This PR is a supplement to https://github.com/pytorch/pytorch/pull/133980. The previous PR fulfill the basic functionality of XPU device guard, while we found it fails to address structured operators.

With current PR, the code snippet in RegisterXPU.cpp is as follows, where we can see the device guard is successfully generated.

```c++
struct structured_exp_out_functional final : public at::native::structured_exp_out {
    void set_output_strided(
        int64_t output_idx, IntArrayRef sizes, IntArrayRef strides,
        TensorOptions options, DimnameList names
    ) override {
        auto current_device = guard_.current_device();
        if (C10_UNLIKELY(current_device.has_value())) {
          TORCH_INTERNAL_ASSERT(*current_device == options.device(),
            "structured kernels don't support multi-device outputs");
        } else {
          guard_.reset_device(options.device());
        }
        outputs_[output_idx] = create_out(sizes, strides, options);
        if (!names.empty()) {
          namedinference::propagate_names(outputs_[output_idx], names);
        }
        // super must happen after, so that downstream can use maybe_get_output
        // to retrieve the output
        at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names);
    }
    void set_output_raw_strided(
        int64_t output_idx, IntArrayRef sizes, IntArrayRef strides,
        TensorOptions options, DimnameList names
    ) override {
        auto current_device = guard_.current_device();
        if (C10_UNLIKELY(current_device.has_value())) {
          TORCH_INTERNAL_ASSERT(*current_device == options.device(),
            "structured kernels don't support multi-device outputs");
        } else {
          guard_.reset_device(options.device());
        }
        outputs_[output_idx] = create_out(sizes, strides, options);
        if (!names.empty()) {
          namedinference::propagate_names(outputs_[output_idx], names);
        }
        // super must happen after, so that downstream can use maybe_get_output
        // to retrieve the output
        at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names);
    }
    const Tensor& maybe_get_output(int64_t output_idx) override {
      return outputs_[output_idx];
    }
    std::array<Tensor, 1> outputs_;
    c10::OptionalDeviceGuard guard_;
};

```

However, without current change, the generated code is

```c++
struct structured_exp_out_functional final : public at::native::structured_exp_out {
    void set_output_strided(
        int64_t output_idx, IntArrayRef sizes, IntArrayRef strides,
        TensorOptions options, DimnameList names
    ) override {
        outputs_[output_idx] = create_out(sizes, strides, options);
        if (!names.empty()) {
          namedinference::propagate_names(outputs_[output_idx], names);
        }
        // super must happen after, so that downstream can use maybe_get_output
        // to retrieve the output
        at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names);
    }
    void set_output_raw_strided(
        int64_t output_idx, IntArrayRef sizes, IntArrayRef strides,
        TensorOptions options, DimnameList names
    ) override {
        outputs_[output_idx] = create_out(sizes, strides, options);
        if (!names.empty()) {
          namedinference::propagate_names(outputs_[output_idx], names);
        }
        // super must happen after, so that downstream can use maybe_get_output
        // to retrieve the output
        at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names);
    }
    const Tensor& maybe_get_output(int64_t output_idx) override {
      return outputs_[output_idx];
    }
    std::array<Tensor, 1> outputs_;
};
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138802
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/ezyang
2024-11-13 05:40:38 +00:00
42ad54c71b [Intel GPU] Allow XPU device in LSTMCell operators (#140246)
Refine device check logic for LSTMCell.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140246
Approved by: https://github.com/soulitzer
2024-11-13 05:13:07 +00:00
4bbd6da331 Enable XPUEvent elapsed_time function (#134666)
# Motivation
This PR aims to enable `elapsed_time` function for `XPUEvent`.

# Additional Context
This PR depends on toolchain oneAPI 2025.0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134666
Approved by: https://github.com/EikanWang, https://github.com/ezyang
2024-11-13 04:32:50 +00:00
cyy
40fb738197 Use Wextra-semi (#140236)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140236
Approved by: https://github.com/ezyang
2024-11-13 02:15:16 +00:00
3d2dd14217 [BE][Bugfix]: Add rad2deg to pointwise ops (#140290)
Adds missing pontwise tags. Apparently this allows NestedTensor to properly generate a function for opinfo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140290
Approved by: https://github.com/jbschlosser
2024-11-13 00:02:00 +00:00
034b105d53 [BE][Ez]: Add NT unary op macro (#140213)
* Adds a macro to simplify adding more unary ops to NT.
* Adds sqrt support to NT
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140213
Approved by: https://github.com/jbschlosser
2024-11-12 19:50:06 +00:00
cc8e832066 [AMD] use DC method for linalg.eigh (#140327)
Summary: Jacobi method has larger numerical errors, see D64997718, use divide-and-conquer method instead.

Test Plan: CI

Differential Revision: D65786796

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140327
Approved by: https://github.com/jianyuh
2024-11-12 19:17:25 +00:00
6a368b3fc5 Add ScalarList overload to _foreach_lerp (#134482)
Related:
- https://github.com/pytorch/pytorch/issues/133367

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134482
Approved by: https://github.com/janeyx99
2024-11-12 19:03:41 +00:00
cyy
7624d625c0 [Reland][7/N] Fix Wextra-semi warning (#140342)
Reland of #140225 to fix a change in FBCODE_CAFFE2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140342
Approved by: https://github.com/kit1980
2024-11-12 18:55:31 +00:00
cyy
a3cff4bbd4 [Environment Variable][7/N] Use thread-safe getenv functions (#140211)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211
Approved by: https://github.com/ezyang, https://github.com/eqy
2024-11-12 18:49:51 +00:00
928b8ec633 [BE]: Add pointwise tag to isfinite (#140291)
Adds pointwise tag to isfinite
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140291
Approved by: https://github.com/jbschlosser
2024-11-12 18:02:07 +00:00
7a02457053 [BE] Fix error message in torch._scaled_mm (#140343)
Followup after https://github.com/pytorch/pytorch/pull/140307 that fixes error message for mat1, but not for mat2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140343
Approved by: https://github.com/kit1980
2024-11-12 17:13:41 +00:00