572 Commits

Author SHA1 Message Date
4cf2c646c2 Added torch.linalg.matrix_norm (#57127)
Summary:
This PR is focused on  the API for `linalg.matrix_norm` and delegates computations to `linalg.norm` for the moment.

The main difference between the norms is when `dim=None`. In this case
- `linalg.norm` will compute a vector norm on the flattened input if `ord=None`, otherwise it requires the input to be either 1D or 2D in order to disambiguate between vector and matrix norm
- `linalg.vector_norm` will flatten the input
- `linalg.matrix_norm` will compute the norm over the last two dimensions, treating the input as batch of matrices

In future PRs, the computations will be moved to `torch.linalg.matrix_norm` and `torch.norm` and `torch.linalg.norm` will delegate computations to either `linalg.vector_norm` or `linalg.matrix_norm` based on the arguments provided.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57127

Reviewed By: mrshenli

Differential Revision: D28186736

Pulled By: mruberry

fbshipit-source-id: 99ce2da9d1c4df3d9dd82c0a312c9570da5caf25
2021-05-09 04:50:33 -07:00
58f32fa5fd Remove compute_uv flag from torch.linalg.svd (#57180)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57180

We have now a separate function for computing only the singular values.
`compute_uv` argument is not needed and it was decided in the
offline discussion to remove it. This is a BC-breaking change but our
linalg module is beta, therefore we can do it without a deprecation
notice.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28142163

Pulled By: mruberry

fbshipit-source-id: 3fac1fcae414307ad5748c9d5ff50e0aa4e1b853
2021-05-07 15:16:42 -07:00
1f1e2dab6b Remove optional type for ord parameter in vector_norm (#57662)
Summary:
As per discussion here https://github.com/pytorch/pytorch/pull/57127#discussion_r624948215

Note that we cannot remove the optional type from the `dim` parameter because the default is to flatten the input tensor which cannot be easily captured by a value other than `None`

### BC Breaking Note
This PR changes the `ord` parameter of `torch.linalg.vector_norm` so that it no longer accepts `None` arguments. The default behavior of `2` is equivalent to the previous default of `None`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57662

Reviewed By: albanD, mruberry

Differential Revision: D28228870

Pulled By: heitorschueroff

fbshipit-source-id: 040fd8055bbe013f64d3c8409bbb4b2c87c99d13
2021-05-06 17:53:25 -07:00
75a2a92b02 Add torch.linalg.cholesky_ex without checking for errors by default (#56724)
Summary:
The new function has the following signature `cholesky_ex(Tensor input, *, bool check_errors=False) -> (Tensor L, Tensor infos)`. When `check_errors=True`, an error is thrown if the decomposition fails; `check_errors=False` - responsibility for checking the decomposition is on the user.

When `check_errors=False`, we don't have host-device memory transfers for checking the values of the `info` tensor.

Rewrote the internal code for `torch.linalg.cholesky`. Added `cholesky_stub` dispatch. `linalg_cholesky` is implemented using calls to `linalg_cholesky_ex` now.

Resolves https://github.com/pytorch/pytorch/issues/57032.

Ref. https://github.com/pytorch/pytorch/issues/34272, https://github.com/pytorch/pytorch/issues/47608, https://github.com/pytorch/pytorch/issues/47953

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56724

Reviewed By: ngimel

Differential Revision: D27960176

Pulled By: mruberry

fbshipit-source-id: f05f3d5d9b4aa444e41c4eec48ad9a9b6fd5dfa5
2021-05-01 18:48:27 -07:00
d4ddb47719 [special] Add xlog1py (#55138)
Summary:
Reference : https://github.com/pytorch/pytorch/issues/50345

* [x] Check Rendered Document (https://12494173-65600975-gh.circle-artifacts.com/0/docs/special.html#torch.special.xlog1py)
* [x] Tests in Binary Ufunc
* [x] OpInfo
* [x] Structured Kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55138

Reviewed By: ngimel

Differential Revision: D27961461

Pulled By: mruberry

fbshipit-source-id: 30a8f41970a829bf50254aadf5615e8ce4148c7e
2021-04-30 05:51:13 -07:00
9da0f2e95e Support __pos__ and positive (#55891)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55604.

This PR implements `torch.Tensor.__pos__` and `torch.positive` for the compatibility with NumPy’s interface. (cc: mruberry, rgommers, emcastillo and kmaehashi)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55891

Reviewed By: H-Huang

Differential Revision: D28025928

Pulled By: mruberry

fbshipit-source-id: e43e329a802f31bf8805f6efab5c2c7ef34c88b9
2021-04-27 13:23:59 -07:00
3e006fc57e Adding hsplit,vsplit and dsplit methods (#53536)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53536

Reviewed By: albanD

Differential Revision: D27938880

Pulled By: iramazanli

fbshipit-source-id: f741119517783ec2bafa296622ee518b587dd127
2021-04-26 09:39:09 -07:00
298db67220 [OpInfo] Add Function Variant and Opinfo for permute (#56125)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56125

Reviewed By: ezyang

Differential Revision: D27960312

Pulled By: mruberry

fbshipit-source-id: b9dd89f7e69d7dff29f3b53828656c13df898fa5
2021-04-25 21:26:44 -07:00
d5ff432615 Add torch.linalg.svdvals (#56684)
Summary:
This PR adds `torch.linalg.svdvals(input, out=None)` that computes only the singular values of `input`.

Resolves https://github.com/pytorch/pytorch/issues/54155.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56684

Reviewed By: albanD

Differential Revision: D27938229

Pulled By: mruberry

fbshipit-source-id: 5ea79ad9cccf818df0fbda1f431299ebf8de3798
2021-04-25 03:42:24 -07:00
1f0223d6bb Fix bug in gaussian_nll_loss (#56469)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53964. cc albanD almson

## Major changes:
- Overhauled the actual loss calculation so that the shapes are now correct (in functional.py)
- added the missing doc in nn.functional.rst

## Minor changes (in functional.py):
- I removed the previous check on whether input and target were the same shape. This is to allow for broadcasting, say when you have 10 predictions that all have the same target.
- I added some comments to explain each shape check in detail. Let me know if these should be shortened/cut.

Screenshots of updated docs attached.
Let me know what you think, thanks!

## Edit: Description of change of behaviour (affecting BC):
The backwards-compatibility is only affected for the `reduction='none'` mode. This was the source of the bug. For tensors with size (N, D), the old returned loss had size (N), as incorrect summation was happening. It will now have size (N, D) as expected.

### Example
Define input tensors, all with size (2, 3).
`input = torch.tensor([[0., 1., 3.], [2., 4., 0.]], requires_grad=True)`
`target = torch.tensor([[1., 4., 2.], [-1., 2., 3.]])`
`var = 2*torch.ones(size=(2, 3), requires_grad=True)`

Initialise loss with reduction mode 'none'. We expect the returned loss to have the same size as the input tensors, (2, 3).
`loss = torch.nn.GaussianNLLLoss(reduction='none')`

Old behaviour:
`print(loss(input, target, var)) `
`# Gives tensor([3.7897, 6.5397], grad_fn=<MulBackward0>. This has size (2).`

New behaviour:
`print(loss(input, target, var)) `
`# Gives tensor([[0.5966, 2.5966, 0.5966], [2.5966, 1.3466, 2.5966]], grad_fn=<MulBackward0>)`
`# This has the expected size, (2, 3).`

To recover the old behaviour, sum along all dimensions except for the 0th:
`print(loss(input, target, var).sum(dim=1))`
`# Gives tensor([3.7897, 6.5397], grad_fn=<SumBackward1>.`

![doc1](https://user-images.githubusercontent.com/26558092/115391089-f7f47b00-a1d6-11eb-8726-e4da9057aee0.png)
![doc2](https://user-images.githubusercontent.com/26558092/115391094-f925a800-a1d6-11eb-954b-afd187f42bc7.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56469

Reviewed By: jbschlosser, agolynski

Differential Revision: D27894170

Pulled By: albanD

fbshipit-source-id: 197890189c97c22109491c47f469336b5b03a23f
2021-04-22 07:43:48 -07:00
52f1a07b63 Python API for Vitals (#53238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53238

There is a tension for the Vitals design: (1) we want a macro based logging API for C++ and (2) we want a clean python API. Furthermore, we want to this to work with "print on destruction" semantics.

The unfortunate resolution is that there are (2) ways to define vitals:
(1) Use the macros for local use only within C++ - this keeps the semantics people enjoy
(2) For vitals to be used through either C++ or Python, we use a global VitalsAPI object.

Both these go to the same place for the user: printing to stdout as the globals are destructed.

The long history on this diff shows many different ways to try to avoid having 2 different paths... we tried weak pointers & shared pointers, verbose switch cases, etc. Ultimately each ran into an ugly trade-off and this cuts the difference better the alternatives.

Test Plan:
buck test mode/dev caffe2/test:torch -- --regex vital
buck test //caffe2/aten:vitals

Reviewed By: orionr

Differential Revision: D26736443

fbshipit-source-id: ccab464224913edd07c1e8532093f673cdcb789f
2021-04-15 16:06:43 -07:00
50057e560b [special] Add i0e (#54409)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

Changes:
* Add `i0e`
* Move some kernels from `UnaryOpsKernel.cu` to `UnarySpecialOpsKernel.cu` to decrease compilation time per file.

Time taken by i0e_vs_scipy tests: around 6.33.s

<details>

<summary>Test Run Log</summary>

```
(pytorch-cuda-dev) kshiteej@qgpu1:~/Pytorch/pytorch_module_special$ pytest test/test_unary_ufuncs.py -k _i0e_vs
======================================================================= test session starts ========================================================================
platform linux -- Python 3.8.6, pytest-6.1.2, py-1.9.0, pluggy-0.13.1
rootdir: /home/kshiteej/Pytorch/pytorch_module_special, configfile: pytest.ini
plugins: hypothesis-5.38.1
collected 8843 items / 8833 deselected / 10 selected

test/test_unary_ufuncs.py ...sss....                                                                                                                         [100%]

========================================================================= warnings summary =========================================================================
../../.conda/envs/pytorch-cuda-dev/lib/python3.8/site-packages/torch/backends/cudnn/__init__.py:73
test/test_unary_ufuncs.py::TestUnaryUfuncsCUDA::test_special_i0e_vs_scipy_cuda_bfloat16
  /home/kshiteej/.conda/envs/pytorch-cuda-dev/lib/python3.8/site-packages/torch/backends/cudnn/__init__.py:73: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/warnings.html
===================================================================== short test summary info ======================================================================
SKIPPED [3] test/test_unary_ufuncs.py:1182: not implemented: Could not run 'aten::_copy_from' with arguments from the 'Meta' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::_copy_from' is only available for these backends: [BackendSelect, Named, InplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, UNKNOWN_TENSOR_TYPE_ID, AutogradMLC, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].

BackendSelect: fallthrough registered at ../aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
InplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:56 [backend fallback]
AutogradOther: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:8761 [autograd kernel]
AutogradCPU: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:8761 [autograd kernel]
AutogradCUDA: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:8761 [autograd kernel]
AutogradXLA: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:8761 [autograd kernel]
UNKNOWN_TENSOR_TYPE_ID: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:8761 [autograd kernel]
AutogradMLC: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:8761 [autograd kernel]
AutogradNestedTensor: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:8761 [autograd kernel]
AutogradPrivateUse1: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:8761 [autograd kernel]
AutogradPrivateUse2: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:8761 [autograd kernel]
AutogradPrivateUse3: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:8761 [autograd kernel]
Tracer: registered at ../torch/csrc/autograd/generated/TraceType_4.cpp:9348 [kernel]
Autocast: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:250 [backend fallback]
Batched: registered at ../aten/src/ATen/BatchingRegistrations.cpp:1016 [backend fallback]
VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
==================================================== 7 passed, 3 skipped, 8833 deselected, 2 warnings in 6.33s =====================================================
```

</details>

TODO:
* [x] Check rendered docs (https://11743402-65600975-gh.circle-artifacts.com/0/docs/special.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54409

Reviewed By: jbschlosser

Differential Revision: D27760472

Pulled By: mruberry

fbshipit-source-id: bdfbcaa798b00c51dc9513c34626246c8fc10548
2021-04-15 06:06:11 -07:00
3fe4718d16 Add padding_idx argument to EmbeddingBag (#49237)
Summary:
This PR adds a `padding_idx` parameter to `nn.EmbeddingBag` and `nn.functional.embedding_bag`. As with `nn.Embedding`'s `padding_idx` argument, if an embedding's index is equal to `padding_idx` it is ignored, so it is not included in the reduction.

This PR does not add support for `padding_idx` for quantized or ONNX `EmbeddingBag` for opset10/11 (opset9 is supported). In these cases, an error is thrown if `padding_idx` is provided.

Fixes https://github.com/pytorch/pytorch/issues/3194

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49237

Reviewed By: walterddr, VitalyFedyunin

Differential Revision: D26948258

Pulled By: jbschlosser

fbshipit-source-id: 3ca672f7e768941f3261ab405fc7597c97ce3dfc
2021-04-14 09:38:01 -07:00
5fb1142702 Add CSR (compressed sparse row) layout for sparse tensors (#50937)
Summary:
Implement compressed sparse row format. Derived from the GCS implementation at https://github.com/pytorch/pytorch/pull/44190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50937

Reviewed By: mrshenli

Differential Revision: D27439865

Pulled By: ezyang

fbshipit-source-id: 3ba3dcb9679505b980ff6a5f513e913bbae2fb1d
2021-04-12 10:09:12 -07:00
902bf0bbbe [special] Alias for sigmoid and logit & follow-up (#54759)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

Chages:
* Alias for sigmoid and logit
* Adds out variant for C++ API
* Updates docs to link back to `special` documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54759

Reviewed By: mrshenli

Differential Revision: D27615208

Pulled By: mruberry

fbshipit-source-id: 8bba908d1bea246e4aa9dbadb6951339af353556
2021-04-08 00:56:59 -07:00
84d18727bd Added linalg.eig, linalg.eigvals (#52491)
Summary:
This PR adds `torch.linalg.eig`, and `torch.linalg.eigvals` for NumPy compatibility.

MAGMA uses a hybrid CPU-GPU algorithm and doesn't have a GPU interface for the non-symmetric eigendecomposition. It means that it forces us to transfer inputs living in GPU memory to CPU first before calling MAGMA, and then transfer results from MAGMA to CPU. That is rather slow for smaller matrices and MAGMA is faster than CPU path only for matrices larger than 3000x3000.
Unfortunately, there is no cuSOLVER function for this operation.

Autograd support for `torch.linalg.eig` will be added in a follow-up PR.

Ref https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52491

Reviewed By: anjali411

Differential Revision: D27563616

Pulled By: mruberry

fbshipit-source-id: b42bb98afcd2ed7625d30bdd71cfc74a7ea57bb5
2021-04-06 13:53:26 -07:00
fd02fc5d71 Port put_ and take from TH to ATen (#53356)
Summary:
The two ports were don together, as they can be implemented with the same kernel. In TH, they were already implemented with the same kernel.

Resolves https://github.com/pytorch/pytorch/issues/24751
Resolves https://github.com/pytorch/pytorch/issues/24614
Resolves https://github.com/pytorch/pytorch/issues/24640
Resolves https://github.com/pytorch/pytorch/issues/24772

This port makes sure that it interacts correctly with the "deterministic algorithms" flag, as done in https://github.com/pytorch/pytorch/pull/51388

This PR also makes these two functions correct in the following aspects (all of them added to the tests as well):
- Support for complex numbers
- Correct handling of scalar inputs and zero-dimensional inputs
- Implementation that does not do any copies nor sorting of any of the input tensors
- Faster and more correct implementation of the backwards (now it works as it should when `source.shape() != index.shape()`)
- Now `put_(..., accumulate=True)` is implemented correctly with atomic operations on GPU / CPU (when possible) and is deterministic (modulo the loss of precision that might happen due to the reordering of a sum of floats)
- Adds the `torch.put` function that was missing, (`index_put` exists, for example)
- Corrected docs

It also adds a much more thorough testing to the operations and their gradients.

There is a BC-breaking change, and that is that now we check that the inputs do not overlap in the `put_` operation. This was handled (some of the cases, other cases were wrong) in the TH implementation by making contiguous copies of the inputs. How should we handle this one?

**Edit.** Benchmarks:
<details>
<summary>Script</summary>

```python
from IPython import get_ipython
import torch
from itertools import product

torch.manual_seed(13)
torch.set_num_threads(1)

ipython = get_ipython()

cpu = torch.device('cpu')
cuda = torch.device('cuda')

def run_test(ndims, size, index_len, device, cmd):
    print(f"cmd: {cmd}, ndims: {ndims}, tensor_size: {size}, index_len: {index_len}, device: {device}")

    large_tensor = torch.rand(*([size] * ndims), device=device)
    small_tensor = torch.rand((index_len,), device=device)
    index = torch.randint(size * ndims, (index_len,), dtype=torch.long, device=device)
    if cmd == "put":
        command = "large_tensor.put_(index, small_tensor, accumulate=False)"
        if device == cuda:
            command += "; torch.cuda.synchronize()"
    elif cmd == "accumulate":
        command = "large_tensor.put_(index, small_tensor, accumulate=True)"
        if device == cuda:
            command += "; torch.cuda.synchronize()"
    elif cmd == "take":
        command = "torch.take(large_tensor, index)"
        if device == cuda:
            command += "; torch.cuda.synchronize()"
    ipython.magic(f"timeit {command}")
    print()

for method, device in product(["accumulate", "put", "take"], [cpu, cuda]):
    run_test(3, 1000, 10, device, method)
    run_test(3, 1000, 1000, device, method)
    run_test(3, 1000, 10000, device, method)
    run_test(2, 10000, 100000, device, method)
```
</details>

```python
put_(accumulate=False)
```

<details>
<summary>ATen CPU (1.5x - 2x speedup)</summary>

```python
cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
1.05 µs ± 2.35 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
3.15 µs ± 5.13 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
21.6 µs ± 13.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
238 µs ± 781 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
</details>

<details>
<summary>TH CPU</summary>

```python
cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
722 ns ± 2.67 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
4.89 µs ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
42.5 µs ± 96.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
428 µs ± 774 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
</details>
<details>
<summary>ATen GPU (same speed)</summary>

```python
cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
8.99 µs ± 16 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
10.4 µs ± 24.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
10.4 µs ± 11.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
15.6 µs ± 1.12 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

<details>
<summary>TH GPU</summary>

```python
cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
8.44 µs ± 31.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
9.09 µs ± 4.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
9.77 µs ± 0.998 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
15.8 µs ± 5.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

```python
put_(accumulate=True)
```

<details>
<summary>ATen CPU (x2 speedup)</summary>

```python
cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
1.12 µs ± 2.91 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
3.14 µs ± 2.05 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
20.8 µs ± 25.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
264 µs ± 263 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
</details>

<details>
<summary>TH CPU</summary>

```python
cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
814 ns ± 1.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
5.11 µs ± 6.02 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
43.9 µs ± 49.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
442 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
</details>
<details>
<summary>ATen GPU (3x - 11x speedup)</summary>

```python
cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
9.01 µs ± 14.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
10.4 µs ± 15.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
10.3 µs ± 44.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
12.6 µs ± 19 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

<details>
<summary>TH GPU</summary>

```python
cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
34.7 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
38.2 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
61.2 µs ± 50.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
140 µs ± 24.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
</details>

```python
take()
```

<details>
<summary>ATen CPU (1.1x speedup)</summary>

```python
cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
1.18 µs ± 2.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
2.79 µs ± 2.96 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
16.6 µs ± 10.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
161 µs ± 984 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
</details>

<details>
<summary>TH CPU</summary>

```python
cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
1.1 µs ± 3.14 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
2.93 µs ± 7.31 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
18.6 µs ± 14.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
178 µs ± 139 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
</details>
<details>
<summary>ATen GPU (same speed)</summary>

```python
cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
9.38 µs ± 23.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
10.7 µs ± 9.77 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
10.6 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
11.5 µs ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

<details>
<summary>TH GPU</summary>

```python
cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
9.31 µs ± 7.57 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
9.52 µs ± 5.78 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
9.73 µs ± 17.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
11.7 µs ± 5.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

cc mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53356

Reviewed By: mruberry

Differential Revision: D27520243

Pulled By: ngimel

fbshipit-source-id: e3979349c2c62d2949e09fb05e5fd4883fbc9093
2021-04-05 18:05:38 -07:00
2ee02b30b1 Replace rounding_mode="true" with rounding_mode=None (#51988)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51988

* **#51988 Replace rounding_mode="true" with rounding_mode=None**

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27561817

Pulled By: mruberry

fbshipit-source-id: 60d1d9c389570f60d599fc1876518717367fb368
2021-04-05 14:53:43 -07:00
d98072b027 Deprecate torch.chain_matmul in favor of torch.linalg.multi_dot (#53453)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53453

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D27406282

Pulled By: heitorschueroff

fbshipit-source-id: b6e715d1b88e0613ee6b6208cb28ba4757e31717
2021-04-01 04:50:51 -07:00
5d68b3695c [Relanding] Implemented torch.linalg.multi_dot (#52859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52859

This reverts commit 92a4ee1cf6092dd941591f80885eb7fef5b2c0d8.

Added support for bfloat16 for CUDA 11 and removed fast-path for empty input tensors that was affecting autograd graph.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D27402390

Pulled By: heitorschueroff

fbshipit-source-id: 73c5ccf54f3da3d29eb63c9ed3601e2fe6951034
2021-04-01 04:49:05 -07:00
c9d0c855f7 [special] Alias for special.expm1 and special.exp2 (#54670)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54670

Reviewed By: H-Huang

Differential Revision: D27401440

Pulled By: mruberry

fbshipit-source-id: 02b1fd0e8ffd3f5a017d6b6b9229b76b92b4b745
2021-03-30 10:03:13 -07:00
c690ed0ae8 Fix override for __iter__ (#54702)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54702

This fixes subclassing for __iter__ so that it returns an iterator over
subclasses properly instead of Tensor.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D27352563

Pulled By: ezyang

fbshipit-source-id: 4c195a86c8f2931a6276dc07b1e74ee72002107c
2021-03-30 08:30:50 -07:00
0527d14248 [numpy] Add torch.take_along_dim (#52833)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/38349

Wrapper around the existing `torch.gather` with broadcasting logic.

TODO:
* [x] Add Doc entry (see if phrasing can be improved)
* [x] Add OpInfo
* [x] Add test against numpy
* [x] Handle broadcasting behaviour and when dim is not given.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52833

Reviewed By: malfet

Differential Revision: D27319038

Pulled By: mruberry

fbshipit-source-id: 00f307825f92c679d96e264997aa5509172f5ed1
2021-03-28 05:22:51 -07:00
6f8328ef44 [special] Add special.entr (#53500)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

TODO:

* [x] Verfiy docs rendering (https://11397990-65600975-gh.circle-artifacts.com/0/docs/special.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53500

Reviewed By: ngimel

Differential Revision: D27287096

Pulled By: mruberry

fbshipit-source-id: 6b3dfd53e811a0f023ee444a0b56176f825d39e9
2021-03-24 18:44:42 -07:00
7e3cf1ee24 [pytorch] Add native support for segment reduce step1: API definition (#53727)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53727

This is first diff to add native support for segment reduction in PyTorch. It provides similar functionality like torch.scatter or "numpy.ufunc.reduceat".

This diff mainly focuses on API layer to make sure future improvements will not cause backward compatibility issues. Once API is settled, here are next steps I am planning:
- Add support for other major reduction types (e.g. min, sum) for 1D tensor
- Add Cuda support
- Backward support
- Documentation for the op
- Perf optimizations and benchmark util
- Support for multi dimensional tensors (on data and lengths) (not high priority)
- Support for 'indices' (not high priority)

Test Plan: Added unit test

Reviewed By: ngimel

Differential Revision: D26952075

fbshipit-source-id: 8040ec96def3013e7240cf675d499ee424437560
2021-03-23 16:00:30 -07:00
f9e7f132fb Added torch.linalg.matrix_power (#52608)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52608

**TODO**

- [x] Add OpInfo
- [x] Update documentation
- [x] Add more tests and compare against NumPy

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D27261532

Pulled By: heitorschueroff

fbshipit-source-id: c1e4ab297da3683f6d5751be8790602f9dc37b6b
2021-03-23 15:10:06 -07:00
bfd009836e [torch.special] Add special.erf{c, inv} (#53260)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

Also adds `overrides` entry for module and the newly added functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53260

Reviewed By: agolynski

Differential Revision: D27114342

Pulled By: mruberry

fbshipit-source-id: b1dd88f373db251bb71df12d33b160382138f63f
2021-03-18 19:06:25 -07:00
4626886f21 [JIT] Add CUDNN Conv-Add-Relu fusion for Frozen Model Optimization (#52102)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52102

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D26646100

fbshipit-source-id: 7f7a82cc0b42c958b9e0c854b3b5dc6ea7cfff6c
2021-03-18 15:18:52 -07:00
382a47b493 Add torch.linalg.vector_norm function (#51099)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51099

Reviewed By: agolynski

Differential Revision: D27147360

Pulled By: mruberry

fbshipit-source-id: 1056f840e7027ad81971c9d1a9f952ab9648f1b5
2021-03-18 06:41:39 -07:00
564456ac44 Added autograd support for torch.orgqr (#52637)
Summary:
This PR adds autograd support for `torch.orgqr`.

Since `torch.orgqr` is one of few functions that expose LAPACK's naming and all other linear algebra routines were renamed a long time ago, I also added a new function with a new name and `torch.orgqr` now is an alias for it.

The new proposed name is `householder_product`. For a matrix `input` and a vector `tau` LAPACK's orgqr operation takes columns of `input` (called Householder vectors or elementary reflectors) scalars of `tau` that together represent Householder matrices and then the product of these matrices is computed. See https://www.netlib.org/lapack/lug/node128.html.
Other linear algebra libraries that I'm aware of do not expose this LAPACK function, so there is some freedom in naming it. It is usually used internally only for QR decomposition, but can be useful for deep learning tasks now when it supports differentiation.

Resolves https://github.com/pytorch/pytorch/issues/50104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52637

Reviewed By: agolynski

Differential Revision: D27114246

Pulled By: mruberry

fbshipit-source-id: 9ab51efe52aec7c137aa018c7bd486297e4111ce
2021-03-18 05:42:18 -07:00
da10ccd35f Implements cpu_kernel_multiple_outputs and torch.frexp (#51097)
Summary:
Close https://github.com/pytorch/pytorch/issues/51108
Related https://github.com/pytorch/pytorch/issues/38349

This PR implements the `cpu_kernel_multiple_outputs` to support returning multiple values in a CPU kernel.
```c++
auto iter = at::TensorIteratorConfig()
  .add_output(out1)
  .add_output(out2)
  .add_input(in1)
  .add_input(in2)
  .build();

at::native::cpu_kernel_multiple_outputs(iter,
  [=](float a, float b) -> std::tuple<float, float> {
    float add = a + b;
    float mul = a * b;
    return std::tuple<float, float>(add, mul);
  }
);
```

The `out1` will equal to `torch.add(in1, in2)`, while the result of `out2` will be `torch.mul(in1, in2)`.
It helps developers implement new torch functions that return two tensors more conveniently, such as NumPy-like functions [divmod](https://numpy.org/doc/1.18/reference/generated/numpy.divmod.html?highlight=divmod#numpy.divmod) and [frexp](https://numpy.org/doc/stable/reference/generated/numpy.frexp.html#numpy.frexp).

This PR adds `torch.frexp` function to exercise the new functionality provided by `cpu_kernel_multiple_outputs`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51097

Reviewed By: albanD

Differential Revision: D26982619

Pulled By: heitorschueroff

fbshipit-source-id: cb61c7f2c79873ab72ab5a61cbdb9203531ad469
2021-03-15 10:44:32 -07:00
afa1ff8e04 Implements torch.linalg.lstsq (#49093)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44378 by providing a wider range of drivers similar to what SciPy is doing.

The supported CPU drivers are `gels, gelsy, gelsd, gelss`.
The CUDA interface has only `gels` implemented but only for overdetermined systems.

The current state of this PR:
- [x] CPU interface
- [x] CUDA interface
- [x] CPU tests
- [x] CUDA tests
- [x] Memory-efficient batch-wise iteration with broadcasting which fixes https://github.com/pytorch/pytorch/issues/49252
- [x] docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49093

Reviewed By: albanD

Differential Revision: D26991788

Pulled By: mruberry

fbshipit-source-id: 8af9ada979240b255402f55210c0af1cba6a0a3c
2021-03-12 13:25:55 -08:00
758fb94fcb Prefix assert_async with underscore, fix some bugs in assert_async CUDA testing (#53276)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53276

- One of the tests had a syntax error (but the test
  wasn't fine grained enough to catch this; any error
  was a pass)
- Doesn't work on ROCm

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D26820048

Test Plan: Imported from OSS

Reviewed By: mruberry

Pulled By: ezyang

fbshipit-source-id: b02c4252d10191c3b1b78f141d008084dc860c45
2021-03-05 17:36:01 -08:00
cfd9360d09 Revert D26837780: Revert D26819810: Revert D26815021: Revert D26744062: Add assert_async
Test Plan: revert-hammer

Differential Revision:
D26837780

Original commit changeset: 21567cab5c0f

fbshipit-source-id: 8ea735e5fdc97e32ae3fafd40297a1b8a7cd34b0
2021-03-04 20:45:35 -08:00
1accffe450 Revert D26819810: Revert D26815021: Revert D26744062: Add assert_async
Test Plan: revert-hammer

Differential Revision:
D26819810

Original commit changeset: e528260e1aa9

fbshipit-source-id: 21567cab5c0ff5f5e60a699d4d4678773a567c30
2021-03-04 18:48:56 -08:00
9e5e5a7d96 Revert D26815021: Revert D26744062: Add assert_async
Test Plan: revert-hammer

Differential Revision:
D26815021

Original commit changeset: 972eaafcdf14

fbshipit-source-id: e528260e1aa91df1873c73af00aa57addd671607
2021-03-04 09:28:25 -08:00
b864457743 Revert D26744062: Add assert_async
Test Plan: revert-hammer

Differential Revision:
D26744062 (12d63cc2f5)

Original commit changeset: be6d2653afe5

fbshipit-source-id: 972eaafcdf14d96abdec3dea6bcbd5cac1f3d759
2021-03-04 04:11:25 -08:00
12d63cc2f5 Add assert_async (#53086)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53086

Fixes #36853

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26744062

Pulled By: ezyang

fbshipit-source-id: be6d2653afe584adf67a05b5d43185b40764650d
2021-03-03 16:18:07 -08:00
d30f4d1dfd Migrate apex.parallel.SyncBatchNorm channels_last to pytorch (#46906)
Summary:
per title

This PR did
- Migrate `apex.parallel.SyncBatchNorm` channels_last to pytorch `torch.nn.SyncBatchNorm`
- Fix a TODO here by fusing `sum`, `div` kernels into backward elementwise kernel
b167402e2e/torch/nn/modules/_functions.py (L76-L95)

Todo
- [x] Discuss a regression introduced in https://github.com/pytorch/pytorch/pull/37133#discussion_r512530389, which is the synchronized copy here
b167402e2e/torch/nn/modules/_functions.py (L32-L34)

**Comment**: This PR uses apex version for the size check. Test passed and I haven't seen anything wrong so far.

- [x] The restriction to use channels_last kernel will be like this
```
inline bool batch_norm_use_channels_last_kernels(const at::Tensor& self) {
  return self.is_contiguous(at::MemoryFormat::ChannelsLast) || self.ndimension() == 2;
}
```
I think we can relax that for channels_last_3d as well?

**Comment**: we don't have benchmark for this now, will check this and add functionality later when needed.
- [x] Add test
- [x] Add benchmark

Detailed benchmark is at https://github.com/xwang233/code-snippet/tree/master/syncbn-channels-last

Close https://github.com/pytorch/pytorch/issues/50781

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46906

Reviewed By: albanD

Differential Revision: D26771437

Pulled By: malfet

fbshipit-source-id: d00387044e9d43ac7e6c0e32a2db22c63d1504de
2021-03-03 15:29:45 -08:00
9c2673df46 Revert D26723384: [pytorch][PR] Implements torch.linalg.lstsq
Test Plan: revert-hammer

Differential Revision:
D26723384 (3ac9013235)

Original commit changeset: c9866a95f140

fbshipit-source-id: 3e5263d71facdc91ca09d7dcbbbe3ba818ee2821
2021-03-03 15:24:25 -08:00
0f81a69a96 Make meta a device (getting rid of empty_meta) (#53143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53143

Meta is now an honest to goodness device type, like cpu, so you can use
device='meta' to trigger allocation of meta tensors.  This way better
than empty_meta since we now have working API for most factory functions
(they don't necessarily work yet, though, because need to register Meta
versions of those functions.)

Some subtleties:
- I decided to drop the concept of CPU versus CUDA meta tensors; meta
  tensors are device agnostic.  It's hard to say exactly what the
  correct level of abstraction here is, but in this particular case
  implementation considerations trump semantic considerations: it
  is way easier to have just a meta device, than to have a meta device
  AND a cpu device AND a cuda device.  This may limit the applicability
  of meta tensors for tracing models that do explicit cpu()/cuda()
  conversions (unless, perhaps, we make those operations no-ops on meta
  tensors).
- I noticed that the DeviceType uppercase strings are kind of weird.
  Are they really supposed to be all caps?  That's weird.
- I moved the Meta dispatch key to live with the rest of the "device"
  dispatch keys.
- I intentionally did NOT add a Backend for Meta.  For now, I'm going to
  hope meta tensors never exercise any of the Backend conversion code;
  even if it does, better to fix the code to just stop converting to and
  from Backend.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D26763552

Pulled By: ezyang

fbshipit-source-id: 14633b6ca738e60b921db66a763155d01795480d
2021-03-03 11:24:13 -08:00
3ac9013235 Implements torch.linalg.lstsq (#49093)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44378 by providing a wider range of drivers similar to what SciPy is doing.

The supported CPU drivers are `gels, gelsy, gelsd, gelss`.
The CUDA interface has only `gels` implemented but only for overdetermined systems.

The current state of this PR:
- [x] CPU interface
- [x] CUDA interface
- [x] CPU tests
- [x] CUDA tests
- [x] Memory-efficient batch-wise iteration with broadcasting which fixes https://github.com/pytorch/pytorch/issues/49252
- [x] docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49093

Reviewed By: H-Huang

Differential Revision: D26723384

Pulled By: mruberry

fbshipit-source-id: c9866a95f14091955cf42de22f4ac9e2da009713
2021-03-02 19:00:07 -08:00
e86476f736 Huber loss (#50553)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48595.

## Background

This PR implements HuberLoss, which differs from SmoothL1Loss by a factor of beta. The current implementation does not share logic between the two. Feedback is welcome for the optimal way to minimize code duplication while remaining performant.

I've done some early [benchmarking](https://pytorch.org/tutorials/recipes/recipes/benchmark.html#collecting-instruction-counts-with-callgrind) with Huber calling in to the Smooth L1 kernel and scaling afterwards; for the simple test case I used, instruction counts are as follows:
```
Huber loss calls dedicated Huber kernel: 2,795,300
Huber loss calls Smooth L1 kernel and scales afterwards: 4,523,612
```
With these numbers, instruction counts are ~62% higher when using the pre-existing Smooth L1 kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50553

Test Plan:
```
python test/test_nn.py TestNN.test_HuberLoss
python test/test_nn.py TestNN.test_HuberLoss_delta
python test/test_nn.py TestNN.test_huber_loss_invalid_delta
python test/test_nn.py TestNNDeviceTypeCPU.test_smooth_l1_loss_vs_huber_loss_cpu
python test/test_nn.py TestNNDeviceTypeCUDA.test_smooth_l1_loss_vs_huber_loss_cuda
python test/test_nn.py TestNNDeviceTypeCPU.test_invalid_reduction_strings_cpu
python test/test_nn.py TestNNDeviceTypeCUDA.test_invalid_reduction_strings_cuda
python test/test_nn.py TestNN.test_loss_equal_input_target_shape
python test/test_nn.py TestNN.test_pointwise_loss_broadcast
python test/test_overrides.py
python test/test_jit.py TestJitGeneratedFunctional.test_nn_huber_loss
python test/test_type_hints.py
python test/test_cpp_api_parity.py
build/bin/test_api
```

## Documentation
<img width="677" alt="Screen Shot 2021-01-14 at 4 25 08 PM" src="https://user-images.githubusercontent.com/75754324/104651224-5a445980-5685-11eb-884b-14ea517958c2.png">
<img width="677" alt="Screen Shot 2021-01-14 at 4 24 35 PM" src="https://user-images.githubusercontent.com/75754324/104651190-4e589780-5685-11eb-974d-8c63a89c050e.png">
<img width="661" alt="Screen Shot 2021-01-14 at 4 24 45 PM" src="https://user-images.githubusercontent.com/75754324/104651198-50225b00-5685-11eb-958e-136b36f6f8a8.png">
<img width="869" alt="Screen Shot 2021-01-14 at 4 25 27 PM" src="https://user-images.githubusercontent.com/75754324/104651208-53b5e200-5685-11eb-9fe4-5ff433aa13c5.png">
<img width="862" alt="Screen Shot 2021-01-14 at 4 25 48 PM" src="https://user-images.githubusercontent.com/75754324/104651209-53b5e200-5685-11eb-8051-b0cfddcb07d3.png">

Reviewed By: H-Huang

Differential Revision: D26734071

Pulled By: jbschlosser

fbshipit-source-id: c98c1b5f32a16f7a2a4e04bdce678080eceed5d5
2021-03-02 17:30:45 -08:00
92a4ee1cf6 Revert D26375734: Implemented torch.linalg.multi_dot
Test Plan: revert-hammer

Differential Revision:
D26375734 (0396f492b9)

Original commit changeset: 839642692424

fbshipit-source-id: cb64db646010128d802e1930d5e9526c1f7aa6a2
2021-02-25 00:43:57 -08:00
30cb6ac53c Introduce mlc device (ML Compute device) to PyTorch's device list (#50634)
Summary:
Apple recently announced ML Compute, a new framework available in macOS Big Sur, which enables users to accelerate the training of neural networks on Mac hardware. This PR is the first on a series of PRs that will enable the integration with ML Compute. Most of the integration code will live on a separate subrepo named `mlc`.
The integration with `mlc` (ML Compute) will be very similar to that of xla. We rely on registering our ops through:

TORCH_LIBRARY_IMPL(aten, PrivateUse1, m) {
 m.impl_UNBOXED(<op_schema_name>, &customized_op_kernel)
 ...
}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50634

Reviewed By: malfet

Differential Revision: D26614213

Pulled By: smessmer

fbshipit-source-id: 3b492b346c61cc3950ac880ac01a82fbdddbc07b
2021-02-24 22:39:11 -08:00
0396f492b9 Implemented torch.linalg.multi_dot (#51807)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51807

Implemented torch.linalg.multi_dot similar to [numpy.linalg.multi_dot](https://numpy.org/doc/stable/reference/generated/numpy.linalg.multi_dot.html).

This function does not support broadcasting or batched inputs at the moment.

**NOTE**
numpy.linalg.multi_dot allows the first and last tensors to have more than 2 dimensions despite their docs stating these must be either 1D or 2D. This PR diverges from NumPy in that it enforces this restriction.

**TODO**
- [ ] Benchmark against NumPy
- [x] Add OpInfo testing
- [x] Remove unnecessary copy for out= argument

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D26375734

Pulled By: heitorschueroff

fbshipit-source-id: 839642692424c4b1783606c76dd5b34455368f0b
2021-02-24 15:32:30 -08:00
964d47dfb9 Add torch.linalg to generated annotated_args for test_overrides (#52464)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52464

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D26618696

Pulled By: heitorschueroff

fbshipit-source-id: 9889fcaafcb307319b4526ee86355389653a6b61
2021-02-24 15:30:32 -08:00
9699c703c2 Stable sort for the CPU take 2. (#51790)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38681.
A duplicate of https://github.com/pytorch/pytorch/pull/50052 created to become importable to the fb internal tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51790

Reviewed By: agolynski

Differential Revision: D26279045

Pulled By: glaringlee

fbshipit-source-id: 348e171dee9c370a76002b65d0c82c329f57a421
2021-02-19 09:28:57 -08:00
33afb5f19f fake_quant cachemask: remove Python bindings (#51878)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51878

`fake_quantize_per_tensor_affine_cachemask` and
`fake_quantize_per_channel_affine_cachemask` are implementation
details of `fake_quantize_per_tensor_affine` and
`fake_quantize_per_channel_affine`, removing the
Python bindings for them since there is no need to
expose them.

Test Plan:
```
python test/test_quantization.py TestFakeQuantize
```

Imported from OSS

Reviewed By: albanD, bugra

Differential Revision: D26314173

fbshipit-source-id: 733c93a3951453e739b6ed46b72fbad2244f6e97
2021-02-09 23:27:53 -08:00
b97a040f71 ENH: toggle TORCH_WARN_ONCE to TORCH_WARN for tests (#48560)
Summary:
Toward fixing https://github.com/pytorch/pytorch/issues/47624

~Step 1: add `TORCH_WARN_MAYBE` which can either warn once or every time in c++, and add a c++ function to toggle the value.
Step 2 will be to expose this to python for tests. Should I continue in this PR or should we take a different approach: add the python level exposure without changing any c++ code and then over a series of PRs change each call site to use the new macro and change the tests to make sure it is being checked?~

Step 1: add a python and c++ toggle to convert TORCH_WARN_ONCE into TORCH_WARN so the warnings can be caught in tests
Step 2: add a python-level decorator to use this toggle in tests
Step 3: (in future PRs): use the decorator to catch the warnings instead of `maybeWarnsRegex`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48560

Reviewed By: ngimel

Differential Revision: D26171175

Pulled By: mruberry

fbshipit-source-id: d83c18f131d282474a24c50f70a6eee82687158f
2021-02-08 08:21:19 -08:00