Commit Graph

397 Commits

Author SHA1 Message Date
1bac5fd0d3 add hardsigmoid FP operator to PyTorch (#34545)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34545

This is for common operator coverage, since this is widely used.  A future PR
will add the quantized version.

Some initial questions for reviewers, since it's my first FP operator
diff:
* do we need a backwards.out method for this?
* do we need CUDA? If yes, should it be this PR or is it ok to split

Test Plan:
```
// test
python test/test_torch.py TestTorchDeviceTypeCPU.test_hardsigmoid_cpu_float32

// benchmark
python -m pt.hardsigmoid_test
...
Forward Execution Time (us) : 40.315

Forward Execution Time (us) : 42.603
```

Imported from OSS

Differential Revision: D20371692

fbshipit-source-id: 95668400da9577fd1002ce3f76b9777c6f96c327
2020-03-16 15:24:12 -07:00
44256199a9 [JIT] remove specialized list ops (#34520)
Summary:
Now that lists are no longer specialized, we can register only one operator for list ops that are generic to their element type.
This PR reorgs lists into three sets of ops:
- CREATE_GENERIC_LIST_OPS
- CREATE_SPECIALIZED_LIST_OPS
- CREATE_COMPARATOR_LIST_OPS_SPECIALIZED (we didn't bind certain specialized ops to Tensor)

This is important to land quickly because mobile is finalizing its bytecode soon, after which we could not remove these ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34520

Reviewed By: iseeyuan

Differential Revision: D20429775

Pulled By: eellison

fbshipit-source-id: ae6519f9b0f731eaa2bf4ac20736317d0a66b8a0
2020-03-12 17:49:23 -07:00
514cba0661 [JIT] remove builtin interpolate functions (#34514)
Summary:
`torch.nn.functional.interpolate` was written as a builtin op when we scripted the standard library, because it has four possible overloads. As a result, whenever we make a change to `interpolate`, we need to make changes in two places, and it also makes it impossible to optimize the interpolate op. The builtin is tech debt.

I talked with ailzhang, and the symbolic script changes are good to remove (i guess that makes a third place we needed to re-implement interpolate).

I'm trying to get rid of unneccessary builtin operators because we're standardizing mobile bytecode soon, so we should try to get this landed as soon as possible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34514

Differential Revision: D20391089

Pulled By: eellison

fbshipit-source-id: abc84cdecfac67332bcba6b308fca4db44303121
2020-03-12 09:21:33 -07:00
fddf73250d [JIT] fix resolving of functions in torch/functional. fix compilation of torch.stft (#33504)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33504

Fix resolution fo functions that are bound onto torch in torch/functional.py. This does not fix compilation of all of those functions, those will be done in follow ups. Does torch.stft as a start.

Fixes #21478

Test Plan: Imported from OSS

Differential Revision: D20014591

Pulled By: eellison

fbshipit-source-id: bb362f1b5479adbb890e72a54111ef716679d127
2020-02-26 18:35:43 -08:00
a9cef05f5d improve EmbeddingBag performance on cuda (#33589)
Summary:
This PR improves performance of EmbeddingBag on cuda by removing 5 kernel launches (2 of those are synchronizing memcopies).
- 2 memcopies are checking values of offsets[0] and offsets[-1] to be in expected range (0 for the former, less than number of indices for the latter). It seems strange to check only those 2 values, if users are providing invalid offsets, invalid values can be anywhere in the array, not only the first and last element. After this PR, the checks are skipped on cuda, the first value is forced to 0, if the last value is larger than expected, cuda kernel will assert. It is less nice than ValueError, but then again, the kernel could have asserted if other offset values were invalid. On the cpu, the checks are moved inside the cpu implementation from functional.py, and will throw RuntimeError instead of ValueError.
- 3 or 4 initializations (depending on the mode) of the output tensors with .zeros() are unnecessary, because every element of those tensors is written to, so their data can be uninitialized on the start.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33589

Reviewed By: jianyuh

Differential Revision: D20078011

Pulled By: ngimel

fbshipit-source-id: 2fb2e2080313af64adc5cf1b9fc6ffbdc6efaf16
2020-02-24 21:37:34 -08:00
fa80299bdf __torch_function__ overrides for torch.functional and torch.nn.functional (#32799)
Summary:
This adds `__torch_function__` support for all functions in `torch.functional` and `torch.nn.functional`.

The changes to C++ code and codegen scripts are to facilitate adding `__torch_function__` support for the native functions in `torch._C._nn`. Note that I moved the `handle_torch_function` C++ function to a header that both `python_torch_functions.cpp` and `python_nn_functions.cpp` include. The changes to `python_nn_functions.cpp` mirror the changes I made to `python_torch_functions.cpp` when `__torch_function__` support was first added in https://github.com/pytorch/pytorch/issues/27064. Due to the somewhat different way the `torch._C` and `torch._C._nn` namespaces are initialized I needed to create a new static reference to the `torch._C._nn` namespace (`THPNNVariableFunctions`). I'm not sure if that is the best way to do this. In principle I could import these namespaces in each kernel and avoid the global variable but that would have a runtime cost.

I added `__torch_function__` support to the Python functions in `torch.nn.functional` following the approach in https://github.com/pytorch/pytorch/issues/32194.

I re-enabled the test that checks if all functions in the `torch` namespace are explicitly tested for `__torch_function__` support. I also generalized the check to work for `torch.functional` and `torch.nn.functional` as well. This test was explicitly disabled in https://github.com/pytorch/pytorch/issues/30730 and I'm happy to disable it again if you think that's appropriate. I figured now was as good a time as any to try to re-enable it.

Finally I adjusted the existing torch API tests to suppress deprecation warnings and add keyword arguments used by some of the code in `torch.nn.functional` that were missed when I originally added the tests in https://github.com/pytorch/pytorch/issues/27064.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32799

Differential Revision: D19956809

Pulled By: ezyang

fbshipit-source-id: 40d34e0109cc4b9f3ef62f409d2d35a1d84e3d22
2020-02-21 08:38:37 -08:00
cfb4862673 [pytorch] correct input size check for GroupNorm (#33008)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33008

Corrects D19373507 to allow valid use cases that fail now. Multiplies batch size by the number of elements in a group to get the correct number of elements over which statistics are computed.

**Details**:
The current implementation disallows GroupNorm to be applied to tensors of shape e.g. `(1, C, 1, 1)` to prevent cases where statistics are computed over 1 element and thus result in a tensor filled with zeros.
However, in GroupNorm the statistics are calculated across channels. So in case where one has an input tensor of shape `(1, 256, 1, 1)` for `GroupNorm(32, 256)`, the statistics will be computed over 8 elements and thus be meaningful.

One use case is [Atrous Spatial Pyramid Pooling (ASPPPooling)](791c172a33/torchvision/models/segmentation/deeplabv3.py (L50)), where GroupNorm could be used in place of BatchNorm [here](791c172a33/torchvision/models/segmentation/deeplabv3.py (L55)). However, now this is prohibited and results in failures.

Proposed solution consists in correcting the computation of the number of elements over which statistics are computed. The number of elements per group is taken into account in the batch size.

Test Plan: check that existing tests pass

Reviewed By: fmassa

Differential Revision: D19723407

fbshipit-source-id: c85c244c832e6592e9aedb279d0acc867eef8f0c
2020-02-18 06:43:53 -08:00
4502d8c391 Interpolate Float [] support in ONNX (#32554)
Summary:
The PR https://github.com/pytorch/pytorch/pull/31791 adds support for float[] constant, which affects some cases of ONNX interpolate support.
This PR adds float[] constants support in ONNX, updates interpolate in ONNX, and re-enable the disabled tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32554

Reviewed By: hl475

Differential Revision: D19566596

Pulled By: houseroad

fbshipit-source-id: 843f62c86126fdf4f9c0117b65965682a776e7e9
2020-02-04 16:14:40 -08:00
3cac9900ca Clarify when softplus is reverted to linear. (#32945)
Summary:
The default value is removed because it is explained right below.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32945

Reviewed By: soumith

Differential Revision: D19706567

Pulled By: ailzhang

fbshipit-source-id: 1b7cc87991532f69b81aaae2451d944f70dda427
2020-02-03 17:54:31 -08:00
602394e996 verify input sizes for instance norm and group norm (#29082)
Summary:
Fix for https://github.com/pytorch/pytorch/issues/19250
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29082

Differential Revision: D19373507

Pulled By: ezyang

fbshipit-source-id: 231a79280f4cd7db2c26218a60869356a124bf77
2020-01-27 09:05:56 -08:00
3ada2e0d64 [pytorch][embeddingbag] Parallelize the EmbeddingBag operator (#4049)
Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/4049

Pull Request resolved: https://github.com/pytorch/pytorch/pull/27477

We would like to add the intra-op parallelization support for the EmbeddingBag operator.

This should bring speedup for the DLRM benchmark:
https://github.com/pytorch/pytorch/pull/24385

Benchmark code:
```
from __future__ import absolute_import, division, print_function, unicode_literals

import torch
import time

eb = torch.nn.EmbeddingBag(1000000, 64, mode='sum')

input = torch.LongTensor(1500).random_(0, 1000000)
offsets = torch.zeros(64, dtype=torch.int64)

niter = 10000
s = time.time()
for _ in range(niter):
    out = eb(input, offsets)
time_per_iter = (time.time() - s) / niter
print('time_per_iter', time_per_iter)
print('GB/s', (input.numel() * 64 * 4 + out.numel() * 4) / time_per_iter / 1e9)
```

The following results are single core on Skylake T6:
- Before our change (with the original caffe2::EmbeddingLookup)
time_per_iter 6.313693523406982e-05
GB/s 6.341517821789133

- After our change using the EmbeddingLookupIdx API which takes the offsets instead of lengths.
time_per_iter 5.7627105712890626e-05
GB/s 6.947841559053659

- With Intel's PR: https://github.com/pytorch/pytorch/pull/24385
time_per_iter 7.393271923065185e-05
GB/s 5.415518381664018

For multi-core performance, because Clang doesn't work with OMP, I can only see the single-core performance on SKL T6.
ghstack-source-id: 97124557

Test Plan:
With D16990830:
```
buck run mode/dev //caffe2/caffe2/perfkernels:embedding_bench
```

With D17750961:
```
buck run mode/opt //experimental/jianyuhuang/embeddingbag:eb
buck run mode/opt-lto //experimental/jianyuhuang/embeddingbag:eb
```

OSS test
```
python run_test.py -i nn -- TestNNDeviceTypeCPU.test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu
```

Buck test
```
buck test mode/dev-nosan //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu"

OMP_NUM_THREADS=3 buck test mode/opt -c pytorch.parallel_backend=tbb //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets"  --print-passing-details
```

Generate the AVX2 code for embedding_lookup_idx_avx2.cc:
```
python hp_emblookup_codegen.py --use-offsets
```

Differential Revision: D17768404

fbshipit-source-id: 8dcd15a62d75b737fa97e0eff17f347052675700
2020-01-23 21:29:44 -08:00
db02a4e4ce Support 3D attention mask in MultiheadAttention. (#31996)
Summary:
Support a 3D attention mask for MultiheadAttention. If `attn_mask` has the batch dimension, it will not be unsqueezed. Fix https://github.com/pytorch/pytorch/issues/30678
Relevant issues/pr:
https://github.com/pytorch/pytorch/pull/25359
https://github.com/pytorch/pytorch/issues/29520
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31996

Differential Revision: D19332816

Pulled By: zhangguanheng66

fbshipit-source-id: 3448af4b219607af60e02655affe59997ad212d9
2020-01-23 13:16:48 -08:00
cc2d5b15ad F.normalize uses clamp_min_ inplace (#32360)
Summary:
We don't care about autograd when `out!=None` anyways
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32360

Differential Revision: D19452402

Pulled By: colesbury

fbshipit-source-id: c54775289f8a700019ca61e951d59ff4894ac980
2020-01-21 10:38:06 -08:00
77c78b7d28 remove .data from torch/nn doc
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31481

Test Plan: Imported from OSS

Differential Revision: D19303242

Pulled By: albanD

fbshipit-source-id: 4f650df9e9e302a299175967bcc6e30a5099fa2a
2020-01-14 07:30:42 -08:00
c4f10e0fe7 Renaming scales parameter for interpolate (#31526)
Summary:
PR separated from https://github.com/pytorch/pytorch/pull/31274.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31526

Reviewed By: zou3519

Differential Revision: D19221931

Pulled By: gchanan

fbshipit-source-id: 81958a9910867ac9d62f2b47abc49384526c4e51
2020-01-02 08:19:30 -08:00
97c1e90f46 ONNX Interpolate Add Scales Params (#28324)
Summary:
Fix for : https://github.com/pytorch/pytorch/issues/27176
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28324

Reviewed By: hl475

Differential Revision: D18309133

Pulled By: houseroad

fbshipit-source-id: 348bb41393442c6b107d88fc2cd3224e0afa3ccf
2019-12-11 20:09:15 -08:00
d6ca93b353 add doc for F.softplus
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30055

Differential Revision: D18762624

Pulled By: zou3519

fbshipit-source-id: 61da88cbb8cd0f37ac26b0fb8aaacdbe85c724ba
2019-12-04 07:16:30 -08:00
e7fe64f6a6 Fix typos (#30606)
Summary:
Should be non-semantic.

Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30606

Differential Revision: D18763028

Pulled By: mrshenli

fbshipit-source-id: 896515a2156d062653408852e6c04b429fc5955c
2019-12-02 20:17:42 -08:00
7903fb118f Move qkv_same, kv_same into branch (#30142)
Summary:
Perf improvements to multi_head_attention_forward

- qkv_same and kv_same were not used outside of that branch. Further, kv_same was calculated even though it is not used if qkv_same
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30142

Differential Revision: D18610938

Pulled By: cpuhrsch

fbshipit-source-id: 19b7456f20aef90032b0f42d7da8c8a2d5563ee3
2019-11-22 10:40:02 -08:00
a4f60b64dc explicitly provide memory format when calling to *_like operators
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29391

Test Plan: Imported from OSS

Differential Revision: D18429726

Pulled By: VitalyFedyunin

fbshipit-source-id: 07dfff568ad776cf792122913530566d53be55fa
2019-11-18 21:47:52 -08:00
c7ed89cf65 Migrate nll_loss2d from TH to ATen (CPU) (#28304)
Summary:
Added check for indicies in Reduction::None case.

### Benchmark results

Note: Due to the size of the input tensors this time the random number generation is responsible for a significant portion of the total time. It is better to look at the individual net time-outputs (which do not include the input preparation).
Script used for benchmark.: [nnl_loss2d_benchmark.py](https://gist.github.com/andreaskoepf/5864aa91e243317cb282c1e7fe576e1b)

#### WITH PR applied
```
using reduction:  none
CPU forward 1000 took 7.916500908322632e-05
CPU forward 10000 took 0.0002642290201038122
CPU forward 100000 took 0.003828087996225804
CPU forward 1000000 took 0.037140720000024885
CPU forward 10000000 took 0.33387596398824826
CPU forward TOTAL time 7.218988707987592

using reduction:  mean
CPU forward 1000 took 9.165197843685746e-05
CPU forward 10000 took 0.0005258890159893781
CPU forward 100000 took 0.0050761590246111155
CPU forward 1000000 took 0.047345594997750595
CPU forward 10000000 took 0.4790863030066248
CPU forward TOTAL time 7.9106070210109465
CPU for- & backward 1000 took 0.0005489500181283802
CPU for- & backward 10000 took 0.0015284279943443835
CPU for- & backward 100000 took 0.015138130984269083
CPU for- & backward 1000000 took 0.15741890601930209
CPU for- & backward 10000000 took 1.6703072849777527
CPU for- & backward TOTAL time 9.555764263990568

using reduction:  sum
CPU forward 1000 took 8.789298590272665e-05
CPU forward 10000 took 0.000514078012201935
CPU forward 100000 took 0.005135576997417957
CPU forward 1000000 took 0.04715992201818153
CPU forward 10000000 took 0.4821214270195924
CPU forward TOTAL time 7.9119505700073205
CPU for- & backward 1000 took 0.00047759301378391683
CPU for- & backward 10000 took 0.0015945070190355182
CPU for- & backward 100000 took 0.018208994006272405
CPU for- & backward 1000000 took 0.15904426100314595
CPU for- & backward 10000000 took 1.5679037219961174
CPU for- & backward TOTAL time 9.495157692988869
```

#### WITHOUT original TH impl
```
using reduction:  none
CPU forward 1000 took 0.0003981560003012419
CPU forward 10000 took 0.0035912430030293763
CPU forward 100000 took 0.035353766987100244
CPU forward 1000000 took 0.3428319719969295
CPU forward 10000000 took 3.364342701010173
CPU forward TOTAL time 11.166179805004504

using reduction:  mean
CPU forward 1000 took 8.63690220285207e-05
CPU forward 10000 took 0.0004704220045823604
CPU forward 100000 took 0.0045734510058537126
CPU forward 1000000 took 0.046232511987909675
CPU forward 10000000 took 0.4191019559802953
CPU forward TOTAL time 7.846049971994944
CPU for- & backward 1000 took 0.0005974550149403512
CPU for- & backward 10000 took 0.0014057719963602722
CPU for- & backward 100000 took 0.013776941981632262
CPU for- & backward 1000000 took 0.13876214998890646
CPU for- & backward 10000000 took 1.3666698939923663
CPU for- & backward TOTAL time 9.10526105100871

using reduction:  sum
CPU forward 1000 took 7.598899537697434e-05
CPU forward 10000 took 0.00046885499614290893
CPU forward 100000 took 0.0044489419960882515
CPU forward 1000000 took 0.04495517900795676
CPU forward 10000000 took 0.418376043002354
CPU forward TOTAL time 7.789334400993539
CPU for- & backward 1000 took 0.0004464260127861053
CPU for- & backward 10000 took 0.0017732900159899145
CPU for- & backward 100000 took 0.01626713399309665
CPU for- & backward 1000000 took 0.11790941300569102
CPU for- & backward 10000000 took 1.4346664609911386
CPU for- & backward TOTAL time 9.294745502003934
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28304

Differential Revision: D18350157

Pulled By: ezyang

fbshipit-source-id: e9437debe51386a483f4265193c475cdc90b28e4
2019-11-09 18:31:20 -08:00
2460dced8f Add torch.nn.GELU for GELU activation (#28944)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28944

Add torch.nn.GELU for GELU activation

Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "GELU"

Reviewed By: hl475, houseroad

Differential Revision: D18240946

fbshipit-source-id: 6284b30def9bd4c12bf7fb2ed08b1b2f0310bb78
2019-11-03 21:55:05 -08:00
e9a91756cd Back out "[pytorch][PR] Migrate soft_margin_loss from the TH to Aten (CUDA+CPU)"
Summary: Original commit changeset: 9ddffe4dbbfa

Test Plan: ci

Reviewed By: yf225

Differential Revision: D17939581

fbshipit-source-id: 44a3b843bf1e7059fec57b9e3d12ed4886816145
2019-10-15 21:12:10 -07:00
2aa84d927b Revert D17939700: Revert D17889288: [pytorch][PR] Migrate soft_margin_loss from the TH to Aten (CUDA+CPU)
Test Plan: revert-hammer

Differential Revision:
D17939700

Original commit changeset: 4fc6156ba388

fbshipit-source-id: dded0a2140d2c14cd2f2a574987ecc164b0e5bfe
2019-10-15 15:24:36 -07:00
c44e33b578 Revert D17889288: [pytorch][PR] Migrate soft_margin_loss from the TH to Aten (CUDA+CPU)
Test Plan: revert-hammer

Differential Revision:
D17889288

Original commit changeset: 9ddffe4dbbfa

fbshipit-source-id: 4fc6156ba38834512b2f735ac0d03e34e69b7286
2019-10-15 14:35:28 -07:00
9033ace9c4 Migrate soft_margin_loss from the TH to Aten (CUDA+CPU) (#27673)
Summary:
Replaces fused TH kernels with a 2-liner of regular Tensor functions.
Benchmarking revealed that performance improves compared to PyTorch 1.2.

Refs: https://github.com/pytorch/pytorch/issues/24631, https://github.com/pytorch/pytorch/issues/24632, https://github.com/pytorch/pytorch/issues/24764, https://github.com/pytorch/pytorch/issues/24765
VitalyFedyunin

### Benchmarking results on my laptop:

## 1.4.0a0+f63c9e8 output
```
PyTorch version: 1.4.0a0+f63c9e8
CPU Operator sanity check:
tensor(0.5926, grad_fn=<MeanBackward0>)
tensor([-0.0159, -0.0170, -0.0011, -0.0083, -0.0140, -0.0217, -0.0290, -0.0262,
        -0.0078, -0.0129])
double backward
tensor(-0.1540, grad_fn=<SumBackward0>)
ok

GPU Operator sanity check:
tensor(0.5601, device='cuda:0', grad_fn=<MeanBackward0>)
tensor([-0.0393, -0.0316, -0.0233, -0.0140, -0.0141, -0.0161, -0.0322, -0.0238,
        -0.0054, -0.0151], device='cuda:0')
double backward
tensor(-0.2148, device='cuda:0', grad_fn=<SumBackward0>)
ok

CPU warmup 1000 took 9.025700273923576e-05
CPU warmup 10000 took 0.0009383050055475906
CPU warmup 100000 took 0.0015631120040779933
CPU warmup TOTAL time 0.0026368020044174045
CPU forward 1000 took 6.919399311300367e-05
CPU forward 10000 took 0.00014462800754699856
CPU forward 100000 took 0.0011234670091653243
CPU forward 1000000 took 0.014555767003912479
CPU forward 10000000 took 0.13409724000666756
CPU forward 100000000 took 1.246048310000333
CPU forward TOTAL time 1.3961777170043206
CPU for- & backward 1000 took 0.0003219560021534562
CPU for- & backward 10000 took 0.00037290599721018225
CPU for- & backward 100000 took 0.001975035003852099
CPU for- & backward 1000000 took 0.02621342398924753
CPU for- & backward 10000000 took 0.2944270490115741
CPU for- & backward 100000000 took 1.6856628700043075
CPU for- & backward TOTAL time 2.0091958299890393

GPU warmup 1000 took 0.0002462909906171262
GPU warmup 10000 took 9.991199476644397e-05
GPU warmup 100000 took 0.00034347400651313365
GPU warmup TOTAL time 0.0007382350013358518
GPU forward 1000 took 9.67290106927976e-05
GPU forward 10000 took 9.349700121674687e-05
GPU forward 100000 took 9.384499571751803e-05
GPU forward 1000000 took 0.0004975290066795424
GPU forward 10000000 took 0.0017606960027478635
GPU forward 100000000 took 0.003572814996005036
GPU forward TOTAL time 0.006185991995153017
GPU for- & backward 1000 took 0.00035818999458570033
GPU for- & backward 10000 took 0.0003240450023440644
GPU for- & backward 100000 took 0.0003223370003979653
GPU for- & backward 1000000 took 0.00036740700306836516
GPU for- & backward 10000000 took 0.0003690610028570518
GPU for- & backward 100000000 took 0.0003672500024549663
GPU for- & backward TOTAL time 0.002197896988946013
```

## 1.2 output
```
PyTorch version: 1.2.0
CPU Operator sanity check:
tensor(0.5926, grad_fn=<SoftMarginLossBackward>)
tensor([-0.0159, -0.0170, -0.0011, -0.0083, -0.0140, -0.0217, -0.0290, -0.0262,
        -0.0078, -0.0129])
double backward
tensor(-0.1540, grad_fn=<SumBackward0>)
ok

GPU Operator sanity check:
tensor(0.5601, device='cuda:0', grad_fn=<SoftMarginLossBackward>)
tensor([-0.0393, -0.0316, -0.0233, -0.0140, -0.0141, -0.0161, -0.0322, -0.0238,
        -0.0054, -0.0151], device='cuda:0')
double backward
tensor(-0.2148, device='cuda:0', grad_fn=<SumBackward0>)
ok

CPU warmup 1000 took 8.422900282312185e-05
CPU warmup 10000 took 0.00036992700188420713
CPU warmup 100000 took 0.003682684007799253
CPU warmup TOTAL time 0.004169487991021015
CPU forward 1000 took 5.521099956240505e-05
CPU forward 10000 took 0.00036948200431652367
CPU forward 100000 took 0.003762389998883009
CPU forward 1000000 took 0.03725024699815549
CPU forward 10000000 took 0.3614480490068672
CPU forward 100000000 took 3.6139175269927364
CPU forward TOTAL time 4.016912263003178
CPU for- & backward 1000 took 0.0002734809968387708
CPU for- & backward 10000 took 0.0006605249946005642
CPU for- & backward 100000 took 0.005437346000690013
CPU for- & backward 1000000 took 0.051245586000732146
CPU for- & backward 10000000 took 0.5291594529990107
CPU for- & backward 100000000 took 5.23841712900321
CPU for- & backward TOTAL time 5.8253340990049765

GPU warmup 1000 took 0.0005757809994975105
GPU warmup 10000 took 0.0004058420017827302
GPU warmup 100000 took 0.0003764610009966418
GPU warmup TOTAL time 0.0013992580061312765
GPU forward 1000 took 0.0003543390048434958
GPU forward 10000 took 0.0003633670130511746
GPU forward 100000 took 0.0004807310033356771
GPU forward 1000000 took 0.0005875999922864139
GPU forward 10000000 took 0.0016903509967960417
GPU forward 100000000 took 0.014400018990272656
GPU forward TOTAL time 0.0179396449966589
GPU for- & backward 1000 took 0.0006167769897729158
GPU for- & backward 10000 took 0.0006845899915788323
GPU for- & backward 100000 took 0.000631830989732407
GPU for- & backward 1000000 took 0.0010741150035755709
GPU for- & backward 10000000 took 0.0017265130009036511
GPU for- & backward 100000000 took 0.014847910992102697
GPU for- & backward TOTAL time 0.01965981800458394
```

### Code used for performance test
```
import torch
import torch.nn.functional as F
import torch.nn as nn

from timeit import default_timer

torch.manual_seed(0)
cpu = torch.device('cpu')
gpu = torch.device('cuda')

loss_fn = F.soft_margin_loss

def run_benchmark(name, depth, require_grad, device, fn):
    total_start = default_timer()
    for i in range(3, 3 + depth):
        start = default_timer()
        n = 10 ** i
        a = torch.rand(n, requires_grad=require_grad, device=device)
        b = torch.rand(n, device=device)
        fn(a, b)
        end = default_timer()
        print('{} {} took {}'.format(name, n, end-start))
    total_end = default_timer()
    print('{} TOTAL time {}'.format(name, total_end-total_start))

def fwd_only(a, b):
    out = loss_fn(a, b)

def fwd_bck(a, b):
    out = loss_fn(a, b)
    out.backward()

def sanity_check(name, device):
    print('{} Operator sanity check:'.format(name))
    a = torch.rand(10, requires_grad=True, device=device)
    b = torch.rand(10, device=device)
    out = loss_fn(a,b)
    print(out)
    out.backward()
    print(a.grad)
    print('double backward')
    loss = loss_fn(a, b)
    loss2 = torch.autograd.grad(loss, a, create_graph=True)
    z = loss2[0].sum()
    print(z)
    z.backward()
    print('ok')
    print()

print('PyTorch version:', torch.__version__)
sanity_check('CPU', cpu)
sanity_check('GPU', gpu)
print()

run_benchmark('CPU warmup', 3, False, cpu, fwd_only)
run_benchmark('CPU forward', 6, False, cpu, fwd_only)
run_benchmark('CPU for- & backward', 6, True, cpu, fwd_bck)
print()

run_benchmark('GPU warmup', 3, False, gpu, fwd_only)
run_benchmark('GPU forward', 6, False, gpu, fwd_only)
run_benchmark('GPU for- & backward', 6, True, gpu, fwd_bck)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27673

Differential Revision: D17889288

Pulled By: ezyang

fbshipit-source-id: 9ddffe4dbbfab6180847a8fec32443910f18f0a9
2019-10-15 08:44:57 -07:00
e5d6b75319 Bag of documentation fixes; fix more sphinx warnings (#27850)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27850

Many of these are real problems in the documentation (i.e., link or
bullet point doesn't display correctly).

Test Plan: - built and viewed the documentation for each change locally.

Differential Revision: D17908123

Pulled By: zou3519

fbshipit-source-id: 65c92a352c89b90fb6b508c388b0874233a3817a
2019-10-15 07:31:14 -07:00
15f9fe1d92 Add missing Optional annotation.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27564

Differential Revision: D17816121

Pulled By: ailzhang

fbshipit-source-id: 5a4ac12ed81bf5d900ec3e7ab616082cb98d832d
2019-10-11 09:04:29 -07:00
eb93200321 Fix DDP incompatibility issue with nn.MultiheadAttention. (#26826)
Summary:
Fix issue https://github.com/pytorch/pytorch/issues/26698.

With different query/keys/value dimensions, `nn.MultiheadAttention` has DDP incompatibility issue because in that case `in_proj_weight` attribute is created but not used. Fix it and add a distributed unit test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26826

Differential Revision: D17583807

Pulled By: zhangguanheng66

fbshipit-source-id: c393584c331ed4f57ebaf2d4015ef04589c973f6
2019-10-08 12:13:34 -07:00
d396c7332a Update ONNX Export for Interpolate in Opset 11 (#26778)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26778

- Add support for linear and cubic interpolate in opset 11.
- Add support for 1d and 3d interpolate in nearest mode for opset 7 and 8.
- Add tests for all cases of interpolate in ORT tests (nearest/linear/cubic, 1d/2d/3d, upsample/downsample).
Original PR resolved: https://github.com/pytorch/pytorch/pull/24805

Reviewed By: hl475

Differential Revision: D17564911

Pulled By: houseroad

fbshipit-source-id: 591e1f5b361854ace322eca1590f8f84d29c1a5d
2019-09-25 05:43:20 -07:00
1bb895e1c1 Revert D17330801: [pytorch][PR] Update ONNX Export for Interpolate in Opset 11
Test Plan: revert-hammer

Differential Revision:
D17330801

Original commit changeset: 1bdefff9e72f

fbshipit-source-id: dff07477403170c27260f736ab6e6010f0deca9f
2019-09-24 18:56:45 -07:00
de3d4686ca Update ONNX Export for Interpolate in Opset 11 (#24805)
Summary:
- Add support for linear and cubic interpolate in opset 11.
- Add support for 1d and 3d interpolate in nearest mode for opset 7 and 8.
- Add tests for all cases of interpolate in ORT tests (nearest/linear/cubic, 1d/2d/3d, upsample/downsample).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24805

Reviewed By: hl475

Differential Revision: D17330801

Pulled By: houseroad

fbshipit-source-id: 1bdefff9e72f5e70c51f4721e1d7347478b7505b
2019-09-24 16:29:57 -07:00
883628cb5c Added documentation for nn.functional.bilinear (#24951)
Summary:
Adds documentation for `nn.functional.bilinear`, as requested in https://github.com/pytorch/pytorch/issues/9886.

The format follows that of `nn.functional.linear`, and borrows from `nn.bilinear` in its description of `Tensor` shapes.

I am happy to add more extensive documentation (e.g. "Args," "Example(s)"). From what I gather, the format of comments is inconsistent across functions in `nn.functional.py` and between modules (e.g. `nn.functional` and `nn`). It's my first PR, so guidance for contributing documentation and other code would be greatly appreciated!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24951

Differential Revision: D17091261

Pulled By: soumith

fbshipit-source-id: efe2ad764700dfd6f30eedc03de4e1cd0d10ac72
2019-08-28 08:19:25 -07:00
74b65c32be Add align_corners option to grid_sample and affine_grid, change default to False (#24929)
Summary:
Resolves: https://github.com/pytorch/pytorch/issues/20785
Addresses https://github.com/pytorch/pytorch/issues/24470 for `affine_grid`
Subsumes and closes: https://github.com/pytorch/pytorch/pull/24878 and likewise closes: https://github.com/pytorch/pytorch/issues/24821

Adds the `align_corners` option to `grid_sample` and `affine_grid`, paralleling the option that was added to `interpolate` in version 0.4.0.

In short, setting `align_corners` to `False` allows these functions to be resolution agnostic.
This ensures, for example, that a grid generated from a neural net trained to warp 1024x1024 images will also work to warp the same image upsampled/downsampled to other resolutions like 512x512 or 2048x2048 without producing scaling/stretching artifacts.

Refer to the documentation and https://github.com/pytorch/pytorch/issues/20785 for more details.

#### BC-Breaking Changes

- **Important**: BC-Breaking change because of new default for `align_corners`
The old functionality can still be achieved by setting `align_corners=True`, but the default is now set to `align_corners=False`, since this is the more correct setting, and since this matches the default setting of `interpolate`.

- **Should not cause BC issues**: BC-Breaking change for pathological use case
2D affine transforms on 1D coordinates and 3D affine transforms on 2D coordinates (that is, when one of the spatial dimensions has an empty span) are ill-defined, and not an intended use case of `affine_grid`. Whereas before, all grid point components along such dimension were set arbitrarily to `-1` (that is, before multiplying be the affine matrix), they are now all set instead to `0`, which is a much more consistent and defensible arbitrary choice. A warning is triggered for such cases.

#### Documentation

- Update `affine_grid` documentation to express that it does indeed support 3D affine transforms. This support was already there but not documented.
- Add documentation warnings for BC-breaking changes in `grid_sample` and `affine_grid` (see above).

#### Refactors

- `affine_grid` no longer dispatches to cuDNN under any circumstances.
The decision point for when the cuDNN `affine_grid_generator` is compatible with the native PyTorch version and when it fails is a headache to maintain (see [these conditions](5377478e94/torch/nn/_functions/vision.py (L7-L8))). The native PyTorch kernel is now used in all cases.

- The kernels for `grid_sample` are slightly refactored to make maintenance easier.

#### Tests
Two new tests are added in `test_nn.py`:
- `test_affine_grid_error_checking` for errors and warnings in `affine_grid`
- `test_affine_grid_3D` for testing `affine_grid`'s 3D functionality. The functionality existed prior to this, but wasn't tested.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24929

Differential Revision: D16949064

Pulled By: ailzhang

fbshipit-source-id: b133ce0d47a2a5b3e2140b9d05fb05fca9140926
2019-08-21 21:17:49 -07:00
b0737ccdc1 Revert D16887357: [pytorch][PR] [BC-BREAKING] Add align_corners option to grid_sample and affine_grid, change default to False
Differential Revision:
D16887357

Original commit changeset: ea09aad7853e

fbshipit-source-id: 0bebb159be4e6ebe479771b42c0b483f5a84a094
2019-08-19 22:05:56 -07:00
87217cfd2a Add align_corners option to grid_sample and affine_grid, change default to False (#23923)
Summary:
Resolves: https://github.com/pytorch/pytorch/issues/20785

Adds the `align_corners` option to `grid_sample` and `affine_grid`, paralleling the option that was added to `interpolate` in version 0.4.0.

In short, setting `align_corners` to `False` allows these functions to be resolution agnostic.
This ensures, for example, that a grid generated from a neural net trained to warp 1024x1024 images will also work to warp the same image upsampled/downsampled to other resolutions like 512x512 or 2048x2048 without producing scaling/stretching artifacts.

Refer to the documentation and https://github.com/pytorch/pytorch/issues/20785 for more details.

**Important**: BC-Breaking Change because of new default
The old functionality can still be achieved by setting `align_corners=True`, but the default is now set to `align_corners=False`, since this is the more correct setting, and since this matches the default setting of `interpolate`.

The vectorized 2D cpu version of `grid_sampler` is refactored a bit. I don’t suspect that this refactor would affect the runtime much, since it is mostly done in inlined functions, but I may be wrong, and this has to be verified by profiling.

~The tests are not yet updated to reflect the new default. New tests should probably also be added to test both settings of `align_corners`.~ _Tests are now updated._
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23923

Differential Revision: D16887357

Pulled By: ailzhang

fbshipit-source-id: ea09aad7853ef16536e719a898db8ba31595daa5
2019-08-19 09:45:44 -07:00
33a1c30cb1 cleanup torch/nn/functional.py (#23977)
Summary:
Cleanup torch/nn/functional now that JIT:
- Handles multiple returns
- Typechecks exits (exceptions)
- assertions refine types
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23977

Differential Revision: D16697750

Pulled By: eellison

fbshipit-source-id: 1f777d6b9ead1105de50120fffd46d523e1e6797
2019-08-07 16:31:36 -07:00
3107f1dcd5 fix align_corners doc
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23707

Differential Revision: D16617565

Pulled By: ezyang

fbshipit-source-id: 9ae581e9233d8c2b92f35b9486af1dab30ce8e3a
2019-08-02 12:43:35 -07:00
b7d90332ea add notes about overshoot in bicubic mode (#23321)
Summary:
fix https://github.com/pytorch/pytorch/issues/21044

Bicubic interpolation can cause overshoot.

Opencv keeps results dtype aligned with input dtype:
- If input is uint8, the result is clamped [0, 255]
- If input is float, the result is unclamped.

In Pytorch case, we only accept float input, so we'll keep the result unclamped, and add some notes so that users can explicitly call `torch.clamp()` when necessary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23321

Differential Revision: D16464796

Pulled By: ailzhang

fbshipit-source-id: 177915e525d1f54c2209e277cf73e40699ed1acd
2019-07-24 14:46:37 -07:00
c2df54d6d0 avg_pool2d avg_pool3d for LongTensor (#22433)
Summary:
Generate avg_pool2d/avg_pool3d for LongTensor for CPU.
Added divisor_override parameter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22433

Differential Revision: D16108809

Pulled By: ifedan

fbshipit-source-id: 8de7ff585a0479702cceafb5ccf9dfea62a9cc50
2019-07-17 19:59:09 -07:00
332824551c Fix F.one_hot doc signature
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22929

Differential Revision: D16290741

Pulled By: ezyang

fbshipit-source-id: d8b979e64d92b94c5a70bb4ffe2a83042ed6abfc
2019-07-17 13:23:25 -07:00
10c4b98ade Remove weak script (#22212)
Summary:
* Deletes all weak script decorators / associated data structures / methods
   * In order to keep supporting the standard library in script, this enables recursive script on any function defined in `torch.nn`
   * Most changes in `torch/nn` are the result of `ag -Q "weak" torch/nn/ -l | xargs sed -i '/weak/d'`, only `rnn.py` needed manual editing to use the `ignore` and `export` to continue supporting the overloaded `forward` methods
* `Sequential`/`ModuleList` no longer need to be added to constants since they are compiled on demand

This should also fix https://github.com/pytorch/pytorch/issues/22212
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22212

Differential Revision: D15988346

Pulled By: driazati

fbshipit-source-id: af223e3ad0580be895377312949997a70e988e4f
2019-07-03 17:28:25 -07:00
bb0f299f27 Update MultiheadAttention module support key/value with different number of features and allow static key/value (#21288)
Summary:
The changes include:

1. Allow key/value to have different number of features with query. It supports the case when key and value have different feature dimensions.
2. Support three separate proj_weight, in addition to a single in_proj_weight. The proj_weight of key and value may have different dimension with that of query so three separate proj_weights are necessary. In case that key and value have same dimension as query, it is preferred to use a single large proj_weight for performance reason. However, it should be noted that using a single large weight or three separate weights is a size-dependent decision.
3. Give an option to use static k and v in the multihead_attn operator (see saved_k and saved_v). Those static key/value tensors can now be re-used when training the model.
4. Add more test cases to cover the arguments.

Note: current users should not be affected by the changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21288

Differential Revision: D15738808

Pulled By: zhangguanheng66

fbshipit-source-id: 288b995787ad55fba374184b3d15b5c6fe9abb5c
2019-07-02 18:06:25 -07:00
34aee933f9 ONNX Export Interpolate (Resize) for opset version 10
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21434

Reviewed By: zrphercule

Differential Revision: D15777197

Pulled By: houseroad

fbshipit-source-id: 517b06a54a234ffdb762401e83f5a732023ed259
2019-06-19 13:40:27 -07:00
0f675f9cbc Port im2col and vol2col (#21769)
Summary:
resolves partially https://github.com/pytorch/pytorch/issues/18353
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21769

Differential Revision: D15854530

Pulled By: ezyang

fbshipit-source-id: 574853c068010d1b7588047d2ab7450077471447
2019-06-17 10:06:26 -07:00
efd20de276 fix multihead attention for half (#21658)
Summary:
Currently multihead attention for half type is broken
```
  File "/home/ngimel/pytorch/torch/nn/functional.py", line 3279, in multi_head_attention_forward
    attn_output = torch.bmm(attn_output_weights, v)
RuntimeError: Expected object of scalar type Float but got scalar type Half for argument https://github.com/pytorch/pytorch/issues/2 'mat2'
```
because softmax converts half inputs into fp32 inputs. This is unnecessary - all the computations in softmax will be done in fp32 anyway, and the results need to be converted into fp16 for the subsequent batch matrix multiply, so nothing is gained by writing them out in fp32. This PR gets rid of type casting in softmax, so that half works.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21658

Differential Revision: D15807487

Pulled By: zhangguanheng66

fbshipit-source-id: 4709ec71a36383d0d35a8f01021e12e22b94992d
2019-06-13 15:17:04 -07:00
26bcadcc61 Gumbel-Softmax Arxiv Docs Link Fix (#21376)
Summary:
Links separated #20297
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21376

Differential Revision: D15696413

Pulled By: ezyang

fbshipit-source-id: 513bd430e41c109aa2d0fbaa9a242acb2a12059b
2019-06-06 10:11:18 -07:00
0c6efbd410 Fix gelu documents (#21265)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21265

Fix gelu documents

Reviewed By: hl475

Differential Revision: D15598958

fbshipit-source-id: 483040069102daada705401c36c8990598142d3d
2019-06-02 20:17:56 -07:00
93ae040ff0 Add gelu activation in pytorch (#20665)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20665

Add gelu activation forward on CPU in pytorch

Compare to current python implemented version of gelu in BERT model like

  def gelu(self, x):
      x * 0.5 * (1.0 + torch.erf(x / self.sqrt_two))

The torch.nn.functional.gelu function can reduce the forward time from 333ms to 109ms (with MKL) / 112ms (without MKL) for input size = [64, 128, 56, 56] on a devvm.

Reviewed By: zheng-xq

Differential Revision: D15400974

fbshipit-source-id: f606b43d1dd64e3c42a12c4991411d47551a8121
2019-06-02 09:08:47 -07:00
8e3311c5e2 Remove functionality unsupported by the JIT from multi_head_attention_forward. (#20653)
Summary:
Remove the internal functions in multi_head_attention_forward. Those internal functions cause 10-15% performance regression and there is possibly a JIT issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20653

Differential Revision: D15398888

Pulled By: cpuhrsch

fbshipit-source-id: 0a3f053a4ade5009e73d3974fa6733c2bff9d929
2019-05-27 15:12:58 -07:00