Commit Graph

469 Commits

Author SHA1 Message Date
14004cbef6 Native batch norm (#13263)
Summary:
- Move batch norm from TH(CU)NN to native
- Speedups in many cases (e.g. #12006) for CUDA due to new block/grid layout and Welford-type mean/variance calculations (the latter for training mode)
- It splits the forward kernel in two pieces and reuses the evaluation kernel for the transformation.
- We change the meaning of save_mean and save_invstd (aka save_var) to accscalar to maintain reasonable precision.

Compared to the ill-fated #12368
- I changed the CPU kernel to not call `.sum()` from within parallel for. This seemed to have caused the breakage (NaN-results) in TestModels.test_dcgan_netG (thank you houseroad for the repro, errors in assessment of the fix are my own)
- I updated the Half->Float upcasting in tensors to go through `t.type().scalarType()` instead of `t.dtype()`.
- I have merged master
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13263

Differential Revision: D12946254

Pulled By: SsnL

fbshipit-source-id: 3bb717ee250fbccaf10afe73722996aa4713d10d
2018-11-06 20:05:54 -08:00
2cd912bcc2 Fix more spectral norm bugs (#13350)
Summary:
Problems with SN and DP after #12671 :
1. in eval mode, `weight_orig` is not getting correct gradient #12737 .

    Fix: keep `v` vector around as a buffer and always calculate `W = W_orig / (u @ W_orig @ v)` even in eval.

2. in training mode, the `weight` buffer of the parallelized module is never updated, if someone touches `weight_orig` and/or `weight` and makes them not sharing storage. So in `eval` the weight used is wrong.

    Fix: Make `weight` not a buffer anymore and always calculate it as above.

3. #12671 changed SN to update `u` in-place to make DP work correctly, but then it breaks backward through two forwards (e.g., the common GAN loss `D(real) - D(fake)`) because the vectors needed to backprop the 1st forward is changed in the 2nd forward.

    Fix: This PR clones `u` and `v` before using them.

To maintain BC, I added a hook interface for producing and loading state_dict. This is ugly and we should really have better interface for spectral_norm. But for the purpose to fix this issue, I make this patch. Even if we have a better interface, BC mechanism for legacy loading legacy state_dict still needs to be done.

cc The controller you requested could not be found. crcrpar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13350

Differential Revision: D12931044

Pulled By: SsnL

fbshipit-source-id: 8be6f934eaa62414d76d2c644dedd7e1b7eb31ef
2018-11-06 19:16:13 -08:00
a7ee632dff Various Test and build fixes (#13556)
Summary:
- fixes weights-contiguous requirement for THCUNN Convolutions
- Add tests that conv backward pass works for non-contiguous weights
- fix RNN tests / error messages to be consistent and pass
- relax weight grad precision for fp16 for a particular test
- fix regression of CMAKE_PREFIX_PATH not passing through
- add missing skipIfNoLapack annotations where needed

Differential Revision: D12918456

Pulled By: soumith

fbshipit-source-id: 8642d36bffcc6f2957800d6afa1e10bef2a91d05
2018-11-06 07:13:47 -08:00
98f5c005da Speed up CPU threshold and relu implementation (#13182)
Summary:
```
The previous threshold implementation was not vectorized or parallelized.
This speeds up ResNet-50 CPU inference [1] from ~88 ms to ~67 ms

CPU timings:
https://gist.github.com/colesbury/d0d1be6974841d62696dbde329a8fde8

1 thread (before vs. after)
10240:  17.4 us vs. 6.9 µs per loop
102400: 141 us vs. 39.8 µs per loop

16 threads (before vs. after)
10240:  17.4 us vs. 6.7 µs per loop
102400: 141 us vs. 14.3 µs per loop

CUDA timings are not measurably different.

[1]: compiled with MKL-DNN, 8 threads, batch norm merged into convolutions
https://gist.github.com/colesbury/8a64897dae97558b3b82da665048c782
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13182

Reviewed By: soumith

Differential Revision: D12825105

Pulled By: colesbury

fbshipit-source-id: 557da608ebb87db8a04adbb0d2882af4f2eb3c15
2018-11-05 12:51:29 -08:00
9f2b2cac37 Fix handling all empty bags in CUDA embedding bag (#13483)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/11847
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13483

Differential Revision: D12902914

Pulled By: SsnL

fbshipit-source-id: 577a53e815231e988da716b1ee5667e1f36408ca
2018-11-02 10:21:14 -07:00
99a5d19591 Rename elementwise_mean to mean (#13419)
Summary:
Closes #12459
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13419

Differential Revision: D12883299

Pulled By: SsnL

fbshipit-source-id: 8b4512ff73b66fdc674412904dbb3bf497ba70a7
2018-11-01 10:31:26 -07:00
488d393ea6 Fix pointwise loss broadcast (#12996)
Summary: Fixes #12129 , #12327

Differential Revision: D10513781

Pulled By: ailzhang

fbshipit-source-id: a210008a39ff6c3f056c9fbe3f0576cfcce638ec
2018-10-31 10:17:25 -07:00
f8864f0505 Revert "Move batch_norm to ATen/native, speed up (#12368)" (#13191)
Summary:
Revert #12368 since it's causing onnx related test cases failing.
https://github.com/pytorch/pytorch/pull/12368

SsnL The controller you requested could not be found.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13191

Reviewed By: BIT-silence

Differential Revision: D12810778

Pulled By: houseroad

fbshipit-source-id: 1c373b92628580097cffcd237dccc5b3d8697577
2018-10-26 23:05:50 -07:00
dae7616078 Shard all of tests based on how many tests exist. (#13160)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13160

Reduces pytorch_core build from 2 hours to 30 minutes

Reviewed By: soumith, dzhulgakov

Differential Revision: D10524261

fbshipit-source-id: 97270ac73404b5ea4c264cd0e9d8d4b1be79b0e9
2018-10-26 18:20:34 -07:00
dc211c7de4 Move batch_norm to ATen/native, speed up (#12368)
Summary:
- Speed up the case of #12006 in the forward
- The backward still isn't as fast as one might hope (factor 2-3 in the #12006 case).
- More extensive benchmarking shows not so great performance compared
  to CuDNN for cases with many channels, e.g. bs=8-128 / c=1024 / f=1024.
- We change the meaning of save_mean and save_invstd (aka save_var) to accscalar to
  maintain reasonable precision.

Needless to say that I would happily separate the TensorAccessor fixes in a separate PR, as they're fixes and unrelated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12368

Differential Revision: D10559696

Pulled By: SsnL

fbshipit-source-id: f0d0d1e0912e17b15b8fb7a2c03d0fe757598419
2018-10-25 23:41:10 -07:00
7863c17b26 Fix convtranspose3d output_size calculation (#12952)
Summary:
Closes #2119.

There was a small bug where the output_size got sliced with `[-2:]`
where we really meant to slice it as `[2:]` (to remove the batch and
channel dimensions).

Added a new test for this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12952

Differential Revision: D10510678

Pulled By: zou3519

fbshipit-source-id: 4c04a5007fc6d002e1806d6fe981b43d33d6a4f2
2018-10-24 09:23:05 -07:00
cf235e0894 fix lint after new flake8 release added new style constraints (#13047)
Summary:
fix lint after new flake8 release added new style constraints
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13047

Differential Revision: D10527804

Pulled By: soumith

fbshipit-source-id: 6f4d02662570b6339f69117b61037c8394b0bbd8
2018-10-24 09:03:38 -07:00
710191e292 fix error message of large kernel size in conv2D (#12791)
Summary:
- fix #12565
- test plan:
with this fix, we have:
```
>>> m = nn.Conv2d(in_channels=3, out_channels=33, kernel_size=10, stride=1, bias=True)
>>> input = torch.randn(1, 3, 1, 1)
>>> output = m(input)
```
RuntimeError: Calculated padded input size per channel: (1 x 1). Kernel size: (10 x 10). Kernel size can't be greater than actual input size at ~/pytorch/aten/src/THNN/generic/SpatialConvolutionMM.c:50

not sure why these are `int` instead of `int64_t`:
5ccdd7a626/aten/src/THNN/generic/SpatialConvolutionMM.c (L10)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12791

Differential Revision: D10443045

Pulled By: weiyangfb

fbshipit-source-id: 2620acb40bdd49d29cec06337f6dfb4653d1987c
2018-10-18 00:51:16 -07:00
f4944f0f8a Rename test/common.py to test/common_utils.py (#12794)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12794

common.py is used in base_module for almost all tests in test/. The
name of this file is so common that can easily conflict with other dependencies
if they happen to have another common.py in the base module. Rename the file to
avoid conflict.

Reviewed By: orionr

Differential Revision: D10438204

fbshipit-source-id: 6a996c14980722330be0a9fd3a54c20af4b3d380
2018-10-17 23:04:29 -07:00
ba25e13782 Forbid Module.to with copy argument. (#12617)
Summary:
Module.to uses the Tensor.to parsing facility.
It should not, however, accept "copy" as a keyword/fourth positional
argument.

See #12571 for discussion.

Thank you SsnL for noticing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12617

Differential Revision: D10392053

Pulled By: ezyang

fbshipit-source-id: b67a5def7993189b4b47193abc7b741b7d07512c
2018-10-16 20:31:44 -07:00
ac994f2c78 Fix SpectralNorm with DataParallel (#12671)
Summary:
There were two problems with SN + DP:

1. In SN, the updated _u vector is saved back to module via a `setattr`. However, in DP, everything is run on a replica, so those updates are lost.
2. In DP, the buffers are broadcast via a `broadcast_coalesced`, so on replicas they are all views. Therefore, the `detach_` call won't work.

Fixes are:
1. Update _u vector in-place so, by the shared storage between 1st replica and the parallelized module, the update is retained
2. Do not call `detach_`.
3. Added comments in SN about the subtlety.
4. Added a note to the DP doc on this particular behavior of DP.

cc crcrpar taesung89 The controller you requested could not be found. yaoshengfu

Fixes https://github.com/pytorch/pytorch/issues/11476
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12671

Differential Revision: D10410232

Pulled By: SsnL

fbshipit-source-id: c447951844a30366d8c196bf9436340e88f3b6d9
2018-10-16 16:02:17 -07:00
e15501fb68 fix bce_with_logits with legacy reduce (#12689)
Summary:
Fix #12624 . internal usecase of legacy `reduce`.
Add test in test_nn
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12689

Reviewed By: ezyang

Differential Revision: D10391195

Pulled By: ailzhang

fbshipit-source-id: 1af2b258c4abb2b6527eaaeac63e8bf1762c66a1
2018-10-16 09:46:58 -07:00
a98958d3bd dtype option for softmax (#11719)
Summary:
Add dtype argument to softmax/log_softmax functions.
Computing softmax in fp32 precision is necessary for mixed precision training, and converting output of the previous layer into fp32 and then reading it as fp32 in softmax is expensive, memory and perf-wise, this PR allows one to avoid it.
For most input data/dtype combinations, input data is converted to dtype and then softmax is computed. If input data is half type and dtype is fp32, kernels with the corresponding template arguments are called.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11719

Reviewed By: ezyang

Differential Revision: D10175514

Pulled By: zou3519

fbshipit-source-id: 06d285af91a0b659932236d41ad63b787eeed243
2018-10-13 17:57:10 -07:00
d400502b1d Fix a bunch of warnings in TestNN
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12453

Differential Revision: D10244130

Pulled By: SsnL

fbshipit-source-id: e425c76bfb721fe118a32ddd1fa6eca3a3cd86f0
2018-10-08 17:38:23 -07:00
f8086845aa Fix bug in grad.py when conv bias != None (#12281)
Summary:
Obviously, the grads of conv weight and conv input are not relevant to the bias, but the original `convXd_input` and `convXd_weight` methods receive a `bias` parameter. What's more, while the doc says `bias` should have the shape `(out_channels,)`, one will get a `RuntimeError` if the bias != None and in_channels != out_channels, for the weight of transposed conv has the shape `(in_channels, out_channels, kH, kW)` while the weight of vanilla conv has the shape `(out_channels, in_channels, kH, kW)`
```
RuntimeError: Given transposed=1, weight of size [channel1, channel2, kH, kW], expected bias to be 1-dimensional with channel2 elements, but got bias of size [channel1] instead
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12281

Differential Revision: D10217370

Pulled By: ezyang

fbshipit-source-id: bc00b439e5ae539276a5e678bdb92af700197bb2
2018-10-05 12:55:14 -07:00
c9f7d7b506 mark unit tests as working, skip failing unit test (#12313)
Summary:
* enabled fp16 tests for test_torch

* enable fp16 tests for test_nn

* enabled multilabelmargin loss for fp16

* removed skip for test_pdist_empty_col

* Enable test_nn tests that pass with compiler fixes etc.

* Enable test_legacy_nn tests that pass with compiler fixes etc.

ezyang bddppq
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12313

Differential Revision: D10189922

Pulled By: bddppq

fbshipit-source-id: a5592817c04b14e355cb062d42ebea406f0c92b6
2018-10-03 23:56:26 -07:00
a2ebbccc9f fix unit tests on CI
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12187

Differential Revision: D10118483

Pulled By: bddppq

fbshipit-source-id: 986c8fb48d61e00103c713548a50e74489a0e442
2018-09-28 23:11:55 -07:00
de11fe0c83 migrate PReLU to ATen (#11758)
Summary:
- fixes https://github.com/pytorch/pytorch/issues/10723
- migrate PReLU to ATen and deprecate legacy PReLU
- performance:

CPU with weight.numel() = 1
```
>>> m = nn.PReLU()
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
100 loops, best of 100: 9.43 ms per loop

>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
10 loops, best of 100: 24.4 ms per loop

>>> m = nn.PReLU()
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
1000 loops, best of 100: 695 µs per loop

>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
100 loops, best of 100: 2.47 ms per loop
```

CPU with weight.numel() = channels
```
>>> m = nn.PReLU(100)
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
1000 loops, best of 100: 603 µs per loop

>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
100 loops, best of 100: 13.3 ms per loop

>>> m = nn.PReLU(100)
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
1000 loops, best of 100: 655 µs per loop

>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
100 loops, best of 100: 2.45 ms per loop
```

CUDA with weight.numel() = 1
```
>>> m = nn.PReLU().cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
10000 loops, best of 100: 187 µs per loop

>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.01 ms per loop

>>> m = nn.PReLU().cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
1000 loops, best of 100: 195 µs per loop

>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.28 ms per loop
```

CUDA with weight.numel() = channel
```
>>> m = nn.PReLU(100).cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
1000 loops, best of 100: 174 µs per loop

>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.27 ms per loop

>>> m = nn.PReLU(100).cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
10000 loops, best of 100: 181 µs per loop

>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.26 ms per loop
```

The huge performance regression in CPU when weight.numel() = 1 is addressed by replacing at::CPU_tensor_apply* with parallelized kernels.

ezyang SsnL zou3519  soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11758

Differential Revision: D9995799

Pulled By: weiyangfb

fbshipit-source-id: d289937c78075f46a54dafbde92fab0cc4b5b86e
2018-09-21 16:26:04 -07:00
775358e4c2 Add non-legacy test of bilinear (#11935)
Summary:
Fixes: #11905
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11935

Differential Revision: D9991120

Pulled By: soumith

fbshipit-source-id: b00ad4f405440664ae5228b229a2ba0a5d3d92f6
2018-09-21 12:43:35 -07:00
d8f6be686d Remove torch/legacy (#11823)
Summary:
Largely unused and hinders current development
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11823

Differential Revision: D9925094

Pulled By: cpuhrsch

fbshipit-source-id: c797f62180e2128f9a567b0c57c8347957470ea5
2018-09-20 14:00:54 -07:00
8aedc27a63 checking device types of input and weights at RNN (#10185)
Summary:
- fixes #9534
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10185

Differential Revision: D9141222

Pulled By: weiyangfb

fbshipit-source-id: bb652e42cc15917019df080d6bce2926b18f3476
2018-09-18 20:26:02 -07:00
e2bc95e1bd add ModuleList.insert (#11664)
Summary:
fixes #11652
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11664

Differential Revision: D9892845

Pulled By: ezyang

fbshipit-source-id: 2c910d6bc0b28a999e25beca6e398fd0f35535c5
2018-09-18 07:41:28 -07:00
7df6650e9c Fix empty embedding bag on cuda (#11740)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/11739
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11740

Differential Revision: D9881392

Pulled By: SsnL

fbshipit-source-id: 2964d314f199dd9b4bb69e36592b67efdf5e0760
2018-09-17 14:40:03 -07:00
6f6b03566b Vectorize grid sample 2d CPU kernels (#10980)
Summary:
This PR vectorizes the CPU grid sample 2d forward and backward kernels. Specifically,

 1. add `.data()` in `TensorAccessor`
 2. support non-void return value for declaring CPU kernel stub
 2. add `bool at:: geometry_is_contiguous(IntList sizes, IntList strides)`
1. The following vectorized CPU primitives are added:

    + `gather<scale>(baseaddr, vindex)`: `result[i] = baseaddr[vindex[i] * scale]`
    + `mask_gather<scale>(src, baseaddr, vindex, mask)`: `result[i] = mask[i] ? baseaddr[vindex[i] * scale] : src[i]`.
    + comparison ops
    + binary logical ops
    + `min(a, b)`
    + `cast<dst_t, src_t>(src_vec)`: changing dtype but keeping the bit representation
    + `blendv(a, b, mask)`: `result[i] = mask[i] ? b[i] : a[i]`.
    + ctor with multiple values (i.e., `setr`)
    + `arange(start = 0, step = 1)`: constructs a vector with values specified by the arange parameters
    + `convert_to_int_of_same_size(vec)`: convert floating point vector to corresponding integral type of same size
    + `interleave2(a, b)` & `deinterleave2(x, y)`: interleave or deinterleaves two vectors. E.g., for `interleave`:
        ```
        inputs:
          {a0, a1, a2, a3, a4, a5, a6, a7}
          {b0, b1, b2, b3, b4, b5, b6, b7}
        outputs:
          {a0, b0, a1, b1, a2, b2, a3, b3}
          {a4, b4, a5, b5, a6, b6, a7, b7}
        ```

  2. Grid sample CPU kernel implementations are described in the following note (also in `GridSampleKernel.cpp`:

  ```
   NOTE [ Grid Sample CPU Kernels ]

   Implementation of vectorized grid sample CPU kernels is divided into three
   parts:

   1. `ComputeLocation` struct
      Transforms grid values into interpolation locations of the input tensor
      for a particular spatial dimension, basing on the size of that dimension
      in input tensor, and the padding mode.
```
```cpp
      template<typename scalar_t, GridSamplerPadding padding>
      struct ComputeLocation {
        using Vec = Vec256<scalar_t>;

        // ctor
        ComputeLocation(int64_t size);

        // Given grid values `in`, return the interpolation locations after
        // un-normalization and padding mechanism (elementwise).
        Vec apply(const Vec &in) const;

        // Similar to `apply`, but also returns `d apply(in) / d in`
        // (elementwise).
        // this is often used in gradient computation.
        std::pair<Vec, Vec> apply_get_grad(const Vec &in) const;
      };
```
```
   2. `ApplyGridSample` struct
      Owns N `ComputeLocation` structs, where N is the number of spatial
      dimensions. Given N input grid vectors (one for each spatial dimension)
      and spatial offset, it gets the interpolation locations from
      `ComputeLocation`s, applies interpolation procedure, and then writes to
      the output (or grad_input & grad_grid in backward).
```
```cpp
      template<typename scalar_t, int spatial_dim,
               GridSamplerInterpolation interp,
               GridSamplerPadding padding>
      struct ApplyGridSample {

        // ctor
        ApplyGridSample(const TensorAccessor<scalar_t, 4>& input);

        // Applies grid sampling (forward) procedure:
        //   1. computes interpolation locations from grid values `grid_x` and
        //      `grid_y`,
        //   2. interpolates output values using the locations and input data
        //      in `inp_slice`, and
        //   3. writes the first `len` values in the interpolated vector to
        //      `out_slice` with spatial offset being `offset`.
        //
        // This assimes that `grid_x` and `grid_y` all contain valid grid
        // values \in [-1, 1], even at indices greater than `len`.
        //
        // The `*_slice` argument namess mean samples within a batch (i.e.,
        // with the batch dimension sliced out).
        void forward(TensorAccessor<scalar_t, 3>& out_slice,
                     const TensorAccessor<scalar_t, 3>& inp_slice,
                     int64_t offset, const Vec& grid_x, const Vec& grid_y,
                     int64_t len) const;

        // Applies grid sampling (backward) procedure. Arguments semantics
        // and strategy are similar to those of `forward`.
        void backward(TensorAccessor<scalar_t, 3>& gInp_slice,
                      TensorAccessor<scalar_t, 3>& gGrid_slice,
                      const TensorAccessor<scalar_t, 3>& gOut_slice,
                      const TensorAccessor<scalar_t, 3>& inp_slice,
                      int64_t offset, const Vec& grid_x, const Vec& grid_y,
                      int64_t len) const;
      }
```
```
   3. `grid_sample_2d_grid_slice_iterator` function
      Among the tensors we work with, we know that the output tensors are
      contiguous (i.e., `output` in forward, and `grad_input` & `grad_grid` in
      backward), we need to randomly read `input` anyways, and `grad_output`
      usually comes from autograd and is often contiguous. So we base our
      iterating strategy on the geometry of grid.
      `grid_sample_2d_grid_slice_iterator` function provides an abstract to
      efficiently iterates through a `grid` slice (without batch dimension).
      See comments of that function on the specific cases and strategies used.
```
```cpp
      template<typename scalar_t, typename ApplyFn>
      void grid_sample_2d_grid_slice_iterator(
        const TensorAccessor<scalar_t, 3>& grid_slice,
        const ApplyFn &apply_fn);

      // `apply_fn` is a function/lambda that can be called as if it has
      // declaration:
      //   void apply_fn(const Vec256<scalar_t>& grid_x,
      //                 const Vec256<scalar_t>& grid_y,
      //                 int64_t spatial_offset, int64_t len);
```
```
      `apply_fn` will be called multiple times, and together cover the entire
      output spatial space. Therefore, e.g., to implement forward 2d grid
      sample, we can do
```
```cpp
      ApplyGridSample<scalar_t, 2, interp, padding> grid_sample(input_accessor);

      for (int n = 0; n < input_accessor.size(0); n++) {
        grid_sample_2d_grid_slice_iterator(
          grid_accessor[n],
          [&](const Vec256<scalar_t>& grid_x, const Vec256<scalar_t>& grid_y,
              int64_t spatial_offset, int64_t len) {
            grid_sample.forward(out_accessor[n], input_accessor[n],
                                spatial_offset, grid_x, grid_y, len);
          });
      }
   ```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10980

Differential Revision: D9564867

Pulled By: SsnL

fbshipit-source-id: 5b7c3c7ea63af00eec230ae9ee1c3e6c6c9679b4
2018-09-16 20:41:10 -07:00
74197c7115 Restore support for dim=None on WeightNorm. (#11661)
Summary:
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11661

Reviewed By: veenix

Differential Revision: D9826799

Pulled By: ezyang

fbshipit-source-id: 9eec57bb27a365406669e412f6eb88741b22ed3d
2018-09-14 07:39:43 -07:00
54107ae8cf convert output_device at data_parallel from torch.device to index (#10189)
Summary:
- fixes #9984
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10189

Differential Revision: D9545390

Pulled By: weiyangfb

fbshipit-source-id: 3a6a705437553ba319e9fd4b7f676ff73857a27e
2018-09-11 20:27:07 -07:00
91089a7e17 Add GPU implementation of pdist (#11102)
Summary:
Add the gpu kernel version.

The parallelism I went with performs poorly when there are a large number of vectors, but they're all short, as I don't allocate the thread pool to wrap in that case.

Test Plan
---------
```
python -m unittest test_torch.TestTorch.test_pdist_{empty,scipy} test_nn.TestNN.test_pdist{,_zeros,_empty_row,_empty_col,_cpu_gradgrad_unimplemented,_cuda_gradgrad_unimplemented} test_jit.TestJitGenerated.test_nn_pdist
```

Current performance specs are a little underwhelming, I'm in the process of debugging.

size | torch | torch cuda | scipy
-----|-------|------------|------
16 x 16 | 9.13 µs ± 3.55 µs | 9.86 µs ± 81.5 ns | 15.8 µs ± 1.2 µs
16 x 1024 | 15 µs ± 224 ns | 9.48 µs ± 88.7 ns | 88.7 µs ± 8.83 µs
1024 x 16 | 852 µs ± 6.03 µs | 7.84 ms ± 6.22 µs | 4.7 ms ± 166 µs
1024 x 1024 | 34.1 ms ± 803 µs | 11.5 ms ± 6.24 µs | 273 ms ± 6.7 ms
2048 x 2048 | 261 ms ± 3.5 ms | 77.5 ms ± 41.5 µs | 2.5 s ± 97.6 ms
4096 x 4096 | 2.37 s ± 154 ms | 636 ms ± 2.97 µs | 25.9 s ± 394 ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11102

Differential Revision: D9697305

Pulled By: erikbrinkman

fbshipit-source-id: 2b4f4b816c02b3715a85d8db3f4e77479d19bb99
2018-09-07 09:09:46 -07:00
9de2085806 Use custom hcc/HIP, purge hcSPARSE (#11198)
Summary:
* purge hcSPARSE now that rocSPARSE is available
* integrate a custom hcc and HIP
* hcc brings two important compiler fixes (fixes hundreds of unit tests)
* HIP brings a smart dispatcher that allows us to avoid a lot of static_casts (we haven't yet removed the automatic static_casts but this catches some occurrences the script did not catch)
* mark 5 unit tests skipping that have regressed w/ the new hcc (we don't know yet what is at fault)
* optimize bitonic sort - the comparator is always an empty struct - therefore passing it by value saves at least 3 bytes. It also removes an ambiguity around passing references to `__global__` functions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11198

Differential Revision: D9652340

Pulled By: ezyang

fbshipit-source-id: f5af1d891189da820e3d13b7bed91a7a43154690
2018-09-06 19:38:07 -07:00
33c7cc13ca improve docker packages, fix bugs, enable tests, enable FFT (#10893)
Summary:
* improve docker packages (install OpenBLAS to have at-compile-time LAPACK functionality w/ optimizations for both Intel and AMD CPUs)
* integrate rocFFT (i.e., enable Fourier functionality)
* fix bugs in ROCm caused by wrong warp size
* enable more test sets, skip the tests that don't work on ROCm yet
* don't disable asserts any longer in hipification
* small improvements
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10893

Differential Revision: D9615053

Pulled By: ezyang

fbshipit-source-id: 864b4d27bf089421f7dfd8065e5017f9ea2f7b3b
2018-09-02 08:54:42 -07:00
e85f3fccb3 Fix relying on UB in test_data_parallel_nested_output (#11092)
Summary:
We shouldn't reply on plain `dict` ordering. Example failure: https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-xenial-cuda8-cudnn6-py3-test1/8417/console
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11092

Reviewed By: ezyang

Differential Revision: D9583274

Pulled By: SsnL

fbshipit-source-id: ba80b96648c98c24c2ec5fa6fd9aa566c095cce7
2018-08-30 13:10:25 -07:00
611a608517 Add ATen pdist CPU kernel (#10782)
Summary:
Also add single grad whitelist to the jit test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10782

Reviewed By: ezyang

Differential Revision: D9583378

Pulled By: erikbrinkman

fbshipit-source-id: 069e5ae68ea7f3524dec39cf1d5fe9cd53941944
2018-08-30 11:55:27 -07:00
b14f2e899c Preserve sparse tensor shape and dim invariants, and add scalar tensor support (#9279)
Summary:
When 0-sized dimension support is added, we expect an empty sparse tensor to be a 1-dimensional tensor of size `[0]`, with `sparseDims == 1` and `denseDims == 0`. Also, we expect the following invariants to be preserved at all times:

```
_sparseDims + _denseDims = len(shape)
_indices.shape: dimensionality: 2,  shape: (_sparseDims, nnz)
_values.shape:  dimensionality: 1 + _denseDims.  shape: (nnz, shape[_sparseDims:])
```

This PR fixes various places where the invariants are not strictly enforced when 0-sized dimension support is enabled.

Tested and `test_sparse.py` passes locally on both CPU and CUDA with the `USE_TH_SIZE_ZERO_DIM` flag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9279

Differential Revision: D8936683

Pulled By: yf225

fbshipit-source-id: 12f5cd7f52233d3b26af6edc20b4cdee045bcb5e
2018-08-23 10:10:24 -07:00
de11a5fb28 Resubmit #8322 with scipy version check
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/10775

Differential Revision: D9458207

Pulled By: SsnL

fbshipit-source-id: f2b0dbf2d236134afded9b15d8bf55ff98f50e7b
2018-08-22 13:39:49 -07:00
c5c1c051ca Fix dropout fused kernel applied in eval mode (#10621)
Summary:
fixes https://github.com/pytorch/pytorch/issues/10584

cc apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10621

Differential Revision: D9379397

Pulled By: SsnL

fbshipit-source-id: 5ff2939ba794af082ce597ef289a09ee757636dc
2018-08-17 14:54:42 -07:00
afd7477eaa Add `buffers(), named_buffers()` methods. (#10554)
Summary:
This commit adds the ``buffers()`` and ``named_buffers()`` methods as
analogues of ``parameters()`` and ``named_parameters()``.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10554

Reviewed By: SsnL

Differential Revision: D9367762

Pulled By: jma127

fbshipit-source-id: f2042e46a7e833dce40cb41681dbd80d7885c74e
2018-08-16 16:26:48 -07:00
a129f9ad3b Revert D9332335: [pytorch][PR] Implements volumetric (5d) affine grid generation.
Differential Revision:
D9332335

Original commit changeset: 1b3a91d078ef

fbshipit-source-id: 3dcce680257a6da121f5d67918ed4236e0c5bfec
2018-08-15 15:25:11 -07:00
86363e1d8e Move RNN implementations to C++ (#10481)
Summary:
This is the first of two changes that are supposed to improve how we handle RNNs in the JIT. They still get traced as `PythonOp`s, but now it will be much easier to actually expose them to the JIT as e.g. `aten::lstm`, and ignore the Python interpreter entirely. This needs some symbolic adjustments that will be part of a second PR.

Even when we fix symbolics, there will still be a bit of a problem with statefulness of the cuDNN API (we need a mutable cache for the dropout state, but our IR has no way of representing that).

zdevito ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10481

Reviewed By: ezyang

Differential Revision: D9341113

Pulled By: apaszke

fbshipit-source-id: 0ae30ead72a1b12044b7c12369d11e5ca8ec30b5
2018-08-15 13:25:41 -07:00
254dedf604 Propagate NaN through threshold (#10277)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/10238
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10277

Reviewed By: SsnL

Differential Revision: D9199825

Pulled By: soumith

fbshipit-source-id: 8ee7f9a72d9546d429f311c3f6028461d3c93fe2
2018-08-15 12:59:31 -07:00
9cffe783f1 relax tolerance for two torch.half (float16) tests (#10519)
Summary:
Two tests in the 'nn' test bucket may fail when the torch.half
(float16) data type is used. The assertions used in the tests
intend to allow slight floating point imprecision in the results,
but the tolerances used for the comparisons are too strict for
the half type.

Relax the tolerances so that slight float16 imprecision won't
cause test failures.

The affected tests are:

- test_variable_sequence_cuda
- test_Conv2d_groups_nobias

For more information, see issue:

https://github.com/pytorch/pytorch/issues/7420
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10519

Differential Revision: D9343751

Pulled By: soumith

fbshipit-source-id: 90aedf48f6e22dd4fed9c7bde7cd7c7b6885845a
2018-08-15 12:11:20 -07:00
f5a4dd89b5 Implements volumetric (5d) affine grid generation. (#8322)
Summary:
I've implemented affine grid generation for volumetric (5d) inputs. The implementation is based off of the spatial implementation, extended by one dimension. I have a few questions about my implementation vs. the existing one that I will add inline.

I have some extensive test cases for the forward pass here: https://gist.github.com/elistevens/6e3bfb20d8d0652b83bd16b3e911285b However, they use `pytest.fixture` extensively, so I'm not sure the best way to incorporate them into the pytorch test suite. Suggestions? I have not tested backwards at all.

Diff probably best viewed with whitespace changes ignored.

Thanks for considering!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8322

Differential Revision: D9332335

Pulled By: SsnL

fbshipit-source-id: 1b3a91d078ef41a6d0a800514e49298fd817e4df
2018-08-15 11:02:08 -07:00
6a55238a3f Grid sampler: nearest interpolation & reflection padding (#10051)
Summary:
closes #9702 .

cc jph00

Commit structure:

1. Change the index calculation logic. I will explain using 1-D for simplicity.

	Previously we have (in pseudo code):

	```
	// 1. get the float locations from grid
	scalar_t x = from_grid()

	// 2. find the integral surrounding indices
	int x_left = floor(x)
	int x_right = x_left + 1

	// 3. calculate the linear interpolate weights
	scalar_t w_left = x_right - x
	scalar_t w_right = x - x_left

	// 4. manipulate the integral surrounding indices if needed
	// (e.g., clip for border padding_mode)
	x_left = manipulate(x_left, padding_mode)
	x_right = manipulate(x_right, padding_mode)

	// 5. interpolate
	output_val = interpolate(w_left, w_right, x_left, x_right)
	```

	This is actually incorrect (and also unintuitive) because it calculates the
	weights before manipulate out-of-boundary indices. Fortunately, this
	isn't manifested in both of the current supported modes, `'zeros'` and
	`'border'` padding:

	+ `'zeros'`: doesn't clip
	+ `'border'`: clips, but for out-of-bound `x` both `x_left` and `x_right` are
	  clipped to the same value, so weights don't matter

	But this is a problem with reflection padding, since after each time we reflect,
	the values of `w_left` and `w_right` should be swapped.

	So in this commit I change the algorithm to (numbers corresponding to the
        ordering in the above pseudo-code)

	```
	1. get float location
	4. clip the float location
	2. find the integral surrounding indices
	3. calculate the linear interpolate weights
	```

	In the backward, because of this change, I need to add new variables to track
	`d manipulate_output / d manipulate_input`, which is basically a multiplier
	on the gradient calculated for `grid`. From benchmarking this addition doesn't
	cause obvious slow downs.

2. Implement reflection padding. The indices will keep being reflected until
	they become within boundary.

	Added variant of `clip_coordinates` and `reflect_coordinates` to be used in
	backward. E.g.,
	```cpp
	// clip_coordinates_set_grad works similarly to clip_coordinates except that
	// it also returns the `d output / d input` via pointer argument `grad_in`.
	// This is useful in the backward pass of grid_sampler.
	scalar_t clip_coordinates_set_grad(scalar_t in, int64_t clip_limit, scalar_t *grad_in)
	```
	For example, if `in` is clipped in `'border'` mode, `grad_in` is set to `0`.
	If `in` is reflected **odd** times in `'reflection'` mode, `grad_in`
	is set to `-1`.

3. Implement nearest interpolation.

4. Add test cases

5. Add better input checking
  Discussed with goldsborough for moving `operator<<` of `at::Device`,
  `at::DeviceType` and `at::Layout` into `at` namespace. (Otherwise
  `AT_CHECK` can't find them.)

6. Support empty tensors. cc gchanan

    + Make empty tensors not acceptable by cudnn.
    + Add `AT_ASSERT(kernel block size  > 0)` if using `GET_BLOCKS`
   + Cache `numel` in `TensorGeometry`
      I was going to use `numel` to test if cudnn descriptor should accept a
      tensor, but it isn't used eventually. I can revert this if needed.

7. Add more test cases, including on input checking and empty tensors

8. Remove an obsolete comment

9. Update docs. Manually tested by generating docs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10051

Differential Revision: D9123950

Pulled By: SsnL

fbshipit-source-id: ac3b4a0a36b39b5d02e83666cc6730111ce216f6
2018-08-10 12:43:27 -07:00
5bb21493fd add fused dropout kernels (#9666)
Summary:
While waiting for dropout to be fully ported to ATen, here's performance fix for the most common dropout case. Dropout is still in python function, I just added efficient path to it. I could not make inplace work, because generator always generates `return self` for inplace function, and I need to return both original tensor and mask, so inplace goes on the existing pass. Even with non-inplace version, since mask is now a ByteTensor, memory used is just a little larger than for inplace dropout, due to savings on mask.
Once dropout is moved to aten, these kernels still can be used for efficient implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9666

Reviewed By: SsnL

Differential Revision: D8948077

Pulled By: ezyang

fbshipit-source-id: 52990ef769471d957e464af635e5f9b4e519567a
2018-08-07 13:34:53 -07:00
149d4f776b use logsigmoid at multilabel_soft_margin_loss, and change output from shape=(N, C)to (N,) (#9965)
Summary:
- fixes #9141, #9301
- use logsigmoid at multilabel_soft_margin_loss to make it more stable (NOT fixing legacy MultiLabelSoftMarginCriterion)
- return (N) instead of (N, C) to match the same behavior as MultiMarginLoss
- Note that with this PR, the following behavior is expected:
```
loss = F.multilabel_soft_margin_loss(outputs, labels, reduction='none')
loss_mean = F.multilabel_soft_margin_loss(outputs, labels, reduction='elementwise_mean')
loss_sum = F.multilabel_soft_margin_loss(outputs, labels, reduction='sum')

loss.sum() == loss_sum  # True
loss.mean() == loss_mean  # True
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9965

Differential Revision: D9038402

Pulled By: weiyangfb

fbshipit-source-id: 0fa94c7b3cd370ea62bd6333f1a0e9bd0b8ccbb9
2018-08-03 17:54:19 -07:00
6456b944fd ctc_loss odds and ends (#10112)
Summary:
- Add convenience wrapper to pass tensors as input_lengths, target_lengths
- Fix documentation example
- Check BLANK >= 0

Thank you, Simon and Soumith for the suggestions!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10112

Differential Revision: D9130737

Pulled By: SsnL

fbshipit-source-id: f9a0022a969788bda3db9f360e2564b519ebf2e6
2018-08-03 13:25:18 -07:00
43b151224e Move grid sampler to ATen (#9961)
Summary:
Spatial version benchmark

|                           | CPUFloat THNN | CPUFloat ATen | CPUDouble THNN | CPUDouble ATen | CUDAHalf THNN | CUDAHalf ATen | CUDAFloat THNN | CUDAFloat ATen | CUDADouble THNN | CUDADouble ATen |
|---------------------------|---------------|---------------|----------------|----------------|---------------|---------------|----------------|----------------|-----------------|-----------------|
| [1024x1x28x28] zero pad   | 2.19281888s   | 0.21280479s   | 2.52922535s    | 0.23944831s    | 0.17494774s   | 0.06242800s   | 0.31270599s    | 0.03706479s    | 0.40542483s     | 0.07391024s     |
| [1024x1x28x28] border pad | 3.04329610s   | 0.24705672s   | 2.29205394s    | 0.22336411s    | 0.17980361s   | 0.06212497s   | 0.31415701s    | 0.03847790s    | 0.43020391s     | 0.07540464s     |
| [32x3x244x244] zero pad   | 18.29301333s  | 2.18566656s   | 19.01662397s   | 3.51552224s    | 1.72487235s   | 0.28933954s   | 2.02466702s    | 0.18178749s    | 2.63671613s     | 0.41391206s     |
| [32x3x244x244] border pad | 18.72205329s  | 2.02600884s   | 20.13017297s   | 3.25979590s    | 1.96455693s   | 0.33070564s   | 2.18666625s    | 0.19546938s    | 2.91268897s     | 0.38465047s     |

For #9702

basics:
+ grid tensors have dimensions `[N, H, W, 2]` (or `[N, D, H, W, 3]` for 3d).
+ input/output tensors have dimensions `[N, C, H, W]` (or `[N, C, D, H ,W]` for 3d)
+ grid sampler maps `input([N, C, inp_H, inp_W]), grid([N, H, W, 2])` to `output([N, C, H, W])` (3d case is similar).

variable naming:
+ `tensor_sH` means the stride of `tensor` at the dimension of `H`.
+ `tensor_ptr_NCH` is a data pointer that always points to the beginning of the `tensor[n][c][h]` slice in the loop.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9961

Differential Revision: D9057175

Pulled By: SsnL

fbshipit-source-id: 9ed8f1dc376ed10229f047fdcf3c90dbd250bee6
2018-08-01 07:54:46 -07:00