Summary:
- Move batch norm from TH(CU)NN to native
- Speedups in many cases (e.g. #12006) for CUDA due to new block/grid layout and Welford-type mean/variance calculations (the latter for training mode)
- It splits the forward kernel in two pieces and reuses the evaluation kernel for the transformation.
- We change the meaning of save_mean and save_invstd (aka save_var) to accscalar to maintain reasonable precision.
Compared to the ill-fated #12368
- I changed the CPU kernel to not call `.sum()` from within parallel for. This seemed to have caused the breakage (NaN-results) in TestModels.test_dcgan_netG (thank you houseroad for the repro, errors in assessment of the fix are my own)
- I updated the Half->Float upcasting in tensors to go through `t.type().scalarType()` instead of `t.dtype()`.
- I have merged master
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13263
Differential Revision: D12946254
Pulled By: SsnL
fbshipit-source-id: 3bb717ee250fbccaf10afe73722996aa4713d10d
Summary:
Problems with SN and DP after #12671 :
1. in eval mode, `weight_orig` is not getting correct gradient #12737 .
Fix: keep `v` vector around as a buffer and always calculate `W = W_orig / (u @ W_orig @ v)` even in eval.
2. in training mode, the `weight` buffer of the parallelized module is never updated, if someone touches `weight_orig` and/or `weight` and makes them not sharing storage. So in `eval` the weight used is wrong.
Fix: Make `weight` not a buffer anymore and always calculate it as above.
3. #12671 changed SN to update `u` in-place to make DP work correctly, but then it breaks backward through two forwards (e.g., the common GAN loss `D(real) - D(fake)`) because the vectors needed to backprop the 1st forward is changed in the 2nd forward.
Fix: This PR clones `u` and `v` before using them.
To maintain BC, I added a hook interface for producing and loading state_dict. This is ugly and we should really have better interface for spectral_norm. But for the purpose to fix this issue, I make this patch. Even if we have a better interface, BC mechanism for legacy loading legacy state_dict still needs to be done.
cc The controller you requested could not be found. crcrpar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13350
Differential Revision: D12931044
Pulled By: SsnL
fbshipit-source-id: 8be6f934eaa62414d76d2c644dedd7e1b7eb31ef
Summary:
- fixes weights-contiguous requirement for THCUNN Convolutions
- Add tests that conv backward pass works for non-contiguous weights
- fix RNN tests / error messages to be consistent and pass
- relax weight grad precision for fp16 for a particular test
- fix regression of CMAKE_PREFIX_PATH not passing through
- add missing skipIfNoLapack annotations where needed
Differential Revision: D12918456
Pulled By: soumith
fbshipit-source-id: 8642d36bffcc6f2957800d6afa1e10bef2a91d05
Summary:
```
The previous threshold implementation was not vectorized or parallelized.
This speeds up ResNet-50 CPU inference [1] from ~88 ms to ~67 ms
CPU timings:
https://gist.github.com/colesbury/d0d1be6974841d62696dbde329a8fde8
1 thread (before vs. after)
10240: 17.4 us vs. 6.9 µs per loop
102400: 141 us vs. 39.8 µs per loop
16 threads (before vs. after)
10240: 17.4 us vs. 6.7 µs per loop
102400: 141 us vs. 14.3 µs per loop
CUDA timings are not measurably different.
[1]: compiled with MKL-DNN, 8 threads, batch norm merged into convolutions
https://gist.github.com/colesbury/8a64897dae97558b3b82da665048c782
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13182
Reviewed By: soumith
Differential Revision: D12825105
Pulled By: colesbury
fbshipit-source-id: 557da608ebb87db8a04adbb0d2882af4f2eb3c15
Summary:
- Speed up the case of #12006 in the forward
- The backward still isn't as fast as one might hope (factor 2-3 in the #12006 case).
- More extensive benchmarking shows not so great performance compared
to CuDNN for cases with many channels, e.g. bs=8-128 / c=1024 / f=1024.
- We change the meaning of save_mean and save_invstd (aka save_var) to accscalar to
maintain reasonable precision.
Needless to say that I would happily separate the TensorAccessor fixes in a separate PR, as they're fixes and unrelated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12368
Differential Revision: D10559696
Pulled By: SsnL
fbshipit-source-id: f0d0d1e0912e17b15b8fb7a2c03d0fe757598419
Summary:
Closes#2119.
There was a small bug where the output_size got sliced with `[-2:]`
where we really meant to slice it as `[2:]` (to remove the batch and
channel dimensions).
Added a new test for this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12952
Differential Revision: D10510678
Pulled By: zou3519
fbshipit-source-id: 4c04a5007fc6d002e1806d6fe981b43d33d6a4f2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12794
common.py is used in base_module for almost all tests in test/. The
name of this file is so common that can easily conflict with other dependencies
if they happen to have another common.py in the base module. Rename the file to
avoid conflict.
Reviewed By: orionr
Differential Revision: D10438204
fbshipit-source-id: 6a996c14980722330be0a9fd3a54c20af4b3d380
Summary:
Module.to uses the Tensor.to parsing facility.
It should not, however, accept "copy" as a keyword/fourth positional
argument.
See #12571 for discussion.
Thank you SsnL for noticing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12617
Differential Revision: D10392053
Pulled By: ezyang
fbshipit-source-id: b67a5def7993189b4b47193abc7b741b7d07512c
Summary:
There were two problems with SN + DP:
1. In SN, the updated _u vector is saved back to module via a `setattr`. However, in DP, everything is run on a replica, so those updates are lost.
2. In DP, the buffers are broadcast via a `broadcast_coalesced`, so on replicas they are all views. Therefore, the `detach_` call won't work.
Fixes are:
1. Update _u vector in-place so, by the shared storage between 1st replica and the parallelized module, the update is retained
2. Do not call `detach_`.
3. Added comments in SN about the subtlety.
4. Added a note to the DP doc on this particular behavior of DP.
cc crcrpar taesung89 The controller you requested could not be found. yaoshengfu
Fixes https://github.com/pytorch/pytorch/issues/11476
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12671
Differential Revision: D10410232
Pulled By: SsnL
fbshipit-source-id: c447951844a30366d8c196bf9436340e88f3b6d9
Summary:
Add dtype argument to softmax/log_softmax functions.
Computing softmax in fp32 precision is necessary for mixed precision training, and converting output of the previous layer into fp32 and then reading it as fp32 in softmax is expensive, memory and perf-wise, this PR allows one to avoid it.
For most input data/dtype combinations, input data is converted to dtype and then softmax is computed. If input data is half type and dtype is fp32, kernels with the corresponding template arguments are called.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11719
Reviewed By: ezyang
Differential Revision: D10175514
Pulled By: zou3519
fbshipit-source-id: 06d285af91a0b659932236d41ad63b787eeed243
Summary:
Obviously, the grads of conv weight and conv input are not relevant to the bias, but the original `convXd_input` and `convXd_weight` methods receive a `bias` parameter. What's more, while the doc says `bias` should have the shape `(out_channels,)`, one will get a `RuntimeError` if the bias != None and in_channels != out_channels, for the weight of transposed conv has the shape `(in_channels, out_channels, kH, kW)` while the weight of vanilla conv has the shape `(out_channels, in_channels, kH, kW)`
```
RuntimeError: Given transposed=1, weight of size [channel1, channel2, kH, kW], expected bias to be 1-dimensional with channel2 elements, but got bias of size [channel1] instead
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12281
Differential Revision: D10217370
Pulled By: ezyang
fbshipit-source-id: bc00b439e5ae539276a5e678bdb92af700197bb2
Summary:
- fixes https://github.com/pytorch/pytorch/issues/10723
- migrate PReLU to ATen and deprecate legacy PReLU
- performance:
CPU with weight.numel() = 1
```
>>> m = nn.PReLU()
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
100 loops, best of 100: 9.43 ms per loop
>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
10 loops, best of 100: 24.4 ms per loop
>>> m = nn.PReLU()
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
1000 loops, best of 100: 695 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
100 loops, best of 100: 2.47 ms per loop
```
CPU with weight.numel() = channels
```
>>> m = nn.PReLU(100)
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
1000 loops, best of 100: 603 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
100 loops, best of 100: 13.3 ms per loop
>>> m = nn.PReLU(100)
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
1000 loops, best of 100: 655 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
100 loops, best of 100: 2.45 ms per loop
```
CUDA with weight.numel() = 1
```
>>> m = nn.PReLU().cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
10000 loops, best of 100: 187 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.01 ms per loop
>>> m = nn.PReLU().cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
1000 loops, best of 100: 195 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.28 ms per loop
```
CUDA with weight.numel() = channel
```
>>> m = nn.PReLU(100).cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
1000 loops, best of 100: 174 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.27 ms per loop
>>> m = nn.PReLU(100).cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
10000 loops, best of 100: 181 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.26 ms per loop
```
The huge performance regression in CPU when weight.numel() = 1 is addressed by replacing at::CPU_tensor_apply* with parallelized kernels.
ezyang SsnL zou3519 soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11758
Differential Revision: D9995799
Pulled By: weiyangfb
fbshipit-source-id: d289937c78075f46a54dafbde92fab0cc4b5b86e
Summary:
This PR vectorizes the CPU grid sample 2d forward and backward kernels. Specifically,
1. add `.data()` in `TensorAccessor`
2. support non-void return value for declaring CPU kernel stub
2. add `bool at:: geometry_is_contiguous(IntList sizes, IntList strides)`
1. The following vectorized CPU primitives are added:
+ `gather<scale>(baseaddr, vindex)`: `result[i] = baseaddr[vindex[i] * scale]`
+ `mask_gather<scale>(src, baseaddr, vindex, mask)`: `result[i] = mask[i] ? baseaddr[vindex[i] * scale] : src[i]`.
+ comparison ops
+ binary logical ops
+ `min(a, b)`
+ `cast<dst_t, src_t>(src_vec)`: changing dtype but keeping the bit representation
+ `blendv(a, b, mask)`: `result[i] = mask[i] ? b[i] : a[i]`.
+ ctor with multiple values (i.e., `setr`)
+ `arange(start = 0, step = 1)`: constructs a vector with values specified by the arange parameters
+ `convert_to_int_of_same_size(vec)`: convert floating point vector to corresponding integral type of same size
+ `interleave2(a, b)` & `deinterleave2(x, y)`: interleave or deinterleaves two vectors. E.g., for `interleave`:
```
inputs:
{a0, a1, a2, a3, a4, a5, a6, a7}
{b0, b1, b2, b3, b4, b5, b6, b7}
outputs:
{a0, b0, a1, b1, a2, b2, a3, b3}
{a4, b4, a5, b5, a6, b6, a7, b7}
```
2. Grid sample CPU kernel implementations are described in the following note (also in `GridSampleKernel.cpp`:
```
NOTE [ Grid Sample CPU Kernels ]
Implementation of vectorized grid sample CPU kernels is divided into three
parts:
1. `ComputeLocation` struct
Transforms grid values into interpolation locations of the input tensor
for a particular spatial dimension, basing on the size of that dimension
in input tensor, and the padding mode.
```
```cpp
template<typename scalar_t, GridSamplerPadding padding>
struct ComputeLocation {
using Vec = Vec256<scalar_t>;
// ctor
ComputeLocation(int64_t size);
// Given grid values `in`, return the interpolation locations after
// un-normalization and padding mechanism (elementwise).
Vec apply(const Vec &in) const;
// Similar to `apply`, but also returns `d apply(in) / d in`
// (elementwise).
// this is often used in gradient computation.
std::pair<Vec, Vec> apply_get_grad(const Vec &in) const;
};
```
```
2. `ApplyGridSample` struct
Owns N `ComputeLocation` structs, where N is the number of spatial
dimensions. Given N input grid vectors (one for each spatial dimension)
and spatial offset, it gets the interpolation locations from
`ComputeLocation`s, applies interpolation procedure, and then writes to
the output (or grad_input & grad_grid in backward).
```
```cpp
template<typename scalar_t, int spatial_dim,
GridSamplerInterpolation interp,
GridSamplerPadding padding>
struct ApplyGridSample {
// ctor
ApplyGridSample(const TensorAccessor<scalar_t, 4>& input);
// Applies grid sampling (forward) procedure:
// 1. computes interpolation locations from grid values `grid_x` and
// `grid_y`,
// 2. interpolates output values using the locations and input data
// in `inp_slice`, and
// 3. writes the first `len` values in the interpolated vector to
// `out_slice` with spatial offset being `offset`.
//
// This assimes that `grid_x` and `grid_y` all contain valid grid
// values \in [-1, 1], even at indices greater than `len`.
//
// The `*_slice` argument namess mean samples within a batch (i.e.,
// with the batch dimension sliced out).
void forward(TensorAccessor<scalar_t, 3>& out_slice,
const TensorAccessor<scalar_t, 3>& inp_slice,
int64_t offset, const Vec& grid_x, const Vec& grid_y,
int64_t len) const;
// Applies grid sampling (backward) procedure. Arguments semantics
// and strategy are similar to those of `forward`.
void backward(TensorAccessor<scalar_t, 3>& gInp_slice,
TensorAccessor<scalar_t, 3>& gGrid_slice,
const TensorAccessor<scalar_t, 3>& gOut_slice,
const TensorAccessor<scalar_t, 3>& inp_slice,
int64_t offset, const Vec& grid_x, const Vec& grid_y,
int64_t len) const;
}
```
```
3. `grid_sample_2d_grid_slice_iterator` function
Among the tensors we work with, we know that the output tensors are
contiguous (i.e., `output` in forward, and `grad_input` & `grad_grid` in
backward), we need to randomly read `input` anyways, and `grad_output`
usually comes from autograd and is often contiguous. So we base our
iterating strategy on the geometry of grid.
`grid_sample_2d_grid_slice_iterator` function provides an abstract to
efficiently iterates through a `grid` slice (without batch dimension).
See comments of that function on the specific cases and strategies used.
```
```cpp
template<typename scalar_t, typename ApplyFn>
void grid_sample_2d_grid_slice_iterator(
const TensorAccessor<scalar_t, 3>& grid_slice,
const ApplyFn &apply_fn);
// `apply_fn` is a function/lambda that can be called as if it has
// declaration:
// void apply_fn(const Vec256<scalar_t>& grid_x,
// const Vec256<scalar_t>& grid_y,
// int64_t spatial_offset, int64_t len);
```
```
`apply_fn` will be called multiple times, and together cover the entire
output spatial space. Therefore, e.g., to implement forward 2d grid
sample, we can do
```
```cpp
ApplyGridSample<scalar_t, 2, interp, padding> grid_sample(input_accessor);
for (int n = 0; n < input_accessor.size(0); n++) {
grid_sample_2d_grid_slice_iterator(
grid_accessor[n],
[&](const Vec256<scalar_t>& grid_x, const Vec256<scalar_t>& grid_y,
int64_t spatial_offset, int64_t len) {
grid_sample.forward(out_accessor[n], input_accessor[n],
spatial_offset, grid_x, grid_y, len);
});
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10980
Differential Revision: D9564867
Pulled By: SsnL
fbshipit-source-id: 5b7c3c7ea63af00eec230ae9ee1c3e6c6c9679b4
Summary:
Add the gpu kernel version.
The parallelism I went with performs poorly when there are a large number of vectors, but they're all short, as I don't allocate the thread pool to wrap in that case.
Test Plan
---------
```
python -m unittest test_torch.TestTorch.test_pdist_{empty,scipy} test_nn.TestNN.test_pdist{,_zeros,_empty_row,_empty_col,_cpu_gradgrad_unimplemented,_cuda_gradgrad_unimplemented} test_jit.TestJitGenerated.test_nn_pdist
```
Current performance specs are a little underwhelming, I'm in the process of debugging.
size | torch | torch cuda | scipy
-----|-------|------------|------
16 x 16 | 9.13 µs ± 3.55 µs | 9.86 µs ± 81.5 ns | 15.8 µs ± 1.2 µs
16 x 1024 | 15 µs ± 224 ns | 9.48 µs ± 88.7 ns | 88.7 µs ± 8.83 µs
1024 x 16 | 852 µs ± 6.03 µs | 7.84 ms ± 6.22 µs | 4.7 ms ± 166 µs
1024 x 1024 | 34.1 ms ± 803 µs | 11.5 ms ± 6.24 µs | 273 ms ± 6.7 ms
2048 x 2048 | 261 ms ± 3.5 ms | 77.5 ms ± 41.5 µs | 2.5 s ± 97.6 ms
4096 x 4096 | 2.37 s ± 154 ms | 636 ms ± 2.97 µs | 25.9 s ± 394 ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11102
Differential Revision: D9697305
Pulled By: erikbrinkman
fbshipit-source-id: 2b4f4b816c02b3715a85d8db3f4e77479d19bb99
Summary:
* purge hcSPARSE now that rocSPARSE is available
* integrate a custom hcc and HIP
* hcc brings two important compiler fixes (fixes hundreds of unit tests)
* HIP brings a smart dispatcher that allows us to avoid a lot of static_casts (we haven't yet removed the automatic static_casts but this catches some occurrences the script did not catch)
* mark 5 unit tests skipping that have regressed w/ the new hcc (we don't know yet what is at fault)
* optimize bitonic sort - the comparator is always an empty struct - therefore passing it by value saves at least 3 bytes. It also removes an ambiguity around passing references to `__global__` functions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11198
Differential Revision: D9652340
Pulled By: ezyang
fbshipit-source-id: f5af1d891189da820e3d13b7bed91a7a43154690
Summary:
* improve docker packages (install OpenBLAS to have at-compile-time LAPACK functionality w/ optimizations for both Intel and AMD CPUs)
* integrate rocFFT (i.e., enable Fourier functionality)
* fix bugs in ROCm caused by wrong warp size
* enable more test sets, skip the tests that don't work on ROCm yet
* don't disable asserts any longer in hipification
* small improvements
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10893
Differential Revision: D9615053
Pulled By: ezyang
fbshipit-source-id: 864b4d27bf089421f7dfd8065e5017f9ea2f7b3b
Summary:
Also add single grad whitelist to the jit test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10782
Reviewed By: ezyang
Differential Revision: D9583378
Pulled By: erikbrinkman
fbshipit-source-id: 069e5ae68ea7f3524dec39cf1d5fe9cd53941944
Summary:
When 0-sized dimension support is added, we expect an empty sparse tensor to be a 1-dimensional tensor of size `[0]`, with `sparseDims == 1` and `denseDims == 0`. Also, we expect the following invariants to be preserved at all times:
```
_sparseDims + _denseDims = len(shape)
_indices.shape: dimensionality: 2, shape: (_sparseDims, nnz)
_values.shape: dimensionality: 1 + _denseDims. shape: (nnz, shape[_sparseDims:])
```
This PR fixes various places where the invariants are not strictly enforced when 0-sized dimension support is enabled.
Tested and `test_sparse.py` passes locally on both CPU and CUDA with the `USE_TH_SIZE_ZERO_DIM` flag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9279
Differential Revision: D8936683
Pulled By: yf225
fbshipit-source-id: 12f5cd7f52233d3b26af6edc20b4cdee045bcb5e
Summary:
This commit adds the ``buffers()`` and ``named_buffers()`` methods as
analogues of ``parameters()`` and ``named_parameters()``.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10554
Reviewed By: SsnL
Differential Revision: D9367762
Pulled By: jma127
fbshipit-source-id: f2042e46a7e833dce40cb41681dbd80d7885c74e
Summary:
This is the first of two changes that are supposed to improve how we handle RNNs in the JIT. They still get traced as `PythonOp`s, but now it will be much easier to actually expose them to the JIT as e.g. `aten::lstm`, and ignore the Python interpreter entirely. This needs some symbolic adjustments that will be part of a second PR.
Even when we fix symbolics, there will still be a bit of a problem with statefulness of the cuDNN API (we need a mutable cache for the dropout state, but our IR has no way of representing that).
zdevito ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10481
Reviewed By: ezyang
Differential Revision: D9341113
Pulled By: apaszke
fbshipit-source-id: 0ae30ead72a1b12044b7c12369d11e5ca8ec30b5
Summary:
Two tests in the 'nn' test bucket may fail when the torch.half
(float16) data type is used. The assertions used in the tests
intend to allow slight floating point imprecision in the results,
but the tolerances used for the comparisons are too strict for
the half type.
Relax the tolerances so that slight float16 imprecision won't
cause test failures.
The affected tests are:
- test_variable_sequence_cuda
- test_Conv2d_groups_nobias
For more information, see issue:
https://github.com/pytorch/pytorch/issues/7420
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10519
Differential Revision: D9343751
Pulled By: soumith
fbshipit-source-id: 90aedf48f6e22dd4fed9c7bde7cd7c7b6885845a
Summary:
I've implemented affine grid generation for volumetric (5d) inputs. The implementation is based off of the spatial implementation, extended by one dimension. I have a few questions about my implementation vs. the existing one that I will add inline.
I have some extensive test cases for the forward pass here: https://gist.github.com/elistevens/6e3bfb20d8d0652b83bd16b3e911285b However, they use `pytest.fixture` extensively, so I'm not sure the best way to incorporate them into the pytorch test suite. Suggestions? I have not tested backwards at all.
Diff probably best viewed with whitespace changes ignored.
Thanks for considering!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8322
Differential Revision: D9332335
Pulled By: SsnL
fbshipit-source-id: 1b3a91d078ef41a6d0a800514e49298fd817e4df
Summary:
closes#9702 .
cc jph00
Commit structure:
1. Change the index calculation logic. I will explain using 1-D for simplicity.
Previously we have (in pseudo code):
```
// 1. get the float locations from grid
scalar_t x = from_grid()
// 2. find the integral surrounding indices
int x_left = floor(x)
int x_right = x_left + 1
// 3. calculate the linear interpolate weights
scalar_t w_left = x_right - x
scalar_t w_right = x - x_left
// 4. manipulate the integral surrounding indices if needed
// (e.g., clip for border padding_mode)
x_left = manipulate(x_left, padding_mode)
x_right = manipulate(x_right, padding_mode)
// 5. interpolate
output_val = interpolate(w_left, w_right, x_left, x_right)
```
This is actually incorrect (and also unintuitive) because it calculates the
weights before manipulate out-of-boundary indices. Fortunately, this
isn't manifested in both of the current supported modes, `'zeros'` and
`'border'` padding:
+ `'zeros'`: doesn't clip
+ `'border'`: clips, but for out-of-bound `x` both `x_left` and `x_right` are
clipped to the same value, so weights don't matter
But this is a problem with reflection padding, since after each time we reflect,
the values of `w_left` and `w_right` should be swapped.
So in this commit I change the algorithm to (numbers corresponding to the
ordering in the above pseudo-code)
```
1. get float location
4. clip the float location
2. find the integral surrounding indices
3. calculate the linear interpolate weights
```
In the backward, because of this change, I need to add new variables to track
`d manipulate_output / d manipulate_input`, which is basically a multiplier
on the gradient calculated for `grid`. From benchmarking this addition doesn't
cause obvious slow downs.
2. Implement reflection padding. The indices will keep being reflected until
they become within boundary.
Added variant of `clip_coordinates` and `reflect_coordinates` to be used in
backward. E.g.,
```cpp
// clip_coordinates_set_grad works similarly to clip_coordinates except that
// it also returns the `d output / d input` via pointer argument `grad_in`.
// This is useful in the backward pass of grid_sampler.
scalar_t clip_coordinates_set_grad(scalar_t in, int64_t clip_limit, scalar_t *grad_in)
```
For example, if `in` is clipped in `'border'` mode, `grad_in` is set to `0`.
If `in` is reflected **odd** times in `'reflection'` mode, `grad_in`
is set to `-1`.
3. Implement nearest interpolation.
4. Add test cases
5. Add better input checking
Discussed with goldsborough for moving `operator<<` of `at::Device`,
`at::DeviceType` and `at::Layout` into `at` namespace. (Otherwise
`AT_CHECK` can't find them.)
6. Support empty tensors. cc gchanan
+ Make empty tensors not acceptable by cudnn.
+ Add `AT_ASSERT(kernel block size > 0)` if using `GET_BLOCKS`
+ Cache `numel` in `TensorGeometry`
I was going to use `numel` to test if cudnn descriptor should accept a
tensor, but it isn't used eventually. I can revert this if needed.
7. Add more test cases, including on input checking and empty tensors
8. Remove an obsolete comment
9. Update docs. Manually tested by generating docs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10051
Differential Revision: D9123950
Pulled By: SsnL
fbshipit-source-id: ac3b4a0a36b39b5d02e83666cc6730111ce216f6
Summary:
While waiting for dropout to be fully ported to ATen, here's performance fix for the most common dropout case. Dropout is still in python function, I just added efficient path to it. I could not make inplace work, because generator always generates `return self` for inplace function, and I need to return both original tensor and mask, so inplace goes on the existing pass. Even with non-inplace version, since mask is now a ByteTensor, memory used is just a little larger than for inplace dropout, due to savings on mask.
Once dropout is moved to aten, these kernels still can be used for efficient implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9666
Reviewed By: SsnL
Differential Revision: D8948077
Pulled By: ezyang
fbshipit-source-id: 52990ef769471d957e464af635e5f9b4e519567a
Summary:
- fixes#9141, #9301
- use logsigmoid at multilabel_soft_margin_loss to make it more stable (NOT fixing legacy MultiLabelSoftMarginCriterion)
- return (N) instead of (N, C) to match the same behavior as MultiMarginLoss
- Note that with this PR, the following behavior is expected:
```
loss = F.multilabel_soft_margin_loss(outputs, labels, reduction='none')
loss_mean = F.multilabel_soft_margin_loss(outputs, labels, reduction='elementwise_mean')
loss_sum = F.multilabel_soft_margin_loss(outputs, labels, reduction='sum')
loss.sum() == loss_sum # True
loss.mean() == loss_mean # True
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9965
Differential Revision: D9038402
Pulled By: weiyangfb
fbshipit-source-id: 0fa94c7b3cd370ea62bd6333f1a0e9bd0b8ccbb9