pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Thomas Viehmann	14004cbef6	Native batch norm (#13263 ) Summary: - Move batch norm from TH(CU)NN to native - Speedups in many cases (e.g. #12006) for CUDA due to new block/grid layout and Welford-type mean/variance calculations (the latter for training mode) - It splits the forward kernel in two pieces and reuses the evaluation kernel for the transformation. - We change the meaning of save_mean and save_invstd (aka save_var) to accscalar to maintain reasonable precision. Compared to the ill-fated #12368 - I changed the CPU kernel to not call `.sum()` from within parallel for. This seemed to have caused the breakage (NaN-results) in TestModels.test_dcgan_netG (thank you houseroad for the repro, errors in assessment of the fix are my own) - I updated the Half->Float upcasting in tensors to go through `t.type().scalarType()` instead of `t.dtype()`. - I have merged master Pull Request resolved: https://github.com/pytorch/pytorch/pull/13263 Differential Revision: D12946254 Pulled By: SsnL fbshipit-source-id: 3bb717ee250fbccaf10afe73722996aa4713d10d	2018-11-06 20:05:54 -08:00
Tongzhou Wang	2cd912bcc2	Fix more spectral norm bugs (#13350 ) Summary: Problems with SN and DP after #12671 : 1. in eval mode, `weight_orig` is not getting correct gradient #12737 . Fix: keep `v` vector around as a buffer and always calculate `W = W_orig / (u @ W_orig @ v)` even in eval. 2. in training mode, the `weight` buffer of the parallelized module is never updated, if someone touches `weight_orig` and/or `weight` and makes them not sharing storage. So in `eval` the weight used is wrong. Fix: Make `weight` not a buffer anymore and always calculate it as above. 3. #12671 changed SN to update `u` in-place to make DP work correctly, but then it breaks backward through two forwards (e.g., the common GAN loss `D(real) - D(fake)`) because the vectors needed to backprop the 1st forward is changed in the 2nd forward. Fix: This PR clones `u` and `v` before using them. To maintain BC, I added a hook interface for producing and loading state_dict. This is ugly and we should really have better interface for spectral_norm. But for the purpose to fix this issue, I make this patch. Even if we have a better interface, BC mechanism for legacy loading legacy state_dict still needs to be done. cc The controller you requested could not be found. crcrpar Pull Request resolved: https://github.com/pytorch/pytorch/pull/13350 Differential Revision: D12931044 Pulled By: SsnL fbshipit-source-id: 8be6f934eaa62414d76d2c644dedd7e1b7eb31ef	2018-11-06 19:16:13 -08:00
Soumith Chintala	a7ee632dff	Various Test and build fixes (#13556 ) Summary: - fixes weights-contiguous requirement for THCUNN Convolutions - Add tests that conv backward pass works for non-contiguous weights - fix RNN tests / error messages to be consistent and pass - relax weight grad precision for fp16 for a particular test - fix regression of CMAKE_PREFIX_PATH not passing through - add missing skipIfNoLapack annotations where needed Differential Revision: D12918456 Pulled By: soumith fbshipit-source-id: 8642d36bffcc6f2957800d6afa1e10bef2a91d05	2018-11-06 07:13:47 -08:00
Sam Gross	98f5c005da	Speed up CPU threshold and relu implementation (#13182 ) Summary: ``` The previous threshold implementation was not vectorized or parallelized. This speeds up ResNet-50 CPU inference [1] from ~88 ms to ~67 ms CPU timings: https://gist.github.com/colesbury/d0d1be6974841d62696dbde329a8fde8 1 thread (before vs. after) 10240: 17.4 us vs. 6.9 µs per loop 102400: 141 us vs. 39.8 µs per loop 16 threads (before vs. after) 10240: 17.4 us vs. 6.7 µs per loop 102400: 141 us vs. 14.3 µs per loop CUDA timings are not measurably different. [1]: compiled with MKL-DNN, 8 threads, batch norm merged into convolutions https://gist.github.com/colesbury/8a64897dae97558b3b82da665048c782 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/13182 Reviewed By: soumith Differential Revision: D12825105 Pulled By: colesbury fbshipit-source-id: 557da608ebb87db8a04adbb0d2882af4f2eb3c15	2018-11-05 12:51:29 -08:00
Tongzhou Wang	9f2b2cac37	Fix handling all empty bags in CUDA embedding bag (#13483 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/11847 Pull Request resolved: https://github.com/pytorch/pytorch/pull/13483 Differential Revision: D12902914 Pulled By: SsnL fbshipit-source-id: 577a53e815231e988da716b1ee5667e1f36408ca	2018-11-02 10:21:14 -07:00
Tongzhou Wang	99a5d19591	Rename elementwise_mean to mean (#13419 ) Summary: Closes #12459 Pull Request resolved: https://github.com/pytorch/pytorch/pull/13419 Differential Revision: D12883299 Pulled By: SsnL fbshipit-source-id: 8b4512ff73b66fdc674412904dbb3bf497ba70a7	2018-11-01 10:31:26 -07:00
Ailing Zhang	488d393ea6	Fix pointwise loss broadcast (#12996 ) Summary: Fixes #12129 , #12327 Differential Revision: D10513781 Pulled By: ailzhang fbshipit-source-id: a210008a39ff6c3f056c9fbe3f0576cfcce638ec	2018-10-31 10:17:25 -07:00
Lu Fang	f8864f0505	Revert "Move batch_norm to ATen/native, speed up (#12368 )" (#13191 ) Summary: Revert #12368 since it's causing onnx related test cases failing. https://github.com/pytorch/pytorch/pull/12368 SsnL The controller you requested could not be found. Pull Request resolved: https://github.com/pytorch/pytorch/pull/13191 Reviewed By: BIT-silence Differential Revision: D12810778 Pulled By: houseroad fbshipit-source-id: 1c373b92628580097cffcd237dccc5b3d8697577	2018-10-26 23:05:50 -07:00
Zachary DeVito	dae7616078	Shard all of tests based on how many tests exist. (#13160 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13160 Reduces pytorch_core build from 2 hours to 30 minutes Reviewed By: soumith, dzhulgakov Differential Revision: D10524261 fbshipit-source-id: 97270ac73404b5ea4c264cd0e9d8d4b1be79b0e9	2018-10-26 18:20:34 -07:00
Thomas Viehmann	dc211c7de4	Move batch_norm to ATen/native, speed up (#12368 ) Summary: - Speed up the case of #12006 in the forward - The backward still isn't as fast as one might hope (factor 2-3 in the #12006 case). - More extensive benchmarking shows not so great performance compared to CuDNN for cases with many channels, e.g. bs=8-128 / c=1024 / f=1024. - We change the meaning of save_mean and save_invstd (aka save_var) to accscalar to maintain reasonable precision. Needless to say that I would happily separate the TensorAccessor fixes in a separate PR, as they're fixes and unrelated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/12368 Differential Revision: D10559696 Pulled By: SsnL fbshipit-source-id: f0d0d1e0912e17b15b8fb7a2c03d0fe757598419	2018-10-25 23:41:10 -07:00
Richard Zou	7863c17b26	Fix convtranspose3d output_size calculation (#12952 ) Summary: Closes #2119. There was a small bug where the output_size got sliced with `[-2:]` where we really meant to slice it as `[2:]` (to remove the batch and channel dimensions). Added a new test for this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/12952 Differential Revision: D10510678 Pulled By: zou3519 fbshipit-source-id: 4c04a5007fc6d002e1806d6fe981b43d33d6a4f2	2018-10-24 09:23:05 -07:00
Soumith Chintala	cf235e0894	fix lint after new flake8 release added new style constraints (#13047 ) Summary: fix lint after new flake8 release added new style constraints Pull Request resolved: https://github.com/pytorch/pytorch/pull/13047 Differential Revision: D10527804 Pulled By: soumith fbshipit-source-id: 6f4d02662570b6339f69117b61037c8394b0bbd8	2018-10-24 09:03:38 -07:00
Wei Yang	710191e292	fix error message of large kernel size in conv2D (#12791 ) Summary: - fix #12565 - test plan: with this fix, we have: ``` >>> m = nn.Conv2d(in_channels=3, out_channels=33, kernel_size=10, stride=1, bias=True) >>> input = torch.randn(1, 3, 1, 1) >>> output = m(input) ``` RuntimeError: Calculated padded input size per channel: (1 x 1). Kernel size: (10 x 10). Kernel size can't be greater than actual input size at ~/pytorch/aten/src/THNN/generic/SpatialConvolutionMM.c:50 not sure why these are `int` instead of `int64_t`: `5ccdd7a626/aten/src/THNN/generic/SpatialConvolutionMM.c (L10)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/12791 Differential Revision: D10443045 Pulled By: weiyangfb fbshipit-source-id: 2620acb40bdd49d29cec06337f6dfb4653d1987c	2018-10-18 00:51:16 -07:00
James Sun	f4944f0f8a	Rename test/common.py to test/common_utils.py (#12794 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12794 common.py is used in base_module for almost all tests in test/. The name of this file is so common that can easily conflict with other dependencies if they happen to have another common.py in the base module. Rename the file to avoid conflict. Reviewed By: orionr Differential Revision: D10438204 fbshipit-source-id: 6a996c14980722330be0a9fd3a54c20af4b3d380	2018-10-17 23:04:29 -07:00
Thomas Viehmann	ba25e13782	Forbid Module.to with copy argument. (#12617 ) Summary: Module.to uses the Tensor.to parsing facility. It should not, however, accept "copy" as a keyword/fourth positional argument. See #12571 for discussion. Thank you SsnL for noticing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/12617 Differential Revision: D10392053 Pulled By: ezyang fbshipit-source-id: b67a5def7993189b4b47193abc7b741b7d07512c	2018-10-16 20:31:44 -07:00
Tongzhou Wang	ac994f2c78	Fix SpectralNorm with DataParallel (#12671 ) Summary: There were two problems with SN + DP: 1. In SN, the updated _u vector is saved back to module via a `setattr`. However, in DP, everything is run on a replica, so those updates are lost. 2. In DP, the buffers are broadcast via a `broadcast_coalesced`, so on replicas they are all views. Therefore, the `detach_` call won't work. Fixes are: 1. Update _u vector in-place so, by the shared storage between 1st replica and the parallelized module, the update is retained 2. Do not call `detach_`. 3. Added comments in SN about the subtlety. 4. Added a note to the DP doc on this particular behavior of DP. cc crcrpar taesung89 The controller you requested could not be found. yaoshengfu Fixes https://github.com/pytorch/pytorch/issues/11476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/12671 Differential Revision: D10410232 Pulled By: SsnL fbshipit-source-id: c447951844a30366d8c196bf9436340e88f3b6d9	2018-10-16 16:02:17 -07:00
Ailing Zhang	e15501fb68	fix bce_with_logits with legacy reduce (#12689 ) Summary: Fix #12624 . internal usecase of legacy `reduce`. Add test in test_nn Pull Request resolved: https://github.com/pytorch/pytorch/pull/12689 Reviewed By: ezyang Differential Revision: D10391195 Pulled By: ailzhang fbshipit-source-id: 1af2b258c4abb2b6527eaaeac63e8bf1762c66a1	2018-10-16 09:46:58 -07:00
Natalia Gimelshein	a98958d3bd	dtype option for softmax (#11719 ) Summary: Add dtype argument to softmax/log_softmax functions. Computing softmax in fp32 precision is necessary for mixed precision training, and converting output of the previous layer into fp32 and then reading it as fp32 in softmax is expensive, memory and perf-wise, this PR allows one to avoid it. For most input data/dtype combinations, input data is converted to dtype and then softmax is computed. If input data is half type and dtype is fp32, kernels with the corresponding template arguments are called. Pull Request resolved: https://github.com/pytorch/pytorch/pull/11719 Reviewed By: ezyang Differential Revision: D10175514 Pulled By: zou3519 fbshipit-source-id: 06d285af91a0b659932236d41ad63b787eeed243	2018-10-13 17:57:10 -07:00
Tongzhou Wang	d400502b1d	Fix a bunch of warnings in TestNN Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12453 Differential Revision: D10244130 Pulled By: SsnL fbshipit-source-id: e425c76bfb721fe118a32ddd1fa6eca3a3cd86f0	2018-10-08 17:38:23 -07:00
daquexian	f8086845aa	Fix bug in grad.py when conv bias != None (#12281 ) Summary: Obviously, the grads of conv weight and conv input are not relevant to the bias, but the original `convXd_input` and `convXd_weight` methods receive a `bias` parameter. What's more, while the doc says `bias` should have the shape `(out_channels,)`, one will get a `RuntimeError` if the bias != None and in_channels != out_channels, for the weight of transposed conv has the shape `(in_channels, out_channels, kH, kW)` while the weight of vanilla conv has the shape `(out_channels, in_channels, kH, kW)` ``` RuntimeError: Given transposed=1, weight of size [channel1, channel2, kH, kW], expected bias to be 1-dimensional with channel2 elements, but got bias of size [channel1] instead ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/12281 Differential Revision: D10217370 Pulled By: ezyang fbshipit-source-id: bc00b439e5ae539276a5e678bdb92af700197bb2	2018-10-05 12:55:14 -07:00
Johannes M Dieterich	c9f7d7b506	mark unit tests as working, skip failing unit test (#12313 ) Summary: * enabled fp16 tests for test_torch * enable fp16 tests for test_nn * enabled multilabelmargin loss for fp16 * removed skip for test_pdist_empty_col * Enable test_nn tests that pass with compiler fixes etc. * Enable test_legacy_nn tests that pass with compiler fixes etc. ezyang bddppq Pull Request resolved: https://github.com/pytorch/pytorch/pull/12313 Differential Revision: D10189922 Pulled By: bddppq fbshipit-source-id: a5592817c04b14e355cb062d42ebea406f0c92b6	2018-10-03 23:56:26 -07:00
iotamudelta	a2ebbccc9f	fix unit tests on CI Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12187 Differential Revision: D10118483 Pulled By: bddppq fbshipit-source-id: 986c8fb48d61e00103c713548a50e74489a0e442	2018-09-28 23:11:55 -07:00
Wei Yang	de11fe0c83	migrate PReLU to ATen (#11758 ) Summary: - fixes https://github.com/pytorch/pytorch/issues/10723 - migrate PReLU to ATen and deprecate legacy PReLU - performance: CPU with weight.numel() = 1 ``` >>> m = nn.PReLU() >>> x = torch.randn(100, 100, 100, requires_grad=True) >>> %timeit -r 100 y = m(x) 100 loops, best of 100: 9.43 ms per loop >>> y = m(x).sum() >>> %timeit -r 100 y.backward(retain_graph=True) 10 loops, best of 100: 24.4 ms per loop >>> m = nn.PReLU() >>> x = torch.randn(100, 100, 100, requires_grad=True) >>> %timeit -r 100 y = m(x) 1000 loops, best of 100: 695 µs per loop >>> y = m(x).sum() >>> %timeit -r 100 y.backward(retain_graph=True) 100 loops, best of 100: 2.47 ms per loop ``` CPU with weight.numel() = channels ``` >>> m = nn.PReLU(100) >>> x = torch.randn(100, 100, 100, requires_grad=True) >>> %timeit -r 100 y = m(x) 1000 loops, best of 100: 603 µs per loop >>> y = m(x).sum() >>> %timeit -r 100 y.backward(retain_graph=True) 100 loops, best of 100: 13.3 ms per loop >>> m = nn.PReLU(100) >>> x = torch.randn(100, 100, 100, requires_grad=True) >>> %timeit -r 100 y = m(x) 1000 loops, best of 100: 655 µs per loop >>> y = m(x).sum() >>> %timeit -r 100 y.backward(retain_graph=True) 100 loops, best of 100: 2.45 ms per loop ``` CUDA with weight.numel() = 1 ``` >>> m = nn.PReLU().cuda() >>> x = torch.randn(100, 100, 100, requires_grad=True).cuda() >>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize(); 10000 loops, best of 100: 187 µs per loop >>> y = m(x).sum() >>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize(); 100 loops, best of 100: 2.01 ms per loop >>> m = nn.PReLU().cuda() >>> x = torch.randn(100, 100, 100, requires_grad=True).cuda() >>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize(); 1000 loops, best of 100: 195 µs per loop >>> y = m(x).sum() >>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize(); 100 loops, best of 100: 2.28 ms per loop ``` CUDA with weight.numel() = channel ``` >>> m = nn.PReLU(100).cuda() >>> x = torch.randn(100, 100, 100, requires_grad=True).cuda() >>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize(); 1000 loops, best of 100: 174 µs per loop >>> y = m(x).sum() >>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize(); 100 loops, best of 100: 2.27 ms per loop >>> m = nn.PReLU(100).cuda() >>> x = torch.randn(100, 100, 100, requires_grad=True).cuda() >>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize(); 10000 loops, best of 100: 181 µs per loop >>> y = m(x).sum() >>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize(); 100 loops, best of 100: 2.26 ms per loop ``` The huge performance regression in CPU when weight.numel() = 1 is addressed by replacing at::CPU_tensor_apply* with parallelized kernels. ezyang SsnL zou3519 soumith Pull Request resolved: https://github.com/pytorch/pytorch/pull/11758 Differential Revision: D9995799 Pulled By: weiyangfb fbshipit-source-id: d289937c78075f46a54dafbde92fab0cc4b5b86e	2018-09-21 16:26:04 -07:00
Thomas Viehmann	775358e4c2	Add non-legacy test of bilinear (#11935 ) Summary: Fixes: #11905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/11935 Differential Revision: D9991120 Pulled By: soumith fbshipit-source-id: b00ad4f405440664ae5228b229a2ba0a5d3d92f6	2018-09-21 12:43:35 -07:00
Christian Puhrsch	d8f6be686d	Remove torch/legacy (#11823 ) Summary: Largely unused and hinders current development Pull Request resolved: https://github.com/pytorch/pytorch/pull/11823 Differential Revision: D9925094 Pulled By: cpuhrsch fbshipit-source-id: c797f62180e2128f9a567b0c57c8347957470ea5	2018-09-20 14:00:54 -07:00
Wei Yang	8aedc27a63	checking device types of input and weights at RNN (#10185 ) Summary: - fixes #9534 Pull Request resolved: https://github.com/pytorch/pytorch/pull/10185 Differential Revision: D9141222 Pulled By: weiyangfb fbshipit-source-id: bb652e42cc15917019df080d6bce2926b18f3476	2018-09-18 20:26:02 -07:00
Xingdong Zuo	e2bc95e1bd	add `ModuleList.insert` (#11664 ) Summary: fixes #11652 Pull Request resolved: https://github.com/pytorch/pytorch/pull/11664 Differential Revision: D9892845 Pulled By: ezyang fbshipit-source-id: 2c910d6bc0b28a999e25beca6e398fd0f35535c5	2018-09-18 07:41:28 -07:00
Tongzhou Wang	7df6650e9c	Fix empty embedding bag on cuda (#11740 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/11739 Pull Request resolved: https://github.com/pytorch/pytorch/pull/11740 Differential Revision: D9881392 Pulled By: SsnL fbshipit-source-id: 2964d314f199dd9b4bb69e36592b67efdf5e0760	2018-09-17 14:40:03 -07:00
Tongzhou Wang	6f6b03566b	Vectorize grid sample 2d CPU kernels (#10980 ) Summary: This PR vectorizes the CPU grid sample 2d forward and backward kernels. Specifically, 1. add `.data()` in `TensorAccessor` 2. support non-void return value for declaring CPU kernel stub 2. add `bool at:: geometry_is_contiguous(IntList sizes, IntList strides)` 1. The following vectorized CPU primitives are added: + `gather<scale>(baseaddr, vindex)`: `result[i] = baseaddr[vindex[i] * scale]` + `mask_gather<scale>(src, baseaddr, vindex, mask)`: `result[i] = mask[i] ? baseaddr[vindex[i] * scale] : src[i]`. + comparison ops + binary logical ops + `min(a, b)` + `cast<dst_t, src_t>(src_vec)`: changing dtype but keeping the bit representation + `blendv(a, b, mask)`: `result[i] = mask[i] ? b[i] : a[i]`. + ctor with multiple values (i.e., `setr`) + `arange(start = 0, step = 1)`: constructs a vector with values specified by the arange parameters + `convert_to_int_of_same_size(vec)`: convert floating point vector to corresponding integral type of same size + `interleave2(a, b)` & `deinterleave2(x, y)`: interleave or deinterleaves two vectors. E.g., for `interleave`: ``` inputs: {a0, a1, a2, a3, a4, a5, a6, a7} {b0, b1, b2, b3, b4, b5, b6, b7} outputs: {a0, b0, a1, b1, a2, b2, a3, b3} {a4, b4, a5, b5, a6, b6, a7, b7} ``` 2. Grid sample CPU kernel implementations are described in the following note (also in `GridSampleKernel.cpp`: ``` NOTE [ Grid Sample CPU Kernels ] Implementation of vectorized grid sample CPU kernels is divided into three parts: 1. `ComputeLocation` struct Transforms grid values into interpolation locations of the input tensor for a particular spatial dimension, basing on the size of that dimension in input tensor, and the padding mode. ``` ```cpp template<typename scalar_t, GridSamplerPadding padding> struct ComputeLocation { using Vec = Vec256<scalar_t>; // ctor ComputeLocation(int64_t size); // Given grid values `in`, return the interpolation locations after // un-normalization and padding mechanism (elementwise). Vec apply(const Vec &in) const; // Similar to `apply`, but also returns `d apply(in) / d in` // (elementwise). // this is often used in gradient computation. std::pair<Vec, Vec> apply_get_grad(const Vec &in) const; }; ``` ``` 2. `ApplyGridSample` struct Owns N `ComputeLocation` structs, where N is the number of spatial dimensions. Given N input grid vectors (one for each spatial dimension) and spatial offset, it gets the interpolation locations from `ComputeLocation`s, applies interpolation procedure, and then writes to the output (or grad_input & grad_grid in backward). ``` ```cpp template<typename scalar_t, int spatial_dim, GridSamplerInterpolation interp, GridSamplerPadding padding> struct ApplyGridSample { // ctor ApplyGridSample(const TensorAccessor<scalar_t, 4>& input); // Applies grid sampling (forward) procedure: // 1. computes interpolation locations from grid values `grid_x` and // `grid_y`, // 2. interpolates output values using the locations and input data // in `inp_slice`, and // 3. writes the first `len` values in the interpolated vector to // `out_slice` with spatial offset being `offset`. // // This assimes that `grid_x` and `grid_y` all contain valid grid // values \in [-1, 1], even at indices greater than `len`. // // The `*_slice` argument namess mean samples within a batch (i.e., // with the batch dimension sliced out). void forward(TensorAccessor<scalar_t, 3>& out_slice, const TensorAccessor<scalar_t, 3>& inp_slice, int64_t offset, const Vec& grid_x, const Vec& grid_y, int64_t len) const; // Applies grid sampling (backward) procedure. Arguments semantics // and strategy are similar to those of `forward`. void backward(TensorAccessor<scalar_t, 3>& gInp_slice, TensorAccessor<scalar_t, 3>& gGrid_slice, const TensorAccessor<scalar_t, 3>& gOut_slice, const TensorAccessor<scalar_t, 3>& inp_slice, int64_t offset, const Vec& grid_x, const Vec& grid_y, int64_t len) const; } ``` ``` 3. `grid_sample_2d_grid_slice_iterator` function Among the tensors we work with, we know that the output tensors are contiguous (i.e., `output` in forward, and `grad_input` & `grad_grid` in backward), we need to randomly read `input` anyways, and `grad_output` usually comes from autograd and is often contiguous. So we base our iterating strategy on the geometry of grid. `grid_sample_2d_grid_slice_iterator` function provides an abstract to efficiently iterates through a `grid` slice (without batch dimension). See comments of that function on the specific cases and strategies used. ``` ```cpp template<typename scalar_t, typename ApplyFn> void grid_sample_2d_grid_slice_iterator( const TensorAccessor<scalar_t, 3>& grid_slice, const ApplyFn &apply_fn); // `apply_fn` is a function/lambda that can be called as if it has // declaration: // void apply_fn(const Vec256<scalar_t>& grid_x, // const Vec256<scalar_t>& grid_y, // int64_t spatial_offset, int64_t len); ``` ``` `apply_fn` will be called multiple times, and together cover the entire output spatial space. Therefore, e.g., to implement forward 2d grid sample, we can do ``` ```cpp ApplyGridSample<scalar_t, 2, interp, padding> grid_sample(input_accessor); for (int n = 0; n < input_accessor.size(0); n++) { grid_sample_2d_grid_slice_iterator( grid_accessor[n], [&](const Vec256<scalar_t>& grid_x, const Vec256<scalar_t>& grid_y, int64_t spatial_offset, int64_t len) { grid_sample.forward(out_accessor[n], input_accessor[n], spatial_offset, grid_x, grid_y, len); }); } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/10980 Differential Revision: D9564867 Pulled By: SsnL fbshipit-source-id: 5b7c3c7ea63af00eec230ae9ee1c3e6c6c9679b4	2018-09-16 20:41:10 -07:00
Edward Yang	74197c7115	Restore support for dim=None on WeightNorm. (#11661 ) Summary: Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/11661 Reviewed By: veenix Differential Revision: D9826799 Pulled By: ezyang fbshipit-source-id: 9eec57bb27a365406669e412f6eb88741b22ed3d	2018-09-14 07:39:43 -07:00
Wei Yang	54107ae8cf	convert output_device at data_parallel from torch.device to index (#10189 ) Summary: - fixes #9984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/10189 Differential Revision: D9545390 Pulled By: weiyangfb fbshipit-source-id: 3a6a705437553ba319e9fd4b7f676ff73857a27e	2018-09-11 20:27:07 -07:00
Erik Brinkman	91089a7e17	Add GPU implementation of pdist (#11102 ) Summary: Add the gpu kernel version. The parallelism I went with performs poorly when there are a large number of vectors, but they're all short, as I don't allocate the thread pool to wrap in that case. Test Plan --------- ``` python -m unittest test_torch.TestTorch.test_pdist_{empty,scipy} test_nn.TestNN.test_pdist{,_zeros,_empty_row,_empty_col,_cpu_gradgrad_unimplemented,_cuda_gradgrad_unimplemented} test_jit.TestJitGenerated.test_nn_pdist ``` Current performance specs are a little underwhelming, I'm in the process of debugging. size \| torch \| torch cuda \| scipy -----\|-------\|------------\|------ 16 x 16 \| 9.13 µs ± 3.55 µs \| 9.86 µs ± 81.5 ns \| 15.8 µs ± 1.2 µs 16 x 1024 \| 15 µs ± 224 ns \| 9.48 µs ± 88.7 ns \| 88.7 µs ± 8.83 µs 1024 x 16 \| 852 µs ± 6.03 µs \| 7.84 ms ± 6.22 µs \| 4.7 ms ± 166 µs 1024 x 1024 \| 34.1 ms ± 803 µs \| 11.5 ms ± 6.24 µs \| 273 ms ± 6.7 ms 2048 x 2048 \| 261 ms ± 3.5 ms \| 77.5 ms ± 41.5 µs \| 2.5 s ± 97.6 ms 4096 x 4096 \| 2.37 s ± 154 ms \| 636 ms ± 2.97 µs \| 25.9 s ± 394 ms Pull Request resolved: https://github.com/pytorch/pytorch/pull/11102 Differential Revision: D9697305 Pulled By: erikbrinkman fbshipit-source-id: 2b4f4b816c02b3715a85d8db3f4e77479d19bb99	2018-09-07 09:09:46 -07:00
iotamudelta	9de2085806	Use custom hcc/HIP, purge hcSPARSE (#11198 ) Summary: * purge hcSPARSE now that rocSPARSE is available * integrate a custom hcc and HIP * hcc brings two important compiler fixes (fixes hundreds of unit tests) * HIP brings a smart dispatcher that allows us to avoid a lot of static_casts (we haven't yet removed the automatic static_casts but this catches some occurrences the script did not catch) * mark 5 unit tests skipping that have regressed w/ the new hcc (we don't know yet what is at fault) * optimize bitonic sort - the comparator is always an empty struct - therefore passing it by value saves at least 3 bytes. It also removes an ambiguity around passing references to `__global__` functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/11198 Differential Revision: D9652340 Pulled By: ezyang fbshipit-source-id: f5af1d891189da820e3d13b7bed91a7a43154690	2018-09-06 19:38:07 -07:00
iotamudelta	33c7cc13ca	improve docker packages, fix bugs, enable tests, enable FFT (#10893 ) Summary: * improve docker packages (install OpenBLAS to have at-compile-time LAPACK functionality w/ optimizations for both Intel and AMD CPUs) * integrate rocFFT (i.e., enable Fourier functionality) * fix bugs in ROCm caused by wrong warp size * enable more test sets, skip the tests that don't work on ROCm yet * don't disable asserts any longer in hipification * small improvements Pull Request resolved: https://github.com/pytorch/pytorch/pull/10893 Differential Revision: D9615053 Pulled By: ezyang fbshipit-source-id: 864b4d27bf089421f7dfd8065e5017f9ea2f7b3b	2018-09-02 08:54:42 -07:00
Tongzhou Wang	e85f3fccb3	Fix relying on UB in test_data_parallel_nested_output (#11092 ) Summary: We shouldn't reply on plain `dict` ordering. Example failure: https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-xenial-cuda8-cudnn6-py3-test1/8417/console Pull Request resolved: https://github.com/pytorch/pytorch/pull/11092 Reviewed By: ezyang Differential Revision: D9583274 Pulled By: SsnL fbshipit-source-id: ba80b96648c98c24c2ec5fa6fd9aa566c095cce7	2018-08-30 13:10:25 -07:00
Erik Brinkman	611a608517	Add ATen pdist CPU kernel (#10782 ) Summary: Also add single grad whitelist to the jit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/10782 Reviewed By: ezyang Differential Revision: D9583378 Pulled By: erikbrinkman fbshipit-source-id: 069e5ae68ea7f3524dec39cf1d5fe9cd53941944	2018-08-30 11:55:27 -07:00
Will Feng	b14f2e899c	Preserve sparse tensor shape and dim invariants, and add scalar tensor support (#9279 ) Summary: When 0-sized dimension support is added, we expect an empty sparse tensor to be a 1-dimensional tensor of size `[0]`, with `sparseDims == 1` and `denseDims == 0`. Also, we expect the following invariants to be preserved at all times: ``` _sparseDims + _denseDims = len(shape) _indices.shape: dimensionality: 2, shape: (_sparseDims, nnz) _values.shape: dimensionality: 1 + _denseDims. shape: (nnz, shape[_sparseDims:]) ``` This PR fixes various places where the invariants are not strictly enforced when 0-sized dimension support is enabled. Tested and `test_sparse.py` passes locally on both CPU and CUDA with the `USE_TH_SIZE_ZERO_DIM` flag. Pull Request resolved: https://github.com/pytorch/pytorch/pull/9279 Differential Revision: D8936683 Pulled By: yf225 fbshipit-source-id: 12f5cd7f52233d3b26af6edc20b4cdee045bcb5e	2018-08-23 10:10:24 -07:00
Tongzhou Wang	de11a5fb28	Resubmit #8322 with scipy version check Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/10775 Differential Revision: D9458207 Pulled By: SsnL fbshipit-source-id: f2b0dbf2d236134afded9b15d8bf55ff98f50e7b	2018-08-22 13:39:49 -07:00
Tongzhou Wang	c5c1c051ca	Fix dropout fused kernel applied in eval mode (#10621 ) Summary: fixes https://github.com/pytorch/pytorch/issues/10584 cc apaszke Pull Request resolved: https://github.com/pytorch/pytorch/pull/10621 Differential Revision: D9379397 Pulled By: SsnL fbshipit-source-id: 5ff2939ba794af082ce597ef289a09ee757636dc	2018-08-17 14:54:42 -07:00
Jerry Ma	afd7477eaa	Add ``buffers(),` `named_buffers()`` methods. (#10554 ) Summary: This commit adds the ``buffers()`` and ``named_buffers()`` methods as analogues of ``parameters()`` and ``named_parameters()``. Pull Request resolved: https://github.com/pytorch/pytorch/pull/10554 Reviewed By: SsnL Differential Revision: D9367762 Pulled By: jma127 fbshipit-source-id: f2042e46a7e833dce40cb41681dbd80d7885c74e	2018-08-16 16:26:48 -07:00
Simon Wang	a129f9ad3b	Revert D9332335: [pytorch][PR] Implements volumetric (5d) affine grid generation. Differential Revision: D9332335 Original commit changeset: 1b3a91d078ef fbshipit-source-id: 3dcce680257a6da121f5d67918ed4236e0c5bfec	2018-08-15 15:25:11 -07:00
Adam Paszke	86363e1d8e	Move RNN implementations to C++ (#10481 ) Summary: This is the first of two changes that are supposed to improve how we handle RNNs in the JIT. They still get traced as `PythonOp`s, but now it will be much easier to actually expose them to the JIT as e.g. `aten::lstm`, and ignore the Python interpreter entirely. This needs some symbolic adjustments that will be part of a second PR. Even when we fix symbolics, there will still be a bit of a problem with statefulness of the cuDNN API (we need a mutable cache for the dropout state, but our IR has no way of representing that). zdevito ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/10481 Reviewed By: ezyang Differential Revision: D9341113 Pulled By: apaszke fbshipit-source-id: 0ae30ead72a1b12044b7c12369d11e5ca8ec30b5	2018-08-15 13:25:41 -07:00
Tongzhou Wang	254dedf604	Propagate NaN through threshold (#10277 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/10238 Pull Request resolved: https://github.com/pytorch/pytorch/pull/10277 Reviewed By: SsnL Differential Revision: D9199825 Pulled By: soumith fbshipit-source-id: 8ee7f9a72d9546d429f311c3f6028461d3c93fe2	2018-08-15 12:59:31 -07:00
Brian Hart	9cffe783f1	relax tolerance for two torch.half (float16) tests (#10519 ) Summary: Two tests in the 'nn' test bucket may fail when the torch.half (float16) data type is used. The assertions used in the tests intend to allow slight floating point imprecision in the results, but the tolerances used for the comparisons are too strict for the half type. Relax the tolerances so that slight float16 imprecision won't cause test failures. The affected tests are: - test_variable_sequence_cuda - test_Conv2d_groups_nobias For more information, see issue: https://github.com/pytorch/pytorch/issues/7420 Pull Request resolved: https://github.com/pytorch/pytorch/pull/10519 Differential Revision: D9343751 Pulled By: soumith fbshipit-source-id: 90aedf48f6e22dd4fed9c7bde7cd7c7b6885845a	2018-08-15 12:11:20 -07:00
Eli Stevens	f5a4dd89b5	Implements volumetric (5d) affine grid generation. (#8322 ) Summary: I've implemented affine grid generation for volumetric (5d) inputs. The implementation is based off of the spatial implementation, extended by one dimension. I have a few questions about my implementation vs. the existing one that I will add inline. I have some extensive test cases for the forward pass here: https://gist.github.com/elistevens/6e3bfb20d8d0652b83bd16b3e911285b However, they use `pytest.fixture` extensively, so I'm not sure the best way to incorporate them into the pytorch test suite. Suggestions? I have not tested backwards at all. Diff probably best viewed with whitespace changes ignored. Thanks for considering! Pull Request resolved: https://github.com/pytorch/pytorch/pull/8322 Differential Revision: D9332335 Pulled By: SsnL fbshipit-source-id: 1b3a91d078ef41a6d0a800514e49298fd817e4df	2018-08-15 11:02:08 -07:00
Tongzhou Wang	6a55238a3f	Grid sampler: nearest interpolation & reflection padding (#10051 ) Summary: closes #9702 . cc jph00 Commit structure: 1. Change the index calculation logic. I will explain using 1-D for simplicity. Previously we have (in pseudo code): ``` // 1. get the float locations from grid scalar_t x = from_grid() // 2. find the integral surrounding indices int x_left = floor(x) int x_right = x_left + 1 // 3. calculate the linear interpolate weights scalar_t w_left = x_right - x scalar_t w_right = x - x_left // 4. manipulate the integral surrounding indices if needed // (e.g., clip for border padding_mode) x_left = manipulate(x_left, padding_mode) x_right = manipulate(x_right, padding_mode) // 5. interpolate output_val = interpolate(w_left, w_right, x_left, x_right) ``` This is actually incorrect (and also unintuitive) because it calculates the weights before manipulate out-of-boundary indices. Fortunately, this isn't manifested in both of the current supported modes, `'zeros'` and `'border'` padding: + `'zeros'`: doesn't clip + `'border'`: clips, but for out-of-bound `x` both `x_left` and `x_right` are clipped to the same value, so weights don't matter But this is a problem with reflection padding, since after each time we reflect, the values of `w_left` and `w_right` should be swapped. So in this commit I change the algorithm to (numbers corresponding to the ordering in the above pseudo-code) ``` 1. get float location 4. clip the float location 2. find the integral surrounding indices 3. calculate the linear interpolate weights ``` In the backward, because of this change, I need to add new variables to track `d manipulate_output / d manipulate_input`, which is basically a multiplier on the gradient calculated for `grid`. From benchmarking this addition doesn't cause obvious slow downs. 2. Implement reflection padding. The indices will keep being reflected until they become within boundary. Added variant of `clip_coordinates` and `reflect_coordinates` to be used in backward. E.g., ```cpp // clip_coordinates_set_grad works similarly to clip_coordinates except that // it also returns the `d output / d input` via pointer argument `grad_in`. // This is useful in the backward pass of grid_sampler. scalar_t clip_coordinates_set_grad(scalar_t in, int64_t clip_limit, scalar_t grad_in) ``` For example, if `in` is clipped in `'border'` mode, `grad_in` is set to `0`. If `in` is reflected odd* times in `'reflection'` mode, `grad_in` is set to `-1`. 3. Implement nearest interpolation. 4. Add test cases 5. Add better input checking Discussed with goldsborough for moving `operator<<` of `at::Device`, `at::DeviceType` and `at::Layout` into `at` namespace. (Otherwise `AT_CHECK` can't find them.) 6. Support empty tensors. cc gchanan + Make empty tensors not acceptable by cudnn. + Add `AT_ASSERT(kernel block size > 0)` if using `GET_BLOCKS` + Cache `numel` in `TensorGeometry` I was going to use `numel` to test if cudnn descriptor should accept a tensor, but it isn't used eventually. I can revert this if needed. 7. Add more test cases, including on input checking and empty tensors 8. Remove an obsolete comment 9. Update docs. Manually tested by generating docs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/10051 Differential Revision: D9123950 Pulled By: SsnL fbshipit-source-id: ac3b4a0a36b39b5d02e83666cc6730111ce216f6	2018-08-10 12:43:27 -07:00
Natalia Gimelshein	5bb21493fd	add fused dropout kernels (#9666 ) Summary: While waiting for dropout to be fully ported to ATen, here's performance fix for the most common dropout case. Dropout is still in python function, I just added efficient path to it. I could not make inplace work, because generator always generates `return self` for inplace function, and I need to return both original tensor and mask, so inplace goes on the existing pass. Even with non-inplace version, since mask is now a ByteTensor, memory used is just a little larger than for inplace dropout, due to savings on mask. Once dropout is moved to aten, these kernels still can be used for efficient implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/9666 Reviewed By: SsnL Differential Revision: D8948077 Pulled By: ezyang fbshipit-source-id: 52990ef769471d957e464af635e5f9b4e519567a	2018-08-07 13:34:53 -07:00
Wei Yang	149d4f776b	use logsigmoid at multilabel_soft_margin_loss, and change output from shape=(N, C)to (N,) (#9965 ) Summary: - fixes #9141, #9301 - use logsigmoid at multilabel_soft_margin_loss to make it more stable (NOT fixing legacy MultiLabelSoftMarginCriterion) - return (N) instead of (N, C) to match the same behavior as MultiMarginLoss - Note that with this PR, the following behavior is expected: ``` loss = F.multilabel_soft_margin_loss(outputs, labels, reduction='none') loss_mean = F.multilabel_soft_margin_loss(outputs, labels, reduction='elementwise_mean') loss_sum = F.multilabel_soft_margin_loss(outputs, labels, reduction='sum') loss.sum() == loss_sum # True loss.mean() == loss_mean # True ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/9965 Differential Revision: D9038402 Pulled By: weiyangfb fbshipit-source-id: 0fa94c7b3cd370ea62bd6333f1a0e9bd0b8ccbb9	2018-08-03 17:54:19 -07:00
Thomas Viehmann	6456b944fd	ctc_loss odds and ends (#10112 ) Summary: - Add convenience wrapper to pass tensors as input_lengths, target_lengths - Fix documentation example - Check BLANK >= 0 Thank you, Simon and Soumith for the suggestions! Pull Request resolved: https://github.com/pytorch/pytorch/pull/10112 Differential Revision: D9130737 Pulled By: SsnL fbshipit-source-id: f9a0022a969788bda3db9f360e2564b519ebf2e6	2018-08-03 13:25:18 -07:00
Tongzhou Wang	43b151224e	Move grid sampler to ATen (#9961 ) Summary: Spatial version benchmark \| \| CPUFloat THNN \| CPUFloat ATen \| CPUDouble THNN \| CPUDouble ATen \| CUDAHalf THNN \| CUDAHalf ATen \| CUDAFloat THNN \| CUDAFloat ATen \| CUDADouble THNN \| CUDADouble ATen \| \|---------------------------\|---------------\|---------------\|----------------\|----------------\|---------------\|---------------\|----------------\|----------------\|-----------------\|-----------------\| \| [1024x1x28x28] zero pad \| 2.19281888s \| 0.21280479s \| 2.52922535s \| 0.23944831s \| 0.17494774s \| 0.06242800s \| 0.31270599s \| 0.03706479s \| 0.40542483s \| 0.07391024s \| \| [1024x1x28x28] border pad \| 3.04329610s \| 0.24705672s \| 2.29205394s \| 0.22336411s \| 0.17980361s \| 0.06212497s \| 0.31415701s \| 0.03847790s \| 0.43020391s \| 0.07540464s \| \| [32x3x244x244] zero pad \| 18.29301333s \| 2.18566656s \| 19.01662397s \| 3.51552224s \| 1.72487235s \| 0.28933954s \| 2.02466702s \| 0.18178749s \| 2.63671613s \| 0.41391206s \| \| [32x3x244x244] border pad \| 18.72205329s \| 2.02600884s \| 20.13017297s \| 3.25979590s \| 1.96455693s \| 0.33070564s \| 2.18666625s \| 0.19546938s \| 2.91268897s \| 0.38465047s \| For #9702 basics: + grid tensors have dimensions `[N, H, W, 2]` (or `[N, D, H, W, 3]` for 3d). + input/output tensors have dimensions `[N, C, H, W]` (or `[N, C, D, H ,W]` for 3d) + grid sampler maps `input([N, C, inp_H, inp_W]), grid([N, H, W, 2])` to `output([N, C, H, W])` (3d case is similar). variable naming: + `tensor_sH` means the stride of `tensor` at the dimension of `H`. + `tensor_ptr_NCH` is a data pointer that always points to the beginning of the `tensor[n][c][h]` slice in the loop. Pull Request resolved: https://github.com/pytorch/pytorch/pull/9961 Differential Revision: D9057175 Pulled By: SsnL fbshipit-source-id: 9ed8f1dc376ed10229f047fdcf3c90dbd250bee6	2018-08-01 07:54:46 -07:00

1 2 3 4 5 ...

469 Commits