Summary:
I'm just doing the honors and bumping the version to 1.0.0.
1.0 preview and RC releases will have the 1.0.0.dev{date} tag
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11717
Reviewed By: SsnL
Differential Revision: D9840857
Pulled By: soumith
fbshipit-source-id: 4c9c2e01dccb3c521dab26c49e1569d970a87ace
Summary:
Previously, it was a necessity to include TensorMethods.h after Tensor.h in order to get the tensor method definitions.
We abstracted this away from users by making sure ATen.h did this correctly; but we don't have any equivalent for ATen/core.
In order to solve this dependency issue, we now forward declare Tensor in the Type declaration, which breaks the dependency cycle.
Type.h now includes Tensor.h (for backwards compatibility) and Tensor.h now includes TensorMethods.h, so there is no longer include dependency restrictions.
We could get rid of TensorMethods.h completely now, but that would involve coordinating a code generation change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11720
Reviewed By: ezyang
Differential Revision: D9841488
Pulled By: gchanan
fbshipit-source-id: 1668199095e096c1790e646b5dc9f61ec1b33c0a
Summary:
A couple fixes I deem necessary to the TorchScript C++ API after writing the tutorial:
1. When I was creating the custom op API, I created `torch/op.h` as the one-stop header for creating custom ops. I now notice that there is no good header for the TorchScript C++ story altogether, i.e. when you just want to load a script module in C++ without any custom ops necessarily. The `torch/op.h` header suits that purpose just as well of course, but I think we should rename it to `torch/script.h`, which seems like a great name for this feature.
2. The current API for the CMake we provided was that we defined a bunch of variables like `TORCH_LIBRARY_DIRS` and `TORCH_INCLUDES` and then expected users to add those variables to their targets. We also had a CMake function that did that for you automatically. I now realized a much smarter way of doing this is to create an `IMPORTED` target for the libtorch library in CMake, and then add all this stuff to the link interface of that target. Then all downstream users have to do is `target_link_libraries(my_target torch)` and they get all the proper includes, libraries and compiler flags added to their target. This means we can get rid of the CMake function and all that stuff. orionr AFAIK this is a much, much better way of doing all of this, no?
3. Since we distribute libtorch with `D_GLIBCXX_USE_CXX11_ABI=0`, dependent libraries must set this flag too. I now add this to the interface compile options of this imported target.
4. Fixes to JIT docs.
These could likely be 4 different PRs but given the release I wouldn't mind landing them all asap.
zdevito dzhulgakov soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11682
Differential Revision: D9839431
Pulled By: goldsborough
fbshipit-source-id: fdc47b95f83f22d53e1995aa683e09613b4bfe65
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11706
This is necessary to handle use-cases when Storage is not set (because the
tensor in question doesn't have a notion of storage.)
Reviewed By: orionr
Differential Revision: D9833361
fbshipit-source-id: e90a384019f44f57682b687d129b54e85b6fabb9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11701
There's one extra multiply from TypeMeta::itemsize() which needs
to be characterized. For all existing Caffe2 uses, storage_offset
is zero.
Reviewed By: li-roy
Differential Revision: D9831230
fbshipit-source-id: 353678edf76d2ccc297a73475a34f6ab2a20d1e1
Summary:
This will allow us to break the dependency cycle between Tensor and Type, because currently Type has defaulted Tensor (reference) arguments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11675
Reviewed By: ezyang
Differential Revision: D9819720
Pulled By: gchanan
fbshipit-source-id: a9577ac34a358120075129ab0654e7862d1dace6
Summary:
This way it shows up in all current and future setup.py commands, as otherwise we'd have to override every once to have them all call copy_protos. This is needed because the nightly packages still do not include caffe2_pb2, because setup.py bdist does not go through setup.py install or setup.py develop
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11726
Reviewed By: orionr
Differential Revision: D9844075
Pulled By: pjh5
fbshipit-source-id: 57b469e48010aacd0c08c214ba8a7e5d757feefa
Summary:
We use these annotations during function declarations, not definitions. See the description of compiler error [C2491](https://msdn.microsoft.com/en-us/library/62688esh.aspx) for more details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11367
Reviewed By: ezyang
Differential Revision: D9697923
Pulled By: orionr
fbshipit-source-id: 1e539c02957851386f887e6d0510ce83117a1695
Summary:
This PR vectorizes the CPU grid sample 2d forward and backward kernels. Specifically,
1. add `.data()` in `TensorAccessor`
2. support non-void return value for declaring CPU kernel stub
2. add `bool at:: geometry_is_contiguous(IntList sizes, IntList strides)`
1. The following vectorized CPU primitives are added:
+ `gather<scale>(baseaddr, vindex)`: `result[i] = baseaddr[vindex[i] * scale]`
+ `mask_gather<scale>(src, baseaddr, vindex, mask)`: `result[i] = mask[i] ? baseaddr[vindex[i] * scale] : src[i]`.
+ comparison ops
+ binary logical ops
+ `min(a, b)`
+ `cast<dst_t, src_t>(src_vec)`: changing dtype but keeping the bit representation
+ `blendv(a, b, mask)`: `result[i] = mask[i] ? b[i] : a[i]`.
+ ctor with multiple values (i.e., `setr`)
+ `arange(start = 0, step = 1)`: constructs a vector with values specified by the arange parameters
+ `convert_to_int_of_same_size(vec)`: convert floating point vector to corresponding integral type of same size
+ `interleave2(a, b)` & `deinterleave2(x, y)`: interleave or deinterleaves two vectors. E.g., for `interleave`:
```
inputs:
{a0, a1, a2, a3, a4, a5, a6, a7}
{b0, b1, b2, b3, b4, b5, b6, b7}
outputs:
{a0, b0, a1, b1, a2, b2, a3, b3}
{a4, b4, a5, b5, a6, b6, a7, b7}
```
2. Grid sample CPU kernel implementations are described in the following note (also in `GridSampleKernel.cpp`:
```
NOTE [ Grid Sample CPU Kernels ]
Implementation of vectorized grid sample CPU kernels is divided into three
parts:
1. `ComputeLocation` struct
Transforms grid values into interpolation locations of the input tensor
for a particular spatial dimension, basing on the size of that dimension
in input tensor, and the padding mode.
```
```cpp
template<typename scalar_t, GridSamplerPadding padding>
struct ComputeLocation {
using Vec = Vec256<scalar_t>;
// ctor
ComputeLocation(int64_t size);
// Given grid values `in`, return the interpolation locations after
// un-normalization and padding mechanism (elementwise).
Vec apply(const Vec &in) const;
// Similar to `apply`, but also returns `d apply(in) / d in`
// (elementwise).
// this is often used in gradient computation.
std::pair<Vec, Vec> apply_get_grad(const Vec &in) const;
};
```
```
2. `ApplyGridSample` struct
Owns N `ComputeLocation` structs, where N is the number of spatial
dimensions. Given N input grid vectors (one for each spatial dimension)
and spatial offset, it gets the interpolation locations from
`ComputeLocation`s, applies interpolation procedure, and then writes to
the output (or grad_input & grad_grid in backward).
```
```cpp
template<typename scalar_t, int spatial_dim,
GridSamplerInterpolation interp,
GridSamplerPadding padding>
struct ApplyGridSample {
// ctor
ApplyGridSample(const TensorAccessor<scalar_t, 4>& input);
// Applies grid sampling (forward) procedure:
// 1. computes interpolation locations from grid values `grid_x` and
// `grid_y`,
// 2. interpolates output values using the locations and input data
// in `inp_slice`, and
// 3. writes the first `len` values in the interpolated vector to
// `out_slice` with spatial offset being `offset`.
//
// This assimes that `grid_x` and `grid_y` all contain valid grid
// values \in [-1, 1], even at indices greater than `len`.
//
// The `*_slice` argument namess mean samples within a batch (i.e.,
// with the batch dimension sliced out).
void forward(TensorAccessor<scalar_t, 3>& out_slice,
const TensorAccessor<scalar_t, 3>& inp_slice,
int64_t offset, const Vec& grid_x, const Vec& grid_y,
int64_t len) const;
// Applies grid sampling (backward) procedure. Arguments semantics
// and strategy are similar to those of `forward`.
void backward(TensorAccessor<scalar_t, 3>& gInp_slice,
TensorAccessor<scalar_t, 3>& gGrid_slice,
const TensorAccessor<scalar_t, 3>& gOut_slice,
const TensorAccessor<scalar_t, 3>& inp_slice,
int64_t offset, const Vec& grid_x, const Vec& grid_y,
int64_t len) const;
}
```
```
3. `grid_sample_2d_grid_slice_iterator` function
Among the tensors we work with, we know that the output tensors are
contiguous (i.e., `output` in forward, and `grad_input` & `grad_grid` in
backward), we need to randomly read `input` anyways, and `grad_output`
usually comes from autograd and is often contiguous. So we base our
iterating strategy on the geometry of grid.
`grid_sample_2d_grid_slice_iterator` function provides an abstract to
efficiently iterates through a `grid` slice (without batch dimension).
See comments of that function on the specific cases and strategies used.
```
```cpp
template<typename scalar_t, typename ApplyFn>
void grid_sample_2d_grid_slice_iterator(
const TensorAccessor<scalar_t, 3>& grid_slice,
const ApplyFn &apply_fn);
// `apply_fn` is a function/lambda that can be called as if it has
// declaration:
// void apply_fn(const Vec256<scalar_t>& grid_x,
// const Vec256<scalar_t>& grid_y,
// int64_t spatial_offset, int64_t len);
```
```
`apply_fn` will be called multiple times, and together cover the entire
output spatial space. Therefore, e.g., to implement forward 2d grid
sample, we can do
```
```cpp
ApplyGridSample<scalar_t, 2, interp, padding> grid_sample(input_accessor);
for (int n = 0; n < input_accessor.size(0); n++) {
grid_sample_2d_grid_slice_iterator(
grid_accessor[n],
[&](const Vec256<scalar_t>& grid_x, const Vec256<scalar_t>& grid_y,
int64_t spatial_offset, int64_t len) {
grid_sample.forward(out_accessor[n], input_accessor[n],
spatial_offset, grid_x, grid_y, len);
});
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10980
Differential Revision: D9564867
Pulled By: SsnL
fbshipit-source-id: 5b7c3c7ea63af00eec230ae9ee1c3e6c6c9679b4
Summary:
Changing `max` to `fmaxf` in `LabelCrossEntropy` kernel for hip to work correctly.
bddppq petrex
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11733
Differential Revision: D9846783
Pulled By: bddppq
fbshipit-source-id: c1b394d2ba7ee0e819f7bf3b36b53d1962de5522
Summary:
Fixes#11452 .
Based on the discussion with SsnL and soumith , we want to bring back Upsample as a module instead of introducing a new nn.interpolate module for now. If anyone want to do downsample, they should use `nn.functional.interpolate ` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11568
Differential Revision: D9804359
Pulled By: ailzhang
fbshipit-source-id: 2b232d55fc83c2b581bf336f1ee8d1cf1c1159ca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11355
There is no reason to implement refcounting manually in this case.
Given the correct NullType, toIntrusivePtr() and moveToIntrusivePtr() will do the right thing.
Reviewed By: ezyang
Differential Revision: D9694918
fbshipit-source-id: 8aae4d66aec32ca5f85c438d66339bd80b72b656
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11353
Before, there was one extra member in the union that had to be at least as large as the largest other member, because it was used for copying.
Now, this isn't needed anymore and we copy the union directly.
Reviewed By: ezyang
Differential Revision: D9694326
fbshipit-source-id: 42b2f7d51ac5d4ea5ebafea3a598b018e10fed68
Summary:
Current behavior is that each process (main and workers) will print trace from `KeyboardInterrupt`. And the main process will also print
```
RuntimeError: DataLoader worker (pid 46045) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with nm_workers=0 may give better error trace.
```
due to our SIGCLD handler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11718
Differential Revision: D9840844
Pulled By: SsnL
fbshipit-source-id: 1a05060bb02907fef5aac3f274d2c84f9f42d187
Summary:
Otherwise each build produces 65MB of warnings log, which makes the CI hard to debug.
iotamudelta Jorghi12
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11698
Differential Revision: D9840356
Pulled By: bddppq
fbshipit-source-id: b69bf6a5c38a97b188221f9c084c608ffc9b37c8
Summary:
1. Document the Sequential module in the C++ API at a high, why-does-this-exist, and low, how-to-use, level
2. Change the Sequential tests to be in a style that makes them easier to convert to gtest. No code changes.
ebetica ezyang apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11648
Differential Revision: D9834526
Pulled By: goldsborough
fbshipit-source-id: 39f2f5c6cbbf8ed5a1b69986978c8ef127036de1
Summary:
This PR splits the CPU and CUDA fusion compilers, putting them into a new jit/fusers/ directory with jit/fusers/common for common components. In particular:
- A fusion interface is created that allows "fusion handles" to be requested
- The CPU and CUDA fusers implement this interface, with dispatch determined by device
- The fusion compilers, fusion function specializations and resource strings are split
- CPU-specific classes like TempFile and DynamicLibrary are in the CPU fuser
- Common classes likes TensorDesc and the base fusion function class are in jit/fusers/common
- There is still some specialization in jit/fusers/common, but these specializations are small(-ish)
- Updates the build system to remove the dummy interface on Windows and minimize the use of macros
This structure should allow in-flight PRs to easily rebase while providing a clear interface to the fusers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10981
Reviewed By: soumith
Differential Revision: D9701999
Pulled By: apaszke
fbshipit-source-id: 3b6bec7b97e0444b2a93caa38d9b897f2e68c1b3
Summary:
Fixes#11663
`TensorIterator` was replacing the op tensors with type casted tensors
which ended up producing side effects in binary ops like `a.float() * b`
where `a` and `b` are `LongTensor`s.
colesbury ezyang apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11708
Differential Revision: D9834016
Pulled By: driazati
fbshipit-source-id: 4082eb9710b31dfc741161a0fbdb9a8eba8fe39d
Summary:
Often, we find ourselves looking at some long-running kernel or emit_nvtx range on an nvvp profile and trying to connect it to the offending line in a training script. If the op is in the forward pass that's easy: ops are enqueued explicitly from the Python side, so tracking it down with manual nvtx ranges supplemented by the built-in emit_nvtx ranges is straightforward. If the op is in the backward pass, it's much more difficult. From the Python side, all you can do is wrap loss.backward() in an nvtx range, and if you also use emit_nvtx, the automatic ranges provide only local information. Right now, the only consistent way to connect backward-pass kernels to their associated forward-pass lines of Python is to understand your script line by line, and know exactly where in the backward pass you are.
This PR augments the existing nvtx machinery to bridge the gap between forward and backward, allowing connection of backward-pass Function apply calls to the forward-pass operations that required/created those Functions.
The method is simple and surgical. During the forward pass, when running with emit_nvtx, the nvtx range for each function in VariableType is tagged with the current sequence number. During the backward pass, the nvtx range associated with each Function's operator() is tagged with that Function's stashed sequence number, which can be compared to "current sequence numbers" from the forward pass to locate the associated op.
Double-backward is not a problem. If a backward pass with create_graph = True is underway, the relationship between backward and double-backward is conceptually the same as the relationship between forward and backward: The functions in VariableType still spit out current-sequence-number-tagged ranges, the Function objects they create still stash those sequence numbers, and in the eventual double-backward execution, their operator() ranges are still tagged with the stashed numbers, which can be compared to "current sequence numbers" from the backward pass.
Minor caveats:
- The sequence number is thread-local, and many VariableType functions (specifically, those without a derivative explicitly defined in derivatives.yaml) don't create an associated function object (instead delegating that to sub-functions further down the call chain, perhaps called from within at::native functions that route back through VariableType by calling at::function_name). So the correspondence of stashed sequence numbers in Function operator() ranges with numbers in forward-pass ranges is not guaranteed to be 1 to 1. However, it's still a vast improvement over the current situation, and I don't think this issue should be a blocker.
- Feel free to litigate my use of stringstream in profiler.cpp. I did it because it was easy and clean. If that's too big a hammer, let's figure out something more lightweight.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10881
Differential Revision: D9833371
Pulled By: apaszke
fbshipit-source-id: 1844f2e697117880ef5e31394e36e801d1de6088
Summary:
This is causing codegen problems in caffe2, when we try to remove the circular Tensor/Type declarations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11673
Differential Revision: D9819341
Pulled By: gchanan
fbshipit-source-id: f2c2cd96e8a16f6de6aa4889e71b8a78e12e9256
Summary:
We'll have separate docs for the C++ frontend, right now this file is just misleading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11703
Differential Revision: D9832847
Pulled By: goldsborough
fbshipit-source-id: 2e8b30ccf6b5cba9d0526e6261160f7c6211a35c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11597
We should always CHECK pointers which we plan to dereference
if they are inputs to the function. Nobody knows how the function will
be called in the future.
Reviewed By: yinghai
Differential Revision: D9800002
fbshipit-source-id: 7fd05f4717f2256d1b09a9e75475b12de6685b03
Summary:
…cuda())
While I was at it, I audited all other ways I know how we might get a CUDA
type from PyTorch and fixed more constructors which don't work.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11533
Differential Revision: D9775786
Pulled By: ezyang
fbshipit-source-id: cd07cdd375fdf74945539ec475a48bf08cbc0c17
Summary:
There's no reason they need to be in Type.h and this moves us along the path of not having circular dependencies (so we can get rid of TensorMethods.h).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11650
Reviewed By: ezyang
Differential Revision: D9812271
Pulled By: gchanan
fbshipit-source-id: 8b70db9a5eb0a332398ab2e8998eeaf7d2eea6d7
Summary:
This adds a small check in `Dirichlet` and `Categorical` `__init__` methods to ensure that scalar parameters are not admissible.
**Motivation**
Currently, `Dirichlet` throws no error when provided with a scalar parameter, but if we `expand` a scalar instance, it inherits the empty event shape from the original instance and gives unexpected results.
The alternative to this check is to promote `event_shape` to be `torch.Size((1,))` if the original instance was a scalar, but that seems to add a bit more complexity (and changes the behavior of `expand` in that it would affect the `event_shape` as well as the `batch_shape` now). Does this seem reasonable? cc. alicanb, fritzo.
```python
In [4]: d = dist.Dirichlet(torch.tensor(1.))
In [5]: d.sample()
Out[5]: tensor(1.0000)
In [6]: d.log_prob(d.sample())
Out[6]: tensor(0.)
In [7]: e = d.expand([3])
In [8]: e.sample()
Out[8]: tensor([0.3953, 0.1797, 0.4250]) # interpreted as events
In [9]: e.log_prob(e.sample())
Out[9]: tensor(0.6931) # wrongly summed out
In [10]: e.batch_shape
Out[10]: torch.Size([3])
In [11]: e.event_shape
Out[11]: torch.Size([]) # cannot be empty
```
Additionally, based on review comments, this removes `real_vector` constraint. This was only being used in `MultivariateNormal`, but I am happy to revert this if we want to keep it around for backwards compatibility.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11589
Differential Revision: D9818271
Pulled By: soumith
fbshipit-source-id: f9bbba90ed6f04e0b5bdfa169e70ca20b280fc74
Summary:
This PR:
- adds a `.expand` method for `TransformedDistribution` along the lines of #11341.
- uses this method to simplify `.expand` in distribution classes that subclass off of `TransformedDistribution`.
- restores testing of `TransformedDistribution` fixtures.
- fixes some bugs wherein we were not setting certain attributes in the expanded instances, and adds tests for `.mean` and `.variance` which use these attributes.
There are many cases where users directly use `TransformedDistribution` rather than subclassing off it. In such cases, it seems rather inconvenient to have to write a separate class just to define a `.expand` method. The default implementation should suffice in these cases.
cc. fritzo, vishwakftw, alicanb
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11607
Differential Revision: D9818225
Pulled By: soumith
fbshipit-source-id: 2c4b3812b9a03e6985278cfce0f9a127ce536f23
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11576
Previously, they were spattered throughout the codebase.
We now follow this convention:
- LegacyTypeDispatch gives you Type
- Context gives you TypeExtendedInterface
- Tensor::type() gives you Type
- at::getType() gives you TypeExtendedInterface
I change some sites to use getType() over type().
Reviewed By: SsnL
Differential Revision: D9790187
fbshipit-source-id: 5e2577cb590a5bbf5df530f3763d3b3c0b4625ca
Summary:
This adds tests in tests/test_distributions.py to ensure that all methods of `Distribution` objects are jittable.
I've replaced a few samplers with jittable versions:
- `.uniform_()` -> `torch.rand()`
- `.exponential_()` -> `-(-torch.rand()).log1p()`
- `.normal_()` -> `torch.normal(torch.zeros(...), torch.ones(...), ...)`
Some jit failures remain, and are marked in test_distributions.py
- `Cauchy` and `HalfCauchy` do not support sampling due to missing `.cauchy_()`
- `Binomial` does not support `.enumerate_support()` due to `arange` ignoring its first arg.
- `MultivariateNormal`, `LowRankMultivariateNormal` do not support `.mean`, `.entropy`
- [x] Currently some tests fail (I've skipped those) due to unavailability of `aten::uniform` and `aten::cauchy` in the jit. Can someone suggest how to add these? I tried to add declarations to `torch/csrc/ir.cpp` and `torch/csrc/passes/shape_analysis.cpp`, but that resulted in "Couldn't find operator" errors.
- [x] There are still lots of `TracerWarning`s that something doesn't match something. I'm not sure whether these are real.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11560
Differential Revision: D9816327
Pulled By: apaszke
fbshipit-source-id: 72ec998ea13fc4c76d1ed003d9502e0fbaf728b8
Summary:
current torch.norm() runs sequentially on CPU. This PR did parallelization and vectorization of torch.norm() on ATen CPU path, roughly provide 2 order of magnitude performance boost.
Performance is benchmarks on Xeon skylake 8180, 2*28 cores 2.5GHz, using the following script:
```python
import torch
from time import time
count = 1000
size = 1000*1000
def test_norm(p=2):
a = torch.randn(size)
tstart = time()
for i in range(count):
torch.norm(a, p)
tend = time()
print("norm on size %d tensor p = %d: %f s" % (size, p, (tend-tstart)))
for p in range(4):
test_norm(p)
```
without this optimization,
```
(intel-pytorch) [mingfeim@mlt-skx065 unit_tests]$ python test_norm.py
norm on size 1000000 tensor p = 0: 1.071235 s
norm on size 1000000 tensor p = 1: 1.069149 s
norm on size 1000000 tensor p = 2: 1.068212 s
norm on size 1000000 tensor p = 3: 69.735312 s
```
and with this optimization,
```
(pytorch-tf) [mingfeim@mlt-skx053 unit_tests]$ python test_norm.py
norm on size 1000000 tensor p = 0: 0.127507 s
norm on size 1000000 tensor p = 1: 0.011867 s
norm on size 1000000 tensor p = 2: 0.011907 s
norm on size 1000000 tensor p = 3: 0.014470 s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11565
Differential Revision: D9804484
Pulled By: ezyang
fbshipit-source-id: 52899f30ac26139d00684d07edfb47cb9b25d871
Summary:
Previously, we would pretty much assume that all floating point tensors do require grad, which might result in some unnecessary compute.
I don't really like the fact that `TensorType` uses `tensor.is_variable() && tensor.requires_grad()` to infer the value of `requires_grad`, but changing constants to keep variables turns out to be pretty hard. I got halfway there, but it would still need some more work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11586
Reviewed By: ezyang
Differential Revision: D9813648
Pulled By: apaszke
fbshipit-source-id: 77f77756d18ff7632fca3aa68ce855e1d7f3bdb8
Summary:
I'm reading the doc of `torch.nn.functional.pad` and it looks a bit confusing to me. Hopefully this PR makes it clearer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11623
Differential Revision: D9818255
Pulled By: soumith
fbshipit-source-id: 4f6b17b0211c6927007f44bfdf42df5f84d47536
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11659
This is less error-prone and less code.
Reviewed By: smessmer
Differential Revision: D9814536
fbshipit-source-id: 028510e31e2fa7a9fa11c1398b0743c5cd085dd5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11657
Previously, we had a constructor in TensorImpl for every constructor in Tensor.
This was unnecessary and wordy: Tensor is the user-visible class, so it deserves
the constructors, but TensorImpl is internal and doesn't need it. So
I replaced TensorImpl with a single, Storage accepting constructor, and then
rewrote Tensor to use that constructor.
Reviewed By: jerryzh168
Differential Revision: D9813742
fbshipit-source-id: 7501b54fe5f39180f1bc07573fd7c1640b0f4e89
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11656
The mis-capitalization really sticks up my craw. I know why (we
already have a static function named GetDeviceType), but let's
name it differently.
```
codemod -d . --extensions cc,cpp,cu,cuh,h,py,hpp,TARGETS GetDevicetype device_type
```
Reviewed By: jerryzh168
Differential Revision: D9813544
fbshipit-source-id: fe462f4bc40b03e74921f8cf5ebd9cfc52e7e636
Summary:
The isCompleted function is changed to being non-const to accomodate
setting some internal status on the work object in the case of
completion. Previously, it was only checking a member field, but for the
MPI backend it calls MPI_Test to poll for completion of an asynchronous
request.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11630
Reviewed By: SsnL
Differential Revision: D9808008
Pulled By: pietern
fbshipit-source-id: 18b70825b1fb4d561a552fa75e9475a522852cd4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11396
std::move and std::forward in C++11 aren't constexpr (they are in C++14).
This caused a build issue orionr was working on.
It should be fixed by this diff
Reviewed By: orionr
Differential Revision: D9724805
fbshipit-source-id: 0d9047dce611385d659cc71a6c04cc7a6a40a5ae
Summary:
Requires https://github.com/onnx/onnx/pull/1377
This PR makes it so that slices with dynamic boundary values can be exported from pytorch and run in caffe2 via ONNX.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11255
Differential Revision: D9790216
Pulled By: jamesr66a
fbshipit-source-id: 6adfcddc5788df4d34d7ca98341077140402a3e2
Summary:
Currently, because of some setup.py logic, `ninja` caching of the `generate_code.py` build step was broken. This resulted in `generate_code.py` running every single time builds were happening, regardless of whether inputs changed.
This updated logic fixes the input caching
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11644
Reviewed By: orionr
Differential Revision: D9814348
Pulled By: soumith
fbshipit-source-id: 2012960908d0f600488d410094095cfd72adc34f
Summary:
This also removes the usage of torch.onnx.symbolic_override in instance_norm. Fixes#8439.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10792
Differential Revision: D9800643
Pulled By: li-roy
fbshipit-source-id: fa13a57de5a31fbfa2d4d02639d214c867b9e1f1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11642
This is just a preparatory change to help with future
refactoring:
- I want to reduce the number of includes that tensor_impl.h
depends on, but
- I need to keep tensor.h providing all Caffe2 headers, because
users may be relying on tensor.h transitively providing those
headers.
Introducing a level of indirection lets me do both at the same time.
Reviewed By: jerryzh168
Differential Revision: D9810823
fbshipit-source-id: 8dfaac4b8768051a22898be8fcaf787ecc57eb13
Summary:
Before this PR it would warn that "dropout is non deterministic and can
cause problems when checking trace", so I disabled the trace checking.
cc zdevito apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11639
Differential Revision: D9812493
Pulled By: zou3519
fbshipit-source-id: fab86928a5fba8b218b47543533aaf7c82a10b4a
Summary:
Arg parser allowed additional positional args to be parsed into keyword-only params.
Fixes a couple cases:
- The positional argument happens to be of the right type, and it just works silently. Now, we fail as expected.
- The positional argument fails later down the line. Now, we fail at the appropriate time and get a better error message.
Pre-fix:
```
>>> torch.cuda.LongTensor((6, 0), 1, 1, 0)
tensor([6, 0], device='cuda:1')
```
Post-fix:
```
>>> torch.cuda.LongTensor((6, 0), 1, 1, 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: new() received an invalid combination of arguments - got (tuple, int, int, int), but expected one of:
* (torch.device device)
* (torch.Storage storage)
* (Tensor other)
* (tuple of ints size, torch.device device)
* (object data, torch.device device)
```
Pre-fix:
```
>>> a = torch.tensor(5)
>>> a.new_zeros((5,5), 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: new_zeros(): argument 'dtype' (position 2) must be torch.dtype, not int
```
Post-fix:
```
>>> a = torch.tensor(5)
>>> a.new_zeros((5,5), 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: new_zeros() takes 1 positional argument but 2 were given
```
fixes#8351
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10499
Differential Revision: D9811093
Pulled By: li-roy
fbshipit-source-id: ce946270fd11b264ff1b09765db3300879491f76
Summary:
In order to comply with Python's rules on implicit casting of
non-booleans to booleans, this PR removes implicit casting in favor of
explicit casts via `bool()`
cc zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11503
Differential Revision: D9780869
Pulled By: driazati
fbshipit-source-id: c753acaca27f4e79dddf424c6b04674f44a6aad9
Summary:
- Just a simple fix to support `fill_`
- And a fix for indexing in `pytorch-complex`
Differential Revision: D9804061
Pulled By: ezyang
fbshipit-source-id: 631129b3fa220a9670770b3766f14a8e03633bdf
Summary:
Add guards against using sparse tensor by checking the conversion from IValue -> PyObject & PyObject -> IValue.
This diff also changes the behavior in constant propagation to not run python ops even if all ops are constant because of possible mutation to global state. This came up in trying to run get_sparse(), and I'm including it here to make it easier to land.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11550
Differential Revision: D9804712
Pulled By: eellison
fbshipit-source-id: 9fe7daf721c6d6e48df4925c0f9c775873bcdc77
Summary:
Clean up some generated tests after we have newly nice features like var args.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11403
Differential Revision: D9800545
Pulled By: wanchaol
fbshipit-source-id: e9973b113f78dc38cf99a81b6ede3fa3485f1cfa
Summary:
* Many op in lstm part of the model don't have implementation in ideep/mkl, and it doesn't make sense to copy back and forth for the few available ops because majority of RNN will be on CPU
* Thus the strategy is to enable mkl only for the resnet18 part of the model, then switch to default cpu engine for the lstm part
* The net may contain some external_inputs falsely added during ONNX->Caffe2. Canary in service shows their existence could leads to service crash (presumably due to these blob somehow get shared between threads). They're now manually removed which seem to be enough to avoid the crash.
Reviewed By: viswanathgs
Differential Revision: D8888763
fbshipit-source-id: da7761bcb7d876ff7bbb6640ae4b24712c0b1de6
Summary:
After discussions in #11584 , new PR for just the test skip and hgemm integration.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11593
Differential Revision: D9798527
Pulled By: ezyang
fbshipit-source-id: e2ef5609676571caef2f8e6844909fe3a11d8b3e
Summary:
I am working on unifying the C++ extensions and C++ API, and one constraint for this is that we will want to be able to build the C++ API without cereal, since we won't want to ship it with the Python `torch` package.
For this I introduce a `TORCH_WITH_CEREAL` option to CMake. If on, the C++ API will be built with cereal and thus serialization support. If off, serialization functions will throw exceptions, but the library will otherwise still compile the same. __This option is on by default, so for regular C++ API users nothing will change__. However, from C++ extensions, we'll be able to turn it off. This effectively means we won't be searching for any cereal headers from C++ API headers, which wouldn't be installed in the Python package.
ebetica ezyang soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11498
Differential Revision: D9784803
Pulled By: goldsborough
fbshipit-source-id: 5d0a1f2501993012d28cf3d730f45932b483abc4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11470
In order to reduce build sizes, we are identifying files that can be split up into smaller units, allowing us to only include the ops we need.
Reviewed By: orionr, ajtulloch
Differential Revision: D9725819
fbshipit-source-id: def1074a33dffe99bd6a7e6e48aa9e5be3d04a6a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11587
To help debug the issue in T33295362, we add some checks in the function.
Possible crashing site in `GetTensorInfo`
1. tc is nullptr, which is checked.
2. tc->capacity_nbytes() hits nullptr, this is unlikely because storage is not a pointer and compute of capacity_nbytes doesn't involve pointers. It's numel * itermsize().
3. tc->ExtractDeviceOption hits nullpt. One possibility raw_data() is nullptr because tc->ExtractDeviceOption will use that. This is checked.
4. Tensor itself which is not a reference. This is also checked.
Reviewed By: salexspb
Differential Revision: D9793484
fbshipit-source-id: 3fc72746fc310a23ae45553bbe0d269a4b9edb72
Summary:
Documents the `AnyModule` class in the C++ API.
Also changed the API to be friendlier by default. Calling `AnyModule::forward` used to return an `AnyModule::Value` which you had to call `.get<T>()` on to cast to a concrete type. I changed the name of that `forward` method to `any_forward` and instead made `forward` templated on a `ReturnType` template parameter which you can supply to do the `.get<T>` cast for you automatically. I default this parameter to `torch::Tensor` so that it can often be omitted. So where you used to have to write
```cpp
any_module.forward(...).get<int>();
any_module.forward(...).get<torch::Tensor>();
```
you now write
```cpp
any_module.forward<int>(...);
any_module.forward(...);
```
ebetica ezyang soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11580
Differential Revision: D9798626
Pulled By: goldsborough
fbshipit-source-id: 060b4ea28facaffc417f53b80b846a9dff9acb73
Summary:
This eliminates the need for any heuristics regarding stack size limits.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11534
Differential Revision: D9779866
Pulled By: resistor
fbshipit-source-id: 96753eead7904bbdc2869fb01f7bd42141032347
Summary:
This PR contains a C++ implementation of weight norm. The user-side exposure of weight norm through torch.nn.utils.weight_norm is unchanged.
If running on the GPU, and the norm is requested over the first or last dimension of the weight tensor, the forward pass is carried out using the fused kernels I wrote for our Fairseq GTC hero run, which offer superior performance to primitive ops and superior numerical stability when running in FP16. In the common case that the backward pass is not itself constructing a graph (ie not attempting to set up double backward) the backward pass will be carried out using another fused kernel. If the backward pass is constructing a graph, an alternate code path is taken, which does the math using differentiable primitive ops. In this way, the implementation allows double backward, even if the fused kernel was used in forward (although in this case, you don't benefit from the performance and stability of the fused backward kernel).
If running on the CPU, or if norming over an interior dim, the forward pass is carried out using double-differentiable primitive ops.
Figuring out how to generate all the right plumbing for this was tricky, but it was a fun experience learning how the autogenerator works and how the graph is constructed. Thanks to colesbury for useful guidance on this front.
I do have a few lingering questions:
- Should I unify my return statements (ie by default-constructing Tensors outside if blocks and using operator= within)?
- What is the significance of `non_blocking` when calling e.g. `auto norms = saved_norms.to(saved_g.type().scalarType(), non_blocking=True/False);`? I am currently omitting `non_blocking`, so it defaults to False, but I didn't see any associated synchronizes on the timeline, so I'm wondering what it means.
- Is there an "official" mapping from at::ScalarTypes to corresponding accumulate types, as there are for the PODs + Half in [AccumulateType.h](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/AccumulateType.h)? I looked for an equivalent mapping for ScalarTypes, didn't find one, and ended up rigging it myself (` at::ScalarType AccType = g.type().scalarType() == at::ScalarType::Half ? at::ScalarType::Float : g.type().scalarType();`).
- Are sparse tensors a concern? Should I include another check for sparse tensors in the `_weight_norm` entry point, and send those along the fallback CPU path as well?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10842
Differential Revision: D9735531
Pulled By: ezyang
fbshipit-source-id: 24431d46532cf5503876b3bd450d5ca775b3eaee
Summary:
This changes the way module import works so that when a module
is reloaded in python it becomes a ScriptModule and not a _C.ScriptModule
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11552
Differential Revision: D9782751
Pulled By: zdevito
fbshipit-source-id: 9576850b75494b228ce3def94c0d371a4a44b11d
Summary:
Also adds two additional tests that check for memory leaks while the relevant graph executors are alive:
- (minimal test): Create a ScriptModule, keep it alive, and test that it does not leak memory while it is alive
- (large test) Do MNIST training with a traced MNIST module and test that no memory is leaked while the traced module (with graph executor) is alive
cc apaszke zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11544
Reviewed By: apaszke
Differential Revision: D9778479
Pulled By: zou3519
fbshipit-source-id: 2d6cdea81dd1264f2c0396b662f70fdafecb3647
Summary:
Fixes:
```
/bin/ld: warning: libnccl.so.1, needed by /data/users/ezyang/pytorch-tmp/build/lib/libcaffe2_gpu.so, not found (try using -rp
ath or -rpath-link)
/data/users/ezyang/pytorch-tmp/build/lib/libcaffe2_gpu.so: undefined reference to `ncclAllReduce'
/data/users/ezyang/pytorch-tmp/build/lib/libcaffe2_gpu.so: undefined reference to `ncclBcast'
/data/users/ezyang/pytorch-tmp/build/lib/libcaffe2_gpu.so: undefined reference to `ncclCommInitAll'
/data/users/ezyang/pytorch-tmp/build/lib/libcaffe2_gpu.so: undefined reference to `ncclGetErrorString'
/data/users/ezyang/pytorch-tmp/build/lib/libcaffe2_gpu.so: undefined reference to `ncclReduceScatter'
/data/users/ezyang/pytorch-tmp/build/lib/libcaffe2_gpu.so: undefined reference to `ncclAllGather'
/data/users/ezyang/pytorch-tmp/build/lib/libcaffe2_gpu.so: undefined reference to `ncclReduce'
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11575
Differential Revision: D9789956
Pulled By: ezyang
fbshipit-source-id: 63e48763cc233be9d137cec721b239159b511a24
Summary:
This PR:
1. Documents `BatchNorm`,
2. Makes a number of API changes after reconsidering some quirks:
1. The default value for the `stateful` parameter used to be `false`, but the most common usage of `BatchNorm` out of the wild is certainly stateful, and the default in Python is also statefulness. So we change the default to stateful.
2. The `pure_forward` function used to use the internal running mean and variance variables instead of the ones supplied to that function call when `stateful` was true, which certainly seems odd. When you call `pure_forward` you would certainly expect the values you pass explicitly to be used. This is now fixed.
3. Adds tests for `BatchNorm`, finally.
ebetica apaszke ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11484
Reviewed By: pjh5
Differential Revision: D9779618
Pulled By: goldsborough
fbshipit-source-id: 59ba760e085c01454b75644b24b22317b688e459
Summary:
- Incorporates MKL addition by mingfeima Thank you! (but all errors are my own)
- Native CPU implementation: defer to matrix multiplication for
small batches and parallelize over batch dimension for large
batches.
- Add bmm test for CUDA just to be sure.
This is a partial fix for #10661, getting down to a factor ~5.
Considerable overhead is incurred for the setup in einsum. It might
be more efficient to eventually define an optimized contraction
functions for arbitrary and several dimensions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11292
Differential Revision: D9784941
Pulled By: ezyang
fbshipit-source-id: f6dded2c6f5e8f0461fb38f31f9a824992a58358
Summary:
Make test work with CPU only build, also fixed the test failures for a long time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11567
Differential Revision: D9785740
Pulled By: teng-li
fbshipit-source-id: 61c43b758c1ee53117e30de8074583e6faea863a
Summary:
This makes torch.distributed works for CPU only build.
Also added one more CI test case to cover MPI CPU build.
All CI tests should cover this change
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11513
Differential Revision: D9784546
Pulled By: teng-li
fbshipit-source-id: 0976a6b0fd199670926f0273e17ad7d2805e42e7
Summary:
Also, fix a performance bug in `ensureUnique`. Previously it formatted the warning string even though we weren't tracing, so all that work would *always* happen in the hot path and be for nothing.
A sample of how the new warnings look like:
```
tmp.py:4: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Pytho
n values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
int(x)
tmp.py:5: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this fun
ction to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might caus
e the trace to be incorrect.
torch.tensor([1.])
tmp.py:6: TracerWarning: There are 2 live references to the data region being modified when tracing in-place operator add_. This might cause t
he trace to be incorrect, because all other views that also reference this data will not not reflect this change in the trace! On the other ha
nd, if all other views use the same memory, but are disjoint (e.g. are outputs of torch.split), this might still be safe.
torch.split(y, 2, dim=1)[0].add_(2)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11545
Differential Revision: D9782975
Pulled By: apaszke
fbshipit-source-id: 5b3abd31366e59c69e0b7ff278042b5563deb5a9
Summary:
This fixes the build when CuDNN was not found on the system.
From the `git blame`, it looks like the bug has been around for 2 years :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11562
Differential Revision: D9784589
Pulled By: soumith
fbshipit-source-id: b33153436dced0a503c9833cdf52f7093f3394b4
Summary:
This adds a Note on making experiments reproducible.
It also adds Instructions for building the Documentation to `README.md`. Please ping if I missed any requirements.
I'm not sure what to do about the submodule changes. Please advise.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11329
Differential Revision: D9784939
Pulled By: ezyang
fbshipit-source-id: 5c5acbe343d1fffb15bdcb84c6d8d925c2ffcc5e
Summary:
Ping ezyang
This addresses your comment in #114. Strangely, when running the doc build (`make html`) none of my changes are actually showing, could you point out what I'm doing wrong?
Once #11329 is merged it might make sense to link to the reproducibility note everywhere.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11434
Differential Revision: D9751208
Pulled By: ezyang
fbshipit-source-id: cc672472449564ff099323c39603e8ff2b2d35c9
Summary: The PyTorch C++ API has `torch.nn.init` equivalents that the RNNG can use to initialize the state of its StackRNNs. This gets rid of the `fanInOut_` methods on `Parser` and tidies up `xavierInitialState` a little.
Reviewed By: wowitsmrinal
Differential Revision: D9472595
fbshipit-source-id: c202116f32383d3b4bba064c2c0d2656311e1170
Summary:
This PR does two things:
1. Replaces the implementation of the `Dropout` module with a call to the ATen function,
2. Replaces `Dropout2d` with a new `FeatureDropout` module that shall take the place of `Dropout2d` and `Dropout3d`. I contemplated calling it `Dropout2d` and making `Dropout3d` an alias for it, but similar to our decision for `BatchNorm{1,2,3}d` (c.f. https://github.com/pytorch/pytorch/pull/9188), we can deviate from Python PyTorch in favor of the ideal-world solution, which is to have a single module, since both actually just call `feature_dropout`.
I also replaced the implementation of `dropout3d` with a call to `dropout2d` in Python. The code is the same and it's easier for developers to parse than having to manually match the tokens to make sure it's really 100% the same code (which it is, if I matched the tokens correctly).
ebetica ezyang SsnL
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11458
Differential Revision: D9756603
Pulled By: goldsborough
fbshipit-source-id: fe847cd2cda2b6da8b06779255d76e32a974807c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11520
Previously, we had Type which was a catch all interface for all
functions and methods we could possibly want to do dynamic dispatch
on. However, we want to check in a non-autogenerated Tensor class
to ATen/core, and to do this, we must also check in a non-autogenerated
Type class which we can do dispatch on. In principle, we could
put the full Type interface in ATen/core, but this would be
a bad developer experience, since any time you add a new free
function, you'd have to regenerate the checked in Type header.
For a better dev experience, we split Type into a two parts,
Type, which will be checked in (though not in this diff), and
TypeExtendedInterface, which will NOT be checked in. Type contains
just enough methods to let Tensor be defined, and leaves the
rest to TypeExtendedInterface.
Some complications:
- We (very unfortunately) have overloaded virtual methods. Because
of C++'s rules, we cannot move one overload without doing some
extra work to make sure that overload in a superclass and an
overload in a subclass resolve together. I've chosen to resolve
this problem simply by moving ALL overloads of a method which
occurs in Tensor to Type.
- There are some places where we take a type() object and call
a method on it, which is not a Tensor base method. I've eliminated
some where possible, but in other cases calling the method on type
is the ONLY way to invoke it; in that case, I've just inserted
a cast. Further refactoring is necessary.
Reviewed By: gchanan
Differential Revision: D9771708
fbshipit-source-id: c59d39fe919cd6f42be6dca699d474346ea3c614
Summary:
The previous error was caused by mpi_test not depending on MPI_CXX_LIBRARIES. This might solve the problem.
Not tested locally - waiting for CI test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11416
Reviewed By: mingzhe09088
Differential Revision: D9771694
Pulled By: Yangqing
fbshipit-source-id: 53e7b4f64eadc88313bc4dd9b8e3f7931cda6e91
Summary:
This works around #11535 by avoiding `arange(n, out=x)` and `eye(n, out=x)` in `torch.distributions`. I've confirmed that the `.enumerate_support()` methods are now jittable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11542
Differential Revision: D9777805
Pulled By: apaszke
fbshipit-source-id: fa38f2f1acfc0a289f725fd8c92478573cfdbefb
Summary: Printing for complex numbers requires loading and storing between `Py_complex` and `std::complex`. This patch aims to support this for the plugin.
Differential Revision: D9771808
Pulled By: ezyang
fbshipit-source-id: 024865f1945d63ddb5efc775a35438c8ea06408e
Summary:
This whitelists train/eval functions in script modules, and tests that nested nn.Modules still work.
This also changes the code for calling python functions from script to allow non-tensor inputs/outputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11505
Differential Revision: D9765466
Pulled By: zdevito
fbshipit-source-id: 1177bff931324422b69e18fa0bbaa82e3c98ec69
Summary:
ezyang delivering my promise to you :)
Basically, now aten tests can use gtest as part of our test harness unification effort. I also converted one test (atest.cpp) to show how one can do this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11429
Reviewed By: ezyang
Differential Revision: D9762934
Pulled By: Yangqing
fbshipit-source-id: 68ec3a748403c6bd88399b1e756200985a4e07e3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11413
LengthsTileOp was implemented using a sequence of device memcopies initiated on the CPU. This was very slow. I changed it to use a kernel. TUM benchmark QPS improved from 13k QPS to 20k QPS as a result.
Reviewed By: manojkris, xianjiec
Differential Revision: D9724988
fbshipit-source-id: 2f98c697730982734d7c6a26d0b6967310d49900
Summary:
Provide a TensorAccessor-Like interface for CUDA as discussed in #8366.
Compared to TensorAccessor
- the CUDATensorAccessor copies the sizes and strides while on the host (I didn't implement a host indexing function, though) to enable transfer to the device, on the device, `[]` works like for TensorAccessors,
- instantiation is from TensorAccessors in order to allow using `.accessor<..>`. The drawback is that it you cannot use `auto` for the variable declaration, but the alternative would be a cuda-specific `.accessor`-like function,
- there is a PtrTraits argument to enable `__restrict__`,
Example for the intended use:
```
...
template <typename scalar_t>
__global__ void
apply_homography_2d_kernel(cuda::CUDATensorAccessor<scalar_t, 4> dest_a,
cuda::CUDATensorAccessor<scalar_t, 4> src_a,
cuda::CUDATensorAccessor<float, 2> transform) {
...
}
template <typename scalar_t>
Tensor apply_homography_2d_template(Tensor& res, const Tensor& image, const Tensor& transform) {
...
cuda::CUDATensorAccessor<scalar_t, 4> image_a(image.accessor<scalar_t, 4>());
cuda::CUDATensorAccessor<scalar_t, 4> res_a(res.accessor<scalar_t, 4>());
cuda::CUDATensorAccessor<float, 2> transform_a(transform.accessor<float, 2>());
auto stream = at::cuda::getCurrentCUDAStream();
apply_homography_2d_kernel<scalar_t>
<<<grid, block, 0, stream>>>(res_a, image_a, transform_a);
return res;
}
...
```
I could use a hint where to put a test for this (e.g. doing a plain vanilla matrix multiplication with a custom kernel) and comparing with the aten mm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11373
Differential Revision: D9735573
Pulled By: ezyang
fbshipit-source-id: 482b218a0d514e19a8b692bbc77c0e37082cfded
Summary: Considering these increase the size of the message stack, I didn't touch the code outside `ATen/native`
Differential Revision: D9754283
Pulled By: soumith
fbshipit-source-id: 04198ec4fd0c4abae09eeba92c493a783408537a
Summary:
This PR adds the "merge to master" step before the build step in CircleCI, so that all PR commits are built against master instead of against the PR's branch. Note that all PRs still need to rebase to master to pick up this new config, so it won't apply to old PR branches retroactively.
To check in CI: make sure it's performing the git merge to master appropriately in "Merge Onto Master" step.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11443
Differential Revision: D9775628
Pulled By: yf225
fbshipit-source-id: 8083db6b098d234a44ae4481f40a486e9906f6f8
Summary:
Disable all CircleCI jobs until we are ready to move forward with them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11523
Differential Revision: D9774462
Pulled By: yf225
fbshipit-source-id: c5724e71eb68bac4df958b4f7bcc380050668b3c
Summary:
Need to link CUDA statically for benchmarking purpose.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10596
Reviewed By: llyfacebook
Differential Revision: D9370738
Pulled By: sf-wind
fbshipit-source-id: 4464d62473e95fe8db65b0bd3b301f262bf269bf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11486
I discovered these by narrowing the interface on Type, and then
fixing call sites outside of core plumbing code which depended
on these methods being provided.
Reviewed By: cpuhrsch
Differential Revision: D9757935
fbshipit-source-id: 3abda0c98919a448a326a757671d438964f6909f
Summary:
I noticed warnings from within pybind11 being shown when building C++ extensions. This can be avoided by including non-user-supplied headers with `-isystem` instead of `-I`
I hope this works on Windows.
soumith ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11459
Differential Revision: D9764444
Pulled By: goldsborough
fbshipit-source-id: b288572106078f347f0342f158f9e2b63a58c235
Summary:
This speeds up incremental builds by doing the following changes:
- Uses `rsync` instead of `cp` (when `rsync` is found) which is a bit smarter in doing "maybe copy"
- Introduces a `rebuild` mode which does not rerun `cmake` in `build_pytorch_libs.sh`.
*Note: `rebuild` should only be used if you dont add / remove files to the build, as `cmake` is not rerun*
Current no-op rebuild speedup:
- 1m 15s -> 20s
There are some lingering bugs. No-op rebuilds rerun `cmake` for two rebuilds (likely that cmake logic is dependent on the install folder, hence kicking off rebuild).
So what you see
```
python setup.py rebuild develop # first time - ~5 mins
python setup.py rebuild develop # second time - ~3 mins
python setup.py rebuild develop # third time - ~2 mins
python setup.py rebuild develop # fourth time - ~20 seconds
python setup.py rebuild develop # fifth time - ~20 seconds
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11487
Differential Revision: D9769087
Pulled By: soumith
fbshipit-source-id: 20fbecde33af6426149c13767e8734fb3be783c5
Summary:
ATen has had a separate build target in the past, but with our move to a root-level CMakeLists.txt file this makes less sense and is harder to maintain. Also, as we blend code between Caffe2 and ATen this will become even less maintainable.
Talked to ezyang about this, but also cc zdevito, Yangqing, and soumith. If this is too difficult, I will revert, but want to see if we can simplify for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11488
Differential Revision: D9770266
Pulled By: orionr
fbshipit-source-id: c7ba52a1676d84e2d052dad4c042b666f49451cd
Summary:
This adds a `.expand` method for distributions that is akin to the `torch.Tensor.expand` method for tensors. It returns a new distribution instance with batch dimensions expanded to the desired `batch_shape`. Since this calls `torch.Tensor.expand` on the distribution's parameters, it does not allocate new memory for the expanded distribution instance's parameters.
e.g.
```python
>>> d = dist.Normal(torch.zeros(100, 1), torch.ones(100, 1))
>>> d.sample().shape
torch.Size([100, 1])
>>> d.expand([100, 10]).sample().shape
torch.Size([100, 10])
```
We have already been using the `.expand` method in Pyro in our [patch](https://github.com/uber/pyro/blob/dev/pyro/distributions/torch.py#L10) of `torch.distributions`. We use this in our models to enable dynamic broadcasting. This has also been requested by a few users on the distributions slack, and we believe will be useful to the larger community.
Note that currently, there is no convenient and efficient way to expand distribution instances:
- Many distributions use `TransformedDistribution` (or wrap over another distribution instance. e.g. `OneHotCategorical` uses a `Categorical` instance) under the hood, or have lazy parameters. This makes it difficult to collect all the relevant parameters, broadcast them and construct new instances.
- In the few cases where this is even possible, the resulting implementation would be inefficient since we will go through a lot of broadcasting and args validation logic in `__init__.py` that can be avoided.
The `.expand` method allows for a safe and efficient way to expand distribution instances. Additionally, this bypasses `__init__.py` (using `__new__` and populating relevant attributes) since we do not need to do any broadcasting or args validation (which was already done when the instance was first created). This can result in significant savings as compared to constructing new instances via `__init__` (that said, the `sample` and `log_prob` methods will probably be the rate determining steps in many applications).
e.g.
```python
>>> a = dist.Bernoulli(torch.ones([10000, 1]), validate_args=True)
>>> %timeit a.expand([10000, 100])
15.2 µs ± 224 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit dist.Bernoulli(torch.ones([10000, 100]), validate_args=True)
11.8 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
cc. fritzo, apaszke, vishwakftw, alicanb
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11341
Differential Revision: D9728485
Pulled By: soumith
fbshipit-source-id: 3b94c23bc6a43ee704389e6287aa83d1e278d52f
Summary:
This enabled `torch.einsum` both in tracing and in script mode. It's used all over Pyro at the moment, and is needed for any use of the JIT in there.
Fixes#11157.
zdevito fritzo neerajprad
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11506
Differential Revision: D9764787
Pulled By: apaszke
fbshipit-source-id: 9b5251b9e7c5897034602bd07ff67b425d33326c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11418
Several improvements that aim to make the APIs more straightforward to use
- Get rid of helper methods subgraph and nonTerminal . Users now should create a NNMatchGraph directly via graph's createNode and createEdge API
- Get rid of operatorSubgraph helper method
- invertGraphTraversal flag applies to both the match graph and the scanned graph. This allows user to create match graph in the same direction as the scanned graph, thus reduce confusion.
- additional parameters of matchNode (count, includeInSubgraph, nonTerminal) are removed from the constructors and moved into setter methods. (We no longer enforce that MatchNode is immutable but this helps improve code clarity).
- Tests are updated to reflect the changes
Follow up changes:
- Possibly clean up the tests further. This change aims to minimally modify the unit tests.
- Help a validity check that enforce the current limitation of the match graph (single source node), and throws if the match graph does not satisfy the criteria.
- Have the single source node be detected automatically and callers just need to pass in the matchGraph instead of the source node reference.
Differential Revision: D9732565
fbshipit-source-id: ae8320e2bc89b867f6bb4b1c1aad635f4b219fa1
Summary:
The old `torch.distributed` will go to `torch.distributed.deprecated`
The old DDP will go to `torch.nn.parallel.deprecated`
Now `torch.nn.parallel.DDP` will use c10d DDP
Now `torch.distributed` will use C10d frontend API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11405
Reviewed By: pietern
Differential Revision: D9733733
Pulled By: teng-li
fbshipit-source-id: d6a3f3e73f8d3a7fcb1f4baef53c78063b8cbb08
Summary:
Add flags for LMDB and LevelDB, default `OFF`. These can be enabled with
```
USE_LMDB=1 USE_LEVELDB=1 python setup.py build_deps
```
Also add a flag to build Caffe2 ops, which is default `ON`. Disable with
```
NO_CAFFE2_OPS=1 python setup.py build_deps
```
cc Yangqing soumith pjh5 mingzhe09088
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11462
Reviewed By: soumith
Differential Revision: D9758156
Pulled By: orionr
fbshipit-source-id: 95fd206d72fdf44df54fc5d0aeab598bff900c63
Summary:
Skip torch tests as well when NO_TEST=1 environment variable is set. Also remove the separate ATen code path for not being built with Caffe2, since it will always be built with Caffe2.
cc The controller you requested could not be found.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11415
Reviewed By: soumith
Differential Revision: D9758179
Pulled By: orionr
fbshipit-source-id: e3e3327364fccdc57a703aeaad8c4f30452973fb
Summary:
There's a bunch of legacy code where people are explicitly instantiating Variable, and these call-sites have thus far been untraceable (appearing as prim::Constant nodes with the tensor value at the time of tracing). This makes it so that the new variable inherits the traced Value* from the tensor it's being constructed from
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11463
Differential Revision: D9756529
Pulled By: jamesr66a
fbshipit-source-id: da99c6a7621957a305f2699ec9cb9def69b1b2d7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10974
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10291
This new operator will do the following:
Given a LENGTHS vector and n_splits, output a "split" LENGTHS vector where:
1. Each length in input vector is split into n_splits values (thus output vector should have LENGTHS.size(0) * n_splits elements)
2. The new lengths in output should be evenly split, and if the length is not divisible by n_splits, then order new values in descending order. (e.g. n_splits = 3, length = 5 -> 2 2 1)
3. If n_splits > some element in the array, its split elements will contain 0s. (e.g. n_splits = 3, length = 2 - > 1 1 0)
Reviewed By: bddppq, chocjy
Differential Revision: D9013119
fbshipit-source-id: 82bf3371ec08c41fc3379177f0007afc142e0d84
Summary:
This PR is stacked on https://github.com/pytorch/pytorch/pull/10610, and only adds changes in one file `.jenkins/pytorch/test.sh`, where we now build the custom op tests and run them.
I'd also like to take this PR to discuss whether the [`TorchConfig.cmake`](https://github.com/pytorch/pytorch/blob/master/cmake/TorchConfig.cmake.in) I made is robust enough (we will also see in the CI) orionr Yangqing dzhulgakov what do you think?
Also ezyang for CI changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10611
Differential Revision: D9597627
Pulled By: goldsborough
fbshipit-source-id: f5af8164c076894f448cef7e5b356a6b3159f8b3
Summary:
Many constructors like `torch.zeros` or `torch.randn` didn't support
size tracing correctly which is fixed by this pass. Same issue has been
fixed in legacy tensor constructors.
Additionally, new tensor constructors, which do not participate in
tracing (most notably `torch.tensor`, `torch.as_tensor` and
`torch.from_numpy`) raise a warning when they are used.
Finally, entering a traceable operation disables the tracing in its body.
This is needed because
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11288
Reviewed By: ezyang
Differential Revision: D9751183
Pulled By: apaszke
fbshipit-source-id: 51444a39d76a3e164adc396c432fd5ee3c8d5f7f
Summary:
NCCL1 uses `int` as its numerical type for fields like `count`, which makes broadcasting tensors larger than `2 << 31 - 1` impossible, and raises opaque error `invalid arguments`. NCCL2 greatly increase the limit on many platforms by using `size_t`. This patch statically detects this type, and raises properly if the broadcast tensor exceeds the limit.
No test because I don't think our test suite should broadcast big tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11466
Differential Revision: D9754753
Pulled By: SsnL
fbshipit-source-id: 73506450cae047e06b5b225b39efdb42d5d26685
Summary:
Normalizing by the world size before the reduction is less likely to cause overflow in FP16 training.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11109
Differential Revision: D9594708
Pulled By: myleott
fbshipit-source-id: 93ab53cb782ee1cbe1264e529b333490a0940338
Summary:
I'm 80% sure that this fixes the math bug. But I can't repro locally so I don't know.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11472
Differential Revision: D9755328
Pulled By: SsnL
fbshipit-source-id: 130be664d3c6ceee3c0c166c1a86fc9ec3b79d74
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11294
The Tensor(ptr, retain) constructor is error prone and circumvents the intrusive_ptr safety.
This diff removes that and pushes the responsibility to callers.
Step by step, manual refcounting can be pushed back and possibly eliminated in the end.
Reviewed By: ezyang
Differential Revision: D9663476
fbshipit-source-id: 7f010e5e47b137a9575960201c5bf5d552c5c2f5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11260
This is needed to make something like this work:
intrusive_ptr<TensorImpl, UndefinedTensorImpl> a = make_intrusive<SparseTensorImpl>(...);
Reviewed By: ezyang
Differential Revision: D9652089
fbshipit-source-id: 19c65e98460ccb27bc69e36d7e558cb9d6e67615
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11258
The two intrusive_ptr constructors in Tensor can be combined into one implementation that does both, moving and copying.
Reviewed By: ezyang
Differential Revision: D9652088
fbshipit-source-id: 5efca02654ba305c99c20bbeb83551469d17a51d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11238
- when moving an IValue, free the old value instead of keeping it allocated
- making classes final
- moving std::string
- making ConstantList const
Reviewed By: ezyang
Differential Revision: D9644700
fbshipit-source-id: ab7228368e4f00f664ba54e1242b0307d91c5e7e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11167
Narrow the Blob API as preparation for merging Blob/IValue
- get rid of templated IsType and Operator::InputIsType / OutputIsType
- Use 'using' instead of 'typedef' for DestroyCall (just for readability)
Reviewed By: ezyang
Differential Revision: D9623916
fbshipit-source-id: 952f0b0cf5a525094b02e8d2798dd57a56a9e1d8
Summary:
Checking assertExportImport for all of the generated test jit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10982
Differential Revision: D9636935
Pulled By: eellison
fbshipit-source-id: f3f1ce77d454848098f2ac7e0fa18bf8564890be
Summary:
`Process.start()` actually take some time as it needs to start a
process and pass the arguments over via a pipe. Therefore, we
only add a worker to self.workers list after it started, so
that we do not call `.join()` if program dies before it starts,
and `__del__` tries to join it but will get:
AssertionError: can only join a started process.
Example trace when such error happens:
```py
[unrelated]
File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 500, in __iter__
return _DataLoaderIter(self)
File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 292, in __init__
w.start()
File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/context.py", line 277, in _Popen
return Popen(process_obj)
File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/popen_fork.py", line 70, in _launch
self.pid = os.fork()
KeyboardInterrupt
Exception ignored in: <function _DataLoaderIter.__del__ at 0x7fa704d5aa60>
Traceback (most recent call last):
File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 398, in __del__
self._shutdown_workers()
File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 392, in _shutdown_workers
w.join()
File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/process.py", line 139, in join
assert self._popen is not None, 'can only join a started process'
AssertionError: can only join a started process
```
No test because hard to reliably trigger.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11432
Reviewed By: ezyang
Differential Revision: D9735430
Pulled By: SsnL
fbshipit-source-id: a8912d9bb4063f210d6236267b178173810e2351
Summary:
as discussed with ezyang and slayton58 , this might be a nice convenience to be able to use code in extensions just as in ATen.
also split off `tracing_state.h` from `torch/jit/tracer.h` fix#11204 to bee able to use the utility functions
pytorchbot it's not a jit patch per se.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11425
Differential Revision: D9735556
Pulled By: ezyang
fbshipit-source-id: 466c92bbdb1d7d7a970eba1c26b7583fe9756139
Summary:
A recent build regression is that we need a system GoogleTest for builds to pass.
This was because, when building with Gloo, gloo is trying to build it's own tests, which look for system gtest [here](https://github.com/facebookincubator/gloo/blob/master/cmake/Dependencies.cmake#L72-L80) (because we're not using full cmake build and making it aware of third_party/GoogleTest, but instead, we are building it isolated using tools/build_pytorch_libs.sh
Traditionally, we didn't ask Gloo to build it's tests, but because we added `-DBUILD_TEST=1` by default to all builds (in refactoring variable names), we accidentally started asking Gloo to build it's tests.
This PR overrides the Gloo flags and asks it to not build tests (like it used to)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11431
Differential Revision: D9736387
Pulled By: soumith
fbshipit-source-id: 59e84edae780123b793bdaea5fd9ac46156cd0af
Summary:
This PR parallels `masked_fill` on CPU, currently it runs in sequential on CPU.
the following script is used to benchmark and verify this PR. On Xeon skylake 8180 (2 sockets * 28 cores),
it runs `4.20` sec without the PR and `0.11` sec with the PR.
```python
import torch
import random
from time import time
size = 10 * 1000 * 1000
count = 100
def test_masked_fill():
dst = torch.randn(size)
dst_ = dst.clone()
mask = torch.rand(size).mul(2).floor().byte()
val = random.random()
tstart = time()
for i in range(count):
dst.masked_fill_(mask, val)
tend = time()
print("masked_fill_: %f" % (tend-tstart))
for i in range(size):
if mask[i]:
if dst[i] != val:
print("fail")
else:
if dst[i] != dst_[i]:
print("fail1")
print("test_masked_fill: PASS")
test_masked_fill()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11359
Differential Revision: D9735578
Pulled By: ezyang
fbshipit-source-id: d437ad7c6dace1910d0c18d6d9ede80efb44fae4
Summary:
Added AVX optimizations for pdist using Vec256. This brings single threaded performance up to speed with scipy, but the current implementation greatly hurts performance without AVX enabled. Is there a way to special case out AVX on dispatch and call the non Vec256 code? Or is the way I used Vec256 completely wrong?
Single threaded comparison to scipy
============================
This is the time to compute the pdist of a 2048 x 2048 float matrix with only one thread for various values of p between torch and scipy. p = 3 is the code path for arbitrary p, and so is much slower than the other values.
p | torch | scipy
-----|-----------|------
0 | 6.27 s ± 393 ms | 7.23 s ± 498 ms
1 | 5.49 s ± 201 ms | 43.4 s ± 1.09 s
2 | 5.74 s ± 474 ms | 53.8 s ± 3.52 s
∞ | 5.59 s ± 292 ms | 47.4 s ± 2.03 s
3 | really slow | gave up
Result by AVX support
================
This is the time to compute the distance and gradient of a 2048 x 2048 float matrix with all threads by AVX support. `before` is the old code, `default` is no AVX support, etc. Interestingly the AVX optimizations provided a great benefit over the old unoptimized code, but drastically hurt performance when compiled without AVX optimizations. p = 3 is the code path for arbitrary p, and so is much slower than the other values.
Results for p = 0
----------------
avx | dist | grad
----|------|-----
before | 514 ms ± 87.5 ms | 191 µs ± 35 µs
default | 3.47 s ± 183 ms | 201 µs ± 24.6 µs
avx | 123 ms ± 18.2 ms | 281 µs ± 130 µs
avx2 | 103 ms ± 11.4 ms | 216 µs ± 74.4 µs
Results for p = 1
----------------
avx | dist | grad
----|------|-----
before | 426 ms ± 35 ms | 6.21 s ± 187 ms
default | 2.6 s ± 123 ms | 5.62 s ± 273 ms
avx | 104 ms ± 6.37 ms | 833 ms ± 44.3 ms
avx2 | 106 ms ± 3.59 ms | 924 ms ± 86.2 ms
Results for p = 2
-----------------
avx | dist | grad
----|------|-----
before | 425 ms ± 45.4 ms | 6.31 s ± 125 ms
default | 3.04 s ± 187 ms | 3.55 s ± 242 ms
avx | 110 ms ± 3.66 ms | 896 ms ± 21.8 ms
avx2 | 113 ms ± 4.68 ms | 934 ms ± 25.2 ms
Results for p = ∞
------------------
avx | dist | grad
----|------|-----
before | 501 ms ± 39.5 ms | 6.64 s ± 321 ms
default | 2.15 s ± 92.9 ms | 8.43 s ± 355 ms
avx | 104 ms ± 5.52 ms | 835 ms ± 36.7 ms
avx2 | 100 ms ± 3.41 ms | 864 ms ± 67 ms
Results for p = 3
-----------------
avx | dist | grad
----|------|-----
before | 22.6 s ± 413 ms | 11.1 s ± 242 ms
default | 24.9 s ± 1 s | 11.2 s ± 293 ms
avx | 2.69 s ± 148 ms | 5.63 s ± 88.4 ms
avx2 | 2.48 s ± 31.8 ms | 5.61 s ± 114 ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11230
Differential Revision: D9735503
Pulled By: erikbrinkman
fbshipit-source-id: a9da619249e4ca2625b39ca1ca7f5543c3086bfb
Summary:
If pybind is build with cmake and installed, we should use config file instead of the Findpybind11 shipped with caffe2.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11423
Differential Revision: D9735557
Pulled By: ezyang
fbshipit-source-id: 28a39e579fa045060aa1a716e5fd7dbcf7b89569
Summary:
Fixes the issue discussed in #10838. `hidden_size` should be the last dimension regardless if we're in ONNX or PyTorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11368
Differential Revision: D9734814
Pulled By: soumith
fbshipit-source-id: 7f69947a029964e092c7b88d1d79b188a417bf5f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11420
Surprisingly tricky! Here are the major pieces:
- We grow a even yet more ludicrous macro
AT_FORALL_SCALAR_TYPES_WITH_COMPLEX_EXCEPT_COMPLEX_HALF
which does what it says on the tin. This is because I was
too lazy to figure out how to define the necessary conversions
in and out of ComplexHalf without triggering ambiguity problems.
It doesn't seem to be as simple as just Half. Leave it for
when someone actually wants this.
- Scalar now can hold std::complex<double>. Internally, it is
stored as double[2] because nvcc chokes on a non-POD type
inside a union.
- overflow() checking is generalized to work with complex.
When converting *to* std::complex<T>, all we need to do is check
for overflow against T. When converting *from* complex, we
must check (1) if To is not complex, that imag() == 0
and (2) for overflow componentwise.
- convert() is generalized to work with complex<->real conversions.
Complex to real drops the imaginary component; we rely on
overflow checking to tell if this actually loses fidelity. To get
the specializations and overloads to work out, we introduce
a new Converter class that actually is specializable.
- Complex scalars convert into Python complex numbers
- This probably fixes complex tensor printing, but there is no way
to test this right now.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Reviewed By: cpuhrsch
Differential Revision: D9697878
Pulled By: ezyang
fbshipit-source-id: 181519e56bbab67ed1e5b49c691b873e124d7946
Summary:
vishwakftw Your patch needed some updates because the default native function dispatches changed from `[function, method]` to `[function]`. The CI was run before that change happened so it still shows green, but the internal test caught it.
I did some changes when rebasing and updating so I didn't just force push to your branch. Let's see if this passes CI and internal test. If it does, let me know if you want me to force push to your branch or use this PR instead.
Note to reviewers: patch was already approved at #10068 .
cc yf225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11421
Differential Revision: D9733407
Pulled By: SsnL
fbshipit-source-id: cf2ed293bb9942dcc5158934ff4def2f63252599
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11331
In the previous commit, we added a bare-bones LegacyTypeDispatch in ATen/core.
This is not sufficient for the use cases we need: we not only need to be able to
get a Type, but we also need to be able to *initialize* the Types if its the first time
we have retrieved a CPU/CUDA/Complex type. I hemmed and hawed about how
to do this; the strategy this PR takes is to introduce a new "hooks" interface
specifically for initializing CPU/CUDA/Complex (which still lives in Context). We then
move all "user-friendly" functions to LegacyTypeDispatch.
Here were some other options which I considered, but don't work:
- Assume that Type is already initialized, because we only intend to call Type
from Tensor methods, where we already have a Tensor. This does not work
because Caffe2 created tensors will not have gone through the standard
Type codepath, and will have skipped initialization.
- Move CUDAHooks and ComplexHooks to ATen/core. Besides being sucky,
this isn't even a complete fix, because I still need to initialize CPU hooks
(so you *still* need another hooks interface).
Reviewed By: cpuhrsch
Differential Revision: D9666612
fbshipit-source-id: ac7004b230044b67d13caa81fdfaf3c6ab915e3f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11274
We don't want to put all of Context into ATen/core, but one
particular part cannot be avoided: the type registry, because
implementations of TensorMethods will need to get a Type,
and then do a virtual call on it.
I needed to do a little bit of (temporary) footwork to get this
in without also moving Type, because unique_ptr<Type> expects
to be able to see the destructor of Type (but it's forward declared
right now). So instead I put the destructor as an explicit functor. We
can get rid of this once Type actually moves in ATen/core
Reviewed By: cpuhrsch
Differential Revision: D9657449
fbshipit-source-id: 940931493bf4f1f6a8dad03f34633cacdd63dd0b
Summary:
to 300 seconds to be safe. It used to be no timeout in THD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11409
Differential Revision: D9731709
Pulled By: teng-li
fbshipit-source-id: 0ce011dcca507cbf063176ad4995405c77dd0cdd
Summary:
Currently gradient is copied into .grad if it is None. This PR aim to remove the copy when it is not absolutely needed.
It is generally an improvement of speed and memory usage. And here is a case it may help a lot:
Normally, people do optimizer.zero_grad() every minibatch before backward. It will translate into a memset, and later a point-wise add.
When there is some large weight in the network, one optimization people can always do is set parameter.grad to None instead of zero_grad. This will remove memset and change point-wise add to a memcpy.
Here is result running following script on V100 GPU. It is 100 iterations of forward/backward/zero_grad on single 1-billion word benchmark size embedding.
`Zero grad: 2.123847723007202`
`None grad: 1.3342866897583008`
With the backend change of this PR, the unnecessary memcpy is removed, thus further speed up is achieved.
`Zero grad: 2.124978542327881`
`None grad: 0.4396955966949463`
[benchmark.txt](https://github.com/pytorch/pytorch/files/2341800/benchmark.txt)
Some details on the code change:
.detach() is used because we need to get rid of new_grad being a view without copy data. This should be safe in first-order only mode.
data need to be contiguous, otherwise `grad_variable.data() += new_grad.data();` below will fail.
Only the last variable that has reference to the temp gradient will grab its buffer.
ngimel, mcarilli and mruberry helped on finalizing this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11165
Differential Revision: D9728874
Pulled By: soumith
fbshipit-source-id: b8fb822a2dff6e812bbddd215d8e384534b2fd78
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11247
Previously, the default for a declaration in native_functions.yaml
was ['function', 'method'], i.e., generate both a method and
function for every binding. We now believe this is inappropriate:
the majority of new kernels added to PyTorch should live as
free functions, NOT methods. Thus, we change the default accordingly.
I also took the opportunity to de-method some "internal" functions
that had a leading underscore. While, strictly speaking, this is a
BC breaking change, I believe it is highly unlikely anyone was using
these directly.
Reviewed By: yf225
Differential Revision: D9648570
fbshipit-source-id: 8b94647b824e0899d6d18aa5585aaedc9d9957d2
Summary:
This is mainly to pick up the change 20074be19a to avoid polluting the CMAKE_DEBUG_POSTFIX variable. cc orionr .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11388
Reviewed By: orionr
Differential Revision: D9720931
Pulled By: Yangqing
fbshipit-source-id: 18a60d0409e74316f74d364f4fe16bf0d0198413
Summary:
Moves the code for the complex registration code into an out-of-line C++ extension to de-noise the test_cpp_extensions.py file. Let's keep it nice and tidy so we can point our users at it for usage examples.
ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11397
Differential Revision: D9725335
Pulled By: goldsborough
fbshipit-source-id: 290618f2ee711b1895cdb8f05276034dfe315c6d
Summary:
~~This PR fixes#8525 by renaming `split_with_sizes` to `split` so that 2 `aten::split` ops are
generated (previously `aten::split(self, int, int)` and `aten::split_with_sizes(self, int[], int)` were generated)~~
~~`split_with_sizes` was made in PR #5443, but I don't see a reason for it to have
a different name than `split` rather than just overload `split`.~~
This PR fixes#8525 by adding `register_special_ops.cpp` to mirror Python dispatching from `split` to `split` and `split_with_sizes` in [tensor.py](https://github.com/pytorch/pytorch/blob/master/torch/tensor.py#L279).
It also fixes#8520 by adding an `int[]` wherever it sees `torch.Size`
In a follow up PR this could also be used to fix some of the other `unknown builtin op` test errors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11051
Differential Revision: D9582443
Pulled By: driazati
fbshipit-source-id: d27201f85937d72e45e851eaa1460dd3dd1b61a9
Summary:
This seems to be causing different versions of OpenMPI being picked up
by different parts of the build. Not a good practice to include absolute
paths anyway, so let's try removing it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11386
Reviewed By: teng-li
Differential Revision: D9724349
Pulled By: pietern
fbshipit-source-id: 3dfef91c81f2e97e5125284aff9e7e98f8761917
Summary:
Continuing pjh5's work to remove FULL_CAFFE2 flag completely.
With these changes you'll be able to also do something like
```
NO_TEST=1 python setup.py build_deps
```
and this will skip building tests in caffe2, aten, and c10d. By default the tests are built.
cc mingzhe09088 Yangqing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11321
Reviewed By: mingzhe09088
Differential Revision: D9694950
Pulled By: orionr
fbshipit-source-id: ff5c4937a23d1a263378a196a5eda0cba98af0a8
Summary:
In addition to documentation, this cleans up a few error message formats.
It also adds infra to find which operators are supported by the JIT automatically, which is then used in the generation of the docs.
The wording and formatting of the docs is not yet polished, but having this will allow our document writers to make faster progress.
Followup PRs will polish the docs and fix formatting issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11357
Differential Revision: D9721277
Pulled By: zdevito
fbshipit-source-id: 153a0d5be1efb314511bcfc0cec48643d78ea48b
Summary:
Add a barrier() to wait for all PG created before destroy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11391
Differential Revision: D9727383
Pulled By: teng-li
fbshipit-source-id: 689d62c978e642b68f4949dcf29982e34869ada4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11382
We found this cudnn bug in S163230 that causes accuracy loss. We fix this in D9601217, but due to the reimplementation of spatialBN it's overwritten. Let's land this fix again.
Reviewed By: kuttas
Differential Revision: D9702347
fbshipit-source-id: 11547e9edaf7b2ba7f4aa7263ffb4f0281bbf078
Summary:
The next function I'm moving to C++ is `sync_params`. It is stacked on top of https://github.com/pytorch/pytorch/pull/9729, so some changes will go away when it lands and I rebase.
I also split code into a `.h` and `.cpp` file for better code organization.
The controller you requested could not be found. pietern apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9805
Differential Revision: D9688604
Pulled By: goldsborough
fbshipit-source-id: 4467104d3f9e2354425503b9e4edbd59603e20a8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11336
Move `context_base.h` header to `ATen/core` and the implementations are in `caffe2/core/context_base.cc`
Reviewed By: ezyang
Differential Revision: D9670493
fbshipit-source-id: ce5bf2b3b4c80e9b62819f4332ce68af82720055
Summary:
This PR cleans up the `at::Tensor` class by removing all methods that start with an underscore in favor of functions in the `at::` namespace. This greatly cleans up the `Tensor` class and makes it clearer what is the public and non-public API.
For this I changed `native_functions.yaml` and `Declarations.cwrap` to make all underscore methods `variant: function` (or add such a statement to begin with), and then fixed all code locations using the underscore methods.
ezyang colesbury gchanan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11152
Differential Revision: D9683607
Pulled By: goldsborough
fbshipit-source-id: 97f869f788fa56639c05a439e2a33be49f10f543
Summary:
Add the gpu kernel version.
The parallelism I went with performs poorly when there are a large number of vectors, but they're all short, as I don't allocate the thread pool to wrap in that case.
Test Plan
---------
```
python -m unittest test_torch.TestTorch.test_pdist_{empty,scipy} test_nn.TestNN.test_pdist{,_zeros,_empty_row,_empty_col,_cpu_gradgrad_unimplemented,_cuda_gradgrad_unimplemented} test_jit.TestJitGenerated.test_nn_pdist
```
Current performance specs are a little underwhelming, I'm in the process of debugging.
size | torch | torch cuda | scipy
-----|-------|------------|------
16 x 16 | 9.13 µs ± 3.55 µs | 9.86 µs ± 81.5 ns | 15.8 µs ± 1.2 µs
16 x 1024 | 15 µs ± 224 ns | 9.48 µs ± 88.7 ns | 88.7 µs ± 8.83 µs
1024 x 16 | 852 µs ± 6.03 µs | 7.84 ms ± 6.22 µs | 4.7 ms ± 166 µs
1024 x 1024 | 34.1 ms ± 803 µs | 11.5 ms ± 6.24 µs | 273 ms ± 6.7 ms
2048 x 2048 | 261 ms ± 3.5 ms | 77.5 ms ± 41.5 µs | 2.5 s ± 97.6 ms
4096 x 4096 | 2.37 s ± 154 ms | 636 ms ± 2.97 µs | 25.9 s ± 394 ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11102
Differential Revision: D9697305
Pulled By: erikbrinkman
fbshipit-source-id: 2b4f4b816c02b3715a85d8db3f4e77479d19bb99
Summary:
This is so that TensorImpl does not have to depend on Tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11337
Differential Revision: D9684421
Pulled By: gchanan
fbshipit-source-id: d2af93420ca6d493429c251cfe5a34e9289c4484
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11273
This one might strike you as a bit surprising, but it's necessary
to expose this interface in ATen/core, because we need to be
able to get a true Variable type from Variable tensors, and
to do that we need to go through the hooks interface.
Reviewed By: gchanan
Differential Revision: D9656548
fbshipit-source-id: 28bb5aee6ac304e8cd5fa1e4c65452c336647161
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11270
Still need to deduplicate this with caffe2/core/registry.h,
but this will be a bit tricky because the current formulation
of the macro is namespace sensitive (i.e., the macro for classes
defined in at:: namespace won't work if you call from caffe2::
namespace).
Reviewed By: gchanan
Differential Revision: D9654871
fbshipit-source-id: 2207d1f2cc6d50bd41bf64ce0eb0b8523b05d9d9
Summary:
After submitting PR #9726, PR #10581 created a different CUDAEvent class. The CUDAEvent proposed in #9726 was similar to the c10d::CUDAEvent class with additional testing and functionality. In particular, it was movable but not copyable. The CUDAEvent created by #10581 is refcounted and copyable. This PR retains the refcounting of the latter PR while fixing several bugs, adding tests, and extending the functionality to support testing and usage like in PR #8354. In particular, this PR:
- Adds set_device() to CUDAContext
- Adds three CUDAEvent tests to stream_test.cpp
- Fixes three bugs:
- Refcounting was broken. Destroying an of the RAIIs holding a particular CUDAEvent would destroy the event UNLESS it was the last RAII (the check was backwards).
- Moving an event would cause a segfault.
- Events were not destroyed on the device they were created on. See PR #9415 (pietern)
- Adds the happened() and recordOnce() functions
- Changes the record() functions to not be const
- Adds additional assertions to verify correctness
This PR does not:
- Make c10d use the ATen CUDAEvent (this is appropriate for a separate PR)
Whether events should be refcounted is an interesting question. It adds some atomic operations and makes event creation eager. Making events movable but not copyable (like the c10d events) avoids these costs and allows events to be lazily constructed. Lazy construction is preferable when working with containers (like std::array or std::vector) and because the event's device can be set automatically to the first stream it's recorded on. With eager construction the user is required to understand that events have a device and acquire the device of the stream the event will be recorded on upfront. This can be seen here:
542aadd9a7/aten/src/ATen/native/cudnn/RNN.cpp (L1130-L1132)
and that file is the only one which currently uses the ATen CUDAEvent.
Refcounting does allow single writer multi-reader scenarios, although these scenarios can be also be supported by providing indirect access to the underlying CUDAEvent. I believe all current and planned usage scenarios do not require refcounting, and if desired I can update this PR to remove refcounting and make the ATen event movable but not copyable like the c10d event. I think not refcounting is preferable because it can improve performance, ease usability, and simplify the code (as seen with two of the above bugs).
I have decided to separate this from PR #8354 since while it's required for PR #8354 the changes are, clearly, of independent interest. PR #8354 has a new dependency on this one, however. I am closing PR #9726 in favor of this PR.
apaszke ezyang pietern
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11293
Differential Revision: D9665836
Pulled By: soumith
fbshipit-source-id: a1513fa4f9761e2f304d126e402f6b6950e1c1d2
Summary:
This adds an optional `expand=True` kwarg to the `distribution.expand_support()` method, to get a distribution's support without expanding the values over the distribution's `batch_shape`.
- The default `expand=True` preserves the current behavior, whereas `expand=False` collapses the batch dimensions.
e.g.
```python
In [47]: d = dist.OneHotCategorical(torch.ones(3, 5) * 0.5)
In [48]: d.batch_shape
Out[48]: torch.Size([3])
In [49]: d.enumerate_support()
Out[49]:
tensor([[[1., 0., 0., 0., 0.],
[1., 0., 0., 0., 0.],
[1., 0., 0., 0., 0.]],
[[0., 1., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 1., 0., 0., 0.]],
[[0., 0., 1., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 1., 0., 0.]],
[[0., 0., 0., 1., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 1., 0.]],
[[0., 0., 0., 0., 1.],
[0., 0., 0., 0., 1.],
[0., 0., 0., 0., 1.]]])
In [50]: d.enumerate_support().shape
Out[50]: torch.Size([5, 3, 5])
In [51]: d.enumerate_support(expand=False)
Out[51]:
tensor([[[1., 0., 0., 0., 0.]],
[[0., 1., 0., 0., 0.]],
[[0., 0., 1., 0., 0.]],
[[0., 0., 0., 1., 0.]],
[[0., 0., 0., 0., 1.]]])
In [52]: d.enumerate_support(expand=False).shape
Out[52]: torch.Size([5, 1, 5])
```
**Motivation:**
- Currently `enumerate_support` builds up tensors of size `support + batch_shape + event_shape`, but the values are *repeated* over the `batch_shape` (adding little in the way of information). This can lead to expensive matrix operations over large tensors when `batch_shape` is large (see, example above), often leading to OOM issues. We use `expand=False` in Pyro for message passing inference. e.g. when enumerating over the state space in a Hidden Markov Model. This creates sparse tensors that capture the markov dependence, and allows for the possibility of using optimized matrix operations over these sparse tensors. `expand=True`, on the other hand, will create tensors that scale exponentially in size with the length of the Markov chain.
- We have been using this in our [patch](https://github.com/uber/pyro/blob/dev/pyro/distributions/torch.py) of `torch.distributions` in Pyro. The interface has been stable, and it is already being used in a few Pyro algorithms. We think that this is more broadly applicable and will be of interest to the larger distributions community.
cc. apaszke, fritzo, alicanb
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11231
Differential Revision: D9696290
Pulled By: soumith
fbshipit-source-id: c556f8ff374092e8366897ebe3f3b349538d9318
Summary:
This actually ended up being a lot more involved than I thought. The basic
problem is that in some of our build environments, thread local state is not
supported. The correct way to test if this is the case is using the
(undocumented) CAFFE2_FB_LIMITED_MOBILE_CAPABILITY macro.
On mobile, OptionGuard is not available, and you have to do everything
by hand. There's a static_assert to check if you accidentally use
OptionGuard in this case and give you a better error message in this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11244
Reviewed By: gchanan
Differential Revision: D9646190
fbshipit-source-id: cf4016f79b47705a96ee9b6142eb34c95abb2bd4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11323
If you do pass it this, you'll get a pointer to
UndefinedTensor; probably not what you want!
Reviewed By: Yangqing
Differential Revision: D9676205
fbshipit-source-id: 0bd3c22c2c40ac2958f95fc7a73b908af291cf22
Summary:
We need to remove nomnigraph from the list of public libraries in order to support libtorch extensions. Easiest way to do this is to include it into the Caffe2 source like all other caffe2/core/ code.
However, because the headers are in a different place, we need to include them for linked libraries (pybind, tests, etc).
On an upside, this means that nomnigraph is now default hidden visibility too.
FYI peterjc123 xkszltl goldsborough bwasti Yangqing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11303
Reviewed By: pjh5
Differential Revision: D9694932
Pulled By: orionr
fbshipit-source-id: 5db3eb20bc5ddc873ce9151236b74663fbb33ed8
Summary:
* purge hcSPARSE now that rocSPARSE is available
* integrate a custom hcc and HIP
* hcc brings two important compiler fixes (fixes hundreds of unit tests)
* HIP brings a smart dispatcher that allows us to avoid a lot of static_casts (we haven't yet removed the automatic static_casts but this catches some occurrences the script did not catch)
* mark 5 unit tests skipping that have regressed w/ the new hcc (we don't know yet what is at fault)
* optimize bitonic sort - the comparator is always an empty struct - therefore passing it by value saves at least 3 bytes. It also removes an ambiguity around passing references to `__global__` functions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11198
Differential Revision: D9652340
Pulled By: ezyang
fbshipit-source-id: f5af1d891189da820e3d13b7bed91a7a43154690
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10888
Add cuda version of SpatialBNOp also optimize SpatialBN on CPU
Reviewed By: houseroad
Differential Revision: D9512435
fbshipit-source-id: 6f828c88d56d30dc9a2f98a297a161c35cc511b1
Summary:
Fixed a few bugs that were not tested in the c10d frontend APIs, including
get_rank, get_world_size, and destroy_process_group of a given group.
These APIs are added to the CI tests.
Also added all the group related tests, including full-group, and partial groups (existing ones), since both will hit different code paths.
Also removed experimental APIs for c10d initially used in DDP, now we don't use it anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11318
Reviewed By: pietern
Differential Revision: D9675896
Pulled By: teng-li
fbshipit-source-id: a2eac2c57933effa2d139855f786e64919a95bfc
Summary:
On the way to #10774
This PR adds advanced indexing with tensors.
The approach is to desugar advanced indexing into an at::index op.
This is exactly how normal pytorch does it.
[(I used this code as reference)](https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/python_variable_indexing.cpp)
Supporting sequences is a little tricky because JIT script doesn't have
an easy way to turn arbitrary n-dimensional python lists into a tensor
(it would be easy if we supported `torch.tensor`), so that'll come
in a future PR.
cc jamesr66a zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10862
Differential Revision: D9659449
Pulled By: zou3519
fbshipit-source-id: 56d293720d44c0fd27909e18327ab3985ddfced6
Summary:
In #9466 I got rid of storage views and eliminated all places where
they were used... OR SO I THOUGHT. In actuality, under certain
conditions (specifically, if you trained a CUDA multiprocessing model
shared over CUDA IPC and then serialized your parameters), you could
also serialize storage slices to the saved model format. In #9466,
I "fixed" the case when you loaded the legacy model format (really,
just unshared the storages--not strictly kosher but if you aren't
updating the parameters, shouldn't matter), but NOT the modern model format, so
such models would fail.
So, I could have applied the legacy model format fix too, but
hyperfraise remarked that he had applied a fix that was effectively
the same as unsharing the storages, but it had caused his model to
behave differently. So I looked into it again, and realized that
using a custom deleter, I could simulate the same behavior as old
storage slices. So back they come.
In principle, I could also reimplement storage views entirely using
our allocators, but I'm not going to do that unless someone really
really wants it.
Fixes#10120.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11314
Reviewed By: ailzhang
Differential Revision: D9671966
Pulled By: ezyang
fbshipit-source-id: fd863783d03b6a6421d6b9ae21ce2f0e44a0dcce
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11190
As discussed with Alexander Sidorov, params_bytes refer to the number of bytes we're reading for parameters, not the size of parameters. They only differ in sparse operators.
Reviewed By: mdschatz
Differential Revision: D9628635
fbshipit-source-id: 9e2aed0cf59388928dc69b8534cf254f0347c9c8
Summary:
This is an experimental build on top of what orionr and mingzhe09088 built.
Essentially, the idea is that we will need separate *_API versions for different shared libraries. If this theory is right, I'll try to clean up the design a bit and document it properly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11266
Reviewed By: orionr
Differential Revision: D9682942
Pulled By: Yangqing
fbshipit-source-id: c79653199e67a1500c9174f39f8b0357324763f3
Summary:
We shouldn't use system Eigen in any cases when building with setup.py. If people want to use system Eigen (not from third_party) they can build with CMake for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11334
Reviewed By: pjh5
Differential Revision: D9689450
Pulled By: orionr
fbshipit-source-id: baf616b9f195692942151ad201611dcfe7d927ba
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11098
Added a test for testing CPU version across multiple devices.
Reviewed By: enosair, BIT-silence
Differential Revision: D9584520
fbshipit-source-id: 0d8c85e6d402bc7b34d5f8f16ef655ff9b61b49e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11338
The `min_` and `max_` value of the filler is in `double` format but when we are filling a specific type of tensor, their value can exceed the type limits, resulting in crash. This diff checks the type limits first and if `min_`/`max_` is out of the limits, it will clip it.
Reviewed By: highker
Differential Revision: D9684455
fbshipit-source-id: 6da98a03c57f3296abaddc7c5cfc1c836c611eb0
Summary:
This will allow users to set customized timeout option for the store.
Tested by my own debug print to make sure that C++ actually used the timeout
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11265
Differential Revision: D9666164
Pulled By: teng-li
fbshipit-source-id: 4eb6441783da106a3fd59b95457e503e83e4640f
Summary:
This lets you compile builtin functions from C++ without having a dependence on Python
```cpp
auto module = torch::jit::compile(JIT"(
def my_script_method(x, y):
return torch.relu(x) + y
)");
IValue result = module->run_method("my_script_method", 1, 2);
```
goldsborough zdevito apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10847
Differential Revision: D9543461
Pulled By: driazati
fbshipit-source-id: 6160dae094030ca144a0df93cb9f26aa78c8cf27
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11315
Rename unit tests file to make it consistent with fb cpp style guideline "The unittest for MyFoo.cpp should be named MyFooTest.cpp."
Reviewed By: yinghai
Differential Revision: D9671519
fbshipit-source-id: 44ed6794f6e479d190916db8064eee692e3ad876
Summary:
1. Add documentation to Linear and improve documentation for RNNs
2. Fix preprocessing in C++ docs by adding correct include path
3. Make myself and ebetica codeowner of docs/cpp to improve development speed
ebetica ezyang soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11313
Differential Revision: D9683615
Pulled By: goldsborough
fbshipit-source-id: 84ea32f9ea6b4060744aabbf5db368776a30f0b5
Summary:
Turns out that '' net.type is not acceptable to CreateNet.
But empty net.type is acceptable.
Fix that in this diff. Also this is related to T33613083
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11286
Reviewed By: Maratyszcza, wat3rBro
Differential Revision: D9659920
Pulled By: harouwu
fbshipit-source-id: d68f24b754e18e1121f029656d885c48ab101946
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11291
In S163230, we've found CuDNN 7 upgrade causes accuracy drop in training convolution network such as ResNeXt-101 (~0% accuracy), and video R(2+1)D (65 --> 63%).
Our current theory for this accuracy loss is because of the new "CUDNN_BATCHNORM_SPATIAL_PERSISTENT" in spatialBN operator. In Caffe 2, we've made this mode as default. According to CuDNN manual (https://fburl.com/z996mr13), this mode may introduce some limitation in the input data range and cause overflow (which outputs NaN). NaN is probably not the case, because we're seeing a few percent of accuracy drop but not gradient explosion or failure. However, this "performance-optimized" code path may introduce accuracy loss (which is not caught by our unit test case because the input data range is [-0.5-0.5].
Reviewed By: kuttas, stephenyan1231
Differential Revision: D9601217
fbshipit-source-id: 73c2690c19cb1f02ea4e5e2200f50128df4f377b
Summary:
this is a fix that's needed for building extensions with a
pre-packaged pytorch. Consider the scenario where
(1) pytorch is compiled and packaged on machine A
(2) the package is downloaded and installed on machine B
(3) an extension is compiled on machine B, using the downloaded package
Before this patch, stage (1) would embed absolute paths to the system
installation of mkl into the generated Caffe2Config.cmake, leading to
failures in stage (3) if mkl was not at the same location on B as on
A. After this patch, only a reference to the wrapper library is
embedded, which is re-resolved on machine B.
We are already using a similar approach for cuda.
Testing: built a package on jenkins, downloaded locally and compiled an extension. Works with this patch, fails without.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11298
Differential Revision: D9683150
Pulled By: anderspapitto
fbshipit-source-id: 06a80c3cd2966860ce04f76143b358de15f94aa4
Summary:
Now that we're building everything together, making all distributed flags conditional of USE_DISTRIBUTED being set.
cc pietern The controller you requested could not be found. cpuhrsch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11221
Reviewed By: Yangqing
Differential Revision: D9664267
Pulled By: orionr
fbshipit-source-id: a296cda5746ad150028c97160f8beacba955ff73
Summary:
Fixes#8560.
Unblocks #10715.
The assert (nDim <= uncompressedDims) was being triggered for a scalar
tensor because we compute nDim to be 1 for a scalar tensor but
uncompressedDim = 0.
This PR changes it so that we compute nDim to be 0 for a scalar tensor. This
works because indexing in a kernel depends on nDim. If nDim = 0, then
offset is always 0, which is what we want.
Some other (small) changes were necessary to make this work:
- One cannot define a 0-length array `IndexType arr[0]` so the code
guards against that
- Needed to change some of the maxTensorInfoSize logic to handle the
case when uncompressedDim == 0.
cc apaszke zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10952
Differential Revision: D9544607
Pulled By: zou3519
fbshipit-source-id: 2b873f47e2377125e1f94eb1b310a95cda51476c
Summary:
Distributed Data Parallel CPU module for c10d. This is basically the same code as Distributed Data Parallel CPU module for THD, since c10d now has the exact same front-end interface as torch.distributed.
We will keep both in the first release and remove the THD one once c10d is stable enough.
Test fully covered just as THD too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11168
Differential Revision: D9674963
Pulled By: teng-li
fbshipit-source-id: ecf52a7189374ca7930c2be305218167fdd822a7
Summary:
Linting `torch/csrc/` (non-recursive) and `torch/csrc/autograd` (non-recursive).
Fixed things like:
- `typedef` vs `using`
- Use `.empty()` instead of comparing with empty string/using `.size() == 0`
- Use range for loops instead of old style loops (`modernize-`)
- Remove some `virtual` + `override`
- Replace `stdint.h` with `cstdint`
- Replace `return Type(x, y)` with `return {x, y}`
- Use boolean values (`true`/`false`) instead of numbers (1/0)
- More ...
ezyang apaszke cpuhrsch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11050
Differential Revision: D9597505
Pulled By: goldsborough
fbshipit-source-id: cb0fb4793ade885a8dbf4b10484487b84c64c7f2
Summary: Closing the gap a bit on API, allowing users to go NetDef -> nomnigraph -> NetDef in python now
Reviewed By: duc0
Differential Revision: D9670495
fbshipit-source-id: 6497518ffc05a186deb0d657e06317980d39ddd5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11256
- in deleteNode method, remove optional deleteEdge flag as it's not used
- in deleteEdge method, remove optional removeRef flag as it's not used
- in replaceNode method, remove optional newHead_ parameter as it's not used - also simplifying the implementation by just calling replaceInEdges and replaceOutEdges
- remove importNode & importEdge as they're not in used
- add getEdgeIfExists that is like getEdge() but returns nullptr instead of throwing when the edge does not exist
- reduce verbosity in the basic graph unit test and add more test cases for ReplaceEdges
Differential Revision: D9650913
fbshipit-source-id: 6c18b37bef0d2abe1b57fb4fc47bfdbcee387694
Summary:
I'm setting up an automatic sync job for cppdocs and need two fixes to the cpp docs config:
1. Right now the cppdocs use the `torch` package to figure out the version. For C++ docs all I really need from the built package are the generated Tensor.h and Functions.h files. I can actually generate those directly via `aten/src/ATen/gen.py`, so I can skip building PyTorch altogether and save 10 minutes in the sync job! For this I need to avoid using the torch package in the docs.
2. Internal proxy issues prevent using the git link for sphinx_rtd_theme. We can just use the pip package for the cppdocs (not for the normal PyTorch docs)
soumith ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11300
Differential Revision: D9667193
Pulled By: goldsborough
fbshipit-source-id: 5567e0b3d3bdce03f5856babdb4ff76bcee91846
Summary:
This PR adds all PyTorch and Caffe2 job configs to CircleCI.
Steps for the CircleCI mini-trial:
- [ ] Make sure this PR passes Jenkins CI and fbcode internal tests
- [x] Approve this PR
- [ ] Ask CircleCI to turn up the number of build machines
- [ ] Land this PR so that the new `.circleci/config.yml` will take effect
Several Caffe2 tests are flaky on CircleCI machines and hence skipped when running on CircleCI. A proper fix for them will be worked on after a successful mini-trial.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11264
Differential Revision: D9656793
Pulled By: yf225
fbshipit-source-id: 7832e90018f3dff7651489c04a179d6742168fe1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11254
Previously we use DeviceType in caffe2.proto directly, but it's an `enum` and have implicit conversion to int, which does not have type safety, e.g. we have to explicitly check for a device type is valid in event.h:
```
template <int d>
struct EventCreateFunctionRegisterer {
explicit EventCreateFunctionRegisterer(EventCreateFunction f) {
static_assert(d < MaxDeviceTypes, "");
Event::event_creator_[d] = f;
}
};
```
at::DeviceType is an `enum class`, and it does not have implicit conversion to int, and provides better type safety guarantees. In this diff we have done the following refactor(taking CPU as an example):
1. caffe2::DeviceType → caffe2::DeviceTypeProto
2. caffe2::CPU → caffe2::PROTO_CPU
3. caffe2::DeviceType = at::DeviceType
4. caffe2::CPU = at::DeviceType::CPU
codemod -d caffe2/caffe2 --extensions h,cc,cpp 'device_type\(\), ' 'device_type(), PROTO_'
+ some manual changes
In short, after this diff, in c++, caffe2::CPU refers to the at::DeviceType::CPU and the old proto caffe2::CPU will be caffe2::PROTO_CPU.
In python side, we have a temporary workaround that alias `caffe2_pb2.CPU = caffe2_pb2.PROOT_CPU` to make the change easier to review and this will be removed later.
Reviewed By: ezyang
Differential Revision: D9545704
fbshipit-source-id: 461a28a4ca74e616d3ee183a607078a717fd38a7
Summary:
Persistent rnns provide much better performance on V100 with half input data for a variety of cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11248
Differential Revision: D9665687
Pulled By: ezyang
fbshipit-source-id: 2bd09a7eb1f5190aadb580977b0ba956e21a7dd5
Summary:
- In Python 2, use of `/` (regardless of int/float/Tensor) causes a compiler error if
`from __future__ import division` is not imported in the file.
- The / operator is universally set to do "true" division for integers
- Added a `prim::FloorDiv` operator because it is used in loop unrolling.
The error if users use '/' in python 2 without importing from __future__
occurs when building the JIT AST.
cc apaszke zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11016
Differential Revision: D9613527
Pulled By: zou3519
fbshipit-source-id: 0cebf44d5b8c92e203167733692ad33c4ec9dac6
Summary:
The existing tests had every rank run send to every other rank and only
then switch to recv mode. This only works if the send operations are
non-blocking and the passed tensors are immediately copied to some kind
of send buffer. Instead, every send must be matched with a recv on the
other side, because from the API perspective they may block.
E.g. imagine a 1GB tensor being sent to every other rank. It can only go
through if there is a recv on the other side, or it will deadlock.
This change reflects this in the send/recv unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11275
Differential Revision: D9658197
Pulled By: pietern
fbshipit-source-id: fb6a3fc03b42343a9dfeed0def30d94914e76974
Summary:
Found these when compiling the new master with gcc 7.3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11257
Differential Revision: D9656612
Pulled By: SsnL
fbshipit-source-id: 7acb19e13204c010238dab7bc6973cc97b96f9a4
Summary:
This PR adds a .travis.yml check for our C++ documentation. The goal is to avoid any documentation/comments in our C++ code that would break the doxygen output and possibly ruin the C++ documentation site (currently https://pytorch.org/cppdocs).
For this, we:
1. Run doxygen and record any warnings,
2. Filter out some known bogus warnings,
3. Count the remaining warnings,
4. Fail the check if (3) is non-zero.
soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11124
Differential Revision: D9651011
Pulled By: goldsborough
fbshipit-source-id: 30f776d23bb6d6c482c54db32828b4b99547e87b
Summary:
Allows mulitplication of e.g. numpy.float32 with tensors.
This came up with #9468
If you want this and after the other patch is done, I'll add tests (but that would be conflicting, so I prefer to wait).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9659
Differential Revision: D8948078
Pulled By: weiyangfb
fbshipit-source-id: c7dcc57b63e2f100df837f70e1299395692f1a1b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10874
Fixes the log message "WARNING:data_workers:Warning, data loading lagging behind: name=0" where instead of source name the size of a queue is reported
Reviewed By: panshen1, Novitial
Differential Revision: D9506606
fbshipit-source-id: 03717cfa9b991afb335ef877378afa3b52fd8f22
Summary:
`__repr__` currently fails for distributions with lazy attributes in PyTorch master, throwing a `KeyError`. This fixes the issue.
**Additionally:**
- Added `logits` to `arg_constraints` for distributions that accept either `probs` or `logits`. This is both to have `__repr__` display the `logits` param when available, and to be able to do validation checks (e.g. NaN checks) when the logit parametrization is used. fritzo, alicanb - I think there were reasons why we had not done so in the first place, but I am unable to recall now. It passes all the tests, but let me know if there is something that I am missing at the moment.
- There are certain distributions, e.g. `OneHotCategorical` which won't show any parameters because it uses a `categorical` instance under the hood and neither `logits` / `probs` in `arg_constraints` are present in the instance's `__dict__`. This isn't addressed in this PR.
cc. vishwakftw, fritzo, nadavbh12, apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11263
Differential Revision: D9654959
Pulled By: apaszke
fbshipit-source-id: 16f5b20243fe8e2c13e9c528050d4df0b8ea6e45
Summary:
This PR adds a hooks interface for registering types for complex
scalar types, and a sample implementation of the hook in
test_cpp_extensions.
The hook registration is patterned off of the existing CUDA hooks.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
CC The controller you requested could not be found.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11216
Differential Revision: D9654840
Pulled By: ezyang
fbshipit-source-id: 7b97646280d584f8ed6e14ee10a4abcd04cf2987
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11127
it's invalid to capture `predicate` by reference as it's a local variable. capture it by value instead.
Differential Revision: D9600115
fbshipit-source-id: 92e0130d0a74908380b75ade5c3492df49e25941
Summary:
Also, make `torch.isclose` work with integral tensors and refactor `_check_trace` a bit.
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11246
Differential Revision: D9652701
Pulled By: apaszke
fbshipit-source-id: fb0bdbfd1952e45e153541e4d471b423a5659f25
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11123
this adds an operator that fills a tensor with a uniform(min, max)
the implementation is to use the fp32 generator and convert to fp16
if performance becomes an issue we could resort to intrinsics
Reviewed By: jspark1105, chocjy
Differential Revision: D9598142
fbshipit-source-id: 5aeab99acf7c3596fa6c33611d9d2c484f7c1145
Summary:
keep net type info when generating model complete net. This will keep the performance optimization option
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11032
Reviewed By: wat3rBro
Differential Revision: D9564125
Pulled By: harouwu
fbshipit-source-id: c6546af9b1d4ff5eddf6124e24a5da1b8baf47df
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11215
I found these by deleting the implicit conversion of Type to
TensorOptions and then fixing sites. This isn't a complete
refactor, because I ran out of steam after fixing this many
and decided to keep the implicit conversion. Still, why
waste a perfectly good refactor?
Reviewed By: gchanan, cpuhrsch
Differential Revision: D9634750
fbshipit-source-id: 4d8fb778e13e6e24b888b1314a02709b2cb00b62
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11205
Our short term plan for supporting out of tree complex development requires an
external library to add a custom subclass of Type without access to the
code generation facilities in ATen. This commit reorganizes Type so
as to minimize the amount of boilerplate you have to write when making
a subclass of Type.
In particular, it:
- Creates a new CPUTypeDefault/CUDATypeDefault class, which you are
intended to inherit from, which provides default implementations
of CPU/CUDA that is layout/dtype agnostic.
- Adds new getCPUAllocator() and getCUDAAllocator() functions, as
a more public API to get your hands on Allocator
- Adds allocator() and getDeviceFromPtr(), abstracting the device
specific parts of storage() methods; these methods are now
implemented in base TypeDefault.
- Delete the static typeString() method, which is now dead.
- Move is_cuda/is_sparse/is_distributed to TypeDefault.
Reviewed By: SsnL
Differential Revision: D9631619
fbshipit-source-id: 40b600d99691230e36e03eb56434c351cbc2aa3a
Summary:
Just pulling this out of https://github.com/pytorch/pytorch/pull/10611
Make sure we can support environments where we don't have libprotobuf installed when we link-local protobuf.
cc goldsborough Yangqing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11161
Differential Revision: D9650282
Pulled By: orionr
fbshipit-source-id: 447b5e54cd2639973b4b10f58590d1c693a988d4
Summary:
Will use USE_DISTRIBUTED for both c10d and THD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11237
Differential Revision: D9647825
Pulled By: teng-li
fbshipit-source-id: 06e0ec9b5e2f8f38780fc88718f8499463e9e969
Summary:
This was lingering after #10731.
cc orionr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11240
Differential Revision: D9645437
Pulled By: pietern
fbshipit-source-id: d02c33354b094be3bb0872cf54a45721e20c4e7d
Summary:
This PR resolved the following compilation errors on devgpu:
/home/mingzhe0908/pytorch/build/lib/libcaffe2_gpud.so: undefined reference to `caffe2::CAFFE2_PLEASE_ADD_OPERATOR_SCHEMA_FOR_Tan()'
/home/mingzhe0908/pytorch/build/lib/libcaffe2_gpud.so: undefined reference to `caffe2::CAFFE2_PLEASE_ADD_OPERATOR_SCHEMA_FOR_MaxPool3D()'
....
The same error has been happening with caffe2 build with debug mode before build_caffe2 was removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11233
Reviewed By: orionr
Differential Revision: D9645527
Pulled By: mingzhe09088
fbshipit-source-id: 68a45aa7fd815cac41b7fd64cfd9838b3226345a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11060
Adding synthetic data generation to the filler.h file (the exact distribution to be replaced later on).
Reviewed By: highker
Differential Revision: D9417594
fbshipit-source-id: 5d66dfbcb254a5961c36b7d3a081332c7372dac7
Summary:
there are multiple views of the tensor live.
Also adds recording for copy_ because this is the critical in place
op where these views will cause LHS indexing to fail.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11129
Differential Revision: D9600195
Pulled By: zdevito
fbshipit-source-id: bfd8f5befa47377e36d704dbdb11023c608fe9a3
Summary:
TSIA. apaszke pointed out that it might be better to use third party folder in default, since system Eigen may often be out of date and does not have the version we need to compile successfully.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11020
Differential Revision: D9562548
Pulled By: Yangqing
fbshipit-source-id: d8ab8a6ebe1f3d9eec638ef726cf5dc4dcf777b5
Summary:
Adding short circuit evaluation to AND or OR. The second expression of and AND or OR gets lifted into an if branch, which is conditionally evaluated.
BatchOps was using the expression `dims = dims1 or dims2`, where dims is often an empty tensor. This nows throws an error, because dims1 gets cast to a boolean, and you can't convert an empty tensor to a scalar. It now matches the behavior of pytorch in python.
One thing that came up is if the second expression in an and/or in python gets returned, it does not get coerced to a boolean.
`tensor == (False or tensor)`
`tensor == (True and tensor)`
We do not currently support this.
edit: wording
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11116
Differential Revision: D9618168
Pulled By: eellison
fbshipit-source-id: 93b202be2f222d41f85d38d9c95f04d1749e8343
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11189
Replaces it with an operator TensorOptions() method on
Type, reestablishing the implicit conversion. I originally
wanted to get rid of the implicit conversion entirely, but
there were a *lot* of use-sites, so I added it back to avoid
a huge codemod. In this patch, I only had to fix sites that
used the optional device_index API.
Reviewed By: cpuhrsch
Differential Revision: D9628281
fbshipit-source-id: 5fe2a68eefb77a3c9bb446f03a94ad723ef90210
Summary:
Example:
```sh
python run_test.py -i sparse -- TestSparse.test_factory_size_check -f
```
With this, the `--verbose` option is redundant (one can call `python run_test.py -- -v` instead of `python run_test.py -v`. But since this is (probably) a frequently used flag, I didn't remove the existing easier-to-use option.
cc ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11209
Differential Revision: D9632215
Pulled By: SsnL
fbshipit-source-id: ff522802da11ef0a0714578be46e4a44f6343d44
Summary:
We don't generate a corresponding Type implementations for them,
so this doesn't do anything at the moment.
We don't plan on supporting complex32 in the near future, but
it is added to reserve the name and number in case we do at
some point in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11173
Reviewed By: SsnL
Differential Revision: D9627477
Pulled By: ezyang
fbshipit-source-id: f49a44ab1c92d8a33130c249ac7b234f210a65e6
Summary:
In the state dict loading code, it would print the error message referring to the shape of the loaded parameters and the parameters in the initialised model with the formatting in the wrong order. Swapped them round to fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11200
Differential Revision: D9631160
Pulled By: SsnL
fbshipit-source-id: 03d9446303bd417fef67027b10d7a27de06486be
Summary:
Disables two of the unit tests in test_cuda that got introduced after test_cuda was enabled that fail on ROCm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11191
Differential Revision: D9628702
Pulled By: ezyang
fbshipit-source-id: 4c298c728f42bb43d39b57967aa3e44385980265
Summary:
This is necessary to allow us to use the complex header
which defines real (and is very sad if real is macro'ed).
We should also fix accreal, ureal, Real and REAL, but
only 'real' is the real blocker.
```
codemod -d aten/src/TH --extensions c,cc,cpp,cu,cuh,h,TARGETS,py,hpp '\breal\b' scalar_t
codemod -d aten/src/THC --extensions c,cc,cpp,cu,cuh,h,TARGETS,py,hpp '\breal\b' scalar_t
codemod -d aten/src/THNN --extensions c,cc,cpp,cu,cuh,h,TARGETS,py,hpp '\breal\b' scalar_t
codemod -d aten/src/THCUNN --extensions c,cc,cpp,cu,cuh,h,TARGETS,py,hpp '\breal\b' scalar_t
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11163
Reviewed By: SsnL
Differential Revision: D9619906
Pulled By: ezyang
fbshipit-source-id: 922cb3a763c0bffecbd81200c1cefc6b8ea70942
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11013
Previously, the parent class Type also contained a large number
of implementations, for things like broadcasting and native
functions that didn't need dispatch. We'd like to be able
to reference this interface from Tensor even when we don't
have any of these implementations are available.
To do this, we convert Type into a truly pure virtual interface,
and move all of the implementations to TypeDefault.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11181
Differential Revision: D9561478
Pulled By: ezyang
fbshipit-source-id: 13c49d80bc547551adf524b1cf1d691bfe311133
Summary:
* improve docker packages (install OpenBLAS to have at-compile-time LAPACK functionality w/ optimizations for both Intel and AMD CPUs)
* integrate rocFFT (i.e., enable Fourier functionality)
* fix bugs in ROCm caused by wrong warp size
* enable more test sets, skip the tests that don't work on ROCm yet
* don't disable asserts any longer in hipification
* small improvements
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10893
Differential Revision: D9615053
Pulled By: ezyang
fbshipit-source-id: 864b4d27bf089421f7dfd8065e5017f9ea2f7b3b
Summary:
This places all constants in the entry block of the graph, and de-duplicates them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10231
Differential Revision: D9601501
Pulled By: resistor
fbshipit-source-id: daa10ed8c99e9894830d6f3e5d65c8d3ab5ea899
Summary:
Previously when we had a slicing expression like `x[0:5, 0]`, where the sliced tensor was of size `5` in dimension 0, we would skip dispatching the actual slice call as an optimization.
This caused incorrect behavior under tracing, as we would not record the slice op and thus if we encountered an input with a different shape while running the trace, we would get incorrect results.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11156
Differential Revision: D9622252
Pulled By: jamesr66a
fbshipit-source-id: 822f2e8f01504e131f53bd9ef51c171c7913a7cc
Summary:
This makes it so `detach` and `detach_` are traceable and also adds a pass to erase them before ONNX export
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11038
Differential Revision: D9588038
Pulled By: jamesr66a
fbshipit-source-id: 263dd3147e24fcb0c716743f37fdb9f84c0015e7
Summary:
Will need to be accessible by caffe2
This also removes a bunch of unnecessary includes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11154
Reviewed By: ezyang
Differential Revision: D9618681
Pulled By: cpuhrsch
fbshipit-source-id: 838a87b75d9c3959e145fd5fca13b63bc5de7bd3
Summary:
```
In file included from third-party-buck/gcc-5-glibc-2.23/build/pybind11/889256a/include/pybind11/cast.h:13:0,
from third-party-buck/gcc-5-glibc-2.23/build/pybind11/889256a/include/pybind11/attr.h:13,
from third-party-buck/gcc-5-glibc-2.23/build/pybind11/889256a/include/pybind11/pybind11.h:43,
from caffe2/torch/csrc/utils/pybind.h:6,
from caffe2/torch/csrc/jit/pybind.h:5,
from caffe2/torch/csrc/jit/script/init.h:3,
from caffe2/torch/csrc/jit/script/init.cpp:1:
third-party-buck/gcc-5-glibc-2.23/build/pybind11/889256a/include/pybind11/pytypes.h:118:19: note: declared here
In file included from caffe2/torch/csrc/jit/pybind.h:12:0,
from caffe2/torch/csrc/jit/python_ir.cpp:4:
caffe2/torch/csrc/jit/pybind_utils.h: In function 'torch::jit::IValue torch::jit::argumentToIValue(const torch::jit::FunctionSchema&, size_t, pybind11::handle)':
caffe2/torch/csrc/jit/pybind_utils.h:138:226: warning: 'pybind11::str pybind11::detail::object_api<Derived>::str() const [with Derived = pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>]' is deprecated: Use py::str(obj) instead [-Wdeprecated-declarations]
```
apaszke zdevito ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11107
Differential Revision: D9598040
Pulled By: goldsborough
fbshipit-source-id: 4a055353ac08d54a2bbca49573ff099310de3666
Summary:
ATen's doc/ folder is manually maintained and can thus cause confusion with the generated file. We now have proper online documentation for ATen, which is superior to ATen doc/. Let's delete ATen/doc.
ezyang apaszke soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11158
Differential Revision: D9618782
Pulled By: goldsborough
fbshipit-source-id: 0ef14f84947601a0589aa4a41e5c8619783426fe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11035
Instead, inline its definition into Tensor. We need
to do this so we can avoid needing to getType() from
TensorImpl.
Reviewed By: cpuhrsch
Differential Revision: D9564516
fbshipit-source-id: 19fdaa2b93419e21572b9916714aee4165cb3390
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11031
The eventual plan is to get rid of TensorImpl::type()
entirely; but first we need a function to call.
Reviewed By: cpuhrsch
Differential Revision: D9564206
fbshipit-source-id: b59a9ccfaed44199f185eff392835cec89ccda8e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11023
I'd like TensorOptions to not know anything about Context, so I can
move it to ATen/core without pulling in Context. To do this, the
type() method has to go, since it consults the context to get a Type.
Reviewed By: cpuhrsch
Differential Revision: D9562467
fbshipit-source-id: 61a18a76eb042a5e70b64b963501e9d68c25d4f0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11144
We can move them now that TensorMethods no longer references Tensor.
Reviewed By: cpuhrsch
Differential Revision: D9613800
fbshipit-source-id: 99ad1dd7d77eb319000769230b7016294cf1980f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11027
Using swap() as a primitive, copy and move assignment become much easier.
Reviewed By: ezyang
Differential Revision: D9563753
fbshipit-source-id: e74faf39b596f097de758bfe038639565807040a
Summary:
This completely removes BUILD_CAFFE2 from CMake. There is still a little bit of "full build" stuff in setup.py that enables USE_CUDNN and BUILD_PYTHON, but otherwise everything should be enabled for PyTorch as well as Caffe2. This gets us a lot closer to full unification.
cc mingzhe09088, pjh5, ezyang, smessmer, Yangqing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8338
Reviewed By: mingzhe09088
Differential Revision: D9600513
Pulled By: orionr
fbshipit-source-id: 9f6ca49df35b920d3439dcec56e7b26ad4768b7d
Summary:
Added MPI group support.
And this will make all previous group test cases of MPI passed.
Also, release the MPI thread level support by serializing different PG's MPI ops. This is required.
The build is fixed too
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11128
Differential Revision: D9602188
Pulled By: teng-li
fbshipit-source-id: 1d618925ae5fb7b47259b23051cc181535aa7497
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11101
I'd like to invert the dependency between Tensor and TensorOptions
(such that Tensor includes TensorOptions); to do this, I'd prefer
there to not be a Tensor constructor. Eventually, all references
of Tensor will disappear from TensorOptions.h
Reviewed By: cpuhrsch
Differential Revision: D9585627
fbshipit-source-id: dd4a28b2c06b1e55f629762915f03c2b6c34d840
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11122
these changes add fixes to device.cc that are appropriate to create the intra-device-copies for opencl
Reviewed By: bwasti
Differential Revision: D9553292
fbshipit-source-id: e59f17916b5df30a504adee0718f9cecfe28f35a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11021
We can now store a boolean saying if we want a Variable or not,
and context can use VariableHooks to get a VariableType if we
request one.
Reviewed By: cpuhrsch
Differential Revision: D9562312
fbshipit-source-id: 84653cd789622764132252406a5ea1a83eee3360
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11096
To discourage willy-nilly use, and make it clearer that it
is not a Variable
Reviewed By: cpuhrsch
Differential Revision: D9583699
fbshipit-source-id: 4fbde0c01ae3deb2c7ef8c125a9028f089b203ae
Summary:
Generate serialized test inputs/outputs/backward graphs of tests inside `caffe2/python/operator_test` that call assertSerializedOperatorCheck(). Tests should be decorated with serialized_test.collect_tests.given_and_seeded to run hypothesis tests that are actually random and a single fixed seeded hypothesis tests.
To use:
1. Refactor your test to be a SerializedTestCase
1a. Decorate it with given_and_seeded
1b. Call testWithArgs in main
2. Run your test with -g to generate the output. Check it in.
3. Subsequent runs of the test without generating the output will check against the checked in test case.
Details:
Run your test with `python caffe2/python/operator_test/[your_test].py -g`
Outputs are in `caffe2/python/serialized_test/data`. The operator tests outputs are in a further subdirectory `operator_test`, to allow for other tests in the future (model zoo tests?)
Currently, we've only refactored weighted_sum_test to use this, but in the next diff, we'll refactor as many as possible. The directory structure may also change as usually there are multiple tests in a single file, so we may create more structure to account for that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10594
Reviewed By: ezyang
Differential Revision: D9370359
Pulled By: ajyu
fbshipit-source-id: 2ce77389cd8bcc0255d3bccd61569833e545ede8
Summary:
**Review last commit only.** Stacked on top of #10949.
This commit fixes a number of issues connected to caching
differentiability status of graphs inside graph executors,
and changes the rules for optimization of differentiable subgraphs.
Previously every one of those was instantiated as a separate graph
executor, but now they are simply heavier-optimized graph regions,
and graph executors are only instantiated for their backward.
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10977
Differential Revision: D9600626
Pulled By: apaszke
fbshipit-source-id: dad09a0f586e396afbd5406319c1cd54fbb8a3d3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11095
We used getType to mean a lot of things.
- getVariableTypeFromBaseType: given a base Type (non-Variable type)
compute the Variable Type which corresponds to it.
- getVariableType: like at::getType, but return the Variable type
rather than the plain type.
This rename makes it clearer at the use-site what things are what,
and will make a subsequent rename of at::getType easier.
Reviewed By: gchanan, cpuhrsch
Differential Revision: D9583630
fbshipit-source-id: 2667ec98e7607bc466920c7415a8c651fd56dfca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11080
- Add a new TensorOptions(Device, ScalarType) constructor,
which serves roughly the same role as getType used to.
We shouldn't get too wild with these constructors, but
since this particular one was widely used by getType,
it seems worth adding.
- Change DLPack DeviceType conversion to at::DeviceType,
rather than at::Backend. While I'm at, add a few more
conversions that at::DeviceType understands.
- Add a new overload of from_blob which understands strides.
Reviewed By: gchanan, cpuhrsch
Differential Revision: D9578734
fbshipit-source-id: 28288ec053aae8765e23925ab91023398d632d6b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11077
getType now supports retrieving variable types, so make it clearer
when a getType function does NOT give you a variable type.
```
codemod -d . --extensions cc,cpp,cu,cuh,h getTypeOpt getNonVariableTypeOpt
```
Reviewed By: gchanan
Differential Revision: D9578398
fbshipit-source-id: 3ee502ac5c714849917f11ddc71de8eacfdaa9d3
Summary:
Operators like aten::chunk used to return a number of tensors, but
now return a list. To make it easier to do shape prop through
aten::chunk and fuse it, I've also introduced prim::ConstantChunk,
which behaves like the previous implementation (has a variable length
output list).
The downside of this PR is that the introduction of more lists to the IR causes the LSTM and MiLSTM graphs to be considered as non-differentiable by the graph executor. I verified that they are still optimize correctly, and my next patch (that changes how the specializations/differentiation works) will restore those.
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10949
Reviewed By: zdevito
Differential Revision: D9556823
Pulled By: apaszke
fbshipit-source-id: 33e63b17fc7247cac6cfc05eb7eb9bf069b499ee
Summary:
update to the latest observer usage syntax
add an example of HistogramObservers
Reviewed By: jspark1105
Differential Revision: D6878439
fbshipit-source-id: c9521f2daecfc7f0c17de6a944dce58e568e3dbe
Summary:
How did we get so many uses of `NULL` again?
ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11047
Differential Revision: D9566799
Pulled By: goldsborough
fbshipit-source-id: 83469f352ac69aa65bdaf1a1a21f922d892e0db3
Summary:
I've been seeing a lot of warnings about multiple declarations of this. Hopefully this fixes it.
cc Yangqing mingzhe09088 ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11045
Reviewed By: mingzhe09088
Differential Revision: D9582756
Pulled By: orionr
fbshipit-source-id: 6171485609a2f2f357d6e1c44e26b4ecfcdb4ce6
Summary:
This is needed because the JIT declares some custom autograd functions.
colesbury
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11082
Differential Revision: D9580456
Pulled By: apaszke
fbshipit-source-id: 6bf00c1188a20b2ee6ecf60e5a0099f8263ad55a
Summary:
This was done because it surprising for a decorator to run a function
rather than wrap it, and not simplify the syntax for tracing modules.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11069
Reviewed By: jamesr66a
Differential Revision: D9583192
Pulled By: zdevito
fbshipit-source-id: b914b7ab4c73c255086465a6576eef3a22de1e13
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11026
ezyang fixed a bug with moving or copying an intrusive_ptr into itself.
This diff adds test cases for it.
Reviewed By: ezyang
Differential Revision: D9563464
fbshipit-source-id: 3a3b3f681124730d2500b276c0135c3bba7875ae
Summary:
This PR creates a stream pool per issue #9646. When a new stream is requested, that device it's requested on lazily creates two pools, one low priority and one high priority, of 32 streams each. Streams are returned from these pools round-robin. That is, stream 0 is returned, then stream 1... then stream 31, then stream 0... This PR also takes the opportunity to clean up the stream API, reducing its complexity and verbosity.
Change notes:
- There are now 3 sets of streams per device, the default stream, the low priority streams, and the high priority streams. These streams live in lazily initialized pools and are destroyed on shutdown.
- All stream refcounting has been removed (the pools pattern replaces it).
- Setting a stream now sets it on its device. Streams are associated with a device and the previous
requirement to specify that device was unnecessary.
- There is no exposure for setting the flags on a stream. This may also seem like a regression but the flag was always set to cudaStreamNonBlocking.
- Streams are now low or high priority whereas previously the priority could be set with an integer. In practice, however, the range for priorities is -1 to 0 on the latest hardware. -1 is high priority, 0 is low priority (aka default priority). Low vs. high actually clarifies this behavior if people were trying finer separations. (E.g., if someone tried streams with priorities 0, 1, and 2, they would actually all have priority 0, historically, and the intended behavior would not be respected.)
- Unused THCStream and THCState stream-related functions were removed.
- A new test of pooling behavior was added in stream_test.
fyi: colesbury, apaszke, goldsborough
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9938
Reviewed By: SsnL
Differential Revision: D9569036
Pulled By: ezyang
fbshipit-source-id: 12ed673fe373170d0cf4d65cb570de016c53ee7d
Summary:
The warnings are erroneous as far as i can see,
so tweak things to avoid. The (unsigned int) cast is
to avoid passing -1 to a size_t time. This was triggered
in gcc8's lto build only, giving:
caffe2/aten/src/TH/generic/THTensor.cpp: In function ‘THFloatTensor_squeeze1d’:
lto1: error: ‘__builtin_memset’ specified size 18446744073709551608
exceeds maximum object size 9223372036854775807 [-Werror=stringop-overflow=]
In function ‘newImpl’,
inlined from ‘operator new’ at common/memory/OperatorOverride.cpp:86:23,
inlined from ‘allocate’ at third-party-buck/platform007/build/libgcc/include/c++/7.3.0/ext/new_allocator.h:111:0,
inlined from ‘allocate’ at third-party-buck/platform007/build/libgcc/include/c++/7.3.0/bits/alloc_traits.h:436:0,
inlined from ‘_M_allocate’ at third-party-buck/platform007/build/libgcc/include/c++/7.3.0/bits/stl_vector.h:172:0,
inlined from ‘_M_default_append’ at third-party-buck/platform007/build/libgcc/include/c++/7.3.0/bits/vector.tcc:571:0,
inlined from ‘resize’ at third-party-buck/platform007/build/libgcc/include/c++/7.3.0/bits/stl_vector.h:671:0,
inlined from ‘THTensor_resizeDim’ at caffe2/aten/src/TH/THTensor.hpp:123:0,
inlined from ‘THFloatTensor_squeeze1d.part.198’ at caffe2/aten/src/TH/generic/THTensor.cpp:429:0,
inlined from ‘THFloatTensor_squeeze1d’:
common/memory/OperatorOverride.cpp:86:23: error:
argument 1 value ‘18446744073709551608’ exceeds maximum object size 9223372036854775807 [-Werror=alloc-size-larger-than=]
void* ptr = malloc(size);
Reviewed By: soumith
Differential Revision: D9568621
fbshipit-source-id: 4569a4be897d669caa3f283f4b84ec829e8d77ad
Summary:
Also add single grad whitelist to the jit test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10782
Reviewed By: ezyang
Differential Revision: D9583378
Pulled By: erikbrinkman
fbshipit-source-id: 069e5ae68ea7f3524dec39cf1d5fe9cd53941944
Summary:
I've had `torch/lib/python3.6` show up as part of the build for some time now. It's not ignored which means I need to be extra careful about checking in files, or I end up with a thousand of them in my index.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11083
Differential Revision: D9580453
Pulled By: apaszke
fbshipit-source-id: 369e4fe87962696532d111b24f2a4a99b9572bf2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10826
Add strides, and make sure the strides are consistent with sizes, and is_contiguous, for all the Caffe2 functions.
is_contiguous means strides_[dim-1] = 1 and strides_[i] = strides_[i+1] * max(size_[i+1], 1);
Reviewed By: ezyang
Differential Revision: D9354480
fbshipit-source-id: 3643871b70f1111b7ffdd9fdd9fe9bec82635963
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10976
The app can run in XCode with the benchmark metrics collected.
It can also run when building with buck
Reviewed By: llyfacebook
Differential Revision: D9546755
fbshipit-source-id: 60ad0112946f8cf57138417f6838a58ed6d2c90f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11046
As suggested by jerryzh168, temporary fix for a new constraint that was added D9350686 is to remove this assert. Long term jerryzh168 is going to work out a better way of handling this.
Reviewed By: jerryzh168
Differential Revision: D9566323
fbshipit-source-id: e4630c7cbe0cc68a084974ea7048654811fae01f
Summary:
Currently our `skipIfLapack` has uses a try-catch block and regex match the error message. It is highly unreliable. This PR adds `hasLAPACK` and `hasMAGMA` on ATen context, and expose the flags to python.
Also fixes refcounting bug with `PyModule_AddObject`. The method steals reference, but we didn't `Py_INCREF` in some places before calling it with `Py_True` or `Py_False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11024
Differential Revision: D9564898
Pulled By: SsnL
fbshipit-source-id: f46862ec3558d7e0058ef48991cd9c720cb317e2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11048
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10739
I wanted to assert that the blobs in the workspace of the new session after loading checkpoint are exactly the same as the blobs in the workspace of the old session before saving to a checkpoint.
But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net `task:output` was also added along with it. See https://fburl.com/937lf2yk
This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan".
This adding a dummy TaskOutput when user specifies no TaskOutput is a hack.
The reason for this is that ZMQ socket can't send empty blob list.
As a result, if the Task on the Worker had no output,
The master would never stop waiting and hang forever. See https://fburl.com/rd7fhy6p and imagine `socket.recv(net, 0)`.
TaskOuput is at user layer. The hack shouldn't be exposed to user layer, polluting user workspaces.
Instead, we should move the creating of the dummy blob to some deeper layer,
and remove the dummy blob in the workspace afterwards to avoid polluting user workspaces.
After this change, the workaround becomes totally transparent and no side-effect to users.
Reviewed By: mraway
Differential Revision: D9566744
fbshipit-source-id: 18292dd64a6d48192c34034200a7c9811d2172af
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11052
Delete the test case for Predictor with constructing by MetaNetDef since the constructor
actually has been deprecated. The broken PR is for construcing predictor from DB instance.
Reviewed By: highker
Differential Revision: D9566935
fbshipit-source-id: 5511883953a2d3f6eb0a4f1c5518a1bc4b3ffbdc
Summary:
Not subclassed except by Tensor. Also requried to align further with
caffe2.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11036
Reviewed By: ezyang
Differential Revision: D9565640
Pulled By: cpuhrsch
fbshipit-source-id: ff7203a2c95d3f3956282b4f2d8dda6c2b93f4a6
Summary:
Things like torch.zeros now appear in traces rather than constants.
To continue to support our current level of ONNX export, we run
constant prop to turn these back into constants where possible before
export.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10935
Differential Revision: D9527427
Pulled By: zdevito
fbshipit-source-id: 552a8bcc01b911251dab7d7026faafdd7a3c758a
Summary:
Initial version of `unique` supporting a `dim` argument.
As discussed in [this issue](https://github.com/pytorch/pytorch/issues/9997) I added the `dim` argument to `torch.unique` with the same behavior like [numpy](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.unique.html).
Since the implementation is based on `std/thrust::unique`, the `tensor` always needs to be sorted. The `sorted` argument in `torch.unique` does not have any function, as in the CUDA version of the plain `torch.unique`.
To check the performance and equal behavior between `torch.unique` and `np.unique`, I've used [this gist](https://gist.github.com/ptrblck/ac0dc862f4e1766f0e1036c252cdb105).
Currently we achieve the following timings for an input of `x = torch.randint(2, (1000, 1000))`:
(The values are calculated by taking the average of the times for both dimension)
| Device | PyTorch (return_inverse=False) | Numpy (return_inverse=False) | PyTorch (return_inverse=True) | Numpy (return_inverse=True) |
| --- | --- | --- | --- | --- |
| CPU | ~0.007331s | ~0.022452s | ~0.011139s | ~0.044800s |
| GPU | ~0.006154s | - | ~0.105373s | - |
Many thanks to colesbury for the awesome mentoring and the valuable advices on the general implementation and performance issues!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10423
Differential Revision: D9517289
Pulled By: soumith
fbshipit-source-id: a4754f805223589c2847c98b8e4e39d8c3ddb7b5
Summary: When conversion fails, dump more information to help fix up the netdef
Reviewed By: hyuen, yinghai
Differential Revision: D9558667
fbshipit-source-id: 8917cc61c9be6285697e4f8395a9dbc7135f618e
Summary:
1. Support ops needed for inference of Faster-RCNN/Mask-RCNN needed in Detectron, mostly direct fallbacks.
2. Use CPU device to hold 0-dim tensors and integer tensors in both fallback op and blob feeder, needed by Detectron models.
3. Ignore 0-dim tensor in MKL-DNN concat operator.
4. Generate dynamic library of Detectron module for CPU device.
This PR obsoletes #9164.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10157
Differential Revision: D9276837
Pulled By: yinghai
fbshipit-source-id: dc364932ae4a2e7fcefdee70b5fce3c0cee91b6f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10955
Add GPU version of HardSigmoid Op to Caffe2. Updated test file to
include GPU tests.
Reviewed By: enosair
Differential Revision: D9499353
fbshipit-source-id: fcb51902063d0c3e4b10354533a8a42cf827c545
Summary:
This probably fixes the logging test error that orionr is encountering - haven't tested locally but wanted to send out a PR to kick off CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10983
Reviewed By: ezyang
Differential Revision: D9552607
Pulled By: Yangqing
fbshipit-source-id: 9ac019031ffd9c03972144df04a836e5dcdafe02
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10920
Update the black box predictor and the related code to use the
constructor with PredictorConfig.
Reviewed By: highker
Differential Revision: D9516972
fbshipit-source-id: fbd7ece934d527e17dc6bcc740b4e67e778afa1d
Summary:
The PR includes:
(1) torch.distributed.c10d, which now includes the complete backward compatible frontend API for `torch.distributed`
(2) `env://` init method functionality
(3) Minor change to `test_distributed.py`, which is now a test for `torch.distributed.c10d`.
(4) The old `test_distributed.py' is now moved to `test_distributed_thd`
(5) Miscellaneous bug fixes.
(6) DDP CPU test is removed since c10d doesn't have this support yet, but this is a very easy test after moving DDP CPU's dependency to torch.distributed.c10d.
(7) CI config to test MPI, NCCL, and Gloo backend of c10d
**Now all the distributed test including c10d DDP can pass with the c10d frontend API**
TODO: (in a separate PR)
MPI subgroup support, once this is added, CI group test will be enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10871
Differential Revision: D9554514
Pulled By: teng-li
fbshipit-source-id: fb686ad42258526c8b4372148e82969fac4f42dd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10987
some code style update to make it consistent with fb cpp style
Reviewed By: yinghai
Differential Revision: D9550130
fbshipit-source-id: 6aef9878676c08e7d384383c95e7ba8c5c9a1bce
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11003
Need a interface to re-write the graph after the net is built and after adding gradient ops.
Reviewed By: aazzolini, harouwu
Differential Revision: D9557827
fbshipit-source-id: 2e082f0321c0776e488a29e18047d950948e7c37
Summary:
The goal here is to separate out the base Type into core; as it was done previously we need all derived Types to be defined when we compile the base Type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10947
Reviewed By: gchanan
Differential Revision: D9540025
Pulled By: ezyang
fbshipit-source-id: 49f0b5acb3c378348ef3a55780abb73e4ae27edd
Summary:
Fixes#10851
Speeds up profiling results dramatically.
For the following script:
```
import torch
import time
ITER = 2000
x = torch.randn(1, 1, requires_grad=True)
with torch.autograd.profiler.profile() as prof:
y = x
for i in range(ITER):
y = 3 * y - 2 * y
y.backward()
start = time.time()
print("Done running. Preparing prof")
x = str(prof)
print("Done preparing prof results")
end = time.time()
print("Elapsed: {}".format(end - start))
```
I get 7s before / 0.13s after these changes.
cc apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10969
Differential Revision: D9556129
Pulled By: zou3519
fbshipit-source-id: 26b421686f8a42cdaace6382567d403e6385dc12
Summary:
Breaking this out of https://github.com/pytorch/pytorch/pull/8338
mingzhe09088's fix of the docstrings for Windows builds. Unfortunately some versions of Windows seem to try and parse the `#` inside the string as a pre-processor declaration. We might need to change this to something else later, but want to get this landed first.
cc mingzhe09088 Yangqing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10998
Reviewed By: mingzhe09088
Differential Revision: D9557480
Pulled By: orionr
fbshipit-source-id: c6a6237c27b7cf35c81133fd9faefead675a9f59
Summary:
Breaking out of https://github.com/pytorch/pytorch/pull/8338
This test fails once we start building with `-DUSE_GLOG=OFF` since the non-glog logging case doesn't support flushing or streaming to the right location. For now, we just disable this test in that case.
cc Yangqing mingzhe09088
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10999
Reviewed By: mingzhe09088
Differential Revision: D9557488
Pulled By: orionr
fbshipit-source-id: 8b306f210411dfc8ccc404bdccf77ddcd36a4830
Summary:
PyTorch exporting test and end to end cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10924
Reviewed By: Ac2zoom
Differential Revision: D9548210
Pulled By: houseroad
fbshipit-source-id: 2381d1ad92a4e07f97060eb65c9fd09f60ad3de6
Summary:
This is part of splitting Context from what needs to go in ATen/core.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10951
Differential Revision: D9540369
Pulled By: gchanan
fbshipit-source-id: 73b0e8c4493785fbab368a989f46137c51f6ea0b
Summary:
Fix#10345, which only happens in CUDA case.
* Instead of returning some random buffer, we fill it with zeros.
* update torch.symeig doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10645
Reviewed By: soumith
Differential Revision: D9395762
Pulled By: ailzhang
fbshipit-source-id: 0f3ed9bb6a919a9c1a4b8eb45188f65a68bfa9ba
Summary:
This fixes multiple bugs in the handling of negative indices in both slicing and gather operations. These were uncovered by @[1466077526:Elias Ellison]'s diff D9493614, which made it so that we actually emit negative indices when we see them in PyTorch code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10973
Reviewed By: jhcross
Differential Revision: D9546183
Pulled By: jamesr66a
fbshipit-source-id: 6cb0e84e8ad399e47e24a96c44025f644c17b375
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10739
I wanted to assert that the blobs in the workspace of the new session after loading checkpoint are exactly the same as the blobs in the workspace of the old session before saving to a checkpoint.
But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net `task:output` was also added along with it. See https://fburl.com/937lf2yk
This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan".
This adding a dummy TaskOutput when user specifies no TaskOutput is a hack.
The reason for this is that ZMQ socket can't send empty blob list.
As a result, if the Task on the Worker had no output,
The master would never stop waiting and hang forever. See https://fburl.com/rd7fhy6p and imagine `socket.recv(net, 0)`.
TaskOuput is at user layer. The hack shouldn't be exposed to user layer, polluting user workspaces.
Instead, we should move the creating of the dummy blob to some deeper layer,
and remove the dummy blob in the workspace afterwards to avoid polluting user workspaces.
After this change, the workaround becomes totally transparent and no side-effect to users.
Reviewed By: mraway
Differential Revision: D9413150
fbshipit-source-id: 51aaf3201e26570b4fcf5738e9b9aa17c58777ac
Summary:
TODO: integrate into torch.onnx.export -- separate PR
*Problem:* We have a facility to trace PyTorch operations on Python code, but there are several failure modes where the trace is not representative of the actual underlying computation:
* The tracer encountered dynamic control flow
* Some computation escaped the tracer, and appeared as a Constant tensor node in the graph
* Some stateful function was traced, e.g. someone did an optimization in Python by memoizing function outputs
*Objective*: In an ideal world, this whole process would be automated and the user can trust that the system will magically capture the intended semantics from the program. Realistically speaking, we will likely have to settle with a human-in-the-loop error reporting system, allowing for the user to identify problems and modify the source code to allow for tracing.
*Stage 1* (this PR): Output-level checking & graph diff. torch.jit.trace gains a kwarg 'check_inputs', which is a list of tuples of input arguments. We will iterate through the list and trace the function again for each set of check inputs. We'll also interpret the original trace with these inputs and compare output values and graphs, printing a diff of the graph if there is a difference.
Examples:
```
torch.jit.trace(torch.rand(3, 4), check_inputs=[(torch.rand(4, 5),)])
def foo(x):
y = torch.arange(0, x.shape[0]).float()
return x + y.unsqueeze(1)
```
```
torch.jit.TracingCheckError: Tracing failed sanity checks!
ERROR: Graphs differed across invocations!
Graph diff:
graph(%0 : Dynamic) {
- %1 : Dynamic = prim::Constant[value= 0 1 2 [ CPULongType{3} ]]()
? ^
+ %1 : Dynamic = prim::Constant[value= 0 1 2 3 [ CPULongType{4} ]]()
? +++ ^
%2 : int = prim::Constant[value=0]()
%3 : Dynamic = aten::_cast_Float(%1, %2)
%4 : int = prim::Constant[value=1]()
%5 : Dynamic = aten::unsqueeze(%3, %4)
%6 : int = prim::Constant[value=1]()
%7 : Dynamic = aten::add(%0, %5, %6)
return (%7);
}
Node diff:
- %1 : Dynamic = prim::Constant[value= 0 1 2 [ CPULongType{3} ]]()
? ^
+ %1 : Dynamic = prim::Constant[value= 0 1 2 3 [ CPULongType{4} ]]()
? +++ ^
Trace source location:
dank.py(5): foo
/Users/jamesreed/onnx-fairseq/pytorch/torch/jit/__init__.py(402): wrapper
dank.py(3): <module>
Check source location:
dank.py(5): foo
/Users/jamesreed/onnx-fairseq/pytorch/torch/jit/__init__.py(281): check_trace
/Users/jamesreed/onnx-fairseq/pytorch/torch/jit/__init__.py(408): wrapper
dank.py(3): <module>
ERROR: Tensor-valued Constant nodes differed in value across invocations. This often indicates that the tracer has encountered untraceable code.
Node:
%1 : Dynamic = prim::Constant[value= 0 1 2 [ CPULongType{3} ]]()
Source Location:
dank.py(5): foo
/Users/jamesreed/onnx-fairseq/pytorch/torch/jit/__init__.py(402): wrapper
dank.py(3): <module>
Comparison exception:
Not equal to tolerance rtol=1e-07, atol=0
(shapes (3,), (4,) mismatch)
x: array([0, 1, 2])
y: array([0, 1, 2, 3])
```
==
```
torch.jit.trace(torch.rand(3, 4), check_inputs=[(torch.rand(3, 4),)])
def foo(x):
y = x.data
return x + y
```
```
torch.jit.TracingCheckError: Tracing failed sanity checks!
ERROR: Traced function outputs do not match the Python function outputs.
ERROR: Tensor-valued Constant nodes differed in value across invocations. This often indicates that the tracer has encountered untraceable code.
Node:
%1 : Dynamic = prim::Constant[value=<Tensor>]()
Source Location:
dank.py(6): foo
/Users/jamesreed/onnx-fairseq/pytorch/torch/jit/__init__.py(402): wrapper
dank.py(3): <module>
Comparison exception:
Not equal to tolerance rtol=1e-07, atol=0
(mismatch 100.0%)
x: array([0.397137, 0.956105, 0.169478, 0.560292, 0.392568, 0.108441,
0.97645 , 0.34412 , 0.951246, 0.793061, 0.557595, 0.770245],
dtype=float32)
y: array([0.243178, 0.315964, 0.972041, 0.0215 , 0.927751, 0.457512,
0.951092, 0.97883 , 0.048688, 0.118066, 0.779345, 0.271272],
dtype=float32)
```
==
```
import torch
torch.jit.trace(torch.rand(3, 4), check_inputs=[(torch.rand(4, 4),)])
def foo(x):
for _ in range(x.size(0)):
x = torch.neg(x)
return x
```
```
torch.jit.TracingCheckError: Tracing failed sanity checks!
ERROR: Traced function outputs do not match the Python function outputs.
ERROR: Graphs differed across invocations!
Graph diff:
graph(%0 : Dynamic) {
%1 : Dynamic = aten::neg(%0)
%2 : Dynamic = aten::neg(%1)
%3 : Dynamic = aten::neg(%2)
+ %4 : Dynamic = aten::neg(%3)
- return (%3);
? ^
+ return (%4);
? ^
}
```
==
```
import torch
def foo(x):
if not hasattr(foo, 'cache'):
foo.cache = torch.neg(x)
return x + foo.cache
traced = torch.jit.trace(torch.rand(3, 4), check_inputs=[(torch.rand(3, 4),)])(foo)
```
```
torch.jit.TracingCheckError: Tracing failed sanity checks!
ERROR: Traced function outputs do not match the Python function outputs.
ERROR: Graphs differed across invocations!
Graph diff:
graph(%0 : Dynamic) {
- %1 : Dynamic = aten::neg(%0)
+ %1 : Dynamic = prim::Constant[value=<Tensor>]()
%2 : int = prim::Constant[value=1]()
%3 : Dynamic = aten::add(%0, %1, %2)
return (%3);
}
Node diff:
- %1 : Dynamic = aten::neg(%0)
+ %1 : Dynamic = prim::Constant[value=<Tensor>]()
Trace source location:
test.py(5): foo
/Users/jamesreed/onnx-fairseq/pytorch/torch/jit/__init__.py(402): wrapper
test.py(8): <module>
Check source location:
test.py(6): foo
/Users/jamesreed/onnx-fairseq/pytorch/torch/jit/__init__.py(281): check_trace
/Users/jamesreed/onnx-fairseq/pytorch/torch/jit/__init__.py(408): wrapper
test.py(8): <module>
```
The following two examples show instances where program semantics are lost in the Python -> trace transformation, and repeated invocation does not give us useful debug information. Further design in underway for catching these scenarios.
```
import torch
torch.jit.trace(torch.rand(3, 4), check_inputs=[(torch.rand(3, 4),)])
def foo(x):
for i in range(3):
x[i, :] = torch.zeros(4)
return x
```
```
torch.jit.TracingCheckError: Tracing failed sanity checks!
ERROR: Traced function outputs do not match the Python function outputs.
Exception:
Not equal to tolerance rtol=1e-07, atol=0
(mismatch 100.0%)
x: array([0.830221, 0.915481, 0.940281, 0.555241], dtype=float32)
y: array([0., 0., 0., 0.], dtype=float32)
```
==
```
import torch
torch.jit.trace(torch.rand(3, 4), check_inputs=[(torch.rand(5, 6),)])
def foo(x):
x.view(-1).add_(-x.view(-1))
return x
```
```
torch.jit.TracingCheckError: Tracing failed sanity checks!
ERROR: Traced function outputs do not match the Python function outputs.
Exception:
Not equal to tolerance rtol=1e-07, atol=0
(mismatch 100.0%)
x: array([0.734441, 0.445327, 0.640592, 0.30076 , 0.891674, 0.124771],
dtype=float32)
y: array([0., 0., 0., 0., 0., 0.], dtype=float32)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10841
Differential Revision: D9499945
Pulled By: jamesr66a
fbshipit-source-id: 1f842a32d0b0645259cc43b29700b86d99c59a45
Summary:
This PR adds argument checking for script method invocation from C++. For this I had to:
1. The schema of a method is currently not serialized in script modules, so we now store the function schema in the `doc_string` field of the ONNX proto. Upon loading of a serialized script module, we parse the schema into the structured C++ form and assign it to the loaded method,
2. Inside `Method::operator()`, we now verify the number and types of arguments.
CC The controller you requested could not be found.
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10872
Differential Revision: D9521219
Pulled By: goldsborough
fbshipit-source-id: 5cb3d710af6f500e7579dad176652c9b11a0487d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10929
Workspace classes methods were missing on the Python side.
Being able to write the New Checkpoint Framework with more control of the workspace and cleaner implementation.
Added
- ws.feed_blob(name, arr)
- ws.remove_blob(name)
Reviewed By: mraway
Differential Revision: D9486867
fbshipit-source-id: ea02d2e3a39d716a5a3da0482f57d4ac4c893763
Summary: Adds basic nomnigraph python bindings for quickly playing with the graphs.
Reviewed By: duc0
Differential Revision: D9441936
fbshipit-source-id: fd70f8ea279b28c766e40f124008800acd94bddd
Summary:
The previous NCCL all gather doesn't work as expected. This is a fully working async version. Tested on both C++ and Python Frontend.
Multi-node:
```
tengli@learnfair042:~/new_pytorch/pytorch/torch/lib/build/c10d/test$ TMPFILE="/private/home/tengli/temp/tengli-test" RANK=0 WORLD_SIZE=2 ./ProcessGroupNCCLTest
Multi-node world size: 2 rank: 0
Allreduce test successful
Broadcast test successful
Reduce test successful
Allgather test successful
tengli@learnfair117:~/new_pytorch/pytorch/torch/lib/build/c10d/test$ TMPFILE="/private/home/tengli/temp/tengli-test" RANK=1 WORLD_SIZE=2 ./ProcessGroupNCCLTest
Multi-node world size: 2 rank: 1
Allreduce test successful
Broadcast test successful
Reduce test successful
Allgather test successful
```
CI test:
```
test_set_get (__main__.FileStoreTest) ... ok
test_set_get (__main__.PrefixFileStoreTest) ... ok
test_set_get (__main__.PrefixTCPStoreTest) ... ok
test_allreduce_ops (__main__.ProcessGroupGlooTest) ... ok
test_broadcast_ops (__main__.ProcessGroupGlooTest) ... ok
test_allgather_ops (__main__.ProcessGroupNCCLTest) ... ok
test_allreduce_ops (__main__.ProcessGroupNCCLTest) ... ok
test_broadcast_ops (__main__.ProcessGroupNCCLTest) ... ok
test_reduce_ops (__main__.ProcessGroupNCCLTest) ... ok
test_common_errors (__main__.RendezvousFileTest) ... ok
test_nominal (__main__.RendezvousFileTest) ... ok
test_common_errors (__main__.RendezvousTCPTest) ... ok
test_nominal (__main__.RendezvousTCPTest) ... ok
test_unknown_handler (__main__.RendezvousTest) ... ok
test_set_get (__main__.TCPStoreTest) ... ok
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10932
Differential Revision: D9542067
Pulled By: teng-li
fbshipit-source-id: 25513eddcc3119fd736875d69dfb631b10f4ac86
Summary:
Running `--accept` on a test doesn't tell you explicitly which sub-test is being updated, this PR fixes that
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10559
Differential Revision: D9353977
Pulled By: driazati
fbshipit-source-id: a9d4014386ff0fe388a092f3dcf50f157e460f04
Summary:
Changes the approach for resolving builtin ops so that the following works
```
add = torch.add
script
def foo(x):
return add(x, x)
```
This handles cases when people alias torch and torch.nn.functional to
shorter names.
This works by building a table of id -> builtin name for the know builtin
ops in torch, torch.nn.functional, and for any user-defined
op created by accessing in torch.ops.foo.bar
This allows us to clean up many SugaredValue types in the compiler.
Notes:
* we now consider any attributes on python modules to be constants
(e.g. math.pi, and torch.double).
* fixes a bug where we incorrectly allowed attribute lookup on arbitrary
pyton objects. It is now restricted to modules only.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10927
Differential Revision: D9527522
Pulled By: zdevito
fbshipit-source-id: 0280422af08b4b0f48f302766d5a9c0deee47660
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10804
Make ShareData and ShareExternalPointer to create new storage when the old one is used by multiple tensors.
When we need to modify the field of storage, we'll create a new storage instead.
Reviewed By: ezyang
Differential Revision: D9350686
fbshipit-source-id: 68d2b6b886b0367b0fc4fabfd55b9a480e7388ca
Summary:
Currently we assume to find cudnn includes and libraries in the `CUDA_HOME` root. But this is not always true. So we now support a `CUDNN_HOME`/`CUDNN_PATH` environment variable that can have its own `/include` and `/lib64` folder.
This means cudnn extensions now also get support on the FAIR cluster.
soumith fmassa
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10922
Differential Revision: D9526856
Pulled By: goldsborough
fbshipit-source-id: 5c64a5ff7cd428eb736381c24736006b21f8b6db
Summary:
Since we don't need `torch.autograd.Variable` anymore, I removed `torch.autograd.Variable` from `onnx.rst`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10810
Differential Revision: D9500960
Pulled By: zou3519
fbshipit-source-id: 1bc820734c96a8c7cb5d804e6d51a95018db8e7f
Summary:
More support for tuples has uncovered a bug in constant prop where
it assumed it can create constant nodes of tuples, even though we
cannot easily create a single prim::Constant to represent a tuples.
This fix checks when we cannot represent an IValue as a prim::Constant
and then stops propagating the node.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10923
Reviewed By: orionr
Differential Revision: D9523417
Pulled By: zdevito
fbshipit-source-id: 745058c4388d9a5e0fc1553eaa2731e31bc03205
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10824
API additions:
- Tensor(c10::intrusive_ptr<TensorImpl,UndefinedTensor>&&)
- Tensor(const c10::intrusive_ptr<TensorImpl,UndefinedTensor>&)
- Tensor::operator=(Tensor&&) && (for completeness sake)
- TensorBase::unsafeGetTensorImpl()
- TensorBase::unsafeReleaseTensorImpl()
- TensorBase::getIntrusivePtr()
- TensorImpl::type_id()
- Tensor::set_data()
- Tensor::is_same(Tensor)
- Tensor::use_count()
- Tensor::type_id()
- Tensor::scalar_type()
- WeakTensor::is_same(WeakTensor)
- intrusive_ptr::weak_use_count()
- weak_intrusive_ptr::weak_use_count()
- c10::raw::intrusive_ptr::{incref,decref,make_weak}
- c10::raw::weak_intrusive_ptr::{incref,decref,lock}
API changes:
- Tensor::pImpl is no longer public (and now named tensor_impl_)
- Most methods accessed this way are now accessible on Tensor
maybe_zero_dim() and set_wrapped_number() being prominent exceptions
(they are now accessed through unsafeGetTensorImpl())
- Type is no longer friend of Tensor
- TensorBase::reset(TensorImpl*) is deleted
- TensorBase::reset(TensorImpl*, bool should_retain) is deleted
- TensorBase::swap(TensorBaseImpl&) is deleted; use std::swap instead
- TensorBase::get() is deleted; use unsafeGetTensorImpl() instead
- TensorBase::detach() is deleted; use unsafeReleaseTensorImpl() instead
- TensorBase::retain() is deleted; use _raw_incref() instead
- TensorBase::release() is deleted; use _raw_decref() instead
- WeakTensor lost most of its methods (it no longer inherits from
TensorBase)
- TensorImpl::storage() is now a const method
- Tensor(TensorBase) constructor removed, instead
we go through getIntrusivePtr(). I'm not sure about
this change; I happened to have accidentally removed the
TensorBase constructor and decided to fix call sites,
but I could go the other way.
- detail::set_data() is deleted; use Tensor::set_data() instead
- c10::raw_intrusive_ptr_target removed; use the functions in c10::raw instead.
(The reason for this change, is that it is invalid to cast an intrusive_ptr_target*
to a raw_intrusive_ptr_target* to take advantage of the methods. But there is
no reason the incref/decref methods shouldn't also work on intrusive_ptr_target;
it is primarily an API consideration. We can be more standards compliant by
keeping them as functions, which are universally applicable.)
- intrusive_ptr::reclaim() and weak_intrusive_ptr::reclaim() now work on
pointers of the NullType. (This counts as a bug fix, because the documentation
specified that pointers produced by release() are valid to reclaim(), and
a release() on a null intrusive_ptr produces the NullType::singleton())
Bug fixes:
- Dispatch code for mutable references incorrectly returned
a reference to a value argument (which would immediately
go out of scope). They now correctly return a tensor by
value.
- intrusive_ptr copy/move assignment did not work correctly when
an object was assigned to itself. We now check for this case and
no-op if so. (This bug manifested itself as a Tensor mysteriously
becoming an UndefinedTensor after lines of code like
'x = x.mul_(y)')
Other changes:
- The checked cast functions in Utils.h have now been
renamed and detemplatized into checked unwrap functions.
- Added type_id() and scalar_type() methods to Tensor
- pImpl is no longer public
- Documented what the && overloads are doing
- All occurrences of 'new TensorImpl' (and similar spellings, like 'new THTensor')
have been expunged. This is NO LONGER a valid way to create a new
tensor, and if you do this, upon your first incref, you will catch an ASSERT
failure saying that only tensors created by intrusive_ptr::release() are valid
to reclaim(). Use c10::make_intrusive instead in this situation.
- IValue is adjusted to use intrusive_ptr instead of Retainable, and all
other sub-classes of Retainable were modified to use intrusive_ptr.
When doing this, I had to make the constructors of sub-classes like
ConstantList public, so that c10::make_intrusive could invoke them. Fortunately,
if you incorrectly stack allocate a ConstantList, and then try to get an
intrusive_ptr to it, it will fail, as stack allocated ConstantLists have refcount 0.
- IValue very narrowly sidesteps the problem of handling NullType, as it
considers intrusive_ptr<TensorImpl> identical to intrusive_ptr<TensorImpl, UndefinedTensor>
which is not always true. This was always the case, but there's now a comment
explaining what's going on.
Some MSVC bugs were uncovered during the preparation of this patch.
They are documented as comments in the code.
Reviewed By: gchanan
Differential Revision: D9481140
fbshipit-source-id: 14a8ea0c231ed88b5715fb86d92730926f9f92fc
Summary:
The goal of this PR is to enable miopen engine(for hip devices) for recurrent operator and also enable corresponding unit test.
bddppq petrex
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10840
Differential Revision: D9518980
Pulled By: bddppq
fbshipit-source-id: 214661e79a47c5dc6b712ef0fba986bd99db051f
Summary:
Previously when tracing slicing & select negative indices would get normalized, fixing the index to the size of the traced tensor. This makes the behavior the same as script so aten::select with negative indices is emitted.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10560
Differential Revision: D9493614
Pulled By: eellison
fbshipit-source-id: ce7a8bae59863723247208d86b9f2948051ccc6c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10877
change default value of DeviceOption.numa_node_id to 0 and use has_numa_node_id() to check existence
Reviewed By: ilia-cher
Differential Revision: D9473891
fbshipit-source-id: 91ac6a152f445644691023110c93d20a3ce80d43
Summary:
* Fix the necessary pathways so that tuples and lists can be inputs to the script.
* prevent linear algebra functions from being run in shape prop because
they frequently will error out for nonsense data.
* favor schema-driven python input conversion where possible.
remaining cases where we directly create Stacks without schema are
only for debugging
* Make the error messages when calling script/trace functions more pythonic
* Simplify FlattenTuples -- now that tuples are supported we can choose to only flatten tuples when needed. This may have to be revisited pending onnx test results, but is necessary for making tuple io work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10812
Differential Revision: D9477982
Pulled By: zdevito
fbshipit-source-id: ed06fc426e6ef6deb404602a26c435a7fc40ea0c
Summary:
The scalar situation has gotten a lot better and now we can
remove all instances of FIXME_zerol().
cc zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10900
Differential Revision: D9514206
Pulled By: zou3519
fbshipit-source-id: e4e522f324126c5454cd6de14b832d2d1f6cb0ce
Summary:
PackedSequence is never supposed to be created by user, but unfortunately some community repo is already doing this (e.g., [here](7c191048ce/torchmoji/model_def.py (L218-L229))). Some change we made break the calling pattern `PackedSequence(data=x, batch_sizes=y)`. This patch adds back support for that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9864
Differential Revision: D9011739
Pulled By: SsnL
fbshipit-source-id: 0e2012655d7f4863ec54803550df30874ec35d75
Summary:
Please review the expects carefully to make sure there are no regressions. I tried to go over them one by one when they changed, but it's sometimes easy to miss finer details.
Summary of changes:
- Renamed `TensorType` to `CompleteTensorType`. Added a new `TensorType` which records only the scalar type, number of dimensions, and device of a value. The argument behind the rename is to encourage people to use `CompleteTensorType` less, as most passes will only have limited information available. To make transition easier `complete_type->cast<TensorType>()` works, and makes our passes work with both kinds of specialization if they don't need extra the extra detail.
- Renamed `ArgumentSpec` to `CompleteArgumentSpec`. Added a new `ArgumentSpec`, which matches argument only at the level of the new `TensorType`.
- Shape analysis can process graphs with both `CompleteTensorType` and `TensorType`.
- Fuser was a part that heavily relied on full shape information being available. Now, we simply try to fuse the largest possible graphs, and have to do run-time checks to make sure they match the code we generate. If they don't, we fall back to regular interpretation. The shape checks are implementing using an optimized method exploiting algebraic properties of shapes with broadcasting, and the relations of broadcasting with pointwise ops. A full written proof of correctness of the shape checking algorithm is included in a comment in `graph_fuser.cpp`.
zdevito ezyang mruberry ngimel csarofeen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10844
Differential Revision: D9498705
Pulled By: apaszke
fbshipit-source-id: 0c53c2fcebd871cc2a29c260f8d012276479cc61
Summary: Update all the caller for the new interface
Reviewed By: highker
Differential Revision: D9323167
fbshipit-source-id: a39335ceb402db0719f5f2314085ba9a81380308
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10239
Make Conv + BN fusion also work for 3D convolutions
Reviewed By: duc0
Differential Revision: D9176314
fbshipit-source-id: 6604aa569c5c3afdb4480a5810890bc617e449c4
Summary:
This disables the symbolic override hacks and makes tracing emit the recently added ATen ops for RNNs (`aten::lstm`, `aten::gru`, ...). I managed to reuse pretty much all of the translation code for their symbolics.
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10638
Differential Revision: D9385830
Pulled By: apaszke
fbshipit-source-id: ff06ef7b1ae7c3b7774825e0991bc3887e1ff59b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10759
Adding a basic registry pattern to pybindstate so that we can have separate 'cc' files register module updates. This is substantially cleaner than using multiple pybind modules (which have been known to cause bugs)
Reviewed By: bddppq
Differential Revision: D9441878
fbshipit-source-id: af9e9e98385e92b58ca50e935678328c62684d8e
Summary:
**Summary**: This PR is a followup of mruberry's https://github.com/pytorch/pytorch/pull/9318/. It tries to achieve the following:
- Specializing std common math functions for `at::Half` type.
- Create `CUDANumerics.cuh` to contain necessary parts from `THCNumerics.cuh`.
- Update `THCNumerics.cuh` with new usage and comments to demonstrate the best practice for developers and hence, making way for its deprecation.
- Remove legacy/redundant code path.
- Remove unused CUDA HALF macros (see separate PR https://github.com/pytorch/pytorch/pull/10147)
**Comments**: `CUDANumerics.cuh` contains mathematical functions that are either not in the std namespace or are specialized for compilation with CUDA NVCC or CUDA NVRTC. This header is derived from the legacy `THCNumerics.cuh`. Following are some rationale behind why some functions were kept while others were removed:
- All arithmetic can now be done in ATen using binary cuda kernel or CUDA tensor pointwise apply (check https://github.com/pytorch/pytorch/pull/8919 and `CUDAApplyUtils`). `at::Half` comparisons rely on implicit conversion to float.
- Functions that are c/c++ standard compliant, have been specialized for user defined types, for instance, the std namespace has been opened up for `at::Half`, that defines math function definitions for `at::Half`. Check `Half-inl.h`
- Some standard compliant functions are specialized here for performance reasons. For instance, `powi` is used for `pow` calculation on integral types. Moreover, `abs`, `isinf`, `isnan` are specialized to save one API call vs when used with std. Although this is subject to change, depending on if we really care about saving one API call.
- Numeric limits such as `max/min` is removed since they call standard defines. Moreover, numeric limits for
`at::Half` is present in `Half-inl.h`. I understood that HIP has some issue with `std::numeric_limits` and this the related github issue I found: https://github.com/ROCm-Developer-Tools/HIP/issues/374. AlexVlx mentions that the issue can be avoided by launching `std::numeric_limits` in `__device__`. Since, we are launching lambdas with device contexts, I don't see an issue why `std::numeric_limits` won't compile on HIP if launched with device context within a kernel, unless I am not aware of the real reason why max/min was there in THCNumerics in the first place. (Haven't ever tried a build with HIP).
Here are some reference PRs that was handy in refactoring TH into ATen:
- https://github.com/pytorch/pytorch/pull/6786
- https://github.com/pytorch/pytorch/pull/5475
- https://github.com/pytorch/pytorch/pull/9401
- https://github.com/pytorch/pytorch/pull/8689
- https://github.com/pytorch/pytorch/pull/8919
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10301
Differential Revision: D9204758
Pulled By: soumith
fbshipit-source-id: 09f489c1656458c02367b6cd31c3eeeca5acdc8a
Summary:
This is along the way of removing Tensor as a member of the tagged union in Scalar. This simplifies ordering dependencies, because currently Scalar and Tensor both depend on each other (so we introduce a TensorBase). Also, this API isn't particularly useful publicly: we can't autograd through Scalars, so you still need a Tensor overload basically everywhere anyway.
I'm undecided what the final API should be here. We could keep a Tensor constructor on Scalar, but have it generate a local scalar; this is convenient but given this API used to be non-synchronizing, it may not be the best.
For now, I'm just using _local_scalar, which is clear, although we should get rid of the prefix _ if that's the API we intend to promote.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10852
Reviewed By: ezyang
Differential Revision: D9496766
Pulled By: gchanan
fbshipit-source-id: 16f39b57536b9707132a5a4d915650c381bb57db
Summary:
The schema_ field is a private and internal cache for nodes, and no
methods meant to manipulate it should be publicly visible. This call
wasn't even necessary at its call site, since removeInput will reset the
schema by itself.
zdevito jamesr66a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10822
Reviewed By: zdevito
Differential Revision: D9498683
Pulled By: apaszke
fbshipit-source-id: 42e1743e3737cb7d81f88e556204487d328c0e47
Summary:
When matching schema, first try to match without adding TensorToNum conversions. Then make another pass where TensorToNum conversions are allowed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10180
Differential Revision: D9438153
Pulled By: eellison
fbshipit-source-id: 80541b5abd06e9d4187e89dda751f44dab6f58c5
Summary:
Since ONNX opset version >5, Reshape changed semantics to take a shape tensor as input instead of relying on `shape` attribute to decide what shape to reshape to. ONNXIFI op has been postponing this change as some of the backends such as TensorRT were not ready. Now that the backends have adopted this semantics, we can remove the legacy mode and output opset version 7 ONNX models.
This change also flushes out some of the bugs and new requirement.
- Converting shape info into int64 tensor
- Fix a bug when we output the shape tensor in the mapped workspace instead of the original workspace
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10848
Reviewed By: houseroad
Differential Revision: D9495121
Pulled By: yinghai
fbshipit-source-id: a6f44a89274c35b33fae9a429813ebf21d9a3d1a
Summary:
Currently on PyTorch AMD, memory accesses on the TensorInfo struct contained in the Operators passed into the kernelPointwiseApply kernel leads to hangs on the HCC runtime. Permuting the argument order such that the operator is first alleviates this issue and the kernel hangs disappear.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10829
Reviewed By: ezyang
Differential Revision: D9492561
Pulled By: Jorghi12
fbshipit-source-id: d0f0e2ab7180e55846db909f2744b8c8b110205e
Summary:
We no longer use nanopb in PyTorch (or Caffe2) so removing. All protobuf manipulation should go through standard protobuf, which is statically linked inside libcaffe2.so by default.
cc zdevito pjh5 ezyang Yangqing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10772
Reviewed By: pjh5
Differential Revision: D9465894
Pulled By: orionr
fbshipit-source-id: 8cdf9f1d3953b7a48478d381814d7107df447201
Summary:
In prep for making FULL_CAFFE2 default, users shouldn't be required to have protobuf installed.
cc pjh5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10771
Reviewed By: pjh5
Differential Revision: D9474458
Pulled By: orionr
fbshipit-source-id: 3e28f5ce64d125a0a0418ce083f9ec73aec62492
Summary:
This is a small part of the effort to remove Tensor as a tagged member in Scalar because it is inconsistent with how we normally do overloads.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10828
Differential Revision: D9485049
Pulled By: gchanan
fbshipit-source-id: 103f5cc03bb7775cd2d3a0a5c0c5924838055f03
Summary:
Part of #10774.
This PR does the following:
- Support ast.ExtSlice in the frontend. This is done by returning a
list of ast.Index and ast.Slice.
- Support multidimensional indexing with ints and slices
The general approach is to desugar multidimensional indexing into
at::slice, at::select operations. This is exactly how normal pytorch
does indexing (by desugaring it into at::slice, at::select, and other ops).
I used [this code](https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/python_variable_indexing.cpp) as reference.
We should be able to copy the rest of this to implement the missing
indexing features in script (indexing with ellipses, tensors, sequences, etc).
After I'm done implementing the missing indexing features in future prs, I can try to
templatize python_variable_indexing.cpp so that it can work with both JIT
script and normal pytorch indexing, but right now I'm not sure if that's
a good idea or not.
cc zdevito jamesr66a apaszke wanchaol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10787
Differential Revision: D9481402
Pulled By: zou3519
fbshipit-source-id: 78c9fa42771a037d157879e23e20b87401cf1837
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10797
A few operators enforces in-place output (e.g., running mean/var for SpatialBN). Functional right now doesn't follow the inplace_enforced_ rules in OpSchema and therefore, the RunNetOnce() will fail on OpSchema->Verify(). Edit the output_names in Functional following the rules to pass check.
Reviewed By: jerryzh168
Differential Revision: D9470582
fbshipit-source-id: 168efeccecc32184bd1d02f3fefe8e61faa4e0f4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10835
The last diff of constructor cause performance regression in cold run.
This one tried to fix this.
Reviewed By: highker
Differential Revision: D9489617
fbshipit-source-id: a77c2e2c903a73e2ad9806b4f9c209cdb751442f
Summary:
Added Prefix Store support.
This will make group be backward compatible.
Test is covered too.
```
tengli@devfair033:~/new_pytorch/pytorch/torch/lib/build/c10d/test$ ./FileStoreTest
Using temporary file: /tmp/testoglRl4
Using temporary file: /tmp/testepZIpB
Test succeeded
tengli@devfair033:~/new_pytorch/pytorch/torch/lib/build/c10d/test$ ./TCPStoreTest
Test succeeded
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10762
Differential Revision: D9484032
Pulled By: teng-li
fbshipit-source-id: 85754af91fe3f5605087c4a2f79ae930a9fd1387
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10605
Make isSubgraphMatch returns a subgraph and map from MatchNodes to graph nodes in the result, which makes it easier to write graph fusion logic. Also include some more helper methods for NN subgraph matcher.
Reviewed By: bwasti
Differential Revision: D9374931
fbshipit-source-id: 3a273295eec81a43027ec3a9e835d27f00853df9
Summary:
apaszke recently ported RNNs from Python into ATen, which means we can replace our implementation in the C++ API (written by ebetica) with the ATen implementation, which cleans up a lot of code (+99, -323). Thanks apaszke!
I also added the `bidirectional` and `batch_first` options to the C++ API RNN options, just because why not.
apaszke ebetica
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10761
Differential Revision: D9443885
Pulled By: goldsborough
fbshipit-source-id: b6ef7566b9ced2b2f0b2e1f46c295b6f250c65a8
Summary:
* first integration of MIOpen for batch norm and conv on ROCm
* workaround a ROCm compiler bug exposed by elementwise_kernel through explicit capture of variables in the densest packing
* workaround a ROCm compiler bug exposed by having `extern "C" __host__` as a definition and just `__host__` in the implementation through the hipify script
* use fabs() in accordance with C++11 for double absolute, not ::abs() which is integer-only on ROCm
* enable test_sparse set on CI, skip tests that don't work currently on ROCm
* enable more tests in test_optim after the elementwise_bug got fixed
* enable more tests in test_dataloader
* improvements to hipification and ROCm build
With this, resnet18 on CIFAR data trains without hang or crash in our tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10612
Reviewed By: bddppq
Differential Revision: D9423872
Pulled By: ezyang
fbshipit-source-id: 22c0c985217d65c593f35762b3eb16969ad96bdd
Summary:
Things like `zeros(1,2,3, dtype=torch.int)` are now supported in the script by altering tryMatchSchema to auto-construct the list `[1,2,3]` when it sees inlined members of the list as the last positional arguments.
I suggest reading the commits individually, since the first two incrementally change how we do tryMatchSchema to get it ready for adding vararg list conversion, while the third actually does the modification.
closes#10632closes#8516
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10250
Differential Revision: D9478235
Pulled By: zdevito
fbshipit-source-id: 0c48caf7a6184e463d9293d97015e9884758ef9c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10779
The idea is to let classes opt-in to providing these methods
by default.
Reviewed By: jerryzh168
Differential Revision: D9466076
fbshipit-source-id: b6beee084cc71d53ce446cdc171d798eeb48dc12
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10766
Added a `Workspace::ForEach(...)` API for accessing the global set of
existing Workspace instances. This is used in the signal handler to print blob
info on the thread receiving a fatal signal.
Reviewed By: mraway
Differential Revision: D9147768
fbshipit-source-id: a94d0b5e6c88390a969ef259ecb8790173af01a4
Summary:
This seems to save a few percent in binary size in libcaffe2_gpu.so, but
the effect may not be real. In fact, deleting some functions can cause
the binary size to increase (perhaps due to alignment issues).
cc orionr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10707
Differential Revision: D9409009
Pulled By: colesbury
fbshipit-source-id: 282931e562e84e316a33ac6da4788c04c2984f08
Summary:
To prepare THCState for refactoring into ATen, this PR removes unused THCState code paths. In particular, it:
- Removes the UVA Allocator
- Removes the THDefaultDeviceAllocator
- Respects the 1 BLAS and 1 sparse handle per device reality
- Removes kernel p2p access
- Removes setting p2p access
- Removes the GCHandler code path
- Removes many unused THCState_... functions
- Removes THCThreadLocal.h/.cpp
It does not change the preexisting external behavior of any remaining function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9735
Differential Revision: D9438558
Pulled By: SsnL
fbshipit-source-id: dde9acbec237a18bb6b75683e0526f7ff1c9a6ea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10509
This diff enables CUDA implementation of LARS operator in caffe2.
Reviewed By: enosair
Differential Revision: D9318356
fbshipit-source-id: 365b9f01e3afd4d9d3ba49155e72e728119f40c5
Summary:
When 0-sized dimension support is added, we expect an empty sparse tensor to be a 1-dimensional tensor of size `[0]`, with `sparseDims == 1` and `denseDims == 0`. Also, we expect the following invariants to be preserved at all times:
```
_sparseDims + _denseDims = len(shape)
_indices.shape: dimensionality: 2, shape: (_sparseDims, nnz)
_values.shape: dimensionality: 1 + _denseDims. shape: (nnz, shape[_sparseDims:])
```
This PR fixes various places where the invariants are not strictly enforced when 0-sized dimension support is enabled.
Tested and `test_sparse.py` passes locally on both CPU and CUDA with the `USE_TH_SIZE_ZERO_DIM` flag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9279
Differential Revision: D8936683
Pulled By: yf225
fbshipit-source-id: 12f5cd7f52233d3b26af6edc20b4cdee045bcb5e
Summary:
This uses zou3519's new `torch.broadcast_tensors()` #10075 to make `Categorical.log_prob()` and the `*Normal.__init__()` methods jittable. Previously `.log_prob()` was failing due to calls to `torch._C.infer_size()` with errors like
```
def log_prob(self, value):
if self._validate_args:
self._validate_sample(value)
> value_shape = torch._C._infer_size(value.size(), self.batch_shape) if self.batch_shape else value.size()
E RuntimeError: expected int at position 0, but got: Tensor
```
After this change I'm able to jit many more of Pyro's tests.
Reviewed By: ezyang
Differential Revision: D9477487
Pulled By: apaszke
fbshipit-source-id: 5f39b29c6b8fa606ad30b02fefe2dfb618e883d6
Summary:
When emitting if Branches, check that the types on each value returned are equivalent. As with reassignment of values, tensors are not forced to be the same shape or subtype.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10281
Differential Revision: D9466566
Pulled By: eellison
fbshipit-source-id: 746abdeb34a0f68806b8e73726ad5003b536911c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9483
The interface is updated to accept the config to construct the predictor.
Reviewed By: highker
Differential Revision: D8872999
fbshipit-source-id: 3ca54d644970823fc33c0ade9a005e12f52e2b24
Summary:
This will make pybind version of MPI PG work. The issue is the scope of the tensor list won't be available for the MPI worker thread. So we pass the vector by value instead.
Also added recv_anysource pybind to make it work. The front-end API will wrap one level up with an int for this function. So taking a tensor should be the easiest way for now.
Also added abort pybind and fixed the flaky test.
```
tengli@devfair033:~/new_pytorch/pytorch/torch/lib/build/c10d/test$ mpirun -np 8 ProcessGroupMPITest
Test successful
Test successful
Test successful
Test successful
Test successful
Test successful
Test successful
Test successful
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10606
Differential Revision: D9474393
Pulled By: teng-li
fbshipit-source-id: cca236c333656431e87d0d3573eeae9232c598b0
Summary:
Augassign (i.e., `x += 1`) gets desugared to an assignment of a binop (`x = x + 1`).
Right now we assert that the RHS of the binop is a tensor,
but it really doesn't have to be because we support scalar/scalar ops and also
list-list ops (i.e., `[1, 2] + [2, 3]`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10730
Differential Revision: D9465110
Pulled By: zou3519
fbshipit-source-id: 7b118622701f09ce356aca81b8db743d9611097b
Summary:
Multiple failing external and internal CI signals were ignored when this commit
was landed. goldsborough please fix the text failures and resubmit this change as a
new PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10785
Reviewed By: ezyang
Differential Revision: D9466791
Pulled By: jamesr66a
fbshipit-source-id: b260e93bac95d05fd627c64e620b6aefb5045949
Summary:
ONNX doesn't support this. Instead flatten the inputs to the ListConstruct op and inline it into the subsequent usage
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10713
Differential Revision: D9458508
Pulled By: jamesr66a
fbshipit-source-id: 0b41e69320e694bb2f304c6221864a39121e4694
Summary:
I included "legacy" includes in the old spots for Backend, Generator, Layout; it seemed unlikely that the other ones had direct user includes.
This is another step on the path to move Type/Tensor to ATen/core.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10740
Reviewed By: ezyang
Differential Revision: D9435888
Pulled By: gchanan
fbshipit-source-id: 89f4f0f445d4498a059d3a79069ba641b22bbcac
Summary:
Don't regex against strings that may have come from the backtrace.
Better to just not regex at all.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10702
Reviewed By: ezyang
Differential Revision: D9406154
Pulled By: jsrmath
fbshipit-source-id: 9b17abee2a6e737a32c05f1e3963aef4b6638a47
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10053
Tensor in Pytorch 1.0 will have
Tensor -> TensorImpl -> Storage -> StorageImpl
In this diff we split Storage from Tensor in order to align with this design.
We'll have Tensor -> Storage -> StorageImpl after this diff
Reviewed By: ezyang
Differential Revision: D9384781
fbshipit-source-id: 40ded2437715a3a2cc888ef28cbca9a25b1d5350
Summary:
I've tested locally that this works to build static and non-static binaries with and without CUDA.
In terms of ongoing testing, I am working on incorporating this into the release package generation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10754
Differential Revision: D9457423
Pulled By: anderspapitto
fbshipit-source-id: aa1dcb17c67c0f0c493a9cf93aca4a6e06b21666
Summary: The code in Operator::SyncDevice had some duplicate logic and using FinishDeviceComputation sufficed in this case.
Reviewed By: yinghai
Differential Revision: D9348288
fbshipit-source-id: d8d874bab491e6d448fcd5fa561a8b99d502753b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10362
This diff implements a manual export from PyText's CRF module to the caffe2 CRF layer.
Note that most of the changes in caffe2/python/crf.py are just formatting changes, the only relevant change is the new class CRFUtils.
Reviewed By: hikushalhere
Differential Revision: D9234126
fbshipit-source-id: 1a67d709034660e8b3d5ac840560b56de63e3f69
Summary:
```
Use intrusive_ptr in Storage; replace unique_ptr<Storage> with Storage
This patch does two major changes:
- It replaces the use of Retainable in Storage with a new implementation
based on intrusive_ptr. This will be necessary because Caffe2 will
be using this class to implement intrusive_ptrs, and we need to
line these up for the merge. One good thing about the new implementation is
that the default copy/move constructors/assignment operators and destructor
work automatically, instead of needing to be hardcoded into Storage/Tensor.
- It replaces all places where we returned std::unique_ptr<Storage> with
Storage, collapsing an unnecessary double indirection that is no longer
necessary now that we have correctly working copy/move constructors.
I didn't initially want to do step (2), but it was very important to
eliminate all bare uses of new Storage and new StorageImpl, and this making
the API change was the most straightforward way to do this.
HOW TO FIX YOUR CODE IN THE NEW API
- You no longer need to dereference the result of tensor.storage() to pass
it to set. So, instead of:
x.set_(*y.storage());
just write:
x.set_(y.storage());
- If you were accessing methods on StorageImpl via the pImpl() method, you
must use the dot operator to run pImpl(). Even better; just drop pImpl,
we now have method forwarding. So, instead of:
storage->pImpl()->data();
just do:
storage->data();
// storage.pImpl()->data() works too but is not as recommended
- storage->getDevice() is no more; instead use storage->device().index()
MISC CODE UPDATES
- retain, release, weak_retain, weak_release and weak_lock are now
reimplemented using the "blessed API", and renamed to make it
clearer that their use is discouraged.
- nvcc OS X and general OS X portability improvements to intrusive_ptr
- A new comment in intrusive_ptr describing how stack allocated
intrusive_ptr_targets work differently than heap allocated ones
from c10::make_intrusive
CAVEAT EMPTOR
- THStorage_weakRetain used to work on strong pointers, but it NO LONGER
works with intrusive_ptr. You must reclaim the strong pointer into a
real strong pointer, construct a weak pointer from it, and then release
the strong and weak pointers. See StorageSharing.cpp for an example.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10488
Reviewed By: gchanan
Differential Revision: D9306134
Pulled By: ezyang
fbshipit-source-id: 02d58ef62dab8e4da6131e1a24834a65c21048e2
Summary:
The optimized code for `linear()` which uses `addmm` when a bias is given was duplicated three times in the ATen and the C++ API. Let's just have `at::linear` and use that everywhere.
apaszke ezyang (who mentioned this in #10481)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10755
Differential Revision: D9443881
Pulled By: goldsborough
fbshipit-source-id: a64862d1649b5961043d58401625ec267d97d9f3
Summary:
zdevito et al came to the conclusion that the ONNX spec does not mandate the widening conversion of integral types when serializing tensor data into raw_data, as opposed to serializing the data into int32_data. PyTorch recently made this change in the export code, which caused import in caffe2 to break because it did not match semantics. This fixes that
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10718
Differential Revision: D9423712
Pulled By: jamesr66a
fbshipit-source-id: 479fbae67b028bf4f9c1ca1812c2c7b0c6cccd12
Summary:
Fixes `__getattr__` to adhere to its Python API contract, and wraps `range()` call in a list since it does not return one anymore in Python 3.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10525
Reviewed By: ezyang
Differential Revision: D9441360
Pulled By: tomdz
fbshipit-source-id: d489c0e7cefecc4699ca866fd55ddbfa629688d4
Summary:
This PR adds support for using custom ops in ScriptModules, the last step for our custom op strategy. You can now write
```
import torch
torch.ops.load_library('libcustom_ops.so')
class Model(torch.jit.ScriptModule):
def __init__(self):
super(Model, self).__init__()
torch.jit.script_method
def forward(self, input):
return torch.ops.custom.op(input) + 1
model = Model()
model.forward(torch.ones(5)) # Works
model.save("model.pt") # Works
model = torch.jit.load("model.pt") # Works
```
You can then load the `model.pt` in C++ and execute its `forward` method!
Missing for this was the fact that the script compiler didn't know to convert `ops.custom.op` into a `BuiltinFunction` which then emits a function call. For this I came up with the following strategy inside `torch/csrc/jit/scrip/init.cpp`:
1. When we access `torch.ops`, we return a `CustomOpValue` (subclass of `PythonValue`), whose purpose is only to return a `CustomOpNamespaceValue` (subclass of `PythonValue`) whenever something under it is accessed.
2. `CustomOpNamespaceValue` will then for each field accessed on it return a `BuiltinFunction`.
This doesn't reduce performance for any calls that are not to `torch.ops` (as opposed to inspecting every function call's name the call site, for example).
I also had to fix `BuiltinFunction` to not assume the namespace is always `aten::`.
A lot of other changes are just tidying up the Python and C++ test harness before I integrate it in CI.
zdevito dzhulgakov
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10610
Differential Revision: D9387832
Pulled By: goldsborough
fbshipit-source-id: c00f431db56c7502a66fe1f813fe78067f428ecb
Summary:
This should resolves "error C2280: 'std::unique_ptr<caffe2::ObserverBase<caffe2::OperatorBase>,std::default_delete<_Ty>> &std::unique_ptr<_Ty,std::default_delete<_Ty>>::operator =(const std::unique_ptr<_Ty,std::default_delete<_Ty>> &)': attempting to reference a deleted function" from Visual Studio.
It should also make error message more human-readable in case if something really messed up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10593
Reviewed By: orionr
Differential Revision: D9436397
Pulled By: mingzhe09088
fbshipit-source-id: 31711667297b4160196134a34365da734db1c61d
Summary:
Let's run CI tests to see what fails given the changes that just landed in https://github.com/pytorch/pytorch/pull/10624
cc mingzhe09088 ezyang Yangqing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10692
Reviewed By: mingzhe09088
Differential Revision: D9423617
Pulled By: orionr
fbshipit-source-id: 3bda1f118d13f8dd8e823727c93167cae747d8cf
Summary:
Set the build environment before installing sccache in order to make sure the docker images have the links set up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10640
Reviewed By: yf225
Differential Revision: D9399593
Pulled By: Jorghi12
fbshipit-source-id: a062fed8b7e83460fe9d50a7a27c0f20bcd766c4
Summary:
This is part of moving the (base) Type to ATen/core; Some Type methods have default argument of type THNN Reduction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10703
Differential Revision: D9406060
Pulled By: gchanan
fbshipit-source-id: 789bb3387c58bd083cd526a602649105274e1ef6
Summary:
This will make the common case more natural (no need to do `_construct_empty_tensor_list()`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10705
Differential Revision: D9411622
Pulled By: michaelsuo
fbshipit-source-id: 2d91fbc5787426748d6e1c8e7bbeee737544dc96
Summary:
The broadcast is used by default when the opset version is greater then 6.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10108
Reviewed By: bddppq
Differential Revision: D9176934
Pulled By: houseroad
fbshipit-source-id: b737bd87b0ddc241c657d35856d1273c9950eeba
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10651
EnsureCPUOutputOp will copy the input from another Context to CPU, but currently there is no guarantee that the Copy will be executed.
Differential Revision: D9390046
fbshipit-source-id: af3ff19cf46560264cb77d2ab8821f0cc5be74f6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10551
Renaming from "subtree" -> "subgraph" to improve clarity of subgraph matcher APIs since it now supports DAG
This is pure renaming, no functionalities change.
Reviewed By: bwasti
Differential Revision: D9348311
fbshipit-source-id: 4b9267845950f3029dfe385ce3257d3abb8bdad4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10549
Support dag matching in nomnigraph. This is done by maintaining a map from node in the MatchGraph to node in the input graph, and additionally enforce that same nodes in the MatchGraph must match to same nodes in the input graph (with the exception of multiplicity i.e. when count != 1 on the MatchGraph node).
In a follow up diff, I'll rename the API that refers to subtree as subgraph to improve clarity.
Reviewed By: bwasti
Differential Revision: D9347322
fbshipit-source-id: 171491b98c76852240a253279c2654e96dd12632
Summary:
Some more `ATEN_API` additions for hidden visibility.
Running CI tests to see what fails to link.
cc Yangqing mingzhe09088 ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10624
Reviewed By: mingzhe09088
Differential Revision: D9392728
Pulled By: orionr
fbshipit-source-id: e0f0861496b12c9a4e40c10b6e0c9e0df18e8726
Summary:
Minor fix for the cuDNN cache. Previously we would skip the event reinitialization when an RNN function would be called on GPU 0, and then on GPU 1, but it would be in eval mode on GPU1. That would cause us to skip event re-initialization, and cause an incorrect resource handle error when trying to record the event.
soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10662
Reviewed By: soumith
Differential Revision: D9393629
Pulled By: apaszke
fbshipit-source-id: e64c1c1d2860e80f5a7ba727d0b01aeb5f762d90
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9888
Limiter cannot be shared or copied; just pass it to the first reader.
Reviewed By: xianjiec
Differential Revision: D9008871
fbshipit-source-id: e20cd785b26b1844e156efc3833ca77cfc3ffe82
Summary:
Trigonometry functions are newly added to ONNX in a recent PR https://github.com/onnx/onnx/pull/869
This PR makes pytorch support exporting graphs with trigonometry functions.
This PR might need to wait until it is ready to change
```python
_onnx_opset_version = 6
```
to
```python
_onnx_opset_version = 7
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/7540
Differential Revision: D9395041
Pulled By: bddppq
fbshipit-source-id: bdf3e9d212b911c8c4eacf5a0753bb092e4748d2
Summary:
There is no reason that user should do an extra import to use DistributedSampler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10671
Differential Revision: D9395189
Pulled By: SsnL
fbshipit-source-id: 8f41d93813c8fb52fe012f76980c6a261a8db9b2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10478
- Removed Backend constructor from Device, and fixed all
use-sites to use DeviceType::CPU instead of kCPU, or
use a new function backendToDeviceType to perform
the conversion.
- New method device_type() on Type; it gives you the
underlying device type, e.g., CPU for SparseCPU.
- We add backward compatibility for kCPU/kCUDA uses,
by introducing a new special type which is implicitly
convertible to both DeviceType and Backend. As long as
you don't define a function that's overloaded on both
DeviceType and Backend (but not on BackendOrDeviceType),
the implicit conversions will ensure that uses
of at::Device(at::kCPU) keep working. We fixed use-sites in
the library, but did NOT fix sites in the test code, so that
we can exercise this BC code.
Reviewed By: Yangqing
Differential Revision: D9301861
fbshipit-source-id: 9a9d88620500715c7b37e655b4fd761f6dd72716
Summary:
... to avoid slow at::chunk (it is slow due to tensor initialization). Picking up from #10026
This is done through the following:
1) Absorb starting chunks into FusionGroup as a part of the graph fuser
pass.
2) When compiling a kernel, emit a `std::vector<ConcatDesc>` that describes if an input (of the original graph) will be chunked.
3) When launching a kernel, `use std::vector<ConcatDesc>` to chunk an
input tensor on the CPU. This chunk directly takes in an at::Tensor and creates
four TensorInfo structs in-place in the argument list, bypassing the creation of intermediate Tensors.
- Expect test and correctness test to see if a single chunk is fused
by the graph fuser
- Correctness test for a variety of chunks (dimension = beginning,
middle, end) and tensors (contiguous, non-contiguous, edge case
(splitSize = 1) for both CPU/CUDA
- Expect test for multiple chunks fused into the same kernel and
correctness test.
cc zdevito apaszke
LSTM forward pass, 1 layer, 512 hidden size and input size, 100 seq length, requires_grad=False on all inputs and weights.
After changes:
```
thnn cudnn jit
8.8468 6.5797 9.3470
```
Before changes:
```
thnn cudnn jit
9.9221 6.6539 11.2550
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10178
Differential Revision: D9382661
Pulled By: zou3519
fbshipit-source-id: 1f8a749208fbdd45559775ce98cf4eb9558448f8
Summary:
Take 2 of #10543
The problem was that between commit and merge there was added one more run point `tools/build_libtorch.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10659
Differential Revision: D9393540
Pulled By: soumith
fbshipit-source-id: 8ebfed600fc735fd1cb0489b161ec80e3db062e0
Summary:
Fixes#10096
If the only thing preventing a simple mappable operator from being fused
into a fusion group is that its Tensor inputs are not of the same shape as the
output, then the graph fuser inserts explicit expand nodes for those
inputs.
This helps the graph fuser not miss out on any fusion opportunities
involving simple mappable operations that have Tensor inputs. This PR
doesn't do anything for the scalar case; that can be addressed later.
Test Plan
- Simple expect test case
- Added expect tests for a raw LSTMCell. The expands help speed up the
forwards pass by allowing more operations to be fused into the LSTMCell's single
FusionGroup.
cc apaszke zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10325
Differential Revision: D9379308
Pulled By: zou3519
fbshipit-source-id: 86d2202eb97e9bb16e511667b7fe177aeaf88245
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10630
`onnxTensorDescriptorV1.name` points to the string buffer. We use a vector of strings to serve as the storage. This means we cannot reallocate the vector because that may invalidate the `onnxTensorDescriptorV1.name` pointers. Solution is to reserve a large enough vector so that it won't reallocate.
Reviewed By: bddppq, houseroad
Differential Revision: D9381838
fbshipit-source-id: f49c5719aafcc0829c79f95a2a39a175bcad7bfe
Summary:
This is on the way to resolving #9940.
Fixes#10501
This PR modifies graph fuser to fuse operations that have constant
scalar arguments. These constant scalar arguments are directly inlined
into the kernel body.
The context for this is that LSTM backward (in particular, sigmoid
backward) has many add(x, 1.) operations. This PR should be sufficient for
LSTM backward to get fused by the graph fuser.
cc apaszke zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10511
Differential Revision: D9378896
Pulled By: zou3519
fbshipit-source-id: 6a7a2987f5b6e8edaaf4b599cd200df33361650f
Summary:
This is still not the final PR, but it removes all blockers for actually using the RNN functions directly in the JIT. Next patch should be final, and will actually remove the symbolic_override code, and change it to proper symbolics for those ATen functions. Turns out the symbolic code can be also cleaned up a bit, and I'll do that too.
zdevito ezyang
colesbury (for minor DispatchStub.h) changes
There was no way to handle those in the JIT for now, and they turned
out to be completely unnecessary. It should make the Python and C++
module code much simpler too, since all the logic is now centralized
in the native functions.
The downside is that RNN modules no longer own their dropout buffers,
which are shared per-device instead (with appropriate locking and
synchronization). This might appear as a perf regression at first, but
in reality it's highly unlikely that anyone will want to run cuDNN RNNs
on the same GPU in parallel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10581
Reviewed By: colesbury
Differential Revision: D9365541
Pulled By: apaszke
fbshipit-source-id: 3ef8677ee5481bae60c74a9117a2508665b476b5
Summary:
The PR is the first step to integrate torch.nn library with JIT. It adds the tests for nn functional interfaces in trace/script mode, and tries to find out the different between torch.nn.functional ops and the ATen ops, to see the work need to be done in order to support a full set of nn functional in script mode.
Some statistics in summary:
- Totally 84 useful functions in torch.nn.functional (the number does not include helper funcs and deprecated funcs in torch.nn.functional).
- 7 functions/ops does not support higher gradient, so just excluded from the whole test.
- 36 functions is different with the Aten op for different reasons. Among those 36 functions, bunch of them (roughly around 10-15) are just naming difference and simple transformation using other ops inside the function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10409
Differential Revision: D9350694
Pulled By: wanchaol
fbshipit-source-id: 8fce6f30d8d25ace5a544a57b219fe61f5a092f8
Summary:
Inlining if branches which have constant inputs. If an if node gets inlined, the set of mutated variables returned by its ancestors may have changed. In the following example the block should
return a mutated set of (a) and not (a, b).
```
if cond:
if True:
a = a - 1
else:
b = b - 1
```
To calculate this we recursively update mutate variables in if branches from the leaf nodes up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10084
Reviewed By: michaelsuo
Differential Revision: D9340429
Pulled By: eellison
fbshipit-source-id: b0dd638a5cace9fdec3130460428fca655ce4b98
Summary:
https://github.com/pytorch/pytorch/pull/10100 recently take external input/output in nomnigraph. This PR makes adjust to
0. Relax some of the conditions on external input
1. Update NNModule inputs/outputs when pruning the input/output.
2. Avoiding copying external input/output as nomnigraph already takes care of it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10598
Reviewed By: bwasti
Differential Revision: D9371730
Pulled By: yinghai
fbshipit-source-id: 9273be5041dc4cc8585587f47cb6721e518a06a8
Summary:
Custom python installation, which have no aliases to `python` or `python3` can't be found by cmake `findPythonInterp` without extra cmake argument.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10543
Differential Revision: D9378844
Pulled By: ezyang
fbshipit-source-id: 022e20aab7e27a5a56b8eb91b6026151116193c7
Summary:
Fix "error LNK2019: unresolved external symbol" from "CAFFE_KNOWN_TYPE" in tests where we should use dllexport instead of AT_CORE_API(=dllimport).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10602
Differential Revision: D9377394
Pulled By: Yangqing
fbshipit-source-id: 993062a461ffce393f2321c5391db5afb9b4e7ba
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10282
This diff removes the unused/deprecated features from the code base.
Reviewed By: manojkris
Differential Revision: D9169859
fbshipit-source-id: d6447b7916a7c687b44b20da868112e6720ba245
Summary:
This is the last step in the custom operator implementation: providing a way to build from C++ and Python. For this I:
1. Created a `FindTorch.cmake` taken largely from ebetica with a CMake function to easily create simple custom op libraries
2. Created a ` torch/op.h` header for easy inclusion of necessary headers,
3. Created a test directory `pytorch/test/custom_operator` which includes the basic setup for a custom op.
1. It defines an op in `op.{h,cpp}`
2. Registers it with the JIT using `RegisterOperators`
3. Builds it into a shared library via a `CMakeLists.txt`
4. Binds it into Python using a `setup.py`. This step makes use of our C++ extension setup that we already have. No work, yey!
The pure C++ and the Python builds are separate and not coupled in any way.
zdevito soumith dzhulgakov
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10226
Differential Revision: D9296839
Pulled By: goldsborough
fbshipit-source-id: 32f74cafb6e3d86cada8dfca8136d0dfb1f197a0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10599
Not spawning threads with spin-lock synchronization is bad because they will switch to `condvar` wait, which increases wake-up latency next time they are needed.
Reviewed By: ajtulloch
Differential Revision: D9366664
fbshipit-source-id: 3b9e4a502aeefaf0ddc4795303a855d98980b02e
Summary:
This commit adds the ``buffers()`` and ``named_buffers()`` methods as
analogues of ``parameters()`` and ``named_parameters()``.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10554
Reviewed By: SsnL
Differential Revision: D9367762
Pulled By: jma127
fbshipit-source-id: f2042e46a7e833dce40cb41681dbd80d7885c74e
Summary:
A continuation of https://github.com/pytorch/pytorch/pull/10504 for GPU, torch, etc. builds.
I was testing with
```
FULL_CAFFE2=1 python setup.py build_deps | tee ~/log.txt
cat ~/log.txt | egrep 'undefined refer' | sort | less
```
I'll rebase on master when Yangqing's changes in 10504 land, but putting up for some testing.
cc mingzhe09088 anderspapitto ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10507
Reviewed By: Yangqing
Differential Revision: D9359606
Pulled By: orionr
fbshipit-source-id: c2a3683b3ea5839689f5d2661da0bc9055a54cd2
Summary:
Resubmit #10416 with fixed tests . This is to remove implicit conversion from gpu to cpu in when calling numpy to keep behavior match others.
It requires users to move the tensor back to cpu() before call numpy functions on it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10553
Differential Revision: D9350212
Pulled By: ailzhang
fbshipit-source-id: 9317d8fea925d4b20ae3150e2c1b39ba5c9c9d0a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10494
Adding the AllredubeBcube routines as they are now available in gloo.
Reviewed By: wesolwsk
Differential Revision: D8269473
fbshipit-source-id: 6a3a32291bbf1fbb328b3ced0f2a753dc5caf4e5
Summary:
The ONNXIFI backend will absorb the constant weight in Conv, so we should not add it as an input. This is just a test artifacts. Note that Onnxifi transformer will do the right thing when cutting the graph to absorb the weights.
rdzhabarov
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10575
Reviewed By: houseroad
Differential Revision: D9357339
Pulled By: yinghai
fbshipit-source-id: a613fa3acafa687295312f5211f8e9d7f77b39cd
Summary:
delete build_caffe2.sh, replace with build_libtorch.py as suggested by peter (and copy-pasted from his draft PR). This ensures that all consumers of the torch CMake file go through as unified a path as possible.
In order to change the surrounding infrastructure as little as possible, I made some tweaks to enable build_pytorch_libs.sh to generate the test binaries relative to the current directory, rather than hardcoding to pytorch/build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10508
Differential Revision: D9354398
Pulled By: anderspapitto
fbshipit-source-id: 05b03df087935f88fca7ccefc676af477ad2d1e9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10546
Have you ever written an operator<< overload in the caffe2 namespace
in a core Caffe2 header, and then been stunned when some completely
unrelated code started breaking? This diff fixes this problem!
The problem looks like this:
1. You're building against a really old version of glog (think 0.3.2,
or something like that)
2. This version of glog defines operator<< overloads for std containers
in the global namespace
3. You add a new overload in your current namespace (e.g., caffe2).
Congratulations: this overload is *preferentially* chosen over
the global namespace one for all calls to << in that namespace.
And since it doesn't actually have std::vector overloads, unrelated
Caffe2 code breaks.
Newer versions of glog have a fix for this: they have the line:
namespace std { using ::operator<<; }
in their header. So let's help old versions of glog out and do this ourselves.
In our new world order, operator<< overloads defined in the global namespace
won't work (unless they're for std containers, which work because of ADL).
So this diff also moves all those overloads to the correct namespace.
Reviewed By: dzhulgakov
Differential Revision: D9344540
fbshipit-source-id: 6246ed50b86312668ebbd7b039fcd1233a3609cf
Summary:
This PR removes the `using Tensor = autograd::Variable;` alias from `torch/tensor.h`, which means `torch::Tensor` is now `at::Tensor`. This PR fixes up some last uses of `.data()` and tidies up the resulting code. For example, I was able to remove `TensorListView` such that code like
```
auto loss = torch::stack(torch::TensorListView(policy_loss)).sum() +
torch::stack(torch::TensorListView(value_loss)).sum();
```
is now
```
auto loss = torch::stack(policy_loss).sum() + torch::stack(value_loss).sum();
```
CC jgehring
ebetica
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10516
Differential Revision: D9324691
Pulled By: goldsborough
fbshipit-source-id: a7c1cb779c9c829f89cea55f07ac539b00c78449
Summary:
fixed NCCL test, which is not run in CI. We should enable it soon.
```
~/new_pytorch/pytorch/test$ python test_c10d.py
...............
----------------------------------------------------------------------
Ran 15 tests in 13.099s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10557
Reviewed By: ailzhang
Differential Revision: D9353286
Pulled By: teng-li
fbshipit-source-id: 5a722975beaa601203f51c723522cc881f2d2090
Summary:
Properly annotated all apis for cpu front. Checked with cmake using
cmake -DUSE_ATEN=ON -DUSE_CUDA=OFF -DBUILD_ATEN=ON
and resulting libcaffe2.so has about 11k symbols.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10504
Reviewed By: ezyang
Differential Revision: D9316491
Pulled By: Yangqing
fbshipit-source-id: 215659abf350af7032e9a4b0f28a856babab2454
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10439
Update Im2Col related to make preparation for group conv in NHWC order.
Reviewed By: houseroad
Differential Revision: D9285344
fbshipit-source-id: 1377b0243acb880d2ad9cf73084529a787dcb97d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10528
adding 2 features to core and model_helper
- reroute_tensor which supports op insertion on net level
- model_helper complete net and cut net used for full graph analysis
Differential Revision: D9330345
fbshipit-source-id: 56341d3f500e72069ee306e20266c8590ae7985a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10053
Tensor in Pytorch 1.0 will have
Tensor -> TensorImpl -> Storage -> StorageImpl
In this diff we split Storage from Tensor in order to align with this design.
We'll have Tensor -> Storage -> StorageImpl after this diff
Reviewed By: dzhulgakov
Differential Revision: D9076734
fbshipit-source-id: ea9e1094ecf8c6eaeaa642413c56c6a95fb3d14e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10526
Resubmitting these changes. Previously they caused issues with multifeed, which I fixed with D9280622
Reviewed By: yinghai
Differential Revision: D9327323
fbshipit-source-id: ec69428039b45c6221a5403b8fe9a83637857f04
Summary:
Implemented via a wrapper, thank you Richard for the suggestion!
Fixes: #9929
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10067
Differential Revision: D9083388
Pulled By: soumith
fbshipit-source-id: 9ab21cd35278b01962e11d3e70781829bf4a36da
Summary:
This should make ASAN tests run faster.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9902
Differential Revision: D9032986
Pulled By: yf225
fbshipit-source-id: 3d2edec2d7ce78bc995d25865aa82ba6d3f971d0
Summary:
Pull a fix in FP16 for compilation bug when using Intel Compiler
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10548
Differential Revision: D9349469
Pulled By: Maratyszcza
fbshipit-source-id: 43e6dc5c3c18319d31eca23426770c73795feec5
Summary:
In my environment, it looks like setup.py hangs when running
```
FULL_CAFFE2=1 python setup.py build_deps
```
Removing this fixes things, but we might also want to look at `tests_require`, which came over from `setup_caffe2.py`.
cc pjh5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10530
Differential Revision: D9349597
Pulled By: orionr
fbshipit-source-id: 589145eca507dfaf16386884ee2fbe60299660b4
Summary:
This PR removes couple of macros throughout TH* as part of the re-factoring effort for ATen. Removing these macros should avoid confusion among developers who are trying to move things from TH* to ATen. This PR is part of the THCNumerics deprecation that I have been working on following up on mruberry's https://github.com/pytorch/pytorch/pull/9318. I am separating these two commits to see if removal of these macros doesn't upset the pytorch public CI, as well as internal builds.
- Commit 1248de7baf removes the code paths guarded by `CUDA_HALF_INSTRUCTIONS` macro. Since the macro was removed in commit 2f186df52d, `ifdef CUDA_HALF_INSTRUCTIONS` would return false and hence the code path that is kept after this change is for the false case of `ifdef CUDA_HALF_INSTRUCTIONS`
- Commit 520c99b057 removes the code paths guarded by `CUDA_HALF_TENSOR` macro. Since Pytorch now provides support for only CUDA 8.0 and above, `CUDA_HALF_TENSOR` is always true since CUDA 8.0 satisfies `CUDA_HAS_FP16` and hence, the code path that is kept after this change is for the true case of `ifdef CUDA_HALF_TENSOR`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10147
Differential Revision: D9345940
Pulled By: soumith
fbshipit-source-id: c9392261dd432d304f1cdaf961760cbd164a59d0
Summary:
This is the first of two changes that are supposed to improve how we handle RNNs in the JIT. They still get traced as `PythonOp`s, but now it will be much easier to actually expose them to the JIT as e.g. `aten::lstm`, and ignore the Python interpreter entirely. This needs some symbolic adjustments that will be part of a second PR.
Even when we fix symbolics, there will still be a bit of a problem with statefulness of the cuDNN API (we need a mutable cache for the dropout state, but our IR has no way of representing that).
zdevito ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10481
Reviewed By: ezyang
Differential Revision: D9341113
Pulled By: apaszke
fbshipit-source-id: 0ae30ead72a1b12044b7c12369d11e5ca8ec30b5
Summary:
In the shortcut for n_sample=1, when category 0 has 0 weight,
we should not map the (uniform) sample 0 to category 0.
The conversion uniform->multinomial was apparently written to work on
a (0,1] range (like curand uses), but PyTorch uses a [0,1) range.
Fixes: #4858. Thank you, Roy Fejgin for reporting.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9960
Reviewed By: soumith
Differential Revision: D9341793
Pulled By: ailzhang
fbshipit-source-id: 6b1a96419a7bc58cc594f761f34c6408ff6354cf
Summary:
Since we can't specify version number to `choco install curl`, we should not assume that `7.57.0` is the curl version that's in the Windows AMI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10476
Differential Revision: D9303129
Pulled By: yf225
fbshipit-source-id: 198544be68330860fbcf93c99bc995f4e280bda7
Summary:
Support broadcasting in _kl_categorical_categorical
this makes it possible to do:
```
import torch.distributions as dist
import torch
p_dist = dist.Categorical(torch.ones(1,10))
q_dist = dist.Categorical(torch.ones(100,10))
dist.kl_divergence(p_dist, q_dist)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10533
Differential Revision: D9341252
Pulled By: soumith
fbshipit-source-id: 34575b30160b43b6c9e4c3070dd7ef07c00ff5d7
Summary:
Two tests in the 'nn' test bucket may fail when the torch.half
(float16) data type is used. The assertions used in the tests
intend to allow slight floating point imprecision in the results,
but the tolerances used for the comparisons are too strict for
the half type.
Relax the tolerances so that slight float16 imprecision won't
cause test failures.
The affected tests are:
- test_variable_sequence_cuda
- test_Conv2d_groups_nobias
For more information, see issue:
https://github.com/pytorch/pytorch/issues/7420
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10519
Differential Revision: D9343751
Pulled By: soumith
fbshipit-source-id: 90aedf48f6e22dd4fed9c7bde7cd7c7b6885845a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10512
SubtreeMatchCriteria now becomes a graph of MatchNode
MatchNode consists of NodeMatchCriteria, nonTerminal and count. This is a cleaner internal representation of the data structure and will bring us much closer to DAG matching.
Note that I still keep the debugString method because convertToDotGraph doesn't currently work with Subgraph.
Reviewed By: bwasti
Differential Revision: D9321695
fbshipit-source-id: 58a76f007a9a95d18cf807d419c2b595e9bc847f
Summary:
optimize max and min reduction for ATen CPU path, current code path from TH module runs in sequential on CPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10343
Differential Revision: D9330799
Pulled By: ezyang
fbshipit-source-id: 5b8271e0ca3e3e73f88a9075aa541c8756001b7c
Summary:
I've implemented affine grid generation for volumetric (5d) inputs. The implementation is based off of the spatial implementation, extended by one dimension. I have a few questions about my implementation vs. the existing one that I will add inline.
I have some extensive test cases for the forward pass here: https://gist.github.com/elistevens/6e3bfb20d8d0652b83bd16b3e911285b However, they use `pytest.fixture` extensively, so I'm not sure the best way to incorporate them into the pytorch test suite. Suggestions? I have not tested backwards at all.
Diff probably best viewed with whitespace changes ignored.
Thanks for considering!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8322
Differential Revision: D9332335
Pulled By: SsnL
fbshipit-source-id: 1b3a91d078ef41a6d0a800514e49298fd817e4df
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10395
Order switch ops (NCHW2NHWC and NHWC2NCHW) were only supporting 2D images.
This diff generalizes them to 1D and 3D, and also add a unit test we didn't have.
Reviewed By: protonu
Differential Revision: D9261177
fbshipit-source-id: 56e7ec54c9a8fb71781ac1336f3f28cf024b4bda
Summary:
We can't rely on the ATen fallback pathway here because we need to parse out the constant attributes explicitly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10513
Reviewed By: dzhulgakov
Differential Revision: D9322133
Pulled By: jamesr66a
fbshipit-source-id: 52af947e6c44532ef220cb4b94838ca838b5df06
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10390
Fixed a bug in box_with_nms_limit where it may produce more bounding boxes than specified.
* The original code first finds the threshold for the boxes at the 'detectons_per_im' position, and filters out boxes lower than the threshold.
* In some cases that there are multiple boxes have the same threshold, the op will return more boxes than 'detectons_per_im'.
Reviewed By: wat3rBro
Differential Revision: D9252726
fbshipit-source-id: 63f40829bcd275cb181692bc7547c384cee01499
Summary:
Background: we run pytorch in embedded C++ pipelines, running in C++ GUIs in https://github.com/Kitware/VIAME and without this addition, the call was failing with the below error, but only on certain windows platforms/configurations:
OSError: [WinError6] The handle is invalid
At:
C:\Program Files\VIAME\Python36\site-packages\torch\cuda_init_.py(162):_lazy_init
C:\Program Files\VIAME\Python36\site-packages\torch\nn\modules\module.py(249): <lambda>
C:\Program Files\VIAME\Python36\site-packages\torch\nn\modules\module.py(182): _apply
C:\Program Files\VIAME\Python36\site-packages\torch\nn\modules\module.py(176): _apply
C:\Program Files\VIAME\Python36\site-packages\torch\nn\modules\module.py(249): cuda
C:\Program Files\VIAME\lib\python3.6None\site-packages\kwiver\arrows\pytorch\pytorch_resnet_f_extractor.py(74):_init_
C:\Program Files\VIAME\lib\python3.6None\site-packages\kwiver\processes\resnet_descriptors.py(132): _configure
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10379
Differential Revision: D9330772
Pulled By: ezyang
fbshipit-source-id: 657ae7590879004558158d3c4abef2ec11d9ed57
Summary:
Breaking out of #8338
This PR is a workaround for a bug with CUDA9.2 + GCC7.
Here is the error this PR fixed:
.../pytorch/caffe2/operators/elementwise_ops.h: In constructor ‘caffe2::BinaryElementwiseWithArgsOp<InputTypes, Context, Functor, OutputTypeMap>::BinaryElementwiseWithArgsOp(const caffe2::OperatorDef&, caffe2::Workspace*)’:
.../pytorch/caffe2/operators/elementwise_ops.h:106:189: error: ‘GetSingleArgument<bool>’ is not a member of ‘caffe2::BinaryElementwiseWithArgsOp<InputTypes, Context, Functor, OutputTypeMap>’
BinaryElementwiseWithArgsOp(const OperatorDef& operator_def, Workspace* ws)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10510
Reviewed By: orionr
Differential Revision: D9319742
Pulled By: mingzhe09088
fbshipit-source-id: ce59e3db14539f071f3c20301e77ca36a6fc3f81
Summary:
Previously, it's easy to do `x[0].accessor<float, 2>()`. However, x[0] is a temporary, so the accessor will point to invalid strides/sizes and probably segfault. With this change, such unsafe code is a compile error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10518
Reviewed By: goldsborough
Differential Revision: D9329288
Pulled By: ebetica
fbshipit-source-id: d08763bee9a19a898b9d1ea5ba648f27baa1992f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10514
fix the bug which break the windows build in fused_rowwise_random_quantization_ops.h
Reviewed By: ezyang, jspark1105
Differential Revision: D9322291
fbshipit-source-id: a6a27e87423b6caa973414ffd7ccb12076f2e1e4
Summary:
setup.py is the official install script, setup_caffe2.py is not used any more
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10520
Reviewed By: yinghai
Differential Revision: D9325548
Pulled By: bddppq
fbshipit-source-id: 3dda87f3dff061b574fd1d5c91859044f065ee33
Summary:
After this, all combinations of {String frontend, Python AST Frontend}{Python 3-style type annotations, MyPy-style type comments}{Script method, Script function} should properly accept type annotations.
Possible TODOs:
- Clean up the functions marked HACK
- Clean up the Subscript tree-view to better match the Python AST versions
- Can we use this for Python functions? That's the only place annotations.get_signature() is still needed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10279
Differential Revision: D9319726
Pulled By: jamesr66a
fbshipit-source-id: b13f7d4f066b0283d4fc1421a1abb9305c3b28fa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10267
isSubtreeMatch now returns a SubtreeMatchResult which contains a match flag and a debugMessage string that contains the reason why a subtree is not matched (if requested).
Reviewed By: bwasti
Differential Revision: D9182429
fbshipit-source-id: 530591fad592d02fb4c31fc398960a14ec90c86a
Summary:
Provided python binding for these four ops. Also provided nccl binding test.
Based on https://github.com/pytorch/pytorch/pull/10058
Please only review init.cpp, and test file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10159
Reviewed By: yf225
Differential Revision: D9323192
Pulled By: teng-li
fbshipit-source-id: b03822009d3a785ec36fecce2fc3071d23f9994e
Summary:
Added
- Reduce (both NCCL and MPI)
- AllGather (both NCCL and MPI)
- Gather (MPI)
- Scatter (MPI)
for c10d process groups. This basically finalizes all supported ops for C10d to match THD.
All ops are tested as well.
```
mpirun -np 8 ./ProcessGroupMPITest
Test successful
Test successful
Test successful
Test successful
Test successful
Test successful
Test successful
Test successful
```
```
./ProcessGroupNCCLTest
Allreduce test successful
Broadcast test successful
Reduce test successful
Allgather test successful
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10058
Reviewed By: yf225
Differential Revision: D9316312
Pulled By: teng-li
fbshipit-source-id: 6a6253268d34332327406b1f87335d1402f7133f
Summary:
After talking to users of the C++ API we found that having the tensor type be `autograd::Variable` causes more complications than having it be `at::Tensor`. It used to be a problem because `at::Tensor` didn't have the "autograd API" of variable (e.g. `detach()` or `grad()` methods), but those methods are now on `at::Tensor`. As such, we want to make a last big breaking change to have the tensor type be `at::Tensor`, while factory methods like `torch::ones` will return `Variable`s disguised as `at::Tensor`. This will make many things easier, like calling functions in ATen that take vectors of tensors.
This PR makes a small step in this direction by updating the optimizer classes to not use `.data()` on `Variable` to access the underlying `at::Tensor`. Using `.data()` is effectively a hack to work around our modification rules for tensors that require grad. The proper way of doing things is to use `with torch.no_grad` or equivalently `NoGradGuard` in C++ to guard in-place operations.
The next step can then simply redefine `torch::Tensor` to be `at::Tensor`. This transition should be smooth, since all methods available on `Variable` are at this point available on `at::Tensor`.
For this PR I:
1. Modified the implementations of optimizers to not use `.data()`. This means the implementations are now different from PyTorch, which still uses the legacy method of using `.data`.
2. To properly verify (1), I added more fine-grained test cases to our optimizer tests, e.g. `SGD` with and without `weight_decay`, then with `nesterov` etc. Generally more tests = more happy!
3. Minor cleanup of the optimizer codebase
ebetica apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10490
Differential Revision: D9318229
Pulled By: goldsborough
fbshipit-source-id: fb386700f37840542bc5d323f308ea88fe5ea5c5
Summary:
Now, run `python test/onnx/test_operators.py --no-onnx`, we won't introduce any onnx python dependence. (No onnx/protobuf python packages needs to be installed)
The major changes:
- output pbtxt from C++ exporter directly, so the floating format may be slightly different. (This should be fine, since it's just to guard ONNX exporting.)
- ONNX python packages are only imported if we run the ONNX related checks. Those checks are disabled when using `--no-onnx` flag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10151
Reviewed By: jamesr66a
Differential Revision: D9130706
Pulled By: houseroad
fbshipit-source-id: ea28cf5db8399929179698ee535137f209e9ce6f
Summary:
There are three classes `RNNCell`, `LSTMCell`, `GRUCell` inherited from `RNNCellBase`, all defining the identical initialization function `reset_parameters`. Lets move it to the common base.
Another option is to have different initialization for RNN, LSTM and GRU. Maybe those weights whose output is processed with sigmoid (i.e. gain=1) should be initialized differently from those going to tanh (gain=5/3)?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10399
Differential Revision: D9316978
Pulled By: SsnL
fbshipit-source-id: a2d9408f0b5c971a3e6c3d42e4673725cf03ecc1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10244
Use CAFFE_ENFORCE_EQ(x, y) instead of CAFFE_ENFORCE(x == y) in conv_op_impl.h for error messages with more information.
Reviewed By: viswanathgs
Differential Revision: D9177091
fbshipit-source-id: cf8d10afec1ce6793d3ae0b62f05648722a4130b
Summary:
It just calls into `ninja install`. For iterative work on
libtorch.so/_C.so,
`python setup.py rebuild_libtorch develop` should provide quick iteration
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10036
Differential Revision: D9317869
Pulled By: anderspapitto
fbshipit-source-id: 45ea45a1b445821add2fb9d823a724fc319ebdd2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10389
Added some unit test for box_with_nms_limit_op.
Reviewed By: wat3rBro
Differential Revision: D9237860
fbshipit-source-id: 2d65744bd387314071b68d2a0c934289fc64a731
Summary:
Test only for existence for now. I had to skip a lot of them so there a FIXME in the test.
Also I'm not testing torch.* because of namespace issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10311
Differential Revision: D9196341
Pulled By: SsnL
fbshipit-source-id: 9c2ca1ffe660bc1cc664474993f8a21198525ccc
Summary:
- Exposed get_debug_graph for ScriptModule (gets the debug graph for its
forward Method)
- Added forward/backward expect tests for lstm and milstm cells. These
are intended to prevent regressions
cc apaszke zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10506
Differential Revision: D9316590
Pulled By: zou3519
fbshipit-source-id: 3c2510d8363e9733ccbc5c7cc015cd1d028efecf
Summary:
This commit adds the ability to insert a node with inputs, using the schema to check the inputs are valid types, fill in any default values, and perform standard implicit conversions. Since it is schema based, it will discover and use the right overload.
Constructors to `NamedValue` enable it to be constructed using `IValue` constants so it is possible to use constant values in the input list as well:
```
g.insert(aten::add, {v, 3});
```
Keyword arguments are also supported:
```
g.insert(aten::add, {v}, {{"other", t}, {"scalar", 1}});
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10198
Differential Revision: D9307252
Pulled By: zdevito
fbshipit-source-id: 644620aa85047d1eae1288383a619d50fec44d9b
Summary:
AffineChannel is being used by public Detectron models, e.g. Mask-RCNN and Faster-RCNN. This PR folds this op into convolution the same way as BN to speed up inference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10293
Differential Revision: D9276789
Pulled By: yinghai
fbshipit-source-id: fbf6dd2c1be05f5713f760752e7245b1320a122b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10100
nomnigraph has until this point tried to ignore external input and output, as they aren't very well defined (does order matter?). but for DCE and some of Keren's work they are becoming necessary. I went ahead and added this to the core nomnigraph converter
Reviewed By: yinghai
Differential Revision: D9105487
fbshipit-source-id: a2e10e3cc84515611d6ab7d4bc54cf99b77729c0
Summary:
Fixes#10456
The graph fuser was fusing together groups with prim::FusedConcat (the producer) with other ops (the consumer) if the consumer is fusable. For example,
```
import torch
torch.jit.script
def fn(x, y, z):
x1 = x + y
y1 = x - y
w = torch.cat([x1, y1])
return w + z
x = torch.randn(2, 2, dtype=torch.float, device='cpu')
y = torch.randn(2, 2, dtype=torch.float, device='cpu')
z = torch.randn(4, 2, dtype=torch.float, device='cpu')
fn(x, y, z)
fn.graph_for(x, y, z)
```
produced the following graph:
```
graph(%x : Float(2, 2)
%y : Float(2, 2)
%z : Float(4, 2)) {
%3 : int = prim::Constant[value=1]()
%y1 : Float(2, 2) = aten::sub(%x, %y, %3)
%8 : int = prim::Constant[value=0]()
%14 : Float(4, 2) = prim::FusionGroup_0[device=-1](%z, %y1, %x, %y)
return (%14);
}
with prim::FusionGroup_0 = graph(%1 : Float(4, 2)
%5 : Float(2, 2)
%7 : Float(2, 2)
%8 : Float(2, 2)) {
%11 : int = prim::Constant[value=1]()
%9 : int = prim::Constant[value=1]()
%x1 : Float(2, 2) = aten::add(%7, %8, %9)
%w : Float(4, 2) = prim::FusedConcat[dim=0](%x1, %5)
%2 : int = prim::Constant[value=1]()
%3 : Float(4, 2) = aten::add(%w, %1, %2)
return (%3);
}
```
this is a problem because it violates two invariants:
1) all inputs to the FusionGroup must have the same size
2) prim::FusedConcat's output must not be used inside the FusionGroup
This PR fixes this problem by checking if the output to a FusionGroup came from a prim::FusedConcat node when deciding whether to fuse the consumer and producer.
If the producer is a value that came from a prim::FusedConcat node in a FusionGroup, then consumer & producer do not get fused.
cc apaszke zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10466
Differential Revision: D9296686
Pulled By: zou3519
fbshipit-source-id: ed826fa9c436b42c04ca7d4d790cece804c162bd
Summary:
A bootcamper was confused by the word "locally" and thought it meant on his macbook as opposed to his FB dev machine. Besides the confusion for the FB context, the word "locally" isn't really necessary at all
soumith ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10495
Reviewed By: soumith
Differential Revision: D9311480
Pulled By: goldsborough
fbshipit-source-id: 2779c7c60f903a1822a50d140ed32a346feec39e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10426
We were seeing linker errors for TanhGradientOperator in multifeed. Since we only use the float specialization, we might as well define it that way.
Reviewed By: yinghai
Differential Revision: D9280622
fbshipit-source-id: d2ffb698c73a84bb062de5e1f3bda741330e4228
Summary:
This operator implements b (1/2/4/8) bit stochastic quantization of a floating
matrix in a row-wise fashion. 8/b floating values are concatenated to a byte
and returned in uint8 tensor. PR: https://github.com/pytorch/pytorch/pull/8629
Reviewed By: harouwu
Differential Revision: D8493264
fbshipit-source-id: 01f64066568a1e5a2b87c6d2134bd31cdf119c02
Summary:
* some small leftovers from the last PR review
* enable more unit test sets for CI
* replace use of hcRNG w/ rocRAND (docker image was already updated w/ newer rocRAND)
* use rocBLAS instead of hipBLAS to allow convergence w/ Caffe2
* use strided_batched gemm interface also from the batched internal interface
* re-enable Dropout.cu as we now have philox w/ rocRAND
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10406
Reviewed By: Jorghi12
Differential Revision: D9277093
Pulled By: ezyang
fbshipit-source-id: 7ef2f6fe4ead77e501ed7aea5c3743afe2466ca2
Summary:
```
This removes PyObjectFinalizer. We were seeing SIGSEGV at exit in some
programs that use multiprocessing. The backtrace pointed to
StorageRef.__del__ being called from subtype_dealloc. My guess is that
the Python interpreter was shutdown before all C++ Storage objects were
deallocated. Deallocating the C++ Storage called the finalizer which
called back into Python after it was no longer safe to do so.
This avoids a callback from C++ into Python during Storage finalization.
Instead, dead Storage objects (expired weak references) are collected
periodically when shared_cache exceeds a limit. The limit is scaled with
2x the number of live references, which places an upper bound on the
amount of extra memory held by dead Storage objects. In practice, this
should be very small.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10407
Differential Revision: D9272400
Pulled By: colesbury
fbshipit-source-id: ecb14d9c6d54ffc91e134c34a4e770a4d09048a2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10278
Translation to Backend happens immediately before we go into the
Type universe; otherwise we use TensorTypeId.
I allocated TensorTypeId corresponding exactly to existing ATen
Backend. Only CPUTensorId and CUDATensorId are relevant in the
Caffe2 universe.
Reviewed By: gchanan
Differential Revision: D9184060
fbshipit-source-id: 9d3989c26f70b90f1bbf98b2a96c57e2b0a46597
Summary:
This PR provides 4 fixes / features:
1. torch::nn::Cloneable inherits virtually from torch::nn::Module. We want to pass around a module with new functions, and the best way to do this is to do a diamond inheritance pattern, i.e.
```c++
struct MySuperModuleImpl : virtual public torch::nn::Module {
virtual void myFunction() = 0;
}
struct MySuperModule : public torch::nn::Cloneable<MySuperModule>, MySuperModuleImple {};
struct MyModule : public MySuperModule<MyModule> {
void myFunction() override;
};
```
This way, we can simply pass around MySuperModuleImpl around instead of torch::nn::Module.
2. Optimizer options are public now, since there's no way to decay the LR or modify it during training otherwise
3. Serialization functions creates autograd history and calls copy_! Bad!
4. Optimizers did not create buffers after add_parameters was called.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9837
Reviewed By: goldsborough
Differential Revision: D9199746
Pulled By: ebetica
fbshipit-source-id: 76d6b22e589a42637b7cc0b5bcd3c6b6662fb299
Summary:
Explanation copied from code:
// Motivation about the gflags wrapper:
// (1) We would need to make sure that the gflags version and the non-gflags
// version of Caffe2 are going to expose the same flags abstraction. One should
// explicitly use caffe2::FLAGS_flag_name to access the flags.
// (2) For flag names, it is recommended to start with caffe2_ to distinguish it
// from regular gflags flags. For example, do
// CAFFE2_DEFINE_BOOL(caffe2_my_flag, true, "An example");
// to allow one to use caffe2::FLAGS_caffe2_my_flag.
// (3) Gflags has a design issue that does not properly expose the global flags,
// if one builds the library with -fvisibility=hidden. The current gflags (as of
// Aug 2018) only deals with the Windows case using dllexport, and not the Linux
// counterparts. As a result, we will explciitly use CAFFE2_EXPORT to export the
// flags defined in Caffe2. This is done via a global reference, so the flag
// itself is not duplicated - under the hood it is the same global gflags flag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10444
Differential Revision: D9296726
Pulled By: Yangqing
fbshipit-source-id: a867d67260255cc46bf0a928122ff71a575d3966
Summary:
On Windows, the FindRocksDB script doesn't detect rocksdb installation built by cmake.
And it doesn't include/link the RocksDB dependencies either, like:
* `Snappy`
* `Shlwapi.lib`
* `Rpcrt4.lib`
This PR try to detect in config mode first before using private find module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/7315
Differential Revision: D9287587
Pulled By: Yangqing
fbshipit-source-id: 314a36a14bfe04aa45013349c5537163fb4c5c00
Summary:
There's no need to hack.
Using `CUDA_LINK_LIBRARIES_KEYWORD` is the normal way.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10437
Differential Revision: D9287579
Pulled By: Yangqing
fbshipit-source-id: d3d575ea8c3235576ba971e4b7493ddb435f92f3
Summary:
Building caffe2 and pytorch separately will end up duplicated symbols as they now share some basic libs. And it's especially bad for registry. This PR fixes our CI and build them in one shot with shared symbols.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10427
Reviewed By: bddppq
Differential Revision: D9282372
Pulled By: yinghai
fbshipit-source-id: 0514931ea88277029a68fa5368ff4336472f132e
Summary:
Optimize the max_pooling operation for inference path by setting the "inference" flag to the underlying MKL-DNN, saving the computation and store of max indices which is only needed for training. To make the API compatible, training mode is still the default and inference mode is set in the optimizeForIdeep path.
Test shows the speed-up of a single max_pooling operation is up to 7X on BDW.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10156
Differential Revision: D9276755
Pulled By: yinghai
fbshipit-source-id: ad533d53aabb8ccb3b592da984d6269d9b794a8a
Summary:
This should just work now that sizes/strides are unified between TH and ATen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10414
Differential Revision: D9274681
Pulled By: gchanan
fbshipit-source-id: 69eb766f4e3a5b6c57b15837cffdef513b6d7817
Summary:
```
Correctly share CUDA Parameters, requires_grad and hooks.
Previously, the following was true:
- If you put a Parameter for a CUDA tensor
in multiprocessing queue (or otherwise tried to transfer it),
this failed, saying that we cannot pickle CUDA storage.
This is issue #9996.
- If you put a leaf Tensor that requires_grad=True through the
multiprocessing queue, it would come out the other end as
requires_grad=False (It should have come out the other end
as requires_grad=True). Similarly, backwards hooks were
lost.
- If you put a non-leaf Tensor that requires_grad=True through
the multiprocessing queue, it would come out the other end
as requires_grad=False.
The root cause for the first issue was that implementation of
reductions for Parameter used the superclass implementation
(tensor) in __reduce_ex__, but this always picks up the
non-ForkingPickler reduction, which doesn't work with CUDA tensors.
So, we registered a new ForkingPickler specifically for Parameter,
and adjusted the code to correctly rewrap a Tensor in a Parameter
if it was originally a parameter.
While working on this, we realized that requires_grad and backwards
hooks would not be preserved in the ForkingPickler reduction
implementation. We fixed the reducer to save these parameters.
However, Adam Paszke pointed out that we shouldn't allow sending
requires_grad=True, non-leaf Tensors over a multiprocessing
queue, since we don't actually support autograd over process
boundar. We now throw an error in this case; this may cause
previously working code to fail, but this is easy enough to fix;
just detach() the tensor before sending it. The error message says
so.
Fixes#9996.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10220
Differential Revision: D9160746
Pulled By: ezyang
fbshipit-source-id: a39c0dbc012ba5afc7a9e646da5c7f325b3cf05c
Summary:
closes#9702 .
cc jph00
Commit structure:
1. Change the index calculation logic. I will explain using 1-D for simplicity.
Previously we have (in pseudo code):
```
// 1. get the float locations from grid
scalar_t x = from_grid()
// 2. find the integral surrounding indices
int x_left = floor(x)
int x_right = x_left + 1
// 3. calculate the linear interpolate weights
scalar_t w_left = x_right - x
scalar_t w_right = x - x_left
// 4. manipulate the integral surrounding indices if needed
// (e.g., clip for border padding_mode)
x_left = manipulate(x_left, padding_mode)
x_right = manipulate(x_right, padding_mode)
// 5. interpolate
output_val = interpolate(w_left, w_right, x_left, x_right)
```
This is actually incorrect (and also unintuitive) because it calculates the
weights before manipulate out-of-boundary indices. Fortunately, this
isn't manifested in both of the current supported modes, `'zeros'` and
`'border'` padding:
+ `'zeros'`: doesn't clip
+ `'border'`: clips, but for out-of-bound `x` both `x_left` and `x_right` are
clipped to the same value, so weights don't matter
But this is a problem with reflection padding, since after each time we reflect,
the values of `w_left` and `w_right` should be swapped.
So in this commit I change the algorithm to (numbers corresponding to the
ordering in the above pseudo-code)
```
1. get float location
4. clip the float location
2. find the integral surrounding indices
3. calculate the linear interpolate weights
```
In the backward, because of this change, I need to add new variables to track
`d manipulate_output / d manipulate_input`, which is basically a multiplier
on the gradient calculated for `grid`. From benchmarking this addition doesn't
cause obvious slow downs.
2. Implement reflection padding. The indices will keep being reflected until
they become within boundary.
Added variant of `clip_coordinates` and `reflect_coordinates` to be used in
backward. E.g.,
```cpp
// clip_coordinates_set_grad works similarly to clip_coordinates except that
// it also returns the `d output / d input` via pointer argument `grad_in`.
// This is useful in the backward pass of grid_sampler.
scalar_t clip_coordinates_set_grad(scalar_t in, int64_t clip_limit, scalar_t *grad_in)
```
For example, if `in` is clipped in `'border'` mode, `grad_in` is set to `0`.
If `in` is reflected **odd** times in `'reflection'` mode, `grad_in`
is set to `-1`.
3. Implement nearest interpolation.
4. Add test cases
5. Add better input checking
Discussed with goldsborough for moving `operator<<` of `at::Device`,
`at::DeviceType` and `at::Layout` into `at` namespace. (Otherwise
`AT_CHECK` can't find them.)
6. Support empty tensors. cc gchanan
+ Make empty tensors not acceptable by cudnn.
+ Add `AT_ASSERT(kernel block size > 0)` if using `GET_BLOCKS`
+ Cache `numel` in `TensorGeometry`
I was going to use `numel` to test if cudnn descriptor should accept a
tensor, but it isn't used eventually. I can revert this if needed.
7. Add more test cases, including on input checking and empty tensors
8. Remove an obsolete comment
9. Update docs. Manually tested by generating docs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10051
Differential Revision: D9123950
Pulled By: SsnL
fbshipit-source-id: ac3b4a0a36b39b5d02e83666cc6730111ce216f6
Summary:
I am using this to test a CI job to upload pip packages, and so am using the Caffe2 namespace to avoid affecting the existing pytorch packages.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9544
Reviewed By: orionr
Differential Revision: D9267111
Pulled By: pjh5
fbshipit-source-id: a68162ed29d2eb9ce353d8435ccb5f16c3b0b894
Summary:
This was used as a convenient way for us to convert c1 models. Now that conversion is more or less done, we should probably require any users who need to convert c1 models to explicitly install c1. This PR removes the explicit c1 proto (which was copied from c1) in favor of explicit installation.
Note that caffe_translator would still work properly, only difference is that now users need to install c1 separately.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10380
Differential Revision: D9267981
Pulled By: Yangqing
fbshipit-source-id: a6ce5d9463e6567976da83f2d08b2c3d94d14390
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10360
It seems `lengths_host_.CopyFrom(lengthsInput, &context_);` is asynchronous w.r.t. the host while `lengths_host_.CopyFrom(lengthsInput);` is synchronous.
However, according to jerryzh168, `lengths_host_.CopyFrom(lengths, &context_); context_.FinishDeviceComputation();` is the safest way to guarantee synchronization.
Reviewed By: jerryzh168
Differential Revision: D9197923
fbshipit-source-id: 827eb63d9d15c1274851e8301a793aed39d4fa6b
Summary:
As in the title. I also did a small refactor that let us loose almost 400 loc. This is a first step in moving the RNN code to C++.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10305
Reviewed By: ezyang
Differential Revision: D9196227
Pulled By: apaszke
fbshipit-source-id: 54da905519aade29baa63ab1774a3ee1db5663ba
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10163
- Remove dependency on caffe2/core/common.h for ATen/core/typeid.h
Unfortunately, Windows seems to rely on typeid.h including this
header, so it is still included from the forwarding header
caffe2/core/typeid.h
- Deduplicate Demangle/DemangleType with their ATen equivalents
Reviewed By: smessmer
Differential Revision: D9132432
fbshipit-source-id: 21f2c89e58ca1e795f1b2caa316361b729a5231b
Summary:
Copy of #10191 because these changes didn't land with the diff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10394
Differential Revision: D9260816
Pulled By: li-roy
fbshipit-source-id: 7dc16919cfab6221fda1d44e98c5b900cfb40558
Summary:
Before we had 0-dim tensors in TH, we were flexible in what we accepted wrt to the difference between size [] and size [1] tensors in backwards functions because they were identical in TH. So, we had backwards definitions that were technically incorrect, but happened to work. This often masks shape issues, adds greatly to code complexity and thus IMO isn't worth keeping.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10382
Differential Revision: D9244618
Pulled By: gchanan
fbshipit-source-id: 2c29c53a8ffe8710843451202cad6b4323af10e8
Summary:
This makes clamp and relu faster (fixes#10276).
The extra copying was introduced when clamp moved to ATen and
the _th_clamp_ wrapper was used to forward to TH/THC,
we remove that and add _th_clamp(_out) instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10352
Reviewed By: ezyang
Differential Revision: D9233590
Pulled By: SsnL
fbshipit-source-id: 4f86a045498e5e577fb22656c71f171add7ed0ac
Summary:
If an `at::test` function is added, gcc can't figure out the `std::thread(test, -1)` resolution.
It is not a problem for current code. I bumped into this when playing with native functions. But I think it is a good to just prevent it from happening in future by removing `using namespace at;`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10381
Differential Revision: D9241614
Pulled By: SsnL
fbshipit-source-id: 972ac3cecff3a50602b3fba463ae1ebd3f53d036
Summary:
When only part of the outputs of unbind are used in a backward,
the gradients for the others are undefined. This sets those
to zero in to_tensor_list.
Fixes: #9977
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9995
Differential Revision: D9239610
Pulled By: soumith
fbshipit-source-id: eb8d1b3f2b4e615449f9d856e10b946910df9147
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10334
Keep kEps in one place to make sure they are consistent
Reviewed By: xianjiec
Differential Revision: D9202280
fbshipit-source-id: 35d173ce1d1a361b5b8cdbf1eac423e906e7c801
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10218
SubtreeMatchCriteria now supports:
- nonTerminal flag : if this is set, it means we only match the root of the subtree and do not care about the children. Example use case: to match an "input" node but does not care how the input is produced.
Additional tests for these new logic are added to subgraph_matcher_test.cc.
Subgraph matching APIs for NNGraph is also added.
(Further enhancement to make the SubgraphMatching API constructs a Subgraph object/more diagnostic information will go later).
Reviewed By: bwasti
Differential Revision: D9156092
fbshipit-source-id: 3f28ac15d9edd474b3e0cd51fd7e6f973299d061
Summary:
Current Dockerfile builds pytorch using default python within miniconda, which happens to be Python 3.6
This patch allows users to specify which python should be installed in the default miniconda environment used by the pytorch dockerfile. I have tested the build for python 2.7, 3.5, 3.6 and 3.7. Python 2.7 required typing and cython
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10317
Differential Revision: D9204401
Pulled By: ezyang
fbshipit-source-id: 11355cab3bf448bbe8369a2ed1de0d409c9a2d6e
Summary:
This PR adds tracing infrastructure for custom operators. It also simplifies the tracer overall, and changes the codegen to do more metaprogramming there instead of via C++ (which was necessary for the custom op tracing).
To give an example of the tracer/metaprogramming change, what used to look like this in `VariableType.cpp`:
```
jit::tracer::PreTraceInfo trace_info;
if (jit::tracer::isTracing()) {
trace_info = jit::tracer::preRecordTrace(jit::aten::index_select, "self", self, "dim", dim, "index", index);
}
```
is now simply the inlined version of `preRecordTrace`, minus C++ metaprogramming:
```
torch::jit::Node* node = nullptr;
if (jit::tracer::isTracing()) {
auto& graph = jit::tracer::getTracingState()->graph;
node = graph->create(jit::aten::index_select_out, /*outputs=*/0);
jit::tracer::recordSourceLocation(node);
jit::tracer::addInputs(node, "result", result);
jit::tracer::addInputs(node, "self", self);
jit::tracer::addInputs(node, "dim", dim);
jit::tracer::addInputs(node, "index", index);
graph->appendNode(node);
}
```
zdevito apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10212
Differential Revision: D9199615
Pulled By: goldsborough
fbshipit-source-id: cd4b603c1dc01340ead407228e109c99bdba2cfc
Summary:
While waiting for dropout to be fully ported to ATen, here's performance fix for the most common dropout case. Dropout is still in python function, I just added efficient path to it. I could not make inplace work, because generator always generates `return self` for inplace function, and I need to return both original tensor and mask, so inplace goes on the existing pass. Even with non-inplace version, since mask is now a ByteTensor, memory used is just a little larger than for inplace dropout, due to savings on mask.
Once dropout is moved to aten, these kernels still can be used for efficient implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9666
Reviewed By: SsnL
Differential Revision: D8948077
Pulled By: ezyang
fbshipit-source-id: 52990ef769471d957e464af635e5f9b4e519567a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10228
Sometimes, for all items in the minibatch in test mode, input length will be
equal to max time steps. This avoids having to pass in an external tensor.
Differential Revision: D9174378
fbshipit-source-id: 22f7d5c311c855d9c3ac59f2a5e773279bd69974
Summary:
This PR extends the existing type and shape metadata tracing and verification done in autograd with device information. This expansion of tracing is required for #8354, is likely useful in other scenarios, and is a healthy sanity check, just like type and shape tracing.
The precise changes are:
- TypeAndShape -> InputMetadata, now includes device()
- Creating InputMetadata is simplified to just require a tensor, and callers were updated to use this simpler invocation wherever possible
- The gradient accumulator of a variable is now reset when set_data() is called if either the type or device changes, and this reset now locks to avoid contention with acquiring the gradient accumulator
- Mismatched devices during backward() will throw a runtime error, just like mismatched type and shape
- (Bonus!) Two uninitialized pointers in THCReduce are now initialized (to nullptr) to prevent build warnings
fyi colesbury
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9796
Reviewed By: goldsborough
Differential Revision: D9119325
Pulled By: ezyang
fbshipit-source-id: 76d1861b8d4f74db0575ff1f3bd965e18f9463de
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10274
Good C++ libraries don't take up un-namespaced identifiers
like DISABLE_COPY_AND_ASSIGN. Re-prefix this.
Follow up fix: codemod Caffe2 to use the new macro, delete
the forwarding definition
Reviewed By: mingzhe09088
Differential Revision: D9181939
fbshipit-source-id: 857d099de1c2c0c4d0c1768c1ab772d59e28977c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10268
Running torch.distributed.init_process_group fails with more than ~64 processes, with various errors like connection refused or connection reset by peer. After some digging, it looks like the root cause is that all workers have to connect to master via TCP (both in Zeus init and in DataChannelTCP - look for `connect()`), and the listening socket only has a backlog of 64.
I increased the backlog to 1024, that seems like enough for reasonable purposes (the hard limit is 65535 in /proc/sys/net/core/somaxconn). There's probably a more correct way to do this that involves retries when connection is refused.
Reviewed By: soumith
Differential Revision: D9182216
fbshipit-source-id: 2f71c4995841db26c670cec344f1e3c7a80a7936
Summary:
Previously, `tensor[i:]` was transformed to `tensor[i:-1]`. This incorrectly leaves off the last element. Noticed this when implementing slicing for list types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10286
Differential Revision: D9193292
Pulled By: michaelsuo
fbshipit-source-id: df372b815f9a3b8029830dd9e8769f9985a890e7
Summary:
I changed the name of this builtin to match Python's native style, but forgot to change the compiler error to match.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10265
Differential Revision: D9192963
Pulled By: michaelsuo
fbshipit-source-id: 225ca4cd50fbbe3b31c369deeb3123a84342aab1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10264
Since we now have DISABLE_COPY_AND_ASSIGN macro in the file,
CoreAPI is no longer an accurate name.
Reviewed By: dzhulgakov
Differential Revision: D9181687
fbshipit-source-id: a9cc5556be9c43e6aaa22671f755010707caef67
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10263
Auxiliary changes that were needed:
- Add DISABLE_COPY_AND_ASSIGN to CoreAPI.h (maybe we should rename this file
now)
Reviewed By: dzhulgakov
Differential Revision: D9181321
fbshipit-source-id: 975687068285b5a94a57934817c960aeea2bbafa
Summary:
When we directly use -std=c++11, it propagates to the downstream applications.
Problems:
1. Gcc flags propagating to nvcc.
2. nvcc flags propagating to nvcc. (Which throws an error like redeclaration of std flag)
This PR will fix these propagation issues!
Similar problem:
https://github.com/FloopCZ/tensorflow_cc/pull/92https://github.com/CGAL/cgal/issues/2775
Requires: Cmake 3.12
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10098
Differential Revision: D9187110
Pulled By: ezyang
fbshipit-source-id: 0e00e6aa3119c77a5b3ea56992ef3bbfecd71d80
Summary:
This PR for the ROCm target does the following:
* enable some unit tests on ROCm
* fix a missing static_cast that breaks BatchNorm call on ROCm
* fix BatchNorm to work on ROCm w/ ROCm warp sizes etc
* improve the pyhipify script by introducing kernel scope to some transpilations and other improvements
* fix a linking issue on ROCm
* for more unit test sets: mark currently broken tests broken (to be fixed)
* enable THINLTO (phase one) to parallelize linking
* address the first failing of the elementwise kernel by removing non-working ROCm specialization
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10266
Differential Revision: D9184178
Pulled By: ezyang
fbshipit-source-id: 03bcd1fe4ca4dd3241f09634dbd42b6a4c350297
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10261
1. Reserve
Currently, Reserve will allocate new memory and old data in the tensor is also preserved,
and Resize is relying on this behavior in some call-site, e.g. https://github.com/pytorch/pytorch/blob/master/caffe2/operators/reservoir_sampling.cc#L103, where we should be using Extend.
We want to bring semantics of Reserve to be more aligned with std::vector, i.e. we want it to be
an optimization about memory allocation and remove the semantics about preserving the data. We'll remove the guarantee that data will be preserved after Reserve, and Extend will be the only API that preserves old data when we do in-place extension of memory. This also helps with the later refactoring on split Storage from Tensor.
Also, we'll only pass in the outer dimension to Reserve which means the later dimensions should be set before we call Reserve.
2. Extend/Shrink
Previously, Extend actually means ExtendBy and Shrink means ShrinkTo, I would like to add a ExtendTo for convenience, and change Shrink to ShrinkTo.
Old functions calling Extend is still there, although it actually means Extend by, but I think it still makes sense to have it.
3. Usage Patterns
The expected usage patterns right now is:
```
t->Resize({0, 32, 32, 32});
t->template mutable_data<T>(); // set meta_
t->Reserve(100);
auto* t_data = t->template mutable_data<T>();
// feed data to tensor using t_data
for (int i = 0; i < 100; ++i) {
t->Extend(1, 50, &context_);
// you can continue to use t_data if you have reserved enough space
// otherwise, you should call t->template mutable_data<T> again to
// get the new data pointer since Extend will allocate new memory even
// though the original data is preserved.
}
```
Reviewed By: ezyang
Differential Revision: D9128147
fbshipit-source-id: e765f6566d73deafe2abeef0b2cc0ebcbfebd096
Summary:
the new entrypoint is `./tools/build_pytorch_libs.sh caffe2`
this will also speed up CI builds a bit, since we will no longer be compiling all of libtorch twice
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9836
Differential Revision: D9182634
Pulled By: anderspapitto
fbshipit-source-id: 0b9a20ab04f5df2d5c4e7777e4dc468ab25b9ce2
Summary:
Turns out some people are using this via the C-API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10259
Differential Revision: D9180135
Pulled By: gchanan
fbshipit-source-id: 68f59beabf7f8093e67581d7e7ebfe8dff9e6b69
Summary:
Using Visual Studio Code and Visual Studio, these IDEs store configurations to `FOLDER/.vscode` and `FOLDER/.vs`.
But "setup.py clean" deletes these folders because those are described in `.gitignore` file.
To prevent this, add "BEGIN NOT-CLEAN-FILES" marker to `.gitignore` file and "setup.py clean" ignores lines after this marker.
Discussed in #10206
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10233
Differential Revision: D9175515
Pulled By: ezyang
fbshipit-source-id: 24074a7e6e505a3d51382dc5ade5c65c97deda37
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10197
Support generic feature in DPER2
For now since we only have one generic type 1, we are directly adding the parsed feature record to embedding feature.
For new feature types with specific structure, there should also be corresponding coding changes expected.
Reviewed By: itomatik
Differential Revision: D8788177
fbshipit-source-id: 9aaa6f35ece382acb4072ec5e57061bb0727f184
Summary:
Fixes#10032
When capturing an output, GraphExecutorAutogradFunction creates
SavedVariable with is_output=False and owns it:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/graph_executor.cpp#L87
Constructing SavedVariable with is_output=False makes it own a copy of
the shared_ptr<GraphExecutorAutogradFunction>, which causes a reference
cycle:
6456b944fd/torch/csrc/autograd/saved_variable.cpp (L27)
The solution in this PR is to construct the SavedVariable with
is_output=True if the captured value is an output.
Test Plan
Turn on cuda memory checking for JitTestCase. If the test's name
includes "cuda" or "gpu" in it, the cuda memory checking test happens.
cc zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10222
Reviewed By: ezyang
Differential Revision: D9162995
Pulled By: zou3519
fbshipit-source-id: aeace85a09160c7a7e79cf35f6ac61eac87cbf66
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10175
Previously, we had at::Device::Type and caffe2::DeviceType (from protobuf),
intended to help us distinguish between CPU, CUDA, etc. devices.
This replaces at::Device::Type entirely with at::DeviceType, which in turn
is a direct, 'enum class' version of the protobuf generated caffe2::DeviceType
'enum'. We can't eliminate the 'enum' because this would a pretty drastic
API change (enum is interconvertible with integers, enum class is not) but
we can make the two line up exactly and share code for, e.g., printing.
Reviewed By: Yangqing
Differential Revision: D9137156
fbshipit-source-id: 566385cd6efb1ed722b25e6f7849a910b50342ab
Summary:
- New concept of a message stack; you can add messages
using AppendMessage
- New concept of a caller; it's just a way to pass along
some arbitrary extra information in the exception
Coming soon is changing Caffe2 to use at::Error instead of
EnforceNotMet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10183
Differential Revision: D9139996
Pulled By: ezyang
fbshipit-source-id: 6979c289ec59bc3566a23d6619bafba2c1920de9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10224
It doesn't work with Caffe2; use AT_CORE_API from ATen/core/CoreAPI.h
instead.
Reviewed By: smessmer
Differential Revision: D9162467
fbshipit-source-id: 3c7d83c1ccb722ebac469296bdd7c3982ff461e5
Summary:
The basic game plan is to stop accessing the type_ field directly,
and instead using the stored backend_, scalar_type_ and
is_variable_ to look up the appropriate Type from Context.
Storage of backend_ and scalar_type_ are new.
At some future point in time, I'd like to look at this code
carefully to see if I can get everything in this codepath inlining.
I didn't do it in this patch because there are circular include
problems making things difficult.
Some other details:
- Added Device::backend() which does what it says on the tin
- SparseTensorImpl is temporarily hard-coded to root in at::Context
for the appropriate context. If/when we put this in shared code,
we'll have to break this dep too, but for now it should be OK.
- There's a stupid problem with globalContext() deadlocking if
you didn't actually initialize it before loading libtorch.so
(which is bringing along the variable hooks). I fixed this by
reordering the static initializers. Fixes#9784
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10210
Differential Revision: D9150697
Pulled By: ezyang
fbshipit-source-id: 89e2006c88688bcfab0dcee82dc369127c198c35
Summary:
- fixes#9141, #9301
- use logsigmoid at multilabel_soft_margin_loss to make it more stable (NOT fixing legacy MultiLabelSoftMarginCriterion)
- return (N) instead of (N, C) to match the same behavior as MultiMarginLoss
- Note that with this PR, the following behavior is expected:
```
loss = F.multilabel_soft_margin_loss(outputs, labels, reduction='none')
loss_mean = F.multilabel_soft_margin_loss(outputs, labels, reduction='elementwise_mean')
loss_sum = F.multilabel_soft_margin_loss(outputs, labels, reduction='sum')
loss.sum() == loss_sum # True
loss.mean() == loss_mean # True
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9965
Differential Revision: D9038402
Pulled By: weiyangfb
fbshipit-source-id: 0fa94c7b3cd370ea62bd6333f1a0e9bd0b8ccbb9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10217
It's only used in debug printing and is not that reliable anyway. If we want to implement it later - we should do it proper accounting for shared storages.
Reviewed By: jerryzh168
Differential Revision: D9155685
fbshipit-source-id: 48320d41a0c4155645f3ba622ef88730a4567895
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9360
This implements a first set of c10 operators, namely the ones needed for the multithread predictor benchmark.
All implementations are CPU-only and experimental. They're not meant to be used in production.
They can be used, however, to test calling simple c10 MLPs from Caffe2 or PyTorch when working on these integration paths.
Reviewed By: dzhulgakov
Differential Revision: D8811698
fbshipit-source-id: 826789c38b2bfdb125a5c0d03c5aebf627785482
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9369
This adds the capability for caffe2 to call c10 operators and adds a dummy c10 sigmoid op as a proof of concept.
I used this test script to make sure it works:
from caffe2.python import workspace, model_helper
import numpy as np
data1 = np.random.rand(16, 100).astype(np.float32)
workspace.FeedBlob("data1", data1)
m = model_helper.ModelHelper(name="my net")
sigmoid1 = m.net.C10Sigmoid_DontUseThisOpYet("data1", "sigmoid1")
sigmoid2 = m.net.Sigmoid("data1", "sigmoid2")
workspace.RunNetOnce(m.param_init_net)
workspace.CreateNet(m.net)
data1 = np.random.rand(16, 100).astype(np.float32)
workspace.FeedBlob("data1", data1)
workspace.RunNet(m.name, 1)
print(workspace.FetchBlob("data1"))
print(workspace.FetchBlob("sigmoid1"))
print(workspace.FetchBlob("sigmoid2"))
(and check that both sigmoid outputs are the same)
Reviewed By: ezyang
Differential Revision: D8814669
fbshipit-source-id: eeb0e7a854727f1617a3c592a662a7e5ae226f40
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10214
Seems we're passing weak pointers over C API boundaries. Need this API there too.
Reviewed By: ezyang
Differential Revision: D9154505
fbshipit-source-id: c9889689b87dad5d918f93ba231e01704b8d2479
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10130
Update some include paths to make them internally consistent
Reviewed By: ezyang
Differential Revision: D9119906
fbshipit-source-id: b44e5cab8e8e795ee18afe9ffc6caf1f2b413467
Summary:
This PR adds a way to infer the JIT/script schema of a function from its signature, and then create an operator from the schema and implementation. The implementation function is wrapped into another function, which pops values from the stack into an argument tuple, then invokes the function and pushes the return value back onto the stack, sometimes unpacking the return value if it is a tuple.
Currently the method is called `createOperator`. We may want to think of a nicer way of registering ops in tandem with `RegisterOperators`. It might be very cumbersome to add a template constructor to `Operator`, so maybe we can come up with a chaining method on `RegisterOperators` like `RegisterOperators(schema, func).op(schema.func).op(schema, func)` -- it has to work at startup time (for a static variable) though. We can solve this in another PR.
zdevito apaszke smessmer dzhulgakov
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10048
Differential Revision: D9125975
Pulled By: goldsborough
fbshipit-source-id: de9e59888757573284a43787ae5d94384bfe8f9a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10192
- release_resources() method must be non-const because it modifies the object
- for intrusive_ptr<const MyClass>, this needs to be const_cast :(
Reviewed By: ezyang
Differential Revision: D9143808
fbshipit-source-id: 9203ff7a7ff3bec165931279371c6e75d4f0ca8c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10133
This is useful for C APIs where we want to give owning pointers to/from other languages.
Reviewed By: ezyang
Differential Revision: D9121493
fbshipit-source-id: f903f5830f587b2ba69c0636ddcf1a066bbac2e0
Summary:
The PR allows int→float and float→int casts. Current we only allow `tensor→int` and `tensor→float` casts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10168
Differential Revision: D9141163
Pulled By: wanchaol
fbshipit-source-id: 5e5591a98b4985a675641dfc9a385b2a0bf8e208
Summary:
Previously, `foo = [bar, baz]` would construct a TupleType of fixed arity. This would cause code like:
```
foo = [2]
if True:
foo = [2, 2]
```
to fail to compile, since `(int)` is not the same as `(int, int)`.
This PR changes things so that list literals construct ListTypes, which can be resized.
Potentially breaking changes introduced:
- Empty list literals are now disallowed, `_constructEmptyFooList()` builtins are required to replace them.
- Iterable variable unpacking where the rhs is a list is now disallowed. (Tuples still work)
- Lists must have a single type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10193
Differential Revision: D9147166
Pulled By: michaelsuo
fbshipit-source-id: bbd1b97b0b6b7cb0e6f9d6aefa1ee9c731e63039
Summary:
* Changes `insertConstant(g, val)` to `g.insertConstant(val)`.
* Moves SourceRange to its own file to enable it.
* Cleans up dead attribute code in schema matching and graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10177
Differential Revision: D9137789
Pulled By: zdevito
fbshipit-source-id: 8a73cfb01a576f02e7e4dce019be9c0a0002989d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9897
Add an IntrusivePtr class to do intrusive refcounting with a shared_ptr-like interface.
Reviewed By: ezyang
Differential Revision: D9018619
fbshipit-source-id: 5de8706aab8eea2e30bead0f59bd6a7ca4d20011
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10173
With D9024330, `Extend` fundtion is no more a template, which makes
the `template` keyword here invalid. For some reason current version of LLVM
doesn't catch this, but the latest one does.
Reviewed By: jerryzh168
Differential Revision: D9133462
fbshipit-source-id: 54ac9aad01f81b9b4e7b6e2864b8961478d2d860
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9905
This diff improves lars operator in Caffe2 by applying clipping to the computed learning rate
Reviewed By: pjh5
Differential Revision: D9020606
fbshipit-source-id: b579f1d628113c09366feac9406002f1ef4bd54f
Summary:
This PR adds strings to the ast and implements them for print statements. Strings are lifted as attributes to the print node. They must be arguments to print itself, not as an argument for an object that is passed to print. If they are encountered elsewhere a NYI exception will be thrown.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9324
Reviewed By: jramseyer
Differential Revision: D8807128
Pulled By: eellison
fbshipit-source-id: 984401ff458ed18d473c6d1bd86750e56c77d078
Summary:
This is part of the process of removing THLongStorage to represent sizes/strides.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10146
Differential Revision: D9126611
Pulled By: gchanan
fbshipit-source-id: b0d995a4c51dfd54bf76dcfee9a69f37f9d01652
Summary:
In this changeset:
* improvements to `hipify-python.py`
* marking unit tests broken for ROCm
* reducing the number of jobs for the built to avoid out of memory issues
* switch to Thrust/cub-hip master for the CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9653
Differential Revision: D9117791
Pulled By: ezyang
fbshipit-source-id: a6c3c7b81f2bda9825974bf9bf89a97767244352
Summary:
Enabled support for generating random numbers in fusion compiler. Currently a philox RNG implemented by Tensorflow is used, as the NVRTC couldn't resolve the curand.h header correctly. The two implementation should have exact same behavior according to our tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9795
Differential Revision: D8999029
Pulled By: SsnL
fbshipit-source-id: f0d2616a699a942e2f370bdb02ac77b9c463d7b8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10081
Add new utility that make it easier to write graph transformation. Callers now only need to take care of the actual transformation logic. The subgraph matching is simplified because callers only need to specify a simple construct for subtree matching criteria.
The utlity is SubgraphMatcher::replaceSubtree
Some notes:
- replaceSubtree takes a subtree matching criteria, and a lambda that takes a subtree root. It does't not handle any transformations itself. Callers should be responsible for the transformation part, including deleting all nodes in the matched subtree(s). We could enhance this to also handle the deletion part if it turns out to be useful.
- Only sub tree matching is supported for now but we can add general DAG sub-graph support later if needed.
Reviewed By: bwasti
Differential Revision: D9073297
fbshipit-source-id: 465a0ad11caafde01196fbb2eda2d4d8e550c3b6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9860
For 3D group convolution, in the case of CUDNN 7 and NCHWD order, filter dim is (M, C/group_, k_h, h_w, k_d).
According to CUDA doc (https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#grouped-convolutions), the existing implementation is incorrect, and will crash the 3d video model training with group convolution.
In the implementation, `filter.dims(1)` is already `C/group_`. So don't need to divide it by `group_` again.
Reviewed By: BIT-silence
Differential Revision: D9008807
fbshipit-source-id: 2f0d6eb47f4e16d7417a7e3baeba709e3254154f
Summary:
Implement IR transformation for control flow
- `prim::Constant`: clone to new graph directly
- `prim::NumToTensor`: create a `BatchTensor` from output tensor with `batch_size = 1`
- `prim::TensorToNum`: clone to new graph
- `prim::ListConstruct`: clone to new graph
- `prim::If`: execute both `if_block` and `else_block` and combine results from them using `cond`
- `prim::Loop`:
- for loop
- while loop: change while `cond` to `cond_any`, use `cond` to update outputs
test case: hand-written LSTM, greedy search, beam search
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9392
Differential Revision: D8822369
Pulled By: ChunliF
fbshipit-source-id: 8f03c95757d32e8c4580eeab3974fd1bc429a1e5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10152
- Moved from namespace c10::guts to at
- I fixed the use sites, since there were only three of them
- Macro renamed from C10_ to AT_
Reviewed By: smessmer
Differential Revision: D9123652
fbshipit-source-id: bef3c0ace046ebadb82ad00ab73371f026749085
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10139
We want CaffeTypeId to be interconvertible with at::ScalarType, and
this means we should have the numbers line up exactly. Fortunately
this is not too hard to do.
Reviewed By: smessmer
Differential Revision: D9123058
fbshipit-source-id: 7e9bd59ca25a552afe9d2d0a16cedc4f6311f911
Summary:
This exposes expand_outplace to python. Fixes#8076. Fixes#10041.
I didn't name it torch.broadcast because numpy.broadcast does something
slightly different (it returns an object with the correct shape
information).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10075
Differential Revision: D9125816
Pulled By: zou3519
fbshipit-source-id: ebe17c8bb54a73ec84b8f76ce14aff3e9c56f4d1
Summary:
Previously, the parser was emitting list literals for tuples, but the IR was representing list literals internally with TupleTypes.
For implementing most list operations, I think it will be helpful distinguish between lists (dynamic size, homogeneous types) and tuples (fixed arity, heterogeneous types)
This diff modifies the parser logic to emit tuple literals. This frees us to represent lists as ListType in the IR, while still properly mapping tuple literals to TupleTypes.
A following diff will actually switch over list literals to emit ListTypes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10128
Differential Revision: D9121305
Pulled By: michaelsuo
fbshipit-source-id: e0cad07ae8bac680f7f8113d10e5129d5a1a511d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10138
Note that `TensorCPU` and `TensorGPU` are all refined to be `Tensor` now. Basically they are the same thing. So check like `blob.IsType<TensorCPU>()` is no longer safe as `TensorGPU` can pass the check too.
We need to systematically weed out the such usage in our codebase... @[100008320710723:jerryzh]
Reviewed By: houseroad
Differential Revision: D9115273
fbshipit-source-id: 13b293c73691002eac34e095cdcd96c27183e875
Summary:
This rewrites checked_convert to use stringstreams, eliminating the use of to_string which is not available on Android stdc++.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10137
Reviewed By: smessmer
Differential Revision: D9122340
fbshipit-source-id: b7c1bff70e36217305f2b3333c51543ef8ff3d9c
Summary:
This will be needed soon because I want to move Half.h into
ATen/core, and then I cannot have a TH dependency.
I also took the liberty of making the code more strict-aliasing
safe (this is not actually useful, since we will never built Torch
with strict aliasing) by replacing pointer casts between
float and unsigned with a memcpy instead.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10134
Differential Revision: D9121920
Pulled By: ezyang
fbshipit-source-id: 3b1f86a7c5880e8ac1a589a51f0635bb72e1fd40
Summary:
…e_/is_variable_
The basic game plan is to stop accessing the type_ field directly,
and instead using the stored backend_, scalar_type_ and
is_variable_ to look up the appropriate Type from Context.
Storage of backend_ and scalar_type_ are new.
At some future point in time, I'd like to look at this code
carefully to see if I can get everything in this codepath inlining.
I didn't do it in this patch because there are circular include
problems making things difficult.
Some other details:
- Added Device::backend() which does what it says on the tin
- SparseTensorImpl is temporarily hard-coded to root in at::Context
for the appropriate context. If/when we put this in shared code,
we'll have to break this dep too, but for now it should be OK.
- There's a stupid problem with globalContext() deadlocking if
you didn't actually initialize it before loading libtorch.so
(which is bringing along the variable hooks). I didn't fix
it in this PR; it's tracked in #9784
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9787
Reviewed By: cpuhrsch
Differential Revision: D8980971
Pulled By: ezyang
fbshipit-source-id: 2b4d867abfdc3999a836a220c638c109053145a8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9740
- Remove implicit ArrayRef -> vector conversion
- Fix 4 call sites that accidentally did an implicit expensive vector conversion but wouldn't have needed to
- Remove explicit vector conversion from 4 call sites that also didn't need to do that
Reviewed By: ezyang
Differential Revision: D8961693
fbshipit-source-id: 980da9f988083c0072497f9dbcbbf6f516fa311c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9610
Mostly making some stuff in ArrayRef constexpr to give it better perf.
Reviewed By: ezyang
Differential Revision: D8926785
fbshipit-source-id: af6d4b05fbc69d20855a80f3edc2b501577a742b
Summary:
in particular, make not building tests actually work
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10091
Differential Revision: D9121366
Pulled By: anderspapitto
fbshipit-source-id: d7d38cf759aa46bff90d3b4f695c20f29039ae75
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10107
This header is needed for ATen/core stuff
This diff also fixes an issue in C++17.h when run in C++17 enabled compilers.
Reviewed By: ezyang
Differential Revision: D9095209
fbshipit-source-id: d45947956019a7095875f48746b88c414e8865bc
Summary:
zdevito explained that the attributed versions of `Operator`s are no longer necessary. This PR does two things:
1. Removes all code associated with attributed operators,
2. Adds a second kind of state to `Operator` where it is constructed with an `Operation` directly instead of an `OperationCreator`. This will be useful to test custom operators which don't require a node (you can just retrieve it directly).
Now rebased on top of https://github.com/pytorch/pytorch/pull/9801
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10080
Differential Revision: D9113668
Pulled By: goldsborough
fbshipit-source-id: 1276a191c7cf89da1c38488769f2105ce2664750
Summary:
Extracted from https://github.com/pytorch/pytorch/pull/8338
Updating Eigen submodule to fix an issue we saw with BUILD_ATEN and BUILD_CAFFE2 removal.
cc mingzhe09088 ezyang smessmer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10095
Reviewed By: mingzhe09088
Differential Revision: D9109877
Pulled By: orionr
fbshipit-source-id: 90e36c298d8a22398558d70dc5f68a95a7687d6b
Summary:
It's not a particularly pretty process right now, but it may as well
be documented. I'm not aware of an ideal location for this, so I'm
just dropping it in the docs/ folder for now as recommended by
soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10087
Differential Revision: D9119681
Pulled By: anderspapitto
fbshipit-source-id: cd4afb642f3778c888d66a501bc697d0b0c88388
Summary:
This also makes Backtrace more portable, by disabling its functionality for
mobile builds as well.
It also handles Caffe2 static Windows builds by introducing a new variable,
AT_CORE_STATIC_WINDOWS, which must be set if you're building
ATen on Windows as part of a static library.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10092
Reviewed By: gchanan, smessmer
Differential Revision: D9094393
Pulled By: ezyang
fbshipit-source-id: 93281f9302bd378605a26589ae308faf1dac7df4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10035
This is an initial diff which refactors some of the components in the Seq2SeqModelCaffe2EnsembleDecoder class.
Reviewed By: jmp84
Differential Revision: D9026372
fbshipit-source-id: 449635208f24494209ae2fb78a19fca872970ea8
Summary:
This PR depends on the tests added in #9670. It moves the first, tiny function from the c10d DDP to C++: `dist_broadcast_coalesced`. Let me know if ` torch/csrc/distributed/c10d/ddp.h` will be a good place to put these rewritten functions.
pietern The controller you requested could not be found. apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9729
Differential Revision: D8985308
Pulled By: goldsborough
fbshipit-source-id: dc459fe9040273714044152063585e746974752f
Summary:
I opened an issue explaining some of my frustrations with the current state of schedulers.
While most points that I raised in [that issue](https://github.com/pytorch/pytorch/issues/8741#issuecomment-404449697) need to be discussed more thoroughly before being implemented, there are some that are not so difficult to fix.
This PR changes the way the LambdaLR scheduler gets serialized:
> The lr_lambda functions are only saved if the are callable objects (which can be stateful).
> There is no point in saving functions/lambdas as you need their definition before unpickling and they are stateless.
This has the big advantage that the scheduler is serializable, even if you use lambda functions or locally defined functions (aka a function in a function).
Does this functionality need any unit tests?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9927
Differential Revision: D9055505
Pulled By: soumith
fbshipit-source-id: 6c1cec588beedd098ec7d2bce6a9add27f29e48f
Summary:
There is a regression in softmin in 0.4.1 that was not present in 0.4.0. The behavior of softmin(x) should match softmax(-x) however instead it is implemented (in v0.4.1) as -softmax(x). These are not the same. The fix is trivial because the bug is due to operator precedence.
This is a major regression that broke my training. I'm not sure how a unit test did not catch this.
```
x = torch.tensor([1, 2, 3.5, 4])
print(F.softmin(x, dim=0)) # this has the wrong output in 0.4.1 but correct in 0.4.0
print(F.softmax(-x, dim=0)) # this is what softmax should be
print(F.softmax(x, dim=0))
print(-F.softmax(x, dim=0)) # this is how softmax is implemented incorrectly
```
In 0.4.1 this produces
tensor([-0.0278, -0.0755, -0.3385, -0.5581])
tensor([0.6668, 0.2453, 0.0547, 0.0332])
tensor([0.0278, 0.0755, 0.3385, 0.5581])
tensor([-0.0278, -0.0755, -0.3385, -0.5581])
In 0.4.0 this produces the correct values
tensor([ 0.6668, 0.2453, 0.0547, 0.0332])
tensor([ 0.6668, 0.2453, 0.0547, 0.0332])
tensor([ 0.0278, 0.0755, 0.3385, 0.5581])
tensor([-0.0278, -0.0755, -0.3385, -0.5581])
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10066
Differential Revision: D9106995
Pulled By: soumith
fbshipit-source-id: 7332503c6077e8461ad6cd72422c749cf6ca595b
Summary:
This is a cleanup and refactoring.
In its original form (changeset 6fdf915c057a) this diff caused a 5% regression
on ads CPU. The root cause was an omission of link_whole = True, causing
symbols to be stripped in mode/opt and forcing the converter to fallback
causing patterns to be unmatched in the graph transform logic. This version of
the diff tests for link_whole by including a C++ test of the transform
Reviewed By: yinghai
Differential Revision: D9040511
fbshipit-source-id: 3e19b89989aa68b021762d12af2d0b4111280b22
Summary:
`.bat` file's EOL is LF, so a build is failed on some Windows machines.
To fix this, add `.gitattributes` and set batch file's EOL to CRLF.
Discussion is in #9677.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9813
Differential Revision: D9026486
Pulled By: soumith
fbshipit-source-id: 341eaa677c35f8476a7eda1bac9827385072eb29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10044
The test was subtly broken! This transform wasn't writing to the correct blob and the test did not catch that because it was looking at the old version.
thanks @[100022211048576:kerenzhou] for catching this
Reviewed By: Jokeren
Differential Revision: D9075520
fbshipit-source-id: c31ff0afcd78dd2dc7ffc240e2e89eeda87f1fb4
Summary:
This should prevent slow startup times, and will not report as many
errors during static initialization time which are hard to debug
ezyang apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9801
Reviewed By: goldsborough
Differential Revision: D8986603
Pulled By: zdevito
fbshipit-source-id: 440d43ab5e8cffe0b15118cb5fda36391ed06dbc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10059
Without virtual dtor, it could induce incorrect sized deallocation, messing up the memory. And unfortunately, sized deallocation cannot be detected by ASAN, yet.
Reviewed By: jerryzh168
Differential Revision: D9080526
fbshipit-source-id: c136cf653134e75b074326be2bc03627da42446f
Summary:
The affected files are all files that are planned to be moved
to ATen/core; the includes are for headers which are NOT slated
for movement.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10085
Differential Revision: D9093746
Pulled By: ezyang
fbshipit-source-id: 2beeffdae26d03d631d2d51b40bf6303759a2f50
Summary:
This lays out initial support for taking and returning a richer set
of types than only tensors. Floats and ints are already valid, lists are
straightforward to add, tuples need some discussion.
Based on top of #9948. Review only the last commit.
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9969
Reviewed By: zdevito
Differential Revision: D9076973
Pulled By: apaszke
fbshipit-source-id: 5a1fe912ea6b79ab2bfd0dcce265eb05855b5ff0
Summary:
_pointwise loss has some python special casing, we converted reduction to aten enums too early.
fixes#10009
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10018
Differential Revision: D9075489
Pulled By: li-roy
fbshipit-source-id: 4bf2f5e2911e757602c699ee1ec58223c61d0162
Summary:
This PR fixes#9418 .
Openmpi 1.10 segfaults in MPI_Bcast with CUDA buffer. And it's a retired openmpi version.
I've tested on 2.1.1 and 3.0.0 and they work well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10015
Reviewed By: soumith
Differential Revision: D9088103
Pulled By: ailzhang
fbshipit-source-id: fc0a45e5cd016093ef0dbb9f371cbf67170d7045
Summary:
The CPU and CUDA variants are a direct transposition of Graves et al.'s description of the algorithm with the
modification that is is in log space.
The there also is a binding for the (much faster) CuDNN implementation.
This could eventually fix#3420
I still need to add tests (TestNN seems much more elaborate than the other testing) and fix the bugs than invariably turn up during the testing. Also, I want to add some more code comments.
I could use feedback on all sorts of things, including:
- Type handling (cuda vs. cpu for the int tensors, dtype for the int tensors)
- Input convention. I use log probs because that is what the gradients are for.
- Launch parameters for the kernels
- Errors and obmissions and anything else I'm not even aware of.
Thank you for looking!
In terms of performance it looks like it is superficially comparable to WarpCTC (and thus, but I have not systematically investigated this).
I have read CuDNN is much faster than implementations because it does *not* use log-space, but also the gathering step is much much faster (but I avoided trying tricky things, it seems to contribute to warpctc's fragility). I might think some more which existing torch function (scatter or index..) I could learn from for that step.
Average timings for the kernels from nvprof for some size:
```
CuDNN:
60.464us compute_alphas_and_betas
16.755us compute_grads_deterministic
Cuda:
121.06us ctc_loss_backward_collect_gpu_kernel (= grads)
109.88us ctc_loss_gpu_kernel (= alphas)
98.517us ctc_loss_backward_betas_gpu_kernel (= betas)
WarpCTC:
299.74us compute_betas_and_grad_kernel
66.977us compute_alpha_kernel
```
Of course, I still have the (silly) outer blocks loop rather than computing consecutive `s` in each thread which I might change, and there are a few other things where one could look for better implementations.
Finally, it might not be unreasonable to start with these implementations, as the performance of the loss has to be seen in the context of the entire training computation, so this would likely dilute the relative speedup considerably.
My performance measuring testing script:
```
import timeit
import sys
import torch
num_labels = 10
target_length = 30
input_length = 50
eps = 1e-5
BLANK = 0#num_labels
batch_size = 16
torch.manual_seed(5)
activations = torch.randn(input_length, batch_size, num_labels + 1)
log_probs = torch.log_softmax(activations, 2)
probs = torch.exp(log_probs)
targets = torch.randint(1, num_labels+1, (batch_size * target_length,), dtype=torch.long)
targets_2d = targets.view(batch_size, target_length)
target_lengths = torch.tensor(batch_size*[target_length])
input_lengths = torch.tensor(batch_size*[input_length])
activations = log_probs.detach()
def time_cuda_ctc_loss(grout, *args):
torch.cuda.synchronize()
culo, culog_alpha = torch._ctc_loss(*args)
g, = torch.autograd.grad(culo, args[0], grout)
torch.cuda.synchronize()
def time_cudnn_ctc_loss(groupt, *args):
torch.cuda.synchronize()
culo, cugra= torch._cudnn_ctc_loss(*args)
g, = torch.autograd.grad(culo, args[0], grout)
torch.cuda.synchronize()
def time_warp_ctc_loss(grout, *args):
torch.cuda.synchronize()
culo = warpctc.ctc_loss(*args, blank_label=BLANK, size_average=False, length_average=False, reduce=False)
g, = torch.autograd.grad(culo, args[0], grout)
torch.cuda.synchronize()
if sys.argv[1] == 'cuda':
lpcu = log_probs.float().cuda().detach().requires_grad_()
args = [lpcu, targets_2d.cuda(), input_lengths.cuda(), target_lengths.cuda(), BLANK]
grout = lpcu.new_ones((batch_size,))
torch.cuda.synchronize()
print(timeit.repeat("time_cuda_ctc_loss(grout, *args)", number=1000, globals=globals()))
elif sys.argv[1] == 'cudnn':
lpcu = log_probs.float().cuda().detach().requires_grad_()
args = [lpcu, targets.int(), input_lengths.int(), target_lengths.int(), BLANK, True]
grout = lpcu.new_ones((batch_size,))
torch.cuda.synchronize()
print(timeit.repeat("time_cudnn_ctc_loss(grout, *args)", number=1000, globals=globals()))
elif sys.argv[1] == 'warpctc':
import warpctc
activations = activations.cuda().detach().requires_grad_()
args = [activations, input_lengths.int(), targets.int(), target_lengths.int()]
grout = activations.new_ones((batch_size,), device='cpu')
torch.cuda.synchronize()
print(timeit.repeat("time_warp_ctc_loss(grout, *args)", number=1000, globals=globals()))
```
I'll also link to a notebook that I used for writing up the algorithm in simple form and then test the against implementations against it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9628
Differential Revision: D8952453
Pulled By: ezyang
fbshipit-source-id: 18e073f40c2d01a7c96c1cdd41f6c70a06e35860
Summary:
I previous did some transformations, e.g. _nDimension,_dim -> nDimensionLegacyAll, nDimension -> nDimensionLegacyNoScalars.
But this didn't touch dim(), which needs to be updated to support scalars. Instead of doing an (ugly) move, I audited the call sites and updated the cases that could be size 1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10023
Differential Revision: D9068996
Pulled By: gchanan
fbshipit-source-id: c63820767dd1496e908a5a96c34968482193f2c5
Summary:
We missed the upsample symbolic when bumping up the opset to 7.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10001
Reviewed By: bddppq
Differential Revision: D9067212
Pulled By: houseroad
fbshipit-source-id: 3e285d2800a32cb04fa82f8e7f261bdd010a8883
Summary:
ATenCore.h is a dummy header to just test that this is working at all.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10019
Reviewed By: smessmer
Differential Revision: D9067262
Pulled By: ezyang
fbshipit-source-id: 58bab9c0aa83b56335e36b719b9b6505400d8dee
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9890
Minor cleanups for Graph.h to make it more consistent with our style guide
Also fix opt/device.cc and binary_match_test.cc to not access subgraph.nodes_ which is now private
Reviewed By: bwasti
Differential Revision: D9017108
fbshipit-source-id: 9f5cba4a2cd2a452a955005f4704f6c120bbc1d5
Summary:
Adding a constant propagation pass to the JIT. I have added examples to the expect files.
There are a couple of special cases which have not been implemented here. IF nodes with constant conditions can be inlined with the correct block. WHILE nodes can be removed if the condition is false. I have added a test for each case in test_jit.py file as expected failures.
To be consistent with DCE, python ops & CPP ops are treated as not having side-effects.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8808
Reviewed By: wanchaol
Differential Revision: D8906770
Pulled By: eellison
fbshipit-source-id: 10ad796d89f80b843566c9ddad6a0abd1f3dc74c
Summary:
This causes numpy to yield to the torch functions,
e.g. instead of numpy array/scalar __mul__ converting the tensor to
an array, it will now arrange for the Tensor __rmul__ to be called.
Fixes case 2 of #9468
I also makes case 3 and 4 equivalent but does not fix them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9651
Differential Revision: D8948079
Pulled By: ezyang
fbshipit-source-id: bd42c04e96783da0bd340f37f4ac3559e9bbf8db
Summary:
More clang tidy cleanups in `torch/csrc`. This time:
1. `hicpp-use-equals-default` recommends `= default` instead of `{}` for constructors/destructors. This is better practice because it expresses the intent better (https://stackoverflow.com/questions/6502828/what-does-default-mean-after-a-class-function-declaration)
2. `readability-inconsistent-declaration-parameter-name` enforces that parameter names in the declaration match parameter names in the definition. This is just generally useful and can prevent confusion and bugs.
Also updated my script a little bit.
apaszke ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9737
Differential Revision: D9069069
Pulled By: goldsborough
fbshipit-source-id: f7b3f3a4eb4c9fadc30425a153566d3b613a41ae
Summary:
These could use some autograd tests, which are coming in a later PR, but using them in autograd is probably pretty rare.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9947
Reviewed By: ezyang
Differential Revision: D9032778
Pulled By: gchanan
fbshipit-source-id: fa5a6509d3bac31ea4fae25143e82de62daabfbd
Summary:
Not sure if anybody is interested but I managed to infer a `GRU` fine in `wasm` using ATen's compiled with emscripten. It was quite trivial to fix the configuration.
It also passes most of the tests, specially all scalar tensor tests.
The command line to configure was, but could be simplified:
```
emconfigure cmake -DAT_LINK_STYLE=STATIC -DCAFFE2_CMAKE_BUILDING_WITH_MAIN_REPO=OFF -DCMAKE_C_FLAGS="-Wno-implicit-function-declaration -DEMSCRIPTEN -s DISABLE_EXCEPTION_CATCHING=0" -DCMAKE_CXX_FLAGS="-Wno-implicit-function-declaration -DEMSCRIPTEN -s DISABLE_EXCEPTION_CATCHING=0" -DCMAKE_INSTALL_PREFIX=/home/sugar/aten-wasm ../
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9803
Differential Revision: D9004610
Pulled By: ezyang
fbshipit-source-id: db26c59f27162ed80f6aee2973c4cb9252d3d1e4
Summary:
Fixes#9818.
It seems original Python doesn't add `[PYTHONPATH]\Library\bin` into `PATH`. We try to add it before dll loading process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9920
Differential Revision: D9040825
Pulled By: soumith
fbshipit-source-id: c07fff71b2aea254a396042ab677696f6829aac7
Summary:
Minor addition to the docstring of `torch.nn.optim.Adam`, adding the default argument description for the `amsgrad` argument to the docstring for concistency.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9971
Differential Revision: D9040820
Pulled By: soumith
fbshipit-source-id: 168744a6bb0d1422331beffd7e694b9d6f61900c
Summary:
This was introduced in #9826 following the corresponding cuda file context_gpu.cu file, tests have passed in the PR, at that point master was 94439d7df. However during the long landing process, a new master commit aebf3b4 has come in that removed the `CAFFE_KNOWN_TYPE(Tensor<HIPContext>)` in context_hip.cc file, which then has broken the HIP BlobStatGetter, and we did NOT run tests again during merge and so when #9826 later landed to master the rocm tests start breaking.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9973
Differential Revision: D9040671
Pulled By: bddppq
fbshipit-source-id: f3b16cabaf681fc0535ca733db0b48430868f922
Summary:
Supersedes #8925
This PR fixes#8502, it fixes the gradients problem for clamp when passing None to the function, and add support for the NoneLiteral and NoneType in script to enable clamp tests. Now we could have corner cases like:
```python
torch.jit.script
def func():
x = torch.randn(3, 3, requires_grad=True)
y = torch.clamp(x, None, 0) # max = 0
y = torch.clamp(x, min=None, max=0)
```
In both JIT and Aten, we use Scalar(NAN) as a sentinel value when passing None type to function clamp, this is the current way we used to support None type in JIT and to solve the gradient problem when user explicitly passing None into clamp.
In JIT side, we create a tensor(NAN) and undefinedTensor if we encounter None when matching the function schema, and later in the interpreter, it will translate to Scalar(NAN) if needed.
Ideally we don't need clamp_min and clamp_max in ATenNative/Autograd and could only support clamp after this change, but since bunch of other operators (e.g. Activation.cpp, Loss.cpp) is using clamp_min in several places, we will still have the functions available, but all python invocations will only call clamp instead of clamp_min/max (with calling underlying th_max/th_min in clamp).
zdevito jamesr66a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9596
Reviewed By: zdevito
Differential Revision: D8940839
Pulled By: wanchaol
fbshipit-source-id: c543a867b82e0ab8c99384773b173fdde2605d28
Summary:
This is a follow-up to https://github.com/pytorch/pytorch/pull/9794 that contains only the serialization library and exposes a cleaner API. This should later be incorporated into the module export code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9900
Reviewed By: zdevito
Differential Revision: D9021057
Pulled By: jamesr66a
fbshipit-source-id: 01af74a7fdd1b90b2f5484644c3121d8ba9eb3b3
Summary:
If we have this "spatial" attribute and its value equals to 1, we could just remove this attribute and convert this op to caffe2 SpatialBN.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9492
Differential Revision: D8988165
Pulled By: houseroad
fbshipit-source-id: a9218dc9cd5fab43deb371f290f81285f5283231
Summary:
We only support special case. The original dim is not supported by ONNX.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9660
Reviewed By: bddppq
Differential Revision: D8965507
Pulled By: houseroad
fbshipit-source-id: 021dffdf0489c2d3a50bfd1e0c4cfd00d4a3d776
Summary:
The goal of this PR is to update the hip files to reflect relevant changes in cuda source files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9826
Differential Revision: D9032840
Pulled By: bddppq
fbshipit-source-id: 504e55c46308eebfee3c9a7beea1f294fe03470f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9747
Currently the ctc_greedy_decoder op initializes the `merge_repeated` argument only if it has been provided by the user. Change to initialize in all cases.
Reviewed By: houseroad
Differential Revision: D8963635
fbshipit-source-id: 18955c7c26a77d9d7f5137e4dec085252ffabfeb
Summary:
```
This adds TensorIterator, a helper class for computing element-wise
operations that's intended to replace the CPU and CUDA apply utils
functions.
CPU kernels are implemented as functions that operate on strided 1-d
tensors compared to CPUApplyUtils which operated individual elements. This
allows the kernels to handle vectorization, while TensorIterator handles
parallelization and non-coalesced dimensions.
GPU kernels continue to operate on elements, but the number of
specializations is reduced. The contiguous case remains the same. The
non-contiguous case uses a single (reduced) shape for all operands and
the fast integer division from THCIntegerDivider. To avoid extra
specializations for indexing with 64-bits, large operations are split
into smaller operations that can be indexed with 32-bits.
Major semantic changes:
- No more s_add, s_mul, s_div, or s_sub. Broadcasting is handled by
TensorIterator. The autograd engine performs the reduction assuming
standard broadcasting if the gradient shape does not match the
expected shape. Functions that do not use standard broadcasting rules
should either continue to trace the expand calls or handle the
reduction in their derivative formula.
- Use ONNX v7, which supports broadcasting ops.
Performance impact:
- Small increased fixed overhead (~0.5 us)
- Larger overhead for wrapped numbers (~2.5 us)
- No significant change for ops on contiguous tensors
- Much faster worst-case performance for non-contiguous GPU tensors
- Faster CPU bias addition (~2x)
- Faster GPU bias addition (~30% faster)
Future work:
- Decrease overhead, especially for wrapping numbers in Tensors
- Handle general inter-type operations
- Extend to unary ops and reductions
- Use buffering for compute-bound operations on non-contiguous tensors
(pull in from CPUApplyUtils)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8919
Differential Revision: D8677600
Pulled By: colesbury
fbshipit-source-id: 61bc9cc2a36931dfd00eb7153501003fe0584afd
Summary: Minor fix for a bug introduced by D9004285
Reviewed By: anderspapitto
Differential Revision: D9028762
fbshipit-source-id: 9b9c5eef30e61d7ae19784e0418fa29bad2b5564
Summary:
I hope this helps me for the windows build failure in #9628 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9904
Differential Revision: D9026715
Pulled By: soumith
fbshipit-source-id: bb97d41d060823f5a37bfc9a1659815b8b9f4eab
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9939
Pull Request resolved: https://github.com/facebookresearch/weakly-supervised-action-detection/pull/13
Pull Request resolved: https://github.com/pytorch/translate/pull/166
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9125
Closes https://github.com/pytorch/pytorch/pull/9125
Use inheritance for polymorphism, and remove template parameter
This is to change the templating in call sites, the core implementations will change later
Before Caffe2 Tensor class was compile-time fixed to bind to a particular device/context. With this change, we're making it a runtime property (stored inside the tensor), but preserve the same semantics. For example, one has to specify device type in order to create a Tensor - there are no uninitialized tensors. More specifically the changes are:
1. We added an extra argument *DeviceType* to most of the constructors of the tensor, e.g. (Tensor(DeviceType type)),
2. Semantics of constructor Tensor(const Tensor<SrcContext>& src, ContextForCopy* context); is changed, in this constructor, the second context is passed in to enable us to call the templated Copy function, it could be in a different context as source and target previously, now we'll enforce that the context should have same device type as src, if it is provided.
3. To preserve 'get-or-construct' semantics of Blob, we added specialized getter Blob::GetMutableTensor that verifies both that Blob contains a Tensor and that it's of a correct type
4. Specifically, Tensor type is not default-constructible any more (as we don't have unknown device tensors) and thus some of the code handling STL containers needs to change
Note: Some changes are postponed just to keep this diff a bit smaller. Please see `TODO`s.
Reviewed By: ezyang, houseroad
Differential Revision: D9024330
fbshipit-source-id: e0b8295d2dc6ebe2963383ded5af799ad17164ba
Summary:
This was showing up in the n-dimensional empty tests as flaky because it's reading uninitialized cuda memory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9907
Differential Revision: D9021413
Pulled By: gchanan
fbshipit-source-id: 31542b7597919df9afd6e528bb108a4a3e8eaf60
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9895
The primary goal here was to remove THTensor::_dim, which isn't part of the API moving forward.
Instead, we provide 3 options for getting the dimensionality (this is temporary although non-trivial to remove!):
```
nDimension corresponds to the "true" ATen dimension. TODO: implement.
nDimensionLegacyNoScalars correpsonds to the ATen dimension, except scalars are viewed as 1-dimensional tensors.
nDimensionLegacyAll corresponds to the ATen dimension, except scalars are viewed as 1-dimensional tensors
and tensors with a dimension of size zero are collapsed to 0-dimensional tensors.
```
So in this patch, nDimension -> nDimensionLegacyNoScalars and _dim/_nDimension goes to nDimensionLegacyAll.
These are just codemods.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9835
Reviewed By: ezyang
Differential Revision: D8999338
Pulled By: gchanan
fbshipit-source-id: a4d676ac728f6f36ca09604a41e888d545ae9311
Summary:
Hello! I just find a small spell mistake while reading this source code. Just PR it, Thx!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9868
Reviewed By: gchanan, ezyang
Differential Revision: D9016030
Pulled By: soumith
fbshipit-source-id: fc3877177be080adbdbda99a169e401691292ebb
Summary:
Based on top of #9763 (first 3 commits belong to that PR). The first commits from this PR are "Stop using attributes ..."
I tried to separate the changes into fairly meaningful commits. I can't split them up into smaller PRs, because everything starts working and all tests pass only after the whole sequence, but hopefully this will make reviewing somewhat easier.
Known issues/regressions/future tasks:
- `aten::lerp` and `aten::clamp` are no longer fusable
- `CreateAutodiffSubgraphs` needs a rewrite
- It is much more strict now, and will miss a lot of opportunities, especially when viewing ops are involved. Our previous approach was "ignore the assumption on shape availability in gradient formulas to determine differentiability, and hope that shape prop will be robust enough to actually deliver them before we differentiate", which obviously doesn't scale well to more complex cases. We should either work on reducing the size dependency of grad formulas (feasible e.g. for `view`/`reshape`, unfeasible for `squeeze`/`unsqueeze`), or make `CreateAutodiffSubgraphs` integrate some kind of "I could integrate this node into an AD subgraph, but will I be able to infer the shape of its input" reasoning (kind of like a limited shape prop, that doesn't infer anything, and only tells if it *could* infer something).
- It sometimes creates constant-only (or constants + one node) graphs, which is useless
- Broken `aten::add` in auto-batching, because it gained a non-tensor input. I changed the test for pointwise operations to use `aten::mul` instead, but I needed to disable the LSTM cell test. I'm not sure how scalar constants should be implemented in this case, because I don't fully understand our format. cc: ChunliF
- Graph import does some hacks to recover type of constants. This code should be removed once we'll gain the ability to export the IR along with value types.
- There's still a fair amount of dead code that can be removed. I didn't want to make this diff any bigger, and removing it is an easy task.
- Graph fuser could be improved to use signature matching (possibly using `OperatorSet`) instead of basing on node kinds.
- Manual constant propagation for the `ListConstruct` node in `torch/onnx/utils.py` should be replaced with a proper constant propagation pass (or we should ensure that the one we have handles at least this case before we remove this code).
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9807
Reviewed By: ezyang
Differential Revision: D9004285
Pulled By: apaszke
fbshipit-source-id: fe88026a765f6b687354add034c86402362508b7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9901
Added support for UINT8 datatype for additional data (prefetching and
output) by ImageInputOp
Reviewed By: ashwinb
Differential Revision: D9018964
fbshipit-source-id: f938a8a072c15c0ee521b2f16788c024b08cd37f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9855
Support production models with predictor benchmark
Two new flags are added:
`--update_prod`: pull production data (netdef, input types, input dims) from Hive and store locally
`--use_prod`: run benchmark with local production data with the same workload as in production.
By default, 300 models will be loaded.
production vs benchmark
avg net run time:
(collected by prod: https://fburl.com/scuba/6lb91zfx and bench: https://fburl.com/ngjj1dc8)
**prod: `408us` vs bench: `543us`**
(With prod data distribution, this should be even closer)
framework overhead (as of 2018-07-22):
prod:
```
9.111% BlackBoxPredictor::Run
4.602% SimpleNet::Run
2.377% Operator::Run
1.786% BlackBoxPredictor::AllocateMemory
1.372% Observable::StartAllObservers
1.358% Observable::StartObserver
1.206% Blob::GetMutable
```
bench:
```
8.577% BlackBoxPredictor::operator()
3.276% SimpleNet::Run
1.954% Operator::Run
1.697% BlackBoxPredictor::AllocateMemory
1.477% Tensor::ShareData
1.230% Blob::GetMutable
1.034% Observable::StartObserver
```
Reviewed By: yinghai
Differential Revision: D8942996
fbshipit-source-id: 27355d7bb5a9fd8d0a40195261d13a97fa24ce17
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9581
Mostly to simplify code. Should also improve performance but order switch ops
don't take much time anyway.
Reviewed By: viswanathgs
Differential Revision: D8909766
fbshipit-source-id: 17a302d5bf4aba2755d88223fc01a41fd72c5919
Summary:
Follow up task of #9584.
Commit 1:
- change expect/cast to return shared pointers instead of raw pointer
- isSubtypeOf accept TypePtr instead. Use `x->isSubtypeOf(NumberType::get())` rather than `x->isSubtypeOf(*NumberType::get())`
Commit 2:
- to address enable_shared_from_this pitfalls, we make the constructor private and expose the factory method to make sure user can only create it using our factory method.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9786
Reviewed By: zdevito
Differential Revision: D8980441
Pulled By: wanchaol
fbshipit-source-id: e5c923fc57a701014310e77cf29985b43bb25364
Summary:
This PR fixes#9743 .
Adding backward support when loading a checkpoint from 0.3.* with 1dim tensor, they are now 0 dim tensor in 0.4+.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9781
Differential Revision: D8988196
Pulled By: ailzhang
fbshipit-source-id: a7a1bc771d597394208430575d5a4d23b9653fef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9891
Add an argument to benchmark binary to specify the seconds to sleep before the run and after the warmup.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9880
Reviewed By: llyfacebook
Differential Revision: D9014254
Pulled By: sf-wind
fbshipit-source-id: d5566186c8ed768f1e170e9266c5f2d6077391e0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9847
CTCBeamSearchDecoder and CTCGreedyDecoder do not currently support IDEEP
execution. Add fallback operators to allow IDEEP execution of models that use
these operators.
Reviewed By: yinghai
Differential Revision: D9006234
fbshipit-source-id: fc539ba67b07d1f960d28564d8adde0be8690649
Summary:
And let Gemm conversion to inspect the input `C` to try converting to FC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9870
Reviewed By: houseroad
Differential Revision: D9013198
Pulled By: bddppq
fbshipit-source-id: b4c509cfccca238262e1c406b004e66cef256321
Summary:
This is blocking the IR operator unification, because I need to be able to pass scalars to backward functions.
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9763
Reviewed By: zou3519
Differential Revision: D8978457
Pulled By: apaszke
fbshipit-source-id: 570b4c3409322459cb0f2592069730a7d586ab20
Summary:
I don't think this file is used anywhere, I guess we'll find out!
(Weirdly this failed lint on one of my PRs even though it shouldn't).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9843
Differential Revision: D9003949
Pulled By: gchanan
fbshipit-source-id: 26d580d1e7cdd30e82e5f4176244e51fd7cd616d
Summary:
Pull Request resolved: https://github.com/facebookresearch/weakly-supervised-action-detection/pull/13
Pull Request resolved: https://github.com/pytorch/translate/pull/166
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9125
Closes https://github.com/pytorch/pytorch/pull/9125
Use inheritance for polymorphism, and remove template parameter
This is to change the templating in call sites, the core implementations will change later
Before Caffe2 Tensor class was compile-time fixed to bind to a particular device/context. With this change, we're making it a runtime property (stored inside the tensor), but preserve the same semantics. For example, one has to specify device type in order to create a Tensor - there are no uninitialized tensors. More specifically the changes are:
1. We added an extra argument *DeviceType* to most of the constructors of the tensor, e.g. (Tensor(DeviceType type)),
2. Semantics of constructor Tensor(const Tensor<SrcContext>& src, ContextForCopy* context); is changed, in this constructor, the second context is passed in to enable us to call the templated Copy function, it could be in a different context as source and target previously, now we'll enforce that the context should have same device type as src, if it is provided.
3. To preserve 'get-or-construct' semantics of Blob, we added specialized getter Blob::GetMutableTensor that verifies both that Blob contains a Tensor and that it's of a correct type
4. Specifically, Tensor type is not default-constructible any more (as we don't have unknown device tensors) and thus some of the code handling STL containers needs to change
Note: Some changes are postponed just to keep this diff a bit smaller. Please see `TODO`s.
Reviewed By: xw285cornell
Differential Revision: D8121878
fbshipit-source-id: 4a5e9a677ba4ac82095df959851a054c81eccf81
Summary:
The PR contains:
Fixes for running MIOpen conv operator in a multi worker scenario, along with a performance fix
Fixing a typo in MIOpen pool op and adding some extra checks for MIOpen spatial BN op
bddppq
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9842
Differential Revision: D9012512
Pulled By: bddppq
fbshipit-source-id: 270e1323c20fbfbc4b725f9a4ff34cd073ddaaa8
Summary:
I split it into two parts, _local_scalar and _local_scalar_dense (unchecked)
so I could reuse the sparse logic in both paths.
_local_scalar became a method on Tensor to work around a circular
include problem.
This is resurrected copy of #9652
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9762
Differential Revision: D8972348
Pulled By: ezyang
fbshipit-source-id: 2232dbfc8e1286b8a4a1c67d285c13a7771aad4c
Summary:
We think this will band-aid some of the new Caffe2 test failures.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9830
Differential Revision: D9008052
Pulled By: ezyang
fbshipit-source-id: 84f1c0faea429d758d760965d6cbfe9e4c72eb19
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9831
Follow up to D8980903 - replace dataIterator with nodeIterator where the data isn't used.
Reviewed By: pjh5
Differential Revision: D8998351
fbshipit-source-id: c333847ecd8b6d8075352322845839b94a63aecc
Summary:
https://github.com/pytorch/pytorch/pull/9755 broke this, but it was only tested if size zero dims were turned on (it can still happen even if that isn't turned on, because we support size [0] tensors).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9825
Differential Revision: D8997303
Pulled By: gchanan
fbshipit-source-id: 911dce112f73fad0f3980a7f4f9423df0f2d923d
Summary:
This was used to build Caffe2 Docker version 170.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9828
Differential Revision: D8997808
Pulled By: ezyang
fbshipit-source-id: f48938b2b71bc86578c9d9b46c281ed05478724e
Summary:
…o dim.
Manifest:
1) The scalar boolean is now in THTensor, although it isn't hooked up at the TH level yet.
2) setScalar is gone, everything now goes through the maybeScalar equivalent (which is renamed)
3) all "scalars" in this context now refer to "zero_dim" in order to differentiate this concept from the "Scalar" class.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9783
Differential Revision: D8978911
Pulled By: gchanan
fbshipit-source-id: f09254be4bebad0e4c510fefe4158b4f7e92efe1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9790
- Add way to check if a NodeRef is in a graph
- Make a nodeIterator (similar to dataIterator) but only iterate through nodes.
Reviewed By: bwasti
Differential Revision: D8980903
fbshipit-source-id: b20504a46715858752e25242303125a15a709b88
Summary:
Temporarily need this to prevent sccache from breaking when I move sccache install to the DockerFile.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9810
Differential Revision: D8991684
Pulled By: Jorghi12
fbshipit-source-id: 14cd0278f53a72372f9bbe27b228980f8d3c1d4a
Summary:
The tutorials url with http is not valid, replacing it with https.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9812
Differential Revision: D8991344
Pulled By: ezyang
fbshipit-source-id: c12faa57905b50eadc320f9938c39c4139bd093b
Summary:
Partially fixes https://github.com/pytorch/pytorch/issues/6890. (backward pass for non-symmetric eigen-decomposition is not implemented in other packages, e.g. autograd, mxnet, tensorflow, presumably because the eigenvalues can be imaginary for the general case, and AFAIK we cannot support complex numbers).
This patch adds a backward function for the symmetric eigen-decomposition function `torch.symeig`. The formula used is taken from [here](http://eprints.maths.ox.ac.uk/1079/1/NA-08-01.pdf). Unit tests are added to verify correctness.
There is still one outstanding issue, which is how to handle the case where the `symeig` is called with `eigenvectors=False`. In this case, the eigenvectors are returned as a zero tensor, but the backward computation for the eigenvalues depends on the eigenvectors. There was a previous attempt to implement this in https://github.com/pytorch/pytorch/pull/2026, where apaszke mentioned that the `eigenvectors` argument should be overridden so that they are saved for the backwards pass. The forward code is autogenerated, though, and it isn't clear to me how that would be done. I'd appreciate any guidance. For now, there is a unit test that will fail until that issue is resolved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8586
Reviewed By: ezyang
Differential Revision: D8872760
Pulled By: SsnL
fbshipit-source-id: 76614495d0f9c118fec163a428f32e5480b4d115
Summary:
The primary use-site of typeString was checked_cast_tensor.
I did a little more than I needed in this patch, to set
the stage for actually deleting the tensor type.
Specifically, I modified checked_cast_tensor to explicitly
take Backend and ScalarType, the idea being that once we
remove the tensor subclasses, we will delete the T template
parameter.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9764
Differential Revision: D8969196
Pulled By: ezyang
fbshipit-source-id: 9de92b974b2c28f12ddad13429917515810f24c6
Summary:
This implements the two-parameter Weibull distribution, with scale $\lambda$ and shape $k$ parameters as described on [Wikipedia](https://en.wikipedia.org/wiki/Weibull_distribution).
**Details**
- We implement as a transformed exponential distribution, as described [here](https://en.wikipedia.org/wiki/Weibull_distribution#Related_distributions).
- The `weibull_min` variance function in scipy does not yet support a vector of distributions, so our unit test uses a scalar distribution instead of a vector.
Example of the bug:
```
>>> sp.stats.expon(np.array([0.5, 1, 2])).var() # fine
array([1., 1., 1.])
>>> sp.stats.weibull_min(c=np.array([0.5, 1, 2])).var() # buggy
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py", line 490, in var
return self.dist.var(*self.args, **self.kwds)
File "/usr/local/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py", line 1242, in var
res = self.stats(*args, **kwds)
File "/usr/local/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py", line 1038, in stats
if np.isinf(mu):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9454
Differential Revision: D8863574
Pulled By: SsnL
fbshipit-source-id: 1ad3e175b469eee2b6af98e7b379ea170d3d9787
Summary:
I got some tensor->variable conversion exceptions from `torch/csrc/autograd/variable.h`, which used the `TORCH_ASSERTM` macros instead of `AT_CHECK`, so they didn't have backtraces. This was such a substantial loss for debugability that I decided to update the whole codebase to use the backtrace-enabled ATen macros instead of `TORCH_ASSERT` and `JIT_ASSERT`, the latter having been an alias of the former.
ezyang apaszke zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9575
Differential Revision: D8924566
Pulled By: goldsborough
fbshipit-source-id: 7a4013b13eec9dbf024cef94cf49fca72f61d441
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9770
Add zero ops to operators that do not have a valid schema
Reviewed By: hlu1
Differential Revision: D8957472
fbshipit-source-id: d8d0a351183e88ace2e050a87c1e1c363af67e33
Summary:
Before I can rewrite portions of the c10d DDP in C++ I need proper tests in place to make sure I am not breaking anything as I port code. There were no tests for the c10d DDP in place so I wrote some.
I refactored the c10d tests to derive some tests cases from a general `MultiGPUTestCase` and followed lots of patterns from `test_distributed.py` w.r.t. how tests are skipped (such that the main process doesn't initialize CUDA, which I found is a super important detail!!!).
I am largely unfamiliar with this code so feel free to scrutinize. The DDP test code itself is also largely taken from `test_distributed.py` but more inlined which I find easier to read.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9670
Differential Revision: D8977724
Pulled By: goldsborough
fbshipit-source-id: 186eab38a72384d7992a2ec5c89f304ad42d5944
Summary:
Fixes: #9754
Maybe this could also make its way into 0.4.1, it is a severe debugging headache if you hit this...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9755
Reviewed By: ezyang
Differential Revision: D8967178
Pulled By: zou3519
fbshipit-source-id: 151ed24e3a15a0c67014e411ac808fb893929a42
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9643
Current map interface assumes float data type, which is not always correct.
Reviewed By: kennyhorror
Differential Revision: D8455784
fbshipit-source-id: b94a31267760f7f97c15aa4b03008affc347fd10
Summary:
When building iOS apps with a caffe2 dependency, we were seeing the `caffe2/caffe2/mobile/contrib/ios/mpscnn/mpscnn.mm:33:17: error: method 'copyWithZone:' in protocol 'NSCopying' not implemented [-Werror,-Wprotocol]`. This fixes it by implementing a shallow copy with that method.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9748
Reviewed By: jerryzh168
Differential Revision: D8954332
Pulled By: williamtwilson
fbshipit-source-id: 0cd44408257c0bd3f4ffb80312ea9d13d13e5ff3
Summary:
This can hardly be called an improvement (we now print
CPUFloatType instead of CPUFloatTensor) but it was the
simplest way I could think of devirtualizing this function in
the short term. Probably need some sort of native function
that gives string information about a tensor.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Approved in #9710
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9758
Differential Revision: D8966935
Pulled By: ezyang
fbshipit-source-id: a4641affe0a6153f90cdd9f4f2a1100e46d1a2db
Summary:
Not in the same format. Skip at the moment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9751
Reviewed By: yinghai
Differential Revision: D8965636
Pulled By: houseroad
fbshipit-source-id: 81d39c2f5625c14c0e1ee11408b5f7267b53798f
Summary:
ebetica made me aware that `nn::Module::clone()` always clones to the current device (usually CPU) instead of preserving the device of each parameter. This PR changes the signature of `clone` from
`shared_ptr<Module> clone()`
to
`shared_ptr<Module> clone(optional<Device> device = nullopt)`
with semantics of:
1. If a `device` is given, all parameters/buffers are moved to that device,
2. If no `device` is supplied (default), parameters/buffers retain their device.
ezyang apaszke ebetica
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9609
Differential Revision: D8957367
Pulled By: goldsborough
fbshipit-source-id: 0d409ae645ed2b8d97d6fc060240de2f3d4bc6c8
Summary:
I renamed the variable in the `Embedding` module from `weight` to `table` a few months ago, because it seemed like a more meaningful name. Turns out it's not such a good idea because it deviates from PyTorch, which unnecessarily breaks C++->Python translated code.
ebetica ezyang apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9720
Differential Revision: D8955647
Pulled By: goldsborough
fbshipit-source-id: 77228b07d2b733866e8cdecaa6d0686eef4cc3ea
Summary:
The underlying reason why this is even an issue is that the conversion
into and out of the 'fictional' onnx operators is done in an unhygenic
order. This doesn't address that, but it does fix the one observable
case where this produces an incorrect result, and unblocks some other
work being done.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9657
Differential Revision: D8940824
Pulled By: anderspapitto
fbshipit-source-id: ea827a24c85447fe4ae470336a746329598eee84
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9718
This patch switches the interpreter to use IValue's primitive numbers rather than tensors for computing on integers and floats. In addition to preparing the interpreter for first-class support of other types, this cleans up the handling of primitive numbers, making it possible to just use the normal operator overloading dispatch to find the right implementation for numbers. As a result of this change, a lot of other functionality needed to be updated since it was the first time we use non-tensors in a lot of places in the code base.
Notes:
* Fixes code_template.py so that multi-line strings are indented correctly when used on a standalone line
* Cast operators (`int(x)`) now are functional. Some tests have addition conversions to integers because
we no longer allow implicit tensor -> integer conversions following the same convention as in python
* prim::ListConstruct/createList has been added to the interpreter for creating lists and this has
replaced aten::stack for integers lists
* gen_jit_dispatch.py has been refactored so that non-tensor types use operators on IValues to extract
the primitives
* IValue gains a .to<T> method that is the equivalent of tensor_as but for IValue instead of at::Tensor
* `constant_as<T>` is switched over to using IValues's `.to<T>` method, to make conversion from constant->IValue->C++ type
more consistent. This functionality combined with `toIValue(Value*)` replaces the `tensor_as` and `as_tensor` family of functions.
* conditional expressions (if, loop) and operators related to them are now computed on integers rather than tensors
* IValue gains constructors for constructing from at::Scalar and converting to it. However, IValue itself will always store
the scalars as a double or int64.
* To align with python 3 syntax, TK_INT, TK_FLOAT, and TK_BOOL have been removed from the parser, and int/float/bool are just treated as special identifiers in the compiler,
along with print. These are represented as special sugared values with a `call` method implemented. For int/float/bool this implements casting behavior.
* Dropped shared_from_this from Type/Module. They were not needed and they making debugging harder because they internally throw/catch exceptions.
* Shape propagation has been updated to support running nodes that include floating point primitive types, this required some refactoring of internal functions.
* TensorToNum and NumToTensor have actual implementations as operators now
* regster_prim_ops now contains implementations of math operators for float/int primitive types, and for mixed (prim <+> tensor) versions. This removes the need for special handling in compiler.cpp
* Primitive math is now entirely handled by letting the compiler choose the right overloads. This removes tons of special casing in the compiler.
* incorporates eellison's change to allow casting from return values. Due to the addition of primitive support, the code need slight modifications, so I just pre-merged it here.
* stack.h gains generic vararg versions of push/pop that know how to convert to/from C++ types:
```
at::Tensor a;
at::Scalar b;
pop(stack, a, b);
at::Tensor c = a + b;
push(stack, c);
```
apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9584
Reviewed By: apaszke
Differential Revision: D8910546
Pulled By: zdevito
fbshipit-source-id: 0f3e60d4d22217f196a8f606549430e43b7e7e30
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9667
MKL-DNN doesn't support 64-bit integger (cfee61bf81/include/mkldnn_types.h (L62-L75)). So force converting from `TensorCPU<long>` to `s32` Ideep tensor will cause memory issue. This diff gives an alternative solution, where we just fall through to TensorCPU. The reasoning is that since MKL-DNN doesn't support 64 bit integer tensor, downstream ops have to be in CPUConext. So there is no reason force converting to ideep tensor and back.
Reviewed By: pjh5
Differential Revision: D8943544
fbshipit-source-id: f514903cda27e34b8887271c9df56c8220895116
Summary:
This is a modification of the strategy from https://github.com/pytorch/pytorch/pull/8919 and https://github.com/pytorch/pytorch/pull/9579.
```
Previously, the CPU architecture-specific kernels self-registered with
the DispatchStub. When linking as part of a static library, this requires
the flag --whole-archive to be passed to the linker to ensure that the
object files for the kernels are included. Caffe2 and TensorFlow use that
strategy.
We ran into some issues with --whole-archive blowing up the binary size
of some downstream projects in Facebook. This PR avoids --whole-archive
for CPU kernels. The downside is that the generic code needs to be aware
of whether kernels are compiled with AVX and with AVX2 (via
HAVE_AVX_CPU_DEFINITION and HAVE_AVX2_CPU_DEFINITION).
The CUDA kernels still self-register with DispatchStub because the CPU
library is not aware of whether the CUDA library will be available at
runtime.
There are a few major changes to DispatchStub
- The environment variable ATEN_CPU_CAPABILITY overrides the CPU
capability detection code (Previous ATEN_DISABLE_AVX/AVX2)
- DispatchStub is defined in the generic native code instead of the
CPU_CAPABILITY_DEFAULT kernel.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9664
Differential Revision: D8943350
Pulled By: colesbury
fbshipit-source-id: 329229b0ee9ff94fc001b960287814bd734096ef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9717
D8722560 was landed with some build errors, unfortunately the c10 code isn't part of contbuild yet.
Fixing them.
Differential Revision: D8954141
fbshipit-source-id: 2a082fb8041626e45ccd609f37a8ef807f6dad8a
Summary:
This is to simplify the data format during benchmarking. After this change, we can use the same benchmarking harness data conversion method to parse data from multiple binaries.
This change should be coordinated with the PR: https://github.com/facebook/FAI-PEP/pull/63
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9555
Reviewed By: pjh5
Differential Revision: D8903024
Pulled By: sf-wind
fbshipit-source-id: 61cabcff99f0873729142ec6cb6dc230c685d13a
Summary:
This pull request implements low rank multivariate normal distribution where the covariance matrix has the from `W @ W.T + D`. Here D is a diagonal matrix, W has shape n x m where m << n. It used "matrix determinant lemma" and "Woodbury matrix identity" to save computational cost.
During the way, I also revise MultivariateNormal distribution a bit. Here are other changes:
+ `torch.trtrs` works with cuda tensor. So I tried to use it instead of `torch.inverse`.
+ Use `torch.matmul` instead of `torch.bmm` in `_batch_mv`. The former is faster and simpler.
+ Use `torch.diagonal` for `_batch_diag`
+ Reimplement `_batch_mahalanobis` based on `_batch_trtrs_lower`.
+ Use trtrs to compute term2 of KL.
+ `variance` relies on `scale_tril` instead of `covariance_matrix`
TODO:
- [x] Resolve the fail at `_gradcheck_log_prob`
- [x] Add test for KL
cc fritzo stepelu apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8635
Differential Revision: D8951893
Pulled By: ezyang
fbshipit-source-id: 488ee3db6071150c33a1fb6624f3cfd9b52760c3
Summary:
…unctions.
This also unifies the error checkign between scatter/scatterAdd on CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9658
Differential Revision: D8941527
Pulled By: gchanan
fbshipit-source-id: 750bbac568f607985088211887c4167b67be11ea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9683
This pops off `refcount_`, `storage_`, `storage_offset_`; there are now no more direct accesses to these fields and we can make them private (with appropriate friending).
Stacked on #9561
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9591
Reviewed By: SsnL
Differential Revision: D8922246
Pulled By: ezyang
fbshipit-source-id: dfae023d790e29ce652e2eab9a1628bbe97b318d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9665
In data_parallel_model, we isolate synchronizing barrier init net into its own from the param_init_net, so that we could have finer granularity of control over the barrier net.
Reviewed By: andrewwdye
Differential Revision: D8375389
fbshipit-source-id: ce0c8c1c8e4bd82b7078a1b07abaced3f149d578
Summary:
**REVIEW LAST COMMIT ONLY**
As discussed in our yesterday's meeting. Nodes can be now matched to particular overloads using the `matches(...)` function:
```cpp
n->matches("aten::type_as(Tensor self, Tensor other) -> Tensor")
```
This also changes the shape prop and peephole passes to use those functions for matching. This fixes a few bugs, makes them much more robust, and prepares us for removal of attributes.
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9567
Reviewed By: zdevito
Differential Revision: D8938482
Pulled By: apaszke
fbshipit-source-id: eb2382eeeae99692aada2d78d5d0c87c8ef1545e
Summary:
This PR contains the change for explicit conversion between ushort and __half required for ROCm 1.8.2 support
bddppq
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9663
Differential Revision: D8943937
Pulled By: bddppq
fbshipit-source-id: 16102f9dbc68ed4ece2e8fc244825c3992c24901
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9637
Adding a method to run plan in background. The intended use is to run BlueWhale's data reading & preprocessing net in background while the GPU is training.
Reviewed By: MisterTea
Differential Revision: D8906439
fbshipit-source-id: b1c73ca7327e2d87a8f873924e05ab3d161a3f1e
Summary:
ezyang noticed that the CUDAStream files lived under ATen/ despite being CUDA-specific, and suggested porting them to ATen/cuda and exposing them with a new CUDAContext. This PR does that. It also:
- Moves ATen's CUDA-specific exceptions for ATen/cudnn to ATen/cuda for consistency
- Moves getDeviceProperties() and getCurrentCUDASparseHandle() to CUDAContext from CUDAHooks
The separation between CUDAContext and CUDAHooks is straightforward. Files that are in CUDA-only builds should rely on CUDAContext, while CUDAHooks is for runtime dispatch in files that can be included in CPU-only builds. A comment in CUDAContext.h explains this pattern. Acquiring device properties and CUDA-specific handles is something only done in builds with CUDA, for example, so I moved them from CUDAHooks to CUDAContext.
This PR will conflict with #9277 and I will merge with master after #9277 goes in.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9435
Reviewed By: soumith
Differential Revision: D8917236
Pulled By: ezyang
fbshipit-source-id: 219718864234fdd21a2baff1dd3932ff289b5751
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9636
Make sure that the blobs are registered to the net
Reviewed By: pjh5
Differential Revision: D8924883
fbshipit-source-id: f09422a2d4d5ba8bf6cfbfd00172097b5ab1fcd6
Summary:
In the repr funciton of LPPoolNd(..) class, there was a missing '='. (`kernel_size{kernel_size}`)
Link to line in the code: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/pooling.py#L694
Original:
return 'norm_type={norm_type}, kernel_size{kernel_size}, stride={stride}, ' \
'ceil_mode={ceil_mode}'.format(**self.__dict__)
Fixed:
return 'norm_type={norm_type}, kernel_size={kernel_size}, stride={stride}, ' \
'ceil_mode={ceil_mode}'.format(**self.__dict__)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9629
Differential Revision: D8932913
Pulled By: soumith
fbshipit-source-id: 9030dff6b14659b5c7b6992d87ef53ec8891f674
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9598
The "max_length" should be passed to UnPackSegmentsOp if "max_length" is given when calling PackSegmentsOp.
Reviewed By: jerryzh168
Differential Revision: D8919799
fbshipit-source-id: 8c97aa717b69177b8a5d5d56892817d488853840
Summary:
This PR adds machinery to cache the schema in an IR node, and allows lookups of (possibly) constant inputs by their names (instead of position). The new methods are:
- `at::optional<T> get<T>(Symbol name)` - if the argument called name is a constant, then casts it to type `T` and returns it. If it's not constant returns `nullopt`. Raises an error if there's no argument with that name.
- `at::optional<IValue> get<T>(Symbol name)` - like above, but packs the result in an IValue
- `Value* getValue(Symbol name)` - retrieves a `Value*` for an argument (no need to know its position).
All above functions currently inspect the attributes as well, but that's only so that I could start using them in other places in the JIT without disrupting our current functionality. I wanted this diff to be a preparation that doesn't change the semantics too much, and so both the tracer and script create nodes with attributes. The next PR will put that to a stop, and hopefully the changes we need to make to other components will be simpler thanks to what I did here.
One more thing I'd like to do before actually stopping creating the non-attributed nodes is to have a convenient way of creating a schema programmatically, matching nodes against it, and creating them without having to pack inputs into flat argument lists (which is quite error prone).
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9505
Reviewed By: ezyang
Differential Revision: D8915496
Pulled By: apaszke
fbshipit-source-id: 39d14fc9a9d73d8494f128367bf70357dbba83f5
Summary:
This fix will prevent errors like (found in `bincount`)
```
RuntimeError: %s not implemented for '%s'bincounttorch.FloatTensor
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9625
Differential Revision: D8932945
Pulled By: soumith
fbshipit-source-id: 794e3b58d662779402ab318e274661826a5db8b2
Summary:
fixes#4176 cc vishwakftw
I didn't do `:math:` and `\neg` because I am using double ticks so they render more similarly with `:attr:`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9630
Differential Revision: D8933022
Pulled By: SsnL
fbshipit-source-id: 31d8551f415b624c2ff66b25d886f20789846508
Summary:
As in the title. Lets us simplify a lot of code.
Depends on #9363, so please review only the last commit.
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9414
Reviewed By: zdevito
Differential Revision: D8836496
Pulled By: apaszke
fbshipit-source-id: 9b3c3d1f001a9dc522f8478abc005b6b86cfa3e3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9501
Added a new stat value to log static states like CPU and memory usage.
Reviewed By: pjh5
Differential Revision: D8872254
fbshipit-source-id: 469e94cab99029a3da55f8986dddeadac076e2a8
Summary:
This PR adds the functional version of `DataParallel` (i.e. `data_parallel`) to the C++ frontend.
For this, I had to:
1. Add "differentiable" versions of scatter and gather, which perform their inverse operation in the backward pass, to C++. I've added them under `torch/csrc/autograd/functions/comm.{h,cpp}`. I had to move some utilities from `VariableType.cpp` into `torch/csrc/autograd/functions/utils.h`, and changed them a bit to fix the `const_cast`s for which there were `TODO`s,
2. Implement the `replicate`, `parallel_apply` and the combining `data_parallel` functions in C++.
`replicate` is implemented based on our existing `clone()` interface, along with the ability to set the current device via `at::OptionsGuard` (so nice).
`parallel_apply` is implemented using `at::parallel_for` (CC cpuhrsch) and [follows the code from PyTorch](https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/parallel_apply.py).
Added lots of tests for these things.
apaszke ezyang ebetica colesbury
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9234
Differential Revision: D8865182
Pulled By: goldsborough
fbshipit-source-id: 4f1fecf2b3f3bc1540c071dfb2d23dd45de433e4
Summary:
In our pimpl system, default constructing a module holder default constructs the contained module. This means `Linear linear;` is ill-formed, since `Linear` doesn't have a default constructor. Instead we require `Linear linear = nullptr;` to get the empty state of the `Linear`. This PR makes the error message for the ill-formed case nicer.
I had to change the forwarding constructors of most of our modules for this, but that's a minor adjustment.
E.g.
```
Linear linear;
In file included from /home/psag/pytorch/pytorch/torch/csrc/api/include/torch/nn/module.h:5:0,
from /home/psag/pytorch/pytorch/test/cpp/api/module.cpp:3:
/home/psag/pytorch/pytorch/torch/csrc/api/include/torch/nn/pimpl.h: In instantiation of ‘torch::nn::ModuleHolder<Contained>::ModuleHolder() [with Contained = torch::nn::LinearImpl]’:
/home/psag/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/dropout.h:45:1: required from here
/home/psag/pytorch/pytorch/torch/csrc/api/include/torch/nn/pimpl.h:46:5: error: static assertion failed: You are trying to default construct a module which has no default constructor. Use = nullptr to give it the empty state (like an empt
y std::shared_ptr).
static_assert(
```
ebetica ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9565
Differential Revision: D8903666
Pulled By: goldsborough
fbshipit-source-id: 5e6b788921a27a44359db89afdc2b057facc5cec
Summary:
This is a few files taken from https://github.com/pytorch/pytorch/pull/8919. They're unchanged from the latest versions of that PR.
```
This is part of https://github.com/pytorch/pytorch/pull/8919. It's
separated to make it easier to merge the PR in pieces.
There are a few major changes to DispatchStub
- The environment variable ATEN_CPU_CAPABILITY overrides the CPU
capability detection code (Previous ATEN_DISABLE_AVX/AVX2)
- DispatchStub is defined in the generic native code instead of the
CPU_CAPABILITY_DEFAULT kernel.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9579
Differential Revision: D8909000
Pulled By: colesbury
fbshipit-source-id: fdeb606270b06acdab3c01dba97ec9d81584ecc0
Summary:
* THTensor now stores `sizes_` and `strides_` which is a `std::vector<int64_t>`
* Anywhere a "public" API function made use of a int64_t* of sizes, I opted to just finagle it out of the tensor using THTensor_getSizePtr rather than try to rewrite all of these sites to use ArrayRef. They should use ArrayRef eventually, but not yet.
* There are new utility functions for resizing sizes/strides in one go (THTensor_resizeDim), or replacing sizes and strides with completely new values (THTensor_setSizesAndStrides)
* Anywhere you said `t->size[n] = 0`, we now say `THTensor_setSizeAt(t, n, 0)`, ditto for strides
* Anywhere you said `t->size[n]`, we now say `t->size(n)` (coming soon: ditto for strides)
Previous review of just the `std::vector` change in #9518, but I'm planning to merge this all in one go.
Note for gchanan: review from commit "ci" and after
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9561
Reviewed By: cpuhrsch
Differential Revision: D8901926
Pulled By: ezyang
fbshipit-source-id: 483cf275060ab0a13845cba1ece39dd127142510
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9594
When the input vector is a zero vector, the previous GPU code will give Nan in backward. We fix this.
Reviewed By: pjh5
Differential Revision: D8849732
fbshipit-source-id: 87b1fb1ee05dfdb0d43bcbe67e36f15896fe1706
Summary:
The underlying reason why this is even an issue is that the conversion
into and out of the 'fictional' onnx operators is done in an unhygenic
order. This doesn't address that, but it does fix the one observable
case where this produces an incorrect result, and unblocks some other
work being done.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9593
Differential Revision: D8919125
Pulled By: anderspapitto
fbshipit-source-id: a88ca979c3b9d439863e223717d3697180c26121
Summary:
This is mainly straightforward, with two exceptions:
1) cublasSgemv, cublasDgemv appear to have a bug where (x,0).mv(0) does not handle beta, whereas cublasSgemm, cublasDgemm do for case where (x,0).mm(0,y). This is handled by manually calling zero / mul.
2) I fixed a bug in btrifact that was broken even when dealing with non-empty tensors. Basically, if out.stride(0) was 1, because the underlying BLAS call expects column-major matrices, to get a column-major tensor, out.transpose_(0, 1) would be called. But this is just wrong, as if the batch dimension (0) doesn't match the size of the columns (1), you don't even have a tensor of the correct shape.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9573
Reviewed By: ezyang
Differential Revision: D8906144
Pulled By: gchanan
fbshipit-source-id: de44d239a58afdd74d874db02f2022850dea9a56
Summary:
0. Fixes#9479
1. rewrites `as_strided` as a native function. This is fine because `set_` does the scalar check.
2. allow using `self` in `python_default_init`. Previously `python_variable_methods.cpp` has `self` as an input `PyObject *`, and use `self_` as the unpacked tensor. But `python_torch_functions.cpp` just use `self` as the unpacked tensor, making it impossible to use `self` in `python_default_init`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9538
Differential Revision: D8894556
Pulled By: SsnL
fbshipit-source-id: ca7877b488e12557b7fb94e781346dcb55d3b299
Summary:
The goal of this PR is to add an infrastructure; to convert(hipify) CUDA ops into [HIP](https://github.com/ROCm-Developer-Tools/HIP) ops , at **compile** time.
Note that HIP ops, which are portable c++ code, can run on AMD and NVIDIA platform.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9322
Differential Revision: D8884707
Pulled By: bddppq
fbshipit-source-id: dabc6319546002c308c10528238e6684f7aef0f8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9509
generate_proposals_op_util_nms.h conditionally requires OpenCV in some cases,
and earlier this was checking just CV_MAJOR_VERSION macro, but that is
undefined unless opencv.hpp is included. Adding `-DCAFFE2_USE_OPENCV` to
TARGETS when opencv is included in external_deps to check for this correctly.
Thanks jinghuang for flagging this issue!
Differential Revision: D8880401
fbshipit-source-id: 65abbcf4ffe3feffc0ee2560882cb8eb0b7476f9
Summary:
This is the first step of refactoring the Predictor. In this diff the config struct
is introduced and the internal data structure of Predictor has been updated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9434
Differential Revision: D8843262
Pulled By: fishbone
fbshipit-source-id: 23f5e4751614e3fedc9a04060d69331bfdecf864
Summary:
Prior to this diff, there have been two ways of compiling the bulk of the torch codebase. There was no interaction between them - you had to pick one or the other.
1) with setup.py. This method
- used the setuptools C extension functionality
- worked on all platforms
- did not build test_jit/test_api binaries
- did not include the C++ api
- always included python functionality
- produced _C.so
2) with cpp_build. This method
- used CMake
- did not support Windows or ROCM
- was capable of building the test binaries
- included the C++ api
- did not build the python functionality
- produced libtorch.so
This diff combines the two.
1) cpp_build/CMakeLists.txt has become torch/CMakeLists.txt. This build
- is CMake-based
- works on all platforms
- builds the test binaries
- includes the C++ api
- does not include the python functionality
- produces libtorch.so
2) the setup.py build
- compiles the python functionality
- calls into the CMake build to build libtorch.so
- produces _C.so, which has a dependency on libtorch.so
In terms of code changes, this mostly means extending the cmake build to support the full variety of environments and platforms. There are also a small number of changes related to the fact that there are now two shared objects - in particular, windows requires annotating some symbols with dllimport/dllexport, and doesn't allow exposing thread_local globals directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8792
Reviewed By: ezyang
Differential Revision: D8764181
Pulled By: anderspapitto
fbshipit-source-id: abec43834f739049da25f4583a0794b38eb0a94f
Summary:
THCStream was recently moved to ATen by mruberry: https://github.com/pytorch/pytorch/pull/8997. This PR now introduces a guard class that replaces `AutoStream` from `torch/csrc/` and also uses this new stream interface.
I had to extend the `CUDAStream` interface with unchecked calls, so that we can reset the stream without throwing an exception in the guard's destructor.
colesbury apaszke ezyang
Fixes https://github.com/pytorch/pytorch/issues/7800
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9277
Differential Revision: D8865183
Pulled By: goldsborough
fbshipit-source-id: 67c9bc09629d92fa5660286b5eec08fde9108cd7
Summary:
….txt setting
In the ROCm branches we will experiment with turning this on.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9543
Differential Revision: D8897990
Pulled By: ezyang
fbshipit-source-id: ae9d25d1b79ee421d49436593edf8c7e49b3a4e5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9438
Current implementation of create_from_proto doesn't work as expected: it
duplicates networks and execution steps by copying original PlanDef first and
adding each step one-by-one later.
Reviewed By: pjh5
Differential Revision: D8850316
fbshipit-source-id: 9b02836d6e6ee1c91cfdd3b4c4804f14137dc22b
Summary:
The purpose of this config is to make sure that CircleCI builds
don't fail when I turn them on for pytorch/pytorch.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9537
Differential Revision: D8894497
Pulled By: ezyang
fbshipit-source-id: 22f43c84a9b8a54cd47a6572ba068f70a73f043a
Summary:
Fix RoIAlignOp GPU implementation for RoIs without batch index
According to https://caffe2.ai/docs/operators-catalogue.html#roialign, RoIs is "2D input of shape (R, 4 or 5)"
Pass RoIs 2nd dimension as kernel parameter and adjust kernel accordingly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9230
Reviewed By: houseroad
Differential Revision: D8886798
Pulled By: malfet
fbshipit-source-id: 52a8b4df85f7e350e36c842ee4428f3a1cba2588
Summary:
Fix gatherTopK template
This change makes it possible to instantiate getherTopK() with IndecesType other than caffe2::TIndex
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9231
Reviewed By: houseroad
Differential Revision: D8886778
Pulled By: malfet
fbshipit-source-id: d5fb1f8814710cd81bc0cf65e0f96fd9fd8317da
Summary:
…CPU LAPACK routines.
Note that the LAPACK functions in general require a different approach, because direct calls with size zero dims do not work.
Here I just selected a reasonable subset of LAPACK routines to support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9522
Reviewed By: ezyang
Differential Revision: D8888180
Pulled By: gchanan
fbshipit-source-id: 16b9013937806d375d83d1c406815765fda00602
Summary:
A 0-dimensional tensor is now returned when squeezing a tensor with a single element.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9529
Differential Revision: D8893103
Pulled By: soumith
fbshipit-source-id: 658189ecfff283b2b7281feb16a397692d6dbd8f
Summary:
This PR contains the ROCm contributions of last week:
* documentation of pyHIPIFY data format originating from #8812 reviewing comments by ezyang
* removal of most patch files from the `amd_build` directory and integration into the code base
* enabling of previously disabled_features that do compile now
* improvement to the static_cast feature in pyHIPIFY (it will only apply static_cast to kernel arguments, not launch arguments)
* addition of two workarounds to pyHIPIFY for ROCm/HIP shortcomings: a) `__forceinline__` does not imply `static`, hence change to `__inline__`, b) `std::[exp,log,pow]` math functions cannot be selected in device code, use `::[exp,log,pow]` instead. Both of these workarounds will be removed once the issues are fixed upstream. Neither of these issues have surfaced on the CI but were reproduced internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9432
Differential Revision: D8887441
Pulled By: ezyang
fbshipit-source-id: 71cf5c6b13772a66d10be369a45ebf06e4e268e1
Summary:
This command (suggested by albanD when I raised a related question in pytorch slack) is super useful to me. I have used it several times and it worked like a charm (without it, I have to delete entire pytorch folder and clone things again). So I guess it is nice to have in the CONTRIBUTING doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9524
Differential Revision: D8890126
Pulled By: soumith
fbshipit-source-id: c1798ff1ab2423627fcd8e0662a66c4e85cb2413
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9520
Add random data filler to predictor bench to support production nets
Reviewed By: salexspb
Differential Revision: D8712757
fbshipit-source-id: 2c732b2ba71ab210f9222adf94d08442ca71dc03
Summary:
- I ran into this couple days ago, and thought it might be useful to take note on that
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9504
Reviewed By: soumith
Differential Revision: D8887396
Pulled By: weiyangfb
fbshipit-source-id: d2061cf379ce140d6e43ef6c18241f7ce00dbab6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9458
The goal is to support count_include_pad in Caffe2 ONNX backend. This commit contains the first step - support 4-D tensor cases.
AveragePool with count_include_pad can be expressed as PadImage + AveragePool.
Reviewed By: houseroad
Differential Revision: D8852180
fbshipit-source-id: 4db00e9771be7a000a2d92850dfd066d9c9c38bf
Summary:
If this is good, I could write some tests to ensure collision doesn't occur within a given range.
Closes#7228
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9246
Differential Revision: D8872608
Pulled By: ezyang
fbshipit-source-id: 0ed29a73188f4167b42756f59a5c9a3d5cb37326
Summary:
It implements per-channel alpha_dropout. It also creates corresponding function classes and unifies the process of dropout and alpha_dropout.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9073
Differential Revision: D8727008
Pulled By: ezyang
fbshipit-source-id: 9d509f9c5db4e98f7b698cdfc4443505a4d2b331
Summary:
This is enabled by the allocator patch; previously we could not
deduplicate THStorage_free/THCStorage_free; now we can.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9495
Reviewed By: SsnL
Differential Revision: D8875497
Pulled By: ezyang
fbshipit-source-id: 387198dff446eb9f84d2d6187066fae1d595dea7
Summary:
ebetica asked for a way to add parameters to `Optimizer`s after they are created.
ebetica ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9472
Differential Revision: D8872176
Pulled By: goldsborough
fbshipit-source-id: 39a4032c519a6d3b458dd3596361b04afea10365
Summary:
…ors (CPU).
This includes (mainly) CPU fixes; CUDA fixes are a little more involved because you can't use an empty grid.
This also includes a fix for index_copy, which checked that self.size(dim) == src.size(0), which isn't correct (the same dimension should be compared).
Finally, also includes a fix for CUDA flip (although it's not tested yet), to get the stride using multiplication rather than division to avoid divide-by-0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9482
Reviewed By: ezyang
Differential Revision: D8873047
Pulled By: gchanan
fbshipit-source-id: 86523afd3d50277834f654cd559dfbc7875cdffe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9480
Ops like Reshape sometimes take a second input tensor of long with the new
shape (can also be specified in arg). If this input tensor is passed in via
external input (which ONNX does sometimes), LoadOp fails with an exception.
Such ops anyway are executed by IDEEPFallbackOp, so this should be fine.
Reviewed By: yinghai
Differential Revision: D8872671
fbshipit-source-id: 659a02416c374e373ce041a7d65a174be828702d
Summary:
It was only used to toggle refcounting, but we ALWAYS
refcount tensors.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9494
Differential Revision: D8875169
Pulled By: ezyang
fbshipit-source-id: 3a8618fb288334e62942bbaf388f3c9e473e7524
Summary:
This issue was fixed in 976f9253a5425918eda7cf865b097cf42b5da8d7
Fixes#5311.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9498
Differential Revision: D8875605
Pulled By: ezyang
fbshipit-source-id: 449ffe975d35c959f92874437ba9be37d4d3a1f2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9497Fixes#7883 by using `rfft`.
It's worth noting that this is BC breaking. And it's impossible to detect the change because the two signatures before and after this change supports a common subset of calling patterns, e.g., `stft(Tensor, int, int)`. (some other calling patterns will raise error).
soumith and I plan to change the current `stft` interface because it is a bit messy and non-standard. rafaelvalle suggested us that `librosa` is a good reference API to align with. After discussing with soumith and ezyang , and given that `stft` is only out for 1 release, I decide to go with directly changing the signature. Also, my understanding is that most researchers in this field will welcome this change as `librosa` seems to be the golden-standard here. (it doesn't yet support all `pad_mode` but those will become available if added to `F.pad`.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9308
Reviewed By: ezyang
Differential Revision: D8806148
Pulled By: SsnL
fbshipit-source-id: f6e8777d0c34d4a4d7024e638dc9c63242e8bb58
Summary:
test_cuda.py uses routine 'number' to prepare many testscases.
number should return a floating point value for float-type tensor
types, or integer otherwise. But number's test to classify the type
is incorrect, so it always returns the integer value.
(type(t).__name__ is always 'torch.tensortype' so never matches
'Double', 'Float', or 'Half'.)
Update number to use the existing is_floating() helper to make the
check.
The change to number causes a few tests to fail for HalfTensor. Relax
the tolerance for those in line with other HalfTensor testcases. The
failing tests--for addcdiv and fill--were not previously relaxed for
HalfTensor so are held to the over-strict 1e-5 default tolerance.
Finally, update a couple other tests for HalfTensor type to use the
existing is_half() helper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9475
Reviewed By: yf225
Differential Revision: D8872112
Pulled By: ezyang
fbshipit-source-id: 016e3e15adb23f6606bd4c08218954c1396699db
Summary:
This change makes README.md compatible with both Github and VSTS markdown engines. Images can be reduced if necessary
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9296
Differential Revision: D8874931
Pulled By: soumith
fbshipit-source-id: 0c530c1e00b06fc891301644c92c33007060bf27
Summary:
I noticed that `Sequential::clone()` does not work. This is because `Sequential` does not use `reset()` which is normally where modules have to initialize and register its submodules. Further, this is because of the way `Sequential` allows its modules to be passed in the constructor, which doesn't work with `reset()` (since it does "late" initialization).
I've added some better error messages inside `Cloneable::clone()` which makes this kind of mistake clearer for other users, and tests for `Sequential::clone()`.
I also had to give `AnyModule` a deep `clone()` method.
ebetica ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9372
Differential Revision: D8865189
Pulled By: goldsborough
fbshipit-source-id: b81586e0d3157cd3c4265b19ac8dd87c5d8dcf94
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9403
In BBoxTransform and GenerateProposal ops, clip_boxes makes sure the bbox fits
within the images. For rotated boxes, this doesn't always make sense as there
could be multiple ways to clip a rotated box within an image boundary.
Moreover, clipping to a horizontal box means we leave out pixels of interest
potentially. Therefore, we clip only boxes with angle almost equal to 0 (with a
specified `angle_thresh` tolerance).
Reviewed By: pjh5
Differential Revision: D8828588
fbshipit-source-id: 39c1eafdb5d39d383780faa0a47e76149145e50c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9153
Closes https://github.com/pytorch/pytorch/pull/9153
Modified the values reported by the benchmarking platform to include tensor_shape and op_args. These values have a different naming scheme to values like flops and latency.
Reviewed By: sf-wind
Differential Revision: D8729791
fbshipit-source-id: f050200be01c6d0794bf5faaa6e8cef12a00affe
Summary:
Storage views were previously used to implement CUDA IPC sharing,
but they weren't necessary. The new strategy is described in
Note [CUDA IPC and the caching allocator].
This also fixes an unrelated bug, where we weren't actually using
the Tensor forking pickler, because we didn't register a pickler
for torch.Tensor.
Fixes#9447. Fixes#46.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
CC apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9466
Reviewed By: apaszke
Differential Revision: D8859698
Pulled By: ezyang
fbshipit-source-id: 3362cb92f6ae4aa37084c57d79b31004bd0b4a97
Summary:
IValue is short for interpreter value. It is used frequently so a short name is important.
This will allow us to implement more non-tensor types in an efficient way and remove
many hacks from the compiler.
This PR is limited. It only introduces IValue and changes interpreter to use it.
Follow up PRs will:
* Change the way aten_ops consume non-tensor types so that integer lists,
are no longer represented as Tensors.
* Introduce TensorList as a fundamental type and remove all vararg handling in gen_jit_dispatch
* Change the compiler to implement math on primitive numbers rather than converting to tensors.
jamesr66a apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9368
Reviewed By: ezyang
Differential Revision: D8817598
Pulled By: zdevito
fbshipit-source-id: 29dce80611ce5f6384234de9d12a67861d2b112f
Summary:
Add `WeakTensor` - a `Tensor` counterpart which doesn't keep the data (or any other expensive resources) alive. They can be `.lock()`ed and return `at::optional<Tensor>` if they're still alive.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9363
Reviewed By: ezyang
Differential Revision: D8815434
Pulled By: apaszke
fbshipit-source-id: 1b3e96503c1285d78ef124c585e65c7630f3253e
Summary:
The tests were too flaky, and the procedure for legitimately
updating versions of software too onerous, to warrant continually
testing these.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9459
Reviewed By: zou3519
Differential Revision: D8852357
Pulled By: ezyang
fbshipit-source-id: 24e99cd00b4252cdeec2a1d9af92456b4a54912a
Summary:
If the type_as operator takes in two values with the same type, remove that operator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9316
Reviewed By: zdevito
Differential Revision: D8808355
fbshipit-source-id: 2d5710a6380b22f4568fc38a439061b5340c4eb1
Summary:
`test_neg` sometimes fails internally because `random_()` can generate an out-of-range value for CharTensor. This PR fixes it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9431
Reviewed By: SsnL
Differential Revision: D8843284
Pulled By: yf225
fbshipit-source-id: bf516cceb8f780e133fa54f7364c77821eb7c013
Summary:
This PR removes `distributions.utils._log_sum_exp` in favor of `torch.logsumexp`. Also fixes some warnings with `reduce` arg. in `binary_cross_entropy_with_logits`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9173
Reviewed By: SsnL
Differential Revision: D8764174
Pulled By: ezyang
fbshipit-source-id: b9c4136dbf0182e8ae77082e6448d23a430d5cb6
Summary:
See Note [Supervisor deleter] for how SupervisedPtr works.
This design is not the obvious one, but there were a lot of
constraints feeding into it:
- It must support the reallocation usage-pattern, where, given
an existing Storage, we allocate a new region of memory,
copy the existing data to it, and then deallocate the old
region of memory.
- Creation of a deleter for memory MUST avoid dynamic allocations
in the common case. We've done some benchmarking in Caffe2
where dynamic allocation for deleters is ruinously expensive,
and it's really hard to avoid these performance tarpits in
very general function wrappers like std::function or
folly::Function (while benchmarking this, we discovered that
folly::Function's move constructor was way more expensive
than it should be).
- We need to be able to deallocate data that comes from external
sources, e.g., dlpack and numpy tensors. Most notably,
you often cannot deallocate these with merely the void*
data pointer; you need some extra, out-of-band information
(e.g., the managing struct) to deallocate it. Sometimes,
you may even want to resize data living in an external source!
- The "core" allocators need to support being wrapped in a Thrust
allocator, so you need to be implement the following two functions:
char* allocate(size_t);
void deallocate(char*, size_t);
- We need to support tensors which contain non-POD, non-trivially
copyable data; specifically tensors of std::string. This is
an upcoming requirement from Caffe2. It's dirty AF, but
it's really useful.
- It should use C++ standard library types like std::unique_ptr
(which is hugely problematic because std::unique_ptr doesn't
call the deleter when the pointer is null.)
Here is the billing of changes:
- Built-in support for realloc() has been DROPPED ENTIRELY.
Instead, you're expected to allocate and then copy from
the old memory to the new memory if you want to do a
reallocation. This is what you'd generally have expected
to occur; and axing realloc() from the design lets us avoid
some tricky correctness issues with std::realloc(), namely
the fact that we must refuse the realloc if the type of the
elements are not trivially copyeable. If it really matters,
we can add this back, but there really needs to be a good
explanation WHY you need fast resizing reallocations (by in
large, people don't resize their storages, and it should
be acceptable to have a performance degradation when they
do).
- TH_STORAGE_FREEMEM is no more; instead, if you want a
storage which doesn't free its result, you just give it
an empty deleter.
- What we used to call an "allocator" (really, a combined
object for allocating/deleting) has been split into two
concepts, an allocator, and a smart pointer (SupervisedPtr)
which knows how to delete data.
- Unlike previously, where THAllocator/THCDeviceAllocator
could have a per-tensor context storing extra information
(e.g., a pointer to the metadata you need to actually
free the tensor), there is no context in the allocator or
the deleter of the smart pointer; instead, the smart
pointer directly holds an owning reference to the
metadata necessary to free the data. This metadata
is *freshly manufactured* upon every allocation, which
permits us to resize tensors even in the absence of
built-in support for realloc().
- By default, allocators don't support "raw" allocations
and deallocations with raw pointers. This is because
some allocations may return a different context every
time, in which case you need to reconstruct the context
at delete time (because all you got was a void*, not
a unique_ptr that carries the deleter).
- The diff between at::Allocator and THCDeviceAllocator is a
bit larger:
- It used to return a cudaError_t. Now, allocators
are expected to check the error status immediately and throw
an exception if there was an error. It turns out that this
is what was immediately done after all occurrences of
allocate/release, so it wasn't a big deal (although some
subsidiary interfaces had to themselves be converted to
not return cudaError_t).
There is one notable exception to this, and it is how
we handle CUDA OOM: if this occurs, we attempt to return
unused memory to the system and try again. This is now
handled by a catch-all try-catch block. The cost of
catching the exception is probably the least of your worries
if you're about to OOM.
- It used to take the CUDA stream to perform the allocation
on as an argument. However, it turned out that all call
sites, this stream was the stream for the current device.
So we can push this into the allocator (and the choice,
in the future, could be made explicitly by twiddling
thread local state.)
- It held two extra methods, emptyCache and cacheInfo, specifically
for interacting with some state in THCCachingAllocator.
But this "generality" was a lie, since THCCachingAllocator
was the only allocator that actually implemented these
methods, and there is actually a bunch of code in THC
which assumes that it is the caching allocator that is
the underlying allocator for CUDA allocations. So I
folded these two methods into this interface as
THCCachingAllocator_emptyCache and THCCachingAllocator_cacheInfo.
- It held its context directly inside the THCDeviceAllocator
struct. This context has been moved out into whatever
is holding the at::Allocator*.
- The APIs for getting at allocators/deleters is now a little different.
- Previously there were a bunch of static variables you could get
the address of (e.g., &THDefaultAllocator); now there is a
function getTHDefaultAllocator().
- Some "allocators" didn't actually know how to allocate (e.g.,
the IPC "allocator"). These have been deleted; instead, you
can wrap the produced pointers into SupervisedPtr using
an appropriate makeSupervisedPtr() static method.
- Storage sharing was a lot of work to wrangle, but I think I've
tamed the beast.
- THMapAllocator and its "subclasses" have been refactored to
be proper, honest to goodness C++ classes. I used the enum
argument trick to get "named" constructors. We use inheritance
to add refcounting and management (in libshm). What we previously
called the "Context" class (Context has been dropped from the name)
is now the supervisor for the data.
- Sometimes, we need to pull out the file descriptor from a
tensor. Previously, it was pulled out of the allocator context.
Now, we pull it out of the supervisor of the SupervisorPtr,
using the static method fromSupervisedPtr(), which uses the
deleter as the typeid, and refines the type if it matches.
- I renamed the std::function deleter into
InefficientStdFunctionSupervisor, to emphasize the fact that it does
a dynamic allocation to save the std::function deleter.
TODO:
- Windows libshm is in shambles and needs to be fixed.
Perhaps for the future:
- newFromFd is now unconditionally calling cudaPointerGetAttributes
even though this is unnecessary, because we know what the device
is from higher up in the callstack. We can fix this by making
newWithDataAndAllocator also take an explicit device argument.
- Consider statically distinguishing between allocators that
support raw_allocate/raw_deallocate, and those which don't.
The Thrust constraint applies only to the CUDA device allocator;
you never need to allocate CPU memory this way
- Really want to get rid of storage views. Ugh.
Nontrivial bugs I noticed when preparing this patch:
- I forgot to placement-new unique pointers and attempted to
assign them directly on uninitialized memory; very bad! Sam
Gross has encouraged me to replace this with a proper constructor
but I keep putting it off, because once everything goes in
StorageImpl there really will be a proper constructor.
- I rewrote a number of APIs to use newWithDataAndAllocator
instead of newWithAllocator, calling the allocator at the
call site (because they required "allocation context" which
we no longer give to "allocators"). When I did this, I forgot
to insert the multiplication with sizeof(real) to scale from
numels to number of bytes.
- The implementation of swap on storages was missing it for
scalarType and backend. It was benign (because the only case
we call swap is when these are the same), but I fixed it anyway.
- I accidentally returned a nullptr unique_ptr with no deleter,
even though there was a legitimate one. This matters, because
some code still shoves its hands in the deleter context to
get extra metadata about the function.
- I used std::move() on a unique_ptr, and then did a boolean
test on the pointer aftewards (always false!)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9358
Reviewed By: SsnL
Differential Revision: D8811822
Pulled By: ezyang
fbshipit-source-id: 4befe2d12c3e7fd62bad819ff52b054a9bf47c75
Summary:
This PR add a device_ member to CUDAEvent. This is necessary since if we create a cudaEvent on one device but destroy it from another, it also creates an additional context on that device. So this device information is needed to guard the cudaEventDestroy. (cc: ngimel is this expected behavior? I can provide a simple cu script to repro this).
c10d tests are probably not in CI yet, please let me know how the test are run and I could double check.
Thanks pietern apaszke for help debugging!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9415
Reviewed By: apaszke
Differential Revision: D8839688
Pulled By: ailzhang
fbshipit-source-id: b950ba37d57b9e3c5fe71726ec92f6a9601c4d0e
Summary:
Fixes: #9419
This assumes that anyone who knows localScalar can also grep for the
error message or get a traceback.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9443
Reviewed By: soumith
Differential Revision: D8850718
Pulled By: ezyang
fbshipit-source-id: a106fee718fef97064e861810a49ca05f536f27e
Summary:
Fixes: #9421
I don't think it is easy to deal with non-contiguous array in cuda topk, so I'm adding a check.
The argument number is a bit confusing when it shows in PyTorch but it is consistent with the other checks. (Not sure whether it would make sense to eliminate argument numbers from the error TH/THC error messages given that they're probably off more than once...)
Do we need a test that it indeed refuses non-contiguous?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9441
Reviewed By: soumith
Differential Revision: D8850719
Pulled By: ezyang
fbshipit-source-id: d50561bb37ed50ab97aeaf54d8e3fc6c765bdc7c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9299
Onnx has ReduceL1 and ReduceL2 operators that would facilitate this, so allow pytorch to export those and allow caffe2 to run them.
I only implemented this on CPU so far.
Reviewed By: pjh5
Differential Revision: D8757381
fbshipit-source-id: 68afc9e2f90042a70929b73ace05a499b5c670c7
Summary:
During tracing (and export) we are now introducing an unnecessary hard-coded view on the RHS of indexed assignments such as `tensor[idxs] = rhs`. This caused a regression in the PyTorch translate models because these expressions appear with variable sizes in the RHS. This change makes it so we only call view if we indeed need to strip leading 1-dimensions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9424
Reviewed By: colesbury
Differential Revision: D8838881
Pulled By: jamesr66a
fbshipit-source-id: 399e5daa7d021f4f59f6f92b9fae581f92bfc538
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9385
The operator transform dense features to sparse features by bucketizing. Only the feature in indices tensor will be transformed and output.
Reviewed By: bddppq
Differential Revision: D8820351
fbshipit-source-id: a66cae546b870c6b2982ac20641f198334f2e853
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8999
Closes https://github.com/pytorch/pytorch/pull/8999
Implemented the WRgrad optimizer operator for dense (base case as well as the case with additional output for effective learning rate and update value) and sparse case.
Reviewed By: pjh5
Differential Revision: D8627933
fbshipit-source-id: a63cde46c04bcc6b428ab5f77a4b3b2beb66c046
Summary:
I'm cramming through clang tidy emitted warnings. This PR addresses the `hi-cpp-override` check which warns that `virtual` + `override` is redundant, since `override` already signifies that a function is overriding and thus virtual.
Where there was `virtual` + `override` I removed the `virtual`, where there was `virtual` and no `override` I removed `virtual` and added `override`.
ezyang apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9335
Differential Revision: D8807082
Pulled By: goldsborough
fbshipit-source-id: e0a261053f6540a22cc56ec160a24aa285af6319
Summary:
This PR improves perfomance of (formerly) latency-bound non-contig-dim reduction kernels by up to 20X, while maintaining determinism.
Currently, reducing across a non-contiguous dimension uses the parallelism exposed across the number of output elements. This means that performance suffers if the number of output elements is small. Example:
```
a = torch.cuda.FloatTensor(32768, 32)
a.sum(dim=0)
```
Before this PR, `a.sum`'s kernel (kernelReduceNoncontigDim_shared) took 138 microseconds on my machine. The speed-of-light estimate (based on a bandwidth of 700 GB/s) should be around 6 microseconds. After this PR's changes, `a.sum(dim=0)`'s kernel takes 6.9 microseconds on my machine.
Christian implemented some nice logic to squeeze out better performance for cases like `a.sum` using intra-block and instruction-level parallelism across the dimension being reduced, but his kernel still only launched one block for every 32 output elements. This was insufficient to saturate the device in many cases, like `a.sum` here (where only one block is launched).
My PR adds block cooperation across the dimension being reduced. Many blocks, instead of one block, help to reduce into each 32 output elements. Internally, each block leverages all of Christian's nice logic to compute a partial reduction into a per-block staging buffer, then the last block to finish combines the results to compute the final output.
Block cooperation does require THCudaMalloc-ing staging and semaphore buffers, so it's not always worthwhile. I included a set of rough heuristics to decide when the kernel should choose to use block cooperation. These heuristics are based on Python-side timings of calling sum() many times in a loop, and comparing to the old implementation.
I tested a wide range of sizes (to determine heuristics) and as long as the number of output elements is greater than 16ish, I don't think there are any remaining pathological sizes where users will encounter unexpectedly poor performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9214
Reviewed By: gchanan
Differential Revision: D8808127
Pulled By: colesbury
fbshipit-source-id: 139f310fc6ea6d187a7c983128f8eb8e1c9b4be3
Summary:
While talking to mruberry, I noticed a few places that use
special cast wrappers that are no longer necessary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9401
Differential Revision: D8828874
Pulled By: colesbury
fbshipit-source-id: 2b7fe7ac3af3b71be26b43a9ad3949f8065a7bc9
Summary:
This is to unify the handling of empty tensors in std/var between the dimension reduce and all reduce cases.
Also to avoid triggering ubsan errors around divide by 0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9400
Reviewed By: ezyang
Differential Revision: D8828879
Pulled By: gchanan
fbshipit-source-id: 6b9306805c94251eec28bd12e234618338bff4e3
Summary:
This includes either bug fixes or NumPy semantics changes for the following methods:
chunk, diagonal, unfold, repeat, flatten, reshape, split, unsqueeze.
The n-dimensional empty tensor feature is still hidden behind a feature flag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9362
Reviewed By: ezyang
Differential Revision: D8817002
Pulled By: gchanan
fbshipit-source-id: 6ff704ec96375f00b4dd39ebcd976efac0607fb4
Summary:
Pure experimental addition to guide us on delivering this
into real production systems and their threadpools. Biggest limitation
now is that we need to turn off BlackBoxPredictor activation
deallocation logic to get to sane performance
Reviewed By: highker
Differential Revision: D8798029
fbshipit-source-id: ec7962689d605fba62b2c9e0904309df567a25a4
Summary:
This was previously meant to be used for c10 code but that plan since changed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9367
Reviewed By: orionr
Differential Revision: D8814361
Pulled By: smessmer
fbshipit-source-id: 8e35fa74e160343a2bb8432013847677aa73695a
Summary:
To allow our C++ customers to use our initialization methods as well, this PR moves some of the code from `torch.nn.init` to ATen, calls it from Python, and adds equivalent code to the C++ frontend.
Notes:
1. Happy to hear thoughts on whether it's ok to have e.g. `torch.nn.init.dirac_` *and* `torch.dirac_` (the former has a `no_grad` guard). We have this for `ones_` and stuff too, so I don't mind it.
2. I left the exception checking in Python because they throw `ValueError`s while ATen errors show as `RuntimeError`s. I imagine this would break users' error handling if someone were to have a `try`-`except` handler for `ValueError` (or maybe it's a far fetch)
EDIT: After discussions with zdevito, the PR now simply duplicates the code in C++ exclusively for the C++ API, and we leave the Python code as-is (to make it easier for people to read/modify).
ebetica ezyang apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9295
Differential Revision: D8813793
Pulled By: goldsborough
fbshipit-source-id: 4b969f3f75952c1be4e837e19e23b8098e5fbd4b
Summary:
Migrated PriorCorrectionCalibration from Dper2 layer to Dper3 module.
A few notes:
1. Calibration operators need dynamic linking;
2. All calibration implementation and tests are located in /modules/calibration/
3. Added a type inference function in operator_shcema.h/operator_schema.cc
Reviewed By: idning
Differential Revision: D8756832
fbshipit-source-id: 7e6300a3bb3d3feaaf3b82340ece2f35d71493fc
Summary:
This PR changes the ATen `CMakeLists.txt` slightly, to enable standalone build of ATen inside PyTorch. Currently, the tests in ATen gets linked to `libcaffe.so libcaffe2.so`. As a result, ATen can't be built standalone without building from the root pytorch directory. I know that there is a big merge happening between caffe2 and pytorch and hence, the purpose of this PR is to really start a conversation on what would be the proper way of migrating the CMakeLists to enable clean builds. We should also follow up on this PR: https://github.com/pytorch/pytorch/pull/7275. For your reference, that PR has the explanation for why `-Wl --no-as-need` is needed. Moreover, without `set(ATen_CUDA_SRCS ${all_cuda_cpp})`, the standalone build will throw unresolved references.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9377
Reviewed By: smessmer
Differential Revision: D8825921
Pulled By: orionr
fbshipit-source-id: c521159b4885639fc7990a9819202051455d07db
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9396
The custom and local CUDADevice RAII wrapper has been superseded by at::DeviceGuard so it doesn't make sense to keep it around.
Reviewed By: ailzhang
Differential Revision: D8824200
fbshipit-source-id: 39fa00ffab4f495606c8001446e976bbf603e866
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9309
This is faster when you're dealing with a small number of processes.
Around the 16 processes mark the halving/doubling algorithm is faster.
Reviewed By: apaszke
Differential Revision: D8785364
fbshipit-source-id: 4a03326266e473026d943787186e149d0cc489f0
Summary:
Use decorator `torch.jit.batch` to implement auto-batching (call `to_batch` pass to do IR tranformation).
- `to_batch` pass: "to_batch.h/cpp" in csrc/jit/passess to transform a graph to a new batched graph.
- Write several basic operators for BatchTensor (add, mul, sigmoid, tanh, mm, matmul, select).
- Register the operators in a lookup table `<std::string, std::shared_ptr<Graph>>`. (use the Graph to replace the original node in IR graph)
Move BatchTensor in python from torch.BatchTensor to torch.jit.BatchTensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9198
Reviewed By: zdevito
Differential Revision: D8744466
Pulled By: ChunliF
fbshipit-source-id: 9ea56a30f55cb870f13a2069a47cc635419763ff
Summary:
In the C++ API, `Sequential` currently was not refcounted itself, but stored `shared_ptr<AnyModule>` to get the reference semantics. This is unfortunate because most modules in the API are accessed via `->`, e.g. `Linear l(1, 2); l->forward(...);`. `Sequential` was different in that it had value semantics itself, thus was accessed via `.`.
This PR makes `Sequential` store `AnyModule` (without extra indirection), and uses the same pImpl mechanism we use for all other modules to make `Sequential` have reference semantics itself. This makes it consistent with the rest of the library. It also removes one level of indirection inside of `Sequential`, which is cool.
One thing I had to change was that the `ModuleHolder` with which the whole pImpl thing is implemented previously did some tricks to make `Linear(3, 4)` actually construct `Linear(LinearOptions(3, 4))`. This doesn't work well with `Sequential` since it takes a variadic parameter pack. Instead, I made `ModuleHolder` forward all arguments to the underlying module, and then further pushed the trick to forward parameters to modules' options types into the actual Modules. This adds one constructor per Module in the library. This is not something user modules have to do (unless they want this nice forwarding themselves). It makes the code simpler overall.
ezyang ebetica apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9151
Reviewed By: ezyang
Differential Revision: D8809298
Pulled By: goldsborough
fbshipit-source-id: da68452c3de912fbc67af330ba93b5220de6909f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9352
I am debugging a failed workflow f61490672, and found the original error message to be not informative.
Differential Revision: D8808181
fbshipit-source-id: 3f524ca092881186a492c5c0456124ce31d54751
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9350
Re-apply #9270
Breaking this out of #8338
This takes care of the Eigen failure we saw on Mac CUDA builds when BUILD_CAFFE2 and BUILD_ATEN were removed. Fix is to isolate Eigen from headers included by cu files and processed by nvcc. This was worked on with smessmer.
Reviewed By: mingzhe09088
Differential Revision: D8794431
fbshipit-source-id: de656334af46c697802073f8e8d9a6aeb9ca65a7
Summary:
Breaking this out of #8338
This fixes some CUDA related build and runtime issues after BUILD_CAFFE2 and BUILD_ATEN are removed.
cc orionr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9347
Reviewed By: orionr
Differential Revision: D8806954
Pulled By: mingzhe09088
fbshipit-source-id: 9f8e3feee06478d1ac2deb30796939453352d388
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9056
Closes https://github.com/pytorch/pytorch/pull/9056
Updates bbox_transform for rotated boxes with angle info to normalize the
predicted angle to be within [angle_bound_lo, angle_bound_hi] range.
Reviewed By: pjh5
Differential Revision: D8706240
fbshipit-source-id: f3ee834cf362736136e285f0f8f0c063af94a879
Summary:
THNN was accumulating the result of reduction loss functions
into real instead of accreal. This was causing precision issues with
MSELoss.
This patch only fixes MSELoss. Some of the other losses exhibit bad precision as well (because they accumulate into real instead of accreal) and require more investigation. I will open an issue for those (#9286)
Fixes#8710
cc li-roy SsnL
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9287
Reviewed By: SsnL
Differential Revision: D8775708
Pulled By: zou3519
fbshipit-source-id: d1a1f159deee0cb90fd8e81e63b246115eea8e9e
Summary:
operator.cpp is not generated. removing the line prevents generate_code.py from always thinking it is out of date and running.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9339
Reviewed By: ezyang
Differential Revision: D8798689
Pulled By: zdevito
fbshipit-source-id: f25a2e215fec29aa51571e6a31771f0f91e7a213
Summary:
dlpacks deserve documentation. :)
I wonder whether it might make sense to merge the various small torch.utils pages (and include a link for the larger ones, e.g. data) to enhance the structure in the docs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9343
Differential Revision: D8801227
Pulled By: soumith
fbshipit-source-id: 2980d271971743b86f052bec5a2cb4d146a90d9b
Summary:
Helps prevent calling functions of the base case on float/double/int subclasses that aren't supported.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9321
Reviewed By: colesbury
Differential Revision: D8793627
Pulled By: cpuhrsch
fbshipit-source-id: 7fde779ecd4b890dda406f3d1306b58bab40efe2
Summary:
As discussed in call, this will allow us to keep this integral part of the effort to run PyTorch on ROCm in sync with the main code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8812
Reviewed By: ezyang
Differential Revision: D8796245
Pulled By: bddppq
fbshipit-source-id: 8e12c2acf6a7e0740f31b21e50be74e10ed8b12c
Summary:
This is a series of two commits that should probably be read separately. They are stacked on top of #9018 since the second commit requires it for correctness.
Commit 1
=======
This commit is the first in a series that will clean up how we handle declaring operators and intrinsics in the JIT to make it more modular and readable. This introduces readable declarations that can be used to register operators and switches gen_jit_dispatch to generate this schema. A follow up PR will remove the dispatch keys like "add-3" and resolve ops directly based on the registered schema, further simplifying the generation process.
* Switches schema over to parsed declarations, in the future this will allow something like:
```
registry.register_intrinsic("foo(Tensor a, Tensor b) -> Tensor", [](Stack& stack) {
...
})
```
This will allow the scalable registration of intrinsics for lists, tuples, and other ops, as long as meta-data for these ops (e.g. derivatives and size propagation routines).
The declarations resemble those used by PythonArgParser but have been singificantly cleaned up to minimize the number of types that can appear in the declaration. We should strive to get the other parts of PyTorch switched over to this restricted declaration set when possible, but it is too much to do in a single PR. My hope is that eventually we will use a very similar language to describe declarations in C10, and this can serve as a guide for that.
Parsing is done using the script lexer, so it is very robust to whitespace and extensible for future types.
This removes the other way we encoded schema, and makes it easier to see what schema are registered.
Current generated declarations: https://gist.github.com/zdevito/a96a17766fb3a098d69a91ee00abaaf6
* Switches how we handle attempting to use an integer in the place of a fixed-sized int list, such as in conv (e.g. 'int[3] stride=1'). Now that we can statically distinguish between int and Tensor, we handle the expansion as an implicit conversion in the compiler. This allows us to simplify the interpreter since it no longer needs to handle the conversion itself.
* Schema declarations have been changed so that they match the type system in the IR exactly. In particular, attribute_info which was used by liftConstantAttributes has been dropped and constant attributes are lifted purely based on the type of the input. Type conversions in compiler have been simplified due to this change.
* Error highlighting in ErrorReport now only reports at most 20 lines of code, to make reading where an error occurred easier.
Commit 2
=======
This commit unifies aten_dispatch and aten_schema into a single Operator object that both contains schema and implementation information. In the future we can use this object to also contain functionality like shape prop and autodiff needed by all operators. Operators are registered globally, and dispatch logic uses the schema information to figure out which variant to use. Descriptor keys, a frequent source of inscrutable debug errors, have been removed.
* Introduce Operator, to replace TensorOp. Unlike TensorOp, we use Operator for all op implementations, including primitives that may occur in the graphs. The only exceptions are ops that are only known to the interpreter like jumps, and GraphExecutors where we need to record additional debug info.
* Adds a global registry for Operator implementations. aten_dispatch.cpp turns into register_aten_ops.cpp, which registers all the Operators for aten with the operator registry. register_prim_ops.cpp now contains the implementations for primitive operators that used to be in the interpreter. This means that it is now safe to use `getOperation(node)` to lookup the true interpreter function for the node, which will simplify const-propagation passes.
* Remove addInterpreterOpHandler in favor of global operator registry.
* Instead of descriptors, we match Node arguments directly against FunctionSchema describing expected inputs in `matchSchema`. `matchSchema` knows how parse both attributes and positional inputs from a node and match it to the appropriate registered operator. Debug error messages when we try to run an invalid operator are significantly improved: they now automatically display the schema for the op with the same name that are registered.
* Merge aten_schema into regsiter_aten_ops. Each Operator takes a string schema which is parsed to determine when to dispatch to that op.
* Cleans up gen_jit_dispatch.py now that we do not need to write out descriptors. In particular, skip_scalar_overloads can be removed since Richard's code sorts declarations to put Tensor, Tensor declarations first.
* remove matchSchemaAndLiftConstantAttributes and use emitBuiltinCall instead to remove code duplication
* refactor stack manipulation functions into a separate header file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8885
Reviewed By: jamesr66a
Differential Revision: D8751048
Pulled By: zdevito
fbshipit-source-id: 312aabfbf88307c5f6ab947b6caf691468b94557
Summary:
Breaking this out of #8338
This takes care of the Eigen failure we saw on Mac CUDA builds when BUILD_CAFFE2 and BUILD_ATEN were removed. Fix is to isolate Eigen from headers included by cu files and processed by nvcc. This was worked on with smessmer.
cc mingzhe09088 smessmer BIT-silence Yangqing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9270
Reviewed By: mingzhe09088
Differential Revision: D8768025
Pulled By: orionr
fbshipit-source-id: 5b34017aeb67e35a1b5938d962181ccd4cd37591
Summary:
Usually DLPack consumer is expected to call DLManagedTensor's
deleter to signal that it doesn't need the contents.
This patch calls the deleter when freeing unconsumed
DLPack capsules created by PyTorch.
Test script:
```
import torch
import torch.utils.dlpack
import gc
for i in range(10000):
a = torch.randn(1000,1000, dtype=torch.float32, device='cuda')
b = torch.utils.dlpack.to_dlpack(a)
gc.collect()
```
Before patch: consume all GPU ram.
After patch: constant GPU ram consumption.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9297
Differential Revision: D8781571
Pulled By: soumith
fbshipit-source-id: 2ebadec6c857646220d632ca64110af430dbd52f
Summary:
I'm trying to write a multi-gpu network by pipelining some layers onto different GPUs. However, the current gradient clip requires all the parameters to locate in the same device.
The overhead of CUDA launch is reduced since the scalar calculation is performed on CPU, but it introduces extra data transfers.
No performance regression is observed by running the following snippet:
```python
import time
import torch
module = torch.nn.Sequential(
torch.nn.LSTM(1024, 1024),
torch.nn.LSTM(256, 256),
torch.nn.Linear(100, 10000),
).cuda()
torch.nn.utils.clip_grad_norm_(module.parameters(), 1)
torch.cuda.synchronize()
start = time.time()
for _ in range(1000):
torch.nn.utils.clip_grad_norm_(module.parameters(), 1)
torch.cuda.synchronize()
time_elapse = time.time() - start
print('{} ms per clip'.format(time_elapse))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9302
Differential Revision: D8781551
Pulled By: soumith
fbshipit-source-id: 9d76d01fe0531927f770a16b9523872a7e08e927
Summary:
Fixes#9264 .
There can be so many elements in the output of `vol2col` so it overflows `int` range! This PR changes 3d conv to use `int64_t` mostly.
Also fixes some unused var warning (cc goldsborough )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9274
Differential Revision: D8770682
Pulled By: SsnL
fbshipit-source-id: f6e37f1aa56fe1009dd4c9bcbc042244e47252db
Summary:
The underlying use-case is the file descriptor to storage cache in
torch.multiprocessing.reductions. Previously, this was implemented by wrapping
an existing allocator with a "weak ref" allocator which also knew to null out
the weak reference when the storage died. This is terribly oblique, and
prevents us from refactoring the allocators to get rid of per-storage allocator
state.
So instead of going through this fiasco, we instead directly implement weak
pointers and finalizers in THStorage. Weak pointers to THStorage retain the
THStorage struct, but not the data_ptr. When all strong references die,
data_ptr dies and the finalizers get invoked.
There is one major hazard in this patch, which is what happens if you
repeatedly call _weak_ref on a storage. For cleanliness, we no longer
shove our grubby fingers into the finalizer struct to see if there is already
a Python object for the weak reference and return it; we just create a new one
(no one is checking these Python objects for identity). This means if you
keep calling it, we'll keep piling on finalizers. That's bad! But I am
not going to fix it until it is actually a problem for someone, because
then we need to add another caching layer.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9148
Differential Revision: D8729106
Pulled By: ezyang
fbshipit-source-id: 69710ca3b7c7e05069090e1b263f8b6b9f1cf72f
Summary:
Fix the problem if caffe2 works with old version of onnx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9284
Reviewed By: yinghai
Differential Revision: D8773894
Pulled By: houseroad
fbshipit-source-id: 99b5a962099f854edc85a2ea815cb88c82a6e175
Summary:
ONNX-TensorRT is still using old opset (<7). Patch it for now.
Future fix would be expose versioning in onnx exporter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9285
Reviewed By: houseroad
Differential Revision: D8775268
Pulled By: yinghai
fbshipit-source-id: c272073f80cce35ebd971e44ec9472e3c8fd4b9e
Summary:
This PR implements and tests N-dimensional empty tensors for indexing, factories, and reductions if compiled with -DUSE_TH_SIZE_ZERO_DIM.
Still remaining to add:
1) TensorShape functions
2) Simple linear algebra functions (matrix multiply variants)
3) Other functions that operate over a dimension (but don't reduce).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9209
Reviewed By: ezyang
Differential Revision: D8751257
Pulled By: gchanan
fbshipit-source-id: 2113374dc7af6caf31a99bf67b3893f130a29e23
Summary:
Tested on my mac on a pretty clean anaconda3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8509
Reviewed By: orionr
Differential Revision: D8702257
Pulled By: pjh5
fbshipit-source-id: eda03ef9732da9fc56b31d909af5c0e39520d689
Summary:
Breaking this out of #8338
This fixed Mac build issues after BUILD_CAFFE2 and BUILD_ATEN are removed.
cc orionr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9283
Reviewed By: orionr
Differential Revision: D8773459
Pulled By: mingzhe09088
fbshipit-source-id: 71942e8e6891a625e6b1a7dc0160e87444c64209
Summary:
Breaking this out of #8338
When BUILD_CAFFE2 and BUILD_ATEN are removed, we need to install typing on Mac.
cc orionr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9271
Reviewed By: orionr
Differential Revision: D8768701
Pulled By: mingzhe09088
fbshipit-source-id: 052b96e90e64b01e6b5dd48b91c0fb12fb96b54a
Summary:
Breaking out of #8338
This fixes the build issues with pytorch on linux machines after BUILD_CAFFE2 and BUILD_ATEN are removed.
cc orionr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9273
Reviewed By: orionr
Differential Revision: D8768869
Pulled By: mingzhe09088
fbshipit-source-id: 2730426ed1bed398eb5dc804c7348aeeb27c93d3
Summary:
Breaking this out of #8338
This takes care of failures we saw on Mac CUDA builds when BUILD_CAFFE2 and BUILD_ATEN were removed. Specifically, smessmer fixed `std::hash` being handled in a weird way by nvcc and I fixed an nvcc template issue by moving `SparseNormalizeOp::RunOnDevice` implementation into the cc file.
cc mingzhe09088 smessmer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9269
Reviewed By: mingzhe09088
Differential Revision: D8767984
Pulled By: orionr
fbshipit-source-id: 550686bfcef6d331f16d593859c99169216c5c2e
Summary:
Breaking this out of #8338
This fixed an Android build issue after BUILD_CAFFE2 and BUILD_ATEN are removed.
cc orionr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9275
Reviewed By: orionr
Differential Revision: D8769913
Pulled By: mingzhe09088
fbshipit-source-id: afce52a12697757a0b2103c7c343e19ab158a9f7
Summary:
Breaking this out of https://github.com/pytorch/pytorch/pull/8338
Use a local version of `np.rot90` with an `axes` argument, since we don't have NumPy 1.12.0 in all of the test environments. Caffe2 conda2-ubuntu16.04, for example, fails. Generally, it seems better to not require a NumPy bump just for this test.
cc mingzhe09088
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9267
Reviewed By: mingzhe09088
Differential Revision: D8767819
Pulled By: orionr
fbshipit-source-id: c51a6295d58366eba06e4e55e3f1ffaa8af96975
Summary:
Breaking this out of #8338
More changes required to support USE_CUDNN=OFF. We should be able to land some of our fixes before the big BUILD_CAFFE2 and BUILD_ATEN removal lands.
cc mingzhe09088 Yangqing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9268
Reviewed By: mingzhe09088
Differential Revision: D8767981
Pulled By: orionr
fbshipit-source-id: 0607ca2773253b685209c274a3adf70180d8ce58
Summary:
Commits:
1. In extension doc, get rid of all references of `Variable` s (Closes#6947 )
+ also add minor improvements
+ also added a section with links to cpp extension :) goldsborough
+ removed mentions of `autograd.Function.requires_grad` as it's not used anywhere and hardcoded to `return_Py_True`.
2. Fix several sphinx warnings
3. Change `*` in equations in `module/conv.py` to `\times`
4. Fix docs for `Fold` and `Unfold`.
+ Added better shape check for `Fold` (it previously may give bogus result when there are not enough blocks). Added test for the checks.
5. Fix doc saying `trtrs` not available for CUDA (#9247 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9239
Reviewed By: soumith
Differential Revision: D8762492
Pulled By: SsnL
fbshipit-source-id: 13cd91128981a94493d5efdf250c40465f84346a
Summary:
When we moved the libaten build into libcaffe2, we changed the location where it generated compile_commands.json such that it was no longer being picked up by the build script. This fixes it so it is still found.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9227
Reviewed By: goldsborough
Differential Revision: D8757984
Pulled By: zdevito
fbshipit-source-id: 73df26bf08d98f18ac841d6c0db7e332fd328ab6
Summary:
Here's an improved error message. Let me know if this change makes the errors a little clearer.
Closes https://github.com/pytorch/pytorch/pull/9212
Reviewed By: soumith
Differential Revision: D8752896
Pulled By: jramseyer
fbshipit-source-id: d2bd8462c3ddf14acd3de56a4c1aeb75a9bc4067
Summary:
This PR moves the THCStream logic (from both the THCStream and THCState APIs) to ATen. In particular, it:
+ Creates a new (THC free) at::CUDAStream class and API
+ Extends the at::Context API to expose it
+ Stubs the current THCStream and THCState APIs to use it
+ Updates THC to no longer violate stream encapsulation (stream.hpp is dead)
+ Adds an ATen cpp test of the API
+ Bonus: Removes some debug spew in test_nn.py
The new API has several advantages over the old one:
(1) It comes with an easy to use RAII, the CUDAStream. CUDAStreams have the expected copy and move semantics and are implicitly convertible to cudaStream_t.
(2) It does not depend on THCState, THCThreadLocal, or CUDA (thanks to goldsborough for suggesting the dynamic registration technique)
(3) It provides one consistent API/place for all stream operations, instead of having them split between THCStream and THCState
(4) The internals are completely encapsulated, unlike the historic THCStream
(5) It has getAndRetain semantics, which are safer than the historic gets (which allowed a gap between acquisition and retention)
There are a couple things this PR does not do, however, which are left for future work:
- It leaves the c10d:CUDAStream class as a THCStream wrapper (which now really wraps an at::CUDAStream).
- It leaves historic users of THCStream mostly untouched, except where they violated encapsulation (by using stream.hpp). A couple forward declarations were also changed.
I hope this PR allows easy usage of streams from ATen and is a useful pattern for porting more of the THCState API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8997
Differential Revision: D8683375
Pulled By: soumith
fbshipit-source-id: 2e48ad85f1f9c8817684fe63a267938e80eafdcf
Summary:
This looks like a totally cosmetic change, but for some reason it reduces the runtime by ~50% running in a single CPU thread.
```
import os
os.environ['OMP_NUM_THREADS']='1' #Use one CPU thread
import torch, torch.nn as nn, time
def test_net(net,offset):
net.eval()
total=0
with torch.no_grad():
for _ in range(100):
x = torch.randn(100,100,100)+offset
start_time = time.time()
y = net(x)
total+=time.time()-start_time
print(net, total*10, 'ms')
for offset in [-1,0,+1]:
test_net(nn.LeakyReLU(),offset)
test_net(nn.PReLU(),offset)
```
Closes https://github.com/pytorch/pytorch/pull/9206
Reviewed By: yf225
Differential Revision: D8749491
Pulled By: btgraham
fbshipit-source-id: 3db8049dd151c0ba9ae1dd5c05bcc58bcab97e9a
Summary:
This PR addresses #5823.
* fix docstring: upsample doesn't support LongTensor
* Enable float scale up & down sampling for linear/bilinear/trilinear modes. (following SsnL 's commit)
* Enable float scale up & down sampling for nearest mode. Note that our implementation is slightly different from TF that there's actually no "align_corners" concept in this mode.
* Add a new interpolate function API to replace upsample. Add deprecate warning for upsample.
* Add an area mode which is essentially Adaptive_average_pooling into resize_image.
* Add test cases for interpolate in test_nn.py
* Add a few comments to help understand *linear interpolation code.
* There is only "*cubic" mode missing in resize_images API which is pretty useful in practice. And it's labeled as hackamonth here #1552. I discussed with SsnL that we probably want to implement all new ops in ATen instead of THNN/THCUNN. Depending on the priority, I could either put it in my queue or leave it for a HAMer.
* After the change, the files named as *Upsampling*.c works for both up/down sampling. I could rename the files if needed.
Differential Revision: D8729635
Pulled By: ailzhang
fbshipit-source-id: a98dc5e1f587fce17606b5764db695366a6bb56b
Summary:
Closes https://github.com/pytorch/pytorch/pull/9199
The input shapes are not logged correctly in production because `PerfNetObserver::Stop()` only gets called after the inference is done for the net and in the mobile models, it's common practice to reuse the blobs as much as possible to save memory. And the shapes of the blobs keep changing during inference. By the time you you query `InputTensorShapes()` in `PerfNetObserver::Stop()`, you only get the final shape of the blobs.
To fix this bug, I moved the 'InputTensorShapes()' query from `PerfNetObserver::Stop()` to `PerfOperatorObserver::Stop()`. The latter gets called at the end of operator->run() whereas `PerfNetObserver::Stop()` gets called at the end of net->run().
Also remove `PerfOperatorObserver::getAnalyticalCost()` since it's now done on the server side and no longer needed on mobile
Reviewed By: Maratyszcza
Differential Revision: D8743346
fbshipit-source-id: 5d2d0132e3f5e084be7d0173863e695e62a6b4a0
Summary:
Closes https://github.com/pytorch/pytorch/pull/9048
max_length argument helps fix the shape of the output to be N * max_length * D, where N is the batch_size, D is the feature_dim.
Reviewed By: bddppq
Differential Revision: D8702782
fbshipit-source-id: e30555608fee1c4a61cc95922f4a71c7f54903af
Summary:
[x] get registry working
[x] move all current ops to registry
Reviewed By: yinghai
Differential Revision: D8706115
fbshipit-source-id: 8dfce79039b57dea1c15e8e291cdd74f39766ade
Summary:
As I try to replicate DP in C++, I need to move some functions into C++ from Python. This PR ports the scatter and gather primitives from Python in torch/cuda/comm.py to C++ in torch/csrc/cuda/comm.cpp. The basic infrastructure was already there, since apaszke had rewritten broadcast in C++ already.
I'm not very familiar with this code, so let me know if I'm doing something wrong. I largely just literally translated the code.
I don't know how "public" `torch.cuda.comm` is, but I feel like the `destination_index` parameter for `gather` should be changed from -1 indicating CPU to `None` indicating CPU, and `-1` indicating the default CUDA device. That would make the code clearer IMO.
apaszke colesbury teng-li pietern
Closes https://github.com/pytorch/pytorch/pull/9117
Differential Revision: D8721729
Pulled By: goldsborough
fbshipit-source-id: 1844a488079d21fa209b32e2c73e48632cbe9e68
Summary:
Added a way to `dynamic_cast` an `nn::Module` and get a pointer to it. `nn::Module::is<T>` just checked if the return value of the `dynamic_cast` was nullptr, so I got rid of `is<T>` since it's equivalent to `as<T> != nullptr`(or just `as<T>` due to boolean conversion).
We're now at
```
if (auto* conv = module.as<nn::Conv2d>()) {
conv->weight.data().normal_(0.0, 0.02);
} else if (auto* bn = module.as<nn::BatchNorm>()) {
bn->weight.data().normal_(1.0, 0.02);
bn->bias.data().fill_(0);
}
```
ezyang apaszke ebetica
Closes https://github.com/pytorch/pytorch/pull/9149
Differential Revision: D8735954
Pulled By: goldsborough
fbshipit-source-id: e2b8f6f0cea16a621f8bc0807a33cc7651d25154
Summary:
Context: I am updating jit::FunctionSchema to use `Symbol name;` rather than `std::string name`. Sometimes the name refers to a builtin thing like `prim::UnpackTuple`, sometimes to an aten operator like `aten::add`, and sometimes just to a raw string, like `my_method_foo` that really doesn't belong in any namespace and should be printed to the user in that form. For this last case, I want the ability to create a raw Symbol again, like was previously possible, that just represents an interned string. This PR enables that use, keeps the other functionality still possible, and simplifies interned_string's implementation a bit.
This changes how Symbol is implemented. Now the namespace of a symbol
is optional and the namespaces themselves are Symbols.
This allows Symbol to be used with arbitrary namespaces, and allows
you to use Symbol as an simple interned string using via fromQualString
and toQualString without :: in the string. This also simplifies the
implementation. Like with string conversion, builtin primitives go
through a fast path for namespace lookup while registered symbols require
holding a lock and reading an array entry to lookup the namespace.
Note: alexnet expect file update is from a previous commit. It doesn't run in CI because pytorch vision is not installed.
Closes https://github.com/pytorch/pytorch/pull/9018
Reviewed By: SsnL
Differential Revision: D8690449
Pulled By: zdevito
fbshipit-source-id: b65ee57704641d7294fe115c5470cf55d406458f
Summary:
Similar to https://github.com/pytorch/pytorch/pull/9187, This PR makes setting the `PYTORCH_TEST_WITH_ASAN` and `PYTORCH_TEST_WITH_UBSAN` flags easier internally, by allowing the flags to be set to `0`.
Closes https://github.com/pytorch/pytorch/pull/9202
Differential Revision: D8745533
Pulled By: yf225
fbshipit-source-id: 6293f52f2e8b1c3ef150becfdc2dd7ded56d5d80
Summary:
This is necessary for n-dimensional empty tensors, which have special native handling.
Closes https://github.com/pytorch/pytorch/pull/9197
Differential Revision: D8744083
Pulled By: gchanan
fbshipit-source-id: 3cc692a1d62cbeb169681b7c40e3df50e12953b7
Summary:
I've been cleaning up my email notifications, and noticed that this PR used a stack-allocated `random_device`. This is generally a bad idea due to this sentence from the C++ reference (emphasis mine):
> `std::random_device` may be implemented in terms of an implementation-defined pseudo-random number engine if a non-deterministic source (e.g. a hardware device) is not available to the implementation. **In this case each `std::random_device` object may generate the same number sequence.**
If this is how this object is implemented, then this `rd()` call will give the same result at every call.
cc yf225
Closes https://github.com/pytorch/pytorch/pull/9080
Differential Revision: D8748342
Pulled By: soumith
fbshipit-source-id: 22987befee61ff7faacda5ecc10138c2ac5d26ff
Summary:
Previously this used the ``.toliist`` method, which converted the
storage object into a list of Python objects, and then sent those to
pickle. For storage objects of non-trivial size, this was very slow.
Now we reuse the logic of the ``torch.save`` function to efficiently
turn the Storage object into bytes, and send those instead. This
reduces the semantic information (it's harder to interpret the bytes)
but should be orders of magnitude more efficient when serializing data
with the pickle protocol or with copy
For future work it would be nice to develop a mechanism to get a buffer
of bytes out of a Storage object, and use that alongside the current
``from_buffer`` method.
See #9168 for context
Closes https://github.com/pytorch/pytorch/pull/9184
Differential Revision: D8747794
Pulled By: soumith
fbshipit-source-id: ac598e660c043788ed1ffab3d0303812886edf79
Summary:
1. Let `ModuleTest` raise when they fail on non-contiguous inputs. Fix legacy modules.
2. Fix BN (both THNN and cuDNN) not working on non-contiguous inputs.
3. Fix CUDA EmbeddingBag not working on non-contiguous inputs. To prevent calling `.contiguous()` on in both `forward` and `backward`,
a. prefix all current `embedding_bag*` functions with `_`, indicating that they require input to be contiguous (there is a check in each function).
b. create `embedding_bag`, which makes input arguments `.contiguous()`, and calls `_embedding_bag`
3. Make many ATen `embedding*` functions to work on non-contiguous inputs so we don't need to call `input = input.contiguous()` in Python `nn.functional.embedding`.
4. Fix dense-sparse addition when the sparse input is not coalesced and indices or values tensor is not contiguous. This came up in the test cases of Embedding modules with `sparse=True`. Added tests.
5. Update `TensorUtils.cpp` to use `AT_*` macros.
Request:
review from cpuhrsch on the `Embedding*` changes.
review from ezyang on ATen sparse & BN changes.
Closes https://github.com/pytorch/pytorch/pull/9114
Differential Revision: D8717299
Pulled By: SsnL
fbshipit-source-id: 0acc6f1c9522b5b605361e75112c16bbe1e98527
Summary:
cc vishwakftw
Also added a check if none of the input tensors in `gradcheck` have `requires_grad=True`.
Closes https://github.com/pytorch/pytorch/pull/9192
Differential Revision: D8739401
Pulled By: SsnL
fbshipit-source-id: 81bb3aa0b5c04eb209b137a4bd978e040e76cbcd
Summary:
This PR makes setting the `NO_MULTIPROCESSING_SPAWN` easier internally, by allowing the flag to be set to `0`.
Closes https://github.com/pytorch/pytorch/pull/9187
Differential Revision: D8736206
Pulled By: yf225
fbshipit-source-id: b8a34cb9a747b13bc9428777a3ed766ce441cfe1
Summary:
With the Cppzation of a few files in `TH`/`THC`, the CPP extensions got broken whenever the user uses feature from `THC` in their files, when pytorch is installed via `python setup.py install`.
This addresses issues such as
```
/home/me/.conda/envs/pytorch/lib/python3.6/site-packages/torch/lib/include/THC/THCDeviceTensorUtils.cuh:5:25: fatal error: THCTensor.hpp: No such file or directory
```
Closes https://github.com/pytorch/pytorch/pull/9182
Reviewed By: soumith
Differential Revision: D8734581
Pulled By: fmassa
fbshipit-source-id: 2a1138f208592eaccb01fcdb805a6b369d7a497a
Summary:
Closes#9147
Added a test to prevent regression in test_torch
Added entries in docs
cc ezyang weiyangfb
Closes https://github.com/pytorch/pytorch/pull/9156
Differential Revision: D8732095
Pulled By: soumith
fbshipit-source-id: 7a6892853cfc0ccb0142b4fd25015818849adf61
Summary:
This file was added in #9107 but wasn't installed. The libraries in
./torch/lib use the headers from Caffe2/ATen from their temporary
install path at torch/lib/tmp_install, and c10d was not able to find
THC/THCGeneral.hpp before this fix.
Closes https://github.com/pytorch/pytorch/pull/9159
Reviewed By: Yangqing
Differential Revision: D8731107
Pulled By: pietern
fbshipit-source-id: d6009f6f6e8e6e0f37dea24cc4c3570736943ab1
Summary:
This will resolve the code and comments mis-match issue.
Closes https://github.com/pytorch/pytorch/pull/9070
Differential Revision: D8712261
Pulled By: ezyang
fbshipit-source-id: a8a7d8af890a41ec246e11c2a62b0bde297be9c1
Summary:
The loss plugin was using the old-style loss[0] access, which in PyTorch 0.4 and
later is an attempt to index into a scalar, generating a warning.
Replaced that with loss.item().
This fixes
https://github.com/pytorch/pytorch/issues/9142
Closes https://github.com/pytorch/pytorch/pull/9143
Differential Revision: D8726403
Pulled By: ezyang
fbshipit-source-id: 6c496b140a74d22c8423f511db901b18615fd6fa
Summary:
- There were missing error messages for AT_CHECK in SparseTensorImpl::set_indices_and_values
- We have to check that the backends of all our inputs line up,
since native does not do it for us.
- Some math operations were missing shape tests.
Fixes#9110
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Closes https://github.com/pytorch/pytorch/pull/9140
Differential Revision: D8724349
Pulled By: ezyang
fbshipit-source-id: 3c75104187aca97cbe92bb0ec24f6ded07b2c3d6
Summary:
Booleaning indexing was special cased to handle a single boolean value, but didn't generally work given multiple booleans.
This PR unifies the behavior with slicing. Note that only 'True' and torch.tensor(True) behave like NumPy due to the lack of n-dimensional empty tensors.
The corresponding tests for false values have been added, but are guarded behind a flag until we add n-dimensional empty tensors.
Closes https://github.com/pytorch/pytorch/pull/8920
Reviewed By: ezyang
Differential Revision: D8661876
Pulled By: gchanan
fbshipit-source-id: 0dc8a45a303aa41f729d04ab8908cfaf2e3ce3d7
Summary:
Closes https://github.com/pytorch/pytorch/pull/9121
This main function causes 'buck test caffe2_test_cpu' to run 0 tests
Reviewed By: orionr
Differential Revision: D8719343
fbshipit-source-id: dc1cf76b0355637eaae193be2159f5746873b9f9
Summary:
Some functions are exactly implemented in THStorage_; in that case,
we called those functions directly.
Stacked on #9135
Closes https://github.com/pytorch/pytorch/pull/9136
Reviewed By: Yangqing
Differential Revision: D8723998
Pulled By: ezyang
fbshipit-source-id: 653d23a5e1db4b9bdda50641fa97730894cc8ed5
Summary:
There is no way to concatenate two `Sequential`s in Python, but it's also easier to do in an immutable fashion by just writing `Sequential(first.modules() + second.modules())`. Concatenating vectors isn't as easy in C++, so I think it's fair to save users some for loops by giving them `Sequential::extend()`.
apaszke ebetica ezyang
CC jamespinkerton
Closes https://github.com/pytorch/pytorch/pull/9116
Reviewed By: ezyang
Differential Revision: D8719630
Pulled By: goldsborough
fbshipit-source-id: 840d7ac70755350e6202b493c531e30ecbb6546f
Summary:
The tests were using the old args, which caused them to emit a lot of deprecation warnings.
closes#9103.
Reviewed By: ezyang
Differential Revision: D8720581
Pulled By: li-roy
fbshipit-source-id: 3b79527f6fe862fb48b99a6394e8d7b89fc7a8c8
Summary:
Closes https://github.com/pytorch/pytorch/pull/9107
Some details about how this was done:
- For now, the allocators for CPU and CUDA are different (unifying
the allocators is a bigger change to make, I'll contribute this in
a later patch). To smooth this over, the allocator field now
stores a void* instead of THAllocator* or THCDeviceAllocator*; to
make this clear the field is renamed to allocatorVoidPtr.
- Some THStorage functions which were generated per-scalar are now
generalized, and thus moved out of the generic/ library. This way
they can be called directly from a non-code-generated at::Storage
- THCState is moved into a C++ header. This is actually not really
related to this particular diff, but I'll need it soon to replace
THAllocator/THCDeviceAllocator with at::Allocator (C++, so I can't
mention it in a C header file.)
- THPPointer needs to be adjusted, since there is no more type refinement
between THStorage/THCStorage for it to template match over. This
is a little tricky, because I can't refer to THCStorage_free unless
we actually compile with CUDA. So there's two copies of the function
now: one for the CPU build, one for the CUDA build. If we ever split
CUDA/non-CUDA Python builds, you will have to indirect this through some
dynamic dispatch.
I want to soon replace the THCDeviceAllocator pointers in
THCState with at::Allocator, but I can't reference a C++ namespaced type
from C code, so THCState needs to move.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Closes https://github.com/pytorch/pytorch/pull/9087
Reviewed By: orionr
Differential Revision: D8712072
Pulled By: ezyang
fbshipit-source-id: c6e1ea236cd1df017b42a7fffb2dbff20d50a284
Summary:
Having circulated the C++ API a bit, I found that it would make it easier for folks to access module parameters directly than through the `parameters()` map. So here I make all variables/submodules and also the configuration options for every module public.
For RNNs, I also updated the names of parameters to match PyTorch, e.g. `hhw` -> `w_hh`. This should make it easier to transition from Python.
apaszke ebetica
Closes https://github.com/pytorch/pytorch/pull/9111
Differential Revision: D8717112
Pulled By: goldsborough
fbshipit-source-id: 3d36d5e161f7a86f44db7136c9c2fa53067abe1c
Summary:
Closes https://github.com/pytorch/pytorch/pull/9108
OperatorDef ownership was given to the net in the past, we no longer
want to do that
Reviewed By: pjh5
Differential Revision: D8705347
fbshipit-source-id: 34976de202a7a7a71b935dd13c1bc8e9c73552e0
Summary:
As we left weight to be the last calculated weight in eval mode, we need to detach it from the computation in order to facilitate using backward.
The typical use case is in GANs when the discriminator has spectral norm, is in eval mode and we want to backprop through the discriminator to get weight gradients for the generator.
Closes https://github.com/pytorch/pytorch/pull/9020
Reviewed By: ezyang
Differential Revision: D8694054
Pulled By: SsnL
fbshipit-source-id: 09ee5843687cac3ed4c40759ac577a14c5371730
Summary:
Closes https://github.com/pytorch/pytorch/pull/9035
This diff builds on the structure in the stacked diff to add serialization/deserialization. It supports the old format and a new suggested format.
Reviewed By: ilia-cher
Differential Revision: D8415115
fbshipit-source-id: acaacce2b015f4c6ac0ae22625455290a3f30262
Summary:
add two small bindings to recently added attributes.
Also want to leave a reference gist here: https://gist.github.com/soumith/8102ef39530bac09070912b1a5401d0f
It showcases:
- traced a module
- symbolically differentiated the forward graph, to get a forward, backward graph
- executed the subsequent forward + backward graphs correctly
- compared the jit vs non-jit results
Closes https://github.com/pytorch/pytorch/pull/8890
Reviewed By: ezyang
Differential Revision: D8677663
Pulled By: soumith
fbshipit-source-id: a29919c05baad997cd7fb7df718f933a83035118
Summary:
Closes https://github.com/pytorch/pytorch/pull/9072
Use FixedDivisor in Reduce and Broadcast CUDA kernels
Reviewed By: houseroad
Differential Revision: D8710243
fbshipit-source-id: 6f1da12234898594a1be8c979d942aa515832aeb
Summary:
This will resolve some of the timeout issues in CPU and GPU tests internally.
Closes https://github.com/pytorch/pytorch/pull/9061
Reviewed By: ezyang
Differential Revision: D8707471
Pulled By: yf225
fbshipit-source-id: 9dc82a2c9da0c540ae015442f74b9b2b1a67a246
Summary:
Fixes#9049.
When provided with a domain string that lacks proper prefix, i.e. `org.pytorch.`, an exception is thrown.
Closes https://github.com/pytorch/pytorch/pull/9053
Differential Revision: D8708264
Pulled By: ezyang
fbshipit-source-id: e2593d8d36a17d3bb26fc0b239a61b84f1c38ecb
Summary:
Closes https://github.com/pytorch/pytorch/pull/9057
Make the `_C` target depend on the `csrc-no-python` target. Also removes the `csrc` target and the with-python version of autogradpp (which is not used). Let me know if we should pick better names here.
I also ran into a nasty linker issue with only one symbol being undefined. It turns out had been given inline linkage in the `.cpp` file, which I believe is an error.
Reviewed By: orionr
Differential Revision: D8705750
fbshipit-source-id: 8de083e371dbf5e9f12c15572d88e1c595dfa087
Summary:
Closes https://github.com/pytorch/pytorch/pull/8933
spatialBN implementation cannot deal with empty batch, this diff tries to enable zero batch setting:
during training, when batch_size = 0:
in forward, output's saved_mean and saved_var are zeros.
in backward, the gradient for SCALE_GRAD and BIAS_GRAD are zeros.
Reviewed By: pjh5
Differential Revision: D8644699
fbshipit-source-id: 599ea687329d68699c987e05f56f409f4e729d1c
Summary:
1. Instead of using non `_out` variant, we allocate a buffer and use `_out` variant to write the intermediate results into the buffer.
2. Reduce dimensions in order of decreasing sizes.
Benchmark:
Sum a randn tensor of shape `[200, 1, 30, 40, 20, 1, 50]` along dimensions `[4, 6, 3, 0, 2, 5]`. Averaged across 1000 times:
```
before patch:
CPU: 0.0441 s
CUDA: 0.0273 s
after patch:
CPU: 0.0234 s
CUDA: 0.0047 s
```
Closes https://github.com/pytorch/pytorch/pull/8992
Differential Revision: D8681069
Pulled By: SsnL
fbshipit-source-id: 2c5d5af5c5a284f2e945181f2b24ee8c78becd50
Summary:
The goal of this PR was to add support for dropout descriptors in the C++ API's RNN class.
The end result is a 4x-5x speedup for our RNN integration tests since they can now use cuDNN instead of autograd when dropout is set.
To achieve this, I had to move `_cudnn_init_dropout_state` to the `TensorOptions` API.
I also fixed a bug around `RNN::cuda()` not flattening parameters for cuDNN.
ebetica ezyang
Closes https://github.com/pytorch/pytorch/pull/9012
Reviewed By: pjh5
Differential Revision: D8689786
Pulled By: goldsborough
fbshipit-source-id: 44fb191f5a38e41c4ded5417306b5bbc012cd56c
Summary:
Addresses #7415 . Adding a note first, will do the API change if there's a need in the future.
Closes https://github.com/pytorch/pytorch/pull/9019
Differential Revision: D8694056
Pulled By: ailzhang
fbshipit-source-id: 0b6fa43fa62ac55deff3b3b099d1bc9fee74a5f9
Summary:
Add BatchTensor class
- construct from data, mask, dims or construct from list of tensors
- can return a list of tensors from an BatchTensor class
next step: do IR level transformation and operators
Closes https://github.com/pytorch/pytorch/pull/8922
Differential Revision: D8668986
Pulled By: ChunliF
fbshipit-source-id: 8b24d2a9f46a3b42dbb397e99e9e059dfb2b326e
Summary:
Just tried these and they work now
Closes https://github.com/pytorch/pytorch/pull/9044
Reviewed By: soumith
Differential Revision: D8698819
Pulled By: jamesr66a
fbshipit-source-id: 1d5574de1819aa31fc36ad245186c7aa68587178
Summary:
Closes https://github.com/pytorch/pytorch/pull/9037
Fixes flaky test failures due to port in use.
Reviewed By: soumith
Differential Revision: D8696779
fbshipit-source-id: a05412d1eb1dcb9a4b35023dead371aa33d62c39
Summary:
Tell people to run with num_workers=0 when DataLoader worker failed
Closes https://github.com/pytorch/pytorch/pull/9007
Differential Revision: D8686005
Pulled By: SsnL
fbshipit-source-id: bf872267f609c7b86e943061caab953149507bfe
Summary:
Any flags linking libraries only take effect on inputs preceding them,
so we have to call `$cxx $in $ldflags -o $out` instead of the other way
around.
This was probably not detected so far since the torch libraries are
already loaded when loading JIT-compiled extensions, so this only has an
effect on third-party libraries.
This also matches our behavior on windows.
Closes https://github.com/pytorch/pytorch/pull/9021
Reviewed By: soumith
Differential Revision: D8694049
Pulled By: ezyang
fbshipit-source-id: e35745fc3b89bf39c14f07ce90d6bd18e6a3d7cc
Summary:
This is an initial implementation of Distributed Data Parallel module for c10d GLOO and NCCL backend.
Have done performance testing and made sure that both single GPU / process and multi-GPU / process are able to overlap communication with BW computation
The idea is, DDP will bucket parameters and do all reduce in the reverse order of the bucket. Since all C10D ops are async ops, no more dedicated thread is needed and we simply queue the all-reduce kernels once the bucket is ready following the deterministic reduction order.
Tested with 8 nodes 64 GPUs, ResNet 50, hit the required accuracy within 90 epochs
Closes https://github.com/pytorch/pytorch/pull/8584
Reviewed By: goldsborough
Differential Revision: D8678696
Pulled By: teng-li
fbshipit-source-id: 440341b804befc6762e92acece2759ba47157cea
Summary:
Enable script for the time-sequence prediction, did bunch of hacks to make the script mode work, and couple of issues discovered while enabling the time-sequence prediction, all noted in #8452,
Shall we merge this PR and iteratively fix those issues thereafter?
Closes https://github.com/pytorch/pytorch/pull/8862
Differential Revision: D8677683
Pulled By: wanchaol
fbshipit-source-id: 02319cd56c87de523be898f0e6c541dd15e57cac
Summary:
When initializing weights for my C++ model, I had to write
```cpp
void initialize_weights(nn::Module& module) {
if (module.name().find("Conv2d") != std::string::npos) {
module.parameters()["weight"].data().normal_(0.0, 0.02);
} else if (module.name().find("BatchNorm") != std::string::npos) {
auto parameters = module.parameters();
parameters["weight"].data().normal_(1.0, 0.02);
parameters["bias"].data().fill_(0);
}
}
```
The string-based module determination is not very nice, and not very C++-y. So I created `nn::Module::is<T>` which does a `dynamic_cast` inside. It also handles the `ModuleHolder` vs. `Module` distinction.
It now becomes
```cpp
if (module.is<nn::Conv2d>()) {
module.parameters()["weight"].data().normal_(0.0, 0.02);
} else if (module.is<nn::BatchNorm>()) {
auto parameters = module.parameters();
parameters["weight"].data().normal_(1.0, 0.02);
parameters["bias"].data().fill_(0);
}
```
ebetica ezyang apaszke
Closes https://github.com/pytorch/pytorch/pull/8970
Differential Revision: D8677476
Pulled By: goldsborough
fbshipit-source-id: 053294e19b6a58cce868167596c89639f7de91c2
Summary:
Currently the `test_RNG_after_pickle` in the PR would fail because pickling a tensor changes the RNG state. This PR aims to fix it.
Closes https://github.com/pytorch/pytorch/pull/8971
Reviewed By: ezyang
Differential Revision: D8677474
Pulled By: yf225
fbshipit-source-id: 1713d9611699ad288b66d92dbb29ce9feb34b8cf
Summary:
- fixes log1p at #8853
- added log1p of sparse tensor in ATen
- make log1p of sparse tensor non-differentiable and raise error, because local derivate of log1p for zero element is 1 / (0 + 1) = 1 and make tensor dense
Closes https://github.com/pytorch/pytorch/pull/8969
Reviewed By: ezyang
Differential Revision: D8677491
fbshipit-source-id: 8363a613519de4bc75eda087ccd20a3eb2d18126
Summary:
The problem was a bad regex; the version hash match used to match 6
wildcards. This PR changes it to match \w+, which is sufficient for the
test because the version hash is always followed by either whitespace or
a right-paren.
Fixes#8981
Closes https://github.com/pytorch/pytorch/pull/8983
Differential Revision: D8677771
Pulled By: zou3519
fbshipit-source-id: dfdde98669bcd682335145cba98c82530a815afa
Summary:
Will bump up to opset 8 in another PR to match the current opset version.
Already tested through generating the models in current model zoo.
Closes https://github.com/pytorch/pytorch/pull/8854
Reviewed By: ezyang
Differential Revision: D8666437
Pulled By: houseroad
fbshipit-source-id: feffdf704dd3136aa59c0f1ff1830c14d1bd20aa
Summary:
Operations on `Variable`s (or `torch::Tensor`) usually return `at::Tensor`. This is usually fine, but the `AnyModule` used in the implementation of `torch::Sequential` is very picky about types, and does not understand implicit conversions like this. This means that `sequential.forward(at_tensor_that_is_actually_a_variable)` will fail unless you wrap `at_tensor_that_is_actually_a_variable` with `torch::Tensor`.
This PR adds a special case to `AnyModule` that will convert an `at::Tensor` to `torch::Tensor` when the tensor is really a variable, and else just pass the `at::Tensor`. This is a nice little usability improvement for the often-used `Sequential` class.
ebetica ezyang
Closes https://github.com/pytorch/pytorch/pull/8968
Reviewed By: ezyang
Differential Revision: D8670407
Pulled By: goldsborough
fbshipit-source-id: 3635ed6ed28238f3900ce4a876d07f1b11713831
Summary:
This PR does 3 things
- Reorder the search order of `intel_lp64` and `gf_lp64` as the first one is more essential and should have high priority.
- Avoid repetitive searching of MKL libraries in `ideep` and `mkldnn` submodule if we already found those in `FindMKL`
- Avoid adding more MKL dependencies to IDEEP if MKL is also found.
TODO: provide an option for user to chose iomp or gomp.
Closes https://github.com/pytorch/pytorch/pull/8955
Reviewed By: bddppq
Differential Revision: D8666960
Pulled By: yinghai
fbshipit-source-id: 669d3142204a8b47c19a900444246fc44a139012
Summary:
disable operator tests for now until we have enough rocm workers in CI
Closes https://github.com/pytorch/pytorch/pull/8720
Reviewed By: ezyang
Differential Revision: D8654871
Pulled By: bddppq
fbshipit-source-id: ff2504d6a7182f85f7cc15618f2df8e512447fa8
Summary:
Closes https://github.com/pytorch/pytorch/pull/8959
MKL-DNN doesn't have support to 0-dim tensor. As a workaround, we produce CPUTensor instead of Ideep tensor in the fallback ops. And for those tensors, we don't need Ideep copy op anymore.
Reviewed By: viswanathgs
Differential Revision: D8665168
fbshipit-source-id: 59678de2c5aed8c691ab5caaadede6d6c000dd7b
Summary:
Sets the random seed at the start of C++ tests so that everything is super deterministic.
I made sure we only generate random values from torch instead of `std::`, so that this seed always applies. I.e. I do:
```
torch::randint(2, {2}, at::kInt64)
```
instead of
```
std::rand() % 2
```
Also got rid of the tests that test the random seeding, since it would interfere here. And the test is not useful since we just use ATen's seeding mechanism, which should work.
Fixes #7288#7286#7289
ebetica ezyang
Closes https://github.com/pytorch/pytorch/pull/8903
Differential Revision: D8667269
Pulled By: goldsborough
fbshipit-source-id: a833e86e156d5e68dae8c53a4b1c433cb0608b6c
Summary:
Closes https://github.com/pytorch/pytorch/pull/8927
Closes https://github.com/pytorch/pytorch/pull/8855
- Add parameter `enable_tracing` to the Arg field of NetDef. `net_async_tracing` will only enable Tracer for Net instances that have this field set (unless the command line argument also include the net name).
- Append a unique id to the json profiling result file because there could be multiple instances of the same net running.
- Dump json profling file regularly instead of just when the Tracer object is destroyed
Reviewed By: ilia-cher
Differential Revision: D8372378
fbshipit-source-id: 8adc9d59f48b67456beed2e3a88235c298fdfd01
Summary:
This PR is the final step to making `torch::` the only namespace users of the C++ API ever see. Basically, I did:
``` cpp
namespace torch {
using namespace at;
}
```
And then changed `torch::` to `at::` almost everywhere. This worked surprisingly well out of the box. So users can now write `torch::relu` and `torch::log_softmax` and `torch::conv2d` instead of having to know when to use `at::` and when `torch::`. This is happy!
Another thing I did was to have `using Dtype = at::ScalarType`, which will be the eventual name anyway.
ebetica ezyang apaszke zdevito
Closes https://github.com/pytorch/pytorch/pull/8911
Reviewed By: ezyang
Differential Revision: D8668230
Pulled By: goldsborough
fbshipit-source-id: a72ccb70fca763c396c4b0997d3c4767c8cf4fd3
Summary:
Closes https://github.com/pytorch/pytorch/pull/8951
Change default value of max decode error rate to 1.0 which means we don't throw such runtime error by default
Reviewed By: avulanov
Differential Revision: D8665640
fbshipit-source-id: 9d373979dd8a97253ad528b167f8d73a28fee82a
Summary:
No longer required now that we've switched over to ShipIt on master.
Closes https://github.com/pytorch/pytorch/pull/8950
Reviewed By: Yangqing
Differential Revision: D8666175
Pulled By: orionr
fbshipit-source-id: 6d8b8b38f6558d87cabd0aa19b72a390057c137b
* add opencl + fpga context
adds an opencl context inside caffe2/fb which can be used for fpga access
* [Caffe2] Force tensor inference checks to be triggered during testing
We've started to rely on TensorInference functions more for different analysis. This diff ensures that the TensorInference function's result matches what is expected from the definition of the operator.
* Enable building //caffe2:torch with @mode/opt
In @mode/opt, python runs out of a PAR, which breaks a lot of
assumptions in the code about where templates/ folders live relative
to __file__. Rather than introduce hacks with parutil, I simply turn
template_path into a parameter for all the relevant functions and
thread it through from the top level.
* [Caffe2] Fix cost models for DotProduct and Div. Update Tensor Inference for dot product
As title. DotProduct states that output is a 1-D tensor (https://caffe2.ai/docs/operators-catalogue.html#dotproduct) though code suggests it is either 0- or 1-D depending on inputs. TensorInference defined to support implementation.
* [SG-MoE] Add an option to make the experts NOT as components
* [nomnigraph] Rename and fixup convertToNeuralNetOperator API
This will make things a bit cleaner
* no longer symlink THNN.h and THCUNN.h
* forced decoder network (onnx export)
Closes https://github.com/pytorch/translate/pull/95
Add networks in ensemble_export.py to create a forced decoding network from PyTorch NMT checkpoints. This network takes an arbitrary numberized (source, target) pair and returns the model score for the translation, including penalties.
Vocabulary reduction networks are also supported, but note that target indices which are not in the possible_translation_tokens generated for the source input will be trea
* Revert schema change to fix production models
Revert schema change to fix production models
* MockLogDeviceReader - rebase on FIX
# Goal
1), Build a make_mock_log_device_reader using make_mock_reader
2), Replace the real log_device_reader here: https://fburl.com/raihwf1p
# Log by D8151734
Real log_device_reader:
```
I0529 20:29:05.373108 954994 tensor.h:839] Tensor print_net/log of type std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >. Dims: (): read_net/ParseOpenTrainingRow:0
I0529 20:29:05.373244 954994 tensor.h:839] Tensor read_net/ParseOpenTrainin
* [C2/D2][1/n]: Nonnegative-Constrained Optimization -- log barrier
implement log barrier as a regularization method
* Add teacher weight screening.
Add teacher weight sceening according to teacher labels. If teacher label is zero, we do not use the distill loss in the objective function.
* Add NormalizerContext
See task for more detail. This implementation is a copy of what exists for RegularizerContext except for how the parameters are defined in the model_definition thrift file.
I'll try an alternative implementation which overrides the default arguments of functions instead like for argscopes in tensorflow.
https://github.com/pytorch/pytorch/compare/master...MaximeBoucher:update-from-facebook-0939578c068c?expand=1
* Adding cosine similarity option in dot processor
Add pairwise cosine similarity option in dot product.
Add an option to concate dot product and cosine similarity.
Add test cases.
* [nomnigraph][redo] Concat elim for sparseNN
Same as D7962948, which was reverted because Operator Schema was not
defined
* [pytorch] Revert pytorch/pytorch#7918 'Release GIL when copying to shared memory', breaks ASAN
Revert this pytorch diff that breaks ASAN when running Filament in dev mode; in opt mode it gives "bad file descriptor" errors. Looks like a race when copying tensors to shared memory in multiple mp.Queue's (which spawn separate threads).
https://github.com/pytorch/pytorch/pull/7918/files
* [nomnigraph][mobile] Enable nomnigraph by default, use -Oz on nomnigraph related code to reduce code size
enables nomnigraph and reduces codesize
* [Warmup] Allow both offline incremental training and online training
Change plan name on saving side and reading side to support both training type
This diff depends on D8128530 and D8168651.
* Revert D7802642: [Warmup] Allow both offline incremental training and online training
This reverts commit afc213cf9b36cecf75333a788391c4d09f4afccc
@bypass-lint
An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files
* Add legacy grad logic to fix div op on old graphs.
Add legacy grad logic to fix div op on old graphs.
* Correctly propagate operator failures
Propagate errors from operators that throw exceptions and return false
* Revert D8374829: [caffe2][nomnigraph][redo] Concat elim for sparseNN
This reverts commit 6dda028c463e54bb5c32188bbbe9202107e188a5
@bypass-lint
An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files
* [Caffe2] Added extra_info to core.DeviceOption(), enforced extra_info to be inherited in scope.DeviceScope
extra_info is a newly defined field in DeviceOption proto. This diff added extra_info to the core.DeviceOption(). And, In scope.DeviceScope(), this diff enforce the new scope to inherit the extra_info from old scope.
* [opt] hgdirsync wasn't enabled, merge diverged code
Here's the damage, P59732616 basically xplat was left behind but had
the change from assert to CAFFE_ENFORCE
* OMP parallelism over RoIs for RoIAlign op
Simpler to parallelize over RoIs. Shouldn't affect other uses as it relies on
the number of OMP threads set during startup.
PR: https://github.com/pytorch/pytorch/pull/8562
* Use int64_t for shape in FillOps
to avoid overflow of int32
* Implement Rotated RoIAlign op
Based on Rotated RPNs as explained in https://arxiv.org/abs/1703.01086.
The idea is simple - orientation/angle is added as an RPN
anchor parameter and then the angle is further regressed similar to bbox
coords. There are some additional changes related to NMS and IoU, but besides
that it's a direct extension to Faster-RCNN. Further details in https://fb.quip.com/sZHlA1iMfWPZ.
RoIs are represented in [center_x, center_y, width, height, angle] format.
`angle` repre
* Rotated RoIAlign op CUDA forward implementation
CUDA forward impl for D8415490
* RoIAlignRotated op CUDA backward pass implementation
TSIA
* All remaining fixes to eliminate process_github.sh
Most of this diff has already been reviewed separately, except for the parts relating to _thnn/utils.py and _utils._internal.py
remove skipIf(True, 'Fbcode') line from process_github.sh
replace sed of cpp file with #ifdef to control cudnnDestroy use
undo sync-time deletion of .gitattributes, remove process_github.sh
switch to using _utils._internal rather than try-import-except
This diff also fixes the open-source bug where rebuilds have
* Back out "Revert D7802642: [Warmup] Allow both offline incremental training and online training"
Original commit changeset: 7707d2efe60e The original diff is backout becuase the online trainer package is backed out. This code would only work with new online trainer package
* [easy] improve error log in adagrad op
as title
* re-allow use of thnn_h_path
This fixes cffi usage in OSS
* [4/4] [tum] paralyzing layerNorm for GPU full sync
as title
* add compile=False to pytorch tests, remove hack with pyc
* Add shape and type inference for RowWiseArgMax operator
See title
* Revert D8515341: Back out "Revert D7802642: [Warmup] Allow both offline incremental training and online training"
This reverts commit 78167eeef0af16b60f72c82f9dcdda9b41b4dcbd
@bypass-lint
An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files
* [fix-flaky-test] mock_hive_reader_test flaky, because GlobalCounter collects local counts intervally
# Problem
`MockHiveReader` uses `GlobalCounter` to limit `max_examples`.
GlobalCounter on server node collect local counts from worker nodes every 1 sec.
This 1 sec delay makes it impossible to limit exactly to the `max_examples`, it will definitely exceed `max_examples`.
# Plan
Given,
```
Expected num_examples = max_examples + num_examples/sec (Read Speed) x 1 sec (GlobalCounter Sync Int
* [Caffe2] Fix FCGradient cost inference. Prevent overflow in cost inference
FCGradient missed a factor 2 in the `num_outputs == 3` case. Overflow was occurring with flop calculation for FC. Changed types to `uint64_t` to prevent future problems.
* Fix binary ops with empty inputs
Fix binary ops with empty inputs
* Support the filling of input blob with provided data
as title for Biz Integrity case
* Back out "Revert D8515341: Back out "Revert D7802642: [Warmup] Allow both offline incremental training and online training""
Original commit changeset: 30c55dd38816 Original diff is reverted due to introducing bad integration test. Fixed the integration test.
* [c2][easy] improve pack ops error loggings
as desc.
* Add ShapeTypeInference for LpNorm operator
As desc
* Shard test_nn to reduce runtime for each test target
Closes https://github.com/pytorch/pytorch/pull/8793
The current test_nn would time out and be disabled in GreenWarden, and we need to have an option to split it up in order to pass the stress test. Right now GreenWarden roughly allows running 100 test cases in test_nn before timing out, and here we have an option to divide test_nn into 30 shards (with ~40 tests in each shard) to allow for some test suite growth in the future.
* Change default caffe2_streams_per_gpu to 1
* Remove IN_SANDCASTLE from common.py and test_nn.py
We prefer to disable the failing tests through Sandcastle UI instead.
* Add a new class for an updated prof_dag.proto
This diff contains:
- An updated prof_dag.proto that contains blob profiles.
- A class to deserialize this information (serialization is in a follow up diff)
- Update to separate profiling information from NeuralNet (and use it as part of the class above).
- Unit tests
* Lambdarank for SparseNN
This diff adds a lambda_rank_layer for SparseNN.
changes include
1) Adds support for multi sessions in c2 op
2) Adds support for two different loss functions in c2 op
3) Unit tests for op
* Revert D8586950: Back out "Revert D8515341: Back out "Revert D7802642: [Warmup] Allow both offline incremental training and online training""
This reverts commit 012220ed63eccc35659a57b31d16a3625da6317b
@bypass-lint
An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files
* [easy] A few fixups to multithread predictor benchmark
(1) support perf on T6 server
(2) remove dead code
* fix a bug about the map size
as title
* Fix reduce sum on in-place case.
Fix reduce sum on in-place case.
* [Warmup] Reland reverted diff Allow both offline incremental training and online training
Closes https://github.com/pytorch/pytorch/pull/8827
fix net transform integration test. Allow offline and online trainer to coexist D7802642.
* Add StoreHandlerNotAvailableException
Add an exception for a store that is not available or has been
deleted.
* Use exception handling for fault tolerance, missing KV store
Remove status blobs to communication ops so that exceptions propagate on
failure.
* [C2/D2][2/n]: Nonnegative-Constrained Optimization -- bounded grad proj
for simple bounded constrained optimization, incl non-negative box constraints.
* [GanH]: Adaptive Weighting with More Estimations
With implemented postivity optimization, we now learn adaptive weights with different
parameterizations.
This improves parameter estimation and training stability.
* Revert some changes for landing
* Remove AutoNoGIL in StorageSharing
* Temporarily disable net_tests
* Revert "[Caffe2] Force tensor inference checks to be triggered during testing"
This reverts commit 67ef05c22b2f71b4a489695384932f968384a2a4.
* Revert "Fix reduce sum on in-place case."
This reverts commit 6cb8a8e1b3db7b6d20941b0053e3f3836068eb64.
* Revert "Revert "Fix reduce sum on in-place case.""
This reverts commit 130a257c0893dc09f4bd6e6a45d112261807fd2c.
* use conda cmake in pytorch-linux-xenial-cuda8-cudnn6-py2 and pytorch-linux-xenial-cuda9-cudnn6-py3
* update test_expect
* add exit 1
* check cmake 3.5
* bump expect driver version
* add back space
* Better forward methods in C++ API
capitalize error message in test_torch.test_flatten
Support for operator()
* Add operator() to Functional
* Get rid of SigmoidLinear
* Add BoundFunction to FunctionalImpl
* Remove macro from conv because it makes errors more nasty
This should be set by the code that instantiates it, be it the Python
bindings or other C++ code. Defaulting to use localhost is not useful
beyond tests. Instead of keeping multiple default paths around we can
punt on it here and require it to be initialized elsewhere.
There is no relevant state in PinnedMemoryAllocator, so we
can have a single allocator with static lifetime.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Rework optim folder
* Removed TORCH_OPTIMIZER_CLASS macro
* Got rid of CRTP/Impl
* Removed TORCH_AUTOGRAD_KWARG
* Differentiate between Optimizer and LossClosureOptimizer
* Make Optimizers parameters based instead of model based
* Allow construction of optimizer from arbitrary vector
* Added test for zero grad
* Added test for external parameter vectors
* Now comparing against baseline values
* Documentation
* Post rebase fixes
* Different strategy for creating and accessing buffers in optimizers
* Fix member ordering
* Unify isViewable, handle n-dimensional empty tensors.
1) Unifies the two isViewable functions in ATen and TH.
2) Handle n-dimensional empty tensors in the implementation
3) Clarify some comments.
This requires an extra copy in the TH case, but that will go away.
* Also unify THCTensor version.
* Remove C-linkage from THTensor_compute_stride.
* Update comment.
* Add pos_weight argument to nn.BCEWithLogitsLoss and F.binary_cross_entropy_with_logits (#5660)
- Add an option to control precision/recall in imbalanced datasets
- Add tests (but new_criterion_tests)
* Move pos_weight to the end of args list in the documentation.
`pos_weight` was moved to the end because it is the last argument in both
`nn.BCEWithLogitsLoss` and `binary_cross_entropy_with_logits`
If CUDNN_INCLUDE_DIR, CUDNN_LIB_DIR, and/or CUDNN_ROOT_DIR were set,
but USE_CUDNN was not explicitly set, the code in
cmake/Dependencies.cmake would set USE_CUDNN=OFF even though it could
be found. This caused an issue in ATen, where it includes its CuDNN
bindings if the variable CUDNN_FOUND is set. This was the case,
because the find_package call in cmake/public/cuda.cmake searches for
CuDNN and ends up finding it. The net result is that ATen tried to
compile CuDNN bits, but the caffe2::cudnn target is never defined let
alone added as dependency, and the build fails on not being able to
find the header cudnn.h.
This change does two things:
1) Restore CuDNN autodetection by setting USE_CUDNN=ON if it is found.
2) Remove obsolete FindCuDNN.cmake module. This functionality now
lives in cmake/public/cuda.cmake.
List dependency on gloo_cuda before dependency on gloo such that
unresolved symbols in gloo_cuda are correctly resolved (since the linker
resolves from left to right).
This fixes building c10d C++ tests on GCC 4.8.
currently torch/CMakeLists doesn't know how to find nanopb without
some higher-level script (setup.py or build_all.sh) telling it where
to look, which is an obstacle towards fully CMake-ifying libtorch.so.
This change removes that dependency.
* Bag of fixes
* Rename tensor_range.h to tensor_list_view.h
* Post rebase fixes
* Rename torch::tensor namespace to torch::tensors due to name conflict
* Avoid recursion in Module::to
This commit implements the solution proposed in https://github.com/pytorch/pytorch/issues/8410
to workaround the need to create zero tensors with the same shape as inputs.
It introduces the concept of a LinearBlock which marks places in the code
where we know if all the inputs to the node are zero, then the outputs
to the node are also zero. Autodiff introduces LinearBlocks around
backwards functions, which have this property. specializeUndef then
propagates Undef nodes using this information.
Notes:
* Since we do not always specialize, we have a pass LowerLinearBlocks
that replaces the block with an if statement that dynamically guards
the Undef case.
* We introduce AutogradAdd which is addition that still works when
its inputs might be undefined. In cases where we specialize this will
get removed in favor of a normal add, but there are cases where
gradient graphs do not specialize (e.g. when they are not differentiable,
but a derivative is required) so it is important for this op to be executable.
* make as_strided safer
* patching as_strided; and stop using it in backward
* Test a simple case in as_strided_backward
* a long note
* remove boundary checks of as_strided; implement slow path
* wip
* fix as_strided backward when input is overlapping
check for input overlapping too
[doc] clarify gradcheck behabior when input is overlapping
longer note
* fix a deprecation warning in test_autograd
* nits
* Created DefaultTensorOptions
* Fix TensorOptions() call which was interpreted as function decl
* Fix empty OptionsGuard
* Make options_ and mutex_ in DefaultTensorOptions class static because of dynamic linker issues
* Make DefaultOptions thread local
* Spectral norm improvements
- Don't do iterations on weight in eval mode
To facilitate this, register weight as buffer in order to be able
to use module with spectral norm in eval mode after immediately
after loading state dict (#8208)
- Use weight instead of weight_orig as weight when removing
spectral norm
- Add dim parameter in case the normalization should occur w.r.t.
a dimension other than 0 (#7865)
* add and update spectral norm tests
* More spectral norm tests
Thank you, Simon, for the suggestions.
* Port THCS to ATen.
General structure of the sparse implementation:
- SparseCUDATensor.{cpp, cu} and SparseCUDATensorMath.cu contain
the same functions as their CPU analogues
- SparseCUDAApplyUtils.cuh contains what used to be in
THCSTensor.cu
- SparseCUDABlas.cu contains what used to be THCSparse.cu
Unrelated improvements:
- Forward declared CUDA types in Context.h are now moved
exclusively to CUDAHooks
- New getCurrentCUDASparseHandle in Context
- Support for printing CUSPARSE_STATUS_ZERO_PIVOT error message
directly
Some unusual pieces:
- get_device got the LegacyBridge makeover, as it needs special
logic on sparse tensors (defer to the inner tensors).
- I noticed that I need to turn off device_guard codegen
for many functions in sparse, noticed because get_device
became a native function, and resulted in an infinite recursion. This was
done by adding device_guard: False to the native definitions. An alternative
strategy might be to make the heuristic for deciding when to put in a device
guard more clever.
Scaffolding removal:
- LegacyBridge now special-cases only on sparse versus dense;
no more CUDA test (hooray!)
- Native bindings get CUDA/SparseCUDA dispatch entries.
CPU sparse refactoring:
- New SparseUtils.h header, with all of the utility functions that
used to live in SparseTensor.cpp
- new_with_tensor_sparse now correctly handles both CPU and CUDA
- transpose functions in sparse/ turned out to be dead, so I killed them
Bugs I noticed while working on this:
- I used accessor<...>() on a CUDA tensor, because I thought it does
the CUDA-CPU sync. It does not.
Last mile changes:
- I killed all of the THS/THCS directories, build scripts, bindings everything.
It is now no more!
- A bunch of trampolines in LegacyBridge are no more; anything
that was "sparse only" is now done natively.
- `sparse_coo_tensor` is implemented a little funny, but we think
it's a good idea.
- HIP is handled by explicitly ifdef'ing out all kernels; we'll add support
for this at some later point in time.
- TH_INDEX_BASE is now unconditionally set to 0.
- Some uses of x.type() now replaced with x.options(), the new way of doing it.
- More notes about checked_cast_tensor, and eliminate Storage/Tensor fields in
the code gen env when they are dead.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* cache cufft plans
* use an LRU cache
* suffix CuFFTParams members with _
* import print_function for py2
* lint
* fix potential race; add dummy impl for CPU only builds
* cpp formatting; remove nccl makefile change
* Use CUDA hooks instead
* comments and doc
* update the error message
* move LRU cachae to a separate file and native::detail namespace
* update comment
* specify NOTE location in CuFFTPlanCache.h
* update disabled_features.yaml to make amd ci work
* another fix for AMD CI in disabled_features.yaml
* Wrap cufft_plan_cache_* methods in __HIP_PLATFORM_HCC__
* improve the notes
* lint
* revert onnx change
* put back inlining for CUFFT_CHECK
* Resolve conflicting name, ContextManager
Concept name `Context Manager` is taken by Python. See https://docs.python.org/3.6/reference/datamodel.html#with-statement-context-managers
It says,
A context manager is an object that defines the runtime context to be established when executing a with statement. The context manager handles the entry into, and the exit from, the desired runtime context for the execution of the block of code.
The `ContextManager` here is more like a registry.
And there is a C++ registry in caffe2 codebase `caffe2/caffe2/core/registry.h`.
There is also a Caffe2DBRegistry, declared by calling `CAFFE_DECLARE_REGISTRY(Caffe2DBRegistry, DB, const string&, Mode);` in `caffe2/caffe2/core/db.h`.
I think we can follow the concept name `Registry`, calling it `ContextRegistry`.
* Make Classes and Functions internal to this module start with "_"
Make Classes and Functions internal to this module start with "_"
* Update context.py
* Update context.py
* adds fp16 support to the jit
* improves formatting
* improves formatting
* added an explanatory comment
* fixes Python2 flake8
* updates c code
* all except halfs
The goal is to be able to use at::Half throughout ATen, including in
CUDA kernels and have it operate like built-in types. This avoids the
need for cuda::from_type and cuda::to_type before every
AT_DISPATCH_ALL_TYPES_AND_HALF call.
Addresses #8177
A design doc can be found here: [gist](https://gist.github.com/zou3519/4b7f13f03cc9f3612bd9363e6405fa0a) version or [quip](https://fb.quip.com/azL1AqUckBdo) version
General approach:
- Add NumberType, FloatType, IntType to represent Python numbers, floats and ints.
- Emit these types for python literals
- Change aten_schema such that Scalars are NumberType, int64_t and bool are IntType.
- Emit aten::type_as, prim::NumToTensor, and prim::TensorToNum nodes for tensor-number math. (see examples below)
- Erase NumberType, prim::NumToTensor, and prim::TensorToNum for ONNX export
### Tensor/number math
```
import torch
@torch.jit.script
def fn(x):
return x + 1
```
```
graph(%x : Dynamic) {
%1 : int = prim::Constant[value={1}]()
%2 : Dynamic = prim::NumToTensor(%1)
%3 : Dynamic = aten::type_as(%2, %x)
%4 : Dynamic = aten::add[alpha={1}](%x, %4)
return (%5);
}
```
### Number/Number Math
```
import torch
@torch.jit.script
def fn(zero):
c = 1 + 1
return zero + c
```
```
graph(%zero : Dynamic) {
%1 : int = prim::Constant[value={1}]()
%2 : int = prim::Constant[value={1}]()
%3 : Dynamic = prim::num_to_tensor(%1)
%4 : Dynamic = prim::num_to_tensor(%2)
%5 : Dynamic = aten::add[alpha={1}](%3, %4)
%c : int = prim::TensorToNum(%6) # this is the result of the addition
...
return (%13);
}
```
List of squashed commits:
* Introduce Python Number types
Added: IntType, FloatType, NumberType with
IntType <: NumberType
FloatType <: NumberType
Changed aten_schema so arguments have corresponding types
* Emit a NumberType for python literals.
Also emit a NumberType for Scalar default values.
* Add prim::NumToTensor and prim::TensorToNum
* Add DynamicType -> NumberType implicit cast for bc
* Better ensureTensor error message
* Add ensureTensorOrNumber. Allow passing Number to some functions
Like the range() construct and slices
* Patch IntList to work.
IntList is still a DynamicType in the frontend: a tensor gets built from
a List[int].
Also, IntList[1] is a "union between int and IntList" the way it is
implemented. If the frontend sees an int being passed for an IntList[1]
arg, it converts it to a tensor as well.
* Enforce some order on schemas to avoid overload ambiguity
add(Tensor, Tensor) should appear earlier than add(Tensor, Scalar). This
matches the order in which python_arg_parser parses its arguments.
* Disable std_dim and var_dim tests.
With the new schema information, std(input, keepdim) and std(input, dim)
are ambiguous. This will need to be fixed at a later date.
* Add NumberType erasure pass.
This is used for ONNX export and to ensure that NumberType information
doesn't reach the interpreter
* Add support for mixed tensor/number math ops.
* Tests for new functionality.
Includes:
- Tensor/number math
- number/number math
- EraseNumberTypes pass test
* Patch tests
Update expect tests for:
- decompose_addmm
- loop unrolling tests
Because python numbers are now NumberType, they cannot be returned by
functions anymore. Work around this by using "torch.full", or by adding
a tensor([0]) (taken from FIXME_zerol()). Both approaches are used
because torch.full is more readable, but it is broken in some cases.
* Add erase_number_types to torch/CMakeLists.txt
* Move math back to emitSimpleExpr from emitSugaredExpr
* Remove some dead lines
* Renable some excluded script/trace tests that are fixed.
* Move some tests to expected failure
* Address some comments (more addressing to come)
* Erase relevant aten::type_as nodes in EraseNumberTypes
I also changed it so that EraseNumberTypes is only called for ONNX
export. It is no longer used to prevent
prim::NumToTensor/prim::TensorToNum from reaching shape_analysis or
interpreter.cpp.
shape_analysis infers the type of the output of these nodes to be the
same as their input.
intepreter.cpp treats both of these nodes as no-ops.
* Add reminder to fix std/var
* Call EraseNumberTypes only when exporting a script module
* Update expects after rebase
Buck doesn't support passing arguments to Python unit tests, and we have to use environment variables to pass the sharding options instead. Also, buck test doesn't go through the __name__ == '__main__' code path and we need to move the env var checking logic to top-level.
* Use env var to pass sharing options to test_nn.py
* Move env var checking to top-level
* fix lint
* Support n-dimensional empty tensors in (most of) THNN.
Most of the argument checking in THNN is directly around dimensionality, which doesn't work in general for n-dimensional empty tensors, because
you will end up dividing by 0 or similar. Instead, we change these to check for empty and give error messages for those cases as well.
In some cases, the error messages are improved as well.
* Fix bug.
* enable captured inputs for if Stmt to fix the carried deps bug in nested
blocks
* postpone captured inputs deletion and add new test case
* recursively generate captured values for nested loops
* check asSimple when recursively create captured input
* Some 0-sized dimension support, port catArray away from resizeLegacy.
The goal of this PR is to port catArray away from resizeLegacy (so we can delete the legacy resize calls), but since catArray has some weird behavior because
we don't have arbitrary 0-sized dimension support, I made some effort to fix these both in one pass.
The major changes here are:
1) catArray uses the new resize API, no longer the old resizeLegacy API.
2) As 1) is the last usage of resizeLegacy, it is deleted.
3) If compiled with USE_TH_SIZE_ZERO_DIM, catArray will work and properly check shapes for n-dimensional empty tensors.
4) However, we retain the old behavior of "ignoring" size [0] tensors in catArray. We previously allowed this because we didn't have n-dimensional empty tensors.
5) To get the above to work, we also add support for n-dimensional empty tensors for narrow and slice (ifdef USE_TH_SIZE_ZERO_DIM).
6) We change the stride formula for empty tensors to match NumPy; basically, we never multiply by 0 as the size, always at least 1, so the
strides are monotonically increasing in the empty tensor case.
7) We print the size of empty tensors if size != [0]; this matches NumPy behavior (even in cases where the size could be inferred from the brackets.
8) For test purposes, we add torch._C._use_zero_size_dim() to add tests for the above.
* Fix flake8.
* Address review comments.
* Solves #8659
This PR adds a warning to alert users about the possibility of a failure in the gradcheck
* Fix lint
* Update gradcheck.py
* Update gradcheck.py
* update error message
* Update warning message to be more descriptive
This surfaces the options struct that can be passed to the
ProcessGroupGloo constructor to Python. By default, if no options struct
is passed at construction time, the Python bindings default to using a
struct with a TCP backed Gloo device that uses the machine's hostname to
resolve the IP address to bind to.
Currently, THTensor_(nDimension) goes to _dim(), which makes it difficult to move individual usages over to the new API.
Instead, let's create a THTensor_(_nDimension) going to _dim() and THTensor_(nDimension) going to _dim(). To do this, we will redirect all current
calls and move them over as we did for _dim() and dim().
* Setup wrappers to get vectorized version of mean
* Responding to review 1
* Responding to review 2
* Use variadic AT_CHECK
* Fix AT_CHECKS in ReduceOps
* Fix broadcast copying device[0] tensor when not using NCCL; Avoids potential extra copy in flatten_dense_tensors
* use toType
* revert dense_flat changes
* address comments
* [c10d] NCCL python binding and CI test, with bug fixes
* Addressed comments and further bug fix
* Made NCCL build optional, made C10D libc10d.a only
* Fixed tests so that NCCL pg won't run when not neeeded
* Addressed comments
* Port all indirect calls of resizeNdLegacy to resizeNd.
* Handle 1-d to 1-d resize.
* Maintain behavior of tensor.set_().
* Fix lack of initializer_list in C :).
* Return full dimensionality from newSizeOf.
* Created TORCH_MODULE macro
Rewrote Linear
Rewrote Dropout and added default constructor to TORCH_MODULE macro
Turned TORCH_MODULE contens into a proper base class
Added some documentation
Got rid of the old Dropout module
Got rid of the old Embedding module
Got rid of the old BatchNorm module
Got rid of the old Conv module
Fixing optimizers
Rebase
Removed old RNN modules and the TORCH_ATTR macro
Removed temporary P:: namespace
Added cloning behavior to all modules
Got rid of some get() calls
self review nits
Remove noexcept from ModuleHolder methods that can throw
Remove spaces
Add missing override to reset() methods
Added examples to documentation in pimpl.h
* Post rebase fixes
catArray is more complicated because it requires real 0-size dimension support. The other changes are safe in that the functions are never called (and are now deleted), or
they are used on a result of THTensor_(newSizeOf), which has a valid size.
test_rnn_args_check generates mismatched input_shape and hidden_shape
args. To do this, it changes a dimension of input_shape or hidden_shape
to have an incorrect size.
Before, the test was changing the size of a dimension to -1. However,
this is flawed because an input of size i.e. (6, -1, 2) is wrong.
This PR fixes it so that the test changes sizes of dimensions to
`bad_size = 7`. As long as none of the other sizes (input_size,
hidden_size, num_layers, batch_size) divide this, we don't have to worry
about that dimension being accidentally broadcasted into working.
Unlike resizeLegacy / resizeNdLegacy, these don't call deprecated methods (e.g. _dim) and don't map between logical sizes (i.e. nDimension == 0 -> size [0]).
What you ask for is what you get.
The full 0-sized dimension support is hidden behind an ifdef, because we it's not fully supported yet.
* Created TensorOptions
Storing the type in TensorOptions to solve the Variable problem
Created convenience creation functions for TensorOptions and added tests
Converted zeros to TensorOptions
Converted rand to TensorOptions
Fix codegen for TensorOptions and multiple arguments
Put TensorOptions convenience functions into torch namespace too
All factory functions except *_like support TensorOptions
Integrated with recent JIT changes
Support *_like functions
Fix in place modification
Some cleanups and fixes
Support sparse_coo_tensor
Fix bug in Type.cpp
Fix .empty calls in C++ API
Fix bug in Type.cpp
Trying to fix device placement
Make AutoGPU CPU compatible
Remove some auto_gpu.h uses
Fixing some headers
Fix some remaining CUDA/AutoGPU issues
Fix some AutoGPU uses
Fixes to dispatch_tensor_conversion
Reset version of new variables to zero
Implemented parsing device strings
Random fixes to tests
Self review cleanups
flake8
Undo changes to variable.{h,cpp} because they fail on gcc7.2
Add [cuda] tag to tensor_options_cuda.cpp
Move AutoGPU::set_index_from into .cpp file because Windows is stupid and sucks
Fix linker error in AutoGPU.cpp
Fix bad merge conflict in native_functions.yaml
Fixed caffe2/contrib/aten
Fix new window functions added to TensorFactories.cpp
* Removed torch::TensorOptions
Added code to generate wrapper functions for factory methods
Add implicit constructor from Backend to TensorOptions
Remove Var() from C++ API and use torch:: functions
Use torch:: functions more subtly in C++ API
Make AutoGPU::set_device more exception safe
Check status directly in DynamicCUDAHooksInterface
Rename AutoGPU to DeviceGuard
Removed set_requires_grad from python_variables.h and warn appropriately in Variable::set_requires_grad
remove python_default_init: self.type()
Add back original factory functions, but with deprecation warnings
Disable DeviceGuard for a couple functions in ATen
Remove print statement
Fix DeviceGuard construction from undefined tensor
Fixing CUDA device compiler issues
Moved as many methods as possible into header files
Dont generate python functions for deprecated factories
Remove merge conflict artefact
Fix tensor_options_cuda.cpp
Fix set_requires_grad not being checked
Fix tensor_new.h
TEMPORARILY put some methods in .cpp files to see if it solves issues on windows and mac
Fix bug in DeviceGuard.h
Missing includes
TEMPORARILY moving a few more methods into .cpp to see if it fixes windows
Fixing linker errors
* Fix up SummaryOps to use new factories
Undo device agnostic behavior of DeviceGuard
Use -1 instead of optional for default device index
Also move DeviceGuard methods into header
Fixes around device index after optional -> int32_t switch
Fix use of DeviceGuard in new_with_tensor_copy
Fix tensor_options.cpp
* Fix Type::copy(
* Remove test_non_float_params from ONNX tests
* Set requires_grad=False in ONNX tests that use ints
* Put layout/dtype/device on Tensor
* Post merge fixes
* Change behavior of DeviceGuard to match AutoGPU
* Fix C++ API integration tests
* Fix flip functions
* Spelling fix in MultivariateNormal docstring (#7915)
* [c10d] MPI Process Group Implementation (#7783)
This provides a bare-minimum MPI Process Group implementation, the commit is on top of @pietern's Gloo Process Group PR.
* [c10d] MPI Process Group Implementation
ref: https://github.com/pytorch/pytorch/issues/7434
* Better exception, atexit func, and addressed comments
* Clang formatting changes
* Static initialization and addressed comments
* Added constness back
* Test will now launch mpi processes if found
* CMakeList Changed
* Fix Windows doc for import error (#7704)
* Fix Windows doc for import error
* Fix doc again
* Fix wrong format
* Moved condition for dilated grouped convolutions to CUDNN convolution implementation (#7465)
* Updates to caffe2 operator documentation (#7917)
* Significant updates to the operator docs in prep for merge
* [auto] Update onnx to 307995b - Update from upstream (onnx/onnx#1038)
307995b143
* Test if ASAN is actually working as part of ASAN tests. (#6050)
* Test if ASAN is actually working as part of ASAN tests.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Drop explicit use of libstdc++, we should not care.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Build with DEBUG=1
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Increase main thread stack size when using ASAN.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Split up detail.h (#7836)
* Fix THCUNN SpatialDepthwiseConvolution assuming contiguity (#7952)
* Fix fbcode compatibility (#7939)
* add test for correctness of transpose fusion (#7950)
* [JIT][script] Fix emitted gather and slice for dynamic indices (#7861)
* [JIT][script] Fix emitted gather for dynamic indices
* Also fix slice
* Address comments
* cache and use BLAS_SET_BY_USER so that it doesn't set itself to TRUE when run second time (#7942)
* Add unsafe flag to skip checking in prepare (#7832)
* Add unsafe flag to skip checking in prepare
* pop
* Rename cuda::type to cuda::into_type and provide cuda::from_type. (#7937)
These are used to convert Half -> half and half -> Half respectively.
from_type will be used for runtime type checking in THC.
* Try to fix TORCH_CUDA_ARCH_LIST for PyTorch again (#7936)
* try again
* use DEFINED
* use a loop
* Minor fixes
* remove sort requirement from pad-sequence (#7928)
* pad-sequence no longer requires sorting entries
pad-sequence can get the max_len from the list of sequences. entries only need to be sorted if output will be used for pack_padded_sequence, which can throw the error itself.
* remove sort requirement from pad-sequence
Picks up from #5974.
Removes the requirement that input sequences to pad_sequence have to be
sorted. Addressed the comments in the PR:
- Updated docstring for pad_sequence
- Remove sort requirement in pad_sequence test
- Test unsorted and sorted sequences in pad_sequence test
* Fix checkBackend error message (#7926)
* Fix checkBackend error message
Fixes#7849
* Switch order of printing args
* Split CI tests in half and run them in parallel (#7867)
* Split and run tests in parallel
* Refactor tests
* Handling of scalars in torch.Size (#5676)
* Handling of scalars in torch.Size
torch.Size() constructor uses python_arg_parser
IntList in python_arg_parser can take iter/range
Have IntList take python iterables and ranges.
Address comments: don't use python_arg_parser and instead call __index__ in THPSize_pynew
Address comments
Address comments
* Rebased
* Address nit
* [JIT] Fission and fusion passes for addmm (#7938)
* Addmm decomposition pass
* Addmm peephole pass
* Fix handling of output shape in fusion pass
* Add DCE to the peephole passes
* add comments
* maybe bugfix?
* Fix GPU tests
* fix py2/3 test issue
* Set smaller grain size for some cases (#7941)
* Fix returning scalar input in Python autograd function (#7934)
* fix _wrap_outputs not working with scalar inputs
* add a test
* Prevent git autocrlf for bash scripts (#7949)
* Delete unused file (#7919)
* Fix typo in autodiff formula for addmm (#7932)
* 1) use meshgrid for flip() CPU implementation, only need one copy of input tensor; 2) changed kernel of CUDA implementation, no need materialized indices tensor; 3) reusing error checking code
* [caffe2] YellowFin parameter update GPU code fix. (#6993)
* [Caffe2] Keep name of caffe2_pybind11_state and caffe2_pybind11_state_gpu in debug build (#7155)
* Allowing MatMul to create a gradient even with 3 inputs. useful if you are differentiating a graph twice (#6536)
* added const for local variables
* Fix the cpp libtorch CUDA build (#7975)
* Use mingfeima's mkldnn (#7977)
* Fix the import part of the windows doc (#7979)
* Change perf test folder after git checkout (#7980)
* Move the broadcast check in MKL Add/Sum to runtime (#7978)
* Use Glog's implementation of STL logging when possible. (#7206)
Inject custom workaround into namespace std so that it can be found by ADL.
* [Hotfix] Bring back warnings and -Werror to ATen (#7866)
* Bring back warnings and -Werror to ATen
* Unbreak...
* Fix tbb errors
* Enable ONNX backend Mean tests (#7985)
* Add third wayt to determine IS_CONDA (#7971)
* Fix EmbeddingBag max_norm option (#7959)
* fix EmbeddingBag max_norm option
* flake8
* add warning to the embedding bag arg change
* Raise error when torch.load a storage on a non-existing device (#7921)
* Raise error when torch.load a storage on a non-existing device
Before, doing torch.load(...) on a CUDA tensor on a CPU-only machine
would raise an unreadable error:
```
~/pytorch/pytorch/torch/cuda/__init__.py in __enter__(self)
223 if self.idx is -1:
224 return
--> 225 self.prev_idx = torch._C._cuda_getDevice()
226 if self.prev_idx != self.idx:
227 torch._C._cuda_setDevice(self.idx)
AttributeError: module 'torch._C' has no attribute '_cuda_getDevice'
```
This PR makes it so that torch.load raises a hard error if one tries to
load a storage onto a non-existing device and suggests the user to use
torch.load's map_location feature.
* Address comments
* missing dep
* Make THStorage / THCStorage have void* data ptr. (#7964)
* Make THStorage / THCStorage have void* data ptr.
This is the initial step in unifying the ATen and TH tensor representations, next is to only generate a single THStorage / THCStorage type.
The major changes here are:
1) data has been renamed to data_ptr and made void* in THStorage/THCStorage.
2) THStorage / THCStorage stores a at::ScalarType representing its data type (This will be useful when we generate a single THStorage/THCStorage).
3) APIs for Accessing the data as a real*:
a) storage->data<real>() -- this does runtime-type checking (checks that the at::ScalarType is correct).
b) storage->unsafeData<real>() -- as above, but no runtime-type checking (used in inner loops / fast code paths).
c) THStorage_(data)(storage) -- this already existed, just calls storage->data<real>().
* Add include.
* Attempt to fix clang build issues.
* Clarify comment and remove extra character.
* Rename unsafeData -> unsafe_data.
* Remove unnecessary 'to' function to get compile time rather than link time errors.
* Import/export observer symbols for DLL, which fixes the linking error in Visual Studio. (#6834)
* Import/export observer symbols for DLL, which fixes the linking error in Visual Studio.
* Add support of all default cmake build types for release to cuda.
* Remove python bindings for `torch.slice` (#7924)
* skip python bindings for slice
* remove tests
* convert slice test to indexing
* Build ONNX for PyTorch version of libcaffe2 (#7967)
* support loading gzip (#6490)
* support loading gzip
* address comments
* address comments
* fix lint
* fix test for python2
* Add memory leak check in CUDA tests (#7270)
* Add memory leak check in CUDA tests
* Tracking multi-GPU too
* fix run_test.py not running __name__ == '__main__' content; add test for make_cuda_memory_checked_test
* add a comment
* skip if cuda
* 1. Change the wrapper to a method in common.py:TestCase
2. Refactor common constants/method that initialize CUDA context into common_cuda.py
3. Update some test files to use TEST_CUDA and TEST_MULTIGPU
* Fix MaxUnpool3d forward memory leak
* Fix MultiLabelMarginCriterion forward memory leak
* Fix MultiMarginLoss backward memory leak
* default doCUDAMemoryCheck to False
* make the wrapper skip-able
* use TEST_MULTIGPU
* add align_corners=True/False tests for Upsample; fix TEST_CUDNN
* finalize interface
* VolumetricMaxUnpooling_updateOutput
* fix test_nccl
* rename THC caching allocator methods to be clearer
* make the wrapped function a method
* address comments; revert changes to aten/src/THC/THCCachingAllocator.cpp
* fix renamed var
* Revert "Set smaller grain size for some cases" (#7988)
* Entry for c10d in CODEOWNERS (#8001)
* Fix a couple of typos (#7998)
* Fix typo
* Fix typo
* Fix typo
* Fix typo
* Add on-stack observer cache for Observable (#7931)
observers_list_ stores all the observers for an observable. The list is allocated on heap, which
can cause LLC miss. Add an on-stack observer cache for fast access. In production, we have seen 20%
speed up for start and stop observer calls.
* Reduce grain size for Unary operations (#8003)
* [auto] Update onnx to 8ec0e5f - Add index check for Transpose's type inference function (onnx/onnx#1053)
8ec0e5fe9b
* Make AT_FORALL_SCALAR_TYPES usable outside of at::namespace. (#7935)
* Make AT_FORALL_SCALAR_TYPES usable outside of at::namespace.
This requires renaming the _cast functions which used the unqualified names.
* Separate onnx mapping of scalar type from cast name.
* Fix flake8.
* Properly cast onnx.
* Remove WITH_ROCM cmake flag/variable (use USE_ROCM solely) (#8013)
* Mention the pytorch-ci-hud on the README. (#8004)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Re-enable build env check (#7969)
* Re-enable build env check
* Fix linux test error
* Try to fix macOS test error
* Update nn.rst (#8029)
* Example for Transformed Distribution (#8011)
* [auto] Update onnx to 33e9cd4 - Remove the usage of default value to fix invalid proto3 files. (onnx/onnx#1052)
33e9cd4182
* [auto] Update onnx to 1504a33 - Convert schema assert for duplicate type names to exception (onnx/onnx#1057)
1504a33abb
* Support CUDA tensors in ProcessGroupGloo (#7694)
This adds an unconditional dependency on CUDA, which is not desirable
for the long term. Ideally we have split like ATen where we have
different artifacts for different backends so you can decide at runtime
what to use.
* [auto] Update onnx to 3fb9656 - Fix for fbcode CI (onnx/onnx#1062)
3fb965666e
* propagate nan in some activations (#8033)
* propagate nan in some activations
* fix py2 not having math.nan
* flake8
* Fix profiler crash when no events register (#8034)
* Fix profiler crash when no events register
When trying to profile, attempting to print the event table throws a vague error because the event list is empty:
....
max_name_length = max(len(evt.key) for evt in events)
ValueError: max() arg is an empty sequence
This change fixes the error by returning an empty string.
* Update profiler.py
* Allow CI testing with different AVX configs (#8020)
* Allow CI testing with different AVX configs
* Unset ATEN_DISABLE_AVX and ATEN_DISABLE_AVX2 in default config
* Support for generating ATen during the fbcode build, rather than committing the generated files (#8002)
Paint the internal bikeshed a slightly different color to appease Buck tooling.
* Factor python dependency out of interpreter (#7970)
* Factor python dependency out of interpreter
* Remove NO_PYTHON for the autograd engine
If there is no python bindings, then a default Engine is constructed
the first time it is requested.
If the python libraries are loaded, then they override the default
accessor and the default engine becomes a python Engine.
Note: it is possible for two engines to be generated if a non-python
one gets created before the python bindings are loaded. This case
is rare, and just results in additional threads being spawned.
* Fixing AlexNet test which is skipped in CI
* [auto] Update onnx to 760c928 - add missing hasNInputShapes check for bidirectionalBroadcastShapeInference (onnx/onnx#1060)
760c9283d0
* Support modules that output scalar in Gather (and data parallel) (#7973)
* Support modules that output scalar in Gather (and data parallel)
* Improve warning msg
* [auto] Update onnx to 9e7855d - Remove PyTorch generated Upsample tests cases (onnx/onnx#1064)
9e7855dcd4
* [script] Add support for torch.zeros, torch.ones, etc. (#7799)
* [script] Add support for torch.zeros, torch.ones, etc.
* modifies gen_jit_dispatch to creating bindings for functions that do
not take tensor arguments, but do have an initial type argument
* adds tensor attributes to these functions for device, layout, and
dtype specification
* extends the list of valid compiler constants to include device, layout,
and dtype.
* allows functions with Generators, but only using the default generator
Known limitations:
* when using `torch.float`, we convert it to a scalar tensor and make
no checks that it is actually used only in a dtype specification.
This is similar to how we handle Python numbers, creating some situations
where the script is more permissive. Fixing this requires much more
significant changes to the IR, so is lower priority for now.
* devices specified using string literals e.g. 'cuda:1' do not work,
since we do not support string literals in general.
* Add profiling annotations to NeuralNet[Operator|Data] (#8005)
* Update from facebook 1ee4edd286a3 (#8040)
* Adding instance weight to batch distill loss
as title
* add bfloat 16-31
added bfloat 16-31 and their respective unit tests
* [CUDA9] Upgrade - fbcode
CUDA9 upgrade diff D5654023 has been out for a while thanks to Pieter. But with time growing it's becoming quite hard to rebase, because of the symlinks and auto-generated build/config files in tp2. Break D5654023 into two diffs, one touching tp2 config files, and another one touching fbcode TARGETS file (adding nvcc flag). These two should be a bit easier to rebase (for detailed procedure see "Test Plan").
This diff can only be committed if:
1. CUDA 9 rpm is rolled out fleet-wide (TBD)
2. NVidia driver 390.40 is rolled out fleet-wide (done)
3. Upgrade CUDA 9.1, cudnn 7.1, nccl 2.1 (done)
4. Make sure all dependents are built (done)
5. Test all C2 operators, PyTorch (see test plan)
* Share intermediate int32 buffer across Conv ops
Adding a known type
* [C2 fix] infer function for ensure_cpu_output_op
this is adding the missing device funtion for ensure_cpu_output_op
* [int8] Add blob serializer/deserializer for Int8TensorCPU
To export to logfiledb
* [nomnigraph] Add try catch block to optimization passes in predictor
This will catch failures that happen in the optimization pass.
* Caffe2: avoid static initialization order fiasco for CAFFE_ENFORCE
CAFFE_ENFORCE uses strack trace fetcher. Which is currently a
global static variable. If at static initialization time CAFFE_ENFORCE
is used, this is a SIOF. Recently CAFFE_ENFORCE was added into init
functions registration, so we started to see this.
Meyers singleton is going to provide safety here. If stacktrace
fetcher was not registered yet, it will just use a dummy one.
* NUMA support in SparseNN CPU benchmark
Adding support for NUMA in SparseNN CPU benchmark
* [mobile-roofline] Add logging needed for roofline model
This should be all that's needed
* Let the operators using the same input if the operators are not chained
or else, we have to change the input data dims
* fix null-pointer-use UBSAN errors in in reshape_op.h
* revert previous fix on input blob name
as title
* Adding flag to let MineHardNegative automatically extract single value from dict
Model exporter requires the output of the model to be a struct. This makes it convenient to use those models directly in MineHardNegative by allow automatic extraction of the single element of dict, which is a common use case.
* Reverting change that broke internal tests back to OSS compatible state
* Skip CUDA memory leak test on BN tests on windows (#8043)
* workaround for Sequential when one cannot retrieve python source (#8048)
* [auto] Update onnx to 0dbec2a - - Generate protoc type hints on Windows (onnx/onnx#1047)
0dbec2a047
* [auto] Update onnx to 4f8ef17 - Remove erroneous documentation around maps and sequences. (onnx/onnx#1069)
4f8ef17ad3
* [auto] Update onnx to e6a500e - Extract constant to initializer (onnx/onnx#1050)
e6a500e54c
* [auto] Update onnx to 033f956 - make gcc happy (onnx/onnx#1061)
033f956f41
* Remove NO_PYTHON macros from Exceptions.h/cpp (#8007)
Removes cases where NO_PYTHON was unnecessary in Exception.h/cpp
* [ready] Clean up torch.distributions (#8046)
* Have a single THStorage and THCStorage type. (#8030)
No longer generate data-type specific Storage types, since all Storage types are now identical anyway.
For (some) backwards compatibility and documentation purposes, the Real names, e.g. THLongStorage are now #defined as aliases to the single THStorage type
* Reduce usages of TensorUtils<T>::DataType in THC. (#8056)
TensorUtils<T> is basically ATen-dispatch-lite in that it allows one to do multi-type THC function dispatch with a single call.
However, it is templatized on the Tensor type, and since we are moving to a single Tensor type, this doesn't work.
Most of the functions in TensorUtils (e.g. getDims) can be pulled up a level, to just call THCTensor_nDimension (or directly accessing the member),
but the DataType specific functions are more problematic.
So, this PR does two things:
1) Replaces calls of 'TensorUtils<THCTensor>::DataType' with 'real' since these are identical
2) Templatizes the THC_pointwiseApplyX functions to take scalar types. To ensure this is done correctly, we static_assert that the scalar type template parameter matches the scalar type of
the corresponding template parameter. We will need to get rid of these static_asserts in the future, but this is useful for now.
* Support to run ONNX Upsample operator (mode=nearest) in Caffe2 (#8037)
* Added support to run ONNX Upsample operator (mode=nearest) in Caffe2
* adding error checks to upsample
* adding error checks to upsample
* adding error checks to upsample
* changing to np.isclose
* Revert onnx submodule update
* still fixing
* [auto] Update onnx to eb12f72 - Add conv transpose test cases (onnx/onnx#886)
eb12f72a86
* [auto] Update onnx to bd98abb - Add a hook for doing post-processing on protobuf generated header files (onnx/onnx#1068)
bd98abbba0
* Skip ConvTraspose ONNX backend tests (#8074)
* Post process onnx proto (#8064)
* Post processing onnx generated protobuf files to hide global symbols
* .
* .
* Add code for TensorBoard visualization of JIT GraphExecutors (#8050)
* [auto] Update onnx to cc26486 - bump version to 7 for prelu. (onnx/onnx#1063)
cc26486541
* [auto] Update onnx to 356208d - add input tensor dimension checks to shape inference (onnx/onnx#1070)
356208d756
* Move backtrace to its own header (#8096)
* Move backtrace to its own header
* Move cxxabi.h into Backtrace.cpp
* Fix and ignore some warnings (#8081)
* Do an additional sanity check that nvcc and CUDA include dir agree. (#8094)
If you set CUDA_HOME and CUDA_NVCC_EXECUTABLE together, you may
end up in a situation where the CUDA_VERSION of your includes
mismatches the CUDA version of your nvcc. See #8092 for a concrete
case where this can occur. Explicitly detect this situation and
give a good error message in this case!
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* use regex in kwarg parser (#8061)
* Removing remaining NO_PYTHON ifdefs (#8067)
* Remove NO_PYTHON in tracing
* Remove NO_PYTHON in ir.h
* Remove NO_PYTHON in test_jit.cpp
* Replace std::size_t with size_t (#8093)
* Remove out-of-date comment (#8114)
* [Caffe2] Enabling AMD GPU Backend for Caffe2 (#7955)
* Add hip support for caffe2 core
* Add MIOPEN header/wrapper to caffe2 core
* Add HIP device into caffe2 PB
* top level makefile change for rocm/hip
* makefile scaffolding for AMD/RocM/HIP
* Makefile scafodding for AMD/RocM/HIP; add makefile/utility for HIP files
* caffe2 PB update for AMD/ROCM HIP device
* Add AMD/RocM/Thrust dependency
* HIP threadpool update
* Fix makefile macro
* makefile fix: duplicate test/binary name
* makefile clean-up
* makefile clean-up
* add HIP operator registry
* add utilities for hip device
* Add USE_HIP to config summary
* makefile fix for BUILD_TEST
* merge latest
* Fix indentation
* code clean-up
* Guard builds without HIP and use the same cmake script as PyTorch to find HIP
* Setup rocm environment variables in build.sh (ideally should be done in the docker images)
* setup locale
* set HIP_PLATFORM
* Revert "set HIP_PLATFORM"
This reverts commit 8ec58db2b390c9259220c49fa34cd403568300ad.
* continue the build script environment variables mess
* HCC_AMDGPU_TARGET
* Cleanup the mess, has been fixed in the lastest docker images
* Assign protobuf field hip_gpu_id a new field number for backward compatibility
* change name to avoid conflict
* Fix duplicated thread pool flag
* Refactor cmake files to not add hip includes and libs globally
* Fix the wrong usage of environment variables detection in cmake
* Add MIOPEN CNN operators
* Revert "Add MIOPEN CNN operators"
This reverts commit 6e89ad4385b5b8967a7854c4adda52c012cee42a.
* Resolve merge conflicts
* .
* Update GetAsyncNetHIPThreadPool
* Enable BUILD_CAFFE2 in pytorch build
* Unifiy USE_HIP and USE_ROCM
* always check USE_ROCM
* .
* remove unrelated change
* move all core hip files to separate subdirectory
* .
* .
* recurse glob core directory
* .
* correct include
* .
* Detect CUDNN related environment variables in cmake (#8082)
* Implement adaptive softmax (#5287)
* Implement adaptive softmax
* fix test for python 2
* add return_logprob flag
* add a test for cross-entropy path
* address review comments
* Fix docs
* pytorch 0.4 fixes
* address review comments
* don't use no_grad when computing log-probs
* add predict method
* add test for predict
* change methods order
* get rid of hardcoded int values
* Add an optional bias term to the head of AdaptiveSoftmax
* Make libshm also test if rt requires pthread. (#8112)
In some configurations (e.g., our internal build of GCC 5 + GLIBC 2.23),
-lrt is not sufficient to use shm_open; you also need to declare
a dependency on pthread. This patch adds a surgical extra fix to
detect this situation, in the case that I noticed it failing in the
wild.
Fixes#8110
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* [auto] Update onnx to 2d5ce4a - Remove empty model (onnx/onnx#1058)
2d5ce4aeb6
* Add missing pragma once. (#8118)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* [auto] Update onnx to 2a87616 - Tests for LRN operator (onnx/onnx#903)
2a876162ac
* Split SparseTensorImpl off from TensorImpl. (#7990)
* Split SparseTensorImpl off from TensorImpl.
At the moment they have the same data layout, but with the upcoming refactor
they will not, and we need a place to put all of the sparse tensor specific
fields.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Update SparseTensorImpl.h
* [Caffe2] Support non peer access in muji and fix bug when reduced_affix is empty (#6896)
* [Caffe2] Support non peer access in muji
* [Caffe2] Add test for 4 gpus and 2 groups
* [Caffe2] Add comments
* Fix bug when reduced_affix is empty
* Fix typo and add comments about cpu and amd gpu
* Skip OnnxBackendNodeModelTest::test_lrn_default_cuda that causes segfault (#8127)
* Replace most remaining usages of TensorUtils<T>::DataType. (#8124)
As in https://github.com/pytorch/pytorch/pull/8056, this doesn't work with a single TensorImpl type.
This replaces the usages of with a templatized parameter and static_asserts that the new and old are equal.
After this we can get rid of the old template parameter, but I want to ensure they are equivalent across all builds first.
* Add utf-8 header to Python file with Unicode. (#8131)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add back lrn test (#8134)
* Revert "Skip OnnxBackendNodeModelTest::test_lrn_default_cuda that causes segfault (#8127)"
This reverts commit 410191c4175eaae141306cdb3c3c1c1e8a495225.
* Fix mismatched default values
* Add non_blocking to Tensor/Module.to (#7312)
* Add non_blocking to Tensor/Module.to
* flake8
* Add argparse tests
* cpp parse
* Use C++ parser
* use a commong parse function with Tensor.to
* fix test_jit
* use THPObjectPtr
* increase refcount for None, True, and False
* address comments
* address comments
* Fix job name checking for AVX tests (#8135)
* Fix a corner case for ReShapeOp (#8142)
In my use case, in the backward propogate pass, the reshape need to
change a [0] tensor into [0,0] shaped tensor. The original implementation would
cause out of index issue. This diff fix this problem.
* cpu/ideep context converter (#8139)
* fix type mismatch while call torch._C._cuda_setDevice (#8065)
* fix type mismatch while call torch._C._cuda_setDevice
* fix type mismatch in scatter
* fix type mismatch in scatter
* fix type mismatch while call torch._C._cuda_setDevice
* fix type mismatch while call torch._C._cuda_setDevice
* fix type mismatch while call torch._C._cuda_setDevice
* docs: Add warning to torch.repeat() (#8116)
* docs: Add warning to torch.repeat()
closes#7993
* docs: Add links for numpy functions
* docs: Break the too long line
* Accelerate bernoulli number generation on CPU (#7171)
* opt bernoulli rng with vsl and openmp
* detect cpu vendor for bernnoulli
* retrigger test platform
* check the vendor more severely
* use cpuinfo to check vendor
* docs: add canonical_url and fix redirect link (#8155)
* docs: enable redirect link to work for each specific page
* docs: add canonical_url for search engines
closes#7222
* docs: update redirect link to canonical_url
* docstring support for @script and @script_method (#7898)
* docstring support for @script and @script_method
* make it python2 compatible
* improve according to review
* improve build_stmts
* use filter instead of list comprehension
* improve the way wrap is handled for script_method
* stash the original method instead
* allow dynamic attr for ScriptMethod and GraphExecutor
* a bit comment on build_Expr
* remove _build_wrap
* a bit improve on comments
* rename to __original_methods
* should be _original_methods
* [auto] Update onnx to 968d28d - fix Node::isBefore (onnx/onnx#1075)
968d28d901
* remove some unnecessary cudaGetDevices (#8089)
* remove unnecessary cudaGetDevices
* make curDevice argument non-optional, add explicit checks to current_device
* Fix cuda.framework error on OSX. (#8136)
When compiling OSX with CUDA, Caffe2's build system uses
find_package(cuda) to get its grubby hands on the CUDA driver
library (for some strange reason, FindCUDA doesn't save this
information as a variable). Unfortunately, on OSX, sometimes
this picks up the cuda.framework folder, and then our build
system chokes to death because it doesn't try to link against
this as a framework. (Is the folder even a framework? I have
no idea).
This commit attempts to fix this in a two pronged fashion:
1. For some users, reducing the precedence of frameworks
using CMAKE_FIND_FRAMEWORK seems to help. So we set these
variables. However, this fix is not perfect; on my laptop
it doesn't actually solve the problem.
2. PyTorch doesn't actually need the CUDA driver API. So we
only add the dep when building Caffe2.
Fixes#8022
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* [C++ API] Improve and use OrderedDict for parameters / modules (#7823)
* Improve OrderedDict for C++ API
* Give OrderedDict a subject and fix review comments
* Fix OrderedDict use in torch/csrc/jit/script/init.cpp
* Fix __rshift__ bug (#8161)
* Fix __rshift__ bug
* Add small tests for __lshift__ and __rshift__ in test_cuda
* Add a more elaborate check for __lshift__ and __rshift__
* refactor the test to address @zou3519 's comments
* Move non-generic Storage code needed by TensorUtils to non-generic C++. (#8164)
For non-generic function call implementations in Storage used by TensorUtils, we do the following:
1) Move the declaration from generic/C to non-generic/C++; we don't need backwards compatibility on these functions and want to use e.g. at::ScalarType.
2) Move the implementation from generic/C++ to non-generic/C++.
3) Change the generic implementation to call the non-generic implementation.
This will allow us to get rid of the corresponding TensorUtils calls (once we move over the Tensor functions in the same manner).
* Pinning opencv to < 3.4 in conda builds (#7923)
* Pinning opencv to 3.1.0 in conda builds
* Also pinning numpy to 1.11
* Trying only specifying <3.4
* Adding -setup- path, and better code structure (#8122)
* Abstract parallelization to faciliate using threadpools (#8163)
* [Caffe2] Update elementwise ops to support numpy style boradcast (#8070)
* Update elementwise ops to support numpy style boradcast
Update elementwise ops to support numpy style boradcast
* Fix sqrt_op
* Fix compare ops
* Fix gradient test
* Fix optimizer legacy broadcast
* Fix legacy broadcast for elementwise ops
* Skip flaky test
* Fix eigen simple binary op
* Fix attention test
* Fix rnn test
* Fix LSTM test
* Fix tan grad
* Fix schema check
* Export getCudnnHandle (#7726)
* [JIT] Support a single TensorList argument anywhere in the argument list + index_put (#8173)
* [JIT] Support a single TensorList argument anywhere in the argument list
* [JIT] index_put
* use the correct datatype format (#8144)
* Add back onnx console scripts dropped during migration from onnx-caffe2 (#8143)
* Get rid of SOVERSION (again). (#8132)
We don't want SOVERSION because pip will lose the symlink and
double your distribution size, and also because our setup.py
accidentally links against both libcaffe2.dylib and libcaffe2.1.dylib
on OS X. This leads to a very puzzling error where you get
the error "cannot initialize CUDA without ATen_cuda", because
there are actually two copies of your registry in memory (because
there are two copies of the dynamic library). Dropping SOVERSION
makes it impossible to make this mistake.
In principle, if the shared library load is done with DYLD_GLOBAL,
that should also prevent two copies of the registry from popping up.
Worth checking at some later point, if you need to bring back
SOVERSION (because, e.g., pip finally fixed their software.)
Partially fixes#8022.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Fix a corner case for ReShapeOp (#8178)
In my use case, in the backward propogate pass, the reshape need to
change a [0] tensor into [0,0] shaped tensor. The original implementation would
cause out of index issue. This diff fix this problem.
* Better conv error message basing on weight shape (#8051)
* Add retry logic to sccache download for Windows build (#7697)
* Add retry logic to sccache download for Windows build
* fix script bug
* clean up
* fix caffe2 docker build (#7411)
* [ONNX] Fix type_as symbolic (#8183)
* [ONNX] Nuke type_as symbolic
* make it better
* Fix lookup + test
* Yangqing as an ONNX codeowner (#8185)
* Fix protobuf options (#8184)
* protobuf
* fix protobuf_MSVC_STATIC_RUNTIME
* Add a loop unrolling pass to PyTorch JIT (#7672)
* [auto] Update onnx to 4e65fd8 - fuse consecutive squeezes (onnx/onnx#1078)
4e65fd83ba
* [Caffe2] Merging setup.py with setup_caffe2.py (#8129)
* Mergine setup.pys, torch works, caffe2 works up to other KP
* Fix to super call for python 2
* Works on python2 on mac
* Consolidating Caffe2 flags
* Fix scalar check for sparse tensors. (#8197)
* Fix scalar check for sparse tensors.
As discovered in #8152
If `t` is a scalar sparse tensor, `t._indices` used to return a sparse
empty tensor because the scalar check was incorrect. This PR modifies
the scalar check to return a dense tensor instead of a sparse tensor.
i.e.
```
tensor = torch.sparse_coo_tensor([], [], torch.Size([]), device=device)
out = tensor._indices() # was a sparse tensor, now is dense.
```
* Fix typos
* fix lint
* Add more annotations for arguments in ATen schema (#8192)
* use THCThrustAllocator in BCECriterion (#8188)
* Allow parallel_apply to take in list[Tensor] (#8047)
* Docs for gradcheck and gradgradcheck; expose gradgradcheck (#8166)
* Docs for gradcheck and gradgradcheck; expose gradgradcheck
* address comments
* Implement randperm for CUDA (#7606)
* Implement randperm for CUDA
* Use Thrust to implement randperm
* clean up
* Fix test
* Offload small input scenario to CPU
* Fixed test
* Try to fix Windows error
* Fix Windows error and clean up
* Use fork_rng context manager
* Move test_randperm_cuda to test_cuda
* Add half tensor support
* Fix cuda::type error
* Fix CPU offloading
* Fix issues
* No need to check range for n == 0 case
* Update c10d build to link against Caffe2 (#8201)
This follows #7399.
* add wipe_cache option (#8204)
as title
* Replace (non-data) TensorUtils calls with non-generic THCTensor calls. (#8176)
* Replace (non-data) TensorUtils calls with non-generic THCTensor calls.
TensorUtils is templatized on the THTensor type, so to support a single tensor type (like ATen), we need to remove these.
This PR does the following:
1) Allows THCTensorTypeUtils.cuh to include THCTensor.hpp.
This involves moving includes of it outside of generic/, so we can use the new implementations.
2) Defines a single _THTensor struct and changes THCRealTensor to be a derived type of _THCTensor.
This allows us to implement a single non-generic function and avoid static_cast or void * tricks to call it from the generic functions.
3) For functions inside of TensorUtils that don't use data pointers:
a) Implement the functions in (non-generic) THTensor.cpp and declare them in (non-generic) THTensor.hpp.
b) Have the generic versions call the non-generic versions.
c) Replace the corresponding TensorUtils<THCTensor>::fn call with (non-generic) THTensor_fn.
* Add comment about THCTensor struct.
* Error if storage is null in setStorageNd or resizeNd.
* Fix c10d compiler warnings (#8206)
Copy compiler flags from the ones used in setup.py and fix warnings.
This makes the root build that includes c10d headers warning free.
* Bump gloo submodule (#8202)
This includes facebookincubator/gloo#125.
* rm -rf aten/contrib (#8165)
* Remove aten/contrib
* Remove from CMake
* Fix tanh_op on ios build (#8207)
* Fix tanh_op on ios build
* Fix tanh
* [auto] Update onnx to f28e2f1 - fix lrn spec (onnx/onnx#1090)
f28e2f1a60
* [cmake] deprecate caffe2_* specific cuda function in cmake. (#8200)
* deprecate caffe2_* specific cuda function in cmake.
* ENV{} -> $ENV{}
* CUDA_ARCH_NAME -> TORCH_CUDA_ARCH_LIST
* .
* .
* .
* skip CUDA memory leak check on Windows altogether (#8213)
* Record shape and type in autograd to validate gradients (#8168)
The check that the gradient is defined is currently disabled because
TestJit.test_ge_optimized will trigger the error.
* [auto] Update onnx to 18d70ff - Graph should only have one (input) kParam node (onnx/onnx#1088)
18d70ff529
* Set up a c10 source folder (#7822)
* Set up a c10 source folder
* Change the benchmark log format and also log flops (#8215)
as title
* Move helper functions to unnamed namespace. (#8224)
Currently, the helper functions in this file are in global
namespace. I am guessing the purpose of excluding them from was to
keep them local.
* [auto] Update onnx to e96d823 - Update Google benchmark to 1.4.1 (onnx/onnx#1083)
e96d823e5c
* Change new bernoulli implementation to be fully generic. (#8218)
The current implementation depends on THTensor types being unique, which is not guaranteed going forward.
* Structure THTensor like THCTensor is structured. (#8217)
In particular, define a base type, _THTensor, that can be used for all THRealTensor structs.
This is just to have less cognitive load when dealing with generic THTensor/THCTensor types (as in templates).
* move THCP-related utils to cuda/utils.cpp. (#8221)
These files don't follow the usual pattern: In general the files torch/csrc/X torch/csrc/cuda/X
both include the generic file torch/csrc/generic/X, where torch/csrc/X includes the cpu implementations and torch/csrc/cuda/X includes the cuda implementations.
(Aside: this is probably not the best structure, the torch/csrc/X fiels should probably be moved to torch/csrc/cpu/X).
utils.cpp combines these so that torch/csrc/utils.cpp has cuda specific code. This makes it impossible to declare a single THTensor and THCTensor template type (i.e. THPPointer<_THTensor>, THPointer<_THCTensor>).
* [READY TO MERGE] Use ccache in macOS build (#8009)
* Use ccache in macOS build
* Moving to sccache
* Don't use sccache in test job
* [NEEDS REVIEW] Add nan and inf probability check to multinomial (#7647)
* Add nan and inf probs check to multinomial
* fix bug
* Spawn CUDA test in subprocess
* Make sure invalid input won't pass the test case
* Try to fix error
* Test failure cases in Python 3 only
* Try to fix Windows error
* Move CUDA test to test_cuda.py
* fix issues
* fix module name error
* no need to check for CUDA existence in test_cuda
* Use PY3
* [READY TO MERGE] Enable tests that use DataLoader with multiple workers on Windows (#6745)
* Don't import TEST_CUDA for test_dataloader on Windows
* test_partial_workers is stuck on Windows
* Don't copy unneeded grads when using a function for several derivatives (Fixes#7722) (#7759)
Trying to copy all results fails when one of them is a tensor list which
has not been populated. This blew up for CuDNN RNNs when the weights
did not require grad.
Thanks to Sylvain Gugger for reporting!
* Fix win mkldnn (#7718)
* Sync build_pytorch_libs.bat with build_pytorch_libs.sh
* fix quoting
* add warnings
* fix warnings
* Add /EHa
* [Caffe2] Add ADD operator for IDEEP (#8220)
* Add ADD operator for IDEEP
* Add boradcast check
* Comments
* Allow optional build and installation of native test binaries (#8225)
* test finetuning
* install off by default
* Turn BUILD_TEST=ON for jenkins.
* Turn on install_test in jenkins as well
* Update MKL exporter to IDEEP ops (#8228)
IDEEP exporter support
* [ideep] Add IDEEP Squeeze op (#8227)
Similar to MKLSqueezeOp at caffe2/mkl/operators/squeeze_op.cc
* [auto] Update onnx to 62e63e9 - Fix build errors inside protobuf-bench (onnx/onnx#1084)
62e63e9de8
* Use .cc since some downstream libraries are configured for C++ only. (#8234)
* Rename SparseTensor to SparseTensorRef. (#8237)
I want to introduce using SparseTensor = Tensor (as a documentary
type alias for Tensor), but the name is already taken.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* [caffe2] Build Android tests and binaries in CI (#7593)
Update benchmark submodule to version with fixed Android/GNUSTL build
* Remove core and util warnings (#8239)
* Fix some signed/unsigned mismatches
* Skip unused result warning
* Explict fallthrough for murmur hash
* Enable aligned new support to eliminate warning
* Switch to int instead of unsigned in some cases
* Remove .gitmodules.aten since it is in .gitmodules now (#8232)
* Fix: gradcheck forced float32 (#8230)
* Print requires_grad and grad_fn in string repr of tensor (#8211)
For example:
>>> torch.ones(3).requires_grad_()
tensor([ 1., 1., 1.], requires_grad=True)
>>> torch.ones(3).requires_grad_() * 5
tensor([ 5., 5., 5.], grad_fn=<MulBackward0>)
The suffix (dtype, requires_grad, grad_fn) wraps to a new line if
it would cause the the line to exceed the linewidth.
>>> torch.ones(10).double().requires_grad_()
tensor([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
dtype=torch.float64, requires_grad=True)
* Fix TEST_CUDA import in test_cuda (#8246)
* Fix lifting cat into its constant version (#8174)
This fixes a bug where schema including varargs lists did not lift
properly blocking correct ONNX export.
* Don't override Tensor, Storage macros defined outside torch/csrc in t… (#8243)
* Don't override Tensor, Storage macros defined outside torch/csrc in torch/csrc.
This PR does the following:
1) Removes THSTensor macros in torch/csrc, which aren't used.
2) For macros defined outside of torch/csrc (THTensor, THTensor_, THStorage, THStorage_):
a) No longer override them, i.e. previously THTensor could actually be THCTensor if a generic file was included from a file including THCP.h.
b) Instead, introduce new macros THW* (e.g. THWTensor) to represent a (potentially empty) wildcard character.
In addition to making this code easier to read and codemod, this allows us to more freely change TH/THC; for example:
currently in the THC random code, the state is casted to THByteTensor*; this happens to work because the macros don't happen to override THByteTensor.
But if THByteTensor just becomes an alias of THTensor (which is the plan for a single tensor type), then this no longer works.
The whole thing is a bit of a mess previously because you really have to understand which macros and redefined and which aren't.
We could also rename the macros that live in torch/csrc (e.g. the THPTensor macros), but since that is more self contained, I punted for now.
* Don't change the plugin.
* [auto] Update onnx to 3a035f4 - Add retry logic to model downloading (onnx/onnx#1077)
3a035f4397
* Fully genericize THC/THCUNN (except for TensorUtils and DeviceTensorUtils). (#8251)
* [cmake] Use CAFFE2_USE_* for public/cuda.cmake (#8248)
* Fix app size check (#8256)
Fix app size check
* wip on CPU impl
* Stop BCELoss from returning negative results (#8147)
* Stop BCELoss from returning negative results
* check explicitly for 0 before taking log
* add tests
* fix lint
* address comments
* Relax CUDA_HOME detection logic, to build when libraries are found. (#8244)
Log when no cuda runtime is found, but CUDA is found
* Added backward function for kl_div target (#7839)
* added backward fn for target
* added module test for kl_div target, and assuming targets are probabilities
* Change the output format of caffe2 observers (#8261)
as title
* Remove TensorUtils<T>::getData, provide data<T>() in TH(C)Tensor. (#8247)
* Remove TensorUtils<T>::getData, provide data<T>() in TH(C)Tensor.
* Fix template parameter.
* [caffe2] Move submodule onnx-tensorrt forward (#7659)
Commit 82106f833dcb0070446a150e658e60ca9428f89b is essential.
* [ideep] Add IDEEP fallbacks for Faster-RCNN ops (#8260)
TSIA
* un-genericize THCDeviceTensorUtils. (#8258)
* provide data<T>() in TH(C)Tensor.
* un-genericize THCDeviceTensorUtils.
This is used outside of generic context, so we need to un-genericize it to have a single THCTensor type.
* [caffe2] Fix ATen dispatch for ops with TensorList arg (#8226)
* [cmake] Add and export Modules_CUDA_fix (#8271)
* Add and export Modules_CUDA_fix
* actually, need to include before finding cuda
* [auto] Update onnx to 2508156 - Make error message more verbose (onnx/onnx#1097)
2508156135
* [auto] Update onnx to 39e4668 - fix optimizer does not set ir_version bug (onnx/onnx#1098)
39e46687ea
* [cmake] Make cudnn optional (#8265)
* Make cudnn optional
* Remove cudnn file from cpu file
* Move signal window functions to ATen; add Blackman window (#8130)
* Move signal window functions to ATen; add Blackman window
* fix cuda test not checking scipy
* [ideep] Fuse Conv-Relu after IDEEP graph rewrite, skip group conv (#8233)
IDEEP supports fusion for non-group conv
* [c10d] NCCL Process Group implementation (#8182)
* [c10d] Process Group NCCL implementation
* Addressed comments
* Added one missing return and clang format again
* Use cmake/Modules for everything and fix gloo build
* Fixed compiler warnings
* Deleted duplicated FindNCCL
* Set up CI build for CUDA 9.2 + macOS (#8274)
* Add macOS CUDA build to CI
* Fix undefined symbols issue
* Use sccache for CUDA build
* Fix sccache issues
* clean up
* c10 build setup (#8264)
* Move c10/ to caffe2/dispatch/
* Set up caffe2/utils directory
* Remove remaining TensorTypeUtils functions. (#8286)
Mostly what's remaining is copy utilities -- these are now provided in THCTensorCopy.hpp and templatized on the ScalarType rather than the TensorType.
* Create initial Python bindings for c10d (#8119)
* Build and install c10d from tools/build_pytorch_libs.sh
* Create initial Python bindings for c10d
* clang-format
* Switch link order to include more symbols
* Add bindings and tests for ProcessGroupGloo
* Add broadcast test
* Separate build flag for c10d
* Explicit PIC property
* Skip c10d tests if not available
* Remove c10d from Windows blacklist
Let it skip by itself because it won't be available anyway.
* Make lint happy
* Comments
* Move c10d module into torch.distributed
* Close tempfile such that it is deleted
* Add option USE_NVRTC which defaults to off (#8289)
* [build] Remove /torch/lib/THD/cmake in favor of /cmake (#7159)
* Remove /torch/lib/THD/cmake in favor of /cmake
* path fix
* Explicitly marking gloo to use cuda
* Fix gloo path in THD
* Have a single THTensor / THCTensor type. (#8288)
* Remove remaining TensorTypeUtils functions.
Mostly what's remaining is copy utilities -- these are now provided in THCTensorCopy.hpp and templatized on the ScalarType rather than the TensorType.
* Have a single THTensor / THCTensor type.
As was previously done with Storages, have only a single (dtype-independent) THTensor / THCTensor.
For documentation and backwards compatibility purposes, the old names, e.g. TH(Cuda)LongTensor alias the new TH(C)Tensor type.
* undef GENERATE_SPARSE.
* [auto] Update onnx to 58efe0a - add float16 support back for math and reduction ops (onnx/onnx#1102)
58efe0a9ca
* Some utils for compile-time programming (#7778)
* Add some C++17 features, implemented with C++14
* Add some type traits
* Compile-time type list abstraction
* Some utils for compile-time programming
* Fix compatibility with a larger range of compilers
* Use guts::array instead of std::array because of std::array shortcomings
* code review comments
* Use quotes for includes
* Remove THC's FindMAGMA (#8299)
* Entries for torch.distributed in CODEOWNERS (#8293)
* Add depthwise convolution test for IDEEP (#8301)
* Fix dividing by zero segfault in Reshape (#8302)
when infer a dimension of zero size new shape
* Removes unused THCTensorConv (#8229)
* Replace Variables to Tensors (#8309)
* Clean up old sccache log before build (#8305)
* Remove unused grad ops on mobile to reduce app size (#8297)
Remove unused grad ops on mobile to reduce app size
* Small fixes (#8296)
* [auto] Update onnx to 5ed684e - Remove/replace /MX with /WX for MSVC build. Was typo in a previous ch… (onnx/onnx#1104)
5ed684ebe5
* Fix sample code for cuda stream (#8319)
* [auto] Update onnx to 4b4085c - Add missing warning ignoring flags to onnx_proto CMake target (onnx/onnx#1105)
4b4085c2e9
* [THD] fix broken THD build with NCCL (#8323)
* Add docstring for `torch.sparse_coo_tensor` (#8152)
* add sparse_coo_tensor docstring
* update empty tensor example
* whitespace
* whitespace again
* add error when backend is not supported by DDP (#8325)
* Fix collect_env.py for Windows (#8326)
* Fix collect_env.py for Windows
* Fix expect file for Win machine
* Fix the script doesn't stop eariler on error for MSVC and Ninja (#8277)
* Simplify the solution
* Remove the usage of set errorlevel
* Skip test_multinomial_invalid_probs_cuda on Windows (#8324)
* Support printing sparse tensors in ATen, fixes#8333. (#8334)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* [C++ API] Cursors (#8190)
* Add cursors to C++ API
* Small self nits
* s/struct/class
* Use more STL like names for cursors
* Implement dim_arange operator (#8266)
* Implement arange_like operator
* add ONNX symbolic
* lint
* change name
* Comment the hack
* 1. fixed flip CPU impl for non-continuous flip dims; 2. added more tests; 3. using TensorInfo and collapseDims to speed up CUDA impl for cases where flip dim is the 1st or last dim
* nits
* 1. removed for loop in pointwise CUDA kernel; 2. using templated (int64_t) IndexType for indices in pointwise CUDA kernel
* added torch.flip.__doc__
* nits
* Update operator documentation with markdown descriptions and interfaces
* Added rest of updated operator documentation to source files
* Commiting local changes for rebase
* fixed bracket typo in sqrt_op.cc file
* Added updated markdown documentation to remaining completed ops
We have 2 use cases where we want to experiment with new base ATen
tensor types:
* BatchTensor for matchbox
* Tensors that live on accelerators
It is possible to subclass TensorImpl to implement these but VariableType
does not work with them because it cannot find the equivalent variable type
in the registry.
This commit changes the way we implement type -> variable(type) lookup so that
torch::register_variable_type_for can be called on any at::Type.
Lookups are still done using arrays so there should be no perf impact from the change.
* Port THS to ATen.
The basic structure of the patch:
- All kernels in aten/src/THS got rewritten as native
functions in aten/src/ATen/native/sparse
I took the liberty to rename some of the kernels,
opting for a longer, more transparent names than
things like 'spaddcmul'.
- Instead of holding fields for sparse tensor in the TH
C struct THSTensor, they are now held in a C++ class
SparseTensorImpl (this explains why I had to do this
all in one go; I can't have *two* reps for sparse
tensors!)
Along the way, we change a key internal representation
invariant: an "empty" sparse tensor has dimI == 1 and
dimV == 0 (this is different from dimI == 0 and dimV == 0
we had before); this ensures that we maintain the invariant
that dim == dimI + dimV. "Scalar" sparse tensors are
made illegal, because there really is no way to properly
express them in COO format.
- Because we haven't ported THCS or any of the traditional
dense TH implementations, there is a new set of adapter
functions in native/LegacyBridge.cpp exclusively devoted
to deciding whether or not to go to the new native implementation
or back to the legacy TH binding (prefixed with th_).
The intent is that when everything gets ported, we can
delete this file.
- I've kept the stubs for all the THS functions, but they now all
error if you try to actually call them. Eventually, we should
replace these with calls to ATen so that everything keeps
working.
- I gobbled up SparseMM (SparseMM.cpp is no more). It was tasty.
There are some miscellaneous improvements which were needed for other
changes in this patch:
- There is now AT_FORALL_SCALAR_TYPES_EXCEPT_HALF, which does what
it says on the tin.
- axpy templated function moved to TH/BlasUtils.h, there's a new macro
which lets you easily forward to all of the TH functions. We also expose
THBlas_copy. I'm not terribly pleased with these functions but
they seem to serve a purpose they need.
- New method on Tensor to get TensorImpl*, unsafeGetTensorImpl
- accessor() is now this-const, since const-correctness on Tensor is a lie
- New toSparse()/toDense() methods on Type; now you can call these
directly without having to manually apply at::toSparse/toDense
on the Backend and then running toBackend yourself.
Changes to the kernels:
- Previously, the whole body of all kernels was compiled for
every supported scalar type. In our new implementation,
the scalar dispatch has been pushed into the smallest extent
which (1) is not in a type loop and (2) requires statically
knowing the scalar type. These sites all use
AT_DISPATCH_ALL_TYPES. I tried to use lambdas as much as
possible, but sometimes it was not possible when a OpenMP
pragma was used.
- Anywhere we tested if the nDimension of a tensor was zero,
we replaced with a test that numel is zero. Because, as we
known, nDimension of zero-size tensors in TH is zero, and
that's wrong wrong wrong (and not done this way in ATen).
Some subtleties:
- Places where previously fastget1d was used, I now use a
TensorAccessor. However, you have to be careful about grabbing
the accessor, because sometimes you will be accessor'ing
indices/values and they are empty, which means they will
be *1D* ("oh, aren't indices always 2D?" Nope. Nyet.)
So, essentially, it is only safe to grab an accessor *after*
you have checked that nnz != 0. All of these shenanigans
will go away when we properly support zero-size dimensions.
A few places, we test for this case just by wrapping the loop
in a conditional on nnz. Some other places this is not so easy,
so we instead short-circuit the function with a special case for
when nnz == 0 (usually, these implementations are degenerate).
- There is a very subtle but important difference between
_sparse_get_impl(self)->indices() and self._indices();
the latter may return a view! This is because nnz is
not guaranteed to match the dimensions of indices/values;
you can "truncate" a sparse tensor by setting the nnz.
Actually, I think this is not a good idea and we should
enforce a stronger invariant, but for this patch I slavishly
adhere to the old ways, and as such I have to be very
careful if I want to resize something, I had better use
the former and not the latter.
- I had to reimplement broadcasting by hand (thus the s_
and non-s_ functions in the sparse native files). There
is a very important distinction between foo_out and foo_,
so it is important that the LegacyBridge function always
call to the lower layer, and not try to avoid boilerplate
by calling to another LegacyBridge function first.
I did NOT put broadcasting in LegacyBridge (even though,
ultimately, that's where it must live), because the th_
functions which are invoked from LegacyBridge handle
broadcasting themselves, and I don't want to broadcast
twice.
- Sparse function MUST explicitly specify the Type they
dispatch from, otherwise Variable wrapping/unwrapping will
not work correctly. If you use _get_sparse_impl, that is
sufficient to levy this requirement.
- The "has native" tests in LegacyBridge.cpp are not 100%,
because some of the functions are mixed dense-sparse functions,
and so you can't just say, "Oh, if it's sparse and CPU, call
the native sparse implementation." This is handled on a
case by case basis. There is some especially complex
logic for add(), which has dense-dense, sparse-sparse
and dense-sparse implementations.
- I added some uses of SparseTensorRef in native_functions.yaml,
but you will notice that these are all on native_* functions,
and not the actual, top-level functions. So the SparseTensorRef
is purely documentary (helping you not call the wrong overload)
but there is no magic; we do the wrapping ourselves the hard
way. (This is in constrast to the TH binding code which is magical.)
Except for _sparse_mask; _sparse_mask is magical.
- There is a raw_copy_sparse_ method, which is really my way of
getting around the fact that copy_ has never been implemented
for sparse tensors (even before this patch), but there IS a
super secret, internal way of doing these copies that the THS
code used, and which I needed to get my hands on when I did this
port. We should refactor so that either (a) copy_ does support
sparse-sparse copy natively, or (b) we do this other ways.
- Irritatingly, I must explicitly resize_as_ before copy_ into
a tensor. This was not the case with THTensor_(copy) but I don't
have any direct binding that doesn't have this requirement.
- For some reason, the sparse tensor constructor accepts a scalar
tensor for the values tensor. This is kind of weird because
you always need an nnz-dimension. However, the old code supported
this and just expanded it into a 1D size 0 tensor; so we need some
explicit code to do this.
There are maybe a bit more AT_ASSERTs in some of the kernels
than is wise. I added them all when I was debugging and was
loathe to remove them.
Some last mile fixes after this commit went into PR
- Move expand outside of dispatch so autograd works (it used to be inside and then we lost all of the recorded broadcasts).
- Hack to duplicate the derivatives for our now two definitions TH and native. Mercifully the derivatives are short.
- Apparently, TH has a special case to make foo_ functions method only, and if you don't do this the Python arg parsing is wrong. We carefully work around this in the native bindings
- Apply DCE to a test_jit case, fixes wobbling due to DCE trick in tracing
- Update test_function's output
- Some last mile fixes for dispatch confusion in sparse_coo_tensor functions.
- New simplified regression test based on failures I saw in ONNX
- Increase tolerance on super resolution test
- More robust dynamic_type normalization, fixes ONNX bug.
The dynamic_type situation is very delicate; probably need
to stop having both Scalar and real.
- Make new_with_tensor_sparse more CUDA safe
- Note about CUDA-safety in SparseTensorImpl
- Rename dimI/dimV to sparseDims/denseDims.
- Make localScalar on SparseTensorImpl work.
- Make numel uniformly supported on all types, not just dense
types
- Add tests for is_nonzero() method (which exercises localScalar)
- Disable constant JIT autogenerated tests, which are fragile and broken
by this change, but being fixed in a parallel track.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* throw error on 0-length tensor slicing
* return empty tensor instead of throwing error
* make 0 slice work for tuples also
* add tests
* move check to aten
* Address comments
* Move empty size logic from ATen into TH/THC.
The goal here is to unify the tensor representations; since the "majority" of the representation is in TH, we push the empty size ({0}) and empty stride ({1}) logic into TH.
This PR does the following:
1) Previously THTensor/THCTensor with dim_ == 0, size == nullptr, stride == nullptr are now dim_ == 1, size == {0}, stride == {1}.
2) The logic that previously implemented this at the ATen level (e.g. THLongStorageView STRIDE_EMPTY_TENSOR) is removed.
3) The above is pretty clean except for resize/resizeNd logic -- that is still called with nDimension == 0. So, we rename these to resizeLegacy, resizeNdLegacy, map nDimension == 1
into the new regime, and will later write a empty-aware resize/resizeNd and move over the calls to resizeLegacy, resizeNdLegacy.
4) Also introduces some ifdefs that are just used for testing:
a) USE_TH_SCALAR: move scalar logic in TH
b) USE_TH_ZERO_SIZE_DIM: support arbitrary 0-sized dimensions, i.e {...,0,...}.
These are just used to write forward-looking correct code while call sites to _dim() (old TH nDimension) and resizeLegacy are updated.
* Get rid of noelem_to_empty.
* Use static_cast rather than C-style cast.
* Allocator size for empty tensors in THS/THCS.
* Add back THLongStorageView type Stride (TH and arg parsing has some magic that needs these to be nullptrs).
* 1. added hardshrink() to ATen (CPU + GPU); 2. removed nn.Hardshrink(); 3. reusing previous tests for nn.Hardshrink() and included CUDA tests at test_nn; 4. default parameter lambda=0.5 is not working yet
* optimized memory read/write
* 1. pass in lambd as scalar for CPU/CUDA_apply*; 2. removed tests for hardshrink at test_legacy_nn
* fixes test_utils
* 1. replace zeros_like with empty_like; 2. use scalar_cast in cuda
* 1. printing lambd value; 2. default lambd=0.5 is still failing
* getting around Scalar bug buy removing default value of lambd from native_functions.yaml, and declare it at nn/functional.py
* cleaned up debug printf
* Temporary solution for having access to the root path for python installations until Caffe2/PyTorch figure out the best way to build.
* Update build.sh
Increasing the verbosity of HIP errors.
This commit turns autograd function/method tests into tests run inside of a trace, or directly written using
script. These tests have uncovered many bugs and limited functionality
in the trace/script pathway, and these failing parts of the tests
are disabled using new exclusion sets. The size of these sets will shrink
as the bugs are fixed.
* fix a bug for SkipIndices
* IDEEP bug, revise the output to CPUTensor in SkipOutputCopy strategy
* [IDEEP] Add IDEEP fallbacks for Style-Transfer ops
* Improve TypeId:
- move it to c10 namespace to allow for easy extraction from caffe2 into c10 (i.e. reuseability from aten)
- Use unordered_map/unordered_set instead of map/set for performance
- Make TypeId a type safe class (i.e. no implicit casts from/to int)
- Make TypeId constexpr
- Some readability improvements (e.g. using instead of typedef)
- Don't explicitly implement TypeMeta copy assignment and construction - let the compiler do that for us.
- Add TypeMeta move constructor
- Make TypeMeta members noexcept
- Implement TypeMeta::operator== and operator!= as free functions instead of in-class
* CR comments
* fix
* fix windows
* Rename back to CaffeTypeId
* Remove c10::TypeId/TypeMeta
* remove C10_KNOWN_TYPE
* code review
* Implement CPU bincount feature support
* Incorporate feedback on renaming to SummaryOps file and other nits
* bincount gpu implementation
* refactor cuda code and incorporate nits
* doc fix
* cuda bincount - cast weights to double if integral type
* fix: signed unsigned comparison error
* fix: ssize_t error
* refactor
* make template typenames readable and other nist
* make compatible with v0.5
* incorporate comments
* update test cases to ensure CUDA code coverage
* add comparison operators to jit
* try to fix CI
* address review comments
* fix type of comparison ops result
* address review comments
* fix indentation
* add comments
* require type_as to have non-dynamic tensor arg
* Typo (should check if template argument of type_as, inputs()[1], is tensor)
* Use .at() instead of []
* Use .at() again
* Improve number formatting in tensor print
* fix bad rebase
* address comments
* fix test
* fix test
* use assertExpected for tests
* address comments
* address comments
* More efficient kernel that avoids deprecated shuffles in Embedding.cu and THCUNN/LookupTable.cu
* Using WARP_BALLOT from THCDeviceUtils.cuh, also changing WARP_BALLOT to return unsigned
* [c10d] Rendezvous skeleton
The rendezvous function takes an URL and produces a triplet of a store,
a process rank, and the process group size.
For the file and TCP handlers, the rank and size must be specified, but
other handlers may discover these parameters dynamically.
It returns a generator function, such that if a rendezvous handler
supports rerendezvous, you can write:
for store, rank, size in c10d.rendezvous(...):
pg = c10d.ProcessGroup(store, rank, size)
while the process group is valid:
# Do stuff with process group
* Add Python 2 fallback for urlparse library
* Import X as Y
* Relative import seems to fix it
* Spelling
* Gate import on c10d availability
* Modifying the build path to handle Caffe2's merge
* Update LoadHIP.cmake
Fixing typo.
* Update Dependencies.cmake
Keeping hip_include_directories since other Caffe2 libs depend on it.
* Update CMakeLists.txt
Only including for the second time if we're building with ATen.
* Update CMakeLists.txt
Adding comments to make sure future users understand why necessary commands have been added.
* [fix] fixup the bias multiplier data access issue
Hotfix for failues in conv_transpose
* [D2][Easy]: lint regularizer
lint with black
* [GanH]: Split mu in adaptive weight for diagnose
* [Dper] Add the ability to split FC weights into multiple smaller ones
* fix SumReduceLikeOp for empty blob
as desc.
* add ctc_greedy_decoder for caffe2
ctc_greedy_decoder same as tf's
* Update event callback handling
Allow multiple callbacks per event
* Add WeightedSum layer
The motivation is to do weighted sum in HoNet/crossnet, in the next diff, I'll replace model.Add with model.WeightedSum in
honet: https://fburl.com/f4rmolg2
crossnet: https://fburl.com/v7awn8se, https://fburl.com/63filbnm
* Replicate DAG's behavior
Some callers expect RunAsync to block, replicate that behavior in case of
explicit 'dag' net type
* [dper] layernorm layer
as title
* Override dag, async_dag, async_polling
Overriding dag, async_dag and async_polling with async_scheduling
* Name the thread pools
Caffe thread pools currently inherit the thread names from the thread that starts them, which can be misleading. Give them an explicit name instead.
* [Caffe2] FilleOp should support int64_t dimensions
Change argument type to int64_t for shape argument of FillerOp (used in ConstantFill, XavierFill, etc)
* Remove caffe2/caffe2/contrib/torch/
It's not used anywhere and depends on old lua torch that conflicts with Aten. Given PT1 it's not relevant any more (though it was nice and clever code!)
#accept2ship
* Fix linearWarmup multiplier check
The multiplier needs to be non-negative, not strictly positive.
* Revert D3314316
This is after 2 years and we do not seem to have a use case for this one, so
for the sake of clean API design we should potentially remove this. This would
allow us to potentially pass in arguments to optionally construct an object,
although it is indeed a little bit unclear how we can reuse existing objects if
constructor arguments are passed in. In any case, we may want to remove this
dangling feature.
* Speedup generate proposals by partial_sort.
Speedup generate proposals by partial_sort.
FACEBOOK:
- Saw speed improvement for training with this op.
- Yanghan benchmarked the op on a small dataset and see consistent 100% improvement on speed (6ms -> 3ms) on 420 input resolution. See next diff for details.
* More parallel processing friendly for CPP version of GenerateProposals.
More parallel processing friendly for CPP version of GenerateProposals.
* [DT] [43/n] Lift stop conditions inside reader code back to flow control
1. Split multi_reader function into local_reader and remote_reader
2. Lifted stop conditions inside Limiter back to flow control
3. Split epoch flow building logic into 3 cases:
- single machine (1 reader, 1 trainer on trainer0 node, no PS)
- (1 reader + 1 trainer) on trainer0 node, has PS
- multiple readers, readers do not share nodes with trainers, might have PS or not
* Resolve conflicts for torch/_thnn/utils.py
* [Caffe2] Handle image decoding errors
Image decoding errors can make the whole training fail. This diff is to handle them
1.Catch imdecode exceptions and check if decoded image has zero columns or rows. This is counted as decoding errors.
2.Replace the image with empty in case of error
3.Count the number of errors and throw runtime exception if the rate reaches given number
The empty image data is kept. It might introduce noise in the training data.
* Update MKL exporter to IDEEP ops
TSIA
* [Caffe2] GlobalInit is thread safe, fixing the comment
With the mutex and lock, GlobalInit is thread safe.
Update the comments.
* Back out "Add support for generating ATen files during fbcode build"
Original commit changeset: 28970ddba353
@override-unit-failures
(Note: this ignores all push blocking failures!)
* [DT]: fix predictor save
similar to D6610058, here we add the fix for distributed online training
* Remove net_singlethread_async_gpu.cc
Closes https://github.com/caffe2/caffe2/pull/2528
This removes net_singlethread_async_gpu.cc as part of our effort to clean
CUDAContext and the net executors.
* Inline DFS task execution
Add a DFS inline task execution mode in executor
* Add c10 folder to fbcode
This adds the c10 folder and its test cases to fbcode. Build flags are mostly taken from aten.
* add dependencies for online trainer
Add some dependencies so that the online model can use DataPipeline and PredictionTransform operators
Relevent post: https://fb.intern.facebook.com/groups/1324375037655677/permalink/1740993462660497/
* Resolve conflicts for tools/jit/gen_jit_dispatch.py
* [Fix] sparse regularization in distributed training
* Support advanced pooling options in sum processor
* support advanced pooling options in sum processor
* remove redundant code
* support attention in sum processor
* Improve shard logging in net tracing code
Make it handle arbitrary shard ids instead of just one digit ids.
* [Caffe2] Call GlobalInit in predictor only in mobile
FACEBOOK:
Calling GlobalInit long after the program starts may not be safe. There are issues if the following happens:
User does not call GlobalInit and initFacebook after program starts
User sets a flag manually: https://fburl.com/mcsumw7d
User calls OSS predictor.
OSS predictor calls GlobalInit
GlobalInit calls initFacebook
initFacebook resets all flags: https://fburl.com/tolszha1
Thus, the user manually set flags are overwritten
This would happen anytime GlobalInit is called long after the program starts.
I suppose the intention of the user in this case is not to call GlobalInit throughout the program,
but use Caffe2 regardless (is that desired?)
But adding GlobalInit in the OSS predictor would automatically call GlobalInit when using Caffe2.
This issue doesn't exist in mobile, since initFacebook is not called on mobile.
For now, guard the GlobalInit in predictor for mobile only.
May want to ensure the GlobalInit is always called at the start of the program. @[3501714:kutta] has seen weird issues when not calling GlobalInit at the start of the program on server side. He has made some progress on this.
* resolve conflicts for caffe2/core/logging_is_google_glog.h and test/test_torch.py
* Add empty fix for SumLikeReduceOp
Add empty fix for SumLikeReduceOp
* Revert D7962948: [caffe2][nomnigraph] Concat elim for sparseNN
This reverts commit f7f434dc5c34ca6058b9765d2ef615453d2276a9
@bypass-lint
An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files
* Remove Declarations.yaml
* Include common.h
* Change std::stoi to caffe2::stoi
* Add thread_name.cc to the CMake file
* No need to subtract 1. Fix test segfaults
* Fix NetTest, ObserverTest
Fix tests
(cherry picked from commit 3767e66c3f365596cba3d46d3e7322c933a0ab41)
* CTCGreedyDecoderOp only has CPU implementation, test should only run on CPU
* Add a variable to avoid conversion resizing issue
* [fix] fixup the bias multiplier data access issue
Hotfix for failues in conv_transpose
* [D2][Easy]: lint regularizer
lint with black
* [GanH]: Split mu in adaptive weight for diagnose
* [Dper] Add the ability to split FC weights into multiple smaller ones
* fix SumReduceLikeOp for empty blob
as desc.
* add ctc_greedy_decoder for caffe2
ctc_greedy_decoder same as tf's
* Update event callback handling
Allow multiple callbacks per event
* Add WeightedSum layer
The motivation is to do weighted sum in HoNet/crossnet, in the next diff, I'll replace model.Add with model.WeightedSum in
honet: https://fburl.com/f4rmolg2
crossnet: https://fburl.com/v7awn8se, https://fburl.com/63filbnm
* Replicate DAG's behavior
Some callers expect RunAsync to block, replicate that behavior in case of
explicit 'dag' net type
* [dper] layernorm layer
as title
* Override dag, async_dag, async_polling
Overriding dag, async_dag and async_polling with async_scheduling
* Name the thread pools
Caffe thread pools currently inherit the thread names from the thread that starts them, which can be misleading. Give them an explicit name instead.
* [Caffe2] FilleOp should support int64_t dimensions
Change argument type to int64_t for shape argument of FillerOp (used in ConstantFill, XavierFill, etc)
* Remove caffe2/caffe2/contrib/torch/
It's not used anywhere and depends on old lua torch that conflicts with Aten. Given PT1 it's not relevant any more (though it was nice and clever code!)
#accept2ship
* Fix linearWarmup multiplier check
The multiplier needs to be non-negative, not strictly positive.
* Revert D3314316
This is after 2 years and we do not seem to have a use case for this one, so
for the sake of clean API design we should potentially remove this. This would
allow us to potentially pass in arguments to optionally construct an object,
although it is indeed a little bit unclear how we can reuse existing objects if
constructor arguments are passed in. In any case, we may want to remove this
dangling feature.
* Speedup generate proposals by partial_sort.
Speedup generate proposals by partial_sort.
FACEBOOK:
- Saw speed improvement for training with this op.
- Yanghan benchmarked the op on a small dataset and see consistent 100% improvement on speed (6ms -> 3ms) on 420 input resolution. See next diff for details.
* More parallel processing friendly for CPP version of GenerateProposals.
More parallel processing friendly for CPP version of GenerateProposals.
* [DT] [43/n] Lift stop conditions inside reader code back to flow control
1. Split multi_reader function into local_reader and remote_reader
2. Lifted stop conditions inside Limiter back to flow control
3. Split epoch flow building logic into 3 cases:
- single machine (1 reader, 1 trainer on trainer0 node, no PS)
- (1 reader + 1 trainer) on trainer0 node, has PS
- multiple readers, readers do not share nodes with trainers, might have PS or not
* Resolve conflicts for torch/_thnn/utils.py
* [Caffe2] Handle image decoding errors
Image decoding errors can make the whole training fail. This diff is to handle them
1.Catch imdecode exceptions and check if decoded image has zero columns or rows. This is counted as decoding errors.
2.Replace the image with empty in case of error
3.Count the number of errors and throw runtime exception if the rate reaches given number
The empty image data is kept. It might introduce noise in the training data.
* Update MKL exporter to IDEEP ops
TSIA
* [Caffe2] GlobalInit is thread safe, fixing the comment
With the mutex and lock, GlobalInit is thread safe.
Update the comments.
* Back out "Add support for generating ATen files during fbcode build"
Original commit changeset: 28970ddba353
@override-unit-failures
(Note: this ignores all push blocking failures!)
* [DT]: fix predictor save
similar to D6610058, here we add the fix for distributed online training
* Remove net_singlethread_async_gpu.cc
Closes https://github.com/caffe2/caffe2/pull/2528
This removes net_singlethread_async_gpu.cc as part of our effort to clean
CUDAContext and the net executors.
* Inline DFS task execution
Add a DFS inline task execution mode in executor
* Add c10 folder to fbcode
This adds the c10 folder and its test cases to fbcode. Build flags are mostly taken from aten.
* add dependencies for online trainer
Add some dependencies so that the online model can use DataPipeline and PredictionTransform operators
Relevent post: https://fb.intern.facebook.com/groups/1324375037655677/permalink/1740993462660497/
* Resolve conflicts for tools/jit/gen_jit_dispatch.py
* [Fix] sparse regularization in distributed training
* Support advanced pooling options in sum processor
* support advanced pooling options in sum processor
* remove redundant code
* support attention in sum processor
* Improve shard logging in net tracing code
Make it handle arbitrary shard ids instead of just one digit ids.
* [Caffe2] Call GlobalInit in predictor only in mobile
FACEBOOK:
Calling GlobalInit long after the program starts may not be safe. There are issues if the following happens:
User does not call GlobalInit and initFacebook after program starts
User sets a flag manually: https://fburl.com/mcsumw7d
User calls OSS predictor.
OSS predictor calls GlobalInit
GlobalInit calls initFacebook
initFacebook resets all flags: https://fburl.com/tolszha1
Thus, the user manually set flags are overwritten
This would happen anytime GlobalInit is called long after the program starts.
I suppose the intention of the user in this case is not to call GlobalInit throughout the program,
but use Caffe2 regardless (is that desired?)
But adding GlobalInit in the OSS predictor would automatically call GlobalInit when using Caffe2.
This issue doesn't exist in mobile, since initFacebook is not called on mobile.
For now, guard the GlobalInit in predictor for mobile only.
May want to ensure the GlobalInit is always called at the start of the program. @[3501714:kutta] has seen weird issues when not calling GlobalInit at the start of the program on server side. He has made some progress on this.
* resolve conflicts for caffe2/core/logging_is_google_glog.h and test/test_torch.py
* Add empty fix for SumLikeReduceOp
Add empty fix for SumLikeReduceOp
* Revert D7962948: [caffe2][nomnigraph] Concat elim for sparseNN
This reverts commit f7f434dc5c34ca6058b9765d2ef615453d2276a9
@bypass-lint
An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files
* Remove Declarations.yaml
* Include common.h
* Change std::stoi to caffe2::stoi
* Add thread_name.cc to the CMake file
* No need to subtract 1. Fix test segfaults
* Fix NetTest, ObserverTest
Fix tests
(cherry picked from commit 3767e66c3f365596cba3d46d3e7322c933a0ab41)
* CTCGreedyDecoderOp only has CPU implementation, test should only run on CPU
* Add a variable to avoid conversion resizing issue
* Remove the code per soumith's comments
* Remove the code per soumith's comments
* Remove blank lines in the end of file
* Resolve conflicts for torch/_thnn/utils.py
* Update MKL exporter to IDEEP ops
TSIA
* Back out "Add support for generating ATen files during fbcode build"
Original commit changeset: 28970ddba353
@override-unit-failures
(Note: this ignores all push blocking failures!)
* add dependencies for online trainer
Add some dependencies so that the online model can use DataPipeline and PredictionTransform operators
Relevent post: https://fb.intern.facebook.com/groups/1324375037655677/permalink/1740993462660497/
* Resolve conflicts for tools/jit/gen_jit_dispatch.py
* Support advanced pooling options in sum processor
* support advanced pooling options in sum processor
* remove redundant code
* support attention in sum processor
* resolve conflicts for caffe2/core/logging_is_google_glog.h and test/test_torch.py
* Revert D7962948: [caffe2][nomnigraph] Concat elim for sparseNN
This reverts commit f7f434dc5c34ca6058b9765d2ef615453d2276a9
@bypass-lint
An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files
* Remove Declarations.yaml
* Include common.h
* Change std::stoi to caffe2::stoi
* [caffe2] uprade IDEEP and hotfix for conv op accuracy issue (#8364)
* [IDEEP] Upgrade IDEEP version
Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com>
* [IDEEP] Fix accuracy issue in conv op
Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com>
* Fix build error due to lack of src in CMakeLists
Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com>
* Remove the code per soumith's comments
* [ONNX] Add an ATen fallback pathway for ONNX export (#8273)
* ATen fallback for ONNX export
* Move to enum
* Fix model test
* Add comment
* Address comments
BC interface
* Remove imaginary file (#8415)
* [Caffe2] Enable AMD/MIOPEN ops for Caffe2 (#8306)
* Add hip support for caffe2 core
* Add MIOPEN header/wrapper to caffe2 core
* Add HIP device into caffe2 PB
* top level makefile change for rocm/hip
* makefile scaffolding for AMD/RocM/HIP
* Makefile scafodding for AMD/RocM/HIP; add makefile/utility for HIP files
* caffe2 PB update for AMD/ROCM HIP device
* Add AMD/RocM/Thrust dependency
* HIP threadpool update
* Fix makefile macro
* makefile fix: duplicate test/binary name
* makefile clean-up
* makefile clean-up
* add HIP operator registry
* add utilities for hip device
* Add USE_HIP to config summary
* makefile fix for BUILD_TEST
* merge latest
* Fix indentation
* code clean-up
* Guard builds without HIP and use the same cmake script as PyTorch to find HIP
* Setup rocm environment variables in build.sh (ideally should be done in the docker images)
* setup locale
* set HIP_PLATFORM
* Revert "set HIP_PLATFORM"
This reverts commit 8ec58db2b390c9259220c49fa34cd403568300ad.
* continue the build script environment variables mess
* HCC_AMDGPU_TARGET
* Cleanup the mess, has been fixed in the lastest docker images
* Assign protobuf field hip_gpu_id a new field number for backward compatibility
* change name to avoid conflict
* Fix duplicated thread pool flag
* Refactor cmake files to not add hip includes and libs globally
* Fix the wrong usage of environment variables detection in cmake
* Add MIOPEN CNN operators
* Revert "Add MIOPEN CNN operators"
This reverts commit 6e89ad4385b5b8967a7854c4adda52c012cee42a.
* Add MIOPEN pooling operator
* Add MIOPEN activation operator
* Add MIOPEN softmax operator
* Add MIOPEN spatial batch norm operator
* Add MIOPEN loacl response normalization operator
* Add MIOPEN conv operator
* Clean-up LRN ops
* enable fp16 in MIOPEN pool ops
* Enable fp16 for MIOPEN relu op
* Enable fp16 for MIOPEN spatial batch norm op
* code clean-up
* revert float16 support
* Create Caffe2 python binding for AMD/ROCM/HIP
* Add op fallback for HIP operator
* add hip src/test files in cmake
* exclude hip src/test files
* fix python binding for hip backend
* fix MIOPEN pooling op workspace
* hack to compile miopen operators
* fix include path for MIOPEN ops
* Fix include path
* Add HIP math utilities
* Fix path for HIP math utils
* cmake fix
* Cmake fix / hipcc for hip files
* suppress hipcc warning
* cmake fix /replcae USE_HIP with USE_ROCM
* revert LoadHIP.cmake change
* fix include for thrust/cub-hip
* include path fix for conversion.h
* Updated with latest upstream changes
* clang format fixes
* Context_hip updates
* Fixed typo in rocblas handle get function
* Updated hipified math utils
* Updated math hip test util
* Updated context hip test
* Updated common_hip
* Updated net async dag for HIP
* Added MIOPEN in operator hip test
* fix
* C2 dependencies clean-up
* fix include path for building custom protobuf
* Decouple miopen pool op and conv_pool_op base
* cmake refactor
* fix operator_hip_test
* move all hip/miopen ops files into caffe2/operators/hip
* sanitize cmake
* permission issue
* remove extra parenthesis
* remove artifact from resolving merge conflict
* cont. sanitize cmake files
* fix syntax error
* sanitize conversion.h
* .
* Revert "."
This reverts commit 56020cb0e996a31ae27bf1f8f491955ed0b121b9.
* clang-format
* Enable some reduce operators' ONNX backend tests (#8418)
* fix old comment to point to the right file (#8416)
* Stop pinning nccl version. (#8421)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Expose logsumexp docs and mark log_sum_exp in distributions for internal use (#8428)
* Enable some of the ONNX backend test on broadcasting (#8423)
* Enable some of the ONNX backend test on broadcasting
* enable gemm broadcast
* Expose proto utils and ONNX (#8073)
* Expose proto utils and ONNX from PyTorch libcaffe2.so
* Try to use protobuf from _C.so
* Fix ONNX proto header include
* Adjust order of imports for ONNX until nanopb goes away
* Set and use ONNX_NAMESPACE for PyTorch builds
* Show protobuf summary for all builds
* Add ONNX_NAMESPACE for cpp_build
* Statically link libprotobuf.a into libtorch.so
* Set ONNX_NAMESPACE on Windows build
* Move core/dispatch up as well
* Add /MD flag for Windows build of _C
* Potential Windows fix for ONNX and protobuf
* Add direct linkage from _C to ONNX on Windows
* Only include protobuf wrapper for PyTorch
* Pass extra_compile_args to _nvrtc ext build
* Remove installation of .a files
* Rebase creates some weird situations, revert them manually
* Remove more weird changes due to rebase
* Need to add thread_name.cc after merge
* Revert "Stop pinning nccl version. (#8421)"
This reverts commit 3cb45bafc8b9b023049e5f979a2bcb75e3f7009d.
* Allow downgrades from libnccl2 install.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Expose proto utils and ONNX from PyTorch libcaffe2.so
* Try to use protobuf from _C.so
* Fix ONNX proto header include
* Adjust order of imports for ONNX until nanopb goes away
* Set and use ONNX_NAMESPACE for PyTorch builds
* Show protobuf summary for all builds
* Add ONNX_NAMESPACE for cpp_build
* Statically link libprotobuf.a into libtorch.so
* Set ONNX_NAMESPACE on Windows build
* Move core/dispatch up as well
* Add /MD flag for Windows build of _C
* Potential Windows fix for ONNX and protobuf
* Add direct linkage from _C to ONNX on Windows
* Only include protobuf wrapper for PyTorch
* Pass extra_compile_args to _nvrtc ext build
* Remove installation of .a files
* Add hip support for caffe2 core
* Add MIOPEN header/wrapper to caffe2 core
* Add HIP device into caffe2 PB
* top level makefile change for rocm/hip
* makefile scaffolding for AMD/RocM/HIP
* Makefile scafodding for AMD/RocM/HIP; add makefile/utility for HIP files
* caffe2 PB update for AMD/ROCM HIP device
* Add AMD/RocM/Thrust dependency
* HIP threadpool update
* Fix makefile macro
* makefile fix: duplicate test/binary name
* makefile clean-up
* makefile clean-up
* add HIP operator registry
* add utilities for hip device
* Add USE_HIP to config summary
* makefile fix for BUILD_TEST
* merge latest
* Fix indentation
* code clean-up
* Guard builds without HIP and use the same cmake script as PyTorch to find HIP
* Setup rocm environment variables in build.sh (ideally should be done in the docker images)
* setup locale
* set HIP_PLATFORM
* Revert "set HIP_PLATFORM"
This reverts commit 8ec58db2b390c9259220c49fa34cd403568300ad.
* continue the build script environment variables mess
* HCC_AMDGPU_TARGET
* Cleanup the mess, has been fixed in the lastest docker images
* Assign protobuf field hip_gpu_id a new field number for backward compatibility
* change name to avoid conflict
* Fix duplicated thread pool flag
* Refactor cmake files to not add hip includes and libs globally
* Fix the wrong usage of environment variables detection in cmake
* Add MIOPEN CNN operators
* Revert "Add MIOPEN CNN operators"
This reverts commit 6e89ad4385b5b8967a7854c4adda52c012cee42a.
* Add MIOPEN pooling operator
* Add MIOPEN activation operator
* Add MIOPEN softmax operator
* Add MIOPEN spatial batch norm operator
* Add MIOPEN loacl response normalization operator
* Add MIOPEN conv operator
* Clean-up LRN ops
* enable fp16 in MIOPEN pool ops
* Enable fp16 for MIOPEN relu op
* Enable fp16 for MIOPEN spatial batch norm op
* code clean-up
* revert float16 support
* Create Caffe2 python binding for AMD/ROCM/HIP
* Add op fallback for HIP operator
* add hip src/test files in cmake
* exclude hip src/test files
* fix python binding for hip backend
* fix MIOPEN pooling op workspace
* hack to compile miopen operators
* fix include path for MIOPEN ops
* Fix include path
* Add HIP math utilities
* Fix path for HIP math utils
* cmake fix
* Cmake fix / hipcc for hip files
* suppress hipcc warning
* cmake fix /replcae USE_HIP with USE_ROCM
* revert LoadHIP.cmake change
* fix include for thrust/cub-hip
* include path fix for conversion.h
* Updated with latest upstream changes
* clang format fixes
* Context_hip updates
* Fixed typo in rocblas handle get function
* Updated hipified math utils
* Updated math hip test util
* Updated context hip test
* Updated common_hip
* Updated net async dag for HIP
* Added MIOPEN in operator hip test
* fix
* C2 dependencies clean-up
* fix include path for building custom protobuf
* Decouple miopen pool op and conv_pool_op base
* cmake refactor
* fix operator_hip_test
* move all hip/miopen ops files into caffe2/operators/hip
* sanitize cmake
* permission issue
* remove extra parenthesis
* remove artifact from resolving merge conflict
* cont. sanitize cmake files
* fix syntax error
* sanitize conversion.h
* .
* Revert "."
This reverts commit 56020cb0e996a31ae27bf1f8f491955ed0b121b9.
* clang-format
Billing of changes:
- New Jenkins script for building on rocm. For now it is a bit hacked together, but we can improve it once CI is running
- New ROCM docker image for nightly HIP, and also some legacy packages that we need temporarily
- New enabled config py2-clang3.8-rocmnightly-ubuntu16.04-build based off of the existing Caffe2 image (not built yet)
- A big pile of cmake fixes, mostly to turn bits on/off when ROCM build is involved
- Switch from hiprng to hcrng
- Apply some patches directly in code, eliminating the patches
- Use __hdiv instead of hdiv, it's more portable
- THCNumerics<T>::gt doesn't work in HIP, so simulate it with sub
- Add a few more overloads HIP needs
- Turn off use of hcc to link (we plan to turn this back on to get tests running)
- Search for hiprand, hiprng, hipblas, hipsparse
- Better Python 2 portability
The Python binding generation code doesn't understand
method '_out' bindings correctly, and will compute the
indices wrong if you have an '_out' function that's also
method. This is a quick check to prevent you from making
this mistake.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Back out "Back out "Add support for generating ATen files during fbcode build""
Original commit changeset: 7b8de22d1613
I'm re-sending this diff exactly as it was approved and
committed. Fixes to support @mode/opt will be sent separately for ease
of review.
* Enable building //caffe2:torch with @mode/opt
In @mode/opt, python runs out of a PAR, which breaks a lot of
assumptions in the code about where templates/ folders live relative
to __file__. Rather than introduce hacks with parutil, I simply turn
template_path into a parameter for all the relevant functions and
thread it through from the top level.
There is a bug in NCCL that causing seg faults while calling ncclCommDestroy() in the destructor during program exit. According to Nvidia, "Whether the NCCL destructor will be called before or after the CUDA runtime destructor is undefined, which can lead to crashes."
For the immediate workaround, skip calling ncclCommDestroy ihe NCCL destructor. This is UGLY and we'll follow up with Nvidia to solve this ASAP.
This does the following:
1) makes nDimension an int64_t (to match ATen)
2) changes the dimension value to dim_ (so we catch direct usages)
3) provide an _dim() that provides access to the "old" view (so we can migrate functions one at a time)
4) have code call ->-_dim() instead of ->nDimension.
Necessary for Tensor detemplatization (D8121878) - now tensor won't have default constructor (as we don't know the device).
Thus this diff makes TypeMeta be constructible with non-default-constructible types in which case ctor() is non-null but always throws.
It's dangerous however as we won't catch potential type errors at compile time. Luckily - the only place where ctor() is used is in Blob and Tensor which have templated wrappers there (GetMutable and mutable_data respectively). We can just enforce the necessary type requirements there explicitly as a static_assert.
It also changes the failure behavior to be throw() instead of abort(). Aborting the process is not cool for the library :)
This is after 2 years and we do not seem to have a use case for this one, so
for the sake of clean API design we should potentially remove this. This would
allow us to potentially pass in arguments to optionally construct an object,
although it is indeed a little bit unclear how we can reuse existing objects if
constructor arguments are passed in. In any case, we may want to remove this
dangling feature.
* Add some C++17 features, implemented with C++14
* Add some type traits
* Compile-time type list abstraction
* Some utils for compile-time programming
* Fix compatibility with a larger range of compilers
* Use guts::array instead of std::array because of std::array shortcomings
* code review comments
* Use quotes for includes
* Remove remaining TensorTypeUtils functions.
Mostly what's remaining is copy utilities -- these are now provided in THCTensorCopy.hpp and templatized on the ScalarType rather than the TensorType.
* Have a single THTensor / THCTensor type.
As was previously done with Storages, have only a single (dtype-independent) THTensor / THCTensor.
For documentation and backwards compatibility purposes, the old names, e.g. TH(Cuda)LongTensor alias the new TH(C)Tensor type.
* undef GENERATE_SPARSE.
* Build and install c10d from tools/build_pytorch_libs.sh
* Create initial Python bindings for c10d
* clang-format
* Switch link order to include more symbols
* Add bindings and tests for ProcessGroupGloo
* Add broadcast test
* Separate build flag for c10d
* Explicit PIC property
* Skip c10d tests if not available
* Remove c10d from Windows blacklist
Let it skip by itself because it won't be available anyway.
* Make lint happy
* Comments
* Move c10d module into torch.distributed
* Close tempfile such that it is deleted
* [c10d] Process Group NCCL implementation
* Addressed comments
* Added one missing return and clang format again
* Use cmake/Modules for everything and fix gloo build
* Fixed compiler warnings
* Deleted duplicated FindNCCL
* provide data<T>() in TH(C)Tensor.
* un-genericize THCDeviceTensorUtils.
This is used outside of generic context, so we need to un-genericize it to have a single THCTensor type.
* Don't override Tensor, Storage macros defined outside torch/csrc in torch/csrc.
This PR does the following:
1) Removes THSTensor macros in torch/csrc, which aren't used.
2) For macros defined outside of torch/csrc (THTensor, THTensor_, THStorage, THStorage_):
a) No longer override them, i.e. previously THTensor could actually be THCTensor if a generic file was included from a file including THCP.h.
b) Instead, introduce new macros THW* (e.g. THWTensor) to represent a (potentially empty) wildcard character.
In addition to making this code easier to read and codemod, this allows us to more freely change TH/THC; for example:
currently in the THC random code, the state is casted to THByteTensor*; this happens to work because the macros don't happen to override THByteTensor.
But if THByteTensor just becomes an alias of THTensor (which is the plan for a single tensor type), then this no longer works.
The whole thing is a bit of a mess previously because you really have to understand which macros and redefined and which aren't.
We could also rename the macros that live in torch/csrc (e.g. the THPTensor macros), but since that is more self contained, I punted for now.
* Don't change the plugin.
For example:
>>> torch.ones(3).requires_grad_()
tensor([ 1., 1., 1.], requires_grad=True)
>>> torch.ones(3).requires_grad_() * 5
tensor([ 5., 5., 5.], grad_fn=<MulBackward0>)
The suffix (dtype, requires_grad, grad_fn) wraps to a new line if
it would cause the the line to exceed the linewidth.
>>> torch.ones(10).double().requires_grad_()
tensor([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
dtype=torch.float64, requires_grad=True)
* Fix some signed/unsigned mismatches
* Skip unused result warning
* Explict fallthrough for murmur hash
* Enable aligned new support to eliminate warning
* Switch to int instead of unsigned in some cases
I want to introduce using SparseTensor = Tensor (as a documentary
type alias for Tensor), but the name is already taken.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Trying to copy all results fails when one of them is a tensor list which
has not been populated. This blew up for CuDNN RNNs when the weights
did not require grad.
Thanks to Sylvain Gugger for reporting!
* Add nan and inf probs check to multinomial
* fix bug
* Spawn CUDA test in subprocess
* Make sure invalid input won't pass the test case
* Try to fix error
* Test failure cases in Python 3 only
* Try to fix Windows error
* Move CUDA test to test_cuda.py
* fix issues
* fix module name error
* no need to check for CUDA existence in test_cuda
* Use PY3
These files don't follow the usual pattern: In general the files torch/csrc/X torch/csrc/cuda/X
both include the generic file torch/csrc/generic/X, where torch/csrc/X includes the cpu implementations and torch/csrc/cuda/X includes the cuda implementations.
(Aside: this is probably not the best structure, the torch/csrc/X fiels should probably be moved to torch/csrc/cpu/X).
utils.cpp combines these so that torch/csrc/utils.cpp has cuda specific code. This makes it impossible to declare a single THTensor and THCTensor template type (i.e. THPPointer<_THTensor>, THPointer<_THCTensor>).
In particular, define a base type, _THTensor, that can be used for all THRealTensor structs.
This is just to have less cognitive load when dealing with generic THTensor/THCTensor types (as in templates).
* Replace (non-data) TensorUtils calls with non-generic THCTensor calls.
TensorUtils is templatized on the THTensor type, so to support a single tensor type (like ATen), we need to remove these.
This PR does the following:
1) Allows THCTensorTypeUtils.cuh to include THCTensor.hpp.
This involves moving includes of it outside of generic/, so we can use the new implementations.
2) Defines a single _THTensor struct and changes THCRealTensor to be a derived type of _THCTensor.
This allows us to implement a single non-generic function and avoid static_cast or void * tricks to call it from the generic functions.
3) For functions inside of TensorUtils that don't use data pointers:
a) Implement the functions in (non-generic) THTensor.cpp and declare them in (non-generic) THTensor.hpp.
b) Have the generic versions call the non-generic versions.
c) Replace the corresponding TensorUtils<THCTensor>::fn call with (non-generic) THTensor_fn.
* Add comment about THCTensor struct.
* Error if storage is null in setStorageNd or resizeNd.
* Implement randperm for CUDA
* Use Thrust to implement randperm
* clean up
* Fix test
* Offload small input scenario to CPU
* Fixed test
* Try to fix Windows error
* Fix Windows error and clean up
* Use fork_rng context manager
* Move test_randperm_cuda to test_cuda
* Add half tensor support
* Fix cuda::type error
* Fix CPU offloading
* Fix issues
* No need to check range for n == 0 case
* Fix scalar check for sparse tensors.
As discovered in #8152
If `t` is a scalar sparse tensor, `t._indices` used to return a sparse
empty tensor because the scalar check was incorrect. This PR modifies
the scalar check to return a dense tensor instead of a sparse tensor.
i.e.
```
tensor = torch.sparse_coo_tensor([], [], torch.Size([]), device=device)
out = tensor._indices() # was a sparse tensor, now is dense.
```
* Fix typos
In my use case, in the backward propogate pass, the reshape need to
change a [0] tensor into [0,0] shaped tensor. The original implementation would
cause out of index issue. This diff fix this problem.
We don't want SOVERSION because pip will lose the symlink and
double your distribution size, and also because our setup.py
accidentally links against both libcaffe2.dylib and libcaffe2.1.dylib
on OS X. This leads to a very puzzling error where you get
the error "cannot initialize CUDA without ATen_cuda", because
there are actually two copies of your registry in memory (because
there are two copies of the dynamic library). Dropping SOVERSION
makes it impossible to make this mistake.
In principle, if the shared library load is done with DYLD_GLOBAL,
that should also prevent two copies of the registry from popping up.
Worth checking at some later point, if you need to bring back
SOVERSION (because, e.g., pip finally fixed their software.)
Partially fixes#8022.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Update elementwise ops to support numpy style boradcast
Update elementwise ops to support numpy style boradcast
* Fix sqrt_op
* Fix compare ops
* Fix gradient test
* Fix optimizer legacy broadcast
* Fix legacy broadcast for elementwise ops
* Skip flaky test
* Fix eigen simple binary op
* Fix attention test
* Fix rnn test
* Fix LSTM test
* Fix tan grad
* Fix schema check
For non-generic function call implementations in Storage used by TensorUtils, we do the following:
1) Move the declaration from generic/C to non-generic/C++; we don't need backwards compatibility on these functions and want to use e.g. at::ScalarType.
2) Move the implementation from generic/C++ to non-generic/C++.
3) Change the generic implementation to call the non-generic implementation.
This will allow us to get rid of the corresponding TensorUtils calls (once we move over the Tensor functions in the same manner).
* Fix __rshift__ bug
* Add small tests for __lshift__ and __rshift__ in test_cuda
* Add a more elaborate check for __lshift__ and __rshift__
* refactor the test to address @zou3519 's comments
When compiling OSX with CUDA, Caffe2's build system uses
find_package(cuda) to get its grubby hands on the CUDA driver
library (for some strange reason, FindCUDA doesn't save this
information as a variable). Unfortunately, on OSX, sometimes
this picks up the cuda.framework folder, and then our build
system chokes to death because it doesn't try to link against
this as a framework. (Is the folder even a framework? I have
no idea).
This commit attempts to fix this in a two pronged fashion:
1. For some users, reducing the precedence of frameworks
using CMAKE_FIND_FRAMEWORK seems to help. So we set these
variables. However, this fix is not perfect; on my laptop
it doesn't actually solve the problem.
2. PyTorch doesn't actually need the CUDA driver API. So we
only add the dep when building Caffe2.
Fixes#8022
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* docstring support for @script and @script_method
* make it python2 compatible
* improve according to review
* improve build_stmts
* use filter instead of list comprehension
* improve the way wrap is handled for script_method
* stash the original method instead
* allow dynamic attr for ScriptMethod and GraphExecutor
* a bit comment on build_Expr
* remove _build_wrap
* a bit improve on comments
* rename to __original_methods
* should be _original_methods
* docs: enable redirect link to work for each specific page
* docs: add canonical_url for search engines
closes#7222
* docs: update redirect link to canonical_url
* opt bernoulli rng with vsl and openmp
* detect cpu vendor for bernnoulli
* retrigger test platform
* check the vendor more severely
* use cpuinfo to check vendor
* fix type mismatch while call torch._C._cuda_setDevice
* fix type mismatch in scatter
* fix type mismatch in scatter
* fix type mismatch while call torch._C._cuda_setDevice
* fix type mismatch while call torch._C._cuda_setDevice
* fix type mismatch while call torch._C._cuda_setDevice
In my use case, in the backward propogate pass, the reshape need to
change a [0] tensor into [0,0] shaped tensor. The original implementation would
cause out of index issue. This diff fix this problem.
* Add non_blocking to Tensor/Module.to
* flake8
* Add argparse tests
* cpp parse
* Use C++ parser
* use a commong parse function with Tensor.to
* fix test_jit
* use THPObjectPtr
* increase refcount for None, True, and False
* address comments
* address comments
As in https://github.com/pytorch/pytorch/pull/8056, this doesn't work with a single TensorImpl type.
This replaces the usages of with a templatized parameter and static_asserts that the new and old are equal.
After this we can get rid of the old template parameter, but I want to ensure they are equivalent across all builds first.
* [Caffe2] Support non peer access in muji
* [Caffe2] Add test for 4 gpus and 2 groups
* [Caffe2] Add comments
* Fix bug when reduced_affix is empty
* Fix typo and add comments about cpu and amd gpu
* Split SparseTensorImpl off from TensorImpl.
At the moment they have the same data layout, but with the upcoming refactor
they will not, and we need a place to put all of the sparse tensor specific
fields.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Update SparseTensorImpl.h
In some configurations (e.g., our internal build of GCC 5 + GLIBC 2.23),
-lrt is not sufficient to use shm_open; you also need to declare
a dependency on pthread. This patch adds a surgical extra fix to
detect this situation, in the case that I noticed it failing in the
wild.
Fixes#8110
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Implement adaptive softmax
* fix test for python 2
* add return_logprob flag
* add a test for cross-entropy path
* address review comments
* Fix docs
* pytorch 0.4 fixes
* address review comments
* don't use no_grad when computing log-probs
* add predict method
* add test for predict
* change methods order
* get rid of hardcoded int values
* Add an optional bias term to the head of AdaptiveSoftmax
* Add hip support for caffe2 core
* Add MIOPEN header/wrapper to caffe2 core
* Add HIP device into caffe2 PB
* top level makefile change for rocm/hip
* makefile scaffolding for AMD/RocM/HIP
* Makefile scafodding for AMD/RocM/HIP; add makefile/utility for HIP files
* caffe2 PB update for AMD/ROCM HIP device
* Add AMD/RocM/Thrust dependency
* HIP threadpool update
* Fix makefile macro
* makefile fix: duplicate test/binary name
* makefile clean-up
* makefile clean-up
* add HIP operator registry
* add utilities for hip device
* Add USE_HIP to config summary
* makefile fix for BUILD_TEST
* merge latest
* Fix indentation
* code clean-up
* Guard builds without HIP and use the same cmake script as PyTorch to find HIP
* Setup rocm environment variables in build.sh (ideally should be done in the docker images)
* setup locale
* set HIP_PLATFORM
* Revert "set HIP_PLATFORM"
This reverts commit 8ec58db2b390c9259220c49fa34cd403568300ad.
* continue the build script environment variables mess
* HCC_AMDGPU_TARGET
* Cleanup the mess, has been fixed in the lastest docker images
* Assign protobuf field hip_gpu_id a new field number for backward compatibility
* change name to avoid conflict
* Fix duplicated thread pool flag
* Refactor cmake files to not add hip includes and libs globally
* Fix the wrong usage of environment variables detection in cmake
* Add MIOPEN CNN operators
* Revert "Add MIOPEN CNN operators"
This reverts commit 6e89ad4385b5b8967a7854c4adda52c012cee42a.
* Resolve merge conflicts
* .
* Update GetAsyncNetHIPThreadPool
* Enable BUILD_CAFFE2 in pytorch build
* Unifiy USE_HIP and USE_ROCM
* always check USE_ROCM
* .
* remove unrelated change
* move all core hip files to separate subdirectory
* .
* .
* recurse glob core directory
* .
* correct include
* .
If you set CUDA_HOME and CUDA_NVCC_EXECUTABLE together, you may
end up in a situation where the CUDA_VERSION of your includes
mismatches the CUDA version of your nvcc. See #8092 for a concrete
case where this can occur. Explicitly detect this situation and
give a good error message in this case!
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Added support to run ONNX Upsample operator (mode=nearest) in Caffe2
* adding error checks to upsample
* adding error checks to upsample
* adding error checks to upsample
* changing to np.isclose
* Revert onnx submodule update
* still fixing
TensorUtils<T> is basically ATen-dispatch-lite in that it allows one to do multi-type THC function dispatch with a single call.
However, it is templatized on the Tensor type, and since we are moving to a single Tensor type, this doesn't work.
Most of the functions in TensorUtils (e.g. getDims) can be pulled up a level, to just call THCTensor_nDimension (or directly accessing the member),
but the DataType specific functions are more problematic.
So, this PR does two things:
1) Replaces calls of 'TensorUtils<THCTensor>::DataType' with 'real' since these are identical
2) Templatizes the THC_pointwiseApplyX functions to take scalar types. To ensure this is done correctly, we static_assert that the scalar type template parameter matches the scalar type of
the corresponding template parameter. We will need to get rid of these static_asserts in the future, but this is useful for now.
No longer generate data-type specific Storage types, since all Storage types are now identical anyway.
For (some) backwards compatibility and documentation purposes, the Real names, e.g. THLongStorage are now #defined as aliases to the single THStorage type
* Adding instance weight to batch distill loss
as title
* add bfloat 16-31
added bfloat 16-31 and their respective unit tests
* [CUDA9] Upgrade - fbcode
CUDA9 upgrade diff D5654023 has been out for a while thanks to Pieter. But with time growing it's becoming quite hard to rebase, because of the symlinks and auto-generated build/config files in tp2. Break D5654023 into two diffs, one touching tp2 config files, and another one touching fbcode TARGETS file (adding nvcc flag). These two should be a bit easier to rebase (for detailed procedure see "Test Plan").
This diff can only be committed if:
1. CUDA 9 rpm is rolled out fleet-wide (TBD)
2. NVidia driver 390.40 is rolled out fleet-wide (done)
3. Upgrade CUDA 9.1, cudnn 7.1, nccl 2.1 (done)
4. Make sure all dependents are built (done)
5. Test all C2 operators, PyTorch (see test plan)
* Share intermediate int32 buffer across Conv ops
Adding a known type
* [C2 fix] infer function for ensure_cpu_output_op
this is adding the missing device funtion for ensure_cpu_output_op
* [int8] Add blob serializer/deserializer for Int8TensorCPU
To export to logfiledb
* [nomnigraph] Add try catch block to optimization passes in predictor
This will catch failures that happen in the optimization pass.
* Caffe2: avoid static initialization order fiasco for CAFFE_ENFORCE
CAFFE_ENFORCE uses strack trace fetcher. Which is currently a
global static variable. If at static initialization time CAFFE_ENFORCE
is used, this is a SIOF. Recently CAFFE_ENFORCE was added into init
functions registration, so we started to see this.
Meyers singleton is going to provide safety here. If stacktrace
fetcher was not registered yet, it will just use a dummy one.
* NUMA support in SparseNN CPU benchmark
Adding support for NUMA in SparseNN CPU benchmark
* [mobile-roofline] Add logging needed for roofline model
This should be all that's needed
* Let the operators using the same input if the operators are not chained
or else, we have to change the input data dims
* fix null-pointer-use UBSAN errors in in reshape_op.h
* revert previous fix on input blob name
as title
* Adding flag to let MineHardNegative automatically extract single value from dict
Model exporter requires the output of the model to be a struct. This makes it convenient to use those models directly in MineHardNegative by allow automatic extraction of the single element of dict, which is a common use case.
* Reverting change that broke internal tests back to OSS compatible state
* [script] Add support for torch.zeros, torch.ones, etc.
* modifies gen_jit_dispatch to creating bindings for functions that do
not take tensor arguments, but do have an initial type argument
* adds tensor attributes to these functions for device, layout, and
dtype specification
* extends the list of valid compiler constants to include device, layout,
and dtype.
* allows functions with Generators, but only using the default generator
Known limitations:
* when using `torch.float`, we convert it to a scalar tensor and make
no checks that it is actually used only in a dtype specification.
This is similar to how we handle Python numbers, creating some situations
where the script is more permissive. Fixing this requires much more
significant changes to the IR, so is lower priority for now.
* devices specified using string literals e.g. 'cuda:1' do not work,
since we do not support string literals in general.
* Factor python dependency out of interpreter
* Remove NO_PYTHON for the autograd engine
If there is no python bindings, then a default Engine is constructed
the first time it is requested.
If the python libraries are loaded, then they override the default
accessor and the default engine becomes a python Engine.
Note: it is possible for two engines to be generated if a non-python
one gets created before the python bindings are loaded. This case
is rare, and just results in additional threads being spawned.
* Fixing AlexNet test which is skipped in CI
* Fix profiler crash when no events register
When trying to profile, attempting to print the event table throws a vague error because the event list is empty:
....
max_name_length = max(len(evt.key) for evt in events)
ValueError: max() arg is an empty sequence
This change fixes the error by returning an empty string.
* Update profiler.py
This adds an unconditional dependency on CUDA, which is not desirable
for the long term. Ideally we have split like ATen where we have
different artifacts for different backends so you can decide at runtime
what to use.
* Make AT_FORALL_SCALAR_TYPES usable outside of at::namespace.
This requires renaming the _cast functions which used the unqualified names.
* Separate onnx mapping of scalar type from cast name.
* Fix flake8.
* Properly cast onnx.
observers_list_ stores all the observers for an observable. The list is allocated on heap, which
can cause LLC miss. Add an on-stack observer cache for fast access. In production, we have seen 20%
speed up for start and stop observer calls.
* Add memory leak check in CUDA tests
* Tracking multi-GPU too
* fix run_test.py not running __name__ == '__main__' content; add test for make_cuda_memory_checked_test
* add a comment
* skip if cuda
* 1. Change the wrapper to a method in common.py:TestCase
2. Refactor common constants/method that initialize CUDA context into common_cuda.py
3. Update some test files to use TEST_CUDA and TEST_MULTIGPU
* Fix MaxUnpool3d forward memory leak
* Fix MultiLabelMarginCriterion forward memory leak
* Fix MultiMarginLoss backward memory leak
* default doCUDAMemoryCheck to False
* make the wrapper skip-able
* use TEST_MULTIGPU
* add align_corners=True/False tests for Upsample; fix TEST_CUDNN
* finalize interface
* VolumetricMaxUnpooling_updateOutput
* fix test_nccl
* rename THC caching allocator methods to be clearer
* make the wrapped function a method
* address comments; revert changes to aten/src/THC/THCCachingAllocator.cpp
* fix renamed var
* Import/export observer symbols for DLL, which fixes the linking error in Visual Studio.
* Add support of all default cmake build types for release to cuda.
* Make THStorage / THCStorage have void* data ptr.
This is the initial step in unifying the ATen and TH tensor representations, next is to only generate a single THStorage / THCStorage type.
The major changes here are:
1) data has been renamed to data_ptr and made void* in THStorage/THCStorage.
2) THStorage / THCStorage stores a at::ScalarType representing its data type (This will be useful when we generate a single THStorage/THCStorage).
3) APIs for Accessing the data as a real*:
a) storage->data<real>() -- this does runtime-type checking (checks that the at::ScalarType is correct).
b) storage->unsafeData<real>() -- as above, but no runtime-type checking (used in inner loops / fast code paths).
c) THStorage_(data)(storage) -- this already existed, just calls storage->data<real>().
* Add include.
* Attempt to fix clang build issues.
* Clarify comment and remove extra character.
* Rename unsafeData -> unsafe_data.
* Remove unnecessary 'to' function to get compile time rather than link time errors.
* Raise error when torch.load a storage on a non-existing device
Before, doing torch.load(...) on a CUDA tensor on a CPU-only machine
would raise an unreadable error:
```
~/pytorch/pytorch/torch/cuda/__init__.py in __enter__(self)
223 if self.idx is -1:
224 return
--> 225 self.prev_idx = torch._C._cuda_getDevice()
226 if self.prev_idx != self.idx:
227 torch._C._cuda_setDevice(self.idx)
AttributeError: module 'torch._C' has no attribute '_cuda_getDevice'
```
This PR makes it so that torch.load raises a hard error if one tries to
load a storage onto a non-existing device and suggests the user to use
torch.load's map_location feature.
* Address comments
* missing dep
* Handling of scalars in torch.Size
torch.Size() constructor uses python_arg_parser
IntList in python_arg_parser can take iter/range
Have IntList take python iterables and ranges.
Address comments: don't use python_arg_parser and instead call __index__ in THPSize_pynew
Address comments
Address comments
* Rebased
* Address nit
* pad-sequence no longer requires sorting entries
pad-sequence can get the max_len from the list of sequences. entries only need to be sorted if output will be used for pack_padded_sequence, which can throw the error itself.
* remove sort requirement from pad-sequence
Picks up from #5974.
Removes the requirement that input sequences to pad_sequence have to be
sorted. Addressed the comments in the PR:
- Updated docstring for pad_sequence
- Remove sort requirement in pad_sequence test
- Test unsorted and sorted sequences in pad_sequence test
* Test if ASAN is actually working as part of ASAN tests.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Drop explicit use of libstdc++, we should not care.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Build with DEBUG=1
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Increase main thread stack size when using ASAN.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This provides a bare-minimum MPI Process Group implementation, the commit is on top of @pietern's Gloo Process Group PR.
* [c10d] MPI Process Group Implementation
ref: https://github.com/pytorch/pytorch/issues/7434
* Better exception, atexit func, and addressed comments
* Clang formatting changes
* Static initialization and addressed comments
* Added constness back
* Test will now launch mpi processes if found
* CMakeList Changed
* [mpscnn] MPSCNNChannelShuffle
att
* [Easy] Adding tags as an argument to the functional layer
Without it "tags" would be added as an argument to the operator.
The change here is based on the assumption that there is no operator that takes "tags" as an argument.
* Fix locally_connected_op schema check.
Fix locally_connected_op schema check.
* [C2] Add TypeAndShape inference for few more operators
As desc
* [c2] Shape inference should support 0 as dimension
Tensors can have 0 in their dimension.
* Make MockHiveReader loop over and support max_examples
Replace DatasetReader with RandomDatasetReader.
So that Mock Hive Reader can simulate a large data input using a small sample file as source.
* Utility function to wipe cache between benchmark runs
Caffe2 benchmark does not wipe out cache between runs, and this potentially creates an unrealistically optimistic picture of performance. This diff adds utility function to wipe out the cache.
* Allow caffe2 GlobalInit to be invoked multiple times
Allow caffe2 GlobalInit to be invoked multiple times. Will re-parse gflags and update logging levels on successive invocations, but will not re-run init functions or perform other one-time initialization.
* Add Caffe2 GlobalInitIsCalledGuard to base net and operator classes
Warn if caffe2's GlobalInit function has not been invoked before creating an operator or net object. This is based on discussion here: https://fb.quip.com/kqGIAbmK7vNG
* Rethrow current exception on failure
Rethrow current exception instead of copy constructing a new one on op failure.
* Make `clone()` return subclass of List/Struct
`clone()` is not working correctly when we subclass those classes
* Wipe the cache before the net run
the util function is copied from D7409424
will rebase once D7409424 is landed.
* [Caffe2] [Mobile] Support utils/cast.h::GetCastDataType with LITE_PROTO builds
* Correct includes
async_polling include -> async_base include
* Prepare execution flags for executor migration
Making async_scheduling aware of underlying net type to prepare for executor
migration
* Add operator level observers into async executor
Adding operator level observers into RunAsync operators' calls
* Cleanup TEST_Benchmark
Remove duplicate code and provide default implementation in NetBase
* [C2] Fix type and shape inference for binary comparison ops
As desc.
* Add GlobalInit to predictor to ensure initialization is always done before prediction
FACEBOOK:
Redo D7651453 the correct way.
Now use a static variable for the arguments passed to GLog
* Remove spammy log message
This method is currently used in various places inside Caffe itself.
* Disable events for operators inside a chain
We don't need to use events in operators within a chain because the chain is
always scheduled on a single stream, keeping only first and last event for
scheduling purposes
* Ensure correct finish run order
In rare cases we might call finishRun and trigger net's destruction while
another worker is still holding shared_ptr to a thread pool, that can cause
thread pool destruction from within a worker thread in case no other nets are
using the pool. This diff fixes the order of calling finishRun and also changes
pool() to return raw pointer to keep pool's ownership within the net
* Reduce unnecessary polling
Make sure we don't waste CPU by polling operators that we can set an efficient
callbacks on
* Squash commit of syncing 9506eeb from github to fbcode
Patch xplat buck fix
add virtual destructor to OptimizationPass
add virtual destructor to OptimizationPass
build fixes for sync
build fixes for sync
* Fix net tracing
Fix net tracing from async_scheduling
* Fix logging
It's going to define a static variable, and this was a loaded
footgun if another C++ file directly included this header.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Revert "Fix error when setting multiple arch in TORCH_CUDA_ARCH_LIST (#7879)"
This reverts commit 45cdb63d8b8022ab26f073d3bed718e75d2aedaf.
* Disable dirty test; always run all CI runs.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Remove templatization of PyTypeObject in THP copy storage methods.
An in-progress refactoring of THStorage is collapsing the types of THStorages to not be ScalarType-specific.
The revelant PyTypeObject to use for the THPStorageType is currently templatized based on the current THStorage;
this doesn't work if the ScalarType is collapsed. Instead, just pass it explicitly.
* Pass src type instead of dst type.
* Line up columns.
* Avoid @generated in templates.
We want @generated only in the build products. Otherwise, templates are
locked and changes to the templates are excluded from phabricator.
Also adds @generated to autograd generated files (e.g.
VariableType.cpp).
See #7780
* Don't try to specify the template filename in generated comment
The template filename is not always the same as the generated filename.
* Make TensorMethods (fastGetSet) not depend on data type of Storage.
Currently, fastGetSet is implemented as macros that depend on the data type of Storage (i.e. that storage->data is real*).
Since we are moving to having 'void*' data this won't work in the future.
Also, due to the recentl C/C++ split, these are actually C++ implementations (because they require the struct definition which is C++),
so we move them to a generic .hpp file and implement them as static inline functions.
* Fix set functions.
* Add generic to CMakeLists.
* Not running ATEN tests on Caffe2 builds
* Keeping test directory when only aten is built
* Changing to run all aten tests too
* Skipping directories again
* .
* .
* skip aten/integer_divider_test (it hangs for unknown reason)
* Implement nn.Sequential that can be inlined into script modules
* fix bugs
* add comment
* add _ConstSequential class
* add script_method for forward in ConstSequential
* fix build bug
* refactor
* Add backward() to Tensor and Variable
* Add at:: in front of Tensor
* Trying to not move optional to appease windows?
* Move implementation into cpp file
* Undo some formatting changes
* Have PyTorch depend on minimal libcaffe2.so instead of libATen.so
* Build ATen tests as a part of Caffe2 build
* Hopefully cufft and nvcc fPIC fixes
* Make ATen install components optional
* Add tests back for ATen and fix TH build
* Fixes for test_install.sh script
* Fixes for cpp_build/build_all.sh
* Fixes for aten/tools/run_tests.sh
* Switch ATen cmake calls to USE_CUDA instead of NO_CUDA
* Attempt at fix for aten/tools/run_tests.sh
* Fix typo in last commit
* Fix valgrind call after pushd
* Be forgiving about USE_CUDA disable like PyTorch
* More fixes on the install side
* Link all libcaffe2 during test run
* Make cuDNN optional for ATen right now
* Potential fix for non-CUDA builds
* Use NCCL_ROOT_DIR environment variable
* Pass -fPIC through nvcc to base compiler/linker
* Remove THCUNN.h requirement for libtorch gen
* Add Mac test for -Wmaybe-uninitialized
* Potential Windows and Mac fixes
* Move MSVC target props to shared function
* Disable cpp_build/libtorch tests on Mac
* Disable sleef for Windows builds
* Move protos under BUILD_CAFFE2
* Remove space from linker flags passed with -Wl
* Remove ATen from Caffe2 dep libs since directly included
* Potential Windows fixes
* Preserve options while sleef builds
* Force BUILD_SHARED_LIBS flag for Caffe2 builds
* Set DYLD_LIBRARY_PATH and LD_LIBRARY_PATH for Mac testing
* Pass TORCH_CUDA_ARCH_LIST directly in cuda.cmake
* Fixes for the last two changes
* Potential fix for Mac build failure
* Switch Caffe2 to build_caffe2 dir to not conflict
* Cleanup FindMKL.cmake
* Another attempt at Mac cpp_build fix
* Clear cpp-build directory for Mac builds
* Disable test in Mac build/test to match cmake
* Skip some tests to unbreak CI
* Pass the opset_version to run_node
* Remove the stale check_graph call, caffe2_net_to_onnx_model will invoke check_model
* Add hip support for caffe2 core
* Add MIOPEN header/wrapper to caffe2 core
* Add HIP device into caffe2 PB
* top level makefile change for rocm/hip
* makefile scaffolding for AMD/RocM/HIP
* Makefile scafodding for AMD/RocM/HIP; add makefile/utility for HIP files
* caffe2 PB update for AMD/ROCM HIP device
* Add AMD/RocM/Thrust dependency
* HIP threadpool update
* Fix makefile macro
* makefile fix: duplicate test/binary name
* makefile clean-up
* makefile clean-up
* add HIP operator registry
* add utilities for hip device
* Add USE_HIP to config summary
* makefile fix for BUILD_TEST
* merge latest
* Fix indentation
* code clean-up
* Guard builds without HIP and use the same cmake script as PyTorch to find HIP
* Setup rocm environment variables in build.sh (ideally should be done in the docker images)
* setup locale
* set HIP_PLATFORM
* Revert "set HIP_PLATFORM"
This reverts commit 8ec58db2b390c9259220c49fa34cd403568300ad.
* continue the build script environment variables mess
* HCC_AMDGPU_TARGET
* Cleanup the mess, has been fixed in the lastest docker images
* Assign protobuf field hip_gpu_id a new field number for backward compatibility
* change name to avoid conflict
* Fix duplicated thread pool flag
* Refactor cmake files to not add hip includes and libs globally
* Fix the wrong usage of environment variables detection in cmake
* Add MIOPEN CNN operators
* Revert "Add MIOPEN CNN operators"
This reverts commit 6e89ad4385b5b8967a7854c4adda52c012cee42a.
Pull new revision of NNPACK which specifies non-executable stack in assembly files. Previous revision didn't do that, and depending on toolchain could cause linker to mark stack as executable for the linked binaries.
This is a starting point and only implements allreduce for CPU tensors. It includes most base functionality like algorithm caching (similar approach as taken in the THD GlooCache) and multi-threaded execution (new).
The expectation is that function calls on the process group class are globally serialized. They execute collective functions, so members of the collective must call the same functions in the same order, or a deadlock may happen.
The algorithm cache works as follows: the ProcessGroupGloo class has a cache map from algorithm keys to algorithm entries. The algorithm key is a struct with fields that make up the signature of a collective function. It includes the dimensionality of the input/output tensors, tensor device assignment, source/destination rank, etc. For collective calls with the same key, the process group will lazily initialize and then cache a Gloo algorithm instance. For now we only keep a single algorithm instance per key, but this may be revisited in the future, if we observe contention on a single key and can exploit additional parallelism.
* Change backward calls to grad to avoid memory leak from #7343; Replace unnecesary create_graph=True with retain_graph=True
* fix gradgradcheck use of make_non_contiguous
* allow non-contguous target
* remove unnecessray .grad.zero_()
* remove contiguous_detach
* fix PReLU double backward always returning ggW as a scalar
* let noncontig gO require grad
* move requires_grad to return
* Fix handling of empty batches in SumReduceDimsOp
As titled
* Deferrable async_scheduling finishRun fix
Proper order of finishing run operations in deferrable_async_scheduling net
* Simplify exception handling in async_scheduling
Simplify exception handling, no need to busy wait, thread that processes the
last task can finish the run
* [C2]worker_coordinator_memorize_worker_ids
As titled. This is related to T28689868, where the number of blobs we want to create is equal to the number of worker ids
* Add unit test for nets with no type set
* Ignore total length argument in sympolic_pad_packed_sequence
1- There was a mistake in the code that total_length was added to the wrong symbolic function (pack_padded_sequence) instead of (pad_packed_sequence)
2- No need to throw an exception if total_length is given since it is only used to enable data_parallel training on multi-gpus and doesn't have anything to do with onnx export, so just ignore it. https://fburl.com/tk4gciqp
* Add support for MKLDNN to async_scheduling
Just add MKLDNN as a possible CPU option to async_scheduling's pool function
* [AuFL][ensemble] support branch output for prediction
This diff supports using predictions from different branches and thus enables model ensembling (not fully independent).
* Fix a bug in add_loss in layer_model_helper
As titled.
* Support lradaption for adam
1.lr adaption operator
2.apply to dense adam
* Perf tweaks for async_scheduling
Restore single pool option + remove unnecessary (no-ops) calls
* add quantization to SparseSimdAdagradOp
add a bunch of quantization signatures to SparseSimdAdagradOp, implementations to come next
* [sr] [codemod] Change all SR callsites to use new API
@allow-large-files
This diff refactors all callsites of SR to use the slightly changed API introduced in the diff below. Really what this means is that you need to include the correct header. Also if you were using `ClientFactory::newFactory` you need to not prefix it with `ClientFactory::`.
```
cd ~/fbsource/fbcode
find ./ -type f -exec sed -i -e 's:#include "servicerouter/client/cpp2/ClientFactory.h":#include "servicerouter/client/cpp2/ServiceRouter.h":' -e 's:#include <servicerouter/client/cpp2/ClientFactory.h>:#include <servicerouter/client/cpp2/ServiceRouter.h>:' -e 's/ClientFactory::newFactory(/newFactory(/g' {} \;
```
Also manually fixed spots that couldn't be done automatically (or broke because they depended on transitive includes).
* Back out "Fix handling of empty batches in SumReduceDimsOp"
Original commit changeset: 282da1730cc2 This commit is blocking the
Github->fbcode sync, which really needs to get merged ASAP. D7881937 which this
diff depends on will be reverted in the sync D7990948 which causes this to
break. The sync diff cannot be patched with this reversion because it must be
landed against base revision 5c8c099 , and D7881937 must not be included in the
sync diff because it is breaking GPU tests that are not available in sandcastle
: https://ci.pytorch.org/jenkins/job/caffe2-builds/job/py2-cuda8.0-cudnn6-ubuntu16.04-test/3638/console
for one example.
* Add the flow to support operator benchmark
1) generate model with the operator 2) upload to everstore 3) generate model spec into json file 4) start running the benchmark
* [tum][gpu] Connect DPM trainer with flow and unit tests
This diff:
- Fix some small bugs for Yiming's recent changes to parallelizer, so it suits real use cases.
- Add correct tags to the TUM code, so we can do data parallel transform
- pass extra info when instantiation.
- add unit test for using DPM in TUM model
After this diff, we can do simple box, multi-gpu fully-sync trainer for TUM in Fblearner workflow, but may still need to do speed benchmarking.
* w/o normalized lradaption for adam dense only
The previous lr adaption includes a normalization step when performing the dot product operation. This is not exactly same as what is proposed in the paper. I add normalization as an option. Without it, the operator performs exactly what the paper proposed. With the option, we add the normalization step
* [fb] Use SharedPromise in DeferrableAsyncSchedulingNet
This code is to simplify DeferrableAsyncSchedulingNet by removing condition
variable + small fixes
* [tum] implement cuda sparseLengthsMean and LengthsMean
as title
* Adding an optional parameter to allow use of protobufs in InferShapesAndTypes function.
Adding an optional parameter to allow use of protobufs in InferShapesAndTypes function.
* Move feature_to_index to FeatureSpec.feature_to_index
move feature_to_index to FeatureSpec.feature_to_index to avoid override other fields
* [Caffe2] Rename bytes_moved to bytes_written
Just a rename in preparation for supporting bytes_read.
* [c2] fix ReduceFrontSumOp for empty case by setting 0
otherwise, it may use the results from last iteration when it's empty batch.
* [Caffe2] [Int8] Improve Intel CPU performance
* [Easy] Improve PrependDim op logging
as titled
* DBFileReader expand db_path using os.path.expanduser(..)
Since there are a lot of possible use cases of `DBFileReader` to read from user home path, like `~/local/sample.db`, I want to save people's trouble of calling `os.path.expanduser(db_path)` themselves.
* [Caffe2] Add bytes_read to cost structure
We're adding analytical read bytes to cost functions. This extends the structure accordingly for all CostInference defined operators.
Additionally, some small bug fixes were performed:
1) Cost functions now extract type information of operands instead of assuming float
* Fix sleef on aarch64 for hhvm
@bypass-lint
Rename flag
* Remove duplicated part in caffe2/ideep/operators/conv_op.cc
should be sync error
* Rename test helper function test_adagrad_sparse_helper to adagrad_sparse_test_helper to avoid confusing pytest
* Fix various sparse transpose issues; remove dead code from Declarations.yaml.
1) Fixes some checks in t_, transpose_ that don't allow transposing empty sparse tensors.
2) Remove out= variants from docs since they don't exist (and haven't since at least v0.3.1).
3) Unify implementations of t_, transpose_, t, transpose.
4) Move dead checking code from Declarations.cwrap to actual implementations.
5) Fix test which never tested transpose_.
* Add test for error with t, t_.
* Address review comments.
* Fix jit tests.
* Fix test_jit.
* Don't allow requires_grad to be set on integer Tensor constructors in tensor_new.
* Fix autograd test.
* Fix test_distributions.
* Fix test_jit.
* Fix NN tests.
Right now, if we add a zero-filled sparse tensor with another sparse
tensor, both tensors must have the same "density" (dimI, dimV) and size
(tensor.size()) for them to be added successfully. This relaxes that
constraint so that if both tensors have the same tensor.size() and at
least one is zero-filled, they can be added successfully.
Before:
```
i = torch.LongTensor([[0, 1, 1], [2, 0, 2]])
v = torch.FloatTensor([3, 4, 5]).unsqueeze(1)
sparse_mat = torch.sparse.FloatTensor(i, v, torch.Size([2,3,1]))
zeros = torch.zeros(sparse_mat.size(), layout=torch.sparse_coo)
sparse_mat + zeros
RuntimeError: cadd operands have
incompatible sizes or dimension types
at
../src/THS/generic/THSTensorMath.c:126
```
After: no error.
Compilers used to report a warning:
caffe2/core/net_async_tracing.cc: In member function 'void caffe2::tracing::Tracer::renameThreads()':
caffe2/core/net_async_tracing.cc:210:32: warning: overflow in implicit constant conversion [-Woverflow]
const long numa_multiplier = 10e9;
This patch fixes it.
* Makes accumulate_grad functions high priority in backwards passes
* Delegating constructor and comments
* Sequence_nr ain't pretty no more
* Sequence_nr ain't pretty no more
* Implemented fused builder based construction mechanism
* "weights" -> "weight"
* Use int64_t instead of size_t everywhere in RNN
* Extracted Conv::ExpandingSize into its own thing
* Rename TORCH_PARAMETER to TORCH_ATTR
* Added documentation
* Fix weight names in batchnorm module
Reference: https://github.com/pytorch/pytorch/issues/7434
* C10D: Added TCPStore to support C10D store interface
* Used pipe to terminate the store daemon and addressed all comments
* Used notify/wake for wait and addressed all comments
* Clean up nits
* Clean up all socket states when the socket is closed
* Adding LBFGS to cpp API
* Adding stop conditions
* Test cases now passing and adding closure to all algs
* Addressing code review
* Set seeds to make optim tests more deterministic
* Reduce gen_jit_dispatch options
This removes the power set of options generated for IntList[k] arguments
in aten_dispatch. Instead, the compiler now performs the broadcast using
schema information. This substantially cuts the compile time for aten_dispatch.cpp
* Make return uniform in lbfgs step
This ensures that we are returning results of the same type
in LBFGS step.
* Adding test case to exercise different exit points
Sets the tolerance_grad to negative infinity and positive
infinity to deterministically excercise the early exit branch
* Fixing lint error
* Fix python3.6 build in caffe2 CI
* Turn off onnx protobuf type stubs generation
* Revert "Turn off onnx protobuf type stubs generation"
This reverts commit 618b80911a316caa69f2d774fb12ae6b24b2a6d6.
Android unit tests failed to link due because libnnpack and libcpuinfo appeared in the linker command line before libcaffe2. This patch somehow fixes it.
Fixes#7502.
Test Plan: build and test
Build output has this:
```
-- Checking prototype magma_get_sgeqrf_nb for MAGMA_V2 - True
-- Compiling with MAGMA V2 support
-- MAGMA INCLUDE DIRECTORIES: /data/users/rzou/miniconda3/include
-- MAGMA LIBRARIES: /data/users/rzou/miniconda3/lib/libmagma.a
```
* PyTorch AMD Build Script.
* Python invocation for hipify
* Adding individual hip fles.
* Updating CWD
Use the actual path for the file instead of the current working directory, which depends on where the script is invoked.
* Updating folder path for amd_build
* Removing previous amd_build directory
* Updated setup.py to support WITH_ROCM
* Renaming the files for CuDNN BatchNorm & Conv since having two .cpp files with the same name results in a linking error in the HCC compiler used for ROCm/AMD.
* Removing old BatchNorm & Conv files since they've been renamed.
* Updating build path to handle ROCM
* Cleaned up the build path and created a FindHIP cmake file for setting up relevant hip paths.
* Seperated the individual patch files to make it easier to detect issues while building.
* Removed CMakeLists hip files and fixed directory structure
* Adding build pytorch amd script
* Merged setup patch into PyTorch setup.py & cleaned a few issues
* Added information on where to download the hipify-python script.
* Resolved linting issues inside of build_pytorch_amd.py
* Removing many unnecessary patch files. Removing unnecessary .hip files. Fixing up the build process.
* Refactored the PR for supporting HIP
* Minimizing the number of changes inside individual patches.
* Cleaned up patch files.
* Removed patch files.
* Updating patches
* Removing HIP change from file.
* Cleaned up patches
* Added AVX/SSE avoidance due to bug with ROCms stack. Just temporary for now.
* Removing the other HIP file
* Removed patch file + merged ROCm into Aten/test
* Removed ATen tests patch file and updated disbale_features yaml to remove headers that don't exist on the HIP stack.
* Reduced the number of patches down to 14 after Edward's suggestions.
* Transferred deletion of certain functions from patch to yaml file.
* Set default Thrust path
* Fixed aten files so we now use the templated pow/abs instead of std:: directly.
* Removed error from aten/src/THCUNN/Abs.cu
* Updated the locations of the cmake build files. Moved THCTensorRandom from a hip to a patch file. Added executable/library commands that can successfully handle either CUDA or HIP.
* Removed hip extraction from the build script and removed the old hip file.
* Replaced MACRO with function in upper level cmake.
* Added empty ELSE() block to prevent the loading of a command without CUDA or HIP. Also added IF guards around torch_cuda_based_add_executable in Aten tests.
* Updated aten tests.
* Removed the hip include from the ATen header.
* Can't throw exceptions on C++ AMP, using abort
* Missing IF guards for cuda/hip executables in aten tests.
* Removed a series of patch files.
* Added template keyword to help out the HCC compiler.
* Rebased the specific files displayed in the PR
* Fixing typo.
* Change flag from "WITH_CUDA" to "NOT NO_CUDA"
Replacing "WITH_CUDA" with "NOT NO_CUDA" after the rebase.
* Fix LoadHIP path
* Updating build files after rebasing.
* Reorganization after cpu/gpu separation.
* Removed HIPCC from setup.py & removed -shared extra linking args.
* Updated CMake / Setup build to correctly link when under ROCm stack.
* Removed the unnecessary argument from Extension constructor.
* Adding another test to be included with ROCm building.
* Updated the setup_helpers scripts in order to get around linter error
* Fix syntax issue
* Solving lint issue: line too long
Running sccache in foreground mode seems to uniformly slow down the builds and causes virtual memory exhausted errors for gcc7.2 builds. This PR moves sccache to background mode instead and print the compilation log at the end of the build.
* Run onnx integration tests in caffe2 CI
* verbose log
* turn off onnx verbose installation log
* can not install ninja
* Do not use all cores to build pytorch
* install tests require
* pip install to user dir
* use determined path to improve (s)ccache hit
* Do not change path in test.sh
* Add the compile cache hit trick to conda install as well
* cover jenkins in CI environment detection
This PR uses Vec256 to vectorize the softmax and logsoftmax Layers.
This comes in 4 steps:
log_softmax
softmax
log_softmax_backward
softmax_backward
* Vectorized Softmax and LogSoftmax
* Abstractions
* Style
* Remove <limits> for Kernel
* Perf investigations
* Last cleanups
Improve script builtin checking using schema
* This add aten_schema.h which provides a barebones amount of type and
argument information about each builtin operator
* emitBuiltinCall is updated to use this information rather than
aten_dispatch to ensure the operator is correct.
* handling of keyword and position arguments now matches python behavior
* There is no longer a requirement that kwargs be constant or that the
attributes of an op must be entirely constant or non-constant
* compiler now constructs a non-attributed version of the op first and
then turns it into the constant-attribute version if all attributes
are constants.
* default arguments for builtins now work
* SugaredValue::call and similar functions now have SourceRange information
for their arguments so that error reporting is more accurate
Notes:
* This does not try to merge the builtin checking with python arg parser.
Given that we will eventually have C10 schema which will replace aten_schema,
we will eventually have a C++ description of the schema and working of that
description directly will be the easiest form to understand.
* python function calls and script method calls do not support keyword arguments yet.
When we add this support we should refactor the handling in tryEmitSchema
that resolves keywords into a common function.
* default arguments work
* keyword arguments to builtins work (still need to extend to calling python and other script methods)
* much better error reporting for incorrect builtins
Lift any constants to attributes on nodes when possible
* Schema is usable internally in the compiler as
the function signatures of script functions as well as for builtin
operators.
* Adds a List[T] class to better represent the arguments to cat/stack
as a type rather than with custom checking.
* Support kwargs for calls of script methods
A future commit will be needed to add support for:
* calls to script _functions_ which are currently are GraphExecutors without schema info.
* kwargs to python functions, which will require refactoring python op
* fix for #7532: clamping the return value of uniform.cdf() to the range [0,1]
* removed whitespace around equals to pass flake8 tests
* added a test for uniform.cdf() with arguments outside support
This PR makes two improvements:
It fixes reduce kernels where accum type != type. Currently, for example, half tensors with small values may have norms that are (approximately) representable in fp16, but calling .norm() on them will result in underflow and a reported norm of zero. This PR fixes that behavior and adds a test in test_cuda.py to ensure underflow does not occur (test_tiny_half_norm).
It simplifies all reductions by removing excessive templating and the -2 contiguous special case from THC_reduceDim and THC_reduceAll. The latter was previously removed from pointwise apply. This has no performance impact as the -2 special case was already mapping to the 1D code path.
PyTorch currently attempts to handle accum type != type by either (1) writing kernels that immediately convert values to accum type after reading or (2) writing operations that take in type values and accumulate to the accum type. The latter path was not working properly (hence the current excessive half tensor underflow) and resulted in a lot of redundant code, with two reduce ops being passed to a kernel instead of one, and reduce ops frequently receiving the same template argument twice.
This PR makes the former approach THE approach. Kernels that accumulate to (potentially) different types should follow the pattern of converting their input to the accum type, performing all operations on that type, and then converting back to the appropriate type if writing their value back to the tensor. This pattern makes the second reduce op redundant and allows for simpler templating, which should improve readability, reduce build time, and reduce binary size. Also, this prevents ops from having to perform their own conversions, which could result in poor performance if the same value was operated on multiple times.
One exception to this simplification was that a new ThrustTensorDistOp was created to handle a call to thrust::inner_product(). This Op fuses the conversion and the TensorDistOp.
In addition to the expected simplification, there is also some cleanup of excessive template parameters. For example, kernelReduceAllPass2() had three template parameters: T, IndexType, and ReduceOp, but IndexType was never used.
* wip
* Adds tests
* Fixes Python linting
* mean and norm fusions, code cleanup
* fixes file permissions
* Built-in support for rebuilding in win-build.sh
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* fixups
Signed-off-by: Jenkins <jenkins@ci.pytorch.org>
* CR comments
* CR comments
* more delayed expansion fixes
* Updates collapseDims() function and documentation
* Adds C++ tests, validates input, updates names for readability
* Removes invalid test
* stashing to merge AT_CHECK macro
* Updates asserts, removes tests on Windows
* Fix advanced indexing with negative indices
Fixes#7156
Here is some behavior before this PR:
```
In[1]:
x = torch.arange(9).view(3, 3).contiguous()
x[[0], [-1]] # Should be equivalent to x[0, -1]
Out[1]:
tensor([ 8])
```
The bug is that negative indices are added to the computed linear index
directly. In the above example, the linear index computed is "-1", which
wraps around to "8", giving the last element of a flattened view of `x`.
Instead, we should wrap negative indices around before adding them to
the linear index.
* Use toCLong()
Previously, CUDAGenerator::CUDAGenerator would initialize the random
number generator on the current device. This would usually be device 0.
This is undesirable because initialize the CUDA context allocates a few
100 MBs due to all the kernels in libTHC.so.
This avoids the unecessary call to THCRandom_getGenerator() in the
CUDAGenerator constructor.
Fixes#7320
Previously, CUDAGenerator::CUDAGenerator would initialize the random
number generator on the current device. This would usually be device 0.
This is undesirable because initialize the CUDA context allocates a few
100 MBs due to all the kernels in libTHC.so.
This avoids the unecessary call to THCRandom_getGenerator() in the
CUDAGenerator constructor.
Fixes#7320
* Fix call to get THCState
* Move ONNX integration tests from onnx-fb-universe to PyTorch repo
* Switch to use torchvision
* Delete single rnn operator tests, they have been covered in e2e tests in test_caffe2.py
* Mirror the fix in onnx-fb-universe to bypass cuda check
667326d84b
* this removes the flag controlling whether the interpreter works on variables.
* now the interpreter _always_ works on variables
* constants in the IR are still _always_ non-variables, and an assert was added to ensure this.
* as_tensor was split into as_variable and as_tensor since it is sometimes used
to construct constants in the IR
* I tried changing the IR to also always use variables but that change was much more
cross cutting and fragile and I never got it working
* [bootcamp] Improve "Shape" operator to support axes specification
To improve .shape operator of Caffe2 to support x.shape(tensor, axes), which takes an optional int array "axes" as input. For example, x.shape(tensor, [1, 0]) will return the dimension for axis 1 and 0 following the specified order. For current version, "axes" input allows duplications and can have arbitrary length.
* Back out "Add barrier net that runs before training nets"
Original commit changeset: b373fdc9c30f. Need additional changes to some callers to support barrier failures.
* Change warning to verbose log to reduce log spam
The `LOG(WARNING)` was a bit spammy for regular use so lets just make it a `VLOG`.
* Extract the shared code from different caffe2_benchmark binaries
The OSS benchmark and Internal benchmark will share most functions in the benchmark.
* Support MFR in sequence training
As titled.
* Make knowledge distillation work with using logged prediction feature as teacher label.
1) Add loading raw dense feature as teacher label.
2) Optional calibration function for teacher label
3) Add teacher label into generic unit test
4) Deprecated TTSN workflow version using feature_options to config teacher label
* [C2/CUDA]: unjoined cross entropy sigmoid
as desc
* Add async_scheduling executor into deferrable_net_exec_test
Add async_scheduling into tests and fix some exception cases
* Fix Event disabled error
When disabling event in RNN ops make sure we don't call Finish on disabled
event from op's RunAsync
* cuda ensure cpu output op can handle both TensorCPU and TensorCUDA
as desc.
* [C2 Core] Infer input device option in C2 hypothesis_test checkers
Improve how we default input blob device options.
Previously it defaults as where op lives but it is not necessarily the case.
For example:
CopyCPUToGPU
* [C2 Op]SplitByLengthsOp CPU/GPU implementation
[C2 Op]SplitByLengthsOp CPU/GPU implementation
* fix undefined symbol error
not sure why we're getting undefined symbol even with link_whole = True
Need to figure out why but need this workaround for now
* Add tools in DAIPlayground platform to help debugging models
Add additional tools to allow Plauground override individual method defined in AnyExp. This will allow user to create module that specificly change certain default method behavior. An example included in this diff is deactivating test model and checkpointing. When debugging any model problems, switching off components helps me quickly narrow down the location of the bug. The technique is extensively used in task T27038712 (Steady memory increase in EDPM, eventually resulting in gloo/cuda.cu:34: out of memory)
* add shape and type inference for int8 conversion operator
* Fix flaky test for group_norm
Fix flaky test for group_norm
* Fix group_norm_op_test flaky
Fix group_norm_op_test flaky
* Implementation of composite learning rate policy
In many state-of-the-arts deep learning works, people use a simple trick to
schedule the learning rate: use a fixed learning rate until error plateaus
and then switch to a different fixed learning rate, and so on. In this diff,
we implemented a simple version of the composite learning rate. The user gives
a set of learning rates policies and corresponding iteration nums, and the
optimizer will change the learning rate policy based on the number of iterations so far.
For example, the user give two learning rate policies, one is FixedLearningRate
and PolyLearningRate, with an iteration number of 1k. Then the first 1k iteration,
we use FixedLearningRate. For the following iterations, we use PolyLearningRate.
* Split two use cases of CachedReader into two classes, DBFileReader and CachedReader
# Use Cases:
1). input: DB file -> output: DatasetReader.
Use DBFileReader.
2). input: Reader -> build cache DB file -> output: DatasetReader.
Use CachedReader.
# Changes to CachedReader:
1). Move db_path to the constructor.
Because in mock reader. cache will always be built ahead.
# Changes to tests:
1). Make a separate TestCase class for CachedReader and DBFileReader.
2). Make it possible to add more test functions by adding setUp, tearDown and _make_temp_path.
3). Make delete db_path more general. `db_path` could be a file for `log_file_db`, but could also be a directory for `leveldb`.
* Back out "On Mobile phones, call GlobalInit with no arguments in predictor in case we need to perform initialization"
Original commit changeset: 4489c6133f11
* Fix LARS bug
Fixed a bug in the LARS implementation which caused all subsequent blobs not using LARS to have the LARS learning rate multiplier applied to them.
* [tum] support sparse init & add uniformFill option
as title
* Propagate exception for async nets
Capture the exception when an exception is thrown in async nets and re-throw it after wait(). This allows exceptions to be propagated up to the caller.
This diff was a part of D7752068. We split the diff so that C2 core files changes are in a separate diff.
* Automatic update of fbcode/onnx to 69894f207dfcd72d1e70497d387201cec327efbc
Previous import was 403ccfbd0161c38f0834413d790bad0874afbf9a
Included changes:
- **[69894f2](https://github.com/onnx/onnx/commit/69894f2)**: Use op schema.all tensor types in random like definitions (#865) <Scott McKay>
- **[b9d6b90](https://github.com/onnx/onnx/commit/b9d6b90)**: Clarify random like operators (#846) <Scott McKay>
- **[fc6b5fb](https://github.com/onnx/onnx/commit/fc6b5fb)**: Refactor shape inference implementation (#855) <anderspapitto>
- **[b7d8dc8](https://github.com/onnx/onnx/commit/b7d8dc8)**: fix cmake warning message (#863) <Eric S. Yu>
- **[f585c5d](https://github.com/onnx/onnx/commit/f585c5d)**: add pytorch-operator test for tile (#831) <Wenhao Hu>
- **[993fe70](https://github.com/onnx/onnx/commit/993fe70)**: add install step (#832) <Eric S. Yu>
- **[68bc26c](https://github.com/onnx/onnx/commit/68bc26c)**: add type inference for traditional ml ops except classifier ops. (#857) <Ke Zhang>
- **[9cc0cda](https://github.com/onnx/onnx/commit/9cc0cda)**: fix string representation of scalar types (#858) <G. Ramalingam>
- **[1078925](https://github.com/onnx/onnx/commit/1078925)**: fix y in pow test case to scalar (#852) <Wenhao Hu>
- **[c66fb6f](https://github.com/onnx/onnx/commit/c66fb6f)**: Add some math function shape inference (#845) <anderspapitto>
- **[ff667d1](https://github.com/onnx/onnx/commit/ff667d1)**: Refactor return type and docs for ONNXIFI_BACKEND_DIRECTX_ID (#853) <Marat Dukhan>
- **[11c6876](https://github.com/onnx/onnx/commit/11c6876)**: clear initializer names when clear initializer (#849) <Wenhao Hu>
- **[73c34ae](https://github.com/onnx/onnx/commit/73c34ae)**: Clarify FeatureVectorizer description. (#843) <Scott McKay>
- **[1befb9b](https://github.com/onnx/onnx/commit/1befb9b)**: Remove useless text in docs (#850) <Lu Fang>
- **[e84788f](https://github.com/onnx/onnx/commit/e84788f)**: Fix SELU attributes' default values (#839) <Lu Fang>
- **[ebac046](https://github.com/onnx/onnx/commit/ebac046)**: Add tile test case (#823) <Wenhao Hu>
- **[8b7a925](https://github.com/onnx/onnx/commit/8b7a925)**: a few more shape inference functions (#772) <anderspapitto>
- **[9718f42](https://github.com/onnx/onnx/commit/9718f42)**: Make the coefficient non optional for LinearClassifier (#836) <Jaliya Ekanayake>
- **[ef083d0](https://github.com/onnx/onnx/commit/ef083d0)**: Add save_tensor and load_tensor functions for Protos (#770) <Lu Fang>
- **[45ceb55](https://github.com/onnx/onnx/commit/45ceb55)**: Check if CMAKE_BUILD_TYPE set before project(). (#812) <Sergii Dymchenko>
- **[4b3d2b0](https://github.com/onnx/onnx/commit/4b3d2b0)**: [WIP] reenable shape inference tests (#834) <anderspapitto>
- **[22d17ee](https://github.com/onnx/onnx/commit/22d17ee)**: RNN tests: LSTM, GRU, SimpleRNN (#739) <Peyman Manikashani>
- **[de65b95](https://github.com/onnx/onnx/commit/de65b95)**: dimension denotation (#443) <Tian Jin>
- **[eccc76e](https://github.com/onnx/onnx/commit/eccc76e)**: fix field number issue in onnx operator proto and enable its build (#829) <Ke Zhang>
- **[d582beb](https://github.com/onnx/onnx/commit/d582beb)**: disable shape inference test to unbreak ci (#830) <Lu Fang>
- **[485b787](https://github.com/onnx/onnx/commit/485b787)**: function proto for composite op. (#802) <Ke Zhang>
- **[cd58928](https://github.com/onnx/onnx/commit/cd58928)**: specify defaults for attributes of Affine op (#820) <G. Ramalingam>
- **[7ee2cf9](https://github.com/onnx/onnx/commit/7ee2cf9)**: merge the dummy backend back into the main one (#743) <anderspapitto>
- **[1c03a5a](https://github.com/onnx/onnx/commit/1c03a5a)**: [Proposal] ONNX Interface for Framework Integration (previously ONNX Backend API) header and docs (#551) <Marat Dukhan>
- **[3769a98](https://github.com/onnx/onnx/commit/3769a98)**: Rename real model test case from VGG-16 to ZFNet (#821) <Lu Fang>
* [C2]ReluN Op
relu n op.
tf reference: https://www.tensorflow.org/api_docs/python/tf/nn/relu6
* Call destructor when assigning a blob value
* Add executor overrides
Add executor overrides flag to enable migration to async_scheduling executor
* Add barrier net that runs before training nets - attempt #2
Add a synchonize barrier net that is run before training nets. With this net, shards that are faster will wait for other shards before start training. This reduce chances of the faster shards timing out during GLOO AllReduce.
Removed explicit data_parallel_model.py.synchronize call in holmes workflow.
This change was landed previously but caused errors for some EDPM workflows - See https://fb.facebook.com/groups/1426530000692545/permalink/1906766366002237/ - because EDPM assumes any call to CreateOrCloneCommonWorld and Gloo ops are wrapped in exception handlers but in this case exception thrown in the barrier init net is not handled.
To address this issue, we add _CreateOrCloneCommonWorld to the param_init_net instead of a new barrier init net. Since errors for param_init_net run is handled gracefully and re-rendezvous, it should fixes the problem.
* Handle empty nets in async_scheduling
Make sure we don't get stuck on empty nets
* use CUDA_ARCH for conditional compile
* [C2 fix] infer function for ensure_cpu_output_op
* Update group_norm test to reduce flaky test
* Fix lr_multiplier for GPU
The file store implementation is new and based on the file
initialization method (which uses a single file and file locking) and
the interface of the Caffe2 store handler.
See #7434.
When tracing we record expand nodes. This is useful in some cases because
it makes it clear a broadcast happened. However, in future runs
the broadcast may be different or not needed. This change adds an
attribute to expand to track if it was implicitly added. This
takes the form of an unused input to expand with a default value.
The execution engine then removes implicit expands before execution.
Note that shape_analysis will re-add expands when it can prove by
shape analysis that they will exist and this is useful for the fuser,
so this change should not affect fusion passes.
* Split libATen.so into libATen_cpu.so and libATen_cuda.so
Previously, ATen could be built with either CPU-only support, or
CPU/CUDA support, but only via a compile-time flag, requiring
two separate builds. This means that if you have a program which
indirectly uses a CPU-only build of ATen, and a CPU/CUDA-build of
ATen, you're gonna have a bad time. And you might want a CPU-only
build of ATen, because it is 15M (versus the 300M of a CUDA build).
This commit splits libATen.so into two libraries, CPU/CUDA, so
that it's not necessary to do a full rebuild to get CPU-only
support; instead, if you link against libATen_cpu.so only, you
are CPU-only; if you additionally link/dlopen libATen_cuda.so,
this enables CUDA support. This brings ATen's dynamic library
structure more similar to Caffe2's. libATen.so is no more
(this is BC BREAKING)
The general principle for how this works is that we introduce
a *hooks* interface, which introduces a dynamic dispatch indirection
between a call site and implementation site of CUDA functionality,
mediated by a static initialization registry. This means that we can continue
to, for example, lazily initialize CUDA from Context (a core, CPU class) without
having a direct dependency on the CUDA bits. Instead, we look up
in the registry if, e.g., CUDA hooks have been loaded (this loading
process happens at static initialization time), and if they
have been we dynamic dispatch to this class. We similarly use
the hooks interface to handle Variable registration.
We introduce a new invariant: if the backend of a type has not
been initialized (e.g., it's library has not been dlopened; for
CUDA, this also includes CUDA initialization), then the Type
pointers in the context registry are NULL. If you access the
registry directly you must maintain this invariant.
There are a few potholes along the way. I document them here:
- Previously, PyTorch maintained a separate registry for variable
types, because no provision for them was made in the Context's
type_registry. Now that we have the hooks mechanism, we can easily
have PyTorch register variables in the main registry. The code
has been refactored accordingly.
- There is a subtle ordering issue between Variable and CUDA.
We permit libATen_cuda.so and PyTorch to be loaded in either
order (in practice, CUDA is always loaded "after" PyTorch, because
it is lazily initialized.) This means that, when CUDA types are
loaded, we must subsequently also initialize their Variable equivalents.
Appropriate hooks were added to VariableHooks to make this possible;
similarly, getVariableHooks() is not referentially transparent, and
will change behavior after Variables are loaded. (This is different
to CUDAHooks, which is "burned in" after you try to initialize CUDA.)
- The cmake is adjusted to separate dependencies into either CPU
or CUDA dependencies. The generator scripts are adjusted to either
generate a file as a CUDA (cuda_file_manager) or CPU file (file_manager).
- I changed all native functions which were CUDA-only (the cudnn functions)
to have dispatches for CUDA only (making it permissible to not specify
all dispatch options.) This uncovered a bug in how we were handling
native functions which dispatch on a Type argument; I introduced a new
self_ty keyword to handle this case. I'm not 100% happy about it
but it fixed my problem.
This also exposed the fact that set_history incompletely handles
heterogenous return tuples combining Tensor and TensorList. I
swapped this codegen to use flatten() (at the possible cost of
a slight perf regression, since we're allocating another vector now
in this code path).
- thc_state is no longer a public member of Context; use getTHCState() instead
- This PR comes with Registry from Caffe2, for handling static initialization.
I needed to make a bunch of fixes to Registry to make it more portable
- No more ##__VA_ARGS__ token pasting; instead, it is mandatory to pass at
least one argument to the var-args. CUDAHooks and VariableHooks pass a nullary
struct CUDAHooksArgs/VariableHooksArgs to solve the problem. We must get rid of
token pasting because it does not work with MSVC.
- It seems MSVC is not willing to generate code for constructors of template
classes at use sites which cross DLL boundaries. So we explicitly instantiate
the class to get around the problem. This involved tweaks to the boilerplate
generating macros, and also required us to shuffle around namespaces a bit,
because you can't specialize a template unless you are in the same namespace as
the template.
- Insertion of AT_API to appropriate places where the registry must be exported
- We have a general problem which is that on recent Ubuntu distributions,
--as-needed is enabled for shared libraries, which is (cc @apaszke who was
worrying about this in #7160 see also #7160 (comment)). For now, I've hacked
this up in the PR to pass -Wl,--no-as-needed to all of the spots necessary to
make CI work, but a more sustainable solution is to attempt to dlopen
libATen_cuda.so when CUDA functionality is requested.
- The JIT tests somehow manage to try to touch CUDA without loading libATen_cuda.so. So
we pass -Wl,--no-as-needed when linking libATen_cuda.so to _C.so
- There is a very subtle linking issue with lapack, which is solved by making sure libATen_cuda.so links against LAPACK. There's a comment in aten/src/ATen/CMakeLists.txt about htis as well as a follow up bug at #7353
- autogradpp used AT_CUDA_ENABLED directly. We've expunged these uses and added
a few more things to CUDAHooks (getNumGPUs)
- Added manualSeedAll to Generator so that we can invoke it polymorphically (it
only does something different for CUDAGenerator)
- There's a new cuda/CUDAConfig.h header for CUDA-only ifdef macros (AT_CUDNN_ENABLED, most prominently)
- CUDAHooks/VariableHooks structs live in at namespace because Registry's
namespace support is not good enough to handle it otherwise (see Registry
changes above)
- There's some modest moving around of native functions in ReduceOps and
UnaryOps to get the CUDA-only function implementations into separate files, so
they are only compiled into libATen_cuda.so. sspaddmm needed a separate CUDA
function due to object linkage boundaries.
- Some direct uses of native functions in CUDA code has to go away, since these
functions are not exported, so you have to go through the dispatcher
(at::native::empty_like to at::empty_like)
- Code in THC/THCS/THCUNN now properly use THC_API macro instead of TH_API
(which matters now that TH and THC are not in the same library)
- Added code debt in torch/_thnn/utils.py and other THNN parsing code to handle
both TH_API and THC_API
- TensorUtils.h is now properly exported with AT_API
- Dead uses of TH_EXPORTS and co expunged; we now use ATen_cpu_exports and
ATen_cuda_exports (new, in ATenCUDAGeneral.h) consistently
- Fix some incorrect type annotations on _cudnn_rnn_backward, where we didn't
declare a type as possibly undefined when we should have. We didn't catch this
previously because optional annotations are not tested on "pass-through" native
ATen ops (which don't have dispatch). Upstream issue at #7316
- There's a new cmake macro aten_compile_options for applying all of our
per-target compile time options. We use this on the cpu and cuda libraries.
- test/test_cpp_extensions.py can be run directly by invoking in Python,
assuming you've setup your PYTHONPATH setup correctly
- type_from_string does some new funny business to only query for all valid CUDA
types (which causes CUDA initialization) when we see "torch.cuda." in the
requested string
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Last mile libtorch fixes
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* pedantic fix
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add name() to C++ modules
* Use RTTI to get module name by default
* Add functional.cpp to CMakeLists.txt
* Call typeid() inside name() instead of constructor
* Add tests and use default constructor
In Maratyszcza/NNPACK#140 @daquexian reported an error on Faster-RCNN model with MobileNet V2, when running with NNPACK engine. The error disappears when using the latest NNPACK and cpuinfo. Updating submodules upstream to ensure others don't hit this issue.
* [ONNX] Allow specifying only a subset of input/output names
Then we can only specify the "real" names while ignoring the names for all the parameters
* fix
* Update utils.py
* Replace incorrect usages of "NotImplemented"
Fixes#7266. Replaces "NotImplemented" (which is supposed to be used for
binary ops) with the correct "NotImplementedError".
* Address comments
* Add batched linear solver to torch.gesv()
Fixes#3164
Picks up from #4502
I moved `gesv` to ATen.
Adds bindings for MAGMA's `gesv_batched` function for CUDA.
For CPU, runs `THLapack(gesv)` in a for loop.
The new function supports arbitrary batch dimensions (and broadcasting
of those dimensions). For example, the 4-d tensor `A x B x M x M` should
be treated as having batch-size `(A x B)`.
The overhead of creating the magma_queue_t is: ~350000 microseconds
the first time it's called and ~6 microseconds every time after that.
* Tests and docs
* Address comments
* Address comments
* Rebase
* Address comments
* Fix rebase
* Addressed comments
* Address comments
* Address comments
* Addressed comments
This lets aten::expand be differentiable in torchscript. It was probably
omitted from the list by accident in the past b/c gradientForNode does
already support aten::expand.
Also adds a test to check expand and its gradient in a torchscript fn.
* Make ATen buildable without all Caffe2 by root cmake
* Fix typo in aten cmake
* Set BUILD_ATEN from USE_ATEN as compat
* Only set BUILD_ATEN from USE_ATEN when on
* Have USE_GLOO only set when BUILD_CAFFE2
* Generic fuse conv relu pass for nomnigraph
* Use it in NNPACK conversion
* Comments
* Change the postprocess interface to take node instead of conv op
* Pinning conda-numpy to 1.14 to avoid SVD issue
* Adding another leveldb test to conda's ignored tests, removing a mkl-test from this
* Removing commented out section
The schema.Scalar class makes pretty strict assumptions (via its docstring)
on the spec of the shape of its underlying object. Because of idiosyncracies
of numpy indexing and the use of np.dtype, those assumptions are broken on an
edge case (dtype = (scalar_type, 1)). This corrects the behavior of this
edge case to conform to the spec.
* Rename autograd namespace to torch and change torch.h into python.h
* Pave the way for torch::nn::Module
* Reorganize module code structure
* Undo ONNX update
* Remove sleef submodule
* ENH: add to method for PackedSequence
* ENH: return self if possible
* TST: remove extra data
* DOC: add more explanation
* TST: remove extra data
* DOC: minor fix
Apparently get() is a function of requests, not a module (not sure if in
the past get() used to be a module). Therefore, the syntax in #3280 will
alway fail with ImportError, and requests lib will never be used (kind
of defeat the purpose of that pull request).
Also, if requests lib is used, should add stream=True parameter,
otherwise requests.get() will load the whole response into memory.
* Clarify patience in ReduceLROnPlateau docs
It's unclear which definition of patience we have. The two ways to
interpret it are:
- How many bad epochs can you see before you start considering changing the learning rate.
- How many bad epochs can you see before you change the learning rate.
This PR clarifies the docs with an example. If `patience = 2`, then
after 2 bad epochs, we begin considering changing the learning rate.
After seeing one more epoch (the 3rd epoch), if that epoch is also bad,
then we change the learning rate after it.
* address comments
* move softmax/logsoftmax to ATen
* specify cpu and gpu accum types
* use accreal for CPU
* expose softmax backward to python, fix legacy interface
* fix Distributions.cu to use common AccumulateType
* fix cuda 8 build
* delete commented out lines
* rebase on master, fix breakages
* Double-dispatch copy.
In order to split ATen's CPU/CUDA code into two separate libraries
which don't require a build flag (AT_CUDA_ENABLED) to separate them,
we need to be able to split source files based on whether or not they
handle CPU functionality only, or also touch CUDA. Copy poses a unique
challenge here, because the naive implementation involves writing
a matrix for all combinations of CPU/GPU in a single file.
This PR splits up Copy.cpp into CPUCopy.cpp and CUDACopy.cpp, respecting
the following matrix:
to\from CPU CUDA
+---------------------------
CPU | CPUCopy.cpp CUDACopy.cpp
CUDA | CUDACopy.cpp CUDACopy.cpp
When you run x.copy_(y) where x is CPU and y is CUDA, we do a second
virtual dispatch to copy_from(y, x) on y's type, so that we can get
from CPUCopy.cpp to CUDACopy.cpp
The new autogenerated code for CPU looks like this:
Tensor & CPUByteType::s_copy_(Tensor & dst, const Tensor & src, bool non_blocking) const {
// code generated by copy_wrapper
checked_cast_tensor<CPUByteTensor>(dst.pImpl, "dst", 0, false);
switch (src.type().ID()) {
case TypeID::CPUByte:
THByteTensor_copyByte(static_cast<CPUByteTensor*>(dst.pImpl)->tensor, static_cast<CPUByteTensor*>(src.pImpl)->tensor);
break;
case TypeID::CPUChar:
THByteTensor_copyChar(static_cast<CPUByteTensor*>(dst.pImpl)->tensor, static_cast<CPUCharTensor*>(src.pImpl)->tensor);
break;
...
default:
return src.type().s_copy_from(src, dst, non_blocking);
Notice that the fall through goes to s_copy_from. s_copy_from is like s_copy
but the arguments are reversed.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Lintfix and no-CUDA fix
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Fix compilation erorr.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* CR
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Rename autograd namespace to torch and change torch.h into python.h
* Include torch.h instead of python.h in test/cpp/api
* Change some mentions of torch.h to python.h in C++ extensions
* Set paths directly, without find_path
This makes the JIT tracer much more robust, by allowing it to record
dependencies on tensor sizes. For example, if you were to trace this
function
def fn(x):
return x.view(x.size(1), -1)
before this patch, then it would embed the actual value of x.size(1)
in the trace as a constant, making it very hard to have e.g. batch size
independent traces. Now, this will correctly record the dependency, and
will retrieve the size of x at every run.
* Refactor reduce ops to take flexible input types
* Add DISPATCH_FUNCTION macros in common_gpu.h
* Use macros to reduce switch case in dispatching cuda functions
AT_ASSERT is an internal, PyTorch specific error, so we should
give a little more debug information (than with the ordinary
errors.)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
```
JIT_ASSERT(v->setUnique(x)->uniqueName() == x);
```
This works by changing any other value in the graph with name x to a
different name. This mirrors llvm behavior and is useful when you
want to ensure some names have particular values.
* Remove stale THD README
* Move common THD dependency into THD/base
The master_worker directory now no longer contains files that are
needed for building other parts of THD.
* [fix] Re-enable events in RNN ops
We have earlier added event disabling in RNN ops as back then we didn't use
events, with current use cases this is no longer true
(https://fburl.com/8vd0lp8y)
* use ops with cude impl
* Revert D7729695: [caffe2][fix] Re-enable events in RNN ops
This reverts commit 4b215c7496fb724656ff4c776933a15bdbbcde5e
@bypass-lint
An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files
* [observer] Clean up observer_config.h
#accept2ship
* [1/n] Refactor dataio_test.py
Replace code duplication with a common function
* Add barrier net that runs before training nets
Add a synchonize barrier net that is run before training nets. With this net, shards that are faster will wait for other shards before start training. This reduce chances of the faster shards timing out during GLOO AllReduce.
Removed explicit data_parallel_model.py.synchronize call in holmes workflow. Similar change in speech/asr_training workflow will come in another diff.
* Support the dnnlowp backend in caffe2_benchmark
This is for SHARE operator latency evaluation
* Migrate integral_image_op to main caffe2
migrate integral_image_op(GPU version) given by https://fburl.com/yvqezigi
to caffe2/caffe2/operators and implement its CPU version. Write up a test
using the hypothesis_test mechanism
* [pos_disc, fbcode] Implement unjoined lr loss
As explained in https://our.intern.facebook.com/intern/wiki/Model_Based_Calibration/, when the dataset is an joined data set, where labels might change later, we need to use unjoined logloss.
The implementation is almost the same as in Sigrid (https://fburl.com/1trngsls), where
loss = y (log(p) - log(1-p)) + (1-y)(log(1-p)) = xy - (1-y)x - (1-y)log(1+exp(-x))
For x < 0, to ensure stability and avoid overflow, we reformulate the above exp as
loss = xy - (1-y)x - (1-y)x + (1-y)log(1+exp(x)) = xy + (1-y)log(1+exp(x))
Then the final expression becomes
loss = xy + (y - 1) x (x >= 0) - (1 - y) log(1 + exp(x - 2 x (x >= 0)))
where y is the true label, x is the dot product and p = logistic(x).
This kind of implementation is align with the current implementation of the original cross entropy in
https://phabricator.intern.facebook.com/diffusion/FBS/browse/master/fbcode/caffe2/caffe2/operators/cross_entropy_op.cc;0bae3b5d0f825897c5e0dd0ff10f489d7271bf25$7-13
* Keep the array to fix the conflict
* [C2] Compute Adagrad effective LR
The AdagradWithLR op outputs an extra blob which is contains the average effective learning rate across all weights in this blob.
* Open-source extractMetaNetDef & runGlobalInitialization, add new Predictor constructor from db file, and add run_map_outputs
1. Open-source extractMetaNetDef and runGlobalInitialization, for use in
2. new Predictor constructor from db file.
3. Add new run function that returns outputs as TensorMap
* Disable eigen cpu
Disable eigen cpu in transpose and reduce
* Introduce request_only/object_only property of ModelLayer
by default this is False
* A simple TC Caffe2 benchmark
We can run tunner, get MappingOptions and then use them to
compare against cuBLAS
currently broken due to LLVM issues. How to run:
hg checkout eec1ab31b59c03b8deded1c755a9abaf8c45be01
add D7401202
add D7434625
add D7506031
add D7540728
buck run @mode/dev-nosan tc/tc/benchmarks_python:caffe2_benchmark
* Move Caffe2 feature_maps_ops to open source
Need feature maps operators in open source project facebookresearch/BlueWhale
* Manually fix the conflicts in channel shuffle op
* Fix the inconsistency between different gh and fbcode
* Skip Adagrad GPU Test (Because some gpu implementation is missing)
* Fix another test to make sure it won't run on gpu when implementation is not available yet
These changes are already handled, either in native functions or via resize specifications in Declarations.cwrap.
The resize_ one is technically not handled, although in TH it is checked if the storage is actually reallocated; this is less strict, but seems okay.
* Add moments op in caffe2
* Use rsqrtf in float for group_norm
* Add docs for default behavior when axes is not provided.
* Update group_norm_op by using Eigen::sqrt on CPU
* Generate code without setup.py for C++ build
* Move code generation to CMake
* Set DEPENDS files correctly
* Fix some errors in codegen
* Fix blank line lint
* Implement torch.as_tensor, similar to numpy.asarray.
torch.as_tensor behaves like torch.tensor except it avoids copies if possible; so also somewhat like tensor.new but without the size overloads.
I didn't add a requires_grad field, because we haven't decided on the semantics such as as_param.
* Remove requires_grad for doc.
Enables more warnings in the C++ API build.
Fixed a bunch of things in torch/csrc/.
Mostly taken from c10
* Enable -pedantic for C++ build
* Enable more warnings
* Include CUDA and library headers with -isystem
* Fix sign-promo warning
* Make AT_ASSERT/AT_ERROR non-printf based, other tweaks
- AT_ASSERT/AT_ERROR don't take printf strings anymore; instead,
they take a comma-separated list of things you wanted to print
(bringing it inline with Caffe2's conventions).
Instead of AT_ASSERT(x == 0, "%d is not zero", x)
you write AT_ASSERT(x == 0, x, " is not zero")
This is done by way of a new variadic template at::str(), which
takes a list of arguments and cats their string reps (as per
operator<<) together.
- A bunch of the demangling logic that was in Error.h is now
moved to Error.cpp (better header hygiene.) Also, demangle
has been moved out to its own helper function, and also
a new helper demangle_type (from Caffe2) added.
- A bunch of AT_ASSERT converted into AT_CHECK, to more properly
convey which checks can be caused by user error, and which are
due to logic error in ATen.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* CR
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Fix test failure.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* buildfix
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* More fixes.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* One more fix
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Try harder
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* initial commit for spectral norm
* fix comment
* edit rst
* fix doc
* remove redundant empty line
* fix nit mistakes in doc
* replace l2normalize with F.normalize
* fix chained `by`
* fix docs
fix typos
add comments related to power iteration and epsilon
update link to the paper
make some comments specific
* fix typo
Right now, the bottleneck test_utils.py tests assume that a user's
python executable is 'python'. This may not be the case especially if
the user has multiple versions of python installed. This PR changes it
so that test_utils.py uses `sys.executable` as the python executable.
* Refactor extractMetaNetDef and runGlobalInitialization into open...
* Fix test by making get output blobs optional
* Update test instead of making output blobs optional
* Dump autogradpp into PyTorch
* Fixed up CMake for autogradpp/C++ API
* Made cereal a submodule
* Change search location of autogradpps mnist directory
* Add test_api to CI
* Download MNIST from the internet instead of storing in repo
* Fix warnings
Adds ability to JIT compile C++ extensions from strings
>>> from torch.utils.cpp_extension import load_inline
>>> source = '''
at::Tensor sin_add(at::Tensor x, at::Tensor y) {
return x.sin() + y.sin();
}
'''
>>> module = load_inline(name='inline_extension', cpp_sources=source, functions='sin_add')
Fixes#7012
* Inline JIT C++ Extensions
* jit_compile_sources -> jit_compile
* Split up test into CUDA and non-CUDA parts
* Documentation fixes
* Implement prologue and epilogue generation
* Remove extra newline
* Only create the CUDA source file when cuda_sources is passed
* Add max mode support to EmbeddingBag
* Lint fix
* Fix compilation issue on other platforms
* Rebase + don't waste memory when not in max mode
* Oops, missed a spot
* Fix whitespace from merge
* less precision
* Lower precision to avoid spurious failures
* Minor typo
* Switch to size()
* Add full impl of GroupNorm
* Fix comments in math.h
* Remove unsed buffers
* Add #include <array> in gpu version
* Remove unused moments_buffer_
* Make inverse std to be a template.
* Add detailed comments
* Add support for dotted names in CPP Extensions
* Modify tests for cpp extensions
Test that dotted names work
* Py2 fixes
* Make run_test cpp_extensions Win-compatible
Changelist:
- Move *.c to *.cpp
- Change includes of ".c" to ".cpp"
- A bunch of cmake configuration modifying CMAKE_C_FLAGS changed
to CMAKE_CXX_FLAGS or add_compile_options, because if you do CMAKE_C_FLAGS it only applies when you compile C code
- Explicitly cast void* to T* in a number of places
- Delete extern "C" { ... } blocks; instead, properly apply TH_API to everything that should have it (TH_API handles extern "C")
- Stop using stdatomic.h, instead, use <atomic>. This resulted in a bunch of placement-new/delete to be "totally properly correct"
- Refactor of THLongStorageView to not have static constructor methods (since it no longer has a copy/move constructor)
- Documentation about how the TH C interface (and extern C business) works
- Note that THD master_worker mode is dead
- C++ headers in TH libraries are given .hpp suffix, to make it less likely that you'll confuse them with the C-compatible headers (now suffixed .h)
- New function THCStream_stream and THCStream_device to project out fields of THCStream instead of accessing fields directly
- New function THStorage_(retainIfLive), which is equivalent to a retain but only if the refcount is greater than zero.
- In general, I tried to avoid using hpp headers outside of ATen/TH. However, there were a few places where I gave up and depended on the headers for my own sanity. See Note [TH abstraction violation] for all the sites where this occurred. All other sites were refactored to use functions
- Some extra Werror fixes (char* versus const char*)
* Add missing header "caffe2/core/common.h" before "caffe/proto/caffe.pb.h" to provide CAFFE2_API macro.
This only affects the Windows build since CAFFE2_API is only defined for DLL.
* Fix ".pb.h" dependency issue about DLL build.
CAFFE2_API defined in "caffe2/core/common.h" is required by ".pb.h" generated on Windows for DLL build.
We always need to have "#include <caffe2/core/common.h>" before using any proto header.
In this case "caffe2.pb.h" is already included by "context_gpu.h" -> "common_cudnn.h" in the correct order, hence we simply remove a line.
* Enable WERROR in tests
* Also set WERROR=1 for cpp_build in CI
* Enable Werror after the compiler checks
* Remove -DWERROR because its picked up from the env var
* Had to fix some errors in aten/contrib/data
* Allow an uninitialized variable in ReduceOpsKernel.cpp
* Use CUDNN_DATA_UINT8 in cuDNN type string conversion
* Fixes and use target_compile_options
* Fix uninitialized variables in THNN
* Include Python.h earlier in tensor_types.cpp
* Use CUDNN_VERSION 7100 instead of 7000?
* More Python.h includes
* Make switch case in common_subexpression_elimination.cpp exhaustive
* Build with WERROR=0 just to see all the warnings
* Remove some Python includes
* Enable WERROR=1 again
* Bring back switch case default
* Allow `__constant__` values in a ScriptModule to be used as attributes for builtin functions
* Fix bugs in @script loops
1. while loops run shape propagation multiple times until the shapes have converged.
There were two bugs here. (a) First the 'changed' condition was not checking if it actually
changed the output, and instead would mark changed = true if the two inputs were different.
This incorrect because the output of the block and the input of the block may always have different shapes.
Now it actually checks if it is about to change the output entry that it is writing to.
(b) expand nodes were being inserted into the graph even inside the while loop body. However, if
we iteratively discover that the input shape to one of these expands is actual dynamic, then
it was incorrect to insert the expand in the first place. This changes it so that we only insert expands
after we have converged on the shapes.
2. the way deleteExtraInputs removed loop-carried dependencies was unsafe because it would lookup
Value* elements in the loop body's environment that were previously invalidated when deleteExtraInputs
remove another input to the loop. This changes the way deleteExtraInputs works so that it never has to
read a value out of the loop body's environment to avoid using the invalidated pointers.
* Fix torch.tensor(...) device-type calculation when used with numpy and type inference.
* Fix tensor device type inference as well.
* Better variable type inference: infer cuda-ness only if device is not specified.
Switches the step/direction variable names (steps and directions are flipped
in the current implementation of the two loop-recursion). This change does
not change the numerical output of the program, but should make it easier
to follow.
* Prevent stack overflow on deletion of deep graph
Fixes#5534.
Sometimes one can end up with a very big computation graph of Functions
and Edges. Each std::shared_ptr<Function> contains a list of Edge, and
each Edge contains a std::shared_ptr<Function>. Deleting a
std::shared_ptr<Function> can trigger the recursive deletion of other
std::shared_ptr<Function>'s: this can stack overflow if the graph
is deep enough. Here is an example of such a graph:
shared_ptr<Function> -> Edge -> shared_ptr<Function> -> Edge -> ... -> shared_ptr<Function>
The solution here is to use a custom deleter with each
std::shared_ptr<Function>. The custom deleter keeps track of how many
nested deleters it is in. When this number exceeds the maximum allowed
depth, the Function* to be deleted are accumulated in a per-thread
delete queue and handled by one of the deleters.
Example code that could trigger the overflow (set ``depth`` to something >
100000) is below. I also benchmarked the below code before/after the
changes to see if there are any significant performance differences.
```
import torch
def scope():
depth = 80000
x = torch.randn(9, requires_grad=True)
y = x.clone()
# build deeply nested computation graph
for i in range(depth):
y = y + y * 0.000001
%timeit -n 100 scope()
376 ms ± 3.94 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Without changes:
352 ms ± 6.58 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
With the change, the above code is 6.8% slower.
UPDATE: I did some more benchmarking. It looks like it takes 25% more time to free the computation graph in the case of the straight chain graph: https://gist.github.com/zou3519/93cf84d96ae431356ae7f7c1923ef51a
* WIP
* Add custom deleter to PyFunctions created by THPFunction
* Address some comments; pick new value
* Address some more comments
* Add more complicated test; special case the windows depth constant
* Add big warning about averagin to KLDivLoss documentation #6622
Also: An (independent) change in diagonal docstring tensor
formatting.
* Improve note with example
Thank you Richard Zou!
* use log_softmax
* Use Index rather than Long for IntList, so floating-point types convertible to ints fail the parsing.
Basically, our unpackLong code works with floating-point types that are convertible to ints, but this isn't often what you want (because of truncation).
What you actually want is to convert to an index, which will usually find such issues.
I made this the minimal change I could because:
1) I didn't want to change unpackLong because the existing code call checkLong before unpackLong, so this should be a non-issue most of the time. And fixing this properly requires calling checkLong again, which will slow everything down.
2) An exception above is with IntList, which only checks that 1) it is a tuple or 2) it is a varargs tuple (i.e. torch.ones(1, 2, 3)).
* Fix bug.
* Don't conflict tensor and IntList bindings.
* Change function to be consistent between python 2 and 3.
* Check Index.
* Move IntList overloads in legacy new functions to below Tensor overloads.
* Implement matmul_out and dot_out.
* Fix autograd by only calling _out variants if we have an out ourselves.
* Disallow mismatched types in dot_out.
* Make sure out variant doesn't have a method.
* Do proper type conversion.
* Enhance diagonal
This patch
- adds Tensor.diagonal to complement torch.diagonal
- implements diagonal natively in ATen
- makes diagonal a view
- implements taking arbitrary diagonals
- implements diagonal backward instead of referring
to the (more limited) diag
* add tests, copy diagonal code to backward for double differentiability
* improve tests and doc comment. Thank you, Adam!
* Mark diagonal as view function in gen_autograd.py, use simple backward.
* Workaround in onnx to get transposes into init_nets
This adds a pass to ONNX so that it can speculate Transpose
operators so that ONNX's split pass can put them into an init_net
Also fixes a potential bug in onnx peephole where an optimization
across blocks might move a Value and violate scoping.
* Perform shape propagation when embedding a program into a trace.
This ensures the trace still has type information specific to that trace, which will help onnx export succeed in more cases.
* onnx export aten::repeat to Tile
* move repeats to input
* turn repeats to a long tensor constant
* deal with case that len of repeats bigger than number of dims in input
DEPTHWISE_3x3 engine provides an optimized implementation of depthwise 3x3 convolution, e.g. for ShuffleNet, MobileNets
Implementations exist for CPU (generic), ARM CPU, and CUDA GPU.
Originally developed by @ajtulloch
* Refactor standard_gamma and implement CUDA gamma sampling
* Attempt fixes for AT_CUDA_ENABLED changes
* Gamma cuda and cpu forward as ATen native
* implement standard_gamma_grad_cuda
* update native_test.cpp, try to fix windows and various cuda version compiles
* searching a windows fix via CI... use std:: for math
* casting some constants in the calculation, compute at float for half precision
* whitespace fixes
* add acctype to do half->float computation, include HALF in generation, cast locally rather than tensors
* fix cuda8 half compilation
* always use scalar_cast with CUDACC, lock CPU generator, CPU acctype = double\nThank you for your review comments!
* Added ReLU unit to LP pooling, so the gradient does not become NAN if all inputs are zero.
* Added workaround for odd p. Added a bit of doc.
* Make the linter happy.
* Changes incorrect "overlappingIndices" call to correct "maybeOverlappingIndices"
THE PROBLEM
The current overlappingIndices() is meant to detect if a tensor defines multiple valid indices for the same data element. There are two significant issues with this function:
(1) The algorithm it attempts to implement cannot do this.
(2) That algorithm is not implemented correctly.
This call is used by pointwiseApply() and scatter(). If a tensor is readable/writable and detected as overlapped these algorithms will create a non-overlapped copy of it to work on. When tensors are improperly identified as overlapped this causese extra work. If tensors are improperly identified as non-overlapped then this would cause the operations to exhibit unexpected behavior.
For example,
ref = torch.arange(0, 32 * 5).view(4, 8, 5).cuda().double()
p = ref[:,:,::2]
p += 1
Results in a call to pointwiseApply1, which detects p as an overlapped tensor (it is not), causing a call to pointwiseApply2 that copies it into a non-overlapped temporary, and then another call to pointwiseApply2 later that copies it back to the original tensor. If, however, the original tensor is given dimensions of (4, 8, 4), instead, it is correctly detected as non-overlapped and only a single pointwiseApply1 call is made.
DISCUSSION + FIX
The algorithm that overlappingIndices() attempts to implement tests for a sufficient but not necessary condition of a tensor to be non-overlapping. That is, if its algorithm were implemented properly then it would be a conservative check that would ensure all overlapped tensors were copied (as desired), but also that some non-overlapped tensors were copied too.
The algorithm can be thought of as trying to test whether the dimensions can be ordered like "nesting dolls," with each dimension fitting within the next one larger than it. If this is true then the tensor is non-overlapping, but if it's false the tensor may or may not be overlapped. For example, a tensor with dims (2, 3) and strides (4, 3) cannot be "nested," but is non-overlapping. (The tensor looks like [[0, 3, 6], [4, 7, 10]].)
The algorithm is currently implemented improperly, as can be seen in the example above. The tensor p has dimensions [4, 8, 3] and strides [40, 5, 2]. This confuses the current implementation, which thinks the innermost dimension needs a stride of 6, which is incorrect. The first row is [0, 2, 4] and the next row begins with 5. The current implementation also improperly implemented its sorting behavior. (qsort comparators require -1, 0, and 1, not true/false return values.)
Fixing the existing algorithm is straightforward (and what this PR does, see below), but it is important to note that the algorithm never performed as intended, so its name and the documentation around it has been updated, too. A natural question is if it's possible to write an efficient overlappingIndices(), and I believe the answer is "no." Disambiguating overlapping from non-overlapping tensors is equivalent to finding a nonzero solution to a linear diophantine equation with restricted coefficients, that is, an equation of the form x_0s_0 + x_1s_1 ... = 0 where s_X is the stride in dimension X and x_X is an integer from [-size_X + 1, size_X - 1].
Another note is that the CPU does not perform this check. For example, if we run:
a = torch.FloatTensor([[0,1], [10, 11]])
b = torch.FloatTensor([[0,0],[0,0]])
b = b.set_(a.storage(), storage_offset=0, size=a.size(), stride=(1,1))
b += 1
Then b is [[1, 3], [3, 11]] because the operation is applied twice to the second element of the original tensor. This causes no warning.
Since the CPU does not perform a similar check, another question is whether the GPU code should remove its check. While it may seem that writing to overlapping tensors is an error state, running test_cuda.py reveals 171 instances of possibly overlapped tensors being copied by pointwiseApply(). (The prior incorrect version has 176 copies.) Allowing writing to overlapped tensors on the GPU may violate assumptions about memory accesses, too. In fairness, these assumptions may be violated on the CPU already.
Leaving the CPU vs GPU behavior question for the future, this fix corrects the current intended GPU behavior. This means that there will be fewer unnecessary copies and no chance of an overlapped tensor sneaking through on the GPU. The CPU behavior remains unchanged. The fix also adds a test to test_cuda.py to ensure that overlapped tensors on the GPU are written to as expected.
* cleanup
* Fixes Python formatting
* Make cuda 9 behave as cuda 8 wrt half conversions
Cuda 9 is too smart about implicit half conversions, this would disable them so that cuda 8 and cuda 9 behave in the same way wrt half.
* try fixing windows build
* one more broken conversion
* Statically linking CUDA for Anaconda builds
* typo
* Adding a summary line
* Comments
* Typo fix
* Fix faulty parameter passing
* Removing problem CUDA modules for now
* Fixing unused debugging function
* Turning off static cuda linking until script changes are in
* Disabling mkl
THC had a concept of per-device per-stream scratch space that was
persistent in THCState. This was useful before the caching allocator
because it avoided synchronizations in kernels that needed temporary
scratch space. However, it's not thread-safe since multiple threads can
operate on the same stream: In a two-pass reduction the scratch space
may get clobbered in between the two kernels.
This removes the scratch space and just uses THCudaMalloc and THCudaFree
within the reductions.
I've kept THCState_getCurrentDeviceScratchSpaceSize for now since it's
useful to have the temporary buffer be sized based on the number of SMs.
ATen can be configured to compile without CUDA support by passing
-DNO_CUDA=0 to cmake. However, cmake will look for CuDNN independently
of that flag and may eventually find it. In cases were compilation
without CUDA support was requested on system with CUDA installed, this
will result in linking errors while building some tests that rely only
on CuDNN being found.
Do not look for CuDNN if -DNO_CUDA=1 was provided in the cmake call
since it does not make sense to compile with CuDNN if CUDA support was
disabled.
* add threshold for ops using omp macro
* modify interface for ops using omp macro
* modify some thresholds
* implement C macros with optional parameters to avoid duplicating definitions for all pointwise operations
* add a parameter of LAB_IMPLEMENT_BASIC_FUNCTION for vectorizing
* modify the comment
* Revert "add a parameter of LAB_IMPLEMENT_BASIC_FUNCTION for vectorizing"
Modify macro LAB_IMPLEMENT_VECTORIZED_FUNCTION to enable optional parameters
This reverts commit 8ef783a0cc67b653c435e64a3beb6866a6b4216d.
Conflicts:
aten/src/TH/generic/THTensorMath.c
* fix build error on windows
* retrigger the test
The long-term fix is to remove the handling-creating pathways and
remove all the modes from PythonOp making it into an op that simply
calls a PyObject. Right now ONNX expects PythonOp to hold a
nn.Function, not a generic callable, so completely removing the legacy
pathway will also require changes to how ONNX symbolics are found.
* [jit][script] Fix a bug combining sizes/unsized tensors
This add an isSubtypeOf method to reflect that sized tensors are a subtype
of Dynamic[Tensors]. It updates the typechecking code to reflect this
relationship.
* Add index_select to shape prop
* Speed up printing of large tensors.
Instead of deciding on the format based on all of the elements of the tensor, decide based on the elements that will actually be printed.
* Fix flake8.
* Add else case.
Sebastian Messmer noticed that these iterators were writeable by
default, which seemed dangerous. Replaced with const iterators.
This doesn't seem to affect any ATen code; seems reasonable enough.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
- ATen repo now has a new top-level, so Travis script has
to be adjusted to (1) be moved to the top-level and (2)
cd into the aten directory before doing anything.
- Unfortunately, this makes the import script even slower,
because I'm banging on the entire index every commit. If
anyone has better suggestions for how to twiddle the index.
One possibility is to fold the ATen build into the base\
.travis.yml but only activate it when a file is missing
(and then filter out that file.)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Track checkpoint performance in scuba
As title.
* [C2/CUDA]: fix cross entropy sigmoid with logits
when adding log_d_trick, I forgot to add it to the cuda impl; this diff fixes
it.
* Back out "[caffe2] Unregister MKL fallbacks for NCHW conversions"
Original commit changeset: 8918dd40205a
Will land after @jongsoo's diff https://phabricator.intern.facebook.com/D7596315 lands
* [Easy][C2] Don't add blob to external outputs from output_record if it's already external output
As desc.
* On Mobile phones, call GlobalInit with no arguments in predictor in case we need to perform initialization
FACEBOOK:
The QPL logger needs the initialization code. In the past, the initialization code is put in the pipeline calling Caffe2. However, those places become obsolete quickly, as the product teams change places to call Caffe2 from time to time. We also need to track which teams use Caffe2 so that we can put the initialization code there.
With this diff, the initialization code is put in the predictor constructor, only enabled for mobile phones. This way, we can always enable QPL logging.
Once we do this, we can check how many times Caffe2 inference is called in production, and which models are more popular in production. This way, we can prioritize our effort supporting those models.
Will clean up the old code calling the init in the product in a separate diff.
* add padding op for sparse length tensor
to pad length-based sparse tensor with padding_value
* Add conv_op with cudaconvnet engine
Add conv_op with cudaconvnet engine
* [numa] Fix simple NUMA copy benchmark
Move XavierFill into init_net and also compute BW
* call roundf (device function) instead of round (host function)
* [caffe2_benchmark][observer] Make caffe2_benchmark use its own observer
1. Add ClearGlobalNetObservers()
2. Make caffe2_benchmark use its own observer and observer_reporter
* [detectron] Use roundf instead of round in the detectron module ops
* allow K larger than number of elements in top k op
one use case is to use this op together with PackSegments for sparse tensors, where the number of elements in each slice is not statistically defined.
* add ChannelShuffle DNNLOWP op
* fixup math_cpu.cc break
* Support list and tuple literals: Adds support for [a, b], (a, b) and "a, "
* Allow non-tensors to reach emitBuiltinCall, each SugaredValue::call
is now responsible for checking the types of its inputs.
Add support for calling cat with a tuple to emitBuiltinOp
This PR makes it so that the collect_env.py tests ignore the most minor
number of most version strings. It also bumps the version up to 0.5.0a
to fix the CI.
Reopening #6606 with fix for TEST_CUDA import issue on Windows and improvement to how we wait for manager exit in test_manager_unclean_exit. Loop tested on the Windows CI multiple times to make sure this actually fixes the CUDA OOM issue.
* Terminate dataloader workers properly when parent process is SIGKILL'ed
* Wait for worker processes to finish before shutting down manager process
* Add test for checking proper worker exit
* cosmetic change
* Test only if CUDA exists
* Don't call multiprocessing.set_start_method() in Python 2
* import TEST_CUDA only when we are in __main__
* Tune JOIN_TIMEOUT
* handle os.getppid() == 0 case
* Reset to original JOIN_TIMEOUT
* Use WaitForSingleObject() to check parent process status on Windows
* Fix TEST_CUDA import
* clean up
* Check main process only when index_queue.get() times out
* Change index_queues to multiprocessing.Queue
* Move manager checking logic to watchdog class
* Fix bugs in dataloader
* Fix TEST_CUDA import issue
* Don't import TEST_CUDA from common_nn
* Use event to signal manager exit in test
* fix lint
* Add comments
* Add environment collection script
Fixes#6111. This should make it easier for users to report bugs by giving
them a script to collect system environment information.
Changes include:
- Refactor out the environment collecting code from utils.bottleneck
- Add script (collect_env.py)
- Cleaned up the issues template so that it suggests using the script
and is more readable.
Testing: added expect tests to go with 4 CI configurations. Whenever one
of these configurations gets updated, the test will fail until the test
also gets updated.
* Expect tests
* Update issue template
* Fix random space
* Minor improvement to issue template; fix expect test
* Skip expect test if BUILD_ENVIRONMENT not found; test fix; split off smoke/expect test
Previously we would see errors like:
variable 'states' previously has type (Tensor, Tensor, Tensor, Tensor, Tensor, Tensor) but is now being assigned to a value of type (Tensor, Tensor, Tensor, Tensor, Tensor, Tensor):
since the default case in the diagnostic printout was "Tensor". This adds a virtual member function to each Type class that returns a human-readable string for better error reporting
* Improve error reporting for tuple type mismatch
* Add better Tensor printout
* Fix performance regression on simple cases of indexing
Dispatches to the old kernels
* Adapt JIT test
The test was expected to fail, but due to the change in the previous diff, it would now dispatch to index_select, which succeeds. I modified the function to go through the advanced indexing codepath
* Only do checks once, properly AutoNoGil, AutoGPU.
* Fix cross device indexing for more than 1 cuda device.
Cross device indexing is attempted from ATen, which doesn't work well because ATen doesn't have AutoGPU, etc.
Instead, before dispatching to ATen we do type conversion on the indices; it would probably be better if we
pushed all this down to ATen, but that will take some work.
* Small cleanup.
Fixes#6759.
Before, `tensor.chunk(0)` would cause a divide by 0.
`tensor.chunk(-1)` would throw an error complaining that "split_size
needs to be positive".
This PR changes it so that the error message makes it clear that
`chunks` has to be greater than 0.
* Add version counter to module, change load_state_dict to use load_local_state_dict which does class specific loading
* Clarifies version number in docs
* fix jit tests
* fix state_dict tests
* typo
* fix ddp
* exclude version numbers from state dict entries
* Fix jit test and empty modules
* address comments
* test for "."
* revert the private version change in state_dict
* make IN case a hard error
* fix not reporting error when unexpected submodule
* address comments
* disallow empty string in name and remvoe trailing dot
We allow variables defined inside of if statements to be defined after
if statements as long as they will be defined unconditionally. This
supports a larger subset of python programs than we supported before.
* More factory functions
Changes:
- Added the remaining factory and factory-like functions
- Better argument reuse via string templates
- Link under torch.rst's Creation Ops to the randomized creation ops
* Add double tick around False
* fix flake8
* Fix False
* Clarify comment: hopefully it is clearer now
* Eliminate handle_zero_dim when broadcasting is applied earlier.
This ends up not actually doing anything unless all the broadcasted tensors are scalars,
which ends up with inconsistent behavior in that case only, because the type promotion rules are different.
This is better solved with real type promotion logic.
* Change type of script comparison to long.
* Fix jit tests.
* Fix cpp jit test by being consistent about long-vs-float.
* Consistent float and long.
* Use int64_t rather than long.
Issue: "python3 test_cuda.py" currently results in a failure when using Volta hardware.
The failure is in test_advancedindex, and is caused by two "sub-tests." At line 4651 a series of indices are used to compare PyTorch's and Numpy's indexing behavior. At least two of these indices index the same element of the reference tensor multiple times. These are:
[slice(None), [[2]], [[0, 3], [4, 4]]]
[slice(None), [[0, 1], [1, 0]], [[2, 3], [3, 0]]]
The first index selects the 5th element of the third row twice, and the
second index selects the 4th element of the second row twice.
This causes the test to attempt to update the same index with two distinct values simultaneously. On my machine the Numpy created tensor will always take the "latter" of these two values, while the Volta tensor will always take the "former." (Not to say this behavior is guaranteed by either framework.)
The fix is to remove these two indices from test_torch.py. This causes all tests to pass.
While updating test_torch.py I also noticed that assert_get_eq(tensor, indexer) had a bug where it was referring to "reference" instead of "tensor." This bug had no impact on behavior. The fix is to have this function refer to its input tensor, "tensor," instead. All tests still pass after this fix.
* Sort declarations when generating Python bindings
This helps resolve ambiguities in argument parsing according to
any rules we will need.
For now, this allows us to make scalar operations more conservarive
wrt. argument types, but makes them commutative again.
* Fix inconsistencies between mod with tensor and scalar
* Fix a stupid mistake
* Terminate dataloader workers properly when parent process is SIGKILL'ed
* Wait for worker processes to finish before shutting down manager process
* Add test for checking proper worker exit
* cosmetic change
* Test only if CUDA exists
* Don't call multiprocessing.set_start_method() in Python 2
* import TEST_CUDA only when we are in __main__
* Tune JOIN_TIMEOUT
* handle os.getppid() == 0 case
* Reset to original JOIN_TIMEOUT
* Use WaitForSingleObject() to check parent process status on Windows
* Fix TEST_CUDA import
* clean up
* Check main process only when index_queue.get() times out
* Change index_queues to multiprocessing.Queue
* Move manager checking logic to watchdog class
* Fix bugs in dataloader
* Fix TEST_CUDA import issue
* Create FileBaton to synchronize distributed JIT C++ extension builds
* Move FileBaton to its own file
* Autoformat code
* Respect verbose flag in cpp_extension._prepare_ldflags
ARM64 clang from Android NDK doesn't define __ARM_NEON__, which results is perf regression on some models. I figured that some compilers define __ARM_NEON__ while others define __ARM_NEON. This patch changes all NEON-specific parts in Caffe2 to check both macros.
* Caffe2: Enhance test for CollectAndDistributeOp
This also changes the operator and the test to use stable sort
otherwise the test will fail due to differences between the op
and the test when facing ROIs of the same score.
* Caffe2: Adjust comparator to make std::nth_element and std::sort stable
Revert the removal of std::nth_element and std::sort and adding of
std::stable_sort.
* Add mutex to THC random number generator
* Add test for CUDA RNG multithread
* fix lint
* Rename gen_state to state and remove unnecessary mutex lock
* Remove RNG test from cpp_extensions
* Add CUDA RNG test to libtorch
* Build test_rng only if CUDA exists
* Move test to aten/src/ATen/test/
* Separate ATen build and test, and run ATen test in CI test phase
* Don't test ATen in ASAN build
* Fix bug in ATen scalar_test
* Fix bug in ATen native_test
* Add FIXME to some CUDA tests in scalar_tensor_test
* Valgrind doesn't work well with CUDA, seed the CPU and CUDA RNG separately instead
* Fix LSTM and GRU parameters description
* Fix previous layer time to t-1 as reviewed
* Replace 'the first layer' to 'at time 0' per review suggestion
* start at generic trilinear
* Implement einsum (fixes#1889)
This provides a simple implementation of einsum. It is built on
top of the work for computing bilinear (#6110).
It uses a naive left-to-right resolution at the moment.
Autograd is able to differentiate by itself.
The obvious unsupported feature is taking diagonals (einsum('ii->i',(a,)).
* add tests and docs
* fix flake8
* clean diff
* rebase on current master to resolve conflicting String wrapping
* clean up after rebase
* better commentary in einsum and sumproduct_pair
* don't say fixme if it's fixed and rename num_outputs to num_output_dims
* adapt python wrapper to use std::string instead of String to avoid typedef at::String
* typos and some vector to array conversion
* fix accidental python<->python3 change
* really fix bad rebase
* [GanH][Easy]: Add assertion to adaptive weighting layer
0 weight causes numeric instability and exploding ne
* [Easy] Add cast op before computing norm in diagnose options
As LpNorm only takes floats we add a manual casting here.
* Introduce a new caching device allocator
`cudaMalloc` and `cudaFree` calls are slow, and become slower the
more GPUs there are. Essentially, they grab a host-wide (not device-wide) lock
because GPU memory is transparently shared across all GPUs. Normally, this
isn't much of a concern since workloads allocate memory upfront, and reuse it
during later computation.
However, under some computation models (specifically, memory conserving
approaches like checkpoint-and-recompute, see
https://medium.com/@yaroslavvb/fitting-larger-networks-into-memory-583e3c758ff9)
this assumption is no longer true. In these situations, `cudaMalloc` and
`cudaFree` are common and frequent. Furthermore, in data parallel contexts,
these calls happen at nearly the same time from all GPUs worsening lock
contention.
A common solution to this problem is to add a custom allocator. In fact,
nVIDIA provides one out of the box: CUB, which Caffe2 already supports.
Unfortunately, the CUB allocator suffers from very high fragmentation. This is
primarily because it is a "buddy" allocator which neither splits nor merges
free cached blocks. Study
https://github.com/NVlabs/cub/blob/1.8.0/cub/util_allocator.cuh#L357 if you
want to convince yourself.
This diff adapts a caching allocator from the Torch codebase
https://github.com/torch/cutorch/blob/master/lib/THC/THCCachingAllocator.cpp
which does splitting and merging and ends up working really well, at least for
workloads like the checkpoint-and-recompute computation models noted above.
I simplified the implementation a little bit, made it a bit more C++-like. I
also removed a bunch of stream synchronization primitives for this diff. I
plan to add them back in subsequent diffs.
* Report reader progress in fblearner workflows
Integrate with fblearner progress reporting API and add support to report training progress from reader nodes.
If reader is constructed with batch limits, report based on finished batch vs total batch. The finished batch may be more than total batch because we evaludate if we should stop processing everytime we dequeue a split.
If no limit for the reader, report based on finished splits (Hive files) vs total splits. This is fairly accurate.
* [GanH][Diagnose]: fix plotting
1. ganh diagnose needs to set plot options
2. modifier's blob name is used for metric field can need to be fixed before
generating net
* Automatic update of fbcode/onnx to 985af3f5a0f7e7d29bc0ee6b13047e7ead9c90c8
* Make CompositeReader stops as soon as one reader finishes
Previously, CompositeReader calls all readers before stopping. It results in flaky test since the last batch may be read by different threads; resulting in dropped data.
* [dper] make sure loss is not nan
as desc.
* [rosetta2] [mobile-vision] Option to export NHWC order for RoIWarp/RoIAlign
Thanks for finding this @stzpz and @wangyanghan. Looks like NHWC is more
optimized. For OCR though it doesn't yet help since NHWC uses more mem b/w but
will soon become important.
* Intra-op parallel FC operator
Intra-op parallel FC operator
* [C2 Proto] extra info in device option
passing extra information in device option
design doc: https://fb.quip.com/yAiuAXkRXZGx
* Unregister MKL fallbacks for NCHW conversions
* Tracing for more executors
Modified Tracer to work with other executors and add more tracing
* Remove ShiftActivationDevices()
* Check for blob entry iff it is present
When processing the placeholders ops, ignore if the blob is not present in the blob_to_device.
* Internalize use of eigen tensor
Move use of eigen tensor out of the header file so we don't get template partial specialization errors when building other libraries.
* feature importance for transformed features.
* - Fix unused parameter warnings
The changes in this diff comments out unused parameters.
This will allow us to enable -Wunused-parameter as error.
#accept2ship
* add opencv dependencies to caffe2
The video input op requires additional opencv packages. This is to add them to
cmake so that it can build
* Add clip_by_value option in gradient clipping
Add clip_by_value option in gradient clipping
when the value is bigger than max or smaller than min, do the clip
* std::round compat
* Check for --noprefix option for mpiexec
--noprefix option to mpiexec is not part of the MPI standard.
It is needed in certain configurations when using OpenMPI but not
supported with other MPI implementations such as MPICH and maybe
others. This commit adds a check if the option is supported by
the current mpiexec. Also this commit fixes Issue #4965 and MPI
tests can be enabled in the CI.
Fixes: #4965
* Update run_test.py
* Codemod to update our codebase to 0.4 standard
* Update some of the test scri[ts
* remove Variable in test_clip_grad_value
* fix _symbolic_override_wrapper_maker
* Scope variables inside the dataloader
This clears up the memory consumed by batches inside the dataloader. Its pretty useful for long living data loaders.
* Update dataloader.py
Changes:
- Deleted docs for old constructor. Add link to new `torch.tensor` ctor
- Add docs for `torch.tensor`
- Add some info on dtypes to the top of `tensors.rst`.
This adds the ability to trace script functions while preserving their
control flow. When the trace encounters a script function it inlines
the graph of the function into the trace rather than tracing the
function itself.
Introducing two updates.
1. Add param to He initialization scheme in torch.nn.init
Problem solved:
The function calculate_gain can take an argument to specify the type of non-linearity used. However, it wasn't possible to pass this argument directly to the He / Kaiming weight initialization function.
2. Add util to clip gradient value in torch.nn.utils.clip_grad
Problem solved:
DL libraries typically provide users with easy access to functions for clipping the gradients both using the norm and a fixed value. However, the utils clip_grad.py only had a function to clip the gradient norm.
* add param to He initialization scheme in torch.nn.init
* add util to clip gradient value in torch/nn/utils/clip_grad.py
* update doc in torch.nn.utils.clip_grad
* update and add test for torch.nn.utils.clip_grad
* update function signature in torch.nn.utils.clip_grad to match suffix_ convention
* ensure backward compatibility in torch.nn.utils.clip_grad
* remove DeprecationWarning in torch.nn.utils.clip_grad
* extend test and implementation of torch.nn.utils.clip_grad
* update test and implementation torch.nn.utils.clip_grad
* Add device docs; match constructor parameter names with attribute names.
* Use double quotes for strings.
* Update printing.
* Separate device ordinal-only construction into a separate note.
* Use current device.
* Explicitly define all caffe2 reducer ops by name instead of string concatenating them
Explicitly define all caffe2 reducer ops by name instead of string concatenating them.
* Use recursion to make the equal() function compatible with C++11.
* Trivial change.
* Trivial change.
* Trivial change to force the flaky build system to rebuild.
* Trivial change to force the flaky build system to rebuild.
* Trivial change to force the flaky build system to rebuild.
* Trivial change to force the flaky build system to rebuild.
* Trivial change to force the flaky build system to rebuild.
* Addressed @dzhulgakov's comments.
* Addressed @dzhulgakov's comments.
* Trivial change to force the flaky build system to rebuild.
* Trivial change to force the flaky build system to rebuild.
* Add dtypes (with reasonable defaults) to sum, prod, cumsum, cumprod.
This adds optional dtypes to torch.sum, torch.prod, torch.cumsum, torch.cumprod.
By default, the dtype is torch.float64 for integral types, and the dtype of the input for floating point types.
* Don't use optional<ScalarType>, because the jit can't handle it yet.
Instead, we manually build the overloads. This is fairly painful because of default arguments, but should be easy to pull out once the jit can handle optional<ScalarType>.
* Fix keepdim with out parameters.
* Fix _cudnn_rnn_flatten_weight.
* If dtype is provided to an out function, make sure it matches the dtype of the result.
* Fix typo.
* Update docs for torch.zeros factory method
If this looks good, I'll submit another PR rewriting the other factory
methods in this fashion.
* Address comments
* Better explanation for device default
* Add variable argument back
* s/set/sequence/g
* Remove class from torch.strided
This modifies the registration process so that all script methods
in a ScriptModule are defined at once.
Method gains a `method_creator` callback that gets invoked when the
method is first called to define it if it has not already been defined.
Recursive cycles in this `method_creator` are checked.
This approach was chosen over first creating all the graphs and then
inlining the call sites because it will combine better with type
propagation for non-tensor types like tuples. e.g.
```
a = foo(b)
return bar(*a)
```
Fixes#5748.
Added an unsafe version so embedding isn't slowed.
* Create safe and unsafe versions of sparse_coo_tensor
* rename sparse_coo_tensor_unsafe to _sparse_coo_tensor_unsafe
* refactor
* make helper static inline
* add sparse size check test
* fix lint
The current implementation of bilinar uses a matrix multiplication approach. This creates a large intermediate matrix (batch * output dimension * input dimension). Relative to the previous pure python approach, this caused severe performance regression (600ms vs. 18ms for 300x100x200 weights and a batch of 50 on CPU, and also quadratic memory).
The attached change restores the performance using the previous strategy of looping over output features. It implements forward, backward, and double backward as native ATen code.
Credits:
Martin Tutek reported the regression and pinpointed the problem
Adam Paszke patiently answered my questions about ATen
I would not have been able to prepare this without you, thank you!
I referenced the old python implementation, used a python version of the naive implementation, and coded manual functions etc.
The tests have gradgradcheck etc.
* fix memory use of native bilinear
* bilinear double backward
* Move bilinear_double_backward to Functions.cpp
Addresses review comment by Tongzhou Wang. Thank you!
* add WrapDimUtilsMulti.h
* start at generic trilinear
* move to generic trilinear
* catch up on dim_list_to_bitset
* switch bilinear to use _trilinear implement _trilinear_backward
* add comments to Linear.cpp, move _trilinear in yaml
* Improve run_test.py to support running individual test classes and methods
Added support in run_test.py for running individual test classes and methods.
The -i/--include option can specify a list of test modules, classes or methods
like this:
python run_test.py -i autograd torch.TestTorch.test_abs \
torch.TestTorch.test_add utils.TestBottleneck
-f, -l and -x behaviour stays the same as before
* Fixed some code formatting
* Multiple fixes according to the reviews in #6344
* Split set_default_tensor_type(dtype) into set_default_dtype(dtype).
* Fix flake8.
The difference between this one and set_default_tensor_type is that it only sets scalar type what determines the type + device of a tensor returned from a factory function with defaults is the default tensor type + the current device (if the default tensor type is cuda). This just changes the scalar type of the default tensor type.
We do eventually want to deprecate set_default_tensor_type; it is not clear how to do that in a sensible and backwards compatible way.
* Switch JIT passes to take a graph rather than TracingState
* Add pybind11 binding for ONNX pass from graph
* Fix canonicalize pass
* address comment
* Switch ToONNX to explicitly return new graph
* optimize_graph instead of optimize_trace
* Allow tuples to be re-assigned
This commit improves our support of tuples by making them more first-class.
In particular, it allows tuples to be re-assigned across loops and ifs.
It does this by making them first-class values in the Graph IR, and then
removing the tuples in a LowerTuples pass.
An alternative approach would have added more support for desugaring tuples
in the Environment object as they were emitted. Instead,
the current approach was chosen anticipating a future when tuples are
fully supported (including the interpreter). In that future, the current
code can be completly reused with the LowerTuples pass just becoming
a optimization that removes unneeded tuple allocations.
* More precise digamma
Fixes#6190.
This is a rebase of #3955 with some tweaks for better performance around
poles. The code is ported over from cephes with permission.
By itself, the cephes code returns inf for the poles.
For better performance around the poles with float32, one intermediate
step is always computed with double precision, regardless of dtype.
This step does `PI / tan(PI * input)`. This is necessary because small (1e-6)
rounding errors for the inputs to tan have strong effects on the output
(ie, the derivative of tan is very large at some points).
* Replace usages of finite-differences digamma with newly implemented digamma
* Better behavior near and at poles
* ScalarConvert -> scalar_cast for readability
* Adding integrated pytorch-caffe2 package
* Updates
* Fixing more substitution
* Fix to pytorch build location
* Bugfixes, progress towards including CUDA libs in package
* Fix to sed call
* Putting off packaing CUDA libs for Caffe2
* Progress towards packaging CUDA libs
* Progress towards packaging CUDA libs
* Changes to CUDA copying
* Turning on CUDA lib packaging
* Correction to env variables passed into meta.yaml
* typo
* Adding more needed variables in build.sh
* Adding some debugging info
* Changing versioning to have dates and be in build string
* Removing version from build string
* Removing packaging CUDA logic for static linking (later)
* Changing version to mirror pytorch
* Removing env variable req in build.sh
* Change to sed to port to mac
Caffe2-NNPACK integration created blobs for precomputed kernel transorms based on the name of Conv operator.
When Conv operators have the same name (e.g. empty string), or the blobs for precomputed transforms get the same name and overwrite each other.
This patch ensures that blobs for all precomputed transforms in the network get a unique name.
* Separate cuda-ness from dtype.
There are no longer torch.cuda.int64, etc; only torch.int64 that correspond to at::ScalarType.
At the python arg parser level, the corresponding ATen type is selected from the combination of (ScalarType, Layout, Device).
There is also currently unused code in here for support ScalarType in native_functions; this will be used for specifying aggregate types
on reduction functions.
* Fix test_autograd.
* Add defaults to randint_like.
* Track is_cuda in py tensor types.
* Fix test_sparse.
* Fix multiprocessing.
* Fix rnn.
* Fix test_nn.
* Fix flake8.
* Fixes to the way script handles multiple values, and other minor fixes.
This commit improves our handling of operators that return multiple values.
Builtins are now checked so that they return the right number of values,
and support for TupleValue is extended to all things that can return
multiple values.
This resolves issues where the compiler accepted things like:
a, b = c + c
This would cause the interpreter to crash. Now each operator knows
how many results it will produce and can check it against the number
of requested inputs.
Notes:
* Allow True/False literals in constant expressions
* make handling of keyword constants more consistent to support True/False
* make parsing constants match the way we construct constants from python
* improve the error messages when accessing bad graph attributes.
* switch findTensorOp to return an optional.
* check that attribute types are correct in findTensorOp
* Check the correct number of outputs for builtins
This also changes emitExpr to return a single SugaredValue
Rather than possibly returning multiple values, emitExpr now
always returns a single value, which _might_ be a tuple. This approach
more closely follows python making the code easier to follow.
Checks for returning the right number of values are now located in
the assignment operator, and occur when unpacking the tuple.
We still pass `n_binders` to function calls so that calls into python
know how many values they should return.
* Update ReduceMean
* Add reduce mean to math
* Update cuda flag
* Update Eigen::Tensor ctor
* Remove unused variables
* Skip ReduceTensorGPUTest if no gpus
* Add NOMINMAX for windows
* Fix lpnorm_op in windows
* Add openmp support for Windows
* Remove pthread from dependency list
* Revert "Add openmp support for Windows"
This reverts commit f234c124ba2b47746e197bc185c083737fee6e65.
* Don't link with msvc openmp libs
* Add support to TensorRT
* Removed License header
* Bind input/output by position
* Comments
* More comments
* Add benchmark
* Add warning for performance degradation on large batch
* Address comments
* comments
* added randint function in ATEN yaml as well as Tensorfactories.cpp
* corrected randint
* randint with overloading complete,getting tuple of ints behaviour though
* done randintlike and randint_out
Left : adding docs and test, and remove the bug on size = (5)
* Removed my error messages, ThRandomTensor will handle all exceptions
* added docs and tests, corrected a mistake
Tested with manual seeds in some test cases as well. Seems fine to me (check documentation though)
* corrected indentation to spaces, and improved sizes argument description
* made documentation argument description shorter
* added whitespace after ',' in torch docs
* addes spaces in documentation
* added more tests (including bounds and overloading features)
* added whitespaces in test_torch
* removed trailing whitespaces
* removed whitespace from a blank line
* removed positive requirement from docs. Added dtype argument and gave eg
* made randint over randn in all files
* changed to data type for dtype in docs for randint
* added autofunction entry for randint in torch.rst
* Better warnings
* Remove -Wc++14-extensions because gcc does not know it
* Warning fix in input_buffer.cpp
* Remove pedantic for torch/csrc/
* Also use Wextra and Wall for ATen
* Use check_env_flag
* Undo changes in shape_analysis.cpp
* Remove C linkage flag
* fix unit test for sqrt op
From the error logging:
[idx, grad, grad_estimate] are:
[[ 146. 0.5 0.45776367]
[ 147. 0.5 0.45776367]
The gradient == 0.5 is correct, which means the SqrtOp and its gradient is doing right job. (Because y = sqrt(x), loss = y^2/2 = x/2, and then d(loss)/dx = 1/2 = 0.5; )
The test failed because of numerical problem of grad_estimate (in unit test). It can be because the step_size is small, and float precision is not high (when there are multiple elements in the tensor, we do sum(y^2) to compute loss)
This diff
- increase the step size, and also move the test cases to be further away from 0 (where sqrt(x) is not well defined) to be safe :)
- also clean up, and merge the test case for inplace Vs. non-inplace
Tested with:
`CAFFE2_HYPOTHESIS_PROFILE=debug ai_bt caffe2/caffe2/python/operator_test:elementwise_ops_test -- "test_sqrt"`
* CompositeReader & CompositeReaderBuilder
A new type of reader gluing multiple readers together.
* Back out "Revert D7394363: [GanH]: Log D Trick for Cross Entropy with Sigmoid"
Original commit changeset: 9325a4356dbe
* [dai][WIP] convert params to int8 on ps before sending to trainer
Add float->uint8 conversion in addition to float->fp16 conversion in model_saver.
* [easy] improve unit test for sparse length sum ops
as desc.
#accept2ship
* Update GitHub upstream to 771fcb3455cbfe69c2abcc4cb3bd7ef92d59af24
* move sparse hash unique ops to OOS and add unit tests
- move the SparseHash version to OOS, since 'sparsehash' is already deps of caffe2 OOS: https://fburl.com/arssw4n1
- The 'SparseHash' engine is also being used in OOS, so the SparseHash version shall be in OOS to reduce confusion: https://fburl.com/o5ea7ah2
- fix the CUDA UniqueOp for the case when batch is empty.
- add unit test
* group_norm_op for caffe2
This is the cuda op for Group Normalization (GN): https://arxiv.org/abs/1803.08494
This code implements GN in one op that computes Y=gamma * (X-mu) / sigma + beta and also its gradients. It is expected to have minimal memory consumption (similar to the BN op), without creating new blobs if GN were implemented as several ops (e.g., reshape, norm_mean/std, affine_channel).
* Resubmit D7405233: disappeared in D7464958
OOS publish causes the op missing -- however, test was still there
* [c2] add sparse hash engine for cuda unique op
The SparseHash version of UniqueOp copy input tensor to CPU, and make use of sparse hash map to get unique output, and then copy back to GPU.
* [dper][gpu] enable unit testing gpu trainer for sparse nn
to debug the GPU trainer using mock data in unit test.
make it easier to develop GPU trainer for new models.
* Reuse Gloo context for Synchronize() calls
Previously we were creating (and leaking) the Gloo context on each call to Synchronize(). Now only run the common world op and create the barrier net once, then run the barrier net on each Synchronize() call. Since timeout is associated with the Gloo context, assert that the timeout is fixed instead of trying to handle the complexity of multiple timeouts (and associated contexts).
* [GanH/WGAN][1/n]: add FC param clipping
as titled
* [mobile] minimizing changes between caffe2_benchmark and speed_benchmark
* [GanH]: enable diagnose within model
avoid finding blob names but to directly enable inside the model
* Add `net_transformer_fun` option to DPM
This callback allows for various transformations to be made to the
model after gradient operators have been added. The immediate motivation for
this is to allow transformations such has "checkpoint-and-recompute" which
allow trading off memory for additional compute.
Adding several callbacks like this has made DPM's API less than ideal at this
stage. However, I could not find any reasonable alternative.
* [DT] [33/n] Compile flow task groups
task groups need to compiled in order to pickle the object in fblearner. However I also changed the Job's compile function as creating new object is not necessary.
* Initial commit for sparse_normalize vectorization and benchmark
* [GanH]: LB Calibration for JSD
as titled
* Tracing event in async executor
Adding event tracing through TRACE_EVENT macro in async executor
* [Resubmit] D7409751 Reseting book-keeping blobs when the reservoir is reset
D7409751 got lost in D7464958
* Visualizing realtime weights values
we want to visualize the weights values as optimizer is iterating. This diff supports to visual the weights at an assigned index.
Currently, we assume the blob to be 2 dimensional.
* [GanH][Easy]: Fix Homotopy Weighting
apparantely, there was a bug in homotopy weight (alpha, beta) update
* [c2] move sparse hash unique op out of oss
so that oss do not need to depend on google hash map.
* Get rid of std::round as it's not supported on Android
* Revert changes on setup.py
* Skip shaky test on Dataio
* fix
* change irfft signal_sizes arg to be the last
* add docs for fft, ifft, rfft, irfft; update doc for stft
* fix typo in window function docs
* improve gradcheck error message
* implement backward of fft, ifft, rfft, irfft
* add grad tests for fft, ifft, rfft, irfft
* fix nits and typos from #6118
* address comments
* Autograd container for trading compute for memory
* add a unit test for checkpoint
* address comments
* address review comments
* adding some docs for the checkpoint api
* more comments
* more comments
* repro bug
* Fix a subtle bug/apply some review comments
* Update checkpoint.py
* Run everything in grad mode
* fix flake and chunk=1
* use imperative backward as per discussion
* remove Variable and also add models and test for models
* Add a simple thread local variable to check for autograd grad mode
* remove models and models test after debugging
* address review comments
* address more comments
* address more comments
Part of #5738. Warns users that they're not viewing the latest stable
release docs.
We should remember to delete this when cutting out 0.4.0 release docs. (we'd just delete the div in pytorch.github.io)
* Unit test for pack_padded tracing
* Move monkeypatching stuff
* Switch symbolic
* Fix stack traces and update test
* Fixup and confirm e2e working
* lint
* Move monkeypatch back to onnx
* Address comments
* remove extraneous import
* Add gradient checking
* lint
* Address comments
* improve test case
* fix fft when any of the input dimensions is not like complex type; add test for ifft+fft
* clarify the comments
* Address comments: add note; add helper function
* use at::nullopt
* add notes on conjugate symmetry; fix complex-to-real cloning condition (should be advanced data layout rather than base_istride)
* add at::sum_intlist and at::prod_intlist
* revert optional<vector> helper due to windows compiler error
* Something that works
* Tuple sugared value
* Works with commenting out input size check
* support string frontend
* Initial starred assignment
* Fix parser
* Fixup tests
* clang-format
* fix rebase error
* lint
* move star assign test to string frontend to make py2 happy
* Py2 fix: parse starargs from Call node
* Address some comments
* Fixup merge
* Remove overloaded unary operators
* Bugfix and test case
* Address a few more comments
* asValues -> asTuple
* Remove unrolledFor stuff
* Fixup getValues
* Pass CallsiteDescriptor struct and have different behavior for different call types
* Address comments and lint
* some type checks
* Address comments
* lint
* Fix mistake
Fixes#6312.
Changed bottleneck's arg parser to user argparse.REMAINDER. This lets
the user specify args as `python -m torch.utils.bottleneck script.py
[args]` (previously, a -- was needed after `bottleneck` and before
`script.py`).
* Check mappings ONNX -> Caffe2 bear the same argument names
When adding an extra arg to an input ONNX op, if it's not supported in Caffe2, the exporter would just silently pass it to NetDef and ignore it in the implementation. It's pretty error-prone. Caffe2 also has an OpSchema description and we can enforce that all arguments explicitly appear in schema or listed explicitly in Caffe2.
See also https://github.com/caffe2/caffe2/pull/2478
Add test for C2 argument checking
* Some operators do not log arguments, which prevents argument checks.
Invite users to file an issue to fix the schema.
* Change Same as input type deduction to work for ops with multiple outputs
* change InferBlobShapesAndTypes definition to take vector ot pointers instead of unique_ptr. The function doesn't own the objects, so no need to pass smart pointers and that prevents calling the function with existing object, since the caller has to create unique_ptr, i.e. copy an existing object just to create the pointer
* switching order of std::move<unique_ptr> and uniqur_ptr.get
* adding comma
Caffe2 started with an option to use NNPACK pre-installed in the system.
Now this option is mostly legacy, as Caffe2 can include NNPACK in its own build on all platforms.
Due to problems when pre-installed NNPACK is built with different dependencies or compiler options, we decided to remove this option and alwyas build NNPACK with Caffe2.
This change makes Caffe2 always build NNPACK as part of its own build, and updates NNPACK and cpuinfo submodules.
* Add string-style devices to all tensors.
Previously, tensors only had a 'get_device' method which would throw an exception on a CPU tensor. This made it necessary to if/else code that
was meant to be device agnostic.
This PR implements the following:
1) Adds a 'device' property to all tensors that returns a string representation of the device for all tensors.
For cpu tensors this is 'cpu'. For cuda tensors this is 'cuda:X', where X is the cuda device ordinal.
2) Adds a DeviceSpec class. This is just a helper class for separating device_type and device_index specification and to allow partial specification.
For example, you can call DeviceSpec('cuda'), DeviceSpec('cuda:0'), DeviceSpec('cuda', 1).
Also has backwards compatibility support for specifying integers, which are treated as cuda devices.
DeviceSpecs have the following properties:
a) device_type: string representation of the device type (i.e. 'cpu' or 'cuda')
b) device_index: integer for the device index (None if not specified)
c) cuda_device_index: for backwards compatibility; behaves roughly like `get_device` did previously. I.e. if a function previously took integers for cuda devices,
it can now take DeviceSpecs (or strings), and can maintain the old functionality by calling `old_index = DeviceSpec(old).cuda_device_index`.
3) tensor methods and torch. functions that took integer devices can now take integers, strings, or DeviceSpecs. For example:
torch.randn((2,3), dtype=torch.cuda.float32, device='cuda:1')
TODO in future PRs:
A) Split out cuda from dtype so you don't need to overspecify cuda-ness
B) We currently only support strings/DeviceSpecs in tensor methods and torch. functions. We should have equivalents torch.cuda.device(...), torch.cuda.device_of, etc.
at the torch. level that work on strings/DeviceSpecs
* Add deviceInt64 to python arg parser.
* device_str.
* Remove device_str.
* remove device prefix from attributes.
* Use const char * instead of string.
* Move autogpu index out of Device.
* comment on is_default.
* Rename torch.DeviceSpec to torch.device.
* comment.
* Fix tests.
* Fix flake8.
* Fix sparse_coo_tensor parameter name.
* Improve error message.
* Remove device_ prefix from C++ device object.
* Allocate static strings.
* Return not implemented from rich compare.
* Move torch::Device to THPDevice.
* Remove cuda index.
* Py_RETURN_NOTIMPLEMENTED doesn't exist in python2.
* Revert "Add __constants__ to Script modules (#6092)"
This reverts commit 5ab30eedf33c670514685838423371f9a5df80f3.
* Revert "[ready] Implement log2 and log10 in PyTorch (#6272)"
This reverts commit 0aa35780bfade6bf9c428f1ae45426caa8a7df93.
* Revert "Use reshape({-1}) (#6281)"
This reverts commit 8ae67a444506a838e648aa60f9eb6a4da22c9b06.
* Revert "Move instruction set specific code to anonymous namespace (#6314)"
This reverts commit 6953c1b77efe2d0764ca9ba7dbf7c9284d68a80c.
* Revert "[auto] Update onnx to 54be8fa - Use cmake3 if it's available (#718) 54be8fad1e"
This reverts commit d33ec12d1e3f4739e10cacf1436764bc54ff89a3.
* Revert "default build with MKL for desktop (#6266)"
This reverts commit 5dcf7078c689f7055ca6837e67ca834cc70d6497.
* Revert "Increase # of runs for CPU perf test, and increase margin of error (#6302)"
This reverts commit 9d1a660670d55590cdab5509bb81c26e8bb3d26a.
Like `__slots__` the `__constants__` property changes the set/getattr behavior of a script module for the keys listed so they behave as constants.
This enables script methods to use them in way that are otherwise not allowed.
* Python numbers/bools can be inlined as constants in script code.
* List of numbers can be iterated over using for loops
* nn.ModuleLists can be used in for loops as well, unrolling their content.
* Implemented log2 and log10
* Re-add incorrectly removed files
* Fix minor bugs
* Fix log1p docs
* Add a try-except for python2 math module in log2 test
* Revert changes made to aten/doc/*
* Fix docstring errors
* Fix windows build
The vec256 and SIMD kernels are compiled multiple times with different
headers. It's important that these functions have internal linkage so
that kernels for different architectures don't get combined during
linking. It's sufficient to label functions "static", but class methods
must be an unnamed namespace to have internal linkage (since static
means something different in the context of classes).
This fixes a bug in which the implementations of Reduction::reduce_all
for different instruction sets was getting combined during linking.
* Remove ATen's copy of FindCUDA
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Minor bugfix for updated FindCUDA.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Use cl.exe as the host compiler even when clcache.exe is set.
Upstream merge request at https://gitlab.kitware.com/cmake/cmake/merge_requests/1933
H/t peterjc123 who contributed the original version of this patch.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Include CMakeInitializeConfigs polyfill from ATen.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Tweak the regex so it actually works on Windows.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* remove patch
* check that cuda dev environment is also present before running cpp_extension cuda tests
* add OSError to list of exceptions when c++filt is not found
when no explicit hidden state is provided, a default is created by
constructing a new Variable filled with zeros. This gets traced as a
Constant operator, which hardcodes in the batch size.
To fix this, we remove such constant operators in an 'optimization'
pass. We could have also fixed it by causing the code to not generate
a Constant in the first place, but this is the least invasive fix from
the perspective of the pure pytorch codebase.
* Update FindCUDA to cmake master as of 561238bb6f07a5ab31293928bd98f6f8911d8bc1
NB: I DID have to apply one local patch; it's the `include_guard` change. Should
be obvious next time you do an update.
Relevant commits:
commit 23119366e9d4e56e13c1fdec9dbff5e8f8c55ee5
Author: Edward Z. Yang <ezyang@fb.com>
Date: Wed Mar 28 11:33:56 2018 -0400
FindCUDA: Make nvcc configurable via CUDA_NVCC_EXECUTABLE env var
This is useful if, for example, you want ccache to be used
for nvcc. With the current behavior, cmake always picks up
/usr/local/cuda/bin/nvcc, even if there is a ccache nvcc
stub in the PATH. Allowing for CUDA_NVCC_EXECUTABLE lets
us work around the problem.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
commit e743fc8e9137692232f0220ac901f5a15cbd62cf
Author: Henry Fredrick Schreiner <henry.fredrick.schreiner@cern.ch>
Date: Thu Mar 15 15:30:50 2018 +0100
FindCUDA/select_compute_arch: Add support for CUDA as a language
Even though this is an internal module, we can still prepare it to
be used in another public-facing module outside of `FindCUDA`.
Issue: #16586
commit 193082a3c803a6418f0f1b5976dc34a91cf30805
Author: luz.paz <luzpaz@users.noreply.github.com>
Date: Thu Feb 8 06:27:21 2018 -0500
MAINT: Misc. typos
Found via `codespell -q 3 -I ../cmake-whitelist.txt`.
commit 9f74aaeb7d6649241c4a478410e87d092c462960
Author: Brad King <brad.king@kitware.com>
Date: Tue Jan 30 08:18:11 2018 -0500
FindCUDA: Fix regression in per-config flags
Changes in commit 48f7e2d300 (Unhardcode the CMAKE_CONFIGURATION_TYPES
values, 2017-11-27) accidentally left `CUDA_configuration_types`
undefined, but this is used in a few places to handle per-config flags.
Restore it.
Fixes: #17671
commit d91b2d9158cbe5d65bfcc8f7512503d7f226ad91
Author: luz.paz <luzpaz@users.noreply.github.com>
Date: Wed Jan 10 12:34:14 2018 -0500
MAINT: Misc. typos
Found via `codespell`
commit d08f3f551fa94b13a1d43338eaed68bcecb95cff
Merge: 1be22978e 1f4d7a071
Author: Brad King <brad.king@kitware.com>
Date: Wed Jan 10 15:34:57 2018 +0000
Merge topic 'unhardcode-configuration-types'
1f4d7a07 Help: Add references and backticks in LINK_FLAGS prop_tgt
48f7e2d3 Unhardcode the CMAKE_CONFIGURATION_TYPES values
Acked-by: Kitware Robot <kwrobot@kitware.com>
Merge-request: !1345
commit 5fbfa18fadf945963687cd95627c1bc62b68948a
Merge: bc88329e5 ff41a4b81
Author: Brad King <brad.king@kitware.com>
Date: Tue Jan 9 14:26:35 2018 +0000
Merge topic 'FindCUDA-deduplicate-c+std-host-flags'
ff41a4b8 FindCUDA: de-duplicates C++11 flag when propagating host flags.
Acked-by: Kitware Robot <kwrobot@kitware.com>
Merge-request: !1628
commit bc88329e5ba7b1a14538f23f4fa223ac8d6d5895
Merge: 89d127463 fab1b432e
Author: Brad King <brad.king@kitware.com>
Date: Tue Jan 9 14:26:16 2018 +0000
Merge topic 'msvc2017-findcuda'
fab1b432 FindCUDA: Update to properly find MSVC 2017 compiler tools
Acked-by: Kitware Robot <kwrobot@kitware.com>
Acked-by: Robert Maynard <robert.maynard@kitware.com>
Merge-request: !1631
commit 48f7e2d30000dc57c31d3e3ab81077950704a587
Author: Beren Minor <beren.minor+git@gmail.com>
Date: Mon Nov 27 19:22:11 2017 +0100
Unhardcode the CMAKE_CONFIGURATION_TYPES values
This removes duplicated code for per-config variable initialization by
providing a `cmake_initialize_per_config_variable(<PREFIX> <DOCSTRING>)`
function.
This function initializes a `<PREFIX>` cache variable from `<PREFIX>_INIT`
and unless the `CMAKE_NOT_USING_CONFIG_FLAGS` variable is defined, does
the same with `<PREFIX>_<CONFIG>` from `<PREFIX>_<CONFIG>_INIT` for every
`<CONFIG>` in `CMAKE_CONFIGURATION_TYPES` for multi-config generators or
`CMAKE_BUILD_TYPE` for single-config generators.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Polyfill CMakeInitializeConfigs
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Tweak condition for when to use bundled FindCUDA support.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Comment out include_guard.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add max_values and argmax convenience functions to ATen
* Add documentation for torch.argmax/argmin and skip max_values
* Add tests for argmax/argmin
* Dont default the dim argument
* Use dim=0 in test_torch.py for argmax tests
* Implement argmin() and argmax() without dim
* Call .contiguous() before .view(-1)
* Add a CODEOWNERS file
* This will let us require review from owners of aten/ and torch/ while giving wider access (for now) to caffe2/
* This will be adjusted as we work on shared components.
* update OWNERS to cover more pytorch bits
If the source and result tensors are empty, arr_in and arr_out may be
null (and size will be 0). This previously called memcpy(null, null, 0),
which is UB according to
http://en.cppreference.com/w/cpp/string/byte/memcpy.
Note that either one of these changes would be sufficient.
(Detected by UBSan)
cpuinfo_initialize() prints error message to the console/log when run
on unsupported CPU/platform. Even the code will work fine this is
confusing error message that shouldn't be shown to the users when use
PyTorch on other architectures than the supported by cpuinfo.
* Manually bump onnx submodule to current latest
* skip _equal_ tests
* Revert "skip _equal_ tests"
This reverts commit 72db49ebc16c9f98ed12add293a8f41e7d509bf3.
* bump to include a fix
* bump
This changes type(tensor) to return `torch.Tensor` instead of
`torch.autograd.Variable`.
This requires a few implementation changes:
- torch.Tensor is now a regular Python class instead of a
pseudo-factory like torch.FloatTensor/torch.DoubleTensor
- torch.autograd.Variable is just a shell with a __new__ function.
Since no instanes are constructed it doesn't have any methods.
- Adds torch.get_default_dtype() since torch.Tensor.dtype returns
<attribute 'dtype' of 'torch._C._TensorBase' objects>
Fixes#6222
We don't need to make sure gradInput is contiguous because it's always
passed in as an empty tensor (see CUDAFloatType.cpp after it gets
codegen-ed). This was increasing the reference on gradInput and leaking
it.
I'm not sure if there's a good way to test this. I put together a script
that
1) Prints out when a tensor is allocated and deallocated
2) Checks allocations vs deallocations after running a python script
And verified that each allocation matches each deallocation.
We had a bug in the Buck build of PyTorch due to symbols from _C
being present in two shared libraries that were both loaded at
runtime. This caused global variables to be initialized twice and
destructed twice on exit. The second destruction often caused
segfaults on exit.
This attempts to detect that sort of situation early on. If
Module.cpp is compiled twice, the symbol
pytorch_duplicate_guard()::initialized will be shared. The second
initialization will print an error message and abort.
This compares the torch function against the reference math funciton
against a relative small set of inputs, including integers, extremes
of some common functions, zero, a few numbers from randn and a few
numbers near 1e6.
The idea here is not to be completely exhaustive, but rather quickly
expose the most common bugs. For exhaustive checks, we should evaluate
torch functions against all ~4e9 possible float32 value.
We compare the torch function evaluated against contiguous
and non-contiguous inputs and large vs. small tensors.
Also:
- Make torch.allclose work with nan and +/-inf
- Add torch.isclose (like numpy.isclose)
- Add torch.testing.assert_allclose (like
numpy.testing.assert_allclose)
After discussion with @colesbury it turns out that avx_mathfun.h is imprecise and cannot be trusted blindly.
Turns on /fp:strict in Windows to disable replacement of trig functions with imprecise vectorized implementation.
Fixes#5719
Previously, the following would error out with an "Invalid file
descriptor" error:
```
import torch
import torch.multiprocessing as mp
q = mp.Queue()
t = torch.tensor([])
q.put(t)
```
on some OSes. The problem was that because one cannot mmap data of size
0, and that an empty tensor has a storage of size 0, the file descriptor
for the storage (referencing shared memory) was not being set. The
multiprocessing sharing code then calls DupFD on that uninitialized file
descriptor, leading to an error.
This PR special cases sharing an empty tensor on the CPU. CUDA does not
have this problem.
Unit tests for both cpu and cuda empty tensors
* [easy] allow empty tensor in cuda relu op
The diff has not enabled unit test of empty tensor, because MLKVersion of ReluOp need extra work to support
* Make blob norm plotting work with distributed trainer when the old framework is used
* Introduce torch.layout and split layout from dtypes.
Tensors (and tensor types) now have a 'layout' attribute that returns either 'torch.strided' or 'torch.sparse_coo'.
Previously, dtypes were 1-to-1 with ATen types/PyTensorTypes; the impetus behind this decision was to make things easy in the common case
(i.e. specifying a type in a factory function). But this doesn't really follow for sparity, which isn't a common case.
It also doesn't properly represent the concept or a dtype, which in numpy are proper scalar types (i.e. roughly the type returned from indexing the
last dimension of an n-d array). But this should be the same whether or not the tensor is represented via strides, sparsity, etc.
This is accomplished by:
1) having the dtype of tensor return the (device-type, scalar-type) combination, i.e. torch.cuda.float32, so both
torch.cuda.FloatTensor and torch.cuda.sparse.FloatTensor have the same dtype
2) Adding a layout parameter to python functions, where the combination of (dtype, layout) maps to an ATen type that is used for dispatch.
* Formatting, make init throw python_error.
* Fix cuda not enabled error message.
* Fix test.
* Change cpp_extensions.py to make it work on Windows
* Fix linting
* Show python paths
* Debug
* Debug 1
* set PYTHONPATH
* Add ATen into library
* expose essential libs and functions, and copy _C.lib
* Specify dir in header
* Update check_abi for MSVC
* Activate cl environment to compile cpp extensions
* change version string
* Redirect stderr to stdout
* Add monkey patch for windows
* Remove unnecessary self
* Fix various issues
* Append necessary flags
* add /MD flag to cuda
* Install ninja
* Use THP_API instead of THP_CLASS
* Beautify the paths
* Revert "Use THP_API instead of THP_CLASS"
This reverts commit dd7e74c44db48e4c5f85bb8e3c698ff9de71ba2d.
* Use THP_API instead of THP_CLASS(new)
This PR enables users to print extra information of their subclassed nn.Module.
Now I simply insert the user-defined string at the ending of module name, which should be discussed in this PR.
Before this PR, users should redefine the __repr__ and copy&paste the source code from Module.
* Add support for extra information on Module
* Rewrite the repr method of Module
* Fix flake8
* Change the __repr__ to get_extra_repr in Linear
* Fix extra new-line for empty line
* Add test for __repr__ method
* Fix bug of block string indent
* Add indent for multi-line repr test.
* Address review comments
* Update tutorial for creating nn.Module
* Fix flake8, add extra_repr of bilinear
* Refactor DropoutNd
* Change to extra_repr in some Modules
* Fix flake8
* Refactor padding modules
* Refactor pooling module
* Fix typo
* Change to extra_repr
* Fix bug for GroupNorm
* Fix bug for LayerNorm
This avoids promotion from python float to torch.Tensor for AffineTransform. This appears to be needed so that constraint registration works across CPU and all GPUs.
Previous discussion at 3a25db73c8 (r176361909)
Background:
There are three basic types of objects in torch.distributions:
- Distributions are flyweight objects constructed from tensor or float args. They always promote float args to tensors.
- Transforms are longer-lived objects (sometimes cached; some are static globals). They can take float arguments. This PR makes AffineTransform avoid promoting float args to tensors.
- Constraints are long-lived objects. They can take either float or tensor arguments. They do not promote floats to tensors. These are relatively symbolic and are not much more than partially evaluated comparisons, e.g. constraints.positive is basically a symbolic version of lambda x: x > 0 that can be stored in a ConstraintRegistry table.
The Problem:
Sometimes we want to apply transform_to(constraints.positive) to a torch.Cuda.FloatTensor. This is fine since
transform_to(constraints.positive)(x)
= ExpTransform()(x)
= x.exp()
which works with any tensor type.
Other times we want to apply transform_to(constraints.greater_than(1.5)) to a torch.cuda.FloatTensor. This is problematic before this PR since
transform_to(constraints.greater_than(1.5))(x)
= ComposeTransform([ExpTransform(), AffineTransform(1.5, 1)])(x)
= AffineTransform(1.5, 1)(x.exp())
= t.loc + t.scale * x.exp() # where t = AffineTransform(1.5, 1)
Before this PR, AffineTransform would promote t.loc and t.scale to tensors. This promotion can happen as early as library load time for some transforms, e.g. transform_to(constraints.unit_interval). Therefore before this PR, the second example would error at t.scale * x.exp() because t.scale is a [default] torch.FloatTensor whereas x.exp() is a torch.cuda.FloatTensor.
Proposed solution:
This PR merely adds support for python floats as the .loc and .scale parameters of AffineTransform. This should suffice for most purposes since only AffineTransform and a handful of parameter-free transforms are ever stored in the global transform_to and biject_to registries.
Alternative solutions include:
- allowing promotion from torch.FloatTensor to all other tensor types, e.g. torch.cuda.FloatTensor.
- adding a handful of specific parameter-free transforms like NegateTransform() in lieu of AffineTransform(0, -1).
Tested: added a regression test
* Support python floats in AffineTransform
* Update docstrings
Since we added cpuinfo as a vendored dependency, this created a problem
with our NNPACK integration, because NNPACK also depends on cpuinfo,
as per #6068. This is particularly difficult to resolve because we
depend on a fairly recent version of cpuinfo, which we generally cannot
assume users have installed (it is submoduled.) So, it would seem that
to fix this properly, NNPACK would have to be vendored and built against
the correct cpuinfo.
However, discussion with Christian Puhrsch and Marat Dukhan suggests
that the benefit of carrying on with NNPACK integration is not all that
great, because mkldnn has since come out with a CPU convolution implementation
that performs better than NNPACK. NNPACK's x86 implementation is not
really maintained, and its ARM support is not really relevant to PyTorch.
So rather than go through all the rigamarole of vendoring NNPACK, better
to just delete it. If you need good perf for CPU convolutions, please
make sure you build against mkldnn.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Script functions can now have no return statements, empty
return statements, or return one or more values.
Additionally fix the lexer to always emit TK_NEWLINE before
TK_DEDENT, which simplifies the parser.
```
[6/179] Building NVCC (Device) object
src/ATen/CMakeFiles/ATen.dir/native/cuda/ATen_generated_SparseMM.cu.o
/home/rzou/pytorch/aten/src/ATen/native/cuda/SparseMM.cu(9): warning:
statement is unreachable
/home/rzou/pytorch/aten/src/ATen/native/cuda/SparseMM.cu(9): warning:
statement is unreachable
```
Warning was caused by unnecessary return statement.
This reverts commit d63266ccbc0c1390c58c2a71ae0b562fdec2fbc0
@bypass-lint
An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files
This reverts commit 05bd9bec10fad5ff9dc40be88836fd7274d50ce9
@bypass-lint
An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files
Providing Python API to fetch Int8 tensors.
data, scale. zero_point = workspace.FetchInt8Blob(blob_name)
now returns a tuple if the blob contains a Int8TensorCPU
'data' = int8 data array
'scale' = fake quantization scale
'zero_point' = fake quantization offset
Although FetchBlob shares back-end implmentation with FetchInt8Blob, we raise
error to prevent unexpected behavior of the same method
The changes in this diff comments out unused parameters. All changes are automated using clang-tidy.
This will allow us to enable `-Wunused-parameter` as error.
#accept2ship
Getting CUDA device property struct with cudaGetDeviceProperties is expensive. THC caches CUDA device property, which is available via THCState_getDeviceProperties, which is available via at::globalContext().getDeviceProperties(device), which is available via torch.cuda.get_device_properties. This PR changes the two methods that previously calls cudaGetDeviceProperties to directly using torch.cuda.get_device_properties in Python.
Also fixes ATen compile error when it can't find CUDA.
Fixes#4908. Using the script from that issue, we get roughly 18x speed-up.
[ssnl@ ~] python dev.py # master
0.2826697587966919
0.00034999847412109375
0.0003493785858154297
0.000356292724609375
0.00036025047302246094
0.0003629922866821289
0.00036084651947021484
0.00035686492919921874
0.00036056041717529296
0.0003606319427490234
[ssnl@ ~] python dev.py # this PR
0.27275662422180175
2.1147727966308594e-05
1.9598007202148438e-05
1.94549560546875e-05
1.9359588623046876e-05
1.938343048095703e-05
2.0074844360351563e-05
1.952648162841797e-05
1.9311904907226562e-05
1.938343048095703e-05
Allows you to export an ONNX model as:
Protobuf file (this is what we have now)
Uncompressed zip archive
Compressed zip archive
Directory
* Experimental support for different ONNX export types
* Remove a copy
* Add comment
* Add test cases
* lint
* fix bug
* address comments
Small PR to allow use of RelWithDebInfo mode in CMake as per request from @ebetica, to make debugging in optimized binaries easier (i.e. don't have to suffer major decrease in performance when using DEBUG mode, but can still debug properly, not like in RELEASE mode).
From what I can see using RelWithDebInfo means -O2 -g -DNDEBUG while Release means -O3.
normal (release):
$ python setup.py build develop
$ grep -e' -fexceptions ' aten/build/build.ninja
FLAGS = -DUSE_AVX2 -msse3 -DUSE_SSE3 --std=c++11 -Wall -Wno-unknown-pragmas -Wno-vla -fexceptions -fopenmp -O3
This PR allows use of the REL_WITH_DEB_INFO environment variable:
$ REL_WITH_DEB_INFO=1 python setup.py build develop
$ grep -e' -fexceptions ' aten/build/build.ninja
FLAGS = -DUSE_AVX2 -DUSE_SSE3 --std=c++11 -Wall -Wno-unknown-pragmas -Wno-vla -fexceptions -O2 -g -DNDEBUG
* Add REL_WITH_DEB_INFO mode
* Fix batch file syntax
* Rename setup.py to setup_caffe2.py
* Also move VERSION_NUMBER under caffe2/ directory.
* Our setup*.py file needs to be at the root level.
* Add requirements.txt
Perf numbers:
https://gist.github.com/colesbury/9e28dd7b0f27b0b019f68adbd4bd4b88
I've changed the dispatch stub so that it doesn't require every kernel
to be compiled for every instruction set. Kernel implementations are
stored in the stub's table with the REGISTER_DISPATCH macro.
I've also moved vec256 to it's own folder and split up the
specializations before they get too unwieldy.
Change UnaryOpsKernel to use new DisaptchStub
- Prefer signed integers. Mixing signed and unsigned integers is a
pain and ATen mostly uses signed integers (int64_t).
- Use inline lambda instead of struct for UnaryOps
- Rename partial load overload "load_partial"
This is in preparation for splitting out sparsity (layout) from dtypes; it's complex to maintain these
and tensor.new(...) is a legacy API in any case.
Fixes#5554
Adds an error message for when NLLLoss is passed an input and target
whose batch sizes don't match. Ideally this check should live in ATen
but since there is NLLLoss logic in python the check is there right now.
Before, using an unknown binary operator like `@`:
```
import torch
@torch.jit.script
def mm(x, y):
return x @ y
x = torch.randn(4, 3)
y = torch.randn(3, 2)
mm(x, y)
```
resulted in [this not-so-readable trace](https://gist.github.com/zou3519/052b8998108c4bc0fe0e7c85c6f5758e).
Now, it tells the user that the problem is an unknown binary operator:
```
NotSupportedError: unsupported binary operator: MatMult
@torch.jit.script
def mm(x, y):
return x @ y
~~~ <--- HERE
```
* Continuation of https://github.com/caffe2/caffe2/pull/2306 and based on Yangqing's PR at https://github.com/caffe2/caffe2/pull/2326
* Put caffe2_protos as static library and link it whole to libcaffe2.so
* For protobuf::libprotobuf, only link it to libcaffe2_protos (and hence libcaffe2.so), but not any downstream library. This avoids manipulating protobuf objects across dll boundaries.
* After the above, during linking one will receive complaint that fixed_address_empty_string is not found. This is because we compiled protobuf with hidden visibility, and the fact that the generated caffe2.pb.h has an inline function that invokes the inline function in protobuf GetEmptyStringAlreadyInited()
* Added sed-like commands to replace the generated header to use caffe2::GetEmptyStringAlreadyInited() instead. And, in proto_utils.cc, implement a function that essentially routes the function call to protobuf's internal one. The reason this works is that, caffe2::G... is visible globally, and libcaffe2.so is able to see the real protobuf one. This ensures that we are always calling protobuf functions that are inside libcaffe2.so.
Keeping compatibility, enable TensorDataset to get any number of tensors.
* Enable TensorDataset to get any number of tensors
* Update dataset.py
Fix syntax error on python 2.7
* Add several test for tensordataset
* Fix whitespaces
* Simplify args
* Update dataset.py
* Block set from param_group['params']
This might cause `list(params)` to output in random order. In this case, in `load_state_dict()`, `id_map` would not be matched correctly.
* Update Error Message
* Add Warning on Optimizer Docs
* Update optimizer.py
According to the code in _torch/nn/functional.py:1399_
(```if target.size()[1:] != input.size()[2:]:```),
if the size of input is (N, C, d_1, d_2, ..., d_K), the size of target should be (N, d_1, d_2, ..., d_K).
* allow calls to non-script methods, allow python non-script attributes in methods
* add test to make sure submodules are not reassigned
* Test that we can change python attributes
- gloo, pybind11, nanopb and nccl now live in third_party.
- ATen builds in aten/build rather than torch/lib/build/aten
- A bit of faffing about in the scripts was necessary, because they used to assume that everything lived in the same directory. Now you are expected to cd into the correct directory before calling one of the build functions. The actual builder script lives in tools
- Lint now just unconditionally ignores third_party, rather than enumerating folders explicitly
Ignore backward step when there is no loss function;
For some customized model, we can encode the update directly in forward step and there is no backward step;
Added a caffe2 math sum operator so that it takes integers (only int32)
Changed the SumFloatIter to SumGenericIter so that it takes >1 types.
Added a sumElementInt operator
Change the positive modulo computation to use less modulo. This should
run ~2x faster (just the modulo part). In addition, we should later switch to
compute reciprocal modulo.
This code introduces a new class for exporting decoder step (ensemble) models trained with fbtranslate pytorch to Caffe2 models via ONNX, for the purpose of use in "component beam search" being developed concurrently in C++ by @juancarabina.
Codemoding imports from libfb.py of the format "from libfb import X". This is part of a larger codemod to remove the mapping from libfb/py to libfb, in the interest of enabling static typechecking in fbcode.
This is required to support placeholder/decorator ops which does not have operator schema. Note that the change is made in such a way that it is a no-op if placeholder Ops are not used.
Changes:
1. Since the placeholder ops always run on CPU, added a utility to infer placeholder ops blob devices.
2. Placeholder op's input/output blobs should be on CPU as well. This change takes care of dealing with output blobs - i.e. use blobs on CPU.
3. Added a Unit test - test_inject_copy_placeholder_ops
This diff is added to support the ProfileObserver in order to differentiate operators in the stepnet properly. Since copy() is only used in the context of RNNs, the name has been changed to reflect that.
* Add numpy.array-like type inference to torch.tensor.
* Temporary fix for int/double types.
* Treat python floats as the default (scalar) dtype.
* Also make 0-length sequences the default scalar type and add more tests.
* Add type inference to sparse_coo_tensor.
* Fix sparse test.
* Remove allow_variables.
* Check numpy platform bits.
* Address review comments.
* Make suggested changes to constraints.
* More checking windows builds.
* Fix test for windows.
* Exported AtomicIterOp count
* Exported AtomicIterOp count
* Exported AtomicIterOp count
* Exported AtomicIterOp count
* Exported AtomicIterOp count
* Exported AtomicIterOp count
* Exported AtomicIterOp count
* Add axis to top_k_op. (#2416)
* Revert update on top_k_op
* Add axis to top_k_op
Add axis to top_k_op
* [auto] Update onnx to a8e4648 - Adjust link flags when built in Windows Debug mode (#647)
a8e4648a7d
* [auto] Update onnx to f4acf28 - Remove allowconsumed enforceconsumed from op schema. (#617)
f4acf281ef
* Exported AtomicIterOp count
* Exported AtomicIterOp count
* Exported AtomicIterOp count
* Exported AtomicIterOp count
* Exported AtomicIterOp count
* Exported AtomicIterOp count
* Initialize cpuinfo in the thread pool
Thread pool called cpuinfo_get_processors_count() without initializing cpuinfo. Only by luck it didn't make Caffe2 single-threaded: threadpool is initialized after NNPACK, and NNPACK initializes cpuinfo itself.
This commit also updates cpuinfo to a version that aborts with a fatal error if its used uninitialized.
* Updated Python Op and Image Pre-Processing Pipeline tutorials && Added CIFAR-10 Part 1 tutorial (#2286)
* Updated Basics tutorial: (1) Added Python 3 support with __future__ statements; (2) Various grammatical/typo fixes and minor refactoring of Markdown
* Added Python 3 support and made minor typo fixes
* Added Python 3 support with future imports, refactored and corrected errors in Markdown, added comments
* Added Python 3 support with future imports, Added use of caffe_translator.py to translate downloaded .caffemodel file to .pb files
* Upgrades to Image Pre-Processing Pipeline tutorial
* Updated Python Op tutorial
* removed markdown with empty links
* Added Part 1 of an end-to-end CIFAR-10 tutorial
* Updated MNIST Dataset and Databases tutorial with python3 support and markdown fixes
* Tweaks to markup, less training iterations
* changed permissions of CIFAR10_Part1; typo corrections in Image_Pre-Processing_Pipeline
* Typo corrections in Multi-GPU Training tutorial
* sync Python_Op py_gen with the IPython notebook
* nit typo correction
* [auto] Update onnx to 5cb999d - Minor cleanups to shape inference (#653)
5cb999ddc1
* [auto] Update onnx to ecac1c1 - Merge Rel 1.1.0 branch into master (#657)
ecac1c1624
* Strip down onnx to only pb definitions in mobile build (#2426)
* Exported AtomicIterOp count
* Exported AtomicIterOp count
* Exported AtomicIterOp count
* Exported AtomicIterOp count
* Exported AtomicIterOp count
* Exported AtomicIterOp count
* Exported AtomicIterOp count
* Exported AtomicIterOp count
* Exported AtomicIterOp count
Thread pool called cpuinfo_get_processors_count() without initializing cpuinfo. Only by luck it didn't make Caffe2 single-threaded: threadpool is initialized after NNPACK, and NNPACK initializes cpuinfo itself.
This commit also updates cpuinfo to a version that aborts with a fatal error if its used uninitialized.
* Deprecate ctx.saved_variables via python warning.
Advises replacing saved_variables with saved_tensors.
Also replaces all instances of ctx.saved_variables with ctx.saved_tensors in the
codebase.
Test by running:
```
import torch
from torch.autograd import Function
class MyFunction(Function):
@staticmethod
def forward(ctx, tensor1, tensor2):
ctx.save_for_backward(tensor1, tensor2)
return tensor1 + tensor2
@staticmethod
def backward(ctx, grad_output):
var1, var2 = ctx.saved_variables
return (grad_output, grad_output)
x = torch.randn((3, 3), requires_grad=True)
y = torch.randn((3, 3), requires_grad=True)
model = MyFunction()
model.apply(x, y).sum().backward()
```
and assert the warning shows up.
* Address comments
* Add deprecation test for saved_variables
* Changes in bilinear upsampling
* Add align_corners option to upsampling module & functional when using linearly interpolating modes
When align_corners=True, it uses the old original upsampling scheme, which gives visually better results,
but doesn't properly align input and output pixels, and thus cause the output vary basing on input.
This PR adds this align_corners option, and changes the default behavior to align_corners=False, with
proper warning if this option is not specified upon using nn.Upsample or nn.functional.upsample to let
be aware of this new change.
Adds tests in test_nn.py for spatial invariance when align_corners=False, and usual module tests for
align_corners=False.
* remove redundant checks and unnecessary variables; fix the cast
* fix negative indices
* Store perf numbers in S3
Previously the perf numbers are stored in https://github.com/yf225/perf-tests/tree/cpu, but we couldn't figure out a way to push the perf numbers only from master builds. This PR moves the perf number storage to S3, which allows us to have finer control over when to push the new numbers.
This is in replacement of #5844 - storing numbers in RDS has its own problems with schema migration and backward compatibility, and using a NoSQL database might be an overkill at this point.
* Fixed issues
This PR addresses issue #5024
* Expose Conv2dBackward in python
* Separate interface for exposing gardients of operators
* Revert old changes
* Add tests
* Add conv1d gradients. Refactor tests for grad convolutions
* Refactor names and change examples
* Remove Varibale from tests for conv backward
dded ind_worker_queue parameter to data.DataLoader. It makes preprocessing determinate.
DataLoader in multiprocessing mode may cause non-deterministic issue. Even if radom_seed has frozen, each subprocess may get tasks in unstable order. This is caused by different I/O time while data loads. If you use augmentation while data loading, it makes results unreproduceble. Look at the https://discuss.pytorch.org/t/deterministic-non-deterministic-results-with-pytorch/9087
To fix this issue I have added the individual queue for each worker. In this case each worker get tasks in the stable order. In summary, subprocess produces the stable results.
To reproduce issue you may change ind_worker_queue to False and run the script several times.
Code to reproduce issue is in the corresponding PR.
* TestIndividualWorkerQueue added to DataLoader tests
* Review fixes
* "Simplify" code by removing itertools
* Rebase conflicts fix
* Review fixes
* Fixed shutdown behavior
* Removed ind_worker_queue flag.
* Rebase on master
* Disable tests that use DataLoader with multiple workers (#5322)
PR introduces AVX2 optimization for sigmoid floats. Issue #4929. The internal benchmark shows ~10x speedup.
Added AVX2 vectorized sigmoid using the 8-way vectorized exp (exp256_ps) in avx_mathfun.h.
Implemented vector dispatch for sigmoid. Since sigmoid function is defined for floats and doubles only, for now, added preprocessor #ifdef to init sigmoid dispatch only for float and double.
Vector functions in THVector.h were not called for all of the basic functions in floating point or double only. Changed the LAB_IMPLEMENT_BASIC_FUNCTION define in THTensorMatch.c to use THVector_(NAME) implementations if the inputs are contiguous. For the functions that do not have vectorized SIMD implementations will use the same default function from THMath.h
* add AVX2 implementation for sigmoid function
* Fix bug in AVX2 code for sigmoid
* Add new macro for custom vectorized functions
* Implement torch.util.bottleneck
This is a tool that is intended to be used as initial exploratory
debugging of bottlenecks in user scripts. Run it with
python -m torch.utils.bottleneck /path/to/source/script.py
* Refactor and address comments
* Fix tests
* Allow passing of args to the profiled script
* Replace Variable
* Implement range for loop in script
* Fix handling of boolean constants
* Use WithInsertPoint
* Allow dynamic max trip count
* fix symbols
* Fix argument order
* fix test
* Add insert{Input,Output} APIs and use them
* Factor out condition stuff
* clang-format
* Address remaining comments
* Fix tests
* Implement script in AST frontend
* Support legacy empty tensor behavior in cat
Continuing from #5837:
Fixes#5332.
Currently, the following behavior happens with torch.cat:
```
import torch
x = torch.randn(4, 3, 32, 32)
empty = torch.Tensor([])
res1 = torch.cat([x, empty], dim=1)
res2 = torch.cat([empty, x], dim=1)
```
However, at some point in the past, res1 and res2 were equal. This PR
supports the legacy behavior of ignoring empty tensors when
concatenating a list of tensors, until we have empty tensors that can
have arbitrary shape, at which point we'll stop supporting this
behavior.
* Address comments
* Moved torch headers copy to build_deps
PR #5706 initially moved headers under build_ext to fix bdist_wheel and
build develop. This broke install and #5755 moved them back to install
which broke bdist_wheel and build develop. Looks like build_ext is called
from install after it already tried to copy the headers to the python install
dir and the headers were not installed correctly. Using build_deps works
correct with all setup.py install, bdist_wheel and build develop.
* Comment about the auto-generated files
Added comment that the current solution will not include auto-generated
files which may be a problem if somebody needs to use them
- All of the scripts are based off of the idea that they should be as
simple as possible, and all the heavy lifting done in the construction
of the Docker file. The scripts are really simple now. A bigger
philosophical discussion can be found in .jenkins/README.md
- build-asan.sh is split out of build.sh, as ASAN builds are a bit
specialized and it's inappropriate to run many of the other builds
as part of them.
- We now build and run with mkl/mkl-include on the CPU only builds
- We now report sccache and ccache stats at the end of all builds.
- run_test.py flushes stdout/stderr before making a subprocess call,
which should solve our interleaving problems.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Fixes#5943
For the following code:
```
import torch
u = torch.zeros((3, 3), requires_grad=True)
v = u.permute(-1, -2) # (1, 0) here is fine
v.sum().backward()
```
during the backward pass, a std::vector is constructed
as an "inverse" of the permutation. To do this, all the dims
are indexed into the vector.
The problem with that is that the negative dims were being indexed
into the std::vector, causing undefined behavior. This PR wraps
those negative dims so they're handled correctly.
* Revert update on top_k_op
* Add axis to top_k_op
* Remove do { ... } while (false)
* Revert top_k op to upstream
* Add argmin and argmax ops
Add argmin and argmax ops
* Revert top_k_test to upstream
* Add argmin and argmax ops
Add argmin and argmax ops
* Have ScriptModule inherit from Module
This is accomplished by created replacement _parameters, _buffers,
and _modules which implement the OrderedDict APIs but which
actually get/set their members inside script::Module
* Merge TracedModule with ScriptModule
* Move logic of attribute handling into Python bindings rather than
make script::Module handle it. This was redundant with nn.Module,
which already handles attribute.
* Make TracedModule a subclass of ScriptModule
* Move handling of attribute kind logic into bindings.
* Allow ScriptModule to contain non-script module submodules.
* Revert "Use -DCMAKE_BUILD_TYPE=Release for local build by default"
This reverts commit 035c62081f6420405b9f1380cc5d21b4c6ae78f6.
* Revert "Export number of iterations of AtomicIterOp (#2338)"
This reverts commit 91b7a0cb48c6b079e2ca8fd5c26819a003937d76.
* add reduce=True arg to MarginRankingLoss
* make default margin arg match for legacy
* remove accidentally added test
* fix test
* fix native_functions.yaml alphabetical order
Fixes#5887 .
Now it shows:
-- MKL library found
-- Found a library with BLAS API (mkl).
CMake Error at CMakeLists.txt:389 (MESSAGE):
MKL header files not found. If using conda, please run `conda install
mkl-include`. Otherwise, please make sure that CMake will search the
directory containing the header files, e.g., by setting CMAKE_INCLUDE_PATH.
-- Configuring incomplete, errors occurred!
See also "/home/ssnl/sftp/pytorch/torch/lib/build/aten/CMakeFiles/CMakeOutput.log".
See also "/home/ssnl/sftp/pytorch/torch/lib/build/aten/CMakeFiles/CMakeError.log".
* Fix integer overflow in remainder
* Fix remainder operator in CUDA
* Add tests for remainder integer overflow
* Add has_different_sign static function
1. support the LpNorm operator to calculate the average LpNorm by adding one more boolean argument, i.e., LpNorm(average = true) = LpNorm(x) / size of (x)
2. integrate the average option into visualization framework
Changes:
=======
1. Added device inference functions for Concat and Split Ops.
2. Added a unit test to validate the change. See, test_device_inference_function in core_test.py
3. Fixed some formatting.
Instead of using hard-coded rules or rely on gpu_strategy to mark full sync data parallel ops, we need some generic rules that is applicable to both the single and distributed setting.
Make it easier to plug in intermediate steps between preprocessing & trainer by maintaining a stable schema.
I also fixed enqueue() so that we can pass in the same blob in multiple location without causing data corruption.
The way `splits()` is currently used is so convoluted. It's impossible to compose ReaderBuilder. I'm working on a composite reader so this is a prerequisite for it.
The idea is that the ReaderBuilder should maintain the states it needs to create a reader. Any setup is done through the new `setup()` method. Currently, `setup()` should only be called once, but, if needed, it should be safe to call it multiple times.
add one more input module preproc everstore for IN1k. It uses the same datasets of sherlock everstroe input reader, then it us DAtaPreproc operator to distribute the image preprocessing on other machine other than the trainer. Suppose to release some compute burdent from trainers.
@override-unit-failures
(Note: this ignores all push blocking failures!)
LOG(INFO) can be stripped out at compile-time or disabled at run-time,
but there're hardly use-cases where we want to call TEST_Benchmark,
but don't want to see the result. Additionally, on Android, LOG(INFO)
writes to logcat, which is OK for errors/warnings, but inconvenient
for benchmarking results, as on new phones logcat spawns logs like crazy.
Not sure if this is a backwards compatibility issue.
```
Python 2.7.9 (default, Apr 2 2015, 15:35:35)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests.get as urlopen
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named get
>>> from requests import get as urlopen
>>>
```
* Revert "Port ATen and JIT C++ tests to Catch2 (#5788)"
This reverts commit 6f80023c29e0fb55f46a32c4931bc5d4ba749846.
* Revert "Fix error message for cat-ing zero-dim tensors (#5819)"
This reverts commit cf2e1760490d369e93017b9425279b235c10772d.
* Revert "Softmax symbolic should account for negative dim (#5846)"
This reverts commit ba64724aeea8ad5d4b50cd1154fca5a011618333.
* Revert "[fft][1 of 3] build system and helpers to support cuFFT and MKL (#5855)"
This reverts commit 22ef8e5654c45d1f5404e3add6ad19678c0b80a9.
* Revert "Don't modify requires_grad when running DataParallel in no_grad mode (#5880)"
This reverts commit d11b7fbd1c49ed7bd84c89d286e2763e6ba55f51.
* Revert "fix some methods not showing up in doc (#5882)"
This reverts commit 24fca0efb289a069929639783d1c050b79e591c0.
* Revert "ReduceOps cleanup and set_num_threads (#5723)"
This reverts commit 84400d5531500e1a3fbcfe8a3f2865f982405861.
* Revert "introduce shape_as_tensor and reshape_from_variable_shape (#5824)"
This reverts commit f446b82e70ca0aa42fffa58469c28b6bce51d021.
* Revert "Enable resetting of batchnorm running moments and cumulative ("simple") moving average (#5766)"
This reverts commit 99b1f6cfad85a4856550cc1e787afd7ff9e6c6aa.
* Add CollectAndDistributeFpnRpnProposalsOp for FPN support
* Adds a C++ operator equivalent to the Python op in Detectron
* Once some additional GenerateProposalsOp changes are made this will
let us support Detectron FPN models with straight Caffe2 C++ ops
* RetinaNet and segmentation models require additional work
* Remove some uses of conservativeResize
* Add notes about training and inputs/outputs to operator documentation
This PR addresses #5648. In particular, following the discussion at #5648:
- it adds Catch as a submodule (https://github.com/catchorg/Catch2) in torch/aten/utils
- it ports all ATen tests to Catch
- it ports torch/csrc/jit/test_jit.cpp to Catch (libtorch only, Python build is unaffected)
This is the first of three PRs that #5537 will be split into.
This PR adds mkl headers to included files, and provides helper functions for MKL fft and cuFFT.
In particular, on POSIX, headers are using mkl-include from conda, and on Windows, it is from a new file @yf225 and I made and uploaded to s3.
* add mkl-include to required packages
* include MKL headers; add AT_MKL_ENABLED flag; add a method to query MKL availability
* Add MKL and CUFFT helpers
Previously, running DataParallel in no_grad mode would change the
requires_grad property of the network's parameters to False. The issue
is that Broadcast returns aliases of the inputs for the source device.
In no_grad mode, it would deatch these inputs in-place.
Fixes#5851
* Changes without centos changes
* Changes for protobuf 3.5 and gcc 4.8
* Changing 3.4.1 back to 3.5.1
* Preventing installing two versions of setuptools
* Fixing setuptools bug
* support n-d inputs in bilinear and move to aten
* support n-d inputs in bilinear and move to aten
* add asserts to bilinear inputs
* address comments
* cast int64_t in asserts
* implement TripletMarginLoss as a native function
* implement TripletMarginLoss as native function
* fix compile error
* address comments
* address comments
* Add keepdim arg to pairwise distance
* Add torch.sparse_coo_tensor factory.
Notes:
1) I didn't add Tensor.new_sparse_coo_tensor; it didn't seem particularly useful, but it's easy to add
2) This doesn't do the type inference, i.e. torch.sparse_coo_tensor(indices=LongTensor, values=IntTensor)
will return a sparse tensor corresponding to the default type rather than a sparse IntTensor. We can add
type inference later when we add it to other factories.
* Fix merge.
* Use type_conversion function from python_variable_methods.
* Namespaced symbols
- Our interned strings now have structure, "ns::symname" rather than just
"symname" before. We support efficient namespace testing for uniques
by encoding the namespace in one byte in the Symbol internal representation.
See torch/csrc/jit/interned_strings.h for a more in-depth implementation
discussion.
- All uses of ksymbol are now attr::symbol (or some appropriate namespace).
The valid namespaces are prim, attr, onnx and aten.
- Symbol is bound in Python as a qualified string "attr::symbol", EXCEPT for the
attribute setting/getting API, whose symbols must always be attr
symbols; they get special cased to assume strings are passed.
There's a little bit of naughtiness in the implementation, maybe you know
how to solve it.
- However, the g.op() convenience function assumes that you're generating
ONNX operators, unless you explicitly qualify.
- All ATen operators and nodes have built-in interned strings generated
for them, so you should never have to write a string literal ever again.
The tracing code is adjusted to use it.
- ONNX exporter now properly tests to see that all operators are in
onnx namespace before accepting the export. This is way more
robust than the previous exporter, which would be willing to
export capitalized operators which were not actually ONNX operators.
- A slight organizational change for symbolic.py; this module now ONLY
contains aten operators. In particular, the exporter for Constant
has moved into utils.py (along with Undefined, from the C++ side),
since primitive ops get "special treatment."
- The un-inplacing logic in recording is more robust, so that we don't
delete a trailing underscore from __and__. This never affected us
before because we didn't have any tests for it.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* 1. Add logdet and slogdet in ATen side
2. Previously, det can return result with incorrect sign upon seeing symmetric
matrices. This is caused by the wrong assumption I had on SVD (when input is
symmetric U=V^T). This fixes it.
3. Moreover, after fixing 2 now QR is always needed for det forward. So I moved
SVD to backward call. Since this is a specific variant of SVD, it is named as
_svd_with_positive_UV_det, with derivative.yaml entry being svd_backward.
4. Updated/added backward functions for det, logdet and slogdet, which uses
_svd_with_positive_UV_det and svd_backward inside.
5. Optimized svd_backward:
a. Avoid unnecessary kernels when only sigma has gradient (this is the usual
case, and also true with *det backward functions).
b. Fix SVD double backward by avoiding a nan.
* 1. Add/update grad checks for det, logdet, and slogdet.
2. Fix an incorrect check for dim_args_idx in test_autograd.py
3. Add option to only test a subset of output values, specified by
test_output_indices, for cases like slogdet where only the
second output is differentiable.
4. Add better doc for the test generating list.
* Add/improve output tests for det, logdet and slogdet
Add a scaling to random matrices so closeness checks are more robust
* Remove unnecessaery Variable wrappers in some test files
* Add logdet slogdet docs
* Improve an err msg in THTensorLapack.c
* add inverse-based backward for invertible matrices
use svd only for non-invertible case, so don't need the special variant anymore
* use LU rather than QR
#5481 was reverted due to a strange test bug. This PR attempts to fix that.
This diff adds vectorization to ATen. It uses intel intrinsics to build a general vec256 class, that represents types of 256bit width. These can then be treated like regular variables. Using those it implements torch.sum() for the contiguous case. It uses Intel TBB for multithreading, which allows workstealing and chunks the reduction operations based on a experimentally chosen value (_THRESHOLD). It uses cpuinfo to pick the right code depending on the host's capabilities.
The kernels are implemented under native/cpu. Each .cpp file is compiled with -avx, -avx2 and no additional flags. A macro is used to append AVX, AVX2 or NONE to the function name. The header then needs to define the functions three times, one for each capability. This could be improved by either changing the cmake file a bit or possibly generating source code using a Python script etc.
For the non-contiguous case this defaults to the current implementation within TH. For CUDA is entirely defaults to the implementation within THC.
There probably needs to be a bit of a debate around the design decisions here, the additional dependencies, parallelization strategy, clarity, etc. The numerical results also diverge from numpy with larger tensors, which is expected since we're summing, for example, 8 numbers and then adding the result to the running sum, instead of each number one by one. But there might be something to be said about accumulating into a double for floats or the degree of divergence, the behavior with respect to CUDA, etc.
I wrote a [small Python script]( https://github.com/cpuhrsch/benchmark/blob/sumall/benchmarks/sum_bench.py) to compare the results with numpy numerically as well as on timing. I ran this script to create timings both on master and this branch.
Here is the command for 1 core
`OMP_NUM_THREAD=1 taskset -c 0 python sum_bench.py --enable_numpy 200`
Here is the command for all cores
`python sum_bench.py --enable_numpy 200`
Here are the results of each:
[Master, 1 core](https://paste.fedoraproject.org/paste/Nho9JzHpPVK9av8a6mByjQ)
[This branch, 1 core](https://paste.fedoraproject.org/paste/6xLHkYvcVJx9z~5MoHxN4w)
[Master, all cores](https://paste.fedoraproject.org/paste/5l3V1d5zGqvJcMXIUteMRw)
[This branch, all cores](https://paste.fedoraproject.org/paste/J4RuDU-0Drz0aZwtphQwEA)
To test the command is
`python sum_bench.py --test 200`
[This branch, test results](https://paste.fedoraproject.org/paste/kTEoUC~oWgXA6XWMAfNfNw)
For this test we look at the average absolute value of the differences. This does not take into account the relative magnitude of the numbers. The numbers are sampled from a standard normal distribution.
In terms of performance this diff should bring PyTorch on par with Numpy and usually exceed it by 1.5 to 2x.
Fixes#5611.
THCTensor_(baddbmm) assumes that newContiguous will always return a new tensor (this is a bad assumption). At the end of the function, tensors are freed if tensor_new != tensor_old. As a result, some tensors aren't freed if they were initially contiguous and newContiguous is called on them.
Test Plan
code reading
run the following (from the #5611 bug report) and assert that the memory doesn't leak anymore
import subprocess
import torch
from torch.autograd import Variable
# This is from https://discuss.pytorch.org/t/access-gpu-memory-usage-in-pytorch/3192/4
def get_gpu_memory_map():
"""Get the current gpu usage.
Returns
-------
usage: dict
Keys are device ids as integers.
Values are memory usage as integers in MB.
"""
result = subprocess.check_output(
[
'nvidia-smi', '--query-gpu=memory.used',
'--format=csv,nounits,noheader'
], encoding='utf-8')
# Convert lines into a dictionary
gpu_memory = [int(x) for x in result.strip().split('\n')]
gpu_memory_map = dict(zip(range(len(gpu_memory)), gpu_memory))
return gpu_memory_map
l, m, n = 1, 9, 1
w = torch.nn.Parameter(torch.Tensor(1024, 2, l, m).cuda())
for i in range(10000):
a = Variable(torch.Tensor(1024, 2, m, n).cuda())
torch.matmul(w, a).permute(0, 3, 1, 2).mean().backward()
if i % 100 == 0:
gpu_mem = get_gpu_memory_map()
print("GPU: {:.2f} KB".format(gpu_mem[0]))
* Simplify run_test.py and dont use shell=True
* Fix non-shell output for check_output and always print to stderr
* Use shlex.split instead of str.split
* s/log/print_to_stderr
* with_init -> with_init_file
* Remove bufsize argument
* Fixing conda
* Adding hypothesis and onnx to conda builds
* Updates but still not working
* Adding required changes to conda_full
* Updates
* Moving to more general build_anaconda script
* Adding check for gcc version
* Adding general ways to add/remove packages from meta.yaml?
* Changes for specific packages to build on gcc 5.4
* Fix with glog spec
* Requiring >numpy 1.12 for python 3 to satisfy opencv dependency
* Adding pydot to required testing packages
* Adding script to read conda versions for gcc ABI
* Trying to fix segfault by installing in env instead
* conda activate -> source activate
* Trying adding back leveldb
* Setting locale for ONNX + conda-search changed its format
* read_conda_versions handles libprotobuf
* Conda script updates
* Adding a protobuf-working test
* Removing changes to proto defs b/c they will require internal changes in a separate diff
* Fix useless opset_import in onnx
* Set the default ir version in make_model
* Use the target_opset_version in Caffe2Frontend
* remove make_model from helper in caffe2.python.onnx
Notes:
1) I didn't add Tensor.new_sparse_coo_tensor; it didn't seem particularly useful, but it's easy to add
2) This doesn't do the type inference, i.e. torch.sparse_coo_tensor(indices=LongTensor, values=IntTensor)
will return a sparse tensor corresponding to the default type rather than a sparse IntTensor. We can add
type inference later when we add it to other factories.
I need this because run_test is going to need to read other
options than just verbose when I implement JUnit XML dumping.
(JUnit XML dumping cannot be implemented solely by frobbing
--python because the XML file to dump to must vary based on the
test name.)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Revert "ATen ReduceOps (#5481)"
This reverts commit 310c3735b9eb97f30cee743b773e5bb054989edc.
* Revert "Check that new cpuinfo and tbb submodules exist (#5714)"
This reverts commit 1a23c9901dbfee295bf5b3dad36e4d3ee7e86366.
The save_mean and save_std are undefined if training is false.
Previously, we unpacked them even though we did not use them in the
computation.
We also don't need to re-pack the mean/variance variables.
* Reduce Sum and Reduce Mean
* Handle reductions with empty 'axes'
* Merge codebase and simplify tesnor reduction logic
* Restructure code and add comments.
* Fix parameter to scale
* Fix parameter to scale
* Fix some minor errors in existing docs.
* Fix Convolution and Pooling docs in torch.nn.functional
* Cleaned up torch.nn.functional docs
* Address @SsnL 's comments
* Add multiplication sign missing in docs
* Fix more typos, and clear some warnings
* Change infinity symbol in LPPool2d
* Revert some changes in torch.nn.functional
* Few more minor changes
Previously, methods like int() and long() would fail tracing because they eventually dispatch down to toType, which takes a Type as a parameter. We don't (currently) support tracing ops with Type inputs[0], so this PR adds specializations for the ATen scalar types and dispatches to those directly. These specialized ops can be traced into the IR without needing a Type argument.
A more long-term solution would be to add support for Types in the IR.
* Traceable dispatch for Variable cast methods
* Add ONNX symbolics
* Fix test
* Fix cross-backend copy issue
* Prepend underscores to cast identifiers
* Metaprogram symbolics
* clang-format
* stupid lint
* Add comments for all code fragments
* Implement torch.reshape and Tensor.reshape
This implements reshape which has similar semantics to numpy.reshape. It
will return a view of the source tensor if possible. Otherwise, it
returns a copy.
* Remove in-place reshape_ that was an alias for resize_
* Update documentation
This includes various fixes required to export the NMT decoder to ONNX
* Add missing ONNX symbolics and fix fusible expand logic
* Update comments and use of at::optional
* Use _unimplemented
* [GanH]: two_task_discriminator
as titled
and adding label smooth
* [Dper2] Simplified UI options needed for blob magnitude visualization
* [GanH]: fix tags
as titled
* Added type and shape inference for GatherRange operator
This helps with type / shape inference when using this operator in layers.
Also just a nice to have in general.
* Demonstrate Caffe2 exception handling with StoreHandlerTimeoutError in Python
We'd like to catch and recover from certain Caffe2 net exceptions. Use this diff to demonstrate a pattern of registering a pybind exception mapping and catching in Pythonusing caffe2::StoreHandlerTimeoutException.
* Bind Gloo IoException to IoError in Python
Allow peer failure handling and recovery using an exception based mechanism. This diff registers gloo::IoException with pybind.
* [GanH]: add label smoothing to softmax with loss
as titled
* [C2] Enable LARS in Adagrad and hook it to DPER
* [DPER] Don't pass LayerModelHelper in create_trainer_nodes
Since we're planning to get rid of it eventually and I want to get access to
NetDef only interface ASAP - I'm looking towards removing all references to
LMH, where we don't really need them.
* fix bugs in LambdaRankNdcgOp
the loss and gradient in LambdaRankNdcgOp are incorrect. The loss should be negative log of probs instead of log.
* Restrict thread pool on iOS to only big cores
Historically, iPhones exposed only one type of cores, and Caffe2 thread pool used all of them.
However, iPhone 8/iPhone X exposes 2 big + 4 LITTLE cores. As our thread pool doesn't support work stealing or other forms of load balancing, fast cores end up waiting for the slow ones, and it may be better to restrict execution to only 2 fast cores, like we do on Android.
* Remove SparseLength Sum/WeightedSum/Mean operators with fp16 engine
Remove SparseLength Sum/WeightedSum/Mean operators with fp16 engine
* make clang happy and get fewer warnings
make clang happy and get fewer warnings
* [Personalization] Support add_output_schema() in layer_model_helper
Problem:
Currently the output_schema of sparse_nn can only be set once. https://fburl.com/efth5zer.
Solution:
For flexibility, we want to add fields to output_schema incrementally.
Plan:
Wrap the change of `model._output_schema` into a new function `add_output_schema()` for adding additional output_schema.
Callsite:
The add_output_schema() should be called instead at https://fburl.com/efth5zer
Reference:
The newly added `add_output_schema()` will be similar to `add_loss()` in https://fburl.com/t2ii8njh
This diff adds vectorization to ATen. It uses intel intrinsics to build a general vec256 class, that represents types of 256bit width. These can then be treated like regular variables. Using those it implements torch.sum() for the contiguous case. It uses Intel TBB for multithreading, which allows workstealing and chunks the reduction operations based on a experimentally chosen value (_THRESHOLD). It uses cpuinfo to pick the right code depending on the host's capabilities.
The kernels are implemented under native/cpu. Each .cpp file is compiled with -avx, -avx2 and no additional flags. A macro is used to append AVX, AVX2 or NONE to the function name. The header then needs to define the functions three times, one for each capability. This could be improved by either changing the cmake file a bit or possibly generating source code using a Python script etc.
For the non-contiguous case this defaults to the current implementation within TH. For CUDA is entirely defaults to the implementation within THC.
There probably needs to be a bit of a debate around the design decisions here, the additional dependencies, parallelization strategy, clarity, etc. The numerical results also diverge from numpy with larger tensors, which is expected since we're summing, for example, 8 numbers and then adding the result to the running sum, instead of each number one by one. But there might be something to be said about accumulating into a double for floats or the degree of divergence, the behavior with respect to CUDA, etc.
I wrote a [small Python script]( https://github.com/cpuhrsch/benchmark/blob/sumall/benchmarks/sum_bench.py) to compare the results with numpy numerically as well as on timing. I ran this script to create timings both on master and this branch.
Here is the command for 1 core
`OMP_NUM_THREAD=1 taskset -c 0 python sum_bench.py --enable_numpy 200`
Here is the command for all cores
`python sum_bench.py --enable_numpy 200`
Here are the results of each:
[Master, 1 core](https://paste.fedoraproject.org/paste/Nho9JzHpPVK9av8a6mByjQ)
[This branch, 1 core](https://paste.fedoraproject.org/paste/6xLHkYvcVJx9z~5MoHxN4w)
[Master, all cores](https://paste.fedoraproject.org/paste/5l3V1d5zGqvJcMXIUteMRw)
[This branch, all cores](https://paste.fedoraproject.org/paste/J4RuDU-0Drz0aZwtphQwEA)
To test the command is
`python sum_bench.py --test 200`
[This branch, test results](https://paste.fedoraproject.org/paste/kTEoUC~oWgXA6XWMAfNfNw)
For this test we look at the average absolute value of the differences. This does not take into account the relative magnitude of the numbers. The numbers are sampled from a standard normal distribution.
In terms of performance this diff should bring PyTorch on par with Numpy and usually exceed it by 1.5 to 2x.
* Delete ""_sym literal form.
Two reasons:
1. It's unnecessary now; all of the uses of the literal form would
be better directly referring to the interned string (esp. since
now we are autogenerating symbols.)
2. When I add namespacing, there will be no convenient way to specify
the desired namespace with just _sym. If we add it back, we would
need distinct suffixes for each different type. Easiest to delete
it while we don't need it.
Add script::Module C++ class to represent script modules
switch AST -> IR conversion to work on Modules/Methods rather than raw graphs
function-only AST -> IR conversion is just a simplified case where there is
only one module with a single method and no parameters.
introduce SugaredValue in compiler.h to represent values in scope in a script
function that are not first-class and that get desugared. This is used to
represent the module's self parameter, as well as python function calls,
and method calls on tensor
provide a Python ScriptModule that provides a nice API on top of script::Module
allowing for the definition of script modules with methods, parameters,
and submodules
Not in this PR but intended for the future:
ScriptModule actually subclasses nn.Module, with most methods implemented
Unification of tracedmodule and script module functionality into one container class.
Detailed changelog:
* Switch compiler over to using Module, but don't
use them yet.
* Remove intermediate attribute encoding in compiler
* Create SugaredValue object to handle resolution
of compiled module.
* switch to_ir to modules, implement Select
* hacky python wrappers
* Private ScriptModule
* Add `define` to script module
* Attributes use TK_LIST_LITERAL
this anticipates adding a real list literal expression to the language.
* Add a metaclass to make sure script stubs are registered
* Add a test
* Doc createResolutionCallback
* Docs and minor editing
* Address PR comments
* Document
* Fix unicode issue
The header files needed for the C++ extensions were copied to
torch/lib/include under install. In case of bdist_wheel or build develop
for example, the files are not copied and cpp_extensions test is failing:
```
Running test_cpp_extensions.py ...
running install
running build
running build_ext
/home/moni/src/ibm/AI/pytorch/torch/utils/cpp_extension.py:79: UserWarning:
Your compiler (g++) may be ABI-incompatible with PyTorch.
Please use a compiler that is ABI-compatible with GCC 4.9 and above.
See https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html.
warnings.warn(ABI_INCOMPATIBILITY_WARNING.format(compiler))
building 'torch_test_cpp_extension' extension
creating build
creating build/temp.linux-x86_64-3.6
gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/moni/src/ibm/AI/pytorch/torch/lib/include -I/home/moni/src/ibm/AI/pytorch/torch/lib/include/TH -I/home/moni/src/ibm/AI/pytorch/torch/lib/include/THC -I/home/moni/miniconda3/envs/pytorch/include/python3.6m -c extension.cpp -o build/temp.linux-x86_64-3.6/extension.o -g -DTORCH_EXTENSION_NAME=torch_test_cpp_extension -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
extension.cpp:1:25: fatal error: torch/torch.h: No such file or directory
#include <torch/torch.h>
^
compilation terminated.
error: command 'gcc' failed with exit status 1
```
* Make use of new BUILD_ENVIRONMENT variable when possible.
Eliminate CI provided environment variables. At the moment, our build scripts depend on a few environment variables which are specified by the CI system and passed down to the build. Based on the build scripts, these environment variables are JOB_NAME, PYTHON_VERSION and GCC_VERSION; variables that depend solely on the image being built and the invoked script.
a. Proposal: A recent rewrite of the pytorch-dockerfiles has embedded a new environment variable, BUILD_ENVIRONMENT, which is automatically set when you run the Docker image. This environment variable subsumes JOB_NAME (this variable doesn't specify if you are “building” or “testing”, but this can easily be inferred from the script that is being invoked.) Make use of this environment variable to compute the other variables.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* syntaxfix
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* bugfix
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add efficient isVariable test to ATen.
This is done as a field on Type so that we can define a
non-virtual, inlinable function. The added ASSERTs probalby
affect runtime performance, we may need to toggle them off
on non-DEBUG builds.
Fixes#4814.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Rebase and brush up
* is_variable -> is_variable_or_undefined
* Fix arange floating point error
* fix test
* add type cast when calculating arange size
* fix nit
* update test
* use doubles instead of floats to calculate size
* requested changes
* PyObject* <--> at::Tensor no longer unwraps variables, instead we expect end uses to always work with variable types, and we will only unwrap the variables when we optimize.
* Add torch::CPU, torch::CUDA and torch::getType
* at::CPU -> torch::CPU in extensions
* Update jenkins build script using the same flag as used in benchmarking
* Add a recently added flag
* Remove BUILD_OBSERVERS flag since it is no longer used
* Add torch.empty, torch.full and new_ size Tensor factory methods.
This adds torch.full, torch.empty equivalents of np.full, np.empty.
In addition, this adds size-based Tensor factory methods new_empty, new_ones, new_full, new_zeros,
which is meant to complete the separation of the legacy "new" method into data-based and size-based
functions.
This also fixes an issue in sparse zeros_like when the dtype didn't match the argument dtype.
* Get rid of unnecessary zero in sparse tensor zeros_like.
* Fix test if only 1 cuda device.
* Support native namespace functions with type dispatch.
Use 'ones' as an example. Note this is a "halfway" solution; i.e. the call chain is:
at::ones(shape, dtype) -> dtype.ones(shape, dtype) -> CPUFloatType.ones(shape, dtype) -> at::native::ones(shape, dtype)
The "nicer" solution would probably be something like:
at::ones(shape, dtype) -> dtype.ones(shape) -> CPUFloatType.ones(shape) -> at::native::ones(shape, this)
* Fix type inference.
* Fix test install.
* Fix extensions.
* Put dtype argument at the beginning.
* Fix extension.cpp.
* Fix rnn.
* Move zeros in the same manner.
* Fix cuda.
* Change randn.
* Change rand.
* Change randperm.
* Fix aten contrib.
* Resize in randperm_out.
* Implement eye.
* Fix sparse zeros.
* linspace, logspace.
* arange.
* range.
* Remove type dispatch from gen_python_functions.
* Properly generate maybe_init_cuda for type dispatch functions not named type.
* Don't duplicate dtype, this parameters for native type dispatched functions.
* Call VariableType factory methods from the base type so it gets version number 0.
* Address review comments.
* fix comment on the location of scale and bias (offset) in each fused rowwise 8bit
* Update fused_rowwise_8bit_conversion_ops.cc
* Update lengths_reducer_fused_8bit_rowwise_ops.cc
* Update lengths_reducer_fused_8bit_rowwise_ops.cc
* CPU int-types pow()
* CUDA int-type pow()
* Cleanup + fix deleted line
* Tests for integer-types pow
* Fix build
* Fix windows tests
* Make _test_int_pow static
This improves backwards compatiblity with 0.3. It adds support for
the out kwarg for the deprecated overloads that have optional
positional alpha/beta/scale arguments.
The addcmul(self, value, tensor1, tensor2, out=self) syntax is used by
gpytorch.
previously, it was being implicitly imported via the import of
torch.onnx
this is no longer the case, and is a hacky thing to depend on anyway,
so import it explicitly
ExportProxy was a mechanism to reuse the code that supported exporting
autograd Functions to support overriding arbitrary python
functions. However, it had some serious downsides
- only works on some functions (all args must be Variable)
- complicated
- bad error messages in some cases
Instead, just expose enough functionality to python to perform the
necessary logic explicitly.
* add end to end test for DistributedDataParallel
* address comments
* skip subgroup tests when less than 3 processes
* set process number based on available gpus
* add single gpu;cleanup WORLD_SIZE
* fix comments
* implement CosineEmbeddingLoss as a native function and add reduce=True arg to it
* fix flake8
* address comments
* add reference function to tests
* fix flake8
This PR adds the possibility to build the C++ parts of autograd and jit, with no dependency on Python.
The goal is to allow taking a PyTorch IR representation (a tree s-expr) and running it with provided inputs.
Prerequisite: build PyTorch so that codegen runs once.
Instructions:
cd tools/cpp_build
bash build_all.sh
This will build libtorchjit and torchjit_test in tools/cpp_build/build/torchjit-build. The latter basically runs the code in test_jit.cpp for now.
While writing the PR, it turned out that a few of Python.h includes were redundant. They were removed here (PyTorch tests still pass on my machine, we'll see CI).
* Introduce Python-free builds of autograd and jit
* Remove NO_PYTHON ifdef in functions/special
* Improvize documentation
1. Add formula for erf, erfinv
2. Make exp, expm1 similar to log, log1p
3. Symbol change in ge, le, ne, isnan
* Fix minor nit in the docstring
* More doc improvements
1. Added some formulae
2. Complete scanning till "Other Operations" in Tensor docs
* Add more changes
1. Modify all torch.Tensor wherever required
* Fix Conv docs
1. Fix minor nits in the references for LAPACK routines
* Improve Pooling docs
1. Fix lint error
* Improve docs for RNN, Normalization and Padding
1. Fix flake8 error for pooling
* Final fixes for torch.nn.* docs.
1. Improve Loss Function documentation
2. Improve Vision Layers documentation
* Fix lint error
* Improve docstrings in torch.nn.init
* Fix lint error
* Fix minor error in torch.nn.init.sparse
* Fix Activation and Utils Docs
1. Fix Math Errors
2. Add explicit clean to Makefile in docs to prevent running graph generation script
while cleaning
3. Fix utils docs
* Make PYCMD a Makefile argument, clear up prints in the build_activation_images.py
* Fix batch norm doc error
* [C2] Don't crash kernel in case of invalid shapes for ConcatOp
Enforce correctness of the shapes for input tensors so we won't access invalid index.
* [Caffe2] Add analytical performance counters to Dynolog
Initial diff for counting analytical flops and memory writes for C2 operators.
* BBoxTransform op: Handle RoIs from multiple images per batch
BBoxTransform op used during typical Faster-RCNN inference operates only on
RoIs from a single image (no batching). Adding support to handle that with an
optional output blob containing the batch splits (i.e., the number of RoIs
belonging to each item in the batch). The code is perfectly backward compatible
and shouldn't break any existing models..
* [mkl] Make MKL-DNN cooperate with memongered nets
C2's MKL-DNN implementation caches input dims and reuses intermediate and
output buffers across net runs, which prevents memonger from being used. This
may not always be useful since input dims may vary widely in many cases and
we'll end up reallocating anyway. Added an option to force reallocation when
memonger is used.
* [oncall] fix batch gather ops for empty input
still need to bisect for the breaking change, but this shall fix the case for empty input.
the error logging is like: https://interncache-ftw.fbcdn.net/t49.3276-7/23938497_293562711176943_6500112636590424064_n.txt?_nc_log=1
@[557759185:raychen] can you help to subscribe oncall from ads side. this may affect the Sigrid online trainer.
* optimize BatchOneHotOp
We want to iterate in row-major as opposed to column-major for better
locality.
* Supported exporting model with int blobs.
Supported exporting model with int blobs. Needed by condensenet.
* BoxWithNMSLimit op: Handle boxes from mutiple images per batch
Similar to D7135360. Added support for multiple images per batch in the op.
Takes an optional additional input "batch_splits" as output by BBoxTransform
op, and returns new batch_splits after applying NMS and filtering. Otherwise,
backward compatibility is maintained.
Questions/possible future works:
How to template-ize to extend support beyond LongTensor?
How to check if autograd works (and if not, how to add explicit gradient)?
CUDA support?
Testing command:
DEBUG=1 NO_CUDA=1 MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py build && DEBUG=1 NO_CUDA=1 MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py develop && python3 test/test_torch.py
Partially fixes#2031
* Initial commit for unique op
* Working unique with test
* Make inverse indices shape conform to input
* flake8 whitespace removal
* address review comment nits
* Expose fn and add docs. Explicitly declare no gradients
* Trial generic dispatch implementation
* Add tests for generics
* flake8 whitespace
* Add basic CUDA error throwing and templateize set
* Explicit contiguous and AT_DISPATCH_ALL_TYPES return
* Remove extraneous numpy conversion
* Refactor out .data calls
* Refactored to variable return length API with wrapper fn as opposed to returning a 0-length tensor, per off-line reviewer comments
* Remove A
* Don't use hidden torch._unique() in test
* Fix documentations
Summary:
Executing loop's body in a separate workspace, using WorkspaceStack to
support saving and reusing of workspaces
Test Plan:
python caffe2/python/operator_test/onnx_while_test.py
Reviewers: caffe2-review, jamesreed
Subscribers:
Tasks:
Tags:
We'll want to reuse this logic for Int8 Reshape, but currently the code assumes
Input(0) and Output(0) are TensorCPUs, which may not be the case for a
subclass.
CMake 3.2 is required to properly track dependencies in projects imported as ExternalProject_Add (BUILD_BYPRODUCTS parameter).
Users on Ubuntu 14.04 LTS would need to install and use cmake3 package for configurations. Users of other popular distributions generally have a recent enough CMake package.
This op is used for gradient clipping to take care of exploding / vanishing gradients.
If original_norm is larger than the threshold,
then each element of the tensor is scaled by threshold / original_norm.
Adding NUMA awareness through numa_node_id in DeviceOption. Blobs of operators
with numa_node_id are allocated on corr. memory banks, using CPU pools with
NUMA affinity set to run operators.
with python3 np.int defaults to int64. This diff should fix it. I don't know if test exist for this function already, however following ASR test was breaking when i switch to py3
```
buck test caffe2/caffe2/fb/speech/asr_training/:tensor_parser_test
```
After D6953547 some of the blobs were no longer impacted by uint8 quanitzation,
but they would still generate operators expecting uint8 inputs and thus fail.
This diff is adding a temporal hack to avoid doing this quantization when layer
is not quantized.
Will fix it with switching to Net rewriting instead.
There is a bug in ConvOp. SetDeviceTensor function only copies data to tensor when the sizes of the two are different. In the 3d convolution case for video models, img_shape_device_ (NCTWH) is modified only in the first processed example, and for the following examples, it won't get updated, because img_shape_device_.size() == img_shape.size(). However, it should get updated for each example, because T is changing for different videos. It is the same with col_buffer_shape_device_.
In this diff, if any dimension of img_shape_device_ is different from img_shape, img_shape_device_ get updated.
- Remove some uses of mega-header THP.h
- Use HANDLE_TH_ERRORS in functions that may throw
- Move NumPy includes to common header
- Delete unused allocator
* WIP: Fix Out of Memory failure in test TensorTest.Tensor64BitDimension
* WIP: update warning message and wrap resize inside TensorTest.Tensor64BitDimension
* WIP: only catch exception which is related to out of memory
* WIP: add return in the out of memory exception
Hopefully this fixes the following assertion faiulre:
/var/lib/jenkins/workspace/aten/src/ATen/test/native_test.cpp:102: test:
Assertion `d5.matmul(d1).allclose(d5.view({24, 2, 3}).bmm(d1.view({1, 3,
1}).expand({24, 3, 1})).view({3, 2, 4, 2}))` failed.
(this error seems to only occur on ASAN tests...)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Use pre-computed offset increments to avoid int division inside kernels.
- OffsetInfo and OffsetIterator pre-computes the necessary coordinate
change along each dimension, so that each successive offset can be
computed using only addition/subtraction/comparisons.
- Added IntDivider which supports "magic division" for uint32_t, thus
eliminating integer divisions altogether for offset calculation, as
long as indices fit in 32 bits.
- In code paths with statically determined dimensions (Dims=1 or 2),
kernel arguments now contain only the necessary data (instead of
MAX_CUTORCH_DIMS of everything).
- Fixed index overflow errors: for tensors with >= 2G elements, we used
to have incorrect results or an infinite loop inside the kernel.
TODO: The following pattern is broken for tensors with >= 2G elements.
It will result in overflow, even if IndexType is uint64_t. Need
to search and replace them.
> for (IndexType linearIndex = blockIdx.x * blockDim.x + threadIdx.x;
> linearIndex < totalElements;
> linearIndex += gridDim.x * blockDim.x) {
* Update CMakeLists.txt
* Removed OffsetIterator, and kept only the fast integer division logic.
- Also changed canUse32BitIndexMath so that the max index for 32-bit
math is INT32_MAX, instead of UINT32_MAX. It also simplifies the
division operation.
* Merged OffsetInfo into THCTensorInfo.cuh.
* Scope MultiRNN blobs with name as well as layers
Also don't double scope MultiRNN in case of multiple layers.
* Scope input projection of first layer with name
We don't scope it with layers because the projection is done
outside of the layer.
* Avoid scoping input blob in MemongerTest.test_rnn
* Rectify input_blob in prepare_input
Revert change in memonger_test because rectifying input will solve the problem.
Summary: Fix documentation for WeightedSumReducerDef to be more general since it applies to both Sparse and Dense ops
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
* First attempt on sqrt op
* Adding the Sqrt op along with the test cases
* Made changes per @Yangqing's questions re: tensor format and used hypothesis to generate input tensor
* Check if node output matches in shape propagation
* Fix list attributes and view shape propagation
* fix inferred shapes for view
* Fix shape inference for integrally typed tensors
* Fixes for concat in control flow
* Fix print
* Fix a bug gen_jit_dispatch.py
The `fromLast` function is confusing to understand since `fromLast(stack, 0)`
was actually invalid whereas `fromLast(stack, 1)` was the last element.
This created off-by-one bugs in gen_jit_dispatch for some operators.
This changes it to `peek(stack, i, N)` which treats the last `N`
elements of the stack as a list, and extracts element `i` of that list.
This usage reflects how `fromLast` was actually being used in the code.
`peekSlice(stack, i, len, N)` similarly treats the last N elements
as a list but extracts a slice. This enables use to get rid of
drop calls and simplify the dispatch logic.
* Add TARGETS for ATenOp (hackily)
This is the best way I could figure out to hook up custom_rule. See https://fb.prod.facebook.com/groups/fbcode/permalink/1810939952287945/ for more details on why it's tricky.
As for the fix with SparseTensor - it seems to be a bug in ATen declarations introduced recently.
* cmake fixes
* Port cuDNN RNN dropout state initialization to ATen and make Python code use it.
Fixes#5138.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Variable/Tensor bugfix
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This also starts generating dispatch code for __and__ and similar
variants. I was too lazy to see if we have committed the '__and__ is
not inplace' mistake other places.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
The nn.* counterpart of #5443 . Mostly removed Variable wrapper. Also added doc for nn.RReLU.
Notice that torch.randn(*, requires_grad=True) isn't documented until #5462 is done.
* Add dtype to torch.Tensor, torch.FloatTensor, etc.
* Support passing dtypes to set_default_tensor_type.
* Check dtype exception.
* Correctly handle new type initialization order.
* Move handling of torch.Storage alias to C++.
* Delete function that erroneously reappeared.
This PR enables the following tests on Windows again:
CUDA HalfTensor tests in test_torch.py and test_nn.py
test_Conv2d_deterministic_cudnn in test_nn.py
test_*Tensor_qr_big in test_cuda.py
The issues are no longer reproducible, possibly because of an upgrade to the display driver.
* Reenable CUDA HalfTensor tests on Windows
* Reenable test_Conv2d_deterministic_cudnn on Windows
* Reenable test_*Tensor_qr_big on Windows
* Adding openmpi to all conda builds
* Typo and turning off quiet
* Removing openmpi from non_cuda conda build
* Actually openmpi is already in the images
Simplifies type dispatch options to consistent use of macros (not macros here and functions there),
Adds the dispatch header to ATen/ATen.h so that users (e.g. writing extensions) can dispatch too.
* Refactor and simplify ATen dispatch
* cuda/Dispatch.h -> cuda/Dispatch.cuh
* Change dispatch strategy for half
* Use __VA_ARGS__ and get rid of parantheses
* Remove rogue UnderlyingType.h
* Fix TensorCompare.cu and add comment
* Include CUDATensorMethods in TensorCompare.cu
* to_cuda_type -> cuda::type and move AccumulateType out of native
* Add Python function calls to script
* Script compiler gains a `Resolver` object that runs when it does not understand a function call. This decouples the python resolution from the conversion to IR.
* Add source information to IR nodes
SourceRange information from the script is not propagated to IR nodes.
This information is only used in two places now: the interpreter
wraps errors that occur when an instruction executions and shape
propagation now reports errors on the line where it fails:
Traceback (most recent call last):
File "test/test_jit.py", line 1655, in test_script_error
bar(Variable(torch.rand(10), requires_grad=True), Variable(torch.rand(9), requires_grad=True))
RuntimeError:
The size of tensor a (10) must match the size of tensor b (9) at non-singleton dimension 0:
@torch.jit.script
def bar(c, b):
return c / b
~~~~~ <--- HERE
In the future, shape propagation should really not report any size
errors and instead just not propagate shapes and let the actual
execution fail. However, this is hard to accomplish while we still
depend on running the op to do shape propagation.
In pytorch, after pad_packed_sequence, the "extra" elements (after the
ends of the sequences) are reset. In the equivalent Caffe2 graph
exported via ONNX, they contained some leftover values, which caused
tests to fail. Probably no one depends on these values, but just in
case, set them to zero to mimic pytorch semantics.
* Revert "Fix wrong argument name (#5366)"
This reverts commit cc9d3b265d7e688865fde055ee3a2f9b77b5714a.
* Fix wrong argument naming
* Revert "Wrap torch::cuda::lazy_init with WITH_CUDA flag"
This reverts commit a8fa37f8fac5aef09eb7fe54d84de6126618c262.
* Revert "Solves the linking error related to lazy_init for MSVC"
This reverts commit 63913a102f274865a76e7c40ffdf6b40c277d5ff.
* better solution for the linking error related to lazy_init for MSVC
* Naming changes
* Namespace changes and further comment
* Rebasing onto current master
* Remove code that is useless
* Fix linting
* Remove rebasing bugs
* Handle legacy pad in Caffe2==>ONNX converter, also remove fake initializer
* Address the comments, 1) have filtering fake initializer before ssa rewrite, 2) polish the legacy padding handling logic
* Add test cases to cover the code just added
* Nit
* Add support for device python arguments with constructors.
* Fix flake8.
* Simplify device handling.
* Dont use torch._C._VariableFunctions.
* Handle default values for functions that have tensor args (e.g. ones_like).
* Support dtypes in legacy new constructors.
* Add comment about why we don't have dtype for sparse (indices, values).
* separate legacy tensor ctor vs new (new includes dtypes).
* Use TypeError.
* Check if CXX compiler supports all the needed functions
This commit improves the code for PR #5230 according to
@ezyang comments. Instead of checking ubuntu/gcc versions it
checks the support for the needed functions from the C++ compiler
using CHECK_CXX_SOURCE_COMPILES.
Fixes: 5229
* cmake target - work in progress
* wip cmake public targets
* Add missing INTERFACE keyword
* Add cuda public dependencies
* Add dependency for test targets
- Remove USE_ARM64 option because it doesn't do what is expected
- Disable ARM ComputeLibrary for non-ARM/ARM64 builds
- Remove analysis of CMake options from scripts/build_android.sh
- Add user-specified CMake options at the end of command line to allow overriding defaults
- Update README for ARM ComputeLibrary integration and do not require to disable NNPACK for ARM64 build with ARM ComputeLibrary
This deletes most of the dead Tensor code paths, including the TensorMethods cwrap and generic/Tensor.cpp.
This also moves the THNN.cwrap/.cpp generation to generate_code which can use ninja if installed.
Support shape propagation with control-flow
* This allows us to enable optimization in the GraphExecutor for most
script tests.
* Changes Type to always be present (non-null) on a Value, removing `hasType()`
and `typeOption()`. A new type kind 'DynamicType' now represents when
a specific type has not been determined.
* If/Loop nodes propagate shapes/types in the simple cases where types of
outputs do not change depending on where control flows. In other
cases, we propagate DynamicType to indicate we do not know what
the shape will be.
* Remove the `cond` input to the body of Loop to simplify handling in
interpreter and shape propagation.
* Bugfix for zero-dim contiguousStridesOf
* torch.jit.trace annotation now creates a GraphExecutor
The other torch.jit.trace, which was used for testing purposes and for onnx to get the trace graph, is now called torch.jit. torch.jit.get_trace_graph.
* @script annotation, and compilation unit for strings
Added functionality to GatherRangesToDenseOp such that it supports an optional input KEY, and will sort DATA according to KEY for each example per feature.
* Update doc of batch size requirements for DP
Fix#5039
* Delete the recommendation for batch size
There's no significant speed difference between divisible and indivisible batch size.
* [C2] Implement Layer-wise Adaptive Rate Scaling (LARS)
* [C2] Implement Layer-wise Adaptive Rate Scaling (LARS)
* add unit test for Lars
* set default value for lars to be None
* remove lars for subclasses of SgdOptimizer
* [cmake] Move nccl to modern cmake, and avoid using EXTERNAL_DEPENDENCIES
* [cmake] Move nnpack to modern cmake and avoid using EXTERNAL_DEPENDENCIES.
* [cmake] Move ATen to modern cmake and avoid using EXTERNAL_DEPENDENCIES.
* Move cpufeatures to modern cmake, and avoid using EXTERNAL_DEPENDENCIES
* Finally remove EXTERNAL_DEPENDENCIES.
* Maratyszcza's comments
* Pin libnccl2 to version 2.1.2
Version 2.1.4 exports C++ symbols that it shouldn't, which causes a
mismatch between raised exceptions and expected exceptions.
Pin this to 2.1.2 until this is solved and NVIDIA releases a new version.
* Fix for 9.1
* Actually pin 2.1.4 for 9.1
Additionally:
- add support for calling functions that are not methods in the Python frontend
- add an end-to-end test for the Python frontend
- add a capture_stdout helper for checking that `print` actually works
* [WIP] moving conda scripts to separate build+test
* [WIP] Splitting conda-builds into build and test phases
* Migrating build_local to call build_anaconda
* Tidying up a regex
This replaces the torch.Tensor constructors with factories that produce
Variables. Similarly, functions on the torch module (e.g. torch.randn)
now return Variables.
To keep the PR to a reasonable size, I've left most of the unused tensor
code. Subsequent PRs will remove the dead code, clean-up calls to
torch.autograd.Variable, and rename Variable to Tensor everywhere.
There are some breaking changes because Variable and Tensors had
slightly different semantics. There's a list of those changes here:
https://github.com/pytorch/pytorch/wiki/Breaking-Changes-from-Variable-and-Tensor-merge
* Add python typing module as build dependency
* Change output_declarations to be a NamedTuple
* Add mypy configuration files
mypy-files.txt includes a list of all files that should be typed checked
with mypy. Run mypy with `mypy @mypyfiles.txt`.
mypy.ini includes mypy options. Unfortunately this can't be merged with
mypy-files.txt.
Update .travis.yml so that one doesn't have to specify what files to
type check inside it.
* Add RuntimeError on missing `typing` module
Alerts users to the new build dependency.
* Handle copying empty sparse tensors to/from CPU, GPU.
This is likely not a robust fix because it special cases the case where both the indices and values are empty
rather than handling each one separately. But this is currently blocking a change introducing devices to constructors.
* Guard sizes being NULL.
* Revert "Fix wrong argument name (#5366)"
This reverts commit cc9d3b265d7e688865fde055ee3a2f9b77b5714a.
* Solves the linking error related to lazy_init for MSVC
* Fix wrong argument naming
* Wrap torch::cuda::lazy_init with WITH_CUDA flag
* Also pass torch includes to nvcc build
* Export ATen/cuda headers with install
* Refactor flags common to C++ and CUDA
* Improve tests for C++/CUDA extensions
* Export .cuh files under THC
* Refactor and clean cpp_extension.py slightly
* Include ATen in cuda extension test
* Clarifying comment in cuda_extension.cu
* Replace cuda_extension.cu with cuda_extension_kernel.cu in setup.py
* Copy compile args in C++ extension and add second kernel
* Conditionally add -std=c++11 to cuda_flags
* Also export cuDNN headers
* Add comment about deepcopy
* Use stacks in the interpreter/aten_dispatch
Rather than have separate input/output lists,
the interpreter now works using a single stack.
Operators in the interpreter push/pop from the stack.
This allows ownership of tensors to transfer directly to an operator,
and an operator can drop the reference to a tensors as soon as it is
no longer needed. This is important for the GraphExecutor op,
which recursively runs the interpreter.
Once autograd is updated to pass variables to Function by value,
we will be able to ensure that we release ownership as soon as possible.
This commit also switches the interpreter to use a fake
tensor 'ContainerTensor' rather than at::Retainable to hold non-tensor
data in the interpreter. This allows us to use std::vector<at::Tensor>
for all registers, which is significantly less confusing than the
OwnedRetainables struct it was replacing.
* Add If and Loop to interpreter
* Preprocess loop to calculate where references to tensor should be dropped
* Add control instructions JumpZ/JumpNZ/Jump
* Switch from explicitly having stage structs to having a single list
of instructions with Store/Load instructions to take values off the
initial stack
* Make the interpreter tests executable rather than use expect files
* add a flag to interpreter code so that constants are variables
if the interpreter is running on variables.
* Add tensor_as to its own file
This will make it easier to bring online new CI configurations
without temporarily breaking the CI, since you can mark it
as disabled in PyTorch HEAD first and then bring the job online.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* at::maybe_data_ptr and Check.h => TensorUtils.h
* THNN support for optional BN running_*
* ATen support for optional BN running_*
* Python nn.* support for optional BN running_*; Improve IN and BN doc
* Add tests for IN and BN new option
* Layer Norm
* Fix LRN doc
* functional interface for LN and IN
* Layer norm tests
* fix BN double backward returning undefined tensors
* fix jit test using wrong dim inputs for BN
* add/improve BN, IN and LN GPU tests with half type
* Udpate docs to be consistent with Conv notation
Fix onnx
Clarified onnx symbokic wrapper
* fix typo
* Address comments
The old pow operator has been deleted in math_ops.cc, math_ops.cu and math_ops.h, while the new operator supporting scalar and tensor exponent has been added in pow_op.cc, pow_op.h an elementwise_op.cu.
* Various dtype improvements.
1) Add dtypes to the new data-based constructors: Variable.new_tensor and torch.autograd.variable.
2) In the python signatures, use Type instead of Dtype to match the C++ signatures; the error messages still print as dtype.
3) Handle / add a better error message when a dtype is used when ATen was not compiled with that type (e.g. cuda types).
4) Move cuda_lazy_init to its own file.
A later commit will add support to the legacy constructors as well.
* Move implementation of lazy_init to cpp.
* Fix parsed_arg size.
* Improve Function interface
* Undo tracer changes
* Fix bug in VariableType.set_history
* Rename function_counter and sequence_number to sequence_nr
* Clarify Function documentation
* Replace swap_next_edges with next_edges() getter
* Bring back set_gradient_edge
* Simplify special.cpp
* add_gradient_edge -> create_gradient_edge
* Add mutable getters for pre/post hooks
* Use make_variable with Edge
* Remove remove_gradient_edge in favor of detach_
* Fix documentation and remove create_gradient_edge friend method
* Canonicalize some includes
* Reduce dataset size for word_language_model; increase NUM_RUNS for all GPU tests
* Test check_cpu_governor option
* Update perf test numbers for CPU and GPU
* Fix public protobuf interface - wip
* Try turn on custom protobuf in mac jenkins.
* Adding back auto-fallback protobuf option
* Address typos pointed out by reviewers
* Remove OpenGL code from benchmark
* Make it possible to print plot in the ipython notbook
* Create the blob if the blob is not specified in the init net
* Do not use gf library for MKL. Even after I install the entire MKL library it is still not found. After removing it, the MKL code can still run
* Remove OpenGL code from benchmark
* Make it possible to print plot in the ipython notbook
* Create the blob if the blob is not specified in the init net
* Do not use gf library for MKL. Even after I install the entire MKL library it is still not found. After removing it, the MKL code can still run
* Support more backends in Caffe2 Benchmark
* Revert "Do not use gf library for MKL. Even after I install the entire MKL library it is still not found. After removing it, the MKL code can still run"
This reverts commit 981b6693a94cbf63ad78d51bd806c7a0d7a5a2d3.
* Build caffe2_benchmark using shared or static library depending on the flag
* Document env vars and properly propagate MAX_JOBS down.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Apply CFLAGS and LDFLAGS environment variables to cmake builds.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Test that running built program works; fixes#5151.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* CMake CR.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This commit updates python-peachpy submodule to bring in the fix.
In #1543 @samarjeet reported that importing caffe2 from Python fails on his system with the error "CRITICAL:root:Cannot load caffe2.python. Error: libcaffe2.so: cannot enable executable stack as shared object requires: Invalid argument". I investigated and found that this is caused by libcaffe2.so being marked as requiring executable stack, which itself was caused by assembly (PeachPy) files in NNPACK not specifying whether they need an executable stack (by default, linked assumes execstack needed). I patched PeachPy to add ".note.GNU-stack" section to generated ELF files, which makes the linker mark libcaffe2.so as NOT needing executable stack. See Maratyszcza/PeachPy#89 for details.
Adds another package to Anaconda.org with a "-full" suffix which includes more libraries by default. This also installs NCCL 2.1 onto the CI Ubuntu docker images to accomplish this.
* Add numpy-style dtypes to Variable factories.
1) Add numpy-style dtypes corresponding to torch tensor types. These are:
torch.float16, torch.float32, torch.float64, torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64
as well as torch.cuda, torch.sparse, and torch.cuda.sparse equivalents.
2) Adds "legacy" names for the above dtypes that correspond more closely to existing tensor names. These are:
torch.half, torch.float, torch.double, torch.short, torch.int, torch.long.
torch.byte and torch.char don't exist because they either don't match numpy semantics or differ on different architectures.
3) Adds a "dtype" parameter to Variable factories (e.g. zeros, ones) that allows the user to specify the type without changing the default tensor type.
4) Adds a "dtype" getter to Variables that return the canonical dtype from 1)
This PR is missing the following useful features that should be added in the future:
A) We only add the "dtype" parameter to auto-generated factories; hand-written factories like in tensor_new.cpp don't support this yet.
B) We don't allow type conversions to use dtypes; that should be added to type(param) or a new function.
C) We don't yet have a "device" parameter for these factories; right now, they will only create Variables on the default device.
* backend_to_string can be private.
* Define python binding argument indexes in a more simple way.
* add all_declared_types, still need to hook it up to THPDType.
* Fix all_declared_types for missing types (it's Sparse + Half).
* Ensure cuda dtypes are created even if compiled with NO_CUDA=1.
* Fix case where dtype is provided but dispatch is via namespace.
This happens in ones_like, empty_like, randn_like.
There is some question if we should do:
1) at::ones_like(tensor).toType(dtype)
2) at::ones_like(tensor.toType(dtype))
I did the former because this matches with the numpy documentation, i.e.:
"Overrides the data type of the result." and it's easier to implement.
Note that the above causes an extra copy, either of the input or output.
Here's a better implementation:
1) Make zeros_like, ones_like native functions that take an optional type (named dtype?).
2) Match the type argument with the dtype, so we don't have two different parameters.
3) Call at::zeros_like(input, type) -> at::native::zeros_like(input, type) -> type.zeros(input.sizes())
* Don't return from maybe_initialize_cuda.
* Don't leak DType name.
* Address cpp review comments.
* Share code between sparse and non-sparse test_dtypes.
* Rewrite _like functions as native function with explicit type parameter.
* Use type 'Type' instead of 'dtype' for consistency.
* Address review comments.
* Handle arg_idx when there is requires_grad but no dtype in python_binding_arguments.
This adds at::_unsafe_view and uses it in matmul. The _unsafe_view
function is identical to view except that the output is not treated
like a view by the automatic differentiation code. This avoids in-place
modifications triggering the more expensive CopySlices/AsStridedBackward
behavior.
The _unsafe_view function is only safe to use on temporaries that will
be immediately discarded and that do not alias other tensors. Otherwise,
in-place modificatiions may trigger incorrect gradients. The funciton is
not exposed to Python.
See #5169
* Fix asan buffer overflow in autograd saved_variable.cpp
* Fix asan global buffer overflow in any_variable_requires_grad
* Revert change in any_variable_requires_grad
* Fixes UB when using legacy python functions and mark_non_differentiable
If an output of a python Function is marked as non_differentiable,
autograd won't save a gradfn for that output. During the backward
pass, this translates to an undefined tensor being passed to the
backward of the Function. The legacy python Function path checks
if *any* of the inputs to backward requires_grad.
This requires_grad check uses Variable::get(), which casts the
undefined tensor to a VariableImpl and then accesses the _requires_grad
member. This is UB because the undefined tensor is NOT a VariableImpl.
The fix here is to add a check for if the variable/tensor is defined
in the legacy python Function code path.
* s/and/&&/
Summary: as title. This is similar with python pprint utility for nested json data structure. It can be useful for checking schema during debugging.
Reviewed By: kittipatv
Differential Revision: D6710767
fbshipit-source-id: e450aa5477fa1ad4f93c4573f8108a2f49956da8
Summary: We are going to enable `-Werror=unused-parameter` flag and I need to manually fix some files so we rest of this process can be automated with a tool called clang-tidy.
Reviewed By: yfeldblum
Differential Revision: D7012203
fbshipit-source-id: 585e9e89d916dca8894308438d0c985cb1e1b07a
Summary: The original implementation averaged the momentum across the embedding dimensions, which doesn't make any sense. This meant all the embedding dimensions received the same update, becoming a very memory-expensive one-dimensional embedding.
Differential Revision: D7003135
fbshipit-source-id: ed54e3427bc13895a4e949e96b4b17f6ebfb6d53
Summary:
Fixes an annoying warning when building for Android with tests enabled.
Closes https://github.com/caffe2/caffe2/pull/1970
Reviewed By: pietern
Differential Revision: D7011817
Pulled By: Maratyszcza
fbshipit-source-id: 06162d5c5b12ed939581ce9a8498fbed3eb2c47b
Summary: Fix logic in operator's event synchronization: Record might be called after async CPU op calls SetFinished
Reviewed By: azzolini
Differential Revision: D7003277
fbshipit-source-id: 4d77d6619c6403e71ba45fbaaf78e939982452b6
Summary:
In some cases we were doing quantization even we we should not. This diff is
preventing this from happening.
Reviewed By: rayleichen
Differential Revision: D6953547
fbshipit-source-id: 7c65baaf969e5e1bddb68ca8182f4f3b43f2431d
* Add a FAQ, for now just 'out of memory' advice.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Updates based on comments.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* minor copyedit
Summary: EQ op should work on bool type.
Reviewed By: ender-wieczorek
Differential Revision: D6992905
fbshipit-source-id: 9a08c8b840963c9817405c7602a7f67dc6a6caab
Their cuda kernels should be initialized with (min_value, 0) and
(max_value, 0), respectively, where the second number is a default index
value. However, they were being initialized with (max, 1) and (min, 1)
instead, probably a remnant from the lua torch days.
This caused bugs in torch.max() and torch.min() when the input is at the
extreme values, and the max value (or min value) occurs at index 0. For example,
import torch
x = torch.ByteTensor([[0]])
x.cuda().max(dim=0) # returns (0, 1) but the expected result is (0, 0)
Summary: We are going to enable `-Werror=unused-parameter` flag and I need to manually fix some files so we rest of this process can be automated with a tool called clang-tidy.
Reviewed By: yfeldblum
Differential Revision: D7001946
fbshipit-source-id: 680d812c98703ec57a9eb952a69c6316e7415be8
Summary:
There is a typo in the setup.py which will cause incomplete install. This fixes it.
Closes https://github.com/caffe2/caffe2/pull/1968
Reviewed By: bddppq
Differential Revision: D7000517
Pulled By: yinghai
fbshipit-source-id: c89e32bc5a4a77571f6ab6569297a6b6a1d1f2fc
* Fix LaTex rendering in CosineAnnealingLR
Backslashes were interpreted by Python as escapes in the string, so \frac
turned into frac, which is not a valid LaTex command.
This could be fixed with double backslashes, but the easiest solution is to
just use a raw (r) docstring.
* Fix sphinx warnings for LRN doc headings
* Move LRN docstring from __init__ to class level
The docstring was not rendered by sphinx at
http://pytorch.org/docs/master/nn.html#torch.nn.LocalResponseNorm
because it was in the constructor.
* Remove superfluous backticks from LRN formula
Summary:
Without this enforce it's too easy to export model overriding it's params in
predictor.
Reviewed By: rayleichen
Differential Revision: D6984506
fbshipit-source-id: 9bbf375758686c6ad12ad071723f255363e98ae6
Summary: We are going to enable `-Werror=unused-parameter` flag and I need to manually fix some files so we rest of this process can be automated with a tool called clang-tidy.
Reviewed By: yfeldblum
Differential Revision: D6928263
fbshipit-source-id: 38ce3597b9968a2c0dba3ab21be5ee1c84a13e41
Summary:
Our cmake files have some issue when using using ninja as the generator to build with cuda
Closes https://github.com/caffe2/caffe2/pull/1962
Differential Revision: D6992456
Pulled By: bddppq
fbshipit-source-id: 7aa328b16e7edfddfee33495352bfcf8cd8ce9f3
* Check GCC version on Ubuntu
GCC 5 in Ubuntu 17.10 and newer doesn't define the macro _GLIBCXX_USE_C99
and causes std::to_string, std::isnan, std::isinf (and more) functions
not to be defined neither. This fix checks if GCC 5 is used on Ubuntu 17.10
or later and shows an error message describing the problem.
* Check GCC version on Ubuntu
GCC 5 in Ubuntu 17.10 and newer doesn't define the macro _GLIBCXX_USE_C99
and causes std::to_string, std::isnan, std::isinf (and more) functions
not to be defined neither. This fix checks if GCC 5 is used on Ubuntu 17.10
or later and shows an error message describing the problem.
Fixes#5229
Summary:
After we removed android-cmake submodule and switched to android.cmake.toolchain from Android NDK, the code that builds cpufeatures dependency is no longer valid. This commit fixes it.
Closes https://github.com/caffe2/caffe2/pull/1957
Differential Revision: D6990082
Pulled By: Maratyszcza
fbshipit-source-id: ccbe8190e30e097474a2876ed4c0b263bcb117ef
Summary:
This reverts commit 30f614beea6f859fee25ce4f85573142885dde45
bypass-lint
An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
cause_a_sev_many_files
Differential Revision:
D6893040
Original commit changeset: 30f614beea6f
fbshipit-source-id: 5e98a24699088283f864efe31234874bdacbe3c3
* Allow zero-dim tensors to be bound to at::Scalar
This relaxes THPUtils_unpackLong and THPUtils_unpackDouble to allow
values convertable to PyLong and PyFloat objects. This includes NumPy
scalars and zero-dim tensors (Variables).
This is important to maintain backwards compatibility in the Tensor
constructors once scalars are enabled and Variable and Tensor are
merged.
* Add comment and unpack PyInt as int64_t
Summary:
onnx-caffe2 requires some more Python packages in order to run its tests.
Closes https://github.com/caffe2/caffe2/pull/1956
Reviewed By: bddppq
Differential Revision: D6985654
Pulled By: yinghai
fbshipit-source-id: 06d4ec95729b09cdd1bc7e096ecf6680124070cd
* hard exit when test output contains warning or error
* update perf test links
* update base machine description
* update z value range
* update cpu perf test numbers
* store perf test numbers in S3 instead, for easier updating
* update mini_sequence_labeler perf test link
* fix lint
* store perf test numbers in repo
* update link to mini_sequence_labeler test
Summary: The old pow operator has been deleted in math_ops.cc, math_ops.cu and math_ops.h, while the new operator supporting scalar and tensor exponent has been added in pow_op.cc, pow_op.h an elementwise_op.cu.
Reviewed By: houseroad
Differential Revision: D6893040
fbshipit-source-id: 30f614beea6f859fee25ce4f85573142885dde45
Summary:
Add a function to return true if the model contains loss and retuen
false if the model doesn't include a loss.
Reviewed By: kittipatv
Differential Revision: D6982444
fbshipit-source-id: 1f63b7a1eaa3077841a0ad5d8d854b471d0aa84c
This PR adds support for convenient CUDA integration in our C++ extension mechanism. This mainly involved figuring out how to get setuptools to use nvcc for CUDA files and the regular C++ compiler for C++ files. I've added a mixed C++/CUDA test case which works great.
I've also added a CUDAExtension and CppExtension function that constructs a setuptools.Extension with "usually the right" arguments, which reduces the required boilerplate to write an extension even more. Especially for CUDA, where library_dir (CUDA_HOME/lib64) and libraries (cudart) have to be specified as well.
Next step is to enable this with our "JIT" mechanism.
NOTE: I've had to write a small find_cuda_home function to find the CUDA install directory. This logic is kind of a duplicate of tools/setup_helpers/cuda.py, but that's not available in the shipped PyTorch distribution. The function is also fairly short. Let me know if it's fine to duplicate this logic.
* CUDA support for C++ extensions with setuptools
* Remove printf in CUDA test kernel
* Remove -arch flag in test/cpp_extensions/setup.py
* Put wrap_compile into BuildExtension
* Add guesses for CUDA_HOME directory
* export PATH to CUDA location in test.sh
* On Python2, sys.platform has the linux version number
* Fix mul with dense + sparse
* Add missing hspmm and smm
Also make repeat only a function (not a method) to match Tensor
behavior.
These were discovered by running test_torch.py and test_sparse.py after
merging Variable and Tensor
Summary: Sometimes we need to add some extra schema later
Reviewed By: sunnieshang
Differential Revision: D6951849
fbshipit-source-id: 564eb88f9250eae24869fd10ba3426e00a18af33
Summary:
We don't care about a particular system Python when building Anaconda images.
Rebasing later to remove the sccache change once it is merged (#1952).
Closes https://github.com/caffe2/caffe2/pull/1953
Differential Revision: D6978409
Pulled By: pietern
fbshipit-source-id: 39762602cdd35eefd485a014011b53e3ee2e830d
Summary:
Work in progress to start using sccache
Closes https://github.com/caffe2/caffe2/pull/1949
Differential Revision: D6978772
Pulled By: pietern
fbshipit-source-id: 721462d8e3470736472263337c628b287cd1a901
Summary:
Modify detect_components to take a list of valid node_name prefixes instead of values. Users can set node_name to e.g. `'sparse_component:0'`, `'sparse_component:1'`, etc.
and pass `'sparse_component:'` as a valid prefix. Also add `Tags.SPARSE_COMPONENT` in addition to `Tags.SPARSE_SHARDED` and `Tags.SPARSE_DONT_SHARD` and update all calls to
`detect_device_components`.
Reviewed By: azzolini
Differential Revision: D6952599
fbshipit-source-id: e1b1e6b146a6bd053b295690016044fd5990c893
- Create a new common.sh to put common bash stanzas in
- Create a new enabled-configs.txt file, which you can use
to selectively disable tests when running CI
- Specify exited user land via trap, which means early successful
exit will correctly print the end sigil.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
Previous behavior may fail to resolve the correct library name. A rework of https://github.com/caffe2/caffe2/pull/1935 as it was messed up in the rebase...
Closes https://github.com/caffe2/caffe2/pull/1950
Reviewed By: bddppq
Differential Revision: D6974530
Pulled By: yinghai
fbshipit-source-id: 924b653e8ac0b68c46341edfd3eb05d9cc0155f2
* Improve Variable interface
* Address comments from @apaszke and @colesbury
* string ::operator= is not noexcept
* Remove ir.h from tracer_state.h to improve build times
* Make Variable a struct and pack SavedVariable fields
* Implement as_variable_ref
* grad_fn_ptr() -> grad_fn_unsafe()
* Reduce hackiness of set_type hack
* Include variable.h and edge.h in tracer_state.h because it uses them
* class Variable -> struct Variable because Windows cant even
* Make Variable::output_nr uint32_t instead of int
* Add comment about tracing state
* Replaced more static_cast<Variable&> and improve docs
* Remove SavedVariable destructor and construct members in init list
* Clarify docs for Variable
* Variable::set_version -> set_version_counter
Summary:
Change log
- Support rectangle cropping, where height and width of clip cropping can be set separately. This is useful when most video resolution is non-square, such as 240p, 360p and 480p where width is significantly larger than height.
- Comparisons of training on ucf101 between using 112x112 croppings and using 112x144 cropping.
- https://fburl.com/i0rw6y1k
- Support 14 multi-cropping per video clip at testing stage to improve classification accuracy. Take left-top, central-top, right-top, left-bottom, central-bottom, right-bottom and central-central croppings as well as their mirrorings. In total, 14 croppings.
- Comparisons on the same model trained on UCF-101. Use 1 clip per video
- RGB. f41014306, w/o Vs f41014868, w/ multi-cropping: `0.64099 Vs 0.65796`
- OF. f41014889, w/o Vs f41014913, w/ multi-cropping: `0.65796 Vs 0.67624`
- Support color jittering and color lighting on RGB data for training data augmentation.
- Comparisons of training on ucf101 from scratch with and without color jittering and lighting:
- https://fburl.com/k69zatul
Reviewed By: HengCV
Differential Revision: D6962620
fbshipit-source-id: 9b43478945874142727fea351ee04417218e6606
Summary:
In Caffe2 Benchmark, if a blob is not specified in the init net, but only specified in the predict net (e.g. input), the blob cannot be retrieved from the workspace. In some cases, it results some errors.
Create the Blob before using it if it doesn't exist.
Closes https://github.com/caffe2/caffe2/pull/1948
Reviewed By: orionr
Differential Revision: D6970316
Pulled By: sf-wind
fbshipit-source-id: 3e317403de0b5cf7568c7bda69a0ebe9d59d4a1f
This better maintains backwards compatibility when Tensors and Variables
are merged. For example:
>>> loss = var.sum().data[0]
Currently, `var.sum().data` is 1-dim so indexing. Once scalars are
enabled and Variable and Tensor are merged it will be zero-dim. This
change allows that expression to continue working (with a warning). In
the future, the canonical way to compute that expression will be:
>>> loss = float(var.sum())
Or an equivalent alternative:
>>> loss = var.sum().item()
Also fixes a few error cases.
Prior to this change, test_autograd.py used type checks that
differentiate between Tensor and Variable to determine if an argument
needs requires_grad=True. This logic breaks when Tensor and Variable are
merged.
This changes the logic for method_tests so that:
- non_differentiable(..) marks an argument as not requiring grad
- floating point tensors have requires_grad=True
- integral tensors have requires_grad=False
- Variables are disallowed (unless they're wrapped in
non_differentiable)
Summary:
Cause 2.1 moved bellman_ford, and scikit-image will install the most recent networkx by default
Closes https://github.com/caffe2/caffe2/pull/1944
Reviewed By: pietern
Differential Revision: D6966299
Pulled By: pjh5
fbshipit-source-id: 71ad387cb4a2b22cde3b87e6665977da6b4c428e
Summary: Copying model_id from metaNetDef_->modelInfo in PredictorContainer for dper models. Since these model_id's are strings of <model_id>_<snapshot_id>, changed them to strings in net_observer
Reviewed By: salexspb
Differential Revision: D6752448
fbshipit-source-id: 93c91950b44c012e57240aaf909bc961449cfd7c
Summary: This fixes issues revolving building on a devserver
Reviewed By: pjh5
Differential Revision: D6953242
fbshipit-source-id: 59b4d3f846971a8b5eb9c1d802a8bacef3fad696
Summary: Step 1 of 3 in adding support for multidevice batch normalization on GPUs. Implements ChannelStatsOp for the GPU. Next steps are to port the backprop stats op and tie things together in DPM.
Reviewed By: rbgirshick
Differential Revision: D6953411
fbshipit-source-id: cd50e53d66ea84fe66021c08b978b28290d9f347
Summary: MKLMemory is not really a tensor, but we can make shape info collection work.
Reviewed By: stephenyan1231
Differential Revision: D6947770
fbshipit-source-id: 04303ea309a8a9c1ac4c5401c43934d1abb6a7c4
Summary: The interface is not used anywhere AFAICT; cleaning up to make it less confusing.
Reviewed By: kuttas
Differential Revision: D6867040
fbshipit-source-id: 3e8a77df76ef09c6864c308561825777b326f76c
Summary:
enum34 dependency of PeachPy conflicts with built-in enum package on Python >= 3.6
This commit brings in NNPACK change to avoid using enum34 on Python >= 3.4
Closes https://github.com/caffe2/caffe2/pull/1925
Differential Revision: D6951906
Pulled By: Maratyszcza
fbshipit-source-id: a698d8bbbc7b7b0c1b0b532c2c9d74fe0d2ae266
* add reduce=True arg to HingeEmbeddingLoss
* pass arg to super constructor in HingeEmbeddingLoss
* make HingeEmbeddingLoss reference fn work on legacy
* Fix test_distributions when WITH_SCALARS.
* Use SCALAR_SHAPE in test, use self.scale in AffineTransform.
* Handle device correctly for scalars.
* Fix one hot categorical.
* Fix relaxed categorical.
* Add a new_tensor instance method to Variable that takes only data.
This is to work around the legacy problems of new, where e.g.
new(5) will give you an unfilled tensor rather than a scalar.
* Fix cuda scalar code path.
* Remove double return.
* Work around lack of WITH_SCALARS.
* Use tensor_new.
* Add a new_tensor instance method to Variable that takes only data.
This is to work around the legacy problems of new, where e.g.
new(5) will give you an unfilled tensor rather than a scalar.
* Remove double return.
* Fix cuda scalar code path.
* Work around lack of WITH_SCALARS.
Summary:
To build with tests and benchmarks
`./scripts/build_android.sh -G Ninja -DBUILD_TEST=ON -DUSE_NNAPI=ON`
To run unit test
`adb push build_android/bin/nnapi_test data/local/tmp`
`adb shell "cd data/local/tmp &&./nnapi_test`
To run benchmark
`adb push build_android/bin/nnapi_benchmark data/local/tmp`
`adb shell "cd data/local/tmp &&./nnapi_benchmark`
Tested on Google PIxel 2 XL with android 8.1
Closes https://github.com/caffe2/caffe2/pull/1918
Reviewed By: Maratyszcza
Differential Revision: D6944604
Pulled By: hlu1
fbshipit-source-id: 462f010117ae4628b23bef506c41397de3817ad4
Summary:
Include six, enum34, and PeachPy as Caffe2 submodules, and use the versions from submodules instead of downloading them during configuration time
Closes https://github.com/caffe2/caffe2/pull/1917
Reviewed By: orionr
Differential Revision: D6938735
Pulled By: Maratyszcza
fbshipit-source-id: 841a6c47a1cd003a19f48f6c256aa4d9eb2cc6e4
Summary: CompleteInTimeOrDie was added to detect deadlocks and proactively exit. In addition, call os.abort() to generate a core dump so that the error is actionable.
Reviewed By: bmaurer
Differential Revision: D6938343
fbshipit-source-id: 8bd36da4f4bb1195bd3398f25d133a6ebf1c66ad
Summary:
It appears that my initial implementation was not really working when one
starts doing nesting. This diff is fixing this by replacing itertools with
something that is really easy to reason about.
Reviewed By: idning
Differential Revision: D6933763
fbshipit-source-id: f7a1de996d878a41bac2b2acd9d87a7c4b416778
Follow up to #4744
This is another code-path in which storages may be null, which is not
allowed in PyTorch. The Python tensor bindings handle this in pynew, but
the ATen bindings do not.
This is caught by test_torch.py when Tensor and Variable are merged.
Summary:
Original commit changeset: d0c1c7681605
Reverting due to broken OSS build due to this commit
Reviewed By: bddppq
Differential Revision: D6935666
fbshipit-source-id: 955cfeb6d5a4ed265b2e099094cfb5bfe960ff95
C++ argument evaluation order is undefined and leads to different
results in different platforms. This commit fixes build_lstm_body to
do the calculation slightly differently.
Fixes#5055
Summary:
Include six, enum34, and PeachPy as Caffe2 submodules, and use the versions from submodules instead of downloading them during configuration time
Closes https://github.com/caffe2/caffe2/pull/1901
Differential Revision: D6930731
Pulled By: Maratyszcza
fbshipit-source-id: d0c1c7681605d957de6f51bd24fbb25afc0f282f
Summary:
There is a long lasting problem of scoping which was introduced in original python wrappers early in H1. Basically each RNNCell implemented has to manually scope outputs of each of the operators. If somebody forgets, then there could be weird bugs with layers etc.
Approach is the following. User has to explicitly specify current scope when using apply_over_sequence function and others if the function is going to be called several times (like for stacking layers). This way we use Caffe2 native scoping approach instead of inventing one extra API people have to use (i.e. passing scope name as an argument to the RNNCell constructor).
Closes https://github.com/caffe2/caffe2/pull/1681
Differential Revision: D6777536
Pulled By: salexspb
fbshipit-source-id: 73d860b8d4857589e04bdea5a6fcd3080d68427c
Summary: Integrate android nn api into Caffe2. Supported ops include averagepool, maxpool, conv, relu, and softmax
Reviewed By: Maratyszcza
Differential Revision: D6560366
fbshipit-source-id: 2879a99c01acb050e711d9d7d5bde022ef95888d
This was accidentally lost while addressing review comments on
https://github.com/pytorch/pytorch/pull/4695
pack_padded_sequence may be called either with a list or with a
Variable. If called with a list we convert to Variable internally.
I added to test_nn to test the new codepath. The bug was also caught
by the onnx-fb-universe tests (which rely on passing in Variable).
torch.mm(sparse, dense) -> dense works for tensors. This PR makes it work for variables as well.
I renamed mm to _mm in Declarations.cwrap and wrote a native mm function that wraps _mm for the dense case and addmm for the sparse case.
The test_cuda.py setup purports to test half tensors, but actually just
re-tests FloatTensors because the keys in type_map were str instead of
type. Testing HalfTensors is more complicated, requiring changes to
precision and requires excluding some unimplemented methods.
We should fully test half CUDA tensors. This change just deletes the
duplicate tests of FloatTensor.
Summary: Set of RL improvements: Fix error in quantile computation. Handle missing values in sparse_to_dense. Replace page_size with minibatch size.
Differential Revision: D6888977
fbshipit-source-id: bb84477866c64da5ff57d6c25df1c8d3b799e437
* PackedSequence: store batch_sizes as tensor
rather than converting to a list of python integers. This maintains
the invariant that module's inputs/outputs are collections of
Variables.
In particular, this causes the JIT to no longer choke when flattening
and unflattening arguments.
* Handle sequence lengths correctly when exporting RNNs to ONNX
- when uniform sequence lengths are provided, correctly omit the
argument when constructing the ONNX graph, so as to not fix the
graph to the batch size.
- handle PackedSequences by floating them through the graph and
eliminating them in an optimization pass. ONNX does not have packed
sequences, but operates on a representation equivalent to
PaddedSequence, so we hide the representation-switching from ONNX
- as a preliminary step towards handling PackedSequences, not directly
tied to ONNX export, change batch_sizes from being an argument to
the RNN operators into being an argument to the forward() function
of those RNN operators. This more closely models the reality that
batch_sizes are effectively part of the input sequences.
Summary:
This was forgotten in #1854.
cc Yangqing
Closes https://github.com/caffe2/caffe2/pull/1880
Differential Revision: D6919916
Pulled By: Yangqing
fbshipit-source-id: 1a8dbae604677bc3c3d23b4e55bd09bb87c24cfd
* Add criterion scalar tests.
This exposed an issue in MarginRankingLoss with scalars, but the cleanest way to fix is to wait
until forward runs on Variables (so we don't have to wait for the backward to check if something
is a scalar).
* Fix flake8.
* Add error message for margin_ranking_loss with scalars.
We perform this check in the generic/SparseTensor.cpp (the Python binding),
but the ATen bindings don't use that code path
Fixes test_broadcast_coalesced with sparse tensors
Summary: We should not be trying to instantiate this op on GPU at this point
Reviewed By: pietern
Differential Revision: D6915576
fbshipit-source-id: 6bdbc93ad12fc67e3001fce1b506fe2895d7b0ba
The Tensor and Variable classes are being merged.
autograd.Function.forward is now called on Variables, but with "no-grad"
mode (torch.no_grad()) enabled.
One benefit is that we no longer have to explicitly track shared
storages.
Variable.item() converts one-element tensors to standard Python numbers.
This operates like float(var) or int(var) depending on
the data type of the Variable.
Because nvcc does not know that in/out pointers do not alias each other,
if we assign a value to *out and then use *in again, the kernel has to
emit a write to *out and then another read from *in.
(Affected kernels become marginally faster after the fix.)
* test_nn working.
* Fix some incorrect scalar assumptions.
* Don't use Variables when we don't have to.
* Use Variable Mixin.
* Fix NLLLoss reference function when WITH_SCALARS not enabled.
* Allow device to be optional in cuda().
* Fix multilabelmarginloss_reference.
* parallelize vol2col and col2vol of Conv3D with CPU backend
* parallelize vol2col and col2vol of Conv3D with CPU backend
* interface test of conv3d
* replace long with int64_t
* correct pragmatic error of comments
Summary: The previous refactor of these four Ops changed their input semantics, which makes backward impatible with old models. This diff fix this problem by checking the input and define follow-up behavior by case, so that the old models can be accommodated.
Reviewed By: dzhulgakov
Differential Revision: D6905840
fbshipit-source-id: fc37baec407fd5eae64fc9c2b61aba3c492a90f3
Summary:
Special While loop operator that follows the semantics of While in ONNX: https://github.com/jamesr66a/onnx/blob/controlflow/docs/Operators.md#experimental-loop
Stuff that's missing:
- Lexical scoping enforced via child workspaces
- Double-buffering on forward
Further possible enhancements:
- Full parallelism when there are no loop-carried dependencies
- Diagonal execution
- More optimized scan_outputs shaping via static shape inference provided in ONNX (coming sometime)
- GPU support (probably just some tensor value management stuff)
- Gradient support (likely low-pri right now)
Closes https://github.com/caffe2/caffe2/pull/1848
Reviewed By: dzhulgakov
Differential Revision: D6907524
Pulled By: jamesr66a
fbshipit-source-id: 4938108733e168b8c027035091104712a18c992a
Summary:
Addresses issue #1676
Now when `make install` is run, the `caffe2` (and `caffe`) python modules will be installed into the correct site-packages directory (relative to the prefix) instead of directly in the prefix.
Closes https://github.com/caffe2/caffe2/pull/1677
Reviewed By: pietern
Differential Revision: D6710247
Pulled By: bddppq
fbshipit-source-id: b49167d48fd94d87f7b7c1ebf0f187ec6a203470
Summary:
This brings an option to disable inline assembly in FXdiv via CMake configuration option `-DFXDIV_USE_INLINE_ASSEMBLY=OFF`
Inline assembly in FXdiv apparently triggers a bug in some gcc versions
Closes https://github.com/caffe2/caffe2/pull/1892
Differential Revision: D6904507
Pulled By: Maratyszcza
fbshipit-source-id: 2ef24b277cbaa2634c69e2d53cef21415b05195f
Summary:
Fix typeid_test when running android C2 tests
Previously it says:
Build failed: Command failed with exit code 1.
stderr: caffe2/caffe2/core/typeid_test.cc: In member function 'virtual void caffe2::{anonymous}::TypeMetaTest_Names_Test::TestBody()':
caffe2/caffe2/core/typeid_test.cc:49:12: error: variable 'string_meta' set but not used [-Werror=unused-but-set-variable]
TypeMeta string_meta = TypeMeta::Make<string>();
Reviewed By: Yangqing
Differential Revision: D6869192
fbshipit-source-id: ccbc30d53d04a8ece98de0a99598c176e6aaf4dc
* Add a small paragraph for pathwise estimator
* Add differentiability as well
* Add small snippet and clear some grammatical errors
* Update documentation to reflect has_rsample
* Add a fix for ExponentialFamily docs
* Update __init__.py
* Add transpose() to TensorGeometry.
This code is dead; I briefly used it in my RNN patchset but
eventually rewrote it to not be necessary. However, it seemed
like a useful gadget so I kept it. In general, it seems that it
would be useful for TensorGeometry to support all operations that
Tensor does, but it only computes the changes to sizes/strides
instead of actually doing the computation.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Turn on wrap_dim behavior for TensorGeometry
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Support for hard-coded differentiable outputs.
Some outputs of functions are nondifferentiable, and should always
be returned with requires_grad=False. Traditionally, we have used
the presence of 'grad' to signal that only the first output is
differentiable, and the rest are not, but cudnn_rnn (to be
implemented) breaks this pattern; its first three outputs are differentiable,
but its last output is a buffer that is just consumed by backwards.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* TensorGeometry constructor from just sizes
The sizes are assumed to form a contiguous tensor, and we compute
the strides we would get in that case.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Support saving TensorList for backwards.
There is some back story here. Saved TensorList in backwards will
be used by cudnn_rnn, and it is worth asking, why is it necessary to
save a list of tensors? Indeed, *technically* speaking a list of
tensors is not necessary, we only need to save the sizes of each
of the weight tensors. (We need the sizes because cuDNN is only
going to blast the derivative of weights into a flat buffer, but
we need to match the sizes of the views into the buffer when we
eventually return the derivatives.)
However, it was surprisingly awful trying to implement passing just
sizes, because as non-Tensor arguments, the JIT interpreter generation
code is expected to handle all non-Tensor arguments as attributes in the
trace, and our attributes struct doesn't actually know how to do
arrays of arrays. Saved TensorList code was much easier to get working,
so that's what this patch does.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* MatrixRef - an ArrayRef with a stride, making it a 2D ArrayRef.
Like ArrayRef, this class does not own the underlying data, it is expected
to be used in situations where the data resides in some other buffer.
This is intended to be trivially copyable, so it should be passed by
value.
For now, 2D only (so the copies are actually cheap, without having
to write a SmallVector class) and contiguous only (so we can
return non-strided ArrayRef on index).
The intended use-case (not in this commit) is to make it easier to
work with RNN weights, which are num_weights x num_layers matrix of
parameters.
P.S. dimension 0 indexes rows, dimension 1 indexes columns
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Generalize getDataType in Descriptors.h
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Change copy_range to take Tensor, and change cat_tensors_backward accordingly
Should a backward function return a Variable or a Tensor? For the most
part, all of our backward functions return Tensor, except cat_tensors_backward,
which returns a variable_list (which is really the only thing that matters,
because Tensor and Variable are interconvertible). But this is kind of weird,
because it means that you can't implement a backwards in ATen that returns
a std::vector<Tensor>, and then hook it up transparently with the derivatives
code. So I switched it over.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Support 5-ary return Tensor tuple.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Support code generation with mixed Tensor/TensorList in output.
I don't think I ended up using this in cudnn_rnn, but this seems
it might be useful for someone else later.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Support 4-ary boolean array
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add support for retain_variables in tools/autograd/derivatives.yaml
'retain_variables', a bool which is true if a user has specified
that saved variables should be retained in case the backwards is
run again later. This allows an optimization where we can
destroy saved buffers if we know variables are not going to be retained,
e.g., it is (will be) used by _cudnn_rnn
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Lazily initialize cuDNN descriptors
Previously, cuDNN descriptors were eagerly allocated as soon
as a FooDescriptor object was created. However, in some uses
of TensorDescriptor, this is problematic: some tensors are optional
and cuDNN's API expects to be given a nullptr TensorDescriptor
in this case, not an uninitialized (but allocated) descriptor.
Lazily initializing the descriptors makes it less likely for
us to use uninitialized memory and matches the usual semantics of
unique_ptr. It's good sense!
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Port cuDNN RNNs to ATen.
This brings three new functions:
- _cudnn_rnn_flatten_weight: flatten a matrix of weight tensors into
a single contiguous weight buffer as required by cuDNN
- _cudnn_rnn: run RNN forwards
- _cudnn_rnn_backward: run RNN backwards
RNNs have a lot of parameters, so we restructured what was previously
a single 'fn' object that recorded all the parameters into three
objects: RNNDescriptorParams, TensorDescriptorListParams and
DropoutDescriptorParams.
We make use of MatrixRef to organize the weight tensors (which are
weight/bias x number of layers), but I did not teach the codegen
how to pass these as arguments/return values natively, so instead
a MatrixRef is passed as its constituent ArrayRef and int64_t stride0.
cudnn_rnn has three differentiable outputs and one nondifferentiable
one, so it makes use of the support for hard-coded differentiable outputs.
I haven't deleted all of the descriptor code from Python, because dropout
initialization still goes through this codepath, that should be fixed soon
but I don't see it as essential for this PR.
This commit also removes the last use of NestedIOFunction from PyTorch.
There are some shenanigans with cuDNN dropout descriptor initialization,
see below:
Note [cuDNN dropout descriptor initialization]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In most cases, setting descriptors in cuDNN is cheap (e.g.,
cudnnSetTensorNdDescriptor). However, this is not the case for
cudnnSetDropoutDescriptor: in cuDNN 6/7 (and possibly others) it does an
expensive precomputation to initialize the random number generator states. In
cuDNN 6, this is the ONLY official mechanism to initialize a dropout descriptor,
which means that law-abiding clients were expected to generate a dropout
descriptor once and cache it. However, our ATen interface is (1) stateless (so
we can't cache the descriptors) and (2) does not accept arbitrary user types in
its interface (so we can't pass the descriptor in). This puts us in a pickle.
In cuDNN 7, a new function, cudnnRestoreDropoutDescriptor was added, which
forgoes the expensive initialization process, and can initialize the
descriptor with a pre-initialized state CUDA tensor. This is great, because
it means we can simply pass in the state tensor and then initialize the
descriptor internally. Unfortunately, this function is not available in
cuDNN 6.
To work around this, we break the cuDNN abstraction barrier, and have
the struct layout of the underlaying dropout descriptor. With this struct,
we can reimplement cudnnRestoreDropoutDescriptor from scratch. Great!
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Fix cuDNN 7 behavior.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Delete some unused, controversial methods from MatrixRef.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add missing filter_dim_a slice
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Replace nested for-loop with itertools.chain.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* CR comment on mut_desc()
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Refactor DropoutDescriptor API.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Use cached CurrentDeviceProperties from Context.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Document _cudnn_rnn outputs.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Improve fmap docs, convert some functions to use it.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Move IndexRange to autograd/function.h
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Elaborate on CUDNN_STATUS_INVALID_VALUE return some more.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add an all-in-one setter for RNNDescriptorParams.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Print what the unrecognized RNN mode was
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* RNN TensorDescriptor improvements
- Have an explicit size/stride overload for set TensorDescriptor,
so you don't have to create a goofy view to feed in.
- Change the padding to 3D rather than 5D, which is all you actually
need (it's just 2D that is not supported by cuDNN API.)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Fix implementation of cudnnRestoreDropoutDescriptor, plus test.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Better comments about input layout.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add comment about no-DropoutDescriptor argument RNNDescriptor function.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Rename vocab_size back to input_size.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Don't use backslash in comment.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Bugfix for contiguous TensorGeometry calculation.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Don't allocate a dummy tensor when setting TensorDescriptor for flatten_weight.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Make contiguity errors more user-friendly.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* s/fn.dropout.train/fn_train/
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* s/_cudnn_rnn_backward_grad/_cudnn_rnn_backward_input/
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Make dcx properly undefined when not required.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Remove old TODO.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add state size check in cudnnRestoreDropoutDescriptor
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Explicitly narrow int64_t to size_t
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Restore copyParams comment.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Update benchmark numbers, and slight engineering improvements.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Typofix.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
* We now allow subdirectories as well as numbers in the name.
* Also fixed an error case.
Closes https://github.com/caffe2/caffe2/pull/1875
Reviewed By: pjh5
Differential Revision: D6894401
Pulled By: orionr
fbshipit-source-id: 6a9938bc7d2ba6b8f094ed7b8a02664120a10626
Once Variable and Tensor are merged the existing Variable test would
cause an infinite recursion. Instead, modify the Variables directly
inside a `no_grad()` block.
* Remove addValues and use WithInsertPoint
* Use blocks to simplify differentiate
Using @ezyang's suggestion, this change uses a block rather than
staging annotations to represent the reverse pass. This allows us
to reuse the machinery to copy graphs/blocks to extract the
reverse pass concisely.
This also change the input order of Gradients df to:
[output vjps][temporary vjps][captures]
In addition to being simpler to generate in this order, it also
will allow ExecutionPlan to append the captures onto the already-
existing input list of vjps that are given by the autograd,
rather than have to prepend them, which should be slightly cheaper.
* Enforce that input capture are before outputs
This changes the Gradient struct to enforce that input
captures appear before output captures in the capture list,
which makes it easier to use in ExecutionPlan.
In some cases when there are two different versions of cudnn installed,
one under /usr/local/cuda and other under a virtual env such as conda or
under the main system path /usr/include, the compiler would pickup the
cudnn.h from the virtual env/system path first. This is because cmake
generates C_INCLUDES and CXX_INCLUDES flags with system include path
first. All this may lead to linking problems as described in Issue #4869Fixes#4869
In lieu of a more complicated builder object, this commit adds
an 'insert point' to Graph and a method 'insertNode' which inserts
nodes at that insert point. setInsertPoint can be used to change
the insert point on the graph to the end of a block or to any point
inside a current block. The resource guard `WithInsertPoint`
can be used to temporarily change it to, for example, insert
into the "then" branch of an If statement.
This commit also updates the resource guard for scopes. It previously
relied on return value optimization to work correctly which is
not guaranteed to be applied until C++17.
This commit is getting the IR ready for representing ONNX control flow.
It adds nested blocks to the IR.
* Each node now has blocks(), addBlock(), and eraseBlock() similar to a node's
output list.
* Blocks are a property of every node rather than an attribute because
to make it easier to manage the lifetime of the containing nodes and because
the behavior of cloning Blocks will likely be different from the way we clone other
attributes.
* A block itself has a list of nodes, as well as inputs and outputs.
The meaning of the nested input/output nodes are specific to the particular
node kind containing the block. It is safe to assume inputs to a block will be
in scope in the block.
* Each Block has an owningNode() and each node has an owningBlock().
The owningNode of the top-most block is null.
* Values are lexically scoped: nested blocks can use values from outer blocks
that have been defined in previous nodes. Lint has been updated with these
new scoping rules.
* This change preserves almost all of the pre-Block API. No attempt has been made
to make optimizations aware of Blocks. This will need to be done on a case-by-case
basis as we make optimizations capable of handling Blocks.
This adds the initial implementation of graph executor for the new JIT design. It includes a few python tests ensuring that nograd, backward, and double-backward cases work for simple examples and some corner cases. More work needs to be done to performance optimize as there are many extra copies and places where we hold onto variables longer than we should. These are noted in the comments.
Summary:
Future-clang is stricter about some things. We need to address deletes on non-virtual destructors.
For reference, the compiler error in question can be identified by: "delete called on 'ClassName' that is abstract but has non-virtual destructor [-Werror,-Wdelete-non-virtual-dtor]" for a given ClassName.
Reviewed By: smeenai
Differential Revision: D6853479
fbshipit-source-id: a40c8e83da7c1b44da48e887cc029e98e40d6737
Summary:
* Likely need to test this so bad formatting can't be added in the future, but cleaning all operators so we at least have good examples.
* Formatting between our internal Facebook operator catalog and external caffe2.ai catalog are still slightly different. We'll work on this.
Closes https://github.com/caffe2/caffe2/pull/1846
Reviewed By: pjh5
Differential Revision: D6848570
Pulled By: orionr
fbshipit-source-id: b9bc0bfccb243d0440bd7b2406858cad8dc37e92
* fix output_nr not incremented correctly
* update test_conv_double_backward to cover this case; call accGradParameters if any param (not just weight) requires grad in parse_nn.py
* update Spatial/VolumetricFull(Dilated)Convolution to support accGradParameters with only bias requiring grad
* Spatial/VolumetricConvolutionMM
* Spatial/VolumetricDilatedConvolution
* address @fmassa 's comments
* Add some more builder scripts from ossci-job-dsl
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Relax precision requirement on test_Upsample_trilinear_scale_3d_cuda
Partially addresses #5006.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Replace async with non_blocking for Python 3.7 upgrade
* Remove trailing whitespace
* Give _cuda and _type kwargs and accept async for compatibility
* Rename async to non_blocking in all C++ code
* Add entries for async in python_variable_methods
* Friendlier backward compatibility for cuda and type
Summary: It seems that integral in std:signbit is not well supported in Windows. Bypassing it.
Reviewed By: xianjiec
Differential Revision: D6869924
fbshipit-source-id: b98a3431c4d26dcffd08e26259037083afd41114
Summary:
- Fix path to FXdiv and FP16 dependencies
- Link cpuinfo library
- Pull NNPACK fix for PYTHONPATH handling when launching PeachPy
- Pull cpuinfo fix for cross-compiling on Linux for Android
- Pull cpuinfo fix for CPUINFO_LIBRARY_TYPE support
- Pull cpuinfo fix for iOS builds
Closes https://github.com/caffe2/caffe2/pull/1869
Differential Revision: D6881428
Pulled By: Maratyszcza
fbshipit-source-id: 7b4115daa090096dbd97303503792e7b144fbb43
Summary:
iOS is also depend on USE_MOBILE_OPENGL, so I think we should only disable it for Android.
Closes https://github.com/caffe2/caffe2/pull/1835
Differential Revision: D6880522
Pulled By: Maratyszcza
fbshipit-source-id: b2c2fa052ad5948bc52fa49eb22c86eb08f59a39
* Rewrite ATen native docs.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Formatting fix
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Some of the CR comments
* More CR comments [ci skip]
* One last CR comment
Don't check the ScalarType and Backend of arguments in VariableType.
Instead, only check that arguments are Variables of any type. The
precise type checks are handled by the base type.
Many of our functions take heterogeneous types. There isn't enough
information in Declarations.yaml to ensure the precise types of
arguments in VariableType, which makes it difficult to add new methods.
This is #4943 with a fix to the memset call
* Add scalar autograd tests for functions requiring 'special' Variables on LHS.
* Add index_* tests.
* Fix flake8.
* Use normal for clamp rather than uniform.
* Add tests for gather, scatter, scatter_add.
* Make sure masked_select doesn't get all zeros.
* Properly fill in make_non_contiguous data for sizes that can't be mad… (#4951)
* Properly fill in make_non_contiguous data for sizes that can't be made contiguous.
* Use clone instead of copy.
* Fix and test backward for mv, ger with scalars.
* Fix addmv.
* Use grad.type() instead of type(grad).
* Fix addr.
There are a couple of hacks here:
1) We need to squeeze the backward result because of implicit broadcast of the arguments to match behavior of ger.
2) The broadcast_dims code doesn't work for scalars; I added support for adding '.scalar' onto the end of the broadcast
specification, but really this should just be a native function with _out support.
* Don't allow scalars in torch.dot for Variables.
There is no dot_out, so the lack of _out isn't an issue.
* Revert "Don't allow scalars in torch.dot for Variables."
This reverts commit 76c521eba8c1fb533e164f121075230209d52927.
* Revert "Fix addr."
This reverts commit afe04a0078394f94645e10cec53626f582cbc55c.
* Revert "Fix addmv."
This reverts commit 550c7ac71b3b832a3b74a809fec9ce5f5e554909.
* Revert "Use grad.type() instead of type(grad)."
This reverts commit ddcb5a424ed004fa2ee238a50177573e6d4a1b89.
* Revert "Fix and test backward for mv, ger with scalars."
This reverts commit 10b0ecad48d987774c41184ffaf11742322926ab.
Summary: hypothesis_test have been introduced in D4508879, add a plain test which is more straightforward.
Reviewed By: kennyhorror
Differential Revision: D6835334
fbshipit-source-id: d05a2cd199b2de56ac0cc0319f19fcd7978647d5
Summary:
Added forward-only mode to CTCOp to compute only the costs without the grads.
Also, num_threads was set to 1, which ends up stomping over
--caffe2_omp_num_threads mid-execution (https://fburl.com/uq65xfty). Fixing
that to use the already configured num OMP threads.
Reviewed By: ajtulloch
Differential Revision: D6867829
fbshipit-source-id: 9ab1fec9857e00d277a9e82c4bd64caa6f4b2a62
Summary: enable ModOp to control the output sign to follow dividend or divisor.
Reviewed By: xianjiec
Differential Revision: D6852457
fbshipit-source-id: 62dbb66cacecb8e0a0f81f63f2b7b378efbd6ee2
Summary:
On windows when using a prebuilt version of protobuf (such as provided by vcpkg) we need to set the PROTOBUF_LIBRARIES and PROTOBUF_INCLUDE_DIRS manually.
The CAFFE2_API decoration should only be defined to dllexport when building shared libs.
Closes https://github.com/caffe2/caffe2/pull/1854
Differential Revision: D6867345
Pulled By: Yangqing
fbshipit-source-id: d4d48f709d313af9dde103fc8dfbfc217261715b
Summary:
These changes are required to use glog on Windows.
Yangqing Please consider merging them as they were removed when PR #1793 was reverted.
Closes https://github.com/caffe2/caffe2/pull/1853
Differential Revision: D6863567
Pulled By: Yangqing
fbshipit-source-id: f6ce3a1c5855e2b39000ce989d62dc2b34cd4817
Uses TypeError from torch/csrc/Exceptions.h in python_arg_parser.cpp so
that the exception is interpreted as a Python TypeError instead of
RuntimeError.
Summary:
When RTTI was not enabled, previously we can only print
(RTTI not enabled ...) type error message. This is annoying when developing
on mobile environment. Adding gRegistry when #T to have basic string for type
easy type inference
Reviewed By: Yangqing
Differential Revision: D6849614
fbshipit-source-id: d41417d72fdcfb7b8c9ddc4ded604ea598572b73
* Revert "Clarify grad_input_mask documentation in derivatives.yaml (#4963)"
This reverts commit 6f3266b4a195db6ade4651431595f9f22bd9e656.
* Revert "fix triu and tril for zero-strided inputs on gpu (#4962)"
This reverts commit 6c197c2f15090ab7368d183439229b768ece5efc.
* Revert "Add mutex for CPU RNG and move TH to C++ (#4041)"
This reverts commit 96239dd50e89bc2d1fd5d91cc5ee8fca95b07f90.
* Revert "Support multivariate TransformedDistributions (#4937)"
This reverts commit ca5071d0721767fcfeb226b5c695dfd5d0671072.
* Revert "Only check that arguments are Variables in VariableType (#4943)"
This reverts commit d44437968f2b136a3399dc62af66adfd3eaa249e.
* Revert "torch.set_num_threads sets MKL option too (#4949)"
This reverts commit 2aaeec0db0be0e9e9effd277c268cd224ff66ef9.
* Add mutex for CPU RNG
* move more things to cpp to make cuda build work
* fix mutex bug on OS X
* try to fix cuda9 half .x bug
* try to fix windows error
* create THGeneratorState as seperate field
* fix mutex issues
Don't check the ScalarType and Backend of arguments in VariableType.
Instead, only check that arguments are Variables of any type. The
precise type checks are handled by the base type.
Many of our functions take heterogeneous types. There isn't enough
information in Declarations.yaml to ensure the precise types of
arguments in VariableType, which makes it difficult to add new methods.
Summary: Current MultiNodeCheckpointManager return None in this case, yet in JobRunner we assume this function returns a valid task group, i.e. we call session.run(self.checkpoint_manager.init(...)) directly. This will fail the case we use LocalHostScheduler and reuse a MultiNodeCheckpointManager
Reviewed By: azzolini
Differential Revision: D6843450
fbshipit-source-id: a7ec942cfe692f19e8751b0078ae6a6108f29e54
Summary: To match the semantic in ONNX, change the default value of alpha of LeakyRelu to 0.01
Reviewed By: dzhulgakov
Differential Revision: D6840975
fbshipit-source-id: 08543f80fd86cbe96a0eee8d725ef137a5bf4ab8
These are from auto-generated tests from existing tests with the following constraints:
1) Forward function passes with scalar self and size (1,) self
2) No Variable/Tensor arguments (besides self)
Summary:
Simplify async_scheduling to use global thread pool instead of per network
polling threads
Reviewed By: romain-intel
Differential Revision: D6814274
fbshipit-source-id: f91ac3e99d9b8cf15578a751ed7929be84840408
* Fix some scalar issues with autograd.
1) Better error messsages in functions that don't support scalars
2) Don't access size(dim) in the backward of a function taking a scalar because the wrap fails.
* Fix CUDA build.
Summary:
Commonly, net observers attach operator observers at construction. This diff separates the logic into a base class to inherit from.
Closes https://github.com/caffe2/caffe2/pull/1806
Reviewed By: salexspb
Differential Revision: D6808623
Pulled By: mdschatz
fbshipit-source-id: 75ef0eea913ef30943541c829c0a976965f42736
Putting these scripts here has a few benefits:
1. PyTorch developers can easily update the scripts without
having to ask for permissions to ossci-job-dsl
2. You can test changes in the scripts by opening a PR to
PyTorch (functionality is ossci-job-dsl is not easily testable.)
3. If you get one of our stock Docker images, you can run these scripts
to trigger a build identical to what would occur in Jenkins (not
entirely true yet, but we can make it so.)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
Last fix was uncommitted due to a bug in internal build (CAFFE2_API causing error). This one re-applies it as well as a few more, especially enabling gtest.
Earlier commit message: Basically, this should make windows {static_lib, shared_lib} * {static_runtime, shared_runtime} * {cpu, gpu} work other than gpu shared_lib, which willyd kindly pointed out a symbol limit problem. A few highlights:
(1) Updated newest protobuf.
(2) use protoc dllexport command to ensure proper symbol export for windows.
(3) various code updates to make sure that C2 symbols are properly shown
(4) cmake file changes to make build proper
(5) option to choose static runtime and shared runtime similar to protobuf
(6) revert to visual studio 2015 as current cuda and msvc 2017 do not play well together.
(7) enabled gtest and fixed testing bugs.
Earlier PR is #1793
Closes https://github.com/caffe2/caffe2/pull/1827
Differential Revision: D6832086
Pulled By: Yangqing
fbshipit-source-id: 85f86e9a992ee5c53c70b484b761c9d6aed721df
Summary:
This was removed in an earlier version. Anyway, I suspect this will make jenkins a bit unhappy (do we use gpu instances for building as well?) so firing a PR to test.
Closes https://github.com/caffe2/caffe2/pull/1833
Differential Revision: D6834889
Pulled By: Yangqing
fbshipit-source-id: bc501cdb9d83a32ad38d24e972c2bfec5242d767
Summary:
Now we use **clang** to build Caffe2 for Android with arm64-v8a ABI, but clang doesn't support "-s" compilation flag. If we append this flag to clang, it will report a warning:
> clang++: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
This submit will check we use gcc or clang to build Caffe2 for Android.
Closes https://github.com/caffe2/caffe2/pull/1834
Differential Revision: D6833011
Pulled By: Yangqing
fbshipit-source-id: e4655d126fb3586e7af605a31a6b1c1ed66b9bcb
Summary:
* Putting up to test on Jenkins since I can't test locally on my Mac.
Might fix https://github.com/caffe2/caffe2/issues/1796 but I haven't touched these files before, so it's a guess. :)
Closes https://github.com/caffe2/caffe2/pull/1826
Reviewed By: Yangqing
Differential Revision: D6832918
Pulled By: orionr
fbshipit-source-id: 22bdeafa031dbe6457d81cb105b41a451ca3a25d
This data-structure will be used as the key in GraphExecutor's
code cache. It supports fast creation, hashing, and equality checking
because it will run on all inputs to GraphExecutors in the hot path.
Summary:
Historically, for interface dependent libraries (glog, gflags and protobuf), exposing them in Caffe2Config.cmake is usually difficult.
New versions of glog and gflags ship with new-style cmake targets, so one does not need to use variables. New-style targets also make it easier for people to depend on them in installed config files.
This diff modernizes the gflags library, and still provides a fallback path if the installed gflags does not have cmake config files coming with it.
It does change one behavior of the build process though - when one specifies -DUSE_GFLAGS=ON but gflags cannot be found, the old script automatically turns it off but the new script crashes, forcing the user to specify USE_GFLAGS=OFF.
Closes https://github.com/caffe2/caffe2/pull/1819
Differential Revision: D6826604
Pulled By: Yangqing
fbshipit-source-id: 210f3926f291c8bfeb24eb9671e5adfcbf8cf7fe
* Fix visibility of AT_CUDA_ENABLED
* link ATen with verify_api_visibility so ATen headers get generated in time
* Move CUDAHalf.* to ATen/cuda
* ATen/cuda/CUDAHalf.cpp -> ATen/cuda/CUDAHalf.cu
* Remove inline attributes from HalfFix
* Also test for AT_CUDNN_ENABLED and add clarifying comment
* Remove unnecessary static inline from HalfFix template
* Move Half::operator double() into header for windows
* Mark Half::operator() as inline
When generating autograd::Function wrappers for ATen functions, we need
to take derivative expressions in derivatives.yaml (identified by name)
and correlate them with the correct index they should take in
grad_inputs (identified positionally only). Previously, this
computation was done *statically* in load_derivatives.py (set_up_derivatives)
and then we hard-coded indices in the generated Functions.cpp.
This is sufficient for supporting ATen operations which consist solely
of Tensor arguments, or a single TensorList argument. However, this
strategy will not work for mixed Tensor/TensorList arguments, as the
index of any Tensor after a TensorList is not known at codegen time,
since it will vary depending on the length of the TensorList, e.g.,
foo({x1, x2}, y) ==> y is index 2
foo({x1, x2, x3}, y) ==> y is index 3
This commit introduces a new strategy for generating these indices which
pushes index computation to *runtime* (though any decent C++ optimizer
can re-optimize the index computation back into constants; this was
verified in Godbolt.) Instead of hard-coding constants, a small
IndexRangeGenerator object is created and used to generate the correct
index ranges (std::pair<size_t, size_t>) for each argument.
Here is an example of mm rewritten in the new codegen format:
variable_list MmBackward::apply(const variable_list& grads) {
IndexRangeGenerator gen;
auto self_ix = gen.range(1);
auto mat2_ix = gen.range(1);
variable_list grad_inputs(gen.size());
auto& grad = grads[0];
auto self = self_.unpack();
auto mat2 = mat2_.unpack();
if (should_compute_output({ mat2_ix })) {
auto grad_result = mm_mat2_backward(grad, self, mat2_sizes, mat2.strides(), 1);
copy_range(grad_inputs, mat2_ix, grad_result);
}
if (should_compute_output({ self_ix })) {
auto grad_result = mm_mat1_backward(grad, mat2, self_sizes, self.strides(), 1);
copy_range(grad_inputs, self_ix, grad_result);
}
return grad_inputs;
}
Unlike before, where self_ix and mat2_ix were hardcoded as 0 and 1,
we derive them by invoking IndexRangeGenerator (which internally
is just a little counter which bumps up each invocation of 'range').
Each _ix variable actually represents a range, as can be seen here.
variable_list CatBackward::apply(const variable_list& grads) {
IndexRangeGenerator gen;
auto tensors_ix = gen.range(tensors_size_);
variable_list grad_inputs(gen.size());
auto& grad = grads[0];
if (should_compute_output({ tensors_ix })) {
auto grad_result = cat_tensors_backward(grad, tensors_sizes_dim, dim);
copy_range(grad_inputs, tensors_ix, grad_result);
}
return grad_inputs;
}
The invocation of 'copy_range' reads a TensorList returned by the
backward function into the correct entries in grad_inputs.
tensors_size_ is a new member of CatBackward which is filled with
the size of the forward input tensor when cat is originally invoked.
With this new code generation strategy, we can completely eliminate
the special cases for Tensor and TensorList in index selection, and
we can smoothly support mixed Tensor/TensorList by making multiple
invocations of gen.range() with non-one arguments.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
In this case, each sequence is treated as having a length equal to the
first dimension of the input tensor. This matches the semantics of
ONNX when the sequence length input is left out.
Closes https://github.com/caffe2/caffe2/pull/1764
Reviewed By: dzhulgakov
Differential Revision: D6751219
Pulled By: anderspapitto
fbshipit-source-id: 89e0efd12339157627494e2b8c83e952bdd8a9f8
Summary:
OpenGL is no longer built by default. Even after setting flag -DUSE_MOBILE_OPENGL, the build fails. Remove it in the benchmark code so that the benchmark can still be built.
Closes https://github.com/caffe2/caffe2/pull/1822
Reviewed By: Maratyszcza
Differential Revision: D6824777
Pulled By: sf-wind
fbshipit-source-id: 5af8b669a36adcd6a98b0a11237b9e03c146bb9d
Summary:
Preivously in SafeDequeueOp, the in.dims()[0] would fail if in.ndim()=0.
However the error message if not informative. I added a Caffe_Enforce,
which would print out the input and output blob name. This is very helpful for
future debugging as well.
Differential Revision: D6821421
fbshipit-source-id: b07e5829a2c580aaaac88b0d9ff8d05f6da11713
Suppose you are given a list of arguments, each of which may be Tensor or
TensorList. How can you write a function that can treat these arguments
uniformly as a list of tensors? This patch solves the problem using
variadic templates.
Why variadic templates? Use of variadic templates means anyone working
with this code has to understand universal references, perfect
forwarding, parameter packs and some idioms of C++ template design.
However, I argue that variadic templates are the *right* tool for
supporting the implementation of functions which must take an
arbitrarily heterogenous set of inputs. We were able to limp by
in old code because, for the most part, tensor inputs were homogenous,
but this is no longer the case for some non-primitively differentiable
functions; and with the upcoming cuDNN RNN in ATen PR, will no longer be
the case for primitively differentiable functions too.
There are two parts to the PR.
First, we add torch/csrc/utils/variadic.h, which defines a mix-in
IterArgs that takes any class which supports operator(), and augments
with a new variadic function apply() which calls operator() on each
argument passed to it. In an original draft of the patch, I wrote the
recursion for each parameter pack from scratch for each function;
however, it turns out there are no fewer than seven instances where we
need this idiom, and the mix-in reduces the lines of code, and also
helps centralize the most important (and easy to forget) boilerplate
for perfect forwarding.
To verify that IterArgs is compiled away into an unrolled form per
call site, I inspected the assembly on some synthetic examples.
Next, we modify the following functions to make use of IterArgs:
- compute_requires_grad
- Function::flags (Variable and Tensor variants)
- flatten
- isTracing
- count_tensors / count_variables
Finally, the tuple packer is rewritten to be variadic, although we
cannot make use of IterArgs (since we are given a tuple). It might
make sense to refactor the code into a generic piece which invokes
a function with the arguments specified by a tuple, and then an
appropriate IterArgs, but we leave this for future work.
One thing to note: we cannot write a function with overloads for both
Tensor and Variable, because both ArrayRef<Variable> and Tensor have
implicit conversions from Variable, making such an overload ambiguous.
It may be interesting to remove the implicit conversion from ArrayRef.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
More changes to be added later. I need to make a PR so that I can point jenkins to this
Closes https://github.com/caffe2/caffe2/pull/1767
Reviewed By: orionr
Differential Revision: D6817174
Pulled By: pjh5
fbshipit-source-id: 0fc73ed7d781b5972e0234f8c9864c5e57180591
The primary benefit is now we have working move constructors
et al without having to write all the boilerplate. Furthermore,
the size of the code is substantially reduced.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
Main changes:
1. Move reader creation to Brew in order to be consistent and avoid a wild use of param_init_net
2. Use optimizers for training function, avoid manual optimizer construction
3. Add MLP mode (a default)
4. Fix a bunch of too verbose comments and add a bit of new explanations
Closes https://github.com/caffe2/caffe2/pull/1760
Differential Revision: D6749059
Pulled By: salexspb
fbshipit-source-id: 9dfbbb2d9772a74a0300c2e404a92e791f7cc593
Summary:
This reverts commit d286264fccc72bf90a2fcd7da533ecca23ce557e
bypass-lint
An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
cause_a_sev_many_files
Differential Revision: D6817719
fbshipit-source-id: 8fe0ad7aba75caaa4c3cac5e0a804ab957a1b836
Summary:
Basically, this should make windows {static_lib, shared_lib} * {static_runtime, shared_runtime} * {cpu, gpu} work. A few highlights:
(1) Updated newest protobuf.
(2) use protoc dllexport command to ensure proper symbol export.
(3) various code updates to make sure that C2 symbols are properly shown
(4) cmake file changes to make build proper
(5) option to choose static runtime and shared runtime similar to protobuf
(6) revert to visual studio 2015 as current cuda and msvc 2017 do not play well together.
Closes https://github.com/caffe2/caffe2/pull/1793
Reviewed By: dzhulgakov
Differential Revision: D6817719
Pulled By: Yangqing
fbshipit-source-id: d286264fccc72bf90a2fcd7da533ecca23ce557e
Summary: Updates `sparse_lookup.py` for the new fused 8-bit rowwise quantization. Mostly just changing the same files as the original diffs (D5753626 and D5761202). I know very little about this code here so please let me know if this is safe, also in terms of migration away from the non-fused storage.
Reviewed By: kennyhorror
Differential Revision: D6710784
fbshipit-source-id: 185f147af52a094a937ba631b0351225e660d205
Summary:
* This way we won't have issues across Linux and Mac.
* Also eliminates some weirdness where files with both capitalizations existed.
Closes https://github.com/caffe2/caffe2/pull/1813
Reviewed By: pjh5
Differential Revision: D6812141
Pulled By: orionr
fbshipit-source-id: 27f52089e2db623196349d7036aa8882e93c32fd
Summary:
PR Description
-----------------
This commit informs the developers why they have to use packages of third_party
folder instead of packages in their Linux distribution.
By default, Caffe2 find installed packages in the Linux distribution. If it
cannot be found, as a next step Caffe2 uses the version bundled in third_party folder.
**Changes proposed in this PR:**
1. Added difference between Linux distro packages and third_party packages
**Self assessment:**
Checked.
Signed-off-by: Geunsik Lim <geunsik.lim@samsung.com>
Closes https://github.com/caffe2/caffe2/pull/1724
Reviewed By: pjh5
Differential Revision: D6728185
Pulled By: orionr
fbshipit-source-id: 0c596cf56faaccf947caefc49ea3c6f0a473e9bf
1) Have 0-dim byte tensors behave like Py_TRUE, Py_FALSE
1) Py_TRUE now properly returns a copy from getitem
3) setitem now properly shapes the LHS consistent with the RHS (this doesn't really matter outside of error messages having the proper shape)
4) setitem supports numpy-style copy_to broadcasting (cuts off prefix 1s from src), so e.g. you can setitem (1,1,2,3) to (2,3) even though
that doesn't follow the normal inplace broadcasting rules.
Summary:
as titled
After converting categorical to Ngram keys, use this op to extract eids
Differential Revision: D6794020
fbshipit-source-id: 4f9251a22d7a129da30b92845e312876e6510e7e
Summary: Adds cuda support for LC Op
Reviewed By: QueryConnectionException
Differential Revision: D6803659
fbshipit-source-id: 538bbf6fd202c79154132fda0e90e175eb09d025
Summary: Weighted sampling reader dequeue randomly chooses a hive reader to read a mini-batch. This diff allows dequeue to output the index of the randomly chosen table to a specific blob.
Reviewed By: kennyhorror
Differential Revision: D6621070
fbshipit-source-id: 754b981fc2bcfdb0146d2a0a5b677e7cfe74211b
Summary: Fix the flaky test for ngram from categorical test
Reviewed By: dragonxlwang
Differential Revision: D6801152
fbshipit-source-id: dcbae17b1d3737a41fb2f5c794c1146a02c542bb
Summary:
Every call to the checkpoint_metadata_handler write() API requires us to pass all params like db_prefix, db_type etc.
Introducing an init API in the checkpoint_metadata_handler so that such params can be saved and need not be passed in every API call
Reviewed By: mraway, anshulverma
Differential Revision: D6792651
fbshipit-source-id: 059fa4309e8fce1ee5ab009af3e0570573c24245
Summary:
Updated bbox_transform op to match detectron training code better.
- Set apply_scale=False and correct_transform_coords=True to match detectron training/inference code.
Reviewed By: wat3rBro
Differential Revision: D6782894
fbshipit-source-id: 053d9847bf2b3c62a535499017a8413d78871ee0
Summary:
When system has protobuf package but hasn't protoc, cmake will be success:
> -- ******** Summary ********
-- General:
-- CMake version : 3.5.1
-- CMake command : /usr/bin/cmake
-- Git version : v0.8.1-967-g27d12d8-dirty
-- System : Linux
-- C++ compiler : /usr/bin/c++
-- C++ compiler version : 5.4.0
-- Protobuf compiler : PROTOBUF_PROTOC_EXECUTABLE-NOTFOUND
-- Protobuf include path : /usr/include
-- Protobuf libraries : optimized;/usr/lib/x86_64-linux-gnu/libprotobuf.so;debug;/usr/lib/x86_64-linux-gnu/libprotobuf.so;-lpthread
...
Then make will be failed.
This submit make it to check protobuf package only when protoc has been found.
This pull request is a clone of [1781](https://github.com/caffe2/caffe2/pull/1781), that pull request closed by mistake.
Closes https://github.com/caffe2/caffe2/pull/1792
Differential Revision: D6800513
Pulled By: pietern
fbshipit-source-id: 79a77a139f342ae0aaa2c37fc1d9a74e28a08422
Summary: Diff 2 in stack of diffs for multi-device batch normalization. Allows plugging of intermediate stats into SpatialBN and SpatialBNGradient to enable multi-device batch normalization. Depends on D6697336.
Reviewed By: rbgirshick
Differential Revision: D6699258
fbshipit-source-id: 1bae0b9a33d257f8de9525f8b2511bec2ec9d51e
Summary: This is the first in a series of diffs to enable batch normalization across multiple devices on the same node with data parallel model. The diff contains the ops for computing the per-channel statistics required to obtain the mean and variance across multiple devices on the same node on the forward pass, and the gradient of the bias and scale during backpropagation. The actual modifications to SpatialBN and SpatialBNGradient to make use of these results will be in a separate diff.
Reviewed By: rbgirshick
Differential Revision: D6697336
fbshipit-source-id: 0de2750fe7e851795f238d9f625aeb4d74023dc2
This pass splits differentiable subgraphs into their own Node,
similar to a fusion group.
This initial implementation does not create optimal subgraphs, but
it works well in the case where most things are differentiable,
and has the building blocks (`mergeNodes`) to extend to the
better implementation.
* Remove setting coalesce to 0 in sparse transpose_
* Remove setting coalesced to 0 in THCSTensor transpose_
* Add test for transpose's coalesce invariant
* Fix#4480 by tracing inputs before running function.
The DCE trick says that if I have y = f(x), and f is internally implemented as
g, it's OK to trace both g and f. Recall the tracing algorithm is:
enter f(x)
compute its result y
trace y = f(x)
return from f
So when you run the example above, you'll do this:
# suppose x is mapped to %1
enter f(x)
enter g(x)
result of g is y
trace y = g(x a.k.a. %1) (mapping y to %2)
return from g
result of f is y
trace y = f(x a.k.a. %1) (remapping y to %3)
return from f
and end up with a trace like this:
%2 = g(%1)
%3 = f(%1)
... only %3 is live, because %2 was killed from the mapping... Subsequent DCE
will eliminate the invocation of g and you'll only see f in the final trace.
However, if f and g are inplace functions, the machinery breaks:
# suppose x is mapped to %1
enter f(x)
enter g(x)
result of g is x
trace x = g(x a.k.a. %1) (remapping x to %2)
return from g
result of f is x
trace x = f(x a.k.a. %2) (remapping x to %3)
return from f
resulting in:
%2 = g(%1)
%3 = f(%2) # OOPS
This commit changes the strategy so we instead do this:
enter f(x)
trace f(x)
compute its result y
trace y = f(x) (computed above)
return from f
Now we get the correct Value before it is overwritten.
Here is what the new trace code looks like:
jit::tracer::PreTraceInfo trace_info;
if (jit::tracer::isTracing( self, index )) {
trace_info = jit::tracer::preRecordTrace( "index_fill", { self, index } );
setattr(trace_info.n, jit::Symbol("dim"), dim);
setattr(trace_info.n, jit::Symbol("value"), value);
}
baseType->index_fill_(self_, dim, index_, value);
increment_version(self);
rebase_history(self, grad_fn);
if (trace_info.state != nullptr) {
jit::tracer::postRecordTrace( trace_info, { self } );
}
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Revert "Hot patch ONNX _run_symbolic_function"
This reverts commit d1c973fee1a20da86d60d526e253ce89f5840baf.
* lintfix
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add missing expect file
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: When using sample weights to do weighted sampling in everstore loader, the proto size is increased by one. Update image_input_op to support this new use case
Reviewed By: chenlifei
Differential Revision: D6776709
fbshipit-source-id: 6148908881ad019b6b621413f452ea1814573a00
* Enable scalars if compiled with WITH_SCALAR environment variable.
We are pretty close to enabling scalars (0-dimensional arrays); this allows turning them on
for development purposes and to be able to write code that works both with and without scalars enabled.
WITH_SCALARS is currently broken with distributions, but should work for test_torch, test_autograd, test_nn.
* Fix unsqueeze.
* Fix wrap dim, wrapping with Scalar.
Summary:
The android.cmake.toolchain file we use from a submodule is unmaintained and not updated since 2015.
It causes numerous problems in Caffe2 build:
- Caffe2 can't be built for Android ARM64, because gcc toolchain for ARM64 doesn't support NEON-FP16 intrinsics, and the android.cmake.toolchain we use doesn't allow us specify clang-5.0 from NDK r15c
- Caffe2 can't be built with Android NDK r16 (the most recent NDK version)
- Caffe2 can't be built for Android with Ninja generator
This change updates the build script to use $ANDROID/build/cmake/android.cmake.toolchain instead, which is maintained by Android team, and synchronized with Android NDK version.
As this toolchain file doesn't support "armeabi-v7a with NEON FP16" ABI, I had to disable mobile OpenGL backend, which requires NEON-FP16 extension to build. With some work, it can be re-enabled in the future.
Closes https://github.com/caffe2/caffe2/pull/1740
Differential Revision: D6707099
Pulled By: Maratyszcza
fbshipit-source-id: 8488594c4225deed0323c1e54c8d71c804b328df
Summary:
MKLSumOp assumes that all inputs will have the same layout, but this needn't be
the case as different inputs are typically created by different primitives and
some of them might have a custom layout. Create a View() before executing
dnnSumCreate().
Differential Revision: D6753233
fbshipit-source-id: 62420b972898066157c9c841275ccc917b3dec59
Summary:
This is a first attempt at completing bootcamp task T24449916. This diff contains 3 major changes:
1) Change LayerModelHelper to allow for exposing the output and parameters of any layer to metrics
2) Added a runner that allows metrics to draw arbitrary plots to a matplotlib axes object
3) Implement a metric that aggregates distributions of values in a blob over the training, and try this out in a notebook
Reviewed By: kennyhorror
Differential Revision: D6671273
fbshipit-source-id: b8961837395e89c957edbf5c7c862bdb845ccf4b
* Favor Variables over Tensors for scalar constructors in torch.distributions.
Current behvior:
1) distribution constructors containing only python number elements will have their python numbers upcasted to Tensors.
2) Python number arguments of distribution constructors that also contain tensors and variables will be upcasted
to the first tensor/variable type.
This PR changes the above to favor Variables as follows:
1) The python numbers will now be upcasted to Variables
2) An error will be raised if the first tensor/variable type is not a Variable.
This is done in preparation for the introduction of Scalars (0-dimensional tensors), which are only available on the Variable API.
Note that we are (separately) merging Variable and Tensor, so this PR should have no real long-term effect.
Also note that the above means we don't change the behavior of constructors without python number arguments.
* Fix tests that require numpy.
Summary: add Test for SparseLookup with PositionWeighted.
Reviewed By: kennyhorror
Differential Revision: D6771612
fbshipit-source-id: b4b3bfd514f366f579b4192643330ae73843d4f9
Summary:
SqueezeOp support to drop drop dims of size 1. MKLMemory now supports Reshape()
if the buffer is in plain layout, in which case just the dims and layouts are
modified similar to caffe2::Tensor. SqueezeOp takes care of converting the
input to plain layout if needed via an intermediate buffer before calling
Reshape().
Differential Revision: D6735656
fbshipit-source-id: 953309498370e1b8986e8c593bc6963f38036255
Currently, index operation kernels work in "source/destination index-major
order". (E.g., if thread count equals slice size, each thread will process
slice #0 in lockstep, and then slice #1, and so on.)
However, when elements inside each "slice" is separated by large strides (e.g.,
selecting columns of a matrix), it is better to switch to "elementInSlice-major
order". For example, each thread can process element #0 of every slice, and
then element #1 of every slice, and so on.
* Add kwarg-only 'requires_grad' parameter to Variable factories.
Functions that create variables, e.g. torch.ones_like currently always return Variables with requires_grad=False;
this is less convenient than the existing Variable constructor that has a requires_grad parameter. This commit
adds the parameter at the python binding level.
* Fix flake8.
* Address review comments.
* Match set_requires_grad implementation with tensor_new version.
* Implement a (data-only) Variable factory.
Implements a function, torch.autograd.variable that is modeled after np.array. The main difference between it and new() and
the tensor constructors is it inteprets a python number as data, i.e. as a 0-dimensional tensor (we currently don't expose
that at the pytorchl level, so it will temporarily end up as a 1-dimensional tensor), rather than a size.
The main difference currently between torch.autograd.variable and np.array is that np.autograd.variable is stricter, e.g.
passing a PyFloat when an integral type is the default tensor type will result in an array; np.array basically lets anything
through (floating-point / integral mismatch, overflow, etc). This is to keep it consistent with Variable.new when called with
a sequence, although we can loosen the checks later.
This will be renamed to torch.tensor once we merge Variable and tensor.
* Address review comments.
Summary:
This reverts commit 417f1bab18b1721db5edc7ac8abaf883c1f7d3ee.
No longer needed since we'll add this within the Jenkins job itself.
Closes https://github.com/caffe2/caffe2/pull/1777
Reviewed By: pietern
Differential Revision: D6778185
Pulled By: orionr
fbshipit-source-id: d66befa76e84f83cf41eea50e54bc610db03ddd0
Summary:
At the end of distributed training, trainer needs to download the parameters back from parameter servers for saving the model. Currently, this parameter downloading happens at the end of job's epoch task group, which creates several problems when checkpointing is enabled for distributed training:
1. When checkpointing is enabled, we run multiple training epochs. At the end of each epoch, the model download tasks will run to collect parameters, but we won't save the model until the true end of training, so there is a big waste of resource.
2. After trainer0 downloads the parameters, these parameters take a lot of memory, so trainer0 can easily run out of memory in the next epoch of training.
Our solution is to insert a parameter download task group between the job's training epoch_group and the job's exit_group.
Reviewed By: azzolini
Differential Revision: D6765393
fbshipit-source-id: 5a4f556fc3c1cd7834a7c406a3c0de3fccd50c49
Summary:
Adds 2 features:
(1) In cmake, allow the use of -march=native
(2) During initialization, check if Caffe2 is built with matching cpu
features of the current machine.
This helps us guarding performance claims in case the Caffe2 baseline is
built with limited computation capability.
Currently only added avx, avx2 and fma which are common.
Closes https://github.com/caffe2/caffe2/pull/1775
Reviewed By: ezyang
Differential Revision: D6772059
Pulled By: Yangqing
fbshipit-source-id: 884a3d7c7a71ed9631b7c6269ae95d842a09e1bd
* Use ATen infer_size implementation rather than TH.
The only substantitive difference between the two implementations is in how empty sizes are handled;
in ATen these are treated as scalars (i.e., can be expanded to anything), whereas in TH they are treated
as a special case of empty tensors (i.e., can't be expanded to anything). Therefore, this change is
necessary to support scalars (0-dimensional tensors). We could also take a bool parameter for determining
how we treat empty tensors but this seems unnecessary: if one tries to expand an empty tensors (as a result
of an infer_size calculation), the expansion will fail.
* Make changes for review.
* Attempt to fix windows build.
* long -> int.
Summary:
This should translate to an 1% error margin. The gradient checker uses a .5% threshold.
Closes https://github.com/caffe2/caffe2/pull/1766
Differential Revision: D6774077
Pulled By: pietern
fbshipit-source-id: f97c7ffb2ef34fdd71d69320a7fdcf4a6a457715
Summary:
Just redirects to MKLSumOp. Doesn't support broadcast though since dnnSumCreate
expects identical dims.
Differential Revision: D6729788
fbshipit-source-id: 3e189465ad9d026bec4954648562ffe4e67fc393
Summary:
The idea is the following. We are going to automatically generate .py files using a jupyter post-save hook. Also, there is a script to generate these for all the tutorials. The script is also used from Jenkins test.sh. So if you don't run the sync anyhow, test will complain.
In this diff I include the framework itself + .py files generated for all tutorials. They live under a separate folder.
Closes https://github.com/caffe2/caffe2/pull/1762
Differential Revision: D6749358
Pulled By: salexspb
fbshipit-source-id: d6ad28e863a0670af2d1e5af86e16909dc0dcf2c
Summary:
As in name. LATTE translation team moving some code from Python 2 to 3 uncovered a case where comparison between unicode and str types leads NameScope('') to prepend a separator to the beginning of blob names. This fixes it.
Thank you so much to dzhulgakov for tracking down the cause of this so quickly!
Reviewed By: dzhulgakov
Differential Revision: D6766866
fbshipit-source-id: fbe46cff581f425ba10e8668400915ea40baab94
Summary: Make test less computationally expensive
Reviewed By: Yangqing, dzhulgakov
Differential Revision: D6766236
fbshipit-source-id: 59e51faa1331d804b11da9f7237ee9ce0cb27df8
Currently, a Variable can only be compared with a Variable, but a Tensor
can be compared with Tensors or numbers. Relax this constraint so Variables
behave identically to Tensors.
Summary:
Reason for this change:
(1) Setting/Getting default gpu id doesn't seem to be used at all.
(2) It actually is confusing compared to the CUDA_VISIBLE_DEVICES options etc.
(3) When setting cuda_gpu_id=-1 in the CUDAContext arg, it used to use the
default gpu id but probably we should use the current gpu - so that the caller
will be able to control the device placement.
One use case is for TensorRT - if we have a custom callback layer, then it would
be easier for TRT or whatever caller to set the running device.
Reviewed By: dzhulgakov
Differential Revision: D6740357
fbshipit-source-id: 2ea710e434b10220d5a198e31c93847304636863
Summary:
- Moved mask-rcnn inference operators to open source caffe2.
- Registered GeneratedProposalsOp as GenerateProposals in addition to GenerateProposalsCPP.
Reviewed By: rbgirshick
Differential Revision: D6747190
fbshipit-source-id: be98d6b56b5b53b13af46e839f5ceaf27f7fddc3
Summary: Building on D6710785 (float <-> fused_8bit_rowwise conversions) and D6710843 (`FusedEmbeddingLookup`), this diff implements the new reduction operations for the fused 8-bit rowwise storage. I mostly followed the [old 8-bit quantized code](diffusion/FBS/browse/master/fbcode/caffe2/caffe2/operators/lengths_reducer_rowwise_8bit_ops.h) and [full-precision code](diffusion/FBS/browse/master/fbcode/caffe2/caffe2/operators/lengths_reducer_ops.h).
Reviewed By: kennyhorror
Differential Revision: D6710844
fbshipit-source-id: b9e85db7437bd32dd44d01733c3749f35c00b06e
Summary:
Updates the perfkernel codebase to implement embedding lookup for our new fused storage format, where each row in the data matrix stores the quantized values *and* the scale and bias.
msmelyan see this as my best-effort attempt at updating the perfkernel stuff for the fused storage. Let me know if any of this is grossly wrong. I also don't know if we need to update any of the prefetching operations or something like that.
Note that we have to keep the old code around for a bit until we get rid of the old operations with separate `scale_bias` storage.
Reviewed By: kennyhorror
Differential Revision: D6710843
fbshipit-source-id: b485ef2389f526c5db1260cac9d4be3fc8df0979
Summary: This first diff adds the conversion operators that go from float to our fused 8bit rowwise quantized storage and back again. For now I've put the scale and bias in front of each row because it makes the pointer arithmetic nicer here and in the EmebddingLookup perfkernel. If benchmarks or other reasons point out that this is a bad idea we can change it easily.
Reviewed By: kennyhorror
Differential Revision: D6710785
fbshipit-source-id: 086ab91c12d3b472564a06eff6329be6cb9e680e
Summary: Changed #undef C to #undef E after the definition of Macro E in cpuid.h
Reviewed By: ot, luciang
Differential Revision: D6763664
fbshipit-source-id: beb221f0c690b5450c39577dd0a843613d802e9c
Summary:
* This will let us generate documentation on the Jenkins workers.
Closes https://github.com/caffe2/caffe2/pull/1772
Reviewed By: ezyang
Differential Revision: D6762731
Pulled By: orionr
fbshipit-source-id: 2e170d13055429971fc2cce66512480825030572
Summary:
This is to update the video input op in caffe2 so that it is up to date.
It adds additional support for:
1, optical flow and early fusion
2, different ways of sampling clips from video
3, different ways of resizing the input video
Reviewed By: dutran
Differential Revision: D6752788
fbshipit-source-id: 0cbd4d4bbbe97b0ada4cba7a55adc91a7af60d5f
The function record_stream is currently only defined on Tensor in
TensorCuda.cwrap. It would be best to implement this in ATen and
automatically bind it to Python, but we're missing ATen types to
represent CUDA streams.
The legacy NN bindings currently operate only on Tensors. We are slowly
replacing all uses of Tensor with Variable in Python code so that there
will only be one user-visible class. This changes the NN bindings
accessed through type2backend to accept either Tensors or Variables.
This does not affect the NN bindings that go through ATen.
* Various testing and utility improvements including torch.testing module.
1) Remove method definition for randn_like since ones_like, zeros_like do not have methods.
2) Add an empty_like native function for creating a tensor with uninitialized values.
3) Add an is_floating_point() native function, similar to is_signed().
4) Add a torch.testing module loosely modeled after numpy.testing; currently it contains
make_non_contiguous (moved from test_autograd) and randn_like (wrapper around the VariableFunction).
5) Remove code from test_autograd and test_nn that is responsible for generating grad_outputs to use
with gradgradcheck. These now use gradgradcheck's own generating code. This fixes
test_nn.py with scalars because gradgradcheck does the right thing here already.
* Rename parameter.
* Fix parameter usages.
Summary:
Fixes a beautiful bug spotted by mschatz: MetaStr was super slow for TensorCUDA because it was defined for CPU tensors only. And thus C++ friendly was invoking the casting costructor which copied the entire buffer to CPU!
I think both copy constructor and cast constructor should be explicit for Tensor given that it's an expensive op. There might be more spots to fix in the code.
Original revision with MetaStr bug is 2d026cfe9c :)
Reviewed By: Yangqing
Differential Revision: D6758540
fbshipit-source-id: 7d2dffadd84c043908e16927fe02e6ffb01f750c
Summary:
This updates https://github.com/caffe2/caffe2/pull/1096/ to build doxygen docs with cmake and fixes operator catalog generation. See the new README.md for details, but you can run
```
mkdir build && cd build
cmake -DBUILD_DOCS=ON .. && make
```
and
```
python caffe2/python/docs/github.py ~/c2docs/_docs/operators-catalogue.md
```
to generate docs.
There was one weird issue in `generator.py` that we sometimes receive tuples and sometimes objects. I handled this just by testing `isinstance`, but we might want to be more principled in the future.
Closes https://github.com/caffe2/caffe2/pull/1758
Reviewed By: pietern
Differential Revision: D6752127
Pulled By: orionr
fbshipit-source-id: 9ba9ad8efc920b27a57327f8a7d3050f3650d4ce
Summary:
Lots of unwanted stuff here that shouldn't be in this branch. I just need to make a PR so I can test it
Closes https://github.com/caffe2/caffe2/pull/1765
Reviewed By: orionr
Differential Revision: D6752610
Pulled By: pjh5
fbshipit-source-id: cc93290773640a9eb029f350b17f520ac5f2504e
The Tensor and Variable classes are being merged in Python. This means
that all interfaces to C++ must accept Variables where they previously
accepted Tensors.
* adds reduce arg to BCEWithLogitsLoss interface
Adds the missing 'reduce' argument for the BCEWithLogitsLoss module
so that it matches the functional interface.
* fix indentation and add additional test
fixes the indentation used to update the BCEWithLogitsLoss module
and adds a unittest to sanity check its usage with `reduce=False`
Previously the side-effect free grad calculation was performed
using callbacks that could also override the decision to run a
function. However this had a few problems e.g. it forced us to iterate
over pretty much all functions in the graph and drop their buffers.
This patch improves the mechanism, by adding explicit support for this
kind of evaluation in execute(). It's safer, and the algorithm used to
decide which nodes have to be evaluated was replaced with a faster one.
Previous Symbol was just a uint32_t and we converts symbolToString and
stringToSymbol. Now Symbol is a struct with a toString method, and
constructors from either BuiltinSymbols enums (e.g. kParam) or strings.
Symbol is convertible to a uint32_t to ensure it can still be used in
switch statement BuiltinSymbol case branches.
* Fix display of test failure number in test_distributions.
Previously, if e.g. the last example of 3 failed, it would say example 2/3.
* Fix other instances of enumerate pattern.
This adds overrides in VariableType for the xxx_out ATen functions and
implements Python bindings. There is no support for automatic
differentiation. If any of the inputs (or outputs) requires grad, then the
function will throw an exception unless it's running in "no-grad" mode.
The bindings for calling torch.xxx functions on Variables are moved to a
different object. Previously, they were static method on VariableBase.
This change prevents users from accidentally calling static methods as if
they were instance methods.
This moves the implementation of repeat to _utils so that the autograd
function can call it directly instead of relying on forward being called
on tensors.
This also removes _range, which was previously necessary because we
shadowed the built-in range() function.
* Add proper scalar checks to functions bound by nn.yaml.
By default, the forward functions use the default ATen scalar checks and the backward functions
use x_->isScalar() for grad_x (with grad_input mapping to self).
These can also be overridden by specifying a dict of arg_name -> scalar_check.
If the argument is not overridden and the default mapping cannot work (because x for grad_x is not
passed to the backward), an error is raised and the scalar_check must be explicitly specified.
* Fix scalar checks for loss functions with a reduce parameter.
Implement MM fusion (MM with add reduction tree)
A tree where leaves are matrix multiplies and inner
vertices are adds can be computed as a single mm.
Such subgraph often appear in backward if a single weight
is reused multiple times (e.g. in RNNs).
NOTE: this seems to be slightly slower on the GPU than the
naive implementation, but it's a huge win on the CPU
(think 100x lower overhead)
* Test_autograd support for 0-dim input/outputs.
This uses the 'fake' _scalar_sum function to test scalar (0-dimensional) inputs and output in test_autograd.
Main changes:
1) Introduces a randn_like function (this is really just for convience but it comes up often in testing.
2) Because the Tensor and Variable API are different wrt sizes, we take care to not exit the Variable API when
constructing Variables based on other Variables. This is pretty straightforward, but there is sometimes an extra
line of code for setting requires_grad. Should we have the 'like' functions maintain requires_grad? Or bind all
factory functions with an additional 'requires_grad' parameter?
* Fix flake8.
* Get rid of _scalar_sum tests.
* Use zeros_like instead of more complicated constructs.
Also remove _scalar_sum native function / derivative definitions.
Summary: Added the RowWise functionality for SparseAdam, which saves roughly 2/3 memory usage by only keeping one first and second moment term for each row of the parameter tensor, rather than one for each individual parameter.
Differential Revision: D6679342
fbshipit-source-id: ce6fb27e35ce41a890c66f6089cd2748d10e7a44
cuDNN batch norm uses mixed half/float precision in batch norm. This
changes the overload to only check that the arguments are of
VariableType and does not check their concrete type (scalar/backend).
Summary:
This is needed for #1740.
Verified that `./build.sh py2-android-ubuntu16.04` builds an Android base image with CMake 3.6.3.
Closes https://github.com/caffe2/caffe2/pull/1747
Differential Revision: D6729823
Pulled By: pietern
fbshipit-source-id: f7c888b4fba14ff6ea703cc269175b327b49f6b8
Summary:
We may not want to run the operator in a prefetch manner if we don't need any prefetching.
The option allows without modification to any operator to run it ina normal fashion.
Differential Revision: D6717720
fbshipit-source-id: 10114d68edd95258b823603d8532360120421649
Summary:
PR Description
----------------
This commit is to update how to install Caffe2 in Ubuntu distribution.
The existing instruction is written as installation guide for generic Ubuntu
distributions. Let's update the existing manual in more detail.
**Changes proposed in this PR:**
1. Added Ubuntu 14.04 section with existing contents.
2. Added Ubuntu 16.04 section
**Self evaluation:**
Tested (compilation in Ubuntu 16.04 x64 LTS)
Signed-off-by: Geunsik Lim <geunsik.lim@samsung.com>
Closes https://github.com/caffe2/caffe2/pull/1723
Reviewed By: pietern
Differential Revision: D6692998
Pulled By: orionr
fbshipit-source-id: 8da9250ff27dbeb41f12364cdd531b2fb416c31f
* Use restat to reduce ninja rebuilding when running codegen.
Usually, you're only working on one codegen file at a time, but
in our old behavior, editing one would induce a rebuild of everything
that depended on ANY generated file. We fix this in two steps:
- Don't write the file (updating the timestamp) when the contents
are unchanged. (I had to update three seperate places; shared
Python library for build tools when?!)
- Use the 'restat' ninja feature to avoid rebuilding when the timestamp
doesn't change.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* lintfix
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* lintfix2
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Previously, it printed [Variable]; now it prints [Variable CPUDoubleTensor].
I'm not altogether sure why toString on Variable returns the uninformative
thing, but that might be worth fixing too.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This commit fixes double-backwards on batch norm. There were two
bugs:
- Returned buffers from batchnorm backwards were being marked as differentiable
when they shouldn't be. The fix for this is "easy": use 'grad' instead of
'grads[0]' in cudnn_batch_norm's backward definition. (More on this below.)
- I was using toTensor on a Scalar, which gives me a Tensor of the wrong
type when I'm in CUDA world. Using the Scalar add() overload directly
solves the problem.
The differentiability of returned buffers was annoyingly subtle and I nearly
went off and implemented a big pile of infrastructure to "tell" the codegen how
to distinguish between differentiable and non-differentiable outputs before
realizing that there must be a way we do this legitimately, because it works for
THNN. I documented this in derivatives.yaml, and also added tests for the
problem in load_derivatives.py to catch the various ways you could "get it
wrong". Hope this helps someone else.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
Gloo test was waiting only for 10sec for processes
to terminate causing tests to be flaky.
Reviewed By: pietern
Differential Revision: D6672990
fbshipit-source-id: c58ba512396a0e45fa6ea4d14534ab0ccd54f2a9
Summary:
[x] Have to rebase
[x] Have to ensure this works on macOS + Anaconda
Closes https://github.com/caffe2/caffe2/pull/1741
Differential Revision: D6714172
Pulled By: pietern
fbshipit-source-id: 43a16d99a6ddf821a35b512c780cdfa35a721219
Summary:
- fixed the false newline at the initialization of the crop layer translation which caused the exceptions described in issue #1215
Closes https://github.com/caffe2/caffe2/pull/1746
Differential Revision: D6716228
Pulled By: Yangqing
fbshipit-source-id: dd93b06b3b903f96505d6e6f8e67caeb6981fe66
Summary:
the fc needs to be in the output_gate_t scope so it can find its input
weights correctly
Closes https://github.com/caffe2/caffe2/pull/1739
Reviewed By: dzhulgakov
Differential Revision: D6705443
Pulled By: anderspapitto
fbshipit-source-id: 139e83ac77589a203ffe404fedab98eea5b1a51c
1) Zero-dim tensors to the fill functions that weren't bound (they couldn't be called successfully
because we haven't enabled scalars), and needed derivatives for their value arguments.
2) ne_ was missing a Scalar overload.
* ONNX: export sum, prod, sqrt improve log_softmax and fix a typo in doc.
Signed-off-by: HE, Tao <sighingnow@gmail.com>
* Add new exported op to doc.
Signed-off-by: HE, Tao <sighingnow@gmail.com>
* Double quotes.
Signed-off-by: HE, Tao <sighingnow@gmail.com>
* Update trace log of log_softmax.
Signed-off-by: HE, Tao <sighingnow@gmail.com>
* Improve export when dim is None and axes_i should be a list of ints.
Signed-off-by: HE, Tao <sighingnow@gmail.com>
* Fix prod when no dim given.
Signed-off-by: HE, Tao <sighingnow@gmail.com>
* Update line ends in test expected file.
Signed-off-by: HE, Tao <sighingnow@gmail.com>
Summary: This diff enables setting model initialization seed, instead of random seed, when reproducible restults are desired.
Reviewed By: xianjiec
Differential Revision: D6642971
fbshipit-source-id: 387b1ee2ecef4f8f66570c882498fb97d7007e17
* Distinguish between scalar tests and pyscalar tests.
* Distinguish between scalars and no arguments.
* Add NoArgsClass so NO_ARGS is iterable.
* Fix iterator specification in python3.
* Now fix for python 2.
* Fix flake8.
In `THPTensor_(_convertToTensorIndexers)`, a `vector<THPIndexTensor>` is
created by constructing `THPTensor`s from sequences/tensors/etc. Each
`THPIndexTensor` is then freed with the following:
```
for (auto& idx : indexers) {
THIndexTensor_(free)(LIBRARY_STATE idx->cdata);
Py_DECREF(idx);
}
```
This is a problem because `Py_DECREF(idx)` will turn `idx->ob_refcnt` to 0 since this function
created the relevant `THPIndexTensor`s and owns them, causing `THPTensor_(dealloc)` to be
called. `THPTensor_(dealloc)` already has a line that calls
`THIndexTensor_(free)(LIBRARY_STATE idx->cdata)`.
So `THIndexTensor_(free)(LIBRARY_STATE idx->cdata)` gets called twice on the same
`cdata`. After the first call frees `cdata`, the second attempts to access flags/members of `cdata` to
determine if it should free it.
Summary:
This should fix Protobuf version problems on all Anaconda builds by putting include directories under Anaconda before all other include directories.
Closes https://github.com/caffe2/caffe2/pull/1728
Reviewed By: orionr
Differential Revision: D6698435
Pulled By: pjh5
fbshipit-source-id: f73f4a5ebb4ca91db14770a88a704ace69d37ba4
Summary:
[Folly] Cut the `ScopeGuard` alias now that we have `auto`.
This form works because of hidden lifetime extension:
```lang=c++
folly::ScopeGuard guard = folly::makeGuard([] { /*...*/ });
// ...
// guard falls out of scope
```
But this form would not work correctly:
```lang=c++
folly::ScopeGuard guard = folly::makeGuard([] { /*...*/ });
std::async(std::launch::async, [guard = std::move(guard)] {});
```
Because `folly::ScopeGuard` is an rvalue-reference-to-base.
We have `auto`, so just remove `folly::ScopeGuard`. This form works correctly:
```lang=c++
auto guard = folly::makeGuard([] { /*...*/ });
std::async(std::launch::async, [guard = std::move(guard)] {});
```
Reviewed By: igorsugak
Differential Revision: D6690070
fbshipit-source-id: 54e32b300d36fce4eb95a59f1828819afe312ec0
Summary:
[Folly] Move `ScopeGuardImpl` and `ScopeGuardImplBase` into the `detail` namespace.
Let them be marked as private implementation details.
Reviewed By: andrewjcg
Differential Revision: D6665317
fbshipit-source-id: 03e8fee6a16338395ec92c582613b053bd9f74ec
Summary:
This is in principle similar to #1612 and is tested on Windows 2017. CMake passes, although there are still bugs in the MSVC compiler that prevents cuda to compile properly.
The difference between this and #1612 is that this diff explicitly puts the CMake files into a separate folder and uses a MiscCheck.cmake chunk of code to test whether we need to include them. See README.txt for more details.
Closes https://github.com/caffe2/caffe2/pull/1727
Reviewed By: pietern
Differential Revision: D6693656
Pulled By: Yangqing
fbshipit-source-id: a74b0a1fde436d7bb2002a56affbc7bbb41ec621
Summary:
Instead of constructing db_name as a member of checkpoint_manager, generalize
this function
Reviewed By: anshulverma
Differential Revision: D6671088
fbshipit-source-id: c528538def66933619f2fdf67820bca5d13571ea
Summary:
we are going to deprecate NNPACK bindings in caffe2/contrib/nnpack.
The first step is to move modern NNPACK bindings from caffe2/mobile/contrib/ios/ to
caffe2/share/contrib/nnpack/, and is implemented in this diff.
Reviewed By: sf-wind
Differential Revision: D6687454
fbshipit-source-id: 458614bade92ab5ba5d2ab7f0691071043198b57
Summary: Test in Jenkins fail becasue test_global_pooling_3d filtered too many tests. We made use of infered value of global_pooling (pad and stride will be constant) to reduce the test samples generated.
Reviewed By: pietern
Differential Revision: D6686840
fbshipit-source-id: d316c0e9f9070b12770170ab9f36e33de68a9ab9
Summary:
* Also remove build status, since it isn't relevant here.
I'm tempted to just reference https://caffe2.ai/docs/getting-started.html and remove all of this, but seemed like it might be worth having a standalone installation.md doc.
Closes https://github.com/caffe2/caffe2/pull/1706
Reviewed By: Yangqing
Differential Revision: D6666561
Pulled By: orionr
fbshipit-source-id: 640f8100a5e4f8d6b2eee2266dd634bd25d0e58e
* Fix the inconsistency of `polygamma` on Tensor and Variable.
Signed-off-by: HE, Tao <sighingnow@gmail.com>
* Regression test for #4466, polygamma works on variables.
Signed-off-by: HE, Tao <sighingnow@gmail.com>
* Add macro IMPLEMENT_STATELESS_SWAP to dispatch stateless methods on Variables correctly.
When call stateless methods with more than one arguments and the `self` comes second,
the `self` argument needs to be swapped to the first position before dispatching.
The macro `IMPLEMENT_STATELESS_ADDXX` is still reserved for deprecated `add**`
methods.
Signed-off-by: HE, Tao <sighingnow@gmail.com>
Summary:
In D5681122 - when routing to global maxpool and average pool, the condition is not correct.
see T24876217 for discussion
Reviewed By: Yangqing
Differential Revision: D6665466
fbshipit-source-id: dcb5b4686249e6ee8e1e976ab66b003ef09b32fd
ATen dispatch in the JIT interpreter needs to switch the current gpu,
but it is not handled in ATen itself, and no higher-level pathway
ensures the device is set correctly.
This also improves debugging information for cross-device issues.
This follows the behavior of numpy in that you can wrap dimensions over a scalar (0-dimensional
tensor) in the range [-1, 0]. I.e. scalarTensor.prod(0) and scalarTensor.prod(-1) works, but
scalarTensor.prod(2) does not.
The only current exception to this is with size(dim) and stride(dim);
there are no numpy equivalents of these (they are attributes), so it seems cleaner to just have
these as (dimensional wrapping) sugar for sizes()[dim] and strides()[dim]; otherwise there are
subtle differences in semantics, e.g. you have to use size(dim) when you want it to directly
apply to scalars, if the default value (1?) makes sense in that case. Simpler to just not have
that difference.
Note that this change can cause problems if code assumed that maybe_wrap_dim would throw an
exception in this case and then called sizes()[dim] or size(dim) without checking; I went
through the code and only found this case in squeeze/squeeze_.
cuModuleLoad is only valid for a single device so we need to
compile for the particular device that the fusion group will run on.
CompiledFunction already specializes different traces for tensors,
so we just need to have fusion_compiler produce the cuFunction on
the right device.
The gen_variable_type.py script now is only responsible for generating
VariableType.h/cpp. The parent script, "gen_autograd.py", delegates to
gen_autograd_functions.py, gen_variable_type.py, and
gen_python_functions.py.
I've removed "fallthrough" functions. It's replaced by
DONT_RECORD_TRACE, DONT_PROFILE, and DONT_REQUIRE_DERIVATIVE.
In preparation for binding the _out variants, I changed some static
types to Tensor (from Variable) and we now unpack and name tuple return
values.
1) Separates ASSERT_THROWS and ASSERT_THROWSM for checking messages vs not.
2) ADDS TRY_CATCH_ELSE for python-style error checking
3) Uses ASSERT_THROWS and TRY_CATCH_ELSE more generally
The previous more ad-hoc constructions were often wrong, i.e. an assert could
pass if the logical else threw an exception if it passed the assert in the catch.
Three stage plan to no more stupidly weird "why isn't cuDNN enabled"
bugs:
- Add torch.backends.cudnn.disable_global_flags(), which as its name suggests,
disables global flag setting in cuDNN, so that you are not allowed to
make changes to this state. However, the flags() context
manager continues to work (since they are non-global changes).
- Call disable_global_flags() in test/common.py
- Switch all of the manual flag setting/unsetting in test/test_nn.py
to use the context manager.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Fix tracking of tracing scopes during ONNX pass
* Use ResourceGuard to manage setting a temporary current scope in Graph
* Add tests for ONNX pass scopes
* Remove unused num_classes argument
Previously, we only tested CPU double-backwards, which is bad!
This would have caught #4422 (still not fixed, so those tests
are manually disabled) and also uncovered #4500 (not yet diagnosed.)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Weight can be non-contiguous due to double backwards, where
we transpose the weight. I'm not very happy with this fix
but it seems to make the tests pass.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
- Out of bounds grads[2] access (thnn_conv_depthwise2d_backward
doesn't compute bias gradient)
- Groups was not set appropriately for depthwise convolution
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: D6636282 caused regression test failure of nmt model use in prod, see 24949620 for besect history.
Reviewed By: pietern
Differential Revision: D6671602
fbshipit-source-id: d863013964666727cf488a6ac5b01f5216f149d9
Summary:
Added Caffe2 operator binding for Gloo Allgather algorithm.
Added new test to verify the binding. Binding is supported only for
CPU device with these changes.
Reviewed By: pietern
Differential Revision: D6610074
fbshipit-source-id: b21df9b5e71befbdb6841d6b146727bb4c83d753
Summary: GPU (CUDA) implementation of the Swish activation function in Caffe2.
Reviewed By: Yangqing, xianjiec
Differential Revision: D6656907
fbshipit-source-id: f5f2c667055abf679728d2b5d43998895ddec708
This mismatched paren causes a syntax error in generated code. I'm guessing the parentheses are necessary, since there was one in there before, but I don't actually know whether the compiler can produce things like a - (b - c) that would make them required.
Summary: Adds transpose CPU version to prepare for LC layer.
Reviewed By: Yangqing
Differential Revision: D6641358
fbshipit-source-id: 1825b4c270dea2c0049ba334303abcbf50b22ee7
Summary:
Some installations of numba seems to be not compatible with asan, so we
will disable its import.
Reviewed By: dzhulgakov
Differential Revision: D6664055
fbshipit-source-id: 311774667e54bdbf328ef280ab2a52ecba1361f2
Summary:
In this PR I do the following:
1. split lstm_test_main into several tests for LSTM, MiLSTM and various Norm based versions
2. instead of looping over various gradient / optimization parameters now they are random inputs through hypothesis.
3. These change make the test faster and we can avoid limiting number of examples
4. Fix a minor bug with gradient checker in RNN unroll test running twice
5. Generate seed for numpy in hypothesis. This make hypothesis avoid having fluky tests
Also note that Norm tests sometimes fail. I haven't looked into it much, it could be just precision issues. New test split should help identify these issues.
Closes https://github.com/caffe2/caffe2/pull/1678
Reviewed By: pietern
Differential Revision: D6657076
Pulled By: salexspb
fbshipit-source-id: 9f59c71ccd2c818156e9d2424c3423d450b8c8e2
BCELoss's outputs and gradInput computations are accurate to around 1e-6 on float types (as a relative value, not absolute), which is reasonable. However, the tests use absolute thresholds: the accumulation of 5 gradInputs has to have error less than 0.0002.
The worse case for BCELoss's gradInput for each element may be described as 1 / ( (1-x) * x ). Previously, the input to the test was restricted to [0.02, 1- 0.02], resulting in worse-case largest gradInput of 50, resulting in a total accumulated grad of 50*5 = 250, resulting in an error of 250 * 1e-6 = 0.00025, which was too big.
By restricting x to [0.028, 1- 0.028] we get a worse case of 36.74, resulting in a total accumulated grad of 184, which is less than the 200 needed to have error less than 0.0002.
* Add test for empty Variable cat (forward only).
* Test for empty cat (no grad/gradgrad checks)
* Support gradcheck on empty inputs, check it for cat with an empty Variable.
* Fix lint.
Summary:
* The request has finished. We might do others in the future, but removing for now.
Closes https://github.com/caffe2/caffe2/pull/1700
Reviewed By: Yangqing
Differential Revision: D6659664
Pulled By: orionr
fbshipit-source-id: cd49d41bdde3c07b5acbcd4724aaa359f69e4752
* Delete obsolete basic ops.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* More deletion.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Delete some unused utilities.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Delete dead apply_fn
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Delete CppFunction symbolic support.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Delete ForwardFunction
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Batchnorm is 'working'
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* make derivative changes and change destination --> result
* fix typo
* add changes for addcdiv also
* modify rsqrt derivative
* revert the derivative for addcdiv
* revert the derivative for div
* fix typo, sorry
Emits a warning if slices have the same size but different shapes. (It
shouldn't be allowed, but it was, so some code might be unknowingly depending on
the behavior.)
Also refactored argument checking code, including index_fill_.
Summary:
During debugging I found that our recently added automatic engine preference actually makes debugging a bit harder - it implicitly routes computation to e.g. CUDNN when we actually want to test out the default GPU implementations.
This diff adds a commandline flag that disables it.
Closes https://github.com/caffe2/caffe2/pull/1696
Reviewed By: pietern
Differential Revision: D6658765
Pulled By: Yangqing
fbshipit-source-id: ef56a16e778eeea6ecdd4dc6002421236e15371a
Summary:
This was introduced in D5681122 - it causes a pretty serious numerical issue
that broke pooling test.
Specifically, if threadIdx.x > sz, max is initialized with an out of bound index
and the max is incorrectly computed.
Reviewed By: pietern
Differential Revision: D6658945
fbshipit-source-id: 487222d26050921ff9c7764fe46076e31a99bb86
Summary:
GCC version check is currently being skipped when using the
newly released CUDA 9.1.
This will also handle other CUDA 9.x minor releases if any,
reducing our work if there are such releases like 9.2. This
assumes that the next major CUDA version will be 10.0,
needing adjustment only after such major version is
released.
Closes https://github.com/caffe2/caffe2/pull/1658
Differential Revision: D6659000
Pulled By: pietern
fbshipit-source-id: 79291b5da9d4e8b4f2c7ac82fe2b1e7939438bc9
This modifies NN binding in ATen so that the xxx_forward functions now
return buffers instead of taking them as inputs. The NN functions with
no suffix are implemented in Type.cpp. They call the xxx_forward
variants and discard any returned buffers.
This simplifies derivatives for NN functions. The derivatives are now
defined on the xxx_forward functions and buffers are treated as any
other input.
Summary:
There were no dimensionality constraints to the generated indices
array, causing many examples being generated and filtered out. Instead,
we should ensure the probability of unique indices is high.
There is a better fix for this by using the `unique` keyword argument
to `hypothesis.extra.numpy.arrays`, but this is available only in
hypothesis version 3.28.0 and later.
This is related to #1536 and #1599.
Once this change has proven to be OK, we can modify the other tests
that now have health check suppression enabled as well.
Closes https://github.com/caffe2/caffe2/pull/1686
Reviewed By: Yangqing
Differential Revision: D6651789
Pulled By: pietern
fbshipit-source-id: d80886c9ccf0a7a842a7580a279f33a2d6cca97c
Summary: The current Load op can only load blobs from one file. We need to make the Load op to support loading blobs from a list of dbs.
Reviewed By: boryiingsu
Differential Revision: D6596034
fbshipit-source-id: 906fa48b0ad61c83e247d497b6b079c04fed499f
Summary: TSIA - it used to cause build errors.
Reviewed By: pietern
Differential Revision: D6652354
fbshipit-source-id: fd291f662e3793b6d11a7e02e1acc741c027a1fd
Summary:
`contrib/prof` provides functionality for profiling (eg. `prof_dag`) but no CMake.
Hence, provide CMake support for building it.
Reviewed By: Yangqing
Differential Revision: D6640488
fbshipit-source-id: 9ed8095b10d7c0337db061206daf2a66f41f4713
Summary: change all use cases of BatchLRloss to the numerically stable version. This includes the uses of function build_loss defined in fbcode/caffe2/caffe2/fb/dper/layer_models/loss.py and class BatchLRLoss defined in fbcode/caffe2/caffe2/python/layers/batch_lr_loss.py.
Reviewed By: xianjiec
Differential Revision: D6643074
fbshipit-source-id: b5678556b03cbdd380cab8a875974a87c33d7f12
Implements nn.Embedding (lookup table) in ATen.
Breaking change: new optional argument padding_idx in F.embedding to
match nn.Embedding.
Note that there are a few bugs in Embedding that are inherited from the
previous code:
- CUDA renorm has race conditions if index contains duplicate entries
- sparse gradient doesn't work with scale_grad_by_freq
Summary: ReaderWithTimeLimit() class to stop after a certain amount of time
Reviewed By: boryiingsu
Differential Revision: D6477623
fbshipit-source-id: 165874c9344b0c9c7e0b33e12e72e24c46669cb2
1. master NNPACK now uses cpuinfo library, so we detect it and
add it to the list of libraries.
2. If a user builds nnpack with --inference-only, there won't
actually be enough symbols to successfully link against NNPACK.
This won't manifest until quite late in the build process.
So we now explicitly test that the gradient functions are
available in the library.
Upstream bug: https://github.com/Maratyszcza/NNPACK/issues/123Fixes#4336
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
Ran into a scenario where if the CPU op in MKLFallbackOp outputs an empty
tensor, attempting to copy the output to MKLMemory (https://fburl.com/www2mtt4)
crashes. Modify MKLMemory to gracefully handle this. This is done at the
MKLMemory level because we want to make sure that its members such as dims and
layout are Reset() correctly.
Interestingly, MKL calls fail at different points for dims {0} and dims {0,N} despite
the buffer size being empty for both - former in dnnAllocateBuffer and
the latter in dnnConversionExecute (likely due to some difference in
layout?).
Also fixed CopyTo in addition to CopyFrom and tested all scenarios.
Reviewed By: ajtulloch
Differential Revision: D6646320
fbshipit-source-id: 61df585f610a949f312f05308baf310241dc9cb2
Summary: Extract some operators from utility_ops and normalize_op to reduce build size impact of depending on these files.
Reviewed By: Maratyszcza
Differential Revision: D6616741
fbshipit-source-id: 1757b6b8a3ce4e2a248deee61322344e5095e940
Summary:
Imported and modified from https://github.com/ARM-software/vulkan-sdk
I changed libvulkan-stub.cpp to libvulkan-stub.c
Reviewed By: Maratyszcza
Differential Revision: D6641092
fbshipit-source-id: 1a7fbf745d58b6111a06a983910c583912365357
This is a step towards removing the special casing of NN functions in gen_variable_type.py. It fixes the signature of in-place NN functions so that they return Tensor & instead of Tensor.
* Support ATen GPU pointwise apply and torch.where.
Like the CPU version, this implements an apply template that is almost identical to the
apply template already in THC, but using the ATen API. Much of this involves stripping out
the TensorUtils code (which is basically templated ATen-style), although a couple of functions
remain that are apply specific (and thus don't seem worth porting to ATen), namely
overlappingIndices, canUse32BitIndexMath, and getTensorInfo. We can make those generally
available if there's a need.
* Use int64_t instead of ptrdiff_t.
* Use snake case for _copyIgnoringOverlaps_.
Adds a missing bias term to the __repr__ functions of the
Linear and Bilinear modules. Fixes the spacing in the Conv2d
__repr__ to make it consistent with other modules.
* Improve matmul native test tolerance.
Because we don't directly use bmm in one case of matmul, a comparison to bmm doesn't make sense;
instead, we compare to the double result.
* Fix spelling.
Previously, we assumed that __main__ was the test file
being run, which is not true if you are using pytest. New
algorithm uses __module__ of the test class, which is a bit
more robust.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: Still WIP, but works for the universal encoder. The other ones are currently broken.
Differential Revision: D6492786
fbshipit-source-id: 232e0058eb3a0c036de3adf0295db5efd624cca7
Summary:
Make operator QuantDecompZstd buildable in open source. The operator is not built by default. Need to specify -DBUILD_SHARE_DIR=ON -DUSE_ZSTD=ON to build it.
Test plans: Build android caffe2 with the change without issue. Run a model with the operator successfully.
Closes https://github.com/caffe2/caffe2/pull/1613
Reviewed By: Yangqing
Differential Revision: D6556723
Pulled By: sf-wind
fbshipit-source-id: 453a7d787a55928f2dea1ed2b99f2df011aa8d26
Summary: Adding support for DLPack tensors to Python op
Reviewed By: Yangqing
Differential Revision: D6577702
fbshipit-source-id: e14ef213fcdb2930ffe164667971a92aa8db503c
Variable.new() should default to the device of "self" if no device is
specified. Previously, we were using the current device. This now
matches Tensor.new().
Summary:
Thanks to feldim2425 we know that GCC 5 in Ubuntu 17.04 and later
doesn't define the macro _GLIBCXX_USE_C99 and by extension the
std::to_string, std::stoi, and std::stod functions (and probably
more). Instead of avoiding using these functions, we simply recommend
people to use GCC 6 or higher on the newer Ubuntu versions where GCC 5
doesn't work.
As a side note, CUDA 8.0 is compatible with GCC up to version 5. This
implies that compiling Caffe2 with CUDA on Ubuntu >= 17.10 implies
using CUDA >= 9.0. If you need to compile with CUDA 8.0 and are on
Ubuntu, you are stuck on version 16.04 or lower.
I verified this fix by running cmake on Ubuntu 17.10 with
-DCMAKE_CXX_COMPILER=/usr/bin/g++5 and observing the fatal error.
This closes#1633.
Closes https://github.com/caffe2/caffe2/pull/1645
Differential Revision: D6620812
Pulled By: pietern
fbshipit-source-id: 29af88cad9bede4fd952084c404c85db05baa9c4
Summary:
If we encounter failures while writing a checkpoint, ensure that the job does
not fail.
A job can make progress even if writing a checkpoint fails
Reviewed By: anshulverma, boryiingsu
Differential Revision: D6615163
fbshipit-source-id: 01f790422e1a81bab1fe73f86750eaf75a72bb77
- Rename THNN convolution to have thnn_ prefix.
- Propagate CuDNN benchmark and deterministic to at::Context
- Add 'convolution', 'convNd' and 'conv_transposeNd' native wrappers, with defaults
The conv_transposeNd wrappers are updated to have the same argument
order as Python.
- torch.nn.functional directly dispatches to the native wrappers
- Make it possible to turn off tracing for some native wrappers, so I don't
have to write symbolics for all the functions above
- Spectral ops can now make use of CuDNN convolution if possible
- Better commentary on cudnn_batch_norm
- Turn on DCE for all JIT tests.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
This means warnings and errors fire sooner rather than later.
This requires a fix for an issue where CMAKE_REQUIRED_FLAGS propagates
to some unrelated check, which then fails, because the Android
compiler doesn't support -mavx2.
Closes https://github.com/caffe2/caffe2/pull/1646
Differential Revision: D6620129
Pulled By: pietern
fbshipit-source-id: 4d1185406ebee3a523d39811bca6783bee82c898
* Batchnorm in ATen
This commit moves BatchNorm derivatives into ATen, eliminating
torch/csrc/autograd/functions/batch_normalization.cpp
Some refactoring along the way:
- Functions got renamed to remove _forward from their names
- CuDNN batchnorm forward was modified to return save_mean/save_std instead of
take it as parameters. To avoid returning undefined Variables, these return
(small) uninitialized tensors when they are not used.
- THNN batch normalization takes care of resizing save_mean and save_std on
forward.
- There are some shenanigans re batchnorm backwards in eval mode. I'm tracking
that in #4284
- I decided not to introduce buffers as a proper concept in ATen, which means
that tensors like running_mean/running_var are variables in ATen. This meant
there needed to be some adjustments to how we *trace* such variables; the
new strategy is if we can't find a Value for a variable, we look and see
if we have a Value for the buffer pointed to by the variable, before
finally falling back on constant.
- This PR finally reliably triggered OOM on Travis builds; I fixed this by reducing
the number of parallel jobs.
- Stop using std::string when it's not necessary.
- Remove training parameter from cudnn_batch_norm_backward, because it
doesn't make sense; cuDNN doesn't implement the math for evaluation mode
batchnorm backwards.
- batchnorm_double_backward is now in an anonymous namespace, as it
no longer needs to be called from torch/csrc
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: Add Min and MinGradient Op
Reviewed By: jamesr66a
Differential Revision: D6608668
fbshipit-source-id: 7e1f8fa7a42a94f26152da0109d597e5deeb21c0
* Convolution derivatives in ATen
This PR introduces ATen implementation of convolution, which dispatches to
THNN/CuDNN/nnpack based on input parameters. The general strategy is to compose
this function out of the various forward-backward pairs of specific
implementations, rather than write a monolithic function with backwards (which
is what we did before because the boilerplate of doing it otherwise would have
been very high.) The new API provides the following functions:
- _convolution, which is a fully generic, native convolution implementation
that dispatches to various other convolution implementations depending on
input characteristics. This is prefixed with an underscore because it
explicitly takes benchmark, deterministic and cudnn_enabled which are
implementation details for CuDNN. The intent is to eventually provide a
convolution that reads these parameters out of the context using #4104.
- _convolution_nogroup is a convolution implementation for non-CuDNN
algorithms which don't support group convolution natively.
- _convolution_double_backward is the generic double-backwards implementation
for convolution.
In more detail:
- Most functionality from torch/csrc/autograd/functions/convolution.cpp has been
moved into aten/src/ATen/native/Convolution.cpp
- We continue to make use of ConvParams, but we now construct the parameters
upon entry to a function from the function signature (which does not use
ConvParams; having convolution take ConvParams directly would require teaching
the code generator how to accept these as parameters, complicating ATen's API
model) and destruct them when making subprocedure calls.
- I introduce a new idiom, input_r, which represents a const Tensor& reference,
which will subsequently be assigned to a local Tensor input. This is helpful
because a lot of the existing algorithms relied on being able to assign to
locals, which is not permitted with a const reference.
- The native argument parser now supports std::array<bool,2> inputs (NB: there
MUST NOT be a space; this is the same hack as is applied to derivatives.yaml)
- Native parser now supports Tensor? arguments, which indicates a nullable
tensor. Previously this function was only used by NN methods.
- Documentation updates on THNN library
- I added an extra fgradInput argument to VolumetricConvolutionMM_updateOutput
and VolumetricConvolutionMM_accGradParameters so that its buffer list lines up
with the backward argument list. This makes it possible to write derivative
for conv3d which previously was not supported (commented out in
derivatives.yaml)
- Extra double_backward declarations for all convolution backwards functions was
added.
- You can now use the syntax Tensor? in native_functions.yaml to indicate that a
tensor argument is nullable. There are adjustments to propagate this to the
Python argument parser.
- NNPACK was ported to ATen, and ATen now builds and links against ATen if
possible. New AT_NNPACK_ENABLED macro. The nnpack functions are
nnpack_spatial_convolution.
- Some modest CuDNN convolution refactoring to remove _forward from names.
- There's a new cudnn_convolution_backward function to deal with the fact that
CuDNN convolution double backward requires you to have computed all gradients
in one go.
- Variable set_flags now checks if the tensor is undefined, fixing a silent memory
corruption.
- checkSameType updated to not raise an exception if called with Variable arguments
- "no ATen declaration found for" error message is improved to say what available declarations are
- make_variable now accepts undefined tensors, and returns an undefined tensor in this case.
This is a part of making sparse tensors work with dataloader (#3898)
This exposes `_values()` and `_indices()` for sparse variables in python (and sparse tensors in Aten).
To do this, I added THDenseTensor* and THDenseIndexTensor* return value functionality to Declarations.cwrap. These should always mean "the dense equivalent of THTensor*" and "the dense equivalent of THIndexTensor*" respectively.
cc @zdevito for the THDenseTensor in cwrap addition
### Test Plan
Run the following:
```
import torch
from torch.autograd import Variable
v = torch.FloatTensor([3, 4, 5])
i = torch.LongTensor([[0, 1, 1], [2, 0, 2]])
x = Variable(torch.sparse.FloatTensor(i, v, torch.Size([2,3])))
x._indices()
x.data._indices()
x._values()
x.data._values()
```
* Further relax VariableFlags
* Allow a requires_grad=True trace to be used for a requires_grad=False
input by computing the gradient but they not connecting it to the
input.
* Enable CSE to de-duplicate WLM backwards pass code which calls sum twice.
* Fix a bug in the interpreter that frees a register too early when
it appears twice in a use list.
* [fuser] Follow all outputs to check if fusion is safe
This bug was introduced when we allowed fusion groups
to fuse together. Previously producers were forced to have a single
output, but now producers that are fusion groups can have multiple outputs.
So now we check the uses of all the outputs of a producer.
* [JIT] Fix handling of undefined inputs
It is not legal to call .data() on variable objects whose tensors
are undefined.
Summary:
hill: the learning rate changes according to following 3 stages
1) linear warmup (increasing) at first num_iter steps from start_multiplier
2) inverse shrink (decreasing) afterwards (gamma, power)
3) lower bounded by end_multiplier
Differential Revision: D6565379
fbshipit-source-id: 9c0e51fc825ba6a7765803a1f09479497057a9d9
Summary:
Implemented syntactic sugar for the following constructs:
- `x.Gather(y)` can now be written as `x[y]`
- `x.Slice(start, end)` can now be written as `x[start:end]`
For slicing, `start` and/or `end` can be omitted iff `x` is one-dimensional (i.e. a vector). That is, `vector[start:]`, `vector[:end]` and `vector[:]` will work. Doesn't work for higher-dimensional tensors because to emit the start/end indices we need to know the rank of the tensor (since `Slice` requires one entry per dimension of the tensor).
Also added a `getProto()` function so that I could test that the generated code is as expected (i.e. that the syntactic sugar does not affect the structure of the output).
Reviewed By: zdevito
Differential Revision: D6605864
fbshipit-source-id: 786359713a13314c24be2fc07e01486c507404ef
Summary: Simple fallback implementation to support LengthsRangeFill, we can have native CUDA implementation later
Reviewed By: pietern
Differential Revision: D6594031
fbshipit-source-id: b705234a591a61e8d1ee5f7524aceec3f4581f9c
Summary:
In layer model helper, add a method `maybe_add_global_constant` to ensure
that when two global constants are added with the same name, we check if they
are actually the same (by initializer) and only add it once.
Reviewed By: kennyhorror
Differential Revision: D6537532
fbshipit-source-id: 37aa3860a2e40d81161ccdea0c50a316248be2e2
Summary: Adds support for backprop to While op, fixes gradient computation for Pow
Reviewed By: azzolini
Differential Revision: D6456875
fbshipit-source-id: 9f660317ad6f3898ff7d8ce43098f85c3426409b
Summary:
Yangqing pietern
With https://github.com/caffe2/caffe2/pull/1627, Caffe2 can statically built with USE_ATEN=ON and USE_CUDA=OFF. But the function deleterFor defined in aten_op_template.h causes duplicated symbols in libcaffe2.a and libcaffe2_gpu.a.
I checked at only one place we call this function, so directly manually inline it into the caller. Later when we use it at other places, we can just extract it again, and put the implementation in aten_op.cc.
Closes https://github.com/caffe2/caffe2/pull/1632
Reviewed By: pietern
Differential Revision: D6594063
Pulled By: houseroad
fbshipit-source-id: 2328e2b2dce819378a9f18411c449830917e0d6a
* Refactor cudnn code layout / make build more robust.
When I previously moved cuDNN into ATen, I wasn't too familiar with the
ATen native function directory layout, and so I did a number of
suboptimal things. This commit fixes those problems.
- If NO_CUDA was set but cuDNN is installed on your system, we'd incorrectly
assume that CUDNN was enabled, to hilarious effect.
- We now distinguish between cudnn implementation files and cudnn
native function files. The native files now live in ATen/native/cudnn,
and are *unconditionally compiled*, even when we are not building with cuDNN.
This means that we can unconditionally declare cudnn functions in yaml
and they are always available, even if they are broken. The cuDNN specific
files live in 'cudnn', they are *never* installed, and they are used
purely for implementation purposes. I had to add stub implementations of
all ATen functions to achieve this.
- I had written headers for at::native functions manually, but codegen
will generate them for me automatically. So I deleted the headers.
That lets me get rid of some header install logic as well.
- There's a new note about ATen preprocessor philosophy.
* add exponential distribution
* add exponential tests
* fix default val of sample_shape
* lambd->rate
* updates per review
* remove notes, keep failure_rate same in exponential test
Summary: hoangmit reported an ASAN test failure on D6389022. Upon further investigation, it appeared there was a logic error on calculating shapes when either the A or B matrix is being broadcasted. This path fixes that error =
Reviewed By: dzhulgakov
Differential Revision: D6580307
fbshipit-source-id: 2bcf9b76f668c42a463f2f0fdc82f544af3ae721
This removes volatile from Variable. The functionality is mostly
replaced by a global (thread-local) flag, which is controlled by
torch.set_grad_enabled() and the context manager torch.no_grad().
In C++, the flag is exposed through GradMode::is_enabled() and GradMode::set_enabled()
Fixes#3627
* Support CPU Apply directly in ATen and implement standard_gamma using it.
Main changes in this PR:
1) Added a TH_APPLY-style templatized function for CPU apply calls (currently only 2 and 3 tensor argument
versions are supported, but more are easy to add). In fact, this is basically identical to TH_APPLY, except
it uses ATen functions and the API is a template instead of a macro. The template takes an operation that
is performed on the data (and an indicator to signal early termination); i.e. you don't need to know that
x_data is a pointer to the current data location of x.
2) Refactors the ATen dispatch code to easily generate dispatch code for different subsets of the scalar types.
This is in preference to the template_scalar path, which requires valid specialization of each scalar type. Valid
specializations are particularly annoying with CUDA because you most likely can't put the specializations
in a header so need to write some sort of for-all-scalar-type macro to get the correct specializations.
Currently, we only generate dispatch_all (all scalar types, the equivalent existed already), and
dispatch_cpu_floating_types (which is used by standard_gamma).
3) Implements standard_gamma using the above changes (this is an arbitrary choice, it was the latest
apply macro to be committed). The forward is bound via Declarations.yaml,
the backward via the Apply template, and then they are hooked together in derivatives.yaml. This eliminates
needing to change TH at all going forward, which means one can write idiomatic C++ instead of the TH-style macros
(e.g. TH_MATH_NAME).
* Generate Dispatch code with nicer spacing.
* Small cleanups.
* Fix typo.
* Add TODOs for changing macros, remove dead code.
* Use a lambda function.
* Get rid of early exit.
* Rename Scalar,ScalarType template parameters to CScalar.
* Reorder _standard_gamma_grad parameters.
* Add comments explaining calling convention.
* Don't generate Dispatch.h anymore.
* Get rid of backend specific checks in dispatch.
* Fix empty/scalar check.
* add reduce arg to PoissonNLLLoss
* fixed comments except reference function
* fixed unit test
* small indentation fix
* fixing last comments by richard
* lint check
* another linting issue
* Add default PyTorch seeding and worker_init_fn to DataLoader
* generate seed using current RNG each time
* worker_seed <- main_proc_RNG_generated_seed + worker_id
* Fix catArray in THTensor
Asserts that the inputs have the same size except in the
cat dimension or are empty (or a mix of both).
* Fix catArray for THCTensor
* Document torch.cat shape checks
* Fix types
* Implement pin_memory() as a NativeFunction
This adds allocators as a concept in ATen that extends deleters. An
allocator is a subclass of at::Allocator that implements the virtual
methods:
virtual void* allocate(size_t n);
virutal void deallocate(void* ptr);
A tensor created with a custom allocator can be resized, unlike a tensor
with a custom deleter.
* Rename AllocatorContext to AllocatorRetainable
* Implement Variable.cuda using ATen
This adds an optional async flag to Tensor::copy_, which attempts to do
a non-blocking copy if the one of the tensors is in pinned memory and
the other is a CUDA tensor.
* Perform cross-device copy in CopyBackwards
Also call torch.cuda._lazy_init() from Variable.cuda()
* Implement Variable.type via ATen
* Changes from review:
- remove copy_out
- remove unnecessary include
- fix default device for .cuda()
* Combine if statements in dispatch_type
* Re-initialize autograd engine in child processes
The autograd engine uses threads for backwards. These don't exist after
forks and they were not being re-initialized because the
Engine::start_threads_flag was already set. This re-initializes the
engine in child processes, which will cause it to re-create threads when
backwards() is called in the child process.
Note that we only attempt to handle the common case where fork() is
called while the backwards threads are idle.
Fixes#3966
* Avoid non-async-signal-safe functions in fork handler
* Rearrange dimensions for pointwise operations for better performance.
In existing code, pointwise operations on transposed tensors process data
"column by column", resulting in poor performance. The worse case happens when
all operands are transposed tensors.
This change tries to "un-transpose" tensors in such a case, so that memory
access patterns are as sequential as possible.
* More explanation on what rearrangeDims() does.
* Fixed a very important (and stupid) typo.
sys.path is searched from first to last, which means that if there is already
a 'tools' directory in the existing python path, we will fail to find the root
directory of PyTorch. Better to put it first.
This method prints a bunch of useful debug information including
the traces that have been record, their shapes, and the traced
graphs associated with them.
Summary: Use MPSCNNDepthwiseConv when groups == input_channels
Reviewed By: ajtulloch
Differential Revision: D6541561
fbshipit-source-id: 7164f26b8f3a101c0ab5c3e6c02ed855397d2750
Summary: Ran into some issues where these values seemed to be initialized to 0 and caused some trouble. Initializing to 1 is safe and well defined.
Reviewed By: hlu1
Differential Revision: D6582774
fbshipit-source-id: 088ec4e782d9680a1d9b4d2d42523d06cbc7dd72
* Trace ATen non-primitive functions as themselves, not their implementations.
Previously, if I invoked an ATen non-primitive function foo, which in turn
called subfoo, I would always see 'subfoo' in the trace (e.g., tracing
'inlines' all of these operations.) Such inlining is bad for ONNX
(and can be bad for optimization) as it prevents high-level
optimizations from taking advantage of the structure. It might
be right to inline, but give the optimizer a chance to work before
inlining happens!
The implementation here is surprisingly simple, because it uses
the "DCE trick". Essentially, it doesn't matter if the constituent
calls perform tracing, because you can always trace it again, and
override the trace nodes associated with the returned variables.
The original trace becomes dead and can be DCE'd.
While implementing this, I also refactored how 'isTracing' and
'trace_outputs' works:
- isTracing was previously a single function with overloads for
both Tensor and Variable arguments. Unfortunately, such overloads
are not safe, because of how C++ implicit conversions work. You
would think that C++ should never confuse an overload for
Variable with ArrayRef<Tensor>, but this is exactly what can
happen: Tensor is convertible to both Variable and ArrayRef<Tensor>,
thus it's ambiguous and C++ doesn't like it. The last time I ran
into this problem, I applied initializer lists to everything and
called it a day. A more robust fix is to separate out the
Variable and Tensor overloads, which I have done in this patch.
- trace_outputs was fed as an initializer list, which doesn't work
when you have heterogenous inputs. So instead we first feed
everything through 'flatten', which has overloads for each of the
argument patterns in ATen, which then goes on to the recordTrace
(which takes an ArrayRef). This is *no less efficient*, because
we were allocating a vector anyway (to do the conversion from
vector of Tensor to vector of Variable).
This fixes mean that 'index' can properly be traced... although the
JIT still does not support it. A failing test case has been added to
this effect.
Some knock-on effects:
- The fuser now knows about chunk as well as split. They're pretty
similar so there is no problem.
- There is a new 'canonicalize' pass in the JIT which renumbers a graph
so that all structurally equivalent graphs render the same.
- We run DCE before the fuser tests, to make sure dead nodes don't
block fusion.
- There are new ONNX exports for the newly introduced higher level ATen
operations. This includes type_as (no-op case only), chunk, select.
Zach didn't like the extra use of 'native' in the new codegen, so
we've introduced a new concept, 'abstract'. An abstract function
is one that is implemented in derived types (e.g., CPUDoubleType),
where as a concrete one is implemented in the base type (Type).
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Fix another leak in pybind11 code.
This time caused by an upstream pybind11 bug:
https://github.com/pybind/pybind11/pull/1216
This changes causes the code to go down a non-buggy pathway.
* Relax verify of VariableFlags
If we trace with a defined tensor, but see a run with a undefined
tensors we now allow that run to happen, replacing the tensor with
zeros.
This also fixes a bug where stage 0 tensors were not
checked against their verify flags.
This change does _not_ handle all bad situations that can happen.
For instance if the first thing traced has a undefined tensor but
a later tensor is defined, then it will fail because the graph itself
does not contain the trace for the derivative of the tensor.
However it is possible to work around this later case by
dry-running the function:
z = Variable(...,requires_grad=True)
x,y = f(z)
(x.sum() + y.sum()).backward()
Summary:
This assumed that the expect statement would run within 1us, whereas
we only care it runs in less than the 100ms to check that it got reset.
Closes https://github.com/caffe2/caffe2/pull/1606
Reviewed By: Yangqing
Differential Revision: D6572951
Pulled By: pietern
fbshipit-source-id: fd0c2854bc6459c8bf0e17fa75035eb0a4e522cd
Summary: Currently these operators are implemented in a complex meta-programming fashion, I removed the definitions and put modified CPU/CUDA implementions into reduction_front_back_ops.{cc,cu}. This will help future extension of these ops to support lengths input.
Reviewed By: asaadaldien
Differential Revision: D6506568
fbshipit-source-id: 7323baf7c8e0eca37912f3ae28c02e37ad2e1103
Because it is hard to know whether -fopenmp will work on a user's machine,
we just try it, and then disable it if it doesn't work.
Fused kernels are now competitive with the stuff in TH when the kernel
is flops bound, and faster when the original kernel was memory bound.
Summary:
Commit 479e4ce5 didn't end up solving the health checks firing and
they are likely still caused by the remaining `assume` calls.
Closes https://github.com/caffe2/caffe2/pull/1625
Differential Revision: D6573036
Pulled By: pietern
fbshipit-source-id: eeb21bdd61dca0a632eb1ba9e529177ac2569bfd
Summary:
The install prefix we use in our builds is /usr/local/caffe2. This is
not standard, so in order to load caffe2 from Python, the Python
interpreter must know where to find it. In a post-build section in the
Jenkins build script we know add a symlink to Python's dist-packages
directory and instruct the loader to look in /usr/local/caffe2/lib.
Together, these tricks make it usable out of the box.
Closes https://github.com/caffe2/caffe2/pull/1617
Differential Revision: D6572322
Pulled By: pietern
fbshipit-source-id: c37b789a0d0babbb1110f991318c6b75fe351c0e
Summary:
As titled.
This will fail with the message: File "/mnt/xarfuse/uid-30088/f8742a88-seed-a26ddfbc-49aa-4c5f-9e08-91909f4775da-ns-4026532692/caffe2/python/layers/concat.py", line 52, in __init__
"Concat expects that limited dimensions of the input tensor"
This is because the output scalar of the pairwise_dot_product layer won't contain shape information if output_dim is 1.
https://fburl.com/1m9r3ayp
This diff is fix it.
Reviewed By: xianjiec
Differential Revision: D6565930
fbshipit-source-id: 181181232065ef3fdfc825aa25d2714affbe6b8d
Summary:
There is a lot of bussiness logic around various events in
the base net class. SimpleNet doesn't have to handle those (checked
with ilia-cher). Normally these should be no events registered for
simple nets, but we can have some issues where they will be added, so
its less error prone to just have a SimpleNet::Run pure. And then we
also avoid extra virtual calls / empty vector iterations.
Reviewed By: ilia-cher
Differential Revision: D6551440
fbshipit-source-id: c97a732a00bb36eed49d35e727156ce94225a08b
Summary: A version of MILSTMCell which uses layer normalization (see https://arxiv.org/pdf/1607.06450.pdf). There's a lot of copypasta because we don't want to make the existing RNNCell classes harder to approach / understand by adding new options.
Differential Revision: D6564208
fbshipit-source-id: 0bc43e12b6c08ebdf5ea6af2c631f785c302bdb4
Summary: Observer passed to RNN step net cloned with RecurrentOperator as subject instead of internal Operator. This diff applies adds the internal operator as the subject
Reviewed By: enosair
Differential Revision: D6560996
fbshipit-source-id: 7af4fb0ff8c19795b5c994c5fc6876f3d2ba7bf4
This is not currently used by anything, but eventually ATen
will need to make decisions about whether or not to use
CuDNN functions or not, which means we need to propagate
this variable to ATen.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Better error messages for blas ops with cuda.LongTensor
Fixes#4157
Test plan
Try matrix multiplying with cuda.LongTensors
>>> import torch
>>> x = torch.randn(4, 4).long().cuda()
>>> y = torch.randn(4, 4).long().cuda()
>>> x.mm(y)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: addmm for CUDA tensors only supports floating-point types. Try converting the tensors with .flo
at() at /private/home/rzou/pytorch/pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:381
Summary:
We see a non trivial overhead because of this debugging
code. I talked with Romain and looks like we can comment this out for
now. We will think about better way to integrate this kind of
functionality in Caffe2 going forward
Reviewed By: romain-intel, pietern
Differential Revision: D6551108
fbshipit-source-id: efa3e643b953d33dc5f3d11f88cafdf2730bc4e4
Derivatives for NN functions now have to be specified in tools/autograd/derivatives.yaml. Leaving a function out will result in that function not being available in autograd.
Note that _backward declarations used in derivatives.yaml are auto-generated by aten/src/ATen/nn_parse.py so the content of tools/autograd/derivatives.yaml has to reflect the generated declarations.
This is an inconvenience, although it's smaller than it looks: future kernels will be implemented directly as ATen native functions.
As a help to the user, we could eventually save declarations generated in nn_parse.py to a file.
* Avoid automatic generation of NN derivatives
* Add inplace functions
* Refactor nn preprocessing function
* Use output instead of self in inplace derivatives
* Include grid_sampler in derivatives
* Finish fixing grid_sampler and affine_grid_generator
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Factor out setting up derivatives, use the same logic for NN and non-NN codepaths
* Implement remaining random methods through ATen
* Change test_bernoulli on Tensor to avoid broadcasting
The new ATen-dispatched bernoulli_ supports broadcasting. The old
Tensor.bernoulli_ bindings instead require the tensors to have the same
number of elements. I haven't change the old code because it will be
deleted soon.
Summary: the "assume" statement in adagrad_test leads to health check failure. here we remove it by checking dc == hu.gpu_do
Reviewed By: pietern
Differential Revision: D6513314
fbshipit-source-id: 4caf2d938e5f5935a95cca8abd99185182223d63
Summary:
This enables two learning rate for Generator and Discrimintor in GAN. For each iteration i, it will decide
whether to enable training on G (or D) based on the desired active_period and inactive_period for G (or D).
Reviewed By: dragonxlwang
Differential Revision: D6379325
fbshipit-source-id: 926f1041e25f48791b2ac1fc1a8eaa08db9639b8
Summary:
Adds modules:
a = Module() # create a module
a.b = 3 # set tensors in module
a.c = 4
b = my_func(a) # pass a module to a function as an argument
c = b.what + 1 # and receive a module as a return
global foo
foo.a.b # translates to Caffe2 name foo/a/b
This should help clean up beam search where many external nets are grouped
into modules.
Reviewed By: jamesr66a
Differential Revision: D6543292
fbshipit-source-id: 349eae0b1609efab4557f94650938e1fa543579d
Summary:
This also removes the `bin/{build.sh,test.sh}` scripts that are now
located in `.jenkins/{build.sh,test.sh}`. The rationale for this is
that these scripts don't care about Docker specifically and are also
run for, for example, macOS builds.
Closes https://github.com/caffe2/caffe2/pull/1610
Differential Revision: D6546204
Pulled By: pietern
fbshipit-source-id: 643bfb0c342b1719c0fb51e4e0987b2674e6424f
Summary:
Builds can then execute rendezvous where a shared file system is not available.
Closes https://github.com/caffe2/caffe2/pull/1530
Differential Revision: D6543267
Pulled By: pietern
fbshipit-source-id: a924e2d8c26e0e30e95673ca17c7e1f40f43b3dc
Summary: Remove scoping assrtion because it is not useful and causing errors
Reviewed By: salexspb
Differential Revision: D6538219
fbshipit-source-id: e587e294d4beec1370e6895af9354f0818a4cdd8
Summary:
Part of a 2-step process to move the Jenkins entry point scripts from
`docker/jenkins/bin` to `.jenkins`.
Closes https://github.com/caffe2/caffe2/pull/1605
Differential Revision: D6537959
Pulled By: pietern
fbshipit-source-id: 716b2e6bd50bbfe56b0bb844dd6b0c666a52527c
Summary:
Change the directory name for ipython notebook.
Change the executable name fro ipython to jupyter
Pass arguments to the script to the notebook, instead of fixing --ip='*'. In some setup, --ip='*' cause jupyter notebook not displayed.
Closes https://github.com/caffe2/caffe2/pull/1546
Reviewed By: pietern
Differential Revision: D6460324
Pulled By: sf-wind
fbshipit-source-id: f73d7be96525e2ab97f3d0e7fcb4b1557934f873
Summary: Updated SingleThreadAsyncNet to use new interface
Reviewed By: ajtulloch
Differential Revision: D6526515
fbshipit-source-id: 6aa24678ba7350a5e448e9c2ab29ccd07a1fcb0b
* Ensure RNNCell variants don't broadcast
* Fix lint
* Add test for hidden_size=1 in RNNCell no broadcasting test
* Prevent broadcasting for hidden_size and input_size
* Isolate input checking from hidden size checking
Summary:
PR #1536 suppressed test_sparse_adagrad but test_row_wise_sparse_adagrad also filters too many examples. Suppress health checks for this test as well.
Closes https://github.com/caffe2/caffe2/pull/1599
Differential Revision: D6530850
Pulled By: pietern
fbshipit-source-id: c73f30d2e104565421e3e381b1cf66185edc833e
Summary:
Flops in conv were underestimated when pad is not zero.
The difference is especially big when image is small.
Reviewed By: salexspb
Differential Revision: D6394190
fbshipit-source-id: b9f057fceae77f745c5daa668cb2100f993d21a7
Summary:
This fixes the in-tree protoc build on CentOS 7 (that ships with super old protobuf version).
Closes https://github.com/caffe2/caffe2/pull/1595
Differential Revision: D6529307
Pulled By: pietern
fbshipit-source-id: ac81c7cd884846854b4ffd4909377e87d93bddc3
Summary:
Also add int as a datatype and correctly check error codes on group
start, end
Closes https://github.com/caffe2/caffe2/pull/1590
Differential Revision: D6524086
Pulled By: pietern
fbshipit-source-id: 385aab6fe1bbf6b5c06fa905066bc576a733c856
We'll need these functions when we merge Variable and Tensor. They throw
an exception if called on a Variable that requires grad. As of now,
every Variable that has a grad_fn also requires grad.
Summary:
Uses caffe2 operator schema to check # of inputs/outputs.
Falls back to actual schema->Verify so that schema errors get
reported associated with a SourceRange.
Reviewed By: jamesr66a
Differential Revision: D6517136
fbshipit-source-id: 9be89165ea5e717c4cec1d25bbd967df86200d6c
Summary:
Adds the ability for a script function to call another and adds the extern function to register an external Caffe2 Net that can be called by the script.
Closes https://github.com/caffe2/caffe2/pull/1591
Reviewed By: jamesr66a
Differential Revision: D6515877
Pulled By: zdevito
fbshipit-source-id: b893d9e4bacd7389b550ac8a37ad7974b95de749
* Bind cauchy_, exponential_, normal_, uniform_ functions to THPVariable.
Also changes the error messages around Generator parser; previously, you'd get an error
like: torch._C.Generator is not a torch.Generator; now the check is proper but returns
that only None is supported.
* Support passing Generators to ATen Variable-bound methods.
This involves changing THPGenerator to have an at::Generator rather than a THGenerator.
TH getRNGState, setRNGState are still called directly because they are not bound from ATen yet;
they should probably be on the Generators and return (opaque) GenerateState objects.
* Fix default values.
* Properly use THRandom_initialSeed.
* update standard gamma to use new default generator.
The C/C++ unary negation operator is well defined for unsigned types. We
should use that behavior. This also implements neg for CharTensor. That
behavior currently depends on whether char is signed or unsigned.
Fixes#4066, #3225
Summary: word_rewards data type is mixed; ConstantFill assigns long but later is filled with float32. This causes issues when running net from outputted protobuf. This change makes data type to be float32 for lifetime of blob.
Reviewed By: jhcross
Differential Revision: D6486723
fbshipit-source-id: c4ce5185a0a6d71b08b1819f2355e9354823b701
Summary:
This can be used for testing and debugging. zdevito and I will primarily use this for our caffe2 script project
Closes https://github.com/caffe2/caffe2/pull/1585
Reviewed By: zdevito
Differential Revision: D6501209
Pulled By: jamesr66a
fbshipit-source-id: fdd65e422c44b74bb6926320af506dcae13327f3
Summary:
* condition if
* True/False literals
* and, or, not
* 0-output expressions, like print
* _ is given a fresh name
* x.foo(...) is desugared to foo(x,...)
* +=, *=
Closes https://github.com/caffe2/caffe2/pull/1581
Reviewed By: jamesr66a
Differential Revision: D6495256
Pulled By: zdevito
fbshipit-source-id: b601d3f9e08fa544881a0c946b4feac24cb7e116
Summary: Turns out that similar to RoIWarp, col2im in custom ConvTranspose implementation is also missing a bound check for image.
Reviewed By: ajtulloch
Differential Revision: D6494061
fbshipit-source-id: 1fadbdd05f360b20343df49b70d2be65eab128ac
Implements from_numpy using ATen tensors. Variable.from_numpy is a
convenient placeholder for the variant that returns Variables until we
merge Tensor and Variable.
The behavior is slightly changed:
- from_numpy() on an empty array now returns an empty tensor instead of
throwing an exception. The shape may not be preserved.
- CharTensor(ndarray) used to throw an exception. It now copies the
ndarray. Copying is implemented via ATen toType.
Summary: Fix MPSCNNRoIWarp and made it more general to channels
Reviewed By: ajtulloch
Differential Revision: D6493869
fbshipit-source-id: 77cfa2e2f3bd80efc6e69a0774793e0162d9942a
Summary:
lines such as
output_scores = best_scores_per_hypo + scores_t_squeezed
hypo_t_int64 = best_indices / 6LL
will emit the respective binary operator (e.g. `Add`, `Div`) with the `broadcast` flag set to 1
Closes https://github.com/caffe2/caffe2/pull/1577
Reviewed By: zdevito
Differential Revision: D6489991
Pulled By: jamesr66a
fbshipit-source-id: 3bef2bd43dfa18659a299cc62affd74f9a763491
Summary:
1 is an int32
1LL is an int64
1f is a float
Still need:
Parsing out numbers such as 1.0 as integer. 1.0f should work, though
Closes https://github.com/caffe2/caffe2/pull/1576
Reviewed By: zdevito
Differential Revision: D6489944
Pulled By: jamesr66a
fbshipit-source-id: 46aab9483a18a31d883c8c7e3086d3074fa5efac
Summary:
Previously, GetProfDagStats operator collects per-op-type cost of a given prof_dag net.
With this diff, the operator GetProfDagStats has a new option “per_op”, when it is false (default value) , the operator still calculates per-op-type cost.
Otherwise, it returns per_op cost, the cost of multiple instances of the same op type will be calculated separately
Reviewed By: heslami
Differential Revision: D6478547
fbshipit-source-id: 82f00f5fb262cd60b81d2accdd8e3598ddf2eefe
Summary: Replace the fallback implementation by native CUDA code. Minor edits of PackSegmentsOp: let all computation use one buffer tensor.
Reviewed By: asaadaldien
Differential Revision: D6455236
fbshipit-source-id: 71f146c470009d1cecf3f2e2f5c381b1751c061c
Summary:
Adding if and while control ops to brew, also adding unit tests
Note: unlike net_builder where we can figure which blobs are external and which ones are local to subnets, here in brew we need to use external_blobs param explicitly to point at external blobls
Reviewed By: harouwu
Differential Revision: D6440508
fbshipit-source-id: c920f0af84b77ccb2d8462ffc7567bb1908c844a
Summary:
* Fix typo in negative constant handling "Negate" -> "Negative"
* Fix unpacking constant in parsing elements for a list attribute
* Parse negative signs in constants
* Switch list syntax to use square brackets in attributes
Closes https://github.com/caffe2/caffe2/pull/1572
Reviewed By: zdevito
Differential Revision: D6483286
Pulled By: jamesr66a
fbshipit-source-id: 949e8fd6a96b12efde756bac9da987da0010e153
* avoid writing `x + 1.0000*y` which causes a promotion to double from float
* refactor tests to make writing graphs easier (while not strictly necessary,
I have some benchmarking code that I am using to make the fuser faster
that is easier to write in this form)
* option to dump the disassembly of the CPU fused code for perf debugging.
Summary:
This is in order for Android to pass - Android support for string related functions is quite limited.
Closes https://github.com/caffe2/caffe2/pull/1571
Reviewed By: pietern
Differential Revision: D6486079
Pulled By: Yangqing
fbshipit-source-id: f0961e2dde6202bd6506f4fb8a3aea4af1670cb5
Summary: A while ago, we had to change some blob names in `optimizer.py` (more specifically, names of `iteration_mutex` and `optimizer_iteration`) to handle corner cases when preparing a net for parallel execution.
Reviewed By: azzolini
Differential Revision: D6480819
fbshipit-source-id: a03a7aa9fad322a50e7785914b0eb0f8654e6d90
Summary: The RunWithType() function of CUDA version shares a lot of code with the CPU version of the op. Merge them by pulling out the different parts of RunWithType() and putting them into a separate CPU/CUDA functions.
Reviewed By: asaadaldien
Differential Revision: D6467962
fbshipit-source-id: 83b45e697a094e959f66e898f46f06b0e2c329bc
Summary:
Reduced the array sizes used in pack_ops_test to prevent time outs
during Travis CI builds.
Reviewed By: enosair
Differential Revision: D6476703
fbshipit-source-id: 20ab871ae40349ca27186447a84135bbc5c351b1
Summary:
This includes a build script for Docker containers to run builds and tests in as well as a build and test script that is run to build and test Caffe2 itself. These scripts are directly used by Jenkins.
Closes https://github.com/caffe2/caffe2/pull/1552
Reviewed By: pjh5
Differential Revision: D6476377
Pulled By: pietern
fbshipit-source-id: c9268873c03d0878bea0e8516a72c27813284427
CMake does not correctly add generated header file dependencies
for CUDA compilation units (cpp works fine.). This introduces an
explicit dependency to force the aten generator to run first.
This adds a simple fusion backend for the CPU.
* Refactors CompiledFusionFunction to have two subclasses that handle
the compilation details of each backend.
* emit-compile-link-run cycle for the CPU
* simple single core loop to run the operation
* lift CUDA-only restrictions in the fuser, checks that fusion groups
are only on a single backend.
Adds streams and comms as optional arguments to the NCCL calls in
torch.cuda.nccl. Also exposes ncclUniqueId and ncclCommInitRank for
multi-process mode.
Moves Py_RETURN_NONE statements after the GIL is re-acquired.
Summary:
Adds a new `LSTMCell` subclass to the `rnn_cell` module that performs layer normalization on the fused input matrix. Moves around some code in `rnn_cell.py` to avoid copy-pasta. Adds relevant test cases to `rnn_cell_test.py`.
Had to fix `brew.layer_norm` first. See T24013870.
Reviewed By: jhcross
Differential Revision: D6454883
fbshipit-source-id: 0f4ea7a778cc5be6a7274f7b28c793f5dd7c6095
Summary:
Regardless of device checker/gradient checker we cannot run a
backwards pass with cuDNN when NHWC is used.
Closes https://github.com/caffe2/caffe2/pull/1566
Differential Revision: D6474181
Pulled By: pietern
fbshipit-source-id: 727d7b4f2a1431a4d6675ffb76c5b60d3d7fa712
Summary: Moving tensorboard from fb specific and untying all dependencies on fb code
Reviewed By: dzhulgakov
Differential Revision: D6313818
fbshipit-source-id: 19302c372540400fa60d34015ef9e944ab203d2e
Summary:
This is a supplementary to commit ce8267d425444f60ae650389fb41838847a44a5e. It allows specifying device to prepare_prediction_net() so prediction extractor can work with GPU.
Closes https://github.com/caffe2/caffe2/pull/1035
Differential Revision: D6467420
Pulled By: salexspb
fbshipit-source-id: b5b9a1536fb516e90b5e4b615403086943cfbe93
Summary: Oops, I left an unused variable here. Let's get rid of that!
Reviewed By: enosair
Differential Revision: D6468223
fbshipit-source-id: 27cc0900b330f056c5b5585a136fb46f5830cf81
Summary: Quick fix for unit test broken by D6454290. This is my fault for approving while the tests covering the single callsite were broken.
Reviewed By: goldsborough
Differential Revision: D6466566
fbshipit-source-id: 2683be3d6bb184286e64fbde3e572946e39030c7
Summary: There are two components that deal with workspace ids: 1) comm framework, 2) injection of GLOBAL_WORKSPACE_ID. The type of workspace id should be consistent for these components. 32 bits integers should be sufficient for such ids.
Reviewed By: akyrola
Differential Revision: D6443675
fbshipit-source-id: 7b0e8a3b005683350706fa5c330abf0a9d4881dd
Summary:
While working on layer normalization for LSTMs I encountered an issue where the layer norm parameters (which are the scale/gain and bias/shift from the paper) were not registered in the model for `brew.layer_norm`. salexspb explained that this is because it was using the `init_net_param` API instead of `create_param`. This diff fixes this.
While fixing I noticed that I noticed that `brew.layer_norm` actually had a bug where it was multiplying with the bias instead of adding it. Another issue was that the function giving the scale and bias a shape of `[1]`, however the paper (https://arxiv.org/pdf/1607.06450.pdf) specifies that, like for batch norm, there is one scale and bias parameter per neuron, i.e. the shape should be `[1, axis_dimension]`. The API now takes an explicit `dim_in` parameter (also more consistent with other normalization functions in that module) so that this can be specified. See tests for how this now looks.
Reviewed By: jhcross
Differential Revision: D6454290
fbshipit-source-id: fc00ca614de3190c40ab743e8984bec9e85fb58c
Summary:
Adding a check to pack_segments to make sure the lengths passed in add up as expected.
Additionally started to address https://fb.facebook.com/groups/1405155842844877/permalink/1977332432293879/ , but it might not fix that issue, but is still useful if it does not help that issue.
Reviewed By: salexspb
Differential Revision: D6443490
fbshipit-source-id: 680dc763a788a550d321d97a556c5b46e3402dd1
* Comprehensive rewrite of Torch CuDNN bindings / a bit of ATen infra
The executive summary is that this moves the torch/csrc/cudnn
library into ATen, adding a number of new cudnn_ methods to ATen
for batchnorm, convolution, affine grid generator and grid sampler.
ATen infra changes:
- TensorGeometry was moved to ATen
- TensorGeometry was modified to make its interface resemble that of
Tensor; in particular, sizes is no longer a field, it's a method.
- AT_CUDA_ENABLED macro is set via ATen/Config.h header which is
generated at cmake configure time.
Fixes https://github.com/zdevito/ATen/issues/168
- Change AT_CUDA_ENABLED macro to be a function macro, so that we
error if it is not defined
- Introduce a new TensorArg class, which is a Tensor plus a little
metadata. This helps us give good error messages when checking
dimensions/shapes of tensors.
Fixes https://github.com/zdevito/ATen/issues/169
- Also introduce a TensorGeometryArg class, for when you don't
need the actual tensor data (which is most of the time.)
- Add ATen/Check.h, which contains a number of utility functions
for testing shapes, types and devices of input tensors. This
will be particulary useful for native methods, which don't get
code generated input testing code. These functions take a
'CheckedFrom' argument, at the moment just a string, which
specifies some extra information about what function was
doing the actual checking; this greatly improves error messages.
- Many check functions take initializer lists, which let you
test that all tensors have some property. This API is
peculiar, in that we IGNORE undefined tensors in this case.
This is handled by filterDefined.
- Add AT_CUDNN_ENABLED macro
- CuDNN linking from ATen was improved; for example, we now actually
add the CuDNN headers to our include path.
- Add some missing override specifiers to some methods
- We now actually build tests with CUDA functionality accessible
(previously, AT_CUDA_ENABLED was not defined, meaning that
the headers were missing all CUDA-only functionality.)
- Native functions now support giving explicit names to return
outputs in yaml. This makes it possible to hook into the NN
autogenerated derivatives codepath using native functions.
CuDNN rewrite changes:
- torch/csrc/cudnn now uses ATen (rather than passing around
THVoidTensor) and lives in ATen. This lets us remove tensorPointer
shenanigans. The functions are exposed to ATen as native functions
described in aten/src/ATen/cudnn/cuDNN.yaml
- ATen now builds and links against CuDNN when enabled. The cmake
package script was taken from Caffe2.
- Some header reorganization was done to help reduce dependencies
on headers (this reorg is no longer used but I've kept it)
- Rename CHECK to CUDNN_CHECK
- Rip out old shape/type testing code in favor of modern ATen/Check.h
interface using TensorArg. In many cases, increase the robustness of
the checking code.
- Change the inputs of the public facing functions, so that they can
be bound by ATen
- Delete THCState*; this is retrieved from the global ATen context
- Delete cudnnHandle_t, this is retrieved from the global Handles.h
- Delete cudnnDataType_t, this is retrieved from the Tensor type
- Delete Convolution class, instead its constituent arguments are
passed individually
- Change functions to return tensors, rather than take an appropriately
sized output tensor as an input.
- Redo how transposed convolution / backward convolution is implemented
(knock on effect of returning tensors). Previously it was assumed
that you would always pass an appropriately sized output tensor, but
we don't want to do this anymore. For backwards, we instead give
the desired output tensor (input, really) size, because that is
readily available. For *transposed* convolution, however, we take
output_padding, and otherwise do the shape calculation.
- Redo how legacy group convolution is implemented (knock on effect from
porting cudnn to ATen.) Previously, group convolution was implemented
by manually constructing sizes and strides and then outputting
appropriate, with macros switching between individual groups and
all-at-once based on CuDNN version. Now, the code looks exactly what
you'd expect: there's a top-level wrapping function that supports
group convolution no matter the version of CuDNN, and a low-level
wrapper which supports only what CuDNN supports. The top-level
function conditions on CuDNN version, and invokes the low-level
interface 1 or n times.
- There is now a debugging printer for tensor descriptors.
- Convolution struct is replaced with ConvolutionArgs, which is not
part of the public API but is used internally to conveniently
pass around all of the arguments needed for Convolution.
- Add some constexprs for well-known dimensions, reduce amount of
magic numbers in code.
- Put 'deterministic' in to ConvParams. Fixes#3659
- Lots more comments.
- Some pessimizations, in the name of code clarity:
- The descriptors are initialized on every invocation of convolution
forward/backward. Previously, the descriptors were cached, so that
you didn't have to initialize them again on backwards. This is
difficult to support in the ATen interface so I didn't support it.
- Legacy group convolution initializes its workspace for *every* group
it performs. I did not feel motivated to fix this because the
legacy codepath is already quite slow.
- Affine grid generator and grid sampler automatically call contiguous
on their arguments as necessary.
- Batchnorm input checking is greatly beefed up, it now checks for
the following input characteristics:
- Definedness
- GPU location
- Type
- Contiguity
- Size
PyTorch binding code changes
- batchnorm now uses consistent var/data naming
- batchnorm and convolution make use of new ATen bindings
- Affine grid generator and grid sampler make use of ATen CuDNN
bindings via derivatives.yaml. This means I had to restructure
the code a little, since the THNN bindings still go through
a legacy Python class.
- I fixed some warnings:
- s/friend class/friend struct/ on InterpreterStateImpl
- Removed pessimizing move 'detached' in torch/csrc/autograd/variable.cpp
- Removed unused pack_list on Scalar
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
GCC 4.8 buildfix
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Add TensorGeometry to ATen.h
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
CUDNN_CHECK
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Update TODO comment
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Delete return in cudnn_grid_sampler
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
s/cudnnSetStreamToCurrent/setCuDNNStreamToCurrent/g
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Don't allocate a new vector when filtering defined.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Remove Check overloads, convert to pass references.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Some more microbenchmarking.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: Replaced sigmoid + xent loss with SigmoidCrossEntropyWithLogits. The sigmoid layer computes the multinomial logistic loss of the sigmoid of its inputs. It's conceptually identical to a sigmoid layer followed by a multinomial logistic loss layer, but provides a more numerical stable gradient.
Reviewed By: xianjiec
Differential Revision: D6305455
fbshipit-source-id: 444c9f651fbdf13c3c52be5142769f8f98ed8770
This commit adds code to setup.py to use ninja to manage
C++ and code generator dependencies rather than use raw setuptools.
This is based on similar code added to ONNX.
Enabled optionally when ninja is installed.
On my computer speed for a do-nothing build drops from 10s to 1.5 seconds.
Speed of other compilation steps is significantly improved as well.
Dependencies are tracked correctly so the need for ccache is reduced.
Summary:
Get higher order interaction of embeddings, similar to cross net but applied in the embedding level.
Formula:
e_(l+1,i) = element_wise_mul[e_(0,i), \sum_i(e_(l,i) * w_(l,i))] + e_(l,i) + b
where l means the l-th layer of this higher order net, i means the i-th embedding in the list.
Finally, concat all the embeddings in the last layer, or concat the sum of each embedding, and attach to the output blob of dot processor.
Differential Revision: D6244001
fbshipit-source-id: 96292914158347b79fc1299694d65605999b55e8
Summary:
Problem:
when we initialize a model from an existing model, currently we load information for each layer parameter independently (in utils.py), including shape information. we have to load the whole model from the db_path every time when we initialize one parameter (in layers.py). For example, in f31078253, the model needs to be initialized twice (not sure why). each time there are 152 layer parameters to load. and loading a model needs 10 min - 50 min depending on resource status.
Restriction:
1. _infer_shape_from_initializer in layers.py is called from multiple other places, besides the if branch of ModelInitDefinition.INIT_MODEL_PATH in load_parameters_from_model_init_options in utils.py, which is the root cause of f31078253. So we still need to support the load operator in _infer_shape_from_initializer. So we need to batch shape blobs loading outside of LayerParameter.
2. in the if branch of ModelInitDefinition.PARAMS in load_parameters_from_model_init_options in utils.py, the db_path can be different from different parameters, so it is hard to batch them.
Solution:
Batch the shape blobs loading in the if branch of ModelInitDefinition.INIT_MODEL_PATH in load_parameters_from_model_init_options in utils.py. We load the model and generate shape blobs of layer parameters in the workspace, so that _infer_shape_from_initializer in layers.py can directly return shape blobs of layer parameters cached in the workspace without reloading the model. and at the same time _infer_shape_from_initializer can still support separate any load operator if shape blobs are not pre-loaded into the workspace (this logic can be used for other ways to initialize a model rather than from an existing model).
Right now we are using 500 layer parameters per batch, and it worked fine. So for 152 layer parameters, one model loading is enough.
Reviewed By: xianjiec
Differential Revision: D6397607
fbshipit-source-id: 54f6f61d6d8b70c82b74c2d72ac56cd010a710da
Summary:
(Work in progress). This diff will allow shifting of activations to other GPUs, in case the model does not fit into memory. To see the API, check the code in data_parallel_model_test, which tests shifting two activations from 0 and 1 to gpu 4, and from gpu 2 and 3 to gpu 5.
I will need to further test on ResNets, and probablly add copy operations to handle device change points.
Reviewed By: asaadaldien
Differential Revision: D5591674
fbshipit-source-id: eb12d23651a56d64fa4db91090c6474218705270
* Implement matmul as a native function; use it for Variable impl.
This also includes an (inefficient) version of allclose, which was necessary for testing.
A more efficient version would use some apply logic to fuse the ops and exit early (coming in future PR).
On small tensors [(2, 5, 5) @ (5,5)], this yields ~2.5x speedup over the python implementation.
* Make maybeSqueeze static.
Summary:
This is a CUDA implementation of the RemovePadding operator, modeled on akyrola's implementation for AddPadding.
There's also an incidental spelling correction: GetAddPadingGradient -> GetAddPaddingGradient.
Reviewed By: akyrola
Differential Revision: D6439594
fbshipit-source-id: b29cd0c252021c58e150b901bbaad28a3bd3cc4a
Summary: Experimental code that allows you to write C2 NetDefs directly using python-like syntax. This includes the ability to write native control-flow (if, while) and have it turn into IfOp and WhileOp
Reviewed By: jamesr66a, dzhulgakov
Differential Revision: D6123298
fbshipit-source-id: 25fc078b5769be61ac7fb3aa9a7c95bd88dccc30
Summary: Support regression with output transform in MTML for feed.
Differential Revision: D6403523
fbshipit-source-id: faa0aab1227a27286b617e8e25adfbab3a349d2c
SavedVariable.unpack() may throw std::runtime_error which may lead to
program termination with SIGABRT without the exception beeing handled
in Python
Fixes#3860
Summary:
This fixes the issue but I haven't figured out yet why is it
happening.
Reviewed By: bwasti
Differential Revision: D6437378
fbshipit-source-id: bf983c9b6f57647423423ec6b22e0f9d2b170e74
* Implemented NCCL Distributed Backend for PyTorch with new dist APIs
* Let FindNCCL to determine the NCCL version
* Let NCCL2 Backend use ATEN instead deprecated THPP
* Let distributed parallel model use a single reduction thread for NCCL backend
* Caching the sockets, bug fix, refactoring, and addressed Adam's comments
* Make BcastNcclID take a single param and bug fix for all_gather
* Removed barrier function, added warning for users, and not exposing experimental func to users
* Use the simplest single bucket working solution for distriubted data parallel model with rebase
* Cleanup, fixes and further addressed Adam's comments
* Used PySequence_Fast in distributed csrc
* Removed the limitation that each group is only bound to a given device sequence
* Used THPObjectPtr for PySequence_Fast
Summary:
With some test seeds this warning starts firing.
Should be addressed in a better way, not generating as many invalid examples.
Closes https://github.com/caffe2/caffe2/pull/1536
Reviewed By: bddppq
Differential Revision: D6437138
Pulled By: pietern
fbshipit-source-id: c619d928a585e3d887f686db5d98f841af10c56b
Summary: The case when sampling_ratio = 0 was skipped before, this diff enables that setting.
Reviewed By: ajtulloch
Differential Revision: D6366669
fbshipit-source-id: 4f3b9eaf47eb9dc20823935428d3d886ea32a5fc
* Add interpreter support for Handles/PythonOp/CppOp
This treats Handles as a first-class type in the interpreter
since this turned out to be conceptually simpler than treating
them as a separate concept, which requires a second channel for
register allocating and moving data from one op to the next.
Notes:
* The refcounting nature of tensors is factored into its own base type
so that it can be shared with other refcounted types such as handle.
* Some methods redundant with TensorBase have been deleted from Tensor
* The interpreter uses raw refcounted handles. In addition to being
able to treat Tensors and Handles as the same base object, it removes
a lot of redundant refcounting as objects moved from tensors to input/
output lists.
* aten_dispatch has been updated to work directly on the raw refcounted
lists to avoid refcounting and duplicate lists.
* Removing jit_closure.cpp, The interpreter can now handle all pathways.
* Functions like `unsafeToTensorShare` describe how
ownership transfers in the interpreter. The `Steal` variants
take rvalue references as arguments, and invalidate those
arguments to prevent potential problems.
* Make TensorTemporary is not a subtype relationship because it is too easy to
do something horribly unsafe:
```
void foo(at::Tensor bar) {
// bar destructor call release on a temporary!
}
foo(TensorTemporary(retainable)); // structure slicing!
```
Summary:
Remove `const` modifier on value-type return types, since it has no effect.
This fixes a clang 5 warning.
Reviewed By: Maratyszcza
Differential Revision: D6399474
fbshipit-source-id: b40af161be5ae67a944518f9b4043c194511267d
Summary: `ThreadPool` is a class, but it is forward-declared as a struct, which produces an error when compiled with clang 5.
Reviewed By: Maratyszcza
Differential Revision: D6399594
fbshipit-source-id: e8e81006f484b38e60389c659e9500ec9cfab731
Summary: Double braces are required in C++11 when constructing an `std::array<,>` using aggregate initialization.
Reviewed By: Maratyszcza
Differential Revision: D6399752
fbshipit-source-id: 7b12c7a8193ba4904bb71b764a344bfd06ad7a7a
Summary:
TSIA. This is found in
https://github.com/caffe2/caffe2/pull/1530
Reviewed By: dzhulgakov
Differential Revision: D6434417
fbshipit-source-id: 2285c2f6252eb7f24e83357eb4887851b3adf690
Summary:
Updating the reader Limiter to identify an epoch end either based on
batches_per_epoch or epoch_duration_len.
I am basically addressing the review comment of D6299602 where I was asked to
break that diff into 2 smaller diffs.
This is Part 1 of the diff D6299602 i.e. making the multi-reader capable of identifying
epoch end either based on batches_per_epoch or based on epoch_duration_minutes
Reviewed By: azzolini
Differential Revision: D6379955
fbshipit-source-id: b8f8e396f515c898ad2f9ee900ec8fad055306b0
Summary:
Async executor based on async_polling (D5985110):
- Tasks scheduling other tasks, using polling only when necessary (e.g.
CUDA->CPU case)
- Fully async, i.e. RunAsync immediately returns
Reviewed By: azzolini
Differential Revision: D6281681
fbshipit-source-id: 06e3723e1424ffab652c38ca7b279cf76e43fa44
* Optimizer: Optimize transposes in variety of circumstances
- No-op transposes
- Consecutive transposes (fuse them)
- Transposes into Gemm (fuse them into transA/transB parameter)
* touch up out of date comment
* Have localScalar work with all 1 element tensors, not just scalars.
Also have toCFloat, etc. call localScalar so 1 element tensors work as well.
* Implement python number conversions.
* Implement __bool__, __nonzero__ as ATen functions.
* Remove merge artifacts.
* Simplify by dispatching to toCDouble.
This adds more heavy sanity checking when we run to_dense(); in particular,
we make sure that if it claims to be coalesced, it truly is coalesced, and if
it is not, that the coalesced version also to_dense() to the same thing.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* CUDA mode profiler fixes
* Enable multi-gpu CUDA tracing
We need to record per-device start events because event timing
comparison only works for events on the same device.
* Course-grained CPU-CUDA syncing of timelines
Record a __cuda_start event used to synchronize cuda/gpu timings.
This requires running some warm-up event records to ensure the
call to event record for the __cuda_start event doesn't take
longer than normal.
fix syncing
* fix cuda build and lint
Summary: RecurrentNetworkExecutor is quite complex and was lacking documentation and had some stray comments. Cleaned up and added documentation. Also did some renaming and reformatting.
Reviewed By: ilia-cher
Differential Revision: D6421087
fbshipit-source-id: c3a57f60042ae4425a59123af5f54acb19e860e7
Summary:
enosair caught bug that the operator returned too early if the lengths output was not provided. Fixed and added testing.
+ noticed the op does not support case when no lengths-input is provided. Added a temporary CAFFE_THROW for this case, will fix later
Reviewed By: enosair
Differential Revision: D6405585
fbshipit-source-id: a81717e1b39afde6e900ddd9049b820943aea9f1
Summary:
our cmake build used to link against libpython.so with its absolute path (instead of -LSOME_LIB_PATH -lpython), so at runtime loader will think it needs the libpython.so at that specific path, and so load in an additional libpython.so, which causes the python binding built with one python installation not reusable by another (maybe on same machine or sometimes even not on same machine). The solution is quite simple, which is we don't link against libpython, leave all the python related symbols unresolved at build time, they will be resolved at runtime when imported into python.
Closes https://github.com/caffe2/caffe2/pull/1514
Reviewed By: dzhulgakov
Differential Revision: D6412405
Pulled By: bddppq
fbshipit-source-id: 9ff5b752ae3806bfac94085942f82d89c304c887
* Add a bit of notation explanation
For a first time user of Conv1d, it is not clear from documentation what N, C and L exactly mean. This should clarify this. Same for Conv2d.
Some tests, such as test_autograd.py, include random generation at the
top-level. It's going to be tough to police these files to ensure that
all randomness only happens within a test, so just set the seed as soon
as args are parsed (as well as before each test).
torch.manual_seed_all is no longer needed since torch.manual_seed also
seeds the CUDA random number generator.
Summary:
Set a default input type so that users do not need to always specify one.
Test Plans: run caffe2_benchmark without the input_type argument, the default one is used.
Closes https://github.com/caffe2/caffe2/pull/1513
Reviewed By: hlu1
Differential Revision: D6401820
Pulled By: sf-wind
fbshipit-source-id: bc8406ca000b3f65fb9aeb1c9c80eb766d625758
Summary: CUDA version of the AddPadding op. It first executes a prefix-sum using Cub to compute the cumulative lenghts array. Then it launches a kernel that uses this information to fill the output tensor with start, end paddding and the actual contents.
Reviewed By: asaadaldien
Differential Revision: D6391413
fbshipit-source-id: 45b431e5976674729e53cb4752c7753c1d8a69e8
Summary:
so that user can use 'WeightedSum' pooling method when there is mix of id list feature and id score list features.
- it's still intuitive to have "WeightedSum" for id list, and we do not need to introduce new "UnWeightedSum" etc.
Reviewed By: chocjy
Differential Revision: D6369270
fbshipit-source-id: 722fa08d1a7986bc6ecf4c7cb02bbae0825bcab4
* Avoid casting integer params and buffers to float(), double() and half()
* Add test for immune integer buffers
* Fix documentation for float(), double() and half()
* Fix test
* Fix CharType min and max
CharType is int8_t and this is not equal to char. CHAR_MIN and
CHAR_MAX cannot be used reliably to specify min and max values.
* Use SCHAR_* instead of hardcoded min/max values for CharType
Summary: This is a reapplication of the earlier PR due to xplat move. Original author is Christoph Conrads <christoph.conrads@fluent.ai> christoph-conrads .
Reviewed By: houseroad
Differential Revision: D6379736
fbshipit-source-id: b7482ecf3b9487a528c15e92976e915791210002
Summary: small changes as I was reading through the dper code base. all of them are nits, but somewhat helped me understanding things.
Reviewed By: xianjiec
Differential Revision: D6389380
fbshipit-source-id: 3412052e4fcba199c6ffc84c6f7ae11bf8ff6ee9
Summary:
The plural version is not defined in the CentOS CMake module.
Verified EIGEN3_INCLUDE_DIR is defined in the Ubuntu CMake module.
This fixes the build on CentOS when using system Eigen3.
Closes https://github.com/caffe2/caffe2/pull/1505
Differential Revision: D6390712
Pulled By: pietern
fbshipit-source-id: b8abb14a62e0ff9fa9c920866504da0e75786c0d
Summary:
Disabled when configuring Jenkins to get a run where tests pass.
Closes https://github.com/caffe2/caffe2/pull/1449
Differential Revision: D6390647
Pulled By: pietern
fbshipit-source-id: c16edc0c4d21ad60f101cf860e5dec183a1ea71a
Remove unnecessary messages and make certain functions in-place.
This commit weakens error checking, but I think it's fine to make
it UB for now, and implement a better asynchronous mechanism later.
This is much needed for achieving high performance.
This also adds support for CUDA-aware MPI implementations.
Summary:
Caffe2 user was confused when model.TensorProtosDBINput([reader]) did not work. This is because of this outdated model helper function, that ignored the input blobs.
Added assertion to enforce correct usage. I did not want to make this work with reader input as well, since this probably should not be used anyway.
Reviewed By: amanrajdce
Differential Revision: D6380326
fbshipit-source-id: 6a50c2861f7f58c06cbfe3e86bde0f17a2b443cb
Implements basic and advanced indexing using ATen tensors/variables.
Basic indexing is translated at the Python-binding level
(python_variable_indexing.cpp) to slice/squeeze/unsqueeze/select calls.
Advanced indexing is implemented in ATen in terms of take() and put()
calls.
FindMAGMA.cmake will look for MAGMA library under harcoded
/usr/local/magma by default. This commit adds MAGMA_HOME env variable
as alternative way to provide the MAGMA home directory. This is
very useful (and the only way) when the user is with restricted rights
and cannot install magma librairies under /usr/local/magma. Also it
is helpful when having multiple versions of the library and being able
to select the one to use.
Summary: Unlanding D6327460 because seems to be causing unstability.
Differential Revision: D6377117
fbshipit-source-id: 4e1241fe65cd4c7a127fa6fa724f60b75965a096
Summary:
This should also be ported to Gloo since its Cuda.cmake was
synchronized to Caffe2 in #1256.
Verified that running CMake with `-DCUDA_ARCH_NAME=Manual` and
`-DCUDA_ARCH_BIN=70` ends up running nvcc with `-gencode
arch=compute_70,code=sm_70`.
Closes#1460.
Closes https://github.com/caffe2/caffe2/pull/1487
Reviewed By: bwasti
Differential Revision: D6376222
Pulled By: pietern
fbshipit-source-id: 563a2947567a2af8a0e64475b346a19d76545ed3
The slice function is very similar to narrow, except that it takes an
optional "step" argument. Unlike narrow, the arguments use the same
conventions as Python indexing: negative values wrap around and start
and stop are clamped to the size of the Tensor.
* Move Variable conversion methods to ATen.
* Add a test to ensure type conversions work through backwards.
* Fix VariableType copy for type conversions.
* Add comment about needing to handle device movement.
* Move back to opposite order for copy function params -- inplace views depend on it.
* Use is_available() rather than is_available.
Summary: Today when PythonOp throws an exception, we log the error and fail the op. Later we assert that the op/net/plan succeeds and throw with a generic message. The user must ttail the logs to find the real error. Instead, align with exception handling from other ops - throw directly. This will include full context of the exception in the error message.
Reviewed By: Yangqing, akyrola
Differential Revision: D6359684
fbshipit-source-id: 85133ba6562759607a3971449120647cbacce946
If virtual python environment is in use (e.g. conda) and
mpiexec was compiled with --enable-mpirun-prefix-by-default option,
it will fail by default as the path is updated to the prefix and
different python (most cases /usr/bin/python) will be used.
Summary: change the interface so BMUF can run on cpus
Reviewed By: asaadaldien
Differential Revision: D6356026
fbshipit-source-id: f58a4da9f800d969145a1a376e118b0f3581f8c1
Summary:
build_local.sh has been changed in a8bb05d to not take CMAKE_ARGS environment variable as args to cmake command
Closes https://github.com/caffe2/caffe2/pull/1488
Differential Revision: D6364057
Pulled By: bddppq
fbshipit-source-id: a96787f3d3f1367ada4819420906e549f0945c8f
* Use aten version of is_signed.
* Define is_cuda native function and use it for variable.
* Use ATen dim for Variable dim/ndimension.
* Get rid of dim, ndimension fallthroughs in variable.py.
* Move size/stride Variable methods to use ATen.
* Implement shape property on Variable via ATen.
* Remove the _getattr__ function from Variable.
* Get rid of dispatch functions and avoid cast.
* Add THPUtils_packInt64Array.
* Throw python errors.
* Use fallthrough and fix fallthrough generation for native functions.
* is_cuda is a property, not a method.
Summary:
There were several regressions over time. Looks like the main
one is recent change with having a map which we iterate over for each
operator call. I made some other little optimizations to our Facebook
observer. Overal this seems to cut about 1000ns from an opertor. At a
rate of 36B operators per second this shouldbe about 750 type vi
hosts.
Reviewed By: bwasti
Differential Revision: D6327460
fbshipit-source-id: 119623addbbd575486906959d65603eea8d4f5e6
Occasionally Travis builds would fail on these two tests.
It's not entirely clear where this nondeterminism is coming
from.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Add (fully opt-in) functionality to support setting pretty names for
nodes in the graph. In particular
- Variable now has a `name` parameter in the constructor
- export now has `input_names` and `export_names` parameters
Nodes that are not named via this mechanism continue to be named
internally with unique integers.
Names have a few rules.
- They must all be unique in the graph.
- They may not be integers (because of potential conflicts with
internally generated names).
Summary: Reported by SImon Layton from NVIDIA: we had a couple of py3-incompatible expresions in data_parallel_model
Reviewed By: azzolini
Differential Revision: D6349447
fbshipit-source-id: a09feb69396be43296400591a3bfed5b8c370b0d
* Add cudaEvent support to the profiler
This adds the ability to record cuda timings using cudaEventRecord
in the profiler. Since it doesn't require nvprof it is easier
to run than the nvprof path.
This also records a thread id for each event, which will make
tracing results easier to understand
* Add flow arrows from cpu to cuda event
* Fix no cuda build
* Review comments
* Move CUDA checks to one place
Summary: Ensure the clone() function didn't return a nullptr before attaching to an RNN operator
Reviewed By: salexspb
Differential Revision: D6341735
fbshipit-source-id: acf89c32f8dae2fd9bc8cb1029bc00df5dbe9dbd
Summary: Cast op cuda can deal with empty batch now.
Reviewed By: azzolini
Differential Revision: D6350138
fbshipit-source-id: 2f3d19f4d42ff34806aa9597690e66f6b4de1a6b
Summary:
Two ops: BatchSparseToDenseOp and DenseToBatchSparseOp
Inverse operations of each other.
Details are described in op Doc
These op is used along with flexible topK, where the output is
lengths, indices, and values.
We want to do softmax on the values, but the dimension of each batch is different. So these op will convert sparse representation to dense and vice versa. The two ops are also gradient op for each other.
Reviewed By: chocjy
Differential Revision: D6288338
fbshipit-source-id: 0ba9e611058b39e46e7414dcc5f39cab29915fa3
Summary:
This is part one: It adds lambdaNDCG loss which can be used to heuristically
optimize the NDCG metric.
Differential Revision: D5830650
fbshipit-source-id: 1eb696337c9a77727ad40219c68f6468e2e097a5
Summary: Came across this bug in doc when I was figuring out NetBuilder form the code.
Reviewed By: volkhin
Differential Revision: D6341821
fbshipit-source-id: 8818f3d92681366bfe7b90d9d4da9f68ef6e4672
Summary: Implement LinearWarmup and ConstantWarmup learning rate policies. LinearWarmup warms up the learning rate from (starting_multiplier * learning_rate) to the specified learning rate over the first 'num_iter' steps. ConstantWarmup scales the learning rate by 'multiplier' for the first 'num_iter' steps.
Differential Revision: D6316038
fbshipit-source-id: 1649c3ecd78bcdfec93b6cf195d86328393a7cb4
Summary: Move quant_decomp_zstd.* to share/contrib so that they're automatically synced to fbcode
Reviewed By: Yangqing
Differential Revision: D6336968
fbshipit-source-id: 1bf48ce97a017ddea8cc82865428a498653d5872
Previously, an in-place operation that saves its output (such as
relu/threshold) would create a reference cycle when applied to the a
view. There were two cycles created:
1) The cycle base.grad_fn.fn.input_.base
base.grad_fn is a CopySlices
base.grad_fn.fn is ThresholdBackward
base.grad_fn.fn.input_ is a SavedVariable with base pointing to base
2) The cycle base.grad_fn.fn.input_.grad_fn.next_functions[0]
base.grad_fn.fn.input_.grad_fn is AsStridedBackward
and next_functions[0] points to base.grad_fn
Generally, we avoid cycles because the AD graph is mostly immutable. Two
notable exceptions are:
a) Variable.grad_fn can change to point to a new grad_fn
b) SavedVariables in a function can be set after the function is created
The first case is not a problem if grad_fns do not hold strong references
to Variables. Removing "base" from SavedVariable removes the strong ref.
For the second case, we need to avoid saving the grad_fn of outputs. We
were incorrectly saving the grad_fns of outputs when they were the
result of in-place ops on views.
This commit adds a Value type similar to the one @ezyang suggested a while
ago for handling multi-return nodes.
Previously if we had a graph like:
a = op1(b)
c, d = op2(a)
Then its in-memory format would look like:
%0 = op1(b)
%1 = op2(%0)
%2 = select(%1, 0)
%2 = select(%1, 1)
Select nodes were used only to handle the multi-output case. In the
single-output case ops referred directly to their uses.
This required special handling for the single- and multi- output cases,
and was confusing when used with ONNX which distinguishes values (the
inputs/outputs of a node) from the nodes themselves (e.g. a Conv).
This commit adds the Node/Value distinction to the IR. In the example
above, `a`, `b`, `c`, and `d` are now Value objects, while `op1` and
`op2` are now Node objects. Inputs/Outputs to the graph are values.
* Nodes now always have multiple outputs, accessible through their `output()`
method.
* Methods exist for adding/removing outputs from a node.
* Nodes own their output Values, destroying a node destroys its outputs and it
is only valid to destroy a node when no uses of its outputs remain.
* Unlike select, Values do not appear in the nodes list.
* The method `node()` on `Value` retrieves its defining node. Calling it
is always valid. For inputs, its kind is "Param". Like "Return" there is a single Param
node representing all inputs.
* For single-output Nodes, the method `output()` retrieves the single
output Value, asserting that the node is in-fact single output.
* Functions are the same, but some functions like `type()` have moved to
Value.
* `replaceAllUsesWith` is now sanely defined for both Values and Nodes.
In the case of Nodes, it replaces all outputs of the node with the outputs
of the replacement node.
* stage is defined both on Node/Value. This is because Inputs require a stage.
* Apart from changing data types from Node->Value most passes remain the same.
Things that previously assumed single-output nodes now have to call output()
to get the node.
* This removes the uses = [...] field in the outputs because it was
getting confusing even before this commit when uses would refer to nodes,
but we print the names of Values. The lint pass validates the use list,
so printing it out seems less necessary.
* Support [output] in native_parse.
* allow specifying [output] in NativeFunctions.
Limitation: doesn't work for method, functions; can only do one or the other.
* Sample native function with output.
* spatial roi pooling forward skeleton (note, build is broken after this commit)
* Support multiple variants in native functions with outputs.
* add roi pooling forward cpu
* Add support for tuple return in NativeFunctions.
* native functions cuda
* fix bug in roi pool cpu forward
* finish forward kernel minus invocation
* add option for getting current stream
* Support backend-specific native function dispatch.
* Move cuda stuff to native.
* Move native related files to /native.
* Get rid of NativeFucntionsCuda.h.
* launch forward kernel
* roipool backward kernel
* Rebase expand error message changes.
* Fix up header files.
* add backward kernel launch, write as native function
* Default to base dispatch.
* Re-arrnage native_parse.py.
* Get rid of tabs.
* Get rid of at:: in C++ code in native function decl.
* Parse name.
* Parse name and return.
* Parse arguments.
* Don't specify variants.
* Get rid of /NativeFunction.
* Infer dispatch level.
* Infer dispatch.
* Improve argument parser.
* Comment, simplify parsing.
* Allow single line comments.
* Parse 'const Tensor &foo' correctly.
* Add comment to native_get_return_types.
* Fix python2 build by removing kwarg to rsplit.
* tabs --> spaces in roi foward cpu
* rename to RoiPooling2d
* add _cpu to roi pooling functions on cpu
* fix name handling in native functions
* Fix lint.
* Simplify default handling.
* Get rid of dispatch_level; infer it from dispatch.
* Simplify multiple return type native parsing.
* Move naming of outputs to gen.py from gen_variable_type.
* Get rid of m_ for type methods; keep only method_prefix_derived for s_ functions.
* add derivatives.yaml entry for roi pool
* Native functions parsed from yaml.
* Add comment explaining native_functions.yaml.
* Fix runtime_error string format.
* Fix wrong CUDA generators and allow for new ones
* Fix CUDA detection for other generators
* Simplify the changed code
* Remove useless flags for MSVC
Summary:
So we can do things like pass -DCMAKE_BUILD_TYPE=DEBUG
Closes https://github.com/caffe2/caffe2/pull/1474
Differential Revision: D6334701
Pulled By: pietern
fbshipit-source-id: 08e6e48ba453ffca50ad0949ee7b0bf7251a542f
Summary: Current beam search generates successor states to EOS which are considered for inclusion in the beam even though they do not represent valid sequence prefixes. This diff introduces a penalty to ensure that such states are not included in the beam.
Reviewed By: xliilx
Differential Revision: D6325511
fbshipit-source-id: b17f10b0d00f3bc5fcc5a826a8a57a0f2cb360a6
Summary: Split into cpu and gpu parts, update chaining test
Reviewed By: Yangqing
Differential Revision: D6331513
fbshipit-source-id: b9e8ec9afc110b0284550c4818bde15ae108fa2f
Summary:
Fixed unit test failures for GRU cell first implemented in D5778202
- GRUCell implementation added to rnn_cell.py
- GRU with recurrent attention test added to seq2seq_model_caffe2.py
- seq2seq_rnn.py
- Added specific behavior for 'gru' cell type
- in LSTMWithAttentionDecoder, output_indices fix for GRU cells
- in build_initial_rnn_decoder_states, don't process cell state for GRU cells
Reviewed By: salexspb
Differential Revision: D6316441
fbshipit-source-id: 18668f3db62245c5cdaf3bfa473a40e0feba0473
Summary: Pass the list of observers to rnnExecutor_ and attach them to operators
Reviewed By: akyrola
Differential Revision: D6279655
fbshipit-source-id: 086dde1bf6edbfb36082d6b4de33ec41f0bbefab
Summary:
Also bumped third_party/protobuf to v3.4.1 similar to #1462 . cc pietern
Closes https://github.com/caffe2/caffe2/pull/1466
Reviewed By: pietern
Differential Revision: D6322210
Pulled By: Yangqing
fbshipit-source-id: 00f72472b71d1903a2705daf56652e4fb3fc021e
Previously, an in-place operation on a view that caused the view to be
volatile would not propagate up to the base. This often happens in
backward passes involving CopySlices which would increase memory usage
by making grad non-volatile.
For example, this splits threshold into threshold(), which is now
never in-place, and threshold_() which is always in-place.
This simplifies the in-place vs. non-in-place logic in
gen_variable_type.py, which was bug-prone.
Summary:
Datatypes was being handled badly in reference check, causing sporadic fails in CI. All batched mat-mul with fp16 data is performed as pseudo-fp16, with all math in fp32. Adjusted the reference implementation to reflect this.
Adjusted the gradient check threshold to the best I could get to consistently pass.
Closes https://github.com/caffe2/caffe2/pull/1406
Differential Revision: D6324431
Pulled By: pietern
fbshipit-source-id: 83ff2584438a11f7a6db4599a4fb0e75e9e15a3d
Summary:
TSIA. Verified on local machine with VS 2017.
Closes https://github.com/caffe2/caffe2/pull/1455
Differential Revision: D6310658
Pulled By: Yangqing
fbshipit-source-id: 88f4519e8e9a4178719a5627365267f627dcb939
Summary:
This is in order for us to share compression ops to oss.
Closes https://github.com/caffe2/caffe2/pull/1463
Reviewed By: hlu1
Differential Revision: D6319101
Pulled By: Yangqing
fbshipit-source-id: 16c94e71fc3efe256054a648170aaf7702e5bcfe
* Add a JIT interpreter
The separate interpreter is used to graphs with a lower overhead than
converting them to autograd graphs. Some notes:
* does not support Handles/PythonOp/CppOp, these will be in a future commit
* jit_closure.cpp still exists and we fall back to it for now when
cannot handle something because of PythonOp/CppOp
* In order to support retain_graph=True, the interpreter can be cloned,
creating a copy that can be run with different arguments. This is
assumed to be the non-standard case so cloning is not particularly optimized.
No tensor _data_ is copied, but the at::Tensor list in the interpreter is.
If we hit problems, there is a lot we could do (such as register allocation)
to minimize the stuff that needs to be copied.
* Uses a pImpl pattern to keep implementation details out of its header file.
* Modifies the way getTensorOp works so that it reads/writes to already-existing
vectors, this prevents needing to realloc these buffers each time.
* Timings are here: https://gist.github.com/zdevito/5a20ac29fb1b9e449e693b67dc478127
This reduces overhead to about the same as running it in python.
It is about 10us faster to run the same thing using ATen directly.
* Code Mod
Interpreter -> InterpreterState
Function -> Code
Add other requested comments.
* RegList -> ListHandle<T>
Change the RegList functions to be safer by identifying the type of
each argument list, and checking that list insert does not try
to add to two different lists at once.
* Use exactly equal for interp tests
* Fix elu double-backwards when applied in-place
Removed unused "input" argument to elu_backwards. Also removed 'inplace'
argument from backwards functions, since we don't ever want to use it.
* Fix up additional calls to ELU_updateGradInput
Summary: Update ATen operator to new version of aten library. This adds support for many neural network functions that previously were not exposed. This also supports operators that take a list of tensor inputs or produce a list of outputs by appending them to the end of the input/output lists.
Reviewed By: jamesr66a
Differential Revision: D6267327
fbshipit-source-id: 0df6af18369241afa8600fd51923811749900c2e
Summary: add NegateGradientOp: in forward pass, this op simply copies the input to output. In backward pass, it flips the sign of gradients.
Reviewed By: dragonxlwang
Differential Revision: D6314456
fbshipit-source-id: 56afd8b131eff9f7e120ab7e4e87461df49649d4
Summary: This new field is not needed anymore, so this diff removes it
Reviewed By: kennyhorror
Differential Revision: D6316744
fbshipit-source-id: f8afc1c42a0592fd03c7939f8e6f78afc8510ec9
Summary:
c777be07d9 changed the type signature for the Set function, this fixes it for the ATenOp
Closes https://github.com/caffe2/caffe2/pull/1464
Reviewed By: zdevito
Differential Revision: D6317561
Pulled By: jamesr66a
fbshipit-source-id: e54d553f44ccf0d5fc695e14dc671dde77004b54
Summary: Currently, the device_option equality is done in a specialized private function. Ideally, we should be able to test the equality from other places in the code and have a more detailed check for the equality.
Reviewed By: akyrola
Differential Revision: D6316608
fbshipit-source-id: c3fd085583e535d7936d05e4c8b15d2eff91c744
Summary: There is no need to use two functions to report net and operators. One function is sufficient.
Reviewed By: Maratyszcza
Differential Revision: D6228730
fbshipit-source-id: c599527254f4a15a3e440d37055cc95fbb3436bb
Summary:
This correctly adds handling of CUDA 8.0 and 9.0 by cmake.
**Discussion:**
CUDA 9.0 is currently not handled by cmake. When trying to build
with it and gcc6, the following cmake error is shown:
-- CUDA detected: 9.0
...
CMake Error at cmake/Dependencies.cmake:332 (message):
CUDA 8.0 is not compatible with GCC version >= 6. Use the following option
to use another version (for example):
-DCUDA_HOST_COMPILER=/usr/bin/gcc-5
Closes https://github.com/caffe2/caffe2/pull/1392
Differential Revision: D6317033
Pulled By: pietern
fbshipit-source-id: 08b89f21b994af52533d5afaaa62f26e2e94aee8
Summary: Dynamic memory management in Data Parallel Model was broken for distributed computation because it also the parameter gradients where freed after been used. That is problem with GLOO because it expects the tensors to have the same address over multiple calls. It is not a huge loss to remove parameter gradients from recycling as they are relatively small for typical convnets.
Reviewed By: asaadaldien
Differential Revision: D6314095
fbshipit-source-id: 949161d8c592927ae2fa82b3262b5f9ee47bed6f
Summary:
Support the default, nnpack, and opengl back end engines. There is no need to change the model. The file would convert the model to appropriate backend.
Closes https://github.com/caffe2/caffe2/pull/1436
Reviewed By: hlu1
Differential Revision: D6275975
Pulled By: sf-wind
fbshipit-source-id: fbd864e18f00372b4c03de294c22383c405a9210
Summary: Currently, in the single machine execution, a misleading message is printed to the log that the 'NODE_ID' blob is not found. This diff ensures that this message is not spitting out anymore while maintaining the semantics.
Reviewed By: Maratyszcza
Differential Revision: D6302728
fbshipit-source-id: 0f45245aedf6d4f664368595f7894e0f695e5323
Summary:
The STDDEV calculation code assumes that the `compare_exchange` returns the value of the atomic, while the C++ spec actually returns `bool`.
Also, the diff puts enough guard to avoid math error on python side -- although this should not happen, the guard is just to avoid problems with floating point calculation offsets.
Differential Revision: D6307930
fbshipit-source-id: d1754afb631f937aca7a88a82b5be2dd0c704aec
* Update comments and size logic
* Record stack traces during JIT tracing
* Use string helper functions and AutoGIL
* Use SourceLocation object instead of storing in debugName
* Address zdevito comments
* Address comments
* Fix CUDA builds for Windows
1. CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS has a limitation, the maximum number of exported functions cannot exceed 65535. So it can't be used.
2. Specify static on an inline function to prevent linking errors.
* cancel CMAKE version limitation
* Allow torch.load to take pathlib.Path
pathlib has been python standard library for filesystem path since python 3.4
But `torch.load` currently cannot take `pathlib.Path` as its filename of state dictionary.
I changed `torch.load` and `_with_file_like` to check so that they can accept `pathlib.Path` typed filepath.
* Fix flake8: too long line & indentation
Summary:
The windows compiler has a bug with chained templates. This diff avoids using such pattern in `plan_executor.cc`.
Closes https://github.com/caffe2/caffe2/pull/1442
Reviewed By: Yangqing
Differential Revision: D6300046
Pulled By: heslami
fbshipit-source-id: 1dc74441d6e2f0586c636e799eb5e88ced289063
* Enable EXPORT_ALL_SYMBOLS for CMAKE
If we turn on CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS flag, we don't need to add most decorators by hand.
* Add quotation marks to pass the string args
* added endif
* Update CMakeLists.txt
Summary:
The use case is that sometimes we need a Tensor of custom type instead of POD
or string. This diff allows one to delegate to BlobSerializerBase to further
serialize the contents inside the Tensor.
Design choices:
(1) Each element is serialized as a BlobProto string, and stored in the
repeated string field.
(2) UNDEFINED is used as the enum value for the tensor data type, and the exact
type string is stored in the additional field.
(3) BlobSerializer is called on each item to obtain the serialized string.
(4) This requires the custom type to have copy constructor - otherwise it
will simply not be possible to copy over the deserialized content without
explicit type.
See blob_test.cc for an example.
Reviewed By: sunnieshang
Differential Revision: D6300196
fbshipit-source-id: 18bf94a22a07337e0fa83d3f1004b3651e38cf27
Summary:
This should Travis the build failures on Mac
Closes https://github.com/caffe2/caffe2/pull/1443
Reviewed By: bddppq
Differential Revision: D6295041
Pulled By: Maratyszcza
fbshipit-source-id: c143220e1ec17e49fe8e84f586f9fb82daba321a
Summary: The topk GPU test was taking too much time, but there are still a variety of codepaths to test (k <= 1024, k > 1024, k == 1, k == n). Reduce the batch sizes and n to reduce time taken by the in-python CPU code equivalent.
Reviewed By: pietern
Differential Revision: D6272628
fbshipit-source-id: b8b8f3601f28bf64f144c73d7c9e915f40c84d70
* added sys/types.h include to fix unknown ssize_t in aten/src/TH/THMemoryFile.c
* now including <sys/types.h> only if _WIN32 is not #defined
* now including sys/types.h in aten/src/TH/THDiskFile.c (if _WIN32 is not defined) to fix undefined off_t
Summary: The number of elements in the caffe2 blob can be larger than int32. Use size_t to prevent overflow.
Reviewed By: ajtulloch
Differential Revision: D6278363
fbshipit-source-id: 356e294c667a53360d8a65b56a63a39d5ce3384e
I messed this up and TestNN.test_MaxPool2d_indices caught me out
on it. This patch assumes that IndexTensor outputs are not
differentiable.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This operator is a warmup I was doing before tackling convolution, as it
has many properties that make it a "first" for implementing things. In
particular, it is the first operator whose backwards have multiple
returns; this means its double backwards is the first backwards for a
function with multiple differentiable outputs. This exercises new code
for output_mask and set_flags.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
We are splitting on ', ', but that causes problems when you
have a nested comma. Quick and dirty fix is to NOT have the
space.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
They don't actually do anything and they're not accurate (many functions
have defaults which we didn't specify here).
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Whenever I used to read Declarations.yaml, it would drive me batty that
'name' was always embedded somewhere in the middle of the record.
Now it at the top, as it should be!
What it looks like now:
- name: storage_offset
method_prefix: m_
arguments:
- dynamic_type: Tensor
name: self
type: const Tensor &
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
pretransposing FCs seems to offset loses we get from low
batch sizes in AdIndexer. First I confirmed this on local benchmarks (see
previous diff). Then in https://fburl.com/yuo49onj I showed how this
change saves 19% of FC time on AdIndexer. Which is already $0.4M in
cap. exp. and over 3 years gives 5x more ROI.
We also we reuse this code for later more efficient gemm
implementations. I.e. msmelyan is working on new fp16 gemm which
would cut bandwidth usage 2x. We can reuse code in this diff for
repacking required by a new gemm.
In this diff I had to take care of memory usage. Here are several
possible approaches to the transformation:
1. Perform on the fly, copy the memory. This is what is done in
skinny gemm (FC with engine SKINNY)
Cons: slow first execution, memory is replicated for each thread
2. Perform copy of weights in operator constructor. On the fly in dbg
mode verify that hash on original weight is the same
Cons: memory is still replicated for each thread
3. Perform copy weights in Predictor constructor
Cons: if we have 2 predictors sharing the same weight blob (via
PredictorContainer), we still get 3x more memory. I.e. original
weights and two copies for each of the predictors in a container
4. Replace weights in Predictor constructor, take care of mapping to
support weight sharing within a Predictor container
This is the approach taken in this diff, it solves issues above and
doensn't create any memory overhead.
Cons: Logic became complex, requires a mutex at initialization time
Reviewed By: akyrola
Differential Revision: D6214593
fbshipit-source-id: 25da6ba7bfd39fc8f4b578094d3f334c7957490d
Summary:
- so that it can also summarize blob of size larger than int
- the calculation of the mean and std may overflow/underflow, change to use double for intermediate calculation
Differential Revision: D6278275
fbshipit-source-id: f0bb72a5279212d429fa6d09b5487cad1baacdbe
Summary:
Will probably rename to adaptive topK to be aligned with the layer name.
The main difference from top_k op is that the K is not fixed as a layer parameter,
instead this op takes in a blob that conatins K information for each row of the input data (batch mode).
Reviewed By: chocjy
Differential Revision: D6221209
fbshipit-source-id: f7fd575ff8f515d886d93278ad94fd17e8bd6fa5
Previously, we checked that Variables were at least one dimensional in
the Python binding (wrap_outputs.h) and in the backwards functions. This
was necessary because some Tensor functions returned Scalar types, which
must be zero dimensional. This moves the wrapping logic into
VariableType.
Summary:
Do not try to link against `libcblas.so` when using the OpenBLAS
back-end. This fixes#763.
I briefly checked the OpenBLAS repository and as far as I can tell, the OpenBLAS build script by build never created a library called _cblas_.
Closes https://github.com/caffe2/caffe2/pull/1420
Differential Revision: D6283019
Pulled By: pietern
fbshipit-source-id: 53cd4455bdc63ee9f31d5bca9822844548350ae3
Summary:
Few people complained in NNPACK repo about broken build on PPC64, as it specifically whitelists supported architecture in its CMakeLists.txt, and refuses to build on unsupported platforms. This commit explicitly disables NNPACK build (as part of Caffe2 build) on unsupported architectures.
Closes https://github.com/caffe2/caffe2/pull/1439
Differential Revision: D6288999
Pulled By: Maratyszcza
fbshipit-source-id: 76c40e9ce882356944b63968df8fd853f21ecd35
Summary: In this diff I am making sure that the checkpoint metadata is written out to the db for every epoch. This will allow us to automatically resume from a epoch if a workflow fails.
Reviewed By: aartibasant
Differential Revision: D6234832
fbshipit-source-id: f09a4de118f2eac25f663556476ac6313925fdf3
Summary: Print the full operator definition when gradient creation fails. This helps debugging cases where same op type is used in many places.
Differential Revision: D6282832
fbshipit-source-id: 4b9dab2602c7c53f795da93a3085cf5c8ca741c1
Summary:
Add `RmsPropOptimizer` to `optimizer.py` so RMSProp can be used as an optimizer.
`RmpsPropOptimizer` uses `RmpPropOp` to update the gradient and `MomentumSGDUpdateOp` to update the model parameters.
Differential Revision: D6118279
fbshipit-source-id: e38b8380ff74c1d1bb1e87fc300b6b55e32cd2e0
Summary:
- This is meant as a set of examples on how parallelize_net works.
- Currently, only one example is provided. More to be added.
Reviewed By: mraway, xianjiec
Differential Revision: D6240160
fbshipit-source-id: 6f6f2d77445825883e050498cb6e06fb74508bbf
Summary:
Let's see if we can make this work...
Closes https://github.com/caffe2/caffe2/pull/1417
Differential Revision: D6276601
Pulled By: pietern
fbshipit-source-id: 4d51a66b693a1c5cff1e0c03373cd42bb273c885
Previously, sizes/strides() would give you the ATen view of the shape, while size(dim), stride(dim) would give you the TH view.
This was unnecessarily confusing and there was no automatic way to get dim wrapping on the ATen view.
Summary:
The source files are not exposed to the parent directory in mobile. Expose them now so that the files are built in OSS.
Closes https://github.com/caffe2/caffe2/pull/1435
Reviewed By: akyrola
Differential Revision: D6274056
Pulled By: sf-wind
fbshipit-source-id: 6b54645bc9a42b4329d8aa20051abeb5fc6b1c37
* Add direct C-type scalar conversions from Tensor, e.g. toCFloat() as an alias for Scalar(x).toFloat()
* Provide tensor overloads for fill_, masked_fill_, index_fill_.
* Everythign up to scalar overload.
* Fix pytorch build for aten scalar return type changes.
* Use valid expression instead of dangling else.
* Simplify code generation.
* Fix test_jit (why didn't this compile locally?)
Summary:
Ability to use average length of sparse feature to initialize weights. Based on experiments, it turns out that this allows a model to converge faster.
More results of the experiment -- https://fb.quip.com/VfraAXNFWhSg
Reviewed By: xianjiec
Differential Revision: D6092437
fbshipit-source-id: d979be7d755719ff297b999f73cba0671e267853
The curand_uniform function returns the range (0, 1]. Most RNG APIs have
the opposite bounds. Fixup the values in uniform_() so that they fall in
the more common bounds.
From https://software.intel.com/en-us/mkl-developer-reference-fortran-gemm:
lda: "When transa = 'N' or 'n', then lda must be at least max(1, m),
otherwise lda must be at least max(1, k)."
ldb: "When transb = 'N' or 'n', then ldb must be at least max(1, k),
otherwise ldb must be at least max(1, n)."
Partly addresses #3525
Summary:
The output shape info is incorrect, e.g. if we have 4 embeddings with dim size 32, the actual shape is (4, 32),
but the previous implementation in concat layer will give us (128, 1). This bug doesn't affect the dot products
calculation because the actual shape of the blob is still (4, 32) in concat_split_op
Differential Revision: D6264793
fbshipit-source-id: 82995e83a8c859cbd15617ff7850a35b30b453b6
* Prevent segfaults from undefined aten tensors.
This introduces a singleton UndefinedTensor TensorImpl with UndefinedType that is the starting state of a Tensor with no constructor arguments. In this way we avoid null pImpls and avoid segfaults
without having to if-check each pImpl dereference.
* If either Backend or Scalar type is Undefined in registry, return
the UndefinedType to avoid errors like CPUUndefinedType is not enabled.
* Address review comments.
* Avoid refcounting UndefinedTensors.
* Use reference_wrapper to avoid copy in check_defined.
* Declare UndefinedTensor singleton as class-static.
* Seperate checked_cast into storage and tensor versions.
* Include <functional>
* Handle nullptr TensorImpls coming from NN.
* Fix nullptr check in batch_normalization backward with defined check.
Summary:
RNN executor uses its own set of events (https://fburl.com/37mows6l) and may
call RunAsync multiple times on the same op. Disable internal op event for this use case.
Reviewed By: akyrola
Differential Revision: D6258471
fbshipit-source-id: 228f9ca9882cfbac5bc8fba55ddf80bd2b542072
Generate random uniform floats in the range [0, 1) by generating random
uniform uint32 in the range [0, 2^24-1] and dividing by 2^24. This
ensures that the largest value is representable as a float32 less than
one.
This also changes the uniform double generation to use more bits of
randomness.
THTensor_(newContiguous) always increments the refcount. It may return
the same pointer if the tensor is always contiguous. Since we added the
check for zero strides, it may be called when the tensor is already
contiguous. We need to make sure that THTensor_(free) is always called
in this case.
Fixes#3498
* add -fexceptions to aten build function for C and CXX builds
* add -fexceptions to aten build function for C and CXX builds
* add -fexceptions to aten build function for C and CXX builds
* Fix test_torch.py test for Power see issue #3277
* Regenerate ONNX nanopb from latest version.
But don't bump the IR version, we don't handle discriminators
yet.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add discriminator to AttributeProto.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add back ONNX definition for permute
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Merge vestigial Local.cwrap into Declarations.cwrap
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Remove dead standalone ATen build logic.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Allow in-place operations on views
Adds VariableViewImpl, a subclass of VariableImpl which has a pointer to
the base Variable on which it is a view. In-place operations on views
change the grad_fn of the base.
Note that in-place operations only work on views that are the first output of the function that created them. All C++/ATen implemented functions have this behavior, but it's possible to write Python-implemented autograd functions that do not. In-place operations on these view will raise an exception.
Fixes#3313
* THS build change
* merge THCS into ATen build
* THCUNN build change over
* update THNN build
* move THC build to ATen, as well as some of the accumulated top level config from other TH* libraries
* TH library build merged into ATen, and warnings fixes.
* fix magma support checking
* check cuda early
* fall back to GCC atomics if C11 atomics have issues.
* fix install name
* disable openmp in files that also include stdatomic.h
* make sure LAPACK is visible to TH build file.
* Use Welford's algorithm when reducing along inner dimension for THCTensor's variance fn
* Use accreals in THCTensor's varInnermostDim
* Skip cuda tests if no cuda
* Variance testing
Summary:
Current version of the code is not supporting type and shape inference that is
going to make all places that rely on it fail misserably.
I'm still leaving option of doing init in the old way in case if some places
are already failing this inference logic.
Reviewed By: ffjiang
Differential Revision: D6241270
fbshipit-source-id: e9080ffe93d610b5ada58ebe66579acfa57c6b3c
Summary:
replaces FB-internal NNPACK fork with open-source version.
Important FB features are already upstreamed to the GitHub repo.
Reviewed By: ajtulloch
Differential Revision: D6224054
fbshipit-source-id: 4dbe02b4da97648a663586414550c2d4e23c7221
Summary: Add support for SparseMomentumSGDUpdate and tests for momentum SGD in both dense and sparse cases
Reviewed By: akyrola
Differential Revision: D6234834
fbshipit-source-id: 9848c29ea06794ef35f1ebaff0f5e81eac4f4db9
Summary:
This seems to be faster in a bunch of cases. Prefer to keep it as a
separate op instead of MatMul + Add so its easy to compare perf on per
op basis between this one and the baseline (normal FC)
Reviewed By: akyrola
Differential Revision: D6169187
fbshipit-source-id: 09b96325d44bd181896f396aec88b27314c435b0
Summary:
resnet50 trainer will save the 'optimizer_iteration' blob in checkpoints, but loads it i in GPU context. This fails because AtomicIter/Iter expect the blob to be in CPU context. So manually reset the optimizer_iteration in CPU context.
I am thinking of making the iter-operators automatically do this switch, but in the mean time this unbreaks the trainer.
Reviewed By: sf-wind
Differential Revision: D6232626
fbshipit-source-id: da7c183a87803e008f94c86b6574b879c3b76438
Summary:
Implementation of polling async net executor.
Notes:
- New net executor async_polling - schedules CPU and GPU ops asynchronously, uses single polling thread
- Events: update to Caffe2 events to support async CPU events, adding new methods:
Query() - non-blocking checking of event states: INITIALIZED -> RECORDED -> SUCCESS/FAILED
ErrorMessage() - when operation runs asynchronously and fails calling this on event will give error message
- Tasks: using existing DAGNet's algorithm to compute CPU and GPU chains, a separate task for each chain
- Polling: using single thread to query state of events - for CPU tasks atomically queries task state, for GPU task - uses cudaEventQuery; using Event
- Scheduling of CPU ops: using global thread pools
- Scheduling of GPU ops: using GPU thread pool per GPU device
Reviewed By: dzhulgakov
Differential Revision: D5985110
fbshipit-source-id: a9de7fcbb71d046a3aa1b573072b89a65dfeee8c
Summary: 8 bytes is 64 bits. Fixes out of range access caught by ASAN
Reviewed By: Yangqing
Differential Revision: D6219576
fbshipit-source-id: f7c418b12fa211890abcb5aef800bd456390b73a
Summary: Before the boundary checking was happening after the first access for 8bit ops.
Reviewed By: Yangqing
Differential Revision: D6206753
fbshipit-source-id: 07ab240cae8c67b3048f03aa79af0b6399b9940b
Summary: Still assumes a complete subgraph, but slightly more generic.
Reviewed By: Yangqing
Differential Revision: D6103228
fbshipit-source-id: bfa0d46067e05baa0478a4c37a67ccf8f81f34ec
Reduction functions that take a dimension now properly reduce
down to scalars if passed a 1-dimensional tensor.
Squeeze now properly reduces down to scalars as well (and is implemented
as a native function).
Unsqueeze now handles scalar inputs correctly (so unsqueezing a scalar
returns a dim 1 tensor, rather than a dim 2 tensor).
This gets rid of kUndefinedDimensions and has nice properties like:
- the dimensionality always matches the length of the sizes and strides.
- the number of elements is always the product of the sizes (starting at the identity)
- the shape you pass to factory functions (e.g. randn) matches the shape that is returned
etc.
In addition to the empty tensor change, this makes some related changes:
1) expand is now a native function, because it needs to operate on the ATen view of the size/strides.
2) adds tests for a number of functions operating on empty, scalar, non-scalar tensors.
This uncovered a number of scalar_check bugs; some of these are fixed in the generated code,
some that need to be manually specified can be specified by a 'scalar_check' argument in the cwrap.
3) fixes the formatting of empty tensors
4) changes the THLongStorageView API; the public API was getting overly complicated, so now you call
'makeFromSize', 'makeFromStride', 'makeFromLength' and it just handles the correct mapping for that type.
Permute transposes multiple dimensions at once. The as_strided function
changes the sizes and strides of a tensor without changing the Storage.
It's a subset of Tensor::set_.
This allows VariableType override them to return instances of
VariableType. Combined with the change to Formatting.cpp, this lets us
print Variables to std::cout.
For one thing, we will want a different implementation from TH because
we need to differentiate between scalars and 1-dim tensors.
Also, we don't really want to expose the THS/THCS function; in addition to
checking the shapes are the same, it checks that the dimensions which
are sparse are the same (because various THS/THCS operators only work if this
is true; it should really be called "is_congruent" or similar.
This adds the ability to specify 'native' functions in NativeFunctions.h and specifies
'split' and 'chunk' in this manner. The function arguments, returns, variants, etc. are
specified as if they were processed via other parsing mechanisms (e.g. cwrap_parse) with
the following additional parameters:
type_method_definition_level: this allows one to specify that the type method should
be defined at the 'base' type level; this is because in the case of 'split' and 'chunk'
(and probably most/all other native functions that don't directly dispatch to TH/THC)
we don't need type-specific implementations. Currently it is enforced that 'base' is
specified for native functions, but this is easy to remove later.
type_method_definition_dispatch: this defines the function to dispatch to. For split,
this is at::native::split; this is just to avoid having a magic namespace and allowing
one to dispatch to a function with a different name.
Currently, the toXXX functions on Scalar check that the conversions are
exact. This will cause an exception in code like:
auto t = CPU(kFloat).ones({1});
t *= M_PI;
Or the equivalent in Python:
t = torch.ones(1)
t *= math.pi
This changes the checks to only throw an exception in the case of
overflow (positive or negative).
a.copy_(b) will now broadcast b to the shape of a. Note that this means
that copies between tensors of the same number of elements but
incompatible shapes are not allowed. For example, the following will
throw an exception:
Tensor a = type.rand({4, 43);
Tensor e = type.rand({3, 4});
a.copy_(e)
The methods were separate because PyTorch supports multiple output types
for comparison methods. For example, for FloatTensors 'a' and 'b' both
calls are vaild:
torch.lt(a, b, out=<ByteTensor>)
torch.lt(a, b, out=<FloatTensor>)
ATen only supports ByteTensor outputs because the overloads have the
same static signature and would conflict. It would be nice to fix this
in the future like with the bernoulli function.
In the meantime, the separate function and method definitions with
different argument names make implementing VariableType more difficult.
This generates NN bindings with a similar interface to PyTorch's
torch.nn.functional package. The file nn.yaml specifies function
signatures and THNN implementations.
Each NN operation generates three functions. For example:
- conv2d
- conv2d_forward
- conv2d_backward
The conv2d and conv2d_forward functions differ in how they handle
buffers that need to be passed to the backward function. conv2d_forward
takes the buffers as parameters. conv2d creates the buffers internally
and discards them.
* Improve Declarations.yaml:
- translate defaults to C++ values
- include names of returned values
- mark keyword-only arguments
* Add comment to translate_default
This respects all the broadcast cwrap specifications except for 'fallback';
i.e. pointwise functions operating on tensors where the number of elements
match but the sizes are different and not broadcastable. This behavior is
currently deprecated in PyTorch. Note that this is a breaking change in ATen,
because ATen just passes through to TH/THC, where the fallback behavior is
actually implemented.
This also changes expand semantics wrt Scalars (as tensors). Previously,
one could 'expand' a 1-dimensional tensor with size 1 to a 'scalar' (i.e.
empty size initializer list).
Replace None grad_inputs with zero tensors in some cases
In Python-implemented autograd functions, we sometimes return None as
the grad_input if the output is marked "non-differentiable". This
replaces those None values with zero-filled Variables if the
corresponding input has requires_grad=True.
C++ implemented autograd functions expect the input (grad_outputs) to
be defined if they're executed. They always return non-null grad_inputs
if should_compute_output(i) is true. This could lead to segfaults if a
subsequent Python-implemented function returned None.
See #3412, #3241
Summary:
\cc akyrola
Fixes a few issues:
1. Performance issue related to regeneration of rng states every time the input size changed - this was unnecessary, now states should be initialized once only.
2. States were being overwritten between fprop and bprop operators, causing silent wrong results. This required use of the new `cudnnRestoreDropoutDescriptor` API, requiring a new gating behind cuDNN v7
3. Random seed was not being inherited from the `operator_def.device_option()`
Closes https://github.com/caffe2/caffe2/pull/1418
Differential Revision: D6222081
Pulled By: akyrola
fbshipit-source-id: 021067b95bcf0a16db8f4a73d3ed70e21b54bc9f
We don't currently generate _out functions for ATen native functions and may not
(they don't work with Variables currently). Also, the existing code was wrong
as the argument orders were swapped in the two squeeze variants.
Summary:
Implements send/receive calls in C++. This includes both a C2 independent
library in async/comm as well as the C2 operations in the c2 sub-directory
There are still several items to be addressed in future diffs:
- multiple channels per pair to alleviate the issue with small message latency
- re-add statistics per comm-client and per-op
- continue adding test cases as usage patterns diversify
Reviewed By: akyrola
Differential Revision: D6095219
fbshipit-source-id: 6d72770dbac693d2b7035f03ce8c6df5ce03706e
Summary:
There were cases where the direct copy succeeded, but the
dimensions didn't match. Now, we check dimensions and reset if they
don't match before issuing the copy.
Reviewed By: salexspb
Differential Revision: D6103325
fbshipit-source-id: 602605d8b119cae74e006c792bc42f355a5a9b4e
Summary:
See comments for where this can be useful (disabling the
OperatorDef::DeviceOption(...) so we can control the scope at the
NetDef::DeviceOption(...) level).
Reviewed By: viswanathgs
Differential Revision: D6103412
fbshipit-source-id: 75a9be54275760132f6d1e71acbe9190e7099289
Summary: Updated brew SpatialBN to use initializers similar to other brew ops such as conv and fc instead of initilaizing all of its parameters itself within the brew call.
Reviewed By: asaadaldien
Differential Revision: D5840359
fbshipit-source-id: 9f3d688d4957605eaf7ecd2488bc26bfb1da3f78
Summary:
With the update of the sample rate API, caffe2_benchmark needs to be changed as well.
Tested building the caffe2_benchmark and running the program on an android phone. See the delay metrics reported in adb.
Closes https://github.com/caffe2/caffe2/pull/1419
Reviewed By: Maratyszcza
Differential Revision: D6221101
Pulled By: sf-wind
fbshipit-source-id: 77a06ecce55b54cff8b9fa0aef857bc542a5f371
Summary: Adds the ability to create a local blob in the workspace even if the blob exists in the parent workspace. This is to support cases where a user wants to create a local copy of the blob and hide the blob from the parent workspace.
Reviewed By: akyrola
Differential Revision: D6194386
fbshipit-source-id: 92c064159ac635ee76c211abc013b72bd8752447
Summary:
We'd like to sparsely sample the net execution, but after the net is sampled for the first time, we'd like to densely sample the following few iterations so that we can have some meanful data for a short period of time.
Change the observer sample rate to the following:
skipIter: skip the first few iterations.
netInitSampleRate: the sample rate for the first iteration after the skipIter or immediately after reset.
netFollowupSampleRate: the sample rate after the netInitSampleRate is hit.
netFollowupSampleRate: the number of iterations that use the netFollowupSampleRate. After this number is hit, use netInitSampleRate (reset)
operatorNetSampleRatio: whenever the net is sampled, if the random number also hit operatorNetSampleRatio, collect operator metrics instead.
Reviewed By: Maratyszcza
Differential Revision: D6205657
fbshipit-source-id: da0c048f77fc4dc64f3fb71b6072429a57e9d2f0
Summary:
If this variable is set to a ccache symlink then the NCCL build will
also use the cache. The NCCL build is the slowest component of a cached
build without this change
Closes https://github.com/caffe2/caffe2/pull/1416
Reviewed By: Yangqing
Differential Revision: D6214008
Pulled By: pietern
fbshipit-source-id: e0a90e27de9b1c5a1fdc0e5bad5fb61f9fa924c3
Summary: CAFFE2_ENFORCE accesses a global variable in a separate compilation unit.
Reviewed By: romain-intel
Differential Revision: D6200236
fbshipit-source-id: a501b05bd23afec2ef4a23dd482a4dc4cfc196f1
Summary:
My commit bab5bc broke things wiht fp16 compute, as i had tested it only with the null-input, that actually produced fp32 data (even dtype was given as float16). Also, I had confused the concepts of "float16 compute" and fp16 data. Issue #1408.
This fixes those issues, tested with both Volta and M40 GPUs. Basically restored much of the previous code and fixed the null input to do FloatToHalf.
Reviewed By: pietern
Differential Revision: D6211849
fbshipit-source-id: 5b41cffdd605f61a438a4c34c56972ede9eee28e
* enable size from ATen type
* temp commit aten thd
* port copy, math
* port random
* changes after rebase
* lapack bind
* thd and csrc compile
* fix min/max reductions in DataChannelTCP
* clean up changes
* re-enable tensor constructors
* port MPI to at::Tensor
* fix storage methods to not cast to thpp storage ptrs
Some knock on effects:
- at() is not supported on ArrayRef. I fixed this by adding a new
overload for input() to access a specific input. I also filed
https://github.com/zdevito/ATen/pull/152
- Need new overloads for fmap/filter, because template deduction won't
attempt an implicit constructor in attempt to match the argument.
- New overload in ir.cpp for printing ArrayRef.
- When we pybind11 an ArrayRef, we convert it into an iterator.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This breaks a lot of the onnx-pytorch tests because the abstraction
barriers are not respected. I'll spin up a patch for that separately.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This started off as a minor fix based on Adam's question, "why is printing
a graph not const" and snowballed into a giant yak shaving exercise.
- The Graph and Node APIs now uniformly enforce deep constness; e.g., if you
get a const Node* or const Graph*, it is not possible to get a non-const
Node*/Graph* somewhere else in the graph (even though the member variables
of these are non-const. Hooray for private access specifier.)
- A big pile of functions got const versions, most notably the printing
functions, and functions for accessing inputs().
- REALLY IMPORTANT, BC-BREAKING CHANGE: inputs() now returns a COPY of the
inputs, rather than a reference to the underlying. I was forced to do this
because there is no way to portably turn a std::vector<Node*> into a
std::vector<const Node*>, which is necessary to provide a const-correct
version of inputs() that enforces deep const-correctness. I then justified
this choice to myself with the observation that outputs() returned a
copy (by necessity), so this makes the API more uniform.
But making this change uncovered two very subtle bugs:
1. If you change functions from returning a reference to returning a copy,
the idiom node->inputs().begin() is no longer valid, because the memory
the iterator points to immediately becomes invalid. THIS SUCKS.
Honestly, we should add a lint rule rejecting calling begin()/end() on
temporaries because this is very dangerous. To excise this pattern from
the codebase, I added begin() and end() methods to Graph, so that we got
rid of the graph->nodes().begin() idiom, which happens to be sound,
despite not returning a reference, because graph_node_list is a
non-owning reference.
2. pybind11 doesn't handle std::vector<Node*> cast out of the box.
Fortunately, I found a simple fix in the GitHub issues tracker
that involved adding an extra type converter. And yes, this
does mean that outputs() in Python never worked correctly.
- New const_graph_node_list, which is a graph_node_list that gives you const
Node*
There are some more miscellaneous improvements:
- Applied CR comment fixes on export.cpp; using replaceInput, and renaming
variables for clarity.
- assertValidInput helper method added, and applied to replaceInput
- Use an explicit function to print THPObjectPtr, otherwise we get
the wrong overload.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Prevent numerical issues with poisson_nll_loss when log_input=False
Evaluation of the logarithm of the input variable in poisson negative log likelihood leads to NaN loss if variable being evaluated is zero. Small epsilon is added to prevent this. See equivalent Keras epsilon here: https://github.com/fchollet/keras/blob/master/keras/losses.py#L68
* PEP8 fix
* Add epsilon support to PoissonNLLLoss in nn.modules.loss
* Add torch.take and Tensor.put_
These are similar to numpy.take and numpy.put. The take function allows
you to linearly index into a tensor without viewing it as a 1D tensor
first. The output has the same shape as the indices. The put function
copies value into a tensor also using linear indices.
Summary: This cleans up the _hack_get_slice_end() using the Conditional operator.
Reviewed By: jmp84
Differential Revision: D6177797
fbshipit-source-id: 5ce0b76b8472123415bba39488aa2c69aad96111
Summary:
Caffe2 fails to build with some old CMake versions because it doesn't figure out that the build implicitly depends on NNPACK build.
This commit adds this dependency explicitly.
Closes https://github.com/caffe2/caffe2/pull/1414
Differential Revision: D6203486
Pulled By: Maratyszcza
fbshipit-source-id: 86f6d9d88976656820f44e3416c57ddf22350362
Summary: Updating the documentation to clarify the behavior of negative end indices.
Reviewed By: jamesr66a
Differential Revision: D6169058
fbshipit-source-id: f14f7cb8b30c26b1cccce104eba8c957a444657f
* update fuser to match ATen-formatted JIT ops
* fix concat optimizations and add test
* allow onnx export to work with single-export functions
* fix onnx handling of multi-return nodes.
* nits, format, vision test update
* fix add constant
* fix driver init issues
* Add missing Neg symbolic.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
I (actually by mistake) included some premature optimization in D6155510 for the threaded rnn executor. Unfortunately, there was a subtle race condition when some ops where run out-of-order, but i had made the count down only to count down in the last timestep. Hard to explain.
For caution, revert D6155510's changes to recurrent_network_executor.cc excluding one assertion and setting of the debug flag.
Differential Revision: D6195544
fbshipit-source-id: 24a275e185e5a80835401a8cdcb162dbc2411789
Summary: Added a simple function to synchronize a blob across machines (but not across devices), i.e a blobs that are not synced over devices.
Reviewed By: yqwangustc
Differential Revision: D6192922
fbshipit-source-id: a4d653c9fb09f06b0c42330bdae07b42f5e6346c
Summary:
Implemented new CUDA class for operator SparseAdagrad. The param and moment inputs now can be float or float16.
The functions for mixed-precision add/mult/store are defined in a separate head file ("caffe2/core/float16_util.h") for reuse purpose.
Reviewed By: azzolini
Differential Revision: D5880200
fbshipit-source-id: dca227f38629a03a9d771f42efe2c0b673075c4d
Summary: Allow the GEMMs in the FC/FCGradient Op to do FP16 compute instead of FP32 if the appropriate op flag is set.
Reviewed By: asaadaldien
Differential Revision: D5839777
fbshipit-source-id: 8051daedadf72bf56c298c1cf830b019b7019f43
Summary: CAFFE2_ENFORCE(a == b) and CAFFE2_ENFORCE_EQ() are functionally equivalent, though the later provides a more detailed failure message.
Reviewed By: salexspb
Differential Revision: D5991775
fbshipit-source-id: 52e4d6d559c933de5b33d791b20223effe9d4f66
Summary:
RNN executor had a disadvantage to plain nets when running in forward-only mode: for plain nets, we only create two workspaces and two nets and alternate between them. With RNN executor, we had only four workspaces (4 > 2 because it was faster in some cases), but the nets (or rather the ops) were created for each of the timesteps. This has significant overhead. This diff changes this sos that if executor is is forward-only mode (i.e has limited parallelism setting), then it will use the same operators as the t - 4'th net -- excluding the ops that require the timestep blob. The latter exception is required because RNN executor needs different timestep blob for each timestep because it cannot modify the value of the timestep blob like when running nets in a loop.
Also removed redundancy in the dependency computation and added a debug flag to the executor that outputs the description of the rnn contents.
Reviewed By: salexspb
Differential Revision: D6155510
fbshipit-source-id: c47f727d2128649b081270d15020a08d41e5748d
- Deleted Addmm/Concat Function class, as this is now native ATen operator
- Resurrected ONNX operator for Concat (now called 'cat')
- Add a "fake" Expand ONNX operator, which we now do the optimization on;
this helps prevent us from emitting a warning that 'expand' is not supported.
We still fail if any of these Expand operators make it to the final model,
until we actually formalize Expand in ONNX. This also simplifies the
fuseBroadcast code, because single-return ONNX nodes don't get select nodes.
- New error reporting strategy. If we fail to export an operator because of
something, we emit a warning, but otherwise keep going. At the very end,
in export.cpp, we now check if there are any ATen operators left over. If
there are, we bug out. This assumes that ATen is lower case and ONNX is upper
case. You're now supposed to 'return _unimplemented(msg)' in these cases.
- New toString() method on Graph, for getting the string graph (useful for
slapping it into error messages.)
- Some of the legacy symbolics (still in Python symbolic method of Function
subclass) have been cleaned up for clarity.)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
The pieces:
- I improved the lint / asserts to catch some bugs which I
committed while working on my export. There are two new
properties which the linter checks now:
(1) "Anticipated uses". If a node says that is used by
M, M better appear later in the topsort. Previously,
we only checked if it was in all_nodes.
(2) If you are a select node, you better be a multi-type node;
if you're not a select node, you better not be! And you
should never have an input that is multi-type.
- There is a new peephole optimization pass, for simple, local
transformations to graphs. Right now, it implements a simple
optimization: remove 'expand' invocations that are no-ops
(the size before matches the size after), but we can add other
things to it later. I needed this for ONNX because no-op expands
show up in the left-hand argument, which we don't support.
- There is now a broadcast fuser, which fuses ATen expand ops
into broadcastable ONNX ops (Add, Div, Mul, Pow, Sub, Gemm.)
It only fuses when the original size is a suffix of the new
size, as per the ONNX spec.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
This is introduced in 8539a1e78b - vector<float> should not be used in Tensor shape inference.
Closes https://github.com/caffe2/caffe2/pull/1393
Reviewed By: akyrola
Differential Revision: D6181075
Pulled By: Yangqing
fbshipit-source-id: 002144a137148b5b16118d0c123132890e8d325a
.gitignore should have uninteresting files listed, so acts as a good
.dockerignore. Reduces the build context sent to the docker daemon from to
2.927GB (after building locally) to 66.66MB (:O).
Summary:
Just noticed while reading the code.
We can wait only to tails of the dag, not every execution chain node.
Reviewed By: akyrola
Differential Revision: D5861078
fbshipit-source-id: f4f6296fed1ccc96b1ab99b4272b82c8bf764ca9
Summary:
Add CUDAContext::cudnn_handle() for easier integration of single
cudnn routines into operators without requiring the weight
of CuDNNWrapper or similar, or needing to spin out a separate CuDNN*Op
version of an operator.
It was necessary to split out the cuDNN wrapper code from the base cuDNN helpers in order to resolve a circular dependency between context_gpu.h and common_cudnn.h when handles and cuDNN `#define` were added.
Closes https://github.com/caffe2/caffe2/pull/1376
Reviewed By: pietern
Differential Revision: D6162034
Pulled By: akyrola
fbshipit-source-id: 95687e55b3e1e921e1f5e0f016f43b586f5f3350
Summary: Added initializer which sets up the ParameterInfo object in the opposite format as the pFP16Initializer. This is needed for when the op requires the initialized blob to be FP32 but a FP16 copy of the weights is needed.
Reviewed By: wesolwsk
Differential Revision: D5840832
fbshipit-source-id: 439e87f41a1dbc58bf63a5c0e7f7fc4cb00b4d65
Summary: Given an additional tensor containing the values corresponding to the weighted samples, add tensor output that contains the values selected by the sampled indexes.
Reviewed By: akyrola
Differential Revision: D6050094
fbshipit-source-id: 1eccc641b99e30d36ae83d49f630b018a53e4147
Summary: Sigmoid + CrossEntropy has numerical stability issue. The gradient of sigmoid is `dx = dy * y * (1-y)`. When `label=0` and `x` is large, `1-y` could be round to (near) 0 and we loss `dx`. Switch to `SigmoidCrossEntropyWithLogits` solve the issue because the gradient is not dependent of `y`.
Reviewed By: chocjy
Differential Revision: D6086950
fbshipit-source-id: f990ae726802aa5c56fa62cf5e23f2e61ee047fa
Summary:
We need to use Cluster to isolate the definition of the nodes.
Otherwise, the contexts are polluted and the run becomes
stateful.
Reviewed By: Yangqing
Differential Revision: D6140404
fbshipit-source-id: 09d1c86ef12bb01eaa16b1dade4d2e1e93be287a
Summary:
This will help releasing models that are using Caffe2 but have their own operator implementations and extensions. More detailed docs to arrive later. Let's see what contbuild says.
Closes https://github.com/caffe2/caffe2/pull/1378
Differential Revision: D6155045
Pulled By: Yangqing
fbshipit-source-id: 657a4c8de2f8e095bad5ed5db5b3e476b2a877e1
Summary:
For some reason, having SHOULD_NOT_DO_GRADIENT in a .cu file (this is for an only-CUDA operator) will cause double-free error detected by asan. This is why innocent looking D5837837 caused automatic asan tests to fail (at least on Xray).
Removing these entries makes the error go away, and is ok because we don't really need these tags. But it would be nice to understand what causes the double-free. I don't have time to investigate myself now.
Reviewed By: Maratyszcza, salexspb
Differential Revision: D6161559
fbshipit-source-id: a52cb2a9cc62f2ec54ed866846f2bd1ccb0ae90f
* API changes
* Implement reduce for THNN ClassNLLCriterion
* Implement reduce keyword for THCUNN ClassNLLCriterion
* Implement reduce for THNN SpatialClassNLLCriterion
* Implement reduce for THCUNN SpatialClassNLLCriterion
* Make legacy NLLLoss work
* Docs for NLLLoss reduce
* reduce keyword for double backwards NLLLoss
* reduce=False tests
* Addressed comments
* Fix trailing whitespace
* Fix test failures in legacy nn
* Rebase: add reduce keyword to aten declarations of NLLLoss
* Add reference functions for all NLLLoss and NLLLoss2d test cases
* Replaced slow get/set fns. Don't use int64_t in kernels.
* Use TH_INDEX_BASE in NLLLoss for consistency
* Fix legacy ClassNLLCriterion tests
Summary:
CMake scripts in NNPACK use enum34 polyfill for PeachPy to support pre-3.4 Python interpreters, which do not have built-in enum module. This polyfill was found to be conflicting with built-in enum module on Python 3.6, and I updated NNPACK CMake scripts to only use polyfill for Python < 3.4. This commit propagates this change to Caffe2, so Caffe2+NNPACK can be built on systems with Python 3.6.
Closes https://github.com/caffe2/caffe2/pull/1389
Reviewed By: bddppq
Differential Revision: D6161663
Pulled By: Maratyszcza
fbshipit-source-id: c8aa07def6abe252a0a2ab927f6c49ccd846ab93
Permute transposes multiple dimensions at once. The as_strided function
changes the sizes and strides of a tensor without changing the Storage.
It's a subset of Tensor::set_.
Summary:
seq2seq/translate.py was running much slower on RNNExecutor. This was because RNNExecutor has significant init overhead (I have another diff to reduce, but not completely eliminate it), and translate was calling the decoder with RunNetOnce -- thus always recreating the net and the ops. Changhing this to RunNet() makes translate run faster than without executor. RunNet uses the net name and uses the already created net, while RunNetOnce passes the whole protobuffer.
Noticed similar bug in seq2seq ensemble bean model, which also calls CreateNet() but uses RunNetOnce() instead of RunNet().
Reviewed By: jhcross
Differential Revision: D6156566
fbshipit-source-id: a933453e36a0d8fd163d0584186fda427a680687
This allows VariableType override them to return instances of
VariableType. Combined with the change to Formatting.cpp, this lets us
print Variables to std::cout.
For one thing, we will want a different implementation from TH because
we need to differentiate between scalars and 1-dim tensors.
Also, we don't really want to expose the THS/THCS function; in addition to
checking the shapes are the same, it checks that the dimensions which
are sparse are the same (because various THS/THCS operators only work if this
is true; it should really be called "is_congruent" or similar.
Summary:
NNPACK now supports building with CMake, and its build scripts have advantages over the ones in Caffe2:
- They automatically download all dependencies, no need to keep them in submodules anymore
- They automatically download and setup PeachPy for x86-64 build
- The same scripts are used for server/desktop (Linux, macOS) and mobile (Android/iOS)
- They unblock Caffe2 build with Ninja
Closes https://github.com/caffe2/caffe2/pull/1382
Reviewed By: Yangqing
Differential Revision: D6150723
Pulled By: Maratyszcza
fbshipit-source-id: 7c3e4e3406f60d4cc059e1c8112cb10aa3d75ece
Summary:
In order to reproduce StarSpace model using the architecture of Two Tower model, we need to implement the ranking loss that is used in StarSpace as well as Filament model. In both StarSpace and Filament model, all negative samples come from random negative sampling, thus the number of negative sampler per positive record is fixed (say 64). To calculate the total loss, for each positive record, the hinge distance between the positive score and negative scores (the 64 scores in the example) are calculated. This diff implement this loss in Dper framework.
The main idea is to add an option so that negative_sampling.py can output random negative samples as an independent field rather than merged with the original input_record. In this way, we can calculate the positive score and negative score separately, which will eventually been used when calculating the ranking loss.
(Note: this ignores all push blocking failures!)
Reviewed By: kittipatv
Differential Revision: D5854486
fbshipit-source-id: f8a5b77be744a6cc8a2b86433282b3b5c7e1ab4a
This includes some changes to the dispatch code for torch.xxx functions:
- Since Variable.addmm is an instance-method, the self argument has to
come first. The dispatch code swaps the first two arguments if
necessary to suppor the deprecated signatures where 'alpha' or 'beta'
comes before the 'self' tensor.
- Delete IMPLEMENT_STATELESS_REVERSED. These functions require output
arguments to be passed in using the keyword 'out'. They were meant to
handle torch.gt(out, a, b), but we haven't allowed that for a while.
Summary: Made the asesrtion messasge clearer to let people know that rowwise is not supported for dense adagrad.
Differential Revision: D6135363
fbshipit-source-id: d706135a335305627310c69a2a6d7721b0a47f0e
* made it explicit in the docstring of Module.register_forward_hook() that the hook(s) will be called AFTER calling forward().
* added "every time" in docstring of Module.register_forward_pre_hook()
* Unify CUDA kernels for SoftMax and LogSoftMax
* Improve SoftMax and LogSoftMax kernels performance
Added a new instantiation of the spatial kernel for
low inner_size and larger dim_size.
* tensor: Ensure that the tensor is contiguous before pinning (#3266)
pin_memory() was producing out-of-order tensor when the given
tensor was transposed, i.e. in column-major order.
This commit fixes this by calling contiguous() before pinning.
* test: add contiguous test for pin_memory (#3266)
Summary:
RNN executor has significant overhead of creating the timestep-nets the first time, and this is especially bad with beamsearch that is complex.
So disable RNN executor for now until perf regression is fixed (I have pending diff on it).
Reviewed By: salexspb
Differential Revision: D6138878
fbshipit-source-id: ce63ab9ce9cc1c0f67097aea1e370494ca98c680
* tensor.numpy() checks that no arguments are passed
* tensor.numpy() checks that no arguments are passed
* Improve .numpy() argument checking performance
Summary:
Added two new ops, FP16MomentumSGDUpdate and FP32MomentumSGDUpdate, which perform both the momentum sgd and weight decay updates to a given parameter in a single op -- thus being more efficient.
Also updated the standard momentum sgd test to test if nesterov momentum works.
Reviewed By: asaadaldien
Differential Revision: D5837837
fbshipit-source-id: 5ad487b9c59434491d3a4fcfdeed820db6083f57
Summary:
Added FP16SgdOptimizer to optimizers. The optimizer updates the params using the FP16MomentumSGDUpdate and FP32MomentumSGDUpdate ops. To determine which update op to call the optimizer expects either the fp32_update flag to be set, or that the blobs are in a recognized format created by initializers.py.
These requirements can be loosened if the blob DataType can be queried in python, though I am unsure of how to do this.
It also forces FP32 updates to SpatialBN as CuDNN does not support FP32 params for SpatialBN.
Reviewed By: asaadaldien
Differential Revision: D5840806
fbshipit-source-id: 84ab8dc11a6e91a198ed72c00287f4809607079d
* Fix clang-802.0.42 tuple overload bug, fixes#3234.
Originally, my plan for emit_record_trace was to keep it as
simple as possible, if at the expense of some somewhat ugly
overloads. So this meant we had a 'recordTrace' function
with overloads like this:
recordTrace(..., const Variable& out)
recordTrace(..., const std::tuple<Variable, Variable>& out)
Unfortunately, this triggers a bug in clang-802.0.42
(widely used in macOS Sierra 10.12.6) wherein a Variable is
implicitly convertible into a std::tuple<Variable, Variable>;
a minimal repro can be seen below here:
#include <tuple>
struct T {};
void f(const std::tuple<T, T>&) {}
void g(T& x) { f(x); }
To work around this bug, the code generator is a bit more
complicated, and is taught how to handle this situation.
Previously the generated code looked like:
jit::tracer::recordTrace( "min", { self }, ret );
Now it looks like:
jit::tracer::recordTrace( "min", { self }, { std::get<0>(ret), std::get<1>(ret) } );
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* CR comments
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: Adding "dtype" parameter for the GivenTensorOp. Also, providing backwards compatibility for the existing code, byt supporting the templating if "dtype" is not provided.
Reviewed By: bddppq
Differential Revision: D6090049
fbshipit-source-id: f5deaa57b49f2280289975f4583aba5bc064a2bc
Summary:
Bumps the pybind version from v1.8.1 to v2.2.1, resolving all compile & runtime issues that arose.
Upgrades to the API used https://github.com/pybind/pybind11/blob/master/docs/upgrade.rst as the point of reference.
This also solves a long-standing bug we had, where a type would spontaneously and intermittently change in the C++ -> Python boundary.
\cc Yangqing
Closes https://github.com/caffe2/caffe2/pull/1308
Differential Revision: D6125152
Pulled By: pietern
fbshipit-source-id: 67839a9654c655d143820c6686c311beba64eff2
Py_InitModule returns a borrowed reference. PyModule_AddObject steals
the reference, so we need to incref the `_nn` object.
(The Python 3 function PyModule_Create returns a new reference.)
Don't create grad_fn if requires_grad=False
- Check that arguments without derivative definitions have
requires_grad=False
- Pass all tensor arguments to the tracer, including ones without
derivative definitions
Summary: CUDA version of weighted sampling operator; minor changes for CPU version
Reviewed By: asaadaldien
Differential Revision: D6106668
fbshipit-source-id: 42d7607bd845a4a39cf5b89d7476904cb5928431
Summary:
While waiting for the single threaded version to complete I noticed it
was doing an awful lot of waiting, so decided to make it multi
threaded. Creating a 150GB DB is now ~4x faster on an AWS EBS volume.
Closes https://github.com/caffe2/caffe2/pull/1334
Reviewed By: romain-intel
Differential Revision: D6045259
Pulled By: pietern
fbshipit-source-id: 43f9392a0a383355660a3ead217ab38939dd2bc2
Summary: Previously, CPU version of operator RowWiseSparseAdagrad has been implemented, Here the GPU version of of the operator has been implemented and tested
Reviewed By: azzolini
Differential Revision: D6082828
fbshipit-source-id: 74befd495666c357d5ab425a698c5880cd8f927c
The general strategy is there is a new module, torch.onnx.symbolic, which
contains a function for every ATen method name with the ONNX translation.
While implementing this, I took the opportunity to expunge all references
of 'g' from the public API; instead, it is managed by a global variable in
torch.onnx which tracks the "current graph".
Other changes:
- If you pass a Tensor to op as an argument, it will now automatically be
converted into a Constant ONNX node. This lets us remove needing to
implement ONNX
- Rename value to other, wherever there is both a Scalar and Tensor overload.
This way, keyword dispatch can work uniformly in both cases.
- Deleted any autograd Function classes that both had a symbolic and were ported
to the new C++ autograd implementation. There may still be some straggling
classes that didn't have symbolic.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
The generated tracing code looks like this:
if (jit::tracer::isTracing({ self })) {
jit::Node *n = jit::tracer::recordTrace( "mean", { self }, ret );
n->rawSet(jit::stringToSymbol("dim"), dim);
n->rawSet(jit::stringToSymbol("keepdim"), keepdim);
}
A few design decisions I made:
- Instead of making the assignment of 'n' conditional on whether or not
attributes are present, I just add (void)n if it would not be used
otherwise. This modestly simplifies code generation.
- Tracing of operations that involve Generator or Storage are not supported.
This is fine because such ops don't take any Variable arguments anyway,
so they couldn't trigger tracing.
- Unfortunately, at::ArrayRef is not covariant, so there is some faffing about
to support conversions from at::ArrayRef<Tensor> (aka TensorList) to
at::ArrayRef<Variable>. In the case of 'recordTrace' (slow path), I just
allocated an intermediate std::vector to get the types correct; in the case
of isTracing (fast path) there's three overloads to avoid refcount bumping
when possible.
- Tracing is all in one place, rather than spattered between the beginning
and end of an ATen function, as Sam suggested.
- This commit doesn't actually enable ATen definitions.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
1) softmax, log_softmax backwards now have int64_t dim argument
2) chunk/split in autograd/functions/tensor.cpp conflict with new
ATen implementations, just delete them and use the ATen ones.
3) div/mul with Scalar now use "other" parameter rather than "value"/
Summary:
(1) use the cmake files of the corresponding libs
(2) allow static linkage of gtest and gbenchmark.
(3) Helps removing the temp solution in #1112
We are yet to disable the installation of the benchmark library, and I have an open pull request at https://github.com/google/benchmark/pull/463 - once it is merged I will do submodule update.
cc lukeyeager pietern who had this issue before - hopefully this makes the solution cleaner.
Closes https://github.com/caffe2/caffe2/pull/1358
Differential Revision: D6111404
Pulled By: Yangqing
fbshipit-source-id: 17468d32cef27f96e9445d119eb869c9c7913118
This adds the ability to specify 'native' functions in NativeFunctions.h and specifies
'split' and 'chunk' in this manner. The function arguments, returns, variants, etc. are
specified as if they were processed via other parsing mechanisms (e.g. cwrap_parse) with
the following additional parameters:
type_method_definition_level: this allows one to specify that the type method should
be defined at the 'base' type level; this is because in the case of 'split' and 'chunk'
(and probably most/all other native functions that don't directly dispatch to TH/THC)
we don't need type-specific implementations. Currently it is enforced that 'base' is
specified for native functions, but this is easy to remove later.
type_method_definition_dispatch: this defines the function to dispatch to. For split,
this is at::native::split; this is just to avoid having a magic namespace and allowing
one to dispatch to a function with a different name.
* with the size=1 case, impossible to do single point check, replace with isContiguousRange
* fix stride in desc; fix undef scope
* add test for this case for cudnn
* assertTrue
In many "non-Python" headers, we include Python.h because we need
to declare a pointer to PyObject, and solely because of that. It
would be a lot better if we had a simpler version of Python.h that
just declared PyObject available for pointers, without anything
else. This is what torch/csrc/utils/python_stub.h does.
The good thing about not including Python.h is that it is easy to
be warning-less; no more ugly insertions of Python.h on headers
where it has no good reason to be.
This makes PyTorch warning clean again.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: Before we fix it properly with 'type' argument.
Reviewed By: bddppq
Differential Revision: D6103973
fbshipit-source-id: 8c00a93c373dd0ad0bbfe59944495f6574223ab6
Summary:
a parameter can be initialized multiple times in init_net if parameter sharing is enabled. With the original implementation, only the first parameter init will be replaced by pre-trained parameters and the next are still unchanged. This overwrites the initialization with pre-trained parameters.
This diff fixes this issue and also support model init for ads-intent project
Reviewed By: dragonxlwang
Differential Revision: D5991291
fbshipit-source-id: 36173f6239c56bd0d604a77bd94e36072f32faa7
Summary: include memory and map from observer.h
Reviewed By: ajtulloch
Differential Revision: D6094338
fbshipit-source-id: f39b27cb76dae3b06816bb9ae37c2c1f96eaa8ba
I've also made the version counter and the "live" reference count
atomics.
Note that it's not safe to set the version counter (operator=) from
multiple threads, because shared_ptr assignment isn't thread safe.
Currently, the only call sites to these functions are on newly created
variables before they can be accessed from other threads.
See #3111
Summary:
Currently, the type inference infers FLOAT as the type for all GivenTensor*Fill operators. However, the inferred type should match the actual operators.
Also, for `Slice` operator, there is a corner case where type inference fails
Reviewed By: azzolini
Differential Revision: D6096813
fbshipit-source-id: d65b7c0f42436138cbc49d8a5a62374fa5e927e1
This removes the StochasticFunctions for bernoulli, multinomial, and
normal and replaces them with classes in the torch.distributions
package. Each distribution supports the differentiable log_prob function
that returns the log of the pdf/pmf of the samples.
The current StochasticFunction implementation has a few problems: it can
be painful to use when there are multiple stochastic outputs which need
to be back-propagated through. It also requires that we store grad_fns
on Variables that have requires_grad=False in order to find stochastic
nodes.
- Cleaned up THNN and THCUNN code and kernels
- Improved THCUNN kernel performance 5x, making it match cuDNN performance
- Added support for computing softmax over arbitrary dims
NOTE: The default dim for 3D inputs is now 1 (used to be 0)
- Both functions now accept inputs with arbitrarily many dimensions
- Autograd functions no longer save the input (it's unnecessary)
- Added cuDNN bindings for softmax, but they are unused as THCUNN
matches or even exceeds cuDNN performance
Summary:
This introduces a few things:
- It enables us to create Caffe2Config.cmake that can be used down the road for building dependent libraries, so they do not need to explicitly write FindCaffe2.cmake.
- The config file will automatically figure out transitive dependency of Caffe2 as well as compiler flags.
- This diff also disables the RPATH setting since it is kind of a mess right now. In principle, we should figure out a clearer rpath setting following the typical rpath setting choices (https://cmake.org/Wiki/CMake_RPATH_handling) - I can send a follow up PR to clean this up.
- Minor: removed old gflags ang glog files.
Closes https://github.com/caffe2/caffe2/pull/1354
Reviewed By: dzhulgakov
Differential Revision: D6098014
Pulled By: Yangqing
fbshipit-source-id: cb06c41a7ef60fddb78b24887b6b3e82684b7c6b
Summary: Model with rowwise RMSProp does not work in net-rewriting pipeline (fbl 29841194). This diff solves the issue by changing the way Slice op is used in the model and adds a rule to `parallelize.py` to cover for needed cases.
Reviewed By: azzolini
Differential Revision: D6096022
fbshipit-source-id: c4f615b2ba99da9f77a1d49c9fb898e0e59401f8
Summary: Allow the application of sequence-length masking to be replicated along one or more minor axes. See task for details.
Reviewed By: jamesr66a
Differential Revision: D6090835
fbshipit-source-id: 9064232aa9b93246c582b6e0bae73be5dbe09e98
* Fix docs for nn.Embedding and F.embedding.
- add description of 'sparse' argument (#3104)
- fix F.embedding example (resulted in RuntimeError)
* Make EmbeddingBag a New Style Function.
* Add a functional interface for EmbeddingBag
* Fix failing tests: add max_norm and norm_type to context,
and fix typo in backend call.
* Docfix: remove torch.manual_seed from example code.
* Add a note about using sparse keyword in Embedding function.
Apparently, the algorithm only guarantees the output is coalesced if
the inputs are coalesced.
I'm planning to do another PR that does much more stringent correctness
testing for the 'coalesced' bit shortly, but y'all should merge
this one first.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Currently, the toXXX functions on Scalar check that the conversions are
exact. This will cause an exception in code like:
auto t = CPU(kFloat).ones({1});
t *= M_PI;
Or the equivalent in Python:
t = torch.ones(1)
t *= math.pi
This changes the checks to only throw an exception in the case of
overflow (positive or negative).
Summary: By default, do not log anything to reduce the runtime overhead
Reviewed By: Maratyszcza
Differential Revision: D6082490
fbshipit-source-id: 35fd09ea439925139d66b4623211e01af46e18f2
* THCUNN Skeleton for Depthwise Convolution port
* implement Depthwise Convolution CUDA Kernels (handles weight parameter only, not bias)
* working kernels and bindings for forward + backward for base conv, and integration
* add support for padding
* strides for weight kernel
* dilation for weight gradient, enable for others
* add support for depthwise multiplier
* remove old depthwise conv
* rename to SpatialDepthwiseConvolution
* clean up depthwise code, add shape asserts, more constrained thread count for accgradparams
* add bias for forward for depthwise conv
* add grad_bias, move bias for forward to CUDA
* fix eligibility test to guard against transposed, properly identify depth multiplier
* add basic unit test; make depthwise conv take priority over cudnn when appropriate
* add tests for depthwise permutations
* make cuda kernels calculate positions using mul instead of div
* remove unnecessary samegpu requirement
* use accreal, test for double type
* use THAssert instead of assert
* rename to is_depthwise
* half prec support for depthwise
* make certain computation more pythonic
* flake8
Previously, we the Variable.data PyObject* in THPVariable_Wrap. For many
Variables, we don't access their data directly. Instead, they are passed
from one Variable compuatation to another.
This reduces the overhead of ATen-implemented Variable methods by
~200ns.
Summary:
Somehow we're observing mysterious test failures for some nnpack-related tests with gcc5 only on Travis: https://travis-ci.org/caffe2/caffe2/jobs/288804879
Marat suggested that maybe the machine doesn't have avx2 support.
Right now gating is happening for FB-internal only. I think it makes sense to make gating generic. Calling `nnp_initialize` seems like the right way to do so. It returns failure if the hardware is not supported and is a noop after the first call.
Reviewed By: Maratyszcza
Differential Revision: D6073808
fbshipit-source-id: e684668628b5c635368351114b6c502d2cc81fe4
Summary:
Op for computing SigmoidCrossEntropyWithLogits with per-label, per-sample weights. Can be used for addressing class or label imbalance.
Doc:
Given three matrices: logits, targets, weights, all of the same shape,
(batch_size, num_classes), computes the weighted sigmoid cross entropy between
logits and targets. Specifically, at each position r,c, this computes
weights[r, c] * crossentropy(sigmoid(logits[r, c]), targets[r, c]), and then
averages over each row.
Returns a tensor of shape (batch_size,) of losses for each example.
Reviewed By: stephenyan1231
Differential Revision: D5997723
fbshipit-source-id: f3172325f1c98b6f26e1700131ef897b743a72fc
* Support MNIST in ONNX
* Add train mode check in FeatureDropout symbolic, add todo mark in logsoftmax_symbolic
* export FeatureDropout as a simple identity op
* turn x = x or y to if-checks.
Summary:
For distributed offline training, downloading parameters from trainer_0 is part of epoch plan. However for distributed realtime training, we publish model by a specific time interval, so we need run multiple iterations for epoch plan before publishing the model.
In this diff, I split downloading parameters from epoch plan as a separate plan, so we can explicitly execute it before model publishing for distributed online training.
Reviewed By: boryiingsu
Differential Revision: D5995122
fbshipit-source-id: 47d61d7b8c57cfae156e79b7ec32068ef579d7c3
Summary: observer framework can now be used in python + a small writeup of how to use it. this is D6035393 with a fix for ct-scan
Reviewed By: salexspb
Differential Revision: D6066380
fbshipit-source-id: 896c4c580d4387240b81ac2dbbc43db51d4bfeb9
Summary: that what made tests fail :)
Reviewed By: xianjiec
Differential Revision: D6067037
fbshipit-source-id: 0194f082feed87b0502170683c6773e07db3ff44
ATen has it's own default CPU RNG. Use this as the default in PyTorch so
that random functions called through ATen have the same behavior as
random functions called through TensorMethods
Summary: until we have an internal build test for this directory we should not have enabled by default in open source
Reviewed By: salexspb
Differential Revision: D6060577
fbshipit-source-id: 25f5c2d30adf274620cd8ec2e2db9565b98cfa7c
Summary:
makes the necessary changes to support Caffe2 OpenGL ES backend on NVIDIA Tegra devices
- Remove no_bounds global because Tegra GLES driver doesn't recognize it as a constant. Define BOUNDS_CHECK_MODE macro instead.
- Recognize "NVIDIA Tegra" as a supported GL_RENDERER
Reviewed By: hlu1
Differential Revision: D6030760
fbshipit-source-id: e3655467612469d69c70b3fee35edb2d6774a793
Summary: observer framework can now be used in python + a small writeup of how to use it
Reviewed By: sf-wind
Differential Revision: D6035393
fbshipit-source-id: 4563cf0203095fa979bb2160621cd16dd22ff830
It's pretty easy to accidentally fail to actually compile
a JITed region, which means that we have accidentally failed
to have test coverage for a number of features. This adds
a secret _assert_compiled kwarg, which will raise an error
if we don't actually hit the compiled codepath.
This is not intended to be user visible; we have some other
ideas for handle this case.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
We weren't doing gradient checks on these functions because the tests
were in-place only. We also incorrectly classified __magic__ functions
as inplace.
Summary: Turns out CuDNN's tensor transform only supports floats. Previous implementation pretended it would work with ints by casting to floats and indeed passed tests for some reason. But rgirdhar found a case where it returned nonsensical results. So rewire int-transposes to use non-cudnn version. Had to refactor a bit for that. Also added a test for the case.
Reviewed By: asaadaldien
Differential Revision: D6043284
fbshipit-source-id: cc3b14f9fbbdeff421b01da453a1d3c7c5ffd4ac
Summary:
input dimensions up to "axis" will be flattened to the outer dim of output and the remaining input dims will be the inner dim
Closes https://github.com/caffe2/caffe2/pull/1330
Reviewed By: dzhulgakov
Differential Revision: D6039560
Pulled By: bddppq
fbshipit-source-id: e92c30b49a9288feeefc4a639522406e97e149e1
Summary:
- hasattr is misbehaving in python 3
- python2: `This is implemented by calling getattr(object, name) and seeing whether it raises an exception or not`
- python3: `This is implemented by calling getattr(object, name) and seeing whether it raises an AttributeError or not.`
Reviewed By: azzolini
Differential Revision: D5973797
fbshipit-source-id: 0b6a413e6ebacd9bdd197c46feab256ab383ace2
a.copy_(b) will now broadcast b to the shape of a. Note that this means
that copies between tensors of the same number of elements but
incompatible shapes are not allowed. For example, the following will
throw an exception:
Tensor a = type.rand({4, 43);
Tensor e = type.rand({3, 4});
a.copy_(e)
The methods were separate because PyTorch supports multiple output types
for comparison methods. For example, for FloatTensors 'a' and 'b' both
calls are vaild:
torch.lt(a, b, out=<ByteTensor>)
torch.lt(a, b, out=<FloatTensor>)
ATen only supports ByteTensor outputs because the overloads have the
same static signature and would conflict. It would be nice to fix this
in the future like with the bernoulli function.
In the meantime, the separate function and method definitions with
different argument names make implementing VariableType more difficult.
Summary: memonger.cc's support for RNNs was broken in D5994548, because it changed a .n argument to .s argument. That made data_parallel_model_test fail (but tests were not run for the blame diff, so this was not noticed).
Reviewed By: kennyhorror
Differential Revision: D6043948
fbshipit-source-id: d29abd6927c519227a28b41c1ef70fb1756904bf
Summary:
I broke dpm.GetLearningRateBlobNames() when adding a new nodename param in optimizer.
Fixing it.
Reviewed By: asaadaldien
Differential Revision: D6043828
fbshipit-source-id: b3a79dd0dfae144187bcb359e2374eab6b32c485
Summary: Adding ability to reuse workspace in Do op and unit tests
Reviewed By: akyrola
Differential Revision: D6037992
fbshipit-source-id: 73d6a14001f667f7ca5e1e02ff39911dc65e4cd1
Summary:
The scripts/build_local.sh script would always build protoc from the
third_party protobuf tree and override the PROTOBUF_PROTOC_EXECUTABLE
CMake variable. This variable is used by the protobuf CMake files, so
it doesn't let us detect whether the protoc was specified by the user
or by the protobuf CMake files (e.g. an existing installation). This
in turn led to a problem where system installed headers would be
picked up while using protoc built from third_party. This only works
if the system installed version matches the version included in the
Caffe2 tree. Therefore, this commit changes the variable to specify a
custom protoc executable to CAFFE2_CUSTOM_PROTOC_EXECUTABLE, and
forces the use of the bundled libprotobuf when it is specified.
The result is that we now EITHER specify a custom protoc (as required
for cross-compilation where protoc must be compiled for the host and
libprotobuf for the target architecture) and use libprotobuf from the
Caffe2 tree, OR use system protobuf.
If system protobuf cannot be found, we fall back to building protoc
and libprotobuf in tree and packaging it as part of the Caffe2 build
artifacts.
Closes https://github.com/caffe2/caffe2/pull/1328
Differential Revision: D6032836
Pulled By: pietern
fbshipit-source-id: b75f8dd88412f02c947dc81ca43f7b2788da51e5
Summary:
Optionally return a blob of shape [batch size, max length] that is
false only in locations where the output tensor was padded.
One can separately convert lengths to segment ids and cast, but
this is more convenient, and possibly more efficient.
Differential Revision: D6006073
fbshipit-source-id: af6c4ea31972566e7d059dcd3fdd8afba97a88e9
Summary:
I had a 30 sec timeout in RNN executor to find out deadlock bugs, but looks like people are occasionally bumping on it in the course of normal business -- perhaps when CPU is heavily used, the threads don't get enough time and run out of the timeout.
Removed the timeout but retain the warning logging.
Reviewed By: salexspb
Differential Revision: D6001960
fbshipit-source-id: 5b2293359ee68c1c24f0d9e0406d88391e531280
Summary:
Im2colNd GPU version was not correctly implemented due to 1) the lack of unit test 2) it is actually NOT used by any use case.
A little more background: We are working implementing a conv-deconv 3D operator, which takes 3D volume data (e.g. video) as input, do conv in spatial domain to reduce resolution and do deconv (a.k.a conv transpose) in temporal domain. We first implement a conv transpose 3D op in D6035108, and spot the buggy gpu implementation.
Reviewed By: asaadaldien
Differential Revision: D6035081
fbshipit-source-id: b76dea2e44bcb73d202441bb246249c4481973e1
* Fix the broadcast in Addmm's symbolic
* fix the non-matching dimension cases
* Add exception for non-supported case, remove onnx test cases (moved to onnx-pytorch repo)
* remove the test_onnx.py in run_test.sh
* lint the code
Summary:
This way, we can choose to include a file and the the containing reporter is registered in the ObserverConfig. We can have different targets with different reporters without exposing the dependency to all clients.
Closes https://github.com/caffe2/caffe2/pull/1320
Reviewed By: bwasti
Differential Revision: D6024096
Pulled By: sf-wind
fbshipit-source-id: c6eabd7f9ca51b88ea4b268612355ca60809c0a2
This generates NN bindings with a similar interface to PyTorch's
torch.nn.functional package. The file nn.yaml specifies function
signatures and THNN implementations.
Each NN operation generates three functions. For example:
- conv2d
- conv2d_forward
- conv2d_backward
The conv2d and conv2d_forward functions differ in how they handle
buffers that need to be passed to the backward function. conv2d_forward
takes the buffers as parameters. conv2d creates the buffers internally
and discards them.
Summary:
Since this is only a duplicate of CMAKE_CXX_FLAGS we should simplify the set of options.
Closes https://github.com/caffe2/caffe2/pull/1327
Differential Revision: D6031544
Pulled By: Yangqing
fbshipit-source-id: 5c610a70118089b4d96be30ab028ef1d5efdb019
Summary: Before this diff RNNOp was using TextFormat for representing steps. This diff is changing RNNOp to prefer NetDef argument instead. To be backward compatible it supports TextFormat for existing models, though we can compile RNNs without TextFormat as well.
Reviewed By: salexspb
Differential Revision: D5949330
fbshipit-source-id: 9336a8f5ccf30ad8d8e3a7067b9437e1704b1c9f
Summary: We have to use copy constructor in Concat when copying non-primitive types
Reviewed By: Yangqing
Differential Revision: D6002883
fbshipit-source-id: 0aebc955079975bb6423291589ed09ce0660acf3
Summary: Use only MLP model and re-enable test
Reviewed By: bddppq, Yangqing
Differential Revision: D6013471
fbshipit-source-id: 0cb4a9346c62a739ee6259832181f71e60eef311
Summary:
In the past we call our libraries libCaffe2_CPU.so and libCaffe2_GPU.so that don't really match the usual linux so library naming conventions. This diff changes it to libcaffe2.so (old Caffe2_CPU) and libcaffe2_gpu.so (old Caffe2_GPU).
This might affect existing building scripts that explicitly use Caffe2_CPU and Caffe2_GPU: what do you guys think? pietern bwasti slayton58
Closes https://github.com/caffe2/caffe2/pull/1300
Differential Revision: D6025973
Pulled By: Yangqing
fbshipit-source-id: 6243de4e7af8924f737bb74f3936015f4c91fa26
Summary:
TSIA - this would allow us to auto-sync the up to date version with intel's repo.
Closes https://github.com/caffe2/caffe2/pull/1319
Reviewed By: pietern
Differential Revision: D6023739
Pulled By: Yangqing
fbshipit-source-id: 79bd91aa3a193c266acccdeb682519a49e028bae
Summary: observer framework can now be used in python + a small writeup of how to use it
Reviewed By: salexspb
Differential Revision: D5905002
fbshipit-source-id: e40ec24a55e08fb73beea9b4f3b68e71fc66ffb1
Summary:
parallel_workers supports calling a custom function "init_fun" when WorkerCoordinators are started which is passed in as an argument to init_workers.
Adding an analogous argument "shutdown_fun" which gets passed in to init_workers, and gets called when a WorkerCoordinator is stopped.
This allows users of the parallel_workers to add custom cleanup logic before the workers are stopped.
Reviewed By: akyrola
Differential Revision: D6020788
fbshipit-source-id: 1e1d8536a304a35fc9553407727da36446c668a3
A few notes about the implementation:
- Need to plumb 'devices' through to the 'fork_rng' calls. You definitely
want these; it makes verify run A LOT faster
- New keyword argument for compiled model execution, '_force_trace', which
forces us to retrace a model.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: it failed for the case when the `prod_prediction` is used as teacher label, which is double, instead of float.
Reviewed By: kittipatv
Differential Revision: D6018163
fbshipit-source-id: cd93fd46996e07c7f762eedbeb67331a4665d4c4
Summary:
Also fixes a dependency bug in the cmake file for the ATen Op.
Closes https://github.com/caffe2/caffe2/pull/1309
Differential Revision: D6017166
Pulled By: zdevito
fbshipit-source-id: 3f4d18772f9179367927d4e7a52e51a4580342e9
Summary: The layer should also apply to evaluation as it's needed for feature importance run.
Reviewed By: xianjiec
Differential Revision: D6016125
fbshipit-source-id: e1db1a2eb3d45515e3cdc71b4badaaf738a4afd8
Summary: A single negative index can crash the job today. We want to skip a few of them but not a lot. If we skip too many then we will force the job to crash.
Reviewed By: kennyhorror
Differential Revision: D6003461
fbshipit-source-id: 7881ed6c2cfa78c7bda90c7aa01e81ca00fd08a6
Summary: This prints the inner net of 'Do' op, for example.
Reviewed By: akyrola
Differential Revision: D6007278
fbshipit-source-id: 459583fe13191b0449982efb7be733c9c01ecf76
Summary:
RNNOp have been using TextFormat for representing nets. This have already cause
some incompatibilites and also pulls huge dependencies for RNN on Mobile. This
diff is adding support for using NetDef arg instead and adds supports for
compiling only this version.
Reviewed By: salexspb
Differential Revision: D5994548
fbshipit-source-id: 6c4ded97b80d7a57ad5a013b79ae917aac777c7d
Summary: 1. iteration and LR must be node-name specific in optimizer
Reviewed By: azzolini
Differential Revision: D6001124
fbshipit-source-id: 0fa53fb3347e89401f62125865166356ac56796b
Summary:
The Caffe2 benchmarking framework can now compare the output of a model with some golden output. In order to do that, and reduce the dependency of the benchmarking framework and caffe2, the output is dumped as text format without any schema.
The output is read in by the benchmarking framework and perform the comparison.
Closes https://github.com/caffe2/caffe2/pull/1301
Reviewed By: bwasti
Differential Revision: D5992836
Pulled By: sf-wind
fbshipit-source-id: f6b403103949f4b9880c8372bbdc36966475a387
Summary: Added TensorMap input for run function in predictor.cc
Reviewed By: bwasti
Differential Revision: D5847103
fbshipit-source-id: cd9755a0491b50adc35177164ffe7a50e73ff80f
Summary:
Input is a matrix tensor. Its first dimension is the batch
size. For each column, bucketize it based on the boundary values and then do
one hot encoding. The `lengths` specifies the number of boundary values for each
column. The final number of buckets is this number plus 1. This would also be
the expanded feature size. `boundaries` specifies all the boundary values.
Note that each bucket is right-inclusive. That is, given boundary values
[b1, b2, b3], the buckets are defined as (-int, b1], (b1, b2], (b2, b3], (b3, inf).
For example
If data = [[2, 3], [4, 1], [2, 5]], lengths = [2, 3],
and boundaries = [0.1, 2.5, 1, 3.1, 4.5], then
output = [[0, 1, 0, 0, 1, 0, 0], [0, 0, 1, 1, 0, 0, 0], [0, 1, 0, 0, 0, 0, 1]]
Reviewed By: xianjiec
Differential Revision: D5976030
fbshipit-source-id: fd746c20b19bcdf5f769451d804c219ad6463f28
* Improve Declarations.yaml:
- translate defaults to C++ values
- include names of returned values
- mark keyword-only arguments
* Add comment to translate_default
Summary:
This is a brief introduction to what this op is doing. In the multi-label case,
i.e., each example has more than one label, we want to find out which examples
have values for each label. That is, given a sparse representation in
len = (2,3), ind = (1, 2, 0, 1, 2), val = (10, 20, 5, 8, 15), we want to return
example_id_0 = [1], example_id_1 = [0,1], example_id_2 = [0,1],
value_0 = [5], value_1 = [10,8], value_2 = [20,15].
There are two special things here. 1. The size of each output tensor is unknown until runtime;
2. The ordering in each output tensor should be preserved, e.g., example_id_1 = [0,1] instead of [1,0].
What I am doing now is to get the output size and an offset map (see codes) in cpu and then
launch a kernel to take care of the rest. This requires a copy of O(N) which is really not ideal.
Previously I had an implementation that computes the output size in gpu but when I fill values in
the output tensors it is hard to make sure the ordering will be preserved unless I do a sorting afterwards.
Reviewed By: azzolini
Differential Revision: D5825104
fbshipit-source-id: 4d987cef0247746998ec1d2acc47fc5ed2302722
Summary:
The build_local.sh script current is single thread, which is really slow. Use the same mechanism in build_android.sh to parallelize the build.
Closes https://github.com/caffe2/caffe2/pull/1282
Differential Revision: D5992231
Pulled By: sf-wind
fbshipit-source-id: 01ba06b6efcb0f535f974a2dfffbae9ba385d27d
* Add reduce keyword to MSECriterion API
* Move gradOutput usage from py to backend
* Implement reduce keyword for THNN MSECriterion
* Implement reduce keyword for THCUNN MSECriterion
* Implement reduce keyword for MSE double backwards
* Tests for MSECriterion with reduce keyword
* Documentation for reduce for MSELoss
* Make legacy nn work with reduce keyword by ignoring it
* Apply linter suggestions
* Address comments (small changes)
* Revert "Tests for MSECriterion with reduce keyword"
This reverts commit 1c0be0defa49d336d023d7d9795db4037c92b6fe.
* Undo changes to legacy nn tests
* Reuse module test for MSELoss by creating a wrapper class for MSELoss
* Address comments: refactor MSECriterion.cu to be nicer
* Fix lint & build errors
Summary:
Separate class definition into header file
Remove uniform buffer initialization in the constructor because it's not necessary
Separate tiling and batching code
Reviewed By: jerryzh168
Differential Revision: D5960502
fbshipit-source-id: 5e3bce5192ce6dc69868be1722f490f690d87076
Summary:
Added an exported statistics that helps in computing
standard deviation. It uses an offset-ed mode of computation
to avoid a common pitfall
Reviewed By: azzolini
Differential Revision: D5977811
fbshipit-source-id: e9f3b99a952e10fb3e3eb18a29b5bdca92f82f4c
Summary:
Latest version of Gloo takes care of MPI_Init/MPI_Finalize for us, so
this commit removes handling that from caffe2/contrib/gloo. It also
imports CMake NCCL module changes from Gloo to stay consistent and
allow setting NCCL_INCLUDE_DIR and NCCL_LIB_DIR separately.
Closes https://github.com/caffe2/caffe2/pull/1295
Reviewed By: dzhulgakov
Differential Revision: D5979364
Pulled By: pietern
fbshipit-source-id: 794b00b0a445317c30a13cc8f0f4dc38e590cc77
There is a bit of nuance to this function. If one blindly charges in
and initializes all GPUs, it is going to take a long time. 20sec for
8 GPUs on my dev machine. But to a user, it is non-obvious that fork_rng
is going to hit all the GPUs by default (which it does by default for
safety reasons.) So there is a nice warning when we notice we're
hitting more than one GPU. There is a bit of extra generality
which is going to be used by torch.jit in a subsequent commit.
The motivation is that I wanted to add some more general purpose
utility random functions, but not gunk up torch/__init__.py.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
This would allow one to debug with asan. Known problems:
- only works with new -fsanitizer=address option.
- not tested on clang.
It's turned off in default so existing builds won't be affected.
Closes https://github.com/caffe2/caffe2/pull/1299
Differential Revision: D5987034
Pulled By: Yangqing
fbshipit-source-id: de29cd3b84edaed5db73e33f8f759c5c3271b5b7
Summary: Given a pair (init_net, train_net) where ops in sparse layers are tagged, this diff detects the components and rename the `node_name` (e.g. tag) to reflect the component name.
Reviewed By: azzolini
Differential Revision: D5948222
fbshipit-source-id: aeda9cfc88bb64922bf7a9942b969e3c5066718a
Summary:
Implement a framework to benchmark the Caffe2 inferencing time. It only contains the observer collecting the delay information for running the net and the operator. The driver of the benchmark is in a separate repository.
It does not interfere with the rest of the Caffe2.
Closes https://github.com/caffe2/caffe2/pull/1263
Reviewed By: bwasti
Differential Revision: D5956861
Pulled By: sf-wind
fbshipit-source-id: ba4f0226066f55d333b27d472e09137d7272d449
Summary:
In instance norm implementation, the lambda function is causing heap overflow
so moving it explicitly into the function body itself.
accept2ship
Reviewed By: pietern
Differential Revision: D5981662
fbshipit-source-id: 6901c9cd738de048e3d0308a0a4c52f9c37e524a
Summary:
This is the first step on DPER side to use net transformation step (`parallelize_net`).
So far, it tags the sparse parameters (in init_net and train_net) once distributed trainer nets are built.
Next step is to merge the part that creates distributed trainer nets (`create_distributed_trainer_nets`) into the part that creates single-trainer, multi-reader nets ('create_distributed_reader_nets`). This step should get rid of parts of `MixtureStrategyModelBuilder`.
Reviewed By: azzolini
Differential Revision: D5902733
fbshipit-source-id: 85fbddbb6c2704badd82b237f1dd2c7c5790e43a
Summary: The cudnn version of the DropoutOp was taking a significant (and unwarranted) amount of time in our RNN training. Further investigation showed that setting the cudnn dropout descriptors was an extremely expensive operation (https://pxl.cl/99nT), much more so than the dropout operation itself. This diff adds to the DropoutCell the option to disable cudnn. The non-cudnn version uses a raw curand call that elides all of the expensive descriptor setting.
Reviewed By: jmp84, akyrola
Differential Revision: D5972022
fbshipit-source-id: 6325ec5d6569f8b94d776cbb2554cc8ddb28f699
Summary: Move common operation out of loop.
Reviewed By: dzhulgakov
Differential Revision: D5962894
fbshipit-source-id: e4f8a5406c870958215cbc1fd366fa87bc381471
Summary: adding an operator with behavior similar to fused GatherRanges and Split.
Reviewed By: kennyhorror
Differential Revision: D5961761
fbshipit-source-id: 616d4668b8901256418004def90d91a0b2041620
Summary:
Added support for batching to SequenceMaskOp.
Let b be the batch dim and k be the axis dim. (We enforce that b < k.) Write the dimensions of the input tensor as [a_1, ..., a_b, ..., a_k, ...]. We first collapse our tensor down to 3D, with dimensions [P, Q, D], where: P = a_1 * ... * a_b, Q=a_{b+1} * ... * a_{k-1}, and D=a_k * a_{k+1} * ... * a_n. Then we mask each slice [i, :, : ] of this 3D tensor (note that each slice is a Q times D tensor w/ dimension 2)
Reviewed By: jamesr66a
Differential Revision: D5733382
fbshipit-source-id: e7a314d9fe6e6691a75112edbee8ba6e8ea8e396
* skeleton commit for building and linking nnpack library in PyTorch
* first stab at conv forward binding + integration
* bind NNPACK gradient kernels
* move nnpack forward, input gradient calls deeper
* nnpack conv api mimics nn
* fix symbol error; use memory across calls
* clean up warnings, add shape checking, thread safety, configurable thread specification
* add batch size threshold, also bind for single-element batch for the future
3D modules apply padding on all three sides. "Both" doesn't make sense here.
I used the wording of the AvgPool3d docstring, where it was already correct.
Summary:
Useful for figuring out with people which version they built with. We can just ask for --caffe2_version gflag or get core.build_options from python.
Also adds CMAKE_INSTALL_RPATH_USE_LINK_PATH - without it wasn't building on my Mac. How should it be tested?
Closes https://github.com/caffe2/caffe2/pull/1271
Reviewed By: bddppq
Differential Revision: D5940750
Pulled By: dzhulgakov
fbshipit-source-id: 45b4c94f67e79346a10a65b34f40fd258295dad1
Summary: This is the continuation of T20872698 Implement the gradient operator for element-wise Logit
Reviewed By: asaadaldien
Differential Revision: D5969487
fbshipit-source-id: c9bb4222529f9fd9085aa9048b90eb70a63f41f4
Summary:
Only works for len(offset) == 1 for now.
Also, Slice Op only supports slicing in one dimension,
can we extend it to support slicing multiple dimensions?
Reviewed By: bwasti
Differential Revision: D5967476
fbshipit-source-id: 6cf9ff510e752ddb3bc9673d47f6a577ae9ccc79
Summary: Clean up the metal remnants in BUCK now that the metal code has been removed
Reviewed By: bwasti
Differential Revision: D5966095
fbshipit-source-id: 6b022624fe91a6728549d93d2954328c6b4e059e
* Generate torch.cat autograd via ATen.
Most of the change is around supporting generation of:
1) TensorList arguments
2) Arguments to "size", "sizes", i.e. "sizes(dim)"
The alpha/beta naming in addmm was flipped; this commit fixes that
problem. It also fixes the ONNX export of alpha/beta parameters.
Finally, it supports executing matmul in the JIT.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: This diff refactors the parameter initialization logic from model manipulation to layers
Reviewed By: azzolini
Differential Revision: D5920225
fbshipit-source-id: 50d230e406bc9ce0b00bdd164802c504cf32ea46
Summary: Include information of the engine for Caffe2 operators.
Reviewed By: salexspb
Differential Revision: D5876323
fbshipit-source-id: 3b1837ccff098109bdfb0865a4fa3f509496ffdb
Summary: only changes needing review are in proto_utils.cc and caffe2.proto
Reviewed By: jerryzh168
Differential Revision: D5956743
fbshipit-source-id: e03fffaf5bc8413f2320c20a89a421f1a69b2870
* commit '9f4accd5bb99900dfda9ffab110aeb7a4534d629':
Make all dim arguments int64_t
Converting dlpack tensor to aten tensor
adding a simple class for converting atensor to dlTensor
Test stub for dlconvertor
adding dlpack header
Fix build failure in MSVC
Mark all (non-static) Type methods as const.
Summary:
Executor benchmarks to measure QPS for different models (sparse nn hogwild and
dataparallel, resnet50 dataparallel)
Reviewed By: dzhulgakov
Differential Revision: D5950770
fbshipit-source-id: 9aa8e0480468a55a6a97b10589d785c682fae01e
Summary: Adjust test thresholds and number of examples
Reviewed By: salexspb
Differential Revision: D5945588
fbshipit-source-id: 7aecb8c642d8775f51dd3c296a28f1faf7ae0c81
* Fix detection of nccl.h when libnccl.so is in /usr/lib/x86_64-linux-gnu and similar paths
* full support for independent NCCL_LIB_DIR and NCCL_INCLUDE_DIR
* lint fix
* add back CUDA_HOME
Summary:
Executor test that checks on different models that model params are the same
when using a given executor and simple net
Reviewed By: akyrola
Differential Revision: D5908769
fbshipit-source-id: b6f5a2cf89c5c67b68e8b9be3264f38d5740d897
Summary:
Problem:
Without -DBLAS=MKL, conda-build won't include MKL library into Caffe2 build. And the BLAS performance is bad on CPU.
Solution:
Explicitly add the flag. Add mkl and mkl-include as dependencies.
ezyang Yangqing
Closes https://github.com/caffe2/caffe2/pull/1264
Reviewed By: bddppq
Differential Revision: D5919192
Pulled By: houseroad
fbshipit-source-id: bb51e4fc4015212694404180a610e06ec8ddb424
torch.jit now contains two user-facing functions: compile and trace
(corresponding to what was previously trace/traced and record_trace).
The non-curried versions of these functions have been eliminated, so
that there is only one function in the API (we *must* have the
curried versions, since these enable their use as decorators). There is
detailed usage documentation in the docblocks for these methods.
This comes with a complete rewrite of the internals of torch.jit, in the process
fixing a number of bugs. Key points of the new implementation:
- compile and trace both always return a Module representing the wrapped
with compilation/tracing underlying function/module. This makes handling
of the function/module cases more uniform, as we can think of the function
case as creating an on-the-fly module with the parameters explicitly
specified by the user. For technical reasons, we now *require* any parameters
in the function case to be honest-to-goodness Parameters (gory details:
you can't register a Variable as a Parameter to a Module, but you can't
create a Parameter from a Variable while sharing the same underlying
identity.)
- Flattening and unflattening is done a lot more uniformly. We now have
a _flatten and _unflatten function which are inverses of each other:
_flatten always returns both the flat, tuple of Variables, *as well as*
the "proto" (now referred in the code as the "struct") from which we
can unflatten the variables. Low level functions like 'raw_trace'
always work with the flattened inputs/outputs, which keeps their logic
simple.
- JIT trace keying now also includes the "struct" of the input arguments.
This is a step towards accepting non-Variable arguments in functions,
although flatten/unflatten don't currently support it.
- TraceForKey (previously TraceInfo) has had its API reworked to have
less degrees of freedom when you are interacting with it.
TODO: Verify, timing, and trace dumping have been temporarily excised. I
plan on adding them back.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
This is a prototype for joint intents + slots modeling workflow, it has the following:
1- New data readers and data processors to process joint labels in parallel
2 - New JointNN model
3- New Fblearner workflow (jointnn) for joint modeling experimentations
This is still work in progress, sending the diff to start the discussion about the interface and what we need to support in our joint modeling efforts.
P.S. The number of lines in this diff is multiplied by 3 since caffe2 is mirrored in both fbandroid and fbobjc. I will highlight the most important parts so that people are not confused.
Differential Revision: D5725243
fbshipit-source-id: ecc5322f937ad0fddaf200a9e090b3573a69f994
Summary: Fixed Caffe2Enforce in size_to_dim() so that it works even if k is same as the number of dimensions in the tensor.
Reviewed By: salexspb
Differential Revision: D5893264
fbshipit-source-id: 525ea263f5e21e197c7010e1c66501355b8027c8
Summary:
This diff implements deformable convolution operator. The idea behind it is that instead of using a fixed NxM kernel, we associate a set of learnable offsets (dx, dy) with each element of the kernel, and use bilinear interpolation to estimate weights in between the integer indices. For background see paper https://arxiv.org/abs/1703.06211 and mxnet implementation https://github.com/msracver/Deformable-ConvNets/tree/master/rfcn/operator_cxx
To simplify code review of the new files the feature is stacked into 2 diffs. First diff duplicates core convolution operator into a separate set of files prefixed with deform_. It also provides documentation on the operator but nothing else. Second diff contains the actual changes that make deformable convolution possible. Thefore, I recommend focusing your code review on changes between diffs 1 and 2.
Current limitations of the operator:
1. Only CUDA is supported. CPU version is not implemented.
2. Only NCHW layout is supported.
3. Only 2d convolution is supported.
CUDA code is ported from mxnet implementation with minimal changes.
See also inline comments in code for tricky parts.
Reviewed By: akyrola
Differential Revision: D5702983
fbshipit-source-id: 4d1bf2c6c73135e6a70dbe87037b38915f4453f9
Summary:
D5772847 is breaking real time style transfer on android and conv unit tests on iPhone 7 upgraded to iOS 11.
The temporary fix in D5908415 only fixes android. iPhone 7 is still crashing.
I think these two diffs should be backed out before D5772847 is fully debugged
Reviewed By: fricc33
Differential Revision: D5913834
fbshipit-source-id: b8072c59c83adfed8a0b0ab0f42c39bc4398c7a0
Summary: Implementation of ReduceFront/Back/Max/Gradient for CPU and CUDA.
Reviewed By: asaadaldien
Differential Revision: D5905402
fbshipit-source-id: 6967ce41aa95ee5ea7a90065430892e81a6da477
Summary: Implemented logit gradient with eps as arg. Add the unit test for it and explored the optimal parameter to run the test.
Reviewed By: asaadaldien
Differential Revision: D5910655
fbshipit-source-id: 44898b784a57c7ad45519b202b1eaf95c1c4d460
This adds some generated autograd functions implemented in C++, which
are generated from derivatives.yaml. It also generates Python bindings
for the Variable methods. The generated files are:
Functions.cpp/h: subclasses of torch::autograd::Function
VariableType.cpp/h: The at::Type for autograd Variables
python_variable_methods.cpp: Python bindings to torch::autograd::Variable
python_variable_methods_dispatch.h: wrapper which releases GIL and sets the
CUDA device
python_functions.cpp/h: exposes generated autograd functions as Python
objects
The generated functions are mostly shadowed by the definitions in
variable.py. We'll remove the Python implementations in favor of the
generated C++ implementations in a subsequent commit.
Summary: Implemented version of SparseAdagrad that only keeps track of an average sum of squared gradients term for each row of the parameter tensor, rather than a sum of squared gradients term for each individual parameter.
Differential Revision: D5881918
fbshipit-source-id: bd96ccf25554b457baaaca9309fc8048adbb37f7
Summary: Equivalent to numpy.sign for CPU and CUDA.
Reviewed By: dzhulgakov
Differential Revision: D5906446
fbshipit-source-id: 389f994bccbb87a62df2c4aaacc327f9a6223cbd
Summary:
This brings proper versioning in Caffe2: instead of manual version macros, this puts the version information in CMake (replacing the TODO bwasti line) and uses macros.h.in to then generate the version in the C++ header.
A few misc updates:
- Removed the mac os rpath, verified on local macbook that it is no longer needed.
- Misc updates for caffe2 ready:
- Mapped cmake/Cuda.cmake with gloo's setting.
- upstreamed third_party/nccl so it builds with cuda 9.
- Separated the Caffe2 cpu dependencies and cuda dependencies
- now libCaffe2_CPU.so do not depend on any cuda libs.
- caffe2 python extensions now depend on cpu and gpu separately too.
- Reduced the number of unused functions in Utils.cmake
Closes https://github.com/caffe2/caffe2/pull/1256
Reviewed By: dzhulgakov
Differential Revision: D5899210
Pulled By: Yangqing
fbshipit-source-id: 36366e47366c3258374d646cf410b5f49f95767b
Summary:
The problem:
Building caffe2 fails because the installed directory contains "anaconda".
The cause:
Compiling Gloo will generate a new config.h file in the binary folder.
If we put the original config.h in front, the compiler will complain "Expected GLOO_USE_CUDA to be defined".
~~~Switch the positions of the include folders can solve the problem.~~~
Function caffe2_include_directories in cmake/Utils.cmake is a little bit hacky. If the directory contains "anaconda", it will append the new include directory after existing include path. Otherwise it will insert the directory before the path. So in the first case, the directories are inserted in order, and in the latter one, they are inserted reversely.
The solution:
See the commit.
pietern #1121
Closes https://github.com/caffe2/caffe2/pull/1258
Reviewed By: Yangqing
Differential Revision: D5907167
Pulled By: houseroad
fbshipit-source-id: 2cb3916e7e0313ebc3be3d1666bfa14bbf479607
Summary:
This operator allows the use of Torch's underlying TH libraries (TH, THC, THNN, and THCUNN)
through the ATen tensor library. Use of the operator is described in the README.
The operator itself is generated from ATen's Declarations.yaml file which describes its public API.
Closes https://github.com/caffe2/caffe2/pull/1235
Reviewed By: dzhulgakov
Differential Revision: D5876944
Pulled By: zdevito
fbshipit-source-id: b558e8563a5e82a0e6278705a4a359bd7df4e70a
Summary: Can be used to gather outputs of a sharded "Gather", or for the SparseLengthsSumGradient when we need the gradient on values.
Reviewed By: akyrola
Differential Revision: D5800901
fbshipit-source-id: 90835755d6d15be13fb0f538cfade980cf4a1cd2
Summary: If a blob is copy from device A to device B in the init_net, and then is used as an external_input in the train_net, we want the train_net to correctly use the blob already on device B instead of copying it over and over again.
Reviewed By: akyrola
Differential Revision: D5800870
fbshipit-source-id: d93f44bba80e4ed70eb03183d552496b54a966b5
Summary:
Exposed by UBSAN:
```lang=bash
caffe2/caffe2/core/qtensor.h:61:40: runtime error: load of value 190, which is not a valid value for type 'bool'
#0 0x7fb4fc09c289 in caffe2::QTensor<caffe2::CPUContext>::Resize(std::vector<int, std::allocator<int> >) caffe2/caffe2/core/qtensor.h:61
#1 0x7fb4fc090403 in caffe2::QuantizedFullyConnectedOp<float, caffe2::CPUContext, caffe2::DefaultEngine>::RunOnDevice() caffe2/caffe2/fb/operators/quantized_fully_connected_op.h:93
#2 0x7fb4fc08d5ee in caffe2::Operator<caffe2::CPUContext>::Run(int) caffe2/caffe2/core/operator.h:306
#3 0x426d8a in caffe2::QFCTest(float, float, float, int, int, int, int) caffe2/caffe2/fb/operators/quantized_fully_connected_op_test.cc:78
#4 0x4295f6 in caffe2::QuantizedFullyConnectedTest_Test_Test::TestBody() caffe2/caffe2/fb/operators/quantized_fully_connected_op_test.cc:110
#5 0x7fb4eee3b6a1 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/engshare/third-party2/googletest/master/src/googletest/googletest/src/gtest.cc:2458
#6 0x7fb4eee2cbe1 in testing::Test::Run() /home/engshare/third-party2/googletest/master/src/googletest/googletest/src/gtest.cc:2475
#7 0x7fb4eee2cd27 in testing::TestInfo::Run() /home/engshare/third-party2/googletest/master/src/googletest/googletest/src/gtest.cc:2656
#8 0x7fb4eee2ce34 in testing::TestCase::Run() /home/engshare/third-party2/googletest/master/src/googletest/googletest/src/gtest.cc:2774
#9 0x7fb4eee2eb8b in testing::internal::UnitTestImpl::RunAllTests() /home/engshare/third-party2/googletest/master/src/googletest/googletest/src/gtest.cc:4649
#10 0x7fb4eee2ef3c in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/engshare/third-party2/googletest/master/src/googletest/googletest/src/gtest.cc:2458
#11 0x7fb4eee2ef3c in testing::UnitTest::Run() /home/engshare/third-party2/googletest/master/src/googletest/googletest/src/gtest.cc:4257
#12 0x7fb4fbee2ed0 in RUN_ALL_TESTS() third-party-buck/gcc-5-glibc-2.23/build/googletest/include/gtest/gtest.h:2233
#13 0x7fb4fbee2d60 in main common/gtest/LightMain.cpp:12
#14 0x7fb4e0ef7857 in __libc_start_main /home/engshare/third-party2/glibc/2.23/src/glibc-2.23/csu/../csu/libc-start.c:289
#15 0x424e08 in _start /home/engshare/third-party2/glibc/2.23/src/glibc-2.23/csu/../sysdeps/x86_64/start.S:118
UndefinedBehaviorSanitizer: invalid-bool-load caffe2/caffe2/core/qtensor.h:61:40
```
Reviewed By: yfeldblum
Differential Revision: D5898877
fbshipit-source-id: e32b1732a1946fdafaec67b3fbc072dc93bcd917
Summary:
T22119644 showed that there is a potential illegal memory access in beam search with attention. Upon further inspection, we can see that there are multiple ops that write to the same old shape blob:
{"output0": "model0/attention_decoder/attention_weighted_encoder_context_reshaped", "output1": "state_old_shape_before_choosing_per_hypo", "input0": "model0/attention_decoder/attention_weighted_encoder_context" }},
{"output0": "model0/attention_decoder/hidden_t_external_reshaped", "output1": "state_old_shape_before_choosing_per_hypo", "input0": "model0/attention_decoder/hidden_t_external" }},
{"output0": "model0/decoder/layer0/cell_t_reshaped", "output1": "state_old_shape_before_choosing_per_hypo", "input0": "model0/decoder/layer0/cell_t" }},
This diff de-dupes these outputs
Reviewed By: akyrola
Differential Revision: D5899103
fbshipit-source-id: 8b6f3f113e764dfeb9262f6c442e1124559cd2d8
Summary:
Gloo was incorrectly updated in #1188 to the non-master version, so this brings back gloo to master.
Closes https://github.com/caffe2/caffe2/pull/1253
Differential Revision: D5899017
Pulled By: Yangqing
fbshipit-source-id: bdf6dbbc4402814e5bcf346cb8a610a448c53cef
Summary: We were keeping the offset in an int :(
Reviewed By: kennyhorror
Differential Revision: D5811955
fbshipit-source-id: 7d00833fa0d5847beed44b73ea74fcb5a8e24090
Summary: Previously, the RecurrentNetwork op used for our beam search did not have any of the input blobs listed as data dependencies. This was fine when we were using SimpleNet, since the ops were run in the order in which we added them to the graph, and thus the RecurrentNetwork op was run after all the other ops. However, when switching to DAG, the ops that produce input data for the beam search were being run in parallel with the RecurrentNetwork beam search op, which caused non-deterministic failures based on thread scheduling. This fixes that
Reviewed By: jmp84, jhcross
Differential Revision: D5879622
fbshipit-source-id: b622de1f6a24b2636b191096db92990e0535890c
Summary:
When using reshape, the speed_benchmark always reports an error.
When using resize, the speed_benchmark can run without any issue.
Reviewed By: salexspb
Differential Revision: D5847999
fbshipit-source-id: 1b9899534d514c779d1710008e239124fe3d2377
Summary: Make LastNWindowCollector optionally thread-safe. The main benefit is that the mutex can then be used to lock the buffer later, avoiding the need to copy the data.
Reviewed By: chocjy
Differential Revision: D5858335
fbshipit-source-id: 209b4374544661936af597f741726510355f7d8e
Summary: CheckpointManager already accepts a path_prefix override for init() and load(), but it assumes the same db_type passed in __init__(). This change adds an optional path_type for each call.
Reviewed By: boryiingsu
Differential Revision: D5888152
fbshipit-source-id: 21cd31a62a0188fe0e0b19b43c3b232c2342d0a8
Instead of initializing CUDA immediately and executing them,
we wait until we actually initialize CUDA before executing.
To keep things debuggable, we also keep track of the original
backtrace when these functions are called, so we can inform
users where they actually called the seeding/state functions
(as opposed to the first time they actually initialized the
RNG).
Fixes#2517
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
After this, windows should be all green.
Closes https://github.com/caffe2/caffe2/pull/1228
Reviewed By: bwasti
Differential Revision: D5888328
Pulled By: Yangqing
fbshipit-source-id: 98fd39a4424237f2910df69c8609455d7af3ca34
Summary: When num_elements is less than num_samples, a workflow should fail during net construction time. Currently, it fails at run time.
Reviewed By: kittipatv
Differential Revision: D5858085
fbshipit-source-id: e2ab3e59848bca58806eff00adefe7c30e9ad891
Summary:
Basically:
- more generator vs list changes.
- difference in the return type of bellman_ford(), see _get_path. 2.x returns list.
- nx 2 removed nbunch in topological_order, so we will need to manually use lexicographical_topological_sort with an explicit key derived from the source node order.
Closes https://github.com/caffe2/caffe2/pull/1243
Reviewed By: ajtulloch
Differential Revision: D5883195
Pulled By: Yangqing
fbshipit-source-id: 215d01fdd026d3af1a11ff866bf835e104370e4c
Summary: This is a quick fix for image input op
Reviewed By: bddppq
Differential Revision: D5857147
fbshipit-source-id: 4b5102616fe295c7c21d394391af8030b79de992
Summary:
Here's what's happening:
C++ only guarantees that static initialization is thread safe there: https://fburl.com/40wdmf1q
So TypeNameRegisterer<bool> can not be called concurrently with TypeNameRegisterer<bool> from another invocation
But there's no guarantees about different template specializations as
they declare separate variables. Thus TypeNameRegisterer<int> might
race with TypeNameRegisterer<bool>. And TypeNameRegisterer accesses
the global variable here: https://fburl.com/gv2mhi08
Thanks dzhulgakov for the investigation!
Reviewed By: Yangqing
Differential Revision: D5882913
fbshipit-source-id: 4db1080b11e6351ce8136373e2dfc52980642fbb
Summary:
If kernel sizes were specified via "kernel_w" and "kernel_h", tensor size
inference was incorrect in InferShapesAndTypes(): it was checking for
"helper_w" instead of "kernel_w".
Reviewed By: akyrola
Differential Revision: D5884280
fbshipit-source-id: 430cbedcedadbe3570384e706198a4ddc499504e
Summary:
Adding uint8 support for to code generator for and high-performance emebding look-up kernels, supporting
Sum, WeightedSum, and Mean reducers. Added number of unit tests to test these operators.
Performance Results
===================
Performance results are below for old code, sparse_lengths_sum_benchmark.old.par, that uses
code in lengths_reducer_rowwise_8bit_ops.h, and our new code, optimized via code generator,
sparse_lengths_sum_benchmark.new.par. Block size was 128 in all cases.
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.old.par --iteration 10000 --dtype uint8
I0912 02:49:58.773259 2640913 net_simple.cc:162] Time per operator type:
I0912 02:49:58.773264 2640913 net_simple.cc:171] 0.75769 SparseLengthsSum8BitsRowwise
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype uint8
I0912 02:50:33.981832 2642102 net_simple.cc:162] Time per operator type:
I0912 02:50:33.981837 2642102 net_simple.cc:171] 0.233322 SparseLengthsSum8BitsRowwise
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float16
I0912 02:51:26.748972 2643925 net_simple.cc:162] Time per operator type:
I0912 02:51:26.748977 2643925 net_simple.cc:171] 0.106591 SparseLengthsSum
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float
I0913 01:39:22.372238 1076874 net_simple.cc:162] Time per operator type:
I0913 01:39:22.372244 1076874 net_simple.cc:171] 0.211041 SparseLengthsSum
Analysis
========
Our optimized generated code is ~3.5x faster than original code in lengths_reducer_rowwise_8bit_ops.h
as shown below.
However, our uint8 is about 2x slower than float16 and is on par with float32. There are several reasons for that:
1. uint8 intrudoces extra instructions to multiply by bias and add scaling factors
2. In addition to emebding blocks, we are now also reading scale_bias.
For every pair of scale and bias, we bring entire cache line of
64 bytes, whiles only using 8 bytes. 128-wide uint8 input block only occupies 2 cache lines and hence
reading nearly entire extra cache lines of useless data adds to bandwidth wastage.
3. In addition, hardware prefetcher runs past the end of the input block and scale_bias
cache line, trying to prefetch more useless lines. This effect was characterised in Appendix section of
https://fb.facebook.com/notes/jason-lu/sparse-adagrad-performance-optimization-in-model-training/10214810437360961/
To get deeper insights into what is going on,
we isolated SparseLengthsSum and SparseLengthsSum8BitsRowwise codes, for float32, float16 and uint8,
into a microbenchmark, where we varried block size, while keeping table size constant (256MB)
block_size time(uint8) time(float16) time(float32)
64 0.19 0.09 0.17
128 0.12 0.09 0.17
256 0.70 0.09 0.14
1024 0.50 0.06 0.10
The pattern for block size of 64 and 128 is similar to what we observed in sparse_lengths_sum_benchmark.
However, we see that as block_size increases (for a fixed table size),
time to perform embeddings decreases quite drastically. For block_size of 256 and beyond, uint8 starts achieving
speedup over float16. Longer block better amortizes bandwidth wastage due to scale_bias and hardware prefetcher
running past the end of the block.
Reviewed By: kennyhorror
Differential Revision: D5870907
fbshipit-source-id: 445321b96f1b5801ef91f296f6063c35673ee11b
Plus a test for Eval nodes in the IR, since we hadn't actually
covered this case now that some nodes are transparently traceable.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
This fixes the apparent discrepancy (list vs iterator). After this, there are still 3 failures regarding topological sort but that seems a bit involved. Someone shall look deeper.
Closes https://github.com/caffe2/caffe2/pull/1242
Reviewed By: akyrola
Differential Revision: D5881806
Pulled By: Yangqing
fbshipit-source-id: 5a200010724befde2fa8ce1b61a9c1ba42cad46a
- If you operate with TracingState, you MUST check if it is live.
Otherwise you will segfault if it is expired; it is VALID for
tracing states to become expired.
- Tracing states can expire if they request backward tracing
(which the tracer does by default). We don't want this to
happen for exports, which only look at forwards. So make
sure we set the correct num_derivatives.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
- Print some diagnostic information when accepting new test output.
- If it's the first time you ran an expect test, print out
the output you got so it's easier to decide if you want
to accept it.
- Add infrastructure for expect-testing against exceptions
(I'm going to use this in a later patch).
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
- If a user accidentally attempts to export a model that is in training mode, the
tracer may perturb the parameters (since modules like batchnorm will update
their parameters.) To prevent this from happening, we temporarily turn
off training mode to make sure this doesn't happen. Temporary is
important, since model export should not actually affect the model
- If you have a buggy model which is changing the parameters,
it is much better for us to export the state_dict() *prior*
to executing the model, because that is what we actually
used as the inputs to the trace. The state_dict() afterwards
could be anything.
- kwargs support never worked, so it's been excised.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
Implementation of MPSCNNMul that only supports multiplying a tensor with a scalar value for now.
Benchmark runtime for CPU, OpenGL and MPSCNN:
```
I0919 21:15:17.942468 3068398464 net_simple.cc:103] Main run finished. Milliseconds per iter: 527.795. Iters per second: 1.89467
I0919 21:15:21.043023 3068398464 opengl_test.cc:2293] Main run finished. Milliseconds per iter: 249.766. Iters per second: 4.00374
I0919 21:15:23.182369 3068398464 net_simple.cc:103] Main run finished. Milliseconds per iter: 175.548. Iters per second: 5.69644
```
Reviewed By: hlu1
Differential Revision: D5870100
fbshipit-source-id: 2aadd5d134f3b8b40a41f638040cbef35a0086df
Summary: When parameter sharing is used, the model may not own the parameters. Emptying out initializer ensures that the shared model doesn't overwrite initialization.
Reviewed By: chocjy
Differential Revision: D5870362
fbshipit-source-id: f8587b84c3a13f331a3251973e8206563939606a
Summary: This is not a very generic constant
Reviewed By: volkhin
Differential Revision: D5870378
fbshipit-source-id: 59509bb48cecb52ba4a3f26b290855374547fe7e
Summary:
Two implementation of max pool reducers had different semantics in case of equal indices. It matters less in real cases, but breaks tests. Choosing the behavior of LengthMax over SortedSegmentRangeMax as the former is more widely used.
Also some minor tweaks for the test code.
Reviewed By: Yangqing
Differential Revision: D5870386
fbshipit-source-id: 6488cbd5cacaf595ffc07c44084730dd44b3f9dd
To be honest, this was the whole point of this refactor set.
I noticed that in a lot of code, we were repeatedly copying lots of metadata
from old nodes to new nodes. This was quite concerning because I wanted to
add some more metadata (alias information) and I didn't want to have to
get it right in all cases. Plus, in a lot of cases we were forgetting
to set more optional properties like debug names when we "copied".
To solve this, I first made cloneFrom() copy all of this metadata. Then,
I searched for all occurrences of setType() (a proxy for "I'm cloning this
node), looked for cases where we really were morally doing a copy, and rewrote
the code to use cloneFrom() instead, allowing us to drop explicit setType()
(and getting more metadata preservation in the process.)
Finally, I refactored tryToMoveChunk. The code is modestly longer,
but the new version has the nice property that the initialization of
selects for input_chunk are next to the creation of the node (as opposed
to delayed for later.) I also added a lot more comments for invariants
I noticed when I was working on the code.
One minor extra change: TensorType grew a new constructor and a withSizesStride
"immutable setter" which returns a new copy of TensorType with different info.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Previously, there was a hidden, unchecked invariant that you were not allowed to
call create(kParam) or create(kReturn). Now that the logic for them is embedded
in create(), the create(kParam) case is valid, and the create(kReturn) case
will raise dynamically if you try it.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Since this code has been stable for a while, I think it's
a good opportunity to make it const correct. There is only
a slight increase in code size, which I hope will appease @zdevito.
- consts were added to all methods which are logically const. Most notably,
lint() is now declared const.
- I made extra const versions of Node::iterator(), Node::reverseIterator(),
Graph::nodes(), Attribute::find(), linked_list::begin(), linked_list::end(),
linked_list::rbegin(), linked_list::rend(); in all cases these were one-liners
except for find() (I spent a little time trying to make find() a one-liner
but didn't think of a way to do it.).
- graph_node_list got factored out into a new, templated type linked_list<T>
(perhaps we should call it intrusive_list<T>). I had to template the iterator
to define constant and non-constant iterators without duplicating code,
and once I was there, I decided to templatize everything else. The code
nicely factors out, although I wouldn't recommend using it for anything
else without more refactoring.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
These functions accept a scaling parameter like THTensor_(cadd)/(csub),
which will make it easier to have the same signature for tensor and
scalar addition in PyTorch and ATen. For example:
tensor.add(other, alpha=2)
Will work if other is a scalar or a tensor value.
See #2739
This adds a concatenated Declarations.cwrap which is the result of
running ATen/extract_cwrap.py on TensorMethods.cwrap. This will let ATen
and the Variable bindings temporarily diverge from Tensor before the new
Variable class subsumes Tensor.
See #2739 and #2633
Summary:
Adding uint8 support for to code generator for and high-performance emebding look-up kernels, supporting
Sum, WeightedSum, and Mean reducers. Added number of unit tests to test these operators.
Performance Results
===================
Performance results are below for old code, sparse_lengths_sum_benchmark.old.par, that uses
code in lengths_reducer_rowwise_8bit_ops.h, and our new code, optimized via code generator,
sparse_lengths_sum_benchmark.new.par. Block size was 128 in all cases.
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.old.par --iteration 10000 --dtype uint8
I0912 02:49:58.773259 2640913 net_simple.cc:162] Time per operator type:
I0912 02:49:58.773264 2640913 net_simple.cc:171] 0.75769 SparseLengthsSum8BitsRowwise
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype uint8
I0912 02:50:33.981832 2642102 net_simple.cc:162] Time per operator type:
I0912 02:50:33.981837 2642102 net_simple.cc:171] 0.233322 SparseLengthsSum8BitsRowwise
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float16
I0912 02:51:26.748972 2643925 net_simple.cc:162] Time per operator type:
I0912 02:51:26.748977 2643925 net_simple.cc:171] 0.106591 SparseLengthsSum
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float
I0913 01:39:22.372238 1076874 net_simple.cc:162] Time per operator type:
I0913 01:39:22.372244 1076874 net_simple.cc:171] 0.211041 SparseLengthsSum
Analysis
========
Our optimized generated code is ~3.5x faster than original code in lengths_reducer_rowwise_8bit_ops.h
as shown below.
However, our uint8 is about 2x slower than float16 and is on par with float32. There are several reasons for that:
1. uint8 intrudoces extra instructions to multiply by bias and add scaling factors
2. In addition to emebding blocks, we are now also reading scale_bias.
For every pair of scale and bias, we bring entire cache line of
64 bytes, whiles only using 8 bytes. 128-wide uint8 input block only occupies 2 cache lines and hence
reading nearly entire extra cache lines of useless data adds to bandwidth wastage.
3. In addition, hardware prefetcher runs past the end of the input block and scale_bias
cache line, trying to prefetch more useless lines. This effect was characterised in Appendix section of
https://fb.facebook.com/notes/jason-lu/sparse-adagrad-performance-optimization-in-model-training/10214810437360961/
To get deeper insights into what is going on,
we isolated SparseLengthsSum and SparseLengthsSum8BitsRowwise codes, for float32, float16 and uint8,
into a microbenchmark, where we varried block size, while keeping table size constant (256MB)
block_size time(uint8) time(float16) time(float32)
64 0.19 0.09 0.17
128 0.12 0.09 0.17
256 0.70 0.09 0.14
1024 0.50 0.06 0.10
The pattern for block size of 64 and 128 is similar to what we observed in sparse_lengths_sum_benchmark.
However, we see that as block_size increases (for a fixed table size),
time to perform embeddings decreases quite drastically. For block_size of 256 and beyond, uint8 starts achieving
speedup over float16. Longer block better amortizes bandwidth wastage due to scale_bias and hardware prefetcher
running past the end of the block.
Reviewed By: dzhulgakov
Differential Revision: D5824641
fbshipit-source-id: 3a5c020294d84874da78c6943e596423393473d6
Summary:
All other NCCL ops expect paired src, dst pointers for each
GPU. Reduce doesn't, and the old logic would always set dst for
rank = 0 regardless of whether that was the root or not.
This change takes into account that Reduce only has one output, and it
should assign dst only for the root rank. Also changes the schema to
allow inplace for any input and Output(0).
Closes https://github.com/caffe2/caffe2/pull/1214
Differential Revision: D5843177
Pulled By: pietern
fbshipit-source-id: 1e775e6a1ca052e29691b89c1429db03a0e6378b
Summary:
I hit a strange bug and found that the reason is that in the macro, it uses a
temp variable named 'r'. This will cuasing conflict when the macro's own
argument is also expanded as 'r' or related stuff (in my case, it expands to
'r.size()' where here r is a tensor)
Reviewed By: pietern
Differential Revision: D5822833
fbshipit-source-id: 64a6c6b0fc5a1f8359d459d70644bb232ef40606
Summary:
Comments say experimental: don't use it. But these functions are used in the critical path from pipeline.py, so better to remove the comment?
Also changed if-else to first check for None. Although python does not crash with getattr(None, "x"), it is confusing.
Some lint issues.
Reviewed By: azzolini
Differential Revision: D5853639
fbshipit-source-id: 977de5ba0ea3ae26343ae5fcacac883faf892b0e
Summary:
Adding backward pass support for If operator:
- Implemented necessary changes to Do operator and generation of gradient Do operator to properly forward gradient blobs in and out of subnet
- Using WorkspaceManager to keep track of workspaces used by Do, in case we need to have access to local blobs to compute gradients (also important for loop's backprop)
- Update to Workspace to handle blob binding from multiple parent workspaces
- Implemented generation of gradient If operator
- Unit test to build and train a net with If control op
Reviewed By: azzolini
Differential Revision: D5745096
fbshipit-source-id: 1023c90a2113716254424d1e50b9e560fe9083e5
Summary:
For future reference - seems that at some point cub had a force push. If any already checked out branch has issues, try deleting the cub submodule and redo git submodule update --init.
Closes https://github.com/caffe2/caffe2/pull/1227
Differential Revision: D5856030
Pulled By: Yangqing
fbshipit-source-id: c192974246c27ce6bd739295c31c25fd75766a35
* Specifying the value used for padding
The "pad_packed_sequence" function fills padded elements with zeros, but sometimes it is not useful. For example, some previous papers on NLP, including my recent paper [1], use a max-pooling technique for RNN-based sentence representations. More specifically, the max-pooling technique selects the maximum value from all time steps (i.e., hidden states) for each dimension. In such a case, we do not want the padded zeros to be selected. To overcome this situation, we can simply use a very small value instead of zero.
An LSTM example is shown below:
input = embedding(Variable(batchInput))
packedInput = nn.utils.rnn.pack_padded_sequence(input, lengths, batch_first = True)
h, (hn, cn) = self.encoder(packedInput, (h0, c0))
h, _ = nn.utils.rnn.pad_packed_sequence(h, -1024.0 batch_first = True)
sentenceRep, _ = torch.max(h, 1, keepdim = True)
[1] A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks. Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. The 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017).
https://arxiv.org/abs/1611.01587 (Equation (4))
* Modified the order of the arguments
Following the suggestion, I modified the order of the arguments.
Summary: In some cases (e.g. CI), showing progress bar will mess up the log.
Reviewed By: jerryzh168
Differential Revision: D5850918
fbshipit-source-id: 2da9d020832264cef977391dc2fd8d1e2677d159
Summary:
It is interesting that under facebook fbcode this was not an issue -
but definitely causing issue on oss.
Closes https://github.com/caffe2/caffe2/pull/1225
Reviewed By: dzhulgakov
Differential Revision: D5851360
Pulled By: Yangqing
fbshipit-source-id: f8a8f15184092a888bdc909ba2323229d4485902
Summary:
This fixed a minor bug in D5690181.
Failing test observed in https://travis-ci.org/caffe2/caffe2/jobs/275603846
Reviewed By: jerryzh168
Differential Revision: D5850985
fbshipit-source-id: 02aefb8902878d6adf7686a94153823b92c0e7b7
* Win64 support for lib/THS
* Fix VS warnings(for lib/THS)
* Revert changes that prevent sucessful build
* use the type descriptors for int64_t
* Fix warnings in THS for MSVC
Summary: Introduced weight for labels in multi-lable setting. An extra weight blob is introduced and read in the operator in case lable setting is weighted sparse.
Reviewed By: kevinwilfong
Differential Revision: D5812467
fbshipit-source-id: efb209092e1e9effc915b0a753fa0c67b47a4fb6
Summary:
Now that Buck supports a way to opt-out external C/C++ libs from omnibus linking,
this diff removes the hack we previously relied on (and which got copy-pasta-d everywhere).
Reviewed By: pixelb
Differential Revision: D5832450
fbshipit-source-id: cc3d12488f8498be6fb12bce1fedb3ad1accb518
Summary: On CPU, no need to replicate parameters. So try using only one copy (cpu_0) for parameters. Made resnet50_trainer use shared model in cpu mode.
Reviewed By: wesolwsk
Differential Revision: D5812181
fbshipit-source-id: 93254733edbc4a62bd74a629a68f5fa23f7e96ea
Summary: following optimization in sparse lengths sum, translate it into weightedsum
Reviewed By: azzolini
Differential Revision: D5732859
fbshipit-source-id: 430ee077a1063f3c55806f6dbb5ea46f0fd5c486
Summary:
following wickedfoo's previous diff, I made SparseLengthsSum kernel a little
faster. I did:
- `__restrict__` note for ptrs
- `ExactBlock` optimization for kernels where post < Maxthreads. This is a general case
===Check Test Area Please, Are we looking at another 57% speed up here???===
Reviewed By: azzolini
Differential Revision: D5676351
fbshipit-source-id: 963f4712106b324fda488ec5c63b7e010b915814
Summary: This caused gradient generation problems. Output was made in-place in PR-1185, by mistake, I believe.
Differential Revision: D5844825
fbshipit-source-id: 4ad84d0fb468aafde9f78463b9acf89316e633ca
Summary: Ported existing adhoc test code to use python unittests. Small tweak to caffe2.python.hypothesis_test_util
Reviewed By: kmatzen
Differential Revision: D5837295
fbshipit-source-id: daa2360db3c18c7d4bda7785e7a0b9175f5858af
Summary:
This is useful for pure throughput tests where
we don't care about training a real model.
Reviewed By: akyrola
Differential Revision: D5834293
fbshipit-source-id: dab528c9269fb713e6f6b42457966219c06e0a35
Summary: When trained on billions of data, the adagrad gradient square sum be very big and create an issue of adding small numbers to big numbers. This diff Allow to decay the adagrad gradient square sum.
Reviewed By: queqichao
Differential Revision: D5825932
fbshipit-source-id: 570224483b77d42ae53410fa2f767af86de167eb
Summary: Added new counter to prof_dag which counts the number of times a particular op_type executed during an iteration, and prints the count per iter in the output.
Reviewed By: akyrola
Differential Revision: D5837444
fbshipit-source-id: 0f2571c6f85410dac21d4b627fe455ef7c1ab908
Summary: PR 1175 caused a build error because gemmBatched was only under a specific #ifdef. Now put it outside the #ifdef, and things work.
Reviewed By: asaadaldien
Differential Revision: D5834868
fbshipit-source-id: 072a64c8f4b259ff7504104121766115b46b8aa0
Summary: Remove the caffe2 namespace {} because all the code inside opengl_test.cc is wrapped inside the caffe2 namespace
Reviewed By: Maratyszcza
Differential Revision: D5829458
fbshipit-source-id: e68dde08a1c3dc4c41260f5f028ca7efe8d34fbd
Summary:
- All NCCL ops that were triggering a reallocation were deadlocking because I think cudaMalloc or something wants the lock that is being held by ncclRun, so I split the parts where potential allocation happens to a separate lambda. Thanks a lot akyrola and asaadaldien for the after-hours help on debugging this.
- Added support for NCCLReduceScatter.
- NCCLReduce is still deadlocking, but it happens somewhere else. We can debug it separately.
Reviewed By: akyrola
Differential Revision: D5800861
fbshipit-source-id: c963f93942a3ee3bb706fac52047b18c3f37831a
Summary: Otherwise weights, biases are not created and test creation fails
Reviewed By: gsethi523
Differential Revision: D5836438
fbshipit-source-id: 32a75313b6b9ebecbfaa43ebd39f19c8eaba8cd1
Summary: get and getheader are the same in Python 2
Reviewed By: akyrola
Differential Revision: D5836486
fbshipit-source-id: 3bacfccc872c44741d7f26c68ba967093fce45c2
Summary: Runasync() called DagNetBase::Run() which called ProfDag::RunAsync().
Reviewed By: Yangqing
Differential Revision: D5835852
fbshipit-source-id: 30618d517c7ee235143de6efaa2f40df3f1d372f
Summary:
* For forward: allow either 1 or 2 output.
* For gradient generator: always return a gradient operator that does not use scale.
* For cudnn gradient op: nothing to do, already like this
* For default CPU and CUDA gradient ops: put scale as a member variable, and always recompute scale.
Reviewed By: bddppq
Differential Revision: D5690181
fbshipit-source-id: a6353202dcaf7359298bc8f032ac0c651352e2bc
Also squash a warning about an implicit conversion that will never
occur (because the type being converted to is a superclass).
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
To speed up deprecating legacy_pad, we added the option
to remove legacy pad in the caffe_translator
Reviewed By: bddppq
Differential Revision: D5724079
fbshipit-source-id: 25465d26f35bd009aa71667c7c523047de42e802
Summary:
This exhibits the problem in NMT training where some out of bound data seems to
have silently written over bound, and causing random segfaults elsewhere in the
code. This itself does not solve the problem, but will trigger us to then fix the out
of bound issues.
Differential Revision: D5832646
fbshipit-source-id: 5eb259e4584e5341ef3f19362f98f0a9554e9aec
Summary:
UBSan report:
```
UndefinedBehaviorSanitizer: dynamic-type-mismatch caffe2/caffe2/core/tensor.h:786:22 in
caffe2/caffe2/core/tensor.h:787:19: runtime error: member call on address 0x60c01f610440 which does not point to an object of type 'caffe2::Tensor<caffe2::Tensor<caffe2::CPUContext> >'
*** Aborted at 1505298367 (Unix time, try 'date -d 1505298367') ***
*** Signal 6 (SIGABRT) (0xf2) received by PID 242 (pthread TID 0x7fb376f06700) (linux TID 33215) (maybe from PID 242, UID 0), stack trace: ***
0x60c01f610440: note: object is of type 'N6caffe26TensorINS_10CPUContextEEE'
07 5e 81 60 c8 47 13 35 00 00 00 00 90 f3 73 80 20 60 00 00 98 f3 73 80 20 60 00 00 a0 f3 73 80
^~~~~~~~~~~~~~~~~~~~~~~
vptr for 'N6caffe26TensorINS_10CPUContextEEE'
#0 0x1f0d1c22 in std::vector<long, std::allocator<long> > caffe2::GetTensorInfo<caffe2::Tensor<caffe2::CPUContext> >(void const*, bool*, unsigned long*, caffe2::DeviceOption*) caffe2/caffe2/core/tensor.h:787:19
#1 0x9a5e0a1 in caffe2::FacebookOperatorObserver::log() caffe2/caffe2/fb/init/net_observer.cpp:300:15
#2 0x9a5b49d in caffe2::FacebookOperatorObserver::Stop() caffe2/caffe2/fb/init/net_observer.cpp:229:11
#3 0x447d046 in caffe2::Operator<caffe2::CPUContext>::Run(int) caffe2/caffe2/core/operator.h:308:20
#4 0x1ecedb2f in caffe2::SimpleNet::Run() caffe2/caffe2/core/net_simple.cc:51:14
#5 0x1f1ba169 in caffe2::Workspace::RunNet(std::basic_fbstring<char, std::char_traits<char>, std::allocator<char>, std::fbstring_core<char> > const&) caffe2/caffe2/core/workspace.cc:211:26
...
```
The bug is that `GetTensorType` and `GetTensorType` take context as template argument, not tensor itself.
Reviewed By: bddppq
Differential Revision: D5826781
fbshipit-source-id: 9cfd2ca1aaef6f8ee8a556ce7b553c0a4f43a100
Summary: Fix comment on core.Net.RunAllOnMKL (the comment was actually for core.Net.RunAllOnGPU)
Reviewed By: zem7
Differential Revision: D5734309
fbshipit-source-id: 2cc40a99a2c0083c73ec1e4c8279f55f296a003c
Summary:
This enables opsnoop to work with simple net as opposed
to just dag net
Reviewed By: pietern
Differential Revision: D5721732
fbshipit-source-id: c38d0b51d3b0469ecb2883e7075eeee7acf81d75
Summary: If blob type switches between fp32, fp16 - for example - we should share the tensor buffer. This kind of switching can happen with memonger and in-place conversions.
Reviewed By: bddppq
Differential Revision: D5812333
fbshipit-source-id: 44d54bfe52cbda734db8c7f20d6970e4b51ee1e1
Summary:
choose the number of cores for the thread pool as the number of fast cores
Didn't do any benchmarks, so its mostly FYI diff
Reviewed By: ajtulloch
Differential Revision: D5579797
fbshipit-source-id: 5ada001116c731780f38a62e9c0b500bd64a4bfe
Summary:
Also add the ability to mark an argument as required.
Added a string constant `OpSchema::Arg_IsTest` for `is_test` arg.
If users define the `is_test` argument with `ArgIsTest(...)`, then it automatically becomes required argument, in the meanwhile user can still use `Arg("is_test", ...)` to define an optional `is_test` argument.
Reviewed By: akyrola
Differential Revision: D5812391
fbshipit-source-id: eaaba50d027813a8012389edc6c459de23c3c728
Summary: For data parallel we need the batch size to be multiple of nubmer of replicas. In order to do so with this diff we do Dataset(rec).trim(multiple_of=num_replicas)
Reviewed By: dzhulgakov, harouwu
Differential Revision: D5753861
fbshipit-source-id: c5d728b925707dbd3d1f500a93e67e185c223569
Summary: CuDNNDropout used to append the CUDNN states structure on top of the mask blob. This is a bit controversial, and also caused problems when the mask-blob was released by dynamic memory management. This diff makes that states-blob a separate blob managed outside the inputs/outputs (so that we don't need to have different signature for CUDNN and non-CUDNN op). Since Gradient op needs to access the same states, it will grab the states blob based on the mask blob name. Perhaps not the most cleanest way to pass information, but at least better than the previous model. Also could remove a fair amount of code.
Reviewed By: bddppq
Differential Revision: D5787039
fbshipit-source-id: d95f0ffafb5fb2a6a7ce46f4a855e9c1b9a47f52
Summary:
I would expect that tests marked "expected failure" mean that there is a known issue in the code which will be fixed later. Both of these tests are simply verifying proper error-checking - nothing needs fixing.
Before (looks like something is wrong):
```
======================================= 2 xfailed in 0.27 seconds =======================================
```
After:
```
======================================= 2 passed in 0.28 seconds ========================================
```
/cc akyrola gsethi523
Closes https://github.com/caffe2/caffe2/pull/1209
Differential Revision: D5825373
Pulled By: akyrola
fbshipit-source-id: 1b98f503e4e406f69567d02425532f43bd16a465
Summary:
Right now, each net implements 2 functions: Run() and RunAsync(). The (loose) abstraction is:
* Run(): run the network in a synchronous way. The call is synchronous.
* RunAsync(): run the network *still synchronously*, but potentially use asynchronous scheduling of the underlying operators.
As one can see, this is highly confusing: RunAsync() is actually a sync call, and the semantics it tries to implement should actually be done by a different net type. For example, DAGNet and AsyncDAGNet both implement the Run() function, and under the hood one uses sync scheduling and one uses async scheduling. Currently, the only user of the RunAsync() function is in SimpleNet::RunAsync(). The only call site is in recurrent_net_op.
Instead, the operator implements the two Run() and RunAsync() functions as follows:
* Run(): run the operator in a synchronous way. aka doing FinishDeviceComputation().
* RunAsync(): run the operator in an asynchronous way if possible (i.e. still sync in CPU, but async in cuda), records the action in the event_, and return immediately.
Semantically, Run() is equal to RunAsync() followed by event().Finish().
As a result, we propose in diff D5812854 to change the network interface similar to the operator interface, and explicitly raise RunAsync() as a first class citizen of the net interface. Specifically, whether a net can run asynchronously is now determined by the
* Adding a SupportsAsync() function that determines if a net supports async execution or not.
* Run(): run the net in a synchronous way.
* RunAsync(): if SupportsAsync() is false, same as Run(). if SupportsAsync() is true, run the operator in an asynchronous way, with the scheduling algorithm determined by the implementation itself. Then, record all outstanding events in the events_ field, and return immediately.
Semantically, Run() is equal to RunAsync, and call event.Finish() for all the events. This is actually the implementation and Run() is no longer a virtual function, RunAsync() is: all sub classes of NetBase shall implement SupportsAsync() and RunAsync() now.
**Why SupportsAsync()?**
This is a design idea that probably needs iterating. Basically, the idea is that RunAsync() is the main entry for the net execution, and it's actually like RunAsyncIfTheNetSupportsIt().
In theory, Run() is basically a wrapper on top of RunAsync() to reduce code duplication: if a net type does not support RunAsync(), its RunAsync() implementation simply is sync (see e.g. SimpleNet) and the Run() to RunAsync() lowering is a no-op (with the only overhead being a nested function call).
I exposed the SupportsAsync() function just in case some caller wants to explicitly check whether an instantiated net supports async call or not - for example, a caller may want to make sure that it is actually running a net asynchronously, in which case SupportsAsync() is the place to query.
Reviewed By: dzhulgakov
Differential Revision: D5812854
fbshipit-source-id: 916b38fded0eb14439f340ab254a034ac5a9a465
Summary: Kernel data and other shader parameters are now cached directly into uniform buffer blocks, and the blocks are dynamically attached at run time.
Reviewed By: hlu1
Differential Revision: D5772847
fbshipit-source-id: 746448c2d5db12e38fb883874ede3acfccb9f6ef
Summary: Default value for timeout in CreateOrCloneCommonWorld does not work properly: if the value of dpm._DEFAULT_TIMEOUT is changed, the default still stays as old 30s. Changed to use None instead as default.
Reviewed By: pietern
Differential Revision: D5813228
fbshipit-source-id: f617ceec40a03893c27d3e13c426e1ca6b2114e2
Summary:
Computes a fixed grid or RMAC region coordinates for a given 4D feature tensor
(NCHW) as described in https://arxiv.org/abs/1511.05879. The output is the
`roi` format expected by RoIPoolOp. To compute the actual RMAC itself, the
output of this op should be passed to RoIPoolOp.
Reviewed By: wickedfoo
Differential Revision: D5594994
fbshipit-source-id: 5edac98a18137b53555f9a16354419b424679c99
Summary: Explicit function to sync blobs. Notice that this must be called before CreateNet(), and syncs the blobs every run.
Reviewed By: asaadaldien, jay-mahadeokar
Differential Revision: D5805891
fbshipit-source-id: 58a1bb47805d75d5cbead136e2e0e9fe663ea954
Variable is now a subclass of at::Tensor backed by a VariableImpl* pImpl. The implementation of the ATen functions is defined in the auto-generated VariableType.h/cpp file.
Currently, only functions which fall through to the base type, such as sizes() and isCuda() are implemented. Differentiable ops like add() and mul() will be added in a subsequent PR.
When you call repr() on a long in Python 2, it prints a long suffix.
This is annoying for tests which assert on the exact output. Use str()
instead.
But then there is a problem with Python 2's default tuple str() implementation,
where it calls repr() on its arguments rather than str(). This means that
if you have a tuple of longs, it will render as "(1L, 2L)" in Python 2.
To solve this problem, we just reimplement tuple printing in C++.
This is not a very robust fix (nested tuples, dictionaries, all these situations
will fail) but in practice it hits the cases that matter.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
In Python 2, the non-generator map will always perform the indexing
even when it is not used in the end. Using the generator can let
us avoid indexing when it is not used.
As an added bonus, it makes the ordering of operations deterministic
between Python 2 and Python 3 in LSTM.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: This is will allow the same decoder to handle different go tokens.
Differential Revision: D5801811
fbshipit-source-id: ddd309963c97e32c728b15d2ccd4ba0c4ad5ebbe
Summary: The android segmentation net was failing with MPSCNN because the some fused MPSCNNConvRelu ops become in-place after fusion.
Reviewed By: fricc33
Differential Revision: D5803245
fbshipit-source-id: 6808e9c3504389c113c7a16504d6554e83bdcc3e
Summary:
If the Gloo InfiniBand transport is used, the Gloo algorithms can use
GPUDirect to DMA directly from/to GPU memory. This is done through the
CudaDeviceWorkspace. This change adds a "gpu_direct" option to the
Allreduce operator that makes it use GPUDirect if the transport
supports it.
Closes https://github.com/caffe2/caffe2/pull/1203
Reviewed By: wesolwsk
Differential Revision: D5806366
Pulled By: pietern
fbshipit-source-id: 9e9a78f059f2b5c6e4fbf6574b7db4776a94696c
Summary:
Implement atomic add operation for zeus kv store.
All nodes now use zeus as KVStore instead of replying on master hosting a KVServer
Code cleanup.
Reviewed By: andrewwdye
Differential Revision: D5581697
fbshipit-source-id: ba7d99215fb478a30942ff593f13dad65aa48d36
Summary: A bit safer, and also suppresses compiler warning.
Reviewed By: bddppq
Differential Revision: D5803080
fbshipit-source-id: d8c782c936a8fdaded4ae209b212378e78606ffb
Summary:
During the team meeting today Dima and Alex mentioned that the current lambda
function causes slowdown in performance when a large number of alloc and
dealloc happen. My observation is that most of the Delete are actually direct
Delete() function pointers, so I gave it a shot to see if we can reduce
the overhead.
RawAllocDealloc is much fast already, and we observe another 5ns reduction
(12.5%). For TensorAllocDealloc of 32x32 tensors, we are observing 57ns saving
(26%). This is measured on Xeon(R) CPU E5-2660.
Also cleaned up the function interfaces of ShareExternalPointer so we have 2
functions only.
Reviewed By: salexspb, dzhulgakov
Differential Revision: D5801013
fbshipit-source-id: 7068207a43400fa3902bbb3689b3c729e839456c
* Added support for nInputDim parameter in Padding class
* moved nInputDim to the end so as to not break backwards compatibilty
* hasattr to check if nInputDim is actually set
* check if nInputDim is positive before checking against input dim
Summary:
Support int64 data type in protobuffer tensor in image input op.
This is useful when fbid, which is usually of data type BIGINT, is stored in tensor proto.
Reviewed By: panshen1
Differential Revision: D5792697
fbshipit-source-id: 0bc3da4fd31120b0582fb32dd7c2d09fe591a6de
Summary: CPU gradient is correct. CUDA gradient was wrong.
Reviewed By: asaadaldien
Differential Revision: D5801595
fbshipit-source-id: 7e529ed751b92137e49a0517120ddfae7a30ec28
Summary: Stress tests for recurrent_net_executor_test failed sporadically when the executor got stuck in forward-only mode. In forward-only mode we apply limitation to the number of parallel timesteps (because we recycle workspaces cyclically). There was a race condition where the finished_timesteps_ variable was set to 0 after jobs had been executed by threads. So set the variable to 0 before putting any jobs to the queue.
Reviewed By: azzolini, Yangqing
Differential Revision: D5801599
fbshipit-source-id: 8443c67f4ae8af3ae08c6f0cd4575ef729ffa3af
Summary: RNN executor previously relied on getting the mapping from x to x_prev (and gradients) from recurrent.py, but we can just infer them from links. This makes all models compatible with rnn executor, given enable_rnn_executor=1 argument.
Reviewed By: jamesr66a
Differential Revision: D5801436
fbshipit-source-id: 14d0e26dfbad6347f645d907da493187c98e9b17
Summary:
Before this change there were two ways for machines to rendezvous for a
distributed run: shared file system or Redis. If you're using an MPI
cluster it is much more convenient to simply execute mpirun and expect
the "right thing (tm)" to happen. This change adds the "mpi_rendezvous"
option to the CreateCommonWorld operator. If this is set, the common
world size and rank will be pulled from the MPI context and Gloo
rendezvous takes place using MPI. Note that this does NOT mean the MPI
BTL is used; MPI is only used for rendezvous.
Closes https://github.com/caffe2/caffe2/pull/1190
Reviewed By: akyrola
Differential Revision: D5796060
Pulled By: pietern
fbshipit-source-id: f8276908d3f3afef2ac88594ad377e38c17d0226
Summary: As title. Made the configurations op-specific since many models run multiple RNNs.
Reviewed By: jamesr66a
Differential Revision: D5796208
fbshipit-source-id: 88173879dfff9f3f7bf583ccc4f4c6385cca5aca
Summary: Allow context to be passed into piper function
Reviewed By: volkhin
Differential Revision: D5684716
fbshipit-source-id: 693f0464fe28f8692d75901705a85a0a413a7bed
Summary: The convolution should not run with input texture slices > 1 with tiling
Differential Revision: D5774187
fbshipit-source-id: 5e94f82cd65e0d4425a7a0090a61a33bef2a14fc
Summary:
`ModifierContext` is the base class for `OptimizerContext` and `RegularizationContext`.
`UseModifierBase` is the base class for `UseRegularizer `and `UseOptimizer`
Most of codes in `OptimizerContext`, `RegularizationContext` and other potential Context class in future could be shared. We thus implemented a new base class, called `ModifierContext` to support it.
It happens to be the same for `UseRegularizer` and `UseOptimizer`, and we implemented a new base class called `UseModifierBase`.
In this way, users only need to provide API for **get** and **has** operation. Also, they need to tell what's the **context class**.
**Note**
Mirrored code in fbandroid and fbobj would be added when finally check in.
Reviewed By: kittipatv, xianjiec
Differential Revision: D5724613
fbshipit-source-id: de19bb822dcd41ec5c459d65065603a0abe2fd20
Summary:
Regularization added for caffe2 and dper.
This regularization is intended for `dense feature `only. Sparse feature would serve as individual optimizer, see ` D5618405 ` and `D5534579` for details.
The implementation of dense regularization is similar to the ones in optimizer. we now support `l1 norm` and ` l2 norm` in regularizer. In dper, we would call different regularization based on regularization type defined in model_definition.thrift.
Reviewed By: xianjiec
Differential Revision: D5724851
fbshipit-source-id: 0fbee698cfeff1ac477fc9d07785406069f8d9c8
Summary:
These arguments control which Gloo transport (TCP or IB) and which
network interface is used for the common world. If not specified, it
defaults to using TCP and the network interface for the IP that the
machine's hostname resolves to.
The valid values for the transport argument are "tcp" and "ibverbs".
For ibverbs to work, Gloo must have been compiled with ibverbs
support. If Gloo is built as part of Caffe2 (sourced from the
third_party directory), then you can pass -DUSE_IBVERBS=ON to CMake to
enable ibverbs support in Gloo.
Closes https://github.com/caffe2/caffe2/pull/1177
Reviewed By: akyrola
Differential Revision: D5789729
Pulled By: pietern
fbshipit-source-id: 0dea1a115c729e54c5c1f9fdd5fb29c14a834a82
Summary:
The predictor export functions allowed a way to specify a net type, but no way to specify num_workers for when you use net type 'dag'. This adds that option to the PredictorExportMeta named tuple and populates the field in the exported protobuf. Also added parameters to callsites in NMT ensemble model class and model repackager to populate net_type and num_workers.
Using DAGNet for our base predictor net (not recurrent stepnets) speeds up our inference by 1.15x, since we can now run encoder forward and backward RecurrentNet's for each model in the ensemble in parallel.
Reviewed By: salexspb
Differential Revision: D5792203
fbshipit-source-id: cb9a8237a0cbe1a09645d4de051dfbb23f06dcfa
Summary: RNN executor did not consider race condition -type of dependency where an op A reads blob X and following op writes blob X. This happened in beam search with a inplace-reshape following FC op.
Reviewed By: jamesr66a
Differential Revision: D5792018
fbshipit-source-id: a5590d80e1b7b127abcdf2b1c2854ea56018e12f
Summary: This dot_product layer was added before functional layer was added. Now we have functional layer, this dot_product layer is no longer needed. This diff removes dot_product layer.
Reviewed By: kittipatv
Differential Revision: D5783303
fbshipit-source-id: 5d13f729918148ee57836fb47c48e6f24773654b
Summary: The shape inference of distance_op has issues (only works when inputs are 1D tensors). This diff fix the shape inference and the unit test.
Reviewed By: kittipatv
Differential Revision: D5788744
fbshipit-source-id: cb1b7facf7b9ccd64b54edca156325eceef50f33
Summary: We could be a bit more helpful.
Reviewed By: jamesr66a
Differential Revision: D5778789
fbshipit-source-id: 570095196b07d593cfed8318477b296e47c5d43d
When the size given is incorrect for the number of elements, the current error message is:
`size '[1 x 1 x 5]' is invalid for input of with 1 elements at /pytorch/torch/lib/TH/THStorage.c:41`
This replaces it by
`size '[1 x 1 x 5]' is invalid for input with 1 elements at /pytorch/torch/lib/TH/THStorage.c:41`
which is grammatically better
Proper broadcasting in ATen uncovered a bug in our fusion
compiler where it outputs the wrong shaped tensor. We're
tracking the issue in https://github.com/ezyang/pytorch/issues/206
but for now, rewrite the code so it does an "old style" comparison,
which works fine.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: Data parallel model failed with device numbers 10, 11.. because it used string sorting of the blob names. Changed to make sorting happen based on device number and then blob name. Also added reduction for 16 devices.
Reviewed By: wesolwsk
Differential Revision: D5781521
fbshipit-source-id: 16be0984ecb55340604c82893be366c0528e822c
* Variables now hold a list of ValueTracingStates and can participate
in multiple traces.
* Refactored Traceable to maintain a list of traces, and only stop
tracing once it records all stages
Summary: Filling in the gap in tensor inference
Reviewed By: sunnieshang, akyrola
Differential Revision: D5779550
fbshipit-source-id: 9ec68c9dad566183d7d0fc2819829c2b91430dda
Summary:
If Caffe2 used the packaged NCCL version then the Gloo build will try
to use it as well. To make sure the NCCL build has completed we need
to add an explicit dependency between the two.
Another subtle change here is that we add the PROJECT_BINARY_DIR to
the include path, since that is where the generated <gloo/config.h>
resides. Without this path Caffe2 includes the empty config.h from the
source tree.
Closes https://github.com/caffe2/caffe2/pull/1170
Differential Revision: D5779002
Pulled By: pietern
fbshipit-source-id: 9bc0d41f01a9b0f023d71bc4dee128a77eec1712
Summary: As title. Wonder this had not been encountered before. Only affects cases where the states are copied over though.
Reviewed By: Yangqing
Differential Revision: D5777314
fbshipit-source-id: 8aef435c832e4ead5bb3d3e35bb065c734a2af5f
Summary: According to GitHub issue #1168, YellowFin's accuracy between Caffe2 and Numpy models from tests are not good enough in some environments. Results were very close on my machine. GitHub's Travis failed on some tests which I later disabled. Therefore the difference doesn't come from logical differences but from loss of precision on some machines. It is safe to disable equivalency test if equivalency was already once tested.
Reviewed By: akyrola
Differential Revision: D5777049
fbshipit-source-id: c249a205d94b52c3928c37481f15227d500aafd0
Summary:
Add type inference for EnsureDense operator so that the output tensor
has the same data_type and shape of the input tensor
Reviewed By: kittipatv
Differential Revision: D5763117
fbshipit-source-id: e507e8d928c1515bd01063e2af595eb0daf1e768
Summary:
Special executor for RNNs which can exploit parallelism over timesteps. For CPU we use multi-threading, achiving 3x or so improved on 4-layers LSTMs.
With CUDA, perf improvements are more modest, but the structure allows for optimizing it further. For CUDA, we use multiple streams and events if there is parallellism
over timesteps. In my experiments, it was not good to use more than 2 streams, though.
Flag --caffe2_rnn_executor can be used to switch the executor off.
Reviewed By: salexspb
Differential Revision: D5749304
fbshipit-source-id: d6f76b3e16598be5b4e8188aff031671ebafaa4c
- kernels -> kernel_shape
- Use the new hybrid dict/tuple result object from Toffee
- Write g and t as singulars, not plural
- nanopb generated files update
- Bugfix for msg() micropb helper
- Start recording producer_version/producer_tag
- Use ir_version from proto description
- Value -> value (Constant)
- Remove special-casing for transposed convolution; we now rely
on the Caffe2 Toffee backend to do something reasonable
- Batchnorm order is no more
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
- Conv no longer supports bias, so we create an explicit broadcasted
addition afterwards. There is one minor problem, however, which is that
ConvTranspose in Caffe2 has mandatory bias. So there's a hack.
See Note [Caffe2ConvTranspose] for the details.
- Squeeze: dims -> axes
- Transpose: axes -> perm
- Reshape lost its extra output (yay!)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This was a doozy!
- 'namespace' is a C++ reserved keyword, so if you have a field named
this, nanopb will blithely export some malformed C++. I submitted
a PR for this: https://github.com/ProjectToffee/ToffeeIR/pull/88
- Zach added support for singular tensor and graph. While attempting
to add support for these, I realized that it was actually impossible
to support them under the default protobuf translation. The gory
details are in Note [Callback for nested messages]. The singular
callbacks needed a new helper which I dubbed msg; it's just
the singular version of list.
- While I was working on the API, I braino'd with the tensor()
method. It turns out this is totally not the right way to think
about it; it's more string_from_tensor(). So I renamed it.
I also renamed add_tensor to set_raw_data; add_tensor is a misnomer
since it implies you can add multiple tensors, which is not true.
- version turned into producer_version. Actually, this is a bit
questionable and might change soon.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This is a case of two wrongs make a right. There were a pair of
related bugs;
- We incorrectly translated Transpose as if it were a Permute;
but Torch transpose actually is a *swap* between dimensions.
- Why didn't we ever notice it? In all of our tests, a transpose
was *solely* done to get a weight matrix into the correct form.
But Caffe2's FC operator *implicitly* does a transpose on
the weight matrix.
This commit fixes both of these problems.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This adds the PyTorch API user documentation for Toffee.
To make the example work, I also converted all "inplace"
ops to export out-of-place in Toffee.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
- BC BREAKING: export now also takes a mandatory file-ish argument, specifying
the file to export the protobuf to. I rewrote the tests to use BytesIO to
get out the string so they could parse it again.
- BC BREAKING: export no longer returns the tensors that were computed. To
get these, use the internal _export function.
- Multiple inputs to models are now supported by passing a tuple to input.
(Old API of a single Variable still works.)
- Keyword arguments to models are now supported via kwargs keyword arg.
- Renamed embed_params to export_params, and it now defaults to True.
- Toffee tests now live in their own test_toffee.py file. I had to
rename a pile of expect files for this.
- Removed defunct torch.toffee imports from autograd to solve module import
cycle.
- Helper function _with_file_like to abstract over opening file-ish arguments,
taken from torch.save()
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Rather than reuse input as output names in ToffeeIR, mark places where
inputs are consumed. In C2 conversion these annotations will be used
to create the corresponding graph.
Toffee submodule update.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
- Reduce setup.py diff.
- Expunge WITH_TOFFEE from codebase.
- Elaborate on a comment.
- Move gen_toffee.sh to tools
- Delete densenet test.
- Use 'using' to inherit a constructor.
- Delete outdated comment.
- Comment about why primspecs can return fewer outputs.
- Remove dead, commented out includes.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Along the way I added converters for Variable and TracingInput. Variable should
probably be moved to a more widely known spot.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Instead of dynamically allocating a float for each element of the tensor
(lol!) save the tensor itself, and directly read out the data.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
"Unused" nodes are mapped to nullptr, and we distinguish
on lookup nodes which were never mapped versus nodes that
were mapped but supposed to be unused. This case
should never happen, but a little extra safety never hurt.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
I realized we weren't running the linter after ToToffeeIR, so
I added a lint call. It thus emerged that the current implementation
was using "Unused" nodes that were not added to the graph,
which was tripping the lint. I fixed this a few ways:
- BatchNorm and Conv primspecs were returning dead "unused" nodes
for their (implicit) handle parameters. I removed them because
setOutputs handles this already, and a dead unused node which
is not attached to the graph violates the "no dead nodes"
invariant.
- OK, but MaxPool actually needs to return a unused node for
the output which supported by PyTorch but not Toffee; we need
to error if subsequently in the trace this output is used.
The new strategy is to have MaxPool's primspec return a None
at the unused position, and then immediately *check* if there
are any uses of that output. If there are, that's an error!
- I needed to adjust the Select invariant in the exporter loop:
only if a Select node has *uses* is it mandatory for it to be
defined in env.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Basic idea:
- Pass buffers (marked as non-Variable tensors) as input variables to
the trace. Every buffer gets represented as an input variable
to the trace, and we remember a correspondence of the underlying
TH pointer and an input variable in the trace.
- When we initially trace a function, we DO NOT record the buffers
as edges. This is so autograd doesn't have to know anything about buffers.
If we ever turn buffers into requires_grad=False parameters, then
this problem goes away.
- When we primspec the buffer, NOW we reach into the cached buffers
(now appropriately named) and gin up the buffer information we need.
Other things:
- CppOp execution is now supported (but lightly tested) using
SimpleEval (thanks @apaszke!)
Todo:
- E2E tests need to have their hacks removed.
- Figure out what is going on with backwards
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
If it's not set, CMAKE_DEBUG_POSTFIX sets it to 'd' which means the
static library gets named something different when built in debug mode.
This is annoying because it means if you build in debug mode, the
library is in a different place. Rather than teach the build system
to find the correct name, just set this POSTFIX so names don't change.
Also, update setup.py to look for the non-debug archive.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
General strategy:
- nanopb is statically linked into PyTorch. It must be built
with -fPIC.
- Generated nanopb files for toffee.proto are checked into
our repo.
- Because nanopb generated protobufs are C only, we wrote a
wrapper around it to give a Google C++ style interface.
More on this shortly.
How does the wrapper work?
- It's called "micropb" becaues it is less small than nanopb :)
- nanopb requires all variable-length fields to be written out
using a "callbacks" mechanism.
- We wrote pre-canned callbacks for all of the types ToffeeIR
writes out and lists; these are micropb_callback and
micropb_callback_list. These operate simply by dynamically
allocating and storing the data to be written out in
data (this defeats the purpose of the callback mechanism,
but it's easy to implement)
- Finally some boilerplate to actually implement the wrapper
classes and have owning pointers to the actual data.
Testing strategy:
- Take the serialized protobuf from nanopb, parse it again
with ToffeeIR and print it. Worked with all of test_jit.py!
These tests don't run without 'toffee' being installed.
TODO:
- Update CI to install ToffeeIR, so we can run the Toffee tests
in CI
- Update E2E with Caffe2 tests so that they work with new stuff.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
previously:
PythonOp/CppOp Graph -> ToffeeIR, primspecs worked with protobufs
now:
PythonOp/CppOp --ToToffeIR--> jit::Graph of in-memory ToffeIR -> protobufs of ToffeIR
This commit let's primspec functions work directly with JIT IR nodes,
which makes it possible to do a lot more stuff in those functions.
Let say I write alpha=2 in my PyTorch code. Is alpha a float
or an int? This problem is resolved when we actually pass
it to the underlying kernel, which knows what type it expects
it as.
When serializing to Toffee IR, the Toffee NodeProto also needs
to dictate the correct type; otherwise, we may guess wrong.
We get this information from the OpSchema in the ToffeeIR library.
With this, we can avoid explicitly casting in dropout.py and
auto_primspec.py
WARNING: You will need to update torch/lib/ToffeeIR when you pull
this patch, as attribute schemas were added recently to ToffeeIR.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This addresses when bias is disabled, which occurs in torchvision's
alexnet and densenet.
The general strategy is this:
- When we encounter a null variable, we turn this into a Constant
node with an undefined at::Tensor
- Toffee exports for BatchNorm and Conv have special cases for bias,
checking if they are provided by a Constant node with undefined
value, and just omit the input if so.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
The general strategy:
- We put all the toffee files in torch/csrc/toffee; they will only be
added when toffee is enabled
- Toffee is enabled if torch/lib/ToffeeIR is present (since we
don't have a submodule/subtree thing going on)
- The most prevalant place you will need to use WITH_TOFFEE is for
primspec definitions on C++ autograd functions. There is a
macro HAS_PRIMSPEC to ameliorate optionally defining primspec()
virtual overrides on Function classes. HasPrimspec is always
available but will be a zero field class when Toffee is disabled.
NB: We might revert this commit in the future if we figure out a way
to unconditionally enable Toffee that everyone likes.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
We want all the conversion code to live in one place. Away it goes!
This means that alexnet protobuf no longer works. It will start working
again when we port changes.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This commit adds a new exporter pass which takes a graph and returns
a string of the human-readable protobuf representation of a model.
We have two strategies for how conversions are implemented:
- If a Python autograd function has a primspec static method, we invoke
it to get the Toffee conversion. Use torch.toffee.op to generate the
format expected to be returned. The particular data representation is opaque
and subject to change in the future.
- Otherwise, there's a giant if statement in the exporter, which manually
uses the JIT IR C++ API and Toffee IR C++ protobuf API to convert.
You must check out a copy of the ToffeeIR repo
https://github.com/ProjectToffee/ToffeeIR at torch/lib; at the moment
we don't have a subtree/submodule set up.
Technical debt in this commit:
- To get protobuf headers in scope, we unconditionally add $CONDA_PREFIX/include
to the include path. This needs to be replaced with a more robust mechanism.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
The API works on either functions or models, taking an extra parameter argument
so that functions can pass in additional variables to trace.
Other behavior is folded into boolean options:
time - collect stats for our own perf debugging
verify - run the original code, and check it is within threshold
optimize - run optimization (currently off until fusiongroups pr is accepted).
enabled - flag to turn off tracing so you can check timing of stuff that cannot be traced.
Fixes#48.
I had to shave some yaks:
- I needed switch on Type, so I wrote a new macro set TYPE_IF,
and abstracted the IR_IF into a GENERIC_IF. The parametrization
is on const-ness and the type kind; also there is a minor annoyance
where type kinds (ugh, hate the name; it means the wrong thing
in Haskell land) don't match the class names, so there needs some
suffix munging. There's still some extra funny business, see
https://github.com/ezyang/pytorch/issues/51
- A lot of functions on types weren't declared const when they could
have been. I added const qualifiers as necessary.
- setType now takes an honest to goodness Type* rather than TypeKind.
- init_pass now preserves types when it does transformations.
There are still some places we're losing types, most notably fusion.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Approach is based on the approach of THC's pointwiseApply{1,2,3} family of kernels,
but doesn't have any dependencies on that code.
Adjacent contiguous dimensions of input tensors are compressed to reduce the complexity of indexing math.
For the completely contiguous case, the indexing logic simplifies to just the linear index.
In simple tests, this code matched or beat the equivalent from THC.
- To test whether or not a multiline string matches some expected
value, you can use assertExpected. This tests that the string
matches the content stored at a file based on the name of the
test (and an optional subname parameter you can pass if you
what to assertExpected multiple times.)
- Suppose you make a change that modifies the output in a big way.
Instead of manually going through and updating each test, you instead
run python test/test_jit.py --accept. This updates all of the expected
outputs. You can now review them one-by-one and make sure your
changes make sense.
We can add more features later (e.g., munging the output to make it
more stable, more sanity checking) but this is just to get us started
testing. One thing to watch out for is that accept tests on intermediate
representation can be a bit wobbly: it is *extremely* important that
people be able to read the IR. It may be worth introducing niceties
to the printer in order to ensure this is the case.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Now it gets initialized during the constructor. This results
in more boilerplate but is conceptually more correct, and solves
an assert failure.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
It is not an /expression/ we trace, but it is a /graph/: that is,
a closed expression which knows its parameters. Knowing the list
of parameters is helpful and helps remove a hack when interpreting.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This prevents nested lets, which are not allowed in ANF. We
basically have SSA now.
There's some niftiness with the visitor returning a lambda which
then gets fed the actual argument. I like it.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Although ANF style developments traditionally stratifies syntactic
classes into atomic (Arg) and complex (Expr) expressions, where
atomic expressions could be variables, constants or lambdas, Zach has
successfully convinced me that we should do away with the variant here and
always require arguments to be variables. There are a few reasons for
this:
1) Tensor constants, not currently supported, could be modeled using a
"Constant" instruction, removing the need for them to be representable
directly inline. An inline constant is marginally more convenient
for peephole optimizations, but since we have gone full ANF, we are going
to need to be able to see across def-uses in any case, and it is not
too much worse to need to handle constants this way. By the way,
Swift Intermediate Language also made a similar choice, see
the slide on "Literal Instructions" in
http://llvm.org/devmtg/2015-10/slides/GroffLattner-SILHighLevelIR.pdf
2) Scalar constants, which are quite important for passing non-tensor
arguments to Python operators, are now stored out-of-band as NON
first-class values. This more closely matches the ToffeeIR design,
and makes it clear what parameters are "first class" (tensors only)
and which ones are not. However, we need to be able to unswizzle
the separate scalar/tensor lists into a unified list in the correct
format; this is what PyFunctionCConv is for.
Also, Locals got renamed into Tuple.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Previously, our AST was a DAG, where shared Nodes indicated a computation
should be reused. This commit rewrites the IR into a new functional
representation which represents sharing explicitly using variable
bindings.
We offer a few justifications for this new style:
1. The new representation is not all that different from the
old one; it is about as easy to construct, and the lack of an
explicit graph doesn't negatively impact our ability to interpret
the graph, since we've chosen, as a matter of design, to NOT have
the IR participate in the actual execution of a graph.
2. The new let-binding representation has an implicit ordering,
which we can use to conveniently keep track of the original order
the trace showed up as. This automatically gives us a topsort,
and gives us an easier to read textual representation of our
IR:
%14 = Embedding %11, %0, -1, None, 2, False, False
%15 = Dropout %14, 0.2, True, False
%16 = Index %12, 0
%17 = Index %12, 1
%18 = Index %13, 0
%19 = Index %13, 1
%20 = Index %15, 0
%21 = Linear %20, %1, %3
%22 = Linear %16, %2, %4
3. It moves us closer to a Futhark style language
(http://futhark-lang.org/publications/pldi17.pdf).
Major aspects of the diff
- Node is replaced with Expr and Arg, a pair of mutually recursive
structures which represent our new language. In BNF, the language
looks like this:
a ::= c | %i
e ::= %i, ... = e
| PyOp e, ...
| Ret %i, ...
Technically, Ret is not actually a return (no control flow is involved),
it just tuples up a series of tensors (identified by variables).
One important invariant is that locals are always tensors; they
are never constants (this is asymmetric with Args.)
- Arguments support Python constants. This is an important piece because
many operators take extra Python literals like integers and tuples in
order to specify extra parameters about how an operator operates. Adding
this was essential to getting word_language_model to work.
- As both Expr and Arg have multiple variants, there is new infrastructure
for doing case on the variants using ExprVisitor and ArgVisitor. The
strategy here is adapted from WebAssembly's visitors, although we have
generalized to permit arbitrary argument forwarding, which is necessary
to support tail-recursive visitor calls. TCO is important because our
interpreter may recurse arbitrarily deep into a stack of nested lets.
If users wish, they can also manually case on the type tag.
- Tracing is now turned on and off using _tracer_enter/_tracer_exit in
torch._C. _tracer_enter accepts a list of variables which are to be
treated as arguments; _tracer_exit accepts the list of traced variables
which should be returned when you reexecute the trace, and returns
the trace expression which can be reexecuted. GlobalTracingState
is a global variable which tracks whether or not we are tracing or not.
- You use run_forward to execute a trace on some set of parameters.
- When under tracing, variables keep track, via trace_local, what the
name of their variables in the IR are.
Here is a simple runner which leaks memory but can be used to JIT models:
import torch.autograd.function as F
import torch._C
def jit(model):
import types
real_forward = model.forward
def forward(self, *args):
def flatten(x):
return tuple(F._iter_variables(x))
if not hasattr(self, "saved_trace"):
torch._C._tracer_enter(tuple(self.parameters()) + flatten(args))
out = real_forward(*args)
self.saved_trace = torch._C._tracer_exit(flatten(out))
self.saved_outs = out
return out
else:
flat_out = Variable._execution_engine.run_forward(self.saved_trace, tuple(self.parameters()) + flatten(args))
return F._unflatten(flat_out, self.saved_outs)
Major problems:
- Sanity checking is spotty at best, especially when users pass in variables.
- The interpreter leaks tensor memory from the store. When we add back def-use
we should be able to deallocate tensors as soon as we know they are no longer
necessary.
- The interpreter needs to reach feature parity with the old execution engine.
From there, we need to see if backwards can be subsumed as well.
- I still have no confidence in having memory managed everything correctly.
This requires a close look.
- Rather than return an *open* expression as a trace, we should return a
*lambda* instead, which knows about how many formal parameters it
requires.
- The IR is not introspectable from Python at the moment, but this is simply a
matter of implementing all the binding code.
- The tracer is NOT reentrant (you can't trace while you're inside a trace.)
Furthermore, no sanity checking is done if you try to incorrectly reuse
things from one trace in another.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Simple test:
import torch
from torch.autograd import Variable
import torch._C as _C
x = Variable(torch.Tensor([4]), requires_grad=True)
y = Variable(torch.Tensor([7]), requires_grad=True)
z = x * y
z.sum().backward()
print(x.grad)
print(y.grad)
x.data[0] = 2
y.data[0] = 3
(z,) = z._execution_engine.run_forward((x, y), (z,))
z.sum().backward()
print(x.grad)
print(y.grad)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
Reader checkpointing was disabled due to bug captured in T21143272
Now that we have resolved that issue, re-enabling reader checkpointing
Reviewed By: boryiingsu, rayleichen
Differential Revision: D5730545
fbshipit-source-id: 7fae48b03e07eaf530bfc9e8e8b6683d8ed4e206
Summary:
release_blobs_when_used() will analyze when a blob is output the last time, and insert a Free op after that. Unless the blob was aliased.
memonger.estimate_memory_usage() does a static memory analysis based on shape inference. See experimental/akyrola/test.py for example use.
Reviewed By: asaadaldien
Differential Revision: D5729199
fbshipit-source-id: 527a5152dbd4ef3bbe28b776c29163fff25f700a
Summary:
As described in task T21337239, NormalizeOp currently normalizes over only the last dimension.
In this commit, the following changes have been made:
(1) Added an axis-parameter to NormalizeOp in both the CPU and CUDA context.
(2) Added the same axis parameter to NormalizeGradient in both the CPU and CUDA context
(3) Removed the limit that the original NormalizeOp operator requires the input dimension to be 2
Reviewed By: akyrola
Differential Revision: D5745162
fbshipit-source-id: 69e04f59ac4d954b0062c3b2a53c8ca465a1027b
Summary: Add tiling support to GLAdd, GLPool, and GLResizeNearest
Differential Revision: D5733208
fbshipit-source-id: b73113326b96d421787d4695ccf7d2d919ee2ed8
Summary:
It looks like this operator is missing some enforces that it should have (since
it's working on the user inputs). This diff is added enforces to ids to be in a
valid range.
Reviewed By: dzhulgakov
Differential Revision: D5488336
fbshipit-source-id: e045c3b71b92e443edd23c95aa75d144877f1334
Summary:
Here is the buggy behavior which this change fixes:
* On the first configure with CMake, a system-wide benchmark installation is not found, so we use the version in `third_party/` ([see here](https://github.com/caffe2/caffe2/blob/v0.8.1/cmake/Dependencies.cmake#L98-L100))
* On installation, the benchmark sub-project installs its headers to `CMAKE_INSTALL_PREFIX` ([see here](https://github.com/google/benchmark/blob/4bf28e611b/src/CMakeLists.txt#L41-L44))
* On a rebuild, CMake searches the system again for a benchmark installation (see https://github.com/caffe2/caffe2/issues/916 for details on why the first search is not cached)
* CMake includes `CMAKE_INSTALL_PREFIX` when searching the system ([docs](https://cmake.org/cmake/help/v3.0/variable/CMAKE_SYSTEM_PREFIX_PATH.html))
* Voila, a "system" installation of benchmark is found at `CMAKE_INSTALL_PREFIX`
* On a rebuild, `-isystem $CMAKE_INSTALL_PREFIX/include` is added to every build target ([see here](https://github.com/caffe2/caffe2/blob/v0.8.1/cmake/Dependencies.cmake#L97)). e.g:
cd /caffe2/build/caffe2/binaries && ccache /usr/bin/c++ -I/caffe2/build -isystem /caffe2/third_party/googletest/googletest/include -isystem /caffe2/install/include -isystem /usr/include/opencv -isystem /caffe2/third_party/eigen -isystem /usr/include/python2.7 -isystem /usr/lib/python2.7/dist-packages/numpy/core/include -isystem /caffe2/third_party/pybind11/include -isystem /usr/local/cuda/include -isystem /caffe2/third_party/cub -I/caffe2 -I/caffe2/build_host_protoc/include -fopenmp -std=c++11 -O2 -fPIC -Wno-narrowing -O3 -DNDEBUG -o CMakeFiles/split_db.dir/split_db.cc.o -c /caffe2/caffe2/binaries/split_db.cc
This causes two issues:
1. Since the headers and libraries at `CMAKE_INSTALL_PREFIX` have a later timestamp than the built files, an unnecessary rebuild is triggered
2. Out-dated headers from the install directory are used during compilation, which can lead to strange build errors (which can usually be fixed by `rm -rf`'ing the install directory)
Possible solutions:
* Stop searching the system for an install of benchmark, and always use the version in `third_party/`
* Cache the initial result of the system-wide search for benchmark, so we don't accidentally pick up the installed version later
* Hack CMake to stop looking for headers and libraries in the installation directory
This PR is an implementation of the first solution. Feel free to close this and fix the issue in another way if you like.
Closes https://github.com/caffe2/caffe2/pull/1112
Differential Revision: D5761750
Pulled By: Yangqing
fbshipit-source-id: 2240088994ffafdb6eedb3626d898b505a4ba564
Summary:
**Description**
Provide DeepText model with the functionality to load a secondary index (pre-trained char-ngram embedding, e.g. FastText) during training/test. Embeddings of out-of-vocabulary words will be computed on-the-fly during training/test by averaging the char-ngram embeddings.
**Approach**
This diff provides two custom operators to accomplish this task – ConditionalOp and IndexCharNgramGetOp. We first use IndexCharNgramGetOp to perform char-ngram index lookup and return a sparse tensor segmented by lengths for each token. The sparse tensor is then used to compute the average embedding provided by the char-ngram index. Finally, we use a ConditionalOp to replace those whose embeddings were not found in the original index during the feature apply stage. Please refer to documentations of the code for more details.
Reviewed By: jamesr66a
Differential Revision: D5666924
fbshipit-source-id: f76605d093154a014d5b9ebf9510de9d79874eee
Summary:
CuDNNWRapper inline_cudnn_handle() should set the stream every time, since it can change. This caused problems in RNN scenarious. Also this bug rendered singlethread_async_net incorrect / slow!
I found out the problem by using nvprof --print-gpu-trace and noticing that some kernels were run in different stream than i expected.
Reviewed By: ajtulloch, Yangqing
Differential Revision: D5758426
fbshipit-source-id: 651c62fe28eaf09e1675d4adf3f1fac8b4c8e75b
This respects all the broadcast cwrap specifications except for 'fallback';
i.e. pointwise functions operating on tensors where the number of elements
match but the sizes are different and not broadcastable. This behavior is
currently deprecated in PyTorch. Note that this is a breaking change in ATen,
because ATen just passes through to TH/THC, where the fallback behavior is
actually implemented.
This also changes expand semantics wrt Scalars (as tensors). Previously,
one could 'expand' a 1-dimensional tensor with size 1 to a 'scalar' (i.e.
empty size initializer list).
Summary:
Replaced std::copysign(x) with (x > 0 ? 1 : -1).
std::copysign is not available on some Android platforms which was detected in GitHub's Travis tests:
"/home/travis/build/caffe2/caffe2/caffe2/sgd/yellowfin_op.cc:57:23: error: 'copysign' is not a member of 'std'"
Reviewed By: akyrola
Differential Revision: D5756384
fbshipit-source-id: 56bc220d2c6216ff45b9cc47ed02aebf6ad439a5
Summary: Disabling test for YellowFin that does not pass test in Travis. Difference comes from numerical reasons. Test passes on my cpu / math libraries. Decide whether to merge it.
Reviewed By: Yangqing
Differential Revision: D5754144
fbshipit-source-id: b6ed6628f962d6904a8d522f0cf4080d7878acad
Summary: Make CUDA version of SparseToDense, register EnsureDense (which is trivial) on CUDA. Need to use atomics because indices can be duplicated. We can later add an option to inform if the indices are unique, and use faster path then.
Reviewed By: jhcross
Differential Revision: D5750893
fbshipit-source-id: 005d1675b127a571aac8474fca62d9633f0c7bff
Summary:
Implementation of a new variant of attention module, which contains a recurrent decoder state with vectors corresponding to each source-side word and strictly increasing values, thus enabling it to model the degree to which source words have been translated.
The approach is a variant of the approaches described in https://arxiv.org/pdf/1601.04811.pdf. We simply include the sum of all previous attention weights for encoder words as a new recurrent state (coverage_t). A new linear transform on encoder_outputs is used to produce coverage_weights, which has the same dimensionality as encoder_outputs, and implicitly models the fertility of source-side words (and putting this extra information strain on the encoder network).
Thus the encoder output, the decoder state, and the coverage weights have the same dimensionality for a given source word, and attention logits are calculated as v * tanh(coverage * coverage_weights + encoder_output + decoder_state).
Note: the entire coverage state for each translation instance is of shape (encoder_length, coverage_units), but the states for the RecurrentNetwork operator, used to train the decoder, must be flat in the data dimension. This state is therefore initialized with shape (encoder_length * coverage_units) [not shown in the open-source library] and reshaped appropriately within the apply_soft_coverage_attention() function.
Differential Revision: D5593617
fbshipit-source-id: 7d0522b5eb0b26f22e8429e4461a459f2f16ed46
Summary: basic little op benchmark generator -- outputs init_net.pb and predict_net.pb for use with speed_benchmark or mobile_speed_benchmark
Reviewed By: Maratyszcza
Differential Revision: D5728534
fbshipit-source-id: 3e912fa63548497ca65ab34c8bb967694c46815b
Summary: Turns out NCCL can deadlock with cudnnSetDropoutDescriptor, so we need a lock.
Reviewed By: pietern
Differential Revision: D5748325
fbshipit-source-id: b3828c50f6acfc4b5323008ec04f571f6d0d5586
Summary: Added super rough conv cost inference that takes into account very few params
Reviewed By: Maratyszcza
Differential Revision: D5412611
fbshipit-source-id: f662822fd5a532eacb525fbc361e8a62f32430a8
Summary: TEST_benchmark will print out gflops if it can infer them
Reviewed By: Maratyszcza
Differential Revision: D5412644
fbshipit-source-id: 3af7bb42cda4684e30db6d8ae5484d441898479c
Summary:
It looks like one of the rebases that I have been doing on this op have
completely messed up my code and I have accidentally remove
TensorInferenceFunction for SliceOp. This diff is returning it back.
Reviewed By: akyrola
Differential Revision: D5745305
fbshipit-source-id: 5266c9e14c7d55be5a9cc96688e128db79547b1a
Summary: Adding support to use kernels, strides, pads etc. as arguments.
Reviewed By: houseroad
Differential Revision: D5710699
fbshipit-source-id: 8b63af4c4a76cd06b637a376aeb29a34c659be2e
Summary: This will allow to do data reading in small batches and concat the batches later on.
Reviewed By: kennyhorror
Differential Revision: D5739129
fbshipit-source-id: 66a8087e5f9d10d654e367c6111ac90cbf54224e
Summary: Check for nullptr before closing a common world.
Reviewed By: pietern
Differential Revision: D5746256
fbshipit-source-id: d395bf60d3b7f2c2629761d2b6fd46085683390c
Summary: Both D5695197 & D5691262 implement the tensor inference function for Gather. Keeping only one.
Reviewed By: akyrola
Differential Revision: D5742331
fbshipit-source-id: 1c31427fbfbc87bfec84b8c04851275f45154fcf
Summary:
Added YellowFin optimizer to Caffe2.
This implemention is different from the original: It has separate alpha and mu for each parameter and it uses different version of Momentum SGD.
Tests / benchmarks for the optimizer are to be done. Some refactor of the code is to be done before pushing. This is still a working version.
Reviewed By: akyrola
Differential Revision: D5652689
fbshipit-source-id: c10dc0424f47c3051b454aede1d121902cb759a8
Summary:
1) Adds monitoring of CPU utilization in trainers and PS's, and report the utilization to global statistics
2) Adds the plan execution time to global stats
3) Uses CPU utilization and network utilization observed from performance estimation job to calculate the optimal number of parameter servers needed for the actual job. The optimal number of parameter server is the minimum number of servers needed while parameter servers are not the bottleneck in execution.
//Note: The calculation assumes that parameter shards are assigned to PS's in a uniform way and accesses to the shards follow a uniform access pattern. In reality, shards' access pattern may be skewed. As a next step, we should monitor shard access pattern in performance estimation job and distribute the shards in the optimal way.//
Reviewed By: sf-wind
Differential Revision: D5674398
fbshipit-source-id: 67a07cb9ed4e4d61ff5e81a0ecfe519b8feb2352
Summary: .h and .c files with YellowFinOp. .cu and test files will be included in next commits.
Reviewed By: akyrola
Differential Revision: D5724198
fbshipit-source-id: b05b9c047af25f9081641a0fe0cdba2ee74cb04b
Summary:
Currently the loss ops are still not on GPU even though ALL strategy is selected.
This diff is to enable it.
Reviewed By: xianjiec
Differential Revision: D5671255
fbshipit-source-id: 033863f171e1f89c8d75430d3af6a1e6d0d2eff2
Summary: Layer by layer comparison between CPU and GPU verified within 1% scale precision
Differential Revision: D5714594
fbshipit-source-id: f4ddee60c317aeeae4c7f3f9ac299fddf9057761
Summary:
Use HINTS instead of PATHS for find_library so that you can specify
-DNCCL_ROOT_DIR and it will use this NCCL installation regardless of
what else is installed on your system. Also add a path hint to include
the default base path for NCCL 2 libraries.
Closes https://github.com/caffe2/caffe2/pull/1152
Reviewed By: Yangqing
Differential Revision: D5740053
Pulled By: pietern
fbshipit-source-id: 43f0908a63e8a9b90320dece0bbb558827433b48
Summary: The GPU op was broken. Copy over the scalar data so that it can be used to construct the output tensor.
Reviewed By: akyrola
Differential Revision: D5733170
fbshipit-source-id: dfc800b9a408eaeb7f9abefbb640e10074204add
Summary:
This was a tricky one to debug. After pulling from master, my build
was complaining that certain identifiers in updated source files were
undefined. After building with VERBOSE=1, extracting the compilation
commands, and adding -M, I saw that CMake had included the Caffe2
installation directory as include path. Worse yet, this path had
precedence over the path to the actual source code. The compiler
included older headers when compiling newer source files.
This change forces the path to the Caffe2 source code to take
precedence over all other include paths. The only path that takes
precedence over *that* path is PROJECT_BINARY_DIR, which holds the
headers that are generated at compile time.
Closes https://github.com/caffe2/caffe2/pull/1140
Reviewed By: Yangqing
Differential Revision: D5727133
Pulled By: pietern
fbshipit-source-id: c60c89e82e8b1ab1cfca0907d31b84417788d79b
Summary: arxiv link to batch-norm paper was broken because dot(.) was included at the end
Reviewed By: zem7
Differential Revision: D5734405
fbshipit-source-id: e037c14091e7f9e415c2f7a3008cbf2bf066e699
Summary: Adding support for integer textures and thus the Galaxy S6 among other devices
Differential Revision: D5695151
fbshipit-source-id: 46514e5aa931f98f8c7c82ec923e7803bcaa9bc0
Summary: The default CUB settings led to very slow execution in practice when using "dynamic" memory allocation with C2 (i.e freeing blobs after their use). After some tinkering, I arrived to these numbers, that work with resnet-50 and NVIDIA M40 GPU much better than the origianal defaults. Also made the maximum allocated memory configurable.
Reviewed By: Yangqing
Differential Revision: D5732930
fbshipit-source-id: 9ff34f49d5a3eb138bc6f44c82918731a35325a6
Summary:
Reshape op's gradient op will have the original shape stored in a blob. Shape inference won't work directly because shape inference function does not have access to blob contents.
In this case, I think making a special exception in the shape inference system is justified: we store the output of reshape in a reshape-cache, and pass that in a backward pass.
Also include my experimental test script that I used for NeuralMT CNN model.
Reviewed By: asaadaldien
Differential Revision: D5721502
fbshipit-source-id: fdc8ab901d3bee2c4621ee5140a5435e49f4471d
Summary:
This diff adds control flow operators in Caffe2 (starting with If, While):
- Added If operator that executes then/else subnet
- Branch subnet is executed in a separate isolated workspace, with some of the blobs transparently forwarded from the outer workspace
- Adding a new NetBuilder subclass to construct nets using new operator
- NetBuilder also keeps track of outer blob names and automatically sets blob bindings between outer and inner workspace, implementing generic convention on handling local/global variables in blocks
Reviewed By: volkhin
Differential Revision: D5720644
fbshipit-source-id: a674cde0c789f6a6ffdcd9d80159d1e42e49133f
Summary: As titled. Direct adaptation of the operator code.
Reviewed By: azzolini
Differential Revision: D5721174
fbshipit-source-id: cc9d4c916d7d79d202a344f29ef384ddc68f4988
Summary:
Add tiled vs. batched comparison for models
Add more logging to GLPadImage
Differential Revision: D5718546
fbshipit-source-id: fdd4f0aabc41cb3b86b6f0ccf8e618a15170ceae
Summary: While there is currently support for scaling the base learning rate when loading the model, there is not support for scaling the base learning rate during training. This is needed for LATTE's seq2seq translation models, as the learning schedule is not predefined and is modified at runtime.
Reviewed By: jhcross
Differential Revision: D5701391
fbshipit-source-id: ae3bec45f238db1a2be7af9c04d720067e9095d5
Summary:
These are wrapper functions so that if we run in a Caffe2-only mode, we can
turn the flag on and get some small speedup on cuda device switches.
The purpose of the diff is to allow us to quickly assess the overhead of cuda
device switch functions. Ideally, the caching behavior shall live in the cuda
driver, which is the only safe place to ensure correctness.
If other code is running aside Caffe2 and does not properly do device guard,
this functionality will fail as separate cudaSetDevice() calls will not update
Caffe2's thread local device id. As a result, the functionality is only enabled
when/if one explicitly sets the flag.
This might not be safe, so use with caution.
- cudaGetDevice can go from 90ns to 2ns
- when setting the same device, we can go from 100ns to 2 ns
- when setting a different device, things are the same (1ns overhead on top of 143ns)
Reviewed By: azzolini
Differential Revision: D5709398
fbshipit-source-id: 6255f17a3d41f59a30327436383f306a2287896e
Summary: When we ported to memonger to C++ in D5544219, we forgot to include the special handling of RecurrentNetwork ops. This fixes that and adds a test.
Reviewed By: asaadaldien
Differential Revision: D5692407
fbshipit-source-id: 4e739b5dd6c7298303eee9bfa1aa4d19359eb7b5
Summary:
Before this diff, we were not respecting in-place blobs. E.g. if we had:
with DeviceOption(CPU):
blob = net.MyOpA([])
with DeviceOption(CUDA):
net.MyOpB([blob], [blob])
After the InjectCrossDevicesCopies we would have:
blob = net.MyOpA([], device=CPU)
blob_cuda0 = net.Copy([blob], [blob_cuda0], device=CUDA)
net.MyOpB([blob_cuda0], [blob], device=CUDA)
Basically, we were not respecting inplace blobs. After this diff, we'll keep the inplace blob.
Reviewed By: harouwu
Differential Revision: D5671867
fbshipit-source-id: 6ad68c612dae19d7e1f45f4988d929644100b4d5
Summary:
Turns out that due to the cmake improvement by lukeyeager , we now no longer rely on compiler flags but on the macros.h file to obtain CAFFE2_USE_MKL. This requires some minor changes in the MKL implementation to properly capture the macro before testing it.
Closes https://github.com/caffe2/caffe2/pull/1124
Reviewed By: jerryzh168
Differential Revision: D5705134
Pulled By: Yangqing
fbshipit-source-id: 6f6ad820cdd826818c12cf5aa344533a9324dbe2
Summary: Add an op to explicitly close common world connections, thus helping propagate closures when errors happen. Requires D5661477.
Reviewed By: pietern
Differential Revision: D5660476
fbshipit-source-id: 85791686691305abd96b082a6f68e4427ba14fbb
Summary:
This diff adds control flow operators in Caffe2 (starting with If, While):
- Added If operator that executes then/else subnet
- Branch subnet is executed in a separate isolated workspace, with some of the
blobs transparently forwarded from the outer workspace
- Adding a new NetBuilder subclass to construct nets using new operator
- NetBuilder also keeps track of outer blob names and automatically sets
blob bindings between outer and inner workspace, implementing generic
convention on handling local/global variables in blocks
Reviewed By: azzolini
Differential Revision: D5641588
fbshipit-source-id: f9e04429961c3da7da4ebca3e8163bfcc2a09ec9
Summary:
_LSTM helper is a legacy piece we had before all the RNNCell awesomeness landed. Now we need to pull it apart and create separate building blocks that people can use for any RNNs.
Please note changes to a test with double scoping. That should go away once we change RNNCell scoping logic in such a way that each cells ads its own name to the scope for all of its outputs (see another diff: D5613139 )
Reviewed By: jhcross
Differential Revision: D5632276
fbshipit-source-id: 1cb568ab995c4c0b3dd1b4bad2d028e34bded9c1
Summary:
This includes the commit that adds `close()` to gloo::transport::Pair.
Closes https://github.com/caffe2/caffe2/pull/1127
Reviewed By: akyrola
Differential Revision: D5708513
Pulled By: pietern
fbshipit-source-id: 8ef505d48b3bfa1576c068c4e4a29c9a8ed5efc7
Summary: These were missing and required for some seq2seq models. Unit tested. The previous implementation of ReduceBackMean shape inference was incorrect, so removed it.
Reviewed By: asaadaldien
Differential Revision: D5691262
fbshipit-source-id: 76f868b298440f988635966a410f0232301ca6c4
Summary:
I ran into an issue where a subset of packages were found in the
Anaconda path. This path also contained includes for other packages
and the Anaconda path inadvertently took precendence over the intended
include path. The new `caffe2_include_directories` helper is a hacky
attempt to "fix" this by deprioritizing Anaconda paths in the hope
that intended include paths are searched before Anaconda.
Closes https://github.com/caffe2/caffe2/pull/1121
Reviewed By: Yangqing
Differential Revision: D5701819
Pulled By: pietern
fbshipit-source-id: 908284cd4ea6c8167774e4e3fcc4dc0ca8a23110
* Support double backwards for AdaptiveAvgPool1d and AdaptiveAvgPool2d.
* Support double backwards for ReplicationPad2d, ReplicationPad3d, and ReflectionPad2d.
* Support double backwards for FractionalMaxPool2d.
* Support double backwards for MaxUnpool1d and MaxUnpool2d.
* Circular recursive imports not supported in python 2.
* Address review comments.
* Add examples in functional.py
Added examples for F.cross_entropy, F.binary_cross_entropy and F.binary_cross_entropy_with_logits.
* Add ` for PyTorch docs
Added ` for PyTorch docs.
* Add examples in loss.py
Added examples for nn.BCELoss and nn.BCEWithLogitLoss.
Summary:
Split the first dimension of a tensor into 2, the first of which is fixed and given in the argument.
This is used to then split batch into smaller batches and distributed it across workers.
Reviewed By: harouwu
Differential Revision: D5702175
fbshipit-source-id: 02bb93e49bf9db411b516e149c8e647301dd2ca5
Summary:
CNMEM was deprecated by commit c59f291 and is not used anymore by
Caffe2. It was superseded by CUB.
The git submodule can now be removed.
Closes https://github.com/caffe2/caffe2/pull/1118
Reviewed By: Yangqing
Differential Revision: D5699492
Pulled By: pietern
fbshipit-source-id: 44627ed038f37c12312889bb27691db426ad122f
Summary:
The PATHS suggestion to find_library is searched after everything
else. By using HINTS, it searches CUDNN_ROOT_DIR much earlier, avoiding
potential conflicts with other paths that have the CuDNN header.
Closes https://github.com/caffe2/caffe2/pull/1122
Reviewed By: Yangqing
Differential Revision: D5701822
Pulled By: pietern
fbshipit-source-id: 3f15757701aff167e7ae2a3e8a4ccf5d96763a0c
Summary: This test was failing on non-GPU builds because it refers to operator CopyGPUToCPU. Thanks pietern for catching this.
Reviewed By: asaadaldien
Differential Revision: D5698763
fbshipit-source-id: 0bde0f3e99c58647dba2ea6da4d51938e763d10c
Summary: Moved code for global norm-based gradient clipping from fb specific workflows (seq2seq) to the open-source caffe2 optimizer library
Reviewed By: jhcross
Differential Revision: D5637453
fbshipit-source-id: 7e73c9a1c97c28a152c188467b27a6449f79242e
Summary: I was assuming left padding == right padding and top padding == bottom padding, but actually they could be different, which results in different output size.
Differential Revision: D5693719
fbshipit-source-id: 32595652231da0cf1ec269dc34fa87df23732328
Summary: Currently, it's not easy to track down which tensor is missing type and shape info. Print it out for easier debuggin.
Reviewed By: volkhin, xianjiec
Differential Revision: D5695223
fbshipit-source-id: 7f0be0be777a35bb5a71b3799b29b91f0763c159
Summary: Make Gather more convenient to use in layer model
Reviewed By: xianjiec
Differential Revision: D5695197
fbshipit-source-id: aa0406ea39af5b6980ee6fd3bb11250732caac00
Summary:
Today, the PS's weirdly store the entire embedding and not just their
subsection of it. This was simply an oversight on the part of the original
author and this diff fixes that.
The sparse params are sharded to the PS's and the PS's just store their section
of the embedding. The trainer requests the id's as is from the PS. But the PS
divides the id by the num_of_shards before looking it up in the emdedding table
blob. This happens on the backward and the forward pass. However, during the
model download part, the PS multiples the embeddings with the num_of_shards
before returning them to the trainer. The upshot is that the trainer does not
know anything about how the embeddings are scaled on the PS. The PS adds extra
divide and multiply steps to achieve that.
2. During estimation time, we allocate just one PS for estimation. So in order
to make all of the embeddings fit on the single PS: We simply additionally
scale the hash table sizes (proportionally and equally for all the sparse
params) such that it fits. This scaling is handled analogously to (1).
Reviewed By: boryiingsu
Differential Revision: D5664093
fbshipit-source-id: 92f501f61566f939c41ce0b614a1b499669f978a
Summary: The operators were lacking some float16 stuff: Extend ScatterAssign for float16. In addition, introduce a constant fill for float16. This needs to be a separate operator instead of ConstantFill, since the latter is in OSS and hence cannot use the Float16 stuff that is fb specific.
Reviewed By: azzolini
Differential Revision: D5664071
fbshipit-source-id: 5b84f625693b6ddddd8b7a35f1541ae40df49fbe
Summary:
This adds a fast path for global max pooling with NCHW. Compared to equivalent ReduceBackMean, this is about 3.5x faster.
Based on D5533059.
Reviewed By: akyrola
Differential Revision: D5681122
fbshipit-source-id: 7a4df934044c7dd01888f095f7dd46654aaf4eae
Summary: Also enforce the "from_type" argument is supplied when getting gradient
Reviewed By: Yangqing
Differential Revision: D5684399
fbshipit-source-id: bee955d44a04c44142b2212cff548cea6e08b22f
When working on PyTorch dependencies we often want to rebuild only that
dependency and the Python extension. You can now do that by running:
python setup.py build_thc
to only re-build THC
Summary: extend pairwise dot product for different number of embeddings on x & y dimensions
Differential Revision: D5663553
fbshipit-source-id: 1743a2c101cb8c0fc1f0f3d89c19530802400ec6
Summary:
The original diff is unlanded as the fbcode-target-determinator tests were not run, recreating a new diff with the same change to trigger the tests.
CUDNN should be almost always faster than the default implementation
Reviewed By: salexspb
Differential Revision: D5637156
fbshipit-source-id: 413a08acba7a83502be6199fcb524ab46f1fd4ce
Summary:
Better isolation for workspaces to allow forwarding selected blobs
from parent to child workspace, possibly under new names. Used for proper
isolation of subnets (loops, then/else branhes, etc) from outer workspace.
Reviewed By: azzolini
Differential Revision: D5681667
fbshipit-source-id: e61a2c7c98ee2abf1f0761905f4bfae47c201c32
Summary: With these changes, Conv, ConvTranspose, PRelu, and Relu work with tiling now. The default is still batching.
Differential Revision: D5623321
fbshipit-source-id: 07aa378d24165ec19e751cd79c70dea995003be9
Summary: Making it more convenient to wrap code int context
Reviewed By: boryiingsu
Differential Revision: D5680991
fbshipit-source-id: 07b7e4d5aa657184039a7d18192b68fe11c1a570
Summary:
Using file(WRITE) caused the file to be rewritten for every CMake
reconfigure, which was causing unnecessary full rebuilds of the project
even when no source files changed.
The new strategy has the added benefit of enforcing that the macros.h file
is always generated correctly. When the main project relies on this
header for macro definitions (instead of relying on add_definitions()),
we can be more confident that the project will build correctly when used
as a library (which is the whole point of the macros.h file).
Upsides:
* No more unnecessary rebuilds
* Higher confidence that the project will compile properly as a third-party library
Downsides:
* Developers need to add an entry to `macros.h.in` whenever they would have added a new definition with `add_definitions()`
Closes https://github.com/caffe2/caffe2/pull/1103
Differential Revision: D5680367
Pulled By: Yangqing
fbshipit-source-id: 4db29c28589efda1b6a3f5f88752e3984260a0f2
Summary: In case the whole function should be wrapped in certain context, this make it less ugly.
Reviewed By: xianjiec
Differential Revision: D5665253
fbshipit-source-id: ecdc6b1a08e91bae6a4352341f97ee37f3aa677a
Summary:
I discovered this while investigating more build-caching issues like https://github.com/caffe2/caffe2/pull/1103.
> If a relative path is given it is interpreted relative to the value of the CMAKE_INSTALL_PREFIX variable.
https://cmake.org/cmake/help/v3.0/command/install.html
This is a non-functional change - it just makes the code a bit easier to read. I verified locally that the resulting install directories are identical.
Closes https://github.com/caffe2/caffe2/pull/1111
Differential Revision: D5677328
Pulled By: Yangqing
fbshipit-source-id: 9bb1bfe85fc0bc54a9b7ce33cc31e45ea061d21e
Summary:
Optimizations for SinusoidPositionEncodingOp to sinusoid position embeddings
more competitive against table based embeddings.
- Removed most calls to std::pow
- Replaced division with multiplication with reciprocal
- Reused computation across examples within a batch
Current speedup with batch size of 16, sequence length of 128 and embedding
size of 512 is about 270x (17k embeddings per second -> 4.7M embeddings per
second). The speedup is very dependent on the batch size; at a batch size of 4
this only gets 1.7M embeddings per second.
Profile: https://pxl.cl/8zf0
Annotated DoRunWithType: P57925031
Reviewed By: jamesr66a
Differential Revision: D5634766
fbshipit-source-id: 0f35bb176164ea547c91de242a0205c5d7adf7cf
Summary: Not sure it is correct in general, but it works as long as we have one blob per GPU.
Reviewed By: harouwu
Differential Revision: D5671891
fbshipit-source-id: 739475101e9b509bc521e268c5b308faa36800e7
Summary:
This adds Event as a new member object to OperatorBase, hence allowing us to do
async computation more easily. Will send a fix for proper RunAsync() for
SimpleNet.
In principle this should have no functionality change yet - the only difference
is that async_dag net now delegates to the operators for holding the event
objects.
Reviewed By: harouwu
Differential Revision: D5668627
fbshipit-source-id: 55f994074be6b85d6c66f09795dcbe2b93aba300
Summary:
https://arxiv.org/abs/1704.04374 is a simple, stateless library that
implements a high performance tensor transposition abstraction - it's
substantially faster than what we have. I think instead of going through an
engine specialization on the CPU side, we can just add this path, since there's
no value (in terms of state management, etc) for having it separate?
We could cache the plan, but it's so cheap to create in these tests.
Reviewed By: jonmorton
Differential Revision: D5534519
fbshipit-source-id: de2fd64fee11be259656b0f02f42a62b7035e3d3
Summary: Disable mpscnn for 10.0.2 temporarily since I can't reproduce the crash
Reviewed By: ajtulloch
Differential Revision: D5665269
fbshipit-source-id: 2f95ba591099078a0347f7ea7bfa82dc37005228
Summary: This is a patch for the recent change for Events. ajtulloch caught this one.
Reviewed By: harouwu
Differential Revision: D5663317
fbshipit-source-id: 471a24f594583669bcd5bbf2fabaeb5664bd0bb7
Summary:
Add more data augmentation to ImageInputOp
1) Inception-style random sized cropping
2) color jittering
3) color lighting
Reviewed By: panshen1
Differential Revision: D5637726
fbshipit-source-id: 45d9cc69eec9f4d48c1607d80ccd89e325961b1a
Summary:
1. Uses the upload_builder in the offline training.
2. Adds the checkpoint taskgroups to the online trainer.
3. Changes the naming rules so that the model checkpoint has the format of
<directory>/<entity_id>_<snapshot_id>.<node_name>.<snapshot_id>
Reviewed By: rayleichen
Differential Revision: D5665068
fbshipit-source-id: a8103aed2ca195a506174d2a1d50611d2f1d9c35
Summary:
A new transform, which combines common subexpressions (where an "expression" is one operator), reducing repeated work.
This version is shippable, but one problem:
This transform will also combine operators which write to external_output, which will make behavior incorrect.
Reviewed By: bwasti
Differential Revision: D5629886
fbshipit-source-id: 2bf9f459e2ca633fddc57de85c9fc75845783099
Summary:
There are ad-hoc efforts on avoiding excessive device synchronizations, such as
async_dag, singlethread_async, etc. This diff aims to provide an early design
for a general Event class, that can achieve the following:
(1) It is device agnostic, essentially using a vtable to do cross device record,
wait and synchronization.
(2) Created new functions WaitEvent and Record in the Context class for
interacting with Events.
(3) Exposed the corresponding WaitEvent and Record functions in the OperatorBase
class as well.
An example use case is that, after potential future refactoring, one can achieve
a real async execution per operator by running
op.WaitEvent(previous_event);
op.RunAsync();
op.RecordEvent(this_op_event);
and the next op can do
next_op.WaitEvent(this_op_event);
Right now, I changed async_dag net implementation so that it uses the general
event design. The old Event class is assimilated to the general Event class and
the old Stream class is now essentially taken over by the Context class itself.
Reviewed By: harouwu
Differential Revision: D5648463
fbshipit-source-id: 58bd84d06e4a9977b0b835110ddb2f18be3b7cbc
Summary:
Adding a range operator in the spirit of np.arange. It is an imporant building block for a lot of manipulation functions.
This accepts parameters with the same meaning in the same order as python's range or np.arange (e.g. `(stop)`, `(start, stop)` or `(start, stop, step)`)
Differential Revision: D5616861
fbshipit-source-id: 02622b8bd85ebca125cc881c06fae5b54b7c602a
Summary: The new test ensures 'add_axis' and 'split' arguments work as intended for tensors of various dimensions. Hypothesis should checks various edge cases like zeroes in 'split_info' and 1D input with axis=0, add_axis=1.
Reviewed By: hoangmit
Differential Revision: D5645778
fbshipit-source-id: 061f9511a082da54e5c1bbe53a0e7096af4b8d1b
Summary:
Add ability to specify a range for randomly scaling to a new shortest side. For example, for Resnet50 training, one would set `random_scale=[256,480]` in the `ImageInput` operator to resize to a random shortest side in the range [256, 480]
Closes https://github.com/caffe2/caffe2/pull/1106
Differential Revision: D5653336
Pulled By: harouwu
fbshipit-source-id: 9c353fbe2bf2207e01bc51d14487de323c68af7b
Summary:
Tests shouldn't rely on operators defined in other tests, because there is no guarantee that they will build together.
transform_test and graph_test did this, and this fixes it.
Reviewed By: jerryzh168
Differential Revision: D5657635
fbshipit-source-id: e628fe1791a64bb124cdd8c59e80c0d915bfb281
Summary:
use cub DeviceReduce, improve the speed from 23k to 26k, but still far
from the 100k, when without dedup.
the bottleneck is at UniqueOp
Reviewed By: harouwu
Differential Revision: D5633828
fbshipit-source-id: e96b8f7317d01c5388c072e7dcfe987abcb01b67
Summary: So far the we format the epoch name with 6 digits, but this is constraining. In order to have consistent naming, we can simply append the epoch to the suffix. Then we will have consistent naming rules for small and for large epoch numbers.
Reviewed By: azzolini
Differential Revision: D5653871
fbshipit-source-id: acdf26a14b731347bb85fe2f33c1b89e2ba83bdd
Summary:
This does not change any existing code behavior - as part of the event
abstractions, this is a cautious step to reduce the interfaces exposed
from contexts. Nothing else is changing.
Reviewed By: harouwu
Differential Revision: D5656597
fbshipit-source-id: 53c5caf278613e610daf6ad3ca4bb6da73367cfc
Summary: When forward-only mode, we need only 2 workspaces. Errornously we sized the length of the workspace vector to 2 if it was different than 2. But if it was longer (because the step workspaces was shared by an non-forward-only op), we end up deleting the workspaces. With RNN Executor, this is a problem, because it held a reference to the deleted workspaces. Without RNN executor, we just ended recrearing the nets.
Reviewed By: jhcross
Differential Revision: D5654534
fbshipit-source-id: 1e6276e63453831747fee6a85c5057f01b89fde5
Summary:
Travis CI is complaining about test_load_model_from_checkpoints in recent PRs.
E: AssertionError: 'trainer:1/task/GivenTensorInt64Fill:0, a C++ native class of type nullptr (uninitialized).' != array([103])
See for example https://travis-ci.org/caffe2/caffe2/jobs/265665119
Reason unkown yet. First disable this then try to fix it
Reviewed By: Yangqing
Differential Revision: D5655068
fbshipit-source-id: 10949339ec92b0a4c2f0e59246040f1b0510be12
Summary: Add a small fix so that the divident won't be 0.
Reviewed By: kittipatv
Differential Revision: D5650240
fbshipit-source-id: fe17bdf0595c4ff113428d2bc18bf7c455e85302
Summary:
Before this fix, a functional layer name can appear several time in a
blob and causes confusion. This diff fix this issue.
Reviewed By: kittipatv
Differential Revision: D5641354
fbshipit-source-id: d19349b313aab927e6cb82c5504f89dbab60c2f2
Summary:
Implemented ApplyTransformIfFaster
Determine if a transform is faster, then return whichever net is better.
Reviewed By: bwasti
Differential Revision: D5534535
fbshipit-source-id: 509943205b0c454bf30fb01343ac4e88d1441c39
Summary:
The cuda_fp16.h header in CUDA 9 RC triggers this diagnostic.
It is included by cusparse.h as well, so guarding the
inclusion of only cuda_fp16.h is not enough.
Reviewed By: Yangqing
Differential Revision: D5651995
fbshipit-source-id: 4778a8a793761e7a1dbebf3792b85b33a3e26219
Summary: These layers were not codemoded
Reviewed By: chocjy
Differential Revision: D5645982
fbshipit-source-id: 4325f77a0f8152dfe6dfdeee59697b25ecb1de35
Summary:
Enforce that blobs don't mix between operators on different GPUs or CPU/GPU. Add test.
+ Fix memonger when no namescope is provided.
Reviewed By: asaadaldien
Differential Revision: D5644708
fbshipit-source-id: 0cb361efd6361b6e2138462584bab6b4de039b5d
Summary: when adding a new axis to concatenate along, allow it to be the last axis. For example, concated 1D columns into a 2D matrix with axis=1, add_axis=1.
Reviewed By: hoangmit
Differential Revision: D5622495
fbshipit-source-id: 8d7c8650c198450ccd4f9e1c98e4ea9f40162be0
Summary: Implement a brew wrapper for the LayerNorm op. This adds the scalar weight and bias terms to the op.
Reviewed By: jmp84
Differential Revision: D5595836
fbshipit-source-id: 467b2e1158b0c454a149d4b26c47719826e98752
Summary:
Forward-only mode had broken at some point. Two things: RNNCell did not pass the parameter to recurrent.py and also recurrent.py was broken if forward_only=True after python3 codemod.
Added test to rnn_cell_test to actually check the forward only parameter is passed to prevent future breakage.
Reviewed By: jmp84
Differential Revision: D5639306
fbshipit-source-id: b1bbc39d59c3f3734b2f40a1c2f3740c733e0bd4
Summary:
As an alternative to sharing embeddings, we want to explore merging the ID_LISTs in the net.
This commit adds an operator to merge many ID_LIST features into a single one.
Differential Revision: D5481523
fbshipit-source-id: 446121122a32de5682d5d75a165370bc8d776d03
Summary:
The current scripts live at `.travis/`. These files at `caffe2/.travis/` were apparently added by accident in fbe2393cc2.
Closes https://github.com/caffe2/caffe2/pull/1102
Differential Revision: D5648563
Pulled By: Yangqing
fbshipit-source-id: 8a071f78f466a1c0bbe62b720b50bacc425287bc
Summary:
As part of the cuda 9 move we have decided to deprecate the cnmem path
as it seems to be superceded by cub if one needs a memory pool.
Closes https://github.com/caffe2/caffe2/pull/1104
Differential Revision: D5647672
Pulled By: Yangqing
fbshipit-source-id: 988af5bf63e24efa1b631fd91ddb58e798ffc5c6
Summary: This can be used for local attention to mask elements outside of a window
Reviewed By: jamesr66a
Differential Revision: D5643677
fbshipit-source-id: 92b33866258ccc7307d5bcf08234610aa3fb152d
Summary: This diff replaces the main of the memonger for dag algorithm _compute_blob_recycling_for_dag with a c++ implementation.
Reviewed By: akyrola
Differential Revision: D5544219
fbshipit-source-id: 9f868880c8d0eb997ad3dd39433f9d0b9216d303
Summary:
Seems to be required for CUDA 9 compilation
Closes https://github.com/caffe2/caffe2/pull/1100
Differential Revision: D5642986
Pulled By: harouwu
fbshipit-source-id: 5f934d580152d3d66f7baa71695fb8847ee2c029
Summary: the old gpu single benchmark mode is lost in recent changes. We still need this mode to benchmark some operators. I also removed some unused ancient code
Reviewed By: azzolini
Differential Revision: D5628501
fbshipit-source-id: c5d2c6c99af18c41bead5d86c46a42f05821e2ff
* Add ability to specify init_method for test_distributed.
* Move init_method specification to test run line.
* Run for gloo tests as well.
* Better status message for gloo test.
Summary:
Since we temporarily disable checkpointing the readers, we need to
rename all the node names in the test to make it pass.
Reviewed By: azzolini
Differential Revision: D5640930
fbshipit-source-id: 1e61be31ddf9b6e28efd2eb8e6e91e63dcd83154
Summary:
Convert from PlanDef ProtoBuf into python Plan object by recursively creating
Nets and ExecutionSteps.
Also support running Plan object directly in Session.
Reviewed By: azzolini
Differential Revision: D5608393
fbshipit-source-id: c0ae3b6da743a759af6db3b614a5a3935fe0b34c
Summary:
This diff adds dependency-aware concurrent/parallel execution of operators in stepnets. For CPU, we use multi-threaded execution. For CUDA, we use multiple streams and cuda events for parallelism and dependency tracking.
Much of the diff is about computing dependency graph, which was quite tricky because we need to also avoid write-races of multiple operators running in multiple timesteps in parallel. Also, recurrent blobs "change name" when passing over timestep ("_prev"), so that needs to be handled as well.
This diff also restores the link-ops that I unlanded earlier.
The performance gain of this diff is very good for CPU (same perf as with static_dag, even better on forward-only). On CUDA, the gains are modest, at least with the sizes i was testing with.
Reviewed By: salexspb
Differential Revision: D5001637
fbshipit-source-id: 3d0a71593d73a9ff22f4c1a5c9abf2a4a0c633c8
Summary:
The hive reader checkpoints are broken because of D5582328.
This breaks our offline simulator test as well.
This is a temporary fix that disables the checkpoints for readers.
Reviewed By: azzolini
Differential Revision: D5637719
fbshipit-source-id: 4f31ae534cb7e981fcacbb721cbb2420249fad91
Summary:
After this, we should have test going back to all green.
Closes https://github.com/caffe2/caffe2/pull/1058
Reviewed By: harouwu
Differential Revision: D5637495
Pulled By: Yangqing
fbshipit-source-id: ac3ab5a27bc56e3bb08fa81aa8ed186cb7e8832b
Summary:
Pattern match currently only supports one type of pattern matching: connected components.
It will be useful to sometimes use different algorithms to pattern match, either a subset of the operators in order, or general non-connected subgraphs. While generalized pattern matching can match for all types, it is inefficient to use it when sorted order or connected component suffice.
You can can set the PatternMatchType to be one of the three options (it is connected by default), and Transform will use the associated algorithm.
We will need this for common subexpression elimination - specifically, sorted order matching.
Reviewed By: bwasti
Differential Revision: D5629321
fbshipit-source-id: 2104f2d4384fe4aba06a386881a08ca324f290a6
Summary: CUDNN should be almost always faster than the default implementation
Reviewed By: Yangqing
Differential Revision: D5633240
fbshipit-source-id: 99c45c04bf6a3c19f3f7eb27be1bb89344bc03d4
Summary:
Adds a benchmark comparing two methods used to generate positional embeddings,
table-based and sinusoid (as in the Transformer paper).
Reviewed By: jamesr66a
Differential Revision: D5625633
fbshipit-source-id: faee2d20ea0c3d9c41479c5114fa010ac49fab24
Summary:
Here is my example:
For static RNN timestep is created as a part of param_init_net. Before DPM assumed that it is CUDA blob by default and it participated in broadcasting causing Copy on line 798 to fail. No device mapping is correct for this blob.
Reviewed By: akyrola
Differential Revision: D5631716
fbshipit-source-id: 28c3eb17ecc3080c95c41d69a60bf7262d3907d4
Basically, it's easy to confuse the dimensions of the index tensor.
This adds some more text which should hopefully clarify the situation.
Fixes#2416.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
Memonger had a subtle bug which caused it to recycle "splitinfo" outputs of Concat/Split. That is bad since they are in CPU device, and woult cause them to be realloaced. This caused big slowdown with Kaiming's trainer.
Bug was that we checked for gradients as contaning "_grad" in the name, although we should only allow it as a suffix. Admittedly, this is not elegant to do string checking anyways, but that is how Caffe2 works now.
Reviewed By: asaadaldien
Differential Revision: D5627251
fbshipit-source-id: c12be2323109bf81c3725d8884c7ef024e010bd5
Summary:
Enable the new convolution group functionality in cuDNN v7
Closes https://github.com/caffe2/caffe2/pull/1079
Differential Revision: D5625074
Pulled By: Yangqing
fbshipit-source-id: 00be025b50161a3bae7e7f09712e4b1adeaffd9f
Summary:
This was updated in 707aed36e89ab9e2041de25166a4930fc4e24ee7 but a
force push into https://github.com/NVlabs/cub made the commit Caffe2
was pointing to unreachable.
cc slayton58 lukeyeager
Closes https://github.com/caffe2/caffe2/pull/1089
Differential Revision: D5621958
Pulled By: pietern
fbshipit-source-id: b1242dc6303a38d3ac9adb37e190084a40a66aa2
Summary: Use the new SequenceMask op to mask out invalid positions in the attention mechanism rather than using PackSegments and UnpackSegments. This should help us on several fronts, including elision of host<>device copies and using fewer intermediate blobs
Differential Revision: D5619156
fbshipit-source-id: e59c644236cee02f853d8743f9a938fb10adc73b
Summary:
Implement forward pass for a SequenceMaskOp to replace https://github.com/caffe2/caffe2/blob/master/caffe2/python/attention.py#L54-L72.
This implements two modes: a sequence-length based mode and a matrix triangle mode.
Reviewed By: akyrola
Differential Revision: D5615493
fbshipit-source-id: a2ce4a8e655d9b720049010a7856be052c5567eb
Summary:
The LocalSession does not work with the multi-node definitions.
The test becomes flaky because of that. The fix is to create
different LocalSession for each Node(), and run each node
sequentially.
Differential Revision: D5617857
fbshipit-source-id: a8079a90291b4c8b5aa6b471c33c06d18e59976c
Summary:
1. Adds one more step in the JobRunner class to upload checkpoints.
2. Adds one function to return the name of the checkpoint given
the name of the node.
Reviewed By: andrewwdye
Differential Revision: D5597130
fbshipit-source-id: 570a55785e6227859e1115326d6cab077f0e7f72
Summary: Added Nesterov momentum as an option for BMUF and corresponding tests
Reviewed By: asaadaldien
Differential Revision: D5599888
fbshipit-source-id: 30819c9e689347c8b75daddc7444bea9f54193ae
Summary: ##select()##, used previously by the ELU implementation, is not vectorized for vector maps in Eigen. This change switches the ELU cpu implementation to use ##cwiseMin## and ##cwiseMax##, which increases the perf by about 4x.
Reviewed By: Maratyszcza
Differential Revision: D5609370
fbshipit-source-id: 99560a25e0ea2cd35e34aa50c65e53788a6be6b0
Summary:
Add support for TensorCore convolution and gemm on Volta hardware.
Currently built on top of #1055
Closes https://github.com/caffe2/caffe2/pull/1056
Differential Revision: D5604068
Pulled By: Yangqing
fbshipit-source-id: 100f67e26ed5fabb1dbb31dcd77f7ecb84de4ee7
Summary: Guarding reservoir sampling with mutex & fix the bug in counting number of new entries.
Reviewed By: chocjy
Differential Revision: D5503300
fbshipit-source-id: fd6b0bacb71fbab99d6d5df2c72da523fba02847
Summary: Adding the option to dedup by object ID so that more frequent objects are not present more than once in the reservoir
Reviewed By: chocjy
Differential Revision: D5503109
fbshipit-source-id: e36c3ad8eea134d6c10a4c875fceadc0f843c976
Summary: Make the candidate pool less localized
Reviewed By: chocjy
Differential Revision: D5453289
fbshipit-source-id: 848cb7551d7112f6f47f2cf647bb0daca6eff341
Summary: Instead of printing the exception using print() use traceback.print_exc() This way you get a stack trace
Reviewed By: jay-mahadeokar
Differential Revision: D5604642
fbshipit-source-id: f8cb67e554305cd2fbed384a4a2040fa2b16e7c0
Summary: Avoid labelling objects similar to true positive (according to raw ID features) as negative.
Reviewed By: chocjy
Differential Revision: D5336506
fbshipit-source-id: 05f68f5d0af2a6eb907963d38702f0d6e9b2f99b
Summary: Make the command-line arguments pertaining to model architecture the same as between train.py and translate.py. Also use s() scoping function for all intermediate blobs in attention.py (this is for comatibility with multi-headed attention).
Differential Revision: D5594312
fbshipit-source-id: cadf51d854b5a9174ec913f32c655be2abf111e5
Summary: In order to control the absolute scale/magnitude of the output of this op, added a tuning parameter: amplitude
Reviewed By: jamesr66a
Differential Revision: D5596574
fbshipit-source-id: 3b7e316de55cce6fd686da70aa5658ec3e99b070
Summary: Turned a number of uniform shader variables into constants
Differential Revision: D5596760
fbshipit-source-id: 68004c081c6b9ba2e55f7f74e48a673489c927b1
Summary:
Bringing over selected dockerfiles from documentation branch and updated the GPU Dockerfiles to use some of lukeyeager provided docker configurations. Latest docker with CUDA 8.0 and cuDNN 6 can be pulled via `docker pull caffe2ai/caffe2` or built with `ubuntu-16.04-cuda8-cudnn6-all-options/Dockerfile`.
**You must use nvidia-docker instead of docker to run the GPU-enabled dockers.** Tutorial files can be overlaid by building `ubuntu-16.04-gpu-tutorial/Dockerfile`. Supersedes #911. Closes#876. Closes#923.
Closes https://github.com/caffe2/caffe2/pull/949
Reviewed By: Yangqing
Differential Revision: D5510872
Pulled By: aaronmarkham
fbshipit-source-id: 390f5eea1d9ec1a3edda828470b12386ab8a1775
Summary: GRU is different than LSTM that it only has hidden states but no cell states. So in this case, reusing the code of _LSTM is problematic, as we need to delete the part of creating cell state, and change many other places that use hard-coded 4 (hidden_all, hidden, cell_all, cell) into 2 (hidden_all, hidden). Otherwise GRU will break during the backward pass, when the optimizer tries to apply gradient to each of the parameters, because cell state is never used, so it does not have gradients for the corresponding parameters (i.e., cell_state_w, cell_state_b).
Differential Revision: D5589309
fbshipit-source-id: f5af67dfe0842acd68223f6da3e96a81639e8049
Summary:
Model downloader was broken after the move on s3 to the vanity url, download.caffe2.ai. Using this as the url base hits a redirect, and will result in the script throwing a 403 error. Rather than upgrading to urllib2 or putting in a bunch of code to handle a redirect on urllib, we can just use the non-vanity base url.
Closes https://github.com/caffe2/caffe2/pull/1020
Reviewed By: Yangqing
Differential Revision: D5568686
Pulled By: aaronmarkham
fbshipit-source-id: d88a6b3e1b7955835fc03b036dc54dec48316e7f
Summary:
Basic NCCL 2 API support - the same as applied to gloo [here](49586d9556)
/cc Yangqing pietern
Closes https://github.com/caffe2/caffe2/pull/1055
Reviewed By: Yangqing
Differential Revision: D5583234
Pulled By: bwasti
fbshipit-source-id: 3a9ce302649fdab9ce897613b94788c1843262e2
Summary:
This is needed for metal build.
Note that for older xcode (7.3), right now ios build fails due to not having metal headers. We will require xcode 8.0 onwards now.
Closes https://github.com/caffe2/caffe2/pull/1062
Differential Revision: D5591536
Pulled By: Yangqing
fbshipit-source-id: 57fbb9e052629ce6ecc16f1ea5179e3303a10907
Summary:
After sudo make install, it is quite cumbersome to remove the installed files manually.This change allows the user to simply type sudo make uninstall to remove all installed files.
Closes https://github.com/caffe2/caffe2/pull/748
Differential Revision: D5590971
Pulled By: Yangqing
fbshipit-source-id: b354640056c88b9975dd0cf195a6a4d8cad8d0ab
Summary:
This PR replaces PR #464. It requires C+11 support using the
new CMake variables (`CMAKE_CXX_STANDARD`, `CMAKE_CXX_STANDARD_REQUIRED`,
etc.) when CMake is version 3.1 or above. Otherwise, if CMake is older
(e.g. Ubuntu 14.04) it falls back to using the -std=c++11 flag and
issues a warning.
This PR is based on the comment from Yangqing:
https://github.com/caffe2/caffe2/pull/464#issuecomment-305376923
The corresponding line in cmake/MiscCheck.cmake is removed in order to
reduce redundancy. Another option would be to move the C++11 logic to MiscCheck.cmake.
Closes https://github.com/caffe2/caffe2/pull/1027
Differential Revision: D5590646
Pulled By: Yangqing
fbshipit-source-id: 11ac63fbeaab7a1da02115549e214f9c529f1873
Summary: as promised, a separate diff for dpm changes I made in experimental code
Reviewed By: pietern
Differential Revision: D5551304
fbshipit-source-id: 9013aeab6c388b1c415ffb2e36fb8dd6b8cf90b0
Summary: This diff implements CUDA version of OneHot operator.
Reviewed By: bddppq
Differential Revision: D5578543
fbshipit-source-id: 55b70e8ec6ee34b647b9140fecbba31b6968f403
Summary: Add CUDA version of GRU operator
Reviewed By: jamesr66a
Differential Revision: D5571043
fbshipit-source-id: 332aa64fc8a9116cc33382f2b2907080e58c13b3
Summary:
While I was trying to make a quick oss cmakefile, I found that some of the
ios source files are out of sync with the most code changes. This diff should
fix the issues.
I manually ran cmake on the oss side with scripts/build_ios.sh to make sure
things pass.
Reviewed By: ajtulloch
Differential Revision: D5582265
fbshipit-source-id: 2636d353d32fcd8fb7087385b9bbed8476e33e74
Summary:
Fix multilayer inference in Caffe2 example seq2seq code. (Rely on LSTMWithAttentionDecoder.apply rather than fixed state indices to determine stepwise decoder output.)
Also assorted updates to bring code in line with changes elsewhere in the codebase, and added unit tests which ensure that training and inference networks generate the same loss, which should make these problems much easier to identify in future.
Reviewed By: jamesr66a
Differential Revision: D5579803
fbshipit-source-id: 6e0f27340d981990ab8d0da58e63793222e7be87
Summary: Users are reporting CUDA illegal access errors happening on some configurations after D5528436 introduced lazy peer connections. Will debug later, but this diff is to revert that change.
Reviewed By: pietern
Differential Revision: D5581673
fbshipit-source-id: ef8e367160a38fc62434d6f5905892db274d9f06
Summary:
It was reverted previously because of lack of schema for gradient op. Added it back and resend.
difference between this diff and previous reverted diff:
1. added schema for gradient operator
2. change line:95 in kmax_pooling_op.h from CAFFE_ENFORCE to CAFFE_ENFORCE_GE
Reviewed By: xianjiec
Differential Revision: D5568867
fbshipit-source-id: 39813b389a5da803967a561249793afdfce00c58
Summary:
When creating a common world, we would attempt to create
one using an existing common world to save on setup cost. This could cause
unexpected behavior when the backing common world had a shorter
timeout than the world being created. This patch improves this
logic by limiting the usage of a backing world to only ones that
have a long enough timeout.
Reviewed By: andrewwdye
Differential Revision: D5570904
fbshipit-source-id: d3b5073a64381ed068a30dcc461a6ec9ce15ad9c
Summary:
(1) BlobsQueue is causing a gcc error (google search suggeste it was a
bug, but we'll put the implementation in a separate cc file).
(2) Preparing for cuda 9: update cub.
(3) Prepare for cudnn 7: update cudnn rnn op.
(4) Fix an MSVC issue
Reviewed By: sf-wind, jerryzh168
Differential Revision: D5574352
fbshipit-source-id: 230820ce3ceaa32bee8323bdc509de352c93fcf2
Summary:
The mpscnn-fb folder was intended for our earlier sharing of the MPSCNN code.
Now that we have fully migrated the code, one should check contrib/ios instead.
accept2ship
Reviewed By: ajtulloch
Differential Revision: D5577227
fbshipit-source-id: df3706a272f022ea6e529f38d960bce374f79baa
Summary:
In Python 3x dictionary values aren't a list and can't be concatenated to a list
this diff should fix that.
Reviewed By: andrewwdye
Differential Revision: D5576724
fbshipit-source-id: c60441857ceceb9c4a71122d2db5e9abad6d3fc2
Summary:
The L1Distance operator used to return a single value denoting the L1 of the entire input, instead of a vector for each input value.
This fixes that.
Reviewed By: Yangqing
Differential Revision: D5570385
fbshipit-source-id: fbab0e0c9262ccbdb3af27262b8baacdeb2d0fc9
Summary: New hybrid randomized sparse nn, which allows layers of sparse NN model to be randomized, semi-random, or learnable
Reviewed By: chocjy
Differential Revision: D5416489
fbshipit-source-id: eb8640ddf463865097ba054b9f8d63da7403024d
Summary:
To train an image model, we also can use label embedding vector as supervision as opposed to using SoftmaxLoss/SigmoidCrossEntropyLoss.
In such case, the label is a dense vector. This diff enables such use cases.
Reviewed By: panshen1
Differential Revision: D5556203
fbshipit-source-id: 52c61495e02fab457dc2d43e3345d7dbd5580ab7
Summary:
data_workers.py provides a really nice, easy way to run background threads for data input. Unfortunately, it's restrictive, the output of the fetcher function has to be a numpy array.
I pulled out that core nice thread management into parallel_workers, and updated the classes data_workers to extend those classes. The main change was refactoring out most of the queue handling logic into QueueManager.
This way parallel_workers can be used to manage background threads without having to use the queue for output.
Reviewed By: akyrola
Differential Revision: D5538626
fbshipit-source-id: f382cc43f800ff90840582a378dc9b86ac05b613
Summary:
There does not exist appropriate build script for Tizen software platform.
This commit is to fix#847.
Signed-off-by: Geunsik Lim <geunsik.lim@samsung.com>
Closes https://github.com/caffe2/caffe2/pull/877
Differential Revision: D5571335
Pulled By: Yangqing
fbshipit-source-id: 12759a3c0cb274ef93d7127b8185341e087f2bfa
Summary:
Adds support for the CUDA 9 toolkit.
Includes new fp16 data type fixes, and changes to warp-synchronous programming. Also updates CUB third-party repo for CUDA 9 support.
Closes https://github.com/caffe2/caffe2/pull/853
Differential Revision: D5548507
Pulled By: Yangqing
fbshipit-source-id: c7fd2edb623f2aa8c67b9a1000efc8f71e6832ab
Summary:
Implement dot attention as described in https://arxiv.org/abs/1508.04025
This saves the computation of weighted encoder outputs in `rnn_cell.py`
When the encoder and decoder dimensions are different, we apply an FC, which corresponds to the general case below Figure 2.
Refactored unit tests.
Reviewed By: jhcross
Differential Revision: D5486976
fbshipit-source-id: f9e9aea675b3b072fbe631bc004199b90a9d95cb
Summary:
Caffe2: add a DB that's wrapped around a BlobsQueue as an adapter for data from non-DB interface.
This is useful for bridging the gap between DB interface data processing ops (TensorProtosDBInput, ImageInputOp etc.) and data that's coming from arbitrary Python or the pretty intricate Hive reader.
Reviewed By: akyrola
Differential Revision: D5554560
fbshipit-source-id: 01bb0056410f9ade205367d5fefc721f91f5b629
Summary:
Now Caffe2 is replicated in three code bases. Some directories
are only for mobile or only for server. Need to strip the
unnecessary files in checkout.
run command to strip the files checked out in mobile
hg sparse --enable-profile fbandroid/xplat/caffe2/.hgsparse-caffe2-xplat
run command to strip the files checked out in server
hg sparse --enable-profile fbcode/caffe2/.hgsparse-caffe2-dev
Reviewed By: mzlee
Differential Revision: D5557190
fbshipit-source-id: e41c8edab09d3fafcb0c8e40ebe1c6809388dc02
This is so that we can do per-commit sync between codebases, removing
the current tech debt of manual syncing.
The code is contributed by various folks: @tulloch for ios, @bwasti for
snpe, @fricc33 and @hlu for opengl, among many others.
@feisun (sf-wind) made the original sync.
Summary:
The current implementation for s=0 doesn't support backward pass.
Switching to using pow op instead as a temporary solution.
Reviewed By: jackielxu
Differential Revision: D5551742
fbshipit-source-id: 33db18325b3166d60933284ca1c4e2f88675c3d3
Summary:
1. switch the protoc building system from msbuild to cmake
2. set default CMAKE_GENERATE to VS2015
3. set default CMAKE_BUILD_TYPE to Release
4. improve error handling
5. add the generated protobuf include path
6. exclude many optional dependencies from build_windows.bat
Closes https://github.com/caffe2/caffe2/pull/1014
Differential Revision: D5559402
Pulled By: Yangqing
fbshipit-source-id: 019e3a6c3c909154027fa932ce1d6549476b23bb
Summary:
Caffe2: Where operator allows users to specify three inputs:
input1: TensorTypes<bool>
input2: TensorTypes<float, double, int, long, std::string>
input3: TensorTypes<float, double, int, long, std::string>,
which allows users to do the operation: output = input1? input2:input3
We found that there is a need to add boolean type in input2 and input3 for caffe2: Where operator for customers who want to use boolean tensor for doing logic
Reviewed By: ender-wieczorek
Differential Revision: D5541815
fbshipit-source-id: 55171b242821f5f2c83235f5229a85f8cbe580de
Summary:
This brings it up to par with how the RedisStoreHandler
works. The store handler configuration does not have to change and
only the run ID parameter changes across runs.
This was inconsistent and came up in https://github.com/caffe2/caffe2/issues/984.
Reviewed By: Yangqing
Differential Revision: D5539299
fbshipit-source-id: 3b5f31c6549b46c24bbd70ebc0bec150eac8b76c
Summary:
Currently Caffe2 enables peer access between all 8 gpus, even if only 1 gpu would be used. This adds several seconds to the startup time, but also takes a lot of memory (110 MB per GPU).
This diff makes the peer access initialization "lazy". When GPU X is first used, pairwise peer access is set between GPUs 0 to X-1 with X. A lookup table is used to ensure no double peer access initialization.
Reviewed By: pietern
Differential Revision: D5528436
fbshipit-source-id: 8f3c2c8154291a7d3a99ee2882e4834ef5e38b66
Summary:
This diff makes SparseLengthsSum(Gradient) Async. It goes through these logics:
1. Adding INDICES to Gradient op input so that we can make it async without device host copies.
2. Registering new 3 input op as gradient for CPU/GPU version of SLS
3. In order to not breaking old nets(they are mostly on cpu), I still register the old 2 input op. So the op schema will not complain when it encounter some old nets that has SLSGradient op in it.
wickedfoo Sorry this diff might bring you extra work of migrating your optimization effort to this new async gradient op. But we think it is worth it. :(
Reviewed By: dzhulgakov
Differential Revision: D5423188
fbshipit-source-id: 62494a6c52a507c4a4688d5a9e1a2bc720d5370d
Summary: Added caffe2 operator to calculate the sinusoidal position encoding for word embeddings, as described on page 6 in https://arxiv.org/abs/1706.03762.
Reviewed By: jamesr66a
Differential Revision: D5533024
fbshipit-source-id: 1afb35cd7f9d8c71f2635b853e56b2c840f0bc1f
Summary:
To achive this, I modified the blob name scheme defined in a layer.
Before it was scope/fc_w and scope/fc_w_auto_0 (if there is another fc
within the same scope).
Now I change it to scope/fc/w and scope/fc_auto_0/w.
That is, we rely on the uniqueness of the scoped layer name to define
names for blobs.
I also overwrote the create_param method in LayerModelHelper to let it
use the resolved name for blobs given the sharingparameter context.
There are some details such as making the initializer more structured
that I need to finalize.
Reviewed By: kennyhorror
Differential Revision: D5435132
fbshipit-source-id: a0525f5ea0977e255dd5ea765b38913f5951d455
Summary: Added functionality that allows users to store huge blobs of any type not only Tensors. Blob has to be divided into chunks in the same way as Tensor blob.
Reviewed By: kennyhorror
Differential Revision: D5432762
fbshipit-source-id: c171faacd99d209bfae6f9707ebde7c4e23ba3b9
Summary: Implement operators LpNorm, which is to calculate the Lp norm of a tensor for regularization(p=1or 2) . Currently, there are only operator L1Distance to calculate the l1 distance of two same-shape tenors. We want to make it take only one input and output the l1 loss. We would do the same for l2 loss. We also plan to implement l_{p,q} loss, but have not decided which p and q to take.
Reviewed By: xianjiec
Differential Revision: D5460051
fbshipit-source-id: d67a38fbc94afa52de26d4a53e4d2b7df3c50b6a
Summary:
KaimingHe debugged slow model, and found out that global average pooling was hideously slow, even with CUDNN. Turns out CUDNN pooling op (especially backward pass) is not optimized for global pooling.
This adds a fast path for global average pooling with NCHW. This is about 30x faster than CUDNN with 56 x 56 pooling, Compared to equivalent ReduceBackSum, this is about 3x faster.
I will bootcamp the max pooling.
Reviewed By: asaadaldien
Differential Revision: D5533059
fbshipit-source-id: 2d590693d737fa92184603663031d96f6145f304
Summary: This will throw away a few examples. It is desirable to keep batch size constant for full sync data parallel
Reviewed By: dzhulgakov
Differential Revision: D5531788
fbshipit-source-id: e19385401155e731cfc5b25e8e9ea7c16c19d478
Summary:
Currently, for `from_column_list` if the input col_names=[], it throws
errors. To solve this issue, we fix the get_field function so that it creates
an empty Struct when empty col_names is given.
Reviewed By: kittipatv
Differential Revision: D5543865
fbshipit-source-id: f6dfa25326e355f8ec24e5542761851a276beeb9
Summary:
StringJoin operator converts input array/matrix elements to string then join them to make vector of strings
Changes:
* Support string tensor input
* Support join on 1-axis
* Add unit tests
Differential Revision: D5513705
fbshipit-source-id: 25f96ed3586065c15f845a968c9f8864ca8f5bdf
Summary: Allow the use of apply_transform() in the python API
Reviewed By: bwasti
Differential Revision: D5530483
fbshipit-source-id: 61a6d36fe125c89629fdeea040a717c453d84417
Summary: This allows users to add an arbitrary of additional outputs to ImageInputOp. These are populated by reading additional TensorProto values from the TensorProtos from the DBReader, and converting them into Tensors. Similar to labels, only ints and floats are supported, and multiple values are supported.
Reviewed By: panshen1
Differential Revision: D5502019
fbshipit-source-id: 5a8b61b3a8549272a112e8e02cd613d8f9a271ba
Summary: Caffe2: allow nets that don't use all input in net.ClonePartial
Differential Revision: D5535564
fbshipit-source-id: 0ec8fb3ade4d7d6cd4a702c9c265d9c77f27a627
Summary: Change DCHECK to CAFFE_ENFORCE (so that the problems occurs also on mode/opt) and use the EQ enforce
Reviewed By: asaadaldien, Yangqing
Differential Revision: D5517647
fbshipit-source-id: 4da6eae54abf71114957133df088ae3623d8beaa
Summary:
In order to pybind, we need transform in core.
It's a basically finished product, with a big test suite. It's safe.
We can begin hooking up observers after this, and I have a diff coming up that pybinds some apply_transform function.
Reviewed By: bwasti
Differential Revision: D5522200
fbshipit-source-id: dea6aa606fc689af84e2533569d1ef348cb5f3f2
Summary:
Allows Operators to match their string properties using * and |, to allow an operator to match multiple types.
Also allows device option, engine, and argument matching.
Reviewed By: bwasti
Differential Revision: D5419697
fbshipit-source-id: fe09c7f83a5a2fefe61d79e09ee1d5b755045313
Summary:
running ##xplat/caffe2/fb_sync.sh##.
Also add two new core sources to the BUCK file, and add ##createSharedBuffer## to NNPACKConvOp.
Reviewed By: ajtulloch
Differential Revision: D5373061
fbshipit-source-id: c030b2629d2715e1d2776c98715f57e2650922c9
Summary:
Fix the error during compilation on Win10+CUDA, not sure if it affects Linux and MacOS.
caffe2/operators/top_k_radix_selection.cuh(359): error : a value of type "caffe2::TIndex *" cannot be used to initialize an entity of type "long *"
Closes https://github.com/caffe2/caffe2/pull/992
Differential Revision: D5532399
Pulled By: Yangqing
fbshipit-source-id: 6958ee4f21053f73a0628cf98936931099211749
Summary:
1. allow PrintOp to print every N
2. add a util function to accumulate hist and print.
Reviewed By: dzhulgakov
Differential Revision: D5437008
fbshipit-source-id: 7dd8e51b20f9daaec6c0a4e69ff6e082fca671e6
Summary: Add tensor inference function for squeeze, refactor a bit
Reviewed By: asaadaldien
Differential Revision: D5518880
fbshipit-source-id: 5b8cb9154f5f777d4be3612a96d7ed76a9068c0c
Summary:
Feed team uses distributed training and wants to also use transfer learning.
Currently, transfer learning implements by overwriting the layer parameter
initializer. Therefore, PS builder can't infer correctly the parameter shape.
To fix this, add a field 'shape' in `layer_parameter` and set the shape if we
overwrite its initializer.
We also enforce the check of parameter shape between the original initializer
and the loaded blob. (this adds extra cost)
Differential Revision: D5520541
fbshipit-source-id: 80547dbd328b3f6cbfcea0b2daaf4004703dfe81
Summary: Several refinements to seq2seq example code, including support for multilayer LSTM.
Reviewed By: jamesr66a
Differential Revision: D5460372
fbshipit-source-id: d2eabf6aa9a5b5df7bbc341fd99c4e7d8322e717
Summary:
We shouldn't LOG(FATAL) in Caffe2 code under any conditions as it's a library.
The case where it failed was a bug in SparseAdaGrad that failed on empty input trying to launch 0-sized CUDA kernel.
Also, the trend for C2 core is in moving from bool to exceptions, so I just moved CAFFE_ENFORCE directly into FinishDeviceComputation. Most of the use cases were already doing that or ignoring the output (bad!).
Reviewed By: akyrola
Differential Revision: D5495913
fbshipit-source-id: 66f382369417a262da69d54470f720e7d04a5cdf
Summary: Memonger did not properly track the number of times a blob output has to be produced before an operator can be visited. Actually I remember fixing this before, but well. This bug was manifested in Priya's model, so thanks prigoyal, and benz's model verifier nicely caught the wrong output.
Reviewed By: asaadaldien
Differential Revision: D5524912
fbshipit-source-id: 10f4d7056b84aba0274a918af508ea043e6026f9
Summary:
Based on discussion with Misha we're going to go for code-generation for all possible variants:
AVX2/AVX512 (eventually)
embedding type: float16, float32
index type: int32, int64
reducer: sum, weighted sum, mean (with scaling by lengths)
block size: 32, 64, 128
From some simple testing full-loop fusion with prefetching (as opposed to TypedAxpy) gives at least 1.5x performance win, so it is justified.
This just adds scaffolding for perfkernels for the embedding lookup subfunction.
I haven't actually moved the current implementation, because it's more work to refactor current macroses/templates, it's easier and more extensible to do codegen.
Scaffolding is a bit ugly because we don't want to pass templates across translation units and thus it requires explicit names of types in function names. Better suggestions are welcomed.
msmelyan - you'd pretty much need to generate appropriate embedding_lookup_avx2.cc
Reviewed By: Yangqing
Differential Revision: D5505887
fbshipit-source-id: ece489d4fd36e7ddbe71efb890f48ab38acaeaec
* Add ATen overload to AutoGPU.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Use new AutoGPU overload.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: This method runs a train net multiple times therefore enables testing layers with iteration-dependent behavior.
Differential Revision: D5493750
fbshipit-source-id: a7fb967a66f799aaf82acfadc4ecf66e0744da20
Summary: One of my workflows was stuck before everstore/hive data input was experiencing networking issues (No route to host etc.). But it is hard to know this is happening because the errors were logged to stdout. Anyway, added a simple logging to warn if the data workers enqueue thread is not getting new data for over 10 secs.
Reviewed By: panshen1
Differential Revision: D5522816
fbshipit-source-id: a036c4afdfbbafea130a4251c1ca02c138d19a83
Summary: The diff adds support for rank_loss operator to support computing loss for multiple sessions (batch).
Reviewed By: kittipatv
Differential Revision: D5515465
fbshipit-source-id: 55a01cd5ad21eaeae82875ad136c392fed0dbb26
Summary: There is no need to disable inexpesnive assertions in mode/opt, but it makes it incredible difficult to debug model problems. So changed a bunch of them to CAFFE_ENFORCEs.
Reviewed By: Yangqing
Differential Revision: D5517902
fbshipit-source-id: 9154d0114db159e8136a482fb6508e92084af97a
Summary:
Optimised SparseLengthsSum (fp32) for now
1) Specialized reducer
2) created fast routine with prefetches, loop unrolling, block specailization and register tiling
3) added more variety of block sizes to segment_ops_test.py
Reviewed By: Yangqing
Differential Revision: D5392472
fbshipit-source-id: 8ed9baf1b12ec05bd391cabb390024e6bc60a6f6
Summary: to support an operation needed by D5507205
Reviewed By: xianjiec
Differential Revision: D5512522
fbshipit-source-id: a9b3a668c28eff71d1e106dbbb572184df4a7638
Summary:
The renames were only being applied to the main net, if step_net has an
external input that is part of renames, running the model would fail with 'blob
not found in workspace' error.
Differential Revision: D5511953
fbshipit-source-id: ba262a094c3263978dfe173f2cab00301131b57f
Summary:
Updated the semi-random layer model for multi-layer models using semi-random layers.
Notable changes:
- Input and outputs for the semi-random layer is now a Struct with "full" and "random" components
- Flag was added to choose to initialize output schema in Arc Cosine or not (if output schema initialization will happen in Semi Random layer)
Reviewed By: chocjy
Differential Revision: D5496034
fbshipit-source-id: 5245e287a5b1cbffd5e8d2e3da31477c65b41e04
Summary: ASAN caught invalid memory problems in 3 of the tests in PatternNetTransformTests. The cause was pushing elements into a vector, that although will remain the same when in scope, can be relocated when resized; thus invalidating the iterator pointer.
Reviewed By: bwasti
Differential Revision: D5510112
fbshipit-source-id: affb11dbd221c826e108136789ef11c96c5d9843
Summary: It is common mistake to create test/validation model with init_params=True. When its param_init_net is run, it will overwrite training models' params, and with DPM, those won't be synchronized to all GPUs. I don't want to make this an assertion yet, since it might break people's trainers (it is ok to have init_params=True if you never run the param_init_net...).
Reviewed By: asaadaldien
Differential Revision: D5509963
fbshipit-source-id: 63b1a16ec0af96e3790e226850f6e0e64689143f
test_FloatTensor_qr_big test is still a bit flaky on K80. Increasing tolerance to improve reliability as tests are moved around and results change for this test.
Summary:
As per rushabhmshah99 request: he wants to append a pre-trained model (without training that) to the model.
So added data_parallel_model.ConvertNetForDevice() to enable that. The unit test shows example how to use this with
AppendNet, and I also added a blurb to the function.
Differential Revision: D5503335
fbshipit-source-id: b2a5db5c1739dc97f46dd0d7606ed555d99255b8
Summary: To prevent assertion from protobuffer when accessing the dims.
Reviewed By: asaadaldien
Differential Revision: D5504362
fbshipit-source-id: d9b55fab3126e2760a3e790615ed30a1af2ddc32
Summary: Weakness in gloo_test led to an embarrassing diff review (D5494956): my test "succeeded", alhough each of the workers failed hard in an assertion. This was not handled because there was no exception to be caught and put into the result queue. So change the logic to put a success-token into the queue, signaling successfully completion.
Reviewed By: pietern
Differential Revision: D5503760
fbshipit-source-id: f2415bcc55638595cefa5d64dea811d86e77f24d
Summary: Use romain-intel's ContextFactory to create common worlds from existing common worlds, thus bypassing KV store completely. Changed data_parallel_model to automatically find if there is already a CW we can work. CreateCommonWorldOp takes optional second parameter, which is existing CW.
Reviewed By: andrewwdye
Differential Revision: D5494956
fbshipit-source-id: 5f7a840bcd5fe4ea756fafeacc746bc2cf5078b0
Summary: Split this into its own file for ease of reviewing. This is a simple interface for someone to create a Transform - by simply providing their own Pattern and Replace NetDefs.
Reviewed By: akyrola
Differential Revision: D5440426
fbshipit-source-id: dc643226f40ffe4ec5c86d56cfea374bd6a4e0e5
Summary:
Nothing gets changed - this would allow us to more easily deal with build
systems. Also now everything that is MKL related lives under mkl/.
Reviewed By: dzhulgakov
Differential Revision: D5505157
fbshipit-source-id: ddb2e6ac290a146a7cb495da23bb0e5b5594bd2a
Summary:
A bug reported in MTML group: https://fburl.com/lumicchc
The reason is that in MTML, the `task_shared_embedding` was not correctly
initalized in python
Reviewed By: xianjiec
Differential Revision: D5502875
fbshipit-source-id: 3538d917392568ecd37c39059dc86f866bce9543
Summary:
Use smaller step size for GradientChecks and pass seed to help reproducing the
test from logged inputs.
Reviewed By: Yangqing
Differential Revision: D5505698
fbshipit-source-id: fc308efe72d535695ba628944aee1913ba16b2f1
Summary: Some old compilers (e.g. gcc 4.8) does not like lambdas.
Reviewed By: ajtulloch
Differential Revision: D5500500
fbshipit-source-id: fe6bcc7277fd7e9607f54a83be1f0ec146411440
* Implement BatchNorm double backwards as a python function called directly from C++.
This will be converted to C++ code once ATen is integrated with autograd.
* Some performance improvements via inplace ops and reusing calculations.
Summary:
The original issue was that the initialized parameters for randomized layers (Arc Cosine and Semi-Random) were not fixed across distributed runs of the layers. Moreover, as the weights are initialized as (constant) parameters, when the layer is added to the preprocessing part, these weights won't be saved after training since they don't exist on the trainer.
I fixed the issue here by building an option to add the randomized parameters to the model global constants so that the same parameter values can be accessed. Also, the parameters can be saved when the training is finished.
In this diff, I've:
- Updated randomized parameters to be added as a global constant across distributed runs of Arc Cosine Feature Map and Semi Random Feature layers
- Updated unit tests
- Ran an end-to-end test, enabling multiple readers to test the fixed issue
Reviewed By: chocjy
Differential Revision: D5483372
fbshipit-source-id: b4617f9ffc1c414d5a381dbded723a31a8be3ccd
There were two implementations of THPUtils_checkLong/THPUtils_unpackLong; one
that was a macro and one that was not, which is hella bad if you accidentally
include the macro before the real definition. Now we always use the inline
function.
A reasonable follow-up task would be to un-macro-ify the rest of these functions.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
Moved distance_op_test from hypothesis_test to distance_op_test and
refactored
Reviewed By: akyrola, asaadaldien
Differential Revision: D5495104
fbshipit-source-id: 4a90c75eabeb380ae9d150d6258e9b5b0fbfc5ca
Summary:
- Adds GetCTCGradient CTC training, so we can use AddGradientOperators() on "costs". The function just calls CopyOp.
- Modified test to verify inputs_gradient is created in workspace.
Reviewed By: yqwangustc
Differential Revision: D5499271
fbshipit-source-id: 5a6985f90f309303aadaceb7c966d822ad3576b2
Summary:
Data type conversion between Numpy Array and Caffe2 Tensor currently only support 3 types: FLOAT, DOUBLE and INT32. Support 8bit and 16bit date types will help reduce the model size in some circumstance. I benefit from this to reduce size of a data set from 8GB to 1GB by using INT8.
Closes https://github.com/caffe2/caffe2/pull/930
Reviewed By: Yangqing
Differential Revision: D5440929
Pulled By: akyrola
fbshipit-source-id: 3762da1d845e62a13ba384d1c144328b19dd663b
Summary:
This is causing OSS android build failures such as
https://travis-ci.org/caffe2/caffe2/jobs/257609575
Reviewed By: akyrola
Differential Revision: D5497495
fbshipit-source-id: b3ba0cca135a4a632461851c9b9212f3d75abd5d
Summary:
__attribute__((unused)) is not supported on Windows, so we actually need to
substitute it with a macro.
Also changed UNUSED_VARIABLE to CAFFE2_UNUSED because we also use it to mark
functions now.
Reviewed By: ajtulloch
Differential Revision: D5497063
fbshipit-source-id: bcda026e626c41f71c21c36f029a3f871eaea7d4
Summary:
I successfully built caffe2 using MSVC 2015 and the Ninja Generator. I use vcpkg to build glfags, glog, lmdb and protobuf. Here is my build procedure:
1. Install vcpkg and set it up according to vcpkg docs
2. Install dependencies
```
$> vcpkg install gflags glog lmdb protobuf eigen3 --triplet x64-windows-static
```
3. Run CMake with this batch file
```Batch
setlocal
if NOT DEFINED VCPKG_DIR ( echo "Please defined VCPKG_DIR" && exit /b 1 )
if NOT DEFINED CMAKE_BUILD_TYPE set CMAKE_BUILD_TYPE=Release
if NOT DEFINED BUILD_DIR set BUILD_DIR=build_%CMAKE_BUILD_TYPE%
if NOT DEFINED USE_CUDA set USE_CUDA=OFF
call "%VS140COMNTOOLS%\..\..\VC\vcvarsall.bat" amd64
if NOT EXIST %BUILD_DIR% (mkdir %BUILD_DIR%)
pushd %BUILD_DIR%
set CMAKE_GENERATOR=Ninja
set ZLIB_LIBRARY=%VCPKG_DIR%\installed\x64-windows-static\lib\zlib.lib
cmake -G"%CMAKE_GENERATOR%" ^
-DBUILD_SHARED_LIBS=OFF ^
-DCMAKE_VERBOSE_MAKEFILE=1 ^
-DBUILD_TEST=OFF ^
-DBUILD_SHARED_LIBS=OFF ^
-DCMAKE_BUILD_TYPE=%CMAKE_BUILD_TYPE% ^
-DUSE_CUDA=%USE_CUDA% ^
-DZLIB_LIBRARY:FILEPATH="%ZLIB_LIBRARY%" ^
-DVCPKG_TARGET_TRIPLET=x64-windows-static ^
-DVCPKG_APPLOCAL_DEPS:BOOL=OFF ^
-DCMAKE_TOOLCHAIN_FILE:FILEPATH=%VCPKG_DIR%\scripts\buildsystems\vcpkg.cmake ^
-DPROTOBUF_PROTOC_EXECUTABLE:FILEPATH=%VCPKG_DIR%\installed\x64-windows-static\tools\protoc.exe ^
..\
ninja
popd
endlocal
```
Closes https://github.com/caffe2/caffe2/pull/880
Differential Revision: D5497384
Pulled By: Yangqing
fbshipit-source-id: e0d81d3dbd3286ab925eddef0e6fbf99eb6375a5
Summary:
libpthreadpool is needed during the linking stage and is missing when user chooses to use an external nnpack installation (from system libraries).
Fixes GitHub issue #459.
Detailed discussion on [this comment](https://github.com/caffe2/caffe2/issues/459#issuecomment-308831547).
Closes https://github.com/caffe2/caffe2/pull/808
Differential Revision: D5430318
Pulled By: Yangqing
fbshipit-source-id: 5e10332fb01e54d8360bb929c1a82b0eef580bbb
Summary: Implemented the registry pattern: now all transforms are instantiated by a string. I then made a simple transform which, given a graph, will change the engine of all Conv operators to be NNPACK, to demonstrate.
Reviewed By: bwasti
Differential Revision: D5447007
fbshipit-source-id: 48065a88fa648ad0e11f7f8ee93b8e732cd515d7
Summary:
This adds an example for vectorized typed axpy implementation under
perfkernels.
Reviewed By: dzhulgakov
Differential Revision: D5479258
fbshipit-source-id: 469e6c8aaf2c12cdf0025bc867eb9d4cab84184f
Summary:
(1) Wrote up length reducer operators from the original dispatcher
implementation under segment_reduction_op.cc. Note that this does not
change the fp16 version now.
(2) created subfolder perfkernels for potential different backends, with
scaffolding done.
(3) provided the vanilla fp16 implementation, so that currently the default
implementation will support fp16 (very slow) right now. This sets up the
fp16 benchmarking capability after D5477844.
Next step is actually to implement the faster versions. The goal of this diff
is mainly so that Misha can plug in his custom implementations more easily.
Reviewed By: dzhulgakov
Differential Revision: D5479056
fbshipit-source-id: bba30dc0d892b8e2cdfc825034fdfb7bd22a1726
Summary: If the last group has length=0, then ##start == end == len_indices##. Implementation is correct, just the assert is not
Reviewed By: wickedfoo
Differential Revision: D5488858
fbshipit-source-id: fcc4ef8162f1390534a7c556de2ae7d2b82eddc9
* add SharedFunctionMaker to create Function shared in the graph
* Clean shared_ptr usage for only function that will be used in the graph
* make Function binding match Varible one
* remove unnecessary changes
* fix comments
* proper weakref implementation
* add call to clear in dealloc
Summary:
Based on discussion on the post in Caffe2 users. Changing DCHECK that works only in debug mode to CAFFE_ENFORCE that throws exception and is a better option.
Update: Also correct the check for label_data >= 0, did not check for all elements previously. Moved it to inner loop.
Reviewed By: akyrola
Differential Revision: D5483788
fbshipit-source-id: ccbff09e19e05e7036db772498f71795063c1fed
Summary: When creating parameters for modelhelper, we should use create_param instead of using param_init_net and model.params directly. The diff rewrite some of these cases in rnn_cell.py in order to make model._parameter_info and model.params consistent.
Reviewed By: kittipatv
Differential Revision: D5477724
fbshipit-source-id: 28c4aaf8f98d9d89125af6a42ad328008f0079e1
* Add examples in CrossEntropyLoss
1. Added examples in CrossEntropyLoss
2. Make consistent style of example for PyTorch docs
3. Delete unnecessary character '
* Change comments in distance.py
1. Delete x1, x2 from arguments and add eps in PariwiseDistance
2. For the shape, added input1 and input2 for readability (PairwiseDistance and CosineSimilarity.
* Add examples
Added the word 'examples' for PyTorch docs
Summary:
Need it for some reference comparison for c2isl.
Also there's an argument that it might be faster on GPU with int32. Doesn't seem to be the case now, but haven't tested with Jeff's changes yet.
Reviewed By: kennyhorror
Differential Revision: D5405482
fbshipit-source-id: dc1a983dce5f06f1111c5634ec475647c94848cc
Summary: Add check that every time we register a caffe operator to CPU or GPU that documentation is added for the particular operator.
Reviewed By: dzhulgakov
Differential Revision: D5443110
fbshipit-source-id: 3793c3d29bea1228078cb30bdf8243ac0ab90664
Summary:
In order to get dimensions right, correctly identify gradients, etc., DropoutCell should call the _prepare_output and _prepare_output_sequence methods of its internal cell for its own such methods.
This bug was identified by NVIDIA intern Syed Tousif Ahmed.
Reviewed By: akyrola
Differential Revision: D5483082
fbshipit-source-id: f6df5b4a0502ed0771056638aab219fb5cc7d964
Summary: TSIA - this makes it a bit easy to benchmark sparse lengths sum.
Reviewed By: dzhulgakov
Differential Revision: D5477844
fbshipit-source-id: 89e25c5e0dbf3538877ba1a9abc75a10abfa2757
Summary: DBExists function was factored out of the DBExistsOp.
Reviewed By: azzolini
Differential Revision: D5472587
fbshipit-source-id: 2a53375ffcccfb88e8f0af2ab55dad4c6a9586e3
Summary: I have hated the "gradient of X is either not provided or sparse" message. It is better to say which one is the problem.
Reviewed By: dzhulgakov
Differential Revision: D5468923
fbshipit-source-id: b63cde293fe252e5136d225ce4c762b4981f6fc8
Summary:
This is needed for us to do more fine grained dispatch based on CPU arch, so
I figured we should just add it. Can help Dima and Misha doing optimization
I think?
Reviewed By: dzhulgakov
Differential Revision: D5477444
fbshipit-source-id: 48aaf8bd799e9755493cd51c793ceec080a8846c
Summary: SimpleNet and DAGNetBase are the only two direct subclasses of NetBase. This feature has already been applied to SimpleNet before, with this diff all nets should be covered.
Reviewed By: dzhulgakov
Differential Revision: D5475498
fbshipit-source-id: 339edac31d008ec1e4630d93d2e27d0f518f4ebb
Summary:
For RNN attention, we should not include the invalid parts of the encoder output (based on encoder_lengths) in the computation. This diff accomplishes that by forcing logits for those positions to be negative infinity.
Note that the this step can be bypassed by passing encoder_lengths=None, which is what we do for beam search, thus incurring no extra overhead for inference.
Reviewed By: jamesr66a
Differential Revision: D5402547
fbshipit-source-id: 1863d6050b5129e4df829c6357f0aa9ded0715dc
Summary: fixing the case where the init net will initialize same blob twice. I made an exception by allowing inplace blob among ops if the blob keeps on the same device. This should fix this problem in a generalized way as most of our training is only on CPU now.
Reviewed By: dzhulgakov
Differential Revision: D5450564
fbshipit-source-id: 525c4c9a2e5216a70dbd1229da2d9f8a58b89e47
Summary: Saving 2 nets at offline training and loading the correct net the user want. The keep_device=false will help us load gpu blobs to CPU memory.
Reviewed By: dzhulgakov
Differential Revision: D5396689
fbshipit-source-id: ff26bf3759856b07f3a1bbefac4a1e613a8a02e1
Summary:
===Update log 7/10===
We are now restrained from problem of connection. Will post if this problem does not fix in 2hrs.
===Update 7/6===
Luke is experimenting on the convergence of this diff. Hopefully he could present results next week
Right now this is not affecting our original CPU training pipeline because the loading op is still correct in CPU situation now.
I will need final test to make sure. But that is now blocked by log device issue t19952135
I will do CPU/GPU nets saved in a separate diff.
====Update before 7.4====
It's actually working! Include local run screenshot
{F67959016}
dogscience
Reviewed By: dzhulgakov
Differential Revision: D5307058
fbshipit-source-id: cad5d9324c239419530f4b120392ec2ccbb72280
Summary: This reduces runtime from 1.54757 ms/iter -> 0.273687 ms/iter for an 100 parallel reductions each of size 100000.
Reviewed By: akyrola
Differential Revision: D5471324
fbshipit-source-id: 626cabb8249fb4655275648fae2738cb739e1a72
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.
Reviewed By: igorsugak
Differential Revision: D5454343
fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.
Reviewed By: igorsugak
Differential Revision: D5454343
fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
Summary: The Implementation of Graph Transformations, with the PatternMatch and ReplaceMatch rules.
Reviewed By: akyrola
Differential Revision: D5404144
fbshipit-source-id: 2bab68e6bff2e841ea9fb64df5d92ea945e704af
Summary: CopyGPUToGPU does not exist. Copy seems to do the trick. Didn't go into details of how copy works, not sure if it ends up triggering UVA.
Reviewed By: akyrola
Differential Revision: D5471014
fbshipit-source-id: d8bc1aed9b19070c92f3ffc76f5617bdd0054563
* added tests + removed explicit expand of weight in bce with logits
* add auto broadcasting of weight to BCELoss
* remove the need for _BCELoss
* formatting of warning
* remove TODO
* move across assert from _functions/thnn/loss.py
* flake8 fixes
Summary: Constructor should extract everything needed from NetDef instead of keeping it for usage after construction.
Reviewed By: akyrola
Differential Revision: D5469095
fbshipit-source-id: 288ea3243d85061ba9c018d2aef3b4d97485dd00
Summary: In situations where both sin & cos are necessary to compute, the joint SinCos function is faster than doing these individually. Both MKL and CUDA support this function, so exposing it here.
Reviewed By: kmatzen
Differential Revision: D5465588
fbshipit-source-id: 7686498e4f2d4b5862d83a1ecf14fcc88ea53640
Summary: Quite common confusion is how to use StopGradient, and typical bug is to forget to specify input=output. This adds a sanity check to gradient builder that checks if some StopGradient outputs are orphaned.
Reviewed By: dzhulgakov
Differential Revision: D5458341
fbshipit-source-id: 056fef4f0ee53eb10e66e9be0ecb55b55f9cc3d7
Summary:
This will fix the test by querying how many instances of the optimizer are already created.
Because OSS tests doesn't run in isolation causing number of created instances of optimizer to be >= 0.
Reviewed By: akyrola
Differential Revision:
D5462433
Tags: easy
fbshipit-source-id: 7a9ab4fe5345f5d5138abb461ba7a990d9ace840
Summary:
In this revision, I mainly implemented the DRelu activation. See https://arxiv.org/pdf/1706.06978v1.pdf for details.
To sum up, different from standard relu and purely, which divide the scope into two parts with boundary at zero, DRelu calculate another value p to divide the activation into two part. P is the softmax value of the output of Batch Normalization. For f(x)=x part in relu, you can find similar patten in f(x)=px, and for f(x)=0 part in rely, you can find similar pattern in f(x)=a(1-p)x, in which a is a parameter to tune. Drelu activation result is the sum of these two parts, f(x) = a(1-p)x + px.
To implement DRelu, I take BatchNormalization as super class and then use the above formula for computation. In order to allow users to choose activation methods, which usually takes place when calling add_mlp function in processor_util.py, I pass the parameter transfer in model_option from UI to the details, just as what dropout do. Currently, I place it in extra_option, but can modify it if AML team needs to redesign the UI.
I also add units test for DRelu. We check the shape of output and also do the numeric unit tests.
For Unit test, I first check the numeric value of BatchNormalization, since there is no similar test before. I then compute the value of DRelu outputs and compare the results with current DRelu layer.
Reviewed By: chocjy
Differential Revision: D5341464
fbshipit-source-id: 896b4dcc49cfd5493d97a8b448401b19e9c80630
Summary: The Graph Interface and Implementation, for the Graph Transformation Framework. The last diff was too long and unapproachable - let's try this instead :)
Reviewed By: akyrola
Differential Revision: D5403985
fbshipit-source-id: 89f9361841088db8ebf45a9a4f8d2357eae3fb76
* add dropout2d and dropout3d to functional
added some loss functions to functional
added tests
using dropout from backend
added docs
fixes
* edited loss modules to call functional
Summary: Net construct bench was using old version of data_parallel_model API.
Reviewed By: bddppq
Differential Revision:
D5453281
Tags: easy
fbshipit-source-id: 93e1ba58511c7b25235ee50d9862fd0614b344c9
Summary: When performing reductions on fp16 buffers, gloo assumed that both buffers were either aligned to 32 bytes or misaligned by the same offset. This may not hold in intermediate steps of halving-doubling allreduce, when the reduction is performed on some offset within the receive buffer. The fix is to use intrinsic instructions that work with unaligned pointers.
Reviewed By: akyrola
Differential Revision: D5450103
fbshipit-source-id: 9a1c8f8c34d2e62223f6d5c21573ea1cfad6537f
The function iterates over columns and sets "sparsity" fraction of entires in each column to 0. The number of zeros in a column (num_zeros) is then ceil(rows*sparsity)
Summary: Rather chunky sync of changes made exclusively to mobile codebases back to fbcode.
Reviewed By: ajtulloch
Differential Revision: D5314405
fbshipit-source-id: c4d0a7244468f953eb63288306bc9bc78eb9e1be
Summary: Adding pooling option as None, and SparseLookup will gather the embedding for each id.
Reviewed By: kittipatv
Differential Revision: D5421667
fbshipit-source-id: 1e8e2b550893ff3869dab12f8eb1fe24a063c3d5
Summary: Allowing CPU device scope instead of enforcing no device scope in data_parallel_model and data_parallel_rendevous.
Reviewed By: akyrola
Differential Revision: D5440492
fbshipit-source-id: bcd4344d64c710ea50ec8a65e3e9d102e35c66ea
Summary: - Minor fix for error message in layer model helper file
Reviewed By: chocjy
Differential Revision: D5440768
fbshipit-source-id: df47bfe68a0caa750f0d3c8def28a5585e465ee0
Summary: The diff added TensorInferenceFunction for ExpandDims operator, so that ExpandDims layer is no longer needed (it can be handled by functional layer)
Reviewed By: kittipatv
Differential Revision: D5430889
fbshipit-source-id: 4f895f2751663c45db4cc4f87e5114c63cda9fbb
Summary: When compiled with -Werror=shadow-compatible-local, cannot reuse a variable name. This passed our tests, but some people use stronger settings to compile.
Differential Revision: D5440805
fbshipit-source-id: a246af748717fb7e0e7a321e1ac4ddfef68ae524
Summary:
if strip_prefix_ not found in blob name, strip_prefix_.size() characters of blob name will be stripped.
Closes https://github.com/caffe2/caffe2/pull/924
Differential Revision: D5440941
Pulled By: akyrola
fbshipit-source-id: 1db772fac4c74f2ce05105eec4bc7742a9067ebc
Summary: Remove this compilation warning: P57645594. Been there a while.
Reviewed By: harouwu
Differential Revision: D5436753
fbshipit-source-id: 630be22f097fdcae7fe0372eed49f20c065146ba
Summary: To reduce round trips with store handlers, it is better to store all addresses in one key instead of one address per pair. This is what this implements.
Reviewed By: andrewwdye
Differential Revision: D5435893
fbshipit-source-id: 2d3ea3a2822c3b934ff2578d44a262e7bfbde6d0
Summary: added support of passing remap_funcs to clone_and_bind_net, so that it can pass it to clone method. Added other utils to ensure RecurrentNetwork operator is correctly cloned based on the remap_blob. The reason that RecurrentNetwork operator needs special treatment is that its arguments contain proto and blobs.
Reviewed By: kittipatv
Differential Revision: D5421532
fbshipit-source-id: 5de68365ce97df2de483f02ad260d78c8d35eead
Summary:
This removes/comments out/silences one or more unused parameters in the files.
We are going to enable `-Wunused-parameter` in fbcode and this fixes a case that automated tooling can't handle.
This diff is automatically generated.
Reviewers are added heuristically.
Reviewed By: dzhulgakov
Differential Revision: D5436791
fbshipit-source-id: 164b080c1bc0f6aad146087ddeded255fe9a3d22
Summary:
This removes/comments out/silences one or more unused parameters in the files.
We are going to enable `-Wunused-parameter` in fbcode and this fixes a case that automated tooling can't handle.
This diff is automatically generated.
Reviewers are added heuristically.
Reviewed By: dzhulgakov
Differential Revision: D5437217
fbshipit-source-id: c2fc5ed30e7ee47b8c40248f89a9f4304ce7c098
Summary:
This is in preparation for adding huge pages. There we want to remember for the pointer how we got it - via mmap() or alloc(). One option is to store gigantic map of void* -> destructor, but luckily usages of Context::New are all inside Tensor which already uses shared_ptr with custom deleter.
This diff could have used unique_ptr as the return type but then it's easy to accidentally call release() and loose the deleter. Thus going with std::pair<void*, MemoryDeleter> to be explicit.
Also, now CPUAllocator can be effectively changed to std::function. Haven't done it yet, but can do if necessary.
Let me know whether it's a bad idea to proceed like this.
Reviewed By: Yangqing
Differential Revision: D5429830
fbshipit-source-id: 8382ab7b81592d51272056c05c122894bb203827
Summary: Add some comments to dag-memonger to help asaadaldien with his C++ port.
Reviewed By: asaadaldien
Differential Revision: D5435459
fbshipit-source-id: dd5d482efb017418d22f42ee79fbd4668bd31bdd
Summary:
recurrent_network_blob_fetcher_op_gpu.cc was failing when compiled with clang
(Note: this ignores all push blocking failures!)
Reviewed By: wesolwsk
Differential Revision: D5436161
fbshipit-source-id: f4ea31066fe5abc108c6d6c15ee92bf828a2ff96
Summary:
Added operator RecurrentNetworkBlobFetcherOp that takes as input a scratch workspace name and prefix, and copies over all blobs in the scratch workspace into the global workspace. This essentially extracts all intermediate recurrent network computation for each timestep.
Added a wrapper in recurrent.py - retrieve_step_blobs(net, prefix='rnn') - which, when called after an rnn is run, will return a list of all blobs extracted from the net.
Reviewed By: akyrola
Differential Revision: D5421926
fbshipit-source-id: 0f35b466d77d3c719fb0e32de7dbcafc6c0d5225
Summary: Add lint rule to check that every time we register a caffe operator to CPU or GPU that documentation is added for the particular operator.
Reviewed By: dzhulgakov
Differential Revision: D5348078
fbshipit-source-id: c3fa22fc7ca8066d5fc8fa780b23d7867fd3380e
Summary:
Implements TEST_benchmark style of tracking for all nets created in the workspace.
I had to do some tricks to invoke stuff in destructors in non-intrusive way. Let me know if it's too hacky.
There are 2 levels of reporting:
- `--caffe2_logging_print_net_summary=1` - prints per-type aggregated stats
- `--caffe2_logging_print_net_summary=2` - prints also individual operator breakdown (might be spammy)
Reviewed By: salexspb
Differential Revision: D5414708
fbshipit-source-id: 40bac2cdf7e3809ab0086150433c376bb5fc7e64
Summary: Currently the dataset cursor blob is using a fixed name. When we read from multi input tables, the dataset cursor of each table is using the same blob. This messed up the split queue and crashed the reader pipelines (see the errors and failures in https://fb.quip.com/uzbIA7K0PgVe)
Reviewed By: dragonxlwang, rayleichen
Differential Revision: D5419863
fbshipit-source-id: 5983a3d8d2e286dc47c2ec38ed1dbbe30c7c9b49
Summary: Use the CreateCommonWorld timeout for the storehandler as well, not just the device connect.
Reviewed By: andrewwdye
Differential Revision: D5425923
fbshipit-source-id: 936d2129e2db3bfed8759ca097b75843d3931d5f
Summary: This would allow us to inspect the binary size of the builds more easily.
Reviewed By: jonmorton
Differential Revision: D4553515
fbshipit-source-id: 95371bf67e66490a8653b874e1ff79cc987805e6
Summary:
MKL on windows works with this change. Tested with MKL 2017 Update 3 (https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-2017-release-notes).
Should fix#544
With MKL 2017 Update 3 #514 should not happen too.
Note: I used Anaconda which ships with its own MKL, so I had to make sure that the MKL 2017 Update 3 version was loaded by replacing the .dll in the `%AnacondaPrefix%\Library\bin` folder. Otherwise, numpy would load it's own version and I would have all sorts of missing procedures errors. Now that the same version is available through `conda` this is easily fixed with `conda install mkl==2017.0.3`
Closes https://github.com/caffe2/caffe2/pull/929
Differential Revision: D5429664
Pulled By: Yangqing
fbshipit-source-id: eaa150bab563ee4ce8348faee1624ac4af477513
Summary: Add api model.add_loss(), which allows adding loss, such as optimization and regularization. See change in sparse_nn.py, in which 'model.loss = loss' is changed to 'model.add_loss(loss)'.
Reviewed By: xianjiec
Differential Revision: D5399056
fbshipit-source-id: 13b2ced4b75d129a5ee4a9b0e989606c04d2ca8b
Summary:
1. it was easy to pass grad_reference which was just ignored due to missing output_to_grad
2. threshold was not passed to the gradient checkinglogic
Reviewed By: dzhulgakov
Differential Revision: D5425226
fbshipit-source-id: 2eb41f2601d5e356f7872e57724d08ab2e742329
Summary:
- (Split diff from Arc Cosine)
- Implemented [[ https://arxiv.org/pdf/1702.08882.pdf | Semi-Random Features ]] Layer
- Created a buck unit test for SRF Layer
Reviewed By: chocjy
Differential Revision: D5374803
fbshipit-source-id: 0293fd91ed5bc19614d418c2fce9c1cfdd1128ae
Summary: As title. This helps with (quite common) cases where data input is stuck for reason or another, and the net execution never proceeds and is stuck forever.
Reviewed By: andrewwdye
Differential Revision: D5409885
fbshipit-source-id: 840261fd5964408f788fc0f50ece0d74193694ac
Summary: The number input dimension for NHWC should be the last dimension C. Since batch size is omitted, it should be 2 instead of 3.
Reviewed By: chocjy
Differential Revision: D5418538
fbshipit-source-id: a6939a863817b7566198ea2a665a1d236a2cf63d
Summary:
Fix case when optimizer isn't called within a device scope context.
Fix OptimizerContext lr blob names
Reviewed By: volkhin
Differential Revision: D5421046
fbshipit-source-id: 186a0d05f40d4442c5ba5736084626da73a0c0f1
Summary:
This fixes super annoying problem of QPS reporting in sparse_nn_benchmarks when QPS "warms up" gradually. The problem is that we create the metrics in init_net and start counting from there. Whereas there can be big delay before real processing begins.
Thus I propose to just start counting from first example seen. It's slightly inprecise too as we miss the first batch, but who cares :)
Reviewed By: harouwu
Differential Revision: D5414672
fbshipit-source-id: 94fcf2e486416f186fed563002864f73c5f1c908
Summary: This manually fixes a few violations of `-Wunused-parameter` where automated tooling couldn't help.
Reviewed By: meyering
Differential Revision: D5416336
fbshipit-source-id: c089f02dfdf33351406ebad2f52ad9f8c676360b
Summary: Added function _RunComparison to data_parallel_model that checks if all shards in a given rendevous have the same value for a given blob_name
Reviewed By: wesolwsk
Differential Revision: D5394164
fbshipit-source-id: c2b07d0f8d5846fa9887d53b0be091a8c057f106
Summary: Fix a bug reported by dzhulgakov that occurs when input blobs is used twice in a same op --> it was released to the recycled blobs pool twice.
Reviewed By: dzhulgakov, volkhin
Differential Revision: D5414023
fbshipit-source-id: 861bb46fe901023cb9a496401736e6ecb77d5fae
* add support for groups in double backward
* add tests for group in double backward
* fix lint
* separate some tests to reduce number of test cases
* remove redundant testing for different number of output channels
Summary:
We want it to be able to register children of layers who
are not direct children of ModelLayer.
This requires us to find subclasses of ModelLayer recursively.
Reviewed By: kittipatv, kennyhorror
Differential Revision: D5397120
fbshipit-source-id: cb1e03d72e3bedb960b1b865877a76e413218a71
Summary: Instead of decoding all frames for X-ray video training, decode only sampled frames
Differential Revision: D5365079
fbshipit-source-id: e00dceadaacd9cdd42d83cf0d0e38338dc1f76ef
Summary: As Part 1 in reducing the size of operator objects, this removes the outside access to def() and moves debug-uses under a new debug_def() function. Next phase will be by jbai to remove all access from subclasses to def().
Reviewed By: Yangqing
Differential Revision: D5393893
fbshipit-source-id: 7301cff4138dce620b49f6c4db315df85fee7266
Summary: This diff makes functional layer return scalar if only one output. This diff also corrects all other corresponding implementations.
Reviewed By: kittipatv
Differential Revision: D5386853
fbshipit-source-id: 1f00582f6ec23384b2a6db94e19952836755ef42
Summary: These are useful constructs for operators dealing with sparse representation.
Reviewed By: sunnieshang
Differential Revision: D5332077
fbshipit-source-id: 16aa8c4516e6d80f3c44ff348848f0a4a8061f22
Summary:
Added device scope checks to data_parallel_model and data_parallel_rendevous
Added test to check that checks are working correctly to data_parallel_model_test
Fixed device_scope error in test_synchronization_barrier
Reviewed By: akyrola
Differential Revision: D5403936
fbshipit-source-id: 849c1cd7452692efbc5ef74d2d60ede090c9c017
Summary: the init method should also make _parameters_info shared between self and param_model, since params is shared. Otherwise it can cause a inconsistence between _parameters_info and params. Examples of using param_model can be find in rnn_cell.py.
Reviewed By: kennyhorror
Differential Revision: D5405327
fbshipit-source-id: ca8079058e898f529906452163cda234cb30a7df
Summary: this diff adds optimizer into param_info, and the associated implementations for modelhelper and brew to set optimizer for each individual parameter.
Reviewed By: kennyhorror
Differential Revision: D5385432
fbshipit-source-id: 5d682f9d1ab077e04a5d76a24d71470f4e64fc92
Summary:
akirillov again presented me with a memonger-bug: his model that has kind of a 'back-and-forth structure' where blobs are passed left and right in a ladder-like structure, revealed a bug in memonger: I should pass the set of free blobs as a reference, not a copy so that the recyclings are properly accounted for. Hard to explain.
Since we have the graph verifier, we can be more confident with these changes.
I also added some helpful debug to the graph verifier.
Differential Revision: D5396925
fbshipit-source-id: 0bffb3a0bf8532afcd6b5bc9331c779768a8c5c5
Summary: Currently the DBReader always creates the DB instance itself when Open is called. Add an Open method that takes in a DB pointer and takes ownership of it, so the DB can be initialized outside the DBReader.
Reviewed By: panshen1
Differential Revision: D5392458
fbshipit-source-id: d8660ab41d349f32030e4934b47bd17256a440df
Summary: When Sum was called with other type than float and int, it just returned false without any helpful error.
Reviewed By: asaadaldien
Differential Revision: D5394070
fbshipit-source-id: 0f3c543a39f89163bccb9f55ea394e1d53561b62
Summary: Implemented python logic and tests to create an RNNCell for GRU. Uses the preexisting GRU Unit Op code.
Reviewed By: salexspb
Differential Revision: D5364893
fbshipit-source-id: 2451d7ec8c2eacb8d8c9b7c893bfd21b65fb9d18
Summary:
Just an implementation of the forward pass of the GRU Unit Op, not the full RNNCell.
Functions were created to mimic LSTM implementation as closely as possible.
Backwards pass implementations are defined in GRU_unit_op.{h, cc}
assertGradientChecks call added to gru_cell_test.py
Reviewed By: salexspb
Differential Revision: D5364856
fbshipit-source-id: 09cff4478091827763b40cc331e4e0abf0ec258f
Summary:
Just an implementation of the forward pass of the GRU Unit Op, not the full RNNCell.
Functions were created to mimic LSTM implementation as closely as possible.
Implementation defined in GRU_unit_op.{h, cc}
tests put in gru_cell_test.py, which import rnn_cell_test_util.py for sigmoid, tanh, and _prepare_rnn functions.
Reviewed By: jamesr66a
Differential Revision: D5363697
fbshipit-source-id: f9ba9fe0be01ffc868dd22027be8be4975b84998
Summary:
Moved sigmoid, tanh, and _prepare_lstm (renamed) to a util file.
Also renamed _prepare_lstm to _preapare_rnn since it is being used for both setting up and LSTM and GRU model.
The reason for this commit is to allow the creation of GRU Op and testing code without copying and pasting code for sigmoid, tanh, and setting up an rnn unit op mode.
Reviewed By: jamesr66a
Differential Revision: D5363675
fbshipit-source-id: 352bd70378031f1d81606c9267e625c6728b18fd
Summary: Our existing serialization routines take a significant amount of time for large numpy arrays in order to verify the type of each element in the array as well as converting each element to a canonical type. For large floating-point tensors, such as model parameters, this checking and converting takes a significant amount of time. Adding a fast track path for just float32 arrays as this is the most common use case to worry about.
Reviewed By: akyrola
Differential Revision: D5389953
fbshipit-source-id: 26f44cb2426ea3efb849e7707b27d5485f69956c
Summary:
numpy.random.rand generates samples from [0, 1) and therefore, the leaky relu test cases weren't testing negative inputs. Tests still pass after change.
Leaky relu can be used in-place, but gradient took X rather than Y. Technically, the result is no different as it's just used for a sign test in the gradient, but updated it to take Y to reduce confusion.
Differential Revision: D5390126
fbshipit-source-id: d0c428abbb2797eb33902a7d2a2f59d5e85daaa6
Summary: GetComputedParams tests namescopes with equality while GetParams tests with a prefix. Switching GetComputedParams to also use a prefix so that both functions have similar usages.
Reviewed By: akyrola
Differential Revision: D5389816
fbshipit-source-id: 0e43e4b491fccbad3b855b6b735dc2b91d7626c9
Summary: When we use int32_data field for float16 tensors serialization it's possible to end up with up to 50% larger representation than can be achieved using byte_data. The reason for it is varints (https://developers.google.com/protocol-buffers/docs/encoding#varints). In worst cast (when highest sign bit is set) it uses 3 8-bit blocks i.e. 24 bits for each number. Saving in byte_field removes this overhead.
Reviewed By: Yangqing
Differential Revision: D5375267
fbshipit-source-id: 0068daed25cd0157ea80a768b6e3899ea2bd8caf
Summary:
dilated convolution semantics were added after the nnpack op, so the feature
check macro was not there originally.
accept2ship
Reviewed By: ajtulloch
Differential Revision: D5387287
fbshipit-source-id: 139ca8c6ad4211ceec8f24982f1f060144524401
Summary: Moving the Sum operator into its own file (elementwise_sum_op.cc)
Reviewed By: oyvindkinsey
Differential Revision: D5379274
fbshipit-source-id: c504d91c9fb5e95b369f2aa7e7b5be31fd8e0d4b
Summary: Added a CUDA implementation of the PiecewiseLinearTransformOp.
Differential Revision: D5378537
fbshipit-source-id: 38857f59f5cc52e16e1ecc97983a0b0b82a46c74
Summary:
# Added the gradients of the operation for both CPU and CUDA kernels.
# Unified variable names across all ops.
# Added reference implementation in numpy.
# The gradient check needs a larger stepsize to succeed, is that normal?
Reviewed By: akyrola
Differential Revision: D5313682
fbshipit-source-id: aceb92649e01c5caeba8774e678f9095502d396c
Summary: replace params with sp, otherwise it will report an empty list
Reviewed By: akyrola
Differential Revision: D5382716
fbshipit-source-id: 34d8e6ee00cbe1718702e3d1f23ea12f8d65063e
Summary:
- Integrated RFF into the preprocessing workflow for dense features
- Developed Flow interface to input RFF parameters
- Created unit test for using RFF with sparseNN
Reviewed By: chocjy
Differential Revision: D5367534
fbshipit-source-id: 07307259c501a614d9ee68a731f0cc8ecd17db68
Summary:
To be used with predictor "online": C++ version of memonger for simple nets. Very simple greedy algorithm. Works well at least on Resnet-50 inference graph: only 3 shared blobs are used.
Next I will integrate this with predictor and run canary (separate diff).
Reviewed By: asaadaldien
Differential Revision: D5375392
fbshipit-source-id: d36e419e39a32e568e105657c27fb00c85a2535d
Summary:
As the title says.
Closes https://github.com/caffe2/caffe2/pull/879
Differential Revision: D5372787
Pulled By: akyrola
fbshipit-source-id: 0ff469c0d227f1b2252c1a0c4f6f8bebaac5580f
Summary: Add synchronization barrier API with configurable timeout. Users can call Synchronize() to join variable length execution before resuming multi-machine communication steps, i.e., resuming distributed training iterations after validation on a single machine.
Reviewed By: akyrola
Differential Revision: D5348387
fbshipit-source-id: 5826da10e6a60c50394c36c7cf47624f10191d11
Summary:
I noticed this when experimenting with the compute-bound convolutions
for the ULP HWGQ binary conv/gemm.
It's an ugly heuristic that Maratyszcza and co. are improving this half, but I think
this will be a net win for C2 especially if segmentation/mask r-cnn are
critical.
Differential Revision: D5375976
fbshipit-source-id: 863f76d434f133bf5a00e7ced1cfadfcf92e3c84
Summary: Memonger had a bug that it crashes if an input blob was input to multiple ops. This fixes that and adds a test.
Reviewed By: asaadaldien
Differential Revision: D5374860
fbshipit-source-id: 1d5044001eacdbe6db43f69727da9297558f5c5c
Summary: Huge improvement in my tests, and it does not really hurt either.
Reviewed By: wesolwsk
Differential Revision: D5374925
fbshipit-source-id: c96a4ed2ca653120a82233c0037cbfded8a2d2a1
Summary:
b33894e95d removed this line:
```py
unittest.skipIf(workspace.NumCudaDevices() < 2, "Need at least 2 GPUs.")
```
but forgot to add it back later.
```
_________________________________ DataParallelModelTest.test_equiv __________________________________
...
if p2p_access_pattern is not None and not p2p_access_pattern[
> devices[0], peer
]:
E IndexError: index 1 is out of bounds for axis 1 with size 1
...
WARNING:data_parallel_model:** Only 1 GPUs available, GPUs [0, 1] requested
```
/cc akyrola
Closes https://github.com/caffe2/caffe2/pull/888
Reviewed By: akyrola
Differential Revision: D5341310
Pulled By: harouwu
fbshipit-source-id: 8d7f06913c7b5a42009a4033dbb6a48a8e812822
Summary:
- allow initializer lists directly with `vector<string>{}` part thanks do default initialization
- reduce the number of instances
Reviewed By: nicolasvasilache
Differential Revision: D5370056
fbshipit-source-id: b8fae3b12144257644e098b284df7369d5bdb377
Summary: Based on benchmark script located at `caffe2/experiments/python/device_reduce_sum_bench.py`, device reduce sum is slower for N <= 10000, so we only switch to use device reduce for large N in SumElements. This diff applies the same schema for SumSqrElements.
Reviewed By: jamesr66a
Differential Revision: D5369868
fbshipit-source-id: ae13a611aff9d3464d1c4950ee155c740a2da339
Summary:
- Created the random fourier features layer
- Generated a unit test to test the random fourier features layer is built correctly
- Inspired by the paper [[ https://people.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf | Random Features for Large-Scale Kernel Machines]]
Reviewed By: chocjy
Differential Revision: D5318105
fbshipit-source-id: c3885cb5ad1358853d4fc13c780fec3141609176
This is needed because of possible races in SpatialConvolutionMM (and others that use gemm)
if the BLAS library is not thread-safe.
In terms of performance, there's not much benefit to run two gemms in parallel, because the
BLAS libraries have their own all-occupying gemms anyways.
Summary:
Otherwise was always added to main net instead of param_init_net when
desired (i.e. initial param sync)
Closes https://github.com/caffe2/caffe2/pull/894
Differential Revision: D5367451
Pulled By: akyrola
fbshipit-source-id: 3d82be6da687c736bd15f4852dbd272266eb4811
* Improve non-contiguous testing in TestAutograd:
1) Test gradcheck and gradgradcheck with non-contiguous inputs
2) Test gradgradcheck with non-contiguous gradoutputs (gradcheck would take more work)
3) Fix discovered issue in Prod backwards.
* Simplify non-contiguous setting wrt View.
Previously, there were 2 issues with test_autograd randomness:
1) Many random operations (e.g. random selection in prod_zeros) happened
before the torch random seed was set (because it was set in run_tests
at the end of the file.
2) The random seed was not set consistently: run_tests would set it to the
proper value, but each call to setUp would set it to 0 (because SEED wasn't
global in run_tests), which made setting the seed mostly worthless.
Previously, these tests added 5e-2 to the denominator tensor (the same as the div
tests), which only avoids divide by 0, but not issues with computing the numerical
jacobian due to non-linearity of fmod/remainder, when input / divisor is close to an
integer. These tests now add 1.5 to the denominator, which is the same as the non-tensor
version of the tests; Note that we can still hit the above condition but it will be much
less likely.
Summary: Allows to override the input/output record as long as the field blobs are the same.
Reviewed By: yangyangyyy
Differential Revision: D5362132
fbshipit-source-id: 3ac2ac22802902b7eed5c226b00a7e1971ad264c
Summary:
Quite common, hard-to-debug, performance bug for multi-GPU training has been that operators have been passed tensors that reside on different GPU than what the op runs on. Since we have peer access enabled, this works, but is just much slower. With data parallel model this problem arises rarely as it has static analysis of the operators, but if someone bypassed DPM or uses FeedBlob with incorrect device options, this problem can happen.
To make debugging easier, I added device-field to tensor that stores the device information that allocated the memory. In addition, I added a function to go through operator inputs and outputs and compare their tensor device to the operator device. This check is run after first iteration with prof_dag only.
Also renamed ShapeCall to TensorInfoFun, as it now returns so much other info than the shape.
I think this is pretty safe diff, but do you find it problematic to add a new field to tensor?
Reviewed By: dzhulgakov
Differential Revision: D5335505
fbshipit-source-id: 511b6c122dff9a205f43951984868ffd40f7ac30
Summary:
It is quite common question when users get some variant of "blob has version 2 but gradient expects version 1" in their backward pass. The error message is completely unhelpful.
To remedy this, I added proper debug information which tells user how the version number of a blob was incremented over time. i.e which ops caused the version to go op. This should help
understand the issue.
Reviewed By: dzhulgakov
Differential Revision: D5358227
fbshipit-source-id: bc09d048ac33200c35d56460e44e86c2f2888f3f
Summary: Port SumElements and softmax_ops.cu to use device reduce sum
Reviewed By: akyrola
Differential Revision: D5351881
fbshipit-source-id: ca9604186c261ffcb1480da2a17baab8a4809372
This takes advantage of the broadcasting behavior of torch.matmul to
support inputs with more than two dimensions. The extra dimensions are
treated like part of the batch dimension, much like nn.Bottle in Lua
Torch.
There are a few related small performance changes:
* Addmm computes the gradient in column-major for inputs in
column-major format
* Variable.mm calls Addmm in-place with the desired output buffer
* Add weight normalization implementation
This adds forward "pre-hooks" which get called before the module's
forward() method. Weight norm is implemented as a hook which calculates
the weight variable from the weight_g and weight_v every iteration.
Based on @rtqichen implementation.
* Specify return type
* Fix unused linker argument warnings.
This patch began when I noticed the following clang warning:
clang: warning: -Wl,-rpath,RIGIN: 'linker' input unused
clang: warning: argument unused during compilation:
'-L/home/ezyang/local/pytorch/torch/lib/tmp_install/lib'
The warning is minor, but I was a bit worried our rpath wasn't
setup correctly. Actually, it was, and there wasn't a problem,
but I had to spend some time figuring out exactly what as going
on, and by the end of it, I might as well fix the warning. In the end, I ended
up filing two upstream tickets for ccache and cmake:
- https://github.com/ccache/ccache/issues/189
- https://gitlab.kitware.com/cmake/cmake/issues/17025
We can remove the warning by using CMAKE_EXE_LINKER_FLAGS and
CMAKE_SHARED_LINKER_FLAGS, which have sane macro expansion rules
(although still slightly insane: the first level of escaping gets removed.)
To ensure that the rpath was being set correctly, I ran
objdump -x torch/lib/build/TH/libTH.so | grep RPATH and verified that ORIGIN
was setup correctly.
I also considered using CMAKE_INSTALL_RPATH, but the rpath here doesn't
seem to get set until you actually install, which is a change in behavior,
and I wasn't sure if anyone was relying on rpaths being setup in the build
directory.
There is a SLIGHT behavior change, in that if we happened to need these
LDFLAGS passed to the static linker, they won't get passed. I don't
think we ever build static libraries today so this shouldn't be aproblem.
P.S. Because of the ccache bug, you may continue to see these warnings
after this patch. If you apply https://github.com/ccache/ccache/pull/190
and clear your cache, it will solve the problem.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Remove unnecessary -Qunused-arguments
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: Added two operators that can be used to tranfer data into the input format of RNN and back.
Reviewed By: kittipatv
Differential Revision: D5329886
fbshipit-source-id: 07eac29416427b08c49989d4eeed50a6f18493a1
Summary: This was broken in a previous diff, fixing it to use model device type.
Reviewed By: asaadaldien
Differential Revision: D5356005
fbshipit-source-id: a4fcc932bae772076b57625a5fcc0d38eb702cc9
Summary: Add an optional timeout parameter to CreateCommonWorldOp, to be honored on dependent collective operations.
Reviewed By: akyrola, romain-intel
Differential Revision: D5348099
fbshipit-source-id: cf5131450c389c7e40b1dabf8334c486e02e0011
Summary:
this works as a standalone python script because args are
global. When used from Flow for monitoring purposes it doesn't
work. This diff fixes it
Reviewed By: zem7
Differential Revision: D5349996
fbshipit-source-id: f73842901d975b783e09e9db0565eb81880bbea1
Summary:
A couple of fixes to fix broken rerporting of lstm_benchmark:
- last_time must be recorded after warm up
- entry count was incorectly removed
Reviewed By: salexspb
Differential Revision: D5349890
fbshipit-source-id: 5dd5bdf46594c520b61bc3b57b153f90a6a17903
Summary:
Eliminates failures from overloaded machines from only
running a few examples before being timed out.
Reviewed By: tomdz
Differential Revision: D5349555
fbshipit-source-id: 89d1db063f58c72656b37157225a586c9e3f24bc
Summary:
Shared im2col buffer needs a mutex only to protect it from ops within a
workspace (since the shared buffer is created per workspace). The current
implementation has a global mutex which affects perf when running multiple nets
in parallel.
I don't feel great about adding a mutex for this in workspace, let me know if
anyone has better suggestions.
Reviewed By: akyrola
Differential Revision: D5341476
fbshipit-source-id: 1c9a92ef488ffb0c0013a7656bcb3d530bc7208b
Summary: This is splitting out one change from D5273337. This makes it so that we only notify the DAGNet condition variable if the condition it's signalling is actually true, namely remaining_ops_==0 || !success_.
Reviewed By: akyrola
Differential Revision: D5341962
fbshipit-source-id: a4d76cc95aebac27dc18da2bf8dc1837db69e6ae
Summary: Lets try this again. Verify graphs every time memonger is run. Will definitely check for time though.
Reviewed By: akyrola
Differential Revision: D5308188
fbshipit-source-id: 512a76c759b670d31c49d1d492dd8ee1eaf3bafd
If the left tensor is 3D+ and the right tensor is at most 2D, we can
fold the batch into the matrix dimension and use torch.mm instead of
torch.bmm. In practice, this is faster especially if the right tensor is
column major.
Summary:
As title. Not sure how did the unit test bug went through -
we should have push blocking test guarding it. Looks like sandcastle
thought that it was already broken
Reviewed By: jamesr66a
Differential Revision: D5340741
fbshipit-source-id: 76b2287fc2f746d85dd732b669ff89808bcbd497
Summary:
This add CollectivesConcurrencyControl class to mange creating common context and cyclic controls to execute GLOO collectivces
and refactors AllReduce and _AddDistributedParamterSync to use it
Reviewed By: akyrola
Differential Revision: D5335795
fbshipit-source-id: 5084e0a65cdb989cd949be3868b77a680561022d
Summary:
This is for the ease of removing the common fields of a struct from another.
For example,
s1 = Struct(
('a', Scalar()),
('b', Scalar()),
)
s2 = Struct(('a', Scalar()))
s1 - s2 == Struct(('b', Scalar()))
More examples are provided in the code comments.
Differential Revision: D5299277
fbshipit-source-id: 7008586ffdc8e24e1eccc8757da70330c4d90370
Summary:
In some cases we don't want to compute the full FC during eval.
These layers allow us to compute dot product between
X and W[idx,:] where idx is an input, e.g., label.
Reviewed By: kittipatv
Differential Revision: D5305364
fbshipit-source-id: 0b6a1b61cc8fcb26c8def8bcd037a4a35d223078
Summary:
similar to sparse_nn all gpu, this is our first step towards offline full gpu experiment.
**Compare Run**
cat(128, 32)512-512 :
GPU 21138598 https://fburl.com/jpeod1pi
CPU 21138787 https://fburl.com/vma7225l
Reviewed By: dzhulgakov
Differential Revision: D5308789
fbshipit-source-id: 413819bf9c5fff125d6967ed48faa5c7b3d6fa85
Summary: Combine _AddDistributedParameterSync() and _SyncParams() into a single function to broadcast across distributes machines and all local GPU simultaneously. This is similar to how calls to Allreduce has already optimized using the functionalities of Gloo. All the refactoring work is contained in data_parallel_model.py.
Reviewed By: akyrola, andrewwdye
Differential Revision: D5329277
fbshipit-source-id: 4407b88980cf396f2e0f994d796294fa79fd39ed
Summary:
This bug in the test was exposed by https://github.com/caffe2/caffe2/pull/861 (previously, the test was always using the cuDNN engine, regardless of the value of `engine`). This bug is now blocking https://github.com/caffe2/caffe2/pull/817.
```
____________________ TestConvolution.test_convolution_sync _____________________
...
if use_cudnn and requested_engine != 'CUDNN':
raise ValueError(
> 'When use_cudnn=True, the only engine you can specify is '
E ValueError: When use_cudnn=True, the only engine you can specify is "CUDNN"
```
https://travis-ci.org/caffe2/caffe2/jobs/247605579
Closes https://github.com/caffe2/caffe2/pull/881
Differential Revision: D5332619
Pulled By: akyrola
fbshipit-source-id: 63737768a155359ddbbef1da424fcbb94f86bd4e
Summary: This should make it so we no longer have super hacky DAG chains just to generate vectors of indices that could be specified at model creation time
Reviewed By: akyrola
Differential Revision: D5316707
fbshipit-source-id: 97bb3868b69e0c5a7f465c95f2e16ae0485dcc56
Summary:
It was allocating TensorCPU always, so causing mutex to be acquired in PinnedCPUAllocator.
Not much impact as everyone should use the CUDNN transpose, but good to fix anyway.
Reviewed By: jamesr66a
Differential Revision: D5332858
fbshipit-source-id: 287643df623b7cd59ab1028ed8b2ed1d3c1da44e
Summary: Implement the gradient for the Slice op on GPU
Reviewed By: akyrola
Differential Revision: D5313442
fbshipit-source-id: 722ad0bdf65e014d3236e17d15c83d40d7c975d2
Summary:
Fixes a memonger bug where it could recycle a blob that was released by the same op being processed.
Added a verification step to ensure in-place assignments are not changed.
Reviewed By: asaadaldien
Differential Revision: D5331495
fbshipit-source-id: 20b08f6de5b973e8c9868aa048c142cac1eb6c58
Summary:
Previous attempt included terminating the program which is
not good. Here I am using [[noreturn]] trick.
Reviewed By: jamesr66a
Differential Revision: D5313159
fbshipit-source-id: 8889efcf793d44d472502309992e6f5b0a31f0e6
Summary: Implement slice gradient for CPU. Will soon port this over to GPU so NMT can use it
Reviewed By: akyrola
Differential Revision: D5309305
fbshipit-source-id: 8fb5f4e665f236ecce9227c5c0c302f5076b01ad
Summary:
Made them faster.
This should be equivalent to the algorithm akyrola suggested, just with a list (of parents) as an intermediate representation instead of a string.
Reviewed By: akyrola
Differential Revision: D5308133
fbshipit-source-id: c976a513d10e79c157ea803afb99b147e9ea3357
Summary: Data workers test timeouts randomly (very seldom), and looks like the reason is that we call FeedBlob in a thread (eneuque-thread), and first time that is called, it will call workspace.CreateBlob() -- which is not thread safe. Fix this by initializing the scratch blobs explicitly.
Reviewed By: panshen1
Differential Revision: D5292426
fbshipit-source-id: d7dad68f3ccc636c60bd82b2527f00f20da298b5
Summary:
Last time I used uuid filled into OperatorDef. And operator_tracebacks was populated using traceback.extract_stack. There were several issues with this approach:
1. A random field in OperatorDef breaks workflows relying on memoization, i.e. when computation is skipped based on already computed result before.
2. Adding one more field revealed RNNs being non forward compatible wrt to new fields in there. prototxt format seems to not allow forward compatibility (thanks jamesr66a for the investigation!). For RNNs we need to swtich them to a more resilient approach. azzolini's proposed change to OperatorDef / NetDef would allow that by just nesting NetDef dirrectly inside OperatorDef without need for extra serialization.
3. traceback.extract_stack is very slow when executable is on a remote filesystem. It does one or more os.stat for each frame on the stack. For some cases it ended up being up to 15 extra minutes on model construction.
In this diff I use a different approach which should fix all those problems above.
1.2. are solved by not adding a new field at all. Instead I report operator idx wrt to a net it runs in. Thanks akyrola and dzhulgakov for the idea. Downside here is that operator list manipulation breaks the logic and separately created ops are not covered at all.
3. I solved this by operating on raw frames without using traceback and inspect modules which end up doing a lot of file system calls. See function extract_stacktace in core.py with additional comments.
Reviewed By: dzhulgakov
Differential Revision: D5286285
fbshipit-source-id: 626dd0f5f6b8b1d86bd6bf519078b122f43ddcaa
Summary: Adding a test to check computational integrity of networks constructed with AttentionCell using UnrolledCell.
Reviewed By: salexspb
Differential Revision: D5306915
fbshipit-source-id: 02acfd1011f7d3ee5fac21cc2778c4a486190c43
Summary: - One line fix for loading saved checkpoint when using Parallelize_GPU_BMUF
Reviewed By: asaadaldien
Differential Revision: D5315254
fbshipit-source-id: a20ba6438c8e6b2ef44b65270c1d3f9ab645ded0
Summary:
Apparently, `brew install` fails if the package is already installed?
```
Error: automake 1.15 is already installed
```
https://travis-ci.org/caffe2/caffe2/jobs/245226634
Maybe TravisCI made some unannounced updates to their OSX images at around the same time [they updated their trusty images](https://blog.travis-ci.com/2017-06-21-trusty-updates-2017-Q2-launch). Something changed on their side two days ago, and the OSX builds have been failing ever since.
Closes https://github.com/caffe2/caffe2/pull/858
Differential Revision: D5313447
Pulled By: aaronmarkham
fbshipit-source-id: 7085640704c60c0119a1a75ea69dacd64b5a4da8
Summary:
Adds basic CUDA 9 support, including adding Volta arch, and making appropriate modifications for half precision datatype changes
Closes https://github.com/facebookincubator/gloo/pull/49
Differential Revision: D5315336
Pulled By: pietern
fbshipit-source-id: 6468b0f357206d604bdcfec69ba82509a2c91407
Summary: Remove cases of constructing a NetDef from String, instead of just creating a NetDef.
Reviewed By: salexspb
Differential Revision: D5309645
fbshipit-source-id: 06ec8617733d9dc5385668485f3b091bb37b3f73
Summary:
This diff fixes gradient computation of residual connections for a training network constructed with MultiRNNCell.
It addresses a logic bug in _prepare_output() and _prepare_output_sequence() by keeping track internally of which layers have consecutive residual connections before the output, and then reconstructing the final residual output by (re-)preparing the output of each of those layers and then combining them with a Sum operation. This also involves keeping track of which states contribute toward the reconstruction of the final sequence output so that outputs_with_grads can be correctly passed to apply_over_sequence().
Differential Revision: D5300520
fbshipit-source-id: f37d800c909e631175de7045abe192351cc11c41
Summary: We had a latent cudnn operator instantiation failure that we didn't know about until I looked at the nvvp profile. This makes it so that those failures (i.e. OPERATOR_NEEDS_FEATURE failures) print to LOG(WARNING) instead of VLOG(1)
Reviewed By: salexspb
Differential Revision: D5303012
fbshipit-source-id: bda54682d9932f907e44aa1c81a04521d864ae99
Summary: This is needed so that we can create blobs that are not numpy arrays, e.g., creating mutex with `CreateMutex` op.
Reviewed By: chocjy
Differential Revision: D5303742
fbshipit-source-id: f83cbf67c658a234c1e4a9a114ad943a4e360598
Summary: softmax_ops_test occasionally fails with gradient checks. Stabilize by setting the numpy random seed. Also reduce some dimensions for the large input test to make it run faster.
Reviewed By: harouwu
Differential Revision: D5292106
fbshipit-source-id: a21eec89e18d30ac7c5609dacf5d413e841841a6
Summary:
Refactor data_parallel_model all_reduce and broadcast methods to work for
a given parameter set not only gradients and reuse them for BMUF distributed
implementation.
Add a distributed test (multiprocessing) to BMUF.
Reviewed By: akyrola
Differential Revision: D5267083
fbshipit-source-id: 8dcc7527d0a755b903d693d8071585f0b54d3403
Summary:
As described in T19378176 by kittipatv, in this diff, we fix the issue of __getitem__() of schema.List.
For example, given Map(int32, float) (Map is a special List), field_names() will return "lengths", "values:keys", & "values:values". "values:keys" and "values:values" are not accessible via __getitem__(). __getitem__() bypasses the values prefix and directly access the fields in the map. Other APIs (e.g., _SchemaNode & dataset_ops) expect "values:keys" and "values:values" as it simplifies traversal logic. Therefore, we should keep field_names() as is and fix __getitem__().
Reviewed By: kittipatv
Differential Revision: D5251657
fbshipit-source-id: 1acfb8d6e53e286eb866cf5ddab01d2dce97e1d2
Summary:
compute_interference_graph() was not able to handle the case when a blob is reused twice for operators supporting in-place parameters. For example, for the following network with operators Mul and Sub
(blob) -> [Mul] -> (blob) -> [Sub] -> (blob)
an incorrect edge will be added from [Sub] to [Mul] and causes nx.is_directed_acyclic_graph() to fail.
Reviewed By: ajtulloch
Differential Revision: D5271604
fbshipit-source-id: f6095b6f8e1dba556ba223a82c8170be7f744529
Summary: Make verify_graph_equality get called by share_grad_blobs and optimize_inference_for_dag
Reviewed By: akyrola
Differential Revision: D5288993
fbshipit-source-id: b9f105ce00148b2673eed2dd390ab74f82f990ad
Summary:
kmatzen why did you set the stepsize in ff84e7dea6?
The test is flaky before this change. Solid afterwards.
Closes https://github.com/caffe2/caffe2/pull/841
Differential Revision: D5292112
Pulled By: akyrola
fbshipit-source-id: c84715261194ff047606d4ec659b7f89dac3cbb1
Summary:
/cc akyrola is it possible this test has been broken ever since 5614816fce?
More generally, why do we still have `hypothesis_test.py` at all? In the case of this test, surely one of these files does more than this one old test:
* `operator_test/cudnn_recurrent_test.py`
* `operator_test/recurrent_network_test.py`
* `operator_test/rnn_cell_test.py`
Closes https://github.com/caffe2/caffe2/pull/843
Differential Revision: D5292109
Pulled By: akyrola
fbshipit-source-id: 6df5df6353a9741d1ae1b796adaab98382857527
Summary:
Funnily, the biggest issue when trying to increase number of trainers from 5 to 20 is not model convergence (it is worse but still converges without tuning); it is the initialization time: it took around 30 min to generate the job.
After this diff, job creation time for the standard 5-7 setup goes from 125s to 8s. (15x speedup).
Another improvement is that ##net_printer.to_string(job)## becomes less complex.
This makes the startup for 20 trainers go to 32s, which is still not ideal.
Next step will be to allow passing num_instances to Node as well. This way we'll be able to create only one reader and one trainer prototype and let the framework take care of the scheduling. For this one we will need to move some DataStream and PS initialization code to C++ first. (c.c. aartibasant)
Reviewed By: dzhulgakov
Differential Revision: D5100788
fbshipit-source-id: 7b76bce108f527a96b2bfe7ed43a22ea8679b682
Summary:
CPU -version of data parallel model. Great thing is that now we can run data_parallel_model_test in Sandcastle (as it does not have GPUs).
Pretty simple change, really. I did not change all variable names with "gpu" in them, to reduce risk (and being a bit lazy). Can improve later.
Reviewed By: wesolwsk
Differential Revision: D5277350
fbshipit-source-id: 682e0c5f9f4ce94a8f5bd089905b0f8268bd2210
Summary:
Advantages of cloning the tasks/execution_steps at runtime:
- Less complexity on the python side: no need to clone nets and add prefixes to blob names
- Faster start-up: we had cases of complex plans that took up to 30min to be created.
- Better isolation: each task cloned at runtime has its own child workspace, preventing false sharing of blobs.
- Opens up possibility for dynamic scheduling: Number of threads per task can be increased on the fly, at runtime.
Reviewed By: dzhulgakov
Differential Revision: D5100730
fbshipit-source-id: 71b83193b135da4e6eaf2536d8fc266528e1fdcc
Summary: Fixed a lot of issues that salexspb brought up, and templates on NetBase which basically adds compatibility for DAGNetBase. This will be useful for Fei's future work.
Reviewed By: salexspb
Differential Revision: D5272352
fbshipit-source-id: b5ffe1d6fb0566dc1bfad9041c129a3ab7f6d93a
Summary:
- Incorporated dropout layer to the sparseNN training and testing pipeline
- Integrated an advanced model options feature on Flow UI for users to specify dropout rate
- Created an end-to-end unit test to build and run a model with dropout
Reviewed By: chocjy
Differential Revision: D5273478
fbshipit-source-id: f7ae7bf4de1172b6e320f5933eaaebca3fd8749e
Summary:
Given the parameter init_params=False, Weight Blob(*_w) and Bias Blob (*_b) should be suppressed in model.param_init_net. Without this fix, the init_params=False doesn't take effect in brew.conv as it does in brew.fc or other ops. This issue is the root cause of #790 [https://github.com/caffe2/caffe2/pull/790].
Closes https://github.com/caffe2/caffe2/pull/824
Reviewed By: harouwu
Differential Revision: D5276676
Pulled By: akyrola
fbshipit-source-id: 8f7088a8e1976658f67e027223e555375b3a2392
Summary:
Since D5193393 introduced a "token" system for memonger that prevents sharing of blobs across parallel branches, we can be more aggressive in blob sharing. Thus, this removes the tracking of 'unused free blobs' and just relies on the token system.
For forward-only resnet50, this reduces the number of shared blobs to 5 (optimal accorsing to akirillov's calculation).
This requires careful testing, so I will not land it soon.
Reviewed By: asaadaldien
Differential Revision: D5208985
fbshipit-source-id: 2e520c4ea2351a2ec327b6c5f2e3af24234d1c9a
Summary:
Adds a separate set of CUDA collectives that run on device as an
alternative to NCCL. Use these collectives as default on-device
collectives instead of NCCL.
Whenever multiple processes on the same machine use Gloo with NCCL and
end up doing concurrent CUDA memory allocations and algorithm
execution, we risk deadlock. A follow up change will enable opt-in
usage of NCCL (e.g. through environment variable).
Benchmark output below with varying number of elements. It shows a
minor improvement over using NCCL for local reduction and broadcast.
Number of elements equal to on-device threshold (256K):
```
Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm: cuda_allreduce_ring
Options: processes=2, inputs=8, gpudirect=no
elements min (us) p50 (us) p99 (us) max (us) samples
(before) 262144 2685 2907 3035 3215 562
(after) 262144 2682 2874 3013 3395 577
Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm: cuda_allreduce_ring_chunked
Options: processes=2, inputs=8, gpudirect=no
elements min (us) p50 (us) p99 (us) max (us) samples
(before) 262144 2045 2133 2325 2643 725
(after) 262144 1533 1673 1834 2048 800
Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm: cuda_allreduce_halving_doubling
Options: processes=2, inputs=8, gpudirect=no
elements min (us) p50 (us) p99 (us) max (us) samples
(before) 262144 1580 1640 1718 2069 893
(after) 262144 1371 1446 1539 1748 1125
```
Larger number of elements (4M):
```
Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm: cuda_allreduce_ring
Options: processes=2, inputs=8, gpudirect=no
elements min (us) p50 (us) p99 (us) max (us) samples
(before) 4194304 55543 58058 60103 62659 32
(after) 4194304 54490 57923 60893 66058 33
Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm: cuda_allreduce_ring_chunked
Options: processes=2, inputs=8, gpudirect=no
elements min (us) p50 (us) p99 (us) max (us) samples
(before) 4194304 18049 22820 24997 26634 105
(after) 4194304 18356 20463 21695 22589 99
Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm: cuda_allreduce_halving_doubling
Options: processes=2, inputs=8, gpudirect=no
elements min (us) p50 (us) p99 (us) max (us) samples
(before) 4194304 18584 24345 27809 29722 95
(after) 4194304 19541 22718 25408 26688 88
```
Reviewed By: akyrola
Differential Revision: D5278192
fbshipit-source-id: 53f09e404663ddc8bb46d06ac87afd8ee3ffc3a2
Summary: As title. Pretty straightforward. Could actually run each kernel in parallel, but we can optimize later if needed.
Reviewed By: Yangqing
Differential Revision: D5278415
fbshipit-source-id: 29f59afe28f37fc4152ec7eb7cd6c1ab65f2cb8c
Summary: end_frm must be Less Than or equal to sampledFrames.size()
Reviewed By: dutran
Differential Revision: D5279265
fbshipit-source-id: 6bae714db6e07ff10ac01c95e6bead786d4941d2
Summary:
Code in tcp/transport tries to find the network interface a socket was
bound to when create a TCP device context. Per getifaddrs(3), it is
possible for the ifa_addr field to be NULL (supposedly when an
interface doesn't have an address). Ignore such entries.
Thanks to slayton58 for reporting this.
Reviewed By: wesolwsk
Differential Revision: D5279376
fbshipit-source-id: 039380b95ba4d6d94942c30581e0b230a060870c
Summary:
a few issues:
1. Randomization hurts memoization
1. Even if we make it non random, then we can get key colisions when loading it back.
2. RNNs use prototxt for step net and apparently its not forward compatible like normal protobuf is
I am thinking of a better less invasive solution now.
Reviewed By: jamesr66a
Differential Revision: D5272118
fbshipit-source-id: ab577fad04fbfc632e1fceffa923377a0d3da1be
Summary:
Previously, `gloo/math.h` inlined methods which use AVX builtins,
which required propagating the `-mavx` flag.
This diff moves these definitions out of the header and into a source
file to prevent avoid this.
Reviewed By: pixelb
Differential Revision: D5271043
fbshipit-source-id: dde4dc560dfb557b46d1a582a8b38e7cb8eb0c37
Summary: Ran into it while working on a dper benchmark. Apparently it works harmless even with empty tensors.
Reviewed By: akyrola
Differential Revision: D5273672
fbshipit-source-id: a968ae03a659d6c1a215f12cc35f7ba68448e833
Summary:
For our CNN training runs I noticed an excessive number of futex() syscalls. Using strace I narrowed this down to excessive calls to std::condition_variable member functions.
1) I added a PushBulk member function to SimpleQueue, that will push all items in a vector onto the queue and issue a single std::condition_variable::notify_all() call, rather than separate notify_one() calls per item.
2) In DAGNet::WorkerFunction, we were calling std::condition_variable::notify_one() after every single op chain was completed, even though it should have only been called when the number of remaining operators dropped to 0 or the execution failed. I added a conditional check around this call to further cut down on unnecessary syscalls.
Reviewed By: pietern
Differential Revision: D5273337
fbshipit-source-id: 67d50f9d838e9a9ef3682d9a3b5ba59c7d33350d
Summary:
Working towards https://github.com/caffe2/caffe2/pull/817.
`E InvalidArgument: Insufficient bytes of entropy to draw requested array. shape=(4, 2, 5, 1, 3, 5, 5, 1), dtype=float32. Can you reduce the size or dimensions of the array? What about using a smaller dtype? If slow test runs and minimisation are acceptable, you could increase settings().buffer_size from 8192 to at least 24576000.`
https://travis-ci.org/caffe2/caffe2/jobs/243867951
Closes https://github.com/caffe2/caffe2/pull/828
Differential Revision: D5276723
Pulled By: akyrola
fbshipit-source-id: f7d0e2dd8ef8b6a2354bd4ff7c7446c377c954b4
Summary:
This changes prepares for having a separate set of collectives that
use native CUDA calls instead of NCCL. This is needed to workaround
the issue where NCCL deadlocks when it is interleaved with CUDA memory
management operations in other processes on the same machine.
Includes a modification to the host reduction functions to bring them
up to parity with the NCCL reduction functions (they now incorporate
offset/counter arguments).
Reviewed By: wesolwsk
Differential Revision: D5276291
fbshipit-source-id: 8844731760d2c48577d207c026ce0cd641f2fc6d
Summary:
Working towards https://github.com/caffe2/caffe2/pull/817.
`E InvalidArgument: Insufficient bytes of entropy to draw requested array. shape=(20, 12, 22), dtype=float32. Can you reduce the size or dimensions of the array? What about using a smaller dtype? If slow test runs and minimisation are acceptable, you could increase settings().buffer_size from 8192 to at least 43253760.`
https://travis-ci.org/caffe2/caffe2/jobs/243867951
/cc kittipatv
Closes https://github.com/caffe2/caffe2/pull/830
Differential Revision: D5276639
Pulled By: akyrola
fbshipit-source-id: 0c21be25ecd931837dc8b0c2cc17048f531350d1
Fixing error on line 661:
warnings.warn("masked_copy_ is deprecated and renamed to masked_scatter_, and will be removed in v0.3")
NameError: name 'warnings' is not defined
Summary:
We want to make sure that a graph optimized by memonger doesn't have any possibility of two threads writing into the same output blob at the same time, when blobs are renamed.
Creates a graph where edges are built such that a parents node's output blob is a child node's input blob, and there is no node in between the parent and child node that writes to the same blob. If two nets generate the same such graph, then the "path" of data is the same.
Reviewed By: akyrola
Differential Revision: D5210385
fbshipit-source-id: 6317fc4e16289339b50c2dcd86ec8b32d2d544a5
Summary:
This is a real implementation (not GPUFallbackOp) of the TopKOp for GPU.
There are two algorithm implementations:
-for k <= 512, it maps to a warp-wide min-heap implementation, which requires only a single scan of the input data.
-for k > 512, it maps to a multi-pass radix selection algorithm that I originally wrote in cutorch. I took the recent cutorch code and removed some cutorch-specific things as it made sense.
Also added several utility files that one or the other implementations use, some from the Faiss library and some from the cutorch library.
Reviewed By: jamesr66a
Differential Revision: D5248206
fbshipit-source-id: ae5fa3451473264293516c2838f1f40688781cf3
Summary: Since shape tensor was allocated every time, the global allocation mutex was acquired, possibly leading to slowdown.
Reviewed By: salexspb
Differential Revision: D5263899
fbshipit-source-id: b44ff0b01342f116154ec2a9c65f91b5c0e51452
Summary: The old version used one block with 128 threads. Throughput was too low for the NMT use case (calculating squared gradient norms for every parameter), so this increases the throughput. Shaves 7% off CNN model training time per step
Reviewed By: wickedfoo
Differential Revision: D5263748
fbshipit-source-id: adc3bacd11e49ea00c60381d613d993050e899be
Summary:
\cc pietern
Minimal changes to allow gloo to compile and run with NCCL 2.0
Closes https://github.com/facebookincubator/gloo/pull/46
Differential Revision: D5268074
Pulled By: pietern
fbshipit-source-id: 58d625d57b31cfc932f3dbbdd7a4b83d9a2e60a8
Summary:
While this is not intended to be the best performat and
general solution, we can see from the test plan in some cases static DAG RNN could
perform better than our own implementation. Hopefully we will get
dynamic RNN DAG execution at least as fast as this one. Then we will
not need this one in production, only for testing.
Still putting it into our benchmark for comparison purposes
Reviewed By: akyrola
Differential Revision: D5210038
fbshipit-source-id: fa44baf51c455872abd6ec5f5d151cf06e15b1fa
Summary: I accidentaly noticed that we were calling the non-CUDNN version of Transpose with attention, and it is super slow. This broke when rnn_cell was changed to use ModelHelper instead of CNNModelHelper in D5062963, but calls to transpose were not "brewed".
Reviewed By: jamesr66a
Differential Revision: D5264248
fbshipit-source-id: b61494ae210f34597245f1195d20547f5b5cd8b5
Summary: Don't want to assert since it can be useful to sometimes create models that are not run (for example, unit tests).
Reviewed By: pietern
Differential Revision: D5258905
fbshipit-source-id: f1beee0605bfef235ed0f23f7e78259109720254
Summary: In https://github.com/caffe2/caffe2/pull/802, slayton58 fixed issue in ImageInputOp where the std and mean blobs were allocated on wrong GPU (0). This fails when there is no P2P memory access. Fundamental reason was that ImageInputOp's constructor did not call SwitchToDevice. Operator's does, but ImageInputOp inherits PrefetchOp -> OperatorBase, neither of which does the switch. So made PrefetchOperator do the switch (OperatorBase does not have context, so it cannot).
Reviewed By: asaadaldien
Differential Revision: D5258729
fbshipit-source-id: c615c60eb2047ad26249c5bcba57ab0ef21d00e4
Summary:
This can be used to serialize allocations and NCCL kernel calls
for example. Multiple such mutexes can be created per process.
Reviewed By: Yangqing, pietern
Differential Revision: D5073609
fbshipit-source-id: 28cc4293632f20e9623ee6531365b881d0f3d9ef
Summary: This makes it easier to gather top-K by group of rows. This is useful in the situation where we want to pick up top-K from batch of fixed length sessions. Let `N` be number of sessions, and `M` be number of examples in a sessions. We would have a batch of `N * M` rows. We can reshape the score blob to `N x M`, and use it as input to `TopK` to select top score for each session. However, without the new output, it's would be inconvenient to gather the rows corresponding to the top scores. The indices are in `[0, K-1)` range. The new output can be used directly as input to `Gather`.
Reviewed By: chocjy
Differential Revision: D5171459
fbshipit-source-id: 69f7b41456c3f9670650ae07afc8fef8328485e9
Summary:
The global StatRegistry doesn't get reset when the workspace is reset.
```
> self.assertTrue(len(workspace.FetchBlob('k3')) == 2)
E AssertionError: False is not true
```
https://travis-ci.org/lukeyeager/caffe2/jobs/240162665
/cc azzolini
NOTE: this error doesn't show up if you just run `stats_ops_test.py` directly. It shows up when you run other tests in the same session before this test:
```
pytest -v caffe2/python/
```
Closes https://github.com/caffe2/caffe2/pull/788
Differential Revision: D5259232
Pulled By: salexspb
fbshipit-source-id: 3c72633af6bb61c4fda62195298b1e9574b4cbef
Summary:
The existing per-branch TravisCI badges don't work, and will be out-dated when https://github.com/caffe2/caffe2/pull/735 is merged.
I also added an Appveyor badge.
Closes https://github.com/caffe2/caffe2/pull/786
Differential Revision: D5253408
Pulled By: aaronmarkham
fbshipit-source-id: b274b30fcef9df3d2ff7cda1046f8462ad56c83b
Summary: Upgrades this file to use brew instead of CNNHelperModel
Reviewed By: harouwu
Differential Revision: D5252089
fbshipit-source-id: 6df4350717c1d42bc4bcc63d255cd422f085ee05
Summary: Implementation of the SliceOp for CUDA
Reviewed By: akyrola
Differential Revision: D5254287
fbshipit-source-id: 0a1660e1aa161fd088a2d8f886e019c05a1919a2
Summary:
This brings back DAGNet up to parity with SimpleNet, where
execution stops as expected after an operator fails. For the DAGNet
it's more involved, since we have to deal with all worker threads
stopping execution. Because the job queue may still hold an arbitrary
number of chains to execute, this diff explicitly closes it down,
waits for all workers to terminate, and resets the job queue, upon
seeing a failure.
Reviewed By: akyrola
Differential Revision: D5232955
fbshipit-source-id: 4dac3c3ed6e5c2ebd07473b0f8be2b02c28978e9
Summary:
```
File "/data/caffe2/install/caffe2/python/hypothesis_test.py", line 1911, in test_batch_to_space
(w + 2 * pad) / block_size).astype(np.float32)
File "mtrand.pyx", line 1404, in mtrand.RandomState.randn (numpy/random/mtrand/mtrand.c:19843)
File "mtrand.pyx", line 1534, in mtrand.RandomState.standard_normal (numpy/random/mtrand/mtrand.c:20368)
File "mtrand.pyx", line 167, in mtrand.cont0_array (numpy/random/mtrand/mtrand.c:6127)
TypeError: 'float' object cannot be interpreted as an index
```
```
File "/data/caffe2/install/caffe2/python/operator_test/tile_op_test.py", line 101, in tile_ref
tiled_data = np.tile(X, tuple(dims))
File "/data/caffe2/venv/local/lib/python2.7/site-packages/numpy/lib/shape_base.py", line 881, in tile
return c.reshape(shape_out)
TypeError: only integer scalar arrays can be converted to a scalar index
```
I also tested to make sure this still works with 0.11.
Closes https://github.com/caffe2/caffe2/pull/787
Differential Revision: D5248087
Pulled By: salexspb
fbshipit-source-id: eff69482a8eabb8ace330003fa326c832b53865f
Summary: Deprecate CNNModelHelper in python/workspace_test.py to use Model_Helper instead of CNN
Reviewed By: harouwu
Differential Revision: D5251778
fbshipit-source-id: d634f1c76e41a95b0247ebf5d5a48aef6f8e232e
Summary:
This diff deprecates `CNNModelHelper` in the `AlexNet()` function. More diffs will be coming to deprecate the helper in other functions.
Depends on D5241738
Reviewed By: harouwu
Differential Revision: D5247004
fbshipit-source-id: eec5c5ef916a85de8289cb92d2174a6a4b8075bf
Summary:
Occurred when running with multiple GPUs, not all of which
are connected via P2P.
Essentially when mean_gpu_ and std_gpu_ are allocated and
populated in the constructor of ImageInputOp, it does not seem to
be guaranteed that the active context is the same as the final context
on which the Op will be run. This causes the image data and the
mean/std to be on different devices. With P2P we don't mind this, but
without this causes OOB memory accesses in the GPU transform
kernel.
Closes https://github.com/caffe2/caffe2/pull/802
Differential Revision: D5258528
Pulled By: akyrola
fbshipit-source-id: 778e55b5f8bb39fc52644b68573c747210ebf3bb
Summary: Hard-to-debug problems arise when a gradient creator fails when the forward op is incorrect itself. Add checking of the schema before callig the creator. Also clarify the error messages
Reviewed By: Yangqing
Differential Revision: D5256016
fbshipit-source-id: 78550f7e2ce5b88e26b69fdae4be0eece52edfea
Summary:
The current version of schema.py has a Metadata class with three fields. The default for it is set to
four Nones. This is just changing that to three Nones so that the number of default values matches the number
of actual fields.
Reviewed By: kennyhorror
Differential Revision: D5250463
fbshipit-source-id: 42e5650d270f5f63662614d8445b4819ed370dec
Summary: Also fixed a small bug in ModelHelper constructor
Reviewed By: harouwu
Differential Revision: D5246799
fbshipit-source-id: 3719ca078f0e2b5e463fc93da9c8215f5583bd9a
Summary:
We need to support RNNs explicitly in ExtractPredictorNet, because they store sub-nets as strings in special arguments. When netdef argument arrive, we can generalize this a bit.
Added a test under rnn_cell_test to test that extracting an LSTM predictor net works correctly and sets the device option properly for the step net ops.
Reviewed By: yqwangustc
Differential Revision: D5236334
fbshipit-source-id: cd653427f8c440a14d94195a532d18276f94749a
Summary: A quite common problem is that it is hard to load blobs with pe.load_from_db to a specific device. One must set the device options of the returned init_net and predict_init_net, which is quite magical. So I made load_from_db() able to set these device options automatically, based on device scope or device_option parameter. Added an unit test.
Reviewed By: asaadaldien
Differential Revision: D5249202
fbshipit-source-id: 7b9d91476cb8d1b0ec0d9772e50b9148b8b184fa
Summary:
salexspb This fixes a major perf issue (40% boost on alexnet end-to-end perf) in the multi-precision SGD optimizer - it was causing repeated cudaMalloc / cudaFree calls during training iterations due to the changing size of the `grad` blob as it moved from fp16 <-> fp32.
Closes https://github.com/caffe2/caffe2/pull/797
Differential Revision: D5246978
Pulled By: salexspb
fbshipit-source-id: ec3d7ef18445e19eaf5aac908d0a7bcd5957eb60
* Add torch.matmul function.
Includes test_torch, test_autograd and docs changes.
* Add __all__ to functional so imports are accidentally imported.
* Include unbind in __all__.
* Add matmul case for when one argument is 1-dimensional and the other
at least 3-dimensional.
* Add squeeze_ to Variable.
* Use squeeze_ instead of squeeze for matmul.
Summary: This was only needed in order to initialize stateful PythonOps. Now PythonOp has support for initialization at Op creation time, so this is not used anymore.
Reviewed By: dzhulgakov
Differential Revision: D5242908
fbshipit-source-id: dbaa249466dd0f37f25d204d387b1f99c6dd4fed
Summary: This is going to show a python Caffe2 user where a failed operator was created. Motivation for having this information not right in protobuf is to avoid having it too verboose and keep ability to read protobufs of a net after a simple print() call.
Reviewed By: jamesr66a
Differential Revision: D5226047
fbshipit-source-id: 7edfe850e05a2ec209577142aa3368664a57a108
Primary things I had to fix:
- Suppress _XOPEN_SOURCE warnings by ensuring that Python.h is included
first, because it always unconditionally defines this macro.
- Turn off strict aliasing, because Python 2 doesn't work with strict
aliasing.
- Workaround setuptools bug, where it's incorrectly passing
-Wstrict-prototypes to C++ compilers (where this doesn't make
any sense)
To compile csrc with -Werror, run `CFLAGS="-Werror" python setup.py build_ext`
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
This allows to construct a python op by passing a pickled "builder function call" as an argument to the op.
The builder function is called at PythonOp construction time and returns a function that will be called when the op is run.
This way we allow to drop the dependency on 'tokens', which didn't work properly for protobufs that get distributed to other processes. Now, the PythonOp definition is self-contained: as long as the build dependencies are right, sharding the protobuf is enough to execute the net remotely.
Reviewed By: dzhulgakov
Differential Revision: D5080833
fbshipit-source-id: a5deaca5d3143024cdb121519689224e9dbec5ce
Fixes#1783.
There is an undocumented invariant in PyTorch that we should
try to avoid having storage == NULL as much as possible (even
though Torch supports it.) This commit properly documents the
invariant, and fixes a bug in sparse where the invariant was
not respected. This now means that sparse tensors now correctly
remember what GPU they are associated with.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Fixes#1782.
The default operation should be cheap: user can always choose to
explicitly make a copy on the way in. Note that this is a
BACKWARDS COMPATIBILITY BREAKING change. However, we DO create
a new tensor wrapper (so we are not affected by subsequent
size changes, etc.)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
truncate id list using the max length computed in compute meta, so that it has determined length,
which is useful for position weighted pooling method.
Reviewed By: sunwael
Differential Revision: D5233739
fbshipit-source-id: f73deec1bb50144ba14c4f8cfa545e1ced5071ce
Summary: Replace call to function that is only supported in CUDA 8.0 with one that has been supported in previous releases.
Reviewed By: pietern
Differential Revision: D5231755
fbshipit-source-id: d72aec2a4a1c511064a65142887f8a05b51dad55
Summary: Recently people find that this test is too strict because of proto string matching. Thus, I change it to compare fields so that this test will not complain even if protobuf chnaged in future.
Reviewed By: dzhulgakov
Differential Revision: D5229855
fbshipit-source-id: 54efcd7a0f9e5dbba1ddeb480801abcb859e07bd
Summary: added an operator that converts key/value blobs into a blob containing a map pointer, unittest passed.
Differential Revision: D5224449
fbshipit-source-id: 2f60754ed3ba6ed16039c09019117ae3c3646ab2
Summary:
Diff D5224410 initializes the should_stop_blob explicitly. With that, we will
have one more blob when executing the job. Adjusts the check accordingly.
Reviewed By: azzolini
Differential Revision: D5228398
fbshipit-source-id: 439b186c30b0b1d0e41e513babbcccd85e7a1b4a
Summary:
We waste extra memory by creating two autosplit gradient
blobs and then accumulating it into them main one. Sometimesk, when Sum
/ Sub ops are involved, we can avoid wasting extra memory at all.
Ideally we would not waste any memory and make ops add to the same
blob rather then calculating separate results and then mering
them. But it would require a substantial change to the frameworks and
rewriting a lot of operators.
Reviewed By: dzhulgakov
Differential Revision: D5157667
fbshipit-source-id: 8293824d6cdd971d8853ae90aee68e4a6d1e132b
Summary:
It's very useful for simple cases like benchmarking nets where we want to encode input/output record in the net and don't want to go through the hurdles of storing input/output record in MetaNetDef.
For those cases I propose remapping the input/output record before saving to 'input_record/{field_name}'. Then we can recover input/output record back just based on the names of the blobs.
Differential Revision: D5170473
fbshipit-source-id: ac5daa60051605ed93022aec1377a49f08f15663
1) Line up trailing dimensions in broadcast docs.
2) remove unnecessary expand_as in common_nn test.
3) use view in tensor_str instead of resize_.
4) newExpand remove raiseErrors change.
5) clarify expandedSizes/expandedStrides parameters in inferExpandGeometry.
6) simplify inferSize2/inferSizeN implementations.
7) use new-style classes for warning.
Setting torch.utils.backcompat.broadcast.warning.enabled=True
will cause Python warnings in the case where broadcast occurs
but previously 1-d view style pointwise ops occured.
Summary: This diff fixes an issue with running the same reader in the same workspace multiple times. In order to achieve correct behavior of execution step we have to explicitly initialize should_stop_blob with False.
Reviewed By: kennyhorror
Differential Revision: D5224410
fbshipit-source-id: 4ad2740e187b62b0a1f5612ea3eef223dcc8a799
1) Line up trailing dimensions in broadcast docs.
2) remove unnecessary expand_as in common_nn test.
3) use view in tensor_str instead of resize_.
4) newExpand remove raiseErrors change.
5) clarify expandedSizes/expandedStrides parameters in inferExpandGeometry.
6) simplify inferSize2/inferSizeN implementations.
7) use new-style classes for warning.
1) Line up trailing dimensions in broadcast docs.
2) remove unnecessary expand_as in common_nn test.
3) use view in tensor_str instead of resize_.
4) newExpand remove raiseErrors change.
5) clarify expandedSizes/expandedStrides parameters in inferExpandGeometry.
6) simplify inferSize2/inferSizeN implementations.
7) use new-style classes for warning.
1) Rename calculateExpandGeometry to inferExpandGeometry for consistency
2) Simplify inferExpandGeometry implementation by using a single pass
through dimensions
3) Implement a two operand expansion, expand2.
4) Implement versions that return error code to use for fallback to
equal nElem support.
1) Rename calculateExpandGeometry to inferExpandGeometry for consistency
2) Simplify inferExpandGeometry implementation by using a single pass
through dimensions
3) Implement a two operand expansion, expand2.
4) Implement versions that return error code to use for fallback to
equal nElem support.
* Add SELU activation function
* Remove unnecessary case
* Add Function for SELU + tests and fix RReLU inplace
* Fix extra line in doc
* Fix tests
Remove in-place tests for RReLU. For some reason they fail on legacy nn, but passes on nn
* SELU in new-style Function
It also supports double backprop, verifyed with gradgradcheck
* Fix flake8
Summary: added an operator that converts key/value blobs into a blob containing a map pointer, unittest passed.
Differential Revision: D5166513
fbshipit-source-id: 748527c423a163fe55f914c08fff3adfc74a540c
Summary:
The SparseToDense layer is essentially calling the SparseToDenseMask op.
This makes it impossible to call the functional layer with the true SparseToDense op.
This diff is to rename the layer.
Please let me know if I missed anything or you have a better name suggestion.
Differential Revision: D5169353
fbshipit-source-id: 724d3c6dba81448a6db054f044176ffc7f708bdb
Summary:
Static RNN allows to unroll an RNN into Caffe2 graph using all existing cell abstractions. In this diff I introduce several new tests that already caught a few bugs in our RecurrentNetworkOp gradient accumulation logic by comparing it to an unrolled version.
Another use case is perf - potentially we can run an unrolled net faster because DAGNet will have access to the whole graph. Same about memonger. But this work is not part of this diff
Reviewed By: akyrola
Differential Revision: D5200943
fbshipit-source-id: 20f16fc1b2ca500d06ccc60c4cec6e81839149dc
Summary:
In some cases you have an optimized network and a normal
one. And you would like to make sure they produce same results. If
math under the hood is the same, you could do this with a very high
precision compare to a traditional numerical gradient check. One of
the application - RNNs. There we can unroll RNN into Caffe2 graph and
make sure result is the same as in the optimized version using
RecurrentNetworkOp.
Another possible application - graph transformations. We can verify
that after that nets produce same gradients (cc akyrola on memonger,
bwasti on other transformation ideas)
Reviewed By: bwasti
Differential Revision: D5200855
fbshipit-source-id: 0196af187f0c2feb33de4778ea08d0d288fe1017
Summary:
when building a multi layer static RNN the last timestep of
the first layer (and other layers except the last one) doesn't get a
gradient for the cell state as normally user uses results only from
the last layer and cell state doesn't go up either.
ZeroGradient provides a general solution for injecting 0 gradient
blobs. It is in some way similar to StopGradient operator which is
also specialcased
Reviewed By: bwasti
Differential Revision: D5198375
fbshipit-source-id: a21d0cfb3676a77fac72e5897a200d0bd25fc6de
Summary: Support grouped convolutions using the `group` arg in the nnpack convolution implementation.
Reviewed By: Maratyszcza
Differential Revision: D5204743
fbshipit-source-id: 81116213f7a4f6afa793e4bdf1c5bdd9a55e124f
Summary:
`brew_test.py` is just plain broken. `core_test.py` doesn't work with pytest. `apmeter_test.py` and `top_k_test.py` don't work for CUDA builds.
Closes https://github.com/caffe2/caffe2/pull/765
Differential Revision: D5211817
Pulled By: Yangqing
fbshipit-source-id: 78ec5af35a3fa870978e4c9590210ade9e3bc5ac
Summary:
Neither dependency is required by the core Python modules.
OpenCV, in particular, is a pain to install (no pip package). Conditionally skipping this test will make TravisCI integration easier.
Closes https://github.com/caffe2/caffe2/pull/739
Differential Revision: D5211799
Pulled By: Yangqing
fbshipit-source-id: c6bdc8a17977f64f34e968fd9ab8c65161d2624d
Summary:
I closed https://github.com/caffe2/caffe2/pull/736 because one of these variables should be used after all.
Here's how C1 uses this variable: https://github.com/BVLC/caffe/blob/rc5/cmake/Targets.cmake#L116
Without this fix, there is a race condition in the parallel build leading to this error:
```
make[2]: *** No rule to make target `../third_party/NNPACK/lib/libnnpack.a', needed by `caffe2/libCaffe2_CPU.so'.
```
Closes https://github.com/caffe2/caffe2/pull/737
Differential Revision: D5211794
Pulled By: Yangqing
fbshipit-source-id: 9e368f09b01edaf86252727adc6f6cc40d244e29
Summary:
The random number generators could be used in a thread-unsafe method.
This patch fixes this by adding a way for tasks to get the thread ID they are
running on.
Reviewed By: panshen1
Differential Revision: D5051334
fbshipit-source-id: 9a9f9e2e7b7a86ff456f37b40422af4fa100b5d9
Summary:
This diff fixes various issues with memonger, and works at leasrt with rbgirshick's failure case, Resnet-50, and new harder unit test. I will still create a proper resnet50-test.
1) Introduce concept of "tokens". These are passed down the dependency chains, and a blob can be used for recycling only if it owns all the tokens that are currently in possession. Tokens are added when branching, and tokens are redeemed after all inputs are satisfied. A bit hard to explain.
2) There were various bugs due to bad code: the free_blobs data structure is of different type when we have blob sizes and when we haven't. I plan to rewrite this soon. But there were some bugs.
3) Added a harder unit test that failed before.
4) Added test for resnet50 + memonger
Reviewed By: asaadaldien
Differential Revision: D5193393
fbshipit-source-id: bc2a714877aa1201c32a5ba8ade862865e455711
Summary: I broke resnet50 when switching to use optimizer, which uses LR per parameter. This only happens after each epoch, and I did no test patiently enough. For a stop-gap, while asaadaldien works on a better solution, just fetch the lr of a conv1_w param.
Reviewed By: asaadaldien
Differential Revision: D5207552
fbshipit-source-id: f3474cd5eb0e291a59880e2834375491883fddfc
Summary:
This diff plan to attack the problem where we want to just annotate device option for operators and leave Caffe2 to help us inject cross device copy functions. This feature would be useful for mixed device training and multi device training with several nets, where previously we do the heavy lifting of adding copy functions ourselves.
Ideally, this feature will happen like this:
//construct your nets first
core.InjectDeviceCopyAmongNets([train_init, train_net, ...])
My ideas are written in comments. I will update them here as well later.
Reviewed By: dzhulgakov
Differential Revision: D5134103
fbshipit-source-id: 173f7da9d1773d1c50ccdc27f1b5cd3067b04af5
Summary: Caught exception when fetching uninitialized blobs when collecting blob sizes in workspace. Some of the output blobs (like mask output of DropOut when is_test=1) may be nullptr and FetchBlob will fail.
Differential Revision: D5198641
fbshipit-source-id: 45ee26c4cb1c25cc48904e9f7d7c007224c97418
Summary: Implements an APMeter operator (APMeterOp) to calculate AP for multilclass classification given prediction socres and labels. The Op takes a score tensor [nsamples x nclasses] and a label tensor [nsamples x nclasses], and outputs a float tensor of size nclasses as the AP for each class.
Reviewed By: akyrola
Differential Revision: D5082565
fbshipit-source-id: ae7304bc8fc999c361245b9aec38eb9a5f5eef4b
Summary:
Add a helper function for parametric op ElementwiseLinear
The typical syntax is model.ElementwiseLinear(input, output, dimension)
Reviewed By: harouwu, akyrola
Differential Revision: D5114152
fbshipit-source-id: 8e8c691f824f518ae510a72ab0c12de1b018f3b5
Summary:
There is an edge case where internal gradient blobs of the backward step net should not be considered internally calclulated if the only "internal" calculation is in-place.
In the case of the failing attention unit tests, the offending blob was attention_weighted_encoder_context_grad, which was incorrectly considered internal because it was the output (as well as input) of a Reshape on the step net's edge. The caveat here is that the results may be unpredictable if a non-pass-through in-place operation is applied to a blob within a step net which is also consumed both internally and is a recurrent state/output. (This is an extreme edge case, and difficult to explicitly enforce, but it's worth noting.)
Reviewed By: salexspb
Differential Revision: D5198328
fbshipit-source-id: 0cfa8f903fd767fc50e727f238ac3d8cdca03fe0
Otherwise, on many machines, the size of the OpenMP thread pool will
change between MKL and our OpenMP enabled functions. The constant thread
creation and destruction results in worse performance and leaks memory
on GCC 5.4
Otherwise, on many machines, the size of the OpenMP thread pool will
change between MKL and our OpenMP enabled functions. The constant thread
creation and destruction results in worse performance and leaks memory
on GCC 5.4
Summary:
While debugging #43 I found common/common.h missing some headers as well.
Fixes#43.
Closes https://github.com/facebookincubator/gloo/pull/44
Differential Revision: D5194970
Pulled By: pietern
fbshipit-source-id: 4861cd04c56931d4759f5bc050816788252003ee
Summary:
The goal of this diff is:
1) Enable checkpointing to honor batches_per_epoch
2) Resume hive_readers mid-split
Reviewed By: azzolini
Differential Revision: D5004212
fbshipit-source-id: 2ff5df30ba946eefadd109d80056cde67398a080
Summary:
Input of topK op: X (dense)
Output of topK op: Value and Indices (sparse representation)
Value will have gradient in some cases,
We backprop (copy) the gradient from sparse (d Value) to dense (d X)
Differential Revision: D5133461
fbshipit-source-id: 7bad55b60e8a22dfe0e51357ce2099d7f752c133
Summary: replace hand made sgd with build_sgd
Reviewed By: salexspb
Differential Revision: D5186331
fbshipit-source-id: 3c7b4b370e29a1344b95819766463bae3812c9a6
Summary: The booleanmask supports another output with sorted indices
Differential Revision: D4984255
fbshipit-source-id: becb10d7fe989bb2f6488c901766a45369613eb7
Summary: Contains the ObserverBase class and some unittests.
Reviewed By: bwasti, pietern
Differential Revision: D5099367
fbshipit-source-id: fabde126d3281729dfc772d63dbf363e5d649319
Summary: Previous implementation relied on the order of fields for some reason.
Reviewed By: azzolini
Differential Revision: D5164478
fbshipit-source-id: 12717310860584e18ce4ca67d0bd5048354cdc0a
Summary: Infer input and output device from OperatorDef through OperatorSchema. This is inspired by shape inference. With this feature, we can easily analysis device information for all blobs in the net in a generic way. It is really helpful for auto cross device execution.
Reviewed By: akyrola, dzhulgakov
Differential Revision: D5161065
fbshipit-source-id: ee656123112171a4ca00f2fb3f6940f32ddf3135
Summary: update the new sigmoid calling process
Reviewed By: dzhulgakov
Differential Revision: D5187589
fbshipit-source-id: cf29e7e80776ac1c4cf5718c5d6043d44f62d4de
Summary:
This diff is fixing fetching of the parameters in the global namescope. Earlier
diff that have switched to '' have introduced this bug.
Reviewed By: dzhulgakov
Differential Revision: D5189667
fbshipit-source-id: 4818e99e2c2c90788e70e0b8b6204ec6f471d37d
When I use the named_parametes to modify the lr and weight decay, I will face a bug. Because the value of the named_parameters return is torch.nn.paramter.Parameter, not a generator of the Parameter.
Summary: ExpandDims is a trivial utility op which should not be triggering a warning when used by ModelHelper.
Reviewed By: akyrola
Differential Revision: D5117985
fbshipit-source-id: 5589f46f58458f5019924b48602db088563f2fee
Summary:
Make it easier for users by returning from ExtractPredictorNet the list of blobs that must be saved/exported to run a predictor net. Added a test for ExtractPredictorNet
Codemod.
Reviewed By: asaadaldien
Differential Revision: D5176097
fbshipit-source-id: b1af42132459487b8d94fcdde0e4c514da608243
Summary:
Swap for accumulated gradients causes problems with distributed training as Gloo ops expect the buffers (pointers) to remain the same. Also, it is quite a hack. So after talking with salexspb, this diff changes the parameter gradient by "transposing" it:
- gradient ops are rewritten to write into a blob with name grad + "_tmpstep"
- then that blob is accumulated directly to the actual gradient blob, not a temporary "_acc" blob.
Reviewed By: salexspb
Differential Revision: D5184839
fbshipit-source-id: c7ca445d4077ff90413c358bb0f7199d123a5553
Summary:
*Fix #417 again (#551 was insufficient)*
Even after a reallocation, the data address can still be the same if malloc returns the same newly freed address.
* Be very explicit and careful about how we set these flags so they don't interfere with other tests
* Disable the failing check
This somewhat takes the teeth out of this test, since it no longer verifies that the reallocation actually occurs.
Test with:
```
blob_test --gtest_filter=TensorCPUTest*Shrink* \
--gtest_shuffle --gtest_repeat=100 --gtest_throw_on_failure
```
/cc sunwael
Closes https://github.com/caffe2/caffe2/pull/723
Differential Revision: D5174953
Pulled By: akyrola
fbshipit-source-id: 3d875a52c8139e73db85550817dea3c837eb7eae
Summary: Machines may not create their Gloo pairs at the same time, due to earlier variable time work. Increase the timeout used to establish the initial tcp connection to accommodate without sacrificing the shorter default timeout for outstanding reads/writes. No related change required for ibverbs as there is no communication on init.
Reviewed By: akyrola
Differential Revision: D5184518
fbshipit-source-id: 0e6c9704a2d2f1406b3927f75887f0a42199450b
Summary:
I'm using Python ops in a project and need corresponding Python gradient ops. For my use case, only a subset of the forward op outputs have gradients and only a subset of forward op inputs have gradients. However the current implementation of `GetPythonGradient` forces all grad inputs and outputs to exist. This diff allows one to specify that only a subset of grad inputs / outputs are used when constructing the Python op.
I'm not sure if this is up to caffe2 standards, so please push back on style and content as needed.
Reviewed By: dzhulgakov
Differential Revision: D4897004
fbshipit-source-id: 96fffe8634c51a49b6bce7339a46c6235f7d4bbd
Summary:
fixing missing future package issue.
Recently we found some of our users does not have future module support. So we might need a try/catch wrapper around all past import
Reviewed By: Yangqing
Differential Revision: D5183547
fbshipit-source-id: 262fdf2940ee1be4454bf0b0abb9e6a0f1a0ee82
Summary:
This diff is introducing abstractions for parameter sharing for all the
parameters, that are created through new create_param syntax.
Possible use-cases of this parameters sharing:
1. Share params within RNN interface.
2. Some complicated models that might share some of the branches.
3. TODO (next diff): Cross-model parameter sharing.
Reviewed By: salexspb
Differential Revision: D5160935
fbshipit-source-id: c6d40a5ed7ead240cd7db0eb69de6dc5f505b05a
Summary:
This is a little excessive:
```
CMake Warning at cmake/Dependencies.cmake:201 (find_package):
By not providing "FindEigen3.cmake" in CMAKE_MODULE_PATH this project has
asked CMake to find a package configuration file provided by "Eigen3", but
CMake did not find one.
Could not find a package configuration file provided by "Eigen3" with any
of the following names:
Eigen3Config.cmake
eigen3-config.cmake
Add the installation prefix of "Eigen3" to CMAKE_PREFIX_PATH or set
"Eigen3_DIR" to a directory containing one of the above files. If "Eigen3"
provides a separate development package or SDK, be sure it has been
installed.
Call Stack (most recent call first):
CMakeLists.txt:72 (include)
```
Closes https://github.com/caffe2/caffe2/pull/729
Differential Revision: D5183059
Pulled By: Yangqing
fbshipit-source-id: d17d5d06a50abb50f9978d022ddc4918e991079d
The correct device must be set when getting the base allocation and when
calling cudaIpcCloseMemHandle. Store the device in the allocators
context, which was previously always NULL.
Fixes#1707
* Modify torchvision documentation following https://github.com/pytorch/vision/pull/179
* Add new datasets to docs
* Fix wording in torch.datasets
* Small clarification
Summary:
KaimingHe noticed a curious performance problem with ConvTranspose (actually ConvTransposeGradient): it got slower when more GPUs were used! This did not make sense.
After some strenuous debugging, I noticed that tensor Y = Output(0) was being reallocated every time: this causes the slowdown because we grab a mutex for each allocation.
Turns out this Y variable is copy-paste code and actually not intended to be part of the gradient op. This caused reallocation because the computed size of Y was larger than dfilter's (also Output(0)), but we never set the capacity of Y/dfilter to match the capacity of the larger size. Thus, Tensor.Resize() always ended up reseting the tensor --> allocation. This did not affect correctness of the code, but made it super-slow.
Before on KaimingHe's code ConvTransposeGradient took total of 3800 ms, now about 200ms.
Reviewed By: ajtulloch
Differential Revision: D5180280
fbshipit-source-id: d72f23038f0c51d82bcde7aed55089d657bda03e
Summary: simply allows to access the third protos only when temporal jittering option is off
Differential Revision: D5178943
fbshipit-source-id: 027234abee5c5c9fcf624dcbd55eb10ae8c9314f
Summary:
This diff is creating new type of Initializer - ExternalInitializer. This
initializer is supposed to be used in cases when the parameter blob is already
expected to exist in the workspace.
Reviewed By: dzhulgakov
Differential Revision: D5171322
fbshipit-source-id: d27861f0f80afdea93c235d49f63da19adccc92c
* Fix gc_refs assertion failure
Ensure that each THPVariable -> THPFunction reference contributes one
ref count to the THPFunction by creating a new shared_ptr for each ref.
Because multiple shared_ptrs can again manage a single THPFunction, it's
not safe to use std::weak_ptr where it may point to a PyFunction. It's
still safe to use weak_ptr for grad_accumulator since these are never
PyFunctions.
Fixes#1626
* Remove stale comment
Summary:
This diff is the first step in the effort for refactoring all parameters. As a first step - I'm merging concept of params and computed_params, that is going
to be based on tags instead (in the first version it's still using old data structs to store all the BlobReferences).
Renaming computed_params to non-trainable/non-backprop params should be done is some other diff.
Reviewed By: salexspb
Differential Revision: D5171159
fbshipit-source-id: 68031ca779f053fb266a7c4a2e5b482a3bd9c832
Before the change, processes were not waiting for master even when they got
'connection refused' (master is not listening yet, so we should wait).
It was because we were closing socket twice: first, by
the resource guard; second, manually in exception handler.
That caused errno to be set to different value (9 - bad file descriptor)
and in result `if`, which checked if connection was refused, was failing.
* Add sanity checks
* Refactor InitMethodFile and TCPInitMethod to more logical functions
* Update few error messages
* Add passing parameters by **kwargs, so now order of parameters is not relevant
* Review comments
Summary:
Add add_weight_decay to optimizer + test.
In D5142973 I accidentally removed weight decay from resnet50 trainer, so this restores it.
Reviewed By: asaadaldien
Differential Revision: D5173594
fbshipit-source-id: c736d8955eddff151632ae6be11afde0883f7531
Summary: noticed a few lint errors in image_input_op so cleaned them up
Reviewed By: akyrola
Differential Revision: D5152171
fbshipit-source-id: f84f476ddace6b4164607a01a9780a2e57e2133f
Summary: old diff had some changes to formatter.py and generator.py, but now everything is in github.py
Reviewed By: bwasti
Differential Revision: D5165061
fbshipit-source-id: 5fe5ff70ff2c5525c7aacf20854916c86d272749
* A pile of misc doc fixes.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Handle @apaszke review comments.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Initial csrc documentation.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: Use new blob as residual sum output, and add scoping to prevent any name conflicts.
Reviewed By: urikz
Differential Revision: D5167145
fbshipit-source-id: a01c87ed2278205e95e8395314b166afb1dca1b3
Summary:
Split the Caffe2 memory based model into to parts
- Dimension reduction MLP
- DNN with concatenation of memory and obj feature
Currently only implement simple mean
Differential Revision: D4866825
fbshipit-source-id: d2f6813402513ec9af30dbe29a50593e2d3cdb3b
Summary:
also contains previous edits on statuses which should be in here....
Closes https://github.com/caffe2/caffe2/pull/657
Differential Revision: D5158733
Pulled By: aaronmarkham
fbshipit-source-id: faba2ab8e2dab206e09f57021b973b3e7d01af95
Summary:
Recent diff introduced a duplicate parameter to the model, which would hurt the performance and also affect correctness (duplicate momentum updates, for example). We unfortunately had no checks for duplicate params, outside of data_parallel_model, which fortunately brought this into our attention.
But it is better to have a Validate() function in model_helper, and call that before adding gradient ops and querying for parameters. Added to brew_test calls as well.
Reviewed By: kennyhorror
Differential Revision: D5163458
fbshipit-source-id: 35692e8bfcc359d4e8bc73e6f2358659f6e45ceb
Summary:
This diff is the first step in the effort for refactoring all paramters. As a
first step - I'm merging concept of params and computed_params, that is going
to be based on tags instead (in the first version it's still using old data
structs to store all the BlobReferences).
Renaming computed_params to non-trainable/non-backprop params should be done is
some other diff.
Reviewed By: salexspb
Differential Revision: D5119830
fbshipit-source-id: 2001090a37346eb12abbb234e13e727c288eb8a7
Summary:
use user defined android ndk path instead of hard code.
Closes https://github.com/caffe2/caffe2/pull/506
Differential Revision: D5162646
Pulled By: Yangqing
fbshipit-source-id: 5093888e15607b3bf6682e05eb91aa94c6206b01
Summary:
It's causing problems inside docker containers:
`InvalidArgument: Insufficient bytes of entropy to draw requested array. shape=(5, 9, 10, 5), dtype=float32. Can you reduce the size or dimensions of the array? What about using a smaller dtype? If slow test runs and minimisation are acceptable, you could increase settings().buffer_size from 8192 to at least 18432000.`
Closes https://github.com/caffe2/caffe2/pull/707
Differential Revision: D5162621
Pulled By: Yangqing
fbshipit-source-id: 55544210961cbc80828dca2cbeba6a5ace8cf8d1
Summary:
This warning becomes an error with https://github.com/numpy/numpy/pull/6271 (`>=0.12.0`).
```
caffe2/python/operator_test/tile_op_test.py::TestTile::test_tilewinput
/opt/caffe2/caffe2/python/operator_test/tile_op_test.py💯 VisibleDeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future
dims[axis] = tiles
/usr/lib/python2.7/dist-packages/numpy/lib/shape_base.py:873: VisibleDeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future
return c.reshape(shape_out)
```
Closes https://github.com/caffe2/caffe2/pull/710
Differential Revision: D5160776
Pulled By: Yangqing
fbshipit-source-id: b264e0e389de5817a289db878c15e655f9fa2f09
Summary:
Adds support for generating and training pfp16 models. Added SGD optimizer for multi-precision trainers and a new callback to data_parallel_model in order to help multi-precision models keep their different copies of parameters in sync during training.
Closes https://github.com/caffe2/caffe2/pull/697
Differential Revision: D5159712
Pulled By: salexspb
fbshipit-source-id: 60a889494d2e2f4df1d720331e19f638c5eb95cc
Summary:
(this is due to an earlier blind vim find-replace error)
Closes https://github.com/caffe2/caffe2/pull/709
Differential Revision: D5159055
Pulled By: Yangqing
fbshipit-source-id: f188b7bebf79a45825568ba96a71b535fe4e3aad
Summary:
Currently we can get into broken situations when some nodes working on computation detectChanges() faster than others, thus only some of the nodes start doing next iteration of training. This is an inconsistent state. To prevent this to happen, now each node sets a "re-rendezvous flag" and that is allreduced after each iteration. Once all agnodes agree, re-rendezvous will be done.
Also noticed that min_shards=1 does not work because data parallel model assumed num_shards>1 when rendezvous is not None. Fixed that.
Reviewed By: andrewwdye
Differential Revision: D5156282
fbshipit-source-id: f2ccbd8ad13ed37f7813ff8ad1080d963d0d17e3
Summary:
Once the build is cached, QUICKTEST takes less than 3 minutes to install+build+test (first build is ~13 minutes).
Future TravisCI improvements:
* Refactor other build targets so they're fast enough to build in under 45 mins
* Run tests for other build targets
* Run Python tests
Closes https://github.com/caffe2/caffe2/pull/550
Differential Revision: D5157407
Pulled By: Yangqing
fbshipit-source-id: b2b2d9c2c85423cc78f314951da54b64c247c0af
Summary:
This PR adds a cli flag '--caffe2_print_stacktraces' that takes a bool, that, when set, will print stack traces when a fatal signal occurs. As a side effect a few new APIs are introduced `caffe2::setPrintStackTracesOnFatalSignal` and `caffe2::printStackTracesOnFatalSignal` - however these are mostly exposed for testing infrastructure purposes.
Also it appears at some point fatal signal handlers were strictly disabled for android - this PR re-enables them.
Closes https://github.com/caffe2/caffe2/pull/698
Reviewed By: Yangqing
Differential Revision: D5150001
Pulled By: danzimm
fbshipit-source-id: abb4aada4ddae8bcfbf1a85f3d101ed63692f221
Summary: Extended the time-out option from just working on TCP to also working with ibverbs
Reviewed By: pietern
Differential Revision: D5090258
fbshipit-source-id: fee685850d761d0c2130852f513c64ceb19f4e9e
Summary: Add information about the offending param when assertion fires.
Reviewed By: kennyhorror
Differential Revision: D5153625
fbshipit-source-id: 9f5a02bf64ccbdef9d93d346f79e589dfe3ec5be
Summary:
Add timing of the phase between last gradient op and the final sync. This gives approximate measure of the latency of distributed computation and helps detecting stragglers. Not intended as a real measure but just for relative comparison.
This could be improved by making nodes share their timings and make decisions based on it. But for first step, we can just look at the numbers ourselves.
Reviewed By: andrewwdye
Differential Revision: D5149273
fbshipit-source-id: c4c346291c0feb6e9c6ceced64e7be667d17dcad
Summary: Fix an issue where the parameter is not created in param_init_net, or net, and then we secondarily look at which device op outputs the gradient. This did not work if the gradient was a GradientSlice.
Reviewed By: harouwu
Differential Revision: D5153102
fbshipit-source-id: 20eae660ea32e5a9ea484bf93c04c8f8c71a51ed
Summary: If ConstantFill (or other fill op) is used in CUDAContext, with input_as_shape, the code crashes as it expects the shape be in CUDAContext but accesses the array in host code... We could fix this by copying the values from the CUDA tensor, but it is probably best to enforce the shape param is in CPU context. This is what this diff does.
Differential Revision: D5152766
fbshipit-source-id: 0629a189bd1d800c0b7c9dbc324b78d279efac0b
Summary:
Bug repro is in a test. Generally speaking accumulation was
not happening if len(ys) >= 2 (list of blobs we compute gradients
from) and for some blob in the net it was both in ys list and also got
a gradient propagated from another element in ys.
Reviewed By: akyrola
Differential Revision: D5121695
fbshipit-source-id: 282d88f2f4f6e27dadae311964f40246a2739130
Summary:
For some long running benchmarks, the iteration count could be 0
which would lead to a segfault when printing results
Reviewed By: pietern
Differential Revision: D5149034
fbshipit-source-id: 7b56e8961c302d1ff11ffcd74ca8e909ea046231
Summary: It looks like it's a bit too restrictive requirement. Let's remove it.
Reviewed By: volkhin
Differential Revision: D5150968
fbshipit-source-id: 9e38574edc6542c5ce3c7f25a01afe8f5ff9b507
Summary:
Fixes some performance issues when `broadcast_computed_params=True` is passed to Parallelize_GPU. Enabled via the same `use_nccl` flag as AllReduce
Closes https://github.com/caffe2/caffe2/pull/630
Differential Revision: D5149828
Pulled By: akyrola
fbshipit-source-id: 12c9714c7fa078811f1cde61c8523dca8f7f968f
Summary: These return views in Python 3 which would not do anything in a lot of usages currently present in Caffe2. This diff simply removes (almost) all usages of these two in Caffe2 and sub projects in favor of comprehensions which are also easier to read/understand
Reviewed By: akyrola
Differential Revision: D5142049
fbshipit-source-id: e800631d2df7d0823fed698cae46c486038007dc
Summary:
Looking at one segfault at exit (https://our.intern.facebook.com/intern/chronos/jobinstance/?jobinstanceid=911625597&smc=chronos_gp_admin_client&log_type=stderr&offset=0&pretty_logs=false) and it's coredump, only thing I can see that a FreeBlob() operator is called concurrently while a cudaMemcpyAsync (on thread 1) is crashing. FreeBlobOp is only called at data_workers _stop() (via utils.ResetBlobs()), and only code that could run a cudaMemcpyAsync that time is the fetcher -thread of data_workers that is enquing blobs.
Here are the stacks: P57455299
This is clearly a bug since we should only clear the scratch blobs after all threads are terminated, which happens at wait_for_finish().
I am not 100% sure this fixes all the segfaults, but at least this one was most likely caused by this.
Reviewed By: andrewwdye
Differential Revision: D5146278
fbshipit-source-id: ae00796706bfc4fee6823caf6529b62ab20c1cd3
Summary: Ring-chunked performance on 8 nodes was substantially worse than halving-doubling in some cases. We can just use halving-doubling in all cases.
Reviewed By: prigoyal
Differential Revision: D5148755
fbshipit-source-id: 1332065615be6b9faf873effac87056011e0e804
Summary:
This diff does two things:
- add supports for optimizer to data_parallel_model. User can supply optimizer_builder_fun instead of param_update_builder_fun. The latter is called for each GPU separately with proper namescope and devicescope, while optimizer builder only is called once and adds optimizes to the whole model.
- use MomentumSGDUpdate instead of MomentumSGD + WeightedSum. This bring major perf benefits.
Changes resnet50 trainer to use optimizer.
This relies on D5133652
Reviewed By: dzhulgakov
Differential Revision: D5142973
fbshipit-source-id: 98e1114f5fae6c657314b3296841ae2dad0dc0e2
Summary:
I'll let y'all decide how you want to fix this (probably need a persistent curand buffer). Here's a test to verify the fix.
Closes https://github.com/caffe2/caffe2/pull/495
Differential Revision: D5148815
Pulled By: akyrola
fbshipit-source-id: e80dabe65230ddd32340f2d872cd8786ac960bf8
Summary:
hankun is using the optimizer, but having mixed set of of GPU and CPU operators. Currently this won't work with optimizer since it adds optimizers for all parameters in the current device scope. But we can actually infer the device that a param belongs to by looking at the device option in the param_init_net.
Added a test as well.
Reviewed By: salexspb
Differential Revision: D5133652
fbshipit-source-id: ad8689d75ac1f5c78981bae1b6978fe91e40ef0f
Summary:
See discussion at https://github.com/caffe2/caffe2/pull/633#issuecomment-303536902
Tested with a TitanX (Pascal) and a TitanZ (Kepler) with this access pattern.
```
Checking GPU(s) for support of peer to peer memory access...
> Peer access from TITAN X (Pascal) (GPU0) -> GeForce GTX TITAN Z (GPU1) : No
> Peer access from TITAN X (Pascal) (GPU0) -> GeForce GTX TITAN Z (GPU2) : No
> Peer access from GeForce GTX TITAN Z (GPU1) -> TITAN X (Pascal) (GPU0) : No
> Peer access from GeForce GTX TITAN Z (GPU1) -> GeForce GTX TITAN Z (GPU2) : Yes
> Peer access from GeForce GTX TITAN Z (GPU2) -> TITAN X (Pascal) (GPU0) : No
> Peer access from GeForce GTX TITAN Z (GPU2) -> GeForce GTX TITAN Z (GPU1) : Yes
```
All combinations pass:
* `0,1`
* `0,2`
* `1,2`
* `0,1,2`
Closes https://github.com/caffe2/caffe2/pull/659
Differential Revision: D5148779
Pulled By: akyrola
fbshipit-source-id: 6263edfe8b36623983f1946b5c3f4a3fef415a45
Summary:
Allow user to force cuDNN convolution algorithms from python - useful if you're using a standard network and don't want to pay the cost of exhaustive search.
Defined as an array in the order of [fwd, wgrad, dgrad].
Also refactors cudnn_conv_op slightly to split the code to do wgrad and dgrad a little more.
Closes https://github.com/caffe2/caffe2/pull/570
Reviewed By: akyrola
Differential Revision: D5125731
Pulled By: asaadaldien
fbshipit-source-id: cc5c64d3ccd2546f8e744d818f587bbbd24f055b
Summary:
Failure mode:
```
- 7 passing examples, 0 failing examples, 0 invalid examples
- Typical runtimes: 12-14987 ms
- Stopped because settings.timeout=60
```
After this change:
```
- 5 passing examples, 0 failing examples, 0 invalid examples
- Typical runtimes: 12-15475 ms
- Stopped because settings.max_examples=5
```
Obviously, the `DYNAMIC_PROGRAMMING` tests are the troublemakers. An alternate solution would be to make separate tests for the two assignment algorithms (one fast, one slow).
Closes https://github.com/caffe2/caffe2/pull/676
Differential Revision: D5147363
Pulled By: akyrola
fbshipit-source-id: 85d9f8198e53c10de2a8d6645e2b0eb7953c96e0
Summary: This diff is one step towards enabling python 3 build by making it be more diligent in its handling of strings.
Reviewed By: salexspb
Differential Revision: D4893083
fbshipit-source-id: 28b8adf3280e8d1f0a7dc9b0fee5ad53f2fada57
Summary: Refactored SoftmaxWithLoss by removing the code for spatial=1 mode and created a new op SpatialSoftmaxWithLoss that has the spatial mode implemented.
Reviewed By: viswanathgs
Differential Revision: D5104120
fbshipit-source-id: 8ab999e32c916b2a39a670a7b2a3365401535f24
Summary:
This should build on all linux systems now (unwind.h appears to be a gcc extension that clang supports as well) on every platform - even android. I'm not sure how to look at what platforms support which libc extensions, so I'm unsure how to proactively ensure this PR will work on all platforms.
Closes https://github.com/caffe2/caffe2/pull/656
Reviewed By: pietern
Differential Revision: D5134097
Pulled By: danzimm
fbshipit-source-id: 093a49239c6d9d43ca64c52e8aaab569970b2cf9
Summary: andrewwdye caught a sigsegv that happened at Gloo failure signaling function. Turns out workspace->CreateBlob() is not thread safe, and since we are running multiple threads it is likely that many gloo ops fail at once and thus we get a race. Caffe2 ops should actually be created in constructor, so that's what this diff does.
Reviewed By: andrewwdye
Differential Revision: D5139269
fbshipit-source-id: 7eaab3084e4e39543632c628c5e0710225e73b65
Summary:
Makes benchmark a bit hacky, but it's a benchmark after all :)
Specifically ports functionality of proper BenchmarkNet run from the ads_benchmarks so that we can see training net perf.
Also adds --report_interval parameter to print stats more often when running in hogwild mode
kdub0 - hopefully if you have time you can integrate it properly with the Flow's workflow
harouwu -shouldn't conflict too much with your current diff
Reviewed By: rayleichen
Differential Revision: D5125183
fbshipit-source-id: 9c6f1663bc85e26d6609f0f2f23aa280731939db
Summary:
To make optimizer for sparse gradients work with CUDA, we need UnsortedSegmentSum and Mean implemented for CUDA. Unique was already implemented by harouwu.
Pretty straightforward implementations, should be fast enough -- and i don't know a faster way anyway.
Added some tests as well.
Reviewed By: asaadaldien
Differential Revision: D5124548
fbshipit-source-id: 63ae72f45fc2f07470603f7b2de12f34635dbb3d
Summary:
This is going to unblock Nvidia in their work on adding fp16
support to Caffe2. I discussed this with kennyhorror before to make
sure this fits into his work on parameter sharing.
Reviewed By: kennyhorror
Differential Revision: D5127797
fbshipit-source-id: 4db155d320b1862570c23b77c4252bdacbf2296f
Summary:
If there're 2 SparseToDense layers that are densifying same IdList feature
it'll result in the situation, where we might export invalid input for the
prediction in input specs. This diff is changing the behavior to support to use
Alias to a new blob instead of passing things directly.
Reviewed By: dzhulgakov
Differential Revision: D5093754
fbshipit-source-id: ef4fa4ac3722331d6e72716bd0c6363b3a629cf7
Summary: Currently using two tower models with cosine distance results in bad calibration. Adding bias to the output of cosine term solves the problem.
Reviewed By: xianjiec
Differential Revision: D5132606
fbshipit-source-id: eb4fa75acf908db89954eeee67627b4a00572f61
Summary: Memory leak happens when new BlobReference is constantly added to the set _scratch_blobs
Reviewed By: panshen1
Differential Revision: D5134945
fbshipit-source-id: 3ce4d482153bb89de065f20cd91411178085caad
Summary: Changed test file name to signify that if testing with ASAN you should disable ASAN signal handling.
Reviewed By: pietern
Differential Revision: D5122977
fbshipit-source-id: f73de44df943516f3353cf408697869c43c45032
Summary:
This was hardcoded at 4 before but should be made
configurable. Can be kept low for big MLPs and higher for convnets.
Reviewed By: akyrola
Differential Revision: D5126138
fbshipit-source-id: 713ee8bbeb243b7de1479808fd6398d397e0b49a
Summary:
Fix number of indices and block_size in SparseAdam to support gradients of any dimension.
Closes https://github.com/caffe2/caffe2/pull/249
Reviewed By: asaadaldien
Differential Revision: D5125714
Pulled By: akyrola
fbshipit-source-id: 84134049cb9a77e58562272ea351222befe27fca
Summary:
Only adding `include_directories` doesn't propagate to the including
targets. Also use `target_include_directories` to do so.
Closes https://github.com/facebookincubator/gloo/pull/39
Differential Revision: D5131001
Pulled By: pietern
fbshipit-source-id: 6c58c4b76ae7fa008e4fb26d1bca7900165884d0
Summary:
Implement SizeOp that returns the number of elements in the input
tensor.
Output is 1D tensor that contains the number of elements
Reviewed By: akyrola
Differential Revision: D5101061
fbshipit-source-id: d1c56053b6f3b41c65ac574dd748482775d1ea0d
Summary: CuDNN conv op's type error was not very descriptive.
Reviewed By: Yangqing
Differential Revision: D5124638
fbshipit-source-id: 7d3f0afad36573cdb97d1f8ec3c60a9c6d87f926
Summary:
The CMake variable CMAKE_BINARY_DIR points to the top level build
directory. For standalone Gloo builds this path lets files include the
generated file "gloo/config.h". When Gloo is included as project, this
variable points to a different path and "gloo/config.h" cannot be
resolved. Fix is to build a path from CMAKE_CURRENT_BINARY_DIR.
Closes https://github.com/facebookincubator/gloo/pull/38
Differential Revision: D5129385
Pulled By: pietern
fbshipit-source-id: 722cebf4892b34f869fe43320153efbb181555b6
Summary: In some cases (for example, when include_tags option is used) output_schema contains blobs that aren't produced by the generated net. In this case we want to filter them from output_schema as well.
Differential Revision: D5120115
fbshipit-source-id: f98ea3f747589390b039d1e1987becec3980634c
Summary:
D5116828 changed how in-place ops were hanled in memonger and fixed a crash in NeuralMT. However, it still produced incorrect memongerization, because an op with one inplace input-output but another non-inplace output would be handled still incorrectly, as the other output's branch would not be followed properly.
This is fixed by actually removing the whole in-place op special handling. This actually is not needed anymore, it was leftover from an older version of memonger that used topological sort of the ops.
Reviewed By: asaadaldien
Differential Revision: D5128142
fbshipit-source-id: b551b0faebdde410e6bd7516958c63cf610cc065
Summary: When two or more blobs are gathered by the same indices blob in a data parallel model, we used to concatenate multiple times and re-write to the same indices blob. This leads to illegal memory access at times because the gradientslice indices blob is longer than its corresponding gradientslice values blob. This diff adds a check in order to avoid this.
Reviewed By: akyrola
Differential Revision: D5116817
fbshipit-source-id: 1c086d092eb6d48926d600f9408f578f5ddc41c7
Summary: Using Misha's vectorized AVX code to greatly improve performance of reductions on float16 values. Float16 reductions are now 2x faster than float.
Reviewed By: pietern
Differential Revision: D5123331
fbshipit-source-id: 03d4e76886d538b7e24eedaf32a92231a80b1e43
Summary: Gradient test for tile op was flaky because i had made the dimensions too large. This caused push blocking errors. Also I noticed my test_grad_tile was incorrect.
Reviewed By: asaadaldien
Differential Revision: D5126476
fbshipit-source-id: ae9ce5d9041648d7a4535fc88d4013e669bd6f02
Summary: Modify BroadcastOp and AllreduceOp to allow initializing algorithms on buffers of float16 values. Previously the Allreduce algorithm definitions were hardcoded to take float.
Reviewed By: pietern
Differential Revision: D5042015
fbshipit-source-id: c5c3ea5566f9f23969847dcc0735f5f4b075f56f
Summary:
The broadcast algorithms use the buffers they were given directly.
There is no inbox/outbox pattern. This means that we can race if the
algorithm is run repeatedly within a short time frame. This hasn't
been an issue so far since we've only used it in combination with
other process wide barriers.
Since this adds a round trip the latency of these ops from the root
rank perspective increases. The variance between the before and after
runs is pretty high since there is no back and forth interaction on
the root. It simply waits for recipients to be ready and then sends
its data.
Before:
```
Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm: broadcast_one_to_all
Options: processes=4, inputs=1
elements min (us) p50 (us) p99 (us) max (us) samples
100 1 16 29 50 426075
200 2 17 32 50 179953
500 2 11 31 59 140291
1000 2 12 29 59 177619
2000 3 12 29 62 117882
5000 5 16 31 64 127113
10000 9 21 38 88 60328
20000 19 36 65 130 30427
50000 48 68 221 556 11180
100000 92 136 426 871 7314
200000 193 251 829 2965 4092
500000 492 638 2098 4133 1677
1000000 1195 2024 3513 11646 628
2000000 3446 4216 5007 17100 282
5000000 12956 13919 14941 37751 71
```
After:
```
Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm: broadcast_one_to_all
Options: processes=4, inputs=1
elements min (us) p50 (us) p99 (us) max (us) samples
100 15 37 52 107 27332
200 14 40 63 199 28620
500 17 37 52 118 18299
1000 9 39 57 120 33375
2000 20 57 78 180 24779
5000 31 61 84 190 18039
10000 39 70 90 225 8908
20000 57 108 130 940 8313
50000 94 163 217 1933 5326
100000 132 231 331 3501 3681
200000 256 426 560 6509 2272
500000 774 1092 1698 10039 985
1000000 1132 2106 3878 18218 484
2000000 3509 4252 6832 20228 226
5000000 11326 15447 27129 52694 77
```
Reviewed By: wesolwsk
Differential Revision: D5123341
fbshipit-source-id: f3bab4f75ef7c38817f74f00b382f18fe43d85d5
Summary:
using typedef to replace the type of Pseudo-random number engines, it would make it flexible to use
Closes https://github.com/caffe2/caffe2/pull/615
Differential Revision: D5121539
Pulled By: Yangqing
fbshipit-source-id: 988e57f8d119cb6f3bfe692fdb303aba2ecacbeb
Summary: Vector out-of-range error was being triggered in some tests due to trying to get the address of an element past the end of vector.
Reviewed By: pietern
Differential Revision: D5123044
fbshipit-source-id: 004f72ebaa27c609290959c12a3d99b16289bfa8
Summary:
This PR changes the cmake of Caffe2 to look for system dependencies before resorting to the submodules in `third-party`. Only googletest should logically be in third-party, the other libraries should ideally be installed as system dependencies by the user. This PR adds system dependency checks for Gloo, CUB, pybind11, Eigen and benchmark, as these were missing from the cmake files.
In addition it removes the execution of `git submodule update --init` in cmake. This seems like bad behavior to me, it should be up to the user to download submodules and manage the git repository.
Closes https://github.com/caffe2/caffe2/pull/382
Differential Revision: D5124123
Pulled By: Yangqing
fbshipit-source-id: cc34dda58ffec447874a89d01058721c02a52476
* Fix segfault in autograd:
1) Every "output" variable must have a grad_fn or grad_accumulator
2) compute_partial_exec_callbacks uses Python errors
* assertRaisesRegexp was renamed assertRaisesRegex in 3.2
* Use HANDLE_TH_ERRORS macro
Summary:
After the change we will be able to simply define targets and find dependencies.
Closes https://github.com/caffe2/caffe2/pull/640
Differential Revision: D5121700
Pulled By: Yangqing
fbshipit-source-id: 2d21e1afbccb09614054feccdd1bef55cbe3b035
Summary: Adds timers to collect the average runtime for each pipe stage.
Reviewed By: azzolini
Differential Revision: D5083958
fbshipit-source-id: 42536bd70c80c2453d98d872286525388f6164c3
Summary:
predictor_exporter copies the original predict_net's op, external_input and
external_output fields, but ignores the type field. This is reasonable as the
train net would generally have 'dag' type and copying that for inference may
not be applicable. It's good to have a way to specify the net type nevertheless
to run DAGNet for inference. This diff adds a field in predictor_exporter to do
that.
Reviewed By: akyrola
Differential Revision: D5122354
fbshipit-source-id: 0e3cc417128db903c71515135c9e3b87620ae21e
Summary: Added a new RNNCell, DropoutCell, which wraps an existing RNNCell and applies dropout to its primary output (as defined by get_output_state_index()).
Reviewed By: salexspb
Differential Revision: D5084871
fbshipit-source-id: 60474af84e5757a12e7fdc3814840dc9ba8e32a1
Summary: Basically takes in a live net and creates an init_net and predict_net which can be written to file and run in Predictor
Reviewed By: salexspb
Differential Revision: D4989425
fbshipit-source-id: 8052065da9ed763d48bd9e1e19f7697ef60a2829
Summary: This RunPlan is getting complex and confusing. The first step to clean it up is to move it out of workspace.cc to better mark separation of concerns.
Reviewed By: kennyhorror
Differential Revision: D5100721
fbshipit-source-id: 4be0559eba1abb8bb1ddc3818698763c2e014ef2
Summary: As noted by salexspb, MultiRNNCell had unreliable gradient computation. The problem was that recurrent gradient and gradient computed wihtin the backward step net were not being accumulated during the backward pass, but rather writing to the same blob, thus overwriting each other. This diff fixes that by artificially introducing an extra blob for the internal output, and then accumulating it into the gradient coming from the recurrent connection.
Reviewed By: salexspb
Differential Revision: D5110059
fbshipit-source-id: 16add50989fe8866361bbc21afce5f214c5292fd
Summary:
- caffe2 compiles now with gflags 2.2.0 (compiled from source), see issue https://github.com/caffe2/caffe2/issues/491
- fixed an error in image_input_op.h (did not compile in vs2015)
Closes https://github.com/caffe2/caffe2/pull/559
Differential Revision: D5121555
Pulled By: Yangqing
fbshipit-source-id: 9d2bedadd13d1872bb930a95d67ed20263988d13
Summary:
Fixed a bug in CMakeLists.txt: should not use option cmd for setting the initial value(empty string) of CAFFE2_CPU_FLAGS and CAFFE2_WHITELIST, because option can only be used for boolean(ON/OFF) variables. Use set cmd instead. The bug can cause compilation errors if CAFFE_CPU_FLAGS is set to ON, since an invalid 'ON' flag will be added to CXX_FLAGS. (2) Add build_* in .gitignore to allow multiple build directories in repo
Closes https://github.com/caffe2/caffe2/pull/611
Differential Revision: D5121545
Pulled By: Yangqing
fbshipit-source-id: 1f57042075356b6bf7138f65565b327be2a6d272
Summary:
Added python-pip and python-numpy into build_raspbian.sh script
because they are not installed in ubuntu/debian minimal image.
Closes https://github.com/caffe2/caffe2/pull/609
Differential Revision: D5121550
Pulled By: Yangqing
fbshipit-source-id: 14dd1450275fcc2aa9d2a06f0982f460528a1930
Summary: Memonger ignores ops with input and output in-place, but did not work correctly if there were also non-inplace inputs, like with Mul. Simple fix to also look at in-placeness during the traversar.
Reviewed By: jhcross
Differential Revision: D5116828
fbshipit-source-id: 52817f1221597986cc09cc65d094417c1923d965
Summary:
In a previous commit where the slot numbering was expanded, I changed
the memory region send/recv path to use a map for the outgoing memory
regions (since they may complete out of order). Before, this was a
fixed size array, which was mutated by both the user thread and device
thread without holding a lock. The map, however, can't be mutated
without a lock. This change adds that lock and a few assertions to
check for this type of problem.
Reviewed By: andrewwdye
Differential Revision: D5108194
fbshipit-source-id: 1908c988112469ecdec6cb6eb9849068d896c409
Summary:
It looks like AddOperator was never really used (searched across the whole
code-base). In addition to this all model_helper functionality is getting
replaced with Brew, so there I'd prefer to remove this method to reduce the
amount of code touching model.params.
Reviewed By: rayleichen
Differential Revision: D5110425
fbshipit-source-id: f2a88e4c1ce5149d27e809e03da9a86c0867bc4d
Summary:
I had "optimized" the number of threads / block, but cub::BlockReduce has a static template parameter for the number of threads, and this must match. Probably tests still passed because typically the initial numbers are zeros.
Also added a stronger test.
Thanks ves for the report.
Differential Revision: D5110901
fbshipit-source-id: c1169b1286e204c202b0727448ddb51b4965eacb
Summary:
This file can then be used by downstream code to figure out what Gloo
features it can support (e.g. ibverbs transport or not).
Closes https://github.com/facebookincubator/gloo/pull/36
Differential Revision: D5110769
Pulled By: pietern
fbshipit-source-id: 2c0c07537258048737ae764a4978f2f7fdbd992d
Summary:
This is another example where our unsolicited writes may interfere
across calls to the collective function. In this case, it was possible
for a second call to overwrite a pair's address before it had been
used to connect the pair in the previous iteration.
Thinking out loud, we could avoid this from happening by supporting
this pattern natively in the Buffer classes. For example, we can add a
notification mechanism (opt in) to the Buffer class such that the
receiver may call `ackRecv()` to acknowledge receipt and handling of
the data in the buffer. Then the sender will block on new sends until
acknowledgement from the previous send has been received. Until then,
we have to keep an extra eye out.
Reviewed By: wesolwsk, romain-intel
Differential Revision: D5095430
fbshipit-source-id: 4c100433108fccea7457bba4dc00f651f722e6c9
Summary:
I'm assuming the repo should be caffe2/caffe2.git and not bwasti/caffe2.git. Changed it accordingly.
Closes https://github.com/caffe2/caffe2/pull/572
Differential Revision: D5105328
Pulled By: aaronmarkham
fbshipit-source-id: 4bd3babbd93c79831be79c6d40b81d873fcc3f4c
Summary: ves and jamesr66a had noticed that TileOp for CUDA was very slow, as it started kernels inside double loops. It was my fault not to notice this in the code review. This diff uses 1 kernel for forward and backward passes and is probably much faster. I did not test though, maybe ves or jamesr66a can help?
Reviewed By: jamesr66a
Differential Revision: D5101968
fbshipit-source-id: 64b6ac933785e3710b3c1d8c692a4c48650bca96
Summary:
When a fatal signal is fired to a task that links against caffe2 this PR adds stacktraces from every thread that's currently running. Only linux is supported currently. The signals that are currently supported are SIGABRT, SIGINT, SIGILL, SIGFPE, SIGBUS and SIGSEGV (more signals can easily be added, but for now this seemed like the major signals that might be fired - see signal_handler.cc:138 for the table of signals).
I've added tests that verify that each of those signals indeed output the expected number of stacktraces.
We need to add linking against libdl since on linux apparently it's not implicitly always linked in (I'm coming from macOS where I believe it is).
Example output can be found [here](https://gist.github.com/danzimm/814faa1229d9c54f359d23ba038344a6) - note that the signal name changes depending on the signal that was sent (as well as the number in parenthesis that corresponds to the specified signal).
Closes https://github.com/caffe2/caffe2/pull/596
Reviewed By: akyrola
Differential Revision: D5087526
Pulled By: pietern
fbshipit-source-id: ba8d058c9ca1cf06b41667205193f8699f8d6964
Summary:
Correct schema generation was previously broken leading to invalid gradient op creation.
Also exhibited in model_device_helper, where invalid schema were being created on the CPU when kwargs['engine'] == 'CUDNN'
Closes https://github.com/caffe2/caffe2/pull/617
Reviewed By: asaadaldien
Differential Revision: D5097062
Pulled By: akyrola
fbshipit-source-id: e22181f857deccb7b4395e87271e2cbf1226eb64
Summary:
This is allows to produce nice comparisons against
CuDNN. Currently on 1 layer I see about 28% slow down on
average across setups specified.
Reviewed By: akyrola
Differential Revision: D4986218
fbshipit-source-id: efb12081f13dbfb92428fd4a85f12fd566eb9522
Summary:
Address KaimingHe's comments in D5093689 about same blob being initialized twice causing internal consistency check to fail. Also I noticed that my new test for test_checkpoint_params was completely botched due to an indentatino issue (it did not actually execute any test). So this fixes that as well.
Modified the test to add a duplicate param initializer, so that this bug is tested for.
Reviewed By: KaimingHe
Differential Revision: D5101304
fbshipit-source-id: 72f343035c1b4953e7bb9a1a1c171cf05d3ead26
Summary: Based on jay-mahadeokar's code, add a test for input order consistency to data workers.
Reviewed By: jay-mahadeokar
Differential Revision: D5096887
fbshipit-source-id: efd226343f81e9a0157ec89d4588f1eee8a78549
Summary:
If Predictor Exporter save_to_db is called in CUDAContext, a failure occurs since the following FeedBlob() tries to store a string (meta data), but for CUDA blobs we assume they are tensors.
+ fix a typo in data_parallel_model that I bumped on.
Reviewed By: asaadaldien
Differential Revision: D5099837
fbshipit-source-id: 69d01b35a9a1816bf083f13d8a6ce88e1f5aecb7
Summary: Rename some type of AVPixelFormat
Reviewed By: aaronmarkham
Differential Revision: D5097337
fbshipit-source-id: 8ee9b0fc7284752e56f74c7ada241b3bd421efd1
Summary: Define StoreHandlerTimeoutException() for timeouts in StoreHandler::wait(). Update all StoreHandler implementations. Catch new exception in CreateCommonWorldOp and store failure blob.
Reviewed By: akyrola
Differential Revision: D5095625
fbshipit-source-id: dc6f8351cc129cd1fac72bd4b2c8e6b684b21f31
Summary:
Major improvements. Before we only synced "params" and "computed params" of model after initialization and after loading a checkpoint. But actually we want to sync all blobs that are generated in the param_init_net. For example the _momentum blobs were missed by the previous implementation and had to be manually included in checkpoint finalization.
I also added GetCheckpointParams() to data_parallel_model because it is now fully general. Also added a unit test.
Reviewed By: andrewwdye
Differential Revision: D5093689
fbshipit-source-id: 8154ded0c73cd6a0f54ee024dc5f2c6826ed7e42
Summary: mutex is only supported on CPU. need to make sure mutex and following atomicIter are both on CPU. This is critical for gpu SparseNN training
Differential Revision: D5093184
fbshipit-source-id: 021e6ba699a3208449fa4761cad6b0ec4544957e
Summary:
deprecate CNNModelHelper in python/operator_test dir
BTW I found that there is 2 mkl_speed_test. I am confused...
Reviewed By: salexspb
Differential Revision: D5094122
fbshipit-source-id: f6526f4de334f2245eb4c1f204a8ec9f23750d78
Summary: We will start our API migration process. Before that, I want to make sure people don't add new CNNModelHelper instance to our opensource code. So that I put deprecation warning here in advance
Reviewed By: salexspb
Differential Revision: D5093556
fbshipit-source-id: 74bf4a7782c2d882f72f202d48c72255d152b68a
* Check cuDNN version at runtime
This checks that the version from cudnn.h matches the version from
libcudnn.so.
Fixes#1476
* Only check major and minor version numbers
Summary:
The pair was still hardcoding limits on the slot numbers. In this
change those limits are lifted.
This also adds back assertions on work completion status in
handleCompletion.
Reviewed By: wesolwsk
Differential Revision: D5090457
fbshipit-source-id: 7bf884e1f31e48e8f1cdfb179a225999e28171b2
Summary: Add support for collectives over vectors of half-precision floating point values.
Reviewed By: pietern
Differential Revision: D5062938
fbshipit-source-id: 0b39fa53370393fec1edf2d852ff7f1d862b9022
Summary:
The halving/doubling algorithm had two instances where a receive
buffer was registered with a number of elements instead of a number of
bytes. This change adds the assertion that should have caught this in
the first place.
Reviewed By: wesolwsk
Differential Revision: D5089483
fbshipit-source-id: fd0f0724ef04300236c9297ee88b27e61fb1e5a0
Summary:
The original implementation created temporary buffers on the backing
context. This also meant an ordering problem when using the ibverbs
transport, as a call to send will block until the remote side has
created its receive side buffer. Since all buffers are now created
prior to using them, this is no longer an issue.
Reviewed By: romain-intel
Differential Revision: D5082352
fbshipit-source-id: 4c260f06e8f461c0336e7eec7ca891e07ff41cd3
Summary:
CUDNN dilated convolution was added to V6. This version of CUDNN does not support NHWC for dilated convolution.
Fix conv_test.py so that it does not test CUDNN for dilated convolution in NHWC format.
Closes https://github.com/caffe2/caffe2/pull/598
Reviewed By: akyrola
Differential Revision: D5084835
Pulled By: asaadaldien
fbshipit-source-id: 3c0c5ed02c5d9232fca567e387ab6260d71e5aaf
Summary: In response to https://github.com/caffe2/caffe2/issues/581 feedback, add textual "less than", "greater than" etc. to comparison operator docs, instead of just <, <=... which are hard to search on browser.
Reviewed By: asaadaldien
Differential Revision: D5085907
fbshipit-source-id: f129d94f03aff1cc919f8da843aa461f157eb144
Summary: I noticed that Sigmoid was taking an inordinate amount of time in our NMT benchmark, so I looked at the implementation and it didn't seem optimal. I replaced the implementation with an Eigen version so that when the Eigen update goes through, we will get proper AVX(2) vectorization.
Differential Revision: D5082464
fbshipit-source-id: aa951f7d730fc05198f7dd04076ec58d471b74c8
Summary: Added L1Distance Operator for CUDA, as well as tests.
Reviewed By: bwasti
Differential Revision: D5071966
fbshipit-source-id: 4c3d862605e9123d955bf091efa67d0731bd816a
Summary: Fixing a bug in the multiple algorithm test where threads were spawned repeatedly, causing collisions during rendezvous.
Reviewed By: pietern
Differential Revision: D5082945
fbshipit-source-id: 4adbbc963b1ff652f73a44cd9fd75dcd3325f182
Summary: When converting from half to float, the bytes to be returned were represented as an unsigned int. When returning, this had the effect of converting the unsigned int into a float. This is incorrect, as we want to instead take the raw data and return it as float.
Reviewed By: pietern, asaadaldien
Differential Revision: D5080335
fbshipit-source-id: 7208efc5799daccf92e1628ee326f7470b867261
Summary:
TSIA
This matches the approach in the TCP transport where all send/recv
logic is contained in the pair code.
Reviewed By: wesolwsk
Differential Revision: D5082503
fbshipit-source-id: b70886ed9aaeb381cdb45fba00704118cff62a23
Summary:
This is necessary to avoid the next iteration of the algorithm
overwriting data in recvBuf_ before it has been consumed by the
receiver of that data. If this does happen, the result of the previous
iteration for the receiving end is corrupted. This can only happen in
async mode on the TCP transport (so all incoming data is unsolicited)
when spinning on the run function.
Reviewed By: wesolwsk
Differential Revision: D5074789
fbshipit-source-id: 66668fbd885888f26266d812e78d61c6d65c2461
* Fix clang warnings
* Raise errors when unsupported ConvNd configurations are used
* Properly handle Variable indexing with LongTensors
* Support both tensors and variables in Variable.type_as
Summary:
Incorporate arbitrary dropout for encoder and decoder layers for Caffe2 NMT models using current configuration. This involves separate output processing (_prepare_output() and _prepare_output_sequence()) for the final layer in a MultiRNNCell.
Switching to using the newly introduced forward_only switch for RNN cells revealed an unrelated bug in our NetGradientChecker test, which urikz is investigating.
Reviewed By: salexspb
Differential Revision: D5031964
fbshipit-source-id: 19b49607d551aa3e2140041ef4e585f128c8f178
Summary: Add a RandomFailureOp and handling to elastic data parallel model of the status code
Reviewed By: andrewwdye
Differential Revision: D5065936
fbshipit-source-id: 24224f9ea414ee535c9e90cc28add5189354b0ef
Summary:
Migrate experiments folder to fb/sparse folder. Keep FunHashOp and SparseFunHashOp because they are now assumed as a default Op in depr. What I did
# Migrate FunHashOp and SparseFunHashOp and their unitests to core-caffe2, make sure tests are passed.
# Migrate other Ops in experiment folder to fb/sparse folder. Write new TARGETS files for them. Make sure tests are passed.
# Make sure all related tests passed.
# Fix MKL definition btw. Make sure that FC_Sparse is not compiled when there is no MKL support
Reviewed By: salexspb
Differential Revision: D4952993
fbshipit-source-id: 86c03676ab4e47f04d2d0dd438a4a1c849bbbff0
Summary:
Residual connections for multilayer RNN encoder/decoder for Caffe2 NMT model. Only supporting 'add' connections (the standard approach, which ves's TF experiments concluded was at least as good as other approaches), and also only implementing for residual_level >= 1 (which also fits our use case).
It is the responsibility of the config to ensure dimension compatibility: each level at and beyond residual_level (in both the encoder and decoder) should have the same number of units, with the exception that a bidirectional initial encoder layer should have half the number of units of the succeeding layer if that next layer is a residual layer.
Differential Revision: D5023160
fbshipit-source-id: f38c1b140638fee78cf3ef7d6b4602dd462484ee
Summary:
Update rnn_cell.py and char_rnn.py example with new `brew` model.
- Deprecated CNNModelHelper
- replace all helper functions with brew helper functions
- Use `model.net.<SingleOp>` format to create bare bone Operator for better clarity.
Reviewed By: salexspb
Differential Revision: D5062963
fbshipit-source-id: 254f7b9059a29621027d2b09e932f3f81db2e0ce
Summary:
the FC ModelLayer needs an optimizer, also seems the catch-all
that sets a default for missing optimizers had a bug
Reviewed By: xianjiec
Differential Revision: D5048302
fbshipit-source-id: cbbf641fb9ee4f4f89c5dbb132f7837ecdbe37a5
Summary: new resnet building with brew
Reviewed By: akyrola
Differential Revision: D4945418
fbshipit-source-id: d90463834cbba2c35d625053ba8812e192df0adf
Summary:
A Single machine multi-GPU version of BMUF algorithm. BMUF is a modification to
model averaging where updates to global model is implemented as a filter:
param_t = param_(t-1) + delta
delta = \beta delta_(t-1) + \alpha average(param_t) - param_(t-1)
Reviewed By: akyrola
Differential Revision: D4995057
fbshipit-source-id: 48176ba66d67eaf3fa4dee16d50d9589825ddba4
Summary: We need to use remapped name for param_grads to enable memonger.
Differential Revision: D5064198
fbshipit-source-id: ae54407c3362044e9bc2bff929e12da68cd6a332
* fix issue #1549, expose bitwise and
* expose C bitwise or of Tensor
* expose C bitwise xor of Tensor
* use built-in method for inplace and, or, xor
* expose C bitwise lshift(ilshift) and rshift(irshift) of Tensor
Summary: based on our discussion, we want an arg_map in ModelHelper and create arg_scope for that model within brew. Now it is realized
Reviewed By: salexspb
Differential Revision: D5042983
fbshipit-source-id: ddd2c7e9bca1be2f08a32f7252b44d3b60a57996
a module that returns a non-standard data structure currently breaks
due to checks for backwards hooks. This refactors the code slightly so
this will only break in the event of backwards hooks.
Summary:
The most recent diff from Andrey had a tiny bug that triggered an error in Android.
Closes https://github.com/caffe2/caffe2/pull/543
Differential Revision: D5040516
Pulled By: Yangqing
fbshipit-source-id: d7b11b509a20b8b5e33db74dd383b55f43608c8f
Summary:
Generalize SpatialBatchNorm CPU Op to compute Spatial batch normalization for
1D, 2D & 3D input tensors.
Reviewed By: dutran
Differential Revision: D5043563
fbshipit-source-id: 7fcb933a628dd47f13aa622f63601a87382f09cd
Summary: After a long and painful debugging of indeterministic behavior on Machine Translation team's attention model, I found that in certain cases SumReduceLike will use cub::DeviceReduce, and it lacked the stream param.
Reviewed By: jamesr66a, asaadaldien
Differential Revision: D5043347
fbshipit-source-id: bb91aacfc6786cc2b85ebc4e432c67e5f876e235
Summary:
Added several features to the ImageInputOp:
- bounding box (per image as well as default for the operator). For per-image, it
only works in Caffe2 format and is passed as the third tensor in the form
(ymin, xmin, height, width). For the operator, pass bounding_xmin, bounding_ymin,
bounding_width and bounding_height as parameters.
- per-channel mean/std. You can use the usual mean/std to pass a single
value to be used for all channels or also pass mean_per_channel and std_per_channel
to specify different values per channel. Order of channels is BGR.
- A minimum size parameter that can be specified instead of the scale parameter.
The minsize parameter will only scale the image if it is smaller than required.
This differs from scale which will scale up as well as down. You can only specify
one of scale or minsize.
Added a test case to test some of the features
Differential Revision: D4874988
fbshipit-source-id: 437191052a46e9916defe8b100d7cc7864373f61
Summary: We need to also add links in ops, so that they don't require a sharp timestep boundary. This implements that.
Reviewed By: salexspb
Differential Revision: D5027046
fbshipit-source-id: e6dd59ee843fe1507fc87377b0e1e23218dbc384
Summary:
In Dper utility, add a function `load_parameters_from_model_init_options` to
allow init parameters from pretrained models
Reviewed By: xianjiec
Differential Revision: D4926075
fbshipit-source-id: 5ab563140b5b072c9ed076bbba1aca43e71c6ac5
Summary: As part of opsifying the RNN execution, we cannot do the workspace switching anymore as it happens at timestep boundary. But we can get same effect by just creating explicitly the blosb into the shared workspace.
Reviewed By: salexspb
Differential Revision: D5025667
fbshipit-source-id: 921c97cb2f7941f9f9235913a60e34667badc303
Summary: Instead of explicitly accumualting the gradients in a loop, add corresponding Sum-ops to the net. This will allow for better parallelism with multithreaded nets.
Reviewed By: salexspb
Differential Revision: D5011177
fbshipit-source-id: 14e2fa2a6905703322d5701c1362054c17c4e796
Summary:
`Append` & `UnPackRecords` don't handle empty tensor well. `Append` would erase the shape of empty tensor, which break the invariants of dataset.
`UnPackRecords` leaves output tensor in an undefined state. If the output tensors were initialized, they would not be cleared out. If the output tensors were not initialized, they would remain uninitialized. This diff disable unpacking empty record if prototype tensors are not provided (since output shapes maybe indeterminable if they were not initialized). The interface remains the same if empty record tensor is not used.
Reviewed By: azzolini
Differential Revision: D4956012
fbshipit-source-id: ad80527d78eb7421cd90968edb82322c289cd417
Summary: Relax requirement on token uniqueness since a few use cases broke after the uniqueness requirement was added in a previous diff.
Reviewed By: kittipatv
Differential Revision: D5034132
fbshipit-source-id: 327eb065923e6ea152a360324316f81b7fb9564b
Summary: We can avoid this extra Reshape.
Reviewed By: jamesr66a
Differential Revision: D5032874
fbshipit-source-id: 92bd568bc6bec53d7f81a64cfa96d2c610823f8c
Summary:
this is still printed on tests a lot. Lets use 1 instead of
0 as most of our RNN code does
Reviewed By: jamesr66a
Differential Revision: D5031460
fbshipit-source-id: bc07990b66c89dfbd97133493cca11929d3138e5
By default, this parameter is False -- a backwards incompatible change, but
one that follows numpy semantics, e.g. numpy.sum (numpy names the parameter
"keepdims" since you can pass multiple dims to reduction functions).
The old behavior seems desired for normalization type operations
where the tensor will immediately be expanded out again, e.g.:
probs.sum(1).expand_as(probs)
which no longer works because the dimension to expand is missing.
This can be fixed by simply passing True as "keepdim" argument
to the reduction operation, e.g:
probs.sum(1, keepdim=True).expand_as(probs)
Summary:
In transfer learning, parameter initialized from pretrained model might require
a different learning rate than otherwise initialized. To this end, here we
implement a python solution where `base_learning_rate` is scaled by `scale`,
which is in turn set by `scale_learning_rate`; Alternatively, we can achieve
same effect by rewriting the LearningRate operator in C++
Reviewed By: kennyhorror
Differential Revision: D4992827
fbshipit-source-id: 8d7e87a61c95b3eb8ef733ec436f4060e865c0ac
Summary:
Adds a parameter cost estimation step before the actual training starts. The costs are later used in order to better shard the parameters across instances of the parameter server.
Things I needed to modify:
- A few changes to make ModelLayerHelper picklable
- Add support for stopping a distributed job after a number of stats reporting steps.
- Refactored run_dist_job to support collocating the reader with the trainer even when PS are present.
- Option to disable dense updates (when num_dense_servers=0).
Currently there's a huge overhead posed by having to launch a child workflow. I'll try and address next in a subsequent diff.
This is WIP because the other workflows need to be migrated as well.
I can break this down into smaller diffs if reviewers would prefer it.
Reviewed By: kennyhorror
Differential Revision: D4974752
fbshipit-source-id: 04c336acb2945f8f11324a221ffc6967818c0672
By default, this parameter is False -- a backwards incompatible change, but
one that follows numpy semantics, e.g. numpy.sum (numpy names the parameter
"keepdims" since you can pass multiple dims to reduction functions).
The old behavior seems desired for normalization type operations
where the tensor will immediately be expanded out again, e.g.:
probs.sum(1).expand_as(probs)
which no longer works because the dimension to expand is missing.
This can be fixed by simply passing True as "keepdim" argument
to the reduction operation, e.g:
probs.sum(1, keepdim=True).expand_as(probs)
By default, this parameter is False -- a backwards incompatible change, but
one that follows numpy semantics, e.g. numpy.sum (numpy names the parameter
"keepdims" since you can pass multiple dims to reduction functions).
The old behavior seems desired for normalization type operations
where the tensor will immediately be expanded out again, e.g.:
probs.sum(1).expand_as(probs)
which no longer works because the dimension to expand is missing.
This can be fixed by simply passing True as "keepdim" argument
to the reduction operation, e.g:
probs.sum(1, keepdim=True).expand_as(probs)
By default, this parameter is False -- a backwards incompatible change, but
one that follows numpy semantics, e.g. numpy.sum (numpy names the parameter
"keepdims" since you can pass multiple dims to reduction functions).
The old behavior seems desired for normalization type operations
where the tensor will immediately be expanded out again, e.g.:
probs.sum(1).expand_as(probs)
which no longer works because the dimension to expand is missing.
This can be fixed by simply passing True as "keepdim" argument
to the reduction operation, e.g:
probs.sum(1, keepdim=True).expand_as(probs)
By default, this parameter is False -- a backwards incompatible change, but
one that follows numpy semantics, e.g. numpy.sum (numpy names the parameter
"keepdims" since you can pass multiple dims to reduction functions).
The old behavior seems desired for normalization type operations
where the tensor will immediately be expanded out again, e.g.:
probs.sum(1).expand_as(probs)
which no longer works because the dimension to expand is missing.
This can be fixed by simply passing True as "keepdim" argument
to the reduction operation, e.g:
probs.sum(1, keepdim=True).expand_as(probs)
Summary: For distributed jobs, we were relying on the order the PythonOps were registered, which was very fragile.
Reviewed By: dzhulgakov
Differential Revision: D5016847
fbshipit-source-id: f5601467c5b0569d5e8a0efdd76abad0d703c5f5
Summary:
It is quite normal to rerun cmake to find new files through GLOB commands. This external command always forces cmake to run when you run make, meaning every make command takes longer than necessary. This PR removes the external command `touch CMakeLists.txt` and therefore leaves that decision up to the user when to rerun cmake and speeds up building when cmake is not required to rerun.
Closes https://github.com/caffe2/caffe2/pull/453
Reviewed By: Yangqing
Differential Revision: D4978919
Pulled By: bwasti
fbshipit-source-id: 0da4495b276a04f6ce46e1c8ceca0474b7573aa0
Summary:
cuDNN versions of dropout and LRN (for native fp16 support), port of Caffe's max pooling algo that uses an explicit mask to store locations (also supports fp16 storage)
Closes https://github.com/caffe2/caffe2/pull/396
Reviewed By: akyrola
Differential Revision: D4990880
Pulled By: asaadaldien
fbshipit-source-id: a716acffb656843e9b31e3e6808bd2d8aa959d03
Summary:
Added a context factory that allows you to use an existing context to
create other fully connected contexts much more cheaply (without having
to rely on a store).
Limitations:
- The backing context needs to be fully connected
Reviewed By: andrewwdye, pietern
Differential Revision: D4985121
fbshipit-source-id: 31ceabccbb679cedb18ec9927b6c166bef5989bb
Summary:
Incorporating definition of cell's output and illustraing it's usage by adding dropout to all types of cell.
I think that we should try to get rid of aliases in RecurrentNetwork, so output of applied_over_sequence is also always (state_1_all, state_2_all, ...). This way we can merge get_output_from_single_step, get_output_from_sequence and get_outputs_with_grads into a single method
Let me know what do you think!
Reviewed By: jhcross
Differential Revision: D4992913
fbshipit-source-id: 737939be336ad145f84e8733cd255d4f7188ef70
Summary: decoder_hidden_encoder_outputs_sum_tmp is tiny after D5010109, no need to recompute it.
Reviewed By: akyrola
Differential Revision: D5014335
fbshipit-source-id: cc9e8f91372889d10bd99c79366018cb3943a435
Summary:
At the moment serialization can tak up to 3x memory of the largest blob:
original blob, BlobProto, SerializeAsString version of the blob. As a result in
certain cases serialization takes more memory than it should and it hurts
utilization/max model size per machines.
This diff is adding IOBound ThreadPool that should set quite strict limitation
on the extra memory overhead per one blob.
Reviewed By: dzhulgakov
Differential Revision: D5012887
fbshipit-source-id: 12dbb9d3efab136411ddeffd519b602cf606661e
Summary:
Segment based Ops requires increasing seg id, and without gap. Lengths based Ops does not
have this requirements.
Otherpooling methods, e.g., LogExpMean does not have Lengths based Ops available yet.
Differential Revision: D5019165
fbshipit-source-id: ab01a220e10d4ed9fa2162939579d346607f905e
Summary:
Specialized implementation of ResizeNearest for width_scale=2 and height_scale=2. This implementation doesn't use divides or calls to std::min, and is unrolled 2x over the width dimension. Also add a correctness test.
About 6x faster.
Reviewed By: ajtulloch
Differential Revision: D4928579
fbshipit-source-id: 5cc92a52bd688690fee907b4333d9c84b666f9c9
Summary: For perf, it is better to check weight0 inside the kernel and avoid host synchronization when copying to a stack variable. Improved style a bit (github does not have Lint, so contributed code may not conform to our style).
Differential Revision: D5011668
fbshipit-source-id: 1eb85912f6f499acd3190cfcb59e7e39c2220d89
Summary: Since this function is declared in a header file, and is not templated and not part of a class, it will produce an ODR error if it is included in more than one file. Adding the `inline` keyword fixes this.
Reviewed By: jhcross, jamesr66a, m3rlin45
Differential Revision: D5011770
fbshipit-source-id: 50266a530da31ebfda59fcca2048355a00fe7758
Summary: External inputs must be computed before updating the _ops_output structure, otherwise if the net to be appended outputs the external input, it is not added correctly
Differential Revision: D5013496
fbshipit-source-id: 6a83d0a6f1c63ef8ae7bec4d862c0ac2a690d47b
Summary: Adding a simple video data layer which allows to read video data from frames, videos and output 5D tensor. It also allows multiple labels. The current implementation is based on ffmpeg
Differential Revision: D4801798
fbshipit-source-id: 46448e9c65fb055c2d71855447383a33ade0e444
Summary: Split doc failed to mention important features like specifying 'split' argument. Two questions the same day in Caffe2 Users were about how to do this.
Reviewed By: azzolini
Differential Revision: D5009503
fbshipit-source-id: 883549be891705a5c83778302d967481419f4dde
Summary:
This diff creates a generalized AttentionCell class, which will allow us to construct attention decoders out of arbitrary RNNCell components (with a particular view to using stacked, multi-layer RNNs).
In order to do this, we introduce a new optional input for RNNCell._apply which allows us to provide an additional input that is not processed by prepare_input(). Note that this is an argument only to _apply, not apply, since it is only meant to be used for additional recurrent connections to "embedded" cells, not for standalone RNNs.
Reviewed By: urikz
Differential Revision: D4998465
fbshipit-source-id: 473009ea4917e86e365f9d23aa2f11a46a94fd65
Summary: It is good practice to provide __dir__ whenever __getattr__ is defined so that tooling will work intelligently. In particular, it is hard to explore the available methods in iPython without tab completion.
Reviewed By: dzhulgakov
Differential Revision: D5006545
fbshipit-source-id: 1a150d91d54637d80b292764513943ff70d971b4
Summary:
Script caffe2/caffe2/python/examples/resnet50_trainer.py can be used to train a ResNet-50 model with Imagenet data (or similar).
However, currently the script does not actually save the model, so it is kind of useless.
Task 1: After each Epoch, save the model in a file "<filename>_X.mdl' where X is the epoch number and <filename> is given as a command line parameter. By default, use "resnet50_model" as filename.
Task 2: Add a functionality to restore the model from a previous file:
- add a command line parameter "load_model", which user can use to specify a filename.
- if this parameter is set, load the model parameters from the previous file
Reviewed By: prigoyal
Differential Revision: D4984340
fbshipit-source-id: 333e92679ba52a7effe9917fdfc2d55d652b868f
Summary:
Part of project to make all gradient accumulation business ops in RecurrentNetworkGradientOp, this makes the accumulateInputGradients ops.
Also added way to mark operators private so they don't appear in docs.
Reviewed By: salexspb
Differential Revision: D5006698
fbshipit-source-id: 226d7afb473290c8d0f936d2cc87640be3e06615
Summary:
Added the possibility to add 'tiles' and 'axis' as input
as opposed to arguments for the Tile Operator. If provided, the input
values will override the argument values. Now with proper CUDA code
Differential Revision: D4930347
fbshipit-source-id: b44b032b327c7d7bddfce63abf4e3289d7e74bfb
Summary: Layer for LastNWindowCollector op. We need this since it's an in-place operator.
Reviewed By: chocjy
Differential Revision: D4981772
fbshipit-source-id: ec85dbf247d0944db422ad396771fa9308650883
Summary:
Use the rnn_cell's multi-cell for LSTM benchmark. While doing this, i had not changed the initial_states and I got a inconsistent result from rnn_cell, so added an assertion to check initial states length is 2 * num layers.
+ fix division by zero error
Reviewed By: salexspb
Differential Revision: D5003177
fbshipit-source-id: a8250b825394c352428a0f067098dfcd7516ab2a
Summary: Use `CopyItems` so that it accepts any type of tensor. Also, move the cursor to input blob so that it's checkpoint friendly. Output is now also part of input so that inference can work correctly.
Reviewed By: xianjiec
Differential Revision: D4920987
fbshipit-source-id: da532736225ec27f409ff763ff69a0629235151c
Summary:
TSIA
This caused a compilation problem on gcc-6, see
https://github.com/caffe2/caffe2/issues/456.
Differential Revision: D5002823
fbshipit-source-id: 764aae1eaf78ee9918455b95a12e982597b85fdc
Summary: Set deviceId_ to -1 when CudaDevicePointer and CudaStream do not have valid data
Reviewed By: andrewwdye
Differential Revision: D4881374
fbshipit-source-id: e973a70e2e6e4519f5fdc2ad4e76f232d9593751
Summary:
Gloo added support for non-power-of-2 number of nodes in the recursive
halving/doubling allreduce algorithm by implementing the binary blocks
extension. This means we no longer have to fall back to using the ring
algorithm when the number of nodes is not a power of 2.
Reviewed By: prigoyal
Differential Revision: D4992536
fbshipit-source-id: f231aecbb46296ae3441ab818e058eb7ad6d8d64
Summary:
Otherwise compilation fails pretty far into the build, which is inconvenient.
The error reported when trying to compile with GCC 6:
CUDA 8.0 is not compatible with GCC version >= 6. Use the following
options to configure GCC version 5:
-DCMAKE_CXX_COMPILER=/usr/bin/g++-5
-DCMAKE_C_COMPILER=/usr/bin/gcc-5
-DCUDA_HOST_COMPILER:FILEPATH=/usr/bin/gcc-5
Closes https://github.com/caffe2/caffe2/pull/504
Reviewed By: akyrola
Differential Revision: D5004299
Pulled By: pietern
fbshipit-source-id: 185cd2f846f291a48e1d41ce0d87ca69e7f2c593
Summary: Allow RecurrentNetwork to accept dag as a step-net
Differential Revision: D4985747
fbshipit-source-id: ff39e0386c8f3a7364801a3011558f322d8ea669
Summary: When I added the CAFFE_ENFORCE_WITH_CALLER typedef to tag the tensor-pointer into enforce-exceptions, I only changed the most common callsites. This changes all enforces in tensor.h.
Reviewed By: salexspb
Differential Revision: D4995773
fbshipit-source-id: 90f2d277aeeb1354e72f92b2b9a75601fcbea609
Summary: Add the above operators to fbobjc and fbandroid by splitting them out to separate files and including these on the build. We are using these on mobile as part of Scout (Messenger).
Reviewed By: bwasti
Differential Revision: D4958660
fbshipit-source-id: f5cb105b4d7186a7eef705023382ec1383b6ec21
* Make sparseMask error if mask is uncoalesced.
Fixes#1447.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add test for sparse adagrad.
Previously, the sparse codepath was not exercised at all; this commit
adds a very simple test case "sparse Rosenbrock"; the idea is to do
Rosenbrock but then knock out one of the dimensions so that the
tensor is sparse.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
Add a parameter dont_rebatch to data_workers. This disables batching of input from fetcher to equal-batch size chunks. This is not desired with RNNs where with longer sequence length we might want to have smaller batches etc.
For some reason the graceful-shutdown test interfered with other tests, so I removed it.
Reviewed By: jay-mahadeokar
Differential Revision: D4988549
fbshipit-source-id: cbab46d77c948f2e293e79e6eb538dde17d800ee
Summary:
We weren't handling an edge case where write(2) would return EINTR
when in sync mode. The Pair::write function would return false
indicating it didn't complete the write whereas the send function
expects it to complete when in sync mode. With this change we now
advance the cursor and retry the write when fewer than expected bytes
were written.
Also see https://github.com/facebookincubator/gloo/issues/34
Reviewed By: andrewwdye
Differential Revision: D4996949
fbshipit-source-id: 3bad4fa3d0a01517f20b64904aa71410641fa60f
Summary:
- Adding ScatterWeightedSumOp for CUDA.
- This version does not support input weight (weight0). In other words, the input weight has to be 1.0, otherwise the op exits.
- To check the value of weight0, we copy its value from device to host at: https://github.com/caffe2/caffe2/pull/443/files#diff-2a77f80797072e8443f4867cb709fb40R244
Closes https://github.com/caffe2/caffe2/pull/443
Reviewed By: akyrola
Differential Revision: D4971910
Pulled By: asaadaldien
fbshipit-source-id: 2282e968f95364f0b3b8126502b053fe7a32ba20
Fixes#1449.
For future reference, we should have a doc explaining our ref-counting
conventions; it looks like this bug slipped by because we assumed that
newTensor was taking ownership of the pointers it was passed in.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: Add Python support for arbitrary (unidirectional) recurrent networks with MultiRNNCell abstraction. Since the combined step net for all layers is created at one time (in method _apply), this may be optimizable as-is. LSTM() function is extended to accept a list of numbers of units for the dim_out argument, producing a multi-layer LSTM in that case.
Reviewed By: salexspb
Differential Revision: D4965001
fbshipit-source-id: 39c069468d5b40bf803503cf62046a479ca83cbb
As discussed in #1441.
I also added some docs giving clear guidance about how to coalescing
in sparse tensors.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Simplify _gen_sparse
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Randomly generate an uncoalesced tensor and test with it.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Simpler implementation of cpu_only suggested by @apaszke
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Better implementation of randn, suggested by @soumith
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Lint fix.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Fix CUDA type error.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: The code snippet below is invalid in the add unit test is invalid but it may or may not cause exception. Disable the syntax so people don't accidentally use it.
Reviewed By: dzhulgakov
Differential Revision: D4985030
fbshipit-source-id: ffa2b26f7b29128b196aba1b1001a97c87e381cf
Summary:
We need a warm-up stage because otherwise first iteration
speds too much timedoing all the allocations
Reviewed By: akyrola
Differential Revision: D4986201
fbshipit-source-id: f60a75520988ff3f1540bb157cdc69634f307db4
Summary:
Layer to allow model to follow different paths for each instantiation context and join later. Together with tagging system cleanup (this is a separate issue), this should reduce the need to write a layer to differentiate between context.
Re: tagging system clean up, we should make exclusion more explicit: EXCLUDE_FROM_<CONTEXT>. This would simplify instation code. TRAIN_ONLY should become a set of all EXCLUDE_FROM_*, except EXCLUDE_FROM_TRAIN.
Reviewed By: kennyhorror
Differential Revision: D4964949
fbshipit-source-id: ba6453b0deb92d1989404efb9d86e1ed25297202
Summary: Previous slot offset was not added to the calculated value for the slot to be used in halving-doubling algorithms. If multiple instances were running, slot values could collide.
Reviewed By: pietern
Differential Revision: D4986618
fbshipit-source-id: 56b9220c91f31cc016d37e82907221460de70657
Summary: Make NCCL optional in data_parallel_model due to continuing reliablity (deadlock) issues.
Reviewed By: pietern
Differential Revision: D4988950
fbshipit-source-id: 8a2192f01b5f3c0e847137cd37aefc69e553a56f
Summary:
RFC. This is a naive implementation of Rebatchin Queue for MultiTask
effort. Full disclaimer, I'm very new to Caffe/Machine Learning and I'm doing
dodge science here (under Dmytros supervision), so please be extra tough on
this review so I
can learn best practices :)
Differential Revision: D4871970
fbshipit-source-id: 924820ef0fce45b5e2bdabeec9885cbafa23a880
1) Fix "kth" attr specification -- I can't get sphinx to generate `k`th,
but `k` th works with a space, unlike now where the highlighting continues
until the next attr.
2) Specify the size of the return tensors.
3) Add an example of the return tensor sizes with more than 1 dimension.
Summary: I ran into this earlier and the debug messages were not helpful enuogh
Reviewed By: kennyhorror
Differential Revision: D4985754
fbshipit-source-id: b3d12b5e2cfa1b54fca9126768c84c902664ef28
Summary:
When appending net A to net B, an external input of net A should not be added as
an external input of net B if net B is outputting that blob.
Reviewed By: dzhulgakov
Differential Revision: D4975921
fbshipit-source-id: a5c0ada7b96d851e57d345244d322dd93c7be8e4
Summary:
This helps guard against programming errors where waitSend is called
before send is called. It uses a std::atomic to keep overhead low.
Reviewed By: andrewwdye
Differential Revision: D4984604
fbshipit-source-id: 04a63b1ba088e3bcba0abff40771af666deb15e5
Summary:
This returns EFAULT when passing a GPU memory pointer (for GPUDirect)
and the ibverbs driver can't map the GPUs memory. Since the error is
pretty cryptic, crash with a more useful message.
```
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at gloo/transport/ibverbs/buffer.cc:46] mr_ !=
nullptr. ibv_reg_mr: Bad address (kernel module 'nv_peer_mem' not
loaded; did you specify a GPU pointer?)
```
Reviewed By: andrewwdye
Differential Revision: D4982966
fbshipit-source-id: 72c220fe22a3bc59396cfff992ad5f0f9c5bf83a
Summary: In certain situation, like in D4907916 where we insert additional step in the middle of a model, it's neccessary to keep the blob names constant across model helper so that it doesn't break communication schema.
Reviewed By: kennyhorror
Differential Revision: D4981527
fbshipit-source-id: 6b8d6d240279dd48f801cfacbaa1d320ba54d694
Summary: Inegration of the CRF Layer in DeepText wordmodels + Implementing the viterbi decode operator in C++ instead of python so that the CRF models can be deployed in production.
Differential Revision: D4912196
fbshipit-source-id: 64f499a1bd47e811e7a96dde839904dcd05cacb3
Summary: Calling `set()` or `set_value()` on Scalar is dangerous as something might be holding a reference to it. This is especially true with `LayerModel`, where instantiation is delayed. The code may still run but it will produce unexpected results, i.e., values maybe written to the wrong blob.
Reviewed By: kennyhorror
Differential Revision: D4955366
fbshipit-source-id: f5e8694a9a411ee319ca9f39a0fed632d180b8a5
Summary:
This is preamble for the "diagonal executor". Instead of creating a Net for each timestep, we have a single executor for the RecurrentNetworkOp that manages ops per timestep.
This will be used if net_type='rnn', so one can still use the old way by using a net type of 'simple' or 'dag' (so there is effective kill-switch if there are some issues with this).
Did this only for the forward-model. Gradient op will follow later on, but it is basically similar, just reverse order.
Reviewed By: salexspb
Differential Revision: D4979933
fbshipit-source-id: bda77918ec518cb6b29d7021ee036d59eb2dd303
* Refactor test_sparse to reduce boilerplate.
Instead of manually creating a helper function, threading an is_cuda
parameter around, and creating a test method for CUDA and non-CUDA
variants, we take a different approach:
- There is now some new member variables initialized in setUp which
control the aspects of how we carry out the test; at the moment,
it's just whether or not we are using CUDA or not. This means
you don't have to pass is_cuda around, or do a conditional to
get the triplet of constructors you need.
I'll note that I am not a big fan of member variables in test
objects, but these are (intended to be) immutable so I think
it should be OK.
- Instead of manually defining test_foo and test_foo_cuda, we now
have a new TestCudaSparse class which overrides setUp (from above)
to swap in the CUDA implementation. Way less boilerplate, and NO
metaprogramming needed.
If you need to opt out of CUDA testing, there is a new cpu_only
decorator you can use.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: A generalized version of halving-doubling that supports non-power-of-two number of processes by breaking up execution into blocks that are powers of two and communicating interblock after the intrablock reduce-scatter. Non-power-of-two cases will have some degree of load imbalance compared to power-of-two, but cases with few large blocks (e.g. 8 + 4 or 16 + 8) should still perform relatively well.
Reviewed By: pietern
Differential Revision: D4955947
fbshipit-source-id: af4f218fedb6adf475530c38386978b81f4f2b74
Summary:
It turned out that we can not run PackedFC on a machine that does not have avx2 right now, as there is an known issue with MKL 2017.0.098 that produces wrong results on non-avx2 machines.
I just moved this test from here because this is not the purpose of this test
Reviewed By: salexspb
Differential Revision: D4974021
fbshipit-source-id: c5b82a41021defc9946a8219f59b28abb13d3beb
Because of this Variables can no longer appear in the graph.
Every usage of a leaf Variable will leave an AccumulateGrad
function that has no outputs, but modifies var.grad as a side
effect.
Summary:
After running the test suite many times we end up with a zillion
connections in TIME_WAIT state. Setting SO_REUSEADDR seems like it
should help binding to ports regardless of the TIME_WAIT state.
Reviewed By: andrewwdye
Differential Revision: D4979606
fbshipit-source-id: b611f9c9e11aba858dc192f6bca3d64e10100b52
Summary:
It can happen that a pair is destructed while in CONNECTING
state when some unrelated code throws an exception after the connect
function has been called. The most likely place for this to happen is
when connecting pair A is in progress while connecting pair B throws
an exception. The exception will force destruction of all references
to pair A, even if it is in the CONNECTING state.
Also see https://github.com/facebookincubator/gloo/issues/33
Reviewed By: andrewwdye
Differential Revision: D4979557
fbshipit-source-id: 0cddddd3f478106f1694603fe7f2efe15a2d9aa1
Summary: Previously, the code below would go out of bound.
Reviewed By: xianjiec
Differential Revision: D4968037
fbshipit-source-id: 3760e2cddc919c45d85ac644ac3fabf72dbaf666
Summary:
build_ios.sh now have `-fembed-bitcode` flags for cmake and passes these flags to build_host_protoc.sh (which now accepts optional argument `--other-flags`). That allows to use output libs (libCaffe2_CPU.a, libCAFFE2_NNPACK.a, libCAFFE2_PTHREADPOOL.a and libprotobuf-lite.a, libprotobuf.a respectively) in Xcode projects with bitcode enabled.
Bitcode is enabled by default in all projects since Xcode7, is crucial for slicing and is mandatory for watchOS targets. Enabling bitcode for target requires bitcode to be enabled for all dependencies also, so Caffe2 built without bitcode forces developers to switch off bitcode for the whole app.
Closes https://github.com/caffe2/caffe2/pull/457
Reviewed By: bwasti
Differential Revision: D4978644
Pulled By: Yangqing
fbshipit-source-id: 5165abb507fb91bc8c38f7348d6836bccf8fcc22
Summary:
Implement NormalizeOP for GPU using CUDA, and re-write the graident to be a function of the output
so its more efficent specially for CUDA implemntation.
Reviewed By: akyrola
Differential Revision: D4971300
fbshipit-source-id: e0ab66462000988aaf1f26010ea550533d107167
Previously, when using same data channel in multiple thread environment,
one didn't have any guarantee that there won't be any deadlocks
or even errors.
Summary: As in the title + added scuba logging of the results.
Reviewed By: andrewwdye
Differential Revision: D4974261
fbshipit-source-id: 3e05b97133be95ffe37c8bcafd8a5a6bf3e7da93
Summary: Only CPU impl is available at the moment. Wrote simple cuda kernels.
Reviewed By: akyrola
Differential Revision: D4577736
fbshipit-source-id: c2540aa9d332fcdeac46cc7f89aab164d107d7a8
Summary: Both SquaredL2Distance and SquaredL2DistanceGradient had bad CUDA implementations. Use proper reductions and batched kernels.
Reviewed By: asaadaldien
Differential Revision: D4968527
fbshipit-source-id: f7cf82072d38bc127c757c5751863a9439aca8b5
Summary: Implement CPU and GPU gradient for Leaky ReLU op.
Differential Revision: D4943905
fbshipit-source-id: 541f13cd5f274a18b69ecf1362722b1bc0105ad9
Summary:
Instance norm failed grad check in some cases that needed a smaller step size. Decreased step size, but also increased threshold slightly.
Related diff: D4627379
Reviewed By: kennyhorror
Differential Revision: D4941827
fbshipit-source-id: d6f565340da92af40bfee90627960a3356c69412
Summary:
This is a naive layering approroach till we have a better
one. It could be c++ based and support diagonal execution. Not integrating into main LSTM API yet as this might be revised a bit. Would like to land so we can compare current implementation in the benchmark and also use this as an example of how LSTMs could be combined (as some folks are doing similar things with some variations).
Later we can LSTM() support API of layered_LSTM() and also change it under the hood so it stacks cells into a bigger cell instead. This way if we make RNN op use a kind of a DAG net, then RNN op can provide more parallelizm in stacked cells.
Reviewed By: urikz
Differential Revision: D4936015
fbshipit-source-id: b1e25f12d985dda582f0c67d9a02508027e5497f
Summary: Use a priority queue instead of std::partial_sort to identify the top k elements. This reduces memory usage and improves performance.
Differential Revision: D4963931
fbshipit-source-id: 02e75b17ffaf24a4f63c7136626bf0991ee47495
Summary:
This is useful when data has standalone sequences which are
not connected to each other by any meaningful context
Reviewed By: yqwangustc
Differential Revision: D4835164
fbshipit-source-id: f95626acc26acc3eba3bca7efb08ed1dbdb36c83
Summary: Ran into illegal memory access errors when running MSRAFill on an odd-sized tensor. curand only supports even-sized fills. To workaround this limitation, we fill the last entry of the tensor manually and use curand for what remains. In this line, the intent is to get the n-1 th element of the tensor. r is already a T*, so we should not be multiplying by sizeof(T) to get the n-1 th element.
Differential Revision: D4961306
fbshipit-source-id: 587f2945abf025e28f573482a4828c09e6ae771b
Summary:
A new argument `blob_name_overrides` is added, which is to specify the
destination of loaded blob (in order to allow they have different names than
what are in the saved file/db).
This will be used for parameter initailization by pretrained model
in Dper 2. When loading a blob, we need to avoid name collision by assigning the
loaded blob with a new (temp) name.
Reviewed By: xianjiec
Differential Revision: D4952485
fbshipit-source-id: 4ce79bf40223314bb94981c22cbe537ae3f3d27c
Summary: No need to assert on connection errors.
Reviewed By: andrewwdye
Differential Revision: D4957698
fbshipit-source-id: b47f6f0f098dbf7d212701c5cb68e34b2c1c9522
Summary:
Free scratch blobs at data workers exit. Also add utility function that you can use to reset gradient blobs easily:
from caffe2.python import utils
grad_blobs = [b for b in workspace.Blobs() if b.endswith("_grad") or b.endswith("_shared")]
utils.ResetBlobs(grad_blobs)
Reviewed By: rpenggithub
Differential Revision: D4955531
fbshipit-source-id: d33b2bb2b5247dd2c4cff51c82b1257c871a4179
Summary: Current eval nets contain loss operators; see example: https://fburl.com/6otbe0n7, which is unnecessary. This diff is to remove them from the eval net.
Differential Revision: D4934589
fbshipit-source-id: 1ba96c20a3a7ef720414acb4124002fb54cabfc7
Summary: Now you can call coordinator.stop_coordinator("train") to stop the train model's data input and release its memory.
Reviewed By: rpenggithub
Differential Revision: D4955014
fbshipit-source-id: c1bc3ec67337b94aff8ea9b306c3b4158eeef42c
Summary:
The _param_init_net does not exist. All the other places reference
param_init_net instead. So far no one has encountered any problem
because all the passed params are BlobReferences. This diff makes
this assumption explicit.
Reviewed By: azzolini
Differential Revision: D4922930
fbshipit-source-id: e6dbd7a29ea640b7e62fcfec7ced3cc7d149f872
Summary: Yet another diff to improve softmax CUDA kernels. 1) Use CUB for reduction ProbCrossEntropyKernel (was sequential loop); 2) remove unnecessary inner for-loops for two other kernels.
Reviewed By: wickedfoo
Differential Revision: D4953099
fbshipit-source-id: 4a5806d450021eff84e3d7fb0e7020cb5013fd69
Summary:
My first CUDA kernel ever!
The general strategy:
1. Create a block per row, up per CAFFE_MAXIMUM_NUM_BLOCKS
2. Create a CAFFE_CUDA_NUM_THREADS to sum in parallel
3. Sequentially compute the max of all inputs for a thread
4. Use CUB parallel reduce to compute the overall max.
The new version of the code is way faster than the old kernel (20x). This is
actually quite suspicious; with the assistance of ntv, we discovered that
RowMaxKernelLargeD was performing slowly on lstm because it was only ever being
parallelized over a single block (see Test Plan below for a sample trace).
It will be good to investigate this further.
Differential Revision: D4948557
fbshipit-source-id: 7f8d5c04667b948881468adb37f8ebc5c903c8da
Summary:
This PR makes cmake installs the gloo CUDA headers if USE_CUDA is enabled.
Closes https://github.com/facebookincubator/gloo/pull/29
Differential Revision: D4946856
Pulled By: pietern
fbshipit-source-id: a688c3794c4a5e34b664e7bdeb4e1148f6504419
Summary:
ScaleGradient is a helper operator that does no actual numerical computation,
and in the gradient computation phase scales the gradient from being computed
through it.
Differential Revision: D4920719
fbshipit-source-id: 0e1e0888f79594be874fdbdda5ccef7389064c50
Summary:
Issue is that AliasOp doesn't work well with swaps that we do for
param.grad and param.accGrad. Tensors become the same if there is no
reallocation of the gradient tensor inside the backward cell net's
local workspace.
bug explanation from akyrola:
```
gpu_0/decoder/decoder_hidden_encoder_outputs_sum_grad: tensor A
on each timestap back to 0, we Alias
gpu_0/decoder/weighted_encoder_outputs_grad,
so then also
gpu_0/decoder/weighted_encoder_outputs_grad: tensor A
It's acc is:
gpu_0/decoder/weighted_encoder_outputs_grad_acc: tensor B
Now after timesteps, we swap (line 626) with _acc to get
gpu_0/decoder/weighted_encoder_outputs_grad: tensor B
gpu_0/decoder/weighted_encoder_outputs_grad_acc: tensor A
OPTION A -- batch size is same as before or smaller:
Then on next iteration, we do again the Alias to
gpu_0/decoder/decoder_hidden_encoder_outputs_sum_grad, so now
gpu_0/decoder/weighted_encoder_outputs_grad: tensor A
and also
gpu_0/decoder/weighted_encoder_outputs_grad_acc: tensor A
swapping them does nothing and they are the same
OPTION B -- batch size increases
gpu_0/decoder/decoder_hidden_encoder_outputs_sum_grad is reallocated,
becomes tensor C
gpu_0/decoder/weighted_encoder_outputs_grad becomes tensor C with
Alias
gpu_0/decoder/weighted_encoder_outputs_grad_acc: is tensor A
```
Reviewed By: urikz
Differential Revision:
D4946730
Tags: rnn, caffe2
fbshipit-source-id: b52d63cb238b81d2ad40e05e70deb32a81336f47
Summary:
New memonger (D4393909) has option to use shape inference. When trying this on some models, I encountered a couple of issues, fixed here:
- elementwise ops Add, Div, Mul did not have shape inference, leading to errors
- if shape inference function throws an error, it will crash the whole thing. It is better to catch the error, log it, and continue going on. Shape inference is not required, but an optimization.
- additional checks to conv/pool shape inference function. This was segfaulting in certain cases.
Reviewed By: asaadaldien
Differential Revision: D4949994
fbshipit-source-id: d4c571e1bb20f8feeade95c49412771bb3e7bed0
Summary: Thanks to ezyang, now I know that if a CUB tempstorage is reused, a thread sync is needed. So added this to the elementwise linear gradient kernel.
Reviewed By: wickedfoo, ezyang
Differential Revision: D4949250
fbshipit-source-id: fbbbd336a962a51be43784207105cadd391a8ef2
Summary: A layer that takes raw ids as inputs and outputs the indices which can be used as labels. The mapping will be stored with the model.
Reviewed By: kittipatv
Differential Revision: D4902556
fbshipit-source-id: 647db47b0362142cdba997effa2ef7a5294c84ee
Summary:
Adding add_weight_decay and image_input to brew module & remove `getWeights` and `getBias` from CNNModelHelper
With fbgs `useWeights`, the results show that noone but add_weight_decay is using this function. I checked with oculus people, their getWeights is a different function.
kennyhorror Please notice whether this is going to affect you :)
Reviewed By: salexspb
Differential Revision: D4945392
fbshipit-source-id: 4ef350fd81dd40a91847e9f3ebc5421eb564df32
Summary: printing resnet training loss and accuracy for each batch so that people will have better idea of what is going on
Reviewed By: pietern
Differential Revision: D4945390
fbshipit-source-id: 0fcd60f4735e81641355aba6e6cbf0e57e886e38
Summary:
lengthTile goes from 1 to multiple, the gradient op is simply the reverse,
by adding up the fanned-out rows of gradients together into 1
Reviewed By: kittipatv
Differential Revision: D4943375
fbshipit-source-id: deae9984e849974a0d484a10b94efdb1d30941cc
Summary:
Added optional support for using activation blobs for sharing as well. Doing this change revealed an non-optimal implementation in the blob sharing: we need to prefer to reuse freeblobs by prefering those blobs that are already shared by many other blobs. Otherwise the memory usage can increase when the pool of 'free blobs' grows.
Also, my first version only passed "free blobs" (i.e blobs in recycling pool) down the first branch when operators forked. But now we pass those blobs that were not used by the first branch down the second branch and so on.
Also added support for blob size information in the heuristic. This uses the shape inference mechanism.
I had to also do some small tweaks:
- use Sum() operator as a way to match shapes of blobs that had otherwise unknown shapes. This is related to the Sum() operator that is added to combine multiple incoming gradient inputs (with _autosplit gradients).
- a couple of random shape inference fixes
This reduces the Resnet-50 memory usage on 64 batch from 9.45 Gig to 8.5 Gig.
For a 32 batch, the memory usage is 4330 MiB, down from 4800 MB, compared to Torch's 6856MiB (thanks prigoyal for checking this for me).
This is unfortunately quite a bunch to review...
Reviewed By: asaadaldien
Differential Revision: D4393909
fbshipit-source-id: 9c7c94125f96512bea80463ebcb63c215ef95ff9
Summary:
This diff contains the following changes:
- implementing __repr__ on Field types; this makes it a little easier to see what broken in the unit tests
- preserve the shape of ndarray input to schema; previously, empty and scalar arrays lose their shape, while other keeps the shape.
- type-checking ndarray input; this ensures basic integrety of schema
Reviewed By: xianjiec
Differential Revision: D4913030
fbshipit-source-id: bd0f6b8722d95bfe800edf98ba05029c5b99d2af
Summary:
It should be up to the program including Gloo to ignore SIGPIPE.
We have seen a case where the EPIPE errno is not properly handled in
an unrelated piece of code. Having SIGPIPE fire means we can get a
core and debug this further.
Reviewed By: andrewwdye
Differential Revision: D4896727
fbshipit-source-id: f6fe2d3f8dc68a9e6c2c457639b45f8aee2d7b20
* move TopK to generic
* partial genericization of kernel code
* introduce TopKTypeConfig, specialize radix type and conversion for floats
* implement topk for byte tensor
* implement for char tensor
* implement for int tensor, extend test to check indices as well
* works for longs too
* make bitfield set/get a struct, add support for 64-bit types
* extend to double tensor
* implement for half tensor
* asserts; test fix
Summary:
This PR is based on commit "977c6b3" as this version allows MKL to use all the cores available.
All MKL related files are added here after incorporating review comments, major changes include
1. usage of Clang-format(Linter) with --style = Google
2. usage of macros for checking input and filter dimension in the mkl operators
3. merged Max and Average pooling functions
4. created a new folder for mkl related python scripts in Python folder and moved them there
5. there is no mkl_alexnet_test.py as that was redundant while convnet_benchmark.py does the same thing
Closes https://github.com/caffe2/caffe2/pull/270
Differential Revision: D4905219
Pulled By: Yangqing
fbshipit-source-id: e5f5b189714a835b93b9ebda24c52e09572dfca7
Summary:
If exception is getting thrown inside of the namescope it won't be reset to
it's previous value. This diff is changing this behavior to expected one.
Reviewed By: kittipatv
Differential Revision: D4928621
fbshipit-source-id: 1d3579f2093ca60901b0d37ae3f2108deb2333ea
Summary:
In its current form, common_rtc.h can only be included in a file where
```
using namespace std;
```
comes before the include
Closes https://github.com/caffe2/caffe2/pull/398
Differential Revision: D4943125
Pulled By: Yangqing
fbshipit-source-id: 3ef15c9353e6dd7326fc5f60322049c9f594ee6c
Summary:
Mac does not support thread_local, and Caffe supports mac, so we will have to
temporarily disable this on mac.
(Note: this ignores all push blocking failures!)
Reviewed By: marksantaniello
Differential Revision: D4945019
fbshipit-source-id: 6d1d828a96459a85e1ae4fb5394eabdd9e610723
Summary: Use a proper reduction in the gradient kernel. This gives about 25% speedup with the n, D I tried (see P57333872), but with larger N, the improvement can be much more sizeable.
Reviewed By: stephenyan1231
Differential Revision: D4941218
fbshipit-source-id: 627eaf26fc20a81f1ef449f39eda0d2191b8c746
Summary: Instead of requiring gradient updates on GPU, this change will allow the usage when loss computation happens on GPU while all grad updates happen on CPU.
Reviewed By: jhcross
Differential Revision: D4943996
fbshipit-source-id: 1f2144c4277dfdb865877e0d0216ca1ac7dd7309
Summary:
Add a pointwise `IsMemberOf` operator to Caffe2.
The original idea was `In` but I think this is not so clear.
I used `UnaryElementwiseWithArgsOp` at some point, but it was making the code a bit more difficult to read without bringing any feature.
Reviewed By: ender-wieczorek
Differential Revision: D4912655
fbshipit-source-id: 716b66bb51468dd59db5f76f23d78cda85961b58
Summary:
Cannot guarantee Gloo will build on 32-bit systems as we don't run continuous build/test for this.
Verified this works by changing 8 to 7 and observing USE_GLOO defaulting to OFF.
Closes https://github.com/caffe2/caffe2/pull/401
Differential Revision: D4943135
Pulled By: pietern
fbshipit-source-id: 1972658afe819951e24ffbec76eb615c36ab0cc2
Summary:
When trying to build caffe2 with python provided by homebrew, I find out there are some errors in the building scripts. The "get_python_cmake_flags.py" script is supposed to find out the correct python library and header file locations. However, due to these errors, this script does not function correctly. After building, caffe2 is linked against the default python library provided by Apple which causes a crash when trying to validate whether or not the installation is successful:
```shell
python -c 'from caffe2.python import core' 2>/dev/null && echo "Success" || echo "Failure"
```
The fix is as simple as follows:
- Add "shell" so that command substitution could work under Makefile.
- Add blank spaces between -D options so that they are treated as options not makefile targets.
- Print the "flags" variable without the newline character so that they could be utilized by command substitution correctly.
Closes https://github.com/caffe2/caffe2/pull/391
Differential Revision: D4943212
Pulled By: Yangqing
fbshipit-source-id: 04d3595fa2d89fe57aed5b6a7a91a95114a82a1b
Summary:
Two new operators to pack and unpack a dataset. This is so that we can
re-use other operators that do not understand the schema format. The immediate
use-case is to use it with a partition operator.
Packing works by splitting the input into separate tensors, putting them in a
vector and wrapping in a shared_ptr (as opposed to a unique_ptr, so we can
copy).
Unpack takes the packed input and concatenates it back to the original.
I also had a gard time understanding the iteration, so I created a TreeWalker
that just hides the complexity of operating with all the arrays and makes the
short functions for a given purpose that at least for me are easier to
understand.
Reviewed By: dzhulgakov
Differential Revision: D4918002
fbshipit-source-id: ecbf9196ed25e886a94383961176b8c84dde2d2f
Summary:
Added option to recurrent_net and RNNCell's for forward_only. If this is set, the backward_step_net is not passed to the operator.
When backward_step_net is not available, operator knows it is in forward_only mode and does not create workspaces for each step but cycles
through only one private workspace.
Note: we could avoid doing a lot of work in recurrent.py:recurrent_network call when backward step is not needed, but doing that nicely requires
more refactoring that I did not want to do now. Thus, we create the backward step nets etc, but just don't pass it to the op.
This can be used to create more efficient inference models. You can also sanitize existing inference nets and remove the backward_step_net argument to
get the benefits.
Reviewed By: salexspb
Differential Revision: D4916482
fbshipit-source-id: c99b93c9cb897c32b0f449253f7f6d6a942618ad
Summary:
This is needed to have a stateful PythonOp (such as the PyTorch in the following diff) where computing f will produce a state (not tensors) thats consumed by grad_f.
python_func_type is a type that constructed as python_func_type(f) and provides forward, backward methods (will be delegated to f, &f_grad). We are constructing this object in at Op registration time to have it as thread local.
Differential Revision: D4900963
fbshipit-source-id: 00a6a55fa372e2244048921914e22e710d11f7ce
Summary:
As per request moving elsewhere and using the Dispatcher. The reason
why I didn't put it into tensor.h is because the dispatcher lives in operator.h
and operator.h includes tensor.h. I also didn't want to do any codemods. If
this turns out to be useful it can be changed. Also the name is not super great
but the TensorPrinter is already taken so that's what first came to mind.
Reviewed By: dzhulgakov
Differential Revision: D4893325
fbshipit-source-id: 7d4e56c4e57164c3cd3748f4a705a4ffe6b932d9
Summary:
rename model_helpers to brew. This is a big diff now. I did these things:
1. replace model_helpers with brew:
find . -type f -exec sed -i 's/model_helpers/brew/g' {} +
2. rename model_helpers.py and model_helpers_test.py
3. rename ModelHelpersTest to BrewTest
4. lowercase all the helper functions to distinguish them from single op
5. run my unittests
6. run converge tests
Reviewed By: salexspb
Differential Revision: D4930465
fbshipit-source-id: f420a1b03238df1cbe9f4426e0b9c43a12119661
Summary:
rename ModelHelperBase to Model.
This is the result of running:
find . -type f -exec sed -i 's/ModelHelperBase/ModelHelper/g' {} +
We had 19 results when fbgs ModelHelperBase. Here is 20 instances because I added 1 test in model_helpers_test.py
Reviewed By: salexspb
Differential Revision: D4928337
fbshipit-source-id: bc4c12b60b90c167e717de50ea9fe17521e142e3
Summary: Instead of calling math::Axpby in a loop, we can do it in one kernel much more efficiently.
Reviewed By: asaadaldien, jamesr66a
Differential Revision: D4935893
fbshipit-source-id: 33497784604d1779723d578ea5400e87803851f0
Summary: jamesr66a noticed that the ScaleKernelAlphaDevice kernel was showing up in a profiler a lot. This was because it is called in a loop in ReduceFrontSumGradientOp. This was easy to replace by one kernel that scales in a "striped" manner.
Reviewed By: asaadaldien, jamesr66a
Differential Revision: D4935888
fbshipit-source-id: bc7bfd8c94988074ace6fbf3fdfb85905027f272
Summary:
This is getting too messy again. So cleaning up it even more. One thing I added here - not calling random to generate the input sequence. Ideally we do this for all other inputs. This was reported to be an issue when hypothesis finds bad examples - it can make it run very long.
Also I tunned ranges a bit so test finishes faster. On my devgpu test the whole test took 600 before and now is 39 seconds.
One more important thing - we want to test all combinations of things that are in the for loop. While things provided by hypothesis are just random tensor inputs.
Differential Revision: D4902956
fbshipit-source-id: ceb02d6761406b3192101d3b255abe90b2866770
Summary:
CUDA version of PRelu and its gradient. Forward pass is straightforward, backward pass requires reductino over the weights.
tsaizhenling, please patch this and test.
Differential Revision: D4931630
fbshipit-source-id: 1238e7d536e41480713865ced91aaef88f4feef5
Summary: To expose operators execution statistics in python, profiling measurements collected in ProfDAGNet class is leveraged. In current implementation, a new operator is defined that outputs the statistic data in a protobuf message. In the frontend, OperatorStatsContainer works as a wrapper to print ProfDAGNet statistics.
Differential Revision: D4923009
fbshipit-source-id: 18a6d76a405ef277a3fca7a312609051cf943207
Summary:
When installing on systems such as Arch Linux where the default python version is 3 the build will fail. To fix this instead of changing the python link in the shell it is more efficient to set the default python version allowed by cmake.
Closes https://github.com/caffe2/caffe2/pull/361
Differential Revision: D4932214
Pulled By: Yangqing
fbshipit-source-id: 06997d2df68b8e4037d72fd49813f6f74ca7591b
Summary:
Simple FindOp for CPU and GPU which searches a list of unordered needles from an unordered index. CPU version might be faster if first sorting the index / needles, but we can get back to that later.
CUDA op is also kind of brutish, but pretty parallel. Since the index and the queries are smallish at least in the use case currently in mind (Machine Translation's team word candidate search), I think this is a sufficient start.
Note that this is much simpler than the Index-class of ops which allow modifying the index etc. Since CUDA ops are more complex to implement for the full Index functionality, I decided to make a separate op with this very simple functionality.
Differential Revision: D4910131
fbshipit-source-id: 6df35c9e3c71d5392a500d5b98fd708ab0c8e587
Summary:
arg_scope module for model_helpers.
Some coding example with it:
with model_helpers.arg_scope([model_helpers.FC], kwargs):
model_helpers.FC(model, "x", "out_1", n, n)
with model_helpers.arg_scope([myhelper], n=-3):
with model_helpers.arg_scope([myhelper], n=-2):
with model_helpers.arg_scope([myhelper], n=n):
res = model_helpers.myhelper(None)
with model_helpers.arg_scope([myhelper], n=-3), \
model_helpers.arg_scope([myhelper], n=-2), \
model_helpers.arg_scope([myhelper], n=n):
res = model_helpers.myhelper(None)
Reviewed By: salexspb
Differential Revision: D4837180
fbshipit-source-id: 2cbd81681779d6cd1e61ee189edcc1cf3bb07d15
Summary: Insufferable Apple fanboys have burned this into my brain.
Reviewed By: Yangqing
Differential Revision: D4913772
fbshipit-source-id: 486c20e9c921
Summary: This file was left over after a recent refactoring but is not used.
Reviewed By: andrewwdye
Differential Revision: D4940265
fbshipit-source-id: 01f8c5fbc73dd0ca0a92306dbfef22ff28133750
Summary:
While it is theoretically possible to make Gloo work on 32-bit systems, it's unlikely anybody would ever use it on 32-bit systems. This removes the expectation that it should work...
Fixes#28
Closes https://github.com/facebookincubator/gloo/pull/31
Differential Revision: D4939073
Pulled By: pietern
fbshipit-source-id: 8c60804f7ae5cf835332871a424aefa2c498e8a4
Fixes#1267
This fixes a number of issues when PyTorch was compiled with CUDA
support but run on a machine without any GPUs. Now, we treat all errors
from cudaGetDeviceCount() as if the machine has no devices.
This saves an extra memory copy, which speeds up data loading a bit
(5-10% with accimage).
As part of this change:
* torch.cat accepts keyword argument out
* sepcifiying out=None is treated like not specifying out
Summary:
Changed _Android_ to _iOS_ in the comments in scripts/build_ios.sh.
Closes https://github.com/caffe2/caffe2/pull/364
Differential Revision: D4930101
Pulled By: Yangqing
fbshipit-source-id: 8f0a6aa1b43fd57c2f71f1c667c61d1f69b1e061
Summary: Work in progress for improving the performance of the TransposeOp on CPU. This is used extensively for inference in several neural MT systems, so optimizing this function is worthwhile and will reduce request latency.
Differential Revision: D4913075
fbshipit-source-id: fa2742829291d91f3eba00fdfe7d6c0dae83e206
Summary: Better to use standard library tanh(), because there can be numerical differences to other systems.
Reviewed By: urikz
Differential Revision: D4910421
fbshipit-source-id: 3a1e63cd20a6b8e3720a1deafea227652b38205e
Summary: CuDNN LSTM weights were incorrectly sized for layers > 0: there was assumption that the input size to middle layers is same as for the first layer, but actually the middle layer will get input from a layer below, which will have dimension equal to the output dimension (hidden dimension). This worked fine when input_dim and hidden_dim were equal, as are the default params for lstm_benchmark.
Reviewed By: salexspb
Differential Revision: D4922824
fbshipit-source-id: 3ed05529dcb0a4e66ad440084a55df1c5932fd33
Summary:
downloaded_size need to be added with the length of returned data_chunk.
When the last block's size less than chunk, the percentage should exceed 100%
Closes https://github.com/caffe2/caffe2/pull/329
Differential Revision: D4922227
Pulled By: Yangqing
fbshipit-source-id: 7d05d9bbf2dad0a9d330be96b60e658908185a46
Summary: Fixes unit test test_seq2seq_caffe2_model_cnn_one_stack_encoder, broken by D4905003. (Also some commas.)
Differential Revision: D4920699
fbshipit-source-id: 2fe501095e3e26a475d666afcae8e48c953f2eef
Summary: This would allow us to pin the size of lengths tensor to the batch size. I'll use this in a follow up diff.
Reviewed By: kennyhorror
Differential Revision: D4906634
fbshipit-source-id: 8d3d151f33fd99547d9940e7c663779810283eb6
Summary: Set pooing mode to execlude padding values and match CPU&CUDA implementations.
Differential Revision: D4920476
fbshipit-source-id: 26ce656cc792061f706e2acb37e72cec46ac77c8
Summary: salexspb recognized that my diff of fixing num_layers>1 cudnn lstm made it run much slower. Turns out this was caused by adding the dropout states to the gradient op (which it was missing ,that was a bug). But since we use dropout=1.0, we don't need to initialize the dropout states, and turns out this improves the perf of CuDNN LSTM very significantly, at least when hidden_dim is small (2.5x increase with hidden_dim=40). With large hidden_dim, the improvement is more modest.
Reviewed By: salexspb
Differential Revision: D4920543
fbshipit-source-id: 860c9d4c61793252f658dc5e3390bab571476be5
Summary:
Top-level makefile had `make` hardcoded, resulting in slow build and the following message when following installation instructions:
warning: jobserver unavailable: using -j1. Add `+' to parent make rule.
Replacing this recursive make command with the variable MAKE fixes the issue.
Closes https://github.com/caffe2/caffe2/pull/324
Differential Revision: D4920978
Pulled By: Yangqing
fbshipit-source-id: 1e75ab41786e52d1b7abcc2c46ad1088880d8c1d
Summary: `not field` calls `__len__()`, causing the field to appear to be missing even when it's not
Differential Revision: D4910587
fbshipit-source-id: bc2b2fadab96571ae43c4af97b30e50c084437af
Summary: We had to disable keep_on_shrink flags for inference and some training workloads, this change limits the memory allowed to be kept around when we are allocating smaller blob after a bigger one.
Differential Revision: D4889366
fbshipit-source-id: 87412cc1c0bf2c43ea1f3f19e31afc178bc1b9db
Summary: PrefixStore::wait() uses a default timeout if unspecified. This is incompatible when using PrefixStore to wrap a Store implementation that does not support timeout. Instead the base Store::wait(keys, timeout) implementation is called, throwing an exception. This change modifies the base implementation to ignore the timeout.
Differential Revision: D4916517
fbshipit-source-id: 3cdd83bd209bf938b58442d82f3fc245e68019ad
Summary:
1. add net gradient check to dper2 model unittest framework
2. add net gradient check to mtml model
3. refactor the code setting defaults to namedtuple.
Reviewed By: kittipatv
Differential Revision: D4897169
fbshipit-source-id: 4f17dd06ee169aa1158f12f5156614d45d7d97c1
Summary: This is needed for the completeness of random negative sampling. When the pool size is 0, we want to generate empty indices tensor.
Reviewed By: xianjiec
Differential Revision: D4906866
fbshipit-source-id: 75d66a92d15d60bb37bcd1075d324f28069c4fa0
Summary: This diff resloved some issues in reverted PR246.
Differential Revision: D4911821
fbshipit-source-id: 0a6fa47f4c2405475697e40fb926758c534f8ef7
Summary: Fixes for corner cases with small element counts. Fixed problems include (1) calling range on out of bounds pointers, (2) failing to allocate send or receive buffers in cases where they correspond to out of bounds indices for reduce-scatter, but are needed in the allgather, (3) not allocating enough receive buffer space (more than count_ bytes may be needed in some cases)
Reviewed By: pietern
Differential Revision: D4912656
fbshipit-source-id: 0409d01894ff9c93ef1a1fdf8021c9ecf62f9b57
Summary:
Similar to SafeDequeueBlobsOp, but add weight-based sampling for reading from multiple input BlobsQueue.
WeightedSampleDequeueBlobsOp will take a vector of weights (each weight is mapped to one input blob queue).
Based on probability, we will choose which BlobQueue to fetch.
WeightedSampleDequeueBlobsOp shall stop when any of input BlobQueue is empty.
Reviewed By: dzhulgakov
Differential Revision: D4905160
fbshipit-source-id: 5b1551e2250569f933a6c01ed04442843c5e0cb6
* updated ubuntu instructions
* updated ubuntu notes and troubleshooting
* updated tutorials using local files
* added doxygen python blocks for docs generation
* doxygen related files for generating docs
* removing Mac and Windows build status while those are in beta
* inference lookup is local now
* launch updates
* moved to docs folder, updating paths
* updated ubuntu instructions
* updated ubuntu notes and troubleshooting
* updated tutorials using local files
* added doxygen python blocks for docs generation
* doxygen related files for generating docs
* removing Mac and Windows build status while those are in beta
* inference lookup is local now
* launch updates
Summary:
When compiling Caffe2 on a Jetson TX2 using JetPack 3.0, the compilation with the Tegra X1 build script runs through perfectly fine. However, when running
from caffe2.python import workspace
the following error shows up:
> ImportError: No module named six
After installing `six` manually using
sudo pip install six
this works fine. I thus added the `six` module to the install script.
I assume this will also be required for the `build_raspbian.sh` script, however as I could test this, I didn't add it (yet).
Closes https://github.com/caffe2/caffe2/pull/293
Differential Revision: D4914121
Pulled By: Yangqing
fbshipit-source-id: 75947e8c295e1f5ad3f480a025fe8518dd91a957
Summary:
This tiny patch fix missing ```CUDA_NVCC_FLAGS``` & ```CUDA_HOST_ARCH``` from ```caffe_detect_installed_gpus()```.
-----------------
People may want define their custom flags or compilers that are more CUDA compatible. Automatic gpu arch detection ignores these flags and fail. Example of such custom flags:
```
cmake . \
-DCUDA_ARCH_NAME="Auto" \
-DCUDA_HOST_COMPILER="/usr/bin/gcc5"
```
* Autodetection part fails regardless proper compiler flags are passed, due to system gcc 7.0 that doesnt work with CUDA thus all arch will be enabled:
```
-- The C compiler identification is GNU 7.0.1
-- The CXX compiler identification is GNU 7.0.1
...//\\...
-- CUDA detected: 8.0
...//\\...
-- Automatic GPU detection failed. Building for all known architectures.
-- Added CUDA NVCC flags for: sm_20 sm_21 sm_30 sm_35 sm_50 sm_60 sm_61
```
* Patch fix the autodetection time as expected:
```
$ cmake ../ -DCUDA_NVCC_FLAGS="-Xcompiler=-std=c++03 -I/usr/include/cuda/"
-- The C compiler identification is
Closes https://github.com/caffe2/caffe2/pull/288
Differential Revision: D4914215
Pulled By: Yangqing
fbshipit-source-id: c407a750e03cb163f9d57f9f6403042704046014
Summary:
If command line flag caffe2_gpu_memory_tracking is enabled, CUDAContext will keep track of total memory allocated on each GPU. This requires keeping tracking of the sizes of the pointers, thus it might add some overhead, and is thus optional. The overhead is minimal in practice since we don't do allocations after first iterations, usually, though.
Added an op GetGPUMemoryUsage() to fetch this data programmatically, and python function utils GetGPUMemoryUsageStats() to call this op and package the results. Modified LSTM benchmark to report these stats.
This tracking is only for GPU now. CPU allocations are less organized..
Reviewed By: asaadaldien
Differential Revision: D4877451
fbshipit-source-id: 857798fe499d8c78cc590783052cbb2d4db56ea0
Summary:
memcpy comes from cstring
See https://github.com/caffe2/caffe2/issues/286
Reviewed By: Yangqing
Differential Revision: D4914228
fbshipit-source-id: de60c2a98feb4228546a8f1fe237a090101f50e4
Summary:
Due to the massive dependencies I did not update the version number - under
the same big version number (2017) the API is compatible so no need to
rebuild all the dependencies.
This will unblock the Caffe2 Intel pull request on MKLDNN.
Differential Revision: D4906463
fbshipit-source-id: 0f74436ac3a05605e35b8b649c3e8b5c1c69b500
Summary: Add a default 60s timeout to RedisStore::wait() to avoid blocking indefinitely when peer machines are unavailable.
Reviewed By: pietern
Differential Revision: D4908699
fbshipit-source-id: 39de9066633e8b0c8d1ee198b6bf3f70d3961196
Summary:
as desc.
small fix in the feature_proc layer for the case when we only have one preproc type
Reviewed By: chocjy
Differential Revision: D4908933
fbshipit-source-id: 1338048fc395f85c3724721a9996ad1ee51f0f20
Summary: added a new context to layers.py
Reviewed By: kennyhorror
Differential Revision: D4817124
fbshipit-source-id: 36f08964b86092e81df24c1b9d4b167293a7ffb8
Summary: unit test using hypothesis for unmask operator
Reviewed By: ender-wieczorek
Differential Revision: D4904075
fbshipit-source-id: 874d3756ec703ab2cc82f24f7160b4254bf791f1
Summary:
Found while browsing the code. Cool stuff in here!
Closes https://github.com/caffe2/caffe2/pull/276
Differential Revision: D4911421
Pulled By: Yangqing
fbshipit-source-id: 3bef10a4001a6b4d4527c054519d69131799a0e2
Summary:
It's possible the pair is in the listening state when it is
destructed. The fd will not have been cleaned up in that case, so we
shouldn't assert that being the case.
Reviewed By: andrewwdye
Differential Revision: D4909964
fbshipit-source-id: 7103d74910e3bcf5de9f4658d8f1f682b6c8a70c
Summary: Make it convenient to test a model where we don't care about the backward pass, e.g., when the backward pass won't be run anyway.
Reviewed By: xianjiec
Differential Revision: D4906890
fbshipit-source-id: 9da51a9de4422474ce780e180b1ca95d6bc8c46d
Summary:
Currently, the functional layer infers the output types and shapes by running the operator once.
But in cases where special input data are needed to run the operator, the inferrence may fail.
This diff allows the caller to manually specify the output types and shapes if the auto infererence may fail.
Reviewed By: kennyhorror
Differential Revision: D4864003
fbshipit-source-id: ba242586ea384f76d745b29a450497135717bdcc
Summary: This will be used to generate random indices input to `Gather`
Reviewed By: xianjiec
Differential Revision: D4904591
fbshipit-source-id: 8d858631e3d640be2cec12f1566cbf195e6aad4b
Summary:
Two new operators to pack and unpack a dataset. This is so that we can
re-use other operators that do not understand the schema format. The immediate
use-case is to use it with a partition operator.
Packing works by splitting the input into separate tensors, putting them in a
vector and wrapping in a shared_ptr (as opposed to a unique_ptr, so we can
copy).
Unpack takes the packed input and concatenates it back to the original.
I also had a gard time understanding the iteration, so I created a TreeWalker
that just hides the complexity of operating with all the arrays and makes the
short functions for a given purpose that at least for me are easier to
understand.
Reviewed By: dzhulgakov
Differential Revision: D4870606
fbshipit-source-id: dc29428de5c96cc3039af2885d9e4b026d9f482d
Summary: Gather should work when both DATA and INDICES are empty
Reviewed By: xianjiec
Differential Revision: D4906878
fbshipit-source-id: 23585afbe618656d7f5831c56d360a03e3cb2584
Summary: scale_ tensor was resizd in correctly in SoftmaxOp CUDA version. For some reason this has not triggered more crashes. I was using the rowmax_ in-place with scale_, which was then also incorrectly sized. Usually D>N, so this was not a issue, but perhaps there were cases with attention where this is not the case. Also the problem is order-sensitive, since if we once had an input with large D, the buffer was of correct size.
Reviewed By: jamesr66a
Differential Revision: D4904989
fbshipit-source-id: 244b6d308d1fc08be885c641440cbacad3b0dbce
Summary: Add AllgatherRing and CudaBroadcastOneToAll to benchmark. Add host info and algorithm sweep to chronos script.
Reviewed By: pietern
Differential Revision: D4901111
fbshipit-source-id: 1421025d39b914b14e857f21c43eac30c9c9dd2f
Summary: CuDNN RecurrentNet GradientOp did not pass the DROPOUT information to the initializer, causing incorrect scratch space size to be estimated. We have an assertion encorcing that scratch space is same for forward and backward ops, so this failed an assertion. We currently hard-code dropout to be 1.0, so this has had no effect on correctness in our tests. For some reason with num_layers=1 there wasn't an issue, but with num_layers>=2, the scratch space size was different.
Reviewed By: salexspb
Differential Revision: D4904715
fbshipit-source-id: 780266c5ecf1f7a32387edcb6fc498a13ac782ac
Summary: This is the nice way to re-use RNN layers for training and for inference.
Reviewed By: salexspb
Differential Revision: D4825894
fbshipit-source-id: 779c69758cee8caca6f36bc507e3ea0566f7652a
Summary: This may help tell different allreduce operations apart during debugging/tracing.
Reviewed By: prigoyal
Differential Revision: D4897921
fbshipit-source-id: bbb2ce02a3e1f467ad54f8a3aed6a4e2b26a9fe4
Summary:
The common worlds can be reused without performance impact as long as
there is a guarantee that no two algorithm instances are using it at
any given time. Since we know the ordering and the maximum
parallelism, we can cycle through common worlds, and reuse them
accordingly.
Differential Revision: D4896779
fbshipit-source-id: 164e1727692eab904fa6879a9f91a3e8332a2e30
Summary:
This is from discussion with dzhulgakov : as a step towards revisiting the
core.Net autonaming, we will first guard against accidental overwrites of
existing networks in the workspace.
ajtulloch since we are doing Predictors in mobile, this should be safe right?
azzolini - I assume this would be safe, but would love to get your approval.
akyrola - would this hurt xray?
Reviewed By: dzhulgakov
Differential Revision: D4897725
fbshipit-source-id: aa41271927ad6671f07a53b9505283623f8c49e5
Summary: Having to pack the input to schema doesn't make much sense since the structure is not recognized by operators anyway.
Differential Revision: D4895686
fbshipit-source-id: df78884ed331f7bd0c69db4f86c682c52829ec76
Summary:
MT-team with urikz found out that their convergence discrepancy with another version of the model was caused by numerical stability issues in softmax. These were caused by our implementation not implementing the optimization to avoid doing exp(log(x)) for softmax-crossentropy. This diff fixes that.
This does not require any changes to the current models since the output of SoftmaxWithLoss is still the exponentiated items
I also did a little bit of cleanup on the code, for some reason we were passing tensors to SoftmaxCPU() instead of pointers.
Reviewed By: urikz
Differential Revision: D4901888
fbshipit-source-id: 62e785ecdd87e33742292b191e91b4f43912e4c0
Summary:
Added the possibility to add 'tiles' and 'axis' as input
as opposed to arguments for the Tile Operator. If provided, the input
values will override the argument values
Differential Revision: D4794432
fbshipit-source-id: a7e38f4f925a4cedf530924bd426c3bb08b5aad8
Summary:
Add conv helpers, the migration of functions assumes that people should not do
cnn_model = CNNModelHelper(use_cudnn=True)
cnn_model.Conv(..., use_cudnn=False, ...)
Reviewed By: salexspb
Differential Revision: D4884974
fbshipit-source-id: 12af6e2a5863eba789232cd4a4771f95d05f9227
Summary:
A workspace may add a suffix such as "_1" to the net name if other nets
have been added to the workspace with the same name. This is true even
if the previous nets have been removed or if the workspace has been
reset.
Closes https://github.com/caffe2/caffe2/pull/213
Differential Revision: D4899877
Pulled By: Yangqing
fbshipit-source-id: b89b196df815dceff49a3ec76d7f658cdc4b0a38
Summary:
Implement a new op ElementwiseLinear.
Given inputs X of size (N x D), a of size D and b of size D,
the op computes Y of size (N X D) where Y_{nd} = X_{nd} * a_d + b_d.
Typically, this op is followed by SigmoidCrossEntropyWithLogits op for multi-label classification problem.
Differential Revision: D4892220
fbshipit-source-id: 77bffc5fbe03d48b3d83ab785f7c24a71c952aec
Summary: This diff allows to export a model partially, filtering layers by tags.
Reviewed By: kittipatv
Differential Revision: D4885610
fbshipit-source-id: 65394c5c9119d57a4d0703aa67ad8e79e4370e3b
Summary: Output peer address on network failures. This change will help in root causing network failures.
Differential Revision: D4899129
fbshipit-source-id: 60a762c6551a726081d5335ab478da8dd7f6dad7
* Fix group-convolution w/o biases on CPU.
Not having this guard will cause a crash further down in the `cat`
function when it uses the first element in the passed list to create a
new tensor. (And even after that, cat doesn't handle nulls well.)
* Added test for groupconv w/o bias on CPU.
Summary:
Based on a discussion with Yangqing, optionally disables the calculation of dX for a convolution op (i.e. conv1 in Alexnet), where the data gradient is not needed.
Closes https://github.com/caffe2/caffe2/pull/242
Differential Revision: D4844013
Pulled By: bwasti
fbshipit-source-id: 202d2410ed6c66671e83e8e49a1383883c6ab29e
Summary:
1. Adds a function to return auxiliary parameters for each optimizer. This function can be used to serialize the optimizers so that they can be recovered.
2. Fixes the bug that the iteration blob is not incremented by one in each iteration. Suppose there are k parameters using the adam learning rate optimizer, the iteration blob is incremented by k based on the original implementation.
Reviewed By: azzolini
Differential Revision: D4872397
fbshipit-source-id: d86711feedda2ba83af5f2a18141b06a6a473733
Summary:
A CPU implementation for unmask operator in caffe2.
There's also a small bug in mask operator, fix it as well.
Reviewed By: ender-wieczorek
Differential Revision: D4896351
fbshipit-source-id: 887d1beb66fe93ea2da1c4e165fce2e026907726
* updated ubuntu instructions
* updated ubuntu notes and troubleshooting
* updated tutorials using local files
* added doxygen python blocks for docs generation
* doxygen related files for generating docs
* removing Mac and Windows build status while those are in beta
* inference lookup is local now
Summary:
The halving/doubling algorithm is faster than both ring and chunked
ring up to 5M elements, but only works with power of two contexts
right now. So use it unless the context size is not a power of two.
Differential Revision: D4890065
fbshipit-source-id: 09ff82b375cbd3d0626e0255dcf9b9f4873fff54
Summary:
For new trained models passing kernels=2*[kernel] and using old code for
inference that will not work because (kernels) argument isn't supported and
we are not passing kernel.
Reviewed By: salexspb
Differential Revision: D4888795
fbshipit-source-id: 1649b073c4e1da1d59da9cea581b4dcab6dbaf5c
Summary:
This is the hardware limit set by NVidia. Basically, on Amazon P2 machines that
have 16 gpus, the previous setting will trigger an error. This fixes the issue
but is pending verification from Amazon.
Differential Revision: D4888402
fbshipit-source-id: 8d26a24d6e0476f895b9afdb979144eb8e6b9321
Summary: Memonger's inference optimization is very efficient, but does not work if a multi-threaded DAG net is used. So I added this alternative that shares code with the gradient memonger and does the blob recycling by traversing the DAG and ensuring that blobs do not pass parallel branches.
Reviewed By: viswanathgs
Differential Revision: D4884303
fbshipit-source-id: dfd0a6ecdb91f4edbb0b743729c92f4cd015602e
Summary:
This allows us to do in-place relu and also corrects the previous error of
inconsistency between the cudnn impl and the non-cudnn impl.
This implementation butchers the cudnn interface, in the sense that we pass
in the output instead of the input for the gradient pass. We do have a
gradient checker to guard this situation, so we should be safe.
Reviewed By: asaadaldien
Differential Revision: D4889426
fbshipit-source-id: 081f8fe06de78413b5786086bfd5ae6c8128cd6e
Summary: Add an option to bias the forget gate one way or another by adding in some float value before the sigmoid is applied.
Differential Revision: D4880712
fbshipit-source-id: 1306a97c29fb31630838b2f96597a46e952d940a
Summary: This allows to check what's the real cost of each PS request for each parameter, and hopefully will allow to improve the sharding logic.
Reviewed By: dzhulgakov
Differential Revision: D4799210
fbshipit-source-id: d18effc671f3f7a611e535e09bde360ef0102a33
Summary: This diff enables sparse gradient synchronization between GPUs. The test case is now a bit too convoluted, but once D4871680 is landed, we can simplify it a bit.
Reviewed By: dzhulgakov
Differential Revision: D4877087
fbshipit-source-id: 37bbb07051cbaf3a6e3c54b0eead97f3e02337d5
Summary:
CopyGPUToCPu and CopyGPUToCPU need to handle gradients that come sparse on their way. Added unit test and fixed the gradient makers to create copies for both value and indices.
This becomes less important with gpu sparse parameter update ops land, but nevertheless good to fix.
Reviewed By: dzhulgakov
Differential Revision: D4882327
fbshipit-source-id: aafd2df46b3e1bcb30b52b1edf40fad8271f1f88
Summary: Device reduce is more efficient for large buffer sizes. For smaller buffers, host reduce may be more efficient in some cases and frees up the GPU for other work.
Reviewed By: andrewwdye
Differential Revision: D4885855
fbshipit-source-id: 7dc522e8c93e1a94427730aca6af03b7e93e660d
Summary: Perform gather on the whole record. This will be used for negative random sampling.
Reviewed By: kennyhorror
Differential Revision: D4882430
fbshipit-source-id: 19e20f7307064755dc4140afb5ba47a699260289
Summary:
These GPU paths are probably even buggier than the CPU paths for sparse gradients with duplicate indices. Both paths cause multiple momentum updates in a single iteration, but only the GPU path is non-deterministic. Depending on how we decide to address the issues on the CPU path, pooyadavoodi has a good idea for how to match dense behavior with the sparse GPU ops.
Closes https://github.com/caffe2/caffe2/pull/254
Reviewed By: bwasti
Differential Revision: D4871680
Pulled By: dzhulgakov
fbshipit-source-id: 220be57a0f699a22ea85ed4f7022d92d362d06b3
Summary:
Instantiate nccl type templates for gloo (minus half).
half requires at a minumum ifdefing CUDA_HAS_HALF and likely requires
more work given that operators aren't defined on it, so skipping it
for now.
Reviewed By: pietern
Differential Revision: D4876217
fbshipit-source-id: 833d2aec12789cbaf9e0a201b979a420fbe6732f
Summary: Added a field caller_ to caffe2::EnforceNotMet and mofified operator Run() exception handler to add the input/output name of the blob being accessed to the error message. Note that this is not able to distinguish case when blob occurs in both input and output, but I believe this is still helpful.
Reviewed By: salexspb
Differential Revision: D4863982
fbshipit-source-id: f6a872fb07f8957dc2d3366d9f106fa81bffbd72
Summary: making the name a bit clearer
Reviewed By: xianjiec
Differential Revision: D4866940
fbshipit-source-id: 3e0f7067a9d3ba89cb038d85c1991e541f1e439c
Summary: Added a pipelined version of cuda halving/doubling algorithm. Half the buffer is reduced prior to first send and the other half prior to reducing the result from first receive. Broadcasts are started asynchronously as soon as each new message is received. New code was added as a new algorithm, as pipelining makes performance worse for small buffer sizes.
Reviewed By: pietern
Differential Revision: D4847109
fbshipit-source-id: 5aa55de95f8c94069380af7396f2b5b6297dcbea
Summary:
A few fixes in this commit: the epoch size is now rounded
down to the closest integer multiple of the global batch size (batch
per GPU * GPUs per hosts * hosts per run). The num_shards and shard_id
parameters are now passed to CreateDB so multiple processes actually
train on different subsets of data. The LR step size is scaled by the
number of hosts in the run. The test accuracy is only determined after
each epoch instead of after every so many iterations.
Differential Revision: D4871505
fbshipit-source-id: d2703dc7cf1e0f76710d9d7c09cd362a42fe0598
Summary:
Length-aware gather operator. This will be use for random negative sampling. See the task for details.
This should be equivalent to:
LengthsToRange + Gather + Reshape + GatherRanges
That's pretty complicated.
Differential Revision: D4846023
fbshipit-source-id: 8d9b7ff3eddc75a7ab147cd1c2a12f377652df93
Summary:
prof_dag in step net is not supported
(Note: this ignores all push blocking failures!)
Differential Revision: D4876551
fbshipit-source-id: 4003e60908e51ef052f8656bf527b326676c298c
Summary: To help dgponinath, and people in general: check that params don't have duplicate entries.
Differential Revision: D4872132
fbshipit-source-id: 1cca1237fda771eb270227f452ecae0f912d7a33
Summary: Add Algebra and train helpers and proxy them to CNNMH
Reviewed By: salexspb
Differential Revision: D4855040
fbshipit-source-id: d948ea913f674a6e47c4b72629a2d33253cb3130
Summary:
Fix issue that amyzhang encountered. She was using ConstantFill to create a blob of same size as an another blob. This caused the gradient op computation flow to interrupt through the ConstantFil since the gradient for the input blob was set to None (although it had another gradient already set). The correct solution is to avoid overwriting gradient assignments with None, if they already have a gradient. UNLESS that blob is output of the same op, as with StopGradient op. (Note that Amy's problem was fixed by using instead a fixed shape ConstantFill and Add with broadcast=1, which is better solution anyway).
Not sure if I explained this well, but see the new unit tests. Before this change, the testAddAndDynamicConstant failed but the testAddAndStaticConstant succeeded.
Reviewed By: dzhulgakov
Differential Revision: D4861176
fbshipit-source-id: 3b53621bfaba2e36786a5e4664145038995f6616
Summary:
To evaluate on checkpoints, we often need to load from multiple checkpoints.
However, it is inconvenient if we always need to check the existence of
a checkpoint manually. Adds interfaces to check the existence of a DB
so that we can find available checkpoints automatically.
Reviewed By: azzolini
Differential Revision: D4823876
fbshipit-source-id: e5a65b736ac2addd0447c4add81dbd0986f422e7
Summary:
This diff adds an option to recurrent_net to define some cell blobs to be recomputed on backward step, and thus they don't need to be stored in the step workspace. This is done by modifying the backward step to automatically include all operators that are needed to produce the output that is to be recomputed, and by storing those blobs in a shared workspace. To enable the shared workspace, i had to modify the stepworkspaces blob to also store a forward shared workspace. Making it a class field won't work since the lifecycle of the blob does not match the lifecycle of the operator.
For basic LSTM, the performance hit is quite modest (about 15% with one setting, but your mileage might vary. For Attention models, I am sure this is beneficial as computing the attention blobs is not expensive.
For basic LSTM, the memory saving is wonderful: each forward workspace only has 4 bytes (for timestep).
I also modified the neural_mt LSTM Cells, but there is no test available, so I am not 100% sure I did it correctly. Please have a look.
Added options to LSTM, MILSTM and LSTMAttention to enable memory mode.
Reviewed By: urikz
Differential Revision: D4853890
fbshipit-source-id: d8d0e0e75a5330d174fbfa39b96d8e4e8c446baa
Summary:
The basic idea of bucket-based calibration:
1. given a model and a calibration data set
2. apply the model to the calibration data set and sort the prediction scores
3. bucketize the prediction scores
4. for the samples in each bucket, compute the proportion of positive samples
5. build a set of piecewise linear functions that map from the bucket range to the proportion
6. appends an operator of piecewise linear transform to the prediction net that is supposed to calibrate the raw predictions.
7. to support calibration in realtime training, we create a new type of Net -- bucket calibration net. This needs a new Context to add_calibration_ops(), to export and load the new Net.
This includes a series of diffs.
This diff implements a layer that adds different operators for train/cali/eval for bucket based calibration.
Reviewed By: dragonxlwang
Differential Revision: D4817119
fbshipit-source-id: 44f8fcad2a94f40f7439cc1ad47e7bae5e17397d
Summary:
Adds support for fp16 to main cuDNN ops (conv, relu, pool, BatchNorm).
Done via. runtime dispatch, not using DispatchHelper at this point to allow for more complex dispatch logic in the future if necessary. Using separate template for all input / output types is for these reasons also - it's easier to add the functionality now and never use it, than need to add it later.
Closes https://github.com/caffe2/caffe2/pull/241
Differential Revision: D4831264
Pulled By: asaadaldien
fbshipit-source-id: ad2ffdb13c031d8eb20552ffbf83c05c278252f7
Summary:
add necessary ops for feature processing
* logit op
* replace nan
* batch one hot op
Reviewed By: kittipatv
Differential Revision: D4840869
fbshipit-source-id: 197123ea5608d54f0b5ac7899973a077a6a86775
Summary: Having directory utils broke open source build :(. Removing the contents as this utility is not really needed.
Differential Revision: D4866228
fbshipit-source-id: 1eae4580ebac5b60e52e2e8553e0ffd919152228
Summary:
Added SumSqrElements, since then we can avoid a large temporary blob which is needed when doing Sqr + SumElements.
Also moved to reduction_ops, because utlitity_ops has grown too big.
Reviewed By: jamesr66a
Differential Revision: D4844172
fbshipit-source-id: 032eec45e24d6724f0d5fb83f4ec1c771d1146e5
Summary:
The code already asserted, but only on the reply type, so it didn't
include the actual error message. This makes debugging problems much
easier when people have problems running the benchmark suite.
Differential Revision: D4860022
fbshipit-source-id: 659bc461a724603375bff18eac90eca658492b05
Summary: This is cheaper than doing getaddrinfo for every pair.
Reviewed By: andrewwdye
Differential Revision: D4850102
fbshipit-source-id: e77f468f099f63860b52fdd0dcc57a8a7a91a448
Summary:
Part of this change is to perform a getaddrinfo in the TCP device
class so we can figure out the interface and subsequently PCI bus ID
of the NIC used for its traffic. This information can be used in a
later diff to avoid doing getaddrinfo calls in the TCP pairs and have
them reuse the information that is resolved by the device.
The PCI bus ID can be used to compute distance between NICs and GPUs
and make informed decisions on where to allocate scratch buffers.
Reviewed By: andrewwdye
Differential Revision: D4850035
fbshipit-source-id: 575e401a9273300bc720c814fef8971846ec748c
* Add IndexLinear
* Fixes to IndexLinear
- Fix IndexLinear test
- make it better for multithreaded case
- fix a glitch in the C code
- improve the reset() method
- fix the weight allocation.
- remove "fakeBatch" possibility as it's not used
- clamp normalized values at evaluation time instead of just dividing by max.
- add assert on the keys/values dimensions in IndexLinear.
- invert order of weightDecay in the case of output dim > 1.
* Changes required to support IndexLinear in CUDA
* Adding support for flattened inputs for IndexLinear
* Doc for IndexLinear + fix for when the input format changes from one batch to another.
* Cleaning up IndexLinear documentation
* Changes required to build with latest torch
* Adding benchmark script for IndexLinear
* Bugfixes and cleanup of IndexLinear.lua
- Fixed bug that occurs when performing multiple accGradParams +
updateParams
- All the data required for the updates is put in a single table
- Added :pararameters method
Summary:
The PiecewiseLinearTransformOp passes the transform parameters (bounds, slopes, intercepts) via operator arg. This diff supports to pass these parameters through input blobs.
The purpose is to allow us to create a model calibration net that can be exported when saving model.
Reviewed By: dragonxlwang
Differential Revision: D4777086
fbshipit-source-id: 0d157154860f61ec6ecfab95aea80beed54aa5c6
Summary: This is like LengthsToSegmentIds + Gather w/o the immediate segment IDs blob. I only realized that after I wrote the whole thing. That combination is not obvious, so just check this in?
Reviewed By: xianjiec
Differential Revision: D4847591
fbshipit-source-id: a1c480f16b317763866af13c83b3aaaeb6a60751
Summary: As said in the title. This should save a lot of memory if using both train and test workflows.
Reviewed By: jhcross
Differential Revision: D4855436
fbshipit-source-id: 9eeca548eee118e07bd587c46f40e7beb138318e
Summary: Instead of reporting the number of total elements of tensor, report the number of bytes. But report the capacity of the tensor, not the current number of bytes.
Reviewed By: jamesr66a, salexspb
Differential Revision: D4851633
fbshipit-source-id: 464d552f41f1b5f25753b0e7001d299b6dac1966
* added dataset downloader from s3 func; leveldb creator func; refactored to use both of these
* working version for squeezenet only
* using fb.me link for mnist dataset
* ubuntu installation instuctions for v0.6.0
* removing non-functional tutorials
* updated model download info
* model download updates
* new tutorial
* bump version to v0.6.1
* tutorial helper functions
Summary:
1. CPU/GPU implementation of SumReduceLikeOp.
[SRLOp](matrix A, matrix B) -> C
where C is of the same shape as B, its value would be the reduce sum of corresponding A element.
2. Make SumReduceLikeOp (part of) the gradient of Add/Mul/Sub and provide unittests
===Update for Translation Team===
3. Passed Tests:
$ buck test caffe2/caffe2/python/operator_test:recurrent_network_test
$ buck test fblearner/flow/tests/langtech/translation/neural_mt:seq2seq_model_caffe2
$ buck test fblearner/flow/tests/langtech/translation/neural_mt:seq2seq_ensemble_beam_model_caffe2
Reviewed By: Yangqing
Differential Revision: D4711302
fbshipit-source-id: 0865abde871b3046b367599731593dae03f0775a
Summary: Put the size of the input tensor vector into the output blob
Reviewed By: xianjiec
Differential Revision: D4849556
fbshipit-source-id: 0929319e1705b027874d41a90a9159b335d93545
Summary: The check for old model style seems wrong. Fails with a model I tried to run.
Differential Revision: D4847970
fbshipit-source-id: f28c5bb635c5e8b4dcfcc5c52a434d91a89217e8
Summary:
This fixes some bugs in the downloader. TODO: fix the URL
Closes https://github.com/caffe2/caffe2/pull/255
Reviewed By: Yangqing
Differential Revision: D4851555
Pulled By: bwasti
fbshipit-source-id: 56d01617ccaddcd40b0fb8e4be137cb4c7a52e91
Summary:
Added a DP + recursion algorithm for finding blob assignments based on blob sizes. This algorithm gives optimal assignments. See comments for details.
The algorithm is not used by default, set algo=memonger.AssignmentAlgorithm.DYNAMIC_PROGRAMMING and provide blob_sizes in optimize_interference() to use it. The blob sizes could be retrieved by running the net once and then calling blob_sizes = memonger.collect_blob_sizes(net). All blob sizes are assumed to be 1 if blob_sizes is not provided. In this case, using algo=memonger.AssignmentAlgorithm.GREEDY may be better.
Testing on the segmentation model, the memory usage is reduced by 19% (14.96MB to 12.08MB) comparing using the greedy algorithm (without considering conv share buffer). The algorithm runs in 15s for the model with 55 sharable blobs.
Reviewed By: ajtulloch
Differential Revision: D4818476
fbshipit-source-id: 606936f4cf2715408d60b9a5cf3bcaf1985a0fec
Summary:
Added Caffe2 cmd line option --caffe2_print_blob_sizes_at_exit=1, that when enabled, will print all tensor sizes at the workspace destructor. Handy especially when using sub-workspaces like with RNNs. Note that the sizes are number of elements, not bytes. Output is designed to be easily excel-copypasteable.
TODO: add sorting
Reviewed By: jamesr66a
Differential Revision: D4844628
fbshipit-source-id: 11608a1710ae5c89bbd741edb506d25496606185
Summary:
This is not a super-elegant, but a working solution to fix Newsfeed-teams problem of extracting a predictor net of a net that has a "side chain" that they want to cut from the middle.
This adds a argument to ExtractPredictorModel that allows one to define "disabled inputs". These are inputs that we want to switch off, so that all operators that depend on that input will be removed from the model.
Differential Revision: D4839953
fbshipit-source-id: 5d16df6f0ec4aac6670e6917efc77abde5d75c95
Summary:
Forgot to include these in a previous commit.
Closes https://github.com/facebookincubator/gloo/pull/23
Differential Revision: D4847072
Pulled By: pietern
fbshipit-source-id: 08aa9e8fa47377eb8c7747bd577eec7e615789f1
Summary: Add CAFFE_ENFORCE to make sure the protobuf parsing is successful.
Reviewed By: salexspb
Differential Revision: D4843662
fbshipit-source-id: 20cab7180e6b0e5afb5e29ff3333591659e41f7a
Summary:
With this we can compute the best GPU device to reduce on. It is not
always the one CUDA indicates as GPU 0.
Reviewed By: andrewwdye
Differential Revision: D4845581
fbshipit-source-id: 13e0500f54fd507899646f781a97c09abcd3b056
Summary: When only_loss=True is enabled, the softmax output buffer is shared with the gradient buffer (which is of same size). Added tests for this. Only for GPU version for now.
Reviewed By: salexspb
Differential Revision: D4843991
fbshipit-source-id: 834d2a1b357d784e4d64efe484f893442201ad6a
Summary: Used blob sizes for finding assignments in a greedy way.
Reviewed By: ajtulloch
Differential Revision: D4818159
fbshipit-source-id: 89180a6117ba5be058e1d2f9488b06d618e91917
Summary:
Added an ordering function (topological_sort_traversal_longest_path()) to reduce live spans of computed blobs. The idea is to sort the ops based on the length of the execution path so that ops in longer path will be used first.
Tested on segmentation model with on-the-fly decoder and reduced memory usage from 21.7MB to 14MB (original size is 33MB with compressed parameters and without considering the conv buffer), comparing to use topological_sort_traversal() as the ordering function.
It is a general ordering function so I put it in memonger.py directly.
Reviewed By: ajtulloch
Differential Revision: D4790135
fbshipit-source-id: e661b45c1640de44ce1a9fdd009a4fba38f8e042
Summary:
This makes it easier to capture, compare, contrast results with
different parameters.
Reviewed By: andrewwdye
Differential Revision: D4843715
fbshipit-source-id: ba6916dcd5f8bcc615d6edce1a54657241357c31
Summary:
Instead of having every CudaDevicePointer "own" a stream, this change
moves to using CudaStream as first class object. It was pretty clunky
to use the copy{To,From}* functions on the CUDA pointer classes to
copy stuff around. For example it was not clear whether the stream
belonging to the source or destination was used to execute the copy
on. There is no longer such ambiguity after this change.
To make this work the CudaBroadcastOneToAll algorithm was changed to
include the workspace template argument, but only has the
CudaHostWorkspace implementation. The CudaDeviceWorkspace
implementation is left to be done for another change (that's not the
purpose of this change).
Reviewed By: andrewwdye
Differential Revision: D4841615
fbshipit-source-id: d0c1b9ba948ff6167832515afa7bdd2b32b48064
Summary: This is moving predictor exporter's code to open-source.
Differential Revision: D4815409
fbshipit-source-id: ce1508a2b6b973c91b0420928d2b4c3953f26e6c
Summary: Make timeout a device attribute. Now the pair will configure timeout when connecting based on device timeout settings, instead of needing to be set explicitly on each pair. Set default tcp timeout to 30 sec.
Reviewed By: pietern
Differential Revision: D4838918
fbshipit-source-id: e6e6ee36c662eb5e7ba5354c904e50f9dcac258f
Summary: cuda_allreduce_halving_doubling was not properly handling the case where buffers are allocated in GPU memory, trying to reduce and copy from them as if they were in system memory.
Reviewed By: pietern
Differential Revision: D4840259
fbshipit-source-id: 2615360cd2f1d9c7a37fb0bcdf33ff35528b2c75
Summary:
Removes the need for all the Copy calls, in one of our apps reduced time from ~40ms to < 200us
Closes https://github.com/caffe2/caffe2/pull/250
Differential Revision: D4828825
Pulled By: pietern
fbshipit-source-id: 656bd0edc4ffbaa3f89ccbe045e28a7aae49ceab
Summary: Softmax was not in the model helper, so added it there so we can set the CUDNN engine, as it is the preferred version.
Reviewed By: asaadaldien
Differential Revision: D4835624
fbshipit-source-id: 7f0c84b7a73653119901795782709a6a617345c5
Summary:
Quite large diff to make cuDNN LSTM and our LSTM produce same results and provide python API for the cuDNN LSTM.
* Added operators RecurrentParamGet and RecurrentParamSet to access weights and biases for the different gates, input/recurrent.
* Removed RecurrentInit as not needed
* recurrent.cudnn_LSTM() returns a special net and mapping that can be used to retrieve the parameters from the LSTM
* recurrent.cudnn_LSTM() can be passed blobs that have the parameters for the individual gate weights and biases
* recurrnet.InitFromLSTMParams() can be used to initialize our own LSTM from CUDNN params. This way we can test if cuDNN and our own produce the same result.
recurrent_test.py tests for the equivalency
Reviewed By: salexspb
Differential Revision: D4654988
fbshipit-source-id: 6c1547d873cadcf33e03b0e0110248f0a7ab8cb0
Summary: Added the support of axis for cudnn version of softmax + added cudnn tests to the softmax_ops_test
Reviewed By: urikz
Differential Revision: D4835409
fbshipit-source-id: 9150b969237e38daebff961fee3c36759f834ac4
Summary: NanCheck is an in-place operator for GPU that checks the input for any NaN or inf values. The operator fails and prints diagnostic information (input tensor dims and values) if it detects these erroneous values. This should help us to narrow down our numerical instability issues in the NMT models, and it might help others as well.
Differential Revision: D4818141
fbshipit-source-id: e5aa9762089c58ce160270446007c7a91a7a85e5
Summary:
Clarify that Redis Cluster is not supported. Also see #21.
Closes https://github.com/facebookincubator/gloo/pull/22
Differential Revision: D4837375
Pulled By: pietern
fbshipit-source-id: 6e3575b3b8dae6ca62beb765da15d8506da4abdb
Summary: Basic port of the CPU halving/doubling algorithm. No pipelining is done between reduce/broadcast and communication.
Reviewed By: pietern
Differential Revision: D4823693
fbshipit-source-id: b18045d64edf90361bf7713f4ccb2e074757780f
Summary:
Following jamesr66a's brilliant observation, this diff fixes the non-CUDNN versions of Softmax. The op did not take into account that blocks can run in parallel, and thus could overwrite each others values, particularly the "row max" that is important for numerical stability
So in this diff:
1) SoftmaxOp now shares all the code with SoftmaxWithLoss, that had better implementation
+ Strengthen the test case and renaming of file.
Reviewed By: jamesr66a
Differential Revision: D4832929
fbshipit-source-id: 4a1bfa2106ceb65ec75f5b868323ee1e7a3457fb
Summary:
This diff enables support of recurrent networks for memonger:
1. Memonger descends into the step-nets and renames the blobs accordingly
2. Memonger tells the gradient op about the renamed blobs by adding a parameter "paramname.renamed=<new name>"
3. RecurrentNetworkGradientOp applies remapping to links and gradient blobs.
I first thought of refactoring the whole gradient blob management of the recurrent network, but that looks to be very hard without a major revise of the code.
Note, I did not enable memonger for neural_mt, since I think the team should do more testing before enabling this.
Reviewed By: salexspb
Differential Revision: D4812823
fbshipit-source-id: 1ffdf3cfb4fcd00eec5bb0ece3bf416aa6d3e26b
Summary:
Description.
We kinda have our hands tied here, can't reference conext_gpu since it needs to run under _gpu TARGET to pick up correct headers and can't change the interface of deserialize blob to return size since not all blobs are tensors.
If this works then let's ship it.
Reviewed By: urikz
Differential Revision: D4826034
fbshipit-source-id: 631ba56386ccb91d9b19d780a3e012d0ceea2422
Summary:
Required for D4821763
Based on targets from https://fb.facebook.com/groups/fbcode/permalink/1304073246296178/ (I also excluded those targets which do not depend on folly:singleton).
Reviewed By: meyering
Differential Revision: D4832492
fbshipit-source-id: fcb4ce42e9e5359d4752769f77d7271e550201fe
Summary: The caffe2 implementation of bare Softmax() has a race condition that wipes out the numerical stability trick. Use the CUDNN implementation instead
Reviewed By: urikz
Differential Revision: D4831298
fbshipit-source-id: d11b1de700e3954629e7ed43225a2416c27b3252
Summary:
Two new features for RecurrentNetwork:
1. Ability to specify longer (for a few steps) initial state
2. Ability to link more than one step of external blob to internal one.
Some motivation for these changes is provided in the unit test
Reviewed By: salexspb
Differential Revision: D4816230
fbshipit-source-id: 5ae6fed53b3b08a6ce4547ff1d0cb773dab42af0
Summary: Refactor AllgatherRing algorithm to remove all memcpy in the communication rounds by using outPtrs as send/receive buffer + remote buffer offset.
Reviewed By: pietern
Differential Revision: D4793186
fbshipit-source-id: 645d0758d246fd0b493e3fe312a8441d86f6d169
Summary: To make the predictor open souorce, move the constants that are generated from Thrift to Protobuf.
Reviewed By: salexspb
Differential Revision: D4656884
fbshipit-source-id: d4dbb3416e8396185e0981fcd9a090fbb054a18a
Summary:
Actually adds stuff on duplicated indices. I didn't use UnorderedSegmentSum because it'd need more modifications for figuring out the first dimension and I don't want to make that function more complex than it's already is :)
We theoretically can have a version that does CopyItems and fails on duplicate indices as a fallback. But I haven't implemented it yet as it wouldn't be that useful for now.
Also fixes hypothesis test - doing rand() inside the body is not cool as it makes hypothesis run forever
Differential Revision: D4814574
fbshipit-source-id: 1851ec5f5df8fc4bf4844585076b8af23a06b0b2
Summary:
Combines the top level common.h with algorithm.h. With algorithm.h in
the common package, CUDA algorithms only need a dependency on that
package. CudaBroadcastOneToAll still depended on broadcast.h so this
change also removes that dependency and has it subclass the Algorithm
class.
Reviewed By: andrewwdye
Differential Revision: D4826885
fbshipit-source-id: 930037e39f7a2c941868e53f0bbc54e3f2e0b184
Summary:
GPUDirect support for CudaAllreduceRingChunked by adding a workspace
template parameter and adding workspace specific init functions.
To support this change the CUDA LocalOp classes had to be changed a
bit to take an extra destination/source pointer. This allows reduction
of 1-N pointers into a target pointer, where the target may live on
device or live on host. If it lives on the host, the NCCL operation
that executes the reduction is followed by a D-to-H memory copy. If
there is only a single input pointer, no reduction needs to happen and
the class just executes the D-to-H memory copy. The net result is that
we can interchangeably use device or host pointers as target for
reduction or source for broadcast and these LocalOp what you would
expect them to do.
Reviewed By: andrewwdye
Differential Revision: D4825236
fbshipit-source-id: 048ec6cbc5a0500bafbe1b3f6abe1e2e5f3a2675
Summary: Fixes for handling errors and timeouts in blocking and polling sync paths. Add test coverage for errors and timeouts.
Reviewed By: pietern
Differential Revision: D4823498
fbshipit-source-id: 93721947a6404ca9cea6a4869f4156f8d270a981
Summary:
Anything number of elements below this always fits in a single packet
and will yield ~identical results.
Differential Revision: D4825190
fbshipit-source-id: 71ac77456049e991da5059d5a029c5e9d2a67ed7
Summary: The PadImage op supports cropping along the H/W dimensions by using negative pads; but currently passing negative values for pad attributes throws an error in ConvPoolOpBase, which PadImage inherits from. Modify ConvPoolOpBase to accept negative pad values for non-conv, non-pool ops. Also add a python operator test for cropping
Reviewed By: ajtulloch
Differential Revision: D4817118
fbshipit-source-id: 5ea5203e8072cc34fe14938e534b157d0ad55f6b
Summary:
The existing CudaAllreduceRing with a CudaDeviceWorkspace
template parameter now has the same effect.
Reviewed By: andrewwdye
Differential Revision: D4823393
fbshipit-source-id: 88fe497a983b26a281a3a74fe3bdc02c0c87c523
Summary: Somehow, feed-non-ranking training data usually have this type of column. Add option to support it.
Reviewed By: xianjiec, kennyhorror
Differential Revision: D4773960
fbshipit-source-id: 5a7ef4618a070e04f3cd8ddfcbf2b7441c00d92d
Summary:
Implement a file store for multi-process transport failure testing. Add test cases to spawn multi-process tcp communication, and verify that all processes throw the expected IoException.
A future diff will add coverage for connectivity failures, sync modes, and ibverbs.
Reviewed By: pietern
Differential Revision: D4807794
fbshipit-source-id: 35212719d46e6d875eacb341fae25681f39053bc
Summary:
Allreduce using recursive halving and doubling algorithm. Algorithm is described in http://www.mcs.anl.gov/~thakur/papers/ijhpca-coll.pdf (see top diagram on page 12). Algorithm consists of 2 lg P stages, the first log P performing a reduce-scatter and the second log P the allgather. Message size is variable across steps. The early stages of the reduce-scatter and the late stages of allgather send the largest messages. The communication is structured such that the largest messages are sent between nearby ranks, which could be useful if elements are ranked in locality-aware fashion.
So far this supports only power-of-two number of processing elements.
I have attempted to minimize the amount of synchronization/ hand-shaking. Messages are received at different offsets of the output buffer for each communication step. Send offsets in the reduce-scatter steps become receive offsets in the allgather and vice versa. The reuse of buffers across reduce-scatter and allgather steps requires synchronization. Right now the algorithm is inefficient in terms of memory use, requiring 3x memory currently. This can be reduced, but would require additional synchronization.
Reviewed By: pietern
Differential Revision: D4795878
fbshipit-source-id: fcc6597ef6a99cd102fce2b8e4562d93088d39dc
Summary:
Uses the cudnnTransformTensor function. It works by shuffling the strides according to the transpose axis. Significant speedup over current GPU version .
+ moves the transpose test under utility_ops, because hypothesis_test is too big
Reviewed By: jamesr66a
Differential Revision: D4810993
fbshipit-source-id: 82577c4ced1389e70bd5992820ae4d8297a3817f
Summary:
This is just by analogue with GetSingleArgument which already
has a default_value support
Reviewed By: Yangqing
Differential Revision: D4819789
fbshipit-source-id: cf271d9f345f14f3e373186365726c738c1c26f3
Summary:
Didn't provide enough value now that ReductionFunction and
CudaReductionFunction are no longer related.
Reviewed By: andrewwdye
Differential Revision: D4819295
fbshipit-source-id: e6479769af7f78d486bee7d9c31f049430cdc775
Summary:
To bring the GPUDirect and non-GPUDirect implementations of CUDA aware
algorithms closer together this change introduces CUDA workspaces.
There's an implementation for a host side workspace and a device side
workspace. The former is used for transports that don't support
GPUDirect and the latter for ones that do. CUDA algorithms will take
an extra template parameter for this workspace and this will determine
whether they can be used for GPUDirect or not.
The workspaces only define their respective pointer types right now
but may contain local operation construction functions at a later
point in time.
Reviewed By: andrewwdye
Differential Revision: D4802826
fbshipit-source-id: cb1d71a224ce0165afd07fb9092ad54d3e07c8cf
Summary:
This was found necessary on some CentOS. aaronmarkham
Closes https://github.com/caffe2/caffe2/pull/240
Differential Revision: D4819591
Pulled By: Yangqing
fbshipit-source-id: 40161cd484a2c8d43f26077919ad2762440dde13
Summary:
multiple places broken, blocking the push :(
- fix the weighted training for ads and feeds
- fix the publishing if no exporter model is selected
- fix the feeds retrieval evaluation
- added the default config for retrieval workflows. plan to use for flow test (in next diff)
- clean up not used code
- smaller hash size for faster canary test
Reviewed By: chocjy
Differential Revision: D4817829
fbshipit-source-id: e3d407314268b6487c22b1ee91f158532dda8807
Summary:
This diff does the followings:
1. Add optimization options to model options in the UI for all workflows.
2. Allow different parameters to use different optimizers (or same optimizer with different settings, eg, learning rate).
3. Remove the default values for the `sparseDedupAggregator` field in the thrift file as the default value for that should just be `None` instead of 'sum'.
4. `fb/dper/layer_models/mlp_sparse.py` is deprecated.
5. Add calibration to two tower workflows.
Reviewed By: kittipatv
Differential Revision: D4767004
fbshipit-source-id: de92ea63fb0ff33f8581b1693479b723a68cd2d1
Summary:
- Fixed loading params into ensemble model
- Small fix for beam decoder
Differential Revision: D4807595
fbshipit-source-id: 0187fda7eb469401f1acd8e6108de54ab67ae922
Summary:
The cublasSgemmStridedBatched is only supported by cuda 8+. Luckily we can
always fall back.
https://devblogs.nvidia.com/parallelforall/cublas-strided-batched-matrix-multiply/
aaronmarkham found this in the centos build on the oss side.
Differential Revision: D4808822
fbshipit-source-id: 1657c139b57158e633074e06787c48302e0df142
Summary:
This is an initial (read: unoptimized) implementation of GatherOp on GPU.
Closes https://github.com/caffe2/caffe2/pull/209
Differential Revision: D4809676
Pulled By: Yangqing
fbshipit-source-id: bc36fa02e9964370ca845e9cc13344e5f3dbf176
Summary: minor fix about C1 model translator
Reviewed By: Yangqing
Differential Revision: D4807165
fbshipit-source-id: 0149e2655d2901b23a37e92f61d9dd678cf6ee69
Summary:
This makes ConvTransposeMobileOp inline with other implementations,
allows us to account for these buffers in the workspace, and is generally a good
thing to do.
Differential Revision: D4767431
fbshipit-source-id: b14a96a089136e305ab42680772272f4e5f16f53
Summary:
The initialization phase of each checkpoint object simply loads the nanmes of
the blobs in the checkpoints. When we load from the checkpoints, the names of
the blobs are given. We can skip this init step.
Reviewed By: azzolini
Differential Revision: D4808114
fbshipit-source-id: 4c740049c1014f3e93b4b87f43e3937afdefa25a
Summary:
Weighted LabelCrossEntropyGradientKernel had a clowny loop over D. Since the operation is completely linear, we can just do it all in a one parallel loop. Massive speed up: in my benchmark from 4s to 20ms.
+ added weights to the lstm_benchmark
Reviewed By: jamesr66a
Differential Revision: D4800889
fbshipit-source-id: f9850bcc56ce34d5d7a613419cd172256633a894
Summary:
Add distributed training to dper2 and keep the dper1 working.
* Created a ModelDelegator to wrap ModelHelper and LayerModelHelper to mitigate the difference.
* To get the average length for sparse feature, I extracted some information in feature_processor. There should be some better way to do it after we have new compute_meta.
* metric right now only runs on the first trainer.
* The model is saved correctly for evaluation. But I'm still not sure how to handle the weights for adagrad.
Reviewed By: kennyhorror
Differential Revision: D4767745
fbshipit-source-id: 0559d264827a7fd9327071e8367d1e84a936bea9
Summary:
We did not parallelize over D, which can be very large, especially in RNN models. This speeds up significantly, with my quick test in lstm_benchmark and nvprof, the time of RowMaxKernel dropped from 1.2s total to 0.28s total.
+ addded softmaxwithloss to the lstm_benchmark
Reviewed By: jamesr66a
Differential Revision: D4800629
fbshipit-source-id: 3400ea1064b1eb2793bc403df2c1b68801d545e5
Summary:
The CUDA algorithms all had their own version of local reduction and
broadcast. This commit consolidates them and allows all CUDA
algorithms to work with CudaDevicePointer instances.
Reviewed By: andrewwdye
Differential Revision: D4797968
fbshipit-source-id: cccef39fce01905a2cd757ccbcffd29803411409
Summary:
(Also, exposed the macros that we use during build time via the macros.h header file)
Closes https://github.com/caffe2/caffe2/pull/233
Differential Revision: D4803311
Pulled By: Yangqing
fbshipit-source-id: 9f8ce57692f81f7a8994344846d3c90aa2c7070a
Summary: Verification was sometimes failing for allreduce halving-doubling. Pieter noticed that it is due to verification step racing with the regular iterations.
Reviewed By: pietern
Differential Revision: D4804558
fbshipit-source-id: f645cb2e332e449a993a634c5bdb42c2dcb8613b
Summary: Instead of callint batch-size many math::Adds, added a new function that does a batch of additions. For CPU there is no difference, but for CUDA we do everything in one kernel. I don't think this has huge performance impact, but at least makes the CUDA profiling look better with less kernel launches.
Reviewed By: jamesr66a
Differential Revision: D4798411
fbshipit-source-id: 44ac65b2da5a615971219809b9298b4e122085cd
Summary: Added SparseMomentumSGDUpdate to NMT training pipeline. Also surfaced and fixed out-of-bounds error in operator stemming from the implicit assumption that gradient slice input would be 2D. Now it is compatible with any dimensions, with indices indexing into the first dimension of param. Added internal checks to ensure that indices are valid.
Differential Revision: D4799697
fbshipit-source-id: 91ea23a6e743cc5337b46fae2821e773067d911e
Summary:
This is a copy of CudaAllreduceRing that doesn't stage the locally
reduced buffer in host memory but uses the GPU side buffers directly.
Eventually I would like this to be absorbed back into
CudaAllreduceRing, but for now it's a good place to compare the two
implementations and abstract the parts that make sense, until they are
identical again.
Reviewed By: andrewwdye
Differential Revision: D4791629
fbshipit-source-id: 5ad065cb94adb968aeee2379327be313638f2161
Summary:
Somehow the stress-runs flag does not work as what I expected.
Now the test finally passes.
Reviewed By: azzolini
Differential Revision: D4797559
fbshipit-source-id: 1e46844e9ae55c331c2e265a59dc550983274213
Summary:
Adding support for multilabel in multiclass workflow. `input_feature_schema` and `trainer_extra_schema` are now a function taking in the preprocessor option and output the schema. This allows dynamic schema definition based on the option.
Changing default value will be in the next diff.
Reviewed By: xianjiec
Differential Revision: D4750064
fbshipit-source-id: 896143f432e963bc1723c0153749efeb39a83bec
Summary:
Main idea is that on the backward pass we don't need to store all the backward outputs in memory. This diff addresses only ones used internally in each private workspace by creating that shares them all witing the backward pass.
Another thing we can do - get rid of state_grad blobs, but this would be a different effort.
See comments for more detailed description.
Reviewed By: urikz
Differential Revision: D4784900
fbshipit-source-id: 2dd8fe1b1215217ce92c09d918582d76c3051630
Summary: This layer will be used to sample negative labels for sampled softmax.
Differential Revision: D4773444
fbshipit-source-id: 605a979c09d07531293dd9472da9d2fa7439c619
Summary:
All of these tests fail with some variant of `Cannot create operator of type 'X' on the device 'CUDA'` (see commit messages).
Closes https://github.com/caffe2/caffe2/pull/227
Differential Revision: D4797060
Pulled By: Yangqing
fbshipit-source-id: 5feaa8e949098bfc1254d4c7449a2744e552f925
Summary:
Blob fits well the semantics of a noexcept moveable object since its semantic is equivalent o a unique_ptr.
This allows for example to have a std::vector<Blob>
Reviewed By: pietern
Differential Revision: D4760079
fbshipit-source-id: d652d89be91a90c70651936ff694e0449a2c406b
Summary: 1) allow other than simple network on recurrent net steps;
Reviewed By: urikz
Differential Revision: D4789889
fbshipit-source-id: f30f0e7268a353134db0fe21fc5c456f21fce969
Summary: To prevent others making the same mistake as I did, check that no op has is_test=0 argument when ExtractPredictorNet is called.
Reviewed By: viswanathgs
Differential Revision: D4796425
fbshipit-source-id: 38c14df6bcc767ec2e6a6e35ee79596a5dab531c
Summary: Add a setTimeout() API to the Pair interface. Implement in the tcp transport for connect, read, and write, and across blocking, polling, and async configurations. Ibverbs implementation to come later.
Reviewed By: pietern
Differential Revision: D4787932
fbshipit-source-id: 6072dc0c0add1700f84a72b83e4388b29b044ec1
Summary:
@public
This has no functionality changes yet, only cleaning up the sequence_op file
so that the header is context-independent and I will implement the gpu parts
separately.
Reviewed By: pietern
Differential Revision: D4777140
fbshipit-source-id: 9b4aea6c36f06a64a53e235a125cd3477d54a045
Summary:
This diff is adding eval nets to layer model helper. It should be useful for
the cases when train/eval nets need some extra input (usually some supervision)
for train/eval. For example various sampled layers, etc.
Differential Revision: D4769453
fbshipit-source-id: 7a8ec7024051eab73b8869ec21e20b5f10fd9acb
Summary:
We should resize the workspace-vector only when it increases. Otherwise we end up destroying and recreating workspaces constantly if sequence length varies.
Modified the lstm_benchmark test to randomize sequence length.
This provides big perf improvement to machine translation pipeline. Look at the recurrent network op runtimes.
WITH:
I0328 12:17:54.073976 492094 prof_dag_net.cc:156] 136.271 ms/iter ( 120.987 ms/iter) RecurrentNetwork
I0328 12:17:54.073982 492094 prof_dag_net.cc:156] 190.074 ms/iter ( 156.828 ms/iter) RecurrentNetworkGradient
WITHOUT:
I0328 12:25:17.658206 518884 prof_dag_net.cc:156] 375.369 ms/iter ( 249.268 ms/iter) RecurrentNetwork
I0328 12:25:17.658211 518884 prof_dag_net.cc:156] 278.892 ms/iter ( 227.29 ms/iter) RecurrentNetworkGradient
With LSTM benchmark, get about 2x speedup
Reviewed By: jamesr66a
Differential Revision: D4789354
fbshipit-source-id: ad72f61974e35b0474abcacdc466ae9c6b4eb0ff
Summary: PadImage has no kernel parameters resulting pads_ paraemeters to be not set (0). I added a test case too.
Differential Revision: D4785230
fbshipit-source-id: fd475e7c41208e07fa7a363def9a45c6f82cddfe
Summary: this is useful to test rnn cells
Reviewed By: dzhulgakov
Differential Revision: D4720641
fbshipit-source-id: baa7df43357ed8af72ede64be3e0a642a40472df
Summary:
Instead of doing gemms in a for-loop (which is not parallelized), it is much better to do the batched matmuls using CUDA 8's new batched-striped version of gemm.
With the MT team's test, we get 5-10% improvement in overall walltime, so it is significant improvement:
----
Without batched gemm:
I0328 10:46:48.118605 58068 prof_dag_net.cc:136] 424.757 ms/iter ( 283.878 ms/iter) RecurrentNetwork
I0328 10:46:48.118609 58068 prof_dag_net.cc:136] 352.603 ms/iter ( 265.85 ms/iter) RecurrentNetworkGradient
With batched gemm:
I0328 10:53:48.169996 85617 prof_dag_net.cc:136] 407.438 ms/iter ( 269.564 ms/iter) RecurrentNetwork
I0328 10:53:48.169999 85617 prof_dag_net.cc:136] 322.393 ms/iter ( 287.625 ms/iter) RecurrentNetworkGradient
Reviewed By: jamesr66a
Differential Revision: D4788272
fbshipit-source-id: 210e8b94c1e036b6ef0f039ce000d455258651f4
Summary:
The header already contained an analysis of required completion queue
depth but the queue pair was still initialized with a maximum queue
depth of kMaxBuffers. This change fixes that and updates the analysis
to talk separately about receive and send completion queues.
Reviewed By: andrewwdye
Differential Revision: D4785786
fbshipit-source-id: 4dc302d523a3b7162dc261d14cfcc755681febf8
Summary:
This is pretty tricky to explain, but we can just use
backward_links. This way the whole cell would use a blob from the
states_grad tensor instead of having its own blob. This also should
save on memory a bit
Differential Revision: D4770798
fbshipit-source-id: 673f85b2c2fdf42c47feeaa24d1e2bf086f012f9
Summary: Creates SparseMomentumSGDUpdate, a sparse version of MomentumSGDUpdate, to make that optimization method (via in-place updating operator) compatible with GradientSlices.
Differential Revision: D4784973
fbshipit-source-id: e6330f471a4d5f53589a6ac245e38f256ca7f354
Summary: These are system headers and so should be included via `<>`.
Reviewed By: yfeldblum
Differential Revision: D4783480
fbshipit-source-id: 979670b594859b45560cead34f615442dfcc9f8b
Summary:
`SamplingTrain` layer is a wrapper around another layer subclassing `SamplingTrainableMixin`. When initiated in the training context, `SamplingTrain` produces sparse output of the wrapped layer. Output can be paired with `indices` to create Map schema. When initiated in prediction context, the full output of the wrap layer is produced.
This is liked the SampledFC function in model helper, https://fburl.com/gi9g1awh, with the ability to initiated in both trainig and prediction context.
I'd like to get consensus whether we should introduce the `SamplingTrain` layer and the accompaying mixin. This can probably be accomplished in some other way, but I think this is not too bad.
Reviewed By: xianjiec
Differential Revision: D4689887
fbshipit-source-id: 7be8a52d82f3a09a053378146262df1047ab26a8
Summary: We actually copy items inside, so no need to limit this to POD types.
Reviewed By: dzhulgakov
Differential Revision: D4768652
fbshipit-source-id: 98f71b78a7c1dd4a2a2e1bff096d6bf63a0c8f50
Summary:
Use data_parallel_model for seq2seq multi-gpu training. The main reason for complexity here is that GatherOp hasn't yet been implemented on GPU.
This diff also adds better cliping procedure - clip by global norm rather than by absolute value.
Differential Revision: D4778691
fbshipit-source-id: bff184dae02ecc227413fef51f48a4726e5d3825
Summary: This allow tensors to borrow external buffers and return them once tensor data is reallocated or freed. This is similar to folly::IOBuf's takeOwnership and ZMQ's message constructor taking a deleter as argument.
Reviewed By: dzhulgakov
Differential Revision: D4760188
fbshipit-source-id: 6989678ad66af2e58487173174d5327bd5ae0515
Summary:
Predefining the reduction functions makes it easy to provide a set of
fast implementations. Eigen is used to implement them if it is found.
Reviewed By: andrewwdye
Differential Revision: D4780868
fbshipit-source-id: e825cf2e5cfe8ec27d587c5aff4002534b1c670d
Summary: This makes it possible to write to any offset in a remote buffer.
Reviewed By: andrewwdye
Differential Revision: D4779776
fbshipit-source-id: f5a44cc705df5141bd720ff4e3fec8697f707a70
Summary:
To evaluate from checkpoints, we need to load a model from the checkpoints.
However, the checkpoints store way more blobs than the blobs needed by the
model. This function enables the model builder to load only the blobs
associated with the model to the workspace. After that, the model builder
can evaluate the model from the populated workspace.
Reviewed By: azzolini
Differential Revision: D4751414
fbshipit-source-id: a7a420228d681fc2dcfd8573cf69a97b1abc2ef3
Summary:
All operations supported by NCCL are now available through the Gloo
wrappers. Algorithm wrappers for them are forthcoming so that they
can be used interchangeably with other implementations.
Since not all of them require same-sized source and destination
pointers, I moved assertions on number of elements to the op
constructors.
Reviewed By: andrewwdye
Differential Revision: D4771292
fbshipit-source-id: 2f34629507b5e1cb9ae8d6d2f02de0a7f641a341
Summary: Currently, we cannot have layer constant because layer params are required to have gradient and optimizer. Global constants don't cut for this because it can only be added once; therefore, a layer that add any global constant can only be used once.
Differential Revision: D4773212
fbshipit-source-id: 5b60d31f3c1602afb04b61f6d30b8e3e06ed2de3
Summary:
D4690225 added support for nested field name lookup in nested
`schema.Struct`s. It would throw a KeyError if trying to access a nested
`List`s field. Writing the lookup recursively avoids the need to enumerate
all complex field types in the lookup.
Differential Revision: D4719755
fbshipit-source-id: 37c87a32d730f0f45f72fb20894da3e32f820999
Summary: Creating PackSegments and UnpackSegments GPU operators using GPUFallbackOp for now. The op does mainly copying of blobs and this is a reasonable solution until we have a CUDA op.
Reviewed By: pietern
Differential Revision: D4761589
fbshipit-source-id: dd483b9e34ecb6b53925405e5b4c24859c549606
Summary: Allow to drill down on data throuhgput overall and per field.
Reviewed By: dzhulgakov
Differential Revision: D4622168
fbshipit-source-id: 1462bb2fac05824fda0c02f4f5f0b8713893e650
Summary:
- Allow to capture averageable stats such as bytes and time per request
- Allow to capture time ellapsed.
Reviewed By: pietern
Differential Revision: D4622101
fbshipit-source-id: f08e422ecdfda83b13a4ed8ab9c6d5c2a5d5718d
Summary:
Use AddNet and AddBlobs to add net and blobs to meta_net_def.
This a codemod and does not change the functionality.
It is for preparation of the protobuf change.
Depends on: D4770648
Reviewed By: salexspb
Differential Revision: D4771110
fbshipit-source-id: 00cecb2105f2c332bd50c3c51b9a10e1004fa90f
Summary:
This was a nasty one to track down. This was the error message:
```
E0323 14:47:46.138900 2870 context_gpu.h:126] Encountered CUDA error: an illegal memory access was encountered
F0323 14:47:46.139143 2870 operator.h:176] Computation on device returned error in operator
input: "x_gpu_2" output: "loss" name: "" type: "AveragedLoss" device_option { device_type: 1 cuda_gpu_id: 1 }
```
Closes https://github.com/caffe2/caffe2/pull/220
Differential Revision: D4771086
Pulled By: Yangqing
fbshipit-source-id: f2d0f39f1647c84d97d9745f8a0305a389bfbc41
Summary:
Codemod to use a separate function, for protobuf change later on
It does not change the functionality
Reviewed By: salexspb
Differential Revision: D4770648
fbshipit-source-id: d8090f45d31ffa5ca1dca47297fb7c196f34d8a6
Summary:
Changed the windows python extension name to ".pyd" and did a manual copy from the {Debug,Release} folder to the main folder for easier automatic build.
Closes https://github.com/caffe2/caffe2/pull/222
Differential Revision: D4771065
Pulled By: Yangqing
fbshipit-source-id: 4a89d409fa66f0979cf4ecf502189b2f9cc11504
Summary: Allgather ring CPU implementation. Its does |buffers| x |contextSize| passes.
Reviewed By: pietern
Differential Revision: D4723809
fbshipit-source-id: ffd8366ac7e1746555474e173143d33cee497822
Summary:
This also requires a change to cmake/External/nccl.cmake to use the
static NCCL binary instead of the shared object. When the Caffe2/Gloo
build uses the bundled NCCL version it should be packaged up in the
resulting libraries and not cause another runtime dependency on a
library that has to be installed separately.
Closes https://github.com/caffe2/caffe2/pull/218
Differential Revision: D4769926
Pulled By: pietern
fbshipit-source-id: 5c85559992c200d874f4218724823815ffb5adb5
Summary: We anyway accumulate values of this blob (param_grad) in a another special internal blob
Differential Revision: D4768643
fbshipit-source-id: a9d08b7eafd25f278a8db722f9cdb1d0064b852a
Currently in-place and out-of-place updateGradOutput will produce different results for input=max_val or input=min_val - in-place won't backprop gradient where input=max_val or input=min_val, out-of-place will backprop gradient in this case.
Summary: Apart from copying gradient blobs for inputs with initial_cell_input, we needed to perform a similar operation for external parameters used by the step net
Reviewed By: salexspb
Differential Revision: D4752259
fbshipit-source-id: 13ee48cf583ed86221a4cc1cc9f57f5c3a7d2450
Summary:
currently the output schema and blobs are names as "field_i" which is
bad for debugging. This diff allows us to specify output names.
Reviewed By: kennyhorror
Differential Revision: D4744949
fbshipit-source-id: 8ac4d3c75cacbb4c9b5f55793ac969fe1cf20467
Summary:
This makes it possible to embed Gloo in a project without CMake
installing Gloo headers and/or libraries, or having a runtime
dependency (and statically link to it).
Also:
* Install benchmark tools
* Statically link to NCCL if the bundled version is used
Closes https://github.com/facebookincubator/gloo/pull/19
Differential Revision: D4762432
Pulled By: pietern
fbshipit-source-id: cf38903e6c51f2480fba4ff18cbdc0c9080df0c4
Summary: This allows to gather stats on how much raw and compressed data is being transferred across queues and network.
Reviewed By: dzhulgakov
Differential Revision: D4622049
fbshipit-source-id: 27c0c0df9e5a705f91256b20a29c7f8f988085da
Summary:
Add ConvNd interface for Nd convluton and keep Conv for 2d convlution.
I added _BaseConv to share code between ConvNd and Conv.
Reviewed By: Yangqing
Differential Revision: D4660822
fbshipit-source-id: 8339421351ce9a36ce5a165f7fa455cfcc61733d
Summary:
This completes the fix that viswanathgs started in an earlier diff but did not
cover the full Caffe convention. It should have proper guards for all the stuff
that Caffe implies, either supporting it or throwing an explicit exception.
Reviewed By: viswanathgs
Differential Revision: D4751751
fbshipit-source-id: 474e921c33840cff333a631b7b19f881b39ebccd
Summary:
This may be the case when the Gloo CMake files are sources from a
parent project that has already imported CMake CUDA support. If these
checks are not performed then CUDA_NVCC_FLAGS might contain
conflicting options.
Verified this works while working on Gloo for Caffe2.
Closes https://github.com/facebookincubator/gloo/pull/18
Differential Revision: D4756179
Pulled By: pietern
fbshipit-source-id: 32fc39ec2322cce5899a2398ebbf8395d3917502
Summary:
These new ops allow you to initialize, start, and stop the CUDA
profiler. This makes it possible to profile CUDA code without running
the application through nvprof.
Reviewed By: jamesr66a
Differential Revision: D4747863
fbshipit-source-id: b439e8f28d1d62db19524fee0458523414cb79e3
Summary:
Some small MPI-related changes:
1) Instead of making an object copy of the MPI_Comm, call MPI_Comm_dup;
because the (passed-in) communicator is used later via the call to
connectFullMesh this guarantees that the communicator will not have been
freed by user before connectFullMesh is called.
2) Allreduce for maxLength is done on an unsigned long type; use the
corresponding MPI type.
Closes https://github.com/facebookincubator/gloo/pull/17
Differential Revision: D4754195
Pulled By: pietern
fbshipit-source-id: 863fd33c726f88120f8f5ee61964c3525babbf97
Summary:
This change solidifies IO error handling between threads and successive transport API calls. When an IO exception occurs, signal all buffers of the error, propagating the exception from the device thread or single user thread onto all user threads. Store the exception in the pair and check on future API calls or device events. Swallow all IO exceptions in the device loop.
Right now IO exceptions during portions of the listen/connect phase will result in an indefinite wait in the peer. I will address this with a configurable timeout (t16205269).
Reviewed By: pietern
Differential Revision: D4749248
fbshipit-source-id: c75ee3b20875d561bf84631e5384e28015dabad3
Summary: This didn't work for a reason specified in comments. Also some cleanup in the unit tests, now inference uses a custom workspace to run cell net on
Reviewed By: urikz
Differential Revision: D4742670
fbshipit-source-id: 04165c029fddec5ae31b20b207faf06d2fa20816
Summary: This popped up during the debugging with intel folks.
Reviewed By: salexspb
Differential Revision: D4745176
fbshipit-source-id: 88ce91e565b45253d60588ab35ed4b8e5b8d4947
Summary:
So you can just run `BUILD_CUDA=ON .travis/install.sh` on a 16.04 machine and have it install the right packages.
Closes https://github.com/caffe2/caffe2/pull/212
Differential Revision: D4748670
Pulled By: Yangqing
fbshipit-source-id: 2015613e4d5ca6bcd1c9320c6c4cba071463c120
Summary: Seems like a lot of confusion in the group lately has been about missing CUDA operators. Let's make it clearer in the error message.
Reviewed By: azzolini
Differential Revision: D4737037
fbshipit-source-id: 56c7819df909bf954510296703bff5f221fa8ae7
Summary:
aaronmarkham this solves your Windows build issue. Basically:
(1) VS 2017 does not have CUDA support yet, and we will be waiting on NVidia to do so.
(2) VS 2015 and 2017 need different cmake generator strings.
This PR shows how to determine those and also updates appveyor to do contbuild guard for the following 3 settings:
- VS2015 without cuda
- VS2017 without cuda
- VS2015 with cuda
Closes https://github.com/caffe2/caffe2/pull/210
Differential Revision: D4745007
Pulled By: Yangqing
fbshipit-source-id: 50952552843abd0eb6f4145d9f132daeee3a6794
Summary: Created `BatchDistillLRLoss` layer and added support for it in DPer2.
Differential Revision: D4718333
fbshipit-source-id: b873954ea704daafed94ac65fef47a20d56858e2
Summary:
Bubble up gloo configuration and network errors as exceptions. The caller may be able to recover. Other unexpected failures continue to be handled as fatal with GLOO_ENFORCE
Modify ibverb API validation to check for != 0 instead of -1 to conform with API definition.
Still need to convert some errors in the rendezvous code and add documentation.
Will pass device loop errors onto the calling thread in a future diff
Reviewed By: pietern
Differential Revision: D4730362
fbshipit-source-id: c801adb353013e7f541ab01ac16a0cc71c1c36b2
Summary: D4734505 part 2. Remove more instances of the batch_size parameter
Reviewed By: urikz
Differential Revision: D4736906
fbshipit-source-id: fc9d374e9308017d61c427890364c5ab9cec2edf
Summary: Reshape based on tensor shapes in the graph rather than based on a passed-in batch_size parameter
Reviewed By: urikz
Differential Revision: D4734505
fbshipit-source-id: d9c23d85be84f61124106e752ef2b4f6945e2a07
Summary: this is a bit simple version of what Aapo did before. As that one has some weird crashes in some of the training pipelines.
Reviewed By: urikz
Differential Revision: D4734934
fbshipit-source-id: f9ecff2a0d68a8cbc0858658f38be34d616fa100
Summary: we don't use this one any more except a few tests
Reviewed By: urikz
Differential Revision: D4731401
fbshipit-source-id: c5c28b7594e3251f501fc28455dfc9bd2093a836
- Add additional timeouts to test_multiprocessing to reduce chances of
hanging indefintely on failure
- Add missing header guards
- Fix typo
- Check that torch_shm_manager exists in torch/__init__.py
Summary: This has been subsumed by gloo.
Reviewed By: andrewwdye
Differential Revision: D4729216
fbshipit-source-id: aa4f0637ee70dd03e85a6a0e7ffda68e5e9505be
Summary:
This can happen when the tensors are changed/resized. The cached
algorithm instance won't be valid in that case. I think for now it's
best to fail hard and require the net to be reinitialized if this
happens. If instead we would always reinitialize this condition is
detected then frequent resets could lead to poor performance and go
undetected.
I spoke about the generality of this problem with YQ. The pattern used
here of updating a representation of the op's parameters is far from
ideal. Instead, it would be much better to have the core framework use
some kind of versioning on tensors/blobs (can be as simple as a single
integer) to make it much easier to detect a change in inputs/outputs.
If there are more places that would benefit from such a facility, we
should consider adding it. As right now Gloo is the only place where
we need it, it doesn't make sense to immediately add it to core.
Reviewed By: Yangqing
Differential Revision: D4728121
fbshipit-source-id: 69a8a620aecc961a3f7a27e8c53e22945d9a258e
Summary: Adding synchronous optimization on GPUs to the translation training pipeline, via data_parallel_model.Parallelize_GPU, which needs to be updated so there is some way of performing sparse parameter updates (e.g., on embedding tables), whether on GPU or CPU.
Reviewed By: urikz
Differential Revision: D4631914
fbshipit-source-id: 9cdd655f7dbda3f9b2733d459228b3e097892441
Summary: This adds a nearest neighbor interpolation resizing operator to caffe2. CPU only, NCHW only, no gradients. Also adds torch2caffe support. This is probably not optimal in terms of performance, but it works.
Reviewed By: ajtulloch
Differential Revision: D4724244
fbshipit-source-id: b8295061141fb513da84acf91fdfd67264119059
Summary:
1. migrate the basic mtml model to dper 2
2. test dper 2 mtml model
3. test all optimizers
Reviewed By: kittipatv
Differential Revision: D4680215
fbshipit-source-id: 7aac5c59bdac22fcad8ed869b98e9e62dca1d337
Summary: layer that takes a label, prediction pair and outputs the L2 loss
Reviewed By: kittipatv
Differential Revision: D4702111
fbshipit-source-id: 09f2ede44d1b548e61096de741f1b2aa0b66bbcb
Summary:
Setting up a caffe2 versioning number per popular request.
The plan is to periodically update the version, with the current plan being
every other week. As a result I am setting the initial number to minor version
5 (since this is the 11th week of the year).
Reviewed By: salexspb
Differential Revision: D4725945
fbshipit-source-id: 9ff4c7e4a6341e22a5f1d4e25740705988cae84b
Summary:
Currently if all samples in a batch miss labels, the task customized layers have no data.
In that case, the EnsureDense op does not compute the gradient correctly. To avoid that, we switch
back to let Gather to generate dense gradients.
why EnsureDense op does not compute the gradient correctly?
It is because when EnsureDense computes gradients, it does not know the actual data batch size. So its out gradients may have wrong batch size.
Reviewed By: xianjiec
Differential Revision: D4712463
fbshipit-source-id: 736f63273e7fbc4348f37fa3a5a696f855b7c3ad
Summary: Useful for restoring after a conditional block where we want to disable threading.
Reviewed By: jamorton
Differential Revision: D4638648
fbshipit-source-id: 29695284f7c427caa6b80a9bca0cbd1406543a44
Summary:
it was broken in trunk and I fixed it locally then had a
wrong merge in D4672026. This is just a revert of those changes
Reviewed By: ajtulloch
Differential Revision: D4723138
fbshipit-source-id: 14757d9c8ae5135bd7c084003a64e25efc74b54f
This ensures that we use the same library at the C++ level and with
Python ctypes. It moves the searching for the correct library from
run-time to compile-time.
Summary: Reshape based on tensor shapes in the graph rather than based on a passed-in batch_size parameter
Reviewed By: urikz
Differential Revision: D4702086
fbshipit-source-id: c4c1d8425cd36c1e86695918eaba2667c27e9601
Summary:
/cc akyrola
I basically just copied all the `ShapeCall` stuff as `TypeCall`. Is there a better way?
Closes https://github.com/caffe2/caffe2/pull/187
Differential Revision: D4699312
Pulled By: Yangqing
fbshipit-source-id: 92f736ffe4127b00b5821acb1eb359771975fdd7
- make each test in test_autograd have a unique name ignoring case
- assemble all tests when test_legacy_nn is imported
- import Python.h in PtrWrapper.h
Summary: For some embedding task, we don't want to include bias term in embedding computation.
Reviewed By: xianjiec
Differential Revision: D4689620
fbshipit-source-id: 4168584681d30c0eaa1d17ceaf68edda11924644
Summary: Initializing ncclComm_t is expensive. Allocate a set of ncclComm_t for each unique device set and cache for reuse. With this change the CudaAllreduceChunked tests runtime improved from ~170 sec -> ~10 sec on my machine. There is no improvement in the benchmark numbers because the algorithm instance is only allocated once.
Reviewed By: pietern
Differential Revision: D4708943
fbshipit-source-id: 85b85070586d6683a762b8282df593ca831e7bc7
Summary:
This change includes CMake changes to compile the MPI assets when the USE_MPI flag is enabled. If so, the benchmark tool can now be launched through mpirun.
Includes the changes done in #11.
Closes https://github.com/facebookincubator/gloo/pull/12
Reviewed By: Yangqing
Differential Revision: D4712060
Pulled By: pietern
fbshipit-source-id: 0d0e93882f5822583f59304d4256dbdf5dea7483
Summary:
Make it use Gloo and optionally use Redis for rendezvous (where a
shared filesystem is not available).
Differential Revision: D4709943
fbshipit-source-id: 59cc7a14316c7b634417ea5161a75fab3c19f2fa
Summary:
We are having more and more nested Struct schema. There is increasing need to get/adda field by nested name, e.g., for the following nest Struct schema:
st = Struct(
('a': Scalar()),
('b': Struct(
('c': Scalar()),
)),
)
We may want to get the field "b:c" and/or insert a new field "b:x". The immediate need is for dper2 metrics.
This diff is to achieve this.
Reviewed By: kittipatv
Differential Revision: D4690225
fbshipit-source-id: 71d4a74b36bd1228a2fefd901db2f200602152b7
Summary: For example, test and train nets could have shared workspaces, leading to race condition. This adds an assertion and adds a running counter to the workspace-blob name.
Reviewed By: jhcross
Differential Revision: D4712152
fbshipit-source-id: 808d7069095bac24ebfe0c9d31ebd134f4cf0956
Summary:
This should fix mkl contbuild per the most recent bugfix from Intel.
Closes https://github.com/caffe2/caffe2/pull/189
Differential Revision: D4711448
Pulled By: Yangqing
fbshipit-source-id: 70d1b35fa4fe6cc9b4d36ec0fcfbd6d33f313182
Summary:
No longer need GPU to CPU copies. The allreduce operator no longer
uses 'local allreduce - global allreduce - local broadcast' sequence
when Gloo is used, but passes all input blobs directly.
Depends on D4708860.
Differential Revision: D4709897
fbshipit-source-id: 4d745d5d8bac9c2fcca081dd5d812c902808c3b6
Summary:
This is going to allow to experiment with various training from scratch / fine tunning technics. The code itself for the new model is not intended to be used as is. Instead one could train a full precision model first. Then add quantization for the last layer, then for the next one and so on.
In my experiments I tried getting a pretrained model and then quantizing all inception layers with 4 bits. This restored original accuracy after several dozen iterations
Also in this diff I added a common prefix to the model checkpoint + added this prefix to git / hg ignore.
And also some extra logs which are usefull to quickly see how things changed right after enabling quantization
Differential Revision: D4672026
fbshipit-source-id: b022c8ccf11dd8a2af1a7b2e92673483bc741a11
Summary: D4704547 caused stuff to crash with various memory corruption errors. The problem appears to be in calling sharedWorkspaces->resize(), although I don't completely understand why. Something to do with moving the shared_ptrs around? Anyway, first clearing and then resizing (only needed when seqLen is bigger than what we have allocated) fixes the issue.
Reviewed By: jhcross, Yangqing
Differential Revision: D4711675
fbshipit-source-id: 35c70e8258555fcb6d403df35e0d391aebe96485
Summary: NCCLOp::runNCCL is mistakenly recording an event in the source pointer after the NCCL op. This results in NCCLOp::wait() returning without synchronizing with the output buffer. The synchronous tests using NCCL fail.
Reviewed By: pietern
Differential Revision: D4708860
fbshipit-source-id: 0c36511e260b587d410e5c9604552ceedd06d988
Summary:
Necessary if CXX isn't set when cmake is called. The CXX variable will then be
empty which prevents make from using its own default.
Closes https://github.com/caffe2/caffe2/pull/202
Differential Revision: D4711113
Pulled By: Yangqing
fbshipit-source-id: 895c07044b263ba9b5440453978248506d7ac225
Summary:
These are all essentially no-op changes which allow for nose-style (or pytest-style) test discovery.
With this patch, you can use any of these methods to discover and run tests under `caffe2/python`:
```
python -m unittest discover -p '*test*.py' caffe2/python/
python -m nose caffe2/python/
python -m pytest caffe2/python/
```
Future work:
* Get all of the tests to pass
* Some seem to be testing operations which don't have GPU implementations
* I get a segfault unless I set `CUDA_VISIBLE_DEVICES=0`
* Some tests are flaky
* Allow test discovery throughout the whole project (e.g. the `experiments/` dir)
Closes https://github.com/caffe2/caffe2/pull/199
Reviewed By: pietern
Differential Revision: D4704504
Pulled By: Yangqing
fbshipit-source-id: 8f5687ec9c8aa873dfaff30dbf44272bc38a206b
Summary:
RecurrentNetOp created workspaces at every run, which is very wasteful, as it had to also recreate the stepnets (forward and backward!).
.
Reviewed By: salexspb
Differential Revision: D4704547
fbshipit-source-id: 460028d912d6a735448c445cb83c0c4d03286351
Summary:
First, this diff includes a full test of data-parallel LSTM, which confirms it works correctly. To make it work, some changes had to be made:
- cell net/step net external inputs must be namespace scoped
- prevent double-namescoping of cellnet inputs
- make data parallel model understand recurrentnets so the device-mapping works
Reviewed By: salexspb
Differential Revision: D4708840
fbshipit-source-id: 4b0ddc43642d449076a2b6f67ad1c47f84138ff4
Summary: Some operators, e.g., SoftmaxWithLoss, returns scalar-typed tensor. This would allow us to use those ops without having to write layer manually.
Reviewed By: xianjiec, kennyhorror
Differential Revision: D4703982
fbshipit-source-id: f33969971c57fc037c9b44adb37af1caba4084b6
Summary: When cloning recurrent net op, we do a remapping of the lengths-blobs. But if they don't exists (like with CRF), we should not do that.
Differential Revision: D4702123
fbshipit-source-id: 37a22d11e709011b8b98b2cc3d9f08eb9fda06c4
Summary:
Central cropping during test phase, similar to Caffe's behavior
Closes https://github.com/caffe2/caffe2/pull/195
Differential Revision: D4704506
Pulled By: Yangqing
fbshipit-source-id: cf7d457dc2acfe8ff5a225ebfd5f8cd0f9d92a07
Summary:
Yield better throughput since full ring allreduce is cheaper for
smaller blobs (fewer communication steps).
Reviewed By: andrewwdye
Differential Revision: D4704850
fbshipit-source-id: 338addd919f454c94412ea145e1280492f765c72
Summary:
TSIA
For the broadcast op the first input tensor on the process with the
specified rank is broadcast to all other processes and outputs.
For the allreduce op all inputs are considered for the reduction.
Reviewed By: andrewwdye
Differential Revision: D4704540
fbshipit-source-id: e6879ca0a9adffe0bc61bf74a333c4052bc8bd92
Summary: These python helpers are going to provide sufficient book keeping when adding quantization for conv layers
Reviewed By: Yangqing
Differential Revision: D4671478
fbshipit-source-id: 292e2f633dd30969c0afbe7a8075b340ce9a6d12
Summary: UNK needs tobe indexed in the vocabulary for validation to work. Default args now result in training loss decreasing.
Reviewed By: urikz
Differential Revision: D4703393
fbshipit-source-id: e4d6ad100daf8392f8ba1e502f9ecf39bb8ce24a
Summary:
Context:
https://fb.facebook.com/groups/1405155842844877/permalink/1677762748917517/.
DropoutOp and DropoutGradientOp already handle input of size 0 gracefully. The
CHECK isn't needed. I think this should fix the crash in xray detection models
where num region proposals are zero.
Differential Revision: D4697254
fbshipit-source-id: afd06975f2ad4b2e59f15d12b0aa332f6eb3f1af
Summary:
Allows `nose` or `pytest` to collect all tests.
```sh
$ cd build
$ nosetests --collect-only
..............................................................................................................................................................................................................................
----------------------------------------------------------------------
Ran 222 tests in 0.430s
OK
```
Closes https://github.com/caffe2/caffe2/pull/198
Differential Revision: D4700783
Pulled By: Yangqing
fbshipit-source-id: 97504f6b14537669aa150f6a71283e851829ac5e
Our extension library links against cudart and pulls in the symbols. Use
LoadLibrary(None) to use the same symbols as the _C extension.
This fixes the PyTorch wheel when you don't have system CUDA installed.
Summary:
It has been a pain to save predictor-compatible models from Caffe2. This diff adds function ExtractPredictorNet that takes a training model and outputs a predictor model by removing all operators that are not relevant for prediction, such as backward pass and dequeue-ops for input loading (as in predictor, the input data is external input).
We can also consider including this directly in the predictor exporter for FB usage.
Reviewed By: rpenggithub
Differential Revision: D4693264
fbshipit-source-id: e81abbbec0bd4d717159cf36488d0baaf0130090
Summary:
Implement ReduceBackSum & ReduceBackMean with gradients for CPU & GPU contexts.
The reduction happens among the last dimenstions for example if input is a
M x N matrix ReduceBackSum will results a vector of dim M x 1 contains the
rowwise sums.
Differential Revision: D4689768
fbshipit-source-id: 5b0482d4341867ecf23526dc6c4d544420e7d8f7
Summary: Add shape inference for reshape. Because it cannot do shape inference for reshaped tensor with runtime tensor data, set `out[0].set_unknown_shape(true)` if no `shape` argument is used.
Differential Revision: D4671125
fbshipit-source-id: 685a9198f9b08e3336014c792f20051b381d8619
Summary: We should be using the vocabulary built on the training data, and corpus_eval as data for the evaluation phase.
Reviewed By: urikz
Differential Revision: D4700382
fbshipit-source-id: ca1dd043a28f9bb585faad050c82fb12c1cdf6cc
Summary:
This is the minimum required CMake version (also the version that is available on Ubuntu Trusty (14.04)).
Closes https://github.com/facebookincubator/gloo/pull/9
Reviewed By: Yangqing
Differential Revision: D4698659
Pulled By: pietern
fbshipit-source-id: bf01541fe485c03e7c665f175c2887feaf9516a3
Summary: Fixed a bug (AttributeError: ModelTrainerLog instance has no attribute 'external_loggers', at File "caffe2/python/experiment_util.py", line 101) when no external_loggers is passed to ModelTrainerLog().
Differential Revision: D4697197
fbshipit-source-id: 1c770c366d87ea474bcf40ab289b67c76648d48b
Summary:
Allocate a set of per-device streams used to serialize NCCL op scheduling. These ensure concurrent NCCL ops are not interleaved across devices (i.e., through priority scheduling), resulting in deadlock.
Synchronize source and destination streams with NCCL streams.
Reviewed By: pietern
Differential Revision: D4685360
fbshipit-source-id: 3c228b195b0a0d9d7cccc720163898d344a5ed4c
Summary:
otherwise the blob will be in different namescope, e.g., `_nested`: https://fburl.com/ntlsaezv.
this make tensorboard ugly.
Reviewed By: dzhulgakov
Differential Revision: D4696946
fbshipit-source-id: 73627feccd7c4896964e6c549b7241bcce4f49a7
Summary:
TSIA
This change also fixes an undefined attribute error after running 20
iterations of the resnet50 example trainer.
Differential Revision: D4692794
fbshipit-source-id: b98efdfeb078c5ba89d2a86837f3c672e1eade5f
Samples elements from `[0,..,len(weights)-1]` with given probabilities (weights). So far there is no mean to either introduce sample weights in loss functions or while sampling from a dataset. This is an attempt to add the functionality for the latter issue.
Summary: A lot of people get confused if the file can't be loaded.
Reviewed By: rpenggithub
Differential Revision: D4686572
fbshipit-source-id: 519ff68a3d4f04cf8ce893f255f7814e043383b6
Summary: We need the InferToDeviceMapping too early, or we should had done it also after running parameter update function since that can create new blobs like the momentum blobs. This fix is maybe not optimal, but works and is fast enough.
Differential Revision: D4693450
fbshipit-source-id: 4c4cc2396dad371b3fbcd1d8da51133ea09a57e0
Summary:
Before we didn't propagate the 'out-of-data' signal if splits_per_epoch wasn't specified.
Right now it's a hacky fix (just reuse ReaderWithLimit). azzolini - any suggestions of more elegant solution? I can create an extra reader that just export "is empty" signal out.
Overall, I guess we need to turn global_queue into a more sustainable unittest that verifies all possible combinations - I'm still not sure it's correct :-\
Reviewed By: xianjiec
Differential Revision: D4665677
fbshipit-source-id: fe44d10ee82c3383145635e67dea1d9b666e061f
Summary: Whe debug using LayerModelHelper, adding Print to model will trigger this assert.
Reviewed By: xianjiec
Differential Revision: D4687859
fbshipit-source-id: 6932e38f8dd17ba0b80da18a20943ecdb2e8af0a
Summary: Thanks for shenpan, detected this bug. Problem is that FinalizeAfterCheckponit() can be passed a list of strings, not blob references, and that fails in stripParam() after assertion I added in D4649208. It is ok to pass strings as well to that function.
Reviewed By: jhcross
Differential Revision: D4691028
fbshipit-source-id: 0bca80d44a5ab641438cc5b26482bca0b1527d69
Summary: Chatted with pietern today, figured it is an easy change.
Reviewed By: pietern
Differential Revision: D4688275
fbshipit-source-id: a2751f1ff9f192ba6f2bd961be6ad1c693c8b5c6
Summary: Following krp's suggestion, check if the shape parameter is empty.
Reviewed By: dzhulgakov
Differential Revision: D4686698
fbshipit-source-id: 3f9fb1e3215dd2a4a726442531201eeb18224bc6
Summary:
This makes it easy to use Gloo transports and algorithms in existing
MPI environments.
Reviewed By: andrewwdye
Differential Revision: D4685999
fbshipit-source-id: cfc7d0e445893512b4e4ed2abe1bb280d83b9c70
Summary:
How pairs are setup and connected to one another is specific to
whatever underlying rendezvous mechanism is used. This change moves
the `connectFullMesh` function into a subclass in the `rendezvous`
directory. This prepares for a separate MPI context that can setup
pairs between processes using an existing MPI communicator.
Reviewed By: andrewwdye
Differential Revision: D4684755
fbshipit-source-id: 9eb643b8ba545b3e6f9a36b65642b3b04a5f0077
Summary:
Created a new function with specifics related to MI LSTM implementation in caffe2
See https://arxiv.org/pdf/1606.06630.pdf for details.
See D4478877 for the implementation of the same in tensorflow
Reviewed By: jhcross
Differential Revision: D4669882
fbshipit-source-id: 095bbcf187dbdac2cd79558ff0c8f9f67d8af639
Summary:
OSS implementation of seq2seq model in Caffe2. The script uses Seq2SeqModelCaffe2 class to build and run the model. It takes in training data in the form of text file with one sentence in each line, builds a vocabulary, generates batches based on batch size and runs the net for a configurable number of epochs. It prints total scalar loss at the end of each epoch.
All FBLearner and neural_mt type system dependencies have been removed. Unimplemented and unnecessary methods have been removed to make the script simpler.
fblearner/flow/projects/langtech/translation/neural_mt/model_util_caffe2.py has been moved to caffe2/caffe2/python/examples/seq2seq_util.py and remains unchanged
Potential TODOs:
- Get the model running in GPU. Only GatherOp does not have a corresponding GPU implementation. Try adding CopyGPUToCPU before and CopyCPUToGPU after Gather, and use CUDA DeviceOption.
- Add evaluation on test data with suitable metric (perplexity? bleu?)
Reviewed By: urikz
Differential Revision: D4653333
fbshipit-source-id: 1c7d970ebc86afe23fad4d48854296bf54eb0f77
Summary: ReversePackedSegs operator for CUDA. Input "lengths" (static integers) required to be in CPU memory.
Differential Revision: D4661281
fbshipit-source-id: c800c316c34015ba8e732dcbcaa8c4edaffdfeab
Summary:
Data parallel model did not support sparse operations, nor gradients computed on CPU ops.
Currently sparse operations are done on CPU, so there is no point of "data parallelizing" them. I had to make a few changes to data_parallel_model to support this:
1. Model can have params that are added prior to adding the data parallel part. For example, a lookup table of word vectors would be a parameter that is non-parallel.
2. Thus, when data parallel model is called, it will separate the non-parallel params and avoid working on them. Note: when we add distributed version, we need to explicitly handle them with AllGather!
This works nicely since Caffe2 automatically adds the backward concat-operator when multiple ops gather from the same blob.
I also added support for data parallel CPU ops, which might be necessary in cases when we don't have GPU implemenation of some ops.
Test in data_parallel_model_test validates the correctness of the code by running the same trainer on different number of gpus and checking the end result is same.
Reviewed By: jhcross
Differential Revision: D4649208
fbshipit-source-id: e3b7ae701ead468dc94c52a976eafec5c9831097
Summary: CudaDevicePointer has the information we need for a NCCL op. Refactor NCCLElement as a composition of src and dst CudaDevicePointers. This allows for separate streams for src and dst, and will simplify a future change to use a static set of streams for all NCCL ops.
Reviewed By: pietern
Differential Revision: D4679483
fbshipit-source-id: 75656cc2fa5b5e2a6c096d914d2111769a47291b
* add momentum and centered options
Add two options :
- Momentum (like SGD's momentum)
- Centered RMSprop, as in Graves 2013 ( https://arxiv.org/abs/1308.0850 ) : grad is normalized by running estimation of its variance
* somme PEP8
* bug in default
* bug2
* sign mistake
* alloc of momentum & centered only if needed
* add link to docstring
* some pep8 on docstring
* implement __setstate__() for backward compatibilty
* correct grammar mistake
* multiply by lr when adding delta to params
* rename momentum variables
* change __init__ params order
Summary: This diff is getting rid of old metrics interface in realtime training.
Reviewed By: xianjiec
Differential Revision: D4649734
fbshipit-source-id: de4af85eb5476df9790ebd3915625bf8beee65af
Summary:
When the execution step is representing things like:
for loop
execution_step
net1
execution_step
net2
net3
the preparation cost for execution step is too high.
This diff moves most of the shared information in the CompiledExecutionStep to save time.
After the change the benchmark result for parameter server handler is as following: (be aware that the first two have some variance)
INFO:__main__:==Summary==
INFO:__main__:Time <function case_if at 0x7f7160c32938> 0.0752924203873
INFO:__main__:Time <function case_loop at 0x7f7160c329b0> 0.0677666187286
INFO:__main__:Time <function case_simple_net at 0x7f7160c32a28> 0.0605396509171
INFO:__main__:Time <function case_one_loop at 0x7f7160c32aa0> 0.0611681699753
Before the change:
INFO:main:==Summary==
INFO:main:Time <function case_if at 0x7f19d079f848> 0.100815701485
INFO:main:Time <function case_loop at 0x7f19d079f8c0> 0.0864136457443
INFO:main:Time <function case_simple_net at 0x7f19d079f938> 0.0614696979523
INFO:main:Time <function case_one_loop at 0x7f19d079f9b0> 0.0598972082138
Reviewed By: azzolini
Differential Revision: D4643926
fbshipit-source-id: 5a4b97230ba778e0ff5cbafc8a216335a191068a
Summary: sum processor and sqrt pooling is to mimic the DoubleHelix model.
Differential Revision: D4678413
fbshipit-source-id: fc1ccfe3c92c540ce5914dfd8ff1a040805c48db
Summary:
For MSC compiler binary flag needs to be specified
Closes https://github.com/caffe2/caffe2/pull/191
Differential Revision: D4677511
Pulled By: Yangqing
fbshipit-source-id: 4f80f09bd4bf9b6b6eff352cc67a62163255334f
Summary: AccumulateHistogramOp, for computing the histogram of all values in input tensors
Differential Revision: D4654417
fbshipit-source-id: dea92346004c772af16e1eb41306287d81dc5a02
This is an important clarification to make as otherwise users are misled as to where they may need to add dropout and to clarify the situation would need to delve into the backend implementation.
4647f753bc/torch/nn/_functions/rnn.py (L73)
Summary: Take user inputs for the introspection visualization: convolutions output layer activations, filters using containing phrases, and number of samples
Reviewed By: Mortimerp9
Differential Revision: D4603797
fbshipit-source-id: dc972dcb8ad36e30defab266d710e047b11cff73
Summary:
modified load_save_op to work with my training script
- SaveOp now correctly strips specified prefix of the form 'gpu_0/' when saving model blobnames to DB
- when translating DB blobnames to model blobnames, LoadOp can now optionally add prefix of the same form
Reviewed By: Yangqing
Differential Revision: D4664134
fbshipit-source-id: a2512e79f0c5172c5111af3e9b6fd161f268f4df
Summary: Super rough implementation of recurrent attention. Planning to factor out the common code between the two functions as well as train and eval. I want to get this out and get eyes on it sooner rather than later
Differential Revision: D4647837
fbshipit-source-id: 54bc4e8ed0df6f04c86c425926decbe89f73b068
Summary: In case of distributed task, load_from_db() loads to wrong workspace (when used inside a Python op). Passing which workspace to use explicitly so that it loads to the one Python op is being run.
Reviewed By: kennyhorror
Differential Revision: D4653692
fbshipit-source-id: 94585c012b05ee38b9ce5e8ef0efdd50aa41dd2b
Summary:
Add a nextSlot() function to the context that increments and
returns a slot number. This enables multiple algorithms sharing the
pairs part of a context. The slot numbers were hardcoded before this
change, which prevented reuse.
After this change, some of the tests can be changed to run multiple
times (or do a parameter sweep) without respawning a new threadpool or
allocating new fixtures.
Also change some internally used variable names for more consistency.
Reviewed By: andrewwdye
Differential Revision: D4668268
fbshipit-source-id: 65cbc8f2666f0b7d2f1c72574b86d913f5855d62
Summary: The evaluation part of the two tower workflow is missing. This diff is to complete it. Part of the newly added functions can be used for other workflows, eg, feed. As the eval workflow in different workflows will be overlapped, a generic eval workflow will be added in a separate diff.
Reviewed By: kennyhorror
Differential Revision: D4646880
fbshipit-source-id: 4d6eb35df10f6f613533d442f2a04dc0332386f8
Summary: Add gradient support for Caffe2 operator SumElements (for use in Translation RNN training pipeline).
Differential Revision: D4669036
fbshipit-source-id: 502760a2a624b20b3241e83a2f208f450b6ff36f
Summary:
The current optimizer code in c2/python has the following issues:
(1) the optimizers in sgd.py cannot config per param-blob optimizer;
(2) sgd.py is a bad file name. optimizer.py is a better name;
(3) layer_model_helper.py has another set of optimizer code (which supports per param-blob optimizer)
This diff did the following
(1) create optimizer objects so that we can config per param-blob optimizer and that are also compatible to the existing optimizer code
(2) the new optimizer code are much more modulized
(3) move the optimizer code to file with better name (optimizer.py)
(4) replace the optimizer imports in the existing code
will do in next diffs
(1) optimizers with structured parameters for dper2
(2) get rid of the optimizer code in layer_model_helper.py
Reviewed By: salexspb
Differential Revision: D4609013
fbshipit-source-id: 2e2d6dfa8685d10498f89069157453d9feca3f27
Summary:
Fun on the plane. This basically reveals the per-platform build status on the README.md file.
Closes https://github.com/caffe2/caffe2/pull/188
Differential Revision: D4668460
Pulled By: Yangqing
fbshipit-source-id: 242b916cca0a46f8d797c6430c1875d6ffaae7ce
Summary:
1. Allow EnsureDense Op to do both in-place pass or copy
2. In MTML, add EnsureDense Op before gather
3. Change the unittest values (adding another operator changes the random seed,
which causes the model initialization also changes)
Reviewed By: xianjiec
Differential Revision: D4625219
fbshipit-source-id: b3c748c3651d1dedd75420912a9698b7e46187c5
Summary: This diff is migrating existing DPER workflows to use new metric abstractions in training.
Reviewed By: xianjiec
Differential Revision: D4656576
fbshipit-source-id: 1b3b16b390fc0757480e41df1c4214c11cd76e8a
Summary:
(Note: previous revert was due to a race condition between D4657831 and
D4659953 that I failed to catch.)
After this, we should have contbuild guarding the Windows build both with
and without CUDA.
This includes a series of changes that are needed to make Windows build,
specifically:
(1) Various flags that are needed in the cmake system, specially dealing
with /MD, /MT, cuda, cudnn, whole static linking, etc.
(2) Contbuild scripts based on appveyo.
(3) For Windows build, note that one will need to use "cmake --build" to
build stuff so that the build type is consistent between configuration and
actual build. see scripts\build_windows.bat for details.
(4) In logging.h, ERROR is already defined by Windows. I don't have a good
solution now, and as a result, LOG(ERROR) on windows is going to be
LOG(INFO).
(5) variable length array is not supported by MSVC (and it is not part of
C++ standard). As a result I replaced them with vectors.
(6) sched.h is not available on Windows, so akyrola 's awesome simple
async net might encounter some slowdown due to no affinity setting on
Windows.
(7) MSVC has a bug that does not work very well with template calls inide
a templated function call, which is a known issue that should be fixed in
MSVC 2017. However for now this means changes to conv_op_impl.h and
recurrent_net_op.h. No actual functionalities are changed.
(8) std host function calls are not supported in CUDA8+MSVC, so I changed
lp_pool (and maybe a few others) to use cuda device functions.
(9) The current Scale and Axpy has heavy templating that does not work
well with MSVC. As a result I reverted azzolini 's changes to the Scale
and Axpy interface, moved the fixed-length version to ScaleFixedSize and
AxpyFixedSize.
(10) CUDA + MSVC does not deal with Eigen well, so I guarded all Eigen
parts to only the non-CUDA part.
(11) In conclusion, it is fun but painful to deal with visual c++.
Differential Revision: D4666745
fbshipit-source-id: 3c9035083067bdb19a16d9c345c1ce66b6a86600
Summary: Renamed ElementwisePower to Pow for better discoverability. Added CUDA version and Gradient + tests.
Reviewed By: kennyhorror
Differential Revision: D4665550
fbshipit-source-id: dd33d8ad3917d71504e363ab397af50d38a63b1f
Summary: Add a simple op to sum the elements, with optional averaging. This is basically copy from AverageLossOp that we should alias to this. And maybe develop this towards a generic norm op.
Reviewed By: jhcross
Differential Revision: D4664591
fbshipit-source-id: 0e0c0efe9e415e2ad2feecfa42b03db2c83bee70
Summary: Due to popular demand, added an op to compute element-wise square + gradient for it (just for the fun of it).
Reviewed By: Yangqing
Differential Revision: D4664797
fbshipit-source-id: 0a29c7c249fdc72f51412bebd6ae352a7801cf05
Summary:
Taking ownership of a std::unique_ptr is a bit awkward. It's actually
useful to reuse the underlying store and create multiple prefix stores
against it.
Reviewed By: andrewwdye
Differential Revision: D4662354
fbshipit-source-id: eaf62f7d5a97d6ee848252ff3124c28da349f6f2
Summary:
This changes the constructor prototype of the broadcast algorithms.
They now take the rank of the root process and the rank of the root
pointer. The root process now also broadcasts locally, among the
specified pointers, in addition to broadcasting to its peer processes.
The broadcast tests are made more robust to use a different value at
every index for every buffer, like the allreduce tests. To accomodate
multiple input buffers for CPU side algorithms, I added a Fixture
helper, and renamed the existing Fixture class to CudaFixture.
The broadcast tests contain a few TODOs since they don't vary the root
process or root pointer yet. I anecdotally verified this does work,
but didn't want to include the necessary changes to do so in this
commit (it requires some changes in rendezvous and NCCL code). A fix
for this is forthcoming.
Reviewed By: andrewwdye
Differential Revision: D4661635
fbshipit-source-id: c069e0d4e8f676a63efd74b15ea1156adcc09477
Summary:
After this, we should have contbuild guarding the Windows build both with
and without CUDA.
This includes a series of changes that are needed to make Windows build,
specifically:
(1) Various flags that are needed in the cmake system, specially dealing
with /MD, /MT, cuda, cudnn, whole static linking, etc.
(2) Contbuild scripts based on appveyo.
(3) For Windows build, note that one will need to use "cmake --build" to
build stuff so that the build type is consistent between configuration and
actual build. see scripts\build_windows.bat for details.
(4) In logging.h, ERROR is already defined by Windows. I don't have a good
solution now, and as a result, LOG(ERROR) on windows is going to be
LOG(INFO).
(5) variable length array is not supported by MSVC (and it is not part of
C++ standard). As a result I replaced them with vectors.
(6) sched.h is not available on Windows, so akyrola 's awesome simple
async net might encounter some slowdown due to no affinity setting on
Windows.
(7) MSVC has a
Closes https://github.com/caffe2/caffe2/pull/183
Reviewed By: ajtulloch
Differential Revision: D4657831
Pulled By: Yangqing
fbshipit-source-id: 070ded372ed78a7e3e3919fdffa1d337640f146e
Summary: Simple elementwise Max implementation for CUDA. Given N inputs, it will do N-1 pairwise maxes. I am not sure if it would be much better to iterate through all the inputs in the kernel, since this has better locality. We can also optimize later.
Reviewed By: Yangqing
Differential Revision: D4659953
fbshipit-source-id: 3a23b7fb3dbdf1d43bf3134ece03af4a791844dd
Summary:
This diff is modifying the way we're specifying metrics from doing reporter, that should know all the blobs which is should access in advance, to reporter that is connected through schema.
This diff is also reporting any arbitrary number of learning curves to Flow and provides really flexible way to specify all the metrics we care about.
TODO: Modify model helper to allow providing intermediate results for reporting
TODO: Add evaluation net (instead of prediction net).
TODO: Move all other places in DPER 2.0 to use that abstractions instead.
TODO: Get rid of LogScoreEstimator in favor of metric that is going to be really suiting our needs.
Reviewed By: azzolini, dzhulgakov, kittipatv
Differential Revision: D4577548
fbshipit-source-id: 3515bd41e0f92263ff90ce2f7207abf65d01b1f7
Summary: so that the utils can be used by a wider range of audience.
Reviewed By: xianjiec
Differential Revision: D4637462
fbshipit-source-id: f0695f430902aef26360efa511069b3755eaf52a
We were keying hooks by RemovableHandle id. However, we don't hold onto
handles and ids of dead objects can be reused. This replaces id(handle)
with a global counter.
This is similar to THCCachingHostAllocator_recordEvent() but on CUDA
allocations. It's useful for overlapping copies with computation. The
workflow is approximately:
0. allocate dst tensor on copy stream
1. copy from CPU to GPU on copy stream
2. synchronize the main stream with the copy stream via
cudaStreamWaitEvent
3. THCCachingAllocator_recordStream(dst, main_stream)
The recordStream() call is necessary to prevent the dst tensor from
begin reused on the copy stream before the main stream finishes work.
Previously, you would need to insert a second cudaStreamWaitEvent before
dst is freed to force the copy stream to wait on the main stream.
Summary:
To avoid Numpy warning: using a non-integer number instead of an integer will result in an error in the future
Closes https://github.com/caffe2/caffe2/pull/64
Differential Revision: D4658348
Pulled By: Yangqing
fbshipit-source-id: 3a1b33cbb27849bc167b08147d078e8d487567f4
Summary: Added validation for load op when doing load_all by refactoring validation logic for loading specific blobs.
Reviewed By: kennyhorror
Differential Revision: D4641986
fbshipit-source-id: e0075a12188ca09d7628add72c143b40d5d9f382
Summary:
In the past we have moved most of the CHECKs to CAFFE_ENFORCE (exceptions).
However, we kept the name "*_CHECK" for cuda calls, and that caused some
confusion especially in the destructor calls: while our destructors are not
written to handle exceptions, these CUDA_CHECKs could actually throw some
exceptions.
As a result, this diff
(1) Renames all cuda related "*_CHECK" to "*_ENFORCE"
(2) Explicitly marked the destructor of core Caffe2 classes as noexcept
(3) Added proper, really-CHECK cuda check macros, and used those in the
corresponding destructors.
This should not change any of existing functionality.
Reviewed By: dzhulgakov
Differential Revision: D4656368
fbshipit-source-id: 32e3056e66c0400156c5ca0187b6151cf3d52404
Summary:
In windows, it is necessary to use `_aligned_free` instead of `free` when using `_aligned_malloc` before.
Closes https://github.com/caffe2/caffe2/pull/184
Differential Revision: D4657929
Pulled By: Yangqing
fbshipit-source-id: 476a9b702a1ee37d5e16483087be2ccdc7bf4259
Summary:
Our internal update of gflags in b0e325ce69 called for this change.
Closes https://github.com/caffe2/caffe2/pull/185
Differential Revision: D4657928
Pulled By: Yangqing
fbshipit-source-id: bdf9fdc63a16dafc28b690598463ec72e3c50f40
Summary:
- Replaces strip_regex implementation in SaveOp. It deletes the prefix of blob names upto a given substring.
- Adds the same functionality to LoadOp. Needed for loading checkpoints that are stored using the strip_prefix feature.
Closes https://github.com/caffe2/caffe2/pull/129
Differential Revision: D4512234
Pulled By: Yangqing
fbshipit-source-id: d926c1c5adcc7a711365cede11f21421bb7d4138
Summary:
This allows one to report the CPU memory allocation over a Caffe2 run.
To enable, use --caffe2_report_cpu_memory_usage in the commandline arguments.
This has to happen before any Caffe2 allocation has taken place.
Reviewed By: salexspb
Differential Revision: D4641353
fbshipit-source-id: 13a4315f63154edad9e925bb5c276cad4fe78c70
Summary:
I have seen a stress run crash with unexpected state. Adding these
assertions will give more information when it happens again.
```
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at gloo/transport/tcp/pair.cc:407] false. Unexpected state: 5
```
Reviewed By: andrewwdye
Differential Revision: D4652216
fbshipit-source-id: e787f4097f5ab32367dd9fa5a336d0389b97e955
Summary: We are converting MetaNetDef from thrift to protobuf. The protobuf is binary encoding. Since bytes is a superset of string. Change the field to bytes so that no warning is generated when compiling caffe2.
Reviewed By: Yangqing
Differential Revision: D4635581
fbshipit-source-id: 916b799e1fb9466658e1dd198bfb5c6928f22488
* Use TH_INDEX_BASE when verifying dimension for cat
* Adding tests for cat when no dimension is specified.
- Also renamed ldimension to cat_dimension to be more specific.
Summary: fix a check if the net is net_dict
Reviewed By: kennyhorror
Differential Revision: D4647493
fbshipit-source-id: e0a62fc5847c99c85857c5635b4e39d59c66d5ce
Summary:
the existing code uses vector<T> to store the given tensor and then copy to output.
If T=bool, vector<bool> stores the data as bits and then copy does not work.
we use TensorCPU to store it instead.
Also add unittest.
Reviewed By: kennyhorror
Differential Revision: D4622325
fbshipit-source-id: 95c27b5d1cfbc836d2419d01cacde5a3172f4d7e
Summary:
Verify shape and type inference in op unittests via assertReferenceChecks(). For now catch exceptions from InferShapeAndTypes() and log a warning.
TBD: Determine if there existing inference/output mismatches, and if so, change test asserts to warnings until they are resolved.
Differential Revision: D4639343
fbshipit-source-id: 605e72f53198e1a100fe7ba18b72c34c9ddbb727
Summary:
The fields are public so their names should not end with an
underscore.
Reviewed By: andrewwdye
Differential Revision: D4645038
fbshipit-source-id: c12b47affbe511383a4722717a06abb61918473b
- Code was using dimension specified which was negative
- Changed the cat_dimension variable to be more explicit
- Fixed code to use the cat_dimension variable
Summary:
- Do not set default for cudnn_ws. Will use the default set by cuDNN ops.
- Do not use cudnn_ws for MLP.
- Do not run the benchmark if the required args are not set. Previously tried to run and errors out.
Closes https://github.com/caffe2/caffe2/pull/177
Differential Revision: D4633143
Pulled By: Yangqing
fbshipit-source-id: e89a7d01984e599d92a330d0ee4ba106feba65b8
Summary:
The NCCL code used in CUDA-aware allreduce does local reduction of N
buffers prior to putting anything on the wire. Supporting this in the
benchmark tool to measure the impact under various configurations.
Other minor tweaks in this change:
* Specify sub-second iteration time
* Templatize allreduce benchmarks (the algorithms share a constructor
prototype)
Reviewed By: andrewwdye
Differential Revision: D4639517
fbshipit-source-id: f7417d3e9f79278a3b1eca48d779f48b77e5260c
Summary: Cuda algorithms take an optional set of device streams to sequence operations. If streams are provided, the algorithms should enqueue final output buffer operations on the associated stream and return asynchronously. Destructors that allocate streams/events should synchronize before tearing down.
Reviewed By: pietern
Differential Revision: D4636447
fbshipit-source-id: 32ec2adc214c83b0b4bc0fff8993ab196459117b
Summary:
With this change, every buffer gets assigned a different
value at every index. This means reordering of segments (e.g. in the
chunked algorithm) would surface as test errors.
Reviewed By: andrewwdye
Differential Revision: D4636368
fbshipit-source-id: 464eb1515d1590e12481961d427a92e2ebb3be82
Summary: CUDA documentation detailing high-level support for CUDA in gloo algorithms, usage of streams, and synchronizing memory management.
Reviewed By: pietern
Differential Revision: D4633120
fbshipit-source-id: d88e230c8dc82fe48cda0f401b61758fa4f07f2e
Summary:
Synchronous mode means using the calling thread instead of the device
thread for completion handling. Since this saves a context switch in
the critical path, this is very beneficial for low latency algorithms.
For example: the p99 of a 4-way barrier drops from 17us to 4us.
Reviewed By: andrewwdye
Differential Revision: D4626948
fbshipit-source-id: 013b1680497589fe5ad0bca38600bce6a410200b
Summary:
All pairs created by a device would use the same completion queue.
Supporting sync mode that way is difficult, as there is no way to
filter completions for a particular pair. This change refactors this
to use a single completion queue per pair so that this is no longer an
issue. This change is a preparation for supporting synchronous mode
(where the calling thread itself will poll the ibv library for
completions instead of the device thread).
This change also includes a refactoring of the way transient memory
regions are handled so that they are properly deregistered and
deallocated when no longer needed.
Reviewed By: andrewwdye
Differential Revision: D4625146
fbshipit-source-id: 21bf5ab321534fbd5c03f12049c10fc67da68944
Summary: std::atomic was not defined for cuda.cu.
Reviewed By: andrewwdye
Differential Revision: D4624611
fbshipit-source-id: 973bba10026e065667d6a576055d00505ee02d62
Summary: Allow gloo consumers to assign a mutex to synchronize CUDA malloc/free and NCCL operations.
Reviewed By: pietern
Differential Revision: D4622135
fbshipit-source-id: 60acd7c01a677a0df5415fe38e6ef5a2e7c8606a
Summary:
Update cuDNN RNN interface (mostly fixing ordering of arguments). Set seed so that test can pass consistently
Closes https://github.com/caffe2/caffe2/pull/62
Reviewed By: Yangqing
Differential Revision: D4348966
fbshipit-source-id: f9b56be37739e5bffabec130e3407492b2aef656
Summary: The shape inferenec did not check for spatial mode.
Reviewed By: andrewwdye
Differential Revision: D4638218
fbshipit-source-id: f15419738587013dea39e04a3da086890938c4e2
Summary:
MSVC 2015 has known bugs about template functions so these changes aim to fix them - no functional differences introduced.
Closes https://github.com/caffe2/caffe2/pull/179
Reviewed By: ajtulloch
Differential Revision: D4635241
Pulled By: Yangqing
fbshipit-source-id: a282a96e1e626e9440c1e3f3cb15b5b1fa710887
Summary:
At the moment LocalSession creates a new workspace if none if provided. As a
result anything that have been executed in local session is not going to be
avaiable to the external caller, i.e. everything that is using SingleRunner can
only observe side-effects and not really access intermediate blobs.
This diff is modifying LocalSession to run in current workspace instead (unless
it gots some really weird effects because we rely on privateness of the
workspace it should work).
Differential Revision: D4634743
fbshipit-source-id: 975bed154c7ca215dc3fc0d60f05a7c092711482
Summary: vigneshr has been experiencing randomly that the process does not exit in the end. We don't know what causes this, so this will help with two ways: (1) by putting timeout_guard.EuthanizeIfNecessary(600) in the end of the operator, you ensure that the process is killed in 10 minutes, allowing for retry; (2) this killing will cause python stack traces to be dumped, helping debug the real issue.
Differential Revision: D4635781
fbshipit-source-id: b558418c80671c00effdd514e4ddc01e935c95df
Summary: Add SparseNN workflow for feed. I haven't fully thought about the change needed for ads, as I added a property called 'preproc_output_schema' for LayerModelHelper.
Reviewed By: xianjiec
Differential Revision: D4585796
fbshipit-source-id: 060d08f4beb928e7e7863f2e563f612c358951fb
Summary: See http://bugs.python.org/issue6721. Since everstore loaders use ProcessPoolExecutor, which is based on forks, and there was perhaps update of the numpy library or some unralted lirbary, we started getting subprocesses stuck at np.random.randint(). Also changed logging to prints, since logging is known to have issues with multiprocessing. See https://www.prod.facebook.com/groups/fbpython/permalink/1438647216176641/
Differential Revision: D4633725
fbshipit-source-id: ae948a1827c71a3a2119d6a3248706728984df31
Summary:
A bit too much stuff in one diff, so sorry:
1. Add inference for gradient types by using the fact that x_grad is gradient of x and must be of same shape. This is kind of awkward to use string matching, but in addition I rely on the operator being actually a gradient op.
2. dzhulgakov was write, scalar shape is () and not (1). Sorry, my claim easlier was #fakenews.
3. Added inference functions for MakeTwoClass, MomentumSGDUpdate and Cross entropy ops.
Reviewed By: dzhulgakov
Differential Revision: D4569758
fbshipit-source-id: 0db13f33819777fdddefe21d4b1ebf906fcaf98c
Summary: Just generate some random data and put it through LSTM (Cafef2 RNN based) using its own output as gradient value for benchmark purposes. With default parameters it fits my dev GPU memory. On default parameters provided in this diff I have got 300k entries per second processed. These entries are split into blocks of seq_length * block_size. Each entry is of size hidden_dim, LSTM takes in hidden_dim sized input and produces output of the same size.
Reviewed By: salexspb
Differential Revision: D4605815
fbshipit-source-id: dd529302a0a93e8711784c67e4c777c8d6a8cdf4
Summary:
Add cudnn v6 support, including testing support for dilated convolution.
Add a check to ensure that the versions of cuDNN used to compile Caffe2 and run it are compatible
Closes https://github.com/caffe2/caffe2/pull/85
Reviewed By: bwasti
Differential Revision: D4387690
Pulled By: Yangqing
fbshipit-source-id: 312960134398dd4afe6ee0c01cdc160046c904e8
Separates out non-Python part of AutoGPU. This also compiles without
CUDA which is useful for generic tensor code.
Also fixes a bug where THCPAutoGPU may not always switch the device:
THCPAutoGPU guard(-1);
guard.setDevice(0);
guard.setDevice(1);
guard.setDevice(0); // would not switch batch to 0
Summary:
previously fp16 type was supported in SparseLengthsSum operator, now it
works in all other segment operator as well.
Reviewed By: dzhulgakov
Differential Revision: D4624312
fbshipit-source-id: c9d72110e3762167270bb088405eaf9c56e88493
Summary:
(1) Since cub seems to be a better memory pool I made cnmem optional.
(2) Added MKL testing since Intel now provides an apt source, but that doesn't seem to work right now.
(3) Added cmake file for nervana gpu.
Closes https://github.com/caffe2/caffe2/pull/175
Differential Revision: D4627056
Pulled By: Yangqing
fbshipit-source-id: 9676fa32fce2a29574c0bf7e9d31660b5535cb51
Summary: Remove TODOs where vectorization with Eigen is not needed, based on D4565679 feedback.
Reviewed By: ajtulloch
Differential Revision: D4623239
fbshipit-source-id: c949ee9bc295e87a87c333d68d958f0abfa71fd4
Summary:
This diff is trying to address one of the concerns that Xianjie have had - requirements create a layer for all operators and attach pass shapes and other info around.
The basic idea of the diff:
1. Try to create a layer with a given name, but if it's not available try to fallback on operator with that name (that is expected to have no parameters).
2. For all operators that we're adding through this functional style of creation - try to use C2 Shape/Type inference logic to get output type. If we fail to get - it just return untyped record and expect user to annotate it when it's really needed.
Reviewed By: xianjiec
Differential Revision: D4408771
fbshipit-source-id: aced7487571940d726424269970df0eb62670c39
Summary:
If init_params is False, the parameters should not be initialized.
This is particularly important when testing a model that provides values for these BN parameters.
Closes https://github.com/caffe2/caffe2/pull/174
Differential Revision: D4621791
Pulled By: Yangqing
fbshipit-source-id: 518443925990a12c1d5729b0971ebe19ba5d8998
Summary: It is better for the workers to share the python-side queue, since I saw a case where workers assigned for one GPU was lagging behind others. Also, reduced logging as requested by rpenggithub.
Differential Revision: D4620487
fbshipit-source-id: 73353f9570b07788c8cd71c9fec9308cd93a44dd
Summary: Replace for loop with Eigen operations in method rmsprop_update
Reviewed By: ajtulloch
Differential Revision: D4620691
fbshipit-source-id: 89cd570ecdf56a1255be4a0959ee711addc9696b
NCCL can deadlock if cudaFree() is called while it's launching kernels.
This exposes a mutex that can be held to prevent cudaFree() calls in the
caching allocator.
Summary: Inference function for the Im2ColOp: caffe2/caffe2/operators/im2col_op.cc.
Differential Revision: D4608663
fbshipit-source-id: d26ffb403c2acb7a5ead5f58f044ee3340c8311a
Summary: Replace for loop with Eigen operations in method ElementWiseDivide
Reviewed By: Yangqing
Differential Revision: D4602516
fbshipit-source-id: 6b19de8190d5e29ffe52359d0cd0c27cf03c52e2
Summary:
The memory pool implementation was written back in the days when I only had
one GPU, and as a result I overlooked the fact that:
(1) CNMEM needs to have the same current device for the allocation and
deallocation to take place correctly.
(2) cub needs the device id of the pointer passed in for proper deallocation.
As a result, since C2 right now switches contexts very frequently, I added a
global map to keep record of the pointer affiliations, and use that for
deallocation when we are at another context.
I have not tested the speed but assuming that std::unordered_map is not too bad
this should be fairly fast.
Differential Revision: D4617300
fbshipit-source-id: e8bb366616cd93504e7d68b7f999011cd49caba5
Summary:
Mysterious deadlocks after epoch has finished have occured randomly but quite frequently recently for myself, vigneshr and others. Looking at a stack trace of vigneshr's job (P57129798), I noticed a couple of threads were calling BlobsQueue.blockingWrite (or something like that). That call stucks when the caffe2/c++ side queue is at capacity (we use capacity of 4 with data workers). So in cases when this call was just being made while the script was to be terminated, the thread did not close and the whole process did not close either (not completely sure why that is since thread is a daemon thread, but this might be a flow-related issue since we run inside a flow container).
This is quite easy to fix: just call CloseBlobsQueue() when terminating the process. I modified coordinator.stop() and wait_for_finish() to return a status code based on whether threads that were joined actually closed within the 1.0sec timeout. This allowed creating an unit test to test for this issue. Before my change, the unit test failed.
Reviewed By: pietern
Differential Revision: D4619638
fbshipit-source-id: d96314ca783977517274fc7aadf8db4ee5636bdf
Summary: The AllReduceChunked algorithm currently performs the local reduce/broadcast of local device buffers in host memory. This diff updates the algorithm to execute the local reduce/broadcast steps using NCCL operations before copying a single device buffer to/from host memory.
Reviewed By: pietern
Differential Revision: D4587441
fbshipit-source-id: 4de689f59a6cf898b8eecd3c3b9f57f77124c0e3
* Add more detail to CUDA documentation
Also adds better cross-linking to the pages that discuss relevant topics.
* Adds recommendation to torch.save docs
* Make the version numbers for the docs dynamic
Might need tweaks for beta, 1.0, etc.
Summary:
It looks like for most of the types there is not way we can get them (except
the results of operation on top of some other tensor), that was pretty
unfortunate for cases when we want to do partial type inference (I was trying
to do so in D4408771).
This diff is adding more possible types for ConstantFillOp. Please let me know
if I'm missing anything. The only part that worries me a bit - possible
GetArgument with types that support only subset of range (but it looks like it
can happen even now for i32 vs i64).
Reviewed By: dzhulgakov
Differential Revision: D4611482
fbshipit-source-id: 77917fd5e1d18a1b860e022ede4518143d0f3f26
Summary:
Reduce test input size to instance norm gradient check. Larger size is currently timing out on stress tests.
e.g. failed: Timeout: Ran out of time before finding a satisfying example for test_instance_norm_gradients. Only found 2 examples in 125.39s.
Reviewed By: Yangqing
Differential Revision: D4608828
fbshipit-source-id: ce17a3ad28752d808efcbf79f1ea4238e63fb005
Backend is SpatialDilatedMaxPooling, so change 3D input (N*C*L)
to 4D size (N*C*1*L). Then output indices will range from 0 to L.
This range will not cause UnMaxPool1D error.
Signed-off-by: Zhou Chang <achang.zhou@gmail.com>
Summary:
(Stacked with D4553941). Using the new net type increases QPS to 470K, close to Torch numbers (there are other optimizations that need to be done, particularly the log-estimator). Previously, QPS was close to 250K. This was when having reuseData=true.
Includes a small bug-fix to the new net type.
Differential Revision: D4594704
fbshipit-source-id: 21e7b0ca4173b036f45d3ba95c218792b31e7398
Summary:
For code in layer model helper, layers. It's intentionally to not have NameScope by default.
This looks another place that may need default NameScope.
https://fburl.com/wdwtxp0m
Reviewed By: kennyhorror
Differential Revision: D4606971
fbshipit-source-id: b560bf59d3242e3f9443cd5aeda5c7e2e4e89079
Summary: D4348953 added support for accuracy for top_k>1, which is only supported on CPU, requiring data to be copied to CUDA. But that diff did not take into account that we have top_k=1 version of AccuracyOp for CUDA. This diff ensures we use the CUDA version for top_k=1.
Differential Revision: D4607767
fbshipit-source-id: 8becda23890343043eb79ad04e4c6196e9010f0c
Summary: as title. Add num of examples limit for group collect. Add option for enabling sum loss in BatchLRLoss
Reviewed By: xianjiec
Differential Revision: D4602311
fbshipit-source-id: 5b2a244f1f0e9f1ab0f4590e94828fd18d018d8d
Summary: curandGenerateNormal can only generate arrays of multiple of 2 lengths. MSRAFill and GaussianFill operators use RandGaussian utility method which in turn uses curandGenerateNormal. This is a test which runs the operators on both devices to generate odd sized random arrays.
Differential Revision: D4602819
fbshipit-source-id: e65f5c731e925886cfa14afff482f7053bd020a0
Summary:
This fixes at partly a recurrent problem when using everstore data input (or any other data input with multiprocessing). If the main process dies violently, the child processes are not killed. One cause for this was when using the TimeoutGuard(), as it called os._exit(1) that prevents any cleanup happening. I changed it to send SIGINT signal to the PID, and if in 10 secs the process is still living, calling os._exit(1). In my tests, this works well.
Did some other cleanup:
- improved logging of inputs/sec in data_workers
- removed redundant atexit() handling as the multiprocessing pool does it itself
Differential Revision: D4602550
fbshipit-source-id: 64d4526a2a3625d163d23f078286e719d56998f4
Summary:
Add two argument to DotProductOp operator, `force_same_dim` (1 if we want
DotProductOp to only accept two tensors with equal dimension, 0 otherwise) and
pad_value (only useful when force_same_dim = 0, pad the tensor with smaller
size to the same as the other one).
Differential Revision: D4502619
fbshipit-source-id: 46f7da710c6f6365f76a7af6234c34c7f656be62
Summary:
Implementation of ##LSTMWithAttention##
Still TBD:
1. There are problems with back propagation, because gradient is not implemented for ops with broadcasting
2. I need to make initial_recurrent_state to be of shape [dim] rather than [1, batch_size, dim], so one doesn't need to provide batch_size to LSTMWithAttention
Differential Revision: D4298735
fbshipit-source-id: 8903fcff4d6a66647ee6d45a6ef28803fc3091e5
Summary:
The context here is that we want fblearner predictor to handle float features (D4601334).
Since predictor processes a single example at a time, it makes sense to specify a single
float feature as a float scalar tensor.
But if the Caffe2 net has a SigridTransforms operator, it expects everything to have an
addition dimension so it can be called with multiple examples.
Being able to Reshape a scalar into a 1-d tensor will enable us to mix SigridTransforms
with other native Caffe2 operators.
Reviewed By: ender-wieczorek
Differential Revision: D4602675
fbshipit-source-id: 8b33876bf47bc341385fd7ac19cd1fd7f67a7ccf
Summary:
It could be that only first item
in the batch was really used in a case rest of the memory was 0. Or if
memory there had a big positive integer, then whole sequence was used. So we used rest of the batch depending on our luck :)
Reviewed By: Yangqing
Differential Revision: D4599569
fbshipit-source-id: ae89cee796bbcbc232e4abcab71dee360b0d8bc6
Summary:
In-place is ~30% speedup, but needs a change to torch2caffe
or a graph rewrite on the client.
Differential Revision: D4577582
fbshipit-source-id: c31bf8ba97f4fa4cedf355cf2475eb7bab48b304
Summary:
cudnn_ws args was already there. This PR only uses that args when model is created.
Closes https://github.com/caffe2/caffe2/pull/164
Differential Revision: D4598443
Pulled By: Yangqing
fbshipit-source-id: c2e83f73059360ecf2fedf2c62be7cacbb4034ca
Summary: we may not need dense feature inputs in some models (e.g., double helix).
Reviewed By: dzhulgakov
Differential Revision: D4568755
fbshipit-source-id: 6850508f86fafb53f81783b2a2a38776be5455d7
Summary: Another part of making DPER compatible with half-floats. This diffs adds supoprt of fp16 to segment reduction operators used in DPER.
Reviewed By: dzhulgakov
Differential Revision: D4587560
fbshipit-source-id: 0ae10648a7286a820bffaee802464dd9464584bc
Summary:
First part of adding half-floats support to DPER 2.0. Let's add an option use_half_floats to enable converting some weights of the model from fp32 to fp16 before saving it to predictor models parts. For now it's for SparseLookup layer's embeddings. All conversion is done after training is finished and saved models are ready to be used on remote predictors as-is (they will be stored compacted in memory). New fp16 blobs are saved to the model instead of original ones, under the same names, so we don't modify MetaNetDef at all.
Next steps:
1) support on delivery side -- operators working with these blobs should support both float and float16 input types
2) benchmark performance to make sure there is no regression
a) of serialization
b) of delivery
3) support realtime training (I'm thinking about adding new pre-publishing net which will be executed each time the realtime trainer stops to publish a new snapshot)
Depends on D4567304
Reviewed By: kennyhorror
Differential Revision: D4571710
fbshipit-source-id: 19967a17d3bd84878d66e8c0ed8c5342bf38d979
Summary:
This operator can always outputs dense gradients regardless of
the input gradients. For forward pass, it passes inputs to outputs in place.
Reviewed By: xianjiec
Differential Revision: D4582511
fbshipit-source-id: 7eb2c5d2142aa05d373f06cab1e7f89d8b747d34
Summary: Set up a server node that periodically gathers values of all nodes' perf counters, allowing to publish them at once.
Reviewed By: dzhulgakov
Differential Revision: D4555116
fbshipit-source-id: 8e49ac8353b52b2be82aedf305762478e7fa687a
Summary:
This diff introduces a new net type 'singlethread_async' which is based on my investigation of DPER/hogwild MLP bottlenecks.
It only uses one CPU thread, but multiple GPUs on each GPU. This is implemented by having each Net to submit their list of operators to
a central GPU-specific executor queue and a thread that executes them asynchronously. This executor takes all tasks in the queue and executes them on separate cuda streams and then waits them in the end. This solution can achieve >95% GPU utilization on 8 GPUs when sufficient amount of workers is used.
FYI: I also tried fancier solution such as using cudaStreamCallbacks(), but they did not have as good performance.
Improved the dper bench by adding the MomentumSGDUpdate operations and adding speed test capabilities. During my testing I also noticed that the startup costs for inizialing CUDA streams and contexts are high, so it is important to do a warm up.
Reviewed By: Yangqing
Differential Revision: D4553941
fbshipit-source-id: bb00524bef653d75de026dd64097b8d9b7a0acb3
Summary:
We were running into a problem where a Job could not be pickled. It needs to be pickled in order for the master flow operator to execute it using the session.
This creates a concept of "compiled" Job, that pretty much only stores protobufs with the Jobs to be executed, avoiding any issue with pickling.
Reviewed By: dzhulgakov
Differential Revision: D4554799
fbshipit-source-id: 2ee9877ca49a796d51925e5ec917436e3d930984
Summary:
Previously we had several limitations for a reporter net:
- needed to be a net, not an execution step
- only one allowed per execution step, with a single interval
Now, "reporter nets" become repoter steps and multiple of them can be specified with different timeouts.
Reviewed By: dzhulgakov
Differential Revision: D4583686
fbshipit-source-id: ad7266e16f96e7829fd24dcc1f165f39e9db573d
Summary:
This script will attempt to determine files that will be useful for building with the correct python version. Currently on macOS with various python installations CMake fails to determine the correct location of python libraries.
Closes https://github.com/caffe2/caffe2/pull/163
Reviewed By: Yangqing
Differential Revision: D4594954
Pulled By: bwasti
fbshipit-source-id: c2b750ee9608a02fad4ce2f2293f5fa54dc7011c
Summary: this is to fix the bug with eigen implementation which calculating crossentropy
Reviewed By: salexspb
Differential Revision: D4582078
fbshipit-source-id: 4c92047e9dbbe219fcbef618a45c584c2fbfaad5
Summary: Removed Model API because no one {seems to,should} be using it
Reviewed By: Yangqing
Differential Revision: D4575126
fbshipit-source-id: 174d39e9aa46750f1fae8295f7e1e5452559af33
Summary:
- Key-value store for counters.
- Counters are updated via macros that also export USTD probes.
- Counter values can be exported using caffe2 operators.
- Snapshot mechanism for tracking time-window counter values.
Reviewed By: dzhulgakov, pietern
Differential Revision: D4553761
fbshipit-source-id: 25a1a91a3168dcff2159c6fba7b357d3fd3aa9bf
Summary:
Work may be queued on CUDA streams for asynchronous execution. The
memory backed by pointers passed to any algorithm can therefore be
mutated after constructing an algorithm instance. By also passing in
the streams these mutations happen on, the algorithms can synchronize
with these mutations to ensure no invalid data is used.
By passing in these streams, any work done by these algorithms will
*also* be queued, which effectively removes a single synchronization
step from any algorithm run.
Differential Revision: D4589394
fbshipit-source-id: 0c8cd6ba9c9018f33d6f4c55a037083fc4164acb
Summary: I was mistakenly calling the non-chunked algorithm for the chunked test.
Reviewed By: pietern
Differential Revision: D4580160
fbshipit-source-id: 9d62a68e9e86cc6e596d90ff8854c585a0e8855c
Summary: Fix hard coded CPUContext and add CUDA support for shape function
Differential Revision: D4577053
fbshipit-source-id: b515e52c39c02aa1600ccb1c3e559c9a5a0b718c
Summary:
This diff adds ability to train multiclass classifier on sampled subset of classes. This basically implements what described in https://arxiv.org/abs/1412.2007 without the sampling probability correction. Since this implement uniform sampling, sampling probabilities are cancelled out in softmax anyway.
The trick to make this work is to have 2 different nets for prediction and training, both shared parameters. The model is built normally until the last layer. If sampling is needed, then we do the following:
The class sampling works as following:
Reviewed By: xianjiec
Differential Revision: D4512859
fbshipit-source-id: ab537bcac81d5e5877a8795045e8682c8064da68
Summary:
First pass at a CUDA-aware allreduce chunked implementation. For now the algorithm runs on the CPU and is mostly copy/paste from allreduce_ring.h. A subsequent pass will offload to the GPU.
Serialize cuda test to avoid intermittent failures due to memory contention.
Reviewed By: pietern
Differential Revision: D4576959
fbshipit-source-id: e1f292a05b88ff24c33e549d4a52e770a21f85d2
Summary: Ideally we would want the driver to busy-poll for us. In absence of driver support, spinning with MSG_DONTWAIT flag seems to be helping a lot too. Of course, we pay the price of burning one core for polling. Sigh.
Reviewed By: pietern
Differential Revision: D4576242
fbshipit-source-id: 85d9e1b786fbb6053864fba80f3e5ecc80fe221d
Summary:
Latency optimization is going well and I've seen the odd case of <10us
measurements. This option makes the benchmark tool display nanos
instead.
Differential Revision: D4575925
fbshipit-source-id: 98dbd3b39e31cbcdd4c146613f6630e721187e1e
Summary: Do I understand correctly? It must be of size 1 for sigrid
Reviewed By: kennyhorror
Differential Revision: D4576541
fbshipit-source-id: 92fa8dc62e36ff095e14cceeb80b03c0028f5695
Summary:
Move the open source version of build_ftrl to the open source directory.
Because build_ftrl can use several engines, the SIMD engine is fb specific.
We keep the build_ftrl in the fb/optimizers/sgd.py file.
So, if the caller only uses the open source engine, it can import the
open source build_ftrl. If the caller may use the SIMD engine, it needs
to import the fb specific build_ftrl.
Also move the tests to python directory.
Reviewed By: salexspb
Differential Revision: D4560384
fbshipit-source-id: 84fc915d3bbe42fd19503ef132d3277088f6fab3
Summary:
Remove the use of `NextName` in layer model helper, so that the same function return `model_helper` that should construct identical `Net`, when under the same NameScope.
The `NextScopedBlob` should only take effect when there is real name conflicting, otherwise it returns ScopedBlobReference.
This is critical for parameter blobs. In long run, we need to be able to specify parameter blobs more explicitly. (kennyhorror is working on this). This solution works in short term for e.g., two tower sparse nn models.
Reviewed By: kennyhorror
Differential Revision: D4555423
fbshipit-source-id: 2c4b99a61392e5d51aa878f7346466a8f14be187
Summary:
Pass through the h-value recurrent output unchanged at each LSTM step beyond the valid part of a sequence (computed based on seqLengths, allowing batching of sequences of different length). This enables using the final-step output of each sequence as the output when one vector is desired for the entire sequence. Gradient also passed back unchanged.
Also made some cosmetic changes to recurrent_network_test.py (seq_lengths offset corrected, should be in [1, T] rather than [0, T-1]).
Reviewed By: urikz
Differential Revision: D4540307
fbshipit-source-id: 73a9f6326069d713dcb0cdc8d17869317c6dbe96
Summary:
The CudaDevicePointer optionally takes an existing stream on
which it runs any operation associated with the pointer (for now just
memcpy's, but this likely will includes kernel execution in the
future).
Differential Revision: D4574035
fbshipit-source-id: ddd7972a3874012059f1fde1b341fd6edd69102d
Summary: We don't use these ops on mobile, so this saves ~150kb.
Reviewed By: Yangqing
Differential Revision: D4569599
fbshipit-source-id: c6f9d702773c64a395e87afa4cfb5b2992dba230
Summary:
In current implementation of SaveOp we always use names for blobs from the
current workspace. But there is a use case for replacing names in saved model:
for example, to use half-floats in prediction model but keep full-floats for
training model we might want to save a blob "w_fp16" as "w".
Differential Revision: D4567304
fbshipit-source-id: 87bc84fa6a45d8bfa33edb55ac1fb1cff542dbe3
Summary: This diff adds shape inference for the SoftmaxWithLoss Operator
Differential Revision: D4565835
fbshipit-source-id: 1c2db398524c765977ec4d8a22c9b986bf9faf82
Summary: Every time data is put into the logger, it checks if a second has passed. If so, it displays how many inputs were put in the last second.
Differential Revision: D4527148
fbshipit-source-id: f197eb975ed81111449705e0719d1e56f385fd8d
Summary: Might be useful for the EXC_RESOURCE / CPU issues.
Reviewed By: salexspb
Differential Revision: D4565494
fbshipit-source-id: 74ac9edeba6334a46ee6799a93ca96eb68216439
Summary:
One can find a reason, why I need gradient for CopyOp in this post - https://fb.facebook.com/groups/1405155842844877/permalink/1639683782725414/
Gradient for CopyOp is trivial in case the device was the same (cpu, or same gpu), but get's a little harder, when the copy was made across two different gpu.
I introduce new operator CopyOnDeviceLike, which has additional second input. The op copies the first input to the same device as the second one. The default implementation is exactly the same as CopyOp, but I specialize it for CUDAContext.
Please, let me know if I'm doing anything wrong here! That's my first caffe2 diff, related to operators definitions.
Reviewed By: Yangqing
Differential Revision: D4557258
fbshipit-source-id: 9494be589cc1e5696bbbfe25b7622aaa4c9efe4a
Summary:
- updated image pre-processing to avoid detectable differences in re-sizing for different angles
- refactored utility functions into dbreader and image_input
- fixed an issue in image_input where crop assert was firing because it was testing pre-resized image
Reviewed By: seansnyder
Differential Revision: D4550365
fbshipit-source-id: 6461e24a26367c8f6af5e2682beb2b3acd67842b
Summary:
In synchronous mode, it is not the device thread that is responsible
for handling I/O, but the user thread itself. Calling waitRecv on a
buffer will trigger the read function on the pair to be called. This
eliminates the context switch necessary if the device thread is
handling all I/O. For benchmarks with small numbers of elements this
reduces latency by as much as 20%.
Reviewed By: plapukhov
Differential Revision: D4549998
fbshipit-source-id: ab718ba090c06d7c7aa4065cc9f92bd96b9e4a35
Summary:
Refactors some of the vectorization and accumulation.
Parallelization is a TODO, I'm not sure how Android goes and it's just an
incremental ~10% or so.
Reviewed By: Yangqing
Differential Revision: D4568850
fbshipit-source-id: aa9db5a364bb738f492085772dc82b94885eb4d6
Summary:
This clears up a bunch of windows build errors, but there are still 12 errors mostly relating to
- template keywords
- initializer list
- pthreadpool
that are not readily available on windows. Also, cuda build is being disabled right now.
Current error can be found here: https://ci.appveyor.com/project/Yangqing/caffe2-w2ucm
Closes https://github.com/caffe2/caffe2/pull/151
Reviewed By: bwasti
Differential Revision: D4564591
Pulled By: Yangqing
fbshipit-source-id: adacad5fa2d6d52d586700947972e3674e3b6e60
Summary: As in headline. I had missed these originally.
Reviewed By: kennyhorror
Differential Revision: D4560255
fbshipit-source-id: e69458e8a2574b981e40e915d87c8e16dadee7d6
Summary:
(Caffe2) Modified RecurrentNetworkGradient operator so that training is possible with any of the output blob(s) receiving gradient during the backward pass. This is realized through a new argument for the RecurrentNetwork op, outputs_with_grads, which takes a list of the indices of the output blobs which will receive gradient. The default case (only receiving gradient from the first output blob) remains the default.
New unit test covers the case where outputs_with_grads = [1, 2] using Python LSTM wrapper.
Reviewed By: urikz
Differential Revision: D4518516
fbshipit-source-id: 5c531582b20f3cf727d1aa91239b4d5a2b8a7c1f
Summary:
The existing op tranforms the input in a general way. It needs M transform mappings to transform a NxM input tensor.
But for binary predictions X (Nx2 tensor), we know that X[:, 0] = 1 - X[:, 1].
So we just need one mapping for X[:, 1]. After being transformed, we can compute X[:, 0].
This diff is to handle this.
Differential Revision: D4550441
fbshipit-source-id: 42d8c6e88d830c97628ee930b543740a32acf904
Summary: This is like `UniformIntFill` but guarantee to return unique elements in the output, excluding the optional avoiding elements.
Reviewed By: xianjiec
Differential Revision: D4511814
fbshipit-source-id: 5dc98ee580616e60e46ee74ebb3f5ddd29a09965
Summary: Updates function revise_recurrent_network_op() which supports cloning recurrent networks by adding a blob-name prefix to string arguments to maintain correspondence. Previously relied on many hard-coded indices referring to the positions of arguments and inputs of RecurrentNetworkOp and its corresponding gradient operator, and therefore broke when the implementation changed. This fix should make it more general and robust
Differential Revision: D4559768
fbshipit-source-id: fb85b0b1ffb1393dc84760d6ae5dc473e8b764b0
Summary: D4438796 (https://github.com/caffe2/caffe2/pull/95) introduced locks to avoid concurrent cudaFrees and NCCLs. Unfortunately, the locks were not put into PinnedCPUAlllocator, causing deadlocks in certain cases (like using Hive reader).
Reviewed By: Yangqing
Differential Revision: D4563752
fbshipit-source-id: 0f95051621282e742f03feb76ebc30662285fb8e
Summary:
Created some simple benchmark to test model saving speed, plus few possible
optimization on top of it.
Since we don't really want to have partial LogFileDB ever, it makes sense to
commit the transactions only after we've finished serialization.
As a result in my test serialization time in my dummy test drops from
480 seconds, to:
Serialization time: 52.5134651661
Deserialization time: 60.5741639137
One more really scary things that I've found:
it looks like load_op with load_all might actually load corrupted DBs (if they'll be truncated), so we do need to fix it really badly (save all blobs we have in the DB or even better checksum).
Reviewed By: dzhulgakov
Differential Revision: D4558216
fbshipit-source-id: 4145c07f29b9dda527a2e57842f3abd8023d71a3
Summary: to verify that a model only used a subset of the parameters of another model (e.g., the model doing training).
Differential Revision: D4557787
fbshipit-source-id: bd8ac96f5e78e05f6f56086db6e6ddcda36c1d37
Summary: Removed Def().arg() in the backward computation since they have already been included in the forward.
Differential Revision: D4563600
fbshipit-source-id: bb6ee25e7c8da99977b82963670267392893fcde
Summary: generates a fair amount of documentation from the operators. also provides a framework for later documentation generation and custom syntax.
Reviewed By: dzhulgakov
Differential Revision: D4168311
fbshipit-source-id: 89ae9d023ad883623cdc1879c11e10b202b68793
Used .c file changes from 7318e2de13 as a starting point. All changes to .c files (except for whitespace details) are present here.
However, the required .h files were not present in that PR.
Summary:
Implement CUDA BroadcastOneToAll algorithm for GPU addresses. Refactor cuda.h into cuda_private.h to allow inclusion of <cuda.h> in public headers without polluting the namespace.
Port broadcast tests to GPU variants.
* this revision is based on Peter's revision D4546932
Differential Revision: D4547382
fbshipit-source-id: 3d294ad8862b04fb783ba22e5c925b8d7cbc8a8d
Summary:
build_sgd, build_adagrad, and build_adam are in open source python directory
now.
Move the tests to the same directory.
Extract TestBase to test_util.py so that TestFtrl can still refer it.
Depends on D4552227
Reviewed By: salexspb
Differential Revision: D4554549
fbshipit-source-id: 35aed05b82c78530808ef623a25bb7532b2abbae
Summary: There's a bug here as well (should be X[:axis] + N instead of [M, N], but that can wait.
Differential Revision: D4555244
fbshipit-source-id: cf07ffe925bd592b4e2159750b6ebd859cfe0e5e
Summary:
The change migrates build_adam function to the open source python directory.
Depends on D4551871
Reviewed By: salexspb
Differential Revision: D4552227
fbshipit-source-id: 2b6bef183ecfd645d0f26215a784846d8841b845
Summary:
hasattr(x, ops) should always work, regardless whether you're inside or outside a NetBuilder context.
There's no ideal solution here. I think this is sensible enough.
Reviewed By: kennyhorror
Differential Revision: D4557228
fbshipit-source-id: 4b1c1db5c8b11e4ccbf977b3f82c63b2c3e6e7db
Summary: These operators update the state of the instance and therefor should have the instance in the output list.
Reviewed By: xianjiec
Differential Revision: D4554773
fbshipit-source-id: 556d484fcf58878308aa6b0f7cd7ea2446d3f29e
Summary:
The change migrates build_adagrad function to the open source python directory.
Depends on D4547016.
Reviewed By: salexspb
Differential Revision: D4551871
fbshipit-source-id: cb68d9b2a723b0f069c8a24cfa3062f1e676c016
Summary:
Matt uyt reported (1) a very infrequent assertion failure at net.cc worker function. This was caused because an operator, that was not a chain, was scheduled in the job queue. This was possibly to happen since our DAGnet operator graph is graph of operators, and not chains. The dependency prunign that I introduced last week exposed this problem since it removed some "middle-to-chain" dependencies when computing the chains. (It is bit hard to explain).
This diff attempts to fix the problem by only allowing scheduling of chains. I added, in addition, extra check to confirm that all parents of all nodes were indeed executed before starting next roud. This adds additional safety and breakpoint to see if there are still problems.
I also fixed a bug in the operator graph pruning that made pruning less effective.
(1) Matt's report:
https://www.prod.facebook.com/groups/1405155842844877/permalink/1639428779417581/
Reviewed By: dzhulgakov
Differential Revision: D4531424
fbshipit-source-id: 80fa7def6e8aff6910ebf0d9d5fef15ff20e0aec
Summary:
.In Tutorial, I found it not correct when calling Model(). After that changing, It works.
Closes https://github.com/caffe2/caffe2/pull/148
Reviewed By: bwasti
Differential Revision: D4556894
Pulled By: Yangqing
fbshipit-source-id: 949a8d0496861f19869436908ffe1ef1a0f853b1
Summary:
This is essentially https://github.com/caffe2/caffe2/pull/146/ but shipit
failed to trigger task determinator.
Reviewed By: bwasti
Differential Revision: D4557698
fbshipit-source-id: b0e6777957e76df4e23671371098c2c6fe83b55c
Summary: For k-top accuracy, if the correct prediction does not make into the k-sized priority queue, it is not going to be in the top-K, so we can short circuit.
Reviewed By: Yangqing
Differential Revision: D4555637
fbshipit-source-id: 7f07787f853f1c6b4024e279dcc6920d28bdde3d
Summary:
Separate benchmark build target for CUDA-aware algorithms.
This is needed to keep CUDA an optional dependency.
Differential Revision: D4546932
fbshipit-source-id: b73176ae9067233f883d51ba3ab4efbb13a6f86f
Summary:
This CUDA-aware ring allreduce is based on the regular ring allreduce.
It runs the reduction algorithm on the CPU and is therefore most
suited for smaller buffers.
Both the device-to-host memcpy's at the start of the algorithm and the
host-to-device memcpy's at the end of the algorithm are kicked off
asynchronously in an attempt to parallize as much as possible.
Reviewed By: Yangqing
Differential Revision: D4542816
fbshipit-source-id: 101dfad276ca79703e37ff93fb1b6d467295f66b
Summary:
The CUDA benchmark suite will be a separate build target, so the
runner should be reused.
Reviewed By: Yangqing
Differential Revision: D4545092
fbshipit-source-id: 6ccf2d30f5d35c74fc59851b25416bfe6863d62c
Summary: ContextManager was thread local. This caused issues because the context registration needs to be global. What needs to be thread local is the current context.
Reviewed By: jhcross
Differential Revision: D4556050
fbshipit-source-id: 5de1c0d9fd0a778c4cb1eadef01f9a1ab488f603
Summary: gcc didn't like not returning a value
Reviewed By: Yangqing
Differential Revision: D4553052
fbshipit-source-id: 68ec2df35cf097be2d9338fcd8901a5fac6292c3
The core autograd Variable, Function, and Engine no longer depend on the
Python API. This let's us implement functions in C++. In the future, we
can also multithread engine and release the GIL for most of the
non-Python backwards.
Summary:
Currently build_sgd is in facebook specific directory. Need to move it to python so that
the open source world can use it.
Reviewed By: salexspb
Differential Revision: D4547016
fbshipit-source-id: d699b7b1ab8051afdeadedb4d247ec2a04a7a3e7
Summary: There are still a lot to clean up, but this is a start change.
Reviewed By: bwasti
Differential Revision: D4543980
fbshipit-source-id: 757fc49db230b56996f02d5de9b69030ebbf3b77
Summary: Unneeded for mobile, should go from 90kb to ~30kb or so.
Differential Revision: D4545466
fbshipit-source-id: 47945493895a8f72d17de684b0429c2c7b5564ed
Summary:
We don't need all the ~dozen filler ops - should reduce from
~60kb to 20kb.
Reviewed By: Yangqing
Differential Revision: D4545452
fbshipit-source-id: 7ed1a6ba5a2c180f37c3163bfb40844160882749
Summary:
We only need Add right now, so split things up.
Can take it from ~260kb to ~20kb.
Reviewed By: salexspb
Differential Revision: D4545441
fbshipit-source-id: 96e58fb4d8b2a4f120ae7d34e86cefca146ec14e
Summary:
Fix warnings exposed by gcc-4.9.x's -Wshadow-compatible-local
I plan to enable this for all of fbcode, soon.
See t13698406 for justification.
Rename inner "err" to "err2".
This avoids the following errors:
caffe2/caffe2/contrib/torch/torch_op.h:263:47: error: declaration of 'err' shadows a previous local [-Werror=shadow-compatible-local]
caffe2/caffe2/contrib/torch/torch_op.h:263:11: error: declaration of 'err' shadows a previous local [-Werror=shadow-compatible-local]
Reviewed By: Yangqing
Differential Revision: D4544812
fbshipit-source-id: b15467ba9af7ec7f391db59f706b0442cdb664c4
Summary:
Fix warnings exposed by gcc-4.9.x's -Wshadow-compatible-local
I plan to enable this for all of fbcode, soon.
See t13698406 for justification.
Rename inner "i" to "j", twice.
This avoids the following errors:
caffe2/caffe2/operators/text_file_reader_utils_test.cc:56:14: error: declaration of 'i' shadows a previous local [-Werror=shadow-compatible-local]
caffe2/caffe2/operators/text_file_reader_utils_test.cc:47:14: error: declaration of 'i' shadows a previous local [-Werror=shadow-compatible-local]
caffe2/caffe2/operators/text_file_reader_utils_test.cc:41:12: error: shadowed declaration is here [-Werror=shadow-compatible-local]
Reviewed By: Yangqing
Differential Revision: D4544810
fbshipit-source-id: 089d73466f48a7a28b2a516117a12389c3ad54d2
Summary:
Fix warnings exposed by gcc-4.9.x's -Wshadow-compatible-local
I plan to enable this for all of fbcode, soon.
See t13698406 for justification.
Remove declaration of unused outer "stream".
This avoids the following errors:
caffe2/caffe2/binaries/core_overhead_benchmark.cc:28:27: error: declaration of 'stream' shadows a previous local [-Werror=shadow-compatible-local]
caffe2/caffe2/binaries/core_overhead_benchmark.cc:26:25: error: shadowed declaration is here [-Werror=shadow-compatible-local]
Reviewed By: Yangqing
Differential Revision: D4544811
fbshipit-source-id: c94e8a6e6d59705c86bc654f05d4de1ae4213eac
Summary:
Fix warnings exposed by gcc-4.9.x's -Wshadow-compatible-local
I plan to enable this for all of fbcode, soon.
See t13698406 for justification.
Rename outer "rank,size" to "rank0,size0" (to avoid shadowing another "rank" and "size" just below).
This avoids the following errors:
caffe2/caffe2/mpi/mpi_test.cc:124:9: error: declaration of 'rank' shadows a previous local [-Werror=shadow-compatible-local]
caffe2/caffe2/mpi/mpi_test.cc:112:7: error: shadowed declaration is here [-Werror=shadow-compatible-local]
caffe2/caffe2/mpi/mpi_test.cc:126:9: error: declaration of 'size' shadows a previous local [-Werror=shadow-compatible-local]
caffe2/caffe2/mpi/mpi_test.cc:115:7: error: shadowed declaration is here [-Werror=shadow-compatible-local]
Reviewed By: Yangqing
Differential Revision: D4544808
fbshipit-source-id: fdc53ab8763eb342302b94d82d1ac046f2af7d33
Summary:
Fix warnings exposed by gcc-4.9.x's -Wshadow-compatible-local
I plan to enable this for all of fbcode, soon.
See t13698406 for justification.
Rename outer "rank" to "rank0" (to avoid shadowing another "rank" just below).
Also rename outer "size" to "size0" for the same reason.
This avoids the following errors:
caffe2/caffe2/mpi/mpi_gpu_test.cc:132:9: error: declaration of 'rank' shadows a previous local [-Werror=shadow-compatible-local]
caffe2/caffe2/mpi/mpi_gpu_test.cc:120:7: error: shadowed declaration is here [-Werror=shadow-compatible-local]
caffe2/caffe2/mpi/mpi_gpu_test.cc:134:9: error: declaration of 'size' shadows a previous local [-Werror=shadow-compatible-local]
caffe2/caffe2/mpi/mpi_gpu_test.cc:123:7: error: shadowed declaration is here [-Werror=shadow-compatible-local]
Reviewed By: Yangqing
Differential Revision: D4544806
fbshipit-source-id: 4cfa412dd672919174d487e60aa503a32125da03
Summary:
Fix warnings exposed by gcc-4.9.x's -Wshadow-compatible-local
I plan to enable this for all of fbcode, soon.
See t13698406 for justification.
Rename inner "new_insta_comm" to "comm".
This avoids the following errors:
caffe2/caffe2/mpi/mpi_common.cc:167:16: error: declaration of 'new_intra_comm' shadows a previous local [-Werror=shadow-compatible-local]
caffe2/caffe2/mpi/mpi_common.cc:162:14: error: shadowed declaration is here [-Werror=shadow-compatible-local]
Reviewed By: pietern
Differential Revision: D4544805
fbshipit-source-id: c703c3f35c71f08b4daae8491ea2518572fc8013
Summary:
Input have to be arranged in such a way so j-th example of
batch i goes right before j-th example in batch i+1 in the text.
Reviewed By: urikz
Differential Revision: D4519553
fbshipit-source-id: 9dd80658e0c4d9ff0f97a7904cbb164f267fe39f
Summary: On batch size of 32 and other default parameters I get 70 iterations per second vs. 40 on CPU. batching still doesn't produce good loss, I am going to work on this in a separate diff
Reviewed By: urikz
Differential Revision: D4516566
fbshipit-source-id: d0611534747beb2cd935a8607a283369378e4a6c
Summary:
Outline of changes:
- add single-operator support to Caffe2-Flow integration (based on Alisson's suggestions)
- because of above support we can move graph construction to the main workflow body and pass the job to the Flow operator doing running, similarly to the distributed case
- after that it's easy to unify code even more
- there's some trickery required to make sure model exporting doesn't pollute Cluster info (as TaskGroup.to_task() creates new tasks)
Important: this diff changes train_local behavior by introducing queue between preprocessing and trainer (before we did everything on trainer thread). It doesn't seem to impact perf much (even slightly positive), so I guess it's fine. It also allows for better unification.
I'll follow up with a separate diff that moves max_examples gating to multi_reader (including train_local) and then we can enable checkpointing.
Reviewed By: xianjiec
Differential Revision: D4526079
fbshipit-source-id: 8c44044f45e7738e9b13e5b3acfbb994bc5a3d72
Summary:
- NetBuilder now honors its name
- When Nets are created in the context of a NetBuilder, they take NetBuilder's name as prefix
- When a NetBuilder is created in the context of a Task, it takes the Tasks's name.
- pipe() now tries to find a good name based on its processor's, output or input queue's name.
- RPC tries to find a name from its handler's name.
- Better names in DataStream
- net_printer prints the name of Tasks and Steps
- net_printer optionally factors out common prefixes form blob names.
Differential Revision: D4527578
fbshipit-source-id: 5d3d1237c186e9576313c5aa01cc8800a9051217
Summary:
1. The existing Gather op outputs gradients in sparse format. We add GatherDense that does the same thing
as Gather but outputs gradients in dense format. This relies on the SparseToDenseOp.
2. SparseToDenseOp converts sparse representation (indices, values) into a dense format (missing values are
filled with zeros). There is an existing SparseToDenseMaskOp. It is mainly for converting sparse features
into dense format. Modifying it to achieve our purpose is too complicated and messy. Better to create a new one.
Reviewed By: dzhulgakov
Differential Revision: D4508879
fbshipit-source-id: f4a50efa1c08586d94040f93195661c41cd414da
Summary:
In the GitHub repository this directory will be mirrored similar to
folly, such that the repository has a single top level directory
called "gloo". This allows for versioning or renaming of the
project root, without having to mangle the include paths; they will
always use the "gloo" prefix.
fbshipit-source-id: 24502e4185fc7cbe19b5249f83609e2b8118e9d7
Summary: This should not be needed any more since we use pybind. It will help python3 migration.
Reviewed By: salexspb
Differential Revision: D4535490
fbshipit-source-id: a47615f73b5c35b940d21bb2d5d55060fa0850be
Summary: Per the task request, replace the original partial_sort solution by using heap.
Differential Revision: D4529118
fbshipit-source-id: 3dc01fc3a552ad020a0370f8d26cbc8be58bca6b
Summary:
Shape inference allows Caffe2 to compute shapes of blobs without running a model. Update InferShapesAndTypes() to accept an optional blob:dimensions map so that external input blobs do not need to be part of the workspace.
InferShapesAndTypes() in workspace.py conditionally calls the ...from_workspace or ...from_map bindings. Note I favored a small amount of code duplication here for the sake of readability. InferShapesAndTypes() in operator.cc has been refactored into mirrored entry points, invoking a common helper.
Other minor changes to address linter warnings.
Reviewed By: dzhulgakov
Differential Revision: D4524873
fbshipit-source-id: 56f863b759c016d7f23523f06fda3aa5bba22357
In cases where copyAsync is a large percentage of the work,
processing events in recordEvent can cause a large bottleneck.
Here, we relax the constraint that we reclaim blocks as fast as possible
(i.e. in copyAync); instead, we only check that a block can be re-allocated
in malloc and free.
Summary:
updated training for breaking change of loss_scale.
Noticed that for large downscale factors opencv INTER_AREA did a better job avoiding aliasing so changed to this filter
Reviewed By: seansnyder
Differential Revision: D4528909
fbshipit-source-id: 692894812701854dd5eb8da932505f465fed3590
These methods are useful from C because they don't require constructing
THLongStorages to wrap the sizes and strides, which can lead to leaked
memory in case of an error. Instead the sizes and strides can be
represented on the stack using standard C long arrays.
Summary: One trainer passed (10,) as the max_buffer_size parameter, causing the internal queue to grow out of bounds as qsize == (10,) never was true. This adds assertion to the type of the parameter.
Reviewed By: prigoyal
Differential Revision: D4527649
fbshipit-source-id: 492a824700b8fc69c484b80773b1f1f5aee39071
Summary:
This enables a real RTT measurement, since it's not possible
for peers to 'pre-fill' the notification buffers as is the case for
the all-to-all barrier.
Differential Revision: D4523543
fbshipit-source-id: 3f6467cdc66b1062ada92deed581e9360003d629
Summary:
Running RunNet() in python in a loop can be a performance issue if the python code is doing a lot of other processing, such as data input, because python's Global Interpreter lock (GIL) will prevent the RunNet() to be called. This can easily be fixed by making RunNet() run multiple iterations inside the C++ land. (Another way to accomplish the same thing is to use Caffe2's "execution plans", but that requires more setup).
+ fixed timing reporting in my OC workflow
+ improved one error log in data_workers.py
Sorry for piggypagging those small changes, but landing diffs currently is slow...
Reviewed By: rpenggithub
Differential Revision: D4523575
fbshipit-source-id: 039a647576efad5dd9afda74df478ac22b43c103
Summary:
- Do not lock LMDB.
- This avoids failure when multiple readers try to read the same LMDB.
- This also can cause a race if a process tries to write into the LMDB that is being read by another process. Because this commit removes the locking mechanism.
- Note that we already use MDB_RDONLY when reading LMDB.
- It seems that LMDB does not provide any method of locking the database to avoid writes while allowing reads.
Closes https://github.com/caffe2/caffe2/pull/130
Differential Revision: D4512220
Pulled By: Yangqing
fbshipit-source-id: 45df849efa339601291aea6d0ed5ac74e097273b
Summary:
Just the first version displays forward part of the training net. I want to refactor local/distributed code to share graph initialization and then visualize all nets individually.
Graphs don't look pretty because of a lot of DotProducts, we need to refactor it.
Reviewed By: xianjiec
Differential Revision: D4514479
fbshipit-source-id: 156bb07c62118b15022c87f197b5e378a7ef3b9f
Summary: Implemented the sharp inference function for AccmulateOp. The output shape and type should be same as the input.
Differential Revision: D4518812
fbshipit-source-id: 11fc7ec4fad1fe3049c5a35d13c371627f9e3d11
Summary:
Update data parallel model to default to using fbcollective.
Update broadcast op to correctly handle Tensor<long>.
Differential Revision: D4508029
fbshipit-source-id: 7b8d17223e25b3e1098ee3f2a08af61af140729e
Summary:
This should help in debugging test failures on continuous
integration hosts.
Part of this change is to make the address family to use configurable,
so the user can force the library to use either IPv4 or IPv6, instead
of picking whatever we see first.
Differential Revision: D4515802
fbshipit-source-id: 8834cece2ff819c8acad81fa2d76c3ed94f06158
Summary:
I recently encountered out-of-memory errors on my OC workflow. This was because the internal queue for buffering image patches was too large. Total memory use was:
image size = 227 x 227 x 3 x 4
total mem = image size x queuesize (500) x num gpus x everstore-worker batch (128) > 300 gigs.
Reducing the batch size to 100 should fix this. Also can now specify as a parameter.
Reviewed By: rpenggithub
Differential Revision: D4519956
fbshipit-source-id: 781697e620431ce7053534e683047bb6e7257b22
Summary:
If num_shards = 1 and distributed training is on, then ring reduce fails when it looks for left pair to exchange information.
I also used the opportunity to do a small fix in my data loader benchmark
Differential Revision: D4513545
fbshipit-source-id: 7d3115b871a39b8ce7b55553394b607d16e08b74
Summary:
Making drawing a bit easier
Also adds a Flow example to check that PNG images are nicely rendered in lists.
Reviewed By: kennyhorror
Differential Revision: D4514470
fbshipit-source-id: 35189c4543c31a351c1dbfe804ce25ae14a3a98b
Summary:
Introduces 2 utitilies:
- ##print_obj##: Prints the whole Job in a nice way -- each op call takes one single line and nets are inlined for much better readability. Loops and parallel steps are easy to read.
- ##analyse_obj##: Goes through a Job and checks 2 things:
- that there will be no undefined blob errors at execution.
- no blob of same name will be created by parallel execution steps
Reviewed By: dzhulgakov
Differential Revision: D4142381
fbshipit-source-id: 61bf3398c22e9947493e99145ce2bfc2646830a6
Summary:
We want to train models with user sequence data for mobile side ranking.
The operators are for preprocessing the sequence based data. They read in a sequence with a batch and convert the examples with different method.
I also add a new loader for connecting the operator to existing trainers
Differential Revision: D4485411
fbshipit-source-id: 0cf17206704995f2ce079e1594607bea70b1ed0c
Summary: This makes sure dper_example is compatible with the new way of defining checkpoint epochs. See D4499320.
Reviewed By: xianjiec
Differential Revision: D4511618
fbshipit-source-id: f5188010cdefe3739f87f6049d1ea6aee765c514
Summary:
Per the task's request, added top_k == 1 branch to specially handle the cases with top-1 accuracy.
In addition, I made slight code refinement: moving the declaration of vector Xdata_pairs out of the for loop to avoid the cost of vector's constructor.
Differential Revision: D4505983
fbshipit-source-id: 5671eaca4aac3900c69dfb54d664c2d617960b4b
Summary: This allows to have a task-local report net before the Task is created. To be used in global counter (diff soon)
Reviewed By: dzhulgakov
Differential Revision: D4497771
fbshipit-source-id: 24ec7c8e95466abbd83fbea79b58717d81201857
Summary:
It was possible for a set and a get to race such that the get
would return an empty string, if the file for the key was created but
not yet written to. This change updates the FileStoreHandler to first
write to a temporary file and then atomically rename(2) the file to
its final path. This removes the described race condition.
This change also replaces the poor filename generation routine with
using the 128-bit MurmurHash of a key.
Differential Revision: D4502154
fbshipit-source-id: f2abc78b8bad68c06ad2f18a078935826e431f7a
Summary:
As per discussion in https://www.prod.facebook.com/groups/184236721951559/permalink/354591931582703/, KaimingHe pointed out that scaling LR is not same as scaling Loss, since LR scaling will affect the weight decay (which is implemented by modifying the gradient, which thus is not yet correctly 'averaged'). Actually prigoyal tried to convince me earlier that loss scaling is the way to go, but I was then not convinved :/.
So this diff removes the LR scaling parameter passed by data_parallel_model and instead passes a loss_scale parameter to the model creation function. Unfortunately, this will break all existing code that uses the data parallel model. But that is not only a bad thing, since it will bring awareness to this change. I will inform in the FB groups about this.
In this diff I modified all my models to work correctly.
Reviewed By: Yangqing
Differential Revision: D4507002
fbshipit-source-id: 16c7221663282f71a1b754b34de0c8ccd5c2ca90
Summary:
We have noticed that the number of chains computed is usually much larger than necessary, when there is a backward pass. For example having a network of 5 FCs with gradient operators (but no parameter updates) should yield only one chain, but instead over 20 were created. After adding parameter updates, the forward pass still should remain one chain, while the backward pass will be splintered.
Analysis showed that the problem was the dependices from forward ops to the gradient computation. But these are redundant since the gradient op is already dependent from the op via the full path over ops. Example:
fc1 -> fc2 ---> fc3 --> loss
| | | |
fc1grad <- fc2grad <- fc3grad <-
Here fc1 and fc1 grad have a direct dependency, but indirect dependency via fc2->fc3->[...]->fc1grad already covers that dependency.
To fix this, I added a pruning step prior to the chain computation. The chain computation is done on the pruned tree, but I do not modify the runtime chains for safety.
Pruning is based on following logic:
- if one of my direct parents is ancestor via an another traversar, I can remove the direct dependency
Pruning is extremely fast, linear in the number of dependencies.
Reviewed By: dzhulgakov
Differential Revision: D4500293
fbshipit-source-id: 0994ae6775c53378ea1e0074365cef041764a1b4
Summary:
This is a bit large diff, sorry about it. It includes basic shape and type inference functionality, based on YQ's Schema scaffolding. I added some helper functions to make it easier to write simple translations.
Bigger refactoring was needed for ConvPoolBase so that we could use the shape inference already there in the schema.
I annotated enough operators to be able to infer forward-pass of shapes for basic convnet, and added test for that. I intend to bootcamp some annotations and annotate enough to handle Resnets fully. Need to think about gradients, if they could be annotated in an easier way.
Only shapes are now exposed to Python, types will follow later. Also the inference is not called yet anywhere but unit test.
Also I am not sure if everything is in the best location in the code, but shouldn't be hard to move stuff around.
Reviewed By: dzhulgakov
Differential Revision: D4436818
fbshipit-source-id: eebee5937ccc9ac09c245465302388a1fae6933c
Summary:
Required by feed ranking: https://fb.quip.com/N4IuAIgda8Pe
Each task might have multi-subtasks. Each subtask has dedicated mlp layers.
Reviewed By: xianjiec
Differential Revision: D4451609
fbshipit-source-id: 3dad48e6a7cce1bb103d93ec205ff6d2333659ea
Summary: If the PATH doesn't include cmake (such as when android studio wipes all the environment variables), this will still work.
Reviewed By: Yangqing
Differential Revision: D4504653
fbshipit-source-id: 56a8854e3daf6ee1f5b1cbeb83ca175a007dad12
Summary:
This learns Shakespeare and then generates samples one character at a time. We want this to be an example of using our LSTM and RNNs in general.
Now it takes 4ms to run the training net on current parameters (with batch size = 1). I don't have data on how much each operator takes yet. But overal python loop doesn't seem to influence much - with 1000 fake iterations in run_net it took 4s for each iteration as expected.
Future work:
* fixing convergence for batching
* profiling on operator level
* trying it out with GPUs
* benchmarking against existing char-rnn implementations
* stacking lstms (one lstm is different from two, one needs to take care of scoping)
Reviewed By: urikz
Differential Revision: D4430612
fbshipit-source-id: b36644fed9844683f670717d57f8527c25ad285c
Summary: stop_if() was not being honored in ProcessingReader.
Reviewed By: dzhulgakov
Differential Revision: D4497784
fbshipit-source-id: 1c967c6252f832149800796e2c26aadf10b74850
Summary: This allows to save the previous value of the counter and send it upstream without losing counts.
Reviewed By: kennyhorror
Differential Revision: D4497854
fbshipit-source-id: 28a7ad0ff1020bde26f78b1f59614b094d1e1881
Summary: The net was being added to the task body by mistake. Also, adds local_init and local_exit functionality.
Reviewed By: dzhulgakov
Differential Revision: D4497794
fbshipit-source-id: 4d9dfb48a277ccfa204f1e74886abba5d44c61f8
Summary: For customers like Ads, Feeds, MarketPlace, their training data size is super large. It is unnecessary and costly to go over all the data to compute meta information. In this diff, numSample option is added in preCompute, so users have control over how many samples they want to use when computing meta information.
Differential Revision: D4492399
fbshipit-source-id: 7199381d226ee6300a959fc5e116d39984d199fc
Summary:
The unit tests using the tcp transport should bind to
localhost instead of hostname(2).
Differential Revision: D4501851
fbshipit-source-id: 43db860c9b96d5d64801d1c6af2bf25e6759b4af
Summary: One model was passing -1s in the label blob, causing illegal memory access when computing the label-cross entropy. Improving the assertion causes it to fail properly.
Reviewed By: prigoyal
Differential Revision: D4491848
fbshipit-source-id: 5c48e43b0a8928cac70e939d69d23c94c07511b9
Summary:
Currently CUDAContext only supports one cuda stream per gpu per thread. But as per my investigation, it is much better to use one CPU thread to control all streams for one GPU. To make this possible, this ground work is necessary: this diff defines a stream id for cuda context that is used to index to streams for that gpu for that thread (the streams are handled by a thread-local class).
This diff also changes the initialization: before we created cuda streams for all gpus and for all threads, even if they would be never used. Now streams are created only when needed.
This adds a small overhead to context.cuda_stream(), but I doubt that to have any significance. Instread, this diff will reduce memory usage on GPU side slightly.
Reviewed By: Yangqing
Differential Revision: D4492380
fbshipit-source-id: 498555e58d75217d43891e1bcad6d86051d376ce
Summary:
The color_ flag to image_input_op now indicates the desired number of output channels. If the source
DB has a different number of channels then color to grayscale or vice-versa is done.
Reviewed By: Yangqing
Differential Revision: D4498455
fbshipit-source-id: da8c39eccd06b9158f320a05663658e502905ae5
Summary: The initial implementation wasn't working quite right (no const fill of an empty external input)
Reviewed By: viswanathgs
Differential Revision: D4490569
fbshipit-source-id: 1b2a4f612efb3b2685edfe6c683571dd9d01aa4f
Summary: Add support for "safe" versions of enqueue and dequeue. I'm not sure if using `math::Set<bool, Context>` is the best context independent approach for setting the status.
Differential Revision: D4398633
fbshipit-source-id: 7c88c8e11acfe36fd3d94f17dbf68ce558eb6df1
Summary:
Takes a 2D tensor of floats, and converts each row into a comma delimited
string. vigneshr ran into a limitation where logging features to hive wasn't
possible without this since our APIs only allow logging strings.
Differential Revision: D4486151
fbshipit-source-id: 2d229290819e2e7ca3dc6f93846433da8b02a41d
Summary: add an option to use a resnet network instead of alexnet. Modified the resnet.create_resnet50 function slightly to allow specifying different kernel/stride parameters so we can adapt resnet to our image size.
Differential Revision: D4472535
fbshipit-source-id: ed06acf52f6425a1e04d047548eb3c70388d74aa
Summary:
I have forgotten to remove this one. The rest of indexing
instead of string names is comming after D4446813 lands as scratches
aren't inputs or outputs and thus can't be indexed.
Reviewed By: urikz
Differential Revision: D4465748
fbshipit-source-id: 2ccbedfb35541ef4a2231d1480eef59025bd5290
Summary: Remove the dependency on sys/time.h, and use c++11 feature chrono library, which is more portable.
Reviewed By: Yangqing
Differential Revision: D4486569
fbshipit-source-id: 86be58c6e9bc410e726a4799bc4d2be86fdd1dd4
Summary:
grayscale images were not being handled correctly by the image input op in the CPU path. There was
a coercion of the grayscale image to color which strided through the grayscale image 3 pixels at a time
Reviewed By: Yangqing
Differential Revision: D4486356
fbshipit-source-id: 482fbfe211ecdc107e55692a4cf0329e174c8e4a
Summary: On some inputs TestWarden was failing
Reviewed By: Yangqing
Differential Revision: D4487293
fbshipit-source-id: 3da4b310a619c2b57f033b2dd7727f71403bfd68
Summary: looks like we don't a good job with initial recurrent input gradients yet. Here is some fix, but gradient doesn't check yet. The shape is correct now though
Reviewed By: salexspb
Differential Revision: D4475447
fbshipit-source-id: 280f1f59f19e487fd0dce0d440609c50ddce294a
Moves THPObjectPtr into a separate header, so that it can be included
independently. Currently, utils.h requries all of THP.h. Also adds RAII
structs for acquiring and releasing the GIL.
Due to bad rank mapping broadcast and reduce were connecting
wrong processes what resulted in errors or not received/sent tensors.
* Introduced new mapping method to solve this problem.
* Added and improved tests for this cases.
Summary: See distributed.py for example of usage
Reviewed By: xianjiec
Differential Revision: D4467723
fbshipit-source-id: c74f71bebaa1751098379838d3da55945aac62bd
Summary:
Turns out that building on raspbian is easy as a cake for caffe2 - cmake is awesome.
Closes https://github.com/caffe2/caffe2/pull/112
Differential Revision: D4480985
Pulled By: Yangqing
fbshipit-source-id: 5dbe5e1e71d8680dea7a5ec8a9ce7fbe6aa5270a
Summary:
This solves most include warnings as seen in Phabricator (no header files, no "packing" system headers, new default mode where more user headers are removed).
We cowardly skip files containing #if for now.
Generated by
```
rm -f /tmp/ffmr-diff/* &&
cd fbcode &&
(foundation/scripts/ls-cpp-dirs | grep -v '^\(\.\.\|external/\|.*/external\|folly/|watchman/\)' |
xargs ffmr -o /tmp/ffmr-diff codegraph/scripts/ffmr/analyze_includes_no_headers_no_packing_skipping_if.sh) &&
(cat /tmp/ffmr-diff/*.diff | patch -p2) &&
hg commit -m foo &&
cd .. &&
arc amend --yes --revision D4414676 && arc diff --nolint --nounit --excuse refactoring --prepare --big-diff -m 'something'
```
folly and watchman are in separate diffs.
Reviewed By: meyering
Differential Revision: D4414676
fbshipit-source-id: 75e2e11f4fac8a5f8071a1bafcc4ddc355fd6f4e
Here's the command I used to invoke autopep8 (in parallel!):
git ls-files | grep '\.py$' | xargs -n1 -P`nproc` autopep8 -i
Several rules are ignored in setup.cfg. The goal is to let autopep8
handle everything which it can handle safely, and to disable any rules
which are tricky or controversial to address. We may want to come back
and re-enable some of these rules later, but I'm trying to make this
patch as safe as possible.
Also configures flake8 to match pep8's behavior.
Also configures TravisCI to check the whole project for lint.
Summary:
Xray is being converted to c2 and ROIPool (needed for detection models) is
missing in c2 trunk. Ported rbgirshick's implementation from experimental with a few
changes:
Also added code for translation in caffe_translate.py
Differential Revision: D4453331
fbshipit-source-id: 7a05a88edec1bd6e806e52dc1e6c55bc75c3149f
Summary: This diff use stack workspaces in RecurrentNetwork, which allows to simplify the implementation and get rid of scratches.
Reviewed By: salexspb
Differential Revision: D4446813
fbshipit-source-id: 514eec7e4300bdf492a9cb192b40cf4f89acf656
Summary:
Using multiple readers for model evaluation. Since it is built by new framework, only NativeLoader is supported.
With 5 readers, the evaluation speed is 124k. The speed for single evaluator is 32k. There is still room for improvement since the evaluator machine is under-utilized.
(Hive is the bottleneck. Adding more loading threads help to improve the speed to 240k. More readers can improve it further.)
Reviewed By: azzolini
Differential Revision: D4469393
fbshipit-source-id: b55af5f798faca4c150b2c0663fe5db0f154cb70
Summary: Replace ParseFromString with ParseProtobufFromLargeString to get around the limitation of the 64MB limit.
Reviewed By: Yangqing
Differential Revision: D4466226
fbshipit-source-id: b68a6efc76955db294ddb0d23bbaf03b69e4952a
Summary: Might be useful to have a command line version of this. Thoughts?
Reviewed By: Yangqing
Differential Revision: D4456221
fbshipit-source-id: 42dd464c5734c0cfbd4c2b1cb348aef9b269b4c2
Summary:
Added cmake for android script under scripts, and set up the travis contbuild target.
Closes https://github.com/caffe2/caffe2/pull/109
Reviewed By: bwasti
Differential Revision: D4468767
Pulled By: Yangqing
fbshipit-source-id: 709f3eb6be24727b0a989d0901dbf377871b122a
Summary: This fixes build that include caffe2 and change the value of CMAKE_BINARY_DIR to their own binary dir. Allows the generation of protobuf headers/files in particular.
Reviewed By: Yangqing
Differential Revision: D4466126
fbshipit-source-id: eba264094dd2bff07a7f050b95fd2d5525462b09
Summary: Makes it much nicer to spot errors, especially in iPython notebook.
Reviewed By: kennyhorror
Differential Revision: D4465726
fbshipit-source-id: c0adaf5168248a70987ff9d5dfce54a622ff2219
Summary:
We get fluky lstm tests on a numerical gradient check. I
would like to improve accuracy of the latter. But first need an
example. After lading this TestWarden would find a bad input for me.
Reviewed By: urikz
Differential Revision: D4467223
fbshipit-source-id: 68d4bf22af11190f39fa28332c6d99efbb192132
Summary: Android studio auto -Werrors in debug mode and throws an error on non string literals in 3rd argument of android_log_print
Reviewed By: Yangqing
Differential Revision: D4465263
fbshipit-source-id: af6dc436b7c98a29aa89bb241c452e6da5c8ad1f
Summary:
- Writing a Caffe2 computation graph to json for visualization in Flow
- Example use in the Text models workflow: it replaces the existing draw function which produces PNG file
- Visualization: https://our.intern.facebook.com/intern/fblearner/c2graphvis/13215753/
- The visualization uses FBLearnerDAG. Plan to add many visualization-related features.
Reviewed By: Mortimerp9
Differential Revision: D4415299
fbshipit-source-id: 2d641d60177566ed2837fb3750394420690f28de
Summary: Fixes segfaults that occur in Eigen and im2col/sgemm backends.
Reviewed By: Yangqing
Differential Revision: D4451772
fbshipit-source-id: 3cf21e5afb2fe300db4228933a82063db5f7091f
Summary:
1. Use opencv for data augmentation after benchmarking various image libraries in python
2. Use cuda no bias conv
3. Use cuda fastest conv (exhaustive search)
4. data_parallel_model had a few changes. Syncing them
3. propagate the errors in threads to make debugging easy
Reviewed By: rbgirshick
Differential Revision: D4341422
fbshipit-source-id: aa4471a2f49dd6d7ca13879999b3c7ceaf818c1e
Summary:
It's a similar trick to dyndeps. The idea is that global state is better to be just replicated to gang workers as otherwise it causes a lot of confusion.
In particular it's useful if one wants to enable detailed logging (--v)
For other operators user still needs to call GlobalInit explicitly. We should consider doing it for all Flow operators, but I'll leave it for future considerations.
Reviewed By: kennyhorror
Differential Revision: D4460686
fbshipit-source-id: 5836737dd3195f9ad12589fd899a3ff63f173e05
Summary:
Fixes the problem surfaced by D4446583.
Our serialization interface is designed for chunking but recepients in distributed training didn't expect that.
For now I just fixed the naming of the tensor and since our blobs are small it should work.
I believe it's still wrong however for big tensors as we just concatenate the serialized proto strings of chunks here: https://fburl.com/6wayxglz and here: https://fburl.com/7k4nhjja . Deserialization path though just tries to deserialize it as a single proto.
I'll make Blob::Serialize(name) version use non-chunking version in a separate diff. Just sending it to unblock for now.
Side note - oujin - why do we have two versions of operator setting the blob? :) Is one of them added by Pieter? Maybe we should unify them a bit.
Reviewed By: kennyhorror
Differential Revision: D4460974
fbshipit-source-id: 485b4de7c8af8cd9eac44c06a1246deaf0b4d502
Summary: Previous implementation was just concatenating string which I believe is wrong. Instead let's turn off chunking when we don't ask for it.
Reviewed By: kennyhorror
Differential Revision: D4461311
fbshipit-source-id: 8b9a3325a40a1cd0a8ffeeb20a17bf9f57b7b0a9
Summary:
it's broken because it relies on add sparse bias.
it's not easy to add_sparse_bias after switch to loader_param.
DPA would like to try it out :)
Differential Revision: D4447275
fbshipit-source-id: 631cb4995f35383070e44387dc86692ba64b91eb
Summary: Remove usage of recurrent_sizes, so recurrent states' sizes can depend on input (in case of attention matrix for beam decoder). I removed recurrent_sizes from forward and backward steps.
Reviewed By: salexspb
Differential Revision: D4427688
fbshipit-source-id: 580420a294d309c86ec5cb4e677058623b7228e1
Summary:
It seems that a simple string("") conversion instead of "" is enough.
Closes https://github.com/caffe2/caffe2/pull/105
Differential Revision: D4458626
Pulled By: Yangqing
fbshipit-source-id: 5072499516332ad1067779526523a3f10aade6ef
Summary: Speeds up inference in the FCIS model from 2900ms/iter for SoftmaxWithLoss layer to 230ms/iter
Differential Revision: D4456494
fbshipit-source-id: dd520d91fbe950511d198de45f34ac4cd4a676b0
Summary:
In this diff I stop passing parameters by name and also remove hardcoded output ids which were there specifically for LSTM to work. It also allows to avoid using recurrent_sizes in the backward pass (for forward this is done in D4427688)
Using similar technic it should be simple enough to eliminate blob name passing at all. Then we can fix scoping. These can be done in a next diff.
Reviewed By: urikz
Differential Revision: D4444614
fbshipit-source-id: 3580a76365502b9f2f09e3d8b7e78084ca739f00
Summary:
lets have a test for this so we don't break existing usecases
while iterating over RecurrentOp's code
Reviewed By: urikz
Differential Revision: D4456404
fbshipit-source-id: 79f2b88c1eed16106adf5b793b4c74441c7146c6
Summary:
it is annoying to print tensors from c++ (while it is easy
from python when you have a net). So I just took logic out of PrintOp
into a separate class
Reviewed By: urikz
Differential Revision: D4452793
fbshipit-source-id: d512559fe07bc468423c9ce38da0c44eaad4fdec
Summary: I can't live without it and we don't have folly here.
Reviewed By: urikz
Differential Revision: D4444511
fbshipit-source-id: 3a85f1a13bd3032be89b3150d40a701dce192004
Summary: added functions to "de scope" the saved model files
Reviewed By: Yangqing
Differential Revision: D4444966
fbshipit-source-id: f447c15754f8e0648459148fcc7fba410dc06f68
Summary:
New operator is added for model calibration. Given a piecewise linear function and raw prediction as input, generate the mapping as output.
Detail can be find in the operator doc.
Differential Revision: D4418640
fbshipit-source-id: f8ff3ea786b0fe233a4ddcb709e5dbf0861ca484
Summary: We don't need this enforce since we already allow raw_mutable_data to return nullptr, we should be able to share meta for tensors even without data
Reviewed By: Yangqing, kennyhorror
Differential Revision: D4439138
fbshipit-source-id: 0e81bef3054fe2f9720efd5002418eac7a2b6c08
Summary:
Relies on NHWC implementation of group conv which doesn't exist right
now
Closes https://github.com/caffe2/caffe2/pull/103
Differential Revision: D4451635
Pulled By: Yangqing
fbshipit-source-id: 31d99b37abf7563a26389f47affcc759ce6bc5e1
Summary: Some DB don't support duplicate keys. Nvidia had problems with LMDB where we potentially can setup duplicate keys. But this won't be possible in some other cases. So instead lets just store different chunks with different keys in DB. And then when reading back we will remove the special suffix.
Reviewed By: dzhulgakov
Differential Revision: D4446583
fbshipit-source-id: 6b345e342840c5fd476029166db131d343467d48
Summary:
Perf bug report: https://www.facebook.com/groups/1405155842844877/permalink/1617904561570003/
Diagnosis:
I've done some digging into this and here's what I've found:
(1) In this use case, the call is disallowed_op_ids = get_op_ids_in_path(ssa, blob_versions, [], inputs)) where inputs = ['res4_22_sum'] is the last blob produced by the res4 stage of a ResNet101 model.
(2) get_op_ids_in_path has exponential running time in the number of blocks in the res4 stage of ResNet. This is based on empirical running times. This call should complete in 4.5 days on my devgpu.
(3) I haven't familiarized myself enough with the IR and SSA code in core.py to understand the algorithmic fix yet, but surely there's a more efficient algorithm to compute the same thing.
Reviewed By: Yangqing
Differential Revision: D4446278
fbshipit-source-id: 8bd147f92d62b865dc355d5802a53e92d64b6e21
Summary:
Now it takes two lines to get drop-in debugger: import it and
then decorate your function. Also got rid of enable / disable logic as
it doesn't seem usefull.
We can also try to enable this by default for our tests when running
locally as a next step.
Reviewed By: bwasti
Differential Revision: D4444299
fbshipit-source-id: 6e2006945d8ad640685b1017ca1bd63054728908
Summary:
DPer example have been creating multiple copies of the transform config in net
defition till this moment, that resulted in the fact that I've hit the limit of
ProtoBuf (64MB) for a certain Task requests (especially visible because of the
ValidationPipeline that I was adding).
After this diff we're going to store SigridTransforms in one instance per
machine for training (or 1 instance per reading).
Difference in sizes of the plans for some simple SparseNN model ~30 MB (even including the fact that second model have validation plan as well).
TODO: Do similar logic for NNPreProc as well (it's also pretty large).
Reviewed By: dzhulgakov
Differential Revision: D4441441
fbshipit-source-id: 4452dd86a4dc49b2c7f5b7642f443aed5720b047
Summary:
This will help issues like #99
Closes https://github.com/caffe2/caffe2/pull/101
Differential Revision: D4448397
Pulled By: Yangqing
fbshipit-source-id: ede3fafc1b1314886583e8ea38948bb31e69347b
Summary:
One way of simplifying the fp16 / multi-precision operators -- remove the explict OpName / OpNameFP16 divide, dispatch the correct calls at runtime based on the contents of the input tensor(s).
Closes https://github.com/caffe2/caffe2/pull/93
Differential Revision: D4444417
Pulled By: Yangqing
fbshipit-source-id: 296dcff1e1e24ba534caca9b82f16e6634da2287
Summary:
From a new model trained by Zhen. We never exercised this codepath before since we've never had models with this choice before.
I'm auditing all our ARM_NEON codepaths to see if there are other cases like this.
Reviewed By: Yangqing
Differential Revision: D4444694
fbshipit-source-id: e0436db4e8b655551fedb21df160b7cae7e79737
Summary:
Spatial Softmax allows specifying locations that are not counted for the loss. If none of the locations are counted, this resulted in NaNs, and headache. This diff fixes that by explicitly handling these cases.
+ assertion for label blob dimension(0)
Created a new test as well.
Differential Revision: D4442939
fbshipit-source-id: 8641bfad2a994e517ca3eda39345380a6ca1ba50
Summary:
When testing the code, a couple of issues arised:
- we need to have different name for last layer than the preprocessed model, otherwise a shape assertion is created
- preprocess_noaugmentation still needs to do a crop for images larger than 227x227, otherwise things fail.
Reviewed By: viswanathgs
Differential Revision: D4442700
fbshipit-source-id: 05f54e7f17c266280f5ba5bb57af1721fe30df12
Summary:
It helps to develop scripts locally (when working outside of Flow). One doesn't have to rerun the script in order to catch exception in the debugger / add a print statement. (Flow does this kind of thing automatically)
Usage example:
```
if __name__ == '__main__':
workspace.GlobalInit(['caffe2', '--caffe2_log_level=2'])
from caffe2.python.utils import DebugMode
DebugMode.enable()
DebugMode.run(main)
```
Reviewed By: Yangqing
Differential Revision: D4424096
fbshipit-source-id: 73f418c80f581820e70139df7e166981e4d8c55f
Summary:
Some tweaks, hopefully getting us to 0.98 MAP
- no cropping for test dataset (as per patrick)
- spatialBN momentum 0.1 (default is 0.9)
Also added some additional logging and reduced frequency of running of test net and logging.
Reviewed By: viswanathgs
Differential Revision: D4439790
fbshipit-source-id: 700705b811a5fc8c7139a265de96db646605ca5a
Summary:
In this diff :
[1] Change the output from generating all paths from root to labels to TreeProto.
TreeProto itself is required by inference and we can use hsm_util to get the
paths from TreeProto.
[2] Fix hsm_util index assigment.
Differential Revision: D4416731
fbshipit-source-id: 657d8b9b4df6fa30c9f92d391cf7e07b5c5db1f8
Summary:
CudnnSpatialBNOp was generating a runtime warning when testing (epsilon_ < CUDNN_BN_MIN_EPSILON) even though epsilon is set = to CUDNN_BN_MIN_EPSILON by default.
Tweaked the comparison here to allow for a small epsilon. I implemented the softer comparison by introducing FLT_EPSILON from <float.h> - let me know if there is a
preferable set of constants to use here.
Reviewed By: Yangqing
Differential Revision: D4431766
fbshipit-source-id: 5e67690a5ed258d460d95e9582b6fdf2050b42f9
Summary: Change labels indices range to be in the range [0, num_classes[
Differential Revision: D4416685
fbshipit-source-id: b16ca8539fd538ad62bf1298dbad3f1553956241
Summary:
Countless hours were spent debugging why ImageInputOp failed with a cryptic exception P56967302. Turns out, that assertion happened in PrefetchOp destructor, that was triggered when a assertion failed in ImageInputOp constructor. Because of this, the underlying problem was shadowed. I fixed this by not asserting on finalize_ if there is no prefetch thread running, and now the error is clean:
[enforce fail at image_input_op.h:105] scale_ > 0. -1 vs 0. Must provide the scaling factor.
Reviewed By: Yangqing
Differential Revision: D4435105
fbshipit-source-id: 52f85a9fd30eea396c9faca54b6d946fa847b7ff
Summary:
Minor bug in D4426513 - bias is added
as input blob always. Running it on xray throws "RuntimeError: [enforce fail at operator.cc:25] blob
!= nullptr. op Conv: Encountered a non-existing input blob:
caffe.SpatialConvolution_0_b"
Reviewed By: Yangqing
Differential Revision: D4429231
fbshipit-source-id: 0d3905ea6e87128ec1aa9d0f0a2f43126b1069b1
Summary:
Turns out xray models have some independent Scale layers (with bias) besides
the Conv-Scale pairs. We could still fuse it with previous layers with some
work, but for simplicity, including Add op followed by Mul for bias if needed.
We could revisit optimizations layer fusion in the future once we have
something working for xray.
Reviewed By: Yangqing
Differential Revision: D4427266
fbshipit-source-id: ef7d8677ccd7d10dbd20759eeed378d9bc4522d1
Summary: Now that we direct support group convolution, this will no longer be needed. I also took the chance to add dilated convolution and also optional bias.
Reviewed By: prigoyal
Differential Revision: D4426513
fbshipit-source-id: eb2bb0aa619f8ff5f732512570f736bc59cd57dd
Summary:
This is a handy tool for amortizing expensive operators (e.g.
distributed communication, some heavier kernel launches, etc) over a
lot of small blobs (e.g. all the biases in a network). We can just
coalesce these small blobs in-place into a single blob, act on them in
operators, etc as if they are non-coalsed (passing them as inputs to
operators, etc), and then finally for heavier operators, just work on
the coalesced blob that contains each of these units.
I named it UnsafeCoalesce since it introduces blob aliasing, which
needs care for work like memory management, graph rewriting as in
memonger, etc.
Reviewed By: Yangqing
Differential Revision: D3557149
fbshipit-source-id: 09cff4459b84270fe9e1da3b4a168fd66d01f795
This is because the current version of luaffifb fails to pass
custom structs (i.e. half) as arguments or accept them as return
values.
The accreal parameters are immediately converted to real internally.
This is done to ensure none of the internal code needs to be changed.
This change also removes transform_reals_to_half which is no longer
necessary.
Change-Id: I978151d001de5492576fb0eddfa0608cd4e99149
Summary: Failing fast instead of swallowing the bias term.
Differential Revision: D4419130
fbshipit-source-id: 98ce0af9a20adecfb027ffe8293ff69910873abc
Summary:
Simple tool similar to caffe_translator_test.py for conversion from caffe to
caffe2. The differences are:
There are a couple of issues that need to be fixed as mentioned in
https://our.intern.facebook.com/intern/tasks?t=15424761, especially related to
the 'legacy_pad' field in conv op.
Differential Revision: D4407146
fbshipit-source-id: ec641f6d7e0cf6cdf2eca21f058b4451635d4a56
Summary: Data paralell model has a sanity check that ensures that operators inputs/outputs do not cross device boundaries. This failed when the operator was a CPU-only operator (such as the new AccuracyOp version). This fixes that.
Reviewed By: prigoyal
Differential Revision: D4417841
fbshipit-source-id: 9bc4e7a2074a544ca4db69ecf24183bbd41f84ca
Summary:
First step in doing multi GPU training - modification of training code to use ImageInputOp. A few changes to accomplish that:
+ modified script that generates our lmdb to store byte image data instead of float
+ we have a float 'label' for our regression problem so added support for float labels in ImageInputOp
+ updated train_network.py to use ImageInputOp, but it is still single GPU
Reviewed By: seansnyder
Differential Revision: D4407728
fbshipit-source-id: a59a1b91b69a9d5f0486383d4fb0a993478393c9
Summary: Github import didn't work and the manual import lost some files.
Reviewed By: Yangqing
Differential Revision: D4408509
fbshipit-source-id: ec8edb8c02876410f0ef212bde6847a7ba327fe4
Summary:
It looks like markdown is not happy for lines starting with =. This diff is
just simply fixes 2 cases when it was not true.
Reviewed By: dzhulgakov
Differential Revision: D4409033
fbshipit-source-id: f2ba3ce5e3936a1e0d57984c12234209993550be
Summary:
It ended up much messier than originally expected. Maybe we should have just hardcode it, but I've tried to be "generic" so far at expense of code readability.
The main issue is that for weights computation we need access to original embedding matrix and in sparse case we need to relookup the embeddings to do the dot product with output grads.
Thus I'm making weight grad computation optional, controlled by a flag and it triggers invocation of a different backward op that produces both grads at the same time.
So far it's implemented only for 'Lengths' version. It'd be straightforward to implement (Un)SortedSegment versions but I haven't done that yet.
Reviewed By: kennyhorror
Differential Revision: D4388215
fbshipit-source-id: 23132ab7daa1f5eec49233f802af1fe75b469c2b
Summary: Just to make life a bit easier to further work.
Reviewed By: kennyhorror
Differential Revision: D4388071
fbshipit-source-id: 71b99ef1c2dc680afe4e9ef2f7a370e43116ce99
Summary:
It looks like for the types that are created directly through type(...)
function call, we don't store the strong references anywhere. As a result
a GC call in Python might/or might not clean up these classes depending on the
phase of the moon and other random things. This results in a fact that in some
cases simple layers as a Relu might disappear.
cat_shame
Reviewed By: xianjiec
Differential Revision: D4396289
fbshipit-source-id: ba4e9b7ef54ee43349853b0acc3d3f40c74e4d73
Summary:
(Ignore the convolution-op related changes, they will be later patched separately)
This diff ignores work from latest few weeks:
- some refactoring of the flow ops
- no_bias setting
- MAP computation (instead of accuracy) for OC
- adaptive learning rate for Xray concepts
- various small bug fixes
Reviewed By: viswanathgs
Differential Revision: D4329500
fbshipit-source-id: 000d4fd22ec408af5290480c788eb86546bff52e
Summary: DivOp missed a gradient for CUDA, so implemented it. Also added operator test.
Differential Revision: D4396638
fbshipit-source-id: 9949e47aa3735bb418a0db003e2b2f4896056a71
Summary:
This diff brings us to roughly par with Torch on ResNet memory usage. On batch size 32, Resnet-50 took 7497MiB, after this 5010 MiB. This will thus allow us to handle 64 images / GPU, or 256 images / 4 GPUs.
In addition, I added a special argument to DagNet that causes it to run only one thread for the first iteration. This is needed since there are allocations on the first iteration's backward pass due to gradient sharing, and this will cause NCCL to deadlock.
The sharing of gradient buffers requires inferring which gradients can share memory (i.e that they are not used concurrently). Previous memonger code uses topological sort, but rbgirshick showed that it does not work with tree-like models. Thus, I wrote a new optimization algorithm based on DFS. It takes about 0.25 secs / GPU on resnet-50, so is clearly fast enough.
Module data_parallel_model supports this feature natively.
Reviewed By: prigoyal
Differential Revision: D4363209
fbshipit-source-id: 73b11e7610438098bb11bff0af8075ab0cf2c0f1
Summary: Currently Gather doesn't check if the provided indices are in the correct range. Adding a check makes issues easier to debug
Reviewed By: dzhulgakov
Differential Revision: D4277170
fbshipit-source-id: dc744b6a229aaf72af8336a417f0f79c97dbdc77
If USE_GLOG and USE_GFLAGS are set to off, or if the system does not
have glog and gflags installed, caffe2 will fall back to a non-glog
and non-gflags installation. This would be helpful for e.g. mobile
builds.
Summary: TSIA - this is needed when users choose to build without glog.
Reviewed By: bwasti
Differential Revision: D4380186
fbshipit-source-id: 1803d451e296f3af5258e0d67d4afdec5f5e5623
Summary: This is needed to properly compile when gflags is not present.
Reviewed By: bwasti
Differential Revision: D4379796
fbshipit-source-id: 3344fa304d85feabbdba81449f663405ed731797
Summary:
Adds a thread pool for image decode, and optional GPU-based data conversion, mean subtraction and std division
Closes https://github.com/caffe2/caffe2/pull/56
Reviewed By: Yangqing
Differential Revision: D4341326
Pulled By: bwasti
fbshipit-source-id: 6485616ea7d212c7701274a40fae912db30dff4a
Summary:
While debugging resnets on imagenet, Ross pointed that MSRAFill is not done correctly. Fixing that
1. use fan_out not fan_in
2. Normal distribution rather than uniform
Reviewed By: Yangqing
Differential Revision: D4372380
fbshipit-source-id: 8f03bd75f543caa60c20e841edbdbb918d1c8775
Summary:
This is needed so that we stick with C++11 instead of 14, which are not well
supported in a few platforms.
Reviewed By: bwasti
Differential Revision: D4377534
fbshipit-source-id: d65d7caaa935a8f16e3b44c838104a576c8f78e4
Summary: Same as D4312617 but this time not excluding source files with `#define`.
Reviewed By: soumith
Differential Revision: D4344811
fbshipit-source-id: 5a314960c319f029c6737c8c8ac8224ec2f20218
Summary:
This diff adds a couple of options to `htrace_to_chrome.py` so that users can specify start and end timestamps for displaying spans.
For example, the arguments `--start_time x --end_time y` indicate that spans that finish before `y` or start after `x` will not be included in the final chrome tracing json file.
This also adds timestamp information to the spans which can serve as hints to the command line argument values.
Differential Revision: D4372220
fbshipit-source-id: a2b0af3be6861448874d804b30426df1b67a676e
Summary: provide a easy way to benchmark different dper models.
Differential Revision: D4367258
fbshipit-source-id: 4821645c58ad183becf0c82daae991375d5c6ef4
Summary:
This is a quick bugfix on `htrace_to_chrome.py`, which produces outputs with wrong file names if command line arguments are given in a specific way.
fbcode $ python caffe2/caffe2/contrib/prof/htrace_to_chrome.py --display operator /tmp/htrace_alexnet_span_log_20161224_055901
Writing chrome json file to --display.json
Now import --display.json in chrome://tracing
Differential Revision: D4369445
fbshipit-source-id: 628f4dbd88fb86814a0d92cd4c8407ba12a401d0
Summary:
this normalizes the sparse gradient, so that the "effective learning rate" of each sparse parameter will NOT be affected by the number of examples in a batch that "use" this sparse parameter.
experiment shows it help convergence (about 0.1% better train NE): https://fburl.com/1230747813683956. It's not conclusive yet, and we still need to do more experiments. But this diff adds it as an option, and does not change the default behavior, so we can get this in first.
Differential Revision: D4367283
fbshipit-source-id: 49ea80dfa9ea776ff4160e220cf6c86593521607
Summary: This diff adds a gflag for specifying the path for htrace span log files. This flag is used by the net types `HTraceDAGNet` and `HTraceAsyncDAGNet`.
Differential Revision: D4366849
fbshipit-source-id: 56038d3d64a3fd5ab363feda86a19a6f2496971c
Summary:
Rewrite D3993337 based on new stack.
Comparing to the old one, we need more readers to achieve the same speed. But so far the speed is the same and the new bottleneck is the write bandwidth of trainer. Model quality is the same as the base.
Reviewed By: azzolini
Differential Revision: D4310803
fbshipit-source-id: 6d04ae8040c1ee7caa9aea5287f054e73fbe325a
Summary: As title. We want to have request_only net which runs on user_only sparse features. Submitting to get early feedback.
Reviewed By: dzhulgakov
Differential Revision: D4282783
fbshipit-source-id: 71241bf5444550075884c788c2da4783659bc1e0
Summary: Recently a PR landed that removed asserts of trying to feed float64 to FeedBlob for GPUs and changed to a warning. Thus the test testing assertions were given started to fail. Removing it.
Reviewed By: Yangqing
Differential Revision: D4363780
fbshipit-source-id: d9e222c309302243138d4ff3c223c711a4d2052d
Summary:
I was testing perf difference between naive group conv and cudnn group conv. I am doing no_bias conv and added support for that in naive implementation
although its deprecated, i thought it would be nice to have working things in our code
Differential Revision: D4363168
fbshipit-source-id: 29719013d79b449fd359884709c7a1195be51ae3
Summary: This diff adds an option to the htrace_to_chrome.py format conversion script so that users can decide to display less traces by hiding kernel/operator/worker spans. For example, passing the arguments `--display worker` will make the script process spans up to worker spans and not go further (deeper).
Differential Revision: D4360404
fbshipit-source-id: aa5af7e499b94aeb3de06823bdeeedfbc3b1c02b
Summary: As per discussion in D4355529
Reviewed By: prigoyal
Differential Revision: D4362162
fbshipit-source-id: 795fcf1507235a7dc3c7a10b0453037936d057aa
Summary:
Essentially, when number of pairs is around 1000, then only positive samples in the list gets a massive boost from all the negative examples. This diff normalizes the gradient and the loss with the number of pairs.
This diff also adds protection against NaN and more logging to help debug.
Reviewed By: kdub0
Differential Revision: D4359782
fbshipit-source-id: 7240344ddb1f2f670d1eec1b03e7f6e413f3dfcc
Summary:
It used to be that only the cudnn engine supports it, and now it should be
fully supported by any conv engine.
To ignore bias, simply use a convolution op that has two inputs instead of
3. The gradient operator will automatically figure out that it does not
compute the bias gradient.
Reviewed By: prigoyal
Differential Revision: D4354183
fbshipit-source-id: cf71b6289a254d15a6a663a85df63fbbaec3702b
Summary:
Ievgen ran into this bug with his dper work - we didn't preserve metadata on lengths field.
Also, we didn't take keep_blobs into account for List's main field. Now fixed.
Also, reformat the file to be nice.
Differential Revision: D4357859
fbshipit-source-id: 1c26c533a10d38afab13b46ccbcb541f5fa9074a
Summary: As discussed, this improves performance a lot and is not a memory hog anymore. Anyway anyone can also turn it off.
Differential Revision: D4338798
fbshipit-source-id: bf0fdb594427ebe90e1e94b2effdc63196096b3f
Summary: att. part of the effort to unify loader configueration.
Differential Revision: D4342147
fbshipit-source-id: bb021112f61d4838b0ccc7a5a8bcaf272cb35cd8
Summary:
This is a first step in improving our RNN story. It provides a wrapper around current RecurrentNetworkOp implementation which infers most of the redundant parameters and makes API much simpler.
Also in order to support general step nets I added an extra argument to the RecurrentNetworkOp.
Future work:
1. Inferring step net output and internal blobs (scratches) sizes and type
2. Avoid accessing blobs by names in c++ part
3. Remove requirement for inputs / output 1:1 correspondence in the step net
4. Make python API support networks with operators like Sum being on the boarder of the Cell net (currently there is an issue with such networks where gradient blobs which are on the side are not explicitly created).
Differential Revision: D4268503
fbshipit-source-id: f8a66491c2b55daa730caeed7e9f2b3921541b49
Summary:
This is an ongoing work - currently the forward pass is implemented, but backward
is yet to be done. We might want a CPU counterpart as well.
I will wait for D4341288 to land and then make bias optional.
Reviewed By: prigoyal
Differential Revision: D4342210
fbshipit-source-id: 51bb0e98d917970bdc040d076b535beb8e994d9a
Summary:
This diff adds HTraceAsyncDAGNet, which is basically the async_dag version of HTraceDAGNet. Similar to HTraceDAGNet, we can use HTraceAsyncDAGNet by setting the net type to `htrace_async_dag`.
For now, we only track iteration spans and do not go deeper (operators, gpu kernels, etc.) because due to the implementation of AsyncDAGNet, applying HTrace is much more intrusive compared to HTraceDAGNet. Creating spans for operators for HTraceAsyncDAGNet is a future task.
This diff also adds a minor change in the TARGETS file so that `htrace_dag`, `htrace_async_dag`, and `prof_dag` are all accessible via one rule.
Differential Revision: D4351587
fbshipit-source-id: 1a4075a9a5efdfafb828a81b663cc731858f7307
Summary:
Fix warnings exposed by gcc-4.9.x's -Wshadow-compatible-local
(and/or the stricter -Wshadow-local) options. Note that these
are both less onerous than -Wshadow.
I plan to enable one of them for all of fbcode, soon.
Rename inner "convert" to "convert2".
Reviewed By: Yangqing
Differential Revision: D4347297
fbshipit-source-id: 7494aedbaeeb2e5356db0612f5f32077f7ffd30b
Summary: This diff adds an option to use rank loss instead of cross entropy loss during training. This assumes that the data is loaded in batches which corresponds to sessions, which is something that was implemented for RNN training
Differential Revision: D4261923
fbshipit-source-id: e92a60cc9f53acc1585ac35d1fdb430c2ebbfa33
Summary:
With __name__ == "__main__" defined, MPI4Py was no longer being setup as intended, leading to test failures on syntax errors (_has_mpi, COMM, RANK and SIZE were no longer defined in a global scope. This is fixed via explicit use of global variables and factoring out the MPI setup into a new method.
Closes https://github.com/caffe2/caffe2/pull/59
Reviewed By: Yangqing
Differential Revision: D4348956
Pulled By: bwasti
fbshipit-source-id: ee741a0fff1df00eade1b6d5e1c281afcb38da6a
Summary:
Only tests for SparseFunHash for now
Closes https://github.com/caffe2/caffe2/pull/60
Reviewed By: Yangqing
Differential Revision: D4348961
Pulled By: bwasti
fbshipit-source-id: cd05d73ccc711b42a7d33e7a6b65a9d1a9bfa7e6
Summary:
Yangqing This seems to work for me, not sure if it's implemented in the right way for you to accept :)
Allows user to specify "no_bias" as an option for convolution layers (only cuDNN at this point), so that the bias associated with that operator is not allocated or computed. This is useful in particular for conv + BatchNorm combinations (such as ResNets), as the bias term can be handled by both conv and Batch Norm, wasting memory and computation.
Closes https://github.com/caffe2/caffe2/pull/50
Reviewed By: Yangqing
Differential Revision: D4341288
Pulled By: bwasti
fbshipit-source-id: e6138d0024c83ed876dff2f83ffbebe7de502fd8
Summary: As part of PR from GitHub, "logging.basicConfig()" was added to workplace, causing havoc with existing logger configurations. It should not be here. Thanks rbgirshick for reporting.
Reviewed By: kdub0
Differential Revision: D4346077
fbshipit-source-id: 084ddcbfe6354bdaf5c97a42086c0bd36ec4629c
Summary: Found some comment typos while working on T14849353.
Reviewed By: Yangqing
Differential Revision: D4334469
fbshipit-source-id: f880e2a3e9a4e1152b315c6d3c8b68ad298d6334
Summary: Builds caffe2 and dependencies for macOS. Not included in the MSQRD engine or elsewhere yet.
Reviewed By: Yangqing
Differential Revision: D4334013
fbshipit-source-id: 31cacf07e2b07f379e1894e51dde5103c56b8815
Summary:
I don't know why I did this embarrassing bug that changes the order of
ldb and beta in the gemm interface. This fixes that.
Differential Revision: D4014493
fbshipit-source-id: 1aec950b6e9d57e947654d4044e50930f2db1344
Summary: Reading Torch docs about Resnets, and soumith's comment, they mention significant memory-saving with in-place ReLu. prigoyal already had this in her code, but I did not. This saves memory a lot: 9851 MiB -> 7497 MiB.
Reviewed By: prigoyal
Differential Revision: D4346100
fbshipit-source-id: e9c5d5e93787f47487fade668b65b9619bfc9741
Summary:
We create a Sum operator to sum up the gradients. Currently we use strings for its input/output blobs.
So the code will fail if AddAllGradients() runs within a NameScope.
To avoid this, just BlobReference instead of string for blobs.
Reviewed By: xianjiec
Differential Revision: D4343701
fbshipit-source-id: 2d008916e192d75c6e20f97921331ac4c7b73363
Summary:
This is a first diff to remove the "easiest" unused includes in fbcode.
* For safety, we only touch .cpp files without #if and #define,
* We do not try to remove redundant systems headers (aka. "packing").
The diff was generated as follows:
```
foundation/scripts/ls-cpp-dirs | grep -v '^\(\.\.\|external/\|.*/external\)' | xargs ffmr -o /tmp/ffmr-diff-1 codegraph/scripts/ffmr/analyze_includes_no_headers_no_packing_skipping_ifdefs.sh
cat /tmp/ffmr-diff-1/*.diff | patch -p2
hg commit -m something
arc diff --prepare --nolint --nounit --less-context --excuse refactoring
```
Note: `grep -v` is just an optimization. The actual configuration is in these two files:
diffusion/FBS/browse/master/fbcode/codegraph/analysis/config.py
diffusion/FBS/browse/master/fbcode/codegraph/scripts/ffmr/analyze_includes_no_headers_no_packing_skipping_ifdefs.sh
See the task for more context, and the recent "safety" improvements on the tool.
depends on D4317825 for very few cases where `nolint` had to be manually added.
Reviewed By: igorsugak
Differential Revision: D4312617
fbshipit-source-id: ecc1f0addfd0651fa4770fcc43cd1314661a311a
Summary: Avoid printing message repeatedly each time the conv_transpose_op (with cudnn) is called
Reviewed By: Yangqing
Differential Revision: D4337242
fbshipit-source-id: 27b048bad8c54604d91174acd4928a1496f2f5c7
Summary:
The exception in FeedBlob causes many tests to fail.
Instead of exception, we log a warning message and move on.
Feeding a float64 blob should not cause any issue.
Closes https://github.com/caffe2/caffe2/pull/57
Reviewed By: bwasti
Differential Revision: D4343135
Pulled By: Yangqing
fbshipit-source-id: cd1144b94c9883fcbd8bdcd78f9f93a67debc0a6
Summary:
An operator that reads labels compute their counts and generates huffman tree
hierarchy. It generates all paths from root node to leafs labels as serialized
HierarchyProto to be used as an input to HSoftmax operator.
The tree is constructed in a bottom up greedy way keeping indices to parent
nodes to in order to generate the code and the path from root to leave in
a bottom up traversal.
Note:
HSoftmax handels computing a generic hierarchy which means for the binary case
we can save one matrix x vector operation per node by representing every node as
logsitc function and also reduce the paths proto size by producing only
one integer list to represent the path / indices and bytes list for the code
per label.
Differential Revision: D4303294
fbshipit-source-id: c7f0d3c204536234c26bb2a4228cb3a1892db395
Summary: this was introduced due to rm and riv params in SpatialBN layer and the likes. We should be saving these params as well but it is not required to broadcast these params to all gpus after every epoch.
Differential Revision: D4338749
fbshipit-source-id: d3bbc92cf0cd7d220a51d76aea8bffcfd6e520b7
Summary: For some reason I had been disabling the exhaustive search heuristic for cudnn for xray/resnet trainers. On BigBasin, this gives 10% perf boost. On BigSur maybe 5%.
Reviewed By: prigoyal
Differential Revision: D4338654
fbshipit-source-id: 3974dd612f5d4f4dc8b2febccb59664d3f276c3e
Summary: I accidentally landed in D4327024 the control_input disable for NCCL. This empirically increases likelihood of deadlocks, although gives a nice perf boost. But better to disable before NVIDIA fixes their stuff.
Reviewed By: Yangqing
Differential Revision: D4338537
fbshipit-source-id: d43efb45965a88bcfe38e5f1dc16c04463e2e038
Summary:
A couple of more misc changes:
- allow starting the coordinator multiple times -- this makes data parallel programming easier
- make the fetcher id a global sequence, before each gpu had same ids for workers
- my flow jobs got stuck when joining the fetcher threads. I think there is actually a memory fencing problem with the is_active boolean. But I am too tired to add proper condition variables there. Instead just add timeout to join(). It is needed anyway since some i/o thread could get blocked.
Differential Revision: D4333381
fbshipit-source-id: 88226c8a9c9a5e05d771360a502a2ba21a6b9d76
Summary:
This adds Caffe2 support for MKL operators directly with MKLMemory. Included a
Relu layer that shows how to use it.
Reviewed By: salexspb
Differential Revision: D4322144
fbshipit-source-id: 8b3392c4fd024ab1a7ba7135c349ebd3e1976799
Summary: This diff moves all tracing code under fb/htrace and fb/prof to contrib/prof.
Differential Revision: D4333032
fbshipit-source-id: 1d1ae14c3d376a89f9199561cada53b2ca62e81a
Summary:
As requested by Yangqing, added Inception model (copied from convnet_benchmarks) and a dummy data feed option to the xray trainer, that we use for scalability benchmarking.
+ a couple of minichanges to the data input framework
Reviewed By: Yangqing
Differential Revision: D4327024
fbshipit-source-id: 86911468456fc13a32d5f437a43347380ec66a68
Summary:
This is just a stub for now. I need to add a report metric as well before I can produce a complete flow.
Possible extensions:
Implement list-wise loss, allow for more than one session in a batch and create a framework for arbitrary loss functions to be applied
The data loader will be the same as for RNN
Reviewed By: xianjiec
Differential Revision: D4245176
fbshipit-source-id: 546683b6551654a37c410dc1606e556a7bf83a2a
Summary:
We often use same net for training and testing, but we must distinguish their data. My yestterday's diff forgot to include that distinction (it was in the xray sampler before), and this diff adds it. Basically one provides a name for the input source for data_workers, and all the queues and scratch spaces are suffixed with that to separate them.
Also specify the caffe2 queue's size to 4, which is empirically found to be sufficient. It was errorneously defined to be function of batch size, which does not make sense as each *element* in the queue is a batch, and led to out of memory issues on xray trainer.
Differential Revision: D4329449
fbshipit-source-id: c994da1c8b0935b8eda2402c118d49b76caa7da8
Summary:
adding imagenet dataset as well
data augmentation and model has been added, just need to add db read
Differential Revision: D4289150
fbshipit-source-id: b531d3f09e3d0efac5cda5bb75d8146e1bb693e4
Summary:
float64 test breaks things on the cuda side. I am deleting it for now and if
we add it back, let's make sure we run the test on a GPU machine first :)
Reviewed By: azzolini
Differential Revision: D4324427
fbshipit-source-id: 0246fe9dd28a286422ca94c90f5b0fc33a162e74
Summary:
Xray sampler (originally by ajtulloch) and prigoyal's resnet trainer use variants of the threaded data input where worker threads put stuff into a python queue that is drained by an enqueuer thread that dumps those batches to a Caffe2 queue, that is then drained by the net's DequeueBlobs operator.
There is a lot of boilerplate, which is also quite complicated.
This diff is an attempt to generalize that general stuff under a new module "data_workers" (name could be improved). Basically you pass it a function that is able to return chunks of data (usually data + labels).
I also created a module 'everstore_data_input' which generalizes everstore-origin data input with preprocessing function (image augmentation , for example). See how I refactored sampler.py for the usage.
Next we could create fetcher function for Laser data.
Differential Revision: D4297667
fbshipit-source-id: 8d8a863b177784ae13940730a27dc76cd1dd3dac
Summary:
This renames the "Snapshot" op name to "Checkpoint" as we discussed earlier.
The early Snapshot name is still available, but we should move to the new name and
eventually deprecate the old name.
The Python SnapshotManager should be also changed, cc azzolini
Reviewed By: dzhulgakov
Differential Revision: D4272021
fbshipit-source-id: 4b8e029354416530dfbf0d538bfc91a0f61e0296
Summary:
TSIA
We also return reference for Input and pointer for Output just to be consistent
with the rest of the framework.
Reviewed By: bwasti
Differential Revision: D4318148
fbshipit-source-id: 857fd72bf929dac04a890f8f787a6fad84bd4287
Summary:
I have noticed that constructing the Xray model takes quite a while. To measure this, I wrote a benchmark script that creates a resnet-50 model on 8 gpus. This takes about 95 secs -- which is kind of annoying when you want to quickly debug stuff.
Profiling (using Python's cProfile), I was able to see that the most of the time is used in net.BlobIsDefined(), which does a linear search over external inputs and operator outputs. Thus it gets slower and slower with large nets. This can be fully optimized by keeping a separate lookup table of operator inputs and outputs (and external inputs and outputs). It is a bit annoying to keep this separate data structure, but I setup the unit tests to ensure things are doing correctly over Clones.
After the optimization, the net construction drops from 95 secs to 8.2 secs!
Reviewed By: azzolini
Differential Revision: D4288307
fbshipit-source-id: 0bb82c8bde9d86a2702b298f4aa706cba509346e
Summary: Allows to collect samples over multiple batches. The method uses a circular array and so there is no guarantee about the order of the samples. The goal is to get a view of the data accross multiple batches
Reviewed By: salexspb
Differential Revision: D4216181
fbshipit-source-id: bb9e1fa84ac7e04006dcddb53c9347a42ec83dc8
Summary: Added gradients for the Copy operators. They are simply the reverse operation. Also added a unit test to test things actually work and added the operator schema and registration to model_helper's known operators.
Differential Revision: D4306516
fbshipit-source-id: dd0633fa7f2ed01991990e56e63669794df037d9
Summary:
Fix RecurrentNetworkGradient with batch size > 1.
The main issue was that we always set the Gradient output to 1, 1, recurrent_size which mismatch the input (1, batch_size, recurrent_size).
Further gradient ops do Squeeze and split assuming that output gradient blob is the same size as the input so they fail.
The fix is simply Resizing the output as the input (1, batch_size, recurrent_size), I had to move the resize to the RunOnDevice since batch_size is computed from Input(0) which is not available till the we actually run the op.
Differential Revision: D4301487
fbshipit-source-id: e5c7426d6e770d985ce72a3737381a2b4af333ba
Summary:
We want to implement request only net and to do this we decided to split the work into two parts. The first part will propagate required metadata and the second part will cut the nets properly.
This diff is to propagate request_only metadata across the layers.
A few notes about implementation:
- Each layer contains a field request_only which can be set based on the input_record. If all the scalars from the input_record are marked request_only we mark a layer as request_only;
- Sparse-To-Dense layer sets request_only metadata;
- SigridTransformation and SparseLookup layers propagate request_only status;
- As for now we join request_only and other sparse features together in input_record, but ideally we may want to separate this, because request_only should be served separately;
Reviewed By: xianjiec
Differential Revision: D4259505
fbshipit-source-id: db8a30ef92cba84f1a843981b9dde3a8b9633608
Summary: The doc for sequence ops says "pad_width" instead of "padding_width". This diff fixes it.
Differential Revision: D4277186
fbshipit-source-id: 63af6cce2fe0af0d395f78c6a6a1f41518039cf8
Summary:
It gives a significant perf boost to do the parameter update inside MomentumSGD, instead of with a separate WeightedSum op.
To ensure backwards compatibility, I made it a separate op.
Also added an unit test.
Reviewed By: prigoyal
Differential Revision: D4262446
fbshipit-source-id: 38e7ee6d7677b398658ac7fe9b7a59b569e033f4
Summary: some operations don't handle the case where the output tensor is empty, and cause segfaults or unexpected behavior (uninitialized output tensor). This diff ensures that BatchMatMul, filler operations, PackSegments/UnpackSegments and ReadNextBatch don't fail and properly initialize their output with the correct type. Those seem like fairly straightforward changes, let me know if you'd rather break it up into separate diffs.
Reviewed By: Yangqing
Differential Revision: D4277149
fbshipit-source-id: c5a30b67bb3b451b117d6aa83827d40b71240c2b
Summary: I couldn't find a way to fill a tensor with a shape provided at runtime, so I added an input_as_shape option to the filler ops. When input_as_shape is true, the input can be used to directly provide the shape of the output (this is different from the default behavior, where the output is reshaped like the input). For example if the input contains [2, 3], the output will have shape [2, 3]. Let me know if you see a simpler way :)
Reviewed By: Yangqing
Differential Revision: D4276872
fbshipit-source-id: 095e995d8bf302152765bd51c405185ef9952212
Summary:
I've been noticing when running caffe2 experiments that calling Exp with many values close to 0 causes MKL's underflow error handler to be called repeatedly, causing significant overhead while the result is correct (e.g. exp(x) = 0). I suggest setting the error mode to VML_ERRMODE_IGNORE for those functions, unless there are good reasons not to.
with the current function (see mkl_vml_kernel_sError and vsexp_cout_rare):
{F65147147}
with VML_ERRMODE_IGNORE:
{F65147148}
Let me know if you see a better workaround
Reviewed By: Yangqing
Differential Revision: D4277240
fbshipit-source-id: d44168da32caee4a3f88227ffb70cdc3d5314722
Summary:
prigoyal sharply noticed a bug in the Resnet models: we have not been checkpointing, nor synchronizing between gpus, the moving average and variance computed by the SpatialBN ops. Particularly the first problen is serious, since models starting from checkpoint would have started from a null-state for SpatialBN. Not synchronizing with the data parallel model is less tragic since each GPU should see very similar data.
Thus I propose keeping track of "computed params", i.e params that are computed from data but not optimized. I don't know if there are other examples, but SpatialBN's moving avg and var definitely are one.
- I modified the checkpointign for xray model to store those blobs + also ensure the synchronization of those blobs
- I modified data parallel model to broadcast those params from gpu0. I first tried averaging, but hit some NCCL deadlocks ... :(
Differential Revision: D4281265
fbshipit-source-id: 933311afeec4b7e9344a13cf2d38aa939c50ac31
Summary: with the current code, Concat accepts inputs of different types and concatenates them as raw data. This causes bugs that can be hard to find: for example, when concatenating a tensor of int with a tensor of long, the long integer get split in two, and the output tensor contains garbage. This adds the necessary checks to make sure the input types are all the same.
Reviewed By: Yangqing
Differential Revision: D4277109
fbshipit-source-id: c1568f74bb66f0d9146a54441c0ee664d5516b77
Summary: I ran into a bug when working with very big tensors in caffe2 (> 2GB). When extending beyong a certain size, the size computation was using int32 instead of int64 and would overflow. This fixes the issue.
Differential Revision: D4276487
fbshipit-source-id: 1704a69c4363c7a5b2f7db748d7d570a9593f2b1
Summary: Position weighted embedding is a bit slow due to the hacky implementation of Mul with broadcast. This diff speeds up the Mul with RowMul.
Reviewed By: xianjiec
Differential Revision: D4271193
fbshipit-source-id: e5c35e18920aeef3de3a7304a8f5727d0c980613
Summary:
Since hashing is different.
This should be ready to commit now. Running ads nn canaries.
Differential Revision: D4264009
fbshipit-source-id: 3aa16b0c47c61f9a442b0375524c5f1580af5892
Summary: Make xray net_type configub a command line argument
Differential Revision: D4262076
fbshipit-source-id: e2ecb9cd5bee5d6aaebe0ea8d2d4d9b378058cba
Summary: This allows us to serialize things between MKLMemory and a TensorProto.
Reviewed By: dzhulgakov
Differential Revision: D4218044
fbshipit-source-id: 934181493b482cb259c17ff4b17008eac52fd885
Summary:
This examples writes a LMDB database of image data and labels (random). Then it reads them using Caffe2's TensorProtosDBINput and validates the checksums match. This example shows how to coerce image data into TensorProtos and be happy.
Before there was no clear example how to create databases for Caffe2.
Differential Revision: D4263614
fbshipit-source-id: 21e08066899095b4efcc2d23dbc3ede81e75914a
Summary: Switching to Pieter-MPI changed the way we setup network between operators. For syncronizing parameters after a checkpoint load, we run a checkpoint_net that contaiend operators for creating the common world and broadcast operators. Unfortunately this fails when the checkpoint sync is done a second time, because we would have created a duplicate common world. Solution is to separate common world op and broadcast op to init net and the actual broadcasting net, and we run the init net only once. This problem did not arise in the Flow version since I did only one checkpoint loading per operator (process).
Differential Revision: D4251754
fbshipit-source-id: ba030579e651e529e29bbf2d27920075078d8ff9
Summary:
Disclaimer: this is really hacky
Continues a fix from D4218902. The root problem is that DPER builds net incrementally and input_record doesn't support it properly. For not I just manipulate the input record directly. Alisson wants to fix it properly later by allowing set_input_record to accept a superset of current record.
But it should unblock our experimentation.
I'm curious how it's going to look in dper_example world.
Reviewed By: azzolini
Differential Revision: D4255285
fbshipit-source-id: ff65b6f943d705a9b3399035597e2e8ded2e1ff3
Summary:
This adds support for automatic aggregation of sparse gradients. We simply concatenate indices and values (no attempt to deduplicate, since this is already done before feeding into the optimizer). This should support various cases (indices and/or values can be generated by one or more gradient ops, or gradient outputs can be directly passed from inputs).
I tried to minimize the code footprint, but I introduced SparseGradGenMeta because GradGenMeta didn't lend itself very well to be used with sparse gradients.
Reviewed By: dzhulgakov
Differential Revision: D4219788
fbshipit-source-id: 1d074664cffd82a8764e4b1473ada6bc46e6c51a
Summary: adding more methods to the layer representation. The corresponding implementation in DPER is: https://fburl.com/563869364
Differential Revision: D4256583
fbshipit-source-id: 91326b7bb9e960a5bc70b5a13812fce90054eceb
Summary:
When refactoring data parallel model, the division of LR by number of devices was dropped, and thus we ended up effectively multiplying gradients by the number of devices. Thus, we need to scale the LR by 1/numgpus.
Created a test to confirm that data_parallel_model produces exactly same results on different number of gpus, given the total batch size.
Reviewed By: prigoyal
Differential Revision: D4248907
fbshipit-source-id: af21ede113e6ac25f12c556de298cb18974548be
Summary: Basic ops to set/get/check/wait against a StoreHandler.
Differential Revision: D4248059
fbshipit-source-id: cc53061fcc13823d4b9eed6b7c1c346b9e8ec991
Summary:
Add store handler implementation backed by a Redis server.
This allows for easy rendezvous when participating machines have no
access to a shared filesystem.
Differential Revision: D4241715
fbshipit-source-id: 4ce881df3a96af24f7efbb02d1050b3b2b9bc3c0
Summary:
DPER has very strange python ops that play with Workspace - they are somewhat similar to LoadOp/SaveOp, so I guess the semantics is fine.
Thus it makes sense to allow python operators to receive workspace pointer similarly to regular Operators.
I didn't figure out a better way to implement optional argument than just checking the number of args function receives on python side.
Reviewed By: ajtulloch
Differential Revision: D4242943
fbshipit-source-id: d97d4227815b741c8f884cfe254b06d2b56b5a41
Summary:
One more small batch of CHECKs that left in C2 codebase. Most of the left overs
should be in tests/GPU only code.
Reviewed By: Yangqing
Differential Revision: D4243782
fbshipit-source-id: a4a03c116ea8ba16facd2efc135746d5921f19d5
Summary: This diff adds a header file for net_gpu.cc so that the AsyncDAGNet class can be used to create other derived classes.
Reviewed By: ajtulloch
Differential Revision: D4230046
fbshipit-source-id: 379c3ff7ebb7aeeb4294f39e6f5d1ecad48b92f0
Summary:
This makes sure that we have useful CUDA error message in asan mode. Also
made a fb specific task pass by explicitly marking it not asan-able.
Reviewed By: dzhulgakov
Differential Revision: D4243471
fbshipit-source-id: 2ce303b97b3b4728c05575a8e7e21eb5960ecbc7
Summary:
Faster implementation of UniqueOp using google::dense_hash_map, as suggested by dzhulgakov. I haven't benchmarked it precisely but early measurements with my workflow show a significant speed bump (this operation went from using 20% of overall CPU time down to 7%).
I gated the implementation using the "engine" feature, to avoid adding sparsehash as a dependency to caffe2.
Reviewed By: dzhulgakov
Differential Revision: D4219768
fbshipit-source-id: 2f142981e772105b42fffa24afb199ef816f8e0c
Summary: I want to collect tensors over multiple batches and so this operation could become helpful to allocate enough memory from the beginning
Reviewed By: dzhulgakov
Differential Revision: D4216198
fbshipit-source-id: e6b67cc7d80d71455487878da9b6b7a225035085
Summary: Used in the NNPreProc layers. It fails the online training when there is empty batch.
Reviewed By: dzhulgakov
Differential Revision: D4235498
fbshipit-source-id: bde00a011831762e44a3f9bf2190d4b241a06ccc
Summary: FlattenToVec was missing a gradient. It can use same gradient implementation as FlattenOp, i.e ResizeLike.
Reviewed By: kdub0
Differential Revision: D4241207
fbshipit-source-id: 6b1a60681fdce3c6f3139d0cd43b17798de2cbc9
Summary: This is mainly for the OSS side checking.
Reviewed By: dzhulgakov
Differential Revision: D4238349
fbshipit-source-id: 061da3f721341c4a1249e1cc6c8c842fc505860f
Summary:
With parameter server, sparse features are updated on the parameter server.
Local update for sparse features are disabled. But that logic is removed in
D4144922. This diff is to add this logic back in a slightly different way.
Previously, in trainer_example, I did that in a hacky way just avoid adding
sparse weight to model.params. It will still generate grad, but will not add
optimization operators. At the same time, it is always registered directly in
the sparse_mapping, so the parameter server is aware of this parameter.
But with the new change for ParameterInfo. I can not do it in that way anymore.
Because the param registry and params are bind together in ParameterInfo.
For dper, there is a option in dper model helper to disable all of the sparse
parameter optimizer.
To combine these two together, I directly changed the ModelHelperBase in this
diff. It is not quite ideal. It is better to do it in Layer. But to fix the old
one, this seems to be more reasonable place to cover both cases.
With this diff, there is no spike anymore. So probably this is the root cause
for the convergence issue we have seen in D4144922. It explains that why the
model can recover, which is because adagrad decays local learning rate and
local updates cause less change.
Reviewed By: dzhulgakov
Differential Revision: D4229684
fbshipit-source-id: da1241d43d7c52cbf13560f9bb83e09897d8d56f
Summary:
This diff introduces a simplified Imagenet trainer that uses data_parallel_model to parallellize training over GPUs and Nodes in synchronous manner. Flow's gang scheduling is used to launch the nodes, and data_parallel_model handles the synchronization among the gang members.
This example also uses the operator-per-epoch model where each epoch produces a checkpoint consumed by the followup epoch.
Reviewed By: salexspb
Differential Revision: D4223384
fbshipit-source-id: 8c2c73f4f6b2fdadb98511075ebbd8426c91eadb
Summary:
This consists of a series of diffs for implementing Multi-task learning.
This diff is to
1. save model;
2. support MT learning in evaluator
3. add unittest.
model after merging (saved model): https://our.intern.facebook.com/intern/graphviz/?paste=56793140
Reviewed By: xianjiec
Differential Revision: D4123316
fbshipit-source-id: 225bf8616962ec08f4f1ef85729c1e94ba7c373a
Summary: Debugging nets can be tiresome, so it is good if we can do some sanity checks. This adds a sanity check that all non-NCCL and non-Copy operators do not reference blobs that have different device scope than the operator. This check is only added to the data_parallel_model, so it should be safe. This check would had caught a subtle bugin prigoyal's training pipeline.
Reviewed By: dzhulgakov
Differential Revision: D4230444
fbshipit-source-id: 3d4a843162134a7a504053d95ff97a552e6b8a6d
Summary:
Previously DPER was quite broken - we couldn't change loaders on the fly because serialized model had blob names hard-coded, e.g. "nn_loader/dense". In fact, the tests worked only by accident as both trainer and evaluator used the same loader type.
This diff does the following:
1) when writing out model, remap input blobs to be 'inputs/<field_name>'
2) when loading eval model, remap them back to the current loader
This diff uses Net.input_schema() for convenience, in particular the schema format is implicitly serialized in input blobs names. From our discussion with Andrey this type of hardcoding is actually acceptible since the schema of HiveReader on python side is inferred via the same string-parsing procedure
It also modifies model saving a bit so that we don't pollute global namespace with shape_provider net.
Overall code in mlp.py is pretty terrible. But I'd leave refactoring to xianjiec as a part of Layers migration.
Reviewed By: xianjiec
Differential Revision: D4218902
fbshipit-source-id: 6cd19f0343ec1be6ddaa3581512e61879957749e
Summary:
- It's first prototype that includes simple unary test.
- will probably need to iterate based on it to include more arches that we see promising offline results
Differential Revision: D4208336
fbshipit-source-id: 5b2d2a5a0274a9dcad0fb169e43e78aa9d9a704d
Summary:
If we go to prod some of the sparse features might be empty or for some reason
batch might be empty. It's a good idea to be sure that we can run empty
batches.
Reviewed By: dzhulgakov
Differential Revision: D4197297
fbshipit-source-id: 1a154ebf625d1a39fd15354a154cf100f525ae9a
Summary:
The old heuristic functioned badly on octa-core phones (e.g., the S6). Limiting the number of threads to 4 in the 8 core case seemed to give optimum performance. For 4 cores, 3 threads still seems to yield best performance, as does 2 threads for 2 cores in the iOS phones, though those cores are very different than the typical ARM cores in Android phones.
I figure at the limit, we should limit ourselves to half the cores available, especially since in a big.LITTLE configuration, only half the cores are likely to be big.
I need to get my hands on a deca-core phone or tablet to try out this heuristic, but I certainly figure that this will function better than what we had before (which would be 9 threads on a 10 core device).
Reviewed By: ajtulloch
Differential Revision: D4220341
fbshipit-source-id: 06fa7677789fcdbec03d98bb85a565f1d22099e1
Summary:
Needed by oss.
This is done by running the following line:
find . -name "*_test.py" -exec sed -i '$ a \\nif __name__ == "__main__":\n import unittest\n unittest.main()' {} \;
Reviewed By: ajtulloch
Differential Revision: D4223848
fbshipit-source-id: ef4696e9701d45962134841165c53e76a2e19233
Summary:
It looks like there's some locking going on here, and so if
the Cursor outlives the DB (or vice-versa), we'll either deadlock or
unlock an unlocked mutex.
Reviewed By: dzhulgakov
Differential Revision: D4224727
fbshipit-source-id: 886401a9f2824f3168fb0b2fd4df6046369e5590
Summary:
Recurrent developer-issue is that they pass numpy arrays with FeedBlob but forget that python float is actually double. Cuda ops in caffe2 don't allow doubles.
Thus, I think we should reject incorrect types already at the FeedBlob() when device option is CUDA.
Added test.
Is this too strong?
Reviewed By: ajtulloch
Differential Revision: D4208153
fbshipit-source-id: 364b057a2a37b5d4b95de4e59faebdab724bb0ed
Summary: Just noticed that I had duplicate code in the example imagenet trainer. Removed the function.
Differential Revision: D4223070
fbshipit-source-id: 443a9401bf7e425f7a3a13a44c9d0f7e21e72303
Summary:
Remove MPI and use fb.distributed rendezvous and Pieter's new Ops.
One now can pass a 'rendezvous' struct to data_parallel_model to initiate distributed SyncSGD. Provided rendezvoud implementation uses the kv-store handler of fb.distributed to disseminate information about other hosts. We can easily add other rendezvous, such as file-based, but that is topic of another diff.
Removing MPI allowed also simplifiying of Xray startup scripts, which are included in this diff.
When accepted, I will work on a simple example code so others can use this stuff as well. Also Flow implementation will be topic of next week.
Differential Revision: D4180012
fbshipit-source-id: 9e74f1fb43eaf7d4bb3e5ac6718d76bef2dfd731
Summary:
The FileStoreHandler subclasses the abstract StoreHandler
class.
Operators expecting to work with a StoreHandler can now use the
filesystem as their backing store.
Reviewed By: Yangqing
Differential Revision: D4217711
fbshipit-source-id: fce60c99c4c505201dfee33ca0a4e8a35db00338
Summary: since the LogScoreEstimator print # of examples after considering negative downsampling.
Reviewed By: kdub0
Differential Revision: D4218040
fbshipit-source-id: 30f54353042dcd85c945c2c911ba0b6d9c0b1540
Summary:
Fix warnings exposed by gcc-4.9.x's -Wshadow-compatible-local
(and/or the stricter -Wshadow-local) options. Note that these
are both less onerous than -Wshadow.
I plan to enable one of them for all of fbcode, soon.
Rename inner "idx" to "k".
Differential Revision: D4216556
fbshipit-source-id: 5ee48751efd07838db24f56390730718ea031772
Summary:
It was not just enough to register ReshapeOp for CUDA, since it does memory copy to/from tensors. This happened in two places: when assigning shape from a shape blob and when outputing a shape tensor.
Also changed the resizeoptest to use CUDA when available (this test was done before hypothesis-tests, so I had to do this manually)
Differential Revision: D4217342
fbshipit-source-id: 61761bac015f3731cf480ccef2563e9c80e0f4aa
Summary:
I got a weird error about NoneType not being iterable which made me think
it was some error in the C2 core, whereas it was an error in my code.
Reviewed By: Yangqing
Differential Revision: D4192799
fbshipit-source-id: 0122f13e205c1c6a0766545f0ad6296228d3a3d9
Summary:
This fixes a race condition in text_file_reader.py.
For example in `fbcode/caffe2/caffe2/fb/text/stats.py`, in `compute_meta`, we build an execution step `read` such as:
```
.
└── step_read
├── net_reader
│ ├── op_TextFileReaderRead
│ └── op_IsEmpty
└── net_consume:n
└── op_Tokenize
```
Note that in `workspace.cc`, we check should_stop between each _step_ and each _net_, not between _ops_
Let's say we have 2 workers, here is a faulty interleaving of threads:
- 1 executes TextFileReaderRead
- 2 executes TextFileReaderRead
- 1 executes IsEmpty and sets should_stop to False
- 2 executes IsEmpty and sets should_stop to True
- 1 checks should_stop before running net_consume:n
- 1 stops
- 2 checks should_stop before running net_consume:n
- 2 stops
That's an issue, because 1 did read data from the file but did not run the processing step (consume:n) for this data.
Reviewed By: dzhulgakov
Differential Revision: D4203729
fbshipit-source-id: eabd94ea995527ec52fa137a8b63c277f7e4dd96
Summary:
This is #2 of a series of changes. It did the following:
(1) a few refactor of the MKL memory interface
(2) an initial MKLContext to deal with MKL specific computations
(3) Provide MKLMemory access in Python with the blob feeder/fetcher registration.
Reviewed By: dzhulgakov
Differential Revision: D4210123
fbshipit-source-id: adea1f1ffbd0b9ffdd55092676468c16bec08992
Summary: Each sparse feature is a ID list. And usually the position of the id in the id list is meaningful. The earlier the id appears in the list, the more important. In this diff, we multiple each embedding with a weight, where the weight corresponds to the position. With this change, same ID appears on different position would have different norm/length/importance after aggregation. The firstX transformation in sigrid is a special case of this model where the weights before n are 1, and 0 after n, where n is the argument of firstX.
Reviewed By: xianjiec
Differential Revision: D4181251
fbshipit-source-id: 2a6f8b7240af445b6bd2052fd24c2d99f39ee7ff
Summary:
Another recurrent problem is that some blob is in CPU scope while operator expects CUDA scope (or other way round).
The exception is only partially helpful, as it tells the operator but not the offending blob name. This diff adds the blob name
to the exception message, helping debug.
Reviewed By: prigoyal
Differential Revision: D4208584
fbshipit-source-id: 5aeac5c3efeed8d6c995bea166ed534855007945
Summary: This is so they don't generate spurious warning messages in the logs
Reviewed By: dzhulgakov
Differential Revision: D4205610
fbshipit-source-id: f764b51565430f4057898ab929372bc7943e0495
(1) nccl submodule, cnmem submodule
(2) mpi ops fallback test
(3) a bit more blob interface
(4) fixed tests
(5) caffe2.python.io -> caffe2.python.dataio to avoid name conflicts
(6) In the build system autogen __init__.py instead of having manual
rules just to copy over an empty __init__.py.
So I tried to make things compilable in python3 but a lot of the actual
functionalities are yet to be verified. Since I am not using py3 for a
short while and protobuf 2.6.1 does not work with py3 (among a bunch of
others), I'll put this as a future todo item.
Eigen for the whole numerical computation (for example, on a platform
where there is no optimized BLAS libraries present, or Eigen is already
the fastest numerical library existing).
The paths I have tested is Eigen and atlas. Have not tested MKL yet.
(1) Registry now uses std::function for more flexible use cases.
(2) dropout adds an "is_test" keyword.
(3) Making all gradient registered via C++. Python still provides gradient wrapper.
TODO item is to make the autograd SSA in C++ if possible. Problem is if we want to dynamically
register python gradients we will be sort of screwed because in c++ things are registered
via static variables.
(1) Loss: do not coerce a gradient output. Although it may be numerically more efficient to do so, it makes the definition of a loss kind of funny if one does not really want to run backward pass.
(2) Autodifferentiation: allow more explicit in-place check, in-place is now opt-in, and implemented a simple SSA/IR gradient generation scheme. Also added some core gradient tests.
Misc bugfixes as well.
(1) cudnn for conv
(2) cublas: after going through the work I feel it's beter to use HOST pointer mode, so changed it.
(3) storage order: despite that googlenet and multibox uses NHWC, it seems better to be still using
NCHW as default to be consistent with caffe and cudnn; moved to NCHW as default.
(1) various bugfixes.
(2) Tensor is now a class independent from its data type. This allows us
to write easier type-independent operators.
(3) code convention changes a bit: dtype -> T, Tensor<*Context> -> Tensor* alias.
(4) ParallelNet -> DAGNet to be more consistent with what it does.
(5) Caffe's own flags library instead of gflags.
(6) Caffe's own logging library instead of glog, but glog can be chosen with
compile-time definition -DCAFFE2_USE_GOOGLE_GLOG. As a result, glog macros
like CHECK, DCHECK now have prefix CAFFE_, and LOG(*) now becomes
CAFFE_LOG_*.
(7) an optional protobuf inclusion, which can be chosen with USE_SYSTEM_PROTOBUF
in build_env.py.
(2) blob serialization comments
(3) cudnn: putting it under a separate device name
so we can explicitly choose cudnn instead of
having CUDA device prioritizing it.
(4) note that mint is not available with ipython
due to zeromq conflict
(5) db_throughput utility
(6) added gprofiler
(1) added blob serialization.
(2) registry can now use key types other than string.
(3) changed load_save_op so they interface with a db.
(4) change sgd iter op: it does increments so we can resume an iter.
(5) mnist linear classifier tests snapshot functionality.
(6) added protodb which is a small wrapper over TensorProtos.
# Let the test pass if baseline number doesn't exist
mean=sys.maxsize
sigma=0.001
print("population mean: ",mean)
print("population sigma: ",sigma)
sample_stats_data=json.loads(args.sample_stats)
sample_mean=sample_stats_data['mean']
sample_sigma=sample_stats_data['sigma']
print("sample mean: ",sample_mean)
print("sample sigma: ",sample_sigma)
z_value=(sample_mean-mean)/sigma
print("z-value: ",z_value)
ifz_value>=3:
raiseException('''\n
z-value >= 3, there is high chance of perf regression.\n
To reproduce this regression, run `cd .jenkins/pytorch/perf_test/ && bash '''+test_name+'''.sh` on your local machine and compare the runtime before/after your code change.
''')
else:
print("z-value < 3, no perf regression detected.")
ifargs.update:
print("We will use these numbers as new baseline.")
if ! python perf-tests/modules/test_cpu_torch.py ${ARGS};then
echo"To reproduce this regression, run \`cd .jenkins/pytorch/perf_test/ && bash "${FUNCNAME[0]}".sh\` on your local machine and compare the runtime before/after your code change."
if ! python perf-tests/modules/test_cpu_torch_tensor.py ${ARGS};then
echo"To reproduce this regression, run \`cd .jenkins/pytorch/perf_test/ && bash "${FUNCNAME[0]}".sh\` on your local machine and compare the runtime before/after your code change."
echo "NOTE: To run \`import torch\`, please make sure to activate the conda environment by running \`call C:\\Jenkins\\Miniconda3\\Scripts\\activate.bat C:\\Jenkins\\Miniconda3\` in Command Prompt before running Git Bash."
) else (
7z a %IMAGE_COMMIT_TAG%.7z C:\\Jenkins\\Miniconda3\\Lib\\site-packages\\torch && python ci_scripts\\upload_image.py %IMAGE_COMMIT_TAG%.7z
author={Paszke, Adam and Gross, Sam and Chintala, Soumith and Chanan, Gregory and Yang, Edward and DeVito, Zachary and Lin, Zeming and Desmaison, Alban and Antiga, Luca and Lerer, Adam},
PyTorch is a python package that provides two high-level features:
- Tensor computation (like numpy) with strong GPU acceleration
- Deep Neural Networks built on a tape-based autograd system
PyTorch is a Python package that provides two high-level features:
- Tensor computation (like NumPy) with strong GPU acceleration
- Deep neural networks built on a tape-based autograd system
You can reuse your favorite python packages such as numpy, scipy and Cython to extend PyTorch when needed.
You can reuse your favorite Python packages such as NumPy, SciPy and Cython to extend PyTorch when needed.
We are in an early-release Beta. Expect some adventures and rough edges.
We are in an early-release beta. Expect some adventures and rough edges.
- [More About PyTorch](#more-about-pytorch)
- [More about PyTorch](#more-about-pytorch)
- [Installation](#installation)
- [Binaries](#binaries)
- [From source](#from-source)
- [Docker image](#docker-image)
- [From Source](#from-source)
- [Docker Image](#docker-image)
- [Building the Documentation](#building-the-documentation)
- [Previous Versions](#previous-versions)
- [Getting Started](#getting-started)
- [Communication](#communication)
- [Releases and Contributing](#releases-and-contributing)
- [The Team](#the-team)
| System | Python | Status |
| System | 2.7 | 3.5 |
| --- | --- | --- |
| Linux CPU | 2.7.8, 2.7, 3.5, nightly | [](https://travis-ci.org/pytorch/pytorch) |
| Linux GPU | 2.7 | [](https://build.pytorch.org/job/pytorch-master-py2) |
| Linux GPU | 3.5 | [](https://build.pytorch.org/job/pytorch-master-py3) |
| Linux CPU | [](https://ci.pytorch.org/jenkins/job/pytorch-master/) | [](https://ci.pytorch.org/jenkins/job/pytorch-master/) |
| Linux GPU | [](https://ci.pytorch.org/jenkins/job/pytorch-master/) | [](https://ci.pytorch.org/jenkins/job/pytorch-master/) |
| Windows GPU | <center>—</center> | [](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-win-ws2016-cuda9-cudnn7-py3-trigger/)
See also the [ci.pytorch.org HUD](https://ezyang.github.io/pytorch-ci-hud/build/pytorch-master).
## More about PyTorch
At a granular level, PyTorch is a library that consists of the following components:
| \_ | \_ |
| ------------------------ | --- |
| torch | a Tensor library like NumPy, with strong GPU support |
| torch.autograd | a tapebased automatic differentiation library that supports all differentiable Tensor operations in torch |
| torch.nn | a neural networks library deeply integrated with autograd designed for maximum flexibility |
| torch.optim | an optimization package to be used with torch.nn with standard optimization methods such as SGD, RMSProp, LBFGS, Adam etc. |
| torch.multiprocessing | python multiprocessing, but with magical memory sharing of torch Tensors across processes. Useful for data loading and hogwild training. |
| torch.utils | DataLoader, Trainer and other utility functions for convenience |
| torch.legacy(.nn/.optim) | legacy code that has been ported over from torch for backward compatibility reasons |
| Component | Description |
| ---- | --- |
| **torch** | a Tensor library like NumPy, with strong GPU support |
| **torch.autograd** | a tape-based automatic differentiation library that supports all differentiable Tensor operations in torch |
| **torch.nn** | a neural networks library deeply integrated with autograd designed for maximum flexibility |
| **torch.multiprocessing** | Python multiprocessing, but with magical memory sharing of torch Tensors across processes. Useful for data loading and Hogwild training |
| **torch.utils** | DataLoader, Trainer and other utility functions for convenience |
| **torch.legacy(.nn/.optim)** | legacy code that has been ported over from torch for backward compatibility reasons |
Usually one uses PyTorch either as:
-A replacement for numpy to use the power of GPUs.
-a replacement for NumPy to use the power of GPUs.
- a deep learning research platform that provides maximum flexibility and speed
Elaborating further:
### A GPU-ready Tensor library
### A GPU-Ready Tensor Library
If you use numpy, then you have used Tensors (a.k.a ndarray).
If you use NumPy, then you have used Tensors (a.k.a ndarray).
PyTorch is not a Python binding into a monolothic C++ framework.
PyTorch is not a Python binding into a monolithic C++ framework.
It is built to be deeply integrated into Python.
You can use it naturally like you would use numpy / scipy / scikit-learn etc.
You can use it naturally like you would use NumPy / SciPy / scikit-learn etc.
You can write your new neural network layers in Python itself, using your favorite libraries
and use packages such as Cython and Numba.
Our goal is to not reinvent the wheel where appropriate.
### Imperative experiences
### Imperative Experiences
PyTorch is designed to be intuitive, linear in thought and easy to use.
When you execute a line of code, it gets executed. There isn't an asynchronous view of the world.
When you drop into a debugger, or receive error messages and stack traces, understanding them is straight-forward.
The stack-trace points to exactly where your code was defined.
When you drop into a debugger, or receive error messages and stack traces, understanding them is straightforward.
The stacktrace points to exactly where your code was defined.
We hope you never spend hours debugging your code because of bad stack traces or asynchronous and opaque execution engines.
### Fast and Lean
PyTorch has minimal framework overhead. We integrate acceleration libraries
such as Intel MKL and NVIDIA (CuDNN, NCCL) to maximize speed.
At the core, it's CPU and GPU Tensor and Neural Network backends
(TH, THC, THNN, THCUNN) are written as independent libraries with a C99 API.
They are mature and have been tested for years.
PyTorch has minimal framework overhead. We integrate acceleration libraries
such as Intel MKL and NVIDIA (cuDNN, NCCL) to maximize speed.
At the core, its CPU and GPU Tensor and neural network backends
(TH, THC, THNN, THCUNN) are mature and have been tested for years.
Hence, PyTorch is quite fast -- whether you run small or large neural networks.
Hence, PyTorch is quite fast – whether you run small or large neural networks.
The memory usage in PyTorch is extremely efficient compared to Torch or some of the alternatives.
We've written custom memory allocators for the GPU to make sure that
your deep learning models are maximally memory efficient.
This enables you to train bigger deep learning models than before.
### Extensions without pain
### Extensions without Pain
Writing new neural network modules, or interfacing with PyTorch's Tensor API was designed to be straight-forward
Writing new neural network modules, or interfacing with PyTorch's Tensor API was designed to be straightforward
and with minimal abstractions.
You can write new neural network layers in Python using the torch API
[or your favorite numpy based libraries such as SciPy](https://github.com/pytorch/tutorials/blob/master/Creating%20extensions%20using%20numpy%20and%20scipy.ipynb).
[or your favorite NumPy-based libraries such as SciPy](http://pytorch.org/tutorials/advanced/numpy_extensions_tutorial.html).
If you want to write your layers in C/C++, we provide an extension API based on
[cffi](http://cffi.readthedocs.io/en/latest/) that is efficient and with minimal boilerplate.
There is no wrapper code that needs to be written. [You can see an example here](https://github.com/pytorch/extension-ffi).
If you want to write your layers in C/C++, we provide a convenient extension API that is efficient and with minimal boilerplate.
There is no wrapper code that needs to be written. You can see [a tutorial here](http://pytorch.org/tutorials/advanced/cpp_extension.html) and [an example here](https://github.com/pytorch/extension-cpp).
## Installation
### Binaries
- Anaconda
```bash
conda install pytorch torchvision -c soumith
```
Commands to install from binaries via Conda or pip wheels are on our website:
### From source
[http://pytorch.org](http://pytorch.org)
### From Source
If you are installing from source, we highly recommend installing an [Anaconda](https://www.continuum.io/downloads) environment.
You will get a high-quality BLAS library (MKL) and you get a controlled compiler version regardless of your Linux distro.
Once you have [anaconda](https://www.continuum.io/downloads) installed, here are the instructions.
Once you have [Anaconda](https://www.continuum.io/downloads) installed, here are the instructions.
If you want to compile with CUDA support, install
- [NVIDIA CUDA](https://developer.nvidia.com/cuda-downloads) 7.5 or above
- [NVIDIA cuDNN](https://developer.nvidia.com/cudnn) v6.x or above
If you want to disable CUDA support, export environment variable `NO_CUDA=1`.
Other potentially useful environment variables may be found in `setup.py`.
If you want to build on Windows, Visual Studio 2017 14.11 toolset and NVTX are also needed.
Especially, for CUDA 8 build on Windows, there will be an additional requirement for VS 2015 Update 3 and a patch for it.
The details of the patch can be found out [here](https://support.microsoft.com/en-gb/help/4020481/fix-link-exe-crashes-with-a-fatal-lnk1000-error-when-you-use-wholearch).
Dockerfiles are supplied to build images with cuda support and cudnn v5 and cudnn v6 RC. Build them as usual
Dockerfile is supplied to build images with cuda support and cudnn v7. You can pass -e PYTHON_VERSION=x.y flag to specificy which python to be used by Miniconda, or leave it unset to use the default. Build as usual
*Slack: general chat, online discussions, collaboration etc. https://pytorch.slack.com/ . Our slack channel is invite-only to promote a healthy balance between power-users and beginners. If you need a slack invite, ping us at slack@pytorch.org
* newsletter: no-noise, one-way email newsletter with important announcements about pytorch. You can sign-up here: http://eepurl.com/cbG0rv
## Releases and Contributing
PyTorch has a 90 day release cycle (major releases).
It's current state is Beta (v0.1.6), we expect no obvious bugs. Please let us know if you encounter a bug by [filing an issue](https://github.com/pytorch/pytorch/issues).
PyTorch has a 90 day release cycle (major releases).
Its current state is Beta, we expect no obvious bugs. Please let us know if you encounter a bug by [filing an issue](https://github.com/pytorch/pytorch/issues).
We appreciate all contributions. If you are planning to contribute back bug-fixes, please do so without any further discussion.
If you plan to contribute new features, utility functions or extensions to the core, please first open an issue and discuss the feature with us.
Sending a PR without discussion might end up resulting in a rejected PR, because we might be taking the core in a different direction than you might be aware of.
**For the next release cycle, these are the 3 big features we are planning to add:**
1. [Distributed PyTorch](https://github.com/pytorch/pytorch/issues/241) (a draft implementation is present in this [branch](https://github.com/apaszke/pytorch-dist) )
2. Backward of Backward - Backpropagating through the optimization process itself. Some past and recent papers such as
[Double Backprop](http://yann.lecun.com/exdb/publis/pdf/drucker-lecun-91.pdf) and [Unrolled GANs](https://arxiv.org/abs/1611.02163) need this.
3. Lazy Execution Engine for autograd - This will enable us to optionally introduce caching and JIT compilers to optimize autograd code.
## The Team
PyTorch is a community driven project with several skillful engineers and researchers contributing to it.
PyTorch is currently maintained by [Adam Paszke](https://apaszke.github.io/), [Sam Gross](https://github.com/colesbury) and [Soumith Chintala](http://soumith.ch) with major contributions coming from 10s of talented individuals in various forms and means. A non-exhaustive but growing list needs to mention: Sergey Zagoruyko, Adam Lerer, Francisco Massa, Andreas Kopf, James Bradbury, Zeming Lin, Yuandong Tian, Guillaume Lample, Marat Dukhan, Natalia Gimelshein.
PyTorch is currently maintained by [Adam Paszke](https://apaszke.github.io/), [Sam Gross](https://github.com/colesbury), [Soumith Chintala](http://soumith.ch) and [Gregory Chanan](https://github.com/gchanan) with major contributions coming from 10s of talented individuals in various forms and means.
A non-exhaustive but growing list needs to mention: Trevor Killeen, Sasank Chilamkurthy, Sergey Zagoruyko, Adam Lerer, Francisco Massa, Alykhan Tejani, Luca Antiga, Alban Desmaison, Andreas Kopf, James Bradbury, Zeming Lin, Yuandong Tian, Guillaume Lample, Marat Dukhan, Natalia Gimelshein, Christian Sarofeen, Martin Raison, Edward Yang, Zachary Devito.
Note: this project is unrelated to [hughperkins/pytorch](https://github.com/hughperkins/pytorch) with the same name. Hugh is a valuable contributor in the Torch community and has helped with many things Torch and PyTorch.
ATen is a simple tensor library thats exposes the Tensor operations in Torch
and PyTorch directly in C++11. The wrapper respects the semantics of operators
in PyTorch, except minor details due to differences between C++ and Python in
the way default arguments are handled. See the [documentation for tensors](http://pytorch.org/docs/tensors.html) in PyTorch for what these operations do.
ATen's API is auto-generated from the same declarations PyTorch uses so the
two APIs will track each other over time.
Tensor types are resolved dynamically, such that the API is generic and
does not include templates. That is, there is one `Tensor` type. It can hold a
CPU or CUDA Tensor, and the tensor may have Doubles, Float, Ints, etc. This design
makes it easy to write generic code without templating everything.
See https://pytorch.org/cppdocs for the provided API. Excerpt:
When using Tensor-wide operations, the relative cost of dynamic dispatch is very small.
However, there are cases, especially in your own kernels, where efficient element-wise access is needed,
and the cost of dynamic dispatch inside the element-wise loop is very high.
ATen provides _accessors_ that are created with a single dynamic check that a Tensor is the type and number of
dimensions. Accessors then expose an API for accessing the Tensor elements efficiently:
```c++
Tensor foo = CPU(kFloat).rand({12,12});
// assert foo is 2-dimensional and holds floats.
auto foo_a = foo.accessor<float,2>();
float trace = 0;
for(int i = 0; i < foo_a.size(0); i++) {
// use the accessor foo_a to get tensor data.
trace += foo_a[i][i];
}
```
Accessors are temporary views of a Tensor. They are only valid for the lifetime of the tensor that they
view and hence should only be used locally in a function, like iterators.
### Using externally created data
If you already have your tensor data allocated in memory (CPU or CUDA),
you can view that memory as a Tensor in ATen:
```c++
float data[] = { 1, 2, 3,
4, 5, 6};
auto f = CPU(kFloat).tensorFromBlob(data, {2,3});
cout << f << endl;
```
These tensors cannot be resized because ATen does not own the memory, but otherwise
behave as normal tensors.
### Scalars and zero-dimensional tensors
In addition to the `Tensor` objects, ATen also includes `Scalar`s that represent a single number.
Like a Tensor, Scalars are dynamically typed and can hold any one of ATen's number types.
Scalars can be implicitly constructed from C++ number types. Scalars are needed because some functions like `addmm` take numbers along with Tensors and expect these
numbers to be the same dynamic type as the tensor. They are also used in the API to indicate places where
a function will _always_ return a Scalar value, like `sum`.
```c++
Tensor addmm(Scalar beta, const Tensor & self,
Scalar alpha, const Tensor & mat1,
const Tensor & mat2);
Scalar sum(const Tensor & self);
//usage
Tensor a = ...
Tensor b = ...
Tensor c = ...
Tensor r = addmm(1.0, a, .5, b, c);
```
In addition to Scalars, ATen also allows Tensor objects to be zero-dimensional. These Tensors hold
a single value and they can be references to a single element in a larger Tensor. They can be used anywhere a Tensor is expected. They are normally created by operators like `select` which reduce the dimensions of
a Tensor.
```c++
Tensor two = CPU(kFloat).rand({10,20});
two[1][2] = 4;
//~~~~~~~ zero-dimensional Tensor
```
It is possible to convert between Scalar and zero-dim Tensors:
```c++
Tensor zero_dim = CPU(kFloat).scalarTensor(4);
Scalar from_tensor = Scalar(zero_dim); //only valid when zero_dim.dim() == 0;
```
### Avoiding unnecessary CUDA synchronization in your kernels when using Scalars
Moving a single number from the GPU to the CPU introduces a synchronization point
that can add latency to your program. In certain cases the result of a GPU operator like `sum` which
returns a Scalar may be plugged into another GPU operator as an argument. If Scalars were always copied
to the CPU, this would result in 2 copies. To avoid these synchronizations, Scalar objects can be
optionally backed by a zero-dim Tensor, and are only copied to the CPU when requested.
```c++
auto a = CUDA(kFloat).rand({3,4});
Scalar on_gpu = Scalar(a[1][1]); //backed by zero-dim Tensor
assert(on_gpu.isBackedByTensor());
double value = on_gpu.toDouble(); // copied to CPU, if it was backed by GPU Tensor.
Scalar svalue = on_gpu.local(); // force the Scalar to become local to CPU.
// get the scalar as a zero-dim tensor. If it was already backed
// by a zero-dim Tensor then this op has no synchronization.
// if the Scalar was local on CPU, it performs the copy
// we have a degree of freedom here to select the dimension size; follow NumPy semantics
// and just bail.
AT_CHECK(newsize!=0,"cannot reshape tensor of 0 elements into shape ",shape);
res[*infer_dim]=numel/newsize;
}
returnres;
}
std::ostringstreamss;
ss<<"shape '"<<shape<<"' is invalid for input of size "<<numel;
throwstd::runtime_error(ss.str());
}
}
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.