* Optimizer: optimize transposes in variety of circumstances (#3509)
* Optimizer: Optimize transposes in variety of circumstances
- No-op transposes
- Consecutive transposes (fuse them)
- Transposes into Gemm (fuse them into transA/transB parameter)
* touch up out of date comment
* Backporting optimizer changes
Gradients were becoming non-volatile because at::zeros_like returned a
Variable with volatile always set to false. The non-volatile gradients
accumulated history in the model which results in continuously
increasing memory usage,
See #3983, #3835, #3824
In v0.4 this will be more robustly solved by #3970
* Remove dilations for pooling in onnx export and other small fixes (#3698)
* fix optimization pass issues
* remove pool dilations
* Fix export for recent changes in ONNX (#3708)
* Fix symbolic for Embedding and Upsampling and improve error messages
* Record stack traces during JIT tracing (#3607)
* Update comments and size logic
* Record stack traces during JIT tracing
* Use string helper functions and AutoGIL
* Use SourceLocation object instead of storing in debugName
* Address zdevito comments
* Address comments
* Allow 1->N broadcasts at the beginning and end to be fused (#3616)
* Allow 1->N broadcasts at the beginning and end to be fused
* Update comments and size logic
* Implement bmm symbolic (#3681)
* Buildfix.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Now actually fix padding (the tests are added in onnx-pytorch) (#3893)
* Now actually fix padding (the tests are added in onnx-pytorch)
* fix test
* Fix exporting HalfTensor
* Fix padding according to https://github.com/onnx/onnx/issues/261
* Update ONNX IR we emit to version 0.0.2 (attribute discriminators) / fix Permute export (#3484)
* Regenerate ONNX nanopb from latest version.
But don't bump the IR version, we don't handle discriminators
yet.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add discriminator to AttributeProto.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add back ONNX definition for permute
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* PyTorch now uses operator versioning.
Also move some of the exporter info out of the ModelProto constructor.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Fix elu double-backwards when applied in-place
Removed unused "input" argument to elu_backwards. Also removed 'inplace'
argument from backwards functions, since we don't ever want to use it.
* Fix up additional calls to ELU_updateGradInput
* added sys/types.h include to fix unknown ssize_t in aten/src/TH/THMemoryFile.c
* now including <sys/types.h> only if _WIN32 is not #defined
* now including sys/types.h in aten/src/TH/THDiskFile.c (if _WIN32 is not defined) to fix undefined off_t
* Allow torch.load to take pathlib.Path
pathlib has been python standard library for filesystem path since python 3.4
But `torch.load` currently cannot take `pathlib.Path` as its filename of state dictionary.
I changed `torch.load` and `_with_file_like` to check so that they can accept `pathlib.Path` typed filepath.
* Fix flake8: too long line & indentation
From https://software.intel.com/en-us/mkl-developer-reference-fortran-gemm:
lda: "When transa = 'N' or 'n', then lda must be at least max(1, m),
otherwise lda must be at least max(1, k)."
ldb: "When transb = 'N' or 'n', then ldb must be at least max(1, k),
otherwise ldb must be at least max(1, n)."
Partly addresses #3525
The curand_uniform function returns the range (0, 1]. Most RNG APIs have
the opposite bounds. Fixup the values in uniform_() so that they fall in
the more common bounds.
Generate random uniform floats in the range [0, 1) by generating random
uniform uint32 in the range [0, 2^24-1] and dividing by 2^24. This
ensures that the largest value is representable as a float32 less than
one.
This also changes the uniform double generation to use more bits of
randomness.
* [v0.3] Don't expose 0-dim tensors to Variable API.
* [v.0.3] Ensure grad_inputs are not ATen scalars and address review comments.
* Remove extra parentheses
THTensor_(newContiguous) always increments the refcount. It may return
the same pointer if the tensor is always contiguous. Since we added the
check for zero strides, it may be called when the tensor is already
contiguous. We need to make sure that THTensor_(free) is always called
in this case.
See #3498
* Use Welford's algorithm when reducing along inner dimension for THCTensor's variance fn
* Use accreals in THCTensor's varInnermostDim
* Skip cuda tests if no cuda
* Variance testing
Replace None grad_inputs with zero tensors in some cases
In Python-implemented autograd functions, we sometimes return None as
the grad_input if the output is marked "non-differentiable". This
replaces those None values with zero-filled Variables if the
corresponding input has requires_grad=True.
C++ implemented autograd functions expect the input (grad_outputs) to
be defined if they're executed. They always return non-null grad_inputs
if should_compute_output(i) is true. This could lead to segfaults if a
subsequent Python-implemented function returned None.
See #3412, #3241
We don't currently generate _out functions for ATen native functions and may not
(they don't work with Variables currently). Also, the existing code was wrong
as the argument orders were swapped in the two squeeze variants.
* enable size from ATen type
* temp commit aten thd
* port copy, math
* port random
* changes after rebase
* lapack bind
* thd and csrc compile
* fix min/max reductions in DataChannelTCP
* clean up changes
* re-enable tensor constructors
* port MPI to at::Tensor
* fix storage methods to not cast to thpp storage ptrs
Some knock on effects:
- at() is not supported on ArrayRef. I fixed this by adding a new
overload for input() to access a specific input. I also filed
https://github.com/zdevito/ATen/pull/152
- Need new overloads for fmap/filter, because template deduction won't
attempt an implicit constructor in attempt to match the argument.
- New overload in ir.cpp for printing ArrayRef.
- When we pybind11 an ArrayRef, we convert it into an iterator.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This breaks a lot of the onnx-pytorch tests because the abstraction
barriers are not respected. I'll spin up a patch for that separately.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This started off as a minor fix based on Adam's question, "why is printing
a graph not const" and snowballed into a giant yak shaving exercise.
- The Graph and Node APIs now uniformly enforce deep constness; e.g., if you
get a const Node* or const Graph*, it is not possible to get a non-const
Node*/Graph* somewhere else in the graph (even though the member variables
of these are non-const. Hooray for private access specifier.)
- A big pile of functions got const versions, most notably the printing
functions, and functions for accessing inputs().
- REALLY IMPORTANT, BC-BREAKING CHANGE: inputs() now returns a COPY of the
inputs, rather than a reference to the underlying. I was forced to do this
because there is no way to portably turn a std::vector<Node*> into a
std::vector<const Node*>, which is necessary to provide a const-correct
version of inputs() that enforces deep const-correctness. I then justified
this choice to myself with the observation that outputs() returned a
copy (by necessity), so this makes the API more uniform.
But making this change uncovered two very subtle bugs:
1. If you change functions from returning a reference to returning a copy,
the idiom node->inputs().begin() is no longer valid, because the memory
the iterator points to immediately becomes invalid. THIS SUCKS.
Honestly, we should add a lint rule rejecting calling begin()/end() on
temporaries because this is very dangerous. To excise this pattern from
the codebase, I added begin() and end() methods to Graph, so that we got
rid of the graph->nodes().begin() idiom, which happens to be sound,
despite not returning a reference, because graph_node_list is a
non-owning reference.
2. pybind11 doesn't handle std::vector<Node*> cast out of the box.
Fortunately, I found a simple fix in the GitHub issues tracker
that involved adding an extra type converter. And yes, this
does mean that outputs() in Python never worked correctly.
- New const_graph_node_list, which is a graph_node_list that gives you const
Node*
There are some more miscellaneous improvements:
- Applied CR comment fixes on export.cpp; using replaceInput, and renaming
variables for clarity.
- assertValidInput helper method added, and applied to replaceInput
- Use an explicit function to print THPObjectPtr, otherwise we get
the wrong overload.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Prevent numerical issues with poisson_nll_loss when log_input=False
Evaluation of the logarithm of the input variable in poisson negative log likelihood leads to NaN loss if variable being evaluated is zero. Small epsilon is added to prevent this. See equivalent Keras epsilon here: https://github.com/fchollet/keras/blob/master/keras/losses.py#L68
* PEP8 fix
* Add epsilon support to PoissonNLLLoss in nn.modules.loss
* Add torch.take and Tensor.put_
These are similar to numpy.take and numpy.put. The take function allows
you to linearly index into a tensor without viewing it as a 1D tensor
first. The output has the same shape as the indices. The put function
copies value into a tensor also using linear indices.
* update fuser to match ATen-formatted JIT ops
* fix concat optimizations and add test
* allow onnx export to work with single-export functions
* fix onnx handling of multi-return nodes.
* nits, format, vision test update
* fix add constant
* fix driver init issues
* Add missing Neg symbolic.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
- Deleted Addmm/Concat Function class, as this is now native ATen operator
- Resurrected ONNX operator for Concat (now called 'cat')
- Add a "fake" Expand ONNX operator, which we now do the optimization on;
this helps prevent us from emitting a warning that 'expand' is not supported.
We still fail if any of these Expand operators make it to the final model,
until we actually formalize Expand in ONNX. This also simplifies the
fuseBroadcast code, because single-return ONNX nodes don't get select nodes.
- New error reporting strategy. If we fail to export an operator because of
something, we emit a warning, but otherwise keep going. At the very end,
in export.cpp, we now check if there are any ATen operators left over. If
there are, we bug out. This assumes that ATen is lower case and ONNX is upper
case. You're now supposed to 'return _unimplemented(msg)' in these cases.
- New toString() method on Graph, for getting the string graph (useful for
slapping it into error messages.)
- Some of the legacy symbolics (still in Python symbolic method of Function
subclass) have been cleaned up for clarity.)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
The pieces:
- I improved the lint / asserts to catch some bugs which I
committed while working on my export. There are two new
properties which the linter checks now:
(1) "Anticipated uses". If a node says that is used by
M, M better appear later in the topsort. Previously,
we only checked if it was in all_nodes.
(2) If you are a select node, you better be a multi-type node;
if you're not a select node, you better not be! And you
should never have an input that is multi-type.
- There is a new peephole optimization pass, for simple, local
transformations to graphs. Right now, it implements a simple
optimization: remove 'expand' invocations that are no-ops
(the size before matches the size after), but we can add other
things to it later. I needed this for ONNX because no-op expands
show up in the left-hand argument, which we don't support.
- There is now a broadcast fuser, which fuses ATen expand ops
into broadcastable ONNX ops (Add, Div, Mul, Pow, Sub, Gemm.)
It only fuses when the original size is a suffix of the new
size, as per the ONNX spec.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
.gitignore should have uninteresting files listed, so acts as a good
.dockerignore. Reduces the build context sent to the docker daemon from to
2.927GB (after building locally) to 66.66MB (:O).
* API changes
* Implement reduce for THNN ClassNLLCriterion
* Implement reduce keyword for THCUNN ClassNLLCriterion
* Implement reduce for THNN SpatialClassNLLCriterion
* Implement reduce for THCUNN SpatialClassNLLCriterion
* Make legacy NLLLoss work
* Docs for NLLLoss reduce
* reduce keyword for double backwards NLLLoss
* reduce=False tests
* Addressed comments
* Fix trailing whitespace
* Fix test failures in legacy nn
* Rebase: add reduce keyword to aten declarations of NLLLoss
* Add reference functions for all NLLLoss and NLLLoss2d test cases
* Replaced slow get/set fns. Don't use int64_t in kernels.
* Use TH_INDEX_BASE in NLLLoss for consistency
* Fix legacy ClassNLLCriterion tests
Permute transposes multiple dimensions at once. The as_strided function
changes the sizes and strides of a tensor without changing the Storage.
It's a subset of Tensor::set_.
This allows VariableType override them to return instances of
VariableType. Combined with the change to Formatting.cpp, this lets us
print Variables to std::cout.
For one thing, we will want a different implementation from TH because
we need to differentiate between scalars and 1-dim tensors.
Also, we don't really want to expose the THS/THCS function; in addition to
checking the shapes are the same, it checks that the dimensions which
are sparse are the same (because various THS/THCS operators only work if this
is true; it should really be called "is_congruent" or similar.
This includes some changes to the dispatch code for torch.xxx functions:
- Since Variable.addmm is an instance-method, the self argument has to
come first. The dispatch code swaps the first two arguments if
necessary to suppor the deprecated signatures where 'alpha' or 'beta'
comes before the 'self' tensor.
- Delete IMPLEMENT_STATELESS_REVERSED. These functions require output
arguments to be passed in using the keyword 'out'. They were meant to
handle torch.gt(out, a, b), but we haven't allowed that for a while.
* made it explicit in the docstring of Module.register_forward_hook() that the hook(s) will be called AFTER calling forward().
* added "every time" in docstring of Module.register_forward_pre_hook()
* Unify CUDA kernels for SoftMax and LogSoftMax
* Improve SoftMax and LogSoftMax kernels performance
Added a new instantiation of the spatial kernel for
low inner_size and larger dim_size.
* tensor: Ensure that the tensor is contiguous before pinning (#3266)
pin_memory() was producing out-of-order tensor when the given
tensor was transposed, i.e. in column-major order.
This commit fixes this by calling contiguous() before pinning.
* test: add contiguous test for pin_memory (#3266)
* tensor.numpy() checks that no arguments are passed
* tensor.numpy() checks that no arguments are passed
* Improve .numpy() argument checking performance
* Fix clang-802.0.42 tuple overload bug, fixes#3234.
Originally, my plan for emit_record_trace was to keep it as
simple as possible, if at the expense of some somewhat ugly
overloads. So this meant we had a 'recordTrace' function
with overloads like this:
recordTrace(..., const Variable& out)
recordTrace(..., const std::tuple<Variable, Variable>& out)
Unfortunately, this triggers a bug in clang-802.0.42
(widely used in macOS Sierra 10.12.6) wherein a Variable is
implicitly convertible into a std::tuple<Variable, Variable>;
a minimal repro can be seen below here:
#include <tuple>
struct T {};
void f(const std::tuple<T, T>&) {}
void g(T& x) { f(x); }
To work around this bug, the code generator is a bit more
complicated, and is taught how to handle this situation.
Previously the generated code looked like:
jit::tracer::recordTrace( "min", { self }, ret );
Now it looks like:
jit::tracer::recordTrace( "min", { self }, { std::get<0>(ret), std::get<1>(ret) } );
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* CR comments
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Py_InitModule returns a borrowed reference. PyModule_AddObject steals
the reference, so we need to incref the `_nn` object.
(The Python 3 function PyModule_Create returns a new reference.)
Don't create grad_fn if requires_grad=False
- Check that arguments without derivative definitions have
requires_grad=False
- Pass all tensor arguments to the tracer, including ones without
derivative definitions
The general strategy is there is a new module, torch.onnx.symbolic, which
contains a function for every ATen method name with the ONNX translation.
While implementing this, I took the opportunity to expunge all references
of 'g' from the public API; instead, it is managed by a global variable in
torch.onnx which tracks the "current graph".
Other changes:
- If you pass a Tensor to op as an argument, it will now automatically be
converted into a Constant ONNX node. This lets us remove needing to
implement ONNX
- Rename value to other, wherever there is both a Scalar and Tensor overload.
This way, keyword dispatch can work uniformly in both cases.
- Deleted any autograd Function classes that both had a symbolic and were ported
to the new C++ autograd implementation. There may still be some straggling
classes that didn't have symbolic.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
The generated tracing code looks like this:
if (jit::tracer::isTracing({ self })) {
jit::Node *n = jit::tracer::recordTrace( "mean", { self }, ret );
n->rawSet(jit::stringToSymbol("dim"), dim);
n->rawSet(jit::stringToSymbol("keepdim"), keepdim);
}
A few design decisions I made:
- Instead of making the assignment of 'n' conditional on whether or not
attributes are present, I just add (void)n if it would not be used
otherwise. This modestly simplifies code generation.
- Tracing of operations that involve Generator or Storage are not supported.
This is fine because such ops don't take any Variable arguments anyway,
so they couldn't trigger tracing.
- Unfortunately, at::ArrayRef is not covariant, so there is some faffing about
to support conversions from at::ArrayRef<Tensor> (aka TensorList) to
at::ArrayRef<Variable>. In the case of 'recordTrace' (slow path), I just
allocated an intermediate std::vector to get the types correct; in the case
of isTracing (fast path) there's three overloads to avoid refcount bumping
when possible.
- Tracing is all in one place, rather than spattered between the beginning
and end of an ATen function, as Sam suggested.
- This commit doesn't actually enable ATen definitions.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
1) softmax, log_softmax backwards now have int64_t dim argument
2) chunk/split in autograd/functions/tensor.cpp conflict with new
ATen implementations, just delete them and use the ATen ones.
3) div/mul with Scalar now use "other" parameter rather than "value"/
This adds the ability to specify 'native' functions in NativeFunctions.h and specifies
'split' and 'chunk' in this manner. The function arguments, returns, variants, etc. are
specified as if they were processed via other parsing mechanisms (e.g. cwrap_parse) with
the following additional parameters:
type_method_definition_level: this allows one to specify that the type method should
be defined at the 'base' type level; this is because in the case of 'split' and 'chunk'
(and probably most/all other native functions that don't directly dispatch to TH/THC)
we don't need type-specific implementations. Currently it is enforced that 'base' is
specified for native functions, but this is easy to remove later.
type_method_definition_dispatch: this defines the function to dispatch to. For split,
this is at::native::split; this is just to avoid having a magic namespace and allowing
one to dispatch to a function with a different name.
* with the size=1 case, impossible to do single point check, replace with isContiguousRange
* fix stride in desc; fix undef scope
* add test for this case for cudnn
* assertTrue
In many "non-Python" headers, we include Python.h because we need
to declare a pointer to PyObject, and solely because of that. It
would be a lot better if we had a simpler version of Python.h that
just declared PyObject available for pointers, without anything
else. This is what torch/csrc/utils/python_stub.h does.
The good thing about not including Python.h is that it is easy to
be warning-less; no more ugly insertions of Python.h on headers
where it has no good reason to be.
This makes PyTorch warning clean again.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
I've also made the version counter and the "live" reference count
atomics.
Note that it's not safe to set the version counter (operator=) from
multiple threads, because shared_ptr assignment isn't thread safe.
Currently, the only call sites to these functions are on newly created
variables before they can be accessed from other threads.
See #3111
This removes the StochasticFunctions for bernoulli, multinomial, and
normal and replaces them with classes in the torch.distributions
package. Each distribution supports the differentiable log_prob function
that returns the log of the pdf/pmf of the samples.
The current StochasticFunction implementation has a few problems: it can
be painful to use when there are multiple stochastic outputs which need
to be back-propagated through. It also requires that we store grad_fns
on Variables that have requires_grad=False in order to find stochastic
nodes.
- Cleaned up THNN and THCUNN code and kernels
- Improved THCUNN kernel performance 5x, making it match cuDNN performance
- Added support for computing softmax over arbitrary dims
NOTE: The default dim for 3D inputs is now 1 (used to be 0)
- Both functions now accept inputs with arbitrarily many dimensions
- Autograd functions no longer save the input (it's unnecessary)
- Added cuDNN bindings for softmax, but they are unused as THCUNN
matches or even exceeds cuDNN performance
* Fix docs for nn.Embedding and F.embedding.
- add description of 'sparse' argument (#3104)
- fix F.embedding example (resulted in RuntimeError)
* Make EmbeddingBag a New Style Function.
* Add a functional interface for EmbeddingBag
* Fix failing tests: add max_norm and norm_type to context,
and fix typo in backend call.
* Docfix: remove torch.manual_seed from example code.
* Add a note about using sparse keyword in Embedding function.
Apparently, the algorithm only guarantees the output is coalesced if
the inputs are coalesced.
I'm planning to do another PR that does much more stringent correctness
testing for the 'coalesced' bit shortly, but y'all should merge
this one first.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Currently, the toXXX functions on Scalar check that the conversions are
exact. This will cause an exception in code like:
auto t = CPU(kFloat).ones({1});
t *= M_PI;
Or the equivalent in Python:
t = torch.ones(1)
t *= math.pi
This changes the checks to only throw an exception in the case of
overflow (positive or negative).
* THCUNN Skeleton for Depthwise Convolution port
* implement Depthwise Convolution CUDA Kernels (handles weight parameter only, not bias)
* working kernels and bindings for forward + backward for base conv, and integration
* add support for padding
* strides for weight kernel
* dilation for weight gradient, enable for others
* add support for depthwise multiplier
* remove old depthwise conv
* rename to SpatialDepthwiseConvolution
* clean up depthwise code, add shape asserts, more constrained thread count for accgradparams
* add bias for forward for depthwise conv
* add grad_bias, move bias for forward to CUDA
* fix eligibility test to guard against transposed, properly identify depth multiplier
* add basic unit test; make depthwise conv take priority over cudnn when appropriate
* add tests for depthwise permutations
* make cuda kernels calculate positions using mul instead of div
* remove unnecessary samegpu requirement
* use accreal, test for double type
* use THAssert instead of assert
* rename to is_depthwise
* half prec support for depthwise
* make certain computation more pythonic
* flake8
Previously, we the Variable.data PyObject* in THPVariable_Wrap. For many
Variables, we don't access their data directly. Instead, they are passed
from one Variable compuatation to another.
This reduces the overhead of ATen-implemented Variable methods by
~200ns.
* Support MNIST in ONNX
* Add train mode check in FeatureDropout symbolic, add todo mark in logsoftmax_symbolic
* export FeatureDropout as a simple identity op
* turn x = x or y to if-checks.
ATen has it's own default CPU RNG. Use this as the default in PyTorch so
that random functions called through ATen have the same behavior as
random functions called through TensorMethods
It's pretty easy to accidentally fail to actually compile
a JITed region, which means that we have accidentally failed
to have test coverage for a number of features. This adds
a secret _assert_compiled kwarg, which will raise an error
if we don't actually hit the compiled codepath.
This is not intended to be user visible; we have some other
ideas for handle this case.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
We weren't doing gradient checks on these functions because the tests
were in-place only. We also incorrectly classified __magic__ functions
as inplace.
a.copy_(b) will now broadcast b to the shape of a. Note that this means
that copies between tensors of the same number of elements but
incompatible shapes are not allowed. For example, the following will
throw an exception:
Tensor a = type.rand({4, 43);
Tensor e = type.rand({3, 4});
a.copy_(e)
The methods were separate because PyTorch supports multiple output types
for comparison methods. For example, for FloatTensors 'a' and 'b' both
calls are vaild:
torch.lt(a, b, out=<ByteTensor>)
torch.lt(a, b, out=<FloatTensor>)
ATen only supports ByteTensor outputs because the overloads have the
same static signature and would conflict. It would be nice to fix this
in the future like with the bernoulli function.
In the meantime, the separate function and method definitions with
different argument names make implementing VariableType more difficult.
* Fix the broadcast in Addmm's symbolic
* fix the non-matching dimension cases
* Add exception for non-supported case, remove onnx test cases (moved to onnx-pytorch repo)
* remove the test_onnx.py in run_test.sh
* lint the code
This generates NN bindings with a similar interface to PyTorch's
torch.nn.functional package. The file nn.yaml specifies function
signatures and THNN implementations.
Each NN operation generates three functions. For example:
- conv2d
- conv2d_forward
- conv2d_backward
The conv2d and conv2d_forward functions differ in how they handle
buffers that need to be passed to the backward function. conv2d_forward
takes the buffers as parameters. conv2d creates the buffers internally
and discards them.
A few notes about the implementation:
- Need to plumb 'devices' through to the 'fork_rng' calls. You definitely
want these; it makes verify run A LOT faster
- New keyword argument for compiled model execution, '_force_trace', which
forces us to retrace a model.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Improve Declarations.yaml:
- translate defaults to C++ values
- include names of returned values
- mark keyword-only arguments
* Add comment to translate_default
* Add reduce keyword to MSECriterion API
* Move gradOutput usage from py to backend
* Implement reduce keyword for THNN MSECriterion
* Implement reduce keyword for THCUNN MSECriterion
* Implement reduce keyword for MSE double backwards
* Tests for MSECriterion with reduce keyword
* Documentation for reduce for MSELoss
* Make legacy nn work with reduce keyword by ignoring it
* Apply linter suggestions
* Address comments (small changes)
* Revert "Tests for MSECriterion with reduce keyword"
This reverts commit 1c0be0defa49d336d023d7d9795db4037c92b6fe.
* Undo changes to legacy nn tests
* Reuse module test for MSELoss by creating a wrapper class for MSELoss
* Address comments: refactor MSECriterion.cu to be nicer
* Fix lint & build errors
There is a bit of nuance to this function. If one blindly charges in
and initializes all GPUs, it is going to take a long time. 20sec for
8 GPUs on my dev machine. But to a user, it is non-obvious that fork_rng
is going to hit all the GPUs by default (which it does by default for
safety reasons.) So there is a nice warning when we notice we're
hitting more than one GPU. There is a bit of extra generality
which is going to be used by torch.jit in a subsequent commit.
The motivation is that I wanted to add some more general purpose
utility random functions, but not gunk up torch/__init__.py.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* skeleton commit for building and linking nnpack library in PyTorch
* first stab at conv forward binding + integration
* bind NNPACK gradient kernels
* move nnpack forward, input gradient calls deeper
* nnpack conv api mimics nn
* fix symbol error; use memory across calls
* clean up warnings, add shape checking, thread safety, configurable thread specification
* add batch size threshold, also bind for single-element batch for the future
3D modules apply padding on all three sides. "Both" doesn't make sense here.
I used the wording of the AvgPool3d docstring, where it was already correct.
* Generate torch.cat autograd via ATen.
Most of the change is around supporting generation of:
1) TensorList arguments
2) Arguments to "size", "sizes", i.e. "sizes(dim)"
The alpha/beta naming in addmm was flipped; this commit fixes that
problem. It also fixes the ONNX export of alpha/beta parameters.
Finally, it supports executing matmul in the JIT.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* commit '9f4accd5bb99900dfda9ffab110aeb7a4534d629':
Make all dim arguments int64_t
Converting dlpack tensor to aten tensor
adding a simple class for converting atensor to dlTensor
Test stub for dlconvertor
adding dlpack header
Fix build failure in MSVC
Mark all (non-static) Type methods as const.
* Fix detection of nccl.h when libnccl.so is in /usr/lib/x86_64-linux-gnu and similar paths
* full support for independent NCCL_LIB_DIR and NCCL_INCLUDE_DIR
* lint fix
* add back CUDA_HOME
torch.jit now contains two user-facing functions: compile and trace
(corresponding to what was previously trace/traced and record_trace).
The non-curried versions of these functions have been eliminated, so
that there is only one function in the API (we *must* have the
curried versions, since these enable their use as decorators). There is
detailed usage documentation in the docblocks for these methods.
This comes with a complete rewrite of the internals of torch.jit, in the process
fixing a number of bugs. Key points of the new implementation:
- compile and trace both always return a Module representing the wrapped
with compilation/tracing underlying function/module. This makes handling
of the function/module cases more uniform, as we can think of the function
case as creating an on-the-fly module with the parameters explicitly
specified by the user. For technical reasons, we now *require* any parameters
in the function case to be honest-to-goodness Parameters (gory details:
you can't register a Variable as a Parameter to a Module, but you can't
create a Parameter from a Variable while sharing the same underlying
identity.)
- Flattening and unflattening is done a lot more uniformly. We now have
a _flatten and _unflatten function which are inverses of each other:
_flatten always returns both the flat, tuple of Variables, *as well as*
the "proto" (now referred in the code as the "struct") from which we
can unflatten the variables. Low level functions like 'raw_trace'
always work with the flattened inputs/outputs, which keeps their logic
simple.
- JIT trace keying now also includes the "struct" of the input arguments.
This is a step towards accepting non-Variable arguments in functions,
although flatten/unflatten don't currently support it.
- TraceForKey (previously TraceInfo) has had its API reworked to have
less degrees of freedom when you are interacting with it.
TODO: Verify, timing, and trace dumping have been temporarily excised. I
plan on adding them back.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This adds some generated autograd functions implemented in C++, which
are generated from derivatives.yaml. It also generates Python bindings
for the Variable methods. The generated files are:
Functions.cpp/h: subclasses of torch::autograd::Function
VariableType.cpp/h: The at::Type for autograd Variables
python_variable_methods.cpp: Python bindings to torch::autograd::Variable
python_variable_methods_dispatch.h: wrapper which releases GIL and sets the
CUDA device
python_functions.cpp/h: exposes generated autograd functions as Python
objects
The generated functions are mostly shadowed by the definitions in
variable.py. We'll remove the Python implementations in favor of the
generated C++ implementations in a subsequent commit.
Instead of initializing CUDA immediately and executing them,
we wait until we actually initialize CUDA before executing.
To keep things debuggable, we also keep track of the original
backtrace when these functions are called, so we can inform
users where they actually called the seeding/state functions
(as opposed to the first time they actually initialized the
RNG).
Fixes#2517
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Plus a test for Eval nodes in the IR, since we hadn't actually
covered this case now that some nodes are transparently traceable.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
- If you operate with TracingState, you MUST check if it is live.
Otherwise you will segfault if it is expired; it is VALID for
tracing states to become expired.
- Tracing states can expire if they request backward tracing
(which the tracer does by default). We don't want this to
happen for exports, which only look at forwards. So make
sure we set the correct num_derivatives.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
- Print some diagnostic information when accepting new test output.
- If it's the first time you ran an expect test, print out
the output you got so it's easier to decide if you want
to accept it.
- Add infrastructure for expect-testing against exceptions
(I'm going to use this in a later patch).
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
- If a user accidentally attempts to export a model that is in training mode, the
tracer may perturb the parameters (since modules like batchnorm will update
their parameters.) To prevent this from happening, we temporarily turn
off training mode to make sure this doesn't happen. Temporary is
important, since model export should not actually affect the model
- If you have a buggy model which is changing the parameters,
it is much better for us to export the state_dict() *prior*
to executing the model, because that is what we actually
used as the inputs to the trace. The state_dict() afterwards
could be anything.
- kwargs support never worked, so it's been excised.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
To be honest, this was the whole point of this refactor set.
I noticed that in a lot of code, we were repeatedly copying lots of metadata
from old nodes to new nodes. This was quite concerning because I wanted to
add some more metadata (alias information) and I didn't want to have to
get it right in all cases. Plus, in a lot of cases we were forgetting
to set more optional properties like debug names when we "copied".
To solve this, I first made cloneFrom() copy all of this metadata. Then,
I searched for all occurrences of setType() (a proxy for "I'm cloning this
node), looked for cases where we really were morally doing a copy, and rewrote
the code to use cloneFrom() instead, allowing us to drop explicit setType()
(and getting more metadata preservation in the process.)
Finally, I refactored tryToMoveChunk. The code is modestly longer,
but the new version has the nice property that the initialization of
selects for input_chunk are next to the creation of the node (as opposed
to delayed for later.) I also added a lot more comments for invariants
I noticed when I was working on the code.
One minor extra change: TensorType grew a new constructor and a withSizesStride
"immutable setter" which returns a new copy of TensorType with different info.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Previously, there was a hidden, unchecked invariant that you were not allowed to
call create(kParam) or create(kReturn). Now that the logic for them is embedded
in create(), the create(kParam) case is valid, and the create(kReturn) case
will raise dynamically if you try it.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Since this code has been stable for a while, I think it's
a good opportunity to make it const correct. There is only
a slight increase in code size, which I hope will appease @zdevito.
- consts were added to all methods which are logically const. Most notably,
lint() is now declared const.
- I made extra const versions of Node::iterator(), Node::reverseIterator(),
Graph::nodes(), Attribute::find(), linked_list::begin(), linked_list::end(),
linked_list::rbegin(), linked_list::rend(); in all cases these were one-liners
except for find() (I spent a little time trying to make find() a one-liner
but didn't think of a way to do it.).
- graph_node_list got factored out into a new, templated type linked_list<T>
(perhaps we should call it intrusive_list<T>). I had to template the iterator
to define constant and non-constant iterators without duplicating code,
and once I was there, I decided to templatize everything else. The code
nicely factors out, although I wouldn't recommend using it for anything
else without more refactoring.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
These functions accept a scaling parameter like THTensor_(cadd)/(csub),
which will make it easier to have the same signature for tensor and
scalar addition in PyTorch and ATen. For example:
tensor.add(other, alpha=2)
Will work if other is a scalar or a tensor value.
See #2739
This adds a concatenated Declarations.cwrap which is the result of
running ATen/extract_cwrap.py on TensorMethods.cwrap. This will let ATen
and the Variable bindings temporarily diverge from Tensor before the new
Variable class subsumes Tensor.
See #2739 and #2633
* Specifying the value used for padding
The "pad_packed_sequence" function fills padded elements with zeros, but sometimes it is not useful. For example, some previous papers on NLP, including my recent paper [1], use a max-pooling technique for RNN-based sentence representations. More specifically, the max-pooling technique selects the maximum value from all time steps (i.e., hidden states) for each dimension. In such a case, we do not want the padded zeros to be selected. To overcome this situation, we can simply use a very small value instead of zero.
An LSTM example is shown below:
input = embedding(Variable(batchInput))
packedInput = nn.utils.rnn.pack_padded_sequence(input, lengths, batch_first = True)
h, (hn, cn) = self.encoder(packedInput, (h0, c0))
h, _ = nn.utils.rnn.pad_packed_sequence(h, -1024.0 batch_first = True)
sentenceRep, _ = torch.max(h, 1, keepdim = True)
[1] A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks. Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. The 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017).
https://arxiv.org/abs/1611.01587 (Equation (4))
* Modified the order of the arguments
Following the suggestion, I modified the order of the arguments.
* Win64 support for lib/THS
* Fix VS warnings(for lib/THS)
* Revert changes that prevent sucessful build
* use the type descriptors for int64_t
* Fix warnings in THS for MSVC
Also squash a warning about an implicit conversion that will never
occur (because the type being converted to is a superclass).
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Variable is now a subclass of at::Tensor backed by a VariableImpl* pImpl. The implementation of the ATen functions is defined in the auto-generated VariableType.h/cpp file.
Currently, only functions which fall through to the base type, such as sizes() and isCuda() are implemented. Differentiable ops like add() and mul() will be added in a subsequent PR.
When you call repr() on a long in Python 2, it prints a long suffix.
This is annoying for tests which assert on the exact output. Use str()
instead.
But then there is a problem with Python 2's default tuple str() implementation,
where it calls repr() on its arguments rather than str(). This means that
if you have a tuple of longs, it will render as "(1L, 2L)" in Python 2.
To solve this problem, we just reimplement tuple printing in C++.
This is not a very robust fix (nested tuples, dictionaries, all these situations
will fail) but in practice it hits the cases that matter.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
In Python 2, the non-generator map will always perform the indexing
even when it is not used in the end. Using the generator can let
us avoid indexing when it is not used.
As an added bonus, it makes the ordering of operations deterministic
between Python 2 and Python 3 in LSTM.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Added support for nInputDim parameter in Padding class
* moved nInputDim to the end so as to not break backwards compatibilty
* hasattr to check if nInputDim is actually set
* check if nInputDim is positive before checking against input dim
When the size given is incorrect for the number of elements, the current error message is:
`size '[1 x 1 x 5]' is invalid for input of with 1 elements at /pytorch/torch/lib/TH/THStorage.c:41`
This replaces it by
`size '[1 x 1 x 5]' is invalid for input with 1 elements at /pytorch/torch/lib/TH/THStorage.c:41`
which is grammatically better
Proper broadcasting in ATen uncovered a bug in our fusion
compiler where it outputs the wrong shaped tensor. We're
tracking the issue in https://github.com/ezyang/pytorch/issues/206
but for now, rewrite the code so it does an "old style" comparison,
which works fine.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Variables now hold a list of ValueTracingStates and can participate
in multiple traces.
* Refactored Traceable to maintain a list of traces, and only stop
tracing once it records all stages
- kernels -> kernel_shape
- Use the new hybrid dict/tuple result object from Toffee
- Write g and t as singulars, not plural
- nanopb generated files update
- Bugfix for msg() micropb helper
- Start recording producer_version/producer_tag
- Use ir_version from proto description
- Value -> value (Constant)
- Remove special-casing for transposed convolution; we now rely
on the Caffe2 Toffee backend to do something reasonable
- Batchnorm order is no more
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
- Conv no longer supports bias, so we create an explicit broadcasted
addition afterwards. There is one minor problem, however, which is that
ConvTranspose in Caffe2 has mandatory bias. So there's a hack.
See Note [Caffe2ConvTranspose] for the details.
- Squeeze: dims -> axes
- Transpose: axes -> perm
- Reshape lost its extra output (yay!)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This was a doozy!
- 'namespace' is a C++ reserved keyword, so if you have a field named
this, nanopb will blithely export some malformed C++. I submitted
a PR for this: https://github.com/ProjectToffee/ToffeeIR/pull/88
- Zach added support for singular tensor and graph. While attempting
to add support for these, I realized that it was actually impossible
to support them under the default protobuf translation. The gory
details are in Note [Callback for nested messages]. The singular
callbacks needed a new helper which I dubbed msg; it's just
the singular version of list.
- While I was working on the API, I braino'd with the tensor()
method. It turns out this is totally not the right way to think
about it; it's more string_from_tensor(). So I renamed it.
I also renamed add_tensor to set_raw_data; add_tensor is a misnomer
since it implies you can add multiple tensors, which is not true.
- version turned into producer_version. Actually, this is a bit
questionable and might change soon.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This is a case of two wrongs make a right. There were a pair of
related bugs;
- We incorrectly translated Transpose as if it were a Permute;
but Torch transpose actually is a *swap* between dimensions.
- Why didn't we ever notice it? In all of our tests, a transpose
was *solely* done to get a weight matrix into the correct form.
But Caffe2's FC operator *implicitly* does a transpose on
the weight matrix.
This commit fixes both of these problems.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This adds the PyTorch API user documentation for Toffee.
To make the example work, I also converted all "inplace"
ops to export out-of-place in Toffee.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
- BC BREAKING: export now also takes a mandatory file-ish argument, specifying
the file to export the protobuf to. I rewrote the tests to use BytesIO to
get out the string so they could parse it again.
- BC BREAKING: export no longer returns the tensors that were computed. To
get these, use the internal _export function.
- Multiple inputs to models are now supported by passing a tuple to input.
(Old API of a single Variable still works.)
- Keyword arguments to models are now supported via kwargs keyword arg.
- Renamed embed_params to export_params, and it now defaults to True.
- Toffee tests now live in their own test_toffee.py file. I had to
rename a pile of expect files for this.
- Removed defunct torch.toffee imports from autograd to solve module import
cycle.
- Helper function _with_file_like to abstract over opening file-ish arguments,
taken from torch.save()
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Rather than reuse input as output names in ToffeeIR, mark places where
inputs are consumed. In C2 conversion these annotations will be used
to create the corresponding graph.
Toffee submodule update.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
- Reduce setup.py diff.
- Expunge WITH_TOFFEE from codebase.
- Elaborate on a comment.
- Move gen_toffee.sh to tools
- Delete densenet test.
- Use 'using' to inherit a constructor.
- Delete outdated comment.
- Comment about why primspecs can return fewer outputs.
- Remove dead, commented out includes.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Along the way I added converters for Variable and TracingInput. Variable should
probably be moved to a more widely known spot.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Instead of dynamically allocating a float for each element of the tensor
(lol!) save the tensor itself, and directly read out the data.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
"Unused" nodes are mapped to nullptr, and we distinguish
on lookup nodes which were never mapped versus nodes that
were mapped but supposed to be unused. This case
should never happen, but a little extra safety never hurt.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
I realized we weren't running the linter after ToToffeeIR, so
I added a lint call. It thus emerged that the current implementation
was using "Unused" nodes that were not added to the graph,
which was tripping the lint. I fixed this a few ways:
- BatchNorm and Conv primspecs were returning dead "unused" nodes
for their (implicit) handle parameters. I removed them because
setOutputs handles this already, and a dead unused node which
is not attached to the graph violates the "no dead nodes"
invariant.
- OK, but MaxPool actually needs to return a unused node for
the output which supported by PyTorch but not Toffee; we need
to error if subsequently in the trace this output is used.
The new strategy is to have MaxPool's primspec return a None
at the unused position, and then immediately *check* if there
are any uses of that output. If there are, that's an error!
- I needed to adjust the Select invariant in the exporter loop:
only if a Select node has *uses* is it mandatory for it to be
defined in env.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Basic idea:
- Pass buffers (marked as non-Variable tensors) as input variables to
the trace. Every buffer gets represented as an input variable
to the trace, and we remember a correspondence of the underlying
TH pointer and an input variable in the trace.
- When we initially trace a function, we DO NOT record the buffers
as edges. This is so autograd doesn't have to know anything about buffers.
If we ever turn buffers into requires_grad=False parameters, then
this problem goes away.
- When we primspec the buffer, NOW we reach into the cached buffers
(now appropriately named) and gin up the buffer information we need.
Other things:
- CppOp execution is now supported (but lightly tested) using
SimpleEval (thanks @apaszke!)
Todo:
- E2E tests need to have their hacks removed.
- Figure out what is going on with backwards
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
If it's not set, CMAKE_DEBUG_POSTFIX sets it to 'd' which means the
static library gets named something different when built in debug mode.
This is annoying because it means if you build in debug mode, the
library is in a different place. Rather than teach the build system
to find the correct name, just set this POSTFIX so names don't change.
Also, update setup.py to look for the non-debug archive.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
General strategy:
- nanopb is statically linked into PyTorch. It must be built
with -fPIC.
- Generated nanopb files for toffee.proto are checked into
our repo.
- Because nanopb generated protobufs are C only, we wrote a
wrapper around it to give a Google C++ style interface.
More on this shortly.
How does the wrapper work?
- It's called "micropb" becaues it is less small than nanopb :)
- nanopb requires all variable-length fields to be written out
using a "callbacks" mechanism.
- We wrote pre-canned callbacks for all of the types ToffeeIR
writes out and lists; these are micropb_callback and
micropb_callback_list. These operate simply by dynamically
allocating and storing the data to be written out in
data (this defeats the purpose of the callback mechanism,
but it's easy to implement)
- Finally some boilerplate to actually implement the wrapper
classes and have owning pointers to the actual data.
Testing strategy:
- Take the serialized protobuf from nanopb, parse it again
with ToffeeIR and print it. Worked with all of test_jit.py!
These tests don't run without 'toffee' being installed.
TODO:
- Update CI to install ToffeeIR, so we can run the Toffee tests
in CI
- Update E2E with Caffe2 tests so that they work with new stuff.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
previously:
PythonOp/CppOp Graph -> ToffeeIR, primspecs worked with protobufs
now:
PythonOp/CppOp --ToToffeIR--> jit::Graph of in-memory ToffeIR -> protobufs of ToffeIR
This commit let's primspec functions work directly with JIT IR nodes,
which makes it possible to do a lot more stuff in those functions.
Let say I write alpha=2 in my PyTorch code. Is alpha a float
or an int? This problem is resolved when we actually pass
it to the underlying kernel, which knows what type it expects
it as.
When serializing to Toffee IR, the Toffee NodeProto also needs
to dictate the correct type; otherwise, we may guess wrong.
We get this information from the OpSchema in the ToffeeIR library.
With this, we can avoid explicitly casting in dropout.py and
auto_primspec.py
WARNING: You will need to update torch/lib/ToffeeIR when you pull
this patch, as attribute schemas were added recently to ToffeeIR.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This addresses when bias is disabled, which occurs in torchvision's
alexnet and densenet.
The general strategy is this:
- When we encounter a null variable, we turn this into a Constant
node with an undefined at::Tensor
- Toffee exports for BatchNorm and Conv have special cases for bias,
checking if they are provided by a Constant node with undefined
value, and just omit the input if so.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
The general strategy:
- We put all the toffee files in torch/csrc/toffee; they will only be
added when toffee is enabled
- Toffee is enabled if torch/lib/ToffeeIR is present (since we
don't have a submodule/subtree thing going on)
- The most prevalant place you will need to use WITH_TOFFEE is for
primspec definitions on C++ autograd functions. There is a
macro HAS_PRIMSPEC to ameliorate optionally defining primspec()
virtual overrides on Function classes. HasPrimspec is always
available but will be a zero field class when Toffee is disabled.
NB: We might revert this commit in the future if we figure out a way
to unconditionally enable Toffee that everyone likes.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
We want all the conversion code to live in one place. Away it goes!
This means that alexnet protobuf no longer works. It will start working
again when we port changes.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This commit adds a new exporter pass which takes a graph and returns
a string of the human-readable protobuf representation of a model.
We have two strategies for how conversions are implemented:
- If a Python autograd function has a primspec static method, we invoke
it to get the Toffee conversion. Use torch.toffee.op to generate the
format expected to be returned. The particular data representation is opaque
and subject to change in the future.
- Otherwise, there's a giant if statement in the exporter, which manually
uses the JIT IR C++ API and Toffee IR C++ protobuf API to convert.
You must check out a copy of the ToffeeIR repo
https://github.com/ProjectToffee/ToffeeIR at torch/lib; at the moment
we don't have a subtree/submodule set up.
Technical debt in this commit:
- To get protobuf headers in scope, we unconditionally add $CONDA_PREFIX/include
to the include path. This needs to be replaced with a more robust mechanism.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
The API works on either functions or models, taking an extra parameter argument
so that functions can pass in additional variables to trace.
Other behavior is folded into boolean options:
time - collect stats for our own perf debugging
verify - run the original code, and check it is within threshold
optimize - run optimization (currently off until fusiongroups pr is accepted).
enabled - flag to turn off tracing so you can check timing of stuff that cannot be traced.
Fixes#48.
I had to shave some yaks:
- I needed switch on Type, so I wrote a new macro set TYPE_IF,
and abstracted the IR_IF into a GENERIC_IF. The parametrization
is on const-ness and the type kind; also there is a minor annoyance
where type kinds (ugh, hate the name; it means the wrong thing
in Haskell land) don't match the class names, so there needs some
suffix munging. There's still some extra funny business, see
https://github.com/ezyang/pytorch/issues/51
- A lot of functions on types weren't declared const when they could
have been. I added const qualifiers as necessary.
- setType now takes an honest to goodness Type* rather than TypeKind.
- init_pass now preserves types when it does transformations.
There are still some places we're losing types, most notably fusion.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Approach is based on the approach of THC's pointwiseApply{1,2,3} family of kernels,
but doesn't have any dependencies on that code.
Adjacent contiguous dimensions of input tensors are compressed to reduce the complexity of indexing math.
For the completely contiguous case, the indexing logic simplifies to just the linear index.
In simple tests, this code matched or beat the equivalent from THC.
- To test whether or not a multiline string matches some expected
value, you can use assertExpected. This tests that the string
matches the content stored at a file based on the name of the
test (and an optional subname parameter you can pass if you
what to assertExpected multiple times.)
- Suppose you make a change that modifies the output in a big way.
Instead of manually going through and updating each test, you instead
run python test/test_jit.py --accept. This updates all of the expected
outputs. You can now review them one-by-one and make sure your
changes make sense.
We can add more features later (e.g., munging the output to make it
more stable, more sanity checking) but this is just to get us started
testing. One thing to watch out for is that accept tests on intermediate
representation can be a bit wobbly: it is *extremely* important that
people be able to read the IR. It may be worth introducing niceties
to the printer in order to ensure this is the case.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Now it gets initialized during the constructor. This results
in more boilerplate but is conceptually more correct, and solves
an assert failure.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
It is not an /expression/ we trace, but it is a /graph/: that is,
a closed expression which knows its parameters. Knowing the list
of parameters is helpful and helps remove a hack when interpreting.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This prevents nested lets, which are not allowed in ANF. We
basically have SSA now.
There's some niftiness with the visitor returning a lambda which
then gets fed the actual argument. I like it.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Although ANF style developments traditionally stratifies syntactic
classes into atomic (Arg) and complex (Expr) expressions, where
atomic expressions could be variables, constants or lambdas, Zach has
successfully convinced me that we should do away with the variant here and
always require arguments to be variables. There are a few reasons for
this:
1) Tensor constants, not currently supported, could be modeled using a
"Constant" instruction, removing the need for them to be representable
directly inline. An inline constant is marginally more convenient
for peephole optimizations, but since we have gone full ANF, we are going
to need to be able to see across def-uses in any case, and it is not
too much worse to need to handle constants this way. By the way,
Swift Intermediate Language also made a similar choice, see
the slide on "Literal Instructions" in
http://llvm.org/devmtg/2015-10/slides/GroffLattner-SILHighLevelIR.pdf
2) Scalar constants, which are quite important for passing non-tensor
arguments to Python operators, are now stored out-of-band as NON
first-class values. This more closely matches the ToffeeIR design,
and makes it clear what parameters are "first class" (tensors only)
and which ones are not. However, we need to be able to unswizzle
the separate scalar/tensor lists into a unified list in the correct
format; this is what PyFunctionCConv is for.
Also, Locals got renamed into Tuple.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Previously, our AST was a DAG, where shared Nodes indicated a computation
should be reused. This commit rewrites the IR into a new functional
representation which represents sharing explicitly using variable
bindings.
We offer a few justifications for this new style:
1. The new representation is not all that different from the
old one; it is about as easy to construct, and the lack of an
explicit graph doesn't negatively impact our ability to interpret
the graph, since we've chosen, as a matter of design, to NOT have
the IR participate in the actual execution of a graph.
2. The new let-binding representation has an implicit ordering,
which we can use to conveniently keep track of the original order
the trace showed up as. This automatically gives us a topsort,
and gives us an easier to read textual representation of our
IR:
%14 = Embedding %11, %0, -1, None, 2, False, False
%15 = Dropout %14, 0.2, True, False
%16 = Index %12, 0
%17 = Index %12, 1
%18 = Index %13, 0
%19 = Index %13, 1
%20 = Index %15, 0
%21 = Linear %20, %1, %3
%22 = Linear %16, %2, %4
3. It moves us closer to a Futhark style language
(http://futhark-lang.org/publications/pldi17.pdf).
Major aspects of the diff
- Node is replaced with Expr and Arg, a pair of mutually recursive
structures which represent our new language. In BNF, the language
looks like this:
a ::= c | %i
e ::= %i, ... = e
| PyOp e, ...
| Ret %i, ...
Technically, Ret is not actually a return (no control flow is involved),
it just tuples up a series of tensors (identified by variables).
One important invariant is that locals are always tensors; they
are never constants (this is asymmetric with Args.)
- Arguments support Python constants. This is an important piece because
many operators take extra Python literals like integers and tuples in
order to specify extra parameters about how an operator operates. Adding
this was essential to getting word_language_model to work.
- As both Expr and Arg have multiple variants, there is new infrastructure
for doing case on the variants using ExprVisitor and ArgVisitor. The
strategy here is adapted from WebAssembly's visitors, although we have
generalized to permit arbitrary argument forwarding, which is necessary
to support tail-recursive visitor calls. TCO is important because our
interpreter may recurse arbitrarily deep into a stack of nested lets.
If users wish, they can also manually case on the type tag.
- Tracing is now turned on and off using _tracer_enter/_tracer_exit in
torch._C. _tracer_enter accepts a list of variables which are to be
treated as arguments; _tracer_exit accepts the list of traced variables
which should be returned when you reexecute the trace, and returns
the trace expression which can be reexecuted. GlobalTracingState
is a global variable which tracks whether or not we are tracing or not.
- You use run_forward to execute a trace on some set of parameters.
- When under tracing, variables keep track, via trace_local, what the
name of their variables in the IR are.
Here is a simple runner which leaks memory but can be used to JIT models:
import torch.autograd.function as F
import torch._C
def jit(model):
import types
real_forward = model.forward
def forward(self, *args):
def flatten(x):
return tuple(F._iter_variables(x))
if not hasattr(self, "saved_trace"):
torch._C._tracer_enter(tuple(self.parameters()) + flatten(args))
out = real_forward(*args)
self.saved_trace = torch._C._tracer_exit(flatten(out))
self.saved_outs = out
return out
else:
flat_out = Variable._execution_engine.run_forward(self.saved_trace, tuple(self.parameters()) + flatten(args))
return F._unflatten(flat_out, self.saved_outs)
Major problems:
- Sanity checking is spotty at best, especially when users pass in variables.
- The interpreter leaks tensor memory from the store. When we add back def-use
we should be able to deallocate tensors as soon as we know they are no longer
necessary.
- The interpreter needs to reach feature parity with the old execution engine.
From there, we need to see if backwards can be subsumed as well.
- I still have no confidence in having memory managed everything correctly.
This requires a close look.
- Rather than return an *open* expression as a trace, we should return a
*lambda* instead, which knows about how many formal parameters it
requires.
- The IR is not introspectable from Python at the moment, but this is simply a
matter of implementing all the binding code.
- The tracer is NOT reentrant (you can't trace while you're inside a trace.)
Furthermore, no sanity checking is done if you try to incorrectly reuse
things from one trace in another.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Simple test:
import torch
from torch.autograd import Variable
import torch._C as _C
x = Variable(torch.Tensor([4]), requires_grad=True)
y = Variable(torch.Tensor([7]), requires_grad=True)
z = x * y
z.sum().backward()
print(x.grad)
print(y.grad)
x.data[0] = 2
y.data[0] = 3
(z,) = z._execution_engine.run_forward((x, y), (z,))
z.sum().backward()
print(x.grad)
print(y.grad)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This respects all the broadcast cwrap specifications except for 'fallback';
i.e. pointwise functions operating on tensors where the number of elements
match but the sizes are different and not broadcastable. This behavior is
currently deprecated in PyTorch. Note that this is a breaking change in ATen,
because ATen just passes through to TH/THC, where the fallback behavior is
actually implemented.
This also changes expand semantics wrt Scalars (as tensors). Previously,
one could 'expand' a 1-dimensional tensor with size 1 to a 'scalar' (i.e.
empty size initializer list).
* Support double backwards for AdaptiveAvgPool1d and AdaptiveAvgPool2d.
* Support double backwards for ReplicationPad2d, ReplicationPad3d, and ReflectionPad2d.
* Support double backwards for FractionalMaxPool2d.
* Support double backwards for MaxUnpool1d and MaxUnpool2d.
* Circular recursive imports not supported in python 2.
* Address review comments.
* Add examples in functional.py
Added examples for F.cross_entropy, F.binary_cross_entropy and F.binary_cross_entropy_with_logits.
* Add ` for PyTorch docs
Added ` for PyTorch docs.
* Add examples in loss.py
Added examples for nn.BCELoss and nn.BCEWithLogitLoss.
When working on PyTorch dependencies we often want to rebuild only that
dependency and the Python extension. You can now do that by running:
python setup.py build_thc
to only re-build THC
* Add ability to specify init_method for test_distributed.
* Move init_method specification to test run line.
* Run for gloo tests as well.
* Better status message for gloo test.
Basically, it's easy to confuse the dimensions of the index tensor.
This adds some more text which should hopefully clarify the situation.
Fixes#2416.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add ATen overload to AutoGPU.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Use new AutoGPU overload.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
test_FloatTensor_qr_big test is still a bit flaky on K80. Increasing tolerance to improve reliability as tests are moved around and results change for this test.
* Implement BatchNorm double backwards as a python function called directly from C++.
This will be converted to C++ code once ATen is integrated with autograd.
* Some performance improvements via inplace ops and reusing calculations.
There were two implementations of THPUtils_checkLong/THPUtils_unpackLong; one
that was a macro and one that was not, which is hella bad if you accidentally
include the macro before the real definition. Now we always use the inline
function.
A reasonable follow-up task would be to un-macro-ify the rest of these functions.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* add SharedFunctionMaker to create Function shared in the graph
* Clean shared_ptr usage for only function that will be used in the graph
* make Function binding match Varible one
* remove unnecessary changes
* fix comments
* proper weakref implementation
* add call to clear in dealloc
* Add examples in CrossEntropyLoss
1. Added examples in CrossEntropyLoss
2. Make consistent style of example for PyTorch docs
3. Delete unnecessary character '
* Change comments in distance.py
1. Delete x1, x2 from arguments and add eps in PariwiseDistance
2. For the shape, added input1 and input2 for readability (PairwiseDistance and CosineSimilarity.
* Add examples
Added the word 'examples' for PyTorch docs
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.
Reviewed By: igorsugak
Differential Revision: D5454343
fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
* added tests + removed explicit expand of weight in bce with logits
* add auto broadcasting of weight to BCELoss
* remove the need for _BCELoss
* formatting of warning
* remove TODO
* move across assert from _functions/thnn/loss.py
* flake8 fixes
* add dropout2d and dropout3d to functional
added some loss functions to functional
added tests
using dropout from backend
added docs
fixes
* edited loss modules to call functional
Summary: When performing reductions on fp16 buffers, gloo assumed that both buffers were either aligned to 32 bytes or misaligned by the same offset. This may not hold in intermediate steps of halving-doubling allreduce, when the reduction is performed on some offset within the receive buffer. The fix is to use intrinsic instructions that work with unaligned pointers.
Reviewed By: akyrola
Differential Revision: D5450103
fbshipit-source-id: 9a1c8f8c34d2e62223f6d5c21573ea1cfad6537f
The function iterates over columns and sets "sparsity" fraction of entires in each column to 0. The number of zeros in a column (num_zeros) is then ceil(rows*sparsity)
Summary: When compiled with -Werror=shadow-compatible-local, cannot reuse a variable name. This passed our tests, but some people use stronger settings to compile.
Differential Revision: D5440805
fbshipit-source-id: a246af748717fb7e0e7a321e1ac4ddfef68ae524
Summary: To reduce round trips with store handlers, it is better to store all addresses in one key instead of one address per pair. This is what this implements.
Reviewed By: andrewwdye
Differential Revision: D5435893
fbshipit-source-id: 2d3ea3a2822c3b934ff2578d44a262e7bfbde6d0
Summary: Use the CreateCommonWorld timeout for the storehandler as well, not just the device connect.
Reviewed By: andrewwdye
Differential Revision: D5425923
fbshipit-source-id: 936d2129e2db3bfed8759ca097b75843d3931d5f
* add support for groups in double backward
* add tests for group in double backward
* fix lint
* separate some tests to reduce number of test cases
* remove redundant testing for different number of output channels
This is needed because of possible races in SpatialConvolutionMM (and others that use gemm)
if the BLAS library is not thread-safe.
In terms of performance, there's not much benefit to run two gemms in parallel, because the
BLAS libraries have their own all-occupying gemms anyways.
* Improve non-contiguous testing in TestAutograd:
1) Test gradcheck and gradgradcheck with non-contiguous inputs
2) Test gradgradcheck with non-contiguous gradoutputs (gradcheck would take more work)
3) Fix discovered issue in Prod backwards.
* Simplify non-contiguous setting wrt View.
Previously, there were 2 issues with test_autograd randomness:
1) Many random operations (e.g. random selection in prod_zeros) happened
before the torch random seed was set (because it was set in run_tests
at the end of the file.
2) The random seed was not set consistently: run_tests would set it to the
proper value, but each call to setUp would set it to 0 (because SEED wasn't
global in run_tests), which made setting the seed mostly worthless.
Previously, these tests added 5e-2 to the denominator tensor (the same as the div
tests), which only avoids divide by 0, but not issues with computing the numerical
jacobian due to non-linearity of fmod/remainder, when input / divisor is close to an
integer. These tests now add 1.5 to the denominator, which is the same as the non-tensor
version of the tests; Note that we can still hit the above condition but it will be much
less likely.
This takes advantage of the broadcasting behavior of torch.matmul to
support inputs with more than two dimensions. The extra dimensions are
treated like part of the batch dimension, much like nn.Bottle in Lua
Torch.
There are a few related small performance changes:
* Addmm computes the gradient in column-major for inputs in
column-major format
* Variable.mm calls Addmm in-place with the desired output buffer
* Add weight normalization implementation
This adds forward "pre-hooks" which get called before the module's
forward() method. Weight norm is implemented as a hook which calculates
the weight variable from the weight_g and weight_v every iteration.
Based on @rtqichen implementation.
* Specify return type
* Fix unused linker argument warnings.
This patch began when I noticed the following clang warning:
clang: warning: -Wl,-rpath,RIGIN: 'linker' input unused
clang: warning: argument unused during compilation:
'-L/home/ezyang/local/pytorch/torch/lib/tmp_install/lib'
The warning is minor, but I was a bit worried our rpath wasn't
setup correctly. Actually, it was, and there wasn't a problem,
but I had to spend some time figuring out exactly what as going
on, and by the end of it, I might as well fix the warning. In the end, I ended
up filing two upstream tickets for ccache and cmake:
- https://github.com/ccache/ccache/issues/189
- https://gitlab.kitware.com/cmake/cmake/issues/17025
We can remove the warning by using CMAKE_EXE_LINKER_FLAGS and
CMAKE_SHARED_LINKER_FLAGS, which have sane macro expansion rules
(although still slightly insane: the first level of escaping gets removed.)
To ensure that the rpath was being set correctly, I ran
objdump -x torch/lib/build/TH/libTH.so | grep RPATH and verified that ORIGIN
was setup correctly.
I also considered using CMAKE_INSTALL_RPATH, but the rpath here doesn't
seem to get set until you actually install, which is a change in behavior,
and I wasn't sure if anyone was relying on rpaths being setup in the build
directory.
There is a SLIGHT behavior change, in that if we happened to need these
LDFLAGS passed to the static linker, they won't get passed. I don't
think we ever build static libraries today so this shouldn't be aproblem.
P.S. Because of the ccache bug, you may continue to see these warnings
after this patch. If you apply https://github.com/ccache/ccache/pull/190
and clear your cache, it will solve the problem.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Remove unnecessary -Qunused-arguments
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
If the left tensor is 3D+ and the right tensor is at most 2D, we can
fold the batch into the matrix dimension and use torch.mm instead of
torch.bmm. In practice, this is faster especially if the right tensor is
column major.
Summary:
Adds basic CUDA 9 support, including adding Volta arch, and making appropriate modifications for half precision datatype changes
Closes https://github.com/facebookincubator/gloo/pull/49
Differential Revision: D5315336
Pulled By: pietern
fbshipit-source-id: 6468b0f357206d604bdcfec69ba82509a2c91407
Summary:
Adds a separate set of CUDA collectives that run on device as an
alternative to NCCL. Use these collectives as default on-device
collectives instead of NCCL.
Whenever multiple processes on the same machine use Gloo with NCCL and
end up doing concurrent CUDA memory allocations and algorithm
execution, we risk deadlock. A follow up change will enable opt-in
usage of NCCL (e.g. through environment variable).
Benchmark output below with varying number of elements. It shows a
minor improvement over using NCCL for local reduction and broadcast.
Number of elements equal to on-device threshold (256K):
```
Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm: cuda_allreduce_ring
Options: processes=2, inputs=8, gpudirect=no
elements min (us) p50 (us) p99 (us) max (us) samples
(before) 262144 2685 2907 3035 3215 562
(after) 262144 2682 2874 3013 3395 577
Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm: cuda_allreduce_ring_chunked
Options: processes=2, inputs=8, gpudirect=no
elements min (us) p50 (us) p99 (us) max (us) samples
(before) 262144 2045 2133 2325 2643 725
(after) 262144 1533 1673 1834 2048 800
Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm: cuda_allreduce_halving_doubling
Options: processes=2, inputs=8, gpudirect=no
elements min (us) p50 (us) p99 (us) max (us) samples
(before) 262144 1580 1640 1718 2069 893
(after) 262144 1371 1446 1539 1748 1125
```
Larger number of elements (4M):
```
Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm: cuda_allreduce_ring
Options: processes=2, inputs=8, gpudirect=no
elements min (us) p50 (us) p99 (us) max (us) samples
(before) 4194304 55543 58058 60103 62659 32
(after) 4194304 54490 57923 60893 66058 33
Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm: cuda_allreduce_ring_chunked
Options: processes=2, inputs=8, gpudirect=no
elements min (us) p50 (us) p99 (us) max (us) samples
(before) 4194304 18049 22820 24997 26634 105
(after) 4194304 18356 20463 21695 22589 99
Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm: cuda_allreduce_halving_doubling
Options: processes=2, inputs=8, gpudirect=no
elements min (us) p50 (us) p99 (us) max (us) samples
(before) 4194304 18584 24345 27809 29722 95
(after) 4194304 19541 22718 25408 26688 88
```
Reviewed By: akyrola
Differential Revision: D5278192
fbshipit-source-id: 53f09e404663ddc8bb46d06ac87afd8ee3ffc3a2
Summary:
Code in tcp/transport tries to find the network interface a socket was
bound to when create a TCP device context. Per getifaddrs(3), it is
possible for the ifa_addr field to be NULL (supposedly when an
interface doesn't have an address). Ignore such entries.
Thanks to slayton58 for reporting this.
Reviewed By: wesolwsk
Differential Revision: D5279376
fbshipit-source-id: 039380b95ba4d6d94942c30581e0b230a060870c
Summary:
Previously, `gloo/math.h` inlined methods which use AVX builtins,
which required propagating the `-mavx` flag.
This diff moves these definitions out of the header and into a source
file to prevent avoid this.
Reviewed By: pixelb
Differential Revision: D5271043
fbshipit-source-id: dde4dc560dfb557b46d1a582a8b38e7cb8eb0c37
Summary:
This changes prepares for having a separate set of collectives that
use native CUDA calls instead of NCCL. This is needed to workaround
the issue where NCCL deadlocks when it is interleaved with CUDA memory
management operations in other processes on the same machine.
Includes a modification to the host reduction functions to bring them
up to parity with the NCCL reduction functions (they now incorporate
offset/counter arguments).
Reviewed By: wesolwsk
Differential Revision: D5276291
fbshipit-source-id: 8844731760d2c48577d207c026ce0cd641f2fc6d
Fixing error on line 661:
warnings.warn("masked_copy_ is deprecated and renamed to masked_scatter_, and will be removed in v0.3")
NameError: name 'warnings' is not defined
Summary:
\cc pietern
Minimal changes to allow gloo to compile and run with NCCL 2.0
Closes https://github.com/facebookincubator/gloo/pull/46
Differential Revision: D5268074
Pulled By: pietern
fbshipit-source-id: 58d625d57b31cfc932f3dbbdd7a4b83d9a2e60a8
* Add torch.matmul function.
Includes test_torch, test_autograd and docs changes.
* Add __all__ to functional so imports are accidentally imported.
* Include unbind in __all__.
* Add matmul case for when one argument is 1-dimensional and the other
at least 3-dimensional.
* Add squeeze_ to Variable.
* Use squeeze_ instead of squeeze for matmul.
Primary things I had to fix:
- Suppress _XOPEN_SOURCE warnings by ensuring that Python.h is included
first, because it always unconditionally defines this macro.
- Turn off strict aliasing, because Python 2 doesn't work with strict
aliasing.
- Workaround setuptools bug, where it's incorrectly passing
-Wstrict-prototypes to C++ compilers (where this doesn't make
any sense)
To compile csrc with -Werror, run `CFLAGS="-Werror" python setup.py build_ext`
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Fixes#1783.
There is an undocumented invariant in PyTorch that we should
try to avoid having storage == NULL as much as possible (even
though Torch supports it.) This commit properly documents the
invariant, and fixes a bug in sparse where the invariant was
not respected. This now means that sparse tensors now correctly
remember what GPU they are associated with.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Fixes#1782.
The default operation should be cheap: user can always choose to
explicitly make a copy on the way in. Note that this is a
BACKWARDS COMPATIBILITY BREAKING change. However, we DO create
a new tensor wrapper (so we are not affected by subsequent
size changes, etc.)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: Replace call to function that is only supported in CUDA 8.0 with one that has been supported in previous releases.
Reviewed By: pietern
Differential Revision: D5231755
fbshipit-source-id: d72aec2a4a1c511064a65142887f8a05b51dad55
1) Line up trailing dimensions in broadcast docs.
2) remove unnecessary expand_as in common_nn test.
3) use view in tensor_str instead of resize_.
4) newExpand remove raiseErrors change.
5) clarify expandedSizes/expandedStrides parameters in inferExpandGeometry.
6) simplify inferSize2/inferSizeN implementations.
7) use new-style classes for warning.
Setting torch.utils.backcompat.broadcast.warning.enabled=True
will cause Python warnings in the case where broadcast occurs
but previously 1-d view style pointwise ops occured.
1) Line up trailing dimensions in broadcast docs.
2) remove unnecessary expand_as in common_nn test.
3) use view in tensor_str instead of resize_.
4) newExpand remove raiseErrors change.
5) clarify expandedSizes/expandedStrides parameters in inferExpandGeometry.
6) simplify inferSize2/inferSizeN implementations.
7) use new-style classes for warning.
1) Line up trailing dimensions in broadcast docs.
2) remove unnecessary expand_as in common_nn test.
3) use view in tensor_str instead of resize_.
4) newExpand remove raiseErrors change.
5) clarify expandedSizes/expandedStrides parameters in inferExpandGeometry.
6) simplify inferSize2/inferSizeN implementations.
7) use new-style classes for warning.
1) Rename calculateExpandGeometry to inferExpandGeometry for consistency
2) Simplify inferExpandGeometry implementation by using a single pass
through dimensions
3) Implement a two operand expansion, expand2.
4) Implement versions that return error code to use for fallback to
equal nElem support.
1) Rename calculateExpandGeometry to inferExpandGeometry for consistency
2) Simplify inferExpandGeometry implementation by using a single pass
through dimensions
3) Implement a two operand expansion, expand2.
4) Implement versions that return error code to use for fallback to
equal nElem support.
* Add SELU activation function
* Remove unnecessary case
* Add Function for SELU + tests and fix RReLU inplace
* Fix extra line in doc
* Fix tests
Remove in-place tests for RReLU. For some reason they fail on legacy nn, but passes on nn
* SELU in new-style Function
It also supports double backprop, verifyed with gradgradcheck
* Fix flake8
Otherwise, on many machines, the size of the OpenMP thread pool will
change between MKL and our OpenMP enabled functions. The constant thread
creation and destruction results in worse performance and leaks memory
on GCC 5.4
Otherwise, on many machines, the size of the OpenMP thread pool will
change between MKL and our OpenMP enabled functions. The constant thread
creation and destruction results in worse performance and leaks memory
on GCC 5.4
Summary:
While debugging #43 I found common/common.h missing some headers as well.
Fixes#43.
Closes https://github.com/facebookincubator/gloo/pull/44
Differential Revision: D5194970
Pulled By: pietern
fbshipit-source-id: 4861cd04c56931d4759f5bc050816788252003ee
When I use the named_parametes to modify the lr and weight decay, I will face a bug. Because the value of the named_parameters return is torch.nn.paramter.Parameter, not a generator of the Parameter.
Summary: Machines may not create their Gloo pairs at the same time, due to earlier variable time work. Increase the timeout used to establish the initial tcp connection to accommodate without sacrificing the shorter default timeout for outstanding reads/writes. No related change required for ibverbs as there is no communication on init.
Reviewed By: akyrola
Differential Revision: D5184518
fbshipit-source-id: 0e6c9704a2d2f1406b3927f75887f0a42199450b
The correct device must be set when getting the base allocation and when
calling cudaIpcCloseMemHandle. Store the device in the allocators
context, which was previously always NULL.
Fixes#1707
* Modify torchvision documentation following https://github.com/pytorch/vision/pull/179
* Add new datasets to docs
* Fix wording in torch.datasets
* Small clarification
* Fix gc_refs assertion failure
Ensure that each THPVariable -> THPFunction reference contributes one
ref count to the THPFunction by creating a new shared_ptr for each ref.
Because multiple shared_ptrs can again manage a single THPFunction, it's
not safe to use std::weak_ptr where it may point to a PyFunction. It's
still safe to use weak_ptr for grad_accumulator since these are never
PyFunctions.
Fixes#1626
* Remove stale comment
Before the change, processes were not waiting for master even when they got
'connection refused' (master is not listening yet, so we should wait).
It was because we were closing socket twice: first, by
the resource guard; second, manually in exception handler.
That caused errno to be set to different value (9 - bad file descriptor)
and in result `if`, which checked if connection was refused, was failing.
* Add sanity checks
* Refactor InitMethodFile and TCPInitMethod to more logical functions
* Update few error messages
* Add passing parameters by **kwargs, so now order of parameters is not relevant
* Review comments
* A pile of misc doc fixes.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Handle @apaszke review comments.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Initial csrc documentation.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: Extended the time-out option from just working on TCP to also working with ibverbs
Reviewed By: pietern
Differential Revision: D5090258
fbshipit-source-id: fee685850d761d0c2130852f513c64ceb19f4e9e
Summary:
For some long running benchmarks, the iteration count could be 0
which would lead to a segfault when printing results
Reviewed By: pietern
Differential Revision: D5149034
fbshipit-source-id: 7b56e8961c302d1ff11ffcd74ca8e909ea046231
Summary:
Only adding `include_directories` doesn't propagate to the including
targets. Also use `target_include_directories` to do so.
Closes https://github.com/facebookincubator/gloo/pull/39
Differential Revision: D5131001
Pulled By: pietern
fbshipit-source-id: 6c58c4b76ae7fa008e4fb26d1bca7900165884d0
Summary:
The CMake variable CMAKE_BINARY_DIR points to the top level build
directory. For standalone Gloo builds this path lets files include the
generated file "gloo/config.h". When Gloo is included as project, this
variable points to a different path and "gloo/config.h" cannot be
resolved. Fix is to build a path from CMAKE_CURRENT_BINARY_DIR.
Closes https://github.com/facebookincubator/gloo/pull/38
Differential Revision: D5129385
Pulled By: pietern
fbshipit-source-id: 722cebf4892b34f869fe43320153efbb181555b6
Summary: Using Misha's vectorized AVX code to greatly improve performance of reductions on float16 values. Float16 reductions are now 2x faster than float.
Reviewed By: pietern
Differential Revision: D5123331
fbshipit-source-id: 03d4e76886d538b7e24eedaf32a92231a80b1e43
Summary:
The broadcast algorithms use the buffers they were given directly.
There is no inbox/outbox pattern. This means that we can race if the
algorithm is run repeatedly within a short time frame. This hasn't
been an issue so far since we've only used it in combination with
other process wide barriers.
Since this adds a round trip the latency of these ops from the root
rank perspective increases. The variance between the before and after
runs is pretty high since there is no back and forth interaction on
the root. It simply waits for recipients to be ready and then sends
its data.
Before:
```
Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm: broadcast_one_to_all
Options: processes=4, inputs=1
elements min (us) p50 (us) p99 (us) max (us) samples
100 1 16 29 50 426075
200 2 17 32 50 179953
500 2 11 31 59 140291
1000 2 12 29 59 177619
2000 3 12 29 62 117882
5000 5 16 31 64 127113
10000 9 21 38 88 60328
20000 19 36 65 130 30427
50000 48 68 221 556 11180
100000 92 136 426 871 7314
200000 193 251 829 2965 4092
500000 492 638 2098 4133 1677
1000000 1195 2024 3513 11646 628
2000000 3446 4216 5007 17100 282
5000000 12956 13919 14941 37751 71
```
After:
```
Device: tcp, pci=0000:25:00.0, iface=eth0, speed=50000
Algorithm: broadcast_one_to_all
Options: processes=4, inputs=1
elements min (us) p50 (us) p99 (us) max (us) samples
100 15 37 52 107 27332
200 14 40 63 199 28620
500 17 37 52 118 18299
1000 9 39 57 120 33375
2000 20 57 78 180 24779
5000 31 61 84 190 18039
10000 39 70 90 225 8908
20000 57 108 130 940 8313
50000 94 163 217 1933 5326
100000 132 231 331 3501 3681
200000 256 426 560 6509 2272
500000 774 1092 1698 10039 985
1000000 1132 2106 3878 18218 484
2000000 3509 4252 6832 20228 226
5000000 11326 15447 27129 52694 77
```
Reviewed By: wesolwsk
Differential Revision: D5123341
fbshipit-source-id: f3bab4f75ef7c38817f74f00b382f18fe43d85d5
Summary: Vector out-of-range error was being triggered in some tests due to trying to get the address of an element past the end of vector.
Reviewed By: pietern
Differential Revision: D5123044
fbshipit-source-id: 004f72ebaa27c609290959c12a3d99b16289bfa8
* Fix segfault in autograd:
1) Every "output" variable must have a grad_fn or grad_accumulator
2) compute_partial_exec_callbacks uses Python errors
* assertRaisesRegexp was renamed assertRaisesRegex in 3.2
* Use HANDLE_TH_ERRORS macro
Summary:
In a previous commit where the slot numbering was expanded, I changed
the memory region send/recv path to use a map for the outgoing memory
regions (since they may complete out of order). Before, this was a
fixed size array, which was mutated by both the user thread and device
thread without holding a lock. The map, however, can't be mutated
without a lock. This change adds that lock and a few assertions to
check for this type of problem.
Reviewed By: andrewwdye
Differential Revision: D5108194
fbshipit-source-id: 1908c988112469ecdec6cb6eb9849068d896c409
Summary:
This file can then be used by downstream code to figure out what Gloo
features it can support (e.g. ibverbs transport or not).
Closes https://github.com/facebookincubator/gloo/pull/36
Differential Revision: D5110769
Pulled By: pietern
fbshipit-source-id: 2c0c07537258048737ae764a4978f2f7fdbd992d
Summary:
This is another example where our unsolicited writes may interfere
across calls to the collective function. In this case, it was possible
for a second call to overwrite a pair's address before it had been
used to connect the pair in the previous iteration.
Thinking out loud, we could avoid this from happening by supporting
this pattern natively in the Buffer classes. For example, we can add a
notification mechanism (opt in) to the Buffer class such that the
receiver may call `ackRecv()` to acknowledge receipt and handling of
the data in the buffer. Then the sender will block on new sends until
acknowledgement from the previous send has been received. Until then,
we have to keep an extra eye out.
Reviewed By: wesolwsk, romain-intel
Differential Revision: D5095430
fbshipit-source-id: 4c100433108fccea7457bba4dc00f651f722e6c9
* Check cuDNN version at runtime
This checks that the version from cudnn.h matches the version from
libcudnn.so.
Fixes#1476
* Only check major and minor version numbers
Summary:
The pair was still hardcoding limits on the slot numbers. In this
change those limits are lifted.
This also adds back assertions on work completion status in
handleCompletion.
Reviewed By: wesolwsk
Differential Revision: D5090457
fbshipit-source-id: 7bf884e1f31e48e8f1cdfb179a225999e28171b2
Summary: Add support for collectives over vectors of half-precision floating point values.
Reviewed By: pietern
Differential Revision: D5062938
fbshipit-source-id: 0b39fa53370393fec1edf2d852ff7f1d862b9022
Summary:
The halving/doubling algorithm had two instances where a receive
buffer was registered with a number of elements instead of a number of
bytes. This change adds the assertion that should have caught this in
the first place.
Reviewed By: wesolwsk
Differential Revision: D5089483
fbshipit-source-id: fd0f0724ef04300236c9297ee88b27e61fb1e5a0
Summary:
The original implementation created temporary buffers on the backing
context. This also meant an ordering problem when using the ibverbs
transport, as a call to send will block until the remote side has
created its receive side buffer. Since all buffers are now created
prior to using them, this is no longer an issue.
Reviewed By: romain-intel
Differential Revision: D5082352
fbshipit-source-id: 4c260f06e8f461c0336e7eec7ca891e07ff41cd3
Summary: Fixing a bug in the multiple algorithm test where threads were spawned repeatedly, causing collisions during rendezvous.
Reviewed By: pietern
Differential Revision: D5082945
fbshipit-source-id: 4adbbc963b1ff652f73a44cd9fd75dcd3325f182
Summary:
TSIA
This matches the approach in the TCP transport where all send/recv
logic is contained in the pair code.
Reviewed By: wesolwsk
Differential Revision: D5082503
fbshipit-source-id: b70886ed9aaeb381cdb45fba00704118cff62a23
Summary:
This is necessary to avoid the next iteration of the algorithm
overwriting data in recvBuf_ before it has been consumed by the
receiver of that data. If this does happen, the result of the previous
iteration for the receiving end is corrupted. This can only happen in
async mode on the TCP transport (so all incoming data is unsolicited)
when spinning on the run function.
Reviewed By: wesolwsk
Differential Revision: D5074789
fbshipit-source-id: 66668fbd885888f26266d812e78d61c6d65c2461
* Fix clang warnings
* Raise errors when unsupported ConvNd configurations are used
* Properly handle Variable indexing with LongTensors
* Support both tensors and variables in Variable.type_as
* fix issue #1549, expose bitwise and
* expose C bitwise or of Tensor
* expose C bitwise xor of Tensor
* use built-in method for inplace and, or, xor
* expose C bitwise lshift(ilshift) and rshift(irshift) of Tensor
a module that returns a non-standard data structure currently breaks
due to checks for backwards hooks. This refactors the code slightly so
this will only break in the event of backwards hooks.
By default, this parameter is False -- a backwards incompatible change, but
one that follows numpy semantics, e.g. numpy.sum (numpy names the parameter
"keepdims" since you can pass multiple dims to reduction functions).
The old behavior seems desired for normalization type operations
where the tensor will immediately be expanded out again, e.g.:
probs.sum(1).expand_as(probs)
which no longer works because the dimension to expand is missing.
This can be fixed by simply passing True as "keepdim" argument
to the reduction operation, e.g:
probs.sum(1, keepdim=True).expand_as(probs)
By default, this parameter is False -- a backwards incompatible change, but
one that follows numpy semantics, e.g. numpy.sum (numpy names the parameter
"keepdims" since you can pass multiple dims to reduction functions).
The old behavior seems desired for normalization type operations
where the tensor will immediately be expanded out again, e.g.:
probs.sum(1).expand_as(probs)
which no longer works because the dimension to expand is missing.
This can be fixed by simply passing True as "keepdim" argument
to the reduction operation, e.g:
probs.sum(1, keepdim=True).expand_as(probs)
By default, this parameter is False -- a backwards incompatible change, but
one that follows numpy semantics, e.g. numpy.sum (numpy names the parameter
"keepdims" since you can pass multiple dims to reduction functions).
The old behavior seems desired for normalization type operations
where the tensor will immediately be expanded out again, e.g.:
probs.sum(1).expand_as(probs)
which no longer works because the dimension to expand is missing.
This can be fixed by simply passing True as "keepdim" argument
to the reduction operation, e.g:
probs.sum(1, keepdim=True).expand_as(probs)
By default, this parameter is False -- a backwards incompatible change, but
one that follows numpy semantics, e.g. numpy.sum (numpy names the parameter
"keepdims" since you can pass multiple dims to reduction functions).
The old behavior seems desired for normalization type operations
where the tensor will immediately be expanded out again, e.g.:
probs.sum(1).expand_as(probs)
which no longer works because the dimension to expand is missing.
This can be fixed by simply passing True as "keepdim" argument
to the reduction operation, e.g:
probs.sum(1, keepdim=True).expand_as(probs)
By default, this parameter is False -- a backwards incompatible change, but
one that follows numpy semantics, e.g. numpy.sum (numpy names the parameter
"keepdims" since you can pass multiple dims to reduction functions).
The old behavior seems desired for normalization type operations
where the tensor will immediately be expanded out again, e.g.:
probs.sum(1).expand_as(probs)
which no longer works because the dimension to expand is missing.
This can be fixed by simply passing True as "keepdim" argument
to the reduction operation, e.g:
probs.sum(1, keepdim=True).expand_as(probs)
Summary:
Added a context factory that allows you to use an existing context to
create other fully connected contexts much more cheaply (without having
to rely on a store).
Limitations:
- The backing context needs to be fully connected
Reviewed By: andrewwdye, pietern
Differential Revision: D4985121
fbshipit-source-id: 31ceabccbb679cedb18ec9927b6c166bef5989bb
Summary: Set deviceId_ to -1 when CudaDevicePointer and CudaStream do not have valid data
Reviewed By: andrewwdye
Differential Revision: D4881374
fbshipit-source-id: e973a70e2e6e4519f5fdc2ad4e76f232d9593751
* Make sparseMask error if mask is uncoalesced.
Fixes#1447.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add test for sparse adagrad.
Previously, the sparse codepath was not exercised at all; this commit
adds a very simple test case "sparse Rosenbrock"; the idea is to do
Rosenbrock but then knock out one of the dimensions so that the
tensor is sparse.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
We weren't handling an edge case where write(2) would return EINTR
when in sync mode. The Pair::write function would return false
indicating it didn't complete the write whereas the send function
expects it to complete when in sync mode. With this change we now
advance the cursor and retry the write when fewer than expected bytes
were written.
Also see https://github.com/facebookincubator/gloo/issues/34
Reviewed By: andrewwdye
Differential Revision: D4996949
fbshipit-source-id: 3bad4fa3d0a01517f20b64904aa71410641fa60f
Fixes#1449.
For future reference, we should have a doc explaining our ref-counting
conventions; it looks like this bug slipped by because we assumed that
newTensor was taking ownership of the pointers it was passed in.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
As discussed in #1441.
I also added some docs giving clear guidance about how to coalescing
in sparse tensors.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Simplify _gen_sparse
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Randomly generate an uncoalesced tensor and test with it.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Simpler implementation of cpu_only suggested by @apaszke
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Better implementation of randn, suggested by @soumith
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Lint fix.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Fix CUDA type error.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: Previous slot offset was not added to the calculated value for the slot to be used in halving-doubling algorithms. If multiple instances were running, slot values could collide.
Reviewed By: pietern
Differential Revision: D4986618
fbshipit-source-id: 56b9220c91f31cc016d37e82907221460de70657
1) Fix "kth" attr specification -- I can't get sphinx to generate `k`th,
but `k` th works with a space, unlike now where the highlighting continues
until the next attr.
2) Specify the size of the return tensors.
3) Add an example of the return tensor sizes with more than 1 dimension.
Summary:
This helps guard against programming errors where waitSend is called
before send is called. It uses a std::atomic to keep overhead low.
Reviewed By: andrewwdye
Differential Revision: D4984604
fbshipit-source-id: 04a63b1ba088e3bcba0abff40771af666deb15e5
Summary:
This returns EFAULT when passing a GPU memory pointer (for GPUDirect)
and the ibverbs driver can't map the GPUs memory. Since the error is
pretty cryptic, crash with a more useful message.
```
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at gloo/transport/ibverbs/buffer.cc:46] mr_ !=
nullptr. ibv_reg_mr: Bad address (kernel module 'nv_peer_mem' not
loaded; did you specify a GPU pointer?)
```
Reviewed By: andrewwdye
Differential Revision: D4982966
fbshipit-source-id: 72c220fe22a3bc59396cfff992ad5f0f9c5bf83a
* Refactor test_sparse to reduce boilerplate.
Instead of manually creating a helper function, threading an is_cuda
parameter around, and creating a test method for CUDA and non-CUDA
variants, we take a different approach:
- There is now some new member variables initialized in setUp which
control the aspects of how we carry out the test; at the moment,
it's just whether or not we are using CUDA or not. This means
you don't have to pass is_cuda around, or do a conditional to
get the triplet of constructors you need.
I'll note that I am not a big fan of member variables in test
objects, but these are (intended to be) immutable so I think
it should be OK.
- Instead of manually defining test_foo and test_foo_cuda, we now
have a new TestCudaSparse class which overrides setUp (from above)
to swap in the CUDA implementation. Way less boilerplate, and NO
metaprogramming needed.
If you need to opt out of CUDA testing, there is a new cpu_only
decorator you can use.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary: A generalized version of halving-doubling that supports non-power-of-two number of processes by breaking up execution into blocks that are powers of two and communicating interblock after the intrablock reduce-scatter. Non-power-of-two cases will have some degree of load imbalance compared to power-of-two, but cases with few large blocks (e.g. 8 + 4 or 16 + 8) should still perform relatively well.
Reviewed By: pietern
Differential Revision: D4955947
fbshipit-source-id: af4f218fedb6adf475530c38386978b81f4f2b74
Because of this Variables can no longer appear in the graph.
Every usage of a leaf Variable will leave an AccumulateGrad
function that has no outputs, but modifies var.grad as a side
effect.
Summary:
After running the test suite many times we end up with a zillion
connections in TIME_WAIT state. Setting SO_REUSEADDR seems like it
should help binding to ports regardless of the TIME_WAIT state.
Reviewed By: andrewwdye
Differential Revision: D4979606
fbshipit-source-id: b611f9c9e11aba858dc192f6bca3d64e10100b52
Summary:
It can happen that a pair is destructed while in CONNECTING
state when some unrelated code throws an exception after the connect
function has been called. The most likely place for this to happen is
when connecting pair A is in progress while connecting pair B throws
an exception. The exception will force destruction of all references
to pair A, even if it is in the CONNECTING state.
Also see https://github.com/facebookincubator/gloo/issues/33
Reviewed By: andrewwdye
Differential Revision: D4979557
fbshipit-source-id: 0cddddd3f478106f1694603fe7f2efe15a2d9aa1
Previously, when using same data channel in multiple thread environment,
one didn't have any guarantee that there won't be any deadlocks
or even errors.
Summary: No need to assert on connection errors.
Reviewed By: andrewwdye
Differential Revision: D4957698
fbshipit-source-id: b47f6f0f098dbf7d212701c5cb68e34b2c1c9522
Summary:
This PR makes cmake installs the gloo CUDA headers if USE_CUDA is enabled.
Closes https://github.com/facebookincubator/gloo/pull/29
Differential Revision: D4946856
Pulled By: pietern
fbshipit-source-id: a688c3794c4a5e34b664e7bdeb4e1148f6504419
Summary:
It should be up to the program including Gloo to ignore SIGPIPE.
We have seen a case where the EPIPE errno is not properly handled in
an unrelated piece of code. Having SIGPIPE fire means we can get a
core and debug this further.
Reviewed By: andrewwdye
Differential Revision: D4896727
fbshipit-source-id: f6fe2d3f8dc68a9e6c2c457639b45f8aee2d7b20
* move TopK to generic
* partial genericization of kernel code
* introduce TopKTypeConfig, specialize radix type and conversion for floats
* implement topk for byte tensor
* implement for char tensor
* implement for int tensor, extend test to check indices as well
* works for longs too
* make bitfield set/get a struct, add support for 64-bit types
* extend to double tensor
* implement for half tensor
* asserts; test fix
Summary: This file was left over after a recent refactoring but is not used.
Reviewed By: andrewwdye
Differential Revision: D4940265
fbshipit-source-id: 01f8c5fbc73dd0ca0a92306dbfef22ff28133750
Summary:
While it is theoretically possible to make Gloo work on 32-bit systems, it's unlikely anybody would ever use it on 32-bit systems. This removes the expectation that it should work...
Fixes#28
Closes https://github.com/facebookincubator/gloo/pull/31
Differential Revision: D4939073
Pulled By: pietern
fbshipit-source-id: 8c60804f7ae5cf835332871a424aefa2c498e8a4
Fixes#1267
This fixes a number of issues when PyTorch was compiled with CUDA
support but run on a machine without any GPUs. Now, we treat all errors
from cudaGetDeviceCount() as if the machine has no devices.
This saves an extra memory copy, which speeds up data loading a bit
(5-10% with accimage).
As part of this change:
* torch.cat accepts keyword argument out
* sepcifiying out=None is treated like not specifying out
Summary: PrefixStore::wait() uses a default timeout if unspecified. This is incompatible when using PrefixStore to wrap a Store implementation that does not support timeout. Instead the base Store::wait(keys, timeout) implementation is called, throwing an exception. This change modifies the base implementation to ignore the timeout.
Differential Revision: D4916517
fbshipit-source-id: 3cdd83bd209bf938b58442d82f3fc245e68019ad
Summary: Fixes for corner cases with small element counts. Fixed problems include (1) calling range on out of bounds pointers, (2) failing to allocate send or receive buffers in cases where they correspond to out of bounds indices for reduce-scatter, but are needed in the allgather, (3) not allocating enough receive buffer space (more than count_ bytes may be needed in some cases)
Reviewed By: pietern
Differential Revision: D4912656
fbshipit-source-id: 0409d01894ff9c93ef1a1fdf8021c9ecf62f9b57
Summary:
memcpy comes from cstring
See https://github.com/caffe2/caffe2/issues/286
Reviewed By: Yangqing
Differential Revision: D4914228
fbshipit-source-id: de60c2a98feb4228546a8f1fe237a090101f50e4
Summary: Add a default 60s timeout to RedisStore::wait() to avoid blocking indefinitely when peer machines are unavailable.
Reviewed By: pietern
Differential Revision: D4908699
fbshipit-source-id: 39de9066633e8b0c8d1ee198b6bf3f70d3961196
Summary:
It's possible the pair is in the listening state when it is
destructed. The fd will not have been cleaned up in that case, so we
shouldn't assert that being the case.
Reviewed By: andrewwdye
Differential Revision: D4909964
fbshipit-source-id: 7103d74910e3bcf5de9f4658d8f1f682b6c8a70c
Summary: Add AllgatherRing and CudaBroadcastOneToAll to benchmark. Add host info and algorithm sweep to chronos script.
Reviewed By: pietern
Differential Revision: D4901111
fbshipit-source-id: 1421025d39b914b14e857f21c43eac30c9c9dd2f
Summary: Output peer address on network failures. This change will help in root causing network failures.
Differential Revision: D4899129
fbshipit-source-id: 60a762c6551a726081d5335ab478da8dd7f6dad7
* Fix group-convolution w/o biases on CPU.
Not having this guard will cause a crash further down in the `cat`
function when it uses the first element in the passed list to create a
new tensor. (And even after that, cat doesn't handle nulls well.)
* Added test for groupconv w/o bias on CPU.
Summary: Device reduce is more efficient for large buffer sizes. For smaller buffers, host reduce may be more efficient in some cases and frees up the GPU for other work.
Reviewed By: andrewwdye
Differential Revision: D4885855
fbshipit-source-id: 7dc522e8c93e1a94427730aca6af03b7e93e660d
Summary:
Instantiate nccl type templates for gloo (minus half).
half requires at a minumum ifdefing CUDA_HAS_HALF and likely requires
more work given that operators aren't defined on it, so skipping it
for now.
Reviewed By: pietern
Differential Revision: D4876217
fbshipit-source-id: 833d2aec12789cbaf9e0a201b979a420fbe6732f
Summary: Added a pipelined version of cuda halving/doubling algorithm. Half the buffer is reduced prior to first send and the other half prior to reducing the result from first receive. Broadcasts are started asynchronously as soon as each new message is received. New code was added as a new algorithm, as pipelining makes performance worse for small buffer sizes.
Reviewed By: pietern
Differential Revision: D4847109
fbshipit-source-id: 5aa55de95f8c94069380af7396f2b5b6297dcbea
Summary:
The code already asserted, but only on the reply type, so it didn't
include the actual error message. This makes debugging problems much
easier when people have problems running the benchmark suite.
Differential Revision: D4860022
fbshipit-source-id: 659bc461a724603375bff18eac90eca658492b05
Summary: This is cheaper than doing getaddrinfo for every pair.
Reviewed By: andrewwdye
Differential Revision: D4850102
fbshipit-source-id: e77f468f099f63860b52fdd0dcc57a8a7a91a448
Summary:
Part of this change is to perform a getaddrinfo in the TCP device
class so we can figure out the interface and subsequently PCI bus ID
of the NIC used for its traffic. This information can be used in a
later diff to avoid doing getaddrinfo calls in the TCP pairs and have
them reuse the information that is resolved by the device.
The PCI bus ID can be used to compute distance between NICs and GPUs
and make informed decisions on where to allocate scratch buffers.
Reviewed By: andrewwdye
Differential Revision: D4850035
fbshipit-source-id: 575e401a9273300bc720c814fef8971846ec748c
* Add IndexLinear
* Fixes to IndexLinear
- Fix IndexLinear test
- make it better for multithreaded case
- fix a glitch in the C code
- improve the reset() method
- fix the weight allocation.
- remove "fakeBatch" possibility as it's not used
- clamp normalized values at evaluation time instead of just dividing by max.
- add assert on the keys/values dimensions in IndexLinear.
- invert order of weightDecay in the case of output dim > 1.
* Changes required to support IndexLinear in CUDA
* Adding support for flattened inputs for IndexLinear
* Doc for IndexLinear + fix for when the input format changes from one batch to another.
* Cleaning up IndexLinear documentation
* Changes required to build with latest torch
* Adding benchmark script for IndexLinear
* Bugfixes and cleanup of IndexLinear.lua
- Fixed bug that occurs when performing multiple accGradParams +
updateParams
- All the data required for the updates is put in a single table
- Added :pararameters method
Summary:
Forgot to include these in a previous commit.
Closes https://github.com/facebookincubator/gloo/pull/23
Differential Revision: D4847072
Pulled By: pietern
fbshipit-source-id: 08aa9e8fa47377eb8c7747bd577eec7e615789f1
Summary:
With this we can compute the best GPU device to reduce on. It is not
always the one CUDA indicates as GPU 0.
Reviewed By: andrewwdye
Differential Revision: D4845581
fbshipit-source-id: 13e0500f54fd507899646f781a97c09abcd3b056
Summary:
This makes it easier to capture, compare, contrast results with
different parameters.
Reviewed By: andrewwdye
Differential Revision: D4843715
fbshipit-source-id: ba6916dcd5f8bcc615d6edce1a54657241357c31
Summary:
Instead of having every CudaDevicePointer "own" a stream, this change
moves to using CudaStream as first class object. It was pretty clunky
to use the copy{To,From}* functions on the CUDA pointer classes to
copy stuff around. For example it was not clear whether the stream
belonging to the source or destination was used to execute the copy
on. There is no longer such ambiguity after this change.
To make this work the CudaBroadcastOneToAll algorithm was changed to
include the workspace template argument, but only has the
CudaHostWorkspace implementation. The CudaDeviceWorkspace
implementation is left to be done for another change (that's not the
purpose of this change).
Reviewed By: andrewwdye
Differential Revision: D4841615
fbshipit-source-id: d0c1b9ba948ff6167832515afa7bdd2b32b48064
Summary: Make timeout a device attribute. Now the pair will configure timeout when connecting based on device timeout settings, instead of needing to be set explicitly on each pair. Set default tcp timeout to 30 sec.
Reviewed By: pietern
Differential Revision: D4838918
fbshipit-source-id: e6e6ee36c662eb5e7ba5354c904e50f9dcac258f
Summary: cuda_allreduce_halving_doubling was not properly handling the case where buffers are allocated in GPU memory, trying to reduce and copy from them as if they were in system memory.
Reviewed By: pietern
Differential Revision: D4840259
fbshipit-source-id: 2615360cd2f1d9c7a37fb0bcdf33ff35528b2c75
Summary:
Clarify that Redis Cluster is not supported. Also see #21.
Closes https://github.com/facebookincubator/gloo/pull/22
Differential Revision: D4837375
Pulled By: pietern
fbshipit-source-id: 6e3575b3b8dae6ca62beb765da15d8506da4abdb
Summary: Basic port of the CPU halving/doubling algorithm. No pipelining is done between reduce/broadcast and communication.
Reviewed By: pietern
Differential Revision: D4823693
fbshipit-source-id: b18045d64edf90361bf7713f4ccb2e074757780f
Summary:
Required for D4821763
Based on targets from https://fb.facebook.com/groups/fbcode/permalink/1304073246296178/ (I also excluded those targets which do not depend on folly:singleton).
Reviewed By: meyering
Differential Revision: D4832492
fbshipit-source-id: fcb4ce42e9e5359d4752769f77d7271e550201fe
Summary: Refactor AllgatherRing algorithm to remove all memcpy in the communication rounds by using outPtrs as send/receive buffer + remote buffer offset.
Reviewed By: pietern
Differential Revision: D4793186
fbshipit-source-id: 645d0758d246fd0b493e3fe312a8441d86f6d169
Summary:
Combines the top level common.h with algorithm.h. With algorithm.h in
the common package, CUDA algorithms only need a dependency on that
package. CudaBroadcastOneToAll still depended on broadcast.h so this
change also removes that dependency and has it subclass the Algorithm
class.
Reviewed By: andrewwdye
Differential Revision: D4826885
fbshipit-source-id: 930037e39f7a2c941868e53f0bbc54e3f2e0b184
Summary:
GPUDirect support for CudaAllreduceRingChunked by adding a workspace
template parameter and adding workspace specific init functions.
To support this change the CUDA LocalOp classes had to be changed a
bit to take an extra destination/source pointer. This allows reduction
of 1-N pointers into a target pointer, where the target may live on
device or live on host. If it lives on the host, the NCCL operation
that executes the reduction is followed by a D-to-H memory copy. If
there is only a single input pointer, no reduction needs to happen and
the class just executes the D-to-H memory copy. The net result is that
we can interchangeably use device or host pointers as target for
reduction or source for broadcast and these LocalOp what you would
expect them to do.
Reviewed By: andrewwdye
Differential Revision: D4825236
fbshipit-source-id: 048ec6cbc5a0500bafbe1b3f6abe1e2e5f3a2675
Summary: Fixes for handling errors and timeouts in blocking and polling sync paths. Add test coverage for errors and timeouts.
Reviewed By: pietern
Differential Revision: D4823498
fbshipit-source-id: 93721947a6404ca9cea6a4869f4156f8d270a981
Summary:
Anything number of elements below this always fits in a single packet
and will yield ~identical results.
Differential Revision: D4825190
fbshipit-source-id: 71ac77456049e991da5059d5a029c5e9d2a67ed7
Summary:
The existing CudaAllreduceRing with a CudaDeviceWorkspace
template parameter now has the same effect.
Reviewed By: andrewwdye
Differential Revision: D4823393
fbshipit-source-id: 88fe497a983b26a281a3a74fe3bdc02c0c87c523
Summary:
Implement a file store for multi-process transport failure testing. Add test cases to spawn multi-process tcp communication, and verify that all processes throw the expected IoException.
A future diff will add coverage for connectivity failures, sync modes, and ibverbs.
Reviewed By: pietern
Differential Revision: D4807794
fbshipit-source-id: 35212719d46e6d875eacb341fae25681f39053bc
Summary:
Allreduce using recursive halving and doubling algorithm. Algorithm is described in http://www.mcs.anl.gov/~thakur/papers/ijhpca-coll.pdf (see top diagram on page 12). Algorithm consists of 2 lg P stages, the first log P performing a reduce-scatter and the second log P the allgather. Message size is variable across steps. The early stages of the reduce-scatter and the late stages of allgather send the largest messages. The communication is structured such that the largest messages are sent between nearby ranks, which could be useful if elements are ranked in locality-aware fashion.
So far this supports only power-of-two number of processing elements.
I have attempted to minimize the amount of synchronization/ hand-shaking. Messages are received at different offsets of the output buffer for each communication step. Send offsets in the reduce-scatter steps become receive offsets in the allgather and vice versa. The reuse of buffers across reduce-scatter and allgather steps requires synchronization. Right now the algorithm is inefficient in terms of memory use, requiring 3x memory currently. This can be reduced, but would require additional synchronization.
Reviewed By: pietern
Differential Revision: D4795878
fbshipit-source-id: fcc6597ef6a99cd102fce2b8e4562d93088d39dc
Summary:
Didn't provide enough value now that ReductionFunction and
CudaReductionFunction are no longer related.
Reviewed By: andrewwdye
Differential Revision: D4819295
fbshipit-source-id: e6479769af7f78d486bee7d9c31f049430cdc775
Summary:
To bring the GPUDirect and non-GPUDirect implementations of CUDA aware
algorithms closer together this change introduces CUDA workspaces.
There's an implementation for a host side workspace and a device side
workspace. The former is used for transports that don't support
GPUDirect and the latter for ones that do. CUDA algorithms will take
an extra template parameter for this workspace and this will determine
whether they can be used for GPUDirect or not.
The workspaces only define their respective pointer types right now
but may contain local operation construction functions at a later
point in time.
Reviewed By: andrewwdye
Differential Revision: D4802826
fbshipit-source-id: cb1d71a224ce0165afd07fb9092ad54d3e07c8cf
Summary:
The CUDA algorithms all had their own version of local reduction and
broadcast. This commit consolidates them and allows all CUDA
algorithms to work with CudaDevicePointer instances.
Reviewed By: andrewwdye
Differential Revision: D4797968
fbshipit-source-id: cccef39fce01905a2cd757ccbcffd29803411409
Summary: Verification was sometimes failing for allreduce halving-doubling. Pieter noticed that it is due to verification step racing with the regular iterations.
Reviewed By: pietern
Differential Revision: D4804558
fbshipit-source-id: f645cb2e332e449a993a634c5bdb42c2dcb8613b
Summary:
This is a copy of CudaAllreduceRing that doesn't stage the locally
reduced buffer in host memory but uses the GPU side buffers directly.
Eventually I would like this to be absorbed back into
CudaAllreduceRing, but for now it's a good place to compare the two
implementations and abstract the parts that make sense, until they are
identical again.
Reviewed By: andrewwdye
Differential Revision: D4791629
fbshipit-source-id: 5ad065cb94adb968aeee2379327be313638f2161
Summary: Add a setTimeout() API to the Pair interface. Implement in the tcp transport for connect, read, and write, and across blocking, polling, and async configurations. Ibverbs implementation to come later.
Reviewed By: pietern
Differential Revision: D4787932
fbshipit-source-id: 6072dc0c0add1700f84a72b83e4388b29b044ec1
Summary:
The header already contained an analysis of required completion queue
depth but the queue pair was still initialized with a maximum queue
depth of kMaxBuffers. This change fixes that and updates the analysis
to talk separately about receive and send completion queues.
Reviewed By: andrewwdye
Differential Revision: D4785786
fbshipit-source-id: 4dc302d523a3b7162dc261d14cfcc755681febf8
Summary:
Predefining the reduction functions makes it easy to provide a set of
fast implementations. Eigen is used to implement them if it is found.
Reviewed By: andrewwdye
Differential Revision: D4780868
fbshipit-source-id: e825cf2e5cfe8ec27d587c5aff4002534b1c670d
Summary: This makes it possible to write to any offset in a remote buffer.
Reviewed By: andrewwdye
Differential Revision: D4779776
fbshipit-source-id: f5a44cc705df5141bd720ff4e3fec8697f707a70
Summary:
All operations supported by NCCL are now available through the Gloo
wrappers. Algorithm wrappers for them are forthcoming so that they
can be used interchangeably with other implementations.
Since not all of them require same-sized source and destination
pointers, I moved assertions on number of elements to the op
constructors.
Reviewed By: andrewwdye
Differential Revision: D4771292
fbshipit-source-id: 2f34629507b5e1cb9ae8d6d2f02de0a7f641a341
Summary: Allgather ring CPU implementation. Its does |buffers| x |contextSize| passes.
Reviewed By: pietern
Differential Revision: D4723809
fbshipit-source-id: ffd8366ac7e1746555474e173143d33cee497822
Currently in-place and out-of-place updateGradOutput will produce different results for input=max_val or input=min_val - in-place won't backprop gradient where input=max_val or input=min_val, out-of-place will backprop gradient in this case.
Summary:
This makes it possible to embed Gloo in a project without CMake
installing Gloo headers and/or libraries, or having a runtime
dependency (and statically link to it).
Also:
* Install benchmark tools
* Statically link to NCCL if the bundled version is used
Closes https://github.com/facebookincubator/gloo/pull/19
Differential Revision: D4762432
Pulled By: pietern
fbshipit-source-id: cf38903e6c51f2480fba4ff18cbdc0c9080df0c4
Summary:
This may be the case when the Gloo CMake files are sources from a
parent project that has already imported CMake CUDA support. If these
checks are not performed then CUDA_NVCC_FLAGS might contain
conflicting options.
Verified this works while working on Gloo for Caffe2.
Closes https://github.com/facebookincubator/gloo/pull/18
Differential Revision: D4756179
Pulled By: pietern
fbshipit-source-id: 32fc39ec2322cce5899a2398ebbf8395d3917502
Summary:
Some small MPI-related changes:
1) Instead of making an object copy of the MPI_Comm, call MPI_Comm_dup;
because the (passed-in) communicator is used later via the call to
connectFullMesh this guarantees that the communicator will not have been
freed by user before connectFullMesh is called.
2) Allreduce for maxLength is done on an unsigned long type; use the
corresponding MPI type.
Closes https://github.com/facebookincubator/gloo/pull/17
Differential Revision: D4754195
Pulled By: pietern
fbshipit-source-id: 863fd33c726f88120f8f5ee61964c3525babbf97
Summary:
This change solidifies IO error handling between threads and successive transport API calls. When an IO exception occurs, signal all buffers of the error, propagating the exception from the device thread or single user thread onto all user threads. Store the exception in the pair and check on future API calls or device events. Swallow all IO exceptions in the device loop.
Right now IO exceptions during portions of the listen/connect phase will result in an indefinite wait in the peer. I will address this with a configurable timeout (t16205269).
Reviewed By: pietern
Differential Revision: D4749248
fbshipit-source-id: c75ee3b20875d561bf84631e5384e28015dabad3
Summary:
Bubble up gloo configuration and network errors as exceptions. The caller may be able to recover. Other unexpected failures continue to be handled as fatal with GLOO_ENFORCE
Modify ibverb API validation to check for != 0 instead of -1 to conform with API definition.
Still need to convert some errors in the rendezvous code and add documentation.
Will pass device loop errors onto the calling thread in a future diff
Reviewed By: pietern
Differential Revision: D4730362
fbshipit-source-id: c801adb353013e7f541ab01ac16a0cc71c1c36b2
- Add additional timeouts to test_multiprocessing to reduce chances of
hanging indefintely on failure
- Add missing header guards
- Fix typo
- Check that torch_shm_manager exists in torch/__init__.py
This ensures that we use the same library at the C++ level and with
Python ctypes. It moves the searching for the correct library from
run-time to compile-time.
- make each test in test_autograd have a unique name ignoring case
- assemble all tests when test_legacy_nn is imported
- import Python.h in PtrWrapper.h
Summary: Initializing ncclComm_t is expensive. Allocate a set of ncclComm_t for each unique device set and cache for reuse. With this change the CudaAllreduceChunked tests runtime improved from ~170 sec -> ~10 sec on my machine. There is no improvement in the benchmark numbers because the algorithm instance is only allocated once.
Reviewed By: pietern
Differential Revision: D4708943
fbshipit-source-id: 85b85070586d6683a762b8282df593ca831e7bc7
Summary:
This change includes CMake changes to compile the MPI assets when the USE_MPI flag is enabled. If so, the benchmark tool can now be launched through mpirun.
Includes the changes done in #11.
Closes https://github.com/facebookincubator/gloo/pull/12
Reviewed By: Yangqing
Differential Revision: D4712060
Pulled By: pietern
fbshipit-source-id: 0d0e93882f5822583f59304d4256dbdf5dea7483
Summary: NCCLOp::runNCCL is mistakenly recording an event in the source pointer after the NCCL op. This results in NCCLOp::wait() returning without synchronizing with the output buffer. The synchronous tests using NCCL fail.
Reviewed By: pietern
Differential Revision: D4708860
fbshipit-source-id: 0c36511e260b587d410e5c9604552ceedd06d988
Our extension library links against cudart and pulls in the symbols. Use
LoadLibrary(None) to use the same symbols as the _C extension.
This fixes the PyTorch wheel when you don't have system CUDA installed.
Summary:
This is the minimum required CMake version (also the version that is available on Ubuntu Trusty (14.04)).
Closes https://github.com/facebookincubator/gloo/pull/9
Reviewed By: Yangqing
Differential Revision: D4698659
Pulled By: pietern
fbshipit-source-id: bf01541fe485c03e7c665f175c2887feaf9516a3
Summary:
Allocate a set of per-device streams used to serialize NCCL op scheduling. These ensure concurrent NCCL ops are not interleaved across devices (i.e., through priority scheduling), resulting in deadlock.
Synchronize source and destination streams with NCCL streams.
Reviewed By: pietern
Differential Revision: D4685360
fbshipit-source-id: 3c228b195b0a0d9d7cccc720163898d344a5ed4c
Samples elements from `[0,..,len(weights)-1]` with given probabilities (weights). So far there is no mean to either introduce sample weights in loss functions or while sampling from a dataset. This is an attempt to add the functionality for the latter issue.
Summary:
This makes it easy to use Gloo transports and algorithms in existing
MPI environments.
Reviewed By: andrewwdye
Differential Revision: D4685999
fbshipit-source-id: cfc7d0e445893512b4e4ed2abe1bb280d83b9c70
Summary:
How pairs are setup and connected to one another is specific to
whatever underlying rendezvous mechanism is used. This change moves
the `connectFullMesh` function into a subclass in the `rendezvous`
directory. This prepares for a separate MPI context that can setup
pairs between processes using an existing MPI communicator.
Reviewed By: andrewwdye
Differential Revision: D4684755
fbshipit-source-id: 9eb643b8ba545b3e6f9a36b65642b3b04a5f0077
Summary: CudaDevicePointer has the information we need for a NCCL op. Refactor NCCLElement as a composition of src and dst CudaDevicePointers. This allows for separate streams for src and dst, and will simplify a future change to use a static set of streams for all NCCL ops.
Reviewed By: pietern
Differential Revision: D4679483
fbshipit-source-id: 75656cc2fa5b5e2a6c096d914d2111769a47291b
* add momentum and centered options
Add two options :
- Momentum (like SGD's momentum)
- Centered RMSprop, as in Graves 2013 ( https://arxiv.org/abs/1308.0850 ) : grad is normalized by running estimation of its variance
* somme PEP8
* bug in default
* bug2
* sign mistake
* alloc of momentum & centered only if needed
* add link to docstring
* some pep8 on docstring
* implement __setstate__() for backward compatibilty
* correct grammar mistake
* multiply by lr when adding delta to params
* rename momentum variables
* change __init__ params order
This is an important clarification to make as otherwise users are misled as to where they may need to add dropout and to clarify the situation would need to delve into the backend implementation.
4647f753bc/torch/nn/_functions/rnn.py (L73)
Summary:
Add a nextSlot() function to the context that increments and
returns a slot number. This enables multiple algorithms sharing the
pairs part of a context. The slot numbers were hardcoded before this
change, which prevented reuse.
After this change, some of the tests can be changed to run multiple
times (or do a parameter sweep) without respawning a new threadpool or
allocating new fixtures.
Also change some internally used variable names for more consistency.
Reviewed By: andrewwdye
Differential Revision: D4668268
fbshipit-source-id: 65cbc8f2666f0b7d2f1c72574b86d913f5855d62
Summary:
Taking ownership of a std::unique_ptr is a bit awkward. It's actually
useful to reuse the underlying store and create multiple prefix stores
against it.
Reviewed By: andrewwdye
Differential Revision: D4662354
fbshipit-source-id: eaf62f7d5a97d6ee848252ff3124c28da349f6f2
Summary:
This changes the constructor prototype of the broadcast algorithms.
They now take the rank of the root process and the rank of the root
pointer. The root process now also broadcasts locally, among the
specified pointers, in addition to broadcasting to its peer processes.
The broadcast tests are made more robust to use a different value at
every index for every buffer, like the allreduce tests. To accomodate
multiple input buffers for CPU side algorithms, I added a Fixture
helper, and renamed the existing Fixture class to CudaFixture.
The broadcast tests contain a few TODOs since they don't vary the root
process or root pointer yet. I anecdotally verified this does work,
but didn't want to include the necessary changes to do so in this
commit (it requires some changes in rendezvous and NCCL code). A fix
for this is forthcoming.
Reviewed By: andrewwdye
Differential Revision: D4661635
fbshipit-source-id: c069e0d4e8f676a63efd74b15ea1156adcc09477
We were keying hooks by RemovableHandle id. However, we don't hold onto
handles and ids of dead objects can be reused. This replaces id(handle)
with a global counter.
This is similar to THCCachingHostAllocator_recordEvent() but on CUDA
allocations. It's useful for overlapping copies with computation. The
workflow is approximately:
0. allocate dst tensor on copy stream
1. copy from CPU to GPU on copy stream
2. synchronize the main stream with the copy stream via
cudaStreamWaitEvent
3. THCCachingAllocator_recordStream(dst, main_stream)
The recordStream() call is necessary to prevent the dst tensor from
begin reused on the copy stream before the main stream finishes work.
Previously, you would need to insert a second cudaStreamWaitEvent before
dst is freed to force the copy stream to wait on the main stream.
Summary:
I have seen a stress run crash with unexpected state. Adding these
assertions will give more information when it happens again.
```
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at gloo/transport/tcp/pair.cc:407] false. Unexpected state: 5
```
Reviewed By: andrewwdye
Differential Revision: D4652216
fbshipit-source-id: e787f4097f5ab32367dd9fa5a336d0389b97e955
* Use TH_INDEX_BASE when verifying dimension for cat
* Adding tests for cat when no dimension is specified.
- Also renamed ldimension to cat_dimension to be more specific.
Summary:
The fields are public so their names should not end with an
underscore.
Reviewed By: andrewwdye
Differential Revision: D4645038
fbshipit-source-id: c12b47affbe511383a4722717a06abb61918473b
- Code was using dimension specified which was negative
- Changed the cat_dimension variable to be more explicit
- Fixed code to use the cat_dimension variable
Summary:
The NCCL code used in CUDA-aware allreduce does local reduction of N
buffers prior to putting anything on the wire. Supporting this in the
benchmark tool to measure the impact under various configurations.
Other minor tweaks in this change:
* Specify sub-second iteration time
* Templatize allreduce benchmarks (the algorithms share a constructor
prototype)
Reviewed By: andrewwdye
Differential Revision: D4639517
fbshipit-source-id: f7417d3e9f79278a3b1eca48d779f48b77e5260c
Summary: Cuda algorithms take an optional set of device streams to sequence operations. If streams are provided, the algorithms should enqueue final output buffer operations on the associated stream and return asynchronously. Destructors that allocate streams/events should synchronize before tearing down.
Reviewed By: pietern
Differential Revision: D4636447
fbshipit-source-id: 32ec2adc214c83b0b4bc0fff8993ab196459117b
Summary:
With this change, every buffer gets assigned a different
value at every index. This means reordering of segments (e.g. in the
chunked algorithm) would surface as test errors.
Reviewed By: andrewwdye
Differential Revision: D4636368
fbshipit-source-id: 464eb1515d1590e12481961d427a92e2ebb3be82
Summary: CUDA documentation detailing high-level support for CUDA in gloo algorithms, usage of streams, and synchronizing memory management.
Reviewed By: pietern
Differential Revision: D4633120
fbshipit-source-id: d88e230c8dc82fe48cda0f401b61758fa4f07f2e
Summary:
Synchronous mode means using the calling thread instead of the device
thread for completion handling. Since this saves a context switch in
the critical path, this is very beneficial for low latency algorithms.
For example: the p99 of a 4-way barrier drops from 17us to 4us.
Reviewed By: andrewwdye
Differential Revision: D4626948
fbshipit-source-id: 013b1680497589fe5ad0bca38600bce6a410200b
Summary:
All pairs created by a device would use the same completion queue.
Supporting sync mode that way is difficult, as there is no way to
filter completions for a particular pair. This change refactors this
to use a single completion queue per pair so that this is no longer an
issue. This change is a preparation for supporting synchronous mode
(where the calling thread itself will poll the ibv library for
completions instead of the device thread).
This change also includes a refactoring of the way transient memory
regions are handled so that they are properly deregistered and
deallocated when no longer needed.
Reviewed By: andrewwdye
Differential Revision: D4625146
fbshipit-source-id: 21bf5ab321534fbd5c03f12049c10fc67da68944
Summary: std::atomic was not defined for cuda.cu.
Reviewed By: andrewwdye
Differential Revision: D4624611
fbshipit-source-id: 973bba10026e065667d6a576055d00505ee02d62
Summary: Allow gloo consumers to assign a mutex to synchronize CUDA malloc/free and NCCL operations.
Reviewed By: pietern
Differential Revision: D4622135
fbshipit-source-id: 60acd7c01a677a0df5415fe38e6ef5a2e7c8606a
Separates out non-Python part of AutoGPU. This also compiles without
CUDA which is useful for generic tensor code.
Also fixes a bug where THCPAutoGPU may not always switch the device:
THCPAutoGPU guard(-1);
guard.setDevice(0);
guard.setDevice(1);
guard.setDevice(0); // would not switch batch to 0
NCCL can deadlock if cudaFree() is called while it's launching kernels.
This exposes a mutex that can be held to prevent cudaFree() calls in the
caching allocator.
Summary: The AllReduceChunked algorithm currently performs the local reduce/broadcast of local device buffers in host memory. This diff updates the algorithm to execute the local reduce/broadcast steps using NCCL operations before copying a single device buffer to/from host memory.
Reviewed By: pietern
Differential Revision: D4587441
fbshipit-source-id: 4de689f59a6cf898b8eecd3c3b9f57f77124c0e3
* Add more detail to CUDA documentation
Also adds better cross-linking to the pages that discuss relevant topics.
* Adds recommendation to torch.save docs
* Make the version numbers for the docs dynamic
Might need tweaks for beta, 1.0, etc.
Backend is SpatialDilatedMaxPooling, so change 3D input (N*C*L)
to 4D size (N*C*1*L). Then output indices will range from 0 to L.
This range will not cause UnMaxPool1D error.
Signed-off-by: Zhou Chang <achang.zhou@gmail.com>
Summary:
Work may be queued on CUDA streams for asynchronous execution. The
memory backed by pointers passed to any algorithm can therefore be
mutated after constructing an algorithm instance. By also passing in
the streams these mutations happen on, the algorithms can synchronize
with these mutations to ensure no invalid data is used.
By passing in these streams, any work done by these algorithms will
*also* be queued, which effectively removes a single synchronization
step from any algorithm run.
Differential Revision: D4589394
fbshipit-source-id: 0c8cd6ba9c9018f33d6f4c55a037083fc4164acb
Summary: I was mistakenly calling the non-chunked algorithm for the chunked test.
Reviewed By: pietern
Differential Revision: D4580160
fbshipit-source-id: 9d62a68e9e86cc6e596d90ff8854c585a0e8855c
Summary:
First pass at a CUDA-aware allreduce chunked implementation. For now the algorithm runs on the CPU and is mostly copy/paste from allreduce_ring.h. A subsequent pass will offload to the GPU.
Serialize cuda test to avoid intermittent failures due to memory contention.
Reviewed By: pietern
Differential Revision: D4576959
fbshipit-source-id: e1f292a05b88ff24c33e549d4a52e770a21f85d2
Summary: Ideally we would want the driver to busy-poll for us. In absence of driver support, spinning with MSG_DONTWAIT flag seems to be helping a lot too. Of course, we pay the price of burning one core for polling. Sigh.
Reviewed By: pietern
Differential Revision: D4576242
fbshipit-source-id: 85d9e1b786fbb6053864fba80f3e5ecc80fe221d
Summary:
Latency optimization is going well and I've seen the odd case of <10us
measurements. This option makes the benchmark tool display nanos
instead.
Differential Revision: D4575925
fbshipit-source-id: 98dbd3b39e31cbcdd4c146613f6630e721187e1e
Summary:
The CudaDevicePointer optionally takes an existing stream on
which it runs any operation associated with the pointer (for now just
memcpy's, but this likely will includes kernel execution in the
future).
Differential Revision: D4574035
fbshipit-source-id: ddd7972a3874012059f1fde1b341fd6edd69102d
Summary:
In synchronous mode, it is not the device thread that is responsible
for handling I/O, but the user thread itself. Calling waitRecv on a
buffer will trigger the read function on the pair to be called. This
eliminates the context switch necessary if the device thread is
handling all I/O. For benchmarks with small numbers of elements this
reduces latency by as much as 20%.
Reviewed By: plapukhov
Differential Revision: D4549998
fbshipit-source-id: ab718ba090c06d7c7aa4065cc9f92bd96b9e4a35
Used .c file changes from 7318e2de13 as a starting point. All changes to .c files (except for whitespace details) are present here.
However, the required .h files were not present in that PR.
Summary:
Implement CUDA BroadcastOneToAll algorithm for GPU addresses. Refactor cuda.h into cuda_private.h to allow inclusion of <cuda.h> in public headers without polluting the namespace.
Port broadcast tests to GPU variants.
* this revision is based on Peter's revision D4546932
Differential Revision: D4547382
fbshipit-source-id: 3d294ad8862b04fb783ba22e5c925b8d7cbc8a8d
Summary:
Separate benchmark build target for CUDA-aware algorithms.
This is needed to keep CUDA an optional dependency.
Differential Revision: D4546932
fbshipit-source-id: b73176ae9067233f883d51ba3ab4efbb13a6f86f
Summary:
This CUDA-aware ring allreduce is based on the regular ring allreduce.
It runs the reduction algorithm on the CPU and is therefore most
suited for smaller buffers.
Both the device-to-host memcpy's at the start of the algorithm and the
host-to-device memcpy's at the end of the algorithm are kicked off
asynchronously in an attempt to parallize as much as possible.
Reviewed By: Yangqing
Differential Revision: D4542816
fbshipit-source-id: 101dfad276ca79703e37ff93fb1b6d467295f66b
Summary:
The CUDA benchmark suite will be a separate build target, so the
runner should be reused.
Reviewed By: Yangqing
Differential Revision: D4545092
fbshipit-source-id: 6ccf2d30f5d35c74fc59851b25416bfe6863d62c
The core autograd Variable, Function, and Engine no longer depend on the
Python API. This let's us implement functions in C++. In the future, we
can also multithread engine and release the GIL for most of the
non-Python backwards.
Summary:
In the GitHub repository this directory will be mirrored similar to
folly, such that the repository has a single top level directory
called "gloo". This allows for versioning or renaming of the
project root, without having to mangle the include paths; they will
always use the "gloo" prefix.
fbshipit-source-id: 24502e4185fc7cbe19b5249f83609e2b8118e9d7
In cases where copyAsync is a large percentage of the work,
processing events in recordEvent can cause a large bottleneck.
Here, we relax the constraint that we reclaim blocks as fast as possible
(i.e. in copyAync); instead, we only check that a block can be re-allocated
in malloc and free.
These methods are useful from C because they don't require constructing
THLongStorages to wrap the sizes and strides, which can lead to leaked
memory in case of an error. Instead the sizes and strides can be
represented on the stack using standard C long arrays.
Moves THPObjectPtr into a separate header, so that it can be included
independently. Currently, utils.h requries all of THP.h. Also adds RAII
structs for acquiring and releasing the GIL.
Due to bad rank mapping broadcast and reduce were connecting
wrong processes what resulted in errors or not received/sent tensors.
* Introduced new mapping method to solve this problem.
* Added and improved tests for this cases.
Here's the command I used to invoke autopep8 (in parallel!):
git ls-files | grep '\.py$' | xargs -n1 -P`nproc` autopep8 -i
Several rules are ignored in setup.cfg. The goal is to let autopep8
handle everything which it can handle safely, and to disable any rules
which are tricky or controversial to address. We may want to come back
and re-enable some of these rules later, but I'm trying to make this
patch as safe as possible.
Also configures flake8 to match pep8's behavior.
Also configures TravisCI to check the whole project for lint.
* Fix error in ELU backward
* Add --seed flag for testst st
* Add test for BatchNorm eval
* Fix autograd.backward docs
* Support cc flags in cuDNN search
* Fix IndexSelect backward formula
Scales `delta` before it is applied to the parameters in order to control the learning rate of the optimizer (inspired from climin optim lib for theano).
Also changed the link to the Adadelta paper to point to the right location.
* Always compile .numpy() for all types
* Add torch.nn.functional docs and hidden headers
* Use sphinx to generate torchvision docs
* Remove unused import in ffi utils
Transposed convolutions are often (but incorrectly) referred to as Deconvolutional operations. Made mention of this in the docstring to make it easier for people to search for this operation in the documentation.
Depending on how PyTorch is compiled, the source code for DataLoader
might not be fully available which can cause a spurious error in
test_dataloader.py
arguments.
For example:
>>> torch.randn(5, 5).geqrf('invalid arg')
TypeError: geqrf received an invalid combination of arguments - got (str), but expected ()
This is because the current version of luaffifb fails to pass
custom structs (i.e. half) as arguments or accept them as return
values.
The accreal parameters are immediately converted to real internally.
This is done to ensure none of the internal code needs to be changed.
This change also removes transform_reals_to_half which is no longer
necessary.
Change-Id: I978151d001de5492576fb0eddfa0608cd4e99149
The load_state_dict() function now raises an error if the argument
state_dict has extra keys or is missing keys.
Previously, load_state_dict() ignored extra and missing keys, which made
it hard to notice when you load an invalid state_dict. This could
happen, for example, if you save the state_dict for a DataParallel, but
load it into a single model.
The state_dict() function now only includes the Tensor data from the
paramters, which reduces checkpoint size by not saving gradients.
The register hook calls now return an object that can be used to remove
the hook. For example,
>>> h = module.register_forward_hook(callback)
>>> h.remove() # removes hook
Or as a context manager:
>>> with module.register_forward_hook(callback):
... pass
This makes it easier for libraries to use hooks without worrying about
name collisions.
- Non differentiable outputs could prevent a gradient computation (see
test_dep_nograd)
- Crash in backward on variable which doesn't requires_grad (issue
#438)
- Stochastic functions could be backproped through multiple times
- don't use cuDNN for half inputs because weight, bias, running_mean,
etc. are required to be of different type than for THCUNN
- accept 3D inputs (N,C,L) in BatchNorm1d
- remove accidental 'use_cudnn=False'
* Add support for torch.HalfTensor.
* Improvements/Simplifications for torch.HalfTensor.
Improvements/Simplifications:
1) Defines half type as TH_Half, so as to not conflict with cutorch
version. Previously, these were defined as the same "half" type and
required proper ordering of includes to ensure type was only defined
once, which would have affected all downstream projects.
2) No longer generates math functions that are not actually defined
on torch.HalfTensor, e.g. maskedFill, map, etc.
3) Adds tests for all available torch.HalfTensor functions
4) Allows compiling without TH_GENERIC_USE_HALF (so if there's a
problem can just unset that in CMakeLists rather than backing out)
5) Some simplifications: removes a new copy optimization and
some TH_HALF literal definitions
Limitations:
Because match functions are not defined, some "non-math" operators
on torch.HalfTensor give an error message, e.g. __index__/__newindex__
with a ByteTensor apply a mask, but masks aren't implemented. These
limitations aren't always obvious, (e.g. for documentation purposes),
but they should always give an error message.
* Rename TH_HALF to THHalf.
This hooks into the (internal) ForkingPickler class in multiprocessing
to reduce tensors, storages, and CUDA events instead of our queue from
joblib. This makes it easier to use the standard multiprocessing classes
in later versions of Python.
This also exposes:
- Tensor/Storage.share_memory_()
- Module.share_memory()
These methods move the CPU tensors and storages to shared memory. If
you're using the "fork" method of multiprocessing, these objects can be
directly inherited instead of serialized through a queue.
Added support for the fill, diff, scale, mul and add functions using
PPC CPU vector instructions. These are used in place of the versions
of these functions written for x86, when compiled on PPC.
This fixes a compile failure on PPC
Occasionally, my PyTorch checkout gets into a bad state where libnccl.so
does not exist, but the NCCL makefile doesn't build it because
libnccl.so.1 exists. Switch to copying libnccl.so.1 to work around this.
Fix a bug in cat when catting with an empty tensor along first dim (it added an extra dim).
Fix the ambiguous 'catting along last dimension' sentence in the doc and change the behavior to pick the maximum last dimension over all input tensors.
Now empty tensors are allowed.
CUDA IPC only works with Python 3 using the "spawn" start method. You
can select the start method using the get_context method:
import torch.multiprocessing as mp
ctx = mp.get_context('spawn')
queue = ctx.Queue()
event = ctx.Event()
Uses the assignment syntax to get deterministic ordering of parameters.
The ordering of parameters using the constructor syntax is
non-deterministic because kwargs use dict() in Python 3.5 and earlier.
Without this, the cuda_events could continuously grow from calls to
cudaMemcpyAsync, but would never be processed if there were no new
pinned memory allocations.
For example:
t1 = cutorch.createCudaHostTensor(10)
t2 = torch.CudaTensor(10)
while true do t2:copyAsync(t1) end
Adds a caching allocator for CUDA pinned (page-locked) memory. This
avoid synchronization due to cudaFreeHost or cudaHostUnregister at the
expense of potentially higher host memory usage.
Correctness is preserved by recording CUDA events after each
cudaMemcpyAsync involving the pinned memory. The pinned memory
allocations are not reused until all events associated with it have
completed.
Exceptions are:
1) SparseLinear
requires additional parameters to be passed in (e.g. nbatches),
so it's not clear it's worth moving to C since it won't really simplify the binding
code logic.
2) BatchNormalization
requires "makeBatch", which isn't a trivial translation to C.
3) LookupTable
requires "view" in C, which is already a TODO
4) SpatialUpSamplingBilinear
requires "view" in C, which is already TODO
DataLoader now supports the constructor argument 'pin_memory'. When set
to true, tensors in the sample are copied to pinned memory. This happens
in a background thread when num_workers > 1.
Previously, cutorch would initialize every CUDA device and enable P2P
access between all pairs. This slows down start-up, especially with 8
devices. Now, THCudaInit does not initialize any devices and P2P access
is enabled lazily. Setting the random number generator seed also does
not initialize the device until random numbers are actually used.
Previously, cutorch would initialize every CUDA device and enable P2P
access between all pairs. This slows down start-up, especially with 8
devices. Now, THCudaInit does not initialize any devices and P2P access
is enabled lazily. Setting the random number generator seed also does
not initialize the device until random numbers are actually used.
Only references to their data and version counters are stored.
Also, it is now possible to have None arguments in save_for_backward
and return too many values from backward (as long as the excessive
results are None).
Without the PyObject_GC_UnTrack call, the tp_dealloc handler could get
called twice if a referred to object triggers a garbage collection from
its destructor.
See http://bugs.python.org/issue28737
On ARMv8, neon is inherit and instead listed as 'asimd' in /proc/cpuinfo
Replace assembly with C
Original authors:
- @dusty-nv
FindARM-patch.txt
CMakeLists-patch.txt
- @rtarquini
NEON.c
| Linux CPU | [](https://travis-ci.org/pytorch/pytorch) | [](https://travis-ci.org/pytorch/pytorch) |
| Linux GPU | [](https://build.pytorch.org/job/pytorch-master-py2-linux) | [](https://build.pytorch.org/job/pytorch-master-py3-linux) |
| macOS CPU | [](https://build.pytorch.org/job/pytorch-master-py2-osx-cpu) | [](https://build.pytorch.org/job/pytorch-master-py3-osx-cpu) |
The project is still under active development and is likely to drastically change in short periods of time.
We will be announcing API changes and important developments via a newsletter, github issues and post a link to the issues on slack.
Please remember that at this stage, this is an invite-only closed alpha, and please don't distribute code further.
This is done so that we can control development tightly and rapidly during the initial phases with feedback from you.
## What is PyTorch?
## More about PyTorch
PyTorch is a library that consists of the following components:
At a granular level, PyTorch is a library that consists of the following components:
| \_ | \_ |
| ------------------------ | --- |
| torch | a Tensor library like NumPy, with strong GPU support |
| torch.autograd | a tape based automatic differentiation library that supports all differentiable Tensor operations in torch |
| torch.nn | a neural networks library deeply integrated with autograd designed for maximum flexibility |
| torch.optim | an optimization package to be used with torch.nn with standard optimization methods such as SGD, RMSProp, LBFGS, Adam etc. |
| torch.multiprocessing | python multiprocessing, but with magical memory sharing of torch Tensors across processes. Useful for data loading and hogwild training. |
| torch.utils | DataLoader, Trainer and other utility functions for convenience |
| torch.legacy(.nn/.optim) | legacy code that has been ported over from torch for backward compatibility reasons |
<table>
<tr>
<td><b> torch </b></td>
<td> a Tensor library like NumPy, with strong GPU support </td>
</tr>
<tr>
<td><b> torch.autograd </b></td>
<td> a tape-based automatic differentiation library that supports all differentiable Tensor operations in torch </td>
</tr>
<tr>
<td><b> torch.nn </b></td>
<td> a neural networks library deeply integrated with autograd designed for maximum flexibility </td>
</tr>
<tr>
<td><b> torch.multiprocessing </b></td>
<td> Python multiprocessing, but with magical memory sharing of torch Tensors across processes. Useful for data loading and Hogwild training. </td>
</tr>
<tr>
<td><b> torch.utils </b></td>
<td> DataLoader, Trainer and other utility functions for convenience </td>
</tr>
<tr>
<td><b> torch.legacy(.nn/.optim) </b></td>
<td> legacy code that has been ported over from torch for backward compatibility reasons </td>
</tr>
</table>
Usually one uses PyTorch either as:
-A replacement for numpy to use the power of GPUs.
-a replacement for NumPy to use the power of GPUs.
- a deep learning research platform that provides maximum flexibility and speed
## Reasons to consider PyTorch
Elaborating further:
### Python first
### A GPU-Ready Tensor Library
PyTorch is not a Python binding into a monolothic C++ framework.
If you use NumPy, then you have used Tensors (a.k.a ndarray).
PyTorch is not a Python binding into a monolithic C++ framework.
It is built to be deeply integrated into Python.
You can use it naturally like you would use numpy / scipy / scikit-learn etc.
You can write your new neural network layers in Python itself, using your favorite libraries.
You can use it naturally like you would use NumPy / SciPy / scikit-learn etc.
You can write your new neural network layers in Python itself, using your favorite libraries
and use packages such as Cython and Numba.
Our goal is to not reinvent the wheel where appropriate.
### Imperativeness first. What you see is what you get!
### Imperative Experiences
PyTorch is designed to be intuitive and easy to use.
When you are debugging your program, or receive error messages / stack traces, you are always guaranteed to get
error messages that are easy to understand and a stack-trace that points to exactly where your code was defined.
Never spend hours debugging your code because of bad stack traces or asynchronous and opaque execution engines.
PyTorch is designed to be intuitive, linear in thought and easy to use.
When you execute a line of code, it gets executed. There isn't an asynchronous view of the world.
When you drop into a debugger, or receive error messages and stacktraces, understanding them is straightforward.
The stack trace points to exactly where your code was defined.
We hope you never spend hours debugging your code because of bad stack traces or asynchronous and opaque execution engines.
### Performance and Memory usage
### Fast and Lean
PyTorch is as fast as the fastest deep learning framework out there. We integrate acceleration frameworks such as Intel MKL and NVIDIA CuDNN for maximum speed.
PyTorch has minimal framework overhead. We integrate acceleration libraries
such as Intel MKL and NVIDIA (cuDNN, NCCL) to maximize speed.
At the core, its CPU and GPU Tensor and neural network backends
(TH, THC, THNN, THCUNN) are written as independent libraries with a C99 API.
They are mature and have been tested for years.
The memory usage in PyTorch is extremely efficient, and we've written custom memory allocators for the GPU to make sure that your
deep learning models are maximally memory efficient. This enables you to train bigger deep learning models than before.
Hence, PyTorch is quite fast – whether you run small or large neural networks.
### Multi-GPU ready
The memory usage in PyTorch is extremely efficient compared to Torch or some of the alternatives.
We've written custom memory allocators for the GPU to make sure that
your deep learning models are maximally memory efficient.
This enables you to train bigger deep learning models than before.
PyTorch is fully powered to efficiently use Multiple GPUs for accelerated deep learning.
We integrate efficient multi-gpu collectives such as NVIDIA NCCL to make sure that you get the maximal Multi-GPU performance.
### Extensions without Pain
### Simple Extension API to interface with C
Writing new neural network modules, or interfacing with PyTorch's Tensor API was designed to be straightforward
and with minimal abstractions.
You can write new neural network layers in Python using the torch API
[or your favorite NumPy-based libraries such as SciPy](http://pytorch.org/tutorials/advanced/numpy_extensions_tutorial.html).
If you want to write your layers in C/C++, we provide an extension API based on
[cffi](http://cffi.readthedocs.io/en/latest/) that is efficient and with minimal boilerplate.
There is no wrapper code that needs to be written. You can see [a tutorial here](http://pytorch.org/tutorials/advanced/c_extension.html) and [an example here](https://github.com/pytorch/extension-ffi).
Writing new neural network modules, or interfacing with PyTorch's Tensor API is a breeze, thanks to an easy to use
* Slack: general chat, online discussions, collaboration etc. https://pytorch.slack.com/ . Our slack channel is invite-only to promote a healthy balance between power-users and beginners. If you need a slack invite, ping us at soumith@pytorch.org
* newsletter: no-noise, one-way email newsletter with important announcements about pytorch. You can sign-up here: http://eepurl.com/cbG0rv
## Timeline
## Releases and Contributing
We will run the alpha releases weekly for 6 weeks.
After that, we will reevaluate progress, and if we are ready, we will hit beta-0. If not, we will do another two weeks of alpha.
PyTorch has a 90 day release cycle (major releases).
It's current state is Beta, we expect no obvious bugs. Please let us know if you encounter a bug by [filing an issue](https://github.com/pytorch/pytorch/issues).
* ~~alpha-0: Working versions of torch, cutorch, nn, cunn, optim fully unit tested with seamless numpy conversions~~
* ~~alpha-1: Serialization to/from disk with sharing intact. initial release of the new neuralnets package based on a Chainer-like design~~
* ~~alpha-2: sharing tensors across processes for hogwild training or data-loading processes. a rewritten optim package for this new nn.~~
* alpha-5: a ton of examples across vision, nlp, speech, RL -- this phase might make us rethink parts of the APIs, and hence want to do this in alpha than beta
* alpha-6: Putting a simple and efficient story around multi-machine training. Probably simplistic like torch-distlearn. Building the website, release scripts, more documentation, etc.
* beta-0: First public release
We appreciate all contributions. If you are planning to contribute back bug-fixes, please do so without any further discussion.
The beta phases will be leaning more towards working with all of you, convering your use-cases, active development on non-core aspects.
If you plan to contribute new features, utility functions or extensions to the core, please first open an issue and discuss the feature with us.
Sending a PR without discussion might end up resulting in a rejected PR, because we might be taking the core in a different direction than you might be aware of.
## pytorch vs torch: important changes
## The Team
We've decided that it's time to rewrite/update parts of the old torch API, even if it means losing some of backward compatibility.
PyTorch is a community driven project with several skillful engineers and researchers contributing to it.
**[This tutorial](https://github.com/pytorch/tutorials/blob/master/Introduction%20to%20PyTorch%20for%20former%20Torchies.ipynb) takes you through the biggest changes**
and walks you through PyTorch
For brevity,
#### Tensors:
- clear separation of in-place and out-of-place operations
- zero-indexing
- no camel casing for Tensor functions
- an efficient Numpy bridge (with zero memory copy)
- CUDA tensors have clear and intuitive semantics
#### New neural network module (Combines nn, nngraph, autograd):
1. Design inspired from Chainer
2. Modules no longer hold state. State is held in the graph
1. Access state via hooks
2. Execution engine
1. imperative execution engine (default)
2. lazy execution engine
1. allows graph optimizations and automatic in-place / fusing operations
4. Model structure is defined by its code
1. You can use loops and arbitrarily complicated conditional statements
**To reiterate, we recommend that you go through [This tutorial](https://github.com/pytorch/tutorials/blob/master/Introduction%20to%20PyTorch%20for%20former%20Torchies.ipynb)**
### Serialization
Pickling tensors is supported, but requires making a temporary copy of all data in memory and breaks sharing.
For this reason we're providing `torch.load` and `torch.save`, that are free of these problems.
They have the same interfaces as `pickle.load` (file object) and `pickle.dump` (serialized object, file object) respectively.
For now the only requirement is that the file should have a `fileno` method, which returns a file descriptor number (this is already implemented by objects returned by `open`).
Objects are serialized in a tar archive consisting of four files:
-`sys_info` - protocol version, byte order, long size, etc.
-`pickle` - pickled object
-`tensors` - tensor metadata
-`storages` - serialized data
### Multiprocessing with Tensor sharing
We made PyTorch to seamlessly integrate with python multiprocessing.
What we've added specially in torch.multiprocessing is the seamless ability to efficiently share and send
tensors over from one process to another. ([technical details of implementation](http://github.com/pytorch/pytorch/wiki/Multiprocessing-Technical-Notes))
This is very useful for example in:
- Writing parallelized data loaders
- Training models "hogwild", where several models are trained in parallel, sharing the same set of parameters.
Here are a couple of examples for torch.multiprocessing
```python
# loaders.py
# Functions from this file run in the workers
deffill(queue):
whileTrue:
tensor=queue.get()
tensor.fill_(10)
queue.put(tensor)
deffill_pool(tensor):
tensor.fill_(10)
```
```python
# Example 1: Using multiple persistent processes and a Queue
# process.py
importtorch
importtorch.multiprocessingasmultiprocessing
fromloadersimportfill
# torch.multiprocessing.Queue automatically moves Tensor data to shared memory
As shown above, structure of the networks is fully defined by control-flow embedded in the code. There are no rigid containers known from Lua. You can put an `if` in the middle of your model and freely branch depending on any condition you can come up with. All operations are registered in the computational graph history.
There are two main objects that make this possible - variables and functions. They will be denoted as squares and circles respectively.

Variables are the objects that hold a reference to a tensor (and optionally to gradient w.r.t. that tensor), and to the function in the computational graph that created it. Variables created explicitly by the user (`Variable(tensor)`) have a Leaf function node associated with them.

Functions are simple classes that define a function from a tuple of inputs to a tuple of outputs, and a formula for computing gradient w.r.t. it's inputs. Function objects are instantiated to hold references to other functions, and these references allow to reconstruct the history of a computation. An example graph for a linear layer (`Wx + b`) is shown below.
Please note that function objects never hold references to Variable objects, except for when they're necessary in the backward pass. This allows to free all the unnecessary intermediate values. A good example for this is addition when computing e.g. (`y = Wx + My`):
Matrix multiplication operation keeps references to it's inputs because it will need them, but addition doesn't need `Wx` and `My` after it computes the result, so as soon as they go out of scope they are freed. To access intermediate values in the forward pass you can either copy them when you still have a reference, or you can use a system of hooks that can be attached to any function. Hooks also allow to access and inspect gradients inside the graph.
Another nice thing about this is that a single layer doesn't hold any state other than it's parameters (all intermediate values are alive as long as the graph references them), so it can be used multiple times before calling backward. This is especially convenient when training RNNs. You can use the same network for all timesteps and the gradients will sum up automatically.
To compute backward pass you can call `.backward()` on a variable if it's a scalar (a 1-element Variable), or you can provide a gradient tensor of matching shape if it's not. This creates an execution engine object that manages the whole backward pass. It's been introduced, so that the code for analyzing the graph and scheduling node processing order is decoupled from other parts, and can be easily replaced. Right now it's simply processing the nodes in topological order, without any prioritization, but in the future we can implement algorithms and heuristics for scheduling independent nodes on different GPU streams, deciding which branches to compute first, etc.
PyTorch is currently maintained by [Adam Paszke](https://apaszke.github.io/), [Sam Gross](https://github.com/colesbury), [Soumith Chintala](http://soumith.ch) and [Gregory Chanan](https://github.com/gchanan) with major contributions coming from 10s of talented individuals in various forms and means.
A non-exhaustive but growing list needs to mention: Trevor Killeen, Sasank Chilamkurthy, Sergey Zagoruyko, Adam Lerer, Francisco Massa, Alykhan Tejani, Luca Antiga, Alban Desmaison, Andreas Kopf, James Bradbury, Zeming Lin, Yuandong Tian, Guillaume Lample, Marat Dukhan, Natalia Gimelshein, Christian Sarofeen, Martin Raison, Edward Yang, Zachary Devito.
Note: this project is unrelated to [hughperkins/pytorch](https://github.com/hughperkins/pytorch) with the same name. Hugh is a valuable contributor in the Torch community and has helped with many things Torch and PyTorch.
If provided, the optional argument `weights` should be a 1D Tensor assigning
weight to each of the classes.
This is particularly useful when you have an unbalanced training set.
The input given through a forward call is expected to contain log-probabilities
of each class: input has to be a 2D Tensor of size minibatch x n
Obtaining log-probabilities in a neural network is easily achieved by
adding a `LogSoftmax` layer in the last layer.
You may use `CrossEntropyLoss` instead, if you prefer not to
add an extra layer.
The target that this loss expects is a class index (1 to the number of class)
The loss can be described as:
loss(x, class) = -x[class]
or in the case of the weights argument it is specified as follows:
loss(x, class) = -weights[class] * x[class]
#### Constructor Arguments
Parameter | Default | Description
--------- | ------- | -----------
weight | None | a manual rescaling weight given to each class. If given, has to be a Tensor of size "nclasses".
size_average | True | By default, the losses are averaged over observations for each minibatch. However, if the field sizeAverage is set to False, the losses are instead summed for each minibatch.
Target Shape: [ * ] : Targets of size [minibatch], each value has to be 1 <= targets[i] <= nClasses
#### Members
Parameter | Description
--------- | -----------
weight | the class-weights given as input to the constructor
### NLLLoss2d
This is negative log likehood loss, but for image inputs. It computes NLL loss per-pixel.
```python
m=nn.Conv2d(16,32,(3,3)).float()
loss=nn.NLLLoss2d()
# input is of size nBatch x nClasses x height x width
input=autograd.Variable(torch.randn(3,16,10,10))
# each element in target has to have 0 <= value < nclasses
size_average | True | By default, the losses are averaged over observations for each minibatch. However, if the field sizeAverage is set to False, the losses are instead summed for each minibatch.
Target Shape: [ * , *, *] : Targets of size minibatch x height x width, each value has to be 1 <= targets[i] <= nClasses
### KLDivLoss
The [Kullback-Leibler divergence](http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) Loss
KL divergence is a useful distance measure for continuous distributions
and is often useful when performing direct regression over the space of
Fractiona MaxPooling is described in detail in the paper ["Fractional Max-Pooling" by Ben Graham](http://arxiv.org/abs/1412.6071)
The max-pooling operation is applied in kHxkW regions by a stochastic
step size determined by the target output size.
The number of output features is equal to the number of input planes.
#### Constructor Arguments
Parameter | Default | Description
--------- | ------- | -----------
kernel_size | | the size of the window to take a max over. Can be a single number k (for a square kernel of k x k) or a tuple (kh x kw)
output_size | | the target output size of the image of the form oH x oW. Can be a tuple (oH, oW) or a single number oH for a square image oH x oH
output_ratio | | If one wants to have an output size as a ratio of the input size, this option can be given. This has to be a number or tuple in the range (0, 1)
return_indices | False | if True, will return the indices along with the outputs. Useful to pass to nn.MaxUnpool2d .
#### Expected Shape
| Shape | Description
------ | ----- | ------------
input | [ * , * , *, * ] | Input is minibatch x channels x iH x iW
output | [ * , * , *, * ] | Output shape = minibatch x channels x floor((iH + 2*padH - kH) / sH + 1) x floor((iW + 2*padW - kW) / sW + 1)
### LPPool2d
Applies a 2D power-average pooling over an input signal composed of several input
```python
# power-2 pool of square window of size=3, stride=2
m=nn.LPPool2d(2,3,stride=2)
# pool of non-square window of power 1.2
m=nn.LPPool2d(1.2,(3,2),stride=(2,1))
input=autograd.Variable(torch.randn(20,16,50,32))
output=m(input)
```
planes.
On each window, the function computed is: f(X) = pow(sum(pow(X, p)), 1/p)
At p = infinity, one gets Max Pooling
At p = 1, one gets Average Pooling
#### Constructor Arguments
Parameter | Default | Description
--------- | ------- | -----------
kernel_size | | the size of the window. Can be a single number k (for a square kernel of k x k) or a tuple (kh x kw)
stride | kernel_size | the stride of the window. Can be a single number s or a tuple (sh x sw).
ceil_mode | | when True, will use "ceil" instead of "floor" to compute the output shape
#### Expected Shape
| Shape | Description
------ | ----- | ------------
input | [ * , * , *, * ] | Input is minibatch x channels x iH x iW
output | [ * , * , *, * ] | Output shape = minibatch x channels x floor((iH + 2*padH - kH) / sH + 1) x floor((iW + 2*padW - kW) / sW + 1)
When you create a `torch.cuda.*Tensor`, it is allocated on the current GPU.
However, you could allocate it on another GPU as well, using the `with torch.cuda.device(id)` context.
All allocations within this context will be placed on the GPU `id`.
Once `Tensor`s are allocated, you can do operations on them from any GPU context, and the results will be placed on the same device as where the source `Tensor` is located.
For example if Tensor `a` and `b` are on GPU-2, but the GPU-1 is the current device.
If one does `c = a + b`, then `c` will be on GPU-2, regardless of what the current device is.
Cross-GPU operations are not allowed. The only Cross-GPU operation allowed is `copy`.
If `a` is on GPU-1 and `b` is on GPU-2, then `c = a + b` will result in an error.
See the example for more clarity on these semantics.
```python
# Tensors are allocated on GPU 1 by default
x=torch.cuda.FloatTensor(1)
# x.get_device() == 0
y=torch.FloatTensor(1).cuda()
# y.get_device() == 0
withtorch.cuda.device(1):
# allocates a tensor on GPU 2
a=torch.cuda.FloatTensor(1)
# transfers a tensor from CPU to GPU-2
b=torch.FloatTensor(1).cuda()
# a.get_device() == b.get_device() == 1
z=x+y
# z.get_device() == 1
# even within a context, you can give a GPU id to the .cuda call
__torch__ is the main package where data structures for multi-dimensional
tensors and mathematical operations over these are defined.
Additionally, it provides many utilities for efficient serializing of
Tensors and arbitrary types, and other useful utilities.
It has a CUDA counterpart, that enables you to run your tensor computations
on an NVIDIA GPU with compute capability >= 2.0.
## Multi-core
### torch.get_num_threads()
Gets the number of OpenMP threads that will be used for parallelizing CPU operations
### torch.set_num_threads(n)
Sets the number of OpenMP threads to use for parallelizing CPU operations
## Serialization
### torch.save(object, file)
This function pickles a Python object to the `file`. `file` is either a filename or a file handle.
`object` can be a picklable python object, including `torch``Tensor`s, autograd `Variable`, nn `Module`s etc.
When a group of `torch``Tensor`s are saved together, and if any of them share the same storages, then this sharing is preserved during saving and loading back.
### torch.load(file)
This function unpickles objects that have been pickled with `torch.save`
## Random Numbers
### torch.get_rng_state()
Gets the current state of the torch Random Number Generator.
This can be passed in the future to `torch.set_rng_state` to restore the current RNG state.
### torch.set_rng_state(state)
Sets the current state of the torch Random Number Generator to the given `state`.
### torch.manual_seed(number)
Sets the initial seed of the random number generator to a given number.
### torch.initial_seed()
Returns the number that is the initial seed to the Random Number Generator
## CUDA
### torch.cuda.is_available()
Returns `True` if CUDA is available and usable. Returns `False` otherwise.
### torch.cuda.device_count()
Returns the number of CUDA devices on the system.
### torch.cuda.current_device()
Returns the device index of the current default CUDA device.
### torch.cuda.synchronize()
This function issues a `cudaDeviceSynchronize` on the current device, and hence waits for all in-flight CUDA computation to finish.
### torch.cuda.current_stream()
Returns the handle to the current stream of the CUDA context.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.