Summary:
#19975 was separated by 2 PRs.
This one:
Introduce MemoryFormat argument to the `x.is_contiguous(memory_format=torch.channels_last)` and to the `y = x.contiguous(memory_format=torch.channels_last)` functions.
At this moment both functions just operate with strides and doesn't store any tensor state.
(Original RFC #19092)
-----
Expands functionality of two tensor functions `.is_contiguous` and `.contiguous` (both python and c++ api).
Note: We had several complaints about `.to(memory_format)` function, and decided not to support it.
1. `.contiguous` now support optional keyword-only argument - `memory_format`, which can be either `torch.contiguous_format` or `torch.channels_last`.
- Using `torch.contiguous_format` will preserve existing `.contiguous()` behavior.
- Calling `x.contiguous(memory_format=torch.channels_last)` returns new tensor which maintain same semantical layout (NCHW), but have different memory allocation pattern.
`x.contiguous(memory_format=torch.channels_last)` expects input tensor to be 3d, 4d or 5d; and fails otherwise.
2. `.is_contiguous` now support optional keyword-only argument - `memory_format`, which can be either `torch.contiguous_format` or `torch.channels_last`.
- `x.is_contiguous(memory_format=torch.contiguous_format)` preserves same functionality as `x.is_contiguous()` and remains unchanged.
- `x.is_contiguous(memory_format=torch.channels_last)` returns true if A) input tensor is contiguous in memory AND B) allocated in the memory in NWHC (or similar for 3d,5d) format.
Note: By the end of the phase one `x.is_contiguous(memory_format=torch.channels_last)` will calculate state of the Tensor on every call. This functionality going to be updated later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20455
Differential Revision: D15341577
Pulled By: VitalyFedyunin
fbshipit-source-id: bbb6b4159a8a49149110ad321109a3742383185d
Summary:
Fixes: #19253
Fixing pow(Tensor, float) is straightforward.
The breakage for pow(float, Tensor) is a bit more subtle to trigger, and fixing needs `torch.log` (`math.log` didn't work) from the newly merged #19115 (Thanks ngimel for pointing out this has landed.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19324
Differential Revision: D15003531
Pulled By: ailzhang
fbshipit-source-id: 8b22138fa27a43806b82886fb3a7b557bbb5a865
Summary:
This PR propagates where we use first-class modules objects into the compiler. This creates a transitionary state where:
* compiler.cpp creates Graphs where `self` is a Module class and attributes/parameters/buffers/submodules are looked up with `prim::GetAttr`
* GraphExecutor still runs "lowered graphs" where the self object has been removed by a compiler pass `lower_first_class_method`.
* Tracing still creates "lowered graphs", and a pass "lift_lowered_method" creates a first-class method graph for things.
* This PR separates out Method and Function. A script::Function is a pure Graph with no `self` bound. Similar to Python, a script::Method is just a bound `self` and its underlying `script::Function`.
* This PR also separates CompilationUnit from Module. A CompilationUnit is just a list of named script::Functions. Class's have a CompilationUnit holding the class methods, and Modules also have a CompilationUnit holding their Methods. This avoids the weird circular case Module --has a-> Class -> has a -> Module ...
Details:
* In this transitionary state, we maintain two copies of a Graph, first-class module and lowered. Th first-class one has a self argument that is the module's class type. The lowered one is the lowered graph that uses the initial_ivalues inputs.
* When defining lowered methods using `_defined_lowered` we immediately create the first-class equivalent. The reverse is done lazily, creating lowered_methods on demand from the class.
* The two way conversions will be deleted in a future PR when the executor itself runs first-class objects. However this requires more changes to (1) the traces, (2) the python bindings, and (3) the onnx export pass and would make this PR way to large.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19167
Differential Revision: D14891966
Pulled By: zdevito
fbshipit-source-id: 0b5f03118aa65448a15c7a7818e64089ec93d7ea
Summary:
It is done by flattening all tensor lists that are inputs/outputs to the
graph into the inputs/outputs list in the autograd graph.
This is less desirable than simply allowing IValues to exist in the
inputs/outputs of autograd::Function but it is substantially less
intrusive.
CaptureList describes the variables captured for backward in a single class.
UnpackInstructs describes how the flattened inputs to backwards are re-packed into lists.
ailzhang
This PR is also part 2 of covering maskrcnn & bert AD formulas, following #16689.
Ops added in this PR:
```
cat
index
meshgrid
reshape
split
split_with_sizes
stack
unbind
```
I will also add a few perf numbers here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16784
Differential Revision: D14104063
Pulled By: ailzhang
fbshipit-source-id: 5ceadadfd67ccaac60c5fd6740786c5354e252b9
Summary:
Move these 2 ops back to autodiff to unblock xla CI.
I will leave them for my next PR to cleanup symbolic_variable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18969
Differential Revision: D14816811
Pulled By: ailzhang
fbshipit-source-id: dd8a7e133dcad29560d3d1d25691883960117299
Summary:
As discussed with gchanan we should deduplicate symbolic_variable and symbolic_script to prepare for the future merge with derivatives.yaml.
This PR moves most easy formulas to symbolic_script.
TODO: run benchmarks to make sure no perf regression
cc: apaszke zdevito wanchaol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17986
Differential Revision: D14766412
Pulled By: ailzhang
fbshipit-source-id: d95a3f876e256c0f505779a71587c985571d3b8f
Summary:
Fix the layernorm formula when weight and bias passed in.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18233
Differential Revision: D14760375
Pulled By: wanchaol
fbshipit-source-id: d6bd3b137bc04c391aa5c24d021d1f811ba2a877
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18378
ghimport-source-id: 55c29bb436a2153d29ff2f4488d99d8863c187b1
Stack from [ghstack](https://github.com/ezyang/ghstack):
* #18379 Enforce single parent for script submodules
* **#18378 Unify namespace of script::Module**
* #18314 Add ability to specialize class types to ArgumentSpec
* #18226 Add Slot type to abstract the raw pointers being used for slots.
This removes individual OrderedDicts in favor of a single unified
namespace for all things in a script::Module. This removes a whole
class of bugs where both a method and an parameter could get the
same name, for instance.
Since we no longer have to expose OrderedDict::Item objects, a lot of
downstream code can be simplified.
We no longer now double-store names (both in the key of the dictionary,
and in the object itself).
Differential Revision: D14603723
fbshipit-source-id: b5f7551b3074679623edd6ea70269830353b4d4c
Summary:
If a test triggers autodiff, it must have a `DifferentiableGraph` in its differentiated forward graph, and this subgraph must have either the original aten node, or the corresponding nodes used in AD formula.
Typically a forward differentiable graph looks like this:
```
graph(%i0 : Float(),
%i1 : Float()):
%3 : Float() = prim::DifferentiableGraph_0(%i0, %i1)
return (%3)
with prim::DifferentiableGraph_0 = graph(%0 : Float(),
%1 : Float()):
%2 : Float() = aten::max(%0, %1)
return (%2)
```
which tells us `aten::max(Tensor self, Tensor other) -> Tensor` is symbolically differentiable.
Update: there're a lot of cases (fusions/ConstantChunk/python implementations) that breaks it so I decided to make the check optionally take node names if different from function name.
~~[OLD]Theoretically I could also check if `aten::max` is in the differentiable block or not to be more precise, but there're also cases like `chunk` where in a differentiable block it's replaced with a prim node (ConstantChunk) and we will have to special case them. Any suggestions here (to be more precise or no) is very welcome!~~
We used to have a list containing nn tests should be run against AD, I moved it to an field when constructing our test(either torch or nn). I think it's cleaner this way, and it matches the fact that for the same op we support one schema of it but not all, in this way we could just turn on the corresponding test which triggers that supported schema.
cc: apaszke zdevito wanchaol ngimel for a review
[Done] :
- Going through a manual second pass of all tests to check if they should enable AD test or not....
- Add a readme about how to add AD for an op and how to add/enable its test in test_jit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18509
Differential Revision: D14696811
Pulled By: ailzhang
fbshipit-source-id: c5e693277baac585cd3aed5ab2c0e7faa5e6f29f
Summary:
Dropout is now eligible for fusion, and generated fused kernels are just as fast as dropout in ATen. Change its lowering in symbolic script so that it can actually be fused. Still special-cased for cuda, because without fusion this lowering is less efficient than current (bernoulli_ * input). Testing is covered by the test case that ailzhang added (test_dropout_cuda).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18375
Differential Revision: D14611938
Pulled By: soumith
fbshipit-source-id: 11b18f4784e6c9265e382a8f8deca7add8df3b37
Summary:
This PR did two things:
1. Enable scalar->float specialization in symbolic script, so AD formula that contains scalar in the schema, should write `float` instead.
2. add addcmul, lerp to AD and fuser.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18081
Differential Revision: D14490493
Pulled By: wanchaol
fbshipit-source-id: b3b86d960d5f051b30733bc908b19786111cdaa4
Summary:
In aten we have a _fused_dropout implementation for CUDA case. As ngimel suggested if we discard it in JIT AD, it hurts performance.
It doesn't seem ideal to include backend specific implementation in AD, but this is helpful to prevent performance regression atm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17756
Differential Revision: D14368999
Pulled By: ailzhang
fbshipit-source-id: 9a371c5020f630e8f6e496849ec9772b6f196169
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18037
The FunctionSchema can now store an overload name and the parser knows how to parse it. Specify like this:
my_func.overload1(arg1: Tensor) -> Tensor
my_func.overload2(arg1: Tensor, arg2: Tensor) -> Tensor
Reviewed By: zdevito
Differential Revision: D14467497
fbshipit-source-id: 8832b32f07351bb61090357b17b77a6a2fed3650
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17585
Create a sugared value that represents a class during initialization. This is
so that assignments to attributes correctly define attributes in __init__ but
raise an error elsewhere.
Reviewed By: shannonzhu
Differential Revision: D14263403
fbshipit-source-id: 09b2feeb272302f00a79c2a0302fbdf5483aed6a
Summary:
Temporarily disable them for perf consideration. Will figure out a way to do `torch.zeros(sizes, grad.options())` in torchscript before enabling these.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17470
Differential Revision: D14210313
Pulled By: ailzhang
fbshipit-source-id: efaf44df1192ae42f4fe75998ff0073234bb4204
Summary:
This PR removes a few size of `self` that passed from forward pass to backward pass when `self` is already required in backward pass. This could be reason that cause the potential slow down in #16689 . I will attach a few perf numbers (still a bit volatile among runs tho) I got in the comment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17187
Differential Revision: D14179512
Pulled By: ailzhang
fbshipit-source-id: 5f3b1f6f26a3fef6dec15623b940380cc13656fa
Summary:
This provides the minimum necessary to allow derivative formulas for things that have a kwarg only specifier in their schema. Support for non-parser frontend default arguments for kwargs is not completed.
Fixes#16921
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17339
Differential Revision: D14160923
Pulled By: zdevito
fbshipit-source-id: 822e964c5a3fe2806509cf24d9f51c6dc01711c3
Summary:
add missing std introduced by #16689 . Investigating why this wasn't caught in CI (nor my local dev environment).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17263
Reviewed By: ezyang
Differential Revision: D14134556
Pulled By: ailzhang
fbshipit-source-id: 6f0753fa858d3997e654924779646228d6d49838
Summary:
Trying to land again, make prim::None into a case of prim::Constant. Reverted the previous landing because it broke an important onnx export test.
https://github.com/pytorch/pytorch/pull/16160
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17186
Differential Revision: D14115304
Pulled By: eellison
fbshipit-source-id: 161435fc30460b4e116cdd62c7b2e5b94581dcb7
Summary:
This change simplifies analysis done on constants since prim::None does not need to be handled separately now. To check if a constant node is None, use node->isNone().
Next step will be to remove prim::Undefined.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16160
Differential Revision: D14109636
Pulled By: eellison
fbshipit-source-id: d26fd383976163a2ddd4c24984bd672a541cc876
Summary:
The main problem there is with differentiating batch norm statically
is that we make a lot of complex run-time decisions about the backend
we choose. Then, the autograd derivatives are implemented for every
backend separately, which makes sense, because they might be saving
buffers containing different values. To resolve the issue, the forward
op returns an index of the chosen backend, and the backward function
takes it as an argument, such that it knows how to interpret the buffers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15403
Differential Revision: D14098815
Pulled By: ailzhang
fbshipit-source-id: 7fcd3e6e0566433e81fe8286fb441c1ecaf198ad
Summary:
When adaptive pooling has to produce a single pixel feature map, it is faster to do so by calling .mean(). Backward calls a pretty inefficient cuda kernel with atomics, which becomes ridiculously slow for halfs. For half this PR provides approx 30x speed-up for adaptive average pooling, which results in 30% end-to-end speed-up on senet. Improvements are smaller for float, but still significant (approx 5x).
Also this PR unifies handling of 3d (no batch dimension) and 4d tensors, using negative dimension indices.
cc ezyang for review.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17011
Reviewed By: ailzhang
Differential Revision: D14078747
Pulled By: soumith
fbshipit-source-id: 0eb9255da2351190a6bcaf68c30e2ae2402a2dd9
Summary:
- Moved a few functions from `autograd` namespace to `aten` namespace to be visible from JIT nativeResolver.
- Added a hack to loop up keyword only argument. Will add proper support for kw only later
- Simulate function overload in aten using `_<number>` as function name suffix.
- Even `forward` returns multiple outputs like in `kthvalue`, there's at most one requires grad that we currently support.
- Removed the `TensorList` related ops here since partial `TensorList` support is prone to bugs. Our symbolic diff for `cat` was never tested with autodiff, and it seems broken. Need to find another proper way to support these ops(either by properly supporting `TensorList` or sth like `prim::ConstantChunk` and leave them for next PR.
Ops supported in this PR:
```
erf
expand_as
index
kthvalue
mean
permute
pow
rsub
select
sqrt
squeeze
t
to
topk
transpose
view
var
embedding
logsumexp
// grad is None
_dim_arange
contiguous
nonzero
ones_like
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16689
Differential Revision: D14020806
Pulled By: ailzhang
fbshipit-source-id: a5e2c144a7be5a0d39d7ac5f93cb402ec12503a5
Summary:
Here is a fresh attempt at getting some fusion back in autodiff-generated graphs in the presence of SumToSize.
- The sum to size operator is now `aten::_grad_sum_to_size` to allow symbolic script differentiation (and that in turn would need to use this in place of sum_to_size to signal that it strictly operates on gradients). This is also used in the autodiff code, replacing `prim::SumToSize`.
- `_grad_sum_to_size` is now fusable, `cat`s - which are fused afterwards thanks to Adam's simplification of the code - are only fused if there is no `_grad_sum_to_size` in the fusion group.
- I push the `_grad_sum_to_size` out of the the fusion group when compiling and record the desired summations in the KernelSpec. The reasoning is the following:
- As the autodiff is a repeated applicaiton of the chain rule, we always have the pattern `grad_in = mm(A, grad_out)`, with A often diagonal for cases interesting to the fuser, whence it is `grad_in = a * grad_out` (a pointwise multiplication). We know that only `grad_out` may have AutodiffGradSumToSize applied, so we can commute AutodiffGradSumToSize with the `mul` (and `div` and `neg` are of similar origin).
- For `type_as` the gradient might be giving the type, so just skip SumToSize,
- `add` (which was inserted as `prim::AutogradAdd`) adding gradients when the forward used the same value in several places. This is non-broadcasting, so we know that the two arguments would have the same sizes as inputs - which is good so we don't have to do bookkeeping of the two parts.
Details:
- During fusion, the Tensor arguments are always kept as the first parameters of the fusion group to accomodate indexing assumptions in the fuser.
- The rewriting of the fusion group to record the necessary output transformation and eliminate `_grad_sum_to_size` from the fusion group is now in the fuser compile step.
- In the execution step, the arguments are split into Tensor / Non-Tensor and the non-tensor args are mostly forgotten about except for doing `sum_to_size` at the end. This would want to be improved if/when we fuse nonconstant scalar arguments.
- In a number of places in the fuser, the non-Tensor arguments to the fusion group needed to be ignored.
Thank you, apaszke for the insightful discussion. All bad ideas and errors are my own.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14957
Differential Revision: D13888173
Pulled By: zou3519
fbshipit-source-id: 071992c876e8b845f2b3e6329ae03a835d39a0ea
Summary:
The PR clang-formats everything in `torch/csrc/jit/` and adds it to the pre-commit hook.
Here is a list of non-mechanical changes:
- I went over each file and fixed up whenever I could tell that clang-format was clobbering comment formatting.
- Made the macros in register_prim_ops a little more clang-format friendly by omitting trailing commas
- Refactored autodiff.cpp to use a helper class with explicit state rather than a bunch of capturing lambdas
- Small improvements to the precommit hook clang-format
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15524
Differential Revision: D13547989
Pulled By: suo
fbshipit-source-id: 3ff1541bb06433ccfe6de6e33f29227a2b5bb493
Summary:
This adds AD support for adaptive_avg_pool2d, which is necessary for resnet50 in pytorch/vision:master. cc: soumith asuhan dlibenzi
apaszke I saw that autodiff bug you fixed in #15403 , as it doesn't prevent this PR from passing, so I'll leave it for your PR to fix it. :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15459
Differential Revision: D13534732
Pulled By: ailzhang
fbshipit-source-id: 4e48b93e35d5ecfe7bd64b6a132a55b07843f206
Summary:
This PR adds enough of the infra for supporting closures (inner script functions) in order to allow us to expression symbolic gradients using them. We do not actually ever run graphs that contain these closures. The symbolic_script infrastructure just extracts them out of the original forward graph and turns them into discrete forward/backward pairs. This cuts down on the type annotations necessary to write forward/backward pairs and aligns closely with the "differentiator" function approach to expression reverse-mode AD.
Example:
This code:
```
import torch
r = torch.jit.CompilationUnit(
'''
def mul_forward(self, other):
def backward(grad_output):
grad_self = (grad_output * other).sum_to_size(self.size())
grad_other = (grad_output * self).sum_to_size(other.size())
return grad_self, grad_other
return self * other, backward
''')
print(r.module.code)
```
Will produce this graph (pretty printed for clarity):
```
def mul_forward(self,
self: Tensor,
other: Tensor) -> Tuple[Tensor, Tuple[None, Tuple[Tensor, Tensor]]]:
backward = (self.__lambda, (other, self))
return (torch.mul(self, other), backward)
def __lambda(self,
context: Tuple[Tensor, Tensor],
grad_output: Tensor) -> Tuple[Tensor, Tensor]:
other, self, = context
grad_self = torch.sum_to_size(torch.mul(grad_output, other), torch.size(self))
grad_other = torch.sum_to_size(torch.mul(grad_output, self), torch.size(other))
return (grad_self, grad_other)
```
symbolic_script will then do some modifications to remove the unsuppored prim::Function node, yielding:
```
def mul_forward(self,
self: Tensor,
other: Tensor) -> Tuple[Tensor, Tuple[None, Tuple[Tensor, Tensor]]]:
return (torch.mul(self, other), (other, self))
def backward(self,
context: Tuple[Tensor, Tensor],
grad_output: Tensor) -> Tuple[Tensor, Tensor]:
other, self, = context
grad_self = torch.sum_to_size(torch.mul(grad_output, other), torch.size(self))
grad_other = torch.sum_to_size(torch.mul(grad_output, self), torch.size(other))
return (grad_self, grad_other)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15411
Differential Revision: D13523340
Pulled By: zdevito
fbshipit-source-id: 4d4a269460e595b16802c00ec55ae00e3e682d49
Summary:
This PR enables autodiff to use the forward/backward graph compiled from python code, instead of using symbolic gradients(modifying the original graph directly).
We put the map in a separate .h file for now to wait for the native_functions.yaml and derivatives.yaml merge. This should ideally go into native_functions.yaml eventually.
This PR should be enough to unblock us for now, we can start writing gradients for aten functions in python.
Differential Revision: D13494635
Pulled By: ailzhang
fbshipit-source-id: f8d51a15243ac46afd09d930c573ccdfcd9fdaaf