Summary:
There is a module called `2to3` which you can target for future specifically to remove these, the directory of `caffe2` has the most redundant imports:
```2to3 -f future -w caffe2```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45033
Reviewed By: seemethere
Differential Revision: D23808648
Pulled By: bugra
fbshipit-source-id: 38971900f0fe43ab44a9168e57f2307580d36a38
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33977
Removing python2 from operator_test so we can retire python2 support for PyTorch.
Test Plan: waitforsandcastle
Reviewed By: seemethere
Differential Revision: D20129500
fbshipit-source-id: d4c82e4acfc795be9bec6a162c713e37ffb9f5ff
Summary:
Goal of this PR is to unify cuda and hip device types in caffe2 python front end.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14221
Differential Revision: D13148564
Pulled By: bddppq
fbshipit-source-id: ef9bd2c7d238200165f217097ac5727e686d887b
Summary:
The pytorch.org site redirects all of the http:// requests to the https:// site anyway, so the comments and error messages might as well refer directly to the https:// site. The GitHub project description should also be updated to point to https://pytorch.org
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12636
Differential Revision: D10377099
Pulled By: soumith
fbshipit-source-id: f47eaba1dd3eecc5dbe62afaf7022573dc3fd039
* Scope MultiRNN blobs with name as well as layers
Also don't double scope MultiRNN in case of multiple layers.
* Scope input projection of first layer with name
We don't scope it with layers because the projection is done
outside of the layer.
* Avoid scoping input blob in MemongerTest.test_rnn
* Rectify input_blob in prepare_input
Revert change in memonger_test because rectifying input will solve the problem.
Summary:
There is a long lasting problem of scoping which was introduced in original python wrappers early in H1. Basically each RNNCell implemented has to manually scope outputs of each of the operators. If somebody forgets, then there could be weird bugs with layers etc.
Approach is the following. User has to explicitly specify current scope when using apply_over_sequence function and others if the function is going to be called several times (like for stacking layers). This way we use Caffe2 native scoping approach instead of inventing one extra API people have to use (i.e. passing scope name as an argument to the RNNCell constructor).
Closes https://github.com/caffe2/caffe2/pull/1681
Differential Revision: D6777536
Pulled By: salexspb
fbshipit-source-id: 73d860b8d4857589e04bdea5a6fcd3080d68427c
Summary:
In this case, each sequence is treated as having a length equal to the
first dimension of the input tensor. This matches the semantics of
ONNX when the sequence length input is left out.
Closes https://github.com/caffe2/caffe2/pull/1764
Reviewed By: dzhulgakov
Differential Revision: D6751219
Pulled By: anderspapitto
fbshipit-source-id: 89e0efd12339157627494e2b8c83e952bdd8a9f8
Summary: A version of MILSTMCell which uses layer normalization (see https://arxiv.org/pdf/1607.06450.pdf). There's a lot of copypasta because we don't want to make the existing RNNCell classes harder to approach / understand by adding new options.
Differential Revision: D6564208
fbshipit-source-id: 0bc43e12b6c08ebdf5ea6af2c631f785c302bdb4
Summary:
Adds a new `LSTMCell` subclass to the `rnn_cell` module that performs layer normalization on the fused input matrix. Moves around some code in `rnn_cell.py` to avoid copy-pasta. Adds relevant test cases to `rnn_cell_test.py`.
Had to fix `brew.layer_norm` first. See T24013870.
Reviewed By: jhcross
Differential Revision: D6454883
fbshipit-source-id: 0f4ea7a778cc5be6a7274f7b28c793f5dd7c6095
Summary: The cudnn version of the DropoutOp was taking a significant (and unwarranted) amount of time in our RNN training. Further investigation showed that setting the cudnn dropout descriptors was an extremely expensive operation (https://pxl.cl/99nT), much more so than the dropout operation itself. This diff adds to the DropoutCell the option to disable cudnn. The non-cudnn version uses a raw curand call that elides all of the expensive descriptor setting.
Reviewed By: jmp84, akyrola
Differential Revision: D5972022
fbshipit-source-id: 6325ec5d6569f8b94d776cbb2554cc8ddb28f699
Summary: This caused gradient generation problems. Output was made in-place in PR-1185, by mistake, I believe.
Differential Revision: D5844825
fbshipit-source-id: 4ad84d0fb468aafde9f78463b9acf89316e633ca
Summary: As title. Wonder this had not been encountered before. Only affects cases where the states are copied over though.
Reviewed By: Yangqing
Differential Revision: D5777314
fbshipit-source-id: 8aef435c832e4ead5bb3d3e35bb065c734a2af5f
Summary:
Implementation of a new variant of attention module, which contains a recurrent decoder state with vectors corresponding to each source-side word and strictly increasing values, thus enabling it to model the degree to which source words have been translated.
The approach is a variant of the approaches described in https://arxiv.org/pdf/1601.04811.pdf. We simply include the sum of all previous attention weights for encoder words as a new recurrent state (coverage_t). A new linear transform on encoder_outputs is used to produce coverage_weights, which has the same dimensionality as encoder_outputs, and implicitly models the fertility of source-side words (and putting this extra information strain on the encoder network).
Thus the encoder output, the decoder state, and the coverage weights have the same dimensionality for a given source word, and attention logits are calculated as v * tanh(coverage * coverage_weights + encoder_output + decoder_state).
Note: the entire coverage state for each translation instance is of shape (encoder_length, coverage_units), but the states for the RecurrentNetwork operator, used to train the decoder, must be flat in the data dimension. This state is therefore initialized with shape (encoder_length * coverage_units) [not shown in the open-source library] and reshaped appropriately within the apply_soft_coverage_attention() function.
Differential Revision: D5593617
fbshipit-source-id: 7d0522b5eb0b26f22e8429e4461a459f2f16ed46
Summary:
_LSTM helper is a legacy piece we had before all the RNNCell awesomeness landed. Now we need to pull it apart and create separate building blocks that people can use for any RNNs.
Please note changes to a test with double scoping. That should go away once we change RNNCell scoping logic in such a way that each cells ads its own name to the scope for all of its outputs (see another diff: D5613139 )
Reviewed By: jhcross
Differential Revision: D5632276
fbshipit-source-id: 1cb568ab995c4c0b3dd1b4bad2d028e34bded9c1
Summary:
Forward-only mode had broken at some point. Two things: RNNCell did not pass the parameter to recurrent.py and also recurrent.py was broken if forward_only=True after python3 codemod.
Added test to rnn_cell_test to actually check the forward only parameter is passed to prevent future breakage.
Reviewed By: jmp84
Differential Revision: D5639306
fbshipit-source-id: b1bbc39d59c3f3734b2f40a1c2f3740c733e0bd4
Summary: GRU is different than LSTM that it only has hidden states but no cell states. So in this case, reusing the code of _LSTM is problematic, as we need to delete the part of creating cell state, and change many other places that use hard-coded 4 (hidden_all, hidden, cell_all, cell) into 2 (hidden_all, hidden). Otherwise GRU will break during the backward pass, when the optimizer tries to apply gradient to each of the parameters, because cell state is never used, so it does not have gradients for the corresponding parameters (i.e., cell_state_w, cell_state_b).
Differential Revision: D5589309
fbshipit-source-id: f5af67dfe0842acd68223f6da3e96a81639e8049
Summary:
Implement dot attention as described in https://arxiv.org/abs/1508.04025
This saves the computation of weighted encoder outputs in `rnn_cell.py`
When the encoder and decoder dimensions are different, we apply an FC, which corresponds to the general case below Figure 2.
Refactored unit tests.
Reviewed By: jhcross
Differential Revision: D5486976
fbshipit-source-id: f9e9aea675b3b072fbe631bc004199b90a9d95cb
Summary: When creating parameters for modelhelper, we should use create_param instead of using param_init_net and model.params directly. The diff rewrite some of these cases in rnn_cell.py in order to make model._parameter_info and model.params consistent.
Reviewed By: kittipatv
Differential Revision: D5477724
fbshipit-source-id: 28c4aaf8f98d9d89125af6a42ad328008f0079e1
Summary:
In order to get dimensions right, correctly identify gradients, etc., DropoutCell should call the _prepare_output and _prepare_output_sequence methods of its internal cell for its own such methods.
This bug was identified by NVIDIA intern Syed Tousif Ahmed.
Reviewed By: akyrola
Differential Revision: D5483082
fbshipit-source-id: f6df5b4a0502ed0771056638aab219fb5cc7d964
Summary:
For RNN attention, we should not include the invalid parts of the encoder output (based on encoder_lengths) in the computation. This diff accomplishes that by forcing logits for those positions to be negative infinity.
Note that the this step can be bypassed by passing encoder_lengths=None, which is what we do for beam search, thus incurring no extra overhead for inference.
Reviewed By: jamesr66a
Differential Revision: D5402547
fbshipit-source-id: 1863d6050b5129e4df829c6357f0aa9ded0715dc
Summary: Adding a test to check computational integrity of networks constructed with AttentionCell using UnrolledCell.
Reviewed By: salexspb
Differential Revision: D5306915
fbshipit-source-id: 02acfd1011f7d3ee5fac21cc2778c4a486190c43
Summary:
This diff fixes gradient computation of residual connections for a training network constructed with MultiRNNCell.
It addresses a logic bug in _prepare_output() and _prepare_output_sequence() by keeping track internally of which layers have consecutive residual connections before the output, and then reconstructing the final residual output by (re-)preparing the output of each of those layers and then combining them with a Sum operation. This also involves keeping track of which states contribute toward the reconstruction of the final sequence output so that outputs_with_grads can be correctly passed to apply_over_sequence().
Differential Revision: D5300520
fbshipit-source-id: f37d800c909e631175de7045abe192351cc11c41
Summary:
While this is not intended to be the best performat and
general solution, we can see from the test plan in some cases static DAG RNN could
perform better than our own implementation. Hopefully we will get
dynamic RNN DAG execution at least as fast as this one. Then we will
not need this one in production, only for testing.
Still putting it into our benchmark for comparison purposes
Reviewed By: akyrola
Differential Revision: D5210038
fbshipit-source-id: fa44baf51c455872abd6ec5f5d151cf06e15b1fa
Summary: I accidentaly noticed that we were calling the non-CUDNN version of Transpose with attention, and it is super slow. This broke when rnn_cell was changed to use ModelHelper instead of CNNModelHelper in D5062963, but calls to transpose were not "brewed".
Reviewed By: jamesr66a
Differential Revision: D5264248
fbshipit-source-id: b61494ae210f34597245f1195d20547f5b5cd8b5
Summary:
Static RNN allows to unroll an RNN into Caffe2 graph using all existing cell abstractions. In this diff I introduce several new tests that already caught a few bugs in our RecurrentNetworkOp gradient accumulation logic by comparing it to an unrolled version.
Another use case is perf - potentially we can run an unrolled net faster because DAGNet will have access to the whole graph. Same about memonger. But this work is not part of this diff
Reviewed By: akyrola
Differential Revision: D5200943
fbshipit-source-id: 20f16fc1b2ca500d06ccc60c4cec6e81839149dc
Summary: Use new blob as residual sum output, and add scoping to prevent any name conflicts.
Reviewed By: urikz
Differential Revision: D5167145
fbshipit-source-id: a01c87ed2278205e95e8395314b166afb1dca1b3
Summary: Added a new RNNCell, DropoutCell, which wraps an existing RNNCell and applies dropout to its primary output (as defined by get_output_state_index()).
Reviewed By: salexspb
Differential Revision: D5084871
fbshipit-source-id: 60474af84e5757a12e7fdc3814840dc9ba8e32a1
Summary: As noted by salexspb, MultiRNNCell had unreliable gradient computation. The problem was that recurrent gradient and gradient computed wihtin the backward step net were not being accumulated during the backward pass, but rather writing to the same blob, thus overwriting each other. This diff fixes that by artificially introducing an extra blob for the internal output, and then accumulating it into the gradient coming from the recurrent connection.
Reviewed By: salexspb
Differential Revision: D5110059
fbshipit-source-id: 16add50989fe8866361bbc21afce5f214c5292fd
Summary:
Incorporate arbitrary dropout for encoder and decoder layers for Caffe2 NMT models using current configuration. This involves separate output processing (_prepare_output() and _prepare_output_sequence()) for the final layer in a MultiRNNCell.
Switching to using the newly introduced forward_only switch for RNN cells revealed an unrelated bug in our NetGradientChecker test, which urikz is investigating.
Reviewed By: salexspb
Differential Revision: D5031964
fbshipit-source-id: 19b49607d551aa3e2140041ef4e585f128c8f178
Summary:
Residual connections for multilayer RNN encoder/decoder for Caffe2 NMT model. Only supporting 'add' connections (the standard approach, which ves's TF experiments concluded was at least as good as other approaches), and also only implementing for residual_level >= 1 (which also fits our use case).
It is the responsibility of the config to ensure dimension compatibility: each level at and beyond residual_level (in both the encoder and decoder) should have the same number of units, with the exception that a bidirectional initial encoder layer should have half the number of units of the succeeding layer if that next layer is a residual layer.
Differential Revision: D5023160
fbshipit-source-id: f38c1b140638fee78cf3ef7d6b4602dd462484ee
Summary:
Update rnn_cell.py and char_rnn.py example with new `brew` model.
- Deprecated CNNModelHelper
- replace all helper functions with brew helper functions
- Use `model.net.<SingleOp>` format to create bare bone Operator for better clarity.
Reviewed By: salexspb
Differential Revision: D5062963
fbshipit-source-id: 254f7b9059a29621027d2b09e932f3f81db2e0ce