fix static linkage and make THD statically linked

increase test subprocess timeout
fix leaking symbols in THNN
2025-10-22 06:11:27 +08:00 · 2017-08-28 10:41:55 -04:00 · 2017-08-25 09:00:36 -07:00 · 2017-08-25 09:00:35 -07:00 · 2017-08-25 09:00:35 -07:00 · 2017-08-25 09:00:35 -07:00
1514 changed files with 216491 additions and 25 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,51 @@
+build/
+dist/
+torch.egg-info/
+*/**/__pycache__
+torch/version.py
+torch/csrc/generic/TensorMethods.cpp
+torch/lib/*.so*
+torch/lib/*.a*
+torch/lib/*.dylib*
+torch/lib/*.h
+torch/lib/build
+torch/lib/tmp_install
+torch/lib/include
+torch/lib/torch_shm_manager
+torch/csrc/cudnn/cuDNN.cpp
+torch/csrc/nn/THNN.cwrap
+torch/csrc/nn/THNN.cpp
+torch/csrc/nn/THCUNN.cwrap
+torch/csrc/nn/THCUNN.cpp
+torch/csrc/nn/THNN_generic.cwrap
+torch/csrc/nn/THNN_generic.cpp
+torch/csrc/nn/THNN_generic.h
+torch/csrc/generated
+docs/src/**/*
+test/data/legacy_modules.t7
+test/data/gpu_tensors.pt
+test/htmlcov
+test/.coverage
+*/*.pyc
+*/**/*.pyc
+*/**/**/*.pyc
+*/**/**/**/*.pyc
+*/**/**/**/**/*.pyc
+*/*.so*
+*/**/*.so*
+*/**/*.dylib*
+test/data/legacy_serialized.pt
+test/data/linear.pt
+
+# IPython notebook checkpoints
+.ipynb_checkpoints
+
+# Editor temporaries
+*.swn
+*.swo
+*.swp
+*~
+
+# OSX dir files
+.DS_Store
+
--- a/.travis.yml
+++ b/.travis.yml
@ -0,0 +1,49 @@
+# https://travis-ci.org/pytorch/pytorch
+language: python
+dist: trusty
+python:
+    - 2.7.9
+    - 2.7
+    - 3.5
+    - 3.6
+    - nightly
+
+cache:
+    - ccache
+    - directories:
+        - $HOME/.ccache
+
+install:
+    - unset CCACHE_DISABLE
+    - export CCACHE_DIR=$HOME/.ccache
+    - export CC="ccache gcc-4.8"
+    - export CXX="ccache g++-4.8"
+    - ccache --show-stats
+    - travis_retry pip install --upgrade pip setuptools wheel
+    - travis_retry pip install -r requirements.txt --only-binary=scipy
+    - python setup.py install
+
+script:
+    - OMP_NUM_THREADS=2 ./test/run_test.sh
+
+addons:
+    apt:
+        sources:
+            - ubuntu-toolchain-r-test
+        packages:
+            - gcc-4.8
+            - g++-4.8
+
+# This reportedly works around an issue downloading packages from pypi on
+# travis.  Consider removing this after the underlying issue is fixed.
+# https://github.com/travis-ci/travis-ci/issues/2389
+sudo: false
+
+matrix:
+    fast_finish: true
+    include:
+        env: LINT_CHECK
+        python: "2.7"
+        addons: true
+        install: pip install flake8
+        script: flake8
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -0,0 +1,185 @@
+## Contributing to PyTorch
+
+If you are interested in contributing to PyTorch, your contributions will fall
+into two categories:
+1. You want to propose a new Feature and implement it
+    - post about your intended feature, and we shall discuss the design and
+    implementation. Once we agree that the plan looks good, go ahead and implement it.
+2. You want to implement a feature or bug-fix for an outstanding issue
+    - Look at the outstanding issues here: https://github.com/pytorch/pytorch/issues
+    - Especially look at the Low Priority and Medium Priority issues
+    - Pick an issue and comment on the task that you want to work on this feature
+    - If you need more context on a particular issue, please ask and we shall provide.
+
+Once you finish implementing a feature or bugfix, please send a Pull Request to
+https://github.com/pytorch/pytorch
+
+If you are not familiar with creating a Pull Request, here are some guides:
+- http://stackoverflow.com/questions/14680711/how-to-do-a-github-pull-request
+- https://help.github.com/articles/creating-a-pull-request/
+
+
+## Developing locally with PyTorch
+
+To locally develop with PyTorch, here are some tips:
+
+1. Uninstall all existing pytorch installs
+```
+conda uninstall pytorch
+pip uninstall torch
+pip uninstall torch # run this command twice
+```
+
+2. Locally clone a copy of PyTorch from source:
+
+```
+git clone https://github.com/pytorch/pytorch
+cd pytorch
+```
+
+3. Install PyTorch in `build develop` mode:
+
+A full set of instructions on installing PyTorch from Source are here:
+https://github.com/pytorch/pytorch#from-source
+
+The change you have to make is to replace
+
+```
+python setup.py install
+```
+
+with
+
+```
+python setup.py build develop
+```
+
+This is especially useful if you are only changing Python files.
+
+This mode will symlink the python files from the current local source tree into the
+python install.
+
+Hence, if you modify a python file, you do not need to reinstall pytorch again and again.
+
+For example:
+- Install local pytorch in `build develop` mode
+- modify your python file `torch/__init__.py` (for example)
+- test functionality
+- modify your python file `torch/__init__.py`
+- test functionality
+- modify your python file `torch/__init__.py`
+- test functionality
+
+You do not need to repeatedly install after modifying python files.
+
+
+## Writing documentation
+
+PyTorch uses [Google style](http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html)
+for formatting docstrings. Length of line inside docstrings block must be limited to 80 characters to
+fit into Jupyter documentation popups.
+
+
+## Managing multiple build trees
+
+One downside to using `python setup.py develop` is that your development
+version of pytorch will be installed globally on your account (e.g., if
+you run `import torch` anywhere else, the development version will be
+used.
+
+If you want to manage multiple builds of PyTorch, you can make use of
+[conda environments](https://conda.io/docs/using/envs.html) to maintain
+separate Python package environments, each of which can be tied to a
+specific build of PyTorch.  To set one up:
+
+```
+conda create -n pytorch-myfeature
+source activate pytorch-myfeature
+# if you run python now, torch will NOT be installed
+python setup.py build develop
+```
+
+## C++ Development tips
+
+If you are working on the C++ code, there are a few important things that you
+will want to keep in mind:
+
+1. How to rebuild only the code you are working on, and
+2. How to make rebuilds in the absence of changes go faster.
+
+### Build only what you need.
+
+`python setup.py build` will build everything, but since our build system is
+not very optimized for incremental rebuilds, this will actually be very slow.
+Far better is to only request rebuilds of the parts of the project you are
+working on:
+
+- Working on `torch/csrc`?  Run `python setup.py develop` to rebuild
+  (NB: no `build` here!)
+
+- Working on `torch/lib/TH`, did not make any cmake changes, and just want to
+  see if it compiles?  Run `(cd torch/lib/build/TH && make install -j$(getconf _NPROCESSORS_ONLN))`.  This
+  applies for any other subdirectory of `torch/lib`.  **Warning: Changes you
+  make here will not be visible from Python.**  See below.
+
+- Working on `torch/lib` and want to run your changes / rerun cmake?  Run
+  `python setup.py build_deps`.  Note that this will rerun cmake for
+  every subdirectory in TH; if you are only working on one project,
+  consider editing `torch/lib/build_all.sh` and commenting out the
+  `build` lines of libraries you are not working on.
+
+On the initial build, you can also speed things up with the environment
+variables `DEBUG` and `NO_CUDA`.
+
+- `DEBUG=1` will enable debug builds (-g -O0)
+- `NO_CUDA=1` will disable compiling CUDA (in case you are developing on something not CUDA related), to save compile time.
+
+For example:
+```
+NO_CUDA=1 DEBUG=1 python setup.py build develop
+```
+
+Make sure you continue to pass these flags on subsequent builds.
+
+### Make no-op build fast.
+
+Python `setuptools` is pretty dumb, and always rebuilds every C file in a
+project. Using ccache in a situation like this is a real time-saver. However, by
+default, ccache does not properly support CUDA stuff, so here are the
+instructions for installing a custom `ccache` fork that has CUDA support:
+
+```
+# install and export ccache
+if ! ls ~/ccache/bin/ccache
+then
+    sudo apt-get update
+    sudo apt-get install -y automake autoconf
+    sudo apt-get install -y asciidoc
+    mkdir -p ~/ccache
+    pushd /tmp
+    rm -rf ccache
+    git clone https://github.com/colesbury/ccache -b ccbin
+    pushd ccache
+    ./autogen.sh
+    ./configure
+    make install prefix=~/ccache
+    popd
+    popd
+
+    mkdir -p ~/ccache/lib
+    mkdir -p ~/ccache/cuda
+    ln -s ~/ccache/bin/ccache ~/ccache/lib/cc
+    ln -s ~/ccache/bin/ccache ~/ccache/lib/c++
+    ln -s ~/ccache/bin/ccache ~/ccache/lib/gcc
+    ln -s ~/ccache/bin/ccache ~/ccache/lib/g++
+    ln -s ~/ccache/bin/ccache ~/ccache/cuda/nvcc
+
+    ~/ccache/bin/ccache -M 25Gi
+fi
+
+export PATH=~/ccache/lib:$PATH
+export CUDA_NVCC_EXECUTABLE=~/ccache/cuda/nvcc
+```
+
+
+Hope this helps, and thanks for considering to contribute.
--- a/36
+++ b/36
@ -0,0 +1,36 @@
+FROM nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04 
+
+RUN echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list
+
+RUN apt-get update && apt-get install -y --no-install-recommends \
+         build-essential \
+         cmake \
+         git \
+         curl \
+         vim \
+         ca-certificates \
+         libjpeg-dev \
+         libpng-dev &&\
+     rm -rf /var/lib/apt/lists/*
+
+RUN curl -o ~/miniconda.sh -O  https://repo.continuum.io/miniconda/Miniconda3-4.2.12-Linux-x86_64.sh  && \
+     chmod +x ~/miniconda.sh && \
+     ~/miniconda.sh -b -p /opt/conda && \     
+     rm ~/miniconda.sh && \
+     /opt/conda/bin/conda install conda-build && \
+     /opt/conda/bin/conda create -y --name pytorch-py35 python=3.5.2 numpy pyyaml scipy ipython mkl&& \
+     /opt/conda/bin/conda clean -ya 
+ENV PATH /opt/conda/envs/pytorch-py35/bin:$PATH
+RUN conda install --name pytorch-py35 -c soumith magma-cuda80
+# This must be done before pip so that requirements.txt is available
+WORKDIR /opt/pytorch
+COPY . .
+
+RUN TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1+PTX" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
+    CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
+    pip install -v .
+
+RUN git clone https://github.com/pytorch/vision.git && cd vision && pip install -v .
+
+WORKDIR /workspace
+RUN chmod -R a+w /workspace
--- a/38
+++ b/38
@ -0,0 +1,38 @@
+Copyright (c) 2016-     Facebook, Inc            (Adam Paszke)
+Copyright (c) 2014-     Facebook, Inc            (Soumith Chintala)
+Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
+Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
+Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
+Copyright (c) 2011-2013 NYU                      (Clement Farabet)
+Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
+Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
+Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
+
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+
+2. Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+
+3. Neither the names of Facebook, Deepmind Technologies, NYU, NEC Laboratories America
+   and IDIAP Research Institute nor the names of its contributors may be
+   used to endorse or promote products derived from this software without
+   specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.
--- a/README.md
+++ b/README.md
@ -1,32 +1,254 @@
-# THNN
+<p align="center"><img width="40%" src="docs/source/_static/img/pytorch-logo-dark.png" /></p>

-THNN is a library that gathers nn's C implementations of neural network modules. It's entirely free of Lua dependency and therefore can be used in any application that has a C FFI. Please note that it only contains quite low level functions, and an object oriented C/C++ wrapper will be created soon as another library.
+--------------------------------------------------------------------------------

-There is also a CUDA counterpart of THNN (THCUNN) in the [cunn repository](https://github.com/torch/cunn/tree/master/lib/THCUNN).
+PyTorch is a Python package that provides two high-level features:
+- Tensor computation (like NumPy) with strong GPU acceleration
+- Deep neural networks built on a tape-based autograd system

-## Links
+You can reuse your favorite Python packages such as NumPy, SciPy and Cython to extend PyTorch when needed.

-* [API reference](doc/api_reference.md)
-* [Style guidelines](doc/style_guidelines.md)
+We are in an early-release beta. Expect some adventures and rough edges.

-## Motivation
+- [More about PyTorch](#more-about-pytorch)
+- [Installation](#installation)
+  - [Binaries](#binaries)
+  - [From Source](#from-source)
+  - [Docker Image](#docker-image)
+- [Getting Started](#getting-started)
+- [Communication](#communication)
+- [Releases and Contributing](#releases-and-contributing)
+- [The Team](#the-team)

-Torch's neural network package (nn) provided many optimized C implementations of modules, but the source files contained Lua specific code and headers so they couldn't be easily compiled and included anywhere else.
+| System | 2.7 | 3.5 |
+| --- | --- | --- |
+| Linux CPU | [![Build Status](https://travis-ci.org/pytorch/pytorch.svg?branch=master)](https://travis-ci.org/pytorch/pytorch) | [![Build Status](https://travis-ci.org/pytorch/pytorch.svg?branch=master)](https://travis-ci.org/pytorch/pytorch) |
+| Linux GPU | [![Build Status](http://build.pytorch.org:8080/buildStatus/icon?job=pytorch-master-py2-linux)](https://build.pytorch.org/job/pytorch-master-py2-linux) | [![Build Status](http://build.pytorch.org:8080/buildStatus/icon?job=pytorch-master-py3-linux)](https://build.pytorch.org/job/pytorch-master-py3-linux) |
+| macOS CPU | [![Build Status](http://build.pytorch.org:8080/buildStatus/icon?job=pytorch-master-py2-osx-cpu)](https://build.pytorch.org/job/pytorch-master-py2-osx-cpu) | [![Build Status](http://build.pytorch.org:8080/buildStatus/icon?job=pytorch-master-py3-osx-cpu)](https://build.pytorch.org/job/pytorch-master-py3-osx-cpu) |

-THNN is based on the same code, but is written in pure C, so it can be easily included in other code. **Future C implementations should be committed to THNN.**

-## API
+## More about PyTorch

-THNN is a purely functional library. It provides 2-3 functions for each module, that perform the most important operations:
+At a granular level, PyTorch is a library that consists of the following components:

-* **updateOutput** - applies the module to an input
-* **updateGradInput** - accepts gradient w.r.t. output and previous module input, and computes a gradient w.r.t. that input
-* **accGradParameters** - *(optional, only modules with parameters)* accepts gradient w.r.t. output and previous module input, and computes gradient w.r.t. the parameters
+<table>
+<tr>
+    <td><b> torch </b></td>
+    <td> a Tensor library like NumPy, with strong GPU support </td>
+</tr>
+<tr>
+    <td><b> torch.autograd </b></td>
+    <td> a tape-based automatic differentiation library that supports all differentiable Tensor operations in torch </td>
+</tr>
+<tr>
+    <td><b> torch.nn </b></td>
+    <td> a neural networks library deeply integrated with autograd designed for maximum flexibility </td>
+</tr>
+<tr>
+    <td><b> torch.multiprocessing  </b></td>
+    <td> Python multiprocessing, but with magical memory sharing of torch Tensors across processes. Useful for data loading and Hogwild training. </td>
+</tr>
+<tr>
+    <td><b> torch.utils </b></td>
+    <td> DataLoader, Trainer and other utility functions for convenience </td>
+</tr>
+<tr>
+    <td><b> torch.legacy(.nn/.optim) </b></td>
+    <td> legacy code that has been ported over from torch for backward compatibility reasons </td>
+</tr>
+</table>

-For information on argument types please check the [API reference](doc/api_reference.md).
+Usually one uses PyTorch either as:

-## Developer docs
+- a replacement for NumPy to use the power of GPUs.
+- a deep learning research platform that provides maximum flexibility and speed

-* [Style guidelines](doc/style_guidelines.md)
+Elaborating further:

-This section will be expanded when FFI refactoring will be finished.
+### A GPU-Ready Tensor Library
+
+If you use NumPy, then you have used Tensors (a.k.a ndarray).
+
+<p align=center><img width="30%" src="docs/source/_static/img/tensor_illustration.png" /></p>
+
+PyTorch provides Tensors that can live either on the CPU or the GPU, and accelerate
+compute by a huge amount.
+
+We provide a wide variety of tensor routines to accelerate and fit your scientific computation needs
+such as slicing, indexing, math operations, linear algebra, reductions.
+And they are fast!
+
+### Dynamic Neural Networks: Tape-Based Autograd
+
+PyTorch has a unique way of building neural networks: using and replaying a tape recorder.
+
+Most frameworks such as TensorFlow, Theano, Caffe and CNTK have a static view of the world.
+One has to build a neural network, and reuse the same structure again and again.
+Changing the way the network behaves means that one has to start from scratch.
+
+With PyTorch, we use a technique called reverse-mode auto-differentiation, which allows you to
+change the way your network behaves arbitrarily with zero lag or overhead. Our inspiration comes
+from several research papers on this topic, as well as current and past work such as
+[autograd](https://github.com/twitter/torch-autograd),
+[autograd](https://github.com/HIPS/autograd),
+[Chainer](http://chainer.org), etc.
+
+While this technique is not unique to PyTorch, it's one of the fastest implementations of it to date.
+You get the best of speed and flexibility for your crazy research.
+
+<p align=center><img width="80%" src="docs/source/_static/img/dynamic_graph.gif" /></p>
+
+### Python First
+
+PyTorch is not a Python binding into a monolithic C++ framework.
+It is built to be deeply integrated into Python.
+You can use it naturally like you would use NumPy / SciPy / scikit-learn etc.
+You can write your new neural network layers in Python itself, using your favorite libraries
+and use packages such as Cython and Numba.
+Our goal is to not reinvent the wheel where appropriate.
+
+### Imperative Experiences
+
+PyTorch is designed to be intuitive, linear in thought and easy to use.
+When you execute a line of code, it gets executed. There isn't an asynchronous view of the world.
+When you drop into a debugger, or receive error messages and stack traces, understanding them is straightforward.
+The stack trace points to exactly where your code was defined.
+We hope you never spend hours debugging your code because of bad stack traces or asynchronous and opaque execution engines.
+
+### Fast and Lean
+
+PyTorch has minimal framework overhead. We integrate acceleration libraries
+such as Intel MKL and NVIDIA (cuDNN, NCCL) to maximize speed.
+At the core, its CPU and GPU Tensor and neural network backends
+(TH, THC, THNN, THCUNN) are written as independent libraries with a C99 API.  
+They are mature and have been tested for years.
+
+Hence, PyTorch is quite fast – whether you run small or large neural networks.
+
+The memory usage in PyTorch is extremely efficient compared to Torch or some of the alternatives.
+We've written custom memory allocators for the GPU to make sure that
+your deep learning models are maximally memory efficient.
+This enables you to train bigger deep learning models than before.
+
+### Extensions without Pain
+
+Writing new neural network modules, or interfacing with PyTorch's Tensor API was designed to be straightforward
+and with minimal abstractions.
+
+You can write new neural network layers in Python using the torch API
+[or your favorite NumPy-based libraries such as SciPy](http://pytorch.org/tutorials/advanced/numpy_extensions_tutorial.html).
+
+If you want to write your layers in C/C++, we provide an extension API based on
+[cffi](http://cffi.readthedocs.io/en/latest/) that is efficient and with minimal boilerplate.
+There is no wrapper code that needs to be written. You can see [a tutorial here](http://pytorch.org/tutorials/advanced/c_extension.html) and [an example here](https://github.com/pytorch/extension-ffi).
+
+
+## Installation
+
+### Binaries
+Commands to install from binaries via Conda or pip wheels are on our website:
+
+[http://pytorch.org](http://pytorch.org)
+
+### From Source
+
+If you are installing from source, we highly recommend installing an [Anaconda](https://www.continuum.io/downloads) environment.
+You will get a high-quality BLAS library (MKL) and you get a controlled compiler version regardless of your Linux distro.
+
+Once you have [Anaconda](https://www.continuum.io/downloads) installed, here are the instructions.
+
+If you want to compile with CUDA support, install
+- [NVIDIA CUDA](https://developer.nvidia.com/cuda-downloads) 7.5 or above
+- [NVIDIA cuDNN](https://developer.nvidia.com/cudnn) v5.x or above
+
+If you want to disable CUDA support, export environment variable `NO_CUDA=1`.
+
+#### Install optional dependencies
+
+On Linux
+```bash
+export CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" # [anaconda root directory]
+
+# Install basic dependencies
+conda install numpy pyyaml mkl setuptools cmake gcc cffi
+
+# Add LAPACK support for the GPU
+conda install -c soumith magma-cuda80 # or magma-cuda75 if CUDA 7.5
+```
+
+On OSX
+```bash
+export CMAKE_PREFIX_PATH=[anaconda root directory]
+conda install numpy pyyaml setuptools cmake cffi
+```
+
+#### Install PyTorch
+On Linux
+```bash
+python setup.py install
+```
+
+On OSX
+```bash
+MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py install
+```
+
+### Docker image
+
+Dockerfile is supplied to build images with cuda support and cudnn v6. Build as usual
+```
+docker build -t pytorch .
+```
+Alternatively, if you want a runtime image, build with
+
+```
+docker build -t pytorch . -f tools/docker/Dockerfile_runtime
+
+```
+and run with nvidia-docker:
+```
+nvidia-docker run --rm -ti --ipc=host pytorch
+```
+Please note that PyTorch uses shared memory to share data between processes, so if torch multiprocessing is used (e.g.
+for multithreaded data loaders) the default shared memory segment size that container runs with is not enough, and you
+should increase shared memory size either with `--ipc=host` or `--shm-size` command line options to `nvidia-docker run`.
+
+
+## Getting Started
+
+Three pointers to get you started:
+- [Tutorials: get you started with understanding and using PyTorch](http://pytorch.org/tutorials/)
+- [Examples: easy to understand pytorch code across all domains](https://github.com/pytorch/examples)
+- The API Reference: [http://pytorch.org/docs/](http://pytorch.org/docs/)
+
+## Communication
+* forums: discuss implementations, research, etc. http://discuss.pytorch.org
+* GitHub issues: bug reports, feature requests, install issues, RFCs, thoughts, etc.
+* Slack: general chat, online discussions, collaboration etc. https://pytorch.slack.com/ . If you need a slack invite, ping us at soumith@pytorch.org
+* newsletter: no-noise, one-way email newsletter with important announcements about pytorch. You can sign-up here: http://eepurl.com/cbG0rv
+
+## Releases and Contributing
+
+PyTorch has a 90 day release cycle (major releases).
+It's current state is Beta, we expect no obvious bugs. Please let us know if you encounter a bug by [filing an issue](https://github.com/pytorch/pytorch/issues).
+
+We appreciate all contributions. If you are planning to contribute back bug-fixes, please do so without any further discussion.
+
+If you plan to contribute new features, utility functions or extensions to the core, please first open an issue and discuss the feature with us.
+Sending a PR without discussion might end up resulting in a rejected PR, because we might be taking the core in a different direction than you might be aware of.
+
+**For the next release cycle, these are the 3 big features we are planning to add:**
+
+1. [Distributed PyTorch](https://github.com/pytorch/pytorch/issues/241) (a draft implementation is present in this [branch](https://github.com/apaszke/pytorch-dist) )
+2. Backward of Backward - Backpropagating through the optimization process itself. Some past and recent papers such as
+   [Double Backprop](http://yann.lecun.com/exdb/publis/pdf/drucker-lecun-91.pdf) and [Unrolled GANs](https://arxiv.org/abs/1611.02163) need this.
+3. Lazy Execution Engine for autograd - This will enable us to optionally introduce caching and JIT compilers to optimize autograd code.
+
+
+## The Team
+
+PyTorch is a community driven project with several skillful engineers and researchers contributing to it.
+
+PyTorch is currently maintained by [Adam Paszke](https://apaszke.github.io/), [Sam Gross](https://github.com/colesbury) and [Soumith Chintala](http://soumith.ch) with major contributions coming from 10s of talented individuals in various forms and means. A non-exhaustive but growing list needs to mention: Sergey Zagoruyko, Adam Lerer, Francisco Massa, Andreas Kopf, James Bradbury, Zeming Lin, Yuandong Tian, Guillaume Lample, Marat Dukhan, Natalia Gimelshein.
+
+Note: this project is unrelated to [hughperkins/pytorch](https://github.com/hughperkins/pytorch) with the same name. Hugh is a valuable contributor in the Torch community and has helped with many things Torch and PyTorch.
--- a/cmake/FindCUDA/FindCUDA.cmake
+++ b/cmake/FindCUDA/FindCUDA.cmake
--- a/cmake/FindCUDA/FindCUDA/make2cmake.cmake
+++ b/cmake/FindCUDA/FindCUDA/make2cmake.cmake
@ -0,0 +1,106 @@
+#  James Bigler, NVIDIA Corp (nvidia.com - jbigler)
+#  Abe Stephens, SCI Institute -- http://www.sci.utah.edu/~abe/FindCuda.html
+#
+#  Copyright (c) 2008 - 2009 NVIDIA Corporation.  All rights reserved.
+#
+#  Copyright (c) 2007-2009
+#  Scientific Computing and Imaging Institute, University of Utah
+#
+#  This code is licensed under the MIT License.  See the FindCUDA.cmake script
+#  for the text of the license.
+
+# The MIT License
+#
+# License for the specific language governing rights and limitations under
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included
+# in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+# OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+# DEALINGS IN THE SOFTWARE.
+#
+
+#######################################################################
+# This converts a file written in makefile syntax into one that can be included
+# by CMake.
+
+# Input variables
+#
+# verbose:BOOL=<>          OFF: Be as quiet as possible (default)
+#                          ON : Extra output
+#
+# input_file:FILEPATH=<>   Path to dependecy file in makefile format
+#
+# output_file:FILEPATH=<>  Path to file with dependencies in CMake readable variable
+#
+
+file(READ ${input_file} depend_text)
+
+if (NOT "${depend_text}" STREQUAL "")
+
+  # message("FOUND DEPENDS")
+
+  string(REPLACE "\\ " " " depend_text ${depend_text})
+
+  # This works for the nvcc -M generated dependency files.
+  string(REGEX REPLACE "^.* : " "" depend_text ${depend_text})
+  string(REGEX REPLACE "[ \\\\]*\n" ";" depend_text ${depend_text})
+
+  set(dependency_list "")
+
+  foreach(file ${depend_text})
+
+    string(REGEX REPLACE "^ +" "" file ${file})
+
+    # OK, now if we had a UNC path, nvcc has a tendency to only output the first '/'
+    # instead of '//'.  Here we will test to see if the file exists, if it doesn't then
+    # try to prepend another '/' to the path and test again.  If it still fails remove the
+    # path.
+
+    if(NOT EXISTS "${file}")
+      if (EXISTS "/${file}")
+        set(file "/${file}")
+      else()
+        if(verbose)
+          message(WARNING " Removing non-existent dependency file: ${file}")
+        endif()
+        set(file "")
+      endif()
+    endif()
+
+    # Make sure we check to see if we have a file, before asking if it is not a directory.
+    # if(NOT IS_DIRECTORY "") will return TRUE.
+    if(file AND NOT IS_DIRECTORY "${file}")
+      # If softlinks start to matter, we should change this to REALPATH.  For now we need
+      # to flatten paths, because nvcc can generate stuff like /bin/../include instead of
+      # just /include.
+      get_filename_component(file_absolute "${file}" ABSOLUTE)
+      list(APPEND dependency_list "${file_absolute}")
+    endif()
+
+  endforeach()
+
+else()
+  # message("FOUND NO DEPENDS")
+endif()
+
+# Remove the duplicate entries and sort them.
+list(REMOVE_DUPLICATES dependency_list)
+list(SORT dependency_list)
+
+foreach(file ${dependency_list})
+  set(cuda_nvcc_depend "${cuda_nvcc_depend} \"${file}\"\n")
+endforeach()
+
+file(WRITE ${output_file} "# Generated by: make2cmake.cmake\nSET(CUDA_NVCC_DEPEND\n ${cuda_nvcc_depend})\n\n")
--- a/cmake/FindCUDA/FindCUDA/parse_cubin.cmake
+++ b/cmake/FindCUDA/FindCUDA/parse_cubin.cmake
@ -0,0 +1,111 @@
+#  James Bigler, NVIDIA Corp (nvidia.com - jbigler)
+#  Abe Stephens, SCI Institute -- http://www.sci.utah.edu/~abe/FindCuda.html
+#
+#  Copyright (c) 2008 - 2009 NVIDIA Corporation.  All rights reserved.
+#
+#  Copyright (c) 2007-2009
+#  Scientific Computing and Imaging Institute, University of Utah
+#
+#  This code is licensed under the MIT License.  See the FindCUDA.cmake script
+#  for the text of the license.
+
+# The MIT License
+#
+# License for the specific language governing rights and limitations under
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included
+# in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+# OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+# DEALINGS IN THE SOFTWARE.
+#
+
+#######################################################################
+# Parses a .cubin file produced by nvcc and reports statistics about the file.
+
+
+file(READ ${input_file} file_text)
+
+if (NOT "${file_text}" STREQUAL "")
+
+  string(REPLACE ";" "\\;" file_text ${file_text})
+  string(REPLACE "\ncode" ";code" file_text ${file_text})
+
+  list(LENGTH file_text len)
+
+  foreach(line ${file_text})
+
+    # Only look at "code { }" blocks.
+    if(line MATCHES "^code")
+
+      # Break into individual lines.
+      string(REGEX REPLACE "\n" ";" line ${line})
+
+      foreach(entry ${line})
+
+        # Extract kernel names.
+        if (${entry} MATCHES "[^g]name = ([^ ]+)")
+          set(entry "${CMAKE_MATCH_1}")
+
+          # Check to see if the kernel name starts with "_"
+          set(skip FALSE)
+          # if (${entry} MATCHES "^_")
+            # Skip the rest of this block.
+            # message("Skipping ${entry}")
+            # set(skip TRUE)
+          # else ()
+            message("Kernel:    ${entry}")
+          # endif ()
+
+        endif()
+
+        # Skip the rest of the block if necessary
+        if(NOT skip)
+
+          # Registers
+          if (${entry} MATCHES "reg([ ]+)=([ ]+)([^ ]+)")
+            set(entry "${CMAKE_MATCH_3}")
+            message("Registers: ${entry}")
+          endif()
+
+          # Local memory
+          if (${entry} MATCHES "lmem([ ]+)=([ ]+)([^ ]+)")
+            set(entry "${CMAKE_MATCH_3}")
+            message("Local:     ${entry}")
+          endif()
+
+          # Shared memory
+          if (${entry} MATCHES "smem([ ]+)=([ ]+)([^ ]+)")
+            set(entry "${CMAKE_MATCH_3}")
+            message("Shared:    ${entry}")
+          endif()
+
+          if (${entry} MATCHES "^}")
+            message("")
+          endif()
+
+        endif()
+
+
+      endforeach()
+
+    endif()
+
+  endforeach()
+
+else()
+  # message("FOUND NO DEPENDS")
+endif()
+
+
--- a/cmake/FindCUDA/FindCUDA/run_nvcc.cmake
+++ b/cmake/FindCUDA/FindCUDA/run_nvcc.cmake
@ -0,0 +1,291 @@
+#  James Bigler, NVIDIA Corp (nvidia.com - jbigler)
+#
+#  Copyright (c) 2008 - 2009 NVIDIA Corporation.  All rights reserved.
+#
+#  This code is licensed under the MIT License.  See the FindCUDA.cmake script
+#  for the text of the license.
+
+# The MIT License
+#
+# License for the specific language governing rights and limitations under
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included
+# in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+# OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+# DEALINGS IN THE SOFTWARE.
+
+
+##########################################################################
+# This file runs the nvcc commands to produce the desired output file along with
+# the dependency file needed by CMake to compute dependencies.  In addition the
+# file checks the output of each command and if the command fails it deletes the
+# output files.
+
+# Input variables
+#
+# verbose:BOOL=<>          OFF: Be as quiet as possible (default)
+#                          ON : Describe each step
+#
+# build_configuration:STRING=<> Typically one of Debug, MinSizeRel, Release, or
+#                               RelWithDebInfo, but it should match one of the
+#                               entries in CUDA_HOST_FLAGS. This is the build
+#                               configuration used when compiling the code.  If
+#                               blank or unspecified Debug is assumed as this is
+#                               what CMake does.
+#
+# generated_file:STRING=<> File to generate.  This argument must be passed in.
+#
+# generated_cubin_file:STRING=<> File to generate.  This argument must be passed
+#                                                   in if build_cubin is true.
+
+if(NOT generated_file)
+  message(FATAL_ERROR "You must specify generated_file on the command line")
+endif()
+
+# Set these up as variables to make reading the generated file easier
+set(CMAKE_COMMAND "@CMAKE_COMMAND@") # path
+set(source_file "@source_file@") # path
+set(NVCC_generated_dependency_file "@NVCC_generated_dependency_file@") # path
+set(cmake_dependency_file "@cmake_dependency_file@") # path
+set(CUDA_make2cmake "@CUDA_make2cmake@") # path
+set(CUDA_parse_cubin "@CUDA_parse_cubin@") # path
+set(build_cubin @build_cubin@) # bool
+set(CUDA_HOST_COMPILER "@CUDA_HOST_COMPILER@") # path
+# We won't actually use these variables for now, but we need to set this, in
+# order to force this file to be run again if it changes.
+set(generated_file_path "@generated_file_path@") # path
+set(generated_file_internal "@generated_file@") # path
+set(generated_cubin_file_internal "@generated_cubin_file@") # path
+
+set(CUDA_NVCC_EXECUTABLE "@CUDA_NVCC_EXECUTABLE@") # path
+set(CUDA_NVCC_FLAGS @CUDA_NVCC_FLAGS@ ;; @CUDA_WRAP_OPTION_NVCC_FLAGS@) # list
+@CUDA_NVCC_FLAGS_CONFIG@
+set(nvcc_flags @nvcc_flags@) # list
+set(CUDA_NVCC_INCLUDE_ARGS "@CUDA_NVCC_INCLUDE_ARGS@") # list (needs to be in quotes to handle spaces properly).
+set(format_flag "@format_flag@") # string
+set(cuda_language_flag @cuda_language_flag@) # list
+
+if(build_cubin AND NOT generated_cubin_file)
+  message(FATAL_ERROR "You must specify generated_cubin_file on the command line")
+endif()
+
+# This is the list of host compilation flags.  It C or CXX should already have
+# been chosen by FindCUDA.cmake.
+@CUDA_HOST_FLAGS@
+
+# Take the compiler flags and package them up to be sent to the compiler via -Xcompiler
+set(nvcc_host_compiler_flags "")
+# If we weren't given a build_configuration, use Debug.
+if(NOT build_configuration)
+  set(build_configuration Debug)
+endif()
+string(TOUPPER "${build_configuration}" build_configuration)
+#message("CUDA_NVCC_HOST_COMPILER_FLAGS = ${CUDA_NVCC_HOST_COMPILER_FLAGS}")
+foreach(flag ${CMAKE_HOST_FLAGS} ${CMAKE_HOST_FLAGS_${build_configuration}})
+  # Extra quotes are added around each flag to help nvcc parse out flags with spaces.
+  set(nvcc_host_compiler_flags "${nvcc_host_compiler_flags},\"${flag}\"")
+endforeach()
+if (nvcc_host_compiler_flags)
+  set(nvcc_host_compiler_flags "-Xcompiler" ${nvcc_host_compiler_flags})
+endif()
+#message("nvcc_host_compiler_flags = \"${nvcc_host_compiler_flags}\"")
+# Add the build specific configuration flags
+list(APPEND CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS_${build_configuration}})
+
+# Any -ccbin existing in CUDA_NVCC_FLAGS gets highest priority
+list( FIND CUDA_NVCC_FLAGS "-ccbin" ccbin_found0 )
+list( FIND CUDA_NVCC_FLAGS "--compiler-bindir" ccbin_found1 )
+if( ccbin_found0 LESS 0 AND ccbin_found1 LESS 0 AND CUDA_HOST_COMPILER )
+  if (CUDA_HOST_COMPILER STREQUAL "$(VCInstallDir)bin" AND DEFINED CCBIN)
+    set(CCBIN -ccbin "${CCBIN}")
+  else()
+    set(CCBIN -ccbin "${CUDA_HOST_COMPILER}")
+  endif()
+endif()
+
+# cuda_execute_process - Executes a command with optional command echo and status message.
+#
+#   status  - Status message to print if verbose is true
+#   command - COMMAND argument from the usual execute_process argument structure
+#   ARGN    - Remaining arguments are the command with arguments
+#
+#   CUDA_result - return value from running the command
+#
+# Make this a macro instead of a function, so that things like RESULT_VARIABLE
+# and other return variables are present after executing the process.
+macro(cuda_execute_process status command)
+  set(_command ${command})
+  if(NOT "x${_command}" STREQUAL "xCOMMAND")
+    message(FATAL_ERROR "Malformed call to cuda_execute_process.  Missing COMMAND as second argument. (command = ${command})")
+  endif()
+  if(verbose)
+    execute_process(COMMAND "${CMAKE_COMMAND}" -E echo -- ${status})
+    # Now we need to build up our command string.  We are accounting for quotes
+    # and spaces, anything else is left up to the user to fix if they want to
+    # copy and paste a runnable command line.
+    set(cuda_execute_process_string)
+    foreach(arg ${ARGN})
+      # If there are quotes, excape them, so they come through.
+      string(REPLACE "\"" "\\\"" arg ${arg})
+      # Args with spaces need quotes around them to get them to be parsed as a single argument.
+      if(arg MATCHES " ")
+        list(APPEND cuda_execute_process_string "\"${arg}\"")
+      else()
+        list(APPEND cuda_execute_process_string ${arg})
+      endif()
+    endforeach()
+    # Echo the command
+    execute_process(COMMAND ${CMAKE_COMMAND} -E echo ${cuda_execute_process_string})
+  endif()
+  # Run the command
+  execute_process(COMMAND ${ARGN} RESULT_VARIABLE CUDA_result )
+endmacro()
+
+# Delete the target file
+cuda_execute_process(
+  "Removing ${generated_file}"
+  COMMAND "${CMAKE_COMMAND}" -E remove "${generated_file}"
+  )
+
+# For CUDA 2.3 and below, -G -M doesn't work, so remove the -G flag
+# for dependency generation and hope for the best.
+set(depends_CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS}")
+set(CUDA_VERSION @CUDA_VERSION@)
+if(CUDA_VERSION VERSION_LESS "3.0")
+  cmake_policy(PUSH)
+  # CMake policy 0007 NEW states that empty list elements are not
+  # ignored.  I'm just setting it to avoid the warning that's printed.
+  cmake_policy(SET CMP0007 NEW)
+  # Note that this will remove all occurances of -G.
+  list(REMOVE_ITEM depends_CUDA_NVCC_FLAGS "-G")
+  cmake_policy(POP)
+endif()
+
+# nvcc doesn't define __CUDACC__ for some reason when generating dependency files.  This
+# can cause incorrect dependencies when #including files based on this macro which is
+# defined in the generating passes of nvcc invokation.  We will go ahead and manually
+# define this for now until a future version fixes this bug.
+set(CUDACC_DEFINE -D__CUDACC__)
+
+# Generate the dependency file
+cuda_execute_process(
+  "Generating dependency file: ${NVCC_generated_dependency_file}"
+  COMMAND "${CUDA_NVCC_EXECUTABLE}"
+  -M
+  ${CUDACC_DEFINE}
+  "${source_file}"
+  -o "${NVCC_generated_dependency_file}"
+  ${CCBIN}
+  ${nvcc_flags}
+  ${nvcc_host_compiler_flags}
+  ${depends_CUDA_NVCC_FLAGS}
+  -DNVCC
+  ${CUDA_NVCC_INCLUDE_ARGS}
+  )
+
+if(CUDA_result)
+  message(FATAL_ERROR "Error generating ${generated_file}")
+endif()
+
+# Generate the cmake readable dependency file to a temp file.  Don't put the
+# quotes just around the filenames for the input_file and output_file variables.
+# CMake will pass the quotes through and not be able to find the file.
+cuda_execute_process(
+  "Generating temporary cmake readable file: ${cmake_dependency_file}.tmp"
+  COMMAND "${CMAKE_COMMAND}"
+  -D "input_file:FILEPATH=${NVCC_generated_dependency_file}"
+  -D "output_file:FILEPATH=${cmake_dependency_file}.tmp"
+  -D "verbose=${verbose}"
+  -P "${CUDA_make2cmake}"
+  )
+
+if(CUDA_result)
+  message(FATAL_ERROR "Error generating ${generated_file}")
+endif()
+
+# Copy the file if it is different
+cuda_execute_process(
+  "Copy if different ${cmake_dependency_file}.tmp to ${cmake_dependency_file}"
+  COMMAND "${CMAKE_COMMAND}" -E copy_if_different "${cmake_dependency_file}.tmp" "${cmake_dependency_file}"
+  )
+
+if(CUDA_result)
+  message(FATAL_ERROR "Error generating ${generated_file}")
+endif()
+
+# Delete the temporary file
+cuda_execute_process(
+  "Removing ${cmake_dependency_file}.tmp and ${NVCC_generated_dependency_file}"
+  COMMAND "${CMAKE_COMMAND}" -E remove "${cmake_dependency_file}.tmp" "${NVCC_generated_dependency_file}"
+  )
+
+if(CUDA_result)
+  message(FATAL_ERROR "Error generating ${generated_file}")
+endif()
+
+# Generate the code
+cuda_execute_process(
+  "Generating ${generated_file}"
+  COMMAND "${CUDA_NVCC_EXECUTABLE}"
+  "${source_file}"
+  ${cuda_language_flag}
+  ${format_flag} -o "${generated_file}"
+  ${CCBIN}
+  ${nvcc_flags}
+  ${nvcc_host_compiler_flags}
+  ${CUDA_NVCC_FLAGS}
+  -DNVCC
+  ${CUDA_NVCC_INCLUDE_ARGS}
+  )
+
+if(CUDA_result)
+  # Since nvcc can sometimes leave half done files make sure that we delete the output file.
+  cuda_execute_process(
+    "Removing ${generated_file}"
+    COMMAND "${CMAKE_COMMAND}" -E remove "${generated_file}"
+    )
+  message(FATAL_ERROR "Error generating file ${generated_file}")
+else()
+  if(verbose)
+    message("Generated ${generated_file} successfully.")
+  endif()
+endif()
+
+# Cubin resource report commands.
+if( build_cubin )
+  # Run with -cubin to produce resource usage report.
+  cuda_execute_process(
+    "Generating ${generated_cubin_file}"
+    COMMAND "${CUDA_NVCC_EXECUTABLE}"
+    "${source_file}"
+    ${CUDA_NVCC_FLAGS}
+    ${nvcc_flags}
+    ${CCBIN}
+    ${nvcc_host_compiler_flags}
+    -DNVCC
+    -cubin
+    -o "${generated_cubin_file}"
+    ${CUDA_NVCC_INCLUDE_ARGS}
+    )
+
+  # Execute the parser script.
+  cuda_execute_process(
+    "Executing the parser script"
+    COMMAND  "${CMAKE_COMMAND}"
+    -D "input_file:STRING=${generated_cubin_file}"
+    -P "${CUDA_parse_cubin}"
+    )
+
+endif()
--- a/cmake/FindCUDA/FindCUDA/select_compute_arch.cmake
+++ b/cmake/FindCUDA/FindCUDA/select_compute_arch.cmake
@ -0,0 +1,200 @@
+# Synopsis:
+#   CUDA_SELECT_NVCC_ARCH_FLAGS(out_variable [target_CUDA_architectures])
+#   -- Selects GPU arch flags for nvcc based on target_CUDA_architectures
+#      target_CUDA_architectures : Auto | Common | All | LIST(ARCH_AND_PTX ...)
+#       - "Auto" detects local machine GPU compute arch at runtime.
+#       - "Common" and "All" cover common and entire subsets of architectures
+#      ARCH_AND_PTX : NAME | NUM.NUM | NUM.NUM(NUM.NUM) | NUM.NUM+PTX
+#      NAME: Fermi Kepler Maxwell Kepler+Tegra Kepler+Tesla Maxwell+Tegra Pascal
+#      NUM: Any number. Only those pairs are currently accepted by NVCC though:
+#            2.0 2.1 3.0 3.2 3.5 3.7 5.0 5.2 5.3 6.0 6.2
+#      Returns LIST of flags to be added to CUDA_NVCC_FLAGS in ${out_variable}
+#      Additionally, sets ${out_variable}_readable to the resulting numeric list
+#      Example:
+#       CUDA_SELECT_NVCC_ARCH_FLAGS(ARCH_FLAGS 3.0 3.5+PTX 5.2(5.0) Maxwell)
+#        LIST(APPEND CUDA_NVCC_FLAGS ${ARCH_FLAGS})
+#
+#      More info on CUDA architectures: https://en.wikipedia.org/wiki/CUDA
+#
+
+# This list will be used for CUDA_ARCH_NAME = All option
+set(CUDA_KNOWN_GPU_ARCHITECTURES  "Fermi" "Kepler" "Maxwell")
+
+# This list will be used for CUDA_ARCH_NAME = Common option (enabled by default)
+set(CUDA_COMMON_GPU_ARCHITECTURES "3.0" "3.5" "5.0")
+
+if (CUDA_VERSION VERSION_GREATER "6.5")
+  list(APPEND CUDA_KNOWN_GPU_ARCHITECTURES "Kepler+Tegra" "Kepler+Tesla" "Maxwell+Tegra")
+  list(APPEND CUDA_COMMON_GPU_ARCHITECTURES "5.2")
+endif ()
+
+if (CUDA_VERSION VERSION_GREATER "7.5")
+  list(APPEND CUDA_KNOWN_GPU_ARCHITECTURES "Pascal")
+  list(APPEND CUDA_COMMON_GPU_ARCHITECTURES "6.0" "6.1" "6.1+PTX")
+else()
+  list(APPEND CUDA_COMMON_GPU_ARCHITECTURES "5.2+PTX")
+endif ()
+
+
+
+################################################################################################
+# A function for automatic detection of GPUs installed  (if autodetection is enabled)
+# Usage:
+#   CUDA_DETECT_INSTALLED_GPUS(OUT_VARIABLE)
+#
+function(CUDA_DETECT_INSTALLED_GPUS OUT_VARIABLE)
+  if(NOT CUDA_GPU_DETECT_OUTPUT)
+    set(cufile ${PROJECT_BINARY_DIR}/detect_cuda_archs.cu)
+
+    file(WRITE ${cufile} ""
+      "#include <cstdio>\n"
+      "int main()\n"
+      "{\n"
+      "  int count = 0;\n"
+      "  if (cudaSuccess != cudaGetDeviceCount(&count)) return -1;\n"
+      "  if (count == 0) return -1;\n"
+      "  for (int device = 0; device < count; ++device)\n"
+      "  {\n"
+      "    cudaDeviceProp prop;\n"
+      "    if (cudaSuccess == cudaGetDeviceProperties(&prop, device))\n"
+      "      std::printf(\"%d.%d \", prop.major, prop.minor);\n"
+      "  }\n"
+      "  return 0;\n"
+      "}\n")
+
+    execute_process(COMMAND "${CUDA_NVCC_EXECUTABLE}" "--run" "${cufile}"
+                    "-ccbin" ${CMAKE_CXX_COMPILER}
+                    WORKING_DIRECTORY "${PROJECT_BINARY_DIR}/CMakeFiles/"
+                    RESULT_VARIABLE nvcc_res OUTPUT_VARIABLE nvcc_out
+                    ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE)
+
+    if(nvcc_res EQUAL 0)
+      # only keep the last line of nvcc_out
+      STRING(REGEX REPLACE ";" "\\\\;" nvcc_out "${nvcc_out}")
+      STRING(REGEX REPLACE "\n" ";" nvcc_out "${nvcc_out}")
+      list(GET nvcc_out -1 nvcc_out)
+      string(REPLACE "2.1" "2.1(2.0)" nvcc_out "${nvcc_out}")
+      set(CUDA_GPU_DETECT_OUTPUT ${nvcc_out} CACHE INTERNAL "Returned GPU architetures from detect_gpus tool" FORCE)
+    endif()
+  endif()
+
+  if(NOT CUDA_GPU_DETECT_OUTPUT)
+    message(STATUS "Automatic GPU detection failed. Building for common architectures.")
+    set(${OUT_VARIABLE} ${CUDA_COMMON_GPU_ARCHITECTURES} PARENT_SCOPE)
+  else()
+    set(${OUT_VARIABLE} ${CUDA_GPU_DETECT_OUTPUT} PARENT_SCOPE)
+  endif()
+endfunction()
+
+
+################################################################################################
+# Function for selecting GPU arch flags for nvcc based on CUDA architectures from parameter list
+# Usage:
+#   SELECT_NVCC_ARCH_FLAGS(out_variable [list of CUDA compute archs])
+function(CUDA_SELECT_NVCC_ARCH_FLAGS out_variable)
+  set(CUDA_ARCH_LIST "${ARGN}")
+
+  if("X${CUDA_ARCH_LIST}" STREQUAL "X" )
+    set(CUDA_ARCH_LIST "Auto")
+  endif()
+
+  set(cuda_arch_bin)
+  set(cuda_arch_ptx)
+
+  if("${CUDA_ARCH_LIST}" STREQUAL "All")
+    set(CUDA_ARCH_LIST ${CUDA_KNOWN_GPU_ARCHITECTURES})
+  elseif("${CUDA_ARCH_LIST}" STREQUAL "Common")
+    set(CUDA_ARCH_LIST ${CUDA_COMMON_GPU_ARCHITECTURES})
+  elseif("${CUDA_ARCH_LIST}" STREQUAL "Auto")
+    CUDA_DETECT_INSTALLED_GPUS(CUDA_ARCH_LIST)
+    message(STATUS "Autodetected CUDA architecture(s): ${CUDA_ARCH_LIST}")
+  endif()
+
+  # Now process the list and look for names
+  string(REGEX REPLACE "[ \t]+" ";" CUDA_ARCH_LIST "${CUDA_ARCH_LIST}")
+  list(REMOVE_DUPLICATES CUDA_ARCH_LIST)
+  foreach(arch_name ${CUDA_ARCH_LIST})
+    set(arch_bin)
+    set(add_ptx FALSE)
+    # Check to see if we are compiling PTX
+    if(arch_name MATCHES "(.*)\\+PTX$")
+      set(add_ptx TRUE)
+      set(arch_name ${CMAKE_MATCH_1})
+    endif()
+    if(arch_name MATCHES "(^[0-9]\\.[0-9](\\([0-9]\\.[0-9]\\))?)$")
+      set(arch_bin ${CMAKE_MATCH_1})
+      set(arch_ptx ${arch_bin})
+    else()
+      # Look for it in our list of known architectures
+      if(${arch_name} STREQUAL "Fermi")
+        set(arch_bin "2.0 2.1(2.0)")
+      elseif(${arch_name} STREQUAL "Kepler+Tegra")
+        set(arch_bin 3.2)
+      elseif(${arch_name} STREQUAL "Kepler+Tesla")
+        set(arch_bin 3.7)
+      elseif(${arch_name} STREQUAL "Kepler")
+        set(arch_bin 3.0 3.5)
+        set(arch_ptx 3.5)
+      elseif(${arch_name} STREQUAL "Maxwell+Tegra")
+        set(arch_bin 5.3)
+      elseif(${arch_name} STREQUAL "Maxwell")
+        set(arch_bin 5.0 5.2)
+        set(arch_ptx 5.2)
+      elseif(${arch_name} STREQUAL "Pascal")
+        set(arch_bin 6.0 6.1)
+        set(arch_ptx 6.1)
+      else()
+        message(SEND_ERROR "Unknown CUDA Architecture Name ${arch_name} in CUDA_SELECT_NVCC_ARCH_FLAGS")
+      endif()
+    endif()
+    if(NOT arch_bin)
+      message(SEND_ERROR "arch_bin wasn't set for some reason")
+    endif()
+    list(APPEND cuda_arch_bin ${arch_bin})
+    if(add_ptx)
+      if (NOT arch_ptx)
+        set(arch_ptx ${arch_bin})
+      endif()
+      list(APPEND cuda_arch_ptx ${arch_ptx})
+    endif()
+  endforeach()
+
+  # remove dots and convert to lists
+  string(REGEX REPLACE "\\." "" cuda_arch_bin "${cuda_arch_bin}")
+  string(REGEX REPLACE "\\." "" cuda_arch_ptx "${cuda_arch_ptx}")
+  string(REGEX MATCHALL "[0-9()]+" cuda_arch_bin "${cuda_arch_bin}")
+  string(REGEX MATCHALL "[0-9]+"   cuda_arch_ptx "${cuda_arch_ptx}")
+
+  if(cuda_arch_bin)
+    list(REMOVE_DUPLICATES cuda_arch_bin)
+  endif()
+  if(cuda_arch_ptx)
+    list(REMOVE_DUPLICATES cuda_arch_ptx)
+  endif()
+
+  set(nvcc_flags "")
+  set(nvcc_archs_readable "")
+
+  # Tell NVCC to add binaries for the specified GPUs
+  foreach(arch ${cuda_arch_bin})
+    if(arch MATCHES "([0-9]+)\\(([0-9]+)\\)")
+      # User explicitly specified ARCH for the concrete CODE
+      list(APPEND nvcc_flags -gencode arch=compute_${CMAKE_MATCH_2},code=sm_${CMAKE_MATCH_1})
+      list(APPEND nvcc_archs_readable sm_${CMAKE_MATCH_1})
+    else()
+      # User didn't explicitly specify ARCH for the concrete CODE, we assume ARCH=CODE
+      list(APPEND nvcc_flags -gencode arch=compute_${arch},code=sm_${arch})
+      list(APPEND nvcc_archs_readable sm_${arch})
+    endif()
+  endforeach()
+
+  # Tell NVCC to add PTX intermediate code for the specified architectures
+  foreach(arch ${cuda_arch_ptx})
+    list(APPEND nvcc_flags -gencode arch=compute_${arch},code=compute_${arch})
+    list(APPEND nvcc_archs_readable compute_${arch})
+  endforeach()
+
+  string(REPLACE ";" " " nvcc_archs_readable "${nvcc_archs_readable}")
+  set(${out_variable}          ${nvcc_flags}          PARENT_SCOPE)
+  set(${out_variable}_readable ${nvcc_archs_readable} PARENT_SCOPE)
+endfunction()
--- a/docs/Makefile
+++ b/docs/Makefile
@ -0,0 +1,27 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line.
+SPHINXOPTS    =
+SPHINXBUILD   = sphinx-build
+SPHINXPROJ    = PyTorch
+SOURCEDIR     = source
+BUILDDIR      = build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+docset: html
+	doc2dash --name $(SPHINXPROJ) --icon $(SOURCEDIR)/_static/img/pytorch-logo-flame.png --enable-js --online-redirect-url http://pytorch.org/docs/ --force $(BUILDDIR)/html/
+
+	# Manually fix because Zeal doesn't deal well with `icon.png`-only at 2x resolution.
+	cp $(SPHINXPROJ).docset/icon.png $(SPHINXPROJ).docset/icon@2x.png
+	convert $(SPHINXPROJ).docset/icon@2x.png -resize 16x16 $(SPHINXPROJ).docset/icon.png
+
+.PHONY: help Makefile docset
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/docs/make.bat
+++ b/docs/make.bat
@ -0,0 +1,36 @@
+@ECHO OFF
+
+pushd %~dp0
+
+REM Command file for Sphinx documentation
+
+if "%SPHINXBUILD%" == "" (
+	set SPHINXBUILD=sphinx-build
+)
+set SOURCEDIR=source
+set BUILDDIR=build
+set SPHINXPROJ=PyTorch
+
+if "%1" == "" goto help
+
+%SPHINXBUILD% >NUL 2>NUL
+if errorlevel 9009 (
+	echo.
+	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
+	echo.installed, then set the SPHINXBUILD environment variable to point
+	echo.to the full path of the 'sphinx-build' executable. Alternatively you
+	echo.may add the Sphinx directory to PATH.
+	echo.
+	echo.If you don't have Sphinx installed, grab it from
+	echo.http://sphinx-doc.org/
+	exit /b 1
+)
+
+%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
+goto end
+
+:help
+%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
+
+:end
+popd
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@ -0,0 +1,2 @@
+sphinx
+-e git://github.com/snide/sphinx_rtd_theme.git#egg=sphinx_rtd_theme
--- a/docs/source/_static/css/pytorch_theme.css
+++ b/docs/source/_static/css/pytorch_theme.css
@ -0,0 +1,118 @@
+body {
+    font-family: "Lato","proxima-nova","Helvetica Neue",Arial,sans-serif;
+}
+
+/* Default header fonts are ugly */
+h1, h2, .rst-content .toctree-wrapper p.caption, h3, h4, h5, h6, legend, p.caption {
+    font-family: "Lato","proxima-nova","Helvetica Neue",Arial,sans-serif;
+}
+
+/* Use white for docs background */
+.wy-side-nav-search {
+    background-color: #fff;
+}
+
+.wy-nav-content-wrap, .wy-menu li.current > a  {
+    background-color: #fff;
+}
+
+@media screen and (min-width: 1400px) {
+    .wy-nav-content-wrap {
+        background-color: rgba(0, 0, 0, 0.0470588);
+    }
+
+    .wy-nav-content {
+        background-color: #fff;
+    }
+}
+
+/* Fixes for mobile */
+.wy-nav-top {
+    background-color: #fff;
+    background-image: url('../img/pytorch-logo-dark.svg');
+    background-repeat: no-repeat;
+    background-position: center;
+    padding: 0;
+    margin: 0.4045em 0.809em;
+    color: #333;
+}
+
+.wy-nav-top > a {
+    display: none;
+}
+
+@media screen and (max-width: 768px) {
+    .wy-side-nav-search>a img.logo {
+        height: 60px;
+    }
+}
+
+/* This is needed to ensure that logo above search scales properly */
+.wy-side-nav-search a {
+    display: block;
+}
+
+/* This ensures that multiple constructors will remain in separate lines. */
+.rst-content dl:not(.docutils) dt {
+    display: table;
+}
+
+/* Use our red for literals (it's very similar to the original color) */
+.rst-content tt.literal, .rst-content tt.literal, .rst-content code.literal {
+    color: #F05732;
+}
+
+.rst-content tt.xref, a .rst-content tt, .rst-content tt.xref,
+.rst-content code.xref, a .rst-content tt, a .rst-content code {
+    color: #404040;
+}
+
+/* Change link colors (except for the menu) */
+
+a {
+    color: #F05732;
+}
+
+a:hover {
+    color: #F05732;
+}
+
+
+a:visited {
+    color: #D44D2C;
+}
+
+.wy-menu a {
+    color: #b3b3b3;
+}
+
+.wy-menu a:hover {
+    color: #b3b3b3;
+}
+
+/* Default footer text is quite big */
+footer {
+    font-size: 80%;
+}
+
+footer .rst-footer-buttons {
+    font-size: 125%; /* revert footer settings - 1/80% = 125% */
+}
+
+footer p {
+    font-size: 100%;
+}
+
+/* For hidden headers that appear in TOC tree */
+/* see http://stackoverflow.com/a/32363545/3343043 */
+.rst-content .hidden-section {
+    display: none;
+}
+
+nav .hidden-section {
+    display: inherit;
+}
+
+.wy-side-nav-search>div.version {
+    color: #000;
+}
--- a/docs/source/_static/img/dynamic_graph.gif
+++ b/docs/source/_static/img/dynamic_graph.gif
--- a/docs/source/_static/img/pytorch-logo-dark.png
+++ b/docs/source/_static/img/pytorch-logo-dark.png
--- a/docs/source/_static/img/pytorch-logo-dark.svg
+++ b/docs/source/_static/img/pytorch-logo-dark.svg
@ -0,0 +1,24 @@
+<?xml version="1.0" encoding="utf-8"?>
+<!-- Generator: Adobe Illustrator 21.0.0, SVG Export Plug-In . SVG Version: 6.00 Build 0)  -->
+<svg version="1.1" id="Layer_1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
+	 viewBox="0 0 199.7 40.2" style="enable-background:new 0 0 199.7 40.2;" xml:space="preserve">
+<style type="text/css">
+	.st0{fill:#F05732;}
+	.st1{fill:#9E529F;}
+	.st2{fill:#333333;}
+</style>
+<path class="st0" d="M102.7,12.2c-1.3-1-1.8,3.9-4.4,3.9c-3,0-4-13-6.3-13c-0.7,0-0.8-0.4-7.9,21.3c-2.9,9,4.4,15.8,11.8,15.8
+	c4.6,0,12.3-3,12.3-12.6C108.2,20.5,104.7,13.7,102.7,12.2z M95.8,35.3c-3.7,0-6.7-3.1-6.7-7c0-3.9,3-7,6.7-7s6.7,3.1,6.7,7
+	C102.5,32.1,99.5,35.3,95.8,35.3z"/>
+<path class="st1" d="M99.8,0c-0.5,0-1.8,2.5-1.8,3.6c0,1.5,1,2,1.8,2c0.8,0,1.8-0.5,1.8-2C101.5,2.5,100.2,0,99.8,0z"/>
+<path class="st2" d="M0,39.5V14.9h11.5c5.3,0,8.3,3.6,8.3,7.9c0,4.3-3,7.9-8.3,7.9H5.2v8.8H0z M14.4,22.8c0-2.1-1.6-3.3-3.7-3.3H5.2
+	v6.6h5.5C12.8,26.1,14.4,24.8,14.4,22.8z"/>
+<path class="st2" d="M35.2,39.5V29.4l-9.4-14.5h6l6.1,9.8l6.1-9.8h5.9l-9.4,14.5v10.1H35.2z"/>
+<path class="st2" d="M63.3,39.5v-20h-7.2v-4.6h19.6v4.6h-7.2v20H63.3z"/>
+<path class="st2" d="M131.4,39.5l-4.8-8.7h-3.8v8.7h-5.2V14.9H129c5.1,0,8.3,3.4,8.3,7.9c0,4.3-2.8,6.7-5.4,7.3l5.6,9.4H131.4z
+	 M131.9,22.8c0-2-1.6-3.3-3.7-3.3h-5.5v6.6h5.5C130.3,26.1,131.9,24.9,131.9,22.8z"/>
+<path class="st2" d="M145.6,27.2c0-7.6,5.7-12.7,13.1-12.7c5.4,0,8.5,2.9,10.3,6l-4.5,2.2c-1-2-3.2-3.6-5.8-3.6
+	c-4.5,0-7.7,3.4-7.7,8.1c0,4.6,3.2,8.1,7.7,8.1c2.5,0,4.7-1.6,5.8-3.6l4.5,2.2c-1.7,3.1-4.9,6-10.3,6
+	C151.3,39.9,145.6,34.7,145.6,27.2z"/>
+<path class="st2" d="M194.5,39.5V29.1h-11.6v10.4h-5.2V14.9h5.2v9.7h11.6v-9.7h5.3v24.6H194.5z"/>
+</svg>
--- a/docs/source/_static/img/pytorch-logo-flame.png
+++ b/docs/source/_static/img/pytorch-logo-flame.png
--- a/docs/source/_static/img/pytorch-logo-flame.svg
+++ b/docs/source/_static/img/pytorch-logo-flame.svg
@ -0,0 +1,33 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<svg
+   xmlns:dc="http://purl.org/dc/elements/1.1/"
+   xmlns:cc="http://creativecommons.org/ns#"
+   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
+   xmlns:svg="http://www.w3.org/2000/svg"
+   xmlns="http://www.w3.org/2000/svg"
+   height="40.200001"
+   width="40.200001"
+   xml:space="preserve"
+   viewBox="0 0 40.200002 40.2"
+   y="0px"
+   x="0px"
+   id="Layer_1"
+   version="1.1"><metadata
+     id="metadata4717"><rdf:RDF><cc:Work
+         rdf:about=""><dc:format>image/svg+xml</dc:format><dc:type
+           rdf:resource="http://purl.org/dc/dcmitype/StillImage" /><dc:title></dc:title></cc:Work></rdf:RDF></metadata><defs
+     id="defs4715" /><style
+     id="style4694"
+     type="text/css">
+	.st0{fill:#F05732;}
+	.st1{fill:#9E529F;}
+	.st2{fill:#333333;}
+</style><path
+     style="fill:#f05732"
+     id="path4696"
+     d="m 26.975479,12.199999 c -1.3,-1 -1.8,3.9 -4.4,3.9 -3,0 -4,-12.9999998 -6.3,-12.9999998 -0.7,0 -0.8,-0.4 -7.9000003,21.2999998 -2.9000001,9 4.4000003,15.8 11.8000003,15.8 4.6,0 12.3,-3 12.3,-12.6 0,-7.1 -3.5,-13.9 -5.5,-15.4 z m -6.9,23.1 c -3.7,0 -6.7,-3.1 -6.7,-7 0,-3.9 3,-7 6.7,-7 3.7,0 6.7,3.1 6.7,7 0,3.8 -3,7 -6.7,7 z"
+     class="st0" /><path
+     style="fill:#9e529f"
+     id="path4698"
+     d="m 24.075479,-7.6293945e-7 c -0.5,0 -1.8,2.49999996293945 -1.8,3.59999996293945 0,1.5 1,2 1.8,2 0.8,0 1.8,-0.5 1.8,-2 -0.1,-1.1 -1.4,-3.59999996293945 -1.8,-3.59999996293945 z"
+     class="st1" /></svg>
--- a/docs/source/_static/img/tensor_illustration.png
+++ b/docs/source/_static/img/tensor_illustration.png
--- a/docs/source/autograd.rst
+++ b/docs/source/autograd.rst
@ -0,0 +1,55 @@
+.. role:: hidden
+    :class: hidden-section
+
+Automatic differentiation package - torch.autograd
+==================================================
+
+.. automodule:: torch.autograd
+.. currentmodule:: torch.autograd
+
+.. autofunction:: backward
+
+.. autofunction:: grad
+
+Variable
+--------
+
+API compatibility
+^^^^^^^^^^^^^^^^^
+
+Variable API is nearly the same as regular Tensor API (with the exception
+of a couple in-place methods, that would overwrite inputs required for
+gradient computation). In most cases Tensors can be safely replaced with
+Variables and the code will remain to work just fine. Because of this,
+we're not documenting all the operations on variables, and you should
+refer to :class:`torch.Tensor` docs for this purpose.
+
+In-place operations on Variables
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Supporting in-place operations in autograd is a hard matter, and we discourage
+their use in most cases. Autograd's aggressive buffer freeing and reuse makes
+it very efficient and there are very few occasions when in-place operations
+actually lower memory usage by any significant amount. Unless you're operating
+under heavy memory pressure, you might never need to use them.
+
+In-place correctness checks
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+All :class:`Variable` s keep track of in-place operations applied to them, and
+if the implementation detects that a variable was saved for backward in one of
+the functions, but it was modified in-place afterwards, an error will be raised
+once backward pass is started. This ensures that if you're using in-place
+functions and not seeing any errors, you can be sure that the computed
+gradients are correct.
+
+
+.. autoclass:: Variable
+    :members:
+
+:hidden:`Function`
+------------------
+
+.. autoclass:: Function
+    :members:
+
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@ -0,0 +1,249 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+#
+# PyTorch documentation build configuration file, created by
+# sphinx-quickstart on Fri Dec 23 13:31:47 2016.
+#
+# This file is execfile()d with the current directory set to its
+# containing dir.
+#
+# Note that not all possible configuration values are present in this
+# autogenerated file.
+#
+# All configuration values have a default; values that are commented out
+# serve to show the default.
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+# import os
+# import sys
+# sys.path.insert(0, os.path.abspath('.'))
+import torch
+try:
+    import torchvision
+except ImportError:
+    import warnings
+    warnings.warn('unable to load "torchvision" package')
+import sphinx_rtd_theme
+
+
+# -- General configuration ------------------------------------------------
+
+# If your documentation needs a minimal Sphinx version, state it here.
+#
+# needs_sphinx = '1.0'
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    'sphinx.ext.autodoc',
+    'sphinx.ext.autosummary',
+    'sphinx.ext.doctest',
+    'sphinx.ext.intersphinx',
+    'sphinx.ext.todo',
+    'sphinx.ext.coverage',
+    'sphinx.ext.mathjax',
+    'sphinx.ext.napoleon',
+    'sphinx.ext.viewcode',
+]
+
+napoleon_use_ivar = True
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+#
+# source_suffix = ['.rst', '.md']
+source_suffix = '.rst'
+
+# The master toctree document.
+master_doc = 'index'
+
+# General information about the project.
+project = 'PyTorch'
+copyright = '2017, Torch Contributors'
+author = 'Torch Contributors'
+
+# The version info for the project you're documenting, acts as replacement for
+# |version| and |release|, also used in various other places throughout the
+# built documents.
+#
+# The short X.Y version.
+# TODO: change to [:2] at v1.0
+version = 'master (' + torch.__version__ + ' )'
+# The full version, including alpha/beta/rc tags.
+# TODO: verify this works as expected
+release = 'master'
+
+# The language for content autogenerated by Sphinx. Refer to documentation
+# for a list of supported languages.
+#
+# This is also used if you do content translation via gettext catalogs.
+# Usually you set "language" from the command line for these cases.
+language = None
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This patterns also effect to html_static_path and html_extra_path
+exclude_patterns = []
+
+# The name of the Pygments (syntax highlighting) style to use.
+pygments_style = 'sphinx'
+
+# If true, `todo` and `todoList` produce output, else they produce nothing.
+todo_include_todos = True
+
+
+# -- Options for HTML output ----------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = 'sphinx_rtd_theme'
+html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
+
+# Theme options are theme-specific and customize the look and feel of a theme
+# further.  For a list of options available for each theme, see the
+# documentation.
+#
+html_theme_options = {
+    'collapse_navigation': False,
+    'display_version': True,
+    'logo_only': True,
+}
+
+html_logo = '_static/img/pytorch-logo-dark.svg'
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
+
+# html_style_path = 'css/pytorch_theme.css'
+html_context = {
+    'css_files': [
+        'https://fonts.googleapis.com/css?family=Lato',
+        '_static/css/pytorch_theme.css'
+    ],
+}
+
+
+# -- Options for HTMLHelp output ------------------------------------------
+
+# Output file base name for HTML help builder.
+htmlhelp_basename = 'PyTorchdoc'
+
+
+# -- Options for LaTeX output ---------------------------------------------
+
+latex_elements = {
+    # The paper size ('letterpaper' or 'a4paper').
+    #
+    # 'papersize': 'letterpaper',
+
+    # The font size ('10pt', '11pt' or '12pt').
+    #
+    # 'pointsize': '10pt',
+
+    # Additional stuff for the LaTeX preamble.
+    #
+    # 'preamble': '',
+
+    # Latex figure (float) alignment
+    #
+    # 'figure_align': 'htbp',
+}
+
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title,
+#  author, documentclass [howto, manual, or own class]).
+latex_documents = [
+    (master_doc, 'pytorch.tex', 'PyTorch Documentation',
+     'Torch Contributors', 'manual'),
+]
+
+
+# -- Options for manual page output ---------------------------------------
+
+# One entry per manual page. List of tuples
+# (source start file, name, description, authors, manual section).
+man_pages = [
+    (master_doc, 'PyTorch', 'PyTorch Documentation',
+     [author], 1)
+]
+
+
+# -- Options for Texinfo output -------------------------------------------
+
+# Grouping the document tree into Texinfo files. List of tuples
+# (source start file, target name, title, author,
+#  dir menu entry, description, category)
+texinfo_documents = [
+    (master_doc, 'PyTorch', 'PyTorch Documentation',
+     author, 'PyTorch', 'One line description of project.',
+     'Miscellaneous'),
+]
+
+
+# Example configuration for intersphinx: refer to the Python standard library.
+intersphinx_mapping = {
+    'python': ('https://docs.python.org/', None),
+    'numpy': ('http://docs.scipy.org/doc/numpy/', None),
+}
+
+# -- A patch that prevents Sphinx from cross-referencing ivar tags -------
+# See http://stackoverflow.com/a/41184353/3343043
+
+from docutils import nodes
+from sphinx.util.docfields import TypedField
+from sphinx import addnodes
+
+
+def patched_make_field(self, types, domain, items, **kw):
+    # `kw` catches `env=None` needed for newer sphinx while maintaining
+    #  backwards compatibility when passed along further down!
+
+    # type: (List, unicode, Tuple) -> nodes.field
+    def handle_item(fieldarg, content):
+        par = nodes.paragraph()
+        par += addnodes.literal_strong('', fieldarg)  # Patch: this line added
+        # par.extend(self.make_xrefs(self.rolename, domain, fieldarg,
+        #                           addnodes.literal_strong))
+        if fieldarg in types:
+            par += nodes.Text(' (')
+            # NOTE: using .pop() here to prevent a single type node to be
+            # inserted twice into the doctree, which leads to
+            # inconsistencies later when references are resolved
+            fieldtype = types.pop(fieldarg)
+            if len(fieldtype) == 1 and isinstance(fieldtype[0], nodes.Text):
+                typename = u''.join(n.astext() for n in fieldtype)
+                typename = typename.replace('int', 'python:int')
+                typename = typename.replace('long', 'python:long')
+                typename = typename.replace('float', 'python:float')
+                typename = typename.replace('type', 'python:type')
+                par.extend(self.make_xrefs(self.typerolename, domain, typename,
+                                           addnodes.literal_emphasis, **kw))
+            else:
+                par += fieldtype
+            par += nodes.Text(')')
+        par += nodes.Text(' -- ')
+        par += content
+        return par
+
+    fieldname = nodes.field_name('', self.label)
+    if len(items) == 1 and self.can_collapse:
+        fieldarg, content = items[0]
+        bodynode = handle_item(fieldarg, content)
+    else:
+        bodynode = self.list_type()
+        for fieldarg, content in items:
+            bodynode += nodes.list_item('', handle_item(fieldarg, content))
+    fieldbody = nodes.field_body('', bodynode)
+    return nodes.field('', fieldname, fieldbody)
+
+TypedField.make_field = patched_make_field
--- a/docs/source/cuda.rst
+++ b/docs/source/cuda.rst
@ -0,0 +1,34 @@
+torch.cuda
+===================================
+
+.. currentmodule:: torch.cuda
+
+.. automodule:: torch.cuda
+   :members:
+
+Communication collectives
+-------------------------
+
+.. autofunction:: torch.cuda.comm.broadcast
+
+.. autofunction:: torch.cuda.comm.reduce_add
+
+.. autofunction:: torch.cuda.comm.scatter
+
+.. autofunction:: torch.cuda.comm.gather
+
+Streams and events
+------------------
+
+.. autoclass:: Stream
+   :members:
+
+.. autoclass:: Event
+   :members:
+
+NVIDIA Tools Extension (NVTX)
+-----------------------------
+
+.. autofunction:: torch.cuda.nvtx.mark
+.. autofunction:: torch.cuda.nvtx.range_push
+.. autofunction:: torch.cuda.nvtx.range_pop
--- a/docs/source/data.rst
+++ b/docs/source/data.rst
@ -0,0 +1,13 @@
+torch.utils.data
+===================================
+
+.. automodule:: torch.utils.data
+.. autoclass:: Dataset
+.. autoclass:: TensorDataset
+.. autoclass:: DataLoader
+.. autoclass:: torch.utils.data.sampler.Sampler
+.. autoclass:: torch.utils.data.sampler.SequentialSampler
+.. autoclass:: torch.utils.data.sampler.RandomSampler
+.. autoclass:: torch.utils.data.sampler.SubsetRandomSampler
+.. autoclass:: torch.utils.data.sampler.WeightedRandomSampler
+.. autoclass:: torch.utils.data.distributed.DistributedSampler
--- a/docs/source/distributed.rst
+++ b/docs/source/distributed.rst
@ -0,0 +1,165 @@
+.. role:: hidden
+    :class: hidden-section
+
+Distributed communication package - torch.distributed
+=====================================================
+
+.. automodule:: torch.distributed
+.. currentmodule:: torch.distributed
+
+Currently torch.distributed supports three backends, each with
+different capabilities. The table below shows which functions are available
+for use with CPU / CUDA tensors.
+MPI supports cuda only iff the implementation used to build PyTorch supports it.
+
+------------+-----------+-----------+-----------+
+| Backend    | ``tcp``   | ``gloo``  | ``mpi``   |
+------------+-----+-----+-----+-----+-----+-----+
+| Device     | CPU | GPU | CPU | GPU | CPU | GPU |
+============+=====+=====+=====+=====+=====+=====+
+| send       | ✓   | ✘   | ✘   | ✘   | ✓   | ?   |
+------------+-----+-----+-----+-----+-----+-----+
+| recv       | ✓   | ✘   | ✘   | ✘   | ✓   | ?   |
+------------+-----+-----+-----+-----+-----+-----+
+| broadcast  | ✓   | ✘   | ✓   | ✓   | ✓   | ?   |
+------------+-----+-----+-----+-----+-----+-----+
+| all_reduce | ✓   | ✘   | ✓   | ✓   | ✓   | ?   |
+------------+-----+-----+-----+-----+-----+-----+
+| reduce     | ✓   | ✘   | ✘   | ✘   | ✓   | ?   |
+------------+-----+-----+-----+-----+-----+-----+
+| all_gather | ✓   | ✘   | ✘   | ✘   | ✓   | ?   |
+------------+-----+-----+-----+-----+-----+-----+
+| gather     | ✓   | ✘   | ✘   | ✘   | ✓   | ?   |
+------------+-----+-----+-----+-----+-----+-----+
+| scatter    | ✓   | ✘   | ✘   | ✘   | ✓   | ?   |
+------------+-----+-----+-----+-----+-----+-----+
+| barrier    | ✓   | ✘   | ✓   | ✓   | ✓   | ?   |
+------------+-----+-----+-----+-----+-----+-----+
+
+Initialization
+--------------
+
+The package needs to be initialized using the :func:`torch.distributed.init_process_group`
+function before calling any other methods.
+
+.. autofunction:: init_process_group
+
+.. autofunction:: get_rank
+
+.. autofunction:: get_world_size
+
+--------------------------------------------------------------------------------
+
+Currently three initialization methods are supported:
+
+TCP initialization
+^^^^^^^^^^^^^^^^^^
+
+Initialization will utilize a network address reachable from all processes.
+If the address belongs to one of the machines, initialization requires that all processes
+have manually specified ranks.
+
+Alternatively, the address has to be a valid IP multicast address, in which case,
+ranks can be assigned automatically. Multicast initialization also supports
+a ``group_name`` argument, which allows you to use the same address for multiple jobs,
+as long as they use different group names.
+
+::
+
+    import torch.distributed as dist
+
+    # Use address of one of the machines
+    dist.init_process_group(init_method='tcp://10.1.1.20:23456', rank=args.rank, world_size=4)
+
+    # or a multicast address - rank will be assigned automatically if unspecified
+    dist.init_process_group(init_method='tcp://[ff15:1e18:5d4c:4cf0:d02d:b659:53ba:b0a7]:23456',
+                            world_size=4)
+
+Shared file-system initialization
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Another initialization method makes use of a file system shared and visible from
+all machines in a group. The URL should start with ``file://`` and contain a path
+to a non-existent file (in an existing directory) on a shared file system.
+This initialization method also supports a ``group_name`` argument, which allows you to
+use the same shared file path for multiple jobs, as long as they use different
+group names.
+
+.. warning::
+    This method assumes that the file system supports locking using ``fcntl`` - most
+    local systems and NFS support it.
+
+::
+
+    import torch.distributed as dist
+
+    # Rank will be assigned automatically if unspecified
+    dist.init_process_group(init_method='file:///mnt/nfs/sharedfile', world_size=4,
+                            group_name=args.group)
+
+Environment variable initialization
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This method will read the configuration from environment variables, allowing
+one to fully customize how the information is obtained. The variables to be set
+are:
+
+* ``MASTER_PORT`` - required; has to be a free port on machine with rank 0
+* ``MASTER_ADDR`` - required (except for rank 0); address of rank 0 node
+* ``WORLD_SIZE`` - required; can be set either here, or in a call to init function
+* ``RANK`` - required; can be set either here, or in a call to init function
+
+The machine with rank 0 will be used to set up all connections.
+
+This is the default method, meaning that ``init_method`` does not have to be specified (or
+can be ``env://``).
+
+Groups
+------
+
+By default collectives operate on the default group (also called the world) and
+require all processes to enter the distributed function call. However, some workloads can benefit
+from more fine-grained communication. This is where distributed groups come
+into play. :func:`~torch.distributed.new_group` function can be
+used to create new groups, with arbitrary subsets of all processes. It returns
+an opaque group handle that can be given as a ``group`` argument to all collectives
+(collectives are distributed functions to exchange information in certain well-known programming patterns).
+
+.. autofunction:: new_group
+
+Point-to-point communication
+----------------------------
+
+.. autofunction:: send
+
+.. autofunction:: recv
+
+:func:`~torch.distributed.isend` and :func:`~torch.distributed.irecv`
+return distributed request objects when used. In general, the type of this object is unspecified
+as they should never be created manually, but they are guaranteed to support two methods:
+
+* ``is_completed()`` - returns True if the operation has finished
+* ``wait()`` - will block the process until the operation is finished.
+  ``is_completed()`` is guaranteed to return True once it returns.
+
+.. autofunction:: isend
+
+.. autofunction:: irecv
+
+Collective functions
+--------------------
+
+.. autofunction:: broadcast
+
+.. autofunction:: all_reduce
+
+.. autofunction:: reduce
+
+.. autofunction:: all_gather
+
+.. autofunction:: gather
+
+.. autofunction:: scatter
+
+.. autofunction:: barrier
+
--- a/docs/source/ffi.rst
+++ b/docs/source/ffi.rst
@ -0,0 +1,6 @@
+torch.utils.ffi
+===============
+
+.. currentmodule:: torch.utils.ffi
+.. autofunction:: create_extension
+
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -0,0 +1,56 @@
+.. PyTorch documentation master file, created by
+   sphinx-quickstart on Fri Dec 23 13:31:47 2016.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+:github_url: https://github.com/pytorch/pytorch
+
+PyTorch documentation
+===================================
+
+PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.
+
+.. toctree::
+   :glob:
+   :maxdepth: 1
+   :caption: Notes
+
+   notes/*
+
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Package Reference
+
+   torch
+   tensors
+   sparse
+   storage
+   nn
+   optim
+   torch.autograd <autograd>
+   torch.multiprocessing <multiprocessing>
+   torch.distributed <distributed>
+   torch.legacy <legacy>
+   cuda
+   ffi
+   data
+   model_zoo
+
+.. toctree::
+   :glob:
+   :maxdepth: 1
+   :caption: torchvision Reference
+
+   torchvision/torchvision
+   torchvision/datasets
+   torchvision/models
+   torchvision/transforms
+   torchvision/utils
+
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`modindex`
--- a/docs/source/legacy.rst
+++ b/docs/source/legacy.rst
@ -0,0 +1,4 @@
+Legacy package - torch.legacy
+===================================
+
+.. automodule:: torch.legacy
--- a/docs/source/model_zoo.rst
+++ b/docs/source/model_zoo.rst
@ -0,0 +1,5 @@
+torch.utils.model_zoo
+===================================
+
+.. automodule:: torch.utils.model_zoo
+.. autofunction:: load_url
--- a/docs/source/multiprocessing.rst
+++ b/docs/source/multiprocessing.rst
@ -0,0 +1,88 @@
+Multiprocessing package - torch.multiprocessing
+===============================================
+
+.. automodule:: torch.multiprocessing
+.. currentmodule:: torch.multiprocessing
+
+.. warning::
+
+    If the main process exits abruptly (e.g. because of an incoming signal),
+    Python's ``multiprocessing`` sometimes fails to clean up its children.
+    It's a known caveat, so if you're seeing any resource leaks after
+    interrupting the interpreter, it probably means that this has just happened
+    to you.
+
+Strategy management
+-------------------
+
+.. autofunction:: get_all_sharing_strategies
+.. autofunction:: get_sharing_strategy
+.. autofunction:: set_sharing_strategy
+
+Sharing CUDA tensors
+--------------------
+
+Sharing CUDA tensors between processes is supported only in Python 3, using
+a ``spawn`` or ``forkserver`` start methods. :mod:`python:multiprocessing` in
+Python 2 can only create subprocesses using ``fork``, and it's not supported
+by the CUDA runtime.
+
+.. warning::
+
+    CUDA API requires that the allocation exported to other processes remains
+    valid as long as it's used by them. You should be careful and ensure that
+    CUDA tensors you shared don't go out of scope as long as it's necessary.
+    This shouldn't be a problem for sharing model parameters, but passing other
+    kinds of data should be done with care. Note that this restriction doesn't
+    apply to shared CPU memory.
+
+
+Sharing strategies
+------------------
+
+This section provides a brief overview into how different sharing strategies
+work. Note that it applies only to CPU tensor - CUDA tensors will always use
+the CUDA API, as that's the only way they can be shared.
+
+File descriptor - ``file_descriptor``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+
+.. note::
+
+    This is the default strategy (except for macOS and OS X where it's not
+    supported).
+
+This strategy will use file descriptors as shared memory handles. Whenever a
+storage is moved to shared memory, a file descriptor obtained from ``shm_open``
+is cached with the object, and when it's going to be sent to other processes,
+the file descriptor will be transferred (e.g. via UNIX sockets) to it. The
+receiver will also cache the file descriptor and ``mmap`` it, to obtain a shared
+view onto the storage data.
+
+Note that if there will be a lot of tensors shared, this strategy will keep a
+large number of file descriptors open most of the time. If your system has low
+limits for the number of open file descriptors, and you can't rise them, you
+should use the ``file_system`` strategy.
+
+File system - ``file_system``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This strategy will use file names given to ``shm_open`` to identify the shared
+memory regions. This has a benefit of not requiring the implementation to cache
+the file descriptors obtained from it, but at the same time is prone to shared
+memory leaks. The file can't be deleted right after its creation, because other
+processes need to access it to open their views. If the processes fatally
+crash, or are killed, and don't call the storage destructors, the files will
+remain in the system. This is very serious, because they keep using up the
+memory until the system is restarted, or they're freed manually.
+
+To counter the problem of shared memory file leaks, :mod:`torch.multiprocessing`
+will spawn a daemon named ``torch_shm_manager`` that will isolate itself from
+the current process group, and will keep track of all shared memory allocations.
+Once all processes connected to it exit, it will wait a moment to ensure there
+will be no new connections, and will iterate over all shared memory files
+allocated by the group. If it finds that any of them still exist, they will be
+deallocated. We've tested this method and it proved to be robust to various
+failures. Still, if your system has high enough limits, and ``file_descriptor``
+is a supported strategy, we do not recommend switching to this one.
--- a/docs/source/nn.rst
+++ b/docs/source/nn.rst
--- a/docs/source/notes/autograd.rst
+++ b/docs/source/notes/autograd.rst
@ -0,0 +1,151 @@
+Autograd mechanics
+==================
+
+This note will present an overview of how autograd works and records the
+operations. It's not strictly necessary to understand all this, but we recommend
+getting familiar with it, as it will help you write more efficient, cleaner
+programs, and can aid you in debugging.
+
+.. _excluding-subgraphs:
+
+Excluding subgraphs from backward
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Every Variable has two flags: :attr:`requires_grad` and :attr:`volatile`.
+They both allow for fine grained exclusion of subgraphs from gradient
+computation and can increase efficiency.
+
+.. _excluding-requires_grad:
+
+``requires_grad``
+~~~~~~~~~~~~~~~~~
+
+If there's a single input to an operation that requires gradient, its output
+will also require gradient. Conversely, only if all inputs don't require
+gradient, the output also won't require it. Backward computation is never
+performed in the subgraphs, where all Variables didn't require gradients.
+
+.. code::
+
+    >>> x = Variable(torch.randn(5, 5))
+    >>> y = Variable(torch.randn(5, 5))
+    >>> z = Variable(torch.randn(5, 5), requires_grad=True)
+    >>> a = x + y
+    >>> a.requires_grad
+    False
+    >>> b = a + z
+    >>> b.requires_grad
+    True
+
+This is especially useful when you want to freeze part of your model, or you
+know in advance that you're not going to use gradients w.r.t. some parameters.
+For example if you want to finetune a pretrained CNN, it's enough to switch the
+:attr:`requires_grad` flags in the frozen base, and no intermediate buffers will
+be saved, until the computation gets to the last layer, where the affine
+transform will use weights that require gradient, and the output of the network
+will also require them.
+
+.. code::
+
+    model = torchvision.models.resnet18(pretrained=True)
+    for param in model.parameters():
+        param.requires_grad = False
+    # Replace the last fully-connected layer
+    # Parameters of newly constructed modules have requires_grad=True by default
+    model.fc = nn.Linear(512, 100)
+
+    # Optimize only the classifier
+    optimizer = optim.SGD(model.fc.parameters(), lr=1e-2, momentum=0.9)
+
+``volatile``
+~~~~~~~~~~~~
+
+Volatile is recommended for purely inference mode, when you're sure you won't
+be even calling `.backward()`. It's more efficient than any other autograd
+setting - it will use the absolute minimal amount of memory to evaluate the
+model. ``volatile`` also determines that ``requires_grad is False``.
+
+Volatile differs from :ref:`excluding-requires_grad` in how the flag propagates.
+If there's even a single volatile input to an operation, its output is also
+going to be volatile. Volatility spreads across the graph much easier than
+non-requiring gradient - you only need a **single** volatile leaf to have a
+volatile output, while you need **all** leaves to not require gradient to
+have an output that doesn't require gradient. Using volatile flag you don't
+need to change any settings of your model parameters to use it for
+inference. It's enough to create a volatile input, and this will ensure that
+no intermediate states are saved.
+
+.. code::
+
+    >>> regular_input = Variable(torch.randn(1, 3, 227, 227))
+    >>> volatile_input = Variable(torch.randn(1, 3, 227, 227), volatile=True)
+    >>> model = torchvision.models.resnet18(pretrained=True)
+    >>> model(regular_input).requires_grad
+    True
+    >>> model(volatile_input).requires_grad
+    False
+    >>> model(volatile_input).volatile
+    True
+    >>> model(volatile_input).grad_fn is None
+    True
+
+How autograd encodes the history
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Autograd is reverse automatic differentiation system.  Conceptually,
+autograd records a graph recording all of the operations that created
+the data as you execute operations, giving you a directed acyclic graph
+whose leaves are the input variables and roots are the output variables.
+By tracing this graph from roots to leaves, you can automatically
+compute the gradients using the chain rule.
+
+Internally, autograd represents this graph as a graph of
+:class:`Function` objects (really expressions), which can be
+:meth:`~torch.autograd.Function.apply` ed to compute the result of
+evaluating the graph.  When computing the forwards pass, autograd
+simultaneously performs the requested computations and builds up a graph
+representing the function that computes the gradient (the ``.grad_fn``
+attribute of each :class:`Variable` is an entry point into this graph).
+When the forwards pass is completed, we evaluate this graph in the
+backwards pass to compute the gradients.
+
+An important thing to note is that the graph is recreated from scratch at every
+iteration, and this is exactly what allows for using arbitrary Python control
+flow statements, that can change the overall shape and size of the graph at
+every iteration. You don't have to encode all possible paths before you
+launch the training - what you run is what you differentiate.
+
+In-place operations on Variables
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Supporting in-place operations in autograd is a hard matter, and we discourage
+their use in most cases. Autograd's aggressive buffer freeing and reuse makes
+it very efficient and there are very few occasions when in-place operations
+actually lower memory usage by any significant amount. Unless you're operating
+under heavy memory pressure, you might never need to use them.
+
+There are two main reasons that limit the applicability of in-place operations:
+
+1. Overwriting values required to compute gradients. This is why variables don't
+   support ``log_``. Its gradient formula requires the original input, and while
+   it is possible to recreate it by computing the inverse operation, it is
+   numerically unstable, and requires additional work that often defeats the
+   purpose of using these functions.
+
+2. Every in-place operation actually requires the implementation to rewrite the
+   computational graph. Out-of-place versions simply allocate new objects and
+   keep references to the old graph, while in-place operations, require
+   changing the creator of all inputs to the :class:`Function` representing
+   this operation. This can be tricky, especially if there are many Variables
+   that reference the same storage (e.g. created by indexing or transposing),
+   and in-place functions will actually raise an error if the storage of
+   modified inputs is referenced by any other :class:`Variable`.
+
+In-place correctness checks
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Every variable keeps a version counter, that is incremented every time it's
+marked dirty in any operation. When a Function saves any tensors for backward,
+a version counter of their containing Variable is saved as well. Once you access
+``self.saved_tensors`` it is checked, and if it's greater than the saved value
+an error is raised.
--- a/docs/source/notes/broadcasting.rst
+++ b/docs/source/notes/broadcasting.rst
@ -0,0 +1,113 @@
+.. _broadcasting-semantics:
+
+Broadcasting semantics
+======================
+
+Many PyTorch operations support :any:`NumPy Broadcasting Semantics <numpy.doc.broadcasting>`.
+
+In short, if a PyTorch operation supports broadcast, then its Tensor arguments can be
+automatically expanded to be of equal sizes (without making copies of the data).
+
+General semantics
+-----------------
+Two tensors are "broadcastable" if the following rules hold:
+
+- Each tensor has at least one dimension.
+- When iterating over the dimension sizes, starting at the trailing dimension,
+  the dimension sizes must either be equal, one of them is 1, or one of them
+  does not exist.
+
+For Example::
+
+    >>> x=torch.FloatTensor(5,7,3)
+    >>> y=torch.FloatTensor(5,7,3)
+    # same shapes are always broadcastable (i.e. the above rules always hold)
+
+    >>> x=torch.FloatTensor()
+    >>> y=torch.FloatTensor(2,2)
+    # x and y are not broadcastable, because x does not have at least 1 dimension
+
+    # can line up trailing dimensions
+    >>> x=torch.FloatTensor(5,3,4,1)
+    >>> y=torch.FloatTensor(  3,1,1)
+    # x and y are broadcastable.
+    # 1st trailing dimension: both have size 1
+    # 2nd trailing dimension: y has size 1
+    # 3rd trailing dimension: x size == y size
+    # 4th trailing dimension: y dimension doesn't exist
+
+    # but:
+    >>> x=torch.FloatTensor(5,2,4,1)
+    >>> y=torch.FloatTensor(  3,1,1)
+    # x and y are not broadcastable, because in the 3rd trailing dimension 2 != 3
+
+If two tensors :attr:`x`, :attr:`y` are "broadcastable", the resulting tensor size
+is calculated as follows:
+
+- If the number of dimensions of :attr:`x` and :attr:`y` are not equal, prepend 1
+  to the dimensions of the tensor with fewer dimensions to make them equal length.
+- Then, for each dimension size, the resulting dimension size is the max of the sizes of
+  :attr:`x` and :attr:`y` along that dimension.
+
+For Example::
+
+    # can line up trailing dimensions to make reading easier
+    >>> x=torch.FloatTensor(5,1,4,1)
+    >>> y=torch.FloatTensor(  3,1,1)
+    >>> (x+y).size()
+    torch.Size([5, 3, 4, 1])
+
+    # but not necessary:
+    >>> x=torch.FloatTensor(1)
+    >>> y=torch.FloatTensor(3,1,7)
+    >>> (x+y).size()
+    torch.Size([3, 1, 7])
+
+    >>> x=torch.FloatTensor(5,2,4,1)
+    >>> y=torch.FloatTensor(3,1,1)
+    >>> (x+y).size()
+    RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 1
+
+In-place semantics
+------------------
+One complication is that in-place operations do not allow the in-place tensor to change shape
+as a result of the broadcast.
+
+For Example::
+
+    >>> x=torch.FloatTensor(5,3,4,1)
+    >>> y=torch.FloatTensor(3,1,1)
+    >>> (x.add_(y)).size()
+    torch.Size([5, 3, 4, 1])
+
+    # but:
+    >>> x=torch.FloatTensor(1,3,1)
+    >>> y=torch.FloatTensor(3,1,7)
+    >>> (x.add_(y)).size()
+    RuntimeError: The expanded size of the tensor (1) must match the existing size (7) at non-singleton dimension 2.
+
+Backwards compatibility
+-----------------------
+Prior versions of PyTorch allowed certain pointwise functions to execute on tensors with different shapes,
+as long as the number of elements in each tensor was equal.  The pointwise operation would then be carried
+out by viewing each tensor as 1-dimensional.  PyTorch now supports broadcasting and the "1-dimensional"
+pointwise behavior is considered deprecated and will generate a Python warning in cases where tensors are
+not broadcastable, but have the same number of elements.
+
+Note that the introduction of broadcasting can cause backwards incompatible changes in the case where
+two tensors do not have the same shape, but are broadcastable and have the same number of elements.
+For Example::
+
+    >>> torch.add(torch.ones(4,1), torch.randn(4))
+
+would previously produce a Tensor with size: torch.Size([4,1]), but now produces a Tensor with size: torch.Size([4,4]).
+In order to help identify cases in your code where backwards incompatibilities introduced by broadcasting may exist,
+you may set `torch.utils.backcompat.broadcast_warning.enabled` to `True`, which will generate a python warning
+in such cases.
+
+For Example::
+
+    >>> torch.utils.backcompat.broadcast_warning.enabled=True
+    >>> torch.add(torch.ones(4,1), torch.ones(4))
+    __main__:1: UserWarning: self and other do not have the same shape, but are broadcastable, and have the same number of elements.
+    Changing behavior in a backwards incompatible manner to broadcasting rather than viewing as 1-dimensional.
--- a/docs/source/notes/cuda.rst
+++ b/docs/source/notes/cuda.rst
@ -0,0 +1,83 @@
+.. _cuda-semantics:
+
+CUDA semantics
+==============
+
+:mod:`torch.cuda` keeps track of currently selected GPU, and all CUDA tensors
+you allocate will be created on it. The selected device can be changed with a
+:any:`torch.cuda.device` context manager.
+
+However, once a tensor is allocated, you can do operations on it irrespectively
+of your selected device, and the results will be always placed in on the same
+device as the tensor.
+
+Cross-GPU operations are not allowed by default, with the only exception of
+:meth:`~torch.Tensor.copy_`. Unless you enable peer-to-peer memory accesses,
+any attempts to launch ops on tensors spread across different devices will
+raise an error.
+
+Below you can find a small example showcasing this::
+
+    x = torch.cuda.FloatTensor(1)
+    # x.get_device() == 0
+    y = torch.FloatTensor(1).cuda()
+    # y.get_device() == 0
+
+    with torch.cuda.device(1):
+        # allocates a tensor on GPU 1
+        a = torch.cuda.FloatTensor(1)
+
+        # transfers a tensor from CPU to GPU 1
+        b = torch.FloatTensor(1).cuda()
+        # a.get_device() == b.get_device() == 1
+
+        c = a + b
+        # c.get_device() == 1
+
+        z = x + y
+        # z.get_device() == 0
+
+        # even within a context, you can give a GPU id to the .cuda call
+        d = torch.randn(2).cuda(2)
+        # d.get_device() == 2
+
+Best practices
+--------------
+
+Use pinned memory buffers
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. warning:
+
+    This is an advanced tip. You overuse of pinned memory can cause serious
+    problems if you'll be running low on RAM, and you should be aware that
+    pinning is often an expensive operation.
+
+Host to GPU copies are much faster when they originate from pinned (page-locked)
+memory. CPU tensors and storages expose a :meth:`~torch.Tensor.pin_memory`
+method, that returns a copy of the object, with data put in a pinned region.
+
+Also, once you pin a tensor or storage, you can use asynchronous GPU copies.
+Just pass an additional ``async=True`` argument to a :meth:`~torch.Tensor.cuda`
+call. This can be used to overlap data transfers with computation.
+
+You can make the :class:`~torch.utils.data.DataLoader` return batches placed in
+pinned memory by passing ``pin_memory=True`` to its constructor.
+
+.. _cuda-nn-dataparallel-instead:
+
+Use nn.DataParallel instead of multiprocessing
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Most use cases involving batched input and multiple GPUs should default to using
+:class:`~torch.nn.DataParallel` to utilize more than one GPU. Even with the GIL,
+a single python process can saturate multiple GPUs.
+
+As of version 0.1.9, large numbers of GPUs (8+) might not be fully utilized.
+However, this is a known issue that is under active development. As always,
+test your use case.
+
+There are significant caveats to using CUDA models with
+:mod:`~torch.multiprocessing`; unless care is taken to meet the data handling
+requirements exactly, it is likely that your program will have incorrect or
+undefined behavior.
--- a/docs/source/notes/extending.rst
+++ b/docs/source/notes/extending.rst
@ -0,0 +1,181 @@
+Extending PyTorch
+=================
+
+In this note we'll cover ways of extending :mod:`torch.nn`,
+:mod:`torch.autograd`, and writing custom C extensions utilizing our C
+libraries.
+
+Extending :mod:`torch.autograd`
+-------------------------------
+
+.. currentmodule:: torch.autograd
+
+Adding operations to :mod:`~torch.autograd` requires implementing a new
+:class:`Function` subclass for each operation. Recall that :class:`Function` s
+are what :mod:`~torch.autograd` uses to compute the results and gradients, and
+encode the operation history. Every new function requires you to implement 2
+methods:
+
+- :meth:`~Function.forward` - the code that performs the operation. It can take
+  as many arguments as you want, with some of them being optional, if you
+  specify the default values. All kinds of Python objects are accepted here.
+  :class:`Variable` arguments will be converted to :class:`Tensor` s before the
+  call, and their use will be registered in the graph. Note that this logic won't
+  traverse lists/dicts/any other data structures and will only consider Variables
+  that are direct arguments to the call. You can return either a single
+  :class:`Tensor` output, or a :class:`tuple` of :class:`Tensor` s if there are
+  multiple outputs. Also, please refer to the docs of :class:`Function` to find
+  descriptions of useful methods that can be called only from :meth:`~Function.forward`.
+- :meth:`~Function.backward` - gradient formula. It will be given
+  as many :class:`Variable` arguments as there were outputs, with each of them
+  representing gradient w.r.t. that output. It should return as many
+  :class:`Variable` s as there were inputs, with each of them containing the
+  gradient w.r.t. its corresponding input. If your inputs didn't require
+  gradient (see :attr:`~Variable.needs_input_grad`), or were non-:class:`Variable`
+  objects, you can return :class:`python:None`. Also, if you have optional
+  arguments to :meth:`~Variable.forward` you can return more gradients than there
+  were inputs, as long as they're all :any:`python:None`.
+
+Below you can find code for a ``Linear`` function from :mod:`torch.nn`, with
+additional comments::
+
+    # Inherit from Function
+    class Linear(Function):
+
+        # Note that both forward and backward are @staticmethods
+        @staticmethod
+        # bias is an optional argument
+        def forward(ctx, input, weight, bias=None):
+            ctx.save_for_backward(input, weight, bias)
+            output = input.mm(weight.t())
+            if bias is not None:
+                output += bias.unsqueeze(0).expand_as(output)
+            return output
+
+        # This function has only a single output, so it gets only one gradient
+        @staticmethod
+        def backward(ctx, grad_output):
+            # This is a pattern that is very convenient - at the top of backward
+            # unpack saved_tensors and initialize all gradients w.r.t. inputs to
+            # None. Thanks to the fact that additional trailing Nones are
+            # ignored, the return statement is simple even when the function has
+            # optional inputs.
+            input, weight, bias = ctx.saved_variables
+            grad_input = grad_weight = grad_bias = None
+
+            # These needs_input_grad checks are optional and there only to
+            # improve efficiency. If you want to make your code simpler, you can
+            # skip them. Returning gradients for inputs that don't require it is
+            # not an error.
+            if self.needs_input_grad[0]:
+                grad_input = grad_output.mm(weight)
+            if self.needs_input_grad[1]:
+                grad_weight = grad_output.t().mm(input)
+            if bias is not None and self.needs_input_grad[2]:
+                grad_bias = grad_output.sum(0).squeeze(0)
+
+            return grad_input, grad_weight, grad_bias
+
+Now, to make it easier to use these custom ops, we recommend aliasing their
+``apply`` method::
+
+    linear = Linear.aply
+
+Here, we give an additional example of a function that is parametrized by
+non-Variable arguments::
+
+    class MulConstant(Function):
+        @staticmethod
+        def forward(ctx, tensor, constant):
+            # ctx is a context object that can be used to stash information
+            for backward computation
+            ctx.constant = constant
+            return tensor * constant
+
+        @staticmethod
+        def backward(ctx, grad_output):
+            # We return as many input gradients as there were arguments.
+            # Gradients of non-Tensor arguments to forward must be None.
+            return grad_output * ctx.constant, None
+
+You probably want to check if the backward method you implemented actually
+computes the derivatives of your function. It is possible by comparing with
+numerical approximations using small finite differences::
+
+    from torch.autograd import gradcheck
+
+    # gradchek takes a tuple of tensor as input, check if your gradient
+    # evaluated with these tensors are close enough to numerical
+    # approximations and returns True if they all verify this condition.
+    input = (Variable(torch.randn(20,20).double(), requires_grad=True), Variable(torch.randn(30,20).double(), requires_grad=True),)
+    test = gradcheck(Linear.apply, input, eps=1e-6, atol=1e-4)
+    print(test)
+
+Extending :mod:`torch.nn`
+-------------------------
+
+.. currentmodule:: torch.nn
+
+:mod:`~torch.nn` exports two kinds of interfaces - modules and their functional
+versions. You can extend it in both ways, but we recommend using modules for
+all kinds of layers, that hold any parameters or buffers, and recommend using
+a functional form parameter-less operations like activation functions, pooling,
+etc.
+
+Adding a functional version of an operation is already fully covered in the
+section above.
+
+Adding a :class:`Module`
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Since :mod:`~torch.nn` heavily utilizes :mod:`~torch.autograd`, adding a new
+:class:`Module` requires implementing a :class:`~torch.autograd.Function`
+that performs the operation and can compute the gradient. From now on let's
+assume that we want to implement a ``Linear`` module and we have the function
+implementated as in the listing above. There's very little code required to
+add this. Now, there are two functions that need to be implemented:
+
+- ``__init__`` (*optional*) - takes in arguments such as kernel sizes, numbers
+  of features, etc. and initializes parameters and buffers.
+- :meth:`~Module.forward` - instantiates a :class:`~torch.autograd.Function` and
+  uses it to perform the operation. It's very similar to a functional wrapper
+  shown above.
+
+This is how a ``Linear`` module can be implemented::
+
+    class Linear(nn.Module):
+        def __init__(self, input_features, output_features, bias=True):
+            self.input_features = input_features
+            self.output_features = output_features
+
+            # nn.Parameter is a special kind of Variable, that will get
+            # automatically registered as Module's parameter once it's assigned
+            # as an attribute. Parameters and buffers need to be registered, or
+            # they won't appear in .parameters() (doesn't apply to buffers), and
+            # won't be converted when e.g. .cuda() is called. You can use
+            # .register_buffer() to register buffers.
+            # nn.Parameters can never be volatile and, different than Variables,
+            # they require gradients by default.
+            self.weight = nn.Parameter(torch.Tensor(input_features, output_features))
+            if bias:
+                self.bias = nn.Parameter(torch.Tensor(output_features))
+            else:
+                # You should always register all possible parameters, but the
+                # optional ones can be None if you want.
+                self.register_parameter('bias', None)
+
+            # Not a very smart way to initialize weights
+            self.weight.data.uniform_(-0.1, 0.1)
+            if bias is not None:
+                self.bias.data.uniform_(-0.1, 0.1)
+
+        def forward(self, input):
+            # See the autograd section for explanation of what happens here.
+            return Linear()(input, self.weight, self.bias)
+
+
+Writing custom C extensions
+---------------------------
+
+Coming soon. For now you can find an example at
+`GitHub <https://github.com/pytorch/extension-ffi>`_.
--- a/docs/source/notes/multiprocessing.rst
+++ b/docs/source/notes/multiprocessing.rst
@ -0,0 +1,124 @@
+Multiprocessing best practices
+==============================
+
+:mod:`torch.multiprocessing` is a drop in replacement for Python's
+:mod:`python:multiprocessing` module. It supports the exact same operations,
+but extends it, so that all tensors sent through a
+:class:`python:multiprocessing.Queue`, will have their data moved into shared
+memory and will only send a handle to another process.
+
+.. note::
+
+    When a :class:`~torch.autograd.Variable` is sent to another process, both
+    the :attr:`Variable.data` and :attr:`Variable.grad.data` are going to be
+    shared.
+
+This allows to implement various training methods, like Hogwild, A3C, or any
+others that require asynchronous operation.
+
+Sharing CUDA tensors
+--------------------
+
+Sharing CUDA tensors between processes is supported only in Python 3, using
+a ``spawn`` or ``forkserver`` start methods. :mod:`python:multiprocessing` in
+Python 2 can only create subprocesses using ``fork``, and it's not supported
+by the CUDA runtime.
+
+.. warning::
+
+    CUDA API requires that the allocation exported to other processes remains
+    valid as long as it's used by them. You should be careful and ensure that
+    CUDA tensors you shared don't go out of scope as long as it's necessary.
+    This shouldn't be a problem for sharing model parameters, but passing other
+    kinds of data should be done with care. Note that this restriction doesn't
+    apply to shared CPU memory.
+
+See also: :ref:`cuda-nn-dataparallel-instead`
+
+
+Best practices and tips
+-----------------------
+
+Avoiding and fighting deadlocks
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+There are a lot of things that can go wrong when a new process is spawned, with
+the most common cause of deadlocks being background threads. If there's any
+thread that holds a lock or imports a module, and ``fork`` is called, it's very
+likely that the subprocess will be in a corrupted state and will deadlock or
+fail in a different way. Note that even if you don't, Python built in
+libraries do - no need to look further than :mod:`python:multiprocessing`.
+:class:`python:multiprocessing.Queue` is actually a very complex class, that
+spawns multiple threads used to serialize, send and receive objects, and they
+can cause aforementioned problems too. If you find yourself in such situation
+try using a :class:`~python:multiprocessing.queues.SimpleQueue`, that doesn't
+use any additional threads.
+
+We're trying our best to make it easy for you and ensure these deadlocks don't
+happen but some things are out of our control. If you have any issues you can't
+cope with for a while, try reaching out on forums, and we'll see if it's an
+issue we can fix.
+
+Reuse buffers passed through a Queue
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Remember that each time you put a :class:`~torch.Tensor` into a
+:class:`python:multiprocessing.Queue`, it has to be moved into shared memory.
+If it's already shared, it is a no-op, otherwise it will incur an additional
+memory copy that can slow down the whole process. Even if you have a pool of
+processes sending data to a single one, make it send the buffers back - this
+is nearly free and will let you avoid a copy when sending next batch.
+
+Asynchronous multiprocess training (e.g. Hogwild)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Using :mod:`torch.multiprocessing`, it is possible to train a model
+asynchronously, with parameters either shared all the time, or being
+periodically synchronized. In the first case, we recommend sending over the whole
+model object, while in the latter, we advise to only send the
+:meth:`~torch.nn.Module.state_dict`.
+
+We recommend using :class:`python:multiprocessing.Queue` for passing all kinds
+of PyTorch objects between processes. It is possible to e.g. inherit the tensors
+and storages already in shared memory, when using the ``fork`` start method,
+however it is very bug prone and should be used with care, and only by advanced
+users. Queues, even though they're sometimes a less elegant solution, will work
+properly in all cases.
+
+.. warning::
+
+    You should be careful about having global statements, that are not guarded
+    with an ``if __name__ == '__main__'``. If a different start method than
+    ``fork`` is used, they will be executed in all subprocesses.
+
+Hogwild
+~~~~~~~
+
+A concrete Hogwild implementation can be found in the `examples repository`__,
+but to showcase the overall structure of the code, there's also a minimal
+example below as well::
+
+    import torch.multiprocessing as mp
+    from model import MyModel
+
+    def train(model):
+        # Construct data_loader, optimizer, etc.
+        for data, labels in data_loader:
+            optimizer.zero_grad()
+            loss_fn(model(data), labels).backward()
+            optimizer.step()  # This will update the shared parameters
+
+    if __name__ == '__main__':
+        num_processes = 4
+        model = MyModel()
+        # NOTE: this is required for the ``fork`` method to work
+        model.share_memory()
+        processes = []
+        for rank in range(num_processes):
+            p = mp.Process(target=train, args=(model,))
+            p.start()
+            processes.append(p)
+        for p in processes:
+          p.join()
+
+.. __: https://github.com/pytorch/examples/tree/master/mnist_hogwild
--- a/docs/source/notes/serialization.rst
+++ b/docs/source/notes/serialization.rst
@ -0,0 +1,34 @@
+
+Serialization semantics
+=======================
+
+Best practices
+--------------
+
+.. _recommend-saving-models:
+
+Recommended approach for saving a model
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+There are two main approaches for serializing and restoring a model.
+
+The first (recommended) saves and loads only the model parameters::
+
+    torch.save(the_model.state_dict(), PATH)
+
+Then later::
+
+    the_model = TheModelClass(*args, **kwargs)
+    the_model.load_state_dict(torch.load(PATH))
+
+The second saves and loads the entire model::
+
+    torch.save(the_model, PATH)
+
+Then later::
+
+    the_model = torch.load(PATH)
+
+However in this case, the serialized data is bound to the specific classes
+and the exact directory structure used, so it can break in various ways when
+used in other projects, or after some serious refactors.
--- a/docs/source/optim.rst
+++ b/docs/source/optim.rst
@ -0,0 +1,134 @@
+torch.optim
+===================================
+
+.. automodule:: torch.optim
+
+How to use an optimizer
+-----------------------
+
+To use :mod:`torch.optim` you have to construct an optimizer object, that will hold
+the current state and will update the parameters based on the computed gradients.
+
+Constructing it
+^^^^^^^^^^^^^^^
+
+To construct an :class:`Optimizer` you have to give it an iterable containing the
+parameters (all should be :class:`~torch.autograd.Variable` s) to optimize. Then,
+you can specify optimizer-specific options such as the learning rate, weight decay, etc.
+
+Example::
+
+    optimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9)
+    optimizer = optim.Adam([var1, var2], lr = 0.0001)
+
+Per-parameter options
+^^^^^^^^^^^^^^^^^^^^^
+
+:class:`Optimizer` s also support specifying per-parameter options. To do this, instead
+of passing an iterable of :class:`~torch.autograd.Variable` s, pass in an iterable of
+:class:`dict` s. Each of them will define a separate parameter group, and should contain
+a ``params`` key, containing a list of parameters belonging to it. Other keys
+should match the keyword arguments accepted by the optimizers, and will be used
+as optimization options for this group.
+
+.. note::
+
+    You can still pass options as keyword arguments. They will be used as
+    defaults, in the groups that didn't override them. This is useful when you
+    only want to vary a single option, while keeping all others consistent
+    between parameter groups.
+
+
+For example, this is very useful when one wants to specify per-layer learning rates::
+
+    optim.SGD([
+                    {'params': model.base.parameters()},
+                    {'params': model.classifier.parameters(), 'lr': 1e-3}
+                ], lr=1e-2, momentum=0.9)
+
+This means that ``model.base``'s parameters will use the default learning rate of ``1e-2``,
+``model.classifier``'s parameters will use a learning rate of ``1e-3``, and a momentum of
+``0.9`` will be used for all parameters
+
+Taking an optimization step
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+All optimizers implement a :func:`~Optimizer.step` method, that updates the
+parameters. It can be used in two ways:
+
+``optimizer.step()``
+~~~~~~~~~~~~~~~~~~~~
+
+This is a simplified version supported by most optimizers. The function can be
+called once the gradients are computed using e.g.
+:func:`~torch.autograd.Variable.backward`.
+
+Example::
+
+    for input, target in dataset:
+        optimizer.zero_grad()
+        output = model(input)
+        loss = loss_fn(output, target)
+        loss.backward()
+        optimizer.step()
+
+``optimizer.step(closure)``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Some optimization algorithms such as Conjugate Gradient and LBFGS need to
+reevaluate the function multiple times, so you have to pass in a closure that
+allows them to recompute your model. The closure should clear the gradients,
+compute the loss, and return it.
+
+Example::
+
+    for input, target in dataset:
+        def closure():
+            optimizer.zero_grad()
+            output = model(input)
+            loss = loss_fn(output, target)
+            loss.backward()
+            return loss
+        optimizer.step(closure)
+
+Algorithms
+----------
+
+.. autoclass:: Optimizer
+    :members:
+.. autoclass:: Adadelta
+    :members:
+.. autoclass:: Adagrad
+    :members:
+.. autoclass:: Adam
+    :members:
+.. autoclass:: Adamax
+    :members:
+.. autoclass:: ASGD
+    :members:
+.. autoclass:: LBFGS
+    :members:
+.. autoclass:: RMSprop
+    :members:
+.. autoclass:: Rprop
+    :members:
+.. autoclass:: SGD
+    :members:
+
+How to adjust Learning Rate
+---------------------------
+
+:mod:`torch.optim.lr_scheduler` provides several methods to adjust the learning
+rate based on the number of epoches. :class:`torch.optim.lr_scheduler.ReduceLROnPlateau`
+allows dynamic learning rate reducing based on some validation measurements.
+
+.. autoclass:: torch.optim.lr_scheduler.LambdaLR
+    :members:
+.. autoclass:: torch.optim.lr_scheduler.StepLR
+    :members:
+.. autoclass:: torch.optim.lr_scheduler.MultiStepLR
+    :members:
+.. autoclass:: torch.optim.lr_scheduler.ExponentialLR
+    :members:
+.. autoclass:: torch.optim.lr_scheduler.ReduceLROnPlateau
+    :members:
--- a/docs/source/sparse.rst
+++ b/docs/source/sparse.rst
@ -0,0 +1,114 @@
+.. currentmodule:: torch.sparse
+
+torch.sparse
+============
+
+.. warning::
+
+    This API is currently experimental and may change in the near future.
+
+Torch supports sparse tensors in COO(rdinate) format, which can
+efficiently store and process tensors for which the majority of elements
+are zeros.
+
+A sparse tensor is represented as a pair of dense tensors: a tensor
+of values and a tensor of indices.  A sparse tensor can be constructed
+by providing these two tensors, as well as the size of the sparse tensor
+(which cannot be inferred from these tensors!)
+
+    >>> i = torch.LongTensor([[0, 1], [2, 0]])
+    >>> v = torch.FloatTensor([3, 4])
+    >>> torch.sparse.FloatTensor(i, v, torch.Size([2,3])).to_dense()
+     0  0  3
+     4  0  0
+    [torch.FloatTensor of size 2x2]
+
+You can also construct hybrid sparse tensors, where only the first n
+dimensions are sparse, and the rest of the dimensions are dense.
+
+    >>> i = torch.LongTensor([[2, 4]])
+    >>> v = torch.FloatTensor([[1, 3], [5, 7]])
+    >>> torch.sparse.FloatTensor(i, v).to_dense()
+     0  0
+     0  0
+     1  3
+     0  0
+     5  7
+    [torch.FloatTensor of size 5x2]
+
+An empty sparse tensor can be constructed by specifying its size:
+
+    >>> torch.sparse.FloatTensor(2, 3)
+    SparseFloatTensor of size 2x3 with indices:
+    [torch.LongTensor with no dimension]
+    and values:
+    [torch.FloatTensor with no dimension]
+
+.. note::
+
+    Our sparse tensor format permits *uncoalesced* sparse tensors, where
+    there may be duplicate coordinates in the indices; in this case,
+    the interpretation is that the value at that index is the sum of all
+    duplicate value entries. Uncoalesced tensors permit us to implement
+    certain operators more efficiently.
+
+    For the most part, you shouldn't have to care whether or not a
+    sparse tensor is coalesced or not, as most operations will work
+    identically given a coalesced or uncoalesced sparse tensor.
+    However, there are two cases in which you may need to care.
+
+    First, if you repeatedly perform an operation that can produce
+    duplicate entries (e.g., :func:`torch.sparse.FloatTensor.add`), you
+    should occasionally coalesce your sparse tensors to prevent
+    them from growing too large.
+
+    Second, some operators will produce different values depending on
+    whether or not they are coalesced or not (e.g.,
+    :func:`torch.sparse.FloatTensor._values` and
+    :func:`torch.sparse.FloatTensor._indices`, as well as
+    :func:`torch.Tensor._sparse_mask`).  These operators are
+    prefixed by an underscore to indicate that they reveal internal
+    implementation details and should be used with care, since code
+    that works with coalesced sparse tensors may not work with
+    uncoalesced sparse tensors; generally speaking, it is safest
+    to explicitly coalesce before working with these operators.
+
+    For example, suppose that we wanted to implement an operator
+    by operating directly on :func:`torch.sparse.FloatTensor._values`.
+    Multiplication by a scalar can be implemented in the obvious way,
+    as multiplication distributes over addition; however, square root
+    cannot be implemented directly, since ``sqrt(a + b) != sqrt(a) +
+    sqrt(b)`` (which is what would be computed if you were given an
+    uncoalesced tensor.)
+
+.. class:: FloatTensor()
+
+    .. method:: add
+    .. method:: add_
+    .. method:: clone
+    .. method:: dim
+    .. method:: div
+    .. method:: div_
+    .. method:: get_device
+    .. method:: hspmm
+    .. method:: mm
+    .. method:: mul
+    .. method:: mul_
+    .. method:: resizeAs_
+    .. method:: size
+    .. method:: spadd
+    .. method:: spmm
+    .. method:: sspaddmm
+    .. method:: sspmm
+    .. method:: sub
+    .. method:: sub_
+    .. method:: t_
+    .. method:: toDense
+    .. method:: transpose
+    .. method:: transpose_
+    .. method:: zero_
+    .. method:: coalesce
+    .. method:: is_coalesced
+    .. method:: _indices
+    .. method:: _values
+    .. method:: _nnz
--- a/docs/source/storage.rst
+++ b/docs/source/storage.rst
@ -0,0 +1,12 @@
+torch.Storage
+===================================
+
+A :class:`torch.Storage` is a contiguous, one-dimensional array of a single
+data type.
+
+Every :class:`torch.Tensor` has a corresponding storage of the same data type.
+
+.. autoclass:: torch.FloatStorage
+   :members:
+   :undoc-members:
+   :inherited-members:
--- a/docs/source/tensors.rst
+++ b/docs/source/tensors.rst
@ -0,0 +1,309 @@
+.. currentmodule:: torch
+
+torch.Tensor
+===================================
+
+A :class:`torch.Tensor` is a multi-dimensional matrix containing elements of
+a single data type.
+
+Torch defines seven CPU tensor types and eight GPU tensor types:
+
+======================== ===========================   ================================
+Data type                CPU tensor                    GPU tensor
+======================== ===========================   ================================
+32-bit floating point    :class:`torch.FloatTensor`    :class:`torch.cuda.FloatTensor`
+64-bit floating point    :class:`torch.DoubleTensor`   :class:`torch.cuda.DoubleTensor`
+16-bit floating point    :class:`torch.HalfTensor`     :class:`torch.cuda.HalfTensor`
+8-bit integer (unsigned) :class:`torch.ByteTensor`     :class:`torch.cuda.ByteTensor`
+8-bit integer (signed)   :class:`torch.CharTensor`     :class:`torch.cuda.CharTensor`
+16-bit integer (signed)  :class:`torch.ShortTensor`    :class:`torch.cuda.ShortTensor`
+32-bit integer (signed)  :class:`torch.IntTensor`      :class:`torch.cuda.IntTensor`
+64-bit integer (signed)  :class:`torch.LongTensor`     :class:`torch.cuda.LongTensor`
+======================== ===========================   ================================
+
+The :class:`torch.Tensor` constructor is an alias for the default tensor type
+(:class:`torch.FloatTensor`).
+
+A tensor can be constructed from a Python :class:`list` or sequence:
+
+::
+
+    >>> torch.FloatTensor([[1, 2, 3], [4, 5, 6]])
+    1  2  3
+    4  5  6
+    [torch.FloatTensor of size 2x3]
+
+An empty tensor can be constructed by specifying its size:
+
+::
+
+    >>> torch.IntTensor(2, 4).zero_()
+    0  0  0  0
+    0  0  0  0
+    [torch.IntTensor of size 2x4]
+
+The contents of a tensor can be accessed and modified using Python's indexing
+and slicing notation:
+
+::
+
+    >>> x = torch.FloatTensor([[1, 2, 3], [4, 5, 6]])
+    >>> print(x[1][2])
+    6.0
+    >>> x[0][1] = 8
+    >>> print(x)
+     1  8  3
+     4  5  6
+    [torch.FloatTensor of size 2x3]
+
+Each tensor has an associated :class:`torch.Storage`, which holds its data.
+The tensor class provides multi-dimensional, `strided <https://en.wikipedia.org/wiki/Stride_of_an_array>`_
+view of a storage and defines numeric operations on it.
+
+.. note::
+   Methods which mutate a tensor are marked with an underscore suffix.
+   For example, :func:`torch.FloatTensor.abs_` computes the absolute value
+   in-place and returns the modified tensor, while :func:`torch.FloatTensor.abs`
+   computes the result in a new tensor.
+
+.. class:: Tensor()
+           Tensor(*sizes)
+           Tensor(size)
+           Tensor(sequence)
+           Tensor(ndarray)
+           Tensor(tensor)
+           Tensor(storage)
+
+   Creates a new tensor from an optional size or data.
+
+   If no arguments are given, an empty zero-dimensional tensor is returned.
+   If a :class:`numpy.ndarray`, :class:`torch.Tensor`, or :class:`torch.Storage`
+   is given, a new tensor that shares the same data is returned. If a Python
+   sequence is given, a new tensor is created from a copy of the sequence.
+
+   .. automethod:: abs
+   .. automethod:: abs_
+   .. automethod:: acos
+   .. automethod:: acos_
+   .. automethod:: add
+   .. automethod:: add_
+   .. automethod:: addbmm
+   .. automethod:: addbmm_
+   .. automethod:: addcdiv
+   .. automethod:: addcdiv_
+   .. automethod:: addcmul
+   .. automethod:: addcmul_
+   .. automethod:: addmm
+   .. automethod:: addmm_
+   .. automethod:: addmv
+   .. automethod:: addmv_
+   .. automethod:: addr
+   .. automethod:: addr_
+   .. automethod:: apply_
+   .. automethod:: asin
+   .. automethod:: asin_
+   .. automethod:: atan
+   .. automethod:: atan2
+   .. automethod:: atan2_
+   .. automethod:: atan_
+   .. automethod:: baddbmm
+   .. automethod:: baddbmm_
+   .. automethod:: bernoulli
+   .. automethod:: bernoulli_
+   .. automethod:: bmm
+   .. automethod:: byte
+   .. automethod:: cauchy_
+   .. automethod:: ceil
+   .. automethod:: ceil_
+   .. automethod:: char
+   .. automethod:: chunk
+   .. automethod:: clamp
+   .. automethod:: clamp_
+   .. automethod:: clone
+   .. automethod:: contiguous
+   .. automethod:: copy_
+   .. automethod:: cos
+   .. automethod:: cos_
+   .. automethod:: cosh
+   .. automethod:: cosh_
+   .. automethod:: cpu
+   .. automethod:: cross
+   .. automethod:: cuda
+   .. automethod:: cumprod
+   .. automethod:: cumsum
+   .. automethod:: data_ptr
+   .. automethod:: diag
+   .. automethod:: dim
+   .. automethod:: dist
+   .. automethod:: div
+   .. automethod:: div_
+   .. automethod:: dot
+   .. automethod:: double
+   .. automethod:: eig
+   .. automethod:: element_size
+   .. automethod:: eq
+   .. automethod:: eq_
+   .. automethod:: equal
+   .. automethod:: exp
+   .. automethod:: exp_
+   .. automethod:: expand
+   .. automethod:: expand_as
+   .. automethod:: exponential_
+   .. automethod:: fill_
+   .. automethod:: float
+   .. automethod:: floor
+   .. automethod:: floor_
+   .. automethod:: fmod
+   .. automethod:: fmod_
+   .. automethod:: frac
+   .. automethod:: frac_
+   .. automethod:: gather
+   .. automethod:: ge
+   .. automethod:: ge_
+   .. automethod:: gels
+   .. automethod:: geometric_
+   .. automethod:: geqrf
+   .. automethod:: ger
+   .. automethod:: gesv
+   .. automethod:: gt
+   .. automethod:: gt_
+   .. automethod:: half
+   .. automethod:: histc
+   .. automethod:: index
+   .. automethod:: index_add_
+   .. automethod:: index_copy_
+   .. automethod:: index_fill_
+   .. automethod:: index_select
+   .. automethod:: int
+   .. automethod:: inverse
+   .. automethod:: is_contiguous
+   .. autoattribute:: is_cuda
+      :annotation:
+   .. automethod:: is_pinned
+   .. automethod:: is_set_to
+   .. automethod:: is_signed
+   .. automethod:: kthvalue
+   .. automethod:: le
+   .. automethod:: le_
+   .. automethod:: lerp
+   .. automethod:: lerp_
+   .. automethod:: log
+   .. automethod:: log1p
+   .. automethod:: log1p_
+   .. automethod:: log_
+   .. automethod:: log_normal_
+   .. automethod:: long
+   .. automethod:: lt
+   .. automethod:: lt_
+   .. automethod:: map_
+   .. automethod:: masked_scatter_
+   .. automethod:: masked_fill_
+   .. automethod:: masked_select
+   .. automethod:: matmul
+   .. automethod:: max
+   .. automethod:: mean
+   .. automethod:: median
+   .. automethod:: min
+   .. automethod:: mm
+   .. automethod:: mode
+   .. automethod:: mul
+   .. automethod:: mul_
+   .. automethod:: multinomial
+   .. automethod:: mv
+   .. automethod:: narrow
+   .. automethod:: ndimension
+   .. automethod:: ne
+   .. automethod:: ne_
+   .. automethod:: neg
+   .. automethod:: neg_
+   .. automethod:: nelement
+   .. automethod:: new
+   .. automethod:: nonzero
+   .. automethod:: norm
+   .. automethod:: normal_
+   .. automethod:: numel
+   .. automethod:: numpy
+   .. automethod:: orgqr
+   .. automethod:: ormqr
+   .. automethod:: permute
+   .. automethod:: pin_memory
+   .. automethod:: potrf
+   .. automethod:: potri
+   .. automethod:: potrs
+   .. automethod:: pow
+   .. automethod:: pow_
+   .. automethod:: prod
+   .. automethod:: pstrf
+   .. automethod:: qr
+   .. automethod:: random_
+   .. automethod:: reciprocal
+   .. automethod:: reciprocal_
+   .. automethod:: remainder
+   .. automethod:: remainder_
+   .. automethod:: renorm
+   .. automethod:: renorm_
+   .. automethod:: repeat
+   .. automethod:: resize_
+   .. automethod:: resize_as_
+   .. automethod:: round
+   .. automethod:: round_
+   .. automethod:: rsqrt
+   .. automethod:: rsqrt_
+   .. automethod:: scatter_
+   .. automethod:: select
+   .. automethod:: set_
+   .. automethod:: share_memory_
+   .. automethod:: short
+   .. automethod:: sigmoid
+   .. automethod:: sigmoid_
+   .. automethod:: sign
+   .. automethod:: sign_
+   .. automethod:: sin
+   .. automethod:: sin_
+   .. automethod:: sinh
+   .. automethod:: sinh_
+   .. automethod:: size
+   .. automethod:: sort
+   .. automethod:: split
+   .. automethod:: sqrt
+   .. automethod:: sqrt_
+   .. automethod:: squeeze
+   .. automethod:: squeeze_
+   .. automethod:: std
+   .. automethod:: storage
+   .. automethod:: storage_offset
+   .. automethod:: storage_type
+   .. automethod:: stride
+   .. automethod:: sub
+   .. automethod:: sub_
+   .. automethod:: sum
+   .. automethod:: svd
+   .. automethod:: symeig
+   .. automethod:: t
+   .. automethod:: t_
+   .. automethod:: tan
+   .. automethod:: tan_
+   .. automethod:: tanh
+   .. automethod:: tanh_
+   .. automethod:: tolist
+   .. automethod:: topk
+   .. automethod:: trace
+   .. automethod:: transpose
+   .. automethod:: transpose_
+   .. automethod:: tril
+   .. automethod:: tril_
+   .. automethod:: triu
+   .. automethod:: triu_
+   .. automethod:: trtrs
+   .. automethod:: trunc
+   .. automethod:: trunc_
+   .. automethod:: type
+   .. automethod:: type_as
+   .. automethod:: unfold
+   .. automethod:: uniform_
+   .. automethod:: unsqueeze
+   .. automethod:: unsqueeze_
+   .. automethod:: var
+   .. automethod:: view
+   .. automethod:: view_as
+   .. automethod:: zero_
--- a/docs/source/torch.rst
+++ b/docs/source/torch.rst
@ -0,0 +1,186 @@
+torch
+===================================
+.. automodule:: torch
+
+Tensors
+----------------------------------
+.. autofunction:: is_tensor
+.. autofunction:: is_storage
+.. autofunction:: set_default_tensor_type
+.. autofunction:: numel
+.. autofunction:: set_printoptions
+
+
+Creation Ops
+~~~~~~~~~~~~~~~~~~~~~~
+.. autofunction:: eye
+.. autofunction:: from_numpy
+.. autofunction:: linspace
+.. autofunction:: logspace
+.. autofunction:: ones
+.. autofunction:: rand
+.. autofunction:: randn
+.. autofunction:: randperm
+.. autofunction:: arange
+.. autofunction:: range
+.. autofunction:: zeros
+
+
+Indexing, Slicing, Joining, Mutating Ops
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autofunction:: cat
+.. autofunction:: chunk
+.. autofunction:: gather
+.. autofunction:: index_select
+.. autofunction:: masked_select
+.. autofunction:: nonzero
+.. autofunction:: split
+.. autofunction:: squeeze
+.. autofunction:: stack
+.. autofunction:: t
+.. autofunction:: transpose
+.. autofunction:: unbind
+.. autofunction:: unsqueeze
+
+
+Random sampling
+----------------------------------
+.. autofunction:: manual_seed
+.. autofunction:: initial_seed
+.. autofunction:: get_rng_state
+.. autofunction:: set_rng_state
+.. autodata:: default_generator
+.. autofunction:: bernoulli
+.. autofunction:: multinomial
+.. autofunction:: normal
+
+
+Serialization
+----------------------------------
+.. autofunction:: save
+.. autofunction:: load
+
+
+Parallelism
+----------------------------------
+.. autofunction:: get_num_threads
+.. autofunction:: set_num_threads
+
+
+Math operations
+----------------------------------
+
+Pointwise Ops
+~~~~~~~~~~~~~~~~~~~~~~
+
+.. autofunction:: abs
+.. autofunction:: acos
+.. autofunction:: add
+.. autofunction:: addcdiv
+.. autofunction:: addcmul
+.. autofunction:: asin
+.. autofunction:: atan
+.. autofunction:: atan2
+.. autofunction:: ceil
+.. autofunction:: clamp
+.. autofunction:: cos
+.. autofunction:: cosh
+.. autofunction:: div
+.. autofunction:: exp
+.. autofunction:: floor
+.. autofunction:: fmod
+.. autofunction:: frac
+.. autofunction:: lerp
+.. autofunction:: log
+.. autofunction:: log1p
+.. autofunction:: mul
+.. autofunction:: neg
+.. autofunction:: pow
+.. autofunction:: reciprocal
+.. autofunction:: remainder
+.. autofunction:: round
+.. autofunction:: rsqrt
+.. autofunction:: sigmoid
+.. autofunction:: sign
+.. autofunction:: sin
+.. autofunction:: sinh
+.. autofunction:: sqrt
+.. autofunction:: tan
+.. autofunction:: tanh
+.. autofunction:: trunc
+
+
+Reduction Ops
+~~~~~~~~~~~~~~~~~~~~~~
+.. autofunction:: cumprod
+.. autofunction:: cumsum
+.. autofunction:: dist
+.. autofunction:: mean
+.. autofunction:: median
+.. autofunction:: mode
+.. autofunction:: norm
+.. autofunction:: prod
+.. autofunction:: std
+.. autofunction:: sum
+.. autofunction:: var
+
+
+Comparison Ops
+~~~~~~~~~~~~~~~~~~~~~~
+.. autofunction:: eq
+.. autofunction:: equal
+.. autofunction:: ge
+.. autofunction:: gt
+.. autofunction:: kthvalue
+.. autofunction:: le
+.. autofunction:: lt
+.. autofunction:: max
+.. autofunction:: min
+.. autofunction:: ne
+.. autofunction:: sort
+.. autofunction:: topk
+
+
+Other Operations
+~~~~~~~~~~~~~~~~~~~~~~
+.. autofunction:: cross
+.. autofunction:: diag
+.. autofunction:: histc
+.. autofunction:: renorm
+.. autofunction:: trace
+.. autofunction:: tril
+.. autofunction:: triu
+
+
+BLAS and LAPACK Operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autofunction:: addbmm
+.. autofunction:: addmm
+.. autofunction:: addmv
+.. autofunction:: addr
+.. autofunction:: baddbmm
+.. autofunction:: bmm
+.. autofunction:: btrifact
+.. autofunction:: btrisolve
+.. autofunction:: dot
+.. autofunction:: eig
+.. autofunction:: gels
+.. autofunction:: geqrf
+.. autofunction:: ger
+.. autofunction:: gesv
+.. autofunction:: inverse
+.. autofunction:: matmul
+.. autofunction:: mm
+.. autofunction:: mv
+.. autofunction:: orgqr
+.. autofunction:: ormqr
+.. autofunction:: potrf
+.. autofunction:: potri
+.. autofunction:: potrs
+.. autofunction:: pstrf
+.. autofunction:: qr
+.. autofunction:: svd
+.. autofunction:: symeig
+.. autofunction:: trtrs
+
--- a/docs/source/torchvision/datasets.rst
+++ b/docs/source/torchvision/datasets.rst
@ -0,0 +1,112 @@
+torchvision.datasets
+====================
+
+All datasets are subclasses of :class:`torch.utils.data.Dataset`
+i.e, they have ``__getitem__`` and ``__len__`` methods implemented.
+Hence, they can all be passed to a :class:`torch.utils.data.DataLoader`
+which can load multiple samples parallelly using ``torch.multiprocessing`` workers. 
+For example: ::
+    
+    imagenet_data = torchvision.datasets.ImageFolder('path/to/imagenet_root/')
+    data_loader = torch.utils.data.DataLoader(imagenet_data, 
+                                              batch_size=4,
+                                              shuffle=True,
+                                              num_workers=args.nThreads)
+
+The following datasets are available:
+
+.. contents:: Datasets
+    :local:
+
+All the datasets have almost similar API. They all have two common arguments:
+``transform`` and  ``target_transform`` to transform the input and target respectively.
+
+
+.. currentmodule:: torchvision.datasets 
+
+
+MNIST
+~~~~~
+
+.. autoclass:: MNIST
+
+COCO
+~~~~
+
+.. note ::
+    These require the `COCO API to be installed`_
+
+.. _COCO API to be installed: https://github.com/pdollar/coco/tree/master/PythonAPI
+
+
+Captions
+^^^^^^^^
+
+.. autoclass:: CocoCaptions
+  :members: __getitem__
+  :special-members:
+
+
+Detection
+^^^^^^^^^
+
+.. autoclass:: CocoDetection
+  :members: __getitem__
+  :special-members:
+
+LSUN
+~~~~
+
+.. autoclass:: LSUN
+  :members: __getitem__
+  :special-members:
+
+ImageFolder
+~~~~~~~~~~~
+
+.. autoclass:: ImageFolder
+  :members: __getitem__
+  :special-members:
+
+
+Imagenet-12
+~~~~~~~~~~~
+
+This should simply be implemented with an ``ImageFolder`` dataset.
+The data is preprocessed `as described
+here <https://github.com/facebook/fb.resnet.torch/blob/master/INSTALL.md#download-the-imagenet-dataset>`__
+
+`Here is an
+example <https://github.com/pytorch/examples/blob/27e2a46c1d1505324032b1d94fc6ce24d5b67e97/imagenet/main.py#L48-L62>`__.
+
+CIFAR
+~~~~~
+
+.. autoclass:: CIFAR10
+  :members: __getitem__
+  :special-members:
+
+STL10
+~~~~~
+
+
+.. autoclass:: STL10
+  :members: __getitem__
+  :special-members:
+
+SVHN
+~~~~~
+
+
+.. autoclass:: SVHN
+  :members: __getitem__
+  :special-members:
+
+PhotoTour
+~~~~~~~~~
+
+
+.. autoclass:: PhotoTour
+  :members: __getitem__
+  :special-members:
+
--- a/docs/source/torchvision/models.rst
+++ b/docs/source/torchvision/models.rst
@ -0,0 +1,12 @@
+torchvision.models
+===================
+
+
+.. currentmodule:: torchvision.models
+
+.. automodule:: torchvision.models
+   :members: alexnet, resnet18, resnet34, resnet50, resnet101, resnet152,
+             vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19,
+             vgg19_bn, inception_v3, squeezenet1_0, squeezenet1_1, densenet121, 
+             densenet169, densenet201, densenet161
+   :undoc-members:
--- a/docs/source/torchvision/torchvision.rst
+++ b/docs/source/torchvision/torchvision.rst
@ -0,0 +1,8 @@
+torchvision
+===================
+
+The :mod:`torchvision` package consists of popular datasets, model
+architectures, and common image transformations for computer vision.
+
+.. automodule:: torchvision
+   :members:
--- a/docs/source/torchvision/transforms.rst
+++ b/docs/source/torchvision/transforms.rst
@ -0,0 +1,48 @@
+torchvision.transforms
+======================
+
+.. currentmodule:: torchvision.transforms
+
+Transforms are common image transforms. They can be chained together using :class:`Compose`
+
+.. autoclass:: Compose
+
+Transforms on PIL.Image
+-----------------------
+
+.. autoclass:: Scale
+
+.. autoclass:: CenterCrop
+
+.. autoclass:: RandomCrop
+
+.. autoclass:: RandomHorizontalFlip
+
+.. autoclass:: RandomSizedCrop
+
+.. autoclass:: Pad
+
+Transforms on torch.\*Tensor
+----------------------------
+
+.. autoclass:: Normalize
+	:members: __call__
+	:special-members:
+
+
+Conversion Transforms
+---------------------
+
+.. autoclass:: ToTensor
+	:members: __call__
+	:special-members:
+
+.. autoclass:: ToPILImage
+	:members: __call__
+	:special-members:
+
+Generic Transforms
+------------------
+
+.. autoclass:: Lambda
+
--- a/docs/source/torchvision/utils.rst
+++ b/docs/source/torchvision/utils.rst
@ -0,0 +1,9 @@
+torchvision.utils
+===================
+
+.. currentmodule:: torchvision.utils
+
+.. autofunction:: make_grid
+
+.. autofunction:: save_image
+
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1 @@
+pyyaml
--- a/setup.py
+++ b/setup.py
@ -0,0 +1,499 @@
+from setuptools import setup, Extension, distutils, Command, find_packages
+import setuptools.command.build_ext
+import setuptools.command.install
+import setuptools.command.develop
+import setuptools.command.build_py
+import distutils.unixccompiler
+import distutils.command.build
+import distutils.command.clean
+import platform
+import subprocess
+import shutil
+import sys
+import os
+
+from tools.setup_helpers.env import check_env_flag
+from tools.setup_helpers.cuda import WITH_CUDA, CUDA_HOME
+from tools.setup_helpers.cudnn import WITH_CUDNN, CUDNN_LIB_DIR, CUDNN_INCLUDE_DIR
+from tools.setup_helpers.split_types import split_types
+DEBUG = check_env_flag('DEBUG')
+WITH_DISTRIBUTED = not check_env_flag('NO_DISTRIBUTED')
+WITH_DISTRIBUTED_MW = WITH_DISTRIBUTED and check_env_flag('WITH_DISTRIBUTED_MW')
+WITH_NCCL = WITH_CUDA and platform.system() != 'Darwin'
+SYSTEM_NCCL = False
+
+
+################################################################################
+# Workaround setuptools -Wstrict-prototypes warnings
+# I lifted this code from https://stackoverflow.com/a/29634231/23845
+################################################################################
+import distutils.sysconfig
+cfg_vars = distutils.sysconfig.get_config_vars()
+for key, value in cfg_vars.items():
+    if type(value) == str:
+            cfg_vars[key] = value.replace("-Wstrict-prototypes", "")
+
+################################################################################
+# Monkey-patch setuptools to compile in parallel
+################################################################################
+original_link = distutils.unixccompiler.UnixCCompiler.link
+
+
+def parallelCCompile(self, sources, output_dir=None, macros=None,
+                     include_dirs=None, debug=0, extra_preargs=None,
+                     extra_postargs=None, depends=None):
+    # those lines are copied from distutils.ccompiler.CCompiler directly
+    macros, objects, extra_postargs, pp_opts, build = self._setup_compile(
+        output_dir, macros, include_dirs, sources, depends, extra_postargs)
+    cc_args = self._get_cc_args(pp_opts, debug, extra_preargs)
+
+    # compile using a thread pool
+    import multiprocessing.pool
+
+    def _single_compile(obj):
+        src, ext = build[obj]
+        self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
+    num_jobs = multiprocessing.cpu_count()
+    multiprocessing.pool.ThreadPool(num_jobs).map(_single_compile, objects)
+
+    return objects
+
+
+def patched_link(self, *args, **kwargs):
+    _cxx = self.compiler_cxx
+    self.compiler_cxx = None
+    result = original_link(self, *args, **kwargs)
+    self.compiler_cxx = _cxx
+    return result
+
+
+distutils.ccompiler.CCompiler.compile = parallelCCompile
+distutils.unixccompiler.UnixCCompiler.link = patched_link
+
+################################################################################
+# Custom build commands
+################################################################################
+
+
+class build_deps(Command):
+    user_options = []
+
+    def initialize_options(self):
+        pass
+
+    def finalize_options(self):
+        pass
+
+    def run(self):
+        from tools.nnwrap import generate_wrappers as generate_nn_wrappers
+        build_all_cmd = ['bash', 'torch/lib/build_all.sh']
+        if WITH_CUDA:
+            build_all_cmd += ['--with-cuda']
+        if WITH_NCCL and not SYSTEM_NCCL:
+            build_all_cmd += ['--with-nccl']
+        if WITH_DISTRIBUTED:
+            build_all_cmd += ['--with-distributed']
+        if subprocess.call(build_all_cmd) != 0:
+            sys.exit(1)
+        generate_nn_wrappers()
+
+
+class build_module(Command):
+    user_options = []
+
+    def initialize_options(self):
+        pass
+
+    def finalize_options(self):
+        pass
+
+    def run(self):
+        self.run_command('build_py')
+        self.run_command('build_ext')
+
+
+class build_py(setuptools.command.build_py.build_py):
+
+    def run(self):
+        self.create_version_file()
+        setuptools.command.build_py.build_py.run(self)
+
+    @staticmethod
+    def create_version_file():
+        global version, cwd
+        print('-- Building version ' + version)
+        version_path = os.path.join(cwd, 'torch', 'version.py')
+        with open(version_path, 'w') as f:
+            f.write("__version__ = '{}'\n".format(version))
+
+
+class develop(setuptools.command.develop.develop):
+
+    def run(self):
+        build_py.create_version_file()
+        setuptools.command.develop.develop.run(self)
+
+
+class build_ext(setuptools.command.build_ext.build_ext):
+
+    def run(self):
+        # Print build options
+        if WITH_NUMPY:
+            print('-- Building with NumPy bindings')
+        else:
+            print('-- NumPy not found')
+        if WITH_CUDNN:
+            print('-- Detected cuDNN at ' + CUDNN_LIB_DIR + ', ' + CUDNN_INCLUDE_DIR)
+        else:
+            print('-- Not using cuDNN')
+        if WITH_CUDA:
+            print('-- Detected CUDA at ' + CUDA_HOME)
+        else:
+            print('-- Not using CUDA')
+        if WITH_NCCL and SYSTEM_NCCL:
+            print('-- Using system provided NCCL library')
+        elif WITH_NCCL:
+            print('-- Building NCCL library')
+        else:
+            print('-- Not using NCCL')
+        if WITH_DISTRIBUTED:
+            print('-- Building with distributed package ')
+        else:
+            print('-- Building without distributed package')
+
+        # cwrap depends on pyyaml, so we can't import it earlier
+        from tools.cwrap import cwrap
+        from tools.cwrap.plugins.THPPlugin import THPPlugin
+        from tools.cwrap.plugins.ArgcountSortPlugin import ArgcountSortPlugin
+        from tools.cwrap.plugins.AutoGPU import AutoGPU
+        from tools.cwrap.plugins.BoolOption import BoolOption
+        from tools.cwrap.plugins.KwargsPlugin import KwargsPlugin
+        from tools.cwrap.plugins.NullableArguments import NullableArguments
+        from tools.cwrap.plugins.CuDNNPlugin import CuDNNPlugin
+        from tools.cwrap.plugins.WrapDim import WrapDim
+        from tools.cwrap.plugins.AssertNDim import AssertNDim
+        from tools.cwrap.plugins.Broadcast import Broadcast
+        from tools.cwrap.plugins.ProcessorSpecificPlugin import ProcessorSpecificPlugin
+        thp_plugin = THPPlugin()
+        cwrap('torch/csrc/generic/TensorMethods.cwrap', plugins=[
+            ProcessorSpecificPlugin(), BoolOption(), thp_plugin,
+            AutoGPU(condition='IS_CUDA'), ArgcountSortPlugin(), KwargsPlugin(),
+            AssertNDim(), WrapDim(), Broadcast()
+        ])
+        cwrap('torch/csrc/cudnn/cuDNN.cwrap', plugins=[
+            CuDNNPlugin(), NullableArguments()
+        ])
+        # It's an old-style class in Python 2.7...
+        setuptools.command.build_ext.build_ext.run(self)
+
+
+class build(distutils.command.build.build):
+    sub_commands = [
+        ('build_deps', lambda self: True),
+    ] + distutils.command.build.build.sub_commands
+
+
+class install(setuptools.command.install.install):
+
+    def run(self):
+        if not self.skip_build:
+            self.run_command('build_deps')
+        setuptools.command.install.install.run(self)
+
+
+class clean(distutils.command.clean.clean):
+
+    def run(self):
+        import glob
+        with open('.gitignore', 'r') as f:
+            ignores = f.read()
+            for wildcard in filter(bool, ignores.split('\n')):
+                for filename in glob.glob(wildcard):
+                    try:
+                        os.remove(filename)
+                    except OSError:
+                        shutil.rmtree(filename, ignore_errors=True)
+
+        # It's an old-style class in Python 2.7...
+        distutils.command.clean.clean.run(self)
+
+
+################################################################################
+# Configure compile flags
+################################################################################
+
+include_dirs = []
+library_dirs = []
+extra_link_args = []
+extra_compile_args = ['-std=c++11', '-Wno-write-strings',
+                      # Python 2.6 requires -fno-strict-aliasing, see
+                      # http://legacy.python.org/dev/peps/pep-3123/
+                      '-fno-strict-aliasing']
+
+cwd = os.path.dirname(os.path.abspath(__file__))
+lib_path = os.path.join(cwd, "torch", "lib")
+
+tmp_install_path = lib_path + "/tmp_install"
+include_dirs += [
+    cwd,
+    os.path.join(cwd, "torch", "csrc"),
+    tmp_install_path + "/include",
+    tmp_install_path + "/include/TH",
+    tmp_install_path + "/include/THPP",
+    tmp_install_path + "/include/THNN",
+    tmp_install_path + "/include/ATen",
+]
+
+library_dirs.append(lib_path)
+
+# we specify exact lib names to avoid conflict with lua-torch installs
+TH_LIB = os.path.join(lib_path, 'libTH.so.1')
+THS_LIB = os.path.join(lib_path, 'libTHS.so.1')
+THC_LIB = os.path.join(lib_path, 'libTHC.so.1')
+THCS_LIB = os.path.join(lib_path, 'libTHCS.so.1')
+THNN_LIB = os.path.join(lib_path, 'libTHNN.so.1')
+THCUNN_LIB = os.path.join(lib_path, 'libTHCUNN.so.1')
+THPP_LIB = os.path.join(lib_path, 'libTHPP.so.1')
+ATEN_LIB = os.path.join(lib_path, 'libATen.so.1')
+GLOO_LIB = os.path.join(lib_path, 'libgloo.a')
+GLOO_CUDA_LIB = os.path.join(lib_path, 'libgloo_cuda.a')
+THD_LIB = os.path.join(lib_path, 'libTHD.a')
+NCCL_LIB = os.path.join(lib_path, 'libnccl.so.1')
+if platform.system() == 'Darwin':
+    TH_LIB = os.path.join(lib_path, 'libTH.1.dylib')
+    THS_LIB = os.path.join(lib_path, 'libTHS.1.dylib')
+    THC_LIB = os.path.join(lib_path, 'libTHC.1.dylib')
+    THCS_LIB = os.path.join(lib_path, 'libTHCS.1.dylib')
+    THNN_LIB = os.path.join(lib_path, 'libTHNN.1.dylib')
+    THCUNN_LIB = os.path.join(lib_path, 'libTHCUNN.1.dylib')
+    THPP_LIB = os.path.join(lib_path, 'libTHPP.1.dylib')
+    ATEN_LIB = os.path.join(lib_path, 'libATen.1.dylib')
+    NCCL_LIB = os.path.join(lib_path, 'libnccl.1.dylib')
+
+if WITH_NCCL and subprocess.call('ldconfig -p | grep libnccl >/dev/null', shell=True) == 0:
+    SYSTEM_NCCL = True
+
+main_compile_args = ['-D_THP_CORE']
+main_libraries = ['shm']
+main_link_args = [TH_LIB, THS_LIB, THPP_LIB, THNN_LIB, ATEN_LIB]
+main_sources = [
+    "torch/csrc/PtrWrapper.cpp",
+    "torch/csrc/Module.cpp",
+    "torch/csrc/Generator.cpp",
+    "torch/csrc/Size.cpp",
+    "torch/csrc/Exceptions.cpp",
+    "torch/csrc/Storage.cpp",
+    "torch/csrc/DynamicTypes.cpp",
+    "torch/csrc/byte_order.cpp",
+    "torch/csrc/utils.cpp",
+    "torch/csrc/expand_utils.cpp",
+    "torch/csrc/utils/object_ptr.cpp",
+    "torch/csrc/utils/tuple_parser.cpp",
+    "torch/csrc/allocators.cpp",
+    "torch/csrc/serialization.cpp",
+    "torch/csrc/autograd/init.cpp",
+    "torch/csrc/autograd/engine.cpp",
+    "torch/csrc/autograd/function.cpp",
+    "torch/csrc/autograd/variable.cpp",
+    "torch/csrc/autograd/input_buffer.cpp",
+    "torch/csrc/autograd/python_function.cpp",
+    "torch/csrc/autograd/python_cpp_function.cpp",
+    "torch/csrc/autograd/python_variable.cpp",
+    "torch/csrc/autograd/python_engine.cpp",
+    "torch/csrc/autograd/python_hook.cpp",
+    "torch/csrc/autograd/functions/batch_normalization.cpp",
+    "torch/csrc/autograd/functions/convolution.cpp",
+    "torch/csrc/autograd/functions/basic_ops.cpp",
+    "torch/csrc/autograd/functions/tensor.cpp",
+    "torch/csrc/autograd/functions/accumulate_grad.cpp",
+    "torch/csrc/autograd/functions/utils.cpp",
+    "torch/csrc/autograd/functions/init.cpp",
+    "torch/csrc/nn/THNN_generic.cpp",
+]
+main_sources += split_types("torch/csrc/Tensor.cpp")
+
+try:
+    import numpy as np
+    include_dirs += [np.get_include()]
+    extra_compile_args += ['-DWITH_NUMPY']
+    WITH_NUMPY = True
+except ImportError:
+    WITH_NUMPY = False
+
+if WITH_DISTRIBUTED:
+    extra_compile_args += ['-DWITH_DISTRIBUTED']
+    main_sources += [
+        "torch/csrc/distributed/Module.cpp",
+        "torch/csrc/distributed/utils.cpp",
+    ]
+    if WITH_DISTRIBUTED_MW:
+        main_sources += [
+            "torch/csrc/distributed/Tensor.cpp",
+            "torch/csrc/distributed/Storage.cpp",
+        ]
+        extra_compile_args += ['-DWITH_DISTRIBUTED_MW']
+    include_dirs += [tmp_install_path + "/include/THD"]
+    main_link_args += [THD_LIB]
+    if platform.system() == 'Linux':
+        main_link_args += [GLOO_LIB]
+
+if WITH_CUDA:
+    cuda_lib_dirs = ['lib64', 'lib']
+    cuda_include_path = os.path.join(CUDA_HOME, 'include')
+    for lib_dir in cuda_lib_dirs:
+        cuda_lib_path = os.path.join(CUDA_HOME, lib_dir)
+        if os.path.exists(cuda_lib_path):
+            break
+    include_dirs.append(cuda_include_path)
+    include_dirs.append(tmp_install_path + "/include/THCUNN")
+    library_dirs.append(cuda_lib_path)
+    extra_link_args.append('-Wl,-rpath,' + cuda_lib_path)
+    extra_compile_args += ['-DWITH_CUDA']
+    extra_compile_args += ['-DCUDA_LIB_PATH=' + cuda_lib_path]
+    main_libraries += ['cudart', 'nvToolsExt']
+    main_link_args += [THC_LIB, THCS_LIB, THCUNN_LIB]
+    if platform.system() == 'Linux':
+        main_link_args += [GLOO_CUDA_LIB]
+    main_sources += [
+        "torch/csrc/cuda/Module.cpp",
+        "torch/csrc/cuda/Storage.cpp",
+        "torch/csrc/cuda/Stream.cpp",
+        "torch/csrc/cuda/AutoGPU.cpp",
+        "torch/csrc/cuda/utils.cpp",
+        "torch/csrc/cuda/expand_utils.cpp",
+        "torch/csrc/cuda/serialization.cpp",
+    ]
+    main_sources += split_types("torch/csrc/cuda/Tensor.cpp")
+
+if WITH_NCCL:
+    if SYSTEM_NCCL:
+        main_libraries += ['nccl']
+    else:
+        main_link_args += [NCCL_LIB]
+    extra_compile_args += ['-DWITH_NCCL']
+
+if WITH_CUDNN:
+    main_libraries += ['cudnn']
+    include_dirs.append(CUDNN_INCLUDE_DIR)
+    library_dirs.append(CUDNN_LIB_DIR)
+    main_sources += [
+        "torch/csrc/cudnn/BatchNorm.cpp",
+        "torch/csrc/cudnn/Conv.cpp",
+        "torch/csrc/cudnn/cuDNN.cpp",
+        "torch/csrc/cudnn/GridSampler.cpp",
+        "torch/csrc/cudnn/AffineGridGenerator.cpp",
+        "torch/csrc/cudnn/Types.cpp",
+        "torch/csrc/cudnn/Handles.cpp",
+    ]
+    extra_compile_args += ['-DWITH_CUDNN']
+
+if DEBUG:
+    extra_compile_args += ['-O0', '-g']
+    extra_link_args += ['-O0', '-g']
+
+if os.getenv('PYTORCH_BINARY_BUILD') and platform.system() == 'Linux':
+    print('PYTORCH_BINARY_BUILD found. Static linking libstdc++ on Linux')
+    # get path of libstdc++ and link manually.
+    # for reasons unknown, -static-libstdc++ doesn't fully link some symbols
+    CXXNAME = os.getenv('CXX', 'g++')
+    STDCPP_LIB = subprocess.check_output([CXXNAME, '-print-file-name=libstdc++.a'])
+    STDCPP_LIB = STDCPP_LIB[:-1]
+    if type(STDCPP_LIB) != str:  # python 3
+        STDCPP_LIB = STDCPP_LIB.decode(sys.stdout.encoding)
+    main_link_args += [STDCPP_LIB]
+    version_script = os.path.abspath("tools/pytorch.version")
+    extra_link_args += ['-Wl,--version-script=' + version_script]
+
+def make_relative_rpath(path):
+    if platform.system() == 'Darwin':
+        return '-Wl,-rpath,@loader_path/' + path
+    else:
+        return '-Wl,-rpath,$ORIGIN/' + path
+
+################################################################################
+# Declare extensions and package
+################################################################################
+
+extensions = []
+packages = find_packages(exclude=('tools', 'tools.*',))
+
+C = Extension("torch._C",
+              libraries=main_libraries,
+              sources=main_sources,
+              language='c++',
+              extra_compile_args=main_compile_args + extra_compile_args,
+              include_dirs=include_dirs,
+              library_dirs=library_dirs,
+              extra_link_args=extra_link_args + main_link_args + [make_relative_rpath('lib')],
+              )
+extensions.append(C)
+
+DL = Extension("torch._dl",
+               sources=["torch/csrc/dl.c"],
+               language='c',
+               )
+extensions.append(DL)
+
+THNN = Extension("torch._thnn._THNN",
+                 sources=['torch/csrc/nn/THNN.cpp'],
+                 language='c++',
+                 extra_compile_args=extra_compile_args,
+                 include_dirs=include_dirs,
+                 extra_link_args=extra_link_args + [
+                     TH_LIB,
+                     THNN_LIB,
+                     make_relative_rpath('../lib'),
+                 ]
+                 )
+extensions.append(THNN)
+
+if WITH_CUDA:
+    THCUNN = Extension("torch._thnn._THCUNN",
+                       sources=['torch/csrc/nn/THCUNN.cpp'],
+                       language='c++',
+                       extra_compile_args=extra_compile_args,
+                       include_dirs=include_dirs,
+                       extra_link_args=extra_link_args + [
+                           TH_LIB,
+                           THC_LIB,
+                           THCUNN_LIB,
+                           make_relative_rpath('../lib'),
+                       ]
+                       )
+    extensions.append(THCUNN)
+
+version = '0.2.0'
+if os.getenv('PYTORCH_BUILD_VERSION'):
+    assert os.getenv('PYTORCH_BUILD_NUMBER') is not None
+    version = os.getenv('PYTORCH_BUILD_VERSION') \
+        + '_' + os.getenv('PYTORCH_BUILD_NUMBER')
+else:
+    try:
+        sha = subprocess.check_output(['git', 'rev-parse', 'HEAD'], cwd=cwd).decode('ascii').strip()
+        version += '+' + sha[:7]
+    except subprocess.CalledProcessError:
+        pass
+
+
+setup(name="torch", version=version,
+      description="Tensors and Dynamic neural networks in Python with strong GPU acceleration",
+      ext_modules=extensions,
+      cmdclass={
+          'build': build,
+          'build_py': build_py,
+          'build_ext': build_ext,
+          'build_deps': build_deps,
+          'build_module': build_module,
+          'develop': develop,
+          'install': install,
+          'clean': clean,
+      },
+      packages=packages,
+      package_data={'torch': [
+          'lib/*.so*', 'lib/*.dylib*',
+          'lib/torch_shm_manager',
+          'lib/*.h',
+          'lib/include/TH/*.h', 'lib/include/TH/generic/*.h',
+          'lib/include/THC/*.h', 'lib/include/THC/generic/*.h']},
+      install_requires=['pyyaml', 'numpy'],
+      )
--- a/test/common.py
+++ b/test/common.py
@ -0,0 +1,291 @@
+import sys
+import os
+import argparse
+import unittest
+import warnings
+import contextlib
+from functools import wraps
+from itertools import product
+from copy import deepcopy
+
+import torch
+import torch.cuda
+from torch.autograd import Variable
+
+
+torch.set_default_tensor_type('torch.DoubleTensor')
+
+SEED = 0
+SEED_SET = 0
+
+
+def parse_set_seed_once():
+    global SEED
+    global SEED_SET
+    parser = argparse.ArgumentParser(add_help=False)
+    parser.add_argument('--seed', type=int, default=123)
+    args, remaining = parser.parse_known_args()
+    if SEED_SET == 0:
+        torch.manual_seed(args.seed)
+        if torch.cuda.is_available():
+            torch.cuda.manual_seed_all(args.seed)
+        SEED = args.seed
+        SEED_SET = 1
+    remaining = [sys.argv[0]] + remaining
+    return remaining
+
+
+def run_tests():
+    remaining = parse_set_seed_once()
+    unittest.main(argv=remaining)
+
+
+TEST_NUMPY = True
+try:
+    import numpy
+except ImportError:
+    TEST_NUMPY = False
+
+TEST_SCIPY = True
+try:
+    import scipy
+except ImportError:
+    TEST_SCIPY = False
+
+
+def skipIfNoLapack(fn):
+    @wraps(fn)
+    def wrapper(*args, **kwargs):
+        try:
+            fn(*args, **kwargs)
+        except Exception as e:
+            if 'Lapack library not found' in e.args[0]:
+                raise unittest.SkipTest('Compiled without Lapack')
+            raise
+    return wrapper
+
+
+def suppress_warnings(fn):
+    def wrapper(*args, **kwargs):
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore")
+            fn(*args, **kwargs)
+    return wrapper
+
+
+def get_cpu_type(t):
+    assert t.__module__ == 'torch.cuda'
+    return getattr(torch, t.__class__.__name__)
+
+
+def get_gpu_type(t):
+    assert t.__module__ == 'torch'
+    return getattr(torch.cuda, t.__name__)
+
+
+def to_gpu(obj, type_map={}):
+    if torch.is_tensor(obj):
+        t = type_map.get(type(obj), get_gpu_type(type(obj)))
+        return obj.clone().type(t)
+    elif torch.is_storage(obj):
+        return obj.new().resize_(obj.size()).copy_(obj)
+    elif isinstance(obj, Variable):
+        assert obj.is_leaf
+        t = type_map.get(type(obj.data), get_gpu_type(type(obj.data)))
+        return Variable(obj.data.clone().type(t), requires_grad=obj.requires_grad)
+    elif isinstance(obj, list):
+        return [to_gpu(o, type_map) for o in obj]
+    elif isinstance(obj, tuple):
+        return tuple(to_gpu(o, type_map) for o in obj)
+    else:
+        return deepcopy(obj)
+
+
+@contextlib.contextmanager
+def freeze_rng_state():
+    rng_state = torch.get_rng_state()
+    if torch.cuda.is_available():
+        cuda_rng_state = torch.cuda.get_rng_state()
+    yield
+    if torch.cuda.is_available():
+        torch.cuda.set_rng_state(cuda_rng_state)
+    torch.set_rng_state(rng_state)
+
+
+def iter_indices(tensor):
+    if tensor.dim() == 0:
+        return range(0)
+    if tensor.dim() == 1:
+        return range(tensor.size(0))
+    return product(*(range(s) for s in tensor.size()))
+
+
+def is_iterable(obj):
+    try:
+        iter(obj)
+        return True
+    except:
+        return False
+
+
+class TestCase(unittest.TestCase):
+    precision = 1e-5
+
+    def setUp(self):
+        torch.manual_seed(SEED)
+        if torch.cuda.is_available():
+            torch.cuda.manual_seed_all(SEED)
+
+    def assertTensorsSlowEqual(self, x, y, prec=None, message=''):
+        max_err = 0
+        self.assertEqual(x.size(), y.size())
+        for index in iter_indices(x):
+            max_err = max(max_err, abs(x[index] - y[index]))
+        self.assertLessEqual(max_err, prec, message)
+
+    def safeCoalesce(self, t):
+        tc = t.coalesce()
+
+        value_map = {}
+        for idx, val in zip(t._indices().t(), t._values()):
+            idx_tup = tuple(idx)
+            if idx_tup in value_map:
+                value_map[idx_tup] += val
+            else:
+                value_map[idx_tup] = val.clone() if torch.is_tensor(val) else val
+
+        new_indices = sorted(list(value_map.keys()))
+        new_values = [value_map[idx] for idx in new_indices]
+        if t._values().ndimension() < 2:
+            new_values = t._values().new(new_values)
+        else:
+            new_values = torch.stack(new_values)
+
+        new_indices = t._indices().new(new_indices).t()
+        tg = t.new(new_indices, new_values, t.size())
+
+        self.assertEqual(tc._indices(), tg._indices())
+        self.assertEqual(tc._values(), tg._values())
+
+        return tg
+
+    def unwrapVariables(self, x, y):
+        if isinstance(x, Variable) and isinstance(y, Variable):
+            return x.data, y.data
+        elif isinstance(x, Variable) or isinstance(y, Variable):
+            raise AssertionError("cannot compare {} and {}".format(type(x), type(y)))
+        return x, y
+
+    def assertEqual(self, x, y, prec=None, message=''):
+        if prec is None:
+            prec = self.precision
+
+        x, y = self.unwrapVariables(x, y)
+
+        if torch.is_tensor(x) and torch.is_tensor(y):
+            def assertTensorsEqual(a, b):
+                super(TestCase, self).assertEqual(a.size(), b.size())
+                if a.numel() > 0:
+                    b = b.type_as(a)
+                    b = b.cuda(device=a.get_device()) if a.is_cuda else b.cpu()
+                    # check that NaNs are in the same locations
+                    nan_mask = a != a
+                    self.assertTrue(torch.equal(nan_mask, b != b))
+                    diff = a - b
+                    diff[nan_mask] = 0
+                    if diff.is_signed():
+                        diff = diff.abs()
+                    max_err = diff.max()
+                    self.assertLessEqual(max_err, prec, message)
+            self.assertEqual(x.is_sparse, y.is_sparse, message)
+            if x.is_sparse:
+                x = self.safeCoalesce(x)
+                y = self.safeCoalesce(y)
+                assertTensorsEqual(x._indices(), y._indices())
+                assertTensorsEqual(x._values(), y._values())
+            else:
+                assertTensorsEqual(x, y)
+        elif type(x) == str and type(y) == str:
+            super(TestCase, self).assertEqual(x, y)
+        elif type(x) == set and type(y) == set:
+            super(TestCase, self).assertEqual(x, y)
+        elif is_iterable(x) and is_iterable(y):
+            super(TestCase, self).assertEqual(len(x), len(y))
+            for x_, y_ in zip(x, y):
+                self.assertEqual(x_, y_, prec, message)
+        else:
+            try:
+                self.assertLessEqual(abs(x - y), prec, message)
+                return
+            except:
+                pass
+            super(TestCase, self).assertEqual(x, y, message)
+
+    def assertNotEqual(self, x, y, prec=None, message=''):
+        if prec is None:
+            prec = self.precision
+
+        x, y = self.unwrapVariables(x, y)
+
+        if torch.is_tensor(x) and torch.is_tensor(y):
+            if x.size() != y.size():
+                super(TestCase, self).assertNotEqual(x.size(), y.size())
+            self.assertGreater(x.numel(), 0)
+            y = y.type_as(x)
+            y = y.cuda(device=x.get_device()) if x.is_cuda else y.cpu()
+            nan_mask = x != x
+            if torch.equal(nan_mask, y != y):
+                diff = x - y
+                if diff.is_signed():
+                    diff = diff.abs()
+                diff[nan_mask] = 0
+                max_err = diff.max()
+                self.assertGreaterEqual(max_err, prec, message)
+        elif type(x) == str and type(y) == str:
+            super(TestCase, self).assertNotEqual(x, y)
+        elif is_iterable(x) and is_iterable(y):
+            super(TestCase, self).assertNotEqual(x, y)
+        else:
+            try:
+                self.assertGreaterEqual(abs(x - y), prec, message)
+                return
+            except:
+                pass
+            super(TestCase, self).assertNotEqual(x, y, message)
+
+    def assertObjectIn(self, obj, iterable):
+        for elem in iterable:
+            if id(obj) == id(elem):
+                return
+        raise AssertionError("object not found in iterable")
+
+    if sys.version_info < (3, 2):
+        # assertRaisesRegexp renamed assertRaisesRegex in 3.2
+        assertRaisesRegex = unittest.TestCase.assertRaisesRegexp
+
+
+def download_file(url, binary=True):
+    if sys.version_info < (3,):
+        from urlparse import urlsplit
+        import urllib2
+        request = urllib2
+        error = urllib2
+    else:
+        from urllib.parse import urlsplit
+        from urllib import request, error
+
+    filename = os.path.basename(urlsplit(url)[2])
+    data_dir = os.path.join(os.path.dirname(__file__), 'data')
+    path = os.path.join(data_dir, filename)
+
+    if os.path.exists(path):
+        return path
+    try:
+        data = request.urlopen(url, timeout=15).read()
+        with open(path, 'wb' if binary else 'w') as f:
+            f.write(data)
+        return path
+    except error.URLError:
+        msg = "could not download test file '{}'".format(url)
+        warnings.warn(msg, RuntimeWarning)
+        raise unittest.SkipTest(msg)
--- a/test/common_nn.py
+++ b/test/common_nn.py
@ -0,0 +1,784 @@
+import sys
+import tempfile
+import unittest
+from copy import deepcopy
+from itertools import product
+
+import torch
+import torch.cuda
+from torch.autograd import Variable
+from common import TestCase, to_gpu, freeze_rng_state
+from torch.autograd.gradcheck import get_numerical_jacobian, iter_tensors, contiguous
+import torch.backends.cudnn
+
+# tarfile module tries to obtain a file object name in python 3.3
+if sys.version_info[:2] == (3, 3):
+    TemporaryFile = tempfile.NamedTemporaryFile
+else:
+    TemporaryFile = tempfile.TemporaryFile
+
+TEST_CUDA = torch.cuda.is_available()
+TEST_MULTIGPU = TEST_CUDA and torch.cuda.device_count() >= 2
+TEST_CUDNN = TEST_CUDA and torch.backends.cudnn.is_acceptable(torch.cuda.FloatTensor(1))
+TEST_CUDNN_VERSION = TEST_CUDNN and torch.backends.cudnn.version()
+PRECISION = 1e-5
+
+module_tests = [
+    dict(
+        module_name='Linear',
+        constructor_args=(10, 8),
+        input_size=(4, 10),
+        reference_fn=lambda i, p: torch.mm(i, p[0].t()) + p[1].view(1, -1).expand(4, 8)
+    ),
+    dict(
+        module_name='Linear',
+        constructor_args=(10, 8, False),
+        input_size=(4, 10),
+        desc='no_bias',
+        reference_fn=lambda i, p: torch.mm(i, p[0].t())
+    ),
+    dict(
+        module_name='Threshold',
+        constructor_args=(2, 1),
+        input_size=(2, 3, 4, 5),
+        check_inplace=True,
+        desc='threshold_value'
+    ),
+    dict(
+        module_name='Threshold',
+        constructor_args=(2, 10),
+        input_size=(2, 3, 4, 5),
+        desc='large_value'
+    ),
+    dict(
+        module_name='ReLU',
+        input_size=(2, 3, 4, 5),
+        check_inplace=True,
+    ),
+    dict(
+        module_name='ReLU6',
+        input_size=(2, 3, 4, 5),
+        check_inplace=True,
+    ),
+    dict(
+        module_name='RReLU',
+        input_size=(1, 2, 2),
+        test_cuda=False,
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='RReLU',
+        constructor_args=(0.1, 0.9),
+        input_size=(4, 4, 5),
+        desc='with_up_down',
+        test_cuda=False,
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='Hardtanh',
+        input_size=(3, 2, 5),
+        reference_fn=lambda i, _: i.clamp(-1, 1),
+    ),
+    dict(
+        module_name='Sigmoid',
+        input_size=(2, 3, 4, 5)
+    ),
+    dict(
+        module_name='Tanh',
+        input_size=(2, 3, 4, 5)
+    ),
+    dict(
+        module_name='Softmax',
+        input_size=(10, 20),
+        reference_fn=lambda i, _: torch.exp(i).div(torch.exp(i).sum(1, True).expand(10, 20)),
+    ),
+    dict(
+        module_name='Softmax2d',
+        input_size=(1, 3, 10, 20),
+        reference_fn=lambda i, _: torch.exp(i).div(torch.exp(i).sum(1, False)),
+    ),
+    dict(
+        module_name='LogSoftmax',
+        input_size=(10, 20),
+        reference_fn=lambda i, _: torch.exp(i).div_(torch.exp(i).sum(1, True).expand(10, 20)).log_(),
+    ),
+    dict(
+        module_name='LogSoftmax',
+        input_size=(1, 3, 10, 20),
+        reference_fn=lambda i, _: torch.exp(i).div_(torch.exp(i).sum(1, False)).log_(),
+        desc='multiparam',
+    ),
+    dict(
+        module_name='ELU',
+        constructor_args=(2.,),
+        input_size=(3, 2, 5),
+    ),
+    # TODO: reference function
+    dict(
+        module_name='Hardshrink',
+        constructor_args=(2.,),
+        input_size=(4, 3, 2, 4),
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='LeakyReLU',
+        input_size=(3, 2, 5),
+        check_inplace=True
+    ),
+    dict(
+        module_name='LeakyReLU',
+        constructor_args=(0.5,),
+        input_size=(3, 2, 5),
+        check_inplace=True,
+        desc='with_negval'
+    ),
+    dict(
+        module_name='LogSigmoid',
+        input_size=(2, 3, 4),
+        reference_fn=lambda i, _: i.sigmoid().log(),
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='Softplus',
+        input_size=(10, 20),
+        reference_fn=lambda i, _: torch.log(1 + torch.exp(i)),
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='Softplus',
+        constructor_args=(2,),
+        input_size=(10, 20),
+        reference_fn=lambda i, _: 1. / 2. * torch.log(1 + torch.exp(2 * i)),
+        desc='beta',
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='Softshrink',
+        input_size=(3, 2, 5),
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='Softshrink',
+        constructor_args=(1,),
+        input_size=(3, 2, 5),
+        desc='lambda',
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='CrossMapLRN2d',
+        constructor_args=(5, 5e-3, 1e-3, 2),
+        input_size=(2, 3, 6, 6),
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='PReLU',
+        input_size=(2, 3, 4),
+        reference_fn=lambda i, p: torch.clamp(i, min=0) + torch.clamp(i, max=0) * p[0][0],
+        desc='1d',
+    ),
+    dict(
+        module_name='PReLU',
+        constructor_args=(3,),
+        input_size=(2, 3, 4),
+        desc='1d_multiparam',
+        reference_fn=lambda i, p: torch.clamp(i, min=0) + torch.clamp(i, max=0) * p[0][0],
+    ),
+    dict(
+        module_name='PReLU',
+        input_size=(2, 3, 4, 5),
+        desc='2d',
+        reference_fn=lambda i, p: torch.clamp(i, min=0) + torch.clamp(i, max=0) * p[0][0],
+    ),
+    dict(
+        module_name='PReLU',
+        constructor_args=(3,),
+        input_size=(2, 3, 4, 5),
+        desc='2d_multiparam',
+        reference_fn=lambda i, p: torch.clamp(i, min=0) + torch.clamp(i, max=0) * p[0][0],
+    ),
+    dict(
+        module_name='PReLU',
+        input_size=(2, 3, 4, 5, 6),
+        reference_fn=lambda i, p: torch.clamp(i, min=0) + torch.clamp(i, max=0) * p[0][0],
+        desc='3d',
+    ),
+    dict(
+        module_name='PReLU',
+        constructor_args=(3,),
+        input_size=(2, 3, 4, 5, 6),
+        desc='3d_multiparam',
+        reference_fn=lambda i, p: torch.clamp(i, min=0) + torch.clamp(i, max=0) * p[0][0],
+    ),
+    dict(
+        module_name='Softsign',
+        input_size=(3, 2, 5),
+        reference_fn=lambda i, _: i.div(1 + torch.abs(i)),
+    ),
+    dict(
+        module_name='Softmin',
+        input_size=(10, 20),
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='Tanhshrink',
+        input_size=(2, 3, 4, 5)
+    ),
+]
+
+criterion_tests = [
+    dict(module_name='L1Loss',
+         input_size=(2, 3, 4),
+         target=torch.randn(2, 3, 4),
+         reference_fn=lambda i, t, _: 1. / i.numel() *
+         sum((a - b).abs().sum() for a, b in zip(i, t)),
+         ),
+    dict(
+        module_name='NLLLoss',
+        input=torch.rand(15, 10).log(),
+        target=torch.Tensor(15).uniform_().mul(10).floor().long(),
+    ),
+    dict(
+        module_name='NLLLoss',
+        constructor_args=(None, False),
+        input=torch.rand(15, 10).log(),
+        target=torch.Tensor(15).uniform_().mul(10).floor().long(),
+        desc='no_size_average'
+    ),
+    dict(
+        module_name='NLLLoss',
+        constructor_args=(None, True, 2),
+        input=torch.rand(15, 10).log(),
+        target=torch.Tensor(15).uniform_().mul(10).floor().long(),
+        desc='ignore_index'
+    ),
+    dict(
+        module_name='NLLLoss',
+        constructor_args=(torch.rand(10),),
+        input=torch.rand(15, 10).add(1e-2).log(),
+        target=torch.Tensor(15).uniform_().mul(10).floor().long(),
+        desc='weights',
+    ),
+    dict(
+        module_name='NLLLoss',
+        constructor_args=(torch.rand(10), True, 2),
+        input=torch.rand(15, 10).add(1e-2).log(),
+        target=torch.Tensor(15).uniform_().mul(10).floor().long(),
+        desc='weights_ignore_index'
+    ),
+    dict(
+        module_name='NLLLoss',
+        constructor_args=(torch.rand(10), True, -1),
+        input=torch.rand(15, 10).add(1e-2).log(),
+        target=torch.Tensor(15).uniform_().mul(10 + 1).floor().long() - 1,
+        desc='weights_ignore_index_neg'
+    ),
+    dict(
+        module_name='KLDivLoss',
+        input=torch.rand(10, 10).log(),
+        target=torch.rand(10, 10),
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='MSELoss',
+        input=torch.randn(2, 3, 4, 5),
+        target=torch.randn(2, 3, 4, 5),
+        reference_fn=lambda i, t, _: (i - t).abs().pow(2).sum() / i.numel(),
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='BCELoss',
+        input=torch.rand(15, 10).clamp_(1e-2, 1 - 1e-2),
+        target=torch.randn(15, 10).gt(0).double(),
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='BCELoss',
+        constructor_args=(torch.rand(10),),
+        input=torch.rand(15, 10).clamp_(1e-2, 1 - 1e-2),
+        target=torch.randn(15, 10).gt(0).double(),
+        desc='weights',
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='CrossEntropyLoss',
+        input=torch.randn(15, 10),
+        target=torch.Tensor(15).uniform_().mul(10).floor().long(),
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='CrossEntropyLoss',
+        constructor_args=(torch.rand(10),),
+        input=torch.randn(15, 10),
+        target=torch.Tensor(15).uniform_().mul(10).floor().long(),
+        desc='weights',
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='NLLLoss2d',
+        input_size=(2, 3, 5, 5),
+        target=torch.rand(2, 5, 5).mul(3).floor().long(),
+    ),
+    dict(
+        module_name='NLLLoss2d',
+        constructor_args=(torch.rand(3),),
+        input_size=(2, 3, 5, 5),
+        target=torch.rand(2, 5, 5).mul(3).floor().long(),
+        desc='weights',
+    ),
+    dict(
+        module_name='NLLLoss2d',
+        constructor_args=(None, True, 3),
+        input_size=(2, 3, 5, 5),
+        target=torch.rand(2, 5, 5).mul(4).floor().long(),
+        desc='ignore_index',
+    ),
+    dict(
+        module_name='HingeEmbeddingLoss',
+        input=torch.rand(10),
+        target=torch.randn(10).gt(0).double().mul_(2).sub(1),
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='HingeEmbeddingLoss',
+        constructor_args=(0.5,),
+        input=torch.rand(10),
+        target=torch.randn(10).gt(0).double().mul_(2).sub(1),
+        desc='margin',
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='MultiLabelMarginLoss',
+        input_size=(5, 10),
+        target=torch.rand(5, 10).mul(10).floor().long(),
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='MultiLabelSoftMarginLoss',
+        input_size=(5, 10),
+        target=torch.rand(5, 10).mul(2).floor(),
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='MultiLabelSoftMarginLoss',
+        constructor_args=(torch.rand(10),),
+        input_size=(5, 10),
+        target=torch.rand(5, 10).mul(2).floor(),
+        desc='weights',
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='MultiMarginLoss',
+        input_size=(5, 10),
+        target=torch.rand(5).mul(8).floor().long(),
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='SmoothL1Loss',
+        input_size=(5, 10),
+        target=torch.randn(5, 10),
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='SoftMarginLoss',
+        input_size=(5, 5),
+        target=torch.randn(5, 5).sign(),
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='CosineEmbeddingLoss',
+        input=(torch.rand(15, 10), torch.rand(15, 10)),
+        target=torch.randn(15).sign(),
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='CosineEmbeddingLoss',
+        constructor_args=(0.7,),
+        input=(torch.rand(15, 10), torch.rand(15, 10)),
+        target=torch.randn(15).sign(),
+        desc='margin',
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='MarginRankingLoss',
+        input=(torch.randn(50).mul(10), torch.randn(50).mul(10)),
+        target=torch.randn(50).sign(),
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='MarginRankingLoss',
+        constructor_args=(2,),
+        input=(torch.randn(50).mul(10), torch.randn(50).mul(10)),
+        target=torch.randn(50).sign(),
+        desc='margin',
+        check_gradgrad=False,
+    ),
+]
+
+
+class NNTestCase(TestCase):
+
+    def _jacobian(self, input, num_out):
+        if isinstance(input, tuple):
+            return tuple(self._jacobian(elem, num_out) for elem in input)
+        elif isinstance(input, list):
+            return [self._jacobian(elem, num_out) for elem in input]
+        else:
+            return torch.zeros(input.nelement(), num_out)
+
+    def _flatten_tensors(self, x):
+        if torch.is_tensor(x):
+            if x.is_sparse:
+                return x.to_dense().view(-1)
+            else:
+                return x.view(-1)
+        elif isinstance(x, Variable):
+            return self._flatten_tensors(x.data)
+        else:
+            return tuple(self._flatten_tensors(a) for a in x)
+
+    def _zero_grad_input(self, input):
+        if isinstance(input, Variable):
+            if input.requires_grad and input.grad is not None:
+                input.grad.data.zero_()
+                input.grad.detach_()
+        elif torch.is_tensor(input):
+            return
+        else:
+            for i in input:
+                self._zero_grad_input(i)
+
+    def _analytical_jacobian(self, module, input, jacobian_input=True, jacobian_parameters=True):
+        output = self._forward(module, input)
+        output_t = output.data if isinstance(output, Variable) else output
+        d_out = output_t.new().resize_(output_t.size())
+        flat_d_out = d_out.view(-1)
+
+        if jacobian_input:
+            jacobian_inp = self._jacobian(input, d_out.nelement())
+            flat_jacobian_input = list(iter_tensors(jacobian_inp))
+
+        if jacobian_parameters:
+            param, d_param = self._get_parameters(module)
+            num_param = sum(p.numel() for p in param)
+            jacobian_param = torch.zeros(num_param, d_out.nelement())
+
+        for i in range(flat_d_out.nelement()):
+            d_out.zero_()
+            flat_d_out[i] = 1
+
+            if jacobian_parameters:
+                self._zero_grad_parameters(module)
+            # Variables will accumulate gradient from multiple steps
+            if jacobian_input:
+                self._zero_grad_input(input)
+            d_input = self._backward(module, input, output, d_out)
+
+            if jacobian_input:
+                for jacobian_x, d_x in zip(flat_jacobian_input, iter_tensors(d_input)):
+                    jacobian_x[:, i] = d_x
+            if jacobian_parameters:
+                jacobian_param[:, i] = torch.cat(self._flatten_tensors(d_param), 0)
+
+        res = tuple()
+        if jacobian_input:
+            res += jacobian_inp,
+        if jacobian_parameters:
+            res += jacobian_param,
+
+        return res
+
+    def _numerical_jacobian(self, module, input, jacobian_input=True, jacobian_parameters=True):
+        output = self._forward(module, input)
+        output_size = output.nelement()
+
+        if jacobian_parameters:
+            param, d_param = self._get_parameters(module)
+
+        def fw(input):
+            out = self._forward(module, input)
+            if isinstance(out, Variable):
+                return out.data
+            return out
+
+        res = tuple()
+        input = contiguous(input)
+        if jacobian_input:
+            res += get_numerical_jacobian(fw, input, input, eps=1e-6),
+        if jacobian_parameters:
+            res += torch.cat(list(get_numerical_jacobian(fw, input, p, eps=1e-6) for p in param), 0),
+        return res
+
+    def check_jacobian(self, module, input, jacobian_input=True):
+        jacobian_parameters = bool(self._get_parameters(module)[0])
+        analytical = self._analytical_jacobian(module, input, jacobian_input, jacobian_parameters)
+        numerical = self._numerical_jacobian(module, input, jacobian_input, jacobian_parameters)
+        analytical_t = iter_tensors(analytical)
+        numerical_t = iter_tensors(numerical)
+        # TODO: compare structure
+        self.assertLessEqual(
+            max(a.add(-1, n).abs().max() for a, n in zip(analytical_t, numerical_t)),
+            PRECISION
+        )
+
+    def check_criterion_jacobian(self, criterion, input, target):
+        eps = 1e-6
+        self._forward_criterion(criterion, input, target)
+        analytical_d_x = self._backward_criterion(criterion, input, target)
+        numerical_d_x = deepcopy(analytical_d_x)
+
+        input_t = iter_tensors(input)
+        numerical_t = iter_tensors(numerical_d_x)
+        for x, d_x in zip(input_t, numerical_t):
+            x = x.view(-1)
+            d_x = d_x.view(-1)
+            for i in range(x.nelement()):
+                original = x[i]
+                x[i] = original + eps
+                fx1 = self._forward_criterion(criterion, input, target)
+                x[i] = original - eps
+                fx2 = self._forward_criterion(criterion, input, target)
+                deriv = (fx1 - fx2) / (2. * eps)
+                d_x[i] = deriv
+                x[i] = original
+
+        # TODO: check structure
+        analytical_t = iter_tensors(analytical_d_x)
+        numerical_t = iter_tensors(numerical_d_x)
+        self.assertLessEqual(
+            max(a.add(-1, n).abs().max() for a, n in zip(analytical_t, numerical_t)),
+            PRECISION
+        )
+
+
+class TestBase(object):
+
+    def __init__(self, constructor, constructor_args=tuple(), input_size=None,
+                 input=None, desc='', reference_fn=None, fullname=None, **kwargs):
+        if input_size is None and input is None:
+            raise RuntimeError("Specify either an input tensor, or it's size!")
+        self.constructor = constructor
+        self.constructor_args = constructor_args
+        self.input = input
+        self.input_size = input_size
+        self.desc = desc
+        self.fullname = fullname
+        self.reference_fn = reference_fn
+
+    def get_name(self):
+        if self.fullname is not None:
+            return 'test_' + self.fullname
+
+        test_name = 'test_' + self.constructor.__name__
+        if self.desc:
+            test_name += '_' + self.desc
+        return test_name
+
+    def _unpack_input(self, input):
+        if isinstance(input, Variable):
+            return input.data
+        elif torch.is_tensor(input):
+            return input
+        else:
+            return type(input)(self._unpack_input(i) for i in input)
+
+    def _get_input(self):
+        if self.input is not None:
+            return self.input
+
+        def map_input_sizes(sizes):
+            if isinstance(sizes, list):
+                return [map_input_sizes(s) for s in sizes]
+            elif torch.is_tensor(sizes):
+                return sizes.double()
+            else:
+                return torch.randn(*sizes)
+
+        assert self.input_size is not None
+        return map_input_sizes(self.input_size)
+
+    def __call__(self, test_case):
+        raise NotImplementedError
+
+
+class ModuleTest(TestBase):
+
+    def __init__(self, *args, **kwargs):
+        super(ModuleTest, self).__init__(*args, **kwargs)
+        self.jacobian_input = kwargs.get('jacobian_input', True)
+        self.should_test_cuda = kwargs.get('test_cuda', True)
+
+    def __call__(self, test_case):
+        module = self.constructor(*self.constructor_args)
+        input = self._get_input()
+
+        if self.reference_fn is not None:
+            out = test_case._forward(module, input)
+            if isinstance(out, Variable):
+                out = out.data
+            ref_input = self._unpack_input(deepcopy(input))
+            expected_out = self.reference_fn(ref_input, test_case._get_parameters(module)[0])
+            test_case.assertEqual(out, expected_out)
+
+        self.test_noncontig(test_case, module, input)
+
+        # TODO: do this with in-memory files as soon as torch.save will support it
+        with TemporaryFile() as f:
+            test_case._forward(module, input)
+            torch.save(module, f)
+            f.seek(0)
+            module_copy = torch.load(f)
+            test_case.assertEqual(test_case._forward(module, input), test_case._forward(module_copy, input))
+
+        self._do_test(test_case, module, input)
+
+    def noncontiguize(self, obj):
+        if isinstance(obj, list):
+            return [self.noncontiguize(o) for o in obj]
+        tensor = obj.data if isinstance(obj, Variable) else obj
+        ndim = tensor.dim()
+        noncontig = torch.stack([tensor.clone().zero_(), tensor], ndim).select(ndim, 1)
+        assert noncontig.numel() == 1 or not noncontig.is_contiguous()
+        if isinstance(obj, Variable):
+            return Variable(noncontig, requires_grad=obj.requires_grad)
+        return noncontig
+
+    def test_noncontig(self, test_case, module, input):
+        test_case._zero_grad_parameters(module)
+        test_case._zero_grad_input(input)
+        with freeze_rng_state():
+            output = test_case._forward(module, input)
+            grad_output = output
+            if isinstance(grad_output, Variable):
+                grad_output = grad_output.data.clone()
+            else:
+                grad_output = grad_output.clone()
+                output = output.clone()
+            grad_output.normal_()
+            d_input = deepcopy(test_case._backward(module, input, output, grad_output))
+            d_param = deepcopy(test_case._get_parameters(module)[1])
+
+        nc_input = self.noncontiguize(input)
+        nc_grad_output = self.noncontiguize(grad_output)
+        for contig_i, contig_g in product((True, False), repeat=2):
+            i = input if contig_i else nc_input
+            go = grad_output if contig_g else nc_grad_output
+            test_case._zero_grad_parameters(module)
+            test_case._zero_grad_input(i)
+            with freeze_rng_state():
+                try:
+                    out = test_case._forward(module, i)
+                except Exception:
+                    # Some modules will fail because of non contiguous inputs and we're ok with that
+                    continue
+                grad = test_case._backward(module, i, out, go)
+
+                test_case.assertEqual(out, output)
+                test_case.assertEqual(grad, d_input, 1e-4)
+                test_case.assertEqual(test_case._get_parameters(module)[1], d_param)
+
+    def test_cuda(self, test_case):
+        if not TEST_CUDA or not self.should_test_cuda:
+            raise unittest.SkipTest('Excluded from CUDA tests')
+        try:
+            cpu_input = self._get_input()
+            type_map = {torch.DoubleTensor: torch.cuda.FloatTensor}
+            gpu_input = to_gpu(cpu_input, type_map=type_map)
+
+            cpu_module = self.constructor(*self.constructor_args)
+            gpu_module = self.constructor(*self.constructor_args).float().cuda()
+            cpu_param = test_case._get_parameters(cpu_module)
+            gpu_param = test_case._get_parameters(gpu_module)
+            for cpu_p, gpu_p in zip(cpu_param[0], gpu_param[0]):
+                if isinstance(cpu_p, Variable):
+                    cpu_p = cpu_p.data
+                if isinstance(gpu_p, Variable):
+                    gpu_p = gpu_p.data
+                gpu_p.copy_(cpu_p)
+
+            test_case._zero_grad_input(cpu_input)
+            test_case._zero_grad_input(gpu_input)
+            test_case._zero_grad_parameters(cpu_module)
+            test_case._zero_grad_parameters(gpu_module)
+            cpu_output = test_case._forward(cpu_module, cpu_input)
+            gpu_output = test_case._forward(gpu_module, gpu_input)
+            test_case.assertEqual(cpu_output, gpu_output, 2e-4)
+
+            for i in range(5):
+                cpu_output_t = cpu_output.data if isinstance(cpu_output, Variable) else cpu_output
+                cpu_gradOutput = cpu_output_t.clone().bernoulli_()
+                gpu_gradOutput = cpu_gradOutput.type('torch.cuda.FloatTensor')
+                cpu_gradInput = test_case._backward(cpu_module, cpu_input, cpu_output, cpu_gradOutput)
+                gpu_gradInput = test_case._backward(gpu_module, gpu_input, gpu_output, gpu_gradOutput)
+                test_case.assertEqual(cpu_gradInput, gpu_gradInput, 2e-4)
+                for cpu_d_p, gpu_d_p in zip(cpu_param[1], gpu_param[1]):
+                    test_case.assertEqual(cpu_d_p, gpu_d_p, 2e-4)
+
+            self.test_noncontig(test_case, gpu_module, gpu_input)
+        except NotImplementedError:
+            pass
+        # TODO: remove this after CUDA scatter_ is implemented
+        except AttributeError as e:
+            if len(e.args) == 1 and "'FloatTensor' object has no attribute 'scatter_'" in e.args[0]:
+                pass
+            else:
+                raise
+
+
+class CriterionTest(TestBase):
+
+    def __init__(self, *args, **kwargs):
+        super(CriterionTest, self).__init__(*args, **kwargs)
+        self.target = self._get_target(kwargs['target'])
+        self.should_test_cuda = kwargs.get('test_cuda', True)
+
+    def _get_target(self, target):
+        return target
+
+    def __call__(self, test_case):
+        module = self.constructor(*self.constructor_args)
+        input = self._get_input()
+
+        # Check that these methods don't raise errors
+        module.__repr__()
+        str(module)
+
+        if self.reference_fn is not None:
+            out = test_case._forward_criterion(module, input, self.target)
+            target = self.target
+            if isinstance(target, Variable):
+                target = target.data
+            expected_out = self.reference_fn(deepcopy(self._unpack_input(input)),
+                                             deepcopy(target), module)
+            test_case.assertEqual(out, expected_out)
+
+        test_case.check_criterion_jacobian(module, input, self.target)
+        self._do_extra_tests(test_case, module, input, self.target)
+
+    def test_cuda(self, test_case):
+        if not TEST_CUDA or not self.should_test_cuda:
+            raise unittest.SkipTest('Excluded from CUDA tests')
+        try:
+            cpu_input = self._get_input()
+            type_map = {
+                torch.DoubleTensor: torch.cuda.FloatTensor,
+            }
+            gpu_input = to_gpu(cpu_input, type_map=type_map)
+
+            cpu_target = self.target
+            gpu_target = to_gpu(self.target, type_map=type_map)
+
+            cpu_module = self.constructor(*self.constructor_args)
+            gpu_module = self.constructor(*self.constructor_args).float().cuda()
+
+            cpu_output = test_case._forward_criterion(cpu_module, cpu_input, cpu_target)
+            gpu_output = test_case._forward_criterion(gpu_module, gpu_input, gpu_target)
+            test_case.assertEqual(cpu_output, gpu_output, 4e-4)
+
+            cpu_gradInput = test_case._backward_criterion(cpu_module, cpu_input, cpu_target)
+            gpu_gradInput = test_case._backward_criterion(gpu_module, gpu_input, gpu_target)
+            test_case.assertEqual(cpu_gradInput, gpu_gradInput, 4e-4)
+        except NotImplementedError:
+            pass
+
+    def _do_extra_tests(self, test_case, module, input, target):
+        pass
--- a/test/data/network1.py
+++ b/test/data/network1.py
@ -0,0 +1,8 @@
+import torch.nn as nn
+
+
+class Net(nn.Module):
+
+    def __init__(self):
+        super(Net, self).__init__()
+        self.linear = nn.Linear(10, 20)
--- a/test/data/network2.py
+++ b/test/data/network2.py
@ -0,0 +1,9 @@
+import torch.nn as nn
+
+
+class Net(nn.Module):
+
+    def __init__(self):
+        super(Net, self).__init__()
+        self.linear = nn.Linear(10, 20)
+        self.relu = nn.ReLU()
--- a/test/error_messages/storage.py
+++ b/test/error_messages/storage.py
@ -0,0 +1,71 @@
+import torch
+
+
+def check_error(desc, fn, *required_substrings):
+    try:
+        fn()
+    except Exception as e:
+        error_message = e.args[0]
+        print('=' * 80)
+        print(desc)
+        print('-' * 80)
+        print(error_message)
+        print('')
+        for sub in required_substrings:
+            assert sub in error_message
+        return
+    assert False, "given function ({}) didn't raise an error".format(desc)
+
+check_error(
+    'Wrong argument types',
+    lambda: torch.FloatStorage(object()),
+    'object')
+
+check_error('Unknown keyword argument',
+            lambda: torch.FloatStorage(content=1234.),
+            'keyword')
+
+check_error('Invalid types inside a sequence',
+            lambda: torch.FloatStorage(['a', 'b']),
+            'list', 'str')
+
+check_error('Invalid size type',
+            lambda: torch.FloatStorage(1.5),
+            'float')
+
+check_error('Invalid offset',
+            lambda: torch.FloatStorage(torch.FloatStorage(2), 4),
+            '2', '4')
+
+check_error('Negative offset',
+            lambda: torch.FloatStorage(torch.FloatStorage(2), -1),
+            '2', '-1')
+
+check_error('Invalid size',
+            lambda: torch.FloatStorage(torch.FloatStorage(3), 1, 5),
+            '2', '1', '5')
+
+check_error('Negative size',
+            lambda: torch.FloatStorage(torch.FloatStorage(3), 1, -5),
+            '2', '1', '-5')
+
+check_error('Invalid index type',
+            lambda: torch.FloatStorage(10)['first item'],
+            'str')
+
+
+def assign():
+    torch.FloatStorage(10)[1:-1] = '1'
+check_error('Invalid value type',
+            assign,
+            'str')
+
+check_error('resize_ with invalid type',
+            lambda: torch.FloatStorage(10).resize_(1.5),
+            'float')
+
+check_error('fill_ with invalid type',
+            lambda: torch.IntStorage(10).fill_('asdf'),
+            'str')
+
+# TODO: frombuffer
--- a/test/ffi/src/cpu/lib.h
+++ b/test/ffi/src/cpu/lib.h
@ -0,0 +1,6 @@
+
+void good_func(THFloatTensor *tensor, int a, float b);
+void bad_func(THFloatTensor *tensor, int a, float b);
+THFloatTensor * new_tensor(int a);
+float int_to_float(int a);
+
--- a/test/ffi/src/cpu/lib1.c
+++ b/test/ffi/src/cpu/lib1.c
@ -0,0 +1,19 @@
+#include <TH/TH.h>
+
+void good_func(THFloatTensor *tensor, int a, float b)
+{
+  THFloatTensor_mul(tensor, tensor, a);
+  THFloatTensor_add(tensor, tensor, b);
+}
+
+THFloatTensor * new_tensor(int a)
+{
+  THFloatTensor *t = THFloatTensor_newWithSize2d(a, a);
+  THFloatTensor_fill(t, a);
+  return t;
+}
+
+float int_to_float(int a)
+{
+  return a;
+}
--- a/test/ffi/src/cpu/lib2.c
+++ b/test/ffi/src/cpu/lib2.c
@ -0,0 +1,8 @@
+#include <TH/TH.h>
+
+void bad_func(THFloatTensor *tensor, int a, float b)
+{
+  THFloatTensor_mul(tensor, tensor, a);
+  THFloatTensor_add(tensor, tensor, b);
+  THFloatTensor_addbmm(tensor, 1, tensor, 1, tensor, tensor);
+}
--- a/test/ffi/src/cuda/cudalib.c
+++ b/test/ffi/src/cuda/cudalib.c
@ -0,0 +1,12 @@
+#include <TH/TH.h>
+#include <THC/THC.h>
+
+extern THCState *state;
+
+#include "../cpu/lib1.c"
+
+void cuda_func(THCudaTensor *tensor, int a, float b)
+{
+  THCudaTensor_mul(state, tensor, tensor, a);
+  THCudaTensor_add(state, tensor, tensor, b);
+}
--- a/test/ffi/src/cuda/cudalib.h
+++ b/test/ffi/src/cuda/cudalib.h
@ -0,0 +1,5 @@
+
+void good_func(THFloatTensor *tensor, int a, float b);
+void cuda_func(THCudaTensor *tensor, int a, float b);
+THFloatTensor * new_tensor(int a);
+float int_to_float(int a);
--- a/test/ffi/src/lib.h
+++ b/test/ffi/src/lib.h
@ -0,0 +1,5 @@
+
+void my_func(THFloatTensor *tensor, int a, float b);
+void my_cuda_func(THCudaTensor *tensor, int a, float b);
+THFloatTensor * new_t(int a);
+float new_int(int a);
--- a/test/optim/compare.sh
+++ b/test/optim/compare.sh
@ -0,0 +1,14 @@
+
+th test.lua > lua.out
+python3 test.py > python.out
+
+diff lua.out python.out >/dev/null 2>&1
+RESULT=$?
+if [[ RESULT -eq 0 ]]; then
+    echo "PASS"
+else
+    echo "FAIL"
+    echo "Press ENTER to open vimdiff"
+    read
+    vimdiff lua.out python.out
+fi
--- a/test/optim/test.lua
+++ b/test/optim/test.lua
@ -0,0 +1,33 @@
+local cjson = require 'cjson'
+require 'optim'
+
+function rosenbrock(t)
+    x, y = t[1], t[2]
+    return (1 - x) ^ 2 + 100 * (y - x^2)^2
+end
+
+function drosenbrock(t)
+    x, y = t[1], t[2]
+    return torch.DoubleTensor({-400 * x * (y - x^2) - 2 * (1 - x), 200 * x * (y - x^2)})
+end
+
+local fd = io.open('tests.json', 'r')
+local tests = cjson.decode(fd:read('*a'))
+fd:close()
+
+for i, test in ipairs(tests) do
+    print(test.algorithm)
+    algorithm = optim[test.algorithm]
+    for i, config in ipairs(test.config) do
+        print('================================================================================')
+        params = torch.DoubleTensor({1.5, 1.5})
+        for i = 1, 100 do
+            function closure(x)
+                return rosenbrock(x), drosenbrock(x)
+            end
+            algorithm(closure, params, config)
+            print(string.format('%.8f\t%.8f', params[1], params[2]))
+        end
+    end
+end
+
--- a/test/optim/test.py
+++ b/test/optim/test.py
@ -0,0 +1,41 @@
+import json
+import torch
+import torch.legacy.optim as optim
+from pprint import pprint
+
+
+def rosenbrock(tensor):
+    x, y = tensor
+    return (1 - x) ** 2 + 100 * (y - x ** 2) ** 2
+
+
+def drosenbrock(tensor):
+    x, y = tensor
+    return torch.DoubleTensor((-400 * x * (y - x ** 2) - 2 * (1 - x), 200 * x * (y - x ** 2)))
+
+algorithms = {
+    'adadelta': optim.adadelta,
+    'adagrad': optim.adagrad,
+    'adam': optim.adam,
+    'adamax': optim.adamax,
+    'asgd': optim.asgd,
+    'cg': optim.cg,
+    'nag': optim.nag,
+    'rmsprop': optim.rmsprop,
+    'rprop': optim.rprop,
+    'sgd': optim.sgd,
+    'lbfgs': optim.lbfgs,
+}
+
+with open('tests.json', 'r') as f:
+    tests = json.loads(f.read())
+
+for test in tests:
+    print(test['algorithm'] + '\t')
+    algorithm = algorithms[test['algorithm']]
+    for config in test['config']:
+        print('================================================================================\t')
+        params = torch.DoubleTensor((1.5, 1.5))
+        for i in range(100):
+            algorithm(lambda x: (rosenbrock(x), drosenbrock(x)), params, config)
+            print('{:.8f}\t{:.8f}\t'.format(params[0], params[1]))
--- a/test/optim/tests.json
+++ b/test/optim/tests.json
@ -0,0 +1,109 @@
+[
+    {
+        "algorithm": "adadelta",
+        "config": [
+            {},
+            {"rho": 0.95},
+            {"rho": 0.95, "eps": 1e-3},
+            {"weightDecay": 0.2}
+        ]
+    },
+    {
+        "algorithm": "adagrad",
+        "config": [
+            {}
+        ]
+    },
+    {
+        "algorithm": "adam",
+        "config": [
+            {},
+            {"learningRate": 1e-4},
+            {"learningRate": 1e-4, "beta1": 0.92},
+            {"learningRate": 1e-4, "beta1": 0.92, "beta2": 0.96},
+            {"learningRate": 1e-4, "beta1": 0.92, "beta2": 0.96, "epsilon": 1e-3},
+            {"learningRate": 1e-4, "weightDecay": 0.1}
+        ]
+    },
+    {
+        "algorithm": "adamax",
+        "config": [
+            {},
+            {"learningRate": 1e-4},
+            {"learningRate": 1e-4, "beta1": 0.92},
+            {"learningRate": 1e-4, "beta1": 0.92, "beta2": 0.96},
+            {"learningRate": 1e-4, "beta1": 0.92, "beta2": 0.96, "epsilon": 1e-3}
+        ]
+    },
+    {
+        "algorithm": "asgd",
+        "config": [
+            {},
+            {"eta0": 1e-4},
+            {"eta0": 1e-4, "lambda": 1e-2},
+            {"eta0": 1e-4, "lambda": 1e-2, "alpha": 0.9},
+            {"eta0": 1e-4, "lambda": 1e-2, "alpha": 0.9, "t0": 10}
+        ]
+    },
+    {
+        "algorithm": "cg",
+        "config": [
+            {},
+            {"rho": 0.02},
+            {"sig": 0.06},
+            {"int": 0.12},
+            {"ext": 3.2},
+            {"maxIter": 5},
+            {"ratio": 95}
+        ]
+    },
+    {
+        "algorithm": "nag",
+        "config": [
+            {},
+            {"learningRate": 1e-4},
+            {"learningRate": 1e-4, "learningRateDecay": 0.1},
+            {"learningRate": 1e-4, "weightDecay": 0.3},
+            {"learningRate": 1e-4, "momentum": 0.95},
+            {"learningRate": 1e-4, "momentum": 0.95, "dampening": 0.8}
+        ]
+    },
+    {
+        "algorithm": "rmsprop",
+        "config": [
+            {},
+            {"learningRate": 1e-4},
+            {"learningRate": 1e-4, "alpha": 0.95},
+            {"learningRate": 1e-4, "alpha": 0.95, "epsilon": 1e-3},
+            {"weightDecay": 0.2}
+        ]
+    },
+    {
+        "algorithm": "rprop",
+        "config": [
+            {},
+            {"stepsize": 0.05},
+            {"stepsize": 0.05, "etaplus": 1.15},
+            {"stepsize": 0.05, "etaplus": 1.15, "etaminus": 0.6},
+            {"stepsize": 0.05, "etaplus": 1.15, "etaminus": 0.6, "stepsizemax": 1, "stepsizemin": 1e-3},
+            {"stepsize": 0.05, "etaplus": 1.15, "etaminus": 0.6, "niter": 10}
+        ]
+    },
+    {
+        "algorithm": "sgd",
+        "config": [
+            {},
+            {"learningRate": 1e-4},
+            {"learningRate": 1e-4, "momentum": 0.95, "dampening": 0.9},
+            {"learningRate": 1e-4, "nesterov": true, "momentum": 0.95, "dampening": 0},
+            {"weightDecay": 0.2}
+        ]
+    },
+    {
+        "algorithm": "lbfgs",
+        "config": [
+            {},
+            {"learningRate": 1e-1}
+        ]
+    }
+]
--- a/test/run_test.sh
+++ b/test/run_test.sh
@ -0,0 +1,95 @@
+#!/usr/bin/env bash
+set -e
+
+PYCMD=${PYCMD:="python"}
+COVERAGE=0
+while [[ "$#" -gt 0 ]]; do
+    case "$1" in
+        -p|--python) PYCMD=$2; shift 2 ;;
+        -c|--coverage) COVERAGE=1; shift 1;;
+        --) shift; break ;;
+        *) echo "Invalid argument: $1!" ; exit 1 ;;
+    esac
+done
+
+if [[ $COVERAGE -eq 1 ]]; then
+    coverage erase
+    PYCMD="coverage run --parallel-mode --source torch "
+    echo "coverage flag found. Setting python command to: \"$PYCMD\""
+fi
+
+pushd "$(dirname "$0")"
+
+echo "Running torch tests"
+$PYCMD test_torch.py $@
+
+echo "Running autograd tests"
+$PYCMD test_autograd.py $@
+
+echo "Running sparse tests"
+$PYCMD test_sparse.py $@
+
+echo "Running nn tests"
+$PYCMD test_nn.py $@
+
+echo "Running legacy nn tests"
+$PYCMD test_legacy_nn.py $@
+
+echo "Running optim tests"
+$PYCMD test_optim.py $@
+
+echo "Running multiprocessing tests"
+$PYCMD test_multiprocessing.py $@
+MULTIPROCESSING_METHOD=spawn $PYCMD test_multiprocessing.py $@
+MULTIPROCESSING_METHOD=forkserver $PYCMD test_multiprocessing.py $@
+
+echo "Running util tests"
+$PYCMD test_utils.py $@
+
+echo "Running dataloader tests"
+$PYCMD test_dataloader.py $@
+
+echo "Running cuda tests"
+$PYCMD test_cuda.py $@
+
+echo "Running NCCL tests"
+$PYCMD test_nccl.py $@
+
+distributed_set_up() {
+  export TEMP_DIR="$(mktemp -d)"
+  rm -rf "$TEMP_DIR/"*
+  mkdir "$TEMP_DIR/barrier"
+  mkdir "$TEMP_DIR/test_dir"
+}
+
+distributed_tear_down() {
+  rm -rf "$TEMP_DIR"
+}
+
+trap distributed_tear_down EXIT SIGHUP SIGINT SIGTERM
+
+echo "Running distributed tests for the TCP backend"
+distributed_set_up
+BACKEND=tcp WORLD_SIZE=3 $PYCMD ./test_distributed.py
+distributed_tear_down
+
+echo "Running distributed tests for the Gloo backend"
+distributed_set_up
+BACKEND=gloo WORLD_SIZE=3 $PYCMD ./test_distributed.py
+distributed_tear_down
+
+if [ -x "$(command -v mpiexec)" ]; then
+  echo "Running distributed tests for the MPI backend"
+  distributed_set_up
+  BACKEND=mpi mpiexec -n 3 $PYCMD ./test_distributed.py
+  distributed_tear_down
+else
+  echo "Skipping MPI backend tests (MPI not found)"
+fi
+
+if [[ $COVERAGE -eq 1 ]]; then
+    coverage combine
+    coverage html
+fi
+
+popd
--- a/test/test_autograd.py
+++ b/test/test_autograd.py
--- a/test/test_cuda.py
+++ b/test/test_cuda.py
@ -0,0 +1,936 @@
+import math
+import tempfile
+import unittest
+from itertools import repeat
+
+import torch
+import torch.cuda
+import torch.cuda.comm as comm
+
+from test_torch import TestTorch
+from common import TestCase, get_gpu_type, to_gpu, freeze_rng_state, run_tests
+
+HAS_CUDA = True
+if not torch.cuda.is_available():
+    print('CUDA not available, skipping tests')
+    TestCase = object  # noqa: F811
+    HAS_CUDA = False
+
+
+def is_floating(t):
+    return type(t) in [torch.FloatTensor, torch.DoubleTensor,
+                       torch.cuda.FloatTensor, torch.cuda.DoubleTensor]
+
+types = [
+    torch.FloatTensor,
+    torch.DoubleTensor,
+    torch.LongTensor,
+    torch.IntTensor,
+    torch.ShortTensor,
+    torch.CharTensor,
+    torch.ByteTensor,
+]
+
+float_types = [
+    torch.FloatTensor,
+    torch.DoubleTensor
+]  # TODO: add half...
+
+
+def number(floating, integer, t):
+    name = type(t).__name__
+    if 'Double' in name or 'Float' in name or 'Half' in name:
+        return floating
+    else:
+        return integer
+# TODO: check HalfTensor
+
+S = 10
+M = 50
+
+
+def make_tensor(t, *sizes):
+    return t(*sizes).copy_(torch.randn(*sizes))
+
+
+def small_2d(t):
+    return make_tensor(t, S, S)
+
+
+def small_2d_scaled(t, scale=10):
+    return make_tensor(t, S, S).mul(scale)
+
+
+def small_2d_oneish(t):
+    if is_floating(t):
+        return make_tensor(t, S, S).clamp(min=0.99, max=1.01)
+    else:
+        return t(S, S).fill_(1)
+
+
+def small_3d(t):
+    return make_tensor(t, S, S, S)
+
+
+def medium_1d(t):
+    return make_tensor(t, M)
+
+
+def medium_2d(t):
+    return make_tensor(t, M, M)
+
+
+def medium_2d_scaled(t, scale=10):
+    return make_tensor(t, M, M).mul(scale)
+
+
+def small_3d_ones(t):
+    return t(S, S, S).copy_(torch.ones(S, S, S))
+
+
+def small_3d_positive(t):
+    min_val = 1e-3 if is_floating(t) else 2
+    return make_tensor(t, S, S, S).clamp_(min_val, 120)
+
+
+def small_3d_unique(t):
+    return t(S, S, S).copy_(torch.arange(1, S * S * S + 1).view(S, S, S))
+
+
+def small_1d_lapack(t):
+    return t(1, 3).copy_(torch.arange(1, 4).view(3))
+
+
+def small_2d_lapack(t):
+    return t(3, 3).copy_(torch.arange(1, 10).view(3, 3))
+
+
+def small_2d_lapack_skinny(t):
+    return t(3, 4).copy_(torch.arange(1, 13).view(3, 4))
+
+
+def small_2d_lapack_fat(t):
+    return t(4, 3).copy_(torch.arange(1, 13).view(4, 3))
+
+
+def large_2d_lapack(t):
+    return t(1000, 1000).normal_()
+
+
+def new_t(*sizes):
+    def tmp(t):
+        return t(*sizes).copy_(torch.randn(*sizes))
+    return tmp
+
+tests = [
+    ('add', small_3d, lambda t: [number(3.14, 3, t)]),
+    ('add', small_3d, lambda t: [small_3d_positive(t)], 'tensor'),
+    ('add', small_3d, lambda t: [number(0.2, 2, t), small_3d_positive(t)], 'scalar_tensor'),
+    ('sub', small_3d, lambda t: [number(3.14, 3, t)],),
+    ('sub', small_3d, lambda t: [small_3d_positive(t)], 'tensor'),
+    ('mul', small_3d, lambda t: [number(3.14, 3, t)],),
+    ('mul', small_3d, lambda t: [small_3d_positive(t)], 'tensor'),
+    ('div', small_3d, lambda t: [number(3.14, 3, t)],),
+    ('div', small_3d, lambda t: [small_3d_positive(t)], 'tensor'),
+    ('pow', small_3d, lambda t: [number(3.14, 3, t)], None, float_types),
+    ('pow', small_3d, lambda t: [small_3d(t).abs_()], 'tensor', float_types),
+    ('addbmm', small_2d, lambda t: [small_3d(t), small_3d(t)], None, float_types),
+    ('addbmm', small_2d, lambda t: [number(0.4, 2, t), small_3d(t), small_3d(t)], 'scalar'),
+    ('addbmm', small_2d, lambda t: [number(0.5, 3, t), number(0.4, 2, t), small_3d(t), small_3d(t)], 'two_scalars'),
+    ('baddbmm', small_3d, lambda t: [small_3d(t), small_3d(t)],),
+    ('baddbmm', small_3d, lambda t: [number(0.4, 2, t), small_3d(t), small_3d(t)], 'scalar'),
+    ('baddbmm', small_3d, lambda t: [number(0.5, 3, t), number(0.4, 2, t), small_3d(t), small_3d(t)], 'two_scalars'),
+    ('addcdiv', small_2d_lapack, lambda t: [small_2d_lapack(t).mul(2), small_2d_lapack(t)],),
+    ('addcdiv', small_2d_lapack, lambda t: [number(2.8, 1, t),
+                                            small_2d_lapack(t).mul(2), small_2d_lapack(t)], 'scalar'),
+    ('addcmul', small_3d, lambda t: [small_3d(t), small_3d(t)],),
+    ('addcmul', small_3d, lambda t: [number(0.4, 2, t), small_3d(t), small_3d(t)], 'scalar'),
+    ('addmm', medium_2d, lambda t: [medium_2d(t), medium_2d(t)],),
+    ('addmm', medium_2d, lambda t: [number(0.4, 2, t), medium_2d(t), medium_2d(t)], 'scalar'),
+    ('addmm', medium_2d, lambda t: [number(0.5, 3, t), number(0.4, 2, t), medium_2d(t), medium_2d(t)], 'two_scalars'),
+    ('addmv', medium_1d, lambda t: [medium_2d(t), medium_1d(t)],),
+    ('addmv', medium_1d, lambda t: [number(0.4, 2, t), medium_2d(t), medium_1d(t)], 'scalar'),
+    ('addmv', medium_1d, lambda t: [number(0.5, 3, t), number(0.4, 2, t), medium_2d(t), medium_1d(t)], 'two_scalars'),
+    ('addr', medium_2d, lambda t: [medium_1d(t), medium_1d(t)],),
+    ('addr', medium_2d, lambda t: [number(0.4, 2, t), medium_1d(t), medium_1d(t)], 'scalar'),
+    ('addr', medium_2d, lambda t: [number(0.5, 3, t), number(0.4, 2, t), medium_1d(t), medium_1d(t)], 'two_scalars'),
+    ('atan2', medium_2d, lambda t: [medium_2d(t)], None, float_types),
+    ('fmod', small_3d, lambda t: [3], 'value'),
+    ('fmod', small_3d, lambda t: [small_3d_positive(t)], 'tensor'),
+    ('chunk', medium_2d, lambda t: [4],),
+    ('chunk', medium_2d, lambda t: [4, 1], 'dim'),
+    ('chunk', medium_2d, lambda t: [4, -2], 'neg_dim'),
+    ('clamp', medium_2d_scaled, lambda t: [-1, 5],),
+    ('clone', medium_2d, lambda t: [],),
+    ('contiguous', medium_2d, lambda t: [],),
+    ('cross', new_t(M, 3, M), lambda t: [new_t(M, 3, M)(t)],),
+    ('cumprod', small_3d, lambda t: [1],),
+    ('cumprod', small_3d, lambda t: [-1], 'neg_dim'),
+    ('cumsum', small_3d, lambda t: [1],),
+    ('cumsum', small_3d, lambda t: [-1], 'neg_dim'),
+    ('dim', small_3d, lambda t: [],),
+    ('dist', small_2d, lambda t: [small_2d(t)],),
+    ('dist', small_2d, lambda t: [small_2d(t), 3], '3_norm'),
+    ('dist', small_2d, lambda t: [small_2d(t), 2.5], '2_5_norm'),
+    ('dot', medium_1d, lambda t: [medium_1d(t)],),
+    ('element_size', medium_1d, lambda t: [],),
+    ('eq', small_3d_ones, lambda t: [small_3d(t)],),
+    ('eq', small_3d_ones, lambda t: [small_3d_ones(t)], 'equal'),
+    ('ne', small_3d_ones, lambda t: [small_3d(t)],),
+    ('ne', small_3d_ones, lambda t: [small_3d_ones(t)], 'equal'),
+    ('equal', small_3d_ones, lambda t: [small_3d_ones(t)], 'equal'),
+    ('equal', small_3d_ones, lambda t: [small_3d(t)],),
+    ('expand', new_t(M, 1, M), lambda t: [M, 4, M],),
+    ('expand_as', new_t(M, 1, M), lambda t: [new_t(M, 4, M)(t)],),
+    ('fill', medium_2d, lambda t: [number(3.14, 3, t)],),
+    ('ge', medium_2d, lambda t: [medium_2d(t)],),
+    ('le', medium_2d, lambda t: [medium_2d(t)],),
+    ('gt', medium_2d, lambda t: [medium_2d(t)],),
+    ('lt', medium_2d, lambda t: [medium_2d(t)],),
+    ('is_contiguous', medium_2d, lambda t: [],),
+    # TODO: can't check negative case - GPU copy will be contiguous
+    ('is_same_size', medium_2d, lambda t: [small_3d(t)], 'negative'),
+    ('is_same_size', medium_2d, lambda t: [medium_2d(t)], 'positive'),
+    ('is_set_to', medium_2d, lambda t: [medium_2d(t)],),
+    # TODO: positive case
+    ('kthvalue', small_3d_unique, lambda t: [3],),
+    ('kthvalue', small_3d_unique, lambda t: [3, 1], 'dim'),
+    ('kthvalue', small_3d_unique, lambda t: [3, -1], 'neg_dim'),
+    ('lerp', small_3d, lambda t: [small_3d(t), 0.3],),
+    ('max', small_3d_unique, lambda t: [],),
+    ('max', small_3d_unique, lambda t: [1], 'dim'),
+    ('max', small_3d_unique, lambda t: [-1], 'neg_dim'),
+    ('max', medium_2d, lambda t: [medium_2d(t)], 'elementwise'),
+    ('min', small_3d_unique, lambda t: [],),
+    ('min', small_3d_unique, lambda t: [1], 'dim'),
+    ('min', small_3d_unique, lambda t: [-1], 'neg_dim'),
+    ('min', medium_2d, lambda t: [medium_2d(t)], 'elementwise'),
+    ('mean', small_3d, lambda t: [],),
+    ('mean', small_3d, lambda t: [-1], 'neg_dim'),
+    ('mean', small_3d, lambda t: [1], 'dim'),
+    ('mode', small_3d, lambda t: [],),
+    ('mode', small_3d, lambda t: [1], 'dim'),
+    ('mode', small_3d, lambda t: [-1], 'neg_dim'),
+    ('remainder', small_3d, lambda t: [3], 'value'),
+    ('remainder', small_3d, lambda t: [-3], 'negative_value'),
+    ('remainder', small_3d, lambda t: [small_3d_positive(t)], 'tensor'),
+    ('remainder', small_3d, lambda t: [0 - small_3d_positive(t)], 'negative_tensor'),
+    ('std', small_3d, lambda t: [],),
+    ('std', small_3d, lambda t: [1], 'dim'),
+    ('std', small_3d, lambda t: [-1], 'neg_dim'),
+    ('var', small_3d, lambda t: [],),
+    ('var', small_3d, lambda t: [1], 'dim'),
+    ('var', small_3d, lambda t: [-1], 'neg_dim'),
+    ('ndimension', small_3d, lambda t: [],),
+    ('nelement', small_3d, lambda t: [],),
+    ('numel', small_3d, lambda t: [],),
+    ('narrow', small_3d, lambda t: [1, 3, 2],),
+    ('narrow', small_3d, lambda t: [-1, 3, 2], 'neg_dim'),
+    ('nonzero', small_3d, lambda t: [],),
+    ('norm', small_3d, lambda t: [],),
+    ('norm', small_3d, lambda t: [3], '3_norm'),
+    ('norm', small_3d, lambda t: [3, 0], '3_norm_dim'),
+    ('norm', small_3d, lambda t: [3, -2], '3_norm_neg_dim'),
+    ('ones', small_3d, lambda t: [1, 2, 3, 4, 5],),
+    ('permute', new_t(1, 2, 3, 4), lambda t: [2, 1, 3, 0],),
+    ('prod', small_2d_oneish, lambda t: [],),
+    ('prod', small_3d, lambda t: [1], 'dim'),
+    ('prod', small_3d, lambda t: [-1], 'neg_dim'),
+    ('sum', small_2d, lambda t: [],),
+    ('sum', small_3d, lambda t: [1], 'dim'),
+    ('sum', small_3d, lambda t: [-1], 'neg_dim'),
+    ('renorm', small_3d, lambda t: [2, 1, 1], '2_norm'),
+    ('renorm', small_3d, lambda t: [2, -1, 1], '2_norm_neg_dim'),
+    ('renorm', small_3d, lambda t: [1.5, 1, 1], '1_5_norm'),
+    ('repeat', small_2d, lambda t: [2, 2, 2],),
+    ('size', new_t(1, 2, 3, 4), lambda t: [],),
+    ('size', new_t(1, 2, 3, 4), lambda t: [1], 'dim'),
+    ('size', new_t(1, 2, 3, 4), lambda t: [-2], 'neg_dim'),
+    ('sort', small_3d_unique, lambda t: [],),
+    ('sort', small_3d_unique, lambda t: [1], 'dim'),
+    ('sort', small_3d_unique, lambda t: [-1], 'neg_dim'),
+    ('sort', small_3d_unique, lambda t: [1, True], 'dim_descending'),
+    ('sort', small_3d_unique, lambda t: [-1, True], 'neg_dim_descending'),
+    ('split', small_3d, lambda t: [2],),
+    ('split', small_3d, lambda t: [2, 1], 'dim'),
+    ('split', small_3d, lambda t: [2, -3], 'neg_dim'),
+    ('squeeze', new_t(1, 2, 1, 4), lambda t: [],),
+    ('squeeze', new_t(1, 2, 1, 4), lambda t: [2], 'dim'),
+    ('squeeze', new_t(1, 2, 1, 4), lambda t: [-2], 'neg_dim'),
+    ('t', new_t(1, 2), lambda t: [],),
+    ('transpose', new_t(1, 2, 3, 4), lambda t: [1, 2],),
+    ('transpose', new_t(1, 2, 3, 4), lambda t: [-1, -2], 'neg_dim'),
+    ('to_list', small_3d, lambda t: [],),
+    ('topk', small_3d_unique, lambda t: [2, 1, False, True], 'dim_sort'),
+    ('topk', small_3d_unique, lambda t: [2, -1, False, True], 'neg_dim_sort'),
+    ('topk', small_3d_unique, lambda t: [2, 1, True, True], 'dim_desc_sort'),
+    ('trace', medium_2d, lambda t: [],),
+    ('tril', medium_2d, lambda t: [],),
+    ('tril', medium_2d, lambda t: [2], 'positive'),
+    ('tril', medium_2d, lambda t: [-2], 'negative'),
+    ('triu', medium_2d, lambda t: [],),
+    ('triu', medium_2d, lambda t: [2], 'positive'),
+    ('triu', medium_2d, lambda t: [-2], 'negative'),
+    ('unsqueeze', new_t(2, 3, 4), lambda t: [2],),
+    ('unsqueeze', new_t(2, 3, 4), lambda t: [-2], 'neg_dim'),
+    ('view', small_3d, lambda t: [100, 10],),
+    ('view_as', small_3d, lambda t: [t(100, 10)],),
+    ('zero', small_3d, lambda t: [],),
+    ('zeros', small_3d, lambda t: [1, 2, 3, 4],),
+    ('rsqrt', lambda t: small_3d(t) + 1, lambda t: [], None, float_types),
+    ('sinh', lambda t: small_3d(t).clamp(-1, 1), lambda t: [], None, float_types),
+    ('tan', lambda t: small_3d(t).clamp(-1, 1), lambda t: [], None, float_types),
+    # lapack tests
+    ('qr', small_2d_lapack, lambda t: [], 'square', float_types),
+    ('qr', small_2d_lapack_skinny, lambda t: [], 'skinny', float_types),
+    ('qr', small_2d_lapack_fat, lambda t: [], 'fat', float_types),
+    ('qr', large_2d_lapack, lambda t: [], 'big', float_types),
+    ('inverse', new_t(20, 20), lambda t: [], None, float_types),
+
+]
+
+# TODO: random functions, cat, gather, scatter, index*, masked*,
+#       resize, resizeAs, storage_offset, storage, stride, unfold
+
+custom_precision = {
+    'addbmm': 1e-4,
+    'addmm': 1e-4,
+    'addmv': 1e-4,
+    'addr': 1e-4,
+    'baddbmm': 1e-4,
+    'rsqrt': 1e-4,
+    'cumprod': 1e-4,
+    'qr': 3e-4,
+}
+
+simple_pointwise = [
+    'abs',
+    'sign',
+]
+for fn in simple_pointwise:
+    tests.append((fn, small_3d, lambda t: []))
+
+simple_pointwise_float = [
+    'log',
+    'log1p',
+    'sigmoid',
+    'sin',
+    'sqrt',
+    'tanh',
+    'acos',
+    'asin',
+    'atan',
+    'cos',
+    'cosh',
+    'exp',
+    'reciprocal',
+    'floor',
+    'frac',
+    'neg',
+    'round',
+    'trunc',
+    'ceil',
+]
+
+for fn in simple_pointwise_float:
+    tests.append((fn, small_3d, lambda t: [], None, float_types))
+
+_cycles_per_ms = None
+
+
+def get_cycles_per_ms():
+    """Approximate number of cycles per millisecond for torch.cuda._sleep"""
+    global _cycles_per_ms
+    if _cycles_per_ms is None:
+        start = torch.cuda.Event(enable_timing=True)
+        end = torch.cuda.Event(enable_timing=True)
+        start.record()
+        torch.cuda._sleep(1000000)
+        end.record()
+        end.synchronize()
+        _cycles_per_ms = 1000000 / start.elapsed_time(end)
+    return _cycles_per_ms
+
+
+def compare_cpu_gpu(tensor_constructor, arg_constructor, fn, t, precision=1e-5):
+    def tmp(self):
+        cpu_tensor = tensor_constructor(t)
+        gpu_tensor = to_gpu(cpu_tensor)
+        cpu_args = arg_constructor(t)
+        gpu_args = [to_gpu(arg) for arg in cpu_args]
+        cpu_result = getattr(cpu_tensor, fn)(*cpu_args)
+        try:
+            gpu_result = getattr(gpu_tensor, fn)(*gpu_args)
+        except RuntimeError as e:
+            reason = e.args[0]
+            if 'unimplemented data type' in reason:
+                raise unittest.SkipTest('unimplemented data type')
+            raise
+        except AttributeError as e:
+            reason = e.args[0]
+            if 'object has no attribute' in reason:
+                raise unittest.SkipTest('unimplemented data type')
+            raise
+        # If one changes, another should change as well
+        self.assertEqual(cpu_tensor, gpu_tensor, precision)
+        self.assertEqual(cpu_args, gpu_args, precision)
+        # Compare results
+        self.assertEqual(cpu_result, gpu_result, precision)
+    return tmp
+
+
+class TestCuda(TestCase):
+
+    @unittest.skipIf(torch.cuda.device_count() < 2, "only one GPU detected")
+    def test_autogpu(self):
+        x = torch.randn(5, 5).cuda()
+        y = torch.randn(5, 5).cuda()
+        self.assertEqual(x.get_device(), 0)
+        self.assertEqual(x.get_device(), 0)
+        with torch.cuda.device(1):
+            z = torch.randn(5, 5).cuda()
+            self.assertEqual(z.get_device(), 1)
+            q = x.add(y)
+            self.assertEqual(q.get_device(), 0)
+            w = torch.randn(5, 5).cuda()
+            self.assertEqual(w.get_device(), 1)
+        z = z.cuda()
+        self.assertEqual(z.get_device(), 0)
+
+    @unittest.skipIf(torch.cuda.device_count() < 2, "only one GPU detected")
+    def test_copy_device(self):
+        x = torch.randn(5, 5).cuda()
+        with torch.cuda.device(1):
+            y = x.cuda()
+            self.assertEqual(y.get_device(), 1)
+            self.assertIs(y.cuda(), y)
+            z = y.cuda(0)
+            self.assertEqual(z.get_device(), 0)
+            self.assertIs(z.cuda(0), z)
+
+        x = torch.randn(5, 5)
+        with torch.cuda.device(1):
+            y = x.cuda()
+            self.assertEqual(y.get_device(), 1)
+            self.assertIs(y.cuda(), y)
+            z = y.cuda(0)
+            self.assertEqual(z.get_device(), 0)
+            self.assertIs(z.cuda(0), z)
+
+    def test_serialization_array_with_storage(self):
+        x = torch.randn(5, 5).cuda()
+        y = torch.IntTensor(2, 5).fill_(0).cuda()
+        q = [x, y, x, y.storage()]
+        with tempfile.NamedTemporaryFile() as f:
+            torch.save(q, f)
+            f.seek(0)
+            q_copy = torch.load(f)
+        self.assertEqual(q_copy, q, 0)
+        q_copy[0].fill_(5)
+        self.assertEqual(q_copy[0], q_copy[2], 0)
+        self.assertTrue(isinstance(q_copy[0], torch.cuda.DoubleTensor))
+        self.assertTrue(isinstance(q_copy[1], torch.cuda.IntTensor))
+        self.assertTrue(isinstance(q_copy[2], torch.cuda.DoubleTensor))
+        self.assertTrue(isinstance(q_copy[3], torch.cuda.IntStorage))
+        q_copy[1].fill_(10)
+        self.assertTrue(q_copy[3], torch.cuda.IntStorage(10).fill_(10))
+
+    def test_type_conversions(self):
+        x = torch.randn(5, 5)
+        self.assertIs(type(x.float()), torch.FloatTensor)
+        self.assertIs(type(x.cuda()), torch.cuda.DoubleTensor)
+        self.assertIs(type(x.cuda().float()), torch.cuda.FloatTensor)
+        self.assertIs(type(x.cuda().float().cpu()), torch.FloatTensor)
+        self.assertIs(type(x.cuda().float().cpu().int()), torch.IntTensor)
+
+        y = x.storage()
+        self.assertIs(type(y.float()), torch.FloatStorage)
+        self.assertIs(type(y.cuda()), torch.cuda.DoubleStorage)
+        self.assertIs(type(y.cuda().float()), torch.cuda.FloatStorage)
+        self.assertIs(type(y.cuda().float().cpu()), torch.FloatStorage)
+        self.assertIs(type(y.cuda().float().cpu().int()), torch.IntStorage)
+
+    @unittest.skipIf(torch.cuda.device_count() < 2, "only one GPU detected")
+    def test_type_conversions_same_gpu(self):
+        x = torch.randn(5, 5).cuda(1)
+        self.assertEqual(x.int().get_device(), 1)
+
+    def _test_broadcast(self, input):
+        if torch.cuda.device_count() < 2:
+            raise unittest.SkipTest("only one GPU detected")
+        result = comm.broadcast(input, (0, 1))
+        for i, t in enumerate(result):
+            self.assertEqual(t.get_device(), i)
+            self.assertEqual(t, input)
+
+    def test_broadcast_cpu(self):
+        self._test_broadcast(torch.randn(5, 5))
+
+    def test_broadcast_gpu(self):
+        self._test_broadcast(torch.randn(5, 5))
+
+    @unittest.skipIf(torch.cuda.device_count() < 2, "only one GPU detected")
+    def test_broadcast_coalesced(self):
+        numel = 5
+        num_bytes = numel * 8
+        tensors = [
+            torch.randn(numel).long().cuda(),
+            torch.randn(numel).cuda(),
+            torch.randn(numel).long().cuda(),
+            torch.randn(numel).long().cuda(),
+            torch.randn(numel * 2).int().cuda(),  # int is 2x shorter
+            torch.randn(numel).cuda(),
+        ]
+
+        b_tensors = [comm.broadcast(t, (0, 1)) for t in tensors]
+        for (_, bt), t in zip(b_tensors, tensors):
+            self.assertEqual(bt.get_device(), 1)
+            self.assertEqual(bt, t)
+            self.assertIsInstance(bt, type(t))
+
+        bc_tensors = comm.broadcast_coalesced(tensors, (0, 1), buffer_size=num_bytes * 5 // 2)
+        bc_tensors_t = list(zip(*bc_tensors))
+        self.assertEqual(b_tensors, bc_tensors_t)
+        for (_, bt), (_, bct) in zip(b_tensors, bc_tensors_t):
+            self.assertEqual(bt.get_device(), bct.get_device())
+            self.assertIsInstance(bct, type(bt))
+
+    @unittest.skipIf(torch.cuda.device_count() < 2, "only one GPU detected")
+    def test_reduce_add(self):
+        x = torch.randn(5, 5)
+        y = torch.randn(5, 5)
+        x_cuda = x.cuda(0)
+        y_cuda = y.cuda(1)
+        result = comm.reduce_add((x_cuda, y_cuda))
+        self.assertEqual(result.get_device(), 0)
+        self.assertEqual(result.cpu(), x + y)
+
+    @unittest.skipIf(torch.cuda.device_count() < 2, "only one GPU detected")
+    def test_reduce_add_coalesced(self):
+        numel = 5
+        num_bytes = numel * 8
+        tensors = [
+            torch.randn(numel).long().cuda(),
+            torch.randn(numel).cuda(),
+            torch.randn(numel).long().cuda(),
+            torch.randn(numel).long().cuda(),
+            torch.randn(numel * 2).int().cuda(),  # int is 2x shorter
+            torch.randn(numel).cuda(),
+        ]
+        dup_tensors = [tensors, list(map(lambda t: t.cuda(1), tensors))]
+
+        r_tensors = list(map(comm.reduce_add, zip(*dup_tensors)))
+        for r, t in zip(r_tensors, tensors):
+            self.assertEqual(r.get_device(), t.get_device())
+            self.assertEqual(r, t * 2)
+            self.assertIsInstance(r, type(t))
+
+        rc_tensors = comm.reduce_add_coalesced(dup_tensors, buffer_size=num_bytes * 5 // 2)
+        self.assertEqual(r_tensors, rc_tensors)
+        for r, rc in zip(r_tensors, rc_tensors):
+            self.assertEqual(rc.get_device(), r.get_device())
+            self.assertIsInstance(rc, type(r))
+
+    def _test_scatter(self, input, chunk_sizes=None, dim=0):
+        if torch.cuda.device_count() < 2:
+            raise unittest.SkipTest("only one GPU detected")
+        result = comm.scatter(input, (0, 1), chunk_sizes, dim)
+        self.assertEqual(len(result), 2)
+        if chunk_sizes is None:
+            chunk_sizes = tuple(repeat(input.size(dim) // 2, 2))
+        chunk_start = 0
+        for i, r in enumerate(result):
+            chunk_end = chunk_start + chunk_sizes[i]
+            index = [slice(None, None), slice(None, None)]
+            index[dim] = slice(chunk_start, chunk_end)
+            self.assertEqual(r, input[tuple(index)], 0)
+            chunk_start = chunk_end
+
+    def test_scatter_cpu(self):
+        self._test_scatter(torch.randn(4, 4), dim=0)
+
+    def test_scatter_cpu_dim(self):
+        self._test_scatter(torch.randn(4, 4), dim=1)
+
+    def test_scatter_cpu_neg_dim(self):
+        self._test_scatter(torch.randn(4, 4), dim=-2)
+
+    def test_scatter_cpu_sizes(self):
+        self._test_scatter(torch.randn(6, 4), chunk_sizes=(2, 4))
+
+    def test_scatter_gpu(self):
+        self._test_scatter(torch.randn(4, 4).cuda(), dim=0)
+
+    def test_scatter_gpu_dim(self):
+        self._test_scatter(torch.randn(4, 4).cuda(), dim=1)
+
+    def test_scatter_gpu_neg_dim(self):
+        self._test_scatter(torch.randn(4, 4).cuda(), dim=-2)
+
+    def test_scatter_gpu_sizes(self):
+        self._test_scatter(torch.randn(6, 4).cuda(), chunk_sizes=(2, 4))
+
+    def _test_gather(self, dim):
+        if torch.cuda.device_count() < 2:
+            raise unittest.SkipTest("only one GPU detected")
+        x = torch.randn(2, 5).cuda(0)
+        y = torch.randn(2, 5).cuda(1)
+        result = comm.gather((x, y), dim)
+
+        expected_size = list(x.size())
+        expected_size[dim] += y.size(dim)
+        expected_size = torch.Size(expected_size)
+        self.assertEqual(result.get_device(), 0)
+        self.assertEqual(result.size(), expected_size)
+
+        index = [slice(None, None), slice(None, None)]
+        index[dim] = slice(0, x.size(dim))
+        self.assertEqual(result[tuple(index)], x)
+        index[dim] = slice(x.size(dim), x.size(dim) + y.size(dim))
+        self.assertEqual(result[tuple(index)], y)
+
+    def test_gather(self):
+        self._test_gather(0)
+
+    def test_gather_dim(self):
+        self._test_gather(1)
+
+    def test_from_sequence(self):
+        seq = [list(range(i * 4, i * 4 + 4)) for i in range(5)]
+        reference = torch.arange(0, 20).resize_(5, 4)
+        for t in types:
+            cuda_type = get_gpu_type(t)
+            self.assertEqual(cuda_type(seq), reference)
+
+    def test_torch_manual_seed_seeds_cuda_devices(self):
+        with freeze_rng_state():
+            x = torch.zeros(4, 4).float().cuda()
+            torch.manual_seed(2)
+            self.assertEqual(torch.cuda.initial_seed(), 2)
+            x.uniform_()
+            torch.manual_seed(2)
+            y = x.clone().uniform_()
+            self.assertEqual(x, y)
+            self.assertEqual(torch.cuda.initial_seed(), 2)
+
+    def test_manual_seed(self):
+        with freeze_rng_state():
+            x = torch.zeros(4, 4).float().cuda()
+            torch.cuda.manual_seed(2)
+            self.assertEqual(torch.cuda.initial_seed(), 2)
+            x.uniform_()
+            torch.cuda.manual_seed(2)
+            y = x.clone().uniform_()
+            self.assertEqual(x, y)
+            self.assertEqual(torch.cuda.initial_seed(), 2)
+
+    @unittest.skipIf(torch.cuda.device_count() < 2, "only one GPU detected")
+    def test_cat_autogpu(self):
+        x = torch.randn(4, 4).cuda(1)
+        y = torch.randn(4, 4).cuda(1)
+        z = torch.cat([x, y], 0)
+        self.assertEqual(z.get_device(), x.get_device())
+
+    def test_serialization(self):
+        x = torch.randn(4, 4).cuda()
+        with tempfile.NamedTemporaryFile() as f:
+            torch.save(x, f)
+            f.seek(0)
+            x_copy = torch.load(f)
+        self.assertEqual(x_copy, x)
+        self.assertIs(type(x_copy), type(x))
+        self.assertEqual(x_copy.get_device(), x.get_device())
+
+    def test_serialization_array_with_empty(self):
+        x = [torch.randn(4, 4).cuda(), torch.cuda.FloatTensor()]
+        with tempfile.NamedTemporaryFile() as f:
+            torch.save(x, f)
+            f.seek(0)
+            x_copy = torch.load(f)
+        for original, copy in zip(x, x_copy):
+            self.assertEqual(copy, original)
+            self.assertIs(type(copy), type(original))
+            self.assertEqual(copy.get_device(), original.get_device())
+
+    @unittest.skipIf(torch.cuda.device_count() < 2, "detected only one GPU")
+    def test_multigpu_serialization(self):
+        x = [torch.randn(4, 4).cuda(0), torch.randn(4, 4).cuda(1)]
+        with tempfile.NamedTemporaryFile() as f:
+            torch.save(x, f)
+            f.seek(0)
+            x_copy = torch.load(f)
+        for original, copy in zip(x, x_copy):
+            self.assertEqual(copy, original)
+            self.assertIs(type(copy), type(original))
+            self.assertEqual(copy.get_device(), original.get_device())
+
+    @unittest.skipIf(torch.cuda.device_count() < 2, "detected only one GPU")
+    def test_multigpu_serialization_remap(self):
+        x = [torch.randn(4, 4).cuda(0), torch.randn(4, 4).cuda(1)]
+
+        def gpu_remap(storage, location):
+            if location == 'cuda:1':
+                return storage.cuda(0)
+
+        with tempfile.NamedTemporaryFile() as f:
+            torch.save(x, f)
+            f.seek(0)
+            x_copy = torch.load(f, map_location=gpu_remap)
+
+        for original, copy in zip(x, x_copy):
+            self.assertEqual(copy, original)
+            self.assertIs(type(copy), type(original))
+            self.assertEqual(copy.get_device(), 0)
+
+    @unittest.skipIf(torch.cuda.device_count() < 2, "detected only one GPU")
+    def test_multigpu_serialization_remap_dict(self):
+        x = [torch.randn(4, 4).cuda(0), torch.randn(4, 4).cuda(1)]
+        with tempfile.NamedTemporaryFile() as f:
+            torch.save(x, f)
+            f.seek(0)
+            x_copy = torch.load(f, map_location={'cuda:1': 'cuda:0'})
+        for original, copy in zip(x, x_copy):
+            self.assertEqual(copy, original)
+            self.assertIs(type(copy), type(original))
+            self.assertEqual(copy.get_device(), 0)
+
+    @unittest.skipIf(torch.cuda.device_count() < 2, "detected only one GPU")
+    def test_cuda_set_device(self):
+        x = torch.randn(5, 5)
+        with torch.cuda.device(1):
+            self.assertEqual(x.cuda().get_device(), 1)
+            torch.cuda.set_device(0)
+            self.assertEqual(x.cuda().get_device(), 0)
+            with torch.cuda.device(1):
+                self.assertEqual(x.cuda().get_device(), 1)
+            self.assertEqual(x.cuda().get_device(), 0)
+            torch.cuda.set_device(1)
+        self.assertEqual(x.cuda().get_device(), 0)
+
+    def test_is_tensor(self):
+        for t in types:
+            tensor = get_gpu_type(t)()
+            self.assertTrue(torch.is_tensor(tensor))
+        self.assertTrue(torch.is_tensor(torch.cuda.HalfTensor()))
+
+    def test_cuda_synchronize(self):
+        torch.cuda.synchronize()
+
+    def test_streams(self):
+        default_stream = torch.cuda.current_stream()
+        user_stream = torch.cuda.Stream()
+        self.assertEqual(torch.cuda.current_stream(), default_stream)
+        self.assertNotEqual(default_stream, user_stream)
+        self.assertEqual(default_stream.cuda_stream, 0)
+        self.assertNotEqual(user_stream.cuda_stream, 0)
+        with torch.cuda.stream(user_stream):
+            self.assertEqual(torch.cuda.current_stream(), user_stream)
+        self.assertTrue(user_stream.query())
+        # copy 10 MB tensor from CPU-GPU which should take some time
+        tensor1 = torch.ByteTensor(10000000).pin_memory()
+        tensor2 = tensor1.cuda(async=True)
+        self.assertFalse(default_stream.query())
+        default_stream.synchronize()
+        self.assertTrue(default_stream.query())
+
+    @unittest.skipIf(torch.cuda.device_count() < 2, "detected only one GPU")
+    def test_streams_multi_gpu(self):
+        default_stream = torch.cuda.current_stream()
+        self.assertEqual(default_stream.device, 0)
+        stream = torch.cuda.Stream(device=1)
+        self.assertEqual(stream.device, 1)
+        with torch.cuda.device(1):
+            self.assertEqual(torch.cuda.current_stream().device, 1)
+            self.assertNotEqual(torch.cuda.current_stream(), default_stream)
+
+    @unittest.skipIf(torch.cuda.device_count() < 2, "multi-GPU not supported")
+    def test_tensor_device(self):
+        self.assertEqual(torch.cuda.FloatTensor(1).get_device(), 0)
+        self.assertEqual(torch.cuda.FloatTensor(1, device=1).get_device(), 1)
+        with torch.cuda.device(1):
+            self.assertEqual(torch.cuda.FloatTensor(1).get_device(), 1)
+            self.assertEqual(torch.cuda.FloatTensor(1, device=0).get_device(), 0)
+            self.assertEqual(torch.cuda.FloatTensor(1, device=None).get_device(), 1)
+
+    def test_events(self):
+        stream = torch.cuda.current_stream()
+        event = torch.cuda.Event(enable_timing=True)
+        self.assertTrue(event.query())
+        start_event = torch.cuda.Event(enable_timing=True)
+        stream.record_event(start_event)
+        torch.cuda._sleep(int(50 * get_cycles_per_ms()))
+        stream.record_event(event)
+        self.assertFalse(event.query())
+        event.synchronize()
+        self.assertTrue(event.query())
+        self.assertGreater(start_event.elapsed_time(event), 0)
+
+    def test_record_stream(self):
+        cycles_per_ms = get_cycles_per_ms()
+
+        t = torch.FloatTensor([1, 2, 3, 4]).pin_memory()
+        result = torch.cuda.FloatTensor(t.size())
+        stream = torch.cuda.Stream()
+        ptr = [None]
+
+        # Performs the CPU->GPU copy in a background stream
+        def perform_copy():
+            with torch.cuda.stream(stream):
+                tmp = t.cuda(async=True)
+                ptr[0] = tmp.data_ptr()
+            torch.cuda.current_stream().wait_stream(stream)
+            tmp.record_stream(torch.cuda.current_stream())
+            torch.cuda._sleep(int(50 * cycles_per_ms))  # delay the copy
+            result.copy_(tmp)
+
+        perform_copy()
+        with torch.cuda.stream(stream):
+            tmp2 = torch.cuda.FloatTensor(t.size())
+            tmp2.zero_()
+            self.assertNotEqual(tmp2.data_ptr(), ptr[0], 'allocation re-used to soon')
+
+        self.assertEqual(result.tolist(), [1, 2, 3, 4])
+
+        # Check that the block will be re-used after the main stream finishes
+        torch.cuda.current_stream().synchronize()
+        with torch.cuda.stream(stream):
+            tmp3 = torch.cuda.FloatTensor(t.size())
+            self.assertEqual(tmp3.data_ptr(), ptr[0], 'allocation not re-used')
+
+    def test_caching_pinned_memory(self):
+        cycles_per_ms = get_cycles_per_ms()
+
+        # check that allocations are re-used after deletion
+        t = torch.FloatTensor([1]).pin_memory()
+        ptr = t.data_ptr()
+        del t
+        t = torch.FloatTensor([1]).pin_memory()
+        self.assertEqual(t.data_ptr(), ptr, 'allocation not reused')
+
+        # check that the allocation is not re-used if it's in-use by a copy
+        gpu_tensor = torch.cuda.FloatTensor([0])
+        torch.cuda._sleep(int(50 * cycles_per_ms))  # delay the copy
+        gpu_tensor.copy_(t, async=True)
+        del t
+        t = torch.FloatTensor([1]).pin_memory()
+        self.assertNotEqual(t.data_ptr(), ptr, 'allocation re-used too soon')
+        self.assertEqual(list(gpu_tensor), [1])
+
+    @unittest.skipIf(torch.cuda.device_count() < 2, "only one GPU detected")
+    def test_caching_pinned_memory_multi_gpu(self):
+        # checks that the events preventing pinned memory from being re-used
+        # too early are recorded on the correct GPU
+        cycles_per_ms = get_cycles_per_ms()
+
+        t = torch.FloatTensor([1]).pin_memory()
+        ptr = t.data_ptr()
+        gpu_tensor0 = torch.cuda.FloatTensor([0], device=0)
+        gpu_tensor1 = torch.cuda.FloatTensor([0], device=1)
+
+        with torch.cuda.device(1):
+            torch.cuda._sleep(int(50 * cycles_per_ms))  # delay the copy
+            gpu_tensor1.copy_(t, async=True)
+
+        del t
+        t = torch.FloatTensor([2]).pin_memory()
+        self.assertNotEqual(t.data_ptr(), ptr, 'allocation re-used too soon')
+
+        with torch.cuda.device(0):
+            gpu_tensor0.copy_(t, async=True)
+
+        self.assertEqual(gpu_tensor1[0], 1)
+        self.assertEqual(gpu_tensor0[0], 2)
+
+    @staticmethod
+    def _select_broadcastable_dims(dims_full=None):
+        return TestTorch._select_broadcastable_dims(dims_full)
+
+    def test_broadcast(self):
+        TestTorch._test_broadcast(self, lambda t: t.cuda())
+
+    def test_broadcast_fallback(self):
+        TestTorch._test_broadcast_fallback(self, lambda t: t.cuda())
+
+    def test_broadcast_fused_matmul(self):
+        TestTorch._test_broadcast_fused_matmul(self, lambda t: t.cuda())
+
+    def test_broadcast_batched_matmul(self):
+        TestTorch._test_broadcast_batched_matmul(self, lambda t: t.cuda())
+
+    def test_advancedindex(self):
+        TestTorch._test_advancedindex(self, lambda t: t.cuda())
+
+    def test_advancedindex_big(self):
+        TestTorch._test_advancedindex_big(self, lambda t: t.cuda())
+
+    def test_btrifact(self):
+        TestTorch._test_btrifact(self, lambda t: t.cuda())
+
+    def test_btrisolve(self):
+        TestTorch._test_btrisolve(self, lambda t: t.cuda())
+
+    def test_tensor_gather(self):
+        TestTorch._test_gather(self, lambda t: t.cuda(), False)
+
+    def test_tensor_scatter(self):
+        TestTorch._test_scatter_base(self, lambda t: t.cuda(), 'scatter_', test_bounds=False)
+
+    def test_tensor_scatterAdd(self):
+        TestTorch._test_scatter_base(self, lambda t: t.cuda(), 'scatter_add_', test_bounds=False)
+
+    def test_tensor_scatterFill(self):
+        TestTorch._test_scatter_base(self, lambda t: t.cuda(), 'scatter_', True, test_bounds=False)
+
+    def test_arange(self):
+        for t in ['IntTensor', 'LongTensor', 'FloatTensor', 'DoubleTensor']:
+            a = torch.cuda.__dict__[t]()
+            torch.arange(0, 10, out=a)
+            b = torch.__dict__[t]()
+            torch.arange(0, 10, out=b)
+            self.assertEqual(a, b.cuda())
+
+    def test_nvtx(self):
+        # Just making sure we can see the symbols
+        torch.cuda.nvtx.range_push("foo")
+        torch.cuda.nvtx.mark("bar")
+        torch.cuda.nvtx.range_pop()
+
+
+if HAS_CUDA:
+    for decl in tests:
+        for t in types:
+            tensor = t()
+            gpu_tensor = get_gpu_type(t)()
+            if len(decl) == 3:
+                name, constr, arg_constr = decl
+                desc = ''
+            elif len(decl) == 4:
+                name, constr, arg_constr, desc = decl
+            elif len(decl) == 5:
+                name, constr, arg_constr, desc, type_subset = decl
+                if t not in type_subset:
+                    continue
+
+            precision = custom_precision.get(name, TestCuda.precision)
+            for inplace in (True, False):
+                if inplace:
+                    name_inner = name + '_'
+                else:
+                    name_inner = name
+                if not hasattr(tensor, name_inner):
+                    continue
+                if not hasattr(gpu_tensor, name_inner):
+                    print("Ignoring {}, because it's not implemented by torch.cuda.{}".format(
+                        name_inner, gpu_tensor.__class__.__name__))
+                    continue
+
+                test_name = 'test_' + t.__name__ + '_' + name_inner
+                if desc:
+                    test_name += '_' + desc
+
+                assert not hasattr(TestCuda, test_name), "Duplicated test name: " + test_name
+                setattr(TestCuda, test_name, compare_cpu_gpu(constr, arg_constr, name_inner, t, precision))
+
+
+if __name__ == '__main__':
+    run_tests()
--- a/test/test_dataloader.py
+++ b/test/test_dataloader.py
@ -0,0 +1,337 @@
+import math
+import sys
+import torch
+import traceback
+import unittest
+from torch.utils.data import Dataset, TensorDataset, DataLoader, ConcatDataset
+from common import TestCase, run_tests, TEST_NUMPY
+from common_nn import TEST_CUDA
+
+
+class TestTensorDataset(TestCase):
+
+    def test_len(self):
+        source = TensorDataset(torch.randn(15, 10, 2, 3, 4, 5), torch.randperm(15))
+        self.assertEqual(len(source), 15)
+
+    def test_getitem(self):
+        t = torch.randn(15, 10, 2, 3, 4, 5)
+        l = torch.randn(15, 10)
+        source = TensorDataset(t, l)
+        for i in range(15):
+            self.assertEqual(t[i], source[i][0])
+            self.assertEqual(l[i], source[i][1])
+
+    def test_getitem_1d(self):
+        t = torch.randn(15)
+        l = torch.randn(15)
+        source = TensorDataset(t, l)
+        for i in range(15):
+            self.assertEqual(t[i], source[i][0])
+            self.assertEqual(l[i], source[i][1])
+
+
+class TestConcatDataset(TestCase):
+
+    def test_concat_two_singletons(self):
+        result = ConcatDataset([[0], [1]])
+        self.assertEqual(2, len(result))
+        self.assertEqual(0, result[0])
+        self.assertEqual(1, result[1])
+
+    def test_concat_two_non_singletons(self):
+        result = ConcatDataset([[0, 1, 2, 3, 4],
+                                [5, 6, 7, 8, 9]])
+        self.assertEqual(10, len(result))
+        self.assertEqual(0, result[0])
+        self.assertEqual(5, result[5])
+
+    def test_concat_two_non_singletons_with_empty(self):
+        # Adding an empty dataset somewhere is correctly handled
+        result = ConcatDataset([[0, 1, 2, 3, 4],
+                                [],
+                                [5, 6, 7, 8, 9]])
+        self.assertEqual(10, len(result))
+        self.assertEqual(0, result[0])
+        self.assertEqual(5, result[5])
+
+    def test_concat_raises_index_error(self):
+        result = ConcatDataset([[0, 1, 2, 3, 4],
+                                [5, 6, 7, 8, 9]])
+        with self.assertRaises(IndexError):
+            # this one goes to 11
+            result[11]
+
+
+class ErrorDataset(Dataset):
+
+    def __init__(self, size):
+        self.size = size
+
+    def __len__(self):
+        return self.size
+
+
+class TestDataLoader(TestCase):
+
+    def setUp(self):
+        self.data = torch.randn(100, 2, 3, 5)
+        self.labels = torch.randperm(50).repeat(2)
+        self.dataset = TensorDataset(self.data, self.labels)
+
+    def _test_sequential(self, loader):
+        batch_size = loader.batch_size
+        for i, (sample, target) in enumerate(loader):
+            idx = i * batch_size
+            self.assertEqual(sample, self.data[idx:idx + batch_size])
+            self.assertEqual(target, self.labels[idx:idx + batch_size])
+        self.assertEqual(i, math.floor((len(self.dataset) - 1) / batch_size))
+
+    def _test_shuffle(self, loader):
+        found_data = {i: 0 for i in range(self.data.size(0))}
+        found_labels = {i: 0 for i in range(self.labels.size(0))}
+        batch_size = loader.batch_size
+        for i, (batch_samples, batch_targets) in enumerate(loader):
+            for sample, target in zip(batch_samples, batch_targets):
+                for data_point_idx, data_point in enumerate(self.data):
+                    if data_point.eq(sample).all():
+                        self.assertFalse(found_data[data_point_idx])
+                        found_data[data_point_idx] += 1
+                        break
+                self.assertEqual(target, self.labels[data_point_idx])
+                found_labels[data_point_idx] += 1
+            self.assertEqual(sum(found_data.values()), (i + 1) * batch_size)
+            self.assertEqual(sum(found_labels.values()), (i + 1) * batch_size)
+        self.assertEqual(i, math.floor((len(self.dataset) - 1) / batch_size))
+
+    def _test_error(self, loader):
+        it = iter(loader)
+        errors = 0
+        while True:
+            try:
+                next(it)
+            except NotImplementedError:
+                errors += 1
+            except StopIteration:
+                self.assertEqual(errors,
+                                 math.ceil(float(len(loader.dataset)) / loader.batch_size))
+                return
+
+    def test_sequential(self):
+        self._test_sequential(DataLoader(self.dataset))
+
+    def test_sequential_batch(self):
+        self._test_sequential(DataLoader(self.dataset, batch_size=2))
+
+    def test_growing_dataset(self):
+        dataset = [torch.ones(4) for _ in range(4)]
+        dataloader_seq = DataLoader(dataset, shuffle=False)
+        dataloader_shuffle = DataLoader(dataset, shuffle=True)
+        dataset.append(torch.ones(4))
+        self.assertEqual(len(dataloader_seq), 5)
+        self.assertEqual(len(dataloader_shuffle), 5)
+
+    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
+    def test_sequential_pin_memory(self):
+        loader = DataLoader(self.dataset, batch_size=2, pin_memory=True)
+        for input, target in loader:
+            self.assertTrue(input.is_pinned())
+            self.assertTrue(target.is_pinned())
+
+    def test_shuffle(self):
+        self._test_shuffle(DataLoader(self.dataset, shuffle=True))
+
+    def test_shuffle_batch(self):
+        self._test_shuffle(DataLoader(self.dataset, batch_size=2, shuffle=True))
+
+    def test_sequential_workers(self):
+        self._test_sequential(DataLoader(self.dataset, num_workers=4))
+
+    def test_seqential_batch_workers(self):
+        self._test_sequential(DataLoader(self.dataset, batch_size=2, num_workers=4))
+
+    def test_shuffle_workers(self):
+        self._test_shuffle(DataLoader(self.dataset, shuffle=True, num_workers=4))
+
+    def test_shuffle_batch_workers(self):
+        self._test_shuffle(DataLoader(self.dataset, batch_size=2, shuffle=True, num_workers=4))
+
+    def _test_batch_sampler(self, **kwargs):
+        # [(0, 1), (2, 3, 4), (5, 6), (7, 8, 9), ...]
+        batches = []
+        for i in range(0, 100, 5):
+            batches.append(tuple(range(i, i + 2)))
+            batches.append(tuple(range(i + 2, i + 5)))
+
+        dl = DataLoader(self.dataset, batch_sampler=batches, **kwargs)
+        self.assertEqual(len(dl), 40)
+        for i, (input, _target) in enumerate(dl):
+            if i % 2 == 0:
+                offset = i * 5 // 2
+                self.assertEqual(len(input), 2)
+                self.assertEqual(input, self.data[offset:offset + 2])
+            else:
+                offset = i * 5 // 2
+                self.assertEqual(len(input), 3)
+                self.assertEqual(input, self.data[offset:offset + 3])
+
+    def test_batch_sampler(self):
+        self._test_batch_sampler()
+        self._test_batch_sampler(num_workers=4)
+
+    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
+    def test_shuffle_pin_memory(self):
+        loader = DataLoader(self.dataset, batch_size=2, shuffle=True, num_workers=4, pin_memory=True)
+        for input, target in loader:
+            self.assertTrue(input.is_pinned())
+            self.assertTrue(target.is_pinned())
+
+    @unittest.skipIf(not TEST_NUMPY, "numpy unavailable")
+    def test_numpy(self):
+        import numpy as np
+
+        class TestDataset(torch.utils.data.Dataset):
+            def __getitem__(self, i):
+                return np.ones((2, 3, 4)) * i
+
+            def __len__(self):
+                return 1000
+
+        loader = DataLoader(TestDataset(), batch_size=12)
+        batch = next(iter(loader))
+        self.assertIsInstance(batch, torch.DoubleTensor)
+        self.assertEqual(batch.size(), torch.Size([12, 2, 3, 4]))
+
+    def test_error(self):
+        self._test_error(DataLoader(ErrorDataset(100), batch_size=2, shuffle=True))
+
+    def test_error_workers(self):
+        self._test_error(DataLoader(ErrorDataset(41), batch_size=2, shuffle=True, num_workers=4))
+
+    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
+    def test_partial_workers(self):
+        "check that workers exit even if the iterator is not exhausted"
+        loader = iter(DataLoader(self.dataset, batch_size=2, num_workers=4, pin_memory=True))
+        workers = loader.workers
+        pin_thread = loader.pin_thread
+        for i, sample in enumerate(loader):
+            if i == 3:
+                break
+        del loader
+        for w in workers:
+            w.join(1.0)  # timeout of one second
+            self.assertFalse(w.is_alive(), 'subprocess not terminated')
+            self.assertEqual(w.exitcode, 0)
+        pin_thread.join(1.0)
+        self.assertFalse(pin_thread.is_alive())
+
+    def test_len(self):
+        def check_len(dl, expected):
+            self.assertEqual(len(dl), expected)
+            n = 0
+            for sample in dl:
+                n += 1
+            self.assertEqual(n, expected)
+        check_len(self.dataset, 100)
+        check_len(DataLoader(self.dataset, batch_size=2), 50)
+        check_len(DataLoader(self.dataset, batch_size=3), 34)
+
+    @unittest.skipIf(not TEST_NUMPY, "numpy unavailable")
+    def test_numpy_scalars(self):
+        import numpy as np
+
+        class ScalarDataset(torch.utils.data.Dataset):
+            def __init__(self, dtype):
+                self.dtype = dtype
+
+            def __getitem__(self, i):
+                return self.dtype()
+
+            def __len__(self):
+                return 4
+
+        dtypes = {
+            np.float64: torch.DoubleTensor,
+            np.float32: torch.FloatTensor,
+            np.float16: torch.HalfTensor,
+            np.int64: torch.LongTensor,
+            np.int32: torch.IntTensor,
+            np.int16: torch.ShortTensor,
+            np.int8: torch.CharTensor,
+            np.uint8: torch.ByteTensor,
+        }
+        for dt, tt in dtypes.items():
+            dset = ScalarDataset(dt)
+            loader = DataLoader(dset, batch_size=2)
+            batch = next(iter(loader))
+            self.assertIsInstance(batch, tt)
+
+
+class StringDataset(Dataset):
+    def __init__(self):
+        self.s = '12345'
+
+    def __len__(self):
+        return len(self.s)
+
+    def __getitem__(self, ndx):
+        return (self.s[ndx], ndx)
+
+
+class TestStringDataLoader(TestCase):
+    def setUp(self):
+        self.dataset = StringDataset()
+
+    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
+    def test_shuffle_pin_memory(self):
+        loader = DataLoader(self.dataset, batch_size=2, shuffle=True, num_workers=4, pin_memory=True)
+        for batch_ndx, (s, n) in enumerate(loader):
+            self.assertIsInstance(s[0], str)
+            self.assertTrue(n.is_pinned())
+
+
+class DictDataset(Dataset):
+    def __len__(self):
+        return 4
+
+    def __getitem__(self, ndx):
+        return {
+            'a_tensor': torch.Tensor(4, 2).fill_(ndx),
+            'another_dict': {
+                'a_number': ndx,
+            },
+        }
+
+
+class TestDictDataLoader(TestCase):
+    def setUp(self):
+        self.dataset = DictDataset()
+
+    def test_sequential_batch(self):
+        loader = DataLoader(self.dataset, batch_size=2, shuffle=False)
+        batch_size = loader.batch_size
+        for i, sample in enumerate(loader):
+            idx = i * batch_size
+            self.assertEqual(set(sample.keys()), {'a_tensor', 'another_dict'})
+            self.assertEqual(set(sample['another_dict'].keys()), {'a_number'})
+
+            t = sample['a_tensor']
+            self.assertEqual(t.size(), torch.Size([batch_size, 4, 2]))
+            self.assertTrue((t[0] == idx).all())
+            self.assertTrue((t[1] == idx + 1).all())
+
+            n = sample['another_dict']['a_number']
+            self.assertEqual(n.size(), torch.Size([batch_size]))
+            self.assertEqual(n[0], idx)
+            self.assertEqual(n[1], idx + 1)
+
+    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
+    def test_pin_memory(self):
+        loader = DataLoader(self.dataset, batch_size=2, pin_memory=True)
+        for batch_ndx, sample in enumerate(loader):
+            self.assertTrue(sample['a_tensor'].is_pinned())
+            self.assertTrue(sample['another_dict']['a_number'].is_pinned())
+
+
+if __name__ == '__main__':
+    run_tests()
--- a/test/test_distributed.py
+++ b/test/test_distributed.py
@ -0,0 +1,548 @@
+import fcntl
+import multiprocessing
+import os
+import sys
+import time
+import unittest
+from functools import wraps, reduce
+from contextlib import contextmanager
+
+import torch
+import torch.distributed as dist
+from common import TestCase
+
+BACKEND = os.environ['BACKEND']
+TEMP_DIR = os.environ['TEMP_DIR']
+MASTER_PORT = '29500'
+MASTER_ADDR = '127.0.0.1'
+
+
+if not dist.is_available():
+    print('Distributed not available, skipping tests')
+    sys.exit(0)
+
+
+@contextmanager
+def _lock():
+    lockfile = os.path.join(TEMP_DIR, 'lockfile')
+    with open(lockfile, 'w') as lf:
+        try:
+            fcntl.flock(lf.fileno(), fcntl.LOCK_EX)
+            yield
+        finally:
+            fcntl.flock(lf.fileno(), fcntl.LOCK_UN)
+            lf.close()
+
+
+def _build_tensor(size, value=None):
+    if value is None:
+        value = size
+    return torch.FloatTensor(size, size, size).fill_(value)
+
+
+class Barrier(object):
+    barrier_id = 0
+
+    @classmethod
+    def init(cls):
+        cls.barrier_id = 0
+        barrier_dir = os.path.join(TEMP_DIR, 'barrier')
+        for f_name in os.listdir(barrier_dir):
+            os.unlink(os.path.join(barrier_dir, f_name))
+
+    @classmethod
+    def sync(cls, timeout=5):
+        cls.barrier_id += 1
+        barrier_dir = os.path.join(TEMP_DIR, 'barrier')
+        pid = str(os.getpid())
+        barrier_file = os.path.join(barrier_dir, pid)
+        with _lock():
+            with open(barrier_file, 'w') as f:
+                f.write(str(cls.barrier_id))
+
+        start_time = time.time()
+        while True:
+            arrived = 0
+            with _lock():
+                for f_name in os.listdir(barrier_dir):
+                    with open(os.path.join(barrier_dir, f_name), 'r') as f:
+                        data = f.read()
+                        if int(data) >= cls.barrier_id:
+                            arrived += 1
+            if arrived == dist.get_world_size():
+                break
+
+            if time.time() - start_time > timeout:
+                raise RuntimeError("barrier timeout")
+            time.sleep(0.1)
+
+
+class _DistTestBase(object):
+
+    def _barrier(self, *args, **kwargs):
+        Barrier.sync(*args, **kwargs)
+
+    def _init_group_test(self):
+        group = [1, 2]
+        group_id = dist.new_group(group)
+        rank = dist.get_rank()
+        if rank not in group:
+            return ([], None, rank)
+
+        return (group, group_id, rank)
+
+    def _init_global_test(self):
+        group = [i for i in range(0, dist.get_world_size())]
+        group_id = dist.group.WORLD
+        rank = dist.get_rank()
+        return (group, group_id, rank)
+
+    # GET RANK
+    def test_get_rank(self):
+        test_dir = os.path.join(TEMP_DIR, 'test_dir')
+        pid = str(os.getpid())
+        num_processes = dist.get_world_size()
+        with open(os.path.join(test_dir, pid), 'w') as f:
+            f.write(str(dist.get_rank()))
+
+        self._barrier()
+
+        all_ranks = set()
+        for f_name in os.listdir(test_dir):
+            with open(os.path.join(test_dir, f_name), 'r') as f:
+                all_ranks.add(int(f.read()))
+        self.assertEqual(len(all_ranks), num_processes)
+
+        self._barrier()
+
+        if dist.get_rank() == 0:
+            for f_name in os.listdir(test_dir):
+                os.unlink(os.path.join(test_dir, f_name))
+
+        self._barrier()
+
+    # SEND RECV
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support send/recv")
+    def test_send_recv(self):
+        rank = dist.get_rank()
+        tensor = _build_tensor(rank + 1)
+        for dest in range(0, dist.get_world_size()):
+            if dest == rank:
+                continue
+            dist.send(tensor, dest)
+
+        for src in range(0, dist.get_world_size()):
+            if src == rank:
+                continue
+            tensor = _build_tensor(src + 1, value=-1)
+            expected_tensor = _build_tensor(src + 1)
+            dist.recv(tensor, src)
+            self.assertEqual(tensor, expected_tensor)
+
+        self._barrier()
+
+    # SEND RECV ANY SOURCE
+    @unittest.skipIf(BACKEND == 'gloo',
+                     "Gloo does not support send/recv from any source")
+    def test_send_recv_any_source(self):
+        rank = dist.get_rank()
+        tensor = _build_tensor(10, rank)
+        for dest in range(0, dist.get_world_size()):
+            if dest == rank:
+                continue
+            dist.send(tensor, dest)
+
+        recv_ranks = set()
+        for src in range(0, dist.get_world_size()):
+            if src == rank:
+                continue
+            tensor = _build_tensor(10, value=-1)
+            dist.recv(tensor)
+            recv_ranks.add(tensor.resize_(1)[0])
+
+        self.assertEqual(len(recv_ranks), dist.get_world_size() - 1)
+        self._barrier()
+
+    # ISEND
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support isend")
+    def test_isend(self):
+        rank = dist.get_rank()
+        world_size = dist.get_world_size()
+
+        if rank == 0:
+            requests = [
+                dist.isend(_build_tensor(dest, 10), dest) for dest in range(1, world_size)
+            ]
+            for request in requests:
+                request.wait()
+                self.assertTrue(request.is_completed())
+        else:
+            tensor = _build_tensor(rank, -1)
+            dist.recv(tensor, 0)
+            self.assertEqual(tensor, _build_tensor(rank, 10))
+
+        self._barrier()
+
+    # IRECV
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support irecv")
+    def test_irecv(self):
+        rank = dist.get_rank()
+        world_size = dist.get_world_size()
+
+        if rank == 0:
+            expected_tensors = [_build_tensor(src, -1) for src in range(1, world_size)]
+            requests = [
+                dist.irecv(expected_tensors[src - 1], src) for src in range(1, world_size)
+            ]
+
+            for src in range(1, world_size):
+                requests[src - 1].wait()
+                self.assertTrue(requests[src - 1].is_completed())
+                self.assertEqual(expected_tensors[src - 1], _build_tensor(src, 10))
+        else:
+            tensor = _build_tensor(rank, 10)
+            dist.send(tensor, 0)
+
+        self._barrier()
+
+    # BROADCAST
+    def _test_broadcast_helper(self, group, group_id, rank, cuda=False):
+        for src in group:
+            expected_tensor = _build_tensor(src + 1)
+            if cuda:
+                expected_tensor = expected_tensor.cuda()
+            if rank == src:
+                dist.broadcast(expected_tensor, src, group_id)
+            else:
+                tensor = _build_tensor(src + 1, -1)
+                if cuda:
+                    tensor = tensor.cuda()
+                dist.broadcast(tensor, src, group_id)
+                self.assertEqual(tensor, expected_tensor)
+
+        self._barrier()
+
+    def test_broadcast(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_broadcast_helper(group, group_id, rank)
+
+    @unittest.skipIf(BACKEND != 'gloo', "Only Gloo backend supports CUDA allReduce")
+    def test_broadcast_cuda(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_broadcast_helper(group, group_id, rank, True)
+
+    def test_broadcast_group(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_broadcast_helper(group, group_id, rank)
+
+    # REDUCE
+    def _test_reduce_helper(self, group, group_id, rank, op, master_value, worker_value, expected_value):
+        for src in group:
+            if rank == src:
+                tensor = _build_tensor(src + 1).fill_(master_value)
+                dist.reduce(tensor, src, op, group_id)
+                self.assertEqual(tensor, _build_tensor(src + 1, expected_value))
+            else:
+                tensor = _build_tensor(src + 1).fill_(worker_value)
+                dist.reduce(tensor, src, op, group_id)
+
+        self._barrier()
+
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support reduce")
+    def test_reduce_sum(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_reduce_helper(
+            group, group_id, rank, dist.reduce_op.SUM, 2, 10, 2 + (10 * (len(group) - 1))
+        )
+
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support reduce")
+    def test_reduce_product(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_reduce_helper(
+            group, group_id, rank, dist.reduce_op.PRODUCT,
+            2, 10, reduce((lambda x, y: x * y), [10] * (len(group) - 1), 2)
+        )
+
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support reduce")
+    def test_reduce_min(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_reduce_helper(
+            group, group_id, rank, dist.reduce_op.MIN, 1010, 1, 1
+        )
+
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support reduce")
+    def test_reduce_max(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_reduce_helper(
+            group, group_id, rank, dist.reduce_op.MAX, -1, 10, 10
+        )
+
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support reduce")
+    def test_reduce_group_sum(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_reduce_helper(
+            group, group_id, rank, dist.reduce_op.SUM, 2, 10, 2 + (10 * (len(group) - 1))
+        )
+
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support reduce")
+    def test_reduce_group_product(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_reduce_helper(
+            group, group_id, rank, dist.reduce_op.PRODUCT,
+            2, 10, reduce((lambda x, y: x * y), [10] * (len(group) - 1), 2)
+        )
+
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support reduce")
+    def test_reduce_group_min(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_reduce_helper(
+            group, group_id, rank, dist.reduce_op.MIN, 1010, 1, 1
+        )
+
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support reduce")
+    def test_reduce_group_max(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_reduce_helper(
+            group, group_id, rank, dist.reduce_op.MAX, -1, 10, 10
+        )
+
+    # ALL REDUCE
+    def _test_all_reduce_helper(self, group, group_id, rank, op, master_value,
+                                worker_value, expected_value, cuda=False):
+        for src in group:
+            if rank == src:
+                tensor = _build_tensor(src + 1).fill_(master_value)
+                if cuda:
+                    tensor = tensor.cuda()
+                dist.all_reduce(tensor, op, group_id)
+                self.assertEqual(tensor, _build_tensor(src + 1, expected_value))
+            else:
+                tensor = _build_tensor(src + 1).fill_(worker_value)
+                if cuda:
+                    tensor = tensor.cuda()
+                dist.all_reduce(tensor, op, group_id)
+                self.assertEqual(tensor, _build_tensor(src + 1, expected_value))
+
+        self._barrier()
+
+    def test_all_reduce_sum(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_all_reduce_helper(
+            group, group_id, rank, dist.reduce_op.SUM, 2, 10, 2 + (10 * (len(group) - 1))
+        )
+
+    @unittest.skipIf(BACKEND != 'gloo', "Only Gloo backend supports CUDA allReduce")
+    def test_all_reduce_sum_cuda(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_all_reduce_helper(
+            group, group_id, rank, dist.reduce_op.SUM, 2, 10, 2 + (10 * (len(group) - 1)), True
+        )
+
+    def test_all_reduce_product(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_all_reduce_helper(
+            group, group_id, rank, dist.reduce_op.PRODUCT,
+            2, 10, reduce((lambda x, y: x * y), [10] * (len(group) - 1), 2)
+        )
+
+    def test_all_reduce_min(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_all_reduce_helper(
+            group, group_id, rank, dist.reduce_op.MIN, 1010, 1, 1
+        )
+
+    def test_all_reduce_max(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_all_reduce_helper(
+            group, group_id, rank, dist.reduce_op.MAX, -1, 10, 10
+        )
+
+    def test_all_reduce_group_sum(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_all_reduce_helper(
+            group, group_id, rank, dist.reduce_op.SUM, 2, 10, 2 + (10 * (len(group) - 1))
+        )
+
+    def test_all_reduce_group_product(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_all_reduce_helper(
+            group, group_id, rank, dist.reduce_op.PRODUCT,
+            2, 10, reduce((lambda x, y: x * y), [10] * (len(group) - 1), 2)
+        )
+
+    def test_all_reduce_group_min(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_all_reduce_helper(
+            group, group_id, rank, dist.reduce_op.MIN, 1010, 1, 1
+        )
+
+    def test_all_reduce_group_max(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_all_reduce_helper(
+            group, group_id, rank, dist.reduce_op.MAX, -1, 10, 10
+        )
+
+    # SCATTER
+    def _test_scatter_helper(self, group, group_id, rank):
+        for dest in group:
+            tensor = _build_tensor(dest + 1, -1)
+            expected_tensor = _build_tensor(dest + 1, rank)
+            tensors = [_build_tensor(dest + 1, i) for i in group] if rank == dest else []
+            dist.scatter(tensor, src=dest, scatter_list=tensors, group=group_id)
+            self.assertEqual(tensor, expected_tensor)
+
+        self._barrier()
+
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support scatter")
+    def test_scatter(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_scatter_helper(group, group_id, rank)
+
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support scatter")
+    def test_scatter_group(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_scatter_helper(group, group_id, rank)
+
+    # GATHER
+    def _test_gather_helper(self, group, group_id, rank):
+        for dest in group:
+            tensor = _build_tensor(dest + 1, rank)
+            tensors = [_build_tensor(dest + 1, -1) for i in group] if rank == dest else []
+            dist.gather(tensor, dst=dest, gather_list=tensors, group=group_id)
+            if rank == dest:
+                expected_tensors = [_build_tensor(dest + 1, i) for i in group]
+                for t1, t2 in zip(tensors, expected_tensors):
+                    self.assertEqual(t1, t2)
+
+        self._barrier()
+
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support gather")
+    def test_gather(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_gather_helper(group, group_id, rank)
+
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support gather")
+    def test_gather_group(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_gather_helper(group, group_id, rank)
+
+    # ALL GATHER
+    def _test_all_gather_helper(self, group, group_id, rank):
+        for dest in group:
+            tensor = _build_tensor(dest + 1, rank)
+            tensors = [_build_tensor(dest + 1, -1) for i in group]
+            dist.all_gather(tensors, tensor, group_id)
+
+            expected_tensors = [_build_tensor(dest + 1, i) for i in group]
+            for t1, t2 in zip(tensors, expected_tensors):
+                self.assertEqual(t1, t2)
+
+        self._barrier()
+
+    def test_all_gather(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_all_gather_helper(group, group_id, rank)
+
+    def test_all_gather_group(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_all_gather_helper(group, group_id, rank)
+
+    # BARRIER
+    def _test_barrier_helper(self, group, group_id, rank):
+        WAIT_TIME = 0.3  # seconds
+
+        for dest in group:
+            expected_time = torch.DoubleTensor(1).fill_(0.0)
+            if dest == rank:
+                expected_time.fill_(time.time() + WAIT_TIME)
+                dist.broadcast(expected_time, dest, group_id)
+                time.sleep(WAIT_TIME + 0.1)  # sleep a little bit longer
+                dist.barrier(group_id)
+            else:
+                dist.broadcast(expected_time, dest, group_id)
+                dist.barrier(group_id)
+                self.assertGreaterEqual(time.time(), expected_time[0])
+
+        self._barrier()
+
+    def test_barrier(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_barrier_helper(group, group_id, rank)
+
+    def test_barrier_group(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_barrier_helper(group, group_id, rank)
+
+if BACKEND == 'tcp' or BACKEND == 'gloo':
+    WORLD_SIZE = os.environ['WORLD_SIZE']
+
+    class TestTCPOrGloo(TestCase, _DistTestBase):
+
+        MANAGER_PROCESS_RANK = -1
+        JOIN_TIMEOUT = 10
+
+        @staticmethod
+        def manager_join(fn):
+            @wraps(fn)
+            def wrapper(self):
+                if self.rank == self.MANAGER_PROCESS_RANK:
+                    self._join_and_reduce()
+                else:
+                    fn(self)
+            return wrapper
+
+        @classmethod
+        def setUpClass(cls):
+            os.environ['MASTER_ADDR'] = MASTER_ADDR
+            os.environ['MASTER_PORT'] = MASTER_PORT
+            os.environ['WORLD_SIZE'] = WORLD_SIZE
+            for attr in dir(cls):
+                if attr.startswith('test'):
+                    fn = getattr(cls, attr)
+                    setattr(cls, attr, cls.manager_join(fn))
+
+        def setUp(self):
+            self.processes = []
+            self.rank = self.MANAGER_PROCESS_RANK
+            Barrier.init()
+            for rank in range(int(WORLD_SIZE)):
+                self.processes.append(self._spawn_process(rank))
+
+        def tearDown(self):
+            for p in self.processes:
+                p.terminate()
+
+        def _spawn_process(self, rank):
+            os.environ['RANK'] = str(rank)
+            name = 'process ' + str(rank)
+            process = multiprocessing.Process(target=self._run, name=name,
+                                              args=(rank,))
+            process.start()
+            return process
+
+        def _run(self, rank):
+            self.rank = rank
+            try:
+                dist.init_process_group(backend=BACKEND)
+            except RuntimeError as e:
+                if 'recompile' in e.args[0]:
+                    sys.exit(0)
+            # self.id() == e.g. '__main__.TestDistributed.test_get_rank'
+            # We're retreiving a corresponding test and executing it.
+            getattr(self, self.id().split(".")[2])()
+            sys.exit(0)
+
+        def _join_and_reduce(self):
+            for p in self.processes:
+                p.join(self.JOIN_TIMEOUT)
+                self.assertEqual(p.exitcode, 0)
+
+elif BACKEND == 'mpi':
+    dist.init_process_group(backend='mpi')
+
+    class TestMPI(TestCase, _DistTestBase):
+        pass
+
+
+if __name__ == '__main__':
+    unittest.main()
--- a/test/test_legacy_nn.py
+++ b/test/test_legacy_nn.py
--- a/test/test_multiprocessing.py
+++ b/test/test_multiprocessing.py
@ -0,0 +1,421 @@
+import contextlib
+import gc
+import os
+import sys
+import time
+import unittest
+from sys import platform
+
+import torch
+import torch.cuda
+import torch.multiprocessing as mp
+from torch.autograd import Variable
+from torch.nn import Parameter
+from common import TestCase, run_tests
+
+
+TEST_REPEATS = 30
+HAS_SHM_FILES = os.path.isdir('/dev/shm')
+TEST_CUDA_IPC = torch.cuda.is_available() and \
+    sys.version_info[0] == 3 and \
+    sys.platform != 'darwin'
+TEST_MULTIGPU = TEST_CUDA_IPC and torch.cuda.device_count() > 1
+
+
+def simple_fill(queue, event):
+    data = queue.get()
+    data[0][:] = 4
+    event.set()
+
+
+def simple_pool_fill(tensor):
+    tensor.fill_(4)
+    return tensor.add(1)
+
+
+def send_tensor(queue, event, tp):
+    t = torch.ones(5, 5).type(tp)
+    queue.put(t)
+    queue.put(t)
+    event.wait()
+
+
+def sum_tensors(inq, outq):
+    with torch.cuda.device(1):
+        tensors = inq.get()
+        for tensor in tensors:
+            outq.put((tensor.sum(), tensor.get_device(),
+                      tensor.numel(), tensor.storage().size()))
+
+
+def queue_get_exception(inqueue, outqueue):
+    os.close(2)  # hide expected error message
+    try:
+        torch.zeros(5, 5).cuda()
+    except Exception as e:
+        outqueue.put(e)
+    else:
+        outqueue.put('no exception')
+
+
+# Multiply by two in a separate stream
+def cuda_multiply_two(queue, ready, done):
+    ready.set()
+    with torch.cuda.stream(torch.cuda.Stream()):
+        cuda_event, tensor = queue.get()
+        cuda_event.wait()
+        tensor.mul_(2)
+        cuda_event.record()
+        done.set()
+        del cuda_event
+
+
+def autograd_sharing(queue, ready, master_modified):
+    var = queue.get()
+    ready.set()
+    master_modified.wait()
+
+    expected_var = torch.arange(1, 26).view(5, 5)
+    expected_var[0, 0] = 1000
+    is_ok = var.data.equal(expected_var)
+    var.data[:] = torch.ones(5, 5)
+
+    is_ok &= var.grad is None
+    var._grad = Variable(torch.ones(5, 5), requires_grad=False)
+
+    queue.put(is_ok)
+
+
+@contextlib.contextmanager
+def fs_sharing():
+    prev_strategy = mp.get_sharing_strategy()
+    mp.set_sharing_strategy('file_system')
+    try:
+        yield
+    finally:
+        mp.set_sharing_strategy(prev_strategy)
+
+
+class leak_checker(object):
+
+    def __init__(self, test_case):
+        self.checked_pids = [os.getpid()]
+        self.test_case = test_case
+
+    def __enter__(self):
+        self.next_fds = self._get_next_fds(10)
+        return self
+
+    def __exit__(self, *args):
+        if args[0] is None:
+            # Check that the 10th available file-descriptor at the end of the
+            # test is no more than 4 higher than the 10th available at the
+            # start. This attempts to catch file descriptor leaks, but allows
+            # one-off initialization that may use up a file descriptor
+            # TODO: Disabled because this check is too flaky
+            # available_fds = self._get_next_fds(10)
+            # self.test_case.assertLessEqual(
+            #     available_fds[-1] - self.next_fds[-1], 5)
+            self.test_case.assertFalse(self.has_shm_files())
+        return False
+
+    def check_pid(self, pid):
+        self.checked_pids.append(pid)
+
+    def _get_next_fds(self, n=1):
+        # dup uses the lowest-numbered unused descriptor for the new descriptor
+        fds = [os.dup(0) for i in range(n)]
+        for fd in fds:
+            os.close(fd)
+        return fds
+
+    def has_shm_files(self, wait=True):
+        if not HAS_SHM_FILES:
+            return False
+        result = self._has_shm_files()
+        if result and mp.get_sharing_strategy() == 'file_system' and wait:
+            time.sleep(0.5)
+            return self._has_shm_files()
+        return result
+
+    def _has_shm_files(self):
+        gc.collect()
+        names = list('torch_' + str(pid) for pid in self.checked_pids)
+        for filename in os.listdir('/dev/shm'):
+            for name in names:
+                if filename.startswith(name):
+                    return True
+        return False
+
+
+class TestMultiprocessing(TestCase):
+
+    def _test_sharing(self, ctx=mp, type=torch.FloatTensor, repeat=1):
+        def test_fill():
+            x = torch.zeros(5, 5).type(type)
+            q = ctx.Queue()
+            e = ctx.Event()
+            data = [x, x[:, 1]]
+            q.put(data)
+            p = ctx.Process(target=simple_fill, args=(q, e))
+            p.daemon = True
+            lc.check_pid(p.pid)
+            p.start()
+            e.wait(10)
+            self.assertTrue(e.is_set())
+            self.assertTrue(data[0].eq(4).all())
+            self.assertTrue(data[1].eq(4).all())
+            p.join(1)
+            self.assertFalse(p.is_alive())
+
+        def test_receive():
+            q = ctx.Queue()
+            e = ctx.Event()
+            p = ctx.Process(target=send_tensor, args=(q, e, type))
+            p.daemon = True
+            lc.check_pid(p.pid)
+            p.start()
+            t1 = q.get()
+            t2 = q.get()
+            self.assertTrue(t1.eq(1).all())
+            self.assertTrue(id(t1.storage()) == id(t2.storage()))
+            e.set()
+            p.join(1)
+            self.assertFalse(p.is_alive())
+
+        with leak_checker(self) as lc:
+            for _ in range(repeat):
+                test_fill()
+                test_receive()
+
+    def _test_preserve_sharing(self, ctx=mp, repeat=1):
+        def do_test():
+            x = torch.randn(5, 5)
+            data = [x.storage(), x.storage()[1:4], x, x[2], x[:, 1]]
+            q = ctx.Queue()
+            q.put(data)
+            new_data = q.get(timeout=1)
+            self.assertEqual(new_data, data, 0)
+            storage_cdata = data[0]._cdata
+            self.assertEqual(new_data[0]._cdata, storage_cdata)
+            for t in new_data[2:]:
+                self.assertEqual(t.storage()._cdata, storage_cdata)
+            # TODO: enable after fixing #46
+            # new_data[0].fill_(10)
+            # self.assertEqual(new_data[1], new_data[0][1:4], 0)
+
+        with leak_checker(self):
+            for i in range(repeat):
+                do_test()
+
+    def _test_pool(self, ctx=mp, repeat=1):
+        def do_test():
+            p = ctx.Pool(2)
+            for proc in p._pool:
+                lc.check_pid(proc.pid)
+
+            buffers = [torch.zeros(2, 2) for i in range(4)]
+            results = p.map(simple_pool_fill, buffers, 1)
+            self.assertEqual(len(results), len(buffers))
+            for r in results:
+                self.assertEqual(r, torch.ones(2, 2) * 5, 0)
+            for b in buffers:
+                self.assertEqual(b, torch.ones(2, 2) * 4, 0)
+
+            p.close()
+            p.join()
+
+        with leak_checker(self) as lc:
+            for i in range(repeat):
+                do_test()
+
+    @unittest.skipIf(platform == 'darwin', "file descriptor strategy is not supported on OS X")
+    def test_fd_sharing(self):
+        self._test_sharing(repeat=TEST_REPEATS)
+
+    @unittest.skipIf(platform == 'darwin', "file descriptor strategy is not supported on OS X")
+    def test_fd_preserve_sharing(self):
+        self._test_preserve_sharing(repeat=TEST_REPEATS)
+
+    @unittest.skipIf(platform == 'darwin', "file descriptor strategy is not supported on OS X")
+    def test_fd_pool(self):
+        self._test_pool(repeat=TEST_REPEATS)
+
+    def test_fs_sharing(self):
+        with fs_sharing():
+            self._test_sharing(repeat=TEST_REPEATS)
+
+    def test_fs_preserve_sharing(self):
+        with fs_sharing():
+            self._test_preserve_sharing(repeat=TEST_REPEATS)
+
+    def test_fs_pool(self):
+        with fs_sharing():
+            self._test_pool(repeat=TEST_REPEATS)
+
+    @unittest.skipIf(not HAS_SHM_FILES, "don't not how to check if shm files exist")
+    def test_fs(self):
+        def queue_put():
+            x = torch.DoubleStorage(4)
+            q = mp.Queue()
+            self.assertFalse(lc.has_shm_files())
+            q.put(x)
+            time.sleep(0.05)  # queue serializes asynchronously
+            self.assertTrue(lc.has_shm_files(wait=False))
+            q.get()
+
+        with fs_sharing(), leak_checker(self) as lc:
+            for _ in range(TEST_REPEATS):
+                queue_put()
+
+    def test_inherit_tensor(self):
+        class SubProcess(mp.Process):
+            def __init__(self, tensor):
+                super(SubProcess, self).__init__()
+                self.tensor = tensor
+                self.daemon = True
+
+            def run(self):
+                self.tensor.add_(3)
+
+        t = torch.zeros(5, 5)
+        p = SubProcess(t.share_memory_())
+        p.start()
+        p.join(1)
+        self.assertEqual(t, torch.ones(5, 5) * 3, 0)
+
+    @unittest.skipIf(not TEST_CUDA_IPC, 'CUDA IPC not available')
+    def test_cuda(self):
+        torch.cuda.FloatTensor([1])  # initialize CUDA outside of leak checker
+        self._test_sharing(mp.get_context('spawn'), torch.cuda.FloatTensor)
+
+    @unittest.skipIf(not TEST_CUDA_IPC, 'CUDA IPC not available')
+    @unittest.skipIf(not TEST_MULTIGPU, 'found only 1 GPU')
+    def test_cuda_small_tensors(self):
+        # Check multiple small tensors which will likely use the same
+        # underlying cached allocation
+        ctx = mp.get_context('spawn')
+        tensors = []
+        for i in range(5):
+            device = i % 2
+            tensors += [torch.arange(i * 5, (i + 1) * 5).cuda(device)]
+
+        inq = ctx.Queue()
+        outq = ctx.Queue()
+        inq.put(tensors)
+        p = ctx.Process(target=sum_tensors, args=(inq, outq))
+        p.start()
+
+        results = []
+        for i in range(5):
+            results.append(outq.get())
+        p.join()
+
+        for i, tensor in enumerate(tensors):
+            v, device, tensor_size, storage_size = results[i]
+            self.assertEqual(v, torch.arange(i * 5, (i + 1) * 5).sum())
+            self.assertEqual(device, i % 2)
+            self.assertEqual(tensor_size, 5)
+            self.assertEqual(storage_size, 5)
+
+    @unittest.skipIf(not torch.cuda.is_available(), 'CUDA not available')
+    def test_cuda_bad_call(self):
+        # Initialize CUDA
+        t = torch.zeros(5, 5).cuda().cpu()
+        inq = mp.Queue()
+        outq = mp.Queue()
+        p = mp.Process(target=queue_get_exception, args=(inq, outq))
+        p.start()
+        inq.put(t)
+        p.join()
+        self.assertIsInstance(outq.get(), RuntimeError)
+
+    @unittest.skipIf(not TEST_CUDA_IPC, 'CUDA IPC not available')
+    def test_event(self):
+        ctx = mp.get_context('spawn')
+        queue = ctx.Queue()
+        ready = ctx.Event()
+        done = ctx.Event()
+        p = ctx.Process(target=cuda_multiply_two, args=(queue, ready, done))
+        p.start()
+
+        ready.wait()
+        with torch.cuda.stream(torch.cuda.Stream()):
+            tensor = torch.cuda.FloatTensor([1, 1, 1, 1])
+            # Use a sleep kernel to test events. Without the event, the
+            # multiply happens before the add.
+            event = torch.cuda.Event(interprocess=True)
+            torch.cuda._sleep(20000000)  # about 30 ms
+            tensor.add_(1)
+            event.record()
+            queue.put((event, tensor))
+            done.wait()  # must wait until subprocess records event
+            event.synchronize()
+            self.assertEqual(list(tensor), [4, 4, 4, 4])
+        p.join()
+
+    def _test_autograd_sharing(self, var):
+        ready = mp.Event()
+        master_modified = mp.Event()
+        queue = mp.Queue()
+        p = mp.Process(target=autograd_sharing, args=(queue, ready, master_modified))
+        p.daemon = True
+        p.start()
+        var._grad = Variable(torch.zeros(5, 5), requires_grad=False)
+        queue.put(var)
+
+        ready.wait()
+        var.data[0, 0] = 1000
+        var.grad.data[:] = torch.ones(5, 5) * 4
+        master_modified.set()
+
+        worker_ok = queue.get()
+        self.assertTrue(worker_ok)
+
+        self.assertEqual(var.data, torch.ones(5, 5))
+        self.assertEqual(var.grad.data, torch.ones(5, 5) * 4)
+        p.join(1)
+        self.assertFalse(p.is_alive())
+
+    def test_variable_sharing(self):
+        configs = [
+            (True, False),
+            (False, False),
+            (False, True),
+        ]
+        for requires_grad, volatile in configs:
+            var = Variable(torch.arange(1, 26).view(5, 5),
+                           requires_grad=requires_grad,
+                           volatile=volatile)
+            self._test_autograd_sharing(var)
+
+    def test_parameter_sharing(self):
+        param = Parameter(torch.arange(1, 26).view(5, 5))
+        self._test_autograd_sharing(param)
+
+    def test_empty_shared(self):
+        t = torch.Tensor()
+        t.share_memory_()
+
+    def _test_is_shared(self):
+        t = torch.randn(5, 5)
+        self.assertFalse(t.is_shared())
+        t.share_memory_()
+        self.assertTrue(t.is_shared())
+
+    @unittest.skipIf(platform == 'darwin', "file descriptor strategy is not supported on OS X")
+    def test_is_shared(self):
+        self._test_is_shared()
+
+    def test_fs_is_shared(self):
+        with fs_sharing():
+            self._test_is_shared()
+
+    @unittest.skipIf(not torch.cuda.is_available(), 'CUDA not available')
+    def test_is_shared_cuda(self):
+        t = torch.randn(5, 5).cuda()
+        self.assertTrue(t.is_shared())
+
+
+if __name__ == '__main__':
+    run_tests()
--- a/test/test_nccl.py
+++ b/test/test_nccl.py
@ -0,0 +1,88 @@
+import unittest
+
+import torch
+import torch.cuda.nccl as nccl
+import torch.cuda
+
+from common import TestCase, run_tests
+
+nGPUs = torch.cuda.device_count()
+if nGPUs == 0:
+    print('CUDA not available, skipping tests')
+    TestCase = object  # noqa: F811
+
+
+class TestNCCL(TestCase):
+
+    @unittest.skipIf(nGPUs < 2, "only one GPU detected")
+    def test_broadcast(self):
+        expected = torch.FloatTensor(128).uniform_()
+        tensors = [expected.cuda()]
+        for device in range(1, torch.cuda.device_count()):
+            with torch.cuda.device(device):
+                tensors.append(torch.cuda.FloatTensor(128))
+
+        nccl.broadcast(tensors)
+        for i in range(torch.cuda.device_count()):
+            self.assertEqual(tensors[i], expected)
+
+    @unittest.skipIf(nGPUs < 2, "only one GPU detected")
+    def test_reduce(self):
+        tensors = [torch.FloatTensor(128).uniform_() for i in range(nGPUs)]
+        expected = torch.FloatTensor(128).zero_()
+        for t in tensors:
+            expected.add_(t)
+
+        tensors = [tensors[i].cuda(i) for i in range(nGPUs)]
+        nccl.reduce(tensors)
+
+        self.assertEqual(tensors[0], expected)
+
+    @unittest.skipIf(nGPUs < 2, "only one GPU detected")
+    def test_all_reduce(self):
+        tensors = [torch.FloatTensor(128).uniform_() for i in range(nGPUs)]
+        expected = torch.FloatTensor(128).zero_()
+        for t in tensors:
+            expected.add_(t)
+
+        tensors = [tensors[i].cuda(i) for i in range(nGPUs)]
+        nccl.all_reduce(tensors)
+
+        for tensor in tensors:
+            self.assertEqual(tensor, expected)
+
+    @unittest.skipIf(nGPUs < 2, "only one GPU detected")
+    def test_all_gather(self):
+        inputs = [torch.FloatTensor(128).uniform_() for i in range(nGPUs)]
+        expected = torch.cat(inputs, 0)
+
+        inputs = [inputs[i].cuda(i) for i in range(nGPUs)]
+        outputs = [torch.cuda.FloatTensor(128 * nGPUs, device=i)
+                   for i in range(nGPUs)]
+        nccl.all_gather(inputs, outputs)
+
+        for tensor in outputs:
+            self.assertEqual(tensor, expected)
+
+    @unittest.skipIf(nGPUs < 2, "only one GPU detected")
+    def test_reduce_scatter(self):
+        in_size = 32 * nGPUs
+        out_size = 32
+
+        inputs = [torch.FloatTensor(in_size).uniform_() for i in range(nGPUs)]
+        expected = torch.FloatTensor(in_size).zero_()
+        for t in inputs:
+            expected.add_(t)
+        expected = expected.view(nGPUs, 32)
+
+        inputs = [inputs[i].cuda(i) for i in range(nGPUs)]
+        outputs = [torch.cuda.FloatTensor(out_size, device=i)
+                   for i in range(nGPUs)]
+        nccl.reduce_scatter(inputs, outputs)
+
+        for i in range(nGPUs):
+            self.assertEqual(outputs[i], expected[i])
+
+
+if __name__ == '__main__':
+    run_tests()
--- a/test/test_nn.py
+++ b/test/test_nn.py
--- a/test/test_optim.py
+++ b/test/test_optim.py
@ -0,0 +1,558 @@
+import unittest
+import functools
+from copy import deepcopy
+import torch
+import torch.optim as optim
+import torch.legacy.optim as old_optim
+import torch.nn.functional as F
+from torch.optim import SGD
+from torch.autograd import Variable
+from torch import sparse
+from torch.optim.lr_scheduler import LambdaLR, StepLR, MultiStepLR, ExponentialLR, ReduceLROnPlateau
+from common import TestCase, run_tests
+
+
+def rosenbrock(tensor):
+    x, y = tensor
+    return (1 - x) ** 2 + 100 * (y - x ** 2) ** 2
+
+
+def drosenbrock(tensor):
+    x, y = tensor
+    return torch.DoubleTensor((-400 * x * (y - x ** 2) - 2 * (1 - x), 200 * (y - x ** 2)))
+
+
+def wrap_old_fn(old_fn, **config):
+    def wrapper(closure, params, state):
+        return old_fn(closure, params, config, state)
+    return wrapper
+
+
+class TestOptim(TestCase):
+
+    def _test_rosenbrock(self, constructor, old_fn):
+        params_t = torch.Tensor([1.5, 1.5])
+        state = {}
+
+        params = Variable(torch.Tensor([1.5, 1.5]), requires_grad=True)
+        optimizer = constructor([params])
+
+        solution = torch.Tensor([1, 1])
+        initial_dist = params.data.dist(solution)
+
+        def eval():
+            optimizer.zero_grad()
+            loss = rosenbrock(params)
+            loss.backward()
+            # loss.backward() will give **slightly** different
+            # gradients, than drosenbtock, because of a different ordering
+            # of floating point operations. In most cases it doesn't matter,
+            # but some optimizers are so sensitive that they can temporarily
+            # diverge up to 1e-4, just to converge again. This makes the
+            # comparison more stable.
+            params.grad.data.copy_(drosenbrock(params.data))
+            return loss
+
+        for i in range(2000):
+            optimizer.step(eval)
+            old_fn(lambda _: (rosenbrock(params_t), drosenbrock(params_t)),
+                   params_t, state)
+            self.assertEqual(params.data, params_t)
+
+        self.assertLessEqual(params.data.dist(solution), initial_dist)
+
+    def _test_rosenbrock_sparse(self, constructor):
+        params_t = torch.Tensor([1.5, 1.5])
+
+        params = Variable(torch.Tensor([1.5, 1.5]), requires_grad=True)
+        params_c = Variable(torch.Tensor([1.5, 1.5]), requires_grad=True)
+        optimizer = constructor([params])
+        optimizer_c = constructor([params_c])
+
+        solution = torch.Tensor([1, 1])
+        initial_dist = params.data.dist(solution)
+
+        def eval(params, sparse_grad, w):
+            # Depending on w, provide only the x or y gradient
+            optimizer.zero_grad()
+            loss = rosenbrock(params)
+            loss.backward()
+            grad = drosenbrock(params.data)
+            # NB: We torture test the optimizer by returning an
+            # uncoalesced sparse tensor
+            if w:
+                i = torch.LongTensor([[0, 0]])
+                x = grad[0]
+                v = torch.DoubleTensor([x / 4., x - x / 4.])
+            else:
+                i = torch.LongTensor([[1, 1]])
+                y = grad[1]
+                v = torch.DoubleTensor([y - y / 4., y / 4.])
+            x = sparse.DoubleTensor(i, v, torch.Size([2]))
+            if sparse_grad:
+                params.grad.data = x
+            else:
+                params.grad.data = x.to_dense()
+            return loss
+
+        for i in range(2000):
+            # Do cyclic coordinate descent
+            w = i % 2
+            optimizer.step(functools.partial(eval, params, True, w))
+            optimizer_c.step(functools.partial(eval, params_c, False, w))
+            self.assertEqual(params.data, params_c.data)
+
+        self.assertLessEqual(params.data.dist(solution), initial_dist)
+
+    def _test_basic_cases_template(self, weight, bias, input, constructor):
+        weight = Variable(weight, requires_grad=True)
+        bias = Variable(bias, requires_grad=True)
+        input = Variable(input)
+        optimizer = constructor(weight, bias)
+
+        def fn():
+            optimizer.zero_grad()
+            y = weight.mv(input)
+            if y.is_cuda and bias.is_cuda and y.get_device() != bias.get_device():
+                y = y.cuda(bias.get_device())
+            loss = (y + bias).pow(2).sum()
+            loss.backward()
+            return loss
+
+        initial_value = fn().data[0]
+        for i in range(200):
+            optimizer.step(fn)
+        self.assertLess(fn().data[0], initial_value)
+
+    def _test_state_dict(self, weight, bias, input, constructor):
+        weight = Variable(weight, requires_grad=True)
+        bias = Variable(bias, requires_grad=True)
+        input = Variable(input)
+
+        def fn_base(optimizer, weight, bias):
+            optimizer.zero_grad()
+            loss = (weight.mv(input) + bias).pow(2).sum()
+            loss.backward()
+            return loss
+
+        optimizer = constructor(weight, bias)
+        fn = functools.partial(fn_base, optimizer, weight, bias)
+
+        # Prime the optimizer
+        for i in range(20):
+            optimizer.step(fn)
+        # Clone the weights and construct new optimizer for them
+        weight_c = Variable(weight.data.clone(), requires_grad=True)
+        bias_c = Variable(bias.data.clone(), requires_grad=True)
+        optimizer_c = constructor(weight_c, bias_c)
+        fn_c = functools.partial(fn_base, optimizer_c, weight_c, bias_c)
+        # Load state dict
+        state_dict = deepcopy(optimizer.state_dict())
+        state_dict_c = deepcopy(optimizer.state_dict())
+        optimizer_c.load_state_dict(state_dict_c)
+        # Run both optimizations in parallel
+        for i in range(20):
+            optimizer.step(fn)
+            optimizer_c.step(fn_c)
+            self.assertEqual(weight, weight_c)
+            self.assertEqual(bias, bias_c)
+        # Make sure state dict wasn't modified
+        self.assertEqual(state_dict, state_dict_c)
+
+    def _test_basic_cases(self, constructor, ignore_multidevice=False):
+        self._test_state_dict(
+            torch.randn(10, 5),
+            torch.randn(10),
+            torch.randn(5),
+            constructor
+        )
+        self._test_basic_cases_template(
+            torch.randn(10, 5),
+            torch.randn(10),
+            torch.randn(5),
+            constructor
+        )
+        # non-contiguous parameters
+        self._test_basic_cases_template(
+            torch.randn(10, 5, 2)[..., 0],
+            torch.randn(10, 2)[..., 0],
+            torch.randn(5),
+            constructor
+        )
+        # CUDA
+        if not torch.cuda.is_available():
+            return
+        self._test_basic_cases_template(
+            torch.randn(10, 5).cuda(),
+            torch.randn(10).cuda(),
+            torch.randn(5).cuda(),
+            constructor
+        )
+        # Multi-GPU
+        if not torch.cuda.device_count() > 1 or ignore_multidevice:
+            return
+        self._test_basic_cases_template(
+            torch.randn(10, 5).cuda(0),
+            torch.randn(10).cuda(1),
+            torch.randn(5).cuda(0),
+            constructor
+        )
+
+    def _build_params_dict(self, weight, bias, **kwargs):
+        return [dict(params=[weight]), dict(params=[bias], **kwargs)]
+
+    def _build_params_dict_single(self, weight, bias, **kwargs):
+        return [dict(params=bias, **kwargs)]
+
+    def test_sgd(self):
+        self._test_rosenbrock(
+            lambda params: optim.SGD(params, lr=1e-3),
+            wrap_old_fn(old_optim.sgd, learningRate=1e-3)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.SGD(params, lr=1e-3, momentum=0.9,
+                                     dampening=0, weight_decay=1e-4),
+            wrap_old_fn(old_optim.sgd, learningRate=1e-3, momentum=0.9,
+                        dampening=0, weightDecay=1e-4)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.SGD([weight, bias], lr=1e-3)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.SGD(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-3)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.SGD(
+                self._build_params_dict_single(weight, bias, lr=1e-2),
+                lr=1e-3)
+        )
+
+    def test_adam(self):
+        self._test_rosenbrock(
+            lambda params: optim.Adam(params, lr=1e-2),
+            wrap_old_fn(old_optim.adam, learningRate=1e-2)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.Adam(params, lr=1e-2, weight_decay=1e-2),
+            wrap_old_fn(old_optim.adam, learningRate=1e-2, weightDecay=1e-2)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.Adam([weight, bias], lr=1e-3)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.Adam(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-3)
+        )
+
+    def test_adadelta(self):
+        self._test_rosenbrock(
+            lambda params: optim.Adadelta(params),
+            wrap_old_fn(old_optim.adadelta)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.Adadelta(params, rho=0.95),
+            wrap_old_fn(old_optim.adadelta, rho=0.95)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.Adadelta(params, weight_decay=1e-2),
+            wrap_old_fn(old_optim.adadelta, weightDecay=1e-2)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.Adadelta([weight, bias])
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.Adadelta(
+                self._build_params_dict(weight, bias, rho=0.95))
+        )
+
+    def test_adagrad(self):
+        self._test_rosenbrock(
+            lambda params: optim.Adagrad(params, lr=1e-1),
+            wrap_old_fn(old_optim.adagrad, learningRate=1e-1)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.Adagrad(params, lr=1e-1, lr_decay=1e-3),
+            wrap_old_fn(old_optim.adagrad, learningRate=1e-1, learningRateDecay=1e-3)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.Adagrad(params, lr=1e-1, weight_decay=1e-2),
+            wrap_old_fn(old_optim.adagrad, learningRate=1e-1, weightDecay=1e-2)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.Adagrad([weight, bias], lr=1e-1)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.Adagrad(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-1)
+        )
+
+    def test_adagrad_sparse(self):
+        self._test_rosenbrock_sparse(
+            lambda params: optim.Adagrad(params, lr=1e-1)
+        )
+
+    def test_adamax(self):
+        self._test_rosenbrock(
+            lambda params: optim.Adamax(params, lr=1e-1),
+            wrap_old_fn(old_optim.adamax, learningRate=1e-1)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.Adamax(params, lr=1e-1, weight_decay=1e-2),
+            wrap_old_fn(old_optim.adamax, learningRate=1e-1, weightDecay=1e-2)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.Adamax(params, lr=1e-1, betas=(0.95, 0.998)),
+            wrap_old_fn(old_optim.adamax, learningRate=1e-1, beta1=0.95, beta2=0.998)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.Adagrad([weight, bias], lr=1e-1)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.Adagrad(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-1)
+        )
+
+    def test_rmsprop(self):
+        self._test_rosenbrock(
+            lambda params: optim.RMSprop(params, lr=1e-2),
+            wrap_old_fn(old_optim.rmsprop, learningRate=1e-2)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.RMSprop(params, lr=1e-2, weight_decay=1e-2),
+            wrap_old_fn(old_optim.rmsprop, learningRate=1e-2, weightDecay=1e-2)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.RMSprop(params, lr=1e-2, alpha=0.95),
+            wrap_old_fn(old_optim.rmsprop, learningRate=1e-2, alpha=0.95)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.Adagrad([weight, bias], lr=1e-2)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.Adagrad(
+                self._build_params_dict(weight, bias, lr=1e-3),
+                lr=1e-2)
+        )
+
+    def test_asgd(self):
+        self._test_rosenbrock(
+            lambda params: optim.ASGD(params, lr=1e-3),
+            wrap_old_fn(old_optim.asgd, eta0=1e-3)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.ASGD(params, lr=1e-3, alpha=0.8),
+            wrap_old_fn(old_optim.asgd, eta0=1e-3, alpha=0.8)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.ASGD(params, lr=1e-3, t0=1e3),
+            wrap_old_fn(old_optim.asgd, eta0=1e-3, t0=1e3)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.ASGD([weight, bias], lr=1e-3, t0=100)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.ASGD(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-3, t0=100)
+        )
+
+    def test_rprop(self):
+        self._test_rosenbrock(
+            lambda params: optim.Rprop(params, lr=1e-3),
+            wrap_old_fn(old_optim.rprop, stepsize=1e-3)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.Rprop(params, lr=1e-3, etas=(0.6, 1.1)),
+            wrap_old_fn(old_optim.rprop, stepsize=1e-3, etaminus=0.6, etaplus=1.1)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.Rprop(params, lr=1e-3, step_sizes=(1e-4, 3)),
+            wrap_old_fn(old_optim.rprop, stepsize=1e-3, stepsizemin=1e-4, stepsizemax=3)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.Rprop([weight, bias], lr=1e-3)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.Rprop(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-3)
+        )
+
+    def test_lbfgs(self):
+        self._test_rosenbrock(
+            lambda params: optim.LBFGS(params),
+            wrap_old_fn(old_optim.lbfgs)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.LBFGS(params, lr=5e-2, max_iter=5),
+            wrap_old_fn(old_optim.lbfgs, learningRate=5e-2, maxIter=5)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.LBFGS([weight, bias]),
+            ignore_multidevice=True
+        )
+
+    def test_invalid_param_type(self):
+        with self.assertRaises(TypeError):
+            optim.SGD(Variable(torch.randn(5, 5)), lr=3)
+
+
+class SchedulerTestNet(torch.nn.Module):
+    def __init__(self):
+        super(SchedulerTestNet, self).__init__()
+        self.conv1 = torch.nn.Conv2d(1, 1, 1)
+        self.conv2 = torch.nn.Conv2d(1, 1, 1)
+
+    def forward(self, x):
+        return self.conv2(F.relu(self.conv1(x)))
+
+
+class TestLRScheduler(TestCase):
+    def setUp(self):
+        self.net = SchedulerTestNet()
+        self.opt = SGD(
+            [{'params': self.net.conv1.parameters()}, {'params': self.net.conv2.parameters(), 'lr': 0.5}],
+            lr=0.05)
+
+    def test_step_lr(self):
+        # lr = 0.05     if epoch < 3
+        # lr = 0.005    if 30 <= epoch < 6
+        # lr = 0.0005   if epoch >= 9
+        single_targets = [0.05] * 3 + [0.005] * 3 + [0.0005] * 3 + [0.00005] * 3
+        targets = [single_targets, list(map(lambda x: x * 10, single_targets))]
+        scheduler = StepLR(self.opt, gamma=0.1, step_size=3)
+        epochs = 10
+        self._test(scheduler, targets, epochs)
+
+    def test_multi_step_lr(self):
+        # lr = 0.05     if epoch < 2
+        # lr = 0.005    if 2 <= epoch < 5
+        # lr = 0.0005   if epoch < 9
+        # lr = 0.00005   if epoch >= 9
+        single_targets = [0.05] * 2 + [0.005] * 3 + [0.0005] * 4 + [0.00005] * 3
+        targets = [single_targets, list(map(lambda x: x * 10, single_targets))]
+        scheduler = MultiStepLR(self.opt, gamma=0.1, milestones=[2, 5, 9])
+        epochs = 10
+        self._test(scheduler, targets, epochs)
+
+    def test_exp_lr(self):
+        single_targets = [0.05 * (0.9 ** x) for x in range(10)]
+        targets = [single_targets, list(map(lambda x: x * 10, single_targets))]
+        scheduler = ExponentialLR(self.opt, gamma=0.9)
+        epochs = 10
+        self._test(scheduler, targets, epochs)
+
+    def test_reduce_lr_on_plateau1(self):
+        for param_group in self.opt.param_groups:
+            param_group['lr'] = 0.5
+        targets = [[0.5] * 20]
+        metrics = [10 - i * 0.0167 for i in range(20)]
+        scheduler = ReduceLROnPlateau(self.opt, threshold_mode='abs', mode='min',
+                                      threshold=0.01, patience=5, cooldown=5)
+        epochs = 10
+        self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
+
+    def test_reduce_lr_on_plateau2(self):
+        for param_group in self.opt.param_groups:
+            param_group['lr'] = 0.5
+        targets = [[0.5] * 6 + [0.05] * 7 + [0.005] * 7 + [0.0005] * 2]
+        metrics = [10 - i * 0.0165 for i in range(22)]
+        scheduler = ReduceLROnPlateau(self.opt, patience=5, cooldown=0, threshold_mode='abs',
+                                      mode='min', threshold=0.1)
+        epochs = 22
+        self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
+
+    def test_reduce_lr_on_plateau3(self):
+        for param_group in self.opt.param_groups:
+            param_group['lr'] = 0.5
+        targets = [[0.5] * (2 + 6) + [0.05] * (5 + 6) + [0.005] * 4]
+        metrics = [-0.8] * 2 + [-0.234] * 20
+        scheduler = ReduceLROnPlateau(self.opt, mode='max', patience=5, cooldown=5,
+                                      threshold_mode='abs')
+        epochs = 22
+        self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
+
+    def test_reduce_lr_on_plateau4(self):
+        for param_group in self.opt.param_groups:
+            param_group['lr'] = 0.5
+        targets = [[0.5] * 20]
+        metrics = [1.5 * (1.025 ** i) for i in range(20)]  # 1.025 > 1.1**0.25
+        scheduler = ReduceLROnPlateau(self.opt, mode='max', patience=3,
+                                      threshold_mode='rel', threshold=0.1)
+        epochs = 20
+        self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
+
+    def test_reduce_lr_on_plateau5(self):
+        for param_group in self.opt.param_groups:
+            param_group['lr'] = 0.5
+        targets = [[0.5] * 6 + [0.05] * (5 + 6) + [0.005] * 4]
+        metrics = [1.5 * (1.005 ** i) for i in range(20)]
+        scheduler = ReduceLROnPlateau(self.opt, mode='max', threshold_mode='rel',
+                                      threshold=0.1, patience=5, cooldown=5)
+        epochs = 20
+        self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
+
+    def test_reduce_lr_on_plateau6(self):
+        for param_group in self.opt.param_groups:
+            param_group['lr'] = 0.5
+        targets = [[0.5] * 20]
+        metrics = [1.5 * (0.85 ** i) for i in range(20)]
+        scheduler = ReduceLROnPlateau(self.opt, mode='min', threshold_mode='rel',
+                                      threshold=0.1)
+        epochs = 20
+        self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
+
+    def test_reduce_lr_on_plateau7(self):
+        for param_group in self.opt.param_groups:
+            param_group['lr'] = 0.5
+        targets = [[0.5] * 6 + [0.05] * (5 + 6) + [0.005] * 4]
+        metrics = [1] * 7 + [0.6] + [0.5] * 12
+        scheduler = ReduceLROnPlateau(self.opt, mode='min', threshold_mode='rel',
+                                      threshold=0.1, patience=5, cooldown=5)
+        epochs = 20
+        self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
+
+    def test_reduce_lr_on_plateau8(self):
+        for param_group in self.opt.param_groups:
+            param_group['lr'] = 0.5
+        targets = [[0.5] * 6 + [0.4] * 14, [0.5] * 6 + [0.3] * 14]
+        metrics = [1.5 * (1.005 ** i) for i in range(20)]
+        scheduler = ReduceLROnPlateau(self.opt, mode='max', threshold_mode='rel', min_lr=[0.4, 0.3],
+                                      threshold=0.1, patience=5, cooldown=5)
+        epochs = 20
+        self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
+
+    def test_lambda_lr(self):
+        self.opt.param_groups[0]['lr'] = 0.05
+        self.opt.param_groups[1]['lr'] = 0.4
+        targets = [[0.05 * (0.9 ** x) for x in range(10)], [0.4 * (0.8 ** x) for x in range(10)]]
+        scheduler = LambdaLR(self.opt,
+                             lr_lambda=[lambda x1: 0.9 ** x1, lambda x2: 0.8 ** x2])
+        epochs = 10
+        self._test(scheduler, targets, epochs)
+
+    def _test(self, scheduler, targets, epochs=10):
+        for epoch in range(epochs):
+            scheduler.step(epoch)
+            for param_group, target in zip(self.opt.param_groups, targets):
+                self.assertAlmostEqual(target[epoch], param_group['lr'],
+                                       msg='LR is wrong in epoch {}: expected {}, got {}'.format(
+                                           epoch, target[epoch], param_group['lr']), delta=1e-5)
+
+    def _test_reduce_lr_on_plateau(self, scheduler, targets, metrics, epochs=10, verbose=False):
+        for epoch in range(epochs):
+            scheduler.step(metrics[epoch])
+            if verbose:
+                print('epoch{}:\tlr={}'.format(epoch, self.opt.param_groups[0]['lr']))
+            for param_group, target in zip(self.opt.param_groups, targets):
+                self.assertAlmostEqual(target[epoch], param_group['lr'],
+                                       msg='LR is wrong in epoch {}: expected {}, got {}'.format(
+                                           epoch, target[epoch], param_group['lr']), delta=1e-5)
+
+if __name__ == '__main__':
+    run_tests()
--- a/test/test_sparse.py
+++ b/test/test_sparse.py
@ -0,0 +1,625 @@
+import torch
+from torch import sparse
+
+import itertools
+import random
+import unittest
+from common import TestCase, run_tests
+from common_nn import TEST_CUDA
+from numbers import Number
+
+
+def cpu_only(inner):
+    def outer(self, *args, **kwargs):
+        if self.is_cuda:
+            raise unittest.SkipTest("Test is CPU-only")
+        inner(self, *args, **kwargs)
+    return outer
+
+
+def cuda_only(inner):
+    def outer(self, *args, **kwargs):
+        if not self.is_cuda:
+            raise unittest.SkipTest("Test is GPU-only")
+        inner(self, *args, **kwargs)
+    return outer
+
+
+class TestSparse(TestCase):
+
+    def setUp(self):
+        # These parameters control the various ways we can run the test.
+        # We will subclass and override this method to implement CUDA
+        # tests
+        self.is_cuda = False
+        self.is_uncoalesced = False
+        self.IndexTensor = torch.LongTensor
+        self.ValueTensor = torch.DoubleTensor
+        self.SparseTensor = torch.sparse.DoubleTensor
+
+    def _gen_sparse(self, d, nnz, with_size):
+        # TODO: Consider implementing this in the CUDA case by directly
+        # performing the operations on the GPU.  You won't be able to
+        # use torch.rand/torch.randn in this case because they are
+        # CPU-only.  If you do this, you can remove the is_cuda branch
+        # at the end.
+        #
+        # If you do this, be sure to update assert_uncoalesced too
+
+        if isinstance(with_size, Number):
+            with_size = [with_size] * d
+
+        if self.is_uncoalesced:
+            # We want to generate a tensor with a lot of uncoalesced
+            # entries to stress test whether or not we handle this
+            # (subtle) case correctly
+            v_size = [nnz * 2] + list(with_size[d:])
+            v = torch.randn(*v_size)
+            r = torch.rand(d, nnz)
+            # Repeat the indexes, so every position shows up twice
+            i = torch.cat([r, r], dim=1) * \
+                torch.Tensor(with_size[:d]).repeat(nnz * 2, 1).transpose(0, 1)
+            i = i.type(torch.LongTensor)
+            x = torch.sparse.DoubleTensor(i, v, torch.Size(with_size))
+            self.assert_uncoalesced(x)
+        else:
+            # Generate a sparse tensor with d sparse dimensions; the
+            # rest the dimensions with_size[d:] are dense.
+            v_size = [nnz] + list(with_size[d:])
+            v = torch.randn(*v_size)
+            i = torch.rand(d, nnz) * \
+                torch.Tensor(with_size[:d]).repeat(nnz, 1).transpose(0, 1)
+            i = i.type(torch.LongTensor)
+            x = torch.sparse.DoubleTensor(i, v, torch.Size(with_size))
+
+        if self.is_cuda:
+            return x.cuda(), i.cuda(), v.cuda()
+        else:
+            return x, i.clone(), v.clone()
+
+    def assert_uncoalesced(self, x):
+        """
+        Test if a CPU tensor is uncoalesced.  This is used to ensure
+        correctness of the uncoalesced tensor generation algorithm.
+        """
+        assert not x.is_coalesced()
+        # Strategy: construct a new sparse tensor with the raw value
+        # field overwritten to a tensor of ones, coalesce it, and then
+        # check if any value entries are > 1 (which indicates that the
+        # original was uncoalesced.)
+        i = x._indices().clone()
+        v = x._values().clone().fill_(1)
+        y = torch.sparse.DoubleTensor(i, v, x.size())
+        z = self.safeCoalesce(y)
+        assert (z._values() > 1).sum() > 0
+
+    def randn(self, *args, **kwargs):
+        """
+        Variant of torch.randn that also works in the TEST_CUDA case.
+        """
+        # TODO: Put this in torch.cuda.randn
+        return self.ValueTensor(*args, **kwargs).normal_()
+
+    def test_basic(self):
+        x, i, v = self._gen_sparse(3, 10, 100)
+
+        self.assertEqual(i, x._indices())
+        self.assertEqual(v, x._values())
+
+        x, i, v = self._gen_sparse(3, 10, [100, 100, 100])
+        self.assertEqual(i, x._indices())
+        self.assertEqual(v, x._values())
+        self.assertEqual(x.ndimension(), 3)
+        self.assertEqual(x.coalesce()._nnz(), 10)
+        for i in range(3):
+            self.assertEqual(x.size(i), 100)
+
+        # Make sure we can access empty indices / values
+        x = self.SparseTensor()
+        self.assertEqual(x._indices().numel(), 0)
+        self.assertEqual(x._values().numel(), 0)
+
+    def test_to_dense(self):
+        i = self.IndexTensor([
+            [0, 1, 2, 2],
+            [0, 0, 0, 3],
+            [0, 0, 1, 4],
+        ])
+        v = self.ValueTensor([2, 1, 3, 4])
+        x = self.SparseTensor(i, v, torch.Size([3, 4, 5]))
+        res = self.ValueTensor([
+            [[2, 0, 0, 0, 0],
+             [0, 0, 0, 0, 0],
+             [0, 0, 0, 0, 0],
+             [0, 0, 0, 0, 0]],
+            [[1, 0, 0, 0, 0],
+             [0, 0, 0, 0, 0],
+             [0, 0, 0, 0, 0],
+             [0, 0, 0, 0, 0]],
+            [[0, 3, 0, 0, 0],
+             [0, 0, 0, 0, 0],
+             [0, 0, 0, 0, 0],
+             [0, 0, 0, 0, 4]],
+        ])
+
+        x.to_dense()  # Tests double to_dense for memory corruption
+        x.to_dense()
+        x.to_dense()
+        self.assertEqual(res, x.to_dense())
+
+    def test_shared(self):
+        i = self.IndexTensor([[2]])
+        v = self.ValueTensor([5])
+        x = self.SparseTensor(i, v, torch.Size([3]))
+        v[0] = 6
+        self.assertEqual(self.ValueTensor([0, 0, 6]), x.to_dense())
+        i[0][0] = 0
+        self.assertEqual(self.ValueTensor([6, 0, 0]), x.to_dense())
+
+    def test_to_dense_hybrid(self):
+        i = self.IndexTensor([
+            [0, 1, 2, 2],
+            [0, 0, 0, 3],
+        ])
+        v = self.ValueTensor([[2, 3], [1, 2], [3, 4], [4, 5]])
+        x = self.SparseTensor(i, v, torch.Size([3, 4, 2]))
+        res = self.ValueTensor([
+            [[2, 3],
+             [0, 0],
+             [0, 0],
+             [0, 0]],
+            [[1, 2],
+             [0, 0],
+             [0, 0],
+             [0, 0]],
+            [[3, 4],
+             [0, 0],
+             [0, 0],
+             [4, 5]],
+        ])
+
+        x.to_dense()  # Tests double to_dense for memory corruption
+        x.to_dense()
+        x.to_dense()
+        self.assertEqual(res, x.to_dense())
+
+    def test_contig(self):
+        i = self.IndexTensor([
+            [1, 0, 35, 14, 39, 6, 71, 66, 40, 27],
+            [92, 31, 62, 50, 22, 65, 89, 74, 56, 34],
+        ])
+        v = self.ValueTensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
+        x = self.SparseTensor(i, v, torch.Size([100, 100]))
+        exp_i = self.IndexTensor([
+            [0, 1, 6, 14, 27, 35, 39, 40, 66, 71],
+            [31, 92, 65, 50, 34, 62, 22, 56, 74, 89],
+        ])
+        exp_v = self.ValueTensor([2, 1, 6, 4, 10, 3, 5, 9, 8, 7])
+        x = self.safeCoalesce(x)
+        self.assertEqual(exp_i, x._indices())
+        self.assertEqual(exp_v, x._values())
+
+        i = self.IndexTensor([
+            [2, 0, 2, 1],
+            [0, 0, 3, 0],
+            [1, 0, 4, 0],
+        ])
+        v = self.ValueTensor([3, 2, 4, 1])
+        x = self.SparseTensor(i, v, torch.Size([3, 4, 5]))
+        exp_i = self.IndexTensor([
+            [0, 1, 2, 2],
+            [0, 0, 0, 3],
+            [0, 0, 1, 4],
+        ])
+        exp_v = self.ValueTensor([2, 1, 3, 4])
+
+        x = self.safeCoalesce(x)
+        self.assertEqual(exp_i, x._indices())
+        self.assertEqual(exp_v, x._values())
+
+        # Duplicate indices
+        i = self.IndexTensor([
+            [0, 0, 2, 0],
+            [0, 0, 3, 0],
+            [0, 0, 4, 0],
+        ])
+        v = self.ValueTensor([3, 2, 4, 1])
+        x = self.SparseTensor(i, v, torch.Size([3, 4, 5]))
+        exp_i = self.IndexTensor([
+            [0, 2],
+            [0, 3],
+            [0, 4],
+        ])
+        exp_v = self.ValueTensor([6, 4])
+
+        x = self.safeCoalesce(x)
+        self.assertEqual(exp_i, x._indices())
+        self.assertEqual(exp_v, x._values())
+
+    def test_contig_hybrid(self):
+        i = self.IndexTensor([
+            [1, 0, 35, 14, 39, 6, 71, 66, 40, 27],
+            [92, 31, 62, 50, 22, 65, 89, 74, 56, 34],
+        ])
+        v = self.ValueTensor([
+            [1, 2], [2, 3], [3, 4], [4, 5], [5, 6],
+            [6, 7], [7, 8], [8, 9], [9, 10], [10, 11],
+        ])
+        x = self.SparseTensor(i, v, torch.Size([100, 100, 2]))
+        exp_i = self.IndexTensor([
+            [0, 1, 6, 14, 27, 35, 39, 40, 66, 71],
+            [31, 92, 65, 50, 34, 62, 22, 56, 74, 89],
+        ])
+        exp_v = self.ValueTensor([
+            [2, 3], [1, 2], [6, 7], [4, 5], [10, 11],
+            [3, 4], [5, 6], [9, 10], [8, 9], [7, 8],
+        ])
+        x = self.safeCoalesce(x)
+        self.assertEqual(exp_i, x._indices())
+        self.assertEqual(exp_v, x._values())
+
+        i = self.IndexTensor([
+            [2, 0, 2, 1],
+            [0, 0, 3, 0],
+            [1, 0, 4, 0],
+        ])
+        v = self.ValueTensor([[3, 3, 3], [2, 2, 2], [4, 4, 4], [1, 1, 1]])
+        x = self.SparseTensor(i, v, torch.Size([3, 4, 5, 3]))
+        exp_i = self.IndexTensor([
+            [0, 1, 2, 2],
+            [0, 0, 0, 3],
+            [0, 0, 1, 4],
+        ])
+        exp_v = self.ValueTensor([[2, 2, 2], [1, 1, 1], [3, 3, 3], [4, 4, 4]])
+
+        x = self.safeCoalesce(x)
+        self.assertEqual(exp_i, x._indices())
+        self.assertEqual(exp_v, x._values())
+
+        # Duplicate indices
+        i = self.IndexTensor([
+            [0, 0, 2, 0],
+            [0, 0, 3, 0],
+            [0, 0, 4, 0],
+        ])
+        v = self.ValueTensor([[3, 2, 3], [2, 1, 1], [4, 3, 4], [1, 1, 1]])
+        x = self.SparseTensor(i, v, torch.Size([3, 4, 5, 3]))
+        exp_i = self.IndexTensor([
+            [0, 2],
+            [0, 3],
+            [0, 4],
+        ])
+        exp_v = self.ValueTensor([[6, 4, 5], [4, 3, 4]])
+
+        x = self.safeCoalesce(x)
+        self.assertEqual(exp_i, x._indices())
+        self.assertEqual(exp_v, x._values())
+
+    def test_clone(self):
+        x, _, _ = self._gen_sparse(4, 20, 5)
+        if self.is_uncoalesced:
+            self.assertFalse(x.is_coalesced())
+            y = x.clone()
+            self.assertFalse(y.is_coalesced())
+        x = x.coalesce()
+        self.assertTrue(x.is_coalesced())
+        y = x.clone()
+        self.assertTrue(y.is_coalesced())
+
+    def test_transpose(self):
+        x = self._gen_sparse(4, 20, 5)[0]
+        y = x.to_dense()
+
+        for i, j in itertools.combinations(range(4), 2):
+            x = x.transpose_(i, j)
+            y = y.transpose(i, j)
+            self.assertEqual(x.to_dense(), y)
+
+            x = x.transpose(i, j)
+            y = y.transpose(i, j)
+            self.assertEqual(x.to_dense(), y)
+
+    @cpu_only
+    def test_mm(self):
+        def test_shape(di, dj, dk):
+            x, _, _ = self._gen_sparse(2, 20, [di, dj])
+            t = torch.randn(di, dk)
+            y = torch.randn(dj, dk)
+            alpha = random.random()
+            beta = random.random()
+
+            res = torch.addmm(alpha, t, beta, x, y)
+            expected = torch.addmm(alpha, t, beta, x.to_dense(), y)
+            self.assertEqual(res, expected)
+
+            res = torch.addmm(t, x, y)
+            expected = torch.addmm(t, x.to_dense(), y)
+            self.assertEqual(res, expected)
+
+            res = torch.mm(x, y)
+            expected = torch.mm(x.to_dense(), y)
+            self.assertEqual(res, expected)
+
+        test_shape(10, 100, 100)
+        test_shape(100, 1000, 200)
+        test_shape(64, 10000, 300)
+
+    @cpu_only
+    def test_saddmm(self):
+        def test_shape(di, dj, dk):
+            x = self._gen_sparse(2, 20, [di, dj])[0]
+            t = self._gen_sparse(2, 20, [di, dk])[0]
+            y = torch.randn(dj, dk)
+            alpha = random.random()
+            beta = random.random()
+
+            res = torch.saddmm(alpha, t, beta, x, y)
+            expected = torch.addmm(alpha, t.to_dense(), beta, x.to_dense(), y)
+            self.assertEqual(res.to_dense(), expected)
+
+            res = torch.saddmm(t, x, y)
+            expected = torch.addmm(t.to_dense(), x.to_dense(), y)
+            self.assertEqual(res.to_dense(), expected)
+
+            res = torch.smm(x, y)
+            expected = torch.mm(x.to_dense(), y)
+            self.assertEqual(res.to_dense(), expected)
+
+        test_shape(7, 5, 3)
+        test_shape(1000, 100, 100)
+        test_shape(3000, 64, 300)
+
+    def test_dsmm(self):
+        def test_shape(di, dj, dk):
+            x = self._gen_sparse(2, 20, [di, dj])[0]
+            y = self.randn(dj, dk)
+
+            res = torch.dsmm(x, y)
+            expected = torch.mm(x.to_dense(), y)
+            self.assertEqual(res, expected)
+
+        test_shape(7, 5, 3)
+        test_shape(1000, 100, 100)
+        test_shape(3000, 64, 300)
+
+    def test_hsmm(self):
+        def test_shape(di, dj, dk):
+            x = self._gen_sparse(2, 20, [di, dj])[0]
+            y = self.randn(dj, dk)
+
+            res = torch.hsmm(x, y)
+            expected = torch.mm(x.to_dense(), y)
+            self.assertEqual(res.to_dense(), expected)
+
+        test_shape(7, 5, 3)
+        test_shape(1000, 100, 100)
+        test_shape(3000, 64, 300)
+
+    def _test_spadd_shape(self, shape_i, shape_v=None):
+        shape = shape_i + (shape_v or [])
+        x, _, _ = self._gen_sparse(len(shape_i), 10, shape)
+        y = self.randn(*shape)
+        r = random.random()
+
+        res = torch.add(y, r, x)
+        expected = y + r * x.to_dense()
+
+        self.assertEqual(res, expected)
+
+        # Non contiguous dense tensor
+        s = list(shape)
+        s[0] = shape[-1]
+        s[-1] = shape[0]
+        y = self.randn(*s)
+        y.transpose_(0, len(s) - 1)
+        r = random.random()
+
+        res = torch.add(y, r, x)
+        expected = y + r * x.to_dense()
+
+        self.assertEqual(res, expected)
+
+    def test_spadd(self):
+        self._test_spadd_shape([5, 6])
+        self._test_spadd_shape([10, 10, 10])
+        self._test_spadd_shape([50, 30, 20])
+        self._test_spadd_shape([5, 5, 5, 5, 5, 5])
+
+    def test_spadd_hybrid(self):
+        self._test_spadd_shape([5, 6], [2, 3])
+        self._test_spadd_shape([10, 10, 10], [3])
+        self._test_spadd_shape([50, 30, 20], [2])
+        self._test_spadd_shape([5, 5, 5, 5, 5, 5], [2])
+
+    def _test_basic_ops_shape(self, shape_i, shape_v=None):
+        shape = shape_i + (shape_v or [])
+        x1, _, _ = self._gen_sparse(len(shape_i), 9, shape)
+        x2, _, _ = self._gen_sparse(len(shape_i), 12, shape)
+
+        y1 = x1 + x2
+        y2 = x1.clone()
+        y2.add_(x2)
+        expected = x1.to_dense() + x2.to_dense()
+        self.assertEqual(y1.to_dense(), expected)
+        self.assertEqual(y2.to_dense(), expected)
+
+        y1 = x1 - x2
+        y2 = x1.clone()
+        y2.sub_(x2)
+        expected = x1.to_dense() - x2.to_dense()
+        self.assertEqual(y1.to_dense(), expected)
+        self.assertEqual(y2.to_dense(), expected)
+
+        y1 = x1 * x2
+        y2 = x1.clone()
+        y2.mul_(x2)
+        expected = x1.to_dense() * x2.to_dense()
+        self.assertEqual(y1.to_dense(), expected)
+        self.assertEqual(y2.to_dense(), expected)
+
+        y1 = x1 * 37.5
+        y2 = x1.clone()
+        y2.mul_(37.5)
+        expected = x1.to_dense() * 37.5
+        self.assertEqual(y1.to_dense(), expected)
+        self.assertEqual(y2.to_dense(), expected)
+
+        y1 = x1 / 37.5
+        y2 = x1.clone()
+        y2.div_(37.5)
+        expected = x1.to_dense() / 37.5
+        self.assertEqual(y1.to_dense(), expected)
+        self.assertEqual(y2.to_dense(), expected)
+
+        # TODO: add back inplace support
+        y1 = x1 ** 2
+        y2 = x1.clone()
+        y2 = y2.pow(2)
+        expected = x1.to_dense() ** 2
+        self.assertEqual(y1.to_dense(), expected)
+        self.assertEqual(y2.to_dense(), expected)
+
+        y = x1.clone()
+        y.zero_()
+        expected = torch.zeros(x1.size())
+        self.assertEqual(y.to_dense(), expected)
+
+        self.assertFalse(x1.is_coalesced())
+        y = x1.coalesce()
+        z = x1.coalesce()
+        self.assertFalse(x1.is_coalesced())
+        self.assertTrue(y.is_coalesced())
+        self.assertEqual(x1, y)
+        # check that coalesce is out of place
+        y._values().add_(1)
+        self.assertEqual(z._values() + 1, y._values())
+
+    def test_basic_ops(self):
+        self._test_basic_ops_shape([5, 6])
+        self._test_basic_ops_shape([10, 10, 10])
+        self._test_basic_ops_shape([50, 30, 20])
+        self._test_basic_ops_shape([5, 5, 5, 5, 5, 5])
+
+    def test_basic_ops_hybrid(self):
+        self._test_basic_ops_shape([5, 6], [2, 3])
+        self._test_basic_ops_shape([10, 10, 10], [3])
+        self._test_basic_ops_shape([50, 30, 20], [2])
+        self._test_basic_ops_shape([5, 5, 5, 5, 5, 5], [2])
+
+    def _test_sparse_mask_shape(self, shape_i, shape_v=None):
+        shape = shape_i + (shape_v or [])
+        x1, _, _ = self._gen_sparse(len(shape_i), 9, shape)
+        x2, _, _ = self._gen_sparse(len(shape_i), 12, shape)
+
+        y1 = x1 + x2
+        y2 = x1.clone()
+        y2.add_(x2)
+        expected = x1.to_dense() + x2.to_dense()
+        self.assertEqual(y1.to_dense(), expected)
+        self.assertEqual(y2.to_dense(), expected)
+
+    def _test_sparse_mask_fixed(self):
+        i = self.IndexTensor([
+            [1, 3, 0, 4],
+            [2, 1, 2, 3],
+        ])
+        v = self.ValueTensor([1, 2, 3, 4])
+        x = self.SparseTensor(i, v, torch.Size([5, 4])).coalesce()
+        dense = self.ValueTensor([
+            [1, 2, 3, 4],
+            [5, 6, 7, 8],
+            [9, 10, 11, 12],
+            [13, 14, 15, 16],
+            [17, 18, 19, 20],
+        ])
+        exp_v = self.ValueTensor([7, 14, 3, 20])
+        res = dense._sparse_mask(x)
+        expected = self.SparseTensor(i, exp_v, torch.Size([5, 4]))
+        self.assertEqual(res, expected)
+
+    def test_sparse_mask(self):
+        self._test_sparse_mask_fixed()
+
+        self._test_sparse_mask_shape([5, 6])
+        self._test_sparse_mask_shape([10, 10, 10])
+        self._test_sparse_mask_shape([50, 30, 20])
+        self._test_sparse_mask_shape([5, 5, 5, 5, 5, 5])
+
+    def _test_sparse_mask_hybrid_fixed(self):
+        i = self.IndexTensor([
+            [1, 3, 0, 4],
+            [2, 1, 2, 3],
+        ])
+        v = self.ValueTensor([[1, 2], [2, 3], [3, 4], [4, 5]])
+        # TODO: This is also testing that, if coalesce is a no-op,
+        # the indices don't get permuted. I don't know if we actually
+        # want to give this invariant.
+        x = self.SparseTensor(i, v, torch.Size([5, 4, 2])).coalesce()
+        dense = self.ValueTensor([
+            [[1, 3], [2, 2], [3, 3], [4, 2]],
+            [[5, 7], [6, 7], [7, 9], [8, 9]],
+            [[9, 2], [10, 4], [11, 1], [12, 3]],
+            [[13, 5], [14, 1], [15, 1], [16, 6]],
+            [[17, 7], [18, 2], [19, 7], [20, 1]],
+        ])
+        res = dense._sparse_mask(x)
+        exp_v = self.ValueTensor([[7, 9], [14, 1], [3, 3], [20, 1]])
+        expected = self.SparseTensor(i, exp_v, torch.Size([5, 4, 2]))
+        self.assertEqual(res, expected)
+
+    def test_sparse_mask_hybrid(self):
+        self._test_sparse_mask_hybrid_fixed()
+
+        self._test_sparse_mask_shape([5, 6], [2, 3])
+        self._test_sparse_mask_shape([10, 10, 10], [3])
+        self._test_sparse_mask_shape([50, 30, 20], [2])
+        self._test_sparse_mask_shape([5, 5, 5, 5, 5, 5], [2])
+
+    @cuda_only
+    def test_storage_not_null(self):
+        x = torch.cuda.sparse.FloatTensor(2)
+        self.assertNotEqual(x.get_device(), -1)
+
+    @cuda_only
+    @unittest.skipIf(torch.cuda.device_count() < 2, "only one GPU detected")
+    def test_same_gpu(self):
+        i = self.IndexTensor([[2]]).cuda(1)
+        v = self.ValueTensor([5]).cuda(1)
+        x = self.SparseTensor(i, v, torch.Size([3]), device=1)
+        self.assertEqual(x.get_device(), 1)
+        self.assertEqual(x._values().get_device(), 1)
+        self.assertEqual(x._indices().get_device(), 1)
+
+        x = self.SparseTensor(3, device=1)
+        self.assertEqual(x.get_device(), 1)
+        self.assertEqual(x._values().get_device(), 1)
+        self.assertEqual(x._indices().get_device(), 1)
+
+        v = self.ValueTensor([5]).cuda(0)
+        self.assertRaises(RuntimeError, lambda: self.SparseTensor(i, v, torch.Size([3])))
+
+
+class TestUncoalescedSparse(TestSparse):
+    def setUp(self):
+        super(TestUncoalescedSparse, self).setUp()
+        self.is_uncoalesced = True
+
+
+@unittest.skipIf(not TEST_CUDA, 'CUDA not available')
+class TestCudaSparse(TestSparse):
+    def setUp(self):
+        super(TestCudaSparse, self).setUp()
+        self.is_cuda = True
+        self.IndexTensor = torch.cuda.LongTensor
+        self.ValueTensor = torch.cuda.DoubleTensor
+        self.SparseTensor = torch.cuda.sparse.DoubleTensor
+
+
+@unittest.skipIf(not TEST_CUDA, 'CUDA not available')
+class TestCudaUncoalescedSparse(TestCudaSparse):
+    def setUp(self):
+        super(TestCudaUncoalescedSparse, self).setUp()
+        self.is_uncoalesced = True
+
+if __name__ == '__main__':
+    run_tests()
--- a/test/test_torch.py
+++ b/test/test_torch.py
--- a/test/test_utils.py
+++ b/test/test_utils.py
@ -0,0 +1,384 @@
+from __future__ import print_function
+import sys
+import os
+import math
+import shutil
+import random
+import tempfile
+import unittest
+import traceback
+import torch
+import torch.utils.data
+import torch.cuda
+import warnings
+from torch.autograd import Variable
+from torch.utils.trainer import Trainer
+from torch.utils.trainer.plugins import *
+from torch.utils.trainer.plugins.plugin import Plugin
+from torch.utils.serialization import load_lua
+
+HAS_CUDA = torch.cuda.is_available()
+
+from common import TestCase, run_tests, download_file
+
+try:
+    import cffi
+    from torch.utils.ffi import compile_extension
+    HAS_CFFI = True
+except ImportError:
+    HAS_CFFI = False
+
+
+class SimplePlugin(Plugin):
+
+    def __init__(self, interval):
+        super(SimplePlugin, self).__init__(interval)
+        self.trainer = None
+        self.num_iteration = 0
+        self.num_epoch = 0
+        self.num_batch = 0
+        self.num_update = 0
+
+    def register(self, trainer):
+        self.trainer = trainer
+
+    def iteration(self, *args):
+        self.iteration_args = args
+        self.num_iteration += 1
+
+    def epoch(self, *args):
+        self.epoch_args = args
+        self.num_epoch += 1
+
+    def batch(self, *args):
+        self.batch_args = args
+        self.num_batch += 1
+
+    def update(self, *args):
+        self.update_args = args
+        self.num_update += 1
+
+
+class ModelMock(object):
+
+    def __init__(self):
+        self.num_calls = 0
+        self.output = Variable(torch.ones(1, 1), requires_grad=True)
+
+    def __call__(self, i):
+        self.num_calls += 1
+        return self.output * 2
+
+
+class CriterionMock(object):
+
+    def __init__(self):
+        self.num_calls = 0
+
+    def __call__(self, out, target):
+        self.num_calls += 1
+        return out
+
+
+class OptimizerMock(object):
+    max_evals = 5
+    min_evals = 1
+
+    def __init__(self):
+        self.num_steps = 0
+        self.num_evals = 0
+
+    def step(self, closure):
+        for i in range(random.randint(self.min_evals, self.max_evals)):
+            loss = closure()
+            self.num_evals += 1
+        self.num_steps += 1
+
+    def zero_grad(self):
+        pass
+
+
+class DatasetMock(object):
+
+    def __iter__(self):
+        for i in range(10):
+            yield torch.randn(2, 10), torch.randperm(10)[:2]
+
+    def __len__(self):
+        return 10
+
+
+class TestDataLoader(TestCase):
+    def setUp(self):
+        self.dataset = torch.randn(5, 3, 3, 2)
+        self.batch_size = 3
+
+    def test_single_keep(self):
+        dataloader = torch.utils.data.DataLoader(self.dataset,
+                                                 batch_size=self.batch_size,
+                                                 num_workers=0,
+                                                 drop_last=False)
+        dataiter = iter(dataloader)
+        self.assertEqual(len(list(dataiter)), 2)
+
+    def test_single_drop(self):
+        dataloader = torch.utils.data.DataLoader(self.dataset,
+                                                 batch_size=self.batch_size,
+                                                 num_workers=0,
+                                                 drop_last=True)
+        dataiter = iter(dataloader)
+        self.assertEqual(len(list(dataiter)), 1)
+
+    def test_multi_keep(self):
+        dataloader = torch.utils.data.DataLoader(self.dataset,
+                                                 batch_size=self.batch_size,
+                                                 num_workers=2,
+                                                 drop_last=False)
+        dataiter = iter(dataloader)
+        self.assertEqual(len(list(dataiter)), 2)
+
+    def test_multi_drop(self):
+        dataloader = torch.utils.data.DataLoader(self.dataset,
+                                                 batch_size=self.batch_size,
+                                                 num_workers=2,
+                                                 drop_last=True)
+        dataiter = iter(dataloader)
+        self.assertEqual(len(list(dataiter)), 1)
+
+
+class TestTrainer(TestCase):
+
+    intervals = [
+        [(1, 'iteration')],
+        [(1, 'epoch')],
+        [(1, 'batch')],
+        [(1, 'update')],
+        [(5, 'iteration')],
+        [(5, 'epoch')],
+        [(5, 'batch')],
+        [(5, 'update')],
+        [(1, 'iteration'), (1, 'epoch')],
+        [(5, 'update'), (1, 'iteration')],
+        [(2, 'epoch'), (1, 'batch')],
+    ]
+
+    def setUp(self):
+        self.optimizer = OptimizerMock()
+        self.trainer = Trainer(ModelMock(), CriterionMock(),
+                               self.optimizer, DatasetMock())
+        self.num_epochs = 3
+        self.dataset_size = len(self.trainer.dataset)
+        self.num_iters = self.num_epochs * self.dataset_size
+
+    def test_register_plugin(self):
+        for interval in self.intervals:
+            simple_plugin = SimplePlugin(interval)
+            self.trainer.register_plugin(simple_plugin)
+            self.assertEqual(simple_plugin.trainer, self.trainer)
+
+    def test_optimizer_step(self):
+        self.trainer.run(epochs=1)
+        self.assertEqual(self.trainer.optimizer.num_steps, 10)
+
+    def test_plugin_interval(self):
+        for interval in self.intervals:
+            self.setUp()
+            simple_plugin = SimplePlugin(interval)
+            self.trainer.register_plugin(simple_plugin)
+            self.trainer.run(epochs=self.num_epochs)
+            units = {
+                ('iteration', self.num_iters),
+                ('epoch', self.num_epochs),
+                ('batch', self.num_iters),
+                ('update', self.num_iters)
+            }
+            for unit, num_triggers in units:
+                call_every = None
+                for i, i_unit in interval:
+                    if i_unit == unit:
+                        call_every = i
+                        break
+                if call_every:
+                    expected_num_calls = math.floor(num_triggers / call_every)
+                else:
+                    expected_num_calls = 0
+                num_calls = getattr(simple_plugin, 'num_' + unit)
+                self.assertEqual(num_calls, expected_num_calls, 0)
+
+    def test_model_called(self):
+        self.trainer.run(epochs=self.num_epochs)
+        num_model_calls = self.trainer.model.num_calls
+        num_crit_calls = self.trainer.criterion.num_calls
+        self.assertEqual(num_model_calls, num_crit_calls)
+        for num_calls in [num_model_calls, num_crit_calls]:
+            lower_bound = OptimizerMock.min_evals * self.num_iters
+            upper_bound = OptimizerMock.max_evals * self.num_iters
+            self.assertEqual(num_calls, self.trainer.optimizer.num_evals)
+            self.assertLessEqual(lower_bound, num_calls)
+            self.assertLessEqual(num_calls, upper_bound)
+
+    def test_model_gradient(self):
+        self.trainer.run(epochs=self.num_epochs)
+        output_var = self.trainer.model.output
+        expected_grad = torch.ones(1, 1) * 2 * self.optimizer.num_evals
+        self.assertEqual(output_var.grad.data, expected_grad)
+
+
+test_dir = os.path.abspath(os.path.dirname(str(__file__)))
+
+
+class TestFFI(TestCase):
+
+    def setUp(self):
+        self.tmpdir = tempfile.mkdtemp()
+        os.chdir(self.tmpdir)
+        sys.path.append(self.tmpdir)
+
+    def tearDown(self):
+        shutil.rmtree(self.tmpdir)
+
+    @unittest.skipIf(not HAS_CFFI, "ffi tests require cffi package")
+    def test_cpu(self):
+        compile_extension(
+            name='test_extensions.cpulib',
+            header=test_dir + '/ffi/src/cpu/lib.h',
+            sources=[
+                test_dir + '/ffi/src/cpu/lib1.c',
+                test_dir + '/ffi/src/cpu/lib2.c',
+            ],
+            verbose=False,
+        )
+        from test_extensions import cpulib
+        tensor = torch.ones(2, 2).float()
+
+        cpulib.good_func(tensor, 2, 1.5)
+        self.assertEqual(tensor, torch.ones(2, 2) * 2 + 1.5)
+
+        new_tensor = cpulib.new_tensor(4)
+        self.assertEqual(new_tensor, torch.ones(4, 4) * 4)
+
+        f = cpulib.int_to_float(5)
+        self.assertIs(type(f), float)
+
+        self.assertRaises(TypeError,
+                          lambda: cpulib.good_func(tensor.double(), 2, 1.5))
+        self.assertRaises(torch.FatalError,
+                          lambda: cpulib.bad_func(tensor, 2, 1.5))
+
+    @unittest.skipIf(not HAS_CFFI or not HAS_CUDA, "ffi tests require cffi package")
+    def test_gpu(self):
+        compile_extension(
+            name='gpulib',
+            header=test_dir + '/ffi/src/cuda/cudalib.h',
+            sources=[
+                test_dir + '/ffi/src/cuda/cudalib.c',
+            ],
+            with_cuda=True,
+            verbose=False,
+        )
+        import gpulib
+        tensor = torch.ones(2, 2).float()
+
+        gpulib.good_func(tensor, 2, 1.5)
+        self.assertEqual(tensor, torch.ones(2, 2) * 2 + 1.5)
+
+        ctensor = tensor.cuda().fill_(1)
+        gpulib.cuda_func(ctensor, 2, 1.5)
+        self.assertEqual(ctensor, torch.ones(2, 2) * 2 + 1.5)
+
+        self.assertRaises(TypeError,
+                          lambda: gpulib.cuda_func(tensor, 2, 1.5))
+        self.assertRaises(TypeError,
+                          lambda: gpulib.cuda_func(ctensor.storage(), 2, 1.5))
+
+
+class TestLuaReader(TestCase):
+
+    @staticmethod
+    def _module_test(name, test):
+        def do_test(self):
+            module = test['module']
+            input = test['input']
+            grad_output = test['grad_output']
+            if hasattr(self, '_transform_' + name):
+                input = getattr(self, '_transform_' + name)(input)
+            output = module.forward(input)
+            module.zeroGradParameters()
+            grad_input = module.backward(input, grad_output)
+            self.assertEqual(output, test['output'])
+            self.assertEqual(grad_input, test['grad_input'])
+            if module.parameters() is not None:
+                params, d_params = module.parameters()
+                self.assertEqual(params, test['params'])
+                self.assertEqual(d_params, test['d_params'])
+            else:
+                self.assertFalse('params' in test and test['params'])
+                self.assertFalse('params' in test and test['d_params'])
+        return do_test
+
+    @staticmethod
+    def _criterion_test(name, test):
+        def do_test(self):
+            module = test['module']
+            input = test['input']
+            if name == 'L1Cost':
+                target = None
+            else:
+                target = test['target']
+            if hasattr(self, '_transform_' + name):
+                input, target = getattr(self, '_transform_' + name)(input, target)
+
+            output = module.forward(input, target)
+            grad_input = module.backward(input, target)
+            self.assertEqual(output, test['loss'])
+            self.assertEqual(grad_input, test['grad_input'])
+        return do_test
+
+    @classmethod
+    def init(cls):
+        try:
+            path = download_file('https://download.pytorch.org/test_data/legacy_modules.t7')
+        except unittest.SkipTest:
+            return
+        tests = load_lua(path)
+        for name, test in tests['modules'].items():
+            test_name = 'test_' + name.replace('nn.', '')
+            setattr(cls, test_name, cls._module_test(name, test))
+        for name, test in tests['criterions'].items():
+            test_name = 'test_' + name.replace('nn.', '')
+            setattr(cls, test_name, cls._criterion_test(name, test))
+
+    def _transform_Index(self, input):
+        return [input[0], input[1].sub(1)]
+
+    def _transform_LookupTable(self, input):
+        return input.sub(1)
+
+    def _transform_MultiLabelMarginCriterion(self, input, target):
+        return input, target.sub(1)
+
+    def _transform_ClassNLLCriterion(self, input, target):
+        return input, target.sub(1)
+
+    def _transform_SpatialClassNLLCriterion(self, input, target):
+        return input, target.sub(1)
+
+    def _transform_ClassSimplexCriterion(self, input, target):
+        return input, target.sub(1)
+
+    def _transform_CrossEntropyCriterion(self, input, target):
+        return input, target.sub(1)
+
+    def _transform_ParallelCriterion(self, input, target):
+        return input, [target[0].sub(1), target[1]]
+
+    def _transform_MultiCriterion(self, input, target):
+        return input, target.sub(1)
+
+    def _transform_MultiMarginCriterion(self, input, target):
+        return input, target.sub(1)
+
+
+TestLuaReader.init()
+if __name__ == '__main__':
+    run_tests()
--- a/tools/init.py
+++ b/tools/init.py
--- a/tools/convert.vim
+++ b/tools/convert.vim
@ -0,0 +1,52 @@
+"Slightly adjust indentation
+%s/^   /        /g
+
+" # -> len
+%s/#\(\S*\) /len(\1)/g
+
+" for loops
+%s/for\( \)\{-\}\(\S*\)\( \)\{-\}=\( \)\{-\}\(\S*\),\( \)\{-\}\(\S*\)\( \)\{-\}do/for \2 in range(\5, \7+1)/g
+
+" Change comments
+%s/--\[\[/"""/g
+%s/]]/"""/g
+%s/--/#/g
+
+" Add spacing between commas
+%s/\(\S\),\(\S\)/\1, \2/g
+
+%s/local //g
+%s/ then/:/g
+%s/ do/:/g
+%s/end//g
+%s/elseif/elif/g
+%s/else/else:/g
+%s/true/True/g
+%s/false/False/g
+%s/\~=/!=/g
+%s/math\.min/min/g
+%s/math\.max/max/g
+%s/math\.abs/abs/g
+
+
+%s/__init/__init__/g
+
+" Rewrite function declarations
+%s/function \w*:\(\w*\)/    def \1/g
+%s/def \(.*\)$/def \1:/g
+
+" class declaration
+%s/\(\w*\), parent = torch\.class.*$/import torch\rfrom torch.legacy import nn\r\rclass \1(nn.Module):/g
+
+%s/input\.THNN/self._backend/g
+%s/\(self\.backend\w*$\)/\1\r        self._backend.library_state,/g
+%s/def \(\w*\)(/def \1(self, /g
+
+%s/__init__(self)/__init__()/g
+
+%s/:\(\S\)/.\1/g
+
+%s/\.cdata()//g
+%s/THNN\.optionalTensor(\(.*\))/\1/g
+
+%s/parent\./super(##, self)./g
--- a/tools/cwrap/init.py
+++ b/tools/cwrap/init.py
@ -0,0 +1 @@
+from .cwrap import cwrap
--- a/tools/cwrap/cwrap.py
+++ b/tools/cwrap/cwrap.py
@ -0,0 +1,277 @@
+import os
+import yaml
+from string import Template
+from copy import deepcopy
+from .plugins import ArgcountChecker, OptionalArguments, ArgumentReferences, \
+    BeforeAfterCall, ConstantArguments, ReturnArguments, GILRelease
+from ..shared import cwrap_common
+
+
+class cwrap(object):
+    BASE_INDENT_SIZE = 6
+
+    RETURN_WRAPPERS = {
+        'void': Template('Py_RETURN_NONE;'),
+        'long': Template('return PyLong_FromLong($result);'),
+        'bool': Template('return PyBool_FromLong($result);'),
+        'void*': Template('return PyLong_FromVoidPtr($result);'),
+    }
+
+    OPTION_TEMPLATE = Template("""
+    ${els}if ($arg_check) {
+      $pre_arg_assign
+      $arg_assign
+      $code
+    """)
+
+    ARG_ASSIGN_TEMPLATE = Template("""${type} ${name} = ${unpack};""")
+
+    OPTION_CODE_TEMPLATE = [
+        '$call',
+        '$return_result',
+    ]
+
+    FUNCTION_CALL_TEMPLATE = Template("$capture_result$cname($call_arg);")
+
+    DEFAULT_PLUGIN_CLASSES = [ArgcountChecker, ConstantArguments, OptionalArguments,
+                              ArgumentReferences, BeforeAfterCall, ReturnArguments, GILRelease]
+
+    def __init__(self, source, destination=None, plugins=None, default_plugins=True):
+        if destination is None:
+            destination = source.replace('.cwrap', '.cpp')
+
+        self.plugins = [] if plugins is None else plugins
+        if default_plugins:
+            defaults = [cls() for cls in self.DEFAULT_PLUGIN_CLASSES]
+            self.plugins = defaults + self.plugins
+
+        for plugin in self.plugins:
+            plugin.initialize(self)
+
+        self.base_path = os.path.dirname(os.path.abspath(source))
+        with open(source, 'r') as f:
+            declarations = f.read()
+
+        # wrap all the declarations in the source .cwrap file
+        wrapper = self.wrap_declarations(declarations)
+
+        # let each plugin do any post-processing of the wrapped file
+        for plugin in self.plugins:
+            wrapper = plugin.process_full_file(wrapper)
+
+        with open(destination, 'w') as f:
+            f.write(wrapper)
+
+    def wrap_declarations(self, declarations):
+        lines = declarations.split('\n')
+        declaration_lines = []
+        output = []
+        in_declaration = False
+        i = 0
+
+        while i < len(lines):
+            line = lines[i]
+            if line == '[[':
+                declaration_lines = []
+                in_declaration = True
+            elif line == ']]':
+                in_declaration = False
+                declaration = yaml.load('\n'.join(declaration_lines))
+                cwrap_common.set_declaration_defaults(declaration)
+
+                # Pass declaration in a list - maybe some plugins want to add
+                # multiple wrappers
+                declarations = [declaration]
+                for plugin in self.plugins:
+                    declarations = plugin.process_declarations(declarations)
+                # Generate wrappers for all declarations and append them to
+                # the output
+                for declaration in declarations:
+                    wrapper = self.generate_wrapper(declaration)
+                    for plugin in self.plugins:
+                        wrapper = plugin.process_wrapper(wrapper, declaration)
+                    output.append(wrapper)
+            elif in_declaration:
+                declaration_lines.append(line)
+            elif '!!inc ' == line[:6]:
+                fname = os.path.join(self.base_path, line[6:].strip())
+                with open(fname, 'r') as f:
+                    included = f.read().split('\n')
+                # insert it into lines at position i+1
+                lines[i + 1:i + 1] = included
+            else:
+                output.append(line)
+            i += 1
+
+        return '\n'.join(output)
+
+    def parse_arguments(self, args):
+        new_args = []
+        for arg in args:
+            # Simple arg declaration of form "<type> <name>"
+            if isinstance(arg, str):
+                t, _, name = arg.partition(' ')
+                new_args.append({'type': t, 'name': name})
+            elif isinstance(arg, dict):
+                if 'arg' in arg:
+                    arg['type'], _, arg['name'] = arg['arg'].partition(' ')
+                    del arg['arg']
+                new_args.append(arg)
+            else:
+                assert False
+        return new_args
+
+    def search_plugins(self, fnname, args, fallback):
+        """Search plugins for the given function to call with args.
+
+        If not found, call fallback with args.
+        """
+        for plugin in self.plugins:
+            wrapper = getattr(plugin, fnname)(*args)
+            if wrapper is not None:
+                return wrapper
+        return fallback(*args)
+
+    def get_type_check(self, arg, option):
+        return self.search_plugins('get_type_check', (arg, option), lambda arg, _: None)
+
+    def get_type_unpack(self, arg, option):
+        return self.search_plugins('get_type_unpack', (arg, option), lambda arg, _: None)
+
+    def get_return_wrapper(self, option):
+        return self.search_plugins('get_return_wrapper', (option,), lambda _: self.RETURN_WRAPPERS[option['return']])
+
+    def get_wrapper_template(self, declaration):
+        return self.search_plugins('get_wrapper_template', (declaration,), lambda _: None)
+
+    def get_assign_args(self, arguments):
+        return self.search_plugins('get_assign_args', (arguments,), lambda _: arguments)
+
+    def get_arg_accessor(self, arg, option):
+        def wrap_accessor(arg, _):
+            if arg.get('idx') is None:
+                raise RuntimeError("Missing accessor for '{} {}'".format(
+                                   arg['type'], arg['name']))
+            return 'PyTuple_GET_ITEM(args, {})'.format(arg['idx'])
+
+        return self.search_plugins('get_arg_accessor', (arg, option), wrap_accessor)
+
+    def generate_wrapper(self, declaration):
+        wrapper = ''
+        for i, option in enumerate(declaration['options']):
+            option_wrapper = self.generate_option(option, is_first=(i == 0))
+            for plugin in self.plugins:
+                option_wrapper = plugin.process_option_code(option_wrapper, option)
+            wrapper += option_wrapper
+        return self.get_wrapper_template(declaration).substitute(name=declaration['name'], options=wrapper)
+
+    def map_selected_arguments(self, base_fn_name, plugin_fn_name, option, arguments):
+        result = []
+        for arg in arguments:
+            accessor = self.get_arg_accessor(arg, option)
+            tmpl = getattr(self, base_fn_name)(arg, option)
+            if tmpl is None:
+                fn = 'check' if base_fn_name == 'get_type_check' else 'unpack'
+                raise RuntimeError("Missing type {} for '{} {}'".format(
+                                   fn, arg['type'], arg['name']))
+            res = tmpl.substitute(arg=accessor, idx=arg.get('idx'))
+            for plugin in self.plugins:
+                res = getattr(plugin, plugin_fn_name)(res, arg, accessor)
+
+            result.append(res)
+        return result
+
+    def build_option_args(self, arguments, arg_unpack):
+        assignement = []
+        call_arg = []
+        # If types or names needs to be changed
+        arguments = self.get_assign_args(arguments)
+        for arg, unpack in zip(arguments, arg_unpack):
+            if arg['type'] == 'CONSTANT':
+                call_arg.append(unpack)
+            else:
+                var_name = "arg_" + str(arg.get('assign_name', arg['name']))
+                res = self.ARG_ASSIGN_TEMPLATE.substitute(
+                    type=arg['type'],
+                    name=var_name,
+                    unpack=unpack)
+
+                if var_name not in call_arg:
+                    assignement.append(res)
+                call_arg.append(var_name)
+        return assignement, call_arg
+
+    def indent_code(self, code):
+        if code == '':
+            return code
+        code_lines = map(lambda s: s.strip(), code.split('\n'))
+        code = '\n'
+        depth = self.BASE_INDENT_SIZE
+        for line in code_lines:
+            depth -= line.count('}') * 2
+            code += ' ' * depth + line + '\n'
+            depth += line.count('{') * 2
+            depth += line.count('(') * 4
+            depth -= line.count(')') * 4
+        return code[:-1]
+
+    def generate_option(self, option, is_first):
+        checked_args = list(filter(
+            lambda arg: 'ignore_check' not in arg or not arg['ignore_check'],
+            option['arguments']))
+        option['num_checked_args'] = len(checked_args)
+        idx_args = list(filter(
+            lambda arg: not arg.get('ignore_check') and not arg.get('no_idx'),
+            option['arguments']))
+        for i, arg in enumerate(idx_args):
+            arg['idx'] = i
+
+        # Generate checks
+        arg_checks = self.map_selected_arguments('get_type_check',
+                                                 'process_single_check', option, checked_args)
+        arg_checks = ' &&\n          '.join(arg_checks)
+        for plugin in self.plugins:
+            arg_checks = plugin.process_all_checks(arg_checks, option)
+
+        # Generate pre_arg assign
+        pre_arg_assign = []
+        for plugin in self.plugins:
+            pre_arg_assign = plugin.process_pre_arg_assign(pre_arg_assign, option)
+
+        # Generate arg assignment and call arguments
+        arg_unpack = self.map_selected_arguments('get_type_unpack',
+                                                 'process_single_unpack', option, option['arguments'])
+        arg_assign, call_arg = self.build_option_args(option['arguments'], arg_unpack)
+
+        call_arg = ', '.join(call_arg)
+        for plugin in self.plugins:
+            call_arg = plugin.process_all_call_arg(call_arg, option)
+
+        # Generate call
+        try:
+            return_result = self.get_return_wrapper(option).substitute()
+            call = self.FUNCTION_CALL_TEMPLATE.substitute(capture_result='',
+                                                          cname=option['cname'], call_arg=call_arg)
+        except KeyError:
+            return_result = self.get_return_wrapper(option).substitute(result='__result')
+            call = self.FUNCTION_CALL_TEMPLATE.substitute(capture_result=(option['return'] + ' __result = '),
+                                                          cname=option['cname'], call_arg=call_arg)
+
+        code_template = deepcopy(self.OPTION_CODE_TEMPLATE)
+        for plugin in self.plugins:
+            code_template = plugin.process_option_code_template(code_template,
+                                                                option)
+        code_template = Template('\n'.join(code_template))
+        code = code_template.substitute(call=call, return_result=return_result)
+        code = self.indent_code(code)
+        pre_arg_assign = self.indent_code('\n'.join(pre_arg_assign))
+        arg_assign = self.indent_code('\n'.join(arg_assign))
+
+        # Put everything together
+        return self.OPTION_TEMPLATE.substitute(
+            els=('} else ' if not is_first else ''),
+            arg_check=arg_checks,
+            pre_arg_assign=pre_arg_assign,
+            arg_assign=arg_assign,
+            code=code,
+        )
--- a/tools/cwrap/plugins/ArgcountChecker.py
+++ b/tools/cwrap/plugins/ArgcountChecker.py
@ -0,0 +1,13 @@
+from . import CWrapPlugin
+
+
+class ArgcountChecker(CWrapPlugin):
+
+    def process_all_checks(self, checks, option):
+        if not checks:
+            checks = '__argcount == 0'
+        else:
+            indent = '\n          '
+            argcount = option['num_checked_args'] + option.get('argcount_offset', 0)
+            checks = '__argcount == {} &&'.format(str(argcount)) + indent + checks
+        return checks
--- a/tools/cwrap/plugins/ArgcountSortPlugin.py
+++ b/tools/cwrap/plugins/ArgcountSortPlugin.py
@ -0,0 +1,15 @@
+import os
+from . import CWrapPlugin
+from ...shared import cwrap_common
+
+
+class ArgcountSortPlugin(CWrapPlugin):
+
+    def __init__(self, descending=True):
+        self.descending = descending
+
+    def process_declarations(self, declarations):
+        for declaration in declarations:
+            cwrap_common.sort_by_number_of_options(declaration,
+                                                   self.descending)
+        return declarations
--- a/tools/cwrap/plugins/ArgumentReferences.py
+++ b/tools/cwrap/plugins/ArgumentReferences.py
@ -0,0 +1,29 @@
+from . import CWrapPlugin
+from string import Template
+
+
+class ArgumentReferences(CWrapPlugin):
+
+    def initialize(self, cwrap):
+        self.cwrap = cwrap
+
+    def process_declarations(self, declarations):
+        for declaration in declarations:
+            for option in declaration['options']:
+                for arg in option['arguments']:
+                    if arg['type'] == 'argument':
+                        arg['ignore_check'] = True
+                        arg['is_reference'] = True
+                        # Copy type from referenced argument
+                        idx = int(arg['name'])
+                        arg['type'] = option['arguments'][idx]['type']
+        return declarations
+
+    def _get_true_idx(self, idx, option):
+        return sum(not arg.get('ignore_check', False) for arg in option['arguments'][:idx])
+
+    def get_arg_accessor(self, arg, option):
+        if arg.get('is_reference', False):
+            idx = int(arg['name'])
+            referenced = option['arguments'][idx]
+            return self.cwrap.get_arg_accessor(referenced, option)
--- a/tools/cwrap/plugins/AssertNDim.py
+++ b/tools/cwrap/plugins/AssertNDim.py
@ -0,0 +1,29 @@
+from . import CWrapPlugin
+from string import Template
+
+
+class AssertNDim(CWrapPlugin):
+
+    PRE_CODE_TEMPLATE = Template(
+        """if(THTensor_(nDimension)(LIBRARY_STATE ${arg_op}) != ${dim_value}) {
+             THError("Expected argument %s to have %d dimension(s), but has %d",
+                     "${op}", ${dim_value}, THTensor_(nDimension)(LIBRARY_STATE ${arg_op}));
+           }
+        """)
+
+    def process_option_code_template(self, template, option):
+        new_code_pre = []
+
+        for _, arg in enumerate(option['arguments']):
+            if 'assert_ndim' not in arg:
+                continue
+
+            dim_value = arg.get('assert_ndim')
+            op = arg.get('assign_name', arg['name'])
+            arg_op = "arg_" + op
+            new_code_pre.append(self.PRE_CODE_TEMPLATE.substitute(op=op,
+                                                                  arg_op=arg_op,
+                                                                  dim_value=dim_value))
+            template = new_code_pre + template
+
+        return template
--- a/tools/cwrap/plugins/AutoGPU.py
+++ b/tools/cwrap/plugins/AutoGPU.py
@ -0,0 +1,30 @@
+from . import CWrapPlugin
+
+
+class AutoGPU(CWrapPlugin):
+
+    def __init__(self, has_self=True, condition=None):
+        self.has_self = has_self
+        self.condition = condition
+
+    DEFINES = """
+#ifdef THC_GENERIC_FILE
+#define THCP_AUTO_GPU 1
+#else
+#define THCP_AUTO_GPU 0
+#endif
+"""
+
+    def process_pre_arg_assign(self, template, option):
+        if not option.get('auto_gpu', True):
+            return template
+        call = 'THCPAutoGPU __autogpu_guard = THCPAutoGPU(args{});'.format(
+            ', (PyObject*)self' if self.has_self else '')
+
+        if self.condition is not None:
+            call = "#if {0}\n      {1}\n#endif\n".format(self.condition, call)
+
+        return [call] + template
+
+    def process_full_file(self, code):
+        return self.DEFINES + code
--- a/tools/cwrap/plugins/BeforeAfterCall.py
+++ b/tools/cwrap/plugins/BeforeAfterCall.py
@ -0,0 +1,33 @@
+from . import CWrapPlugin
+from string import Template
+
+
+class BeforeAfterCall(CWrapPlugin):
+
+    def initialize(self, cwrap):
+        self.cwrap = cwrap
+
+    def insert_snippet(self, template, option, offset, name):
+        prepend_str = option.get(name)
+        if prepend_str is None:
+            return
+        if '$' in prepend_str:
+            before_call_template = Template(option[name])
+            args = {'arg' + str(i): self.cwrap.get_arg_accessor(arg, option) for i, arg
+                    in enumerate(option['arguments'])}
+            prepend_str = before_call_template.substitute(args)
+        template.insert(offset, prepend_str)
+
+    def process_pre_arg_assign(self, template, option):
+        if option.get('before_arg_assign'):
+            self.insert_snippet(template, option, 0, 'before_arg_assign')
+        return template
+
+    def process_option_code_template(self, template, option):
+        if option.get('before_call') or option.get('after_call'):
+            call_idx = template.index('$call')
+            self.insert_snippet(template, option, call_idx, 'before_call')
+            # call position might have changed
+            call_idx = template.index('$call')
+            self.insert_snippet(template, option, call_idx + 1, 'after_call')
+        return template
--- a/tools/cwrap/plugins/BoolOption.py
+++ b/tools/cwrap/plugins/BoolOption.py
@ -0,0 +1,35 @@
+from . import CWrapPlugin
+from string import Template
+
+import sys
+if sys.version_info[0] == 3:
+    string_type = str
+else:
+    string_type = basestring
+
+
+class BoolOption(CWrapPlugin):
+
+    UNPACK_TEMPLATE = Template('$arg == Py_True ? $if_true : $if_false')
+
+    def is_bool_option(self, arg):
+        return arg['type'] == 'bool' and 'if_true' in arg and 'if_false' in arg
+
+    def process_declarations(self, declarations):
+        for declaration in declarations:
+            for option in declaration['options']:
+                for arg in option['arguments']:
+                    if self.is_bool_option(arg):
+                        arg['is_bool_option'] = True
+                        if isinstance(arg['if_true'], string_type):
+                            arg['type'] = 'const char*'
+        return declarations
+
+    def get_type_check(self, arg, option):
+        if arg.get('is_bool_option', False):
+            return Template('PyBool_Check($arg)')
+
+    def get_type_unpack(self, arg, option):
+        if arg.get('is_bool_option', False):
+            return Template(self.UNPACK_TEMPLATE.safe_substitute(
+                if_true=arg['if_true'], if_false=arg['if_false']))
--- a/tools/cwrap/plugins/Broadcast.py
+++ b/tools/cwrap/plugins/Broadcast.py
@ -0,0 +1,318 @@
+from . import CWrapPlugin
+from string import Template
+
+# Arguments to the Broadcast Plugin:
+# broadcast: args_to_broadcast_against [inplace] [fallback]
+# [args_to_broadcast_against]: either a single argument (e.g. "arg1") or a comma-seperated
+#                              list of two arguments (e.g. "tensor1,tensor2") indicating
+#                              arguments to broadcast specified argument (usually "self") against
+# [inplace] will generate code for in-place function, which doesn't allow the in-place
+#           argument to be broadcast
+# [fallback] if tensors aren't broadcastable, preserves "element number" pointwise behavior,
+#            where only number of elements need to match, and tensors are viewed as 1-dimensional.
+# [dims] specify if the tensors shouldn't be broadcast to a specific tensor or tensors, but a combination
+#        of individual dimension sizes of a set of tensors.  For example: addbmm(C,A,B) a.k.a. [C + A @ B]
+#        broadcasts C to the first dimension of A and the second dimension of B.  Each dimension is specified as
+#        [arg].dim[#] and dimensions are comma-separated.  So, to specify that the tensor should be
+#        broadcast to 3-dimensions with sizes:
+#        tensor0->size[0] x tensor1->size[1] x tensor2->size[2]
+#        you would write:
+#        dims:tensor0.dim0,tensor1.dim1,tensor2.dim2
+# [types] if the tensors should be of different types than THTensor, specify as X where
+#         the actual type to use is THXTensor (i.e. Byte for THByteTensor).  If the type
+#         should be THTensor, use 'Real'
+
+# For out of place:
+# Two args: expand the two args together
+# Three args (fused kernels): (e.g. addcmul) expand all three args together
+# Sketch of proof that this is the same:
+# consider addcmul, under expansion we want: a + (b * c) = (a + b * c) [all expanded together]
+# Let e(i, j) be the expansion of i with j, e(i, j, k) be the expansion of i with j,k
+#
+# Then a + (b * c) = e(a, e(b,c) * e(c,b)) + e(e(b,c)   * e(c,b), a)
+#                  = e(a, e(b,c))          + e(e(b,c)   * e(c,b), a)    (only size matters for second param)
+#                  = e(a,b,c)              + e(e(b,c)   * e(c,b), a)    (by associativity of max in expand)
+#                  = e(a,b,c)              + e(b,c,a)   * e(c,b,a)      (see L1)
+# which is a + b * c all expanded together
+#
+# L1: Show e(i * j, a) = e(i,a) * e(j,a) where i,j have same size
+# Consider any index _{ s_0, ..., s_n}
+# e(i * j, a) = (i*j)_{f(s_0), ...,f(s_n)} where f is the expansion of that dimension with a
+#             = i_{f(s_0), ..., f(s_n)} * j_{f(s_0), ..., f(s_n)} by definition of pointwise operator
+#             = e(i,a) * e(j,a)
+
+
+class Broadcast(CWrapPlugin):
+
+    # Save and restore passed in arguments in case later plugins use
+    POST_TEMPLATE = Template(
+        """${arg_op_other} = ${arg_op_other}_save;\n""")
+
+    def getPreArgStringTemplate(self, type=None):
+        if type is None:
+            ret = """THTensor *${arg_op_other}_save = ${arg_op_other};
+                     THTensorPtr ${arg_op_other}_guard(THTensor_(new)(LIBRARY_STATE_NOARGS));\n"""
+        else:
+            cpu_t = "TH" + type + "Tensor"
+            gpu_t = "THCuda" + type + "Tensor"
+            ret = ("#if !IS_CUDA\n" +
+                   cpu_t + " *${arg_op_other}_save = ${arg_op_other};\n" +
+                   cpu_t + "Ptr ${arg_op_other}_guard(" + cpu_t + "_new(LIBRARY_STATE_NOARGS));\n" +
+                   "#else\n" +
+                   gpu_t + " *${arg_op_other}_save = ${arg_op_other};\n" +
+                   "THPPointer<" + gpu_t + "> ${arg_op_other}_guard(\n" + gpu_t + "_new(LIBRARY_STATE_NOARGS));\n" +
+                   "#endif\n")
+        return Template(ret)
+
+    def getExpandTemplate(self, expand_call, success_code, raise_errors):
+        if not raise_errors:
+            return Template(
+                "bool expand_success = false;\n" +
+                "try {\n" +
+                expand_call +
+                "\nexpand_success = true;\n" +
+                "}\n"
+                "catch (std::exception &e) {}\n" +
+                "if(expand_success) {\n" +
+                success_code +
+                "\n}\n")
+        else:
+            return Template(
+                expand_call + "\n" +
+                success_code + "\n")
+
+    def getOutPlacePreExpand2Template(self, raise_errors):
+        expand_code = """expand_outplace2(LIBRARY_STATE ${arg_op_a}_guard.get(), ${arg_op_other}_guard.get(),
+                                          ${arg_op_a}, ${arg_op_other},
+                                          \"${op_a}\", \"${op_other}\", !${raise_errors});"""
+        success_code = """${arg_op_a} = ${arg_op_a}_guard.get();
+                          ${arg_op_other} = ${arg_op_other}_guard.get();"""
+        return self.getExpandTemplate(expand_code, success_code, raise_errors)
+
+    def getOutPlacePreExpand3Template(self, raise_errors):
+        expand_code = """expand_outplace3(LIBRARY_STATE ${arg_op_a}_guard.get(),
+                                          ${arg_op_other1}_guard.get(), ${arg_op_other2}_guard.get(),
+                                          ${arg_op_a}, ${arg_op_other1}, ${arg_op_other2},
+                                          \"${op_a}\", \"${op_other1}\", \"${op_other2}\", !${raise_errors});"""
+        success_code = """${arg_op_a} = ${arg_op_a}_guard.get();
+                          ${arg_op_other1} = ${arg_op_other1}_guard.get();
+                          ${arg_op_other2} = ${arg_op_other2}_guard.get();"""
+        return self.getExpandTemplate(expand_code, success_code, raise_errors)
+
+    OUT_PLACE_PRE_EXPAND_PRE_DIM_TEMPLATE = Template(
+        """if(THTensor_(nDimension)(LIBRARY_STATE ${arg_op_dim}) <= ${arg_op_dim_value}) {
+             THError("Argument %s requires at least %d dimensions, but only has %d",
+                     "${op_dim}", ${arg_op_dim_value} + 1, THTensor_(nDimension)(LIBRARY_STATE ${arg_op_dim}));
+           }
+           long ${arg_op_a}_dim${idx}_size = THTensor_(size)(LIBRARY_STATE ${arg_op_dim}, ${arg_op_dim_value});\n""")
+
+    OUT_PLACE_PRE_EXPAND1_DIM_TEMPLATE = Template(
+        """THLongStoragePtr ${arg_op_a}_storage(THLongStorage_newWithSize1(${arg_op_a}_dim0_size));\n""")
+
+    OUT_PLACE_PRE_EXPAND2_DIM_TEMPLATE = Template(
+        """THLongStoragePtr ${arg_op_a}_storage(
+               THLongStorage_newWithSize2(${arg_op_a}_dim0_size, ${arg_op_a}_dim1_size));\n""")
+
+    OUT_PLACE_PRE_EXPAND3_DIM_TEMPLATE = Template(
+        """THLongStoragePtr ${arg_op_a}_storage(
+               THLongStorage_newWithSize3(${arg_op_a}_dim0_size, ${arg_op_a}_dim1_size, ${arg_op_a}_dim2_size));\n""")
+
+    def getOutPlacePreExpandPostDimTemplate(self, raise_errors):
+        expand_code = """expand(LIBRARY_STATE ${arg_op_a}_guard.get(), ${arg_op_a}, ${arg_op_a}_storage);"""
+        success_code = """${arg_op_a} = ${arg_op_a}_guard.get();"""
+        return self.getExpandTemplate(expand_code, success_code, raise_errors)
+
+    OUT_PLACE_PRE_TEMPLATE = Template(
+        """${code_arg_op_a}${code_arg_op_other1}${code_arg_op_other2}
+           ${expand_code}""")
+
+    def getInPlacePreExpand1Template(self, raise_errors):
+        expand_code = """expand_inplace1(LIBRARY_STATE ${arg_op_other}_guard.get(), ${arg_op_other}, ${arg_op_a},
+                                         \"${op_other}\", \"${op_a}\", !${raise_errors});"""
+        success_code = """${arg_op_other} = ${arg_op_other}_guard.get();"""
+        return self.getExpandTemplate(expand_code, success_code, raise_errors)
+
+    def getInPlacePreExpand2Template(self, raise_errors):
+        expand_code = """expand_inplace2(LIBRARY_STATE ${arg_op_other1}_guard.get(), ${arg_op_other2}_guard.get(),
+                                         ${arg_op_other1}, ${arg_op_other2}, ${arg_op_a},
+                                         \"${op_other1}\", \"${op_other2}\", \"${op_a}\", !${raise_errors});"""
+        success_code = """${arg_op_other1} = ${arg_op_other1}_guard.get();
+                          ${arg_op_other2} = ${arg_op_other2}_guard.get();"""
+        return self.getExpandTemplate(expand_code, success_code, raise_errors)
+
+    IN_PLACE_PRE_TEMPLATE = Template(
+        """${code_arg_op_other1}${code_arg_op_other2}
+           ${expand_code}""")
+
+    def initialize(self, cwrap):
+        self.cwrap = cwrap
+
+    # Arguments:
+    # [0]: name of tensor to broadcast with (possibly two comma separated)
+    # [1] inplace (optional).  In place operations only broadcast on second tensor argument
+    # [2] fallback (optional).  Will fallback to applying to tensor of equal nElem if broadcast fails
+    def process_option_code_template(self, template, option):
+        new_code_pre = []
+        new_code_post = []
+        for _, arg in enumerate(option['arguments']):
+            if 'broadcast' not in arg:
+                continue
+
+            params = arg.get('broadcast').split(" ")
+            op_a = arg.get('assign_name', arg['name'])
+            in_place = "inplace" in params
+            raise_errors = "false" if "fallback" in params else "true"
+
+            param_others = params[0].split(",")
+            if len(param_others) > 2:
+                raise ValueError('Broadcast only supports up to 2 secondary parameters')
+            op_b = param_others[0]
+            op_c = param_others[1] if len(param_others) == 2 else None
+            arg_op_b = "arg_" + op_b
+            arg_op_a = "arg_" + op_a
+            arg_op_c = ("arg_" + op_c) if op_c else None
+
+            dims_kvs = []
+            for p in params:
+                if p.startswith("dims:"):
+                    assert(raise_errors == "true")
+                    if len(dims_kvs) != 0:
+                        raise ValueError("multiple specifications of dims")
+                    dims = p[len("dims:"):].split(",")
+                    for dim in dims:
+                        batchdim = dim.split(".")
+                        assert len(batchdim) == 2
+                        assert batchdim[1].startswith("dim")
+                        dim_val = batchdim[1][len("dim"):]
+                        dims_kvs.append({"op": batchdim[0], "arg_op": "arg_" + batchdim[0], "val": dim_val})
+
+            assert len(dims_kvs) <= 3
+            for p in params[1:]:
+                if p != "inplace" and p != "fallback" and not p.startswith("dims:") and not p.startswith("types:"):
+                    raise ValueError("invalid parameter {}".format(p))
+
+            type_op_b = None
+            type_op_c = None
+            for p in params:
+                if p.startswith("types:"):
+                    if not in_place and len(dims_kvs) > 0:
+                        raise ValueError("type specification not supported yet for out-of-place functions "
+                                         "that specify explicit dimensions")
+                    types = p[len("types:"):].split(",")
+                    assert(len(types) == (2 if op_c else 1))
+                    type_op_b = None if types[0] == "Real" else types[0]
+                    if op_c:
+                        type_op_c = None if types[1] == "Real" else types[1]
+
+            op_b_mapping = {
+                "op_a": op_a,
+                "op_other": op_b,
+                "arg_op_a": arg_op_a,
+                "arg_op_other": arg_op_b,
+                "raise_errors": raise_errors
+            }
+            op_c_mapping = {
+                "op_a": op_a,
+                "op_other": op_c,
+                "arg_op_a": arg_op_a,
+                "arg_op_other": arg_op_c,
+                "raise_errors": raise_errors
+            }
+
+            if in_place:
+                code_arg_op_other1 = self.getPreArgStringTemplate(type=type_op_b).substitute(op_b_mapping)
+                code_arg_op_other2 = (
+                    self.getPreArgStringTemplate(type=type_op_c).substitute(op_c_mapping) if op_c else "")
+
+                if op_c:
+                    expand_code = self.getInPlacePreExpand2Template(raise_errors == "true").substitute(
+                        op_b_mapping,
+                        op_other1=op_b,
+                        op_other2=op_c,
+                        arg_op_other1=arg_op_b,
+                        arg_op_other2=arg_op_c)
+                else:
+                    expand_code = self.getInPlacePreExpand1Template(raise_errors == "true").substitute(op_b_mapping)
+
+                new_code_pre.append(self.IN_PLACE_PRE_TEMPLATE.substitute(
+                    arg_op_a=arg_op_a,
+                    code_arg_op_other1=code_arg_op_other1,
+                    code_arg_op_other2=code_arg_op_other2,
+                    expand_code=expand_code,
+                    raise_errors=raise_errors))
+                new_code_pre.append("")
+
+                post_code = self.POST_TEMPLATE.substitute(op_b_mapping)
+                if op_c:
+                    post_code += self.POST_TEMPLATE.substitute(op_c_mapping)
+
+                new_code_post.append(post_code)
+                new_code_post.append("")
+            else:
+                if len(dims_kvs) != 0:
+                    code_arg_op_a = self.getPreArgStringTemplate().substitute(arg_op_other=arg_op_a)
+                    code_arg_op_other1 = ""
+                    code_arg_op_other2 = ""
+                    expand_code = ""
+                    for idx, kv in enumerate(dims_kvs):
+                        expand_code += self.OUT_PLACE_PRE_EXPAND_PRE_DIM_TEMPLATE.substitute(
+                            arg_op_a=arg_op_a,
+                            op_dim=kv["op"],
+                            arg_op_dim=kv["arg_op"],
+                            arg_op_dim_value=kv["val"],
+                            idx=idx)
+
+                    if len(dims_kvs) == 1:
+                        expand_code += self.OUT_PLACE_PRE_EXPAND1_DIM_TEMPLATE.substitute(
+                            arg_op_a=arg_op_a,
+                            arg_op_dim0=dims_kvs[0]["arg_op"])
+                    elif len(dims_kvs) == 2:
+                        expand_code += self.OUT_PLACE_PRE_EXPAND2_DIM_TEMPLATE.substitute(
+                            arg_op_a=arg_op_a,
+                            arg_op_dim0=dims_kvs[0]["arg_op"],
+                            arg_op_dim1=dims_kvs[1]["arg_op"])
+                    else:
+                        expand_code += self.OUT_PLACE_PRE_EXPAND3_DIM_TEMPLATE.substitute(
+                            arg_op_a=arg_op_a,
+                            arg_op_dim0=dims_kvs[0]["arg_op"],
+                            arg_op_dim1=dims_kvs[1]["arg_op"],
+                            arg_op_dim2=dims_kvs[2]["arg_op"])
+                    expand_code += self.getOutPlacePreExpandPostDimTemplate(raise_errors == "true").substitute(
+                        arg_op_a=arg_op_a,
+                        raise_errors=raise_errors)
+                    post_code = self.POST_TEMPLATE.substitute(arg_op_other=arg_op_a)
+
+                else:
+                    code_arg_op_a = self.getPreArgStringTemplate().substitute(arg_op_other=arg_op_a)
+                    code_arg_op_other1 = self.getPreArgStringTemplate(type=type_op_b).substitute(op_b_mapping)
+                    code_arg_op_other2 = (self.getPreArgStringTemplate(type=type_op_c).substitute(op_c_mapping)
+                                          if op_c else "")
+
+                    if op_c:
+                        expand_code = self.getOutPlacePreExpand3Template(raise_errors == "true").substitute(
+                            op_b_mapping,
+                            op_other1=op_b,
+                            op_other2=op_c,
+                            arg_op_other1=arg_op_b,
+                            arg_op_other2=arg_op_c)
+
+                    else:
+                        expand_code = self.getOutPlacePreExpand2Template(
+                            raise_errors == "true").substitute(op_b_mapping)
+
+                    post_code = self.POST_TEMPLATE.substitute(arg_op_other=arg_op_a)
+                    post_code += self.POST_TEMPLATE.substitute(op_b_mapping)
+                    post_code += self.POST_TEMPLATE.substitute(op_c_mapping) if op_c else ""
+
+                new_code_pre.append(self.OUT_PLACE_PRE_TEMPLATE.substitute(
+                    code_arg_op_a=code_arg_op_a,
+                    code_arg_op_other1=code_arg_op_other1,
+                    code_arg_op_other2=code_arg_op_other2,
+                    expand_code=expand_code))
+                new_code_pre.append("")
+
+                new_code_post.append(post_code)
+                new_code_post.append("")
+
+        template = new_code_pre + template + new_code_post
+        return template
--- a/tools/cwrap/plugins/ConstantArguments.py
+++ b/tools/cwrap/plugins/ConstantArguments.py
@ -0,0 +1,21 @@
+from . import CWrapPlugin
+from string import Template
+
+
+class ConstantArguments(CWrapPlugin):
+
+    def process_declarations(self, declarations):
+        for declaration in declarations:
+            for option in declaration['options']:
+                for arg in option['arguments']:
+                    if arg['type'] == 'CONSTANT':
+                        arg['ignore_check'] = True
+        return declarations
+
+    def get_type_unpack(self, arg, option):
+        if arg['type'] == 'CONSTANT':
+            return Template('$arg')
+
+    def get_arg_accessor(self, arg, option):
+        if arg['type'] == 'CONSTANT':
+            return arg['name']
--- a/tools/cwrap/plugins/CuDNNPlugin.py
+++ b/tools/cwrap/plugins/CuDNNPlugin.py
@ -0,0 +1,179 @@
+from string import Template
+import copy
+from copy import deepcopy
+from . import CWrapPlugin
+from itertools import product
+
+
+class CuDNNPlugin(CWrapPlugin):
+
+    TYPE_UNPACK = {
+        'THTensor*': Template('((THPVoidTensor*)$arg)->cdata'),
+        'int': Template('THPUtils_unpackLong($arg)'),
+        'std::vector<int>': Template('THPUtils_unpackIntTuple($arg)'),
+        'cudnnDataType_t': Template('$arg'),
+        'cudnnHandle_t': Template('$arg'),
+        'Convolution*': Template('(Convolution*)THPWrapper_get($arg)'),
+        'bool': Template('$arg == Py_True'),
+        'double': Template('THPDoubleUtils_unpackReal($arg)'),
+    }
+
+    INPUT_ARGUMENT_MAP = {
+        'THTensor*': 'THVoidTensor*',
+    }
+
+    TYPE_CHECK = {
+        'Convolution*': Template('THPWrapper_check($arg)'),
+        'THTensor*': Template('(PyObject*)Py_TYPE($arg) == tensorClass'),
+        'int': Template('THPUtils_checkLong($arg)'),
+        'std::vector<int>': Template('THPUtils_checkIntTuple($arg)'),
+        'bool': Template('PyBool_Check($arg)'),
+        'double': Template('THPDoubleUtils_checkReal($arg)'),
+    }
+
+    RETURN_WRAPPER = {
+        'Convolution*': Template('return THPWrapper_New($result, [](void* arg) { delete (Convolution*)arg; });'),
+    }
+
+    METHODS_DECLARATION = Template("""
+static PyMethodDef _THCUDNN_methods[] = {
+$methods
+  {NULL}
+};
+
+PyMethodDef* THCUDNN_methods()
+{
+  return _THCUDNN_methods;
+}
+""")
+
+    WRAPPER_TEMPLATE = Template("""\
+static PyObject * $name(PyObject *self, PyObject *args, PyObject *kwargs)
+{
+    HANDLE_TH_ERRORS
+    int __tuplecount = args ? PyTuple_Size(args) : 0;
+    int __dictcount = kwargs ? PyDict_Size(kwargs) : 0;
+    int __argcount = __tuplecount + __dictcount;
+    PyObject* tensorClass = getTensorClass(args);
+    THCPAutoGPU __autogpu_guard = THCPAutoGPU(args);
+
+    $options
+    }
+
+    THPUtils_invalidArguments(args, kwargs, "$readable_name", $num_options, $expected_args);
+    return NULL;
+    END_HANDLE_TH_ERRORS
+}
+""")
+
+    RELEASE_ARG = Template("_${name}_guard.release();")
+
+    TYPE_NAMES = {
+        'THTensor*': '" THPTensorStr "',
+        'long': 'int',
+        'bool': 'bool',
+        'int': 'int',
+    }
+
+    def __init__(self):
+        self.declarations = []
+
+    def get_type_unpack(self, arg, option):
+        return self.TYPE_UNPACK.get(arg['type'], None)
+
+    def get_type_check(self, arg, option):
+        return self.TYPE_CHECK.get(arg['type'], None)
+
+    def get_assign_args(self, arguments):
+        assign_args = []
+        for arg in arguments:
+            arg = copy.copy(arg)
+            new_type = self.INPUT_ARGUMENT_MAP.get(arg['type'])
+            if new_type is not None:
+                arg['type'] = new_type
+            assign_args.append(arg)
+        return assign_args
+
+    def get_wrapper_template(self, declaration):
+        arg_desc = []
+        for option in declaration['options']:
+            option_desc = [self.TYPE_NAMES.get(arg['type'], arg['type']) + ' ' + arg['name']
+                           for arg in option['arguments']
+                           if not arg.get('ignore_check', False)]
+            # TODO: this should probably go to THPLongArgsPlugin
+            if option_desc:
+                arg_desc.append('({})'.format(', '.join(option_desc)))
+            else:
+                arg_desc.append('no arguments')
+        arg_desc.sort(key=len)
+        arg_desc = ['"' + desc + '"' for desc in arg_desc]
+        arg_str = ', '.join(arg_desc)
+        readable_name = declaration['python_name']
+        return Template(self.WRAPPER_TEMPLATE.safe_substitute(
+            readable_name=readable_name, num_options=len(arg_desc),
+            expected_args=arg_str))
+
+    def get_return_wrapper(self, option):
+        return self.RETURN_WRAPPER.get(option['return'], None)
+
+    def get_arg_accessor(self, arg, option):
+        name = arg['name']
+        if name == 'self':
+            return 'self'
+        elif name == 'dataType':
+            return 'getCudnnDataType(tensorClass)'
+        elif name == 'handle':
+            return 'getCudnnHandle()'
+
+    def process_declarations(self, declarations):
+        for declaration in declarations:
+            declaration.setdefault('python_name', '_{}'.format(declaration['name']))
+            declaration['name'] = 'THCUDNN_{}'.format(declaration['name'])
+            self.declarations.append(declaration)
+            for option in declaration['options']:
+                for arg in option['arguments']:
+                    if arg['name'] in ['self', 'state', 'dataType', 'handle']:
+                        arg['ignore_check'] = True
+            declaration['options'] = self.filter_unique_options(declaration['options'])
+        return [d for d in declarations if not d.get('only_register', False)]
+
+    def filter_unique_options(self, options):
+        def signature(option):
+            return '#'.join(arg['type'] for arg in option['arguments']
+                            if 'ignore_check' not in arg or not arg['ignore_check'])
+        seen_signatures = set()
+        unique = []
+        for option in options:
+            sig = signature(option)
+            if sig not in seen_signatures:
+                unique.append(option)
+                seen_signatures.add(sig)
+        return unique
+
+    def preprocessor_guard(self, code, condition):
+        return '#if ' + condition + '\n' + code + '#endif\n'
+
+    def process_wrapper(self, code, declaration):
+        if 'defined_if' in declaration:
+            return self.preprocessor_guard(code, declaration['defined_if'])
+        return code
+
+    def process_all_call_arg(self, code, option):
+        return 'state, ' + code
+
+    def declare_methods(self):
+        methods = ''
+        for declaration in self.declarations:
+            extra_flags = ' | ' + declaration.get('method_flags') if 'method_flags' in declaration else ''
+            if not declaration.get('only_register'):
+                extra_flags += ' | METH_KEYWORDS'
+            entry = Template('  {"$python_name", (PyCFunction)$name, METH_VARARGS$extra_flags, NULL},\n').substitute(
+                python_name=declaration['python_name'], name=declaration['name'], extra_flags=extra_flags
+            )
+            if 'defined_if' in declaration:
+                entry = self.preprocessor_guard(entry, declaration['defined_if'])
+            methods += entry
+        return self.METHODS_DECLARATION.substitute(methods=methods)
+
+    def process_full_file(self, code):
+        return code + self.declare_methods()
--- a/tools/cwrap/plugins/GILRelease.py
+++ b/tools/cwrap/plugins/GILRelease.py
@ -0,0 +1,31 @@
+from . import CWrapPlugin
+from string import Template
+
+
+class GILRelease(CWrapPlugin):
+
+    OPTION_START = [
+        'PyThreadState *_save = NULL;',
+        'try {',
+    ]
+
+    BEFORE_CALL = 'Py_UNBLOCK_THREADS;'
+
+    AFTER_CALL = 'Py_BLOCK_THREADS;'
+
+    OPTION_END = [
+        '} catch (...) {',
+        'if (_save) {',
+        'Py_BLOCK_THREADS;',
+        '}',
+        'throw;',
+        '}',
+    ]
+
+    def process_option_code_template(self, template, option):
+        if option.get('with_gil', False):
+            return template
+        call_idx = template.index('$call')
+        template.insert(call_idx, self.BEFORE_CALL)
+        template.insert(call_idx + 2, self.AFTER_CALL)
+        return self.OPTION_START + template + self.OPTION_END
--- a/tools/cwrap/plugins/GenericNN.py
+++ b/tools/cwrap/plugins/GenericNN.py
@ -0,0 +1,223 @@
+import copy
+from string import Template
+from . import CWrapPlugin
+
+
+class GenericNN(CWrapPlugin):
+    INPUT_TYPE_CHECK = Template("checkTypes(is_cuda, $type, $tensor_args);")
+
+    HEADER_TEMPLATE = Template("void $name($args);")
+
+    WRAPPER_TEMPLATE = Template("""\
+void $name($args)
+{
+  bool is_cuda = $input->isCuda();
+  auto type = $input->type();
+  $type_check
+  $options
+  } else {
+    throw std::runtime_error("invalid arguments");
+  }
+}
+""")
+
+    THNN_TEMPLATE = Template("""\
+    if (type == thpp::Type::FLOAT) {
+        THNN_Float$name(
+            NULL,
+            $float_args);
+    } else if (type == thpp::Type::DOUBLE) {
+        THNN_Double$name(
+            NULL,
+            $double_args);
+    } else {
+        throw std::runtime_error("unsupported tensor type");
+    }""")
+
+    THCUNN_TEMPLATE = Template("""\
+#ifdef WITH_CUDA
+    if (type == thpp::Type::FLOAT) {
+        THNN_Cuda$name(
+            state,
+            $float_args);
+    } else if (type == thpp::Type::DOUBLE) {
+        THNN_CudaDouble$name(
+            state,
+            $double_args);
+    } else if (type == thpp::Type::HALF) {
+        THNN_CudaHalf$name(
+            state,
+            $half_args);
+    } else {
+        throw std::runtime_error("unsupported tensor type");
+    }
+#endif
+""")
+
+    INDEX_TENSOR_TYPES = {'THIndexTensor*', 'THCIndexTensor*'}
+
+    REAL_TENSOR_TYPES = {'THTensor*', 'THCTensor*'}
+
+    INPUT_ARGUMENT_MAP = {
+        'THNNState*': 'void*',
+        'THCState*': 'void*',
+        'THTensor*': 'thpp::Tensor*',
+        'THCTensor*': 'thpp::Tensor*',
+        'THIndexTensor*': 'thpp::Tensor*',
+        'THCIndexTensor*': 'thpp::Tensor*',
+        'THIndex_t': 'long',
+        'accreal': 'double',
+    }
+
+    def __init__(self, header=False):
+        self.header = header
+        self.declarations = []
+
+    def process_full_file(self, base_wrapper):
+        if self.header:
+            wrapper = '#pragma once\n\n'
+            wrapper += '#include <THPP/Tensor.hpp>\n\n'
+        else:
+            wrapper = '#include "THNN_generic.h"\n'
+            wrapper = '#include "THNN_generic.inc.h"\n\n'
+        wrapper += 'namespace torch { namespace nn {\n\n'
+        wrapper += base_wrapper
+        wrapper += '}} // namespace torch::nn\n'
+        return wrapper
+
+    def process_declarations(self, declarations):
+        for declaration in declarations:
+            base_args = declaration['options'][0]['arguments']
+            for option in declaration['options']:
+                for idx, arg in enumerate(option['arguments']):
+                    arg['assign_name'] = base_args[idx]['name']
+                    arg['assign_type'] = base_args[idx]['type']
+                    if idx != 1:
+                        arg['ignore_check'] = True
+        return declarations
+
+    def get_arg_accessor(self, arg, option):
+        return self.get_type_unpack(arg, option)
+
+    def process_pre_arg_assign(self, pre_arg_assign, option):
+        if option['backend'] == 'cunn':
+            # Enclose arg_assign with CUDA guard
+            pre_arg_assign.append('#ifdef WITH_CUDA')
+        return pre_arg_assign
+
+    def process_option_code_template(self, template, option):
+        template = []
+        if option['backend'] == 'cunn':
+            template.append('#endif')
+
+        def base_cast(arg, CReal, real):
+            name = 'arg_' + arg['assign_name']
+            type = arg['type']
+            if type in self.REAL_TENSOR_TYPES:
+                return ('(TH{CReal}Tensor*){name}->cdata()'
+                        .format(CReal=CReal, name=name))
+            elif type in self.INDEX_TENSOR_TYPES:
+                return '({type}){name}->cdata()'.format(type=type, name=name)
+            elif type == 'THCState*':
+                return '({}){}'.format(type, name)
+            elif type == 'real':
+                if real == 'half':
+                    return 'THC_float2half({})'.format(name)
+                return '({real}){name}'.format(real=real, name=name)
+            return name
+
+        def cast(arg, CReal, real):
+            expr = base_cast(arg, CReal, real)
+            if arg.get('optional', False):
+                name = 'arg_' + arg['assign_name']
+                return '{name} ? {expr} : NULL'.format(name=name, expr=expr)
+            return expr
+
+        if option['backend'] == 'nn':
+            float_args = []
+            double_args = []
+            for idx, arg in enumerate(option['arguments']):
+                float_args.append(cast(arg, 'Float', 'float'))
+                double_args.append(cast(arg, 'Double', 'double'))
+
+            code = self.THNN_TEMPLATE.substitute(
+                name=option['cname'],
+                float_args=',\n'.join(float_args),
+                double_args=',\n'.join(double_args))
+            template.append(code)
+
+        elif option['backend'] == 'cunn':
+            float_args = []
+            double_args = []
+            half_args = []
+            for idx, arg in enumerate(option['arguments']):
+                float_args.append(cast(arg, 'Cuda', 'float'))
+                double_args.append(cast(arg, 'CudaDouble', 'double'))
+                half_args.append(cast(arg, 'CudaHalf', 'half'))
+
+            code = self.THCUNN_TEMPLATE.substitute(
+                name=option['cname'],
+                float_args=',\n'.join(float_args),
+                double_args=',\n'.join(double_args),
+                half_args=',\n'.join(half_args))
+            template.append(code)
+
+        template.append('')
+        return template
+
+    def get_type_unpack(self, arg, option):
+        return Template(arg.get('assign_name', arg['name']))
+
+    def get_type_check(self, arg, option):
+        if option['backend'] == 'cunn':
+            return Template('is_cuda')
+        else:
+            return Template('!is_cuda')
+
+    def get_assign_args(self, arguments):
+        assign_args = []
+        for arg in arguments:
+            arg = copy.copy(arg)
+            new_type = self.INPUT_ARGUMENT_MAP.get(arg['type'])
+            if new_type is not None:
+                arg['type'] = new_type
+            assign_args.append(arg)
+        return assign_args
+
+    def get_wrapper_template(self, declaration):
+        # get assign arguments string
+        base_arguments = declaration['options'][0]['arguments']
+        args = self.get_assign_args(base_arguments)
+        arg_str = ', '.join([arg['type'] + ' ' + arg['name'] for arg in args])
+
+        if self.header:
+            return Template(self.HEADER_TEMPLATE.safe_substitute(args=arg_str))
+
+        def get_checked_args(tensor_types):
+            checked_args = []
+            for arg in base_arguments:
+                if arg['type'] in tensor_types:
+                    name = arg.get('assign_name', arg['name'])
+                    name_str = name
+                    if arg.get('optional', False):
+                        name_str = '?' + name_str
+                    checked_args += ['"' + name_str + '"', name]
+            checked_args += ['NULL']
+            return checked_args
+
+        real_args = get_checked_args(self.REAL_TENSOR_TYPES)
+        long_args = get_checked_args(self.INDEX_TENSOR_TYPES)
+
+        # check input types
+        types_checks = []
+        if len(real_args) > 1:
+            types_checks.append(self.INPUT_TYPE_CHECK.substitute(
+                type='type', tensor_args=', '.join(real_args)))
+        if len(long_args) > 1:
+            types_checks.append(self.INPUT_TYPE_CHECK.substitute(
+                type='thpp::Type::LONG', tensor_args=', '.join(long_args)))
+
+        return Template(self.WRAPPER_TEMPLATE.safe_substitute(
+            input=args[0]['name'],
+            args=arg_str,
+            type_check='\n  '.join(types_checks)))
--- a/tools/cwrap/plugins/KwargsPlugin.py
+++ b/tools/cwrap/plugins/KwargsPlugin.py
@ -0,0 +1,69 @@
+from . import CWrapPlugin
+from string import Template
+
+
+class KwargsPlugin(CWrapPlugin):
+
+    ACCESSOR_TEMPLATE = Template('(__tuplecount > $idx ? PyTuple_GET_ITEM(args, $idx) : __kw_$name)')
+    KWARG_ONLY_ACCESSOR_TEMPLATE = Template('__kw_$name')
+    CHECK_TEMPLATE = Template('(__tuplecount > $idx || __kw_$name) && $code')
+    KWARG_ONLY_CHECK_TEMPLATE = Template('__kw_$name && $code')
+    WRAPPER_TEMPLATE = Template("""
+    $declarations
+    if (kwargs) {
+      $lookups
+    }
+    """)
+
+    def process_declarations(self, declarations):
+        # We don't have access to declaration or options in get_arg_accessor
+        # and process_single_check, so we have to push the flag down to
+        # the args.
+        for declaration in declarations:
+            if declaration.get('no_kwargs'):
+                for option in declaration['options']:
+                    for arg in option['arguments']:
+                        arg['no_kwargs'] = True
+        # we need to use offsets for arg position in *arg if kwarg_only args
+        # are not at the end
+        for declaration in declarations:
+            for option in declaration['options']:
+                offset = 0
+                for arg in option['arguments']:
+                    if arg.get('kwarg_only'):
+                        arg['no_idx'] = True
+        return declarations
+
+    def get_arg_accessor(self, arg, option):
+        if arg.get('no_kwargs'):
+            return
+        if arg.get('kwarg_only'):
+            return self.KWARG_ONLY_ACCESSOR_TEMPLATE.substitute(name=arg['name'])
+        return self.ACCESSOR_TEMPLATE.substitute(idx=arg['idx'], name=arg['name'])
+
+    def process_single_check(self, code, arg, arg_accessor):
+        if arg.get('no_kwargs'):
+            return code
+        if arg.get('kwarg_only'):
+            return self.KWARG_ONLY_CHECK_TEMPLATE.substitute(name=arg['name'], code=code)
+        return self.CHECK_TEMPLATE.substitute(idx=arg['idx'], name=arg['name'], code=code)
+
+    def process_wrapper(self, code, declaration):
+        if declaration.get('no_kwargs'):
+            return code
+        seen_args = set()
+        args = []
+        for option in declaration['options']:
+            for arg in option['arguments']:
+                name = arg['name']
+                if (not arg.get('ignore_check') and
+                        not arg.get('no_kwargs') and
+                        name not in seen_args):
+                    seen_args.add(name)
+                    args.append(name)
+        declarations = '\n    '.join(['PyObject *__kw_{} = NULL;'.format(a) for a in args])
+        lookups = '\n      '.join(
+            ['__kw_{name} = PyDict_GetItemString(kwargs, "{name}");'.format(name=a) for a in args])
+        start_idx = code.find('{') + 1
+        new_code = self.WRAPPER_TEMPLATE.substitute(declarations=declarations, lookups=lookups)
+        return code[:start_idx] + new_code + code[start_idx:]
--- a/tools/cwrap/plugins/NullableArguments.py
+++ b/tools/cwrap/plugins/NullableArguments.py
@ -0,0 +1,14 @@
+from . import CWrapPlugin
+
+
+class NullableArguments(CWrapPlugin):
+
+    def process_single_check(self, code, arg, arg_accessor):
+        if 'nullable' in arg and arg['nullable']:
+            return '({} || {} == Py_None)'.format(code, arg_accessor)
+        return code
+
+    def process_single_unpack(self, code, arg, arg_accessor):
+        if 'nullable' in arg and arg['nullable']:
+            return '({} == Py_None ? NULL : {})'.format(arg_accessor, code)
+        return code
--- a/tools/cwrap/plugins/OptionalArguments.py
+++ b/tools/cwrap/plugins/OptionalArguments.py
@ -0,0 +1,18 @@
+import os
+from copy import deepcopy
+from . import CWrapPlugin
+from itertools import product
+from ...shared import cwrap_common
+
+
+class OptionalArguments(CWrapPlugin):
+
+    def process_declarations(self, declarations):
+        for declaration in declarations:
+            cwrap_common.enumerate_options_due_to_default(
+                declaration,
+                allow_kwarg=True,
+                type_to_signature={},
+                remove_self=False)
+
+        return declarations
--- a/tools/cwrap/plugins/ProcessorSpecificPlugin.py
+++ b/tools/cwrap/plugins/ProcessorSpecificPlugin.py
@ -0,0 +1,90 @@
+from copy import deepcopy
+from . import CWrapPlugin
+import yaml
+
+
+class ProcessorSpecificPlugin(CWrapPlugin):
+
+    def process_declarations(self, declarations):
+        # In order to move Torch's random functions into the same cwrap
+        # declaration, we need to be able to handle the fact that on the CPU
+        # these functions take a generator argument, while on the GPU, they
+        # do not. As such, we would like to split those declarations at cwrap
+        # runtime into two separate declarations, one for the CPU (unchanged),
+        # and one for the GPU (with the generator argument removed).
+        #
+        # For example, the declaration arguments:
+        # arguments:
+        #   - THTensor* self
+        #   - arg: THGenerator* generator
+        #     default: THPDefaultGenerator->cdata
+        #     kwarg_only: True
+        #
+        # Would have the generator argument removed when generating for the GPU
+        # backend.
+
+        def arg_contains_generator(arg):
+            return (arg['type'] == 'THGenerator*' or (arg.get('default', None)
+                    is not None and 'THPDefaultGenerator' in
+                    str(arg.get('default', ""))))
+
+        def split_candidate(declaration):
+            # First, check and see if it is a declaration for both CPU/GPU
+            if all([proc in declaration['backends'] for
+                    proc in ['CPU', 'CUDA']]):
+                for option in declaration['options']:
+                    for argument in option['arguments']:
+                        if arg_contains_generator(argument):
+                            return True
+
+            return False
+
+        def can_we_handle_the_split(declaration):
+            # hook into here if the split cannot happen for some reason
+            return True
+
+        def generator_split(declaration):
+            # the split must make two changes: 1. remove the generator argument
+            # for the GPU, and 2. assign the correct backends/types to the
+            # split declaration
+            dec_cpu = declaration
+            dec_gpu = deepcopy(declaration)
+
+            # Remove GPU backend and types from dec_cpu
+            dec_cpu['backends'].remove('CUDA')
+            if dec_cpu.get('backend_type_pairs', False):
+                dec_cpu['backend_type_pairs'] = (
+                    [pair for pair in dec_cpu['backend_type_pairs'] if
+                     pair[1] == 'CPU'])
+            # also need to reach into options
+            for option in dec_cpu['options']:
+                option['backends'].remove('CUDA')
+
+            # Remove CPU backend and types from dec_gpu
+            dec_gpu['backends'].remove('CPU')
+            if dec_gpu.get('backend_type_pairs', False):
+                dec_gpu['backend_type_pairs'] = (
+                    [pair for pair in dec_gpu['backend_type_pairs'] if
+                     pair[1] == 'CUDA'])
+            # also need to reach into options
+            for option in dec_gpu['options']:
+                option['backends'].remove('CPU')
+
+            # Remove generator arguments from dec_gpu options
+            for option in dec_gpu['options']:
+                option['arguments'] = (
+                    [arg for arg in option['arguments'] if
+                     not arg_contains_generator(arg)])
+
+            return [dec_cpu, dec_gpu]
+
+        decs = []
+        for declaration in declarations:
+            if split_candidate(declaration):
+                assert(can_we_handle_the_split(declaration))
+                newdecs = generator_split(declaration)
+                decs.extend(newdecs)
+            else:
+                decs.append(declaration)
+
+        return decs
--- a/tools/cwrap/plugins/ReturnArguments.py
+++ b/tools/cwrap/plugins/ReturnArguments.py
@ -0,0 +1,21 @@
+from . import CWrapPlugin
+from string import Template
+
+
+class ReturnArguments(CWrapPlugin):
+    ARGUMENT_RETURN_TEMPLATE = Template("Py_INCREF($arg);\nreturn (PyObject*)($arg);")
+    TUPLE_RETURN_TEMPLATE = Template("return PyTuple_Pack($num_args, $args);")
+
+    def initialize(self, cwrap):
+        self.cwrap = cwrap
+
+    def get_return_wrapper(self, option):
+        if option['return'].startswith('argument '):
+            indices = list(map(int, option['return'][len('argument '):].split(',')))
+            args = [option['arguments'][idx] for idx in indices]
+            accessors = [self.cwrap.get_arg_accessor(arg, option) for arg in args]
+            if len(args) == 1:
+                return Template(self.ARGUMENT_RETURN_TEMPLATE.safe_substitute(arg=accessors[0]))
+            else:
+                return Template(self.TUPLE_RETURN_TEMPLATE.safe_substitute(num_args=len(args),
+                                                                           args=', '.join(accessors)))
--- a/tools/cwrap/plugins/StandaloneExtension.py
+++ b/tools/cwrap/plugins/StandaloneExtension.py
@ -0,0 +1,151 @@
+import os
+from string import Template
+from . import CWrapPlugin
+
+
+MODULE_HEAD = """
+#include <Python.h>
+#include <exception>
+
+#include "THP_API.h"
+
+"""
+with open(os.path.join(os.path.dirname(__file__), 'templates', 'module_tail.cpp'), 'r') as f:
+    MODULE_TAIL = Template(f.read())
+
+REGISTER_METHOD_TEMPLATE = Template('  {"$name", (PyCFunction)$name, METH_VARARGS, NULL},\n')
+
+MODULE_METHODS_TEMPLATE = Template("""
+static PyMethodDef module_methods[] = {
+$METHODS
+  {NULL, NULL, 0, NULL}
+};
+""")
+
+
+class StandaloneExtension(CWrapPlugin):
+
+    TYPE_UNPACK = {
+        'THFloatTensor*': Template('THPFloatTensor_CData((THPFloatTensor*)$arg)'),
+        'THDoubleTensor*': Template('THPDoubleTensor_CData((THPDoubleTensor*)$arg)'),
+        'THLongTensor*': Template('THPLongTensor_CData((THPLongTensor*)$arg)'),
+        'THIntTensor*': Template('THPIntTensor_CData((THPIntTensor*)$arg)'),
+        'THCudaHalfTensor*': Template('THCPHalfTensor_CData((THCPHalfTensor*)$arg)'),
+        'THCudaTensor*': Template('THCPFloatTensor_CData((THCPFloatTensor*)$arg)'),
+        'THCudaDoubleTensor*': Template('THCPDoubleTensor_CData((THCPDoubleTensor*)$arg)'),
+        'THCudaLongTensor*': Template('THCPLongTensor_CData((THCPLongTensor*)$arg)'),
+        'half': Template('THPHalfUtils_unpackReal($arg)'),
+        'float': Template('THPFloatUtils_unpackReal($arg)'),
+        'double': Template('THPDoubleUtils_unpackReal($arg)'),
+        'bool': Template('($arg == Py_True ? true : false)'),
+        'int': Template('THPUtils_unpackLong($arg)'),
+        'long': Template('THPUtils_unpackLong($arg)'),
+        'void*': Template('(void*)THPUtils_unpackLong($arg)'),
+        'THGenerator*': Template('THPGenerator_CData((THPGenerator*)$arg)'),
+    }
+
+    TYPE_CHECK = {
+        'THDoubleTensor*': Template('(PyObject*)Py_TYPE($arg) == THPDoubleTensorClass'),
+        'THFloatTensor*': Template('(PyObject*)Py_TYPE($arg) == THPFloatTensorClass'),
+        'THLongTensor*': Template('(PyObject*)Py_TYPE($arg) == THPLongTensorClass'),
+        'THIntTensor*': Template('(PyObject*)Py_TYPE($arg) == THPIntTensorClass'),
+        'THCudaHalfTensor*': Template('THCPHalfTensor_Check($arg)'),
+        'THCudaTensor*': Template('(PyObject*)Py_TYPE($arg) == THCPFloatTensorClass'),
+        'THCudaDoubleTensor*': Template('THCPDoubleTensor_Check($arg)'),
+        'THCudaLongTensor*': Template('(PyObject*)Py_TYPE($arg) == THCPLongTensorClass'),
+        'half': Template('THPHalfUtils_checkReal($arg)'),
+        'float': Template('THPFloatUtils_checkReal($arg)'),
+        'double': Template('THPDoubleUtils_checkReal($arg)'),
+        'bool': Template('PyBool_Check($arg)'),
+        'int': Template('THPUtils_checkLong($arg)'),
+        'long': Template('THPUtils_checkLong($arg)'),
+        'void*': Template('THPUtils_checkLong($arg)'),
+        'THGenerator*': Template('(PyObject*)Py_TYPE($arg) == THPGeneratorClass'),
+    }
+
+    WRAPPER_TEMPLATE = Template("""
+PyObject * $name(PyObject *_unused, PyObject *args)
+{
+  HANDLE_TH_ERRORS
+  int __argcount = args ? PyTuple_Size(args) : 0;
+    $options
+  } else {
+    THPUtils_invalidArguments(args, NULL, "$name", 1, $expected_args);
+    return NULL;
+  }
+  END_HANDLE_TH_ERRORS
+}
+    """)
+
+    TYPE_NAMES = {
+        'THGenerator*': 'Generator',
+        'THCudaHalfTensor*': 'torch.cuda.HalfTensor',
+        'THCudaTensor*': 'torch.cuda.FloatTensor',
+        'THCudaDoubleTensor*': 'torch.cuda.DoubleTensor',
+        'THCudaLongTensor*': 'torch.cuda.LongTensor',
+        'THDoubleTensor*': 'torch.DoubleTensor',
+        'THFloatTensor*': 'torch.FloatTensor',
+        'THBoolTensor*': 'torch.ByteTensor',
+        'THLongTensor*': 'torch.LongTensor',
+        'THIndexTensor*': 'torch.LongTensor',
+        'THIntTensor*': 'torch.IntTensor',
+        'THLongStorage*': 'torch.LongStorage',
+        'long': 'int',
+        'int': 'int',
+        'real': 'float',
+        'half': 'float',
+        'double': 'float',
+        'float': 'float',
+        'accreal': 'float',
+        'bool': 'bool',
+        'void*': 'int',
+    }
+
+    def __init__(self, module_name):
+        self.module_name = module_name
+        self.declarations = []
+
+    def process_full_file(self, code):
+        short_name = self.module_name.split('.')[-1]
+        new_code = MODULE_HEAD
+        new_code += code
+        new_code += self.declare_module_methods()
+        new_code += MODULE_TAIL.substitute(full_name=self.module_name, short_name=short_name)
+        return new_code
+
+    def process_wrapper(self, code, declaration):
+        self.declarations.append(declaration)
+        return code
+
+    def declare_module_methods(self):
+        module_methods = ''
+        for declaration in self.declarations:
+            module_methods += REGISTER_METHOD_TEMPLATE.substitute(name=declaration['name'])
+        return MODULE_METHODS_TEMPLATE.substitute(METHODS=module_methods)
+
+    def get_type_unpack(self, arg, option):
+        return self.TYPE_UNPACK.get(arg['type'], None)
+
+    def get_type_check(self, arg, option):
+        return self.TYPE_CHECK.get(arg['type'], None)
+
+    def get_wrapper_template(self, declaration):
+        arg_desc = []
+
+        def describe_arg(arg):
+            desc = self.TYPE_NAMES[arg['type']] + ' ' + arg['name']
+            if arg.get('nullable'):
+                return '[{} or None]'.format(desc)
+            return desc
+        for option in declaration['options']:
+            option_desc = [describe_arg(arg)
+                           for arg in option['arguments']
+                           if not arg.get('ignore_check', False)]
+            if option_desc:
+                arg_desc.append('({})'.format(', '.join(option_desc)))
+            else:
+                arg_desc.append('no arguments')
+        arg_desc.sort(key=len)
+        arg_desc = ['"' + desc + '"' for desc in arg_desc]
+        arg_str = ', '.join(arg_desc)
+        return Template(self.WRAPPER_TEMPLATE.safe_substitute(expected_args=arg_str))
--- a/Show More
+++ b/Show More