Scopes 0.3.1 backport (#5153 )

* Introduce scopes during tracing (#3016) * Fix segfault during ONNX export * Further fix to tracing scope (#4558) * Set missing temporary scope in callPySymbolicMethod * Use expected traces in all scope tests * Fix tracking of tracing scopes during ONNX pass (#4524) * Fix tracking of tracing scopes during ONNX pass * Use ResourceGuard to manage setting a temporary current scope in Graph * Add tests for ONNX pass scopes * Remove unused num_classes argument * Expose node scopeName to python (#4200) * Inherit JIT scopes when cloning only when it's correct It's correct only when the new graph owns the same scope tree as the original one. We can end up with dangling pointers otherwise. * Fixes after cherry-picking, still one test to go * Fix for last failing test after scope cherry-pick * Fix linting issue
Cherry pick dataloader issue fix to 0.3.1 (#5140 )
2025-10-22 14:15:01 +08:00 · 2018-02-09 12:07:43 -05:00 · 2018-02-09 11:44:58 -05:00 · 2018-02-08 09:46:22 -08:00 · 2018-02-07 15:35:51 -08:00 · 2018-02-07 13:58:04 -08:00
1523 changed files with 196129 additions and 38954 deletions
--- a/.dockerignore
+++ b/.dockerignore
@ -0,0 +1 @@
+.gitignore
--- a/.gitignore
+++ b/.gitignore
@ -2,16 +2,52 @@ build/
 dist/
 torch.egg-info/
 */**/__pycache__
+torch/version.py
 torch/csrc/generic/TensorMethods.cpp
 torch/lib/*.so*
+torch/lib/*.a*
 torch/lib/*.dylib*
 torch/lib/*.h
 torch/lib/build
 torch/lib/tmp_install
+torch/lib/include
+torch/lib/torch_shm_manager
+torch/csrc/jit/generated/*
+torch/csrc/autograd/generated/*
+torch/csrc/cudnn/cuDNN.cpp
 torch/csrc/nn/THNN.cwrap
 torch/csrc/nn/THNN.cpp
 torch/csrc/nn/THCUNN.cwrap
 torch/csrc/nn/THCUNN.cpp
+torch/csrc/nn/THNN_generic.cwrap
+torch/csrc/nn/THNN_generic.cpp
+torch/csrc/nn/THNN_generic.h
+torch/csrc/generated
+docs/src/**/*
+test/data/legacy_modules.t7
+test/data/gpu_tensors.pt
+test/htmlcov
+test/.coverage
+*/*.pyc
 */**/*.pyc
+*/**/**/*.pyc
+*/**/**/**/*.pyc
+*/**/**/**/**/*.pyc
+*/*.so*
 */**/*.so*
 */**/*.dylib*
+test/data/legacy_serialized.pt
+test/data/linear.pt
+
+# IPython notebook checkpoints
+.ipynb_checkpoints
+
+# Editor temporaries
+*.swn
+*.swo
+*.swp
+*~
+
+# OSX dir files
+.DS_Store
+
--- a/.gitmodules
+++ b/.gitmodules
@ -0,0 +1,9 @@
+[submodule "torch/lib/gloo"]
+	path = torch/lib/gloo
+	url = https://github.com/facebookincubator/gloo
+[submodule "torch/lib/pybind11"]
+	path = torch/lib/pybind11
+	url = https://github.com/pybind/pybind11
+[submodule "torch/lib/nanopb"]
+	path = torch/lib/nanopb
+	url = https://github.com/nanopb/nanopb.git
--- a/.travis.yml
+++ b/.travis.yml
@ -1,32 +1,40 @@
 # https://travis-ci.org/pytorch/pytorch
 language: python
+dist: trusty
+git:
+  submodules: false
 python:
-    - 2.7.8
+    - 2.7.9
    - 2.7
-    - 3.3
-    - 3.4
    - 3.5
+    - 3.6
    - nightly

-install:
-    - export CC="gcc-4.8"
-    - export CXX="g++-4.8"
-    - travis_retry pip install -r requirements.txt
-    - travis_retry pip install .
+cache:
+    - ccache
+    - directories:
+        - $HOME/.ccache

-script:
-    - python test/test_torch.py
-    - python test/test_legacy_nn.py
-    - python test/test_nn.py
-    - python test/test_autograd.py
+install:
+    - unset CCACHE_DISABLE
+    - export CCACHE_DIR=$HOME/.ccache
+    - export CC="ccache gcc-5"
+    - export CXX="ccache g++-5"
+    - ccache --show-stats
+    - travis_retry pip install --upgrade pip setuptools wheel
+    - travis_retry pip install -r requirements.txt --only-binary=scipy
+    - git submodule update --init --recursive
+    - MAX_JOBS=8 python setup.py install

 addons:
    apt:
        sources:
            - ubuntu-toolchain-r-test
        packages:
-            - gcc-4.8
-            - g++-4.8
+            - g++-5
+
+script:
+    - OMP_NUM_THREADS=2 ./test/run_test.sh

 # This reportedly works around an issue downloading packages from pypi on
 # travis.  Consider removing this after the underlying issue is fixed.
@ -35,3 +43,9 @@ sudo: false

 matrix:
    fast_finish: true
+    include:
+        env: LINT_CHECK
+        python: "2.7"
+        addons: true
+        install: pip install flake8
+        script: flake8
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -0,0 +1,195 @@
+## Contributing to PyTorch
+
+If you are interested in contributing to PyTorch, your contributions will fall
+into two categories:
+1. You want to propose a new Feature and implement it
+    - post about your intended feature, and we shall discuss the design and
+    implementation. Once we agree that the plan looks good, go ahead and implement it.
+2. You want to implement a feature or bug-fix for an outstanding issue
+    - Look at the outstanding issues here: https://github.com/pytorch/pytorch/issues
+    - Especially look at the Low Priority and Medium Priority issues
+    - Pick an issue and comment on the task that you want to work on this feature
+    - If you need more context on a particular issue, please ask and we shall provide.
+
+Once you finish implementing a feature or bugfix, please send a Pull Request to
+https://github.com/pytorch/pytorch
+
+If you are not familiar with creating a Pull Request, here are some guides:
+- http://stackoverflow.com/questions/14680711/how-to-do-a-github-pull-request
+- https://help.github.com/articles/creating-a-pull-request/
+
+
+## Developing locally with PyTorch
+
+To locally develop with PyTorch, here are some tips:
+
+1. Uninstall all existing pytorch installs
+```
+conda uninstall pytorch
+pip uninstall torch
+pip uninstall torch # run this command twice
+```
+
+2. Locally clone a copy of PyTorch from source:
+
+```
+git clone https://github.com/pytorch/pytorch
+cd pytorch
+```
+
+3. Install PyTorch in `build develop` mode:
+
+A full set of instructions on installing PyTorch from Source are here:
+https://github.com/pytorch/pytorch#from-source
+
+The change you have to make is to replace
+
+```
+python setup.py install
+```
+
+with
+
+```
+python setup.py build develop
+```
+
+This is especially useful if you are only changing Python files.
+
+This mode will symlink the python files from the current local source tree into the
+python install.
+
+Hence, if you modify a python file, you do not need to reinstall pytorch again and again.
+
+For example:
+- Install local pytorch in `build develop` mode
+- modify your python file `torch/__init__.py` (for example)
+- test functionality
+- modify your python file `torch/__init__.py`
+- test functionality
+- modify your python file `torch/__init__.py`
+- test functionality
+
+You do not need to repeatedly install after modifying python files.
+
+
+## Writing documentation
+
+PyTorch uses [Google style](http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html)
+for formatting docstrings. Length of line inside docstrings block must be limited to 80 characters to
+fit into Jupyter documentation popups.
+
+
+## Managing multiple build trees
+
+One downside to using `python setup.py develop` is that your development
+version of pytorch will be installed globally on your account (e.g., if
+you run `import torch` anywhere else, the development version will be
+used.
+
+If you want to manage multiple builds of PyTorch, you can make use of
+[conda environments](https://conda.io/docs/using/envs.html) to maintain
+separate Python package environments, each of which can be tied to a
+specific build of PyTorch.  To set one up:
+
+```
+conda create -n pytorch-myfeature
+source activate pytorch-myfeature
+# if you run python now, torch will NOT be installed
+python setup.py build develop
+```
+
+## C++ Development tips
+
+If you are working on the C++ code, there are a few important things that you
+will want to keep in mind:
+
+1. How to rebuild only the code you are working on, and
+2. How to make rebuilds in the absence of changes go faster.
+
+### Build only what you need.
+
+`python setup.py build` will build everything, but since our build system is
+not very optimized for incremental rebuilds, this will actually be very slow.
+Far better is to only request rebuilds of the parts of the project you are
+working on:
+
+- Working on `torch/csrc`?  Run `python setup.py develop` to rebuild
+  (NB: no `build` here!)
+
+- Working on `torch/lib/TH`, did not make any cmake changes, and just want to
+  see if it compiles?  Run `(cd torch/lib/build/TH && make install -j$(getconf _NPROCESSORS_ONLN))`.  This
+  applies for any other subdirectory of `torch/lib`.  **Warning: Changes you
+  make here will not be visible from Python.**  See below.
+
+- Working on `torch/lib` and want to run your changes / rerun cmake?  Run
+  `python setup.py build_deps`.  Note that this will rerun cmake for
+  every subdirectory in TH; if you are only working on one project,
+  consider editing `torch/lib/build_all.sh` and commenting out the
+  `build` lines of libraries you are not working on.
+
+On the initial build, you can also speed things up with the environment
+variables `DEBUG` and `NO_CUDA`.
+
+- `DEBUG=1` will enable debug builds (-g -O0)
+- `NO_CUDA=1` will disable compiling CUDA (in case you are developing on something not CUDA related), to save compile time.
+
+For example:
+```
+NO_CUDA=1 DEBUG=1 python setup.py build develop
+```
+
+Make sure you continue to pass these flags on subsequent builds.
+
+### Make no-op build fast.
+
+Python `setuptools` is pretty dumb, and always rebuilds every C file in a
+project. Using ccache in a situation like this is a real time-saver. However, by
+default, ccache does not properly support CUDA stuff, so here are the
+instructions for installing a custom `ccache` fork that has CUDA support:
+
+```
+# install and export ccache
+if ! ls ~/ccache/bin/ccache
+then
+    sudo apt-get update
+    sudo apt-get install -y automake autoconf
+    sudo apt-get install -y asciidoc
+    mkdir -p ~/ccache
+    pushd /tmp
+    rm -rf ccache
+    git clone https://github.com/colesbury/ccache -b ccbin
+    pushd ccache
+    ./autogen.sh
+    ./configure
+    make install prefix=~/ccache
+    popd
+    popd
+
+    mkdir -p ~/ccache/lib
+    mkdir -p ~/ccache/cuda
+    ln -s ~/ccache/bin/ccache ~/ccache/lib/cc
+    ln -s ~/ccache/bin/ccache ~/ccache/lib/c++
+    ln -s ~/ccache/bin/ccache ~/ccache/lib/gcc
+    ln -s ~/ccache/bin/ccache ~/ccache/lib/g++
+    ln -s ~/ccache/bin/ccache ~/ccache/cuda/nvcc
+
+    ~/ccache/bin/ccache -M 25Gi
+fi
+
+export PATH=~/ccache/lib:$PATH
+export CUDA_NVCC_EXECUTABLE=~/ccache/cuda/nvcc
+```
+
+## CUDA Development tips
+
+If you are working on the CUDA code, here are some useful CUDA debugging tips:
+
+1. `CUDA_DEBUG=1` will enable CUDA debugging symbols (-g -G). This is particularly
+    helpful in debugging device code. However, it will slow down the build process,
+    so use wisely.
+2. `cuda-gdb` and `cuda-memcheck` are your best CUDA debuging friends. Unlike`gdb`,
+   `cuda-gdb` can display actual values in a CUDA tensor (rather than all zeros).
+
+
+Hope this helps, and thanks for considering to contribute.
--- a/41
+++ b/41
@ -0,0 +1,41 @@
+FROM nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04 
+
+RUN echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list
+
+RUN apt-get update && apt-get install -y --no-install-recommends \
+         build-essential \
+         cmake \
+         git \
+         curl \
+         vim \
+         ca-certificates \
+         libnccl2=2.0.5-2+cuda8.0 \
+         libnccl-dev=2.0.5-2+cuda8.0 \
+         libjpeg-dev \
+         libpng-dev &&\
+     rm -rf /var/lib/apt/lists/*
+
+
+ENV PYTHON_VERSION=3.6
+RUN curl -o ~/miniconda.sh -O  https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh  && \
+     chmod +x ~/miniconda.sh && \
+     ~/miniconda.sh -b -p /opt/conda && \     
+     rm ~/miniconda.sh && \
+#     /opt/conda/bin/conda install conda-build && \
+     /opt/conda/bin/conda create -y --name pytorch-py$PYTHON_VERSION python=$PYTHON_VERSION numpy pyyaml scipy ipython mkl&& \
+     /opt/conda/bin/conda clean -ya 
+ENV PATH /opt/conda/envs/pytorch-py$PYTHON_VERSION/bin:$PATH
+RUN conda install --name pytorch-py$PYTHON_VERSION -c soumith magma-cuda80
+# This must be done before pip so that requirements.txt is available
+WORKDIR /opt/pytorch
+COPY . .
+
+RUN git submodule update --init
+RUN TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1+PTX" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
+    CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
+    pip install -v .
+
+RUN git clone https://github.com/pytorch/vision.git && cd vision && pip install -v .
+
+WORKDIR /workspace
+RUN chmod -R a+w /workspace
--- a/38
+++ b/38
@ -0,0 +1,38 @@
+Copyright (c) 2016-     Facebook, Inc            (Adam Paszke)
+Copyright (c) 2014-     Facebook, Inc            (Soumith Chintala)
+Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
+Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
+Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
+Copyright (c) 2011-2013 NYU                      (Clement Farabet)
+Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
+Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
+Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
+
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+
+2. Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+
+3. Neither the names of Facebook, Deepmind Technologies, NYU, NEC Laboratories America
+   and IDIAP Research Institute nor the names of its contributors may be
+   used to endorse or promote products derived from this software without
+   specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.
--- a/22
+++ b/22
@ -1,22 +0,0 @@
-# Add main target here - setup.py doesn't understand the need to recompile
-# after generic files change
-.PHONY: all clean torch
-
-all: install
-
-torch:
-	python3 setup.py build
-
-install:
-	python3 setup.py install
-
-clean:
-	@rm -rf build
-	@rm -rf dist
-	@rm -rf torch.egg-info
-	@rm -rf tools/__pycache__
-	@rm -rf torch/csrc/generic/TensorMethods.cpp
-	@rm -rf torch/lib/tmp_install
-	@rm -rf torch/lib/build
-	@rm -rf torch/lib/*.so*
-	@rm -rf torch/lib/*.h
--- a/README.md
+++ b/README.md
@ -1,271 +1,249 @@
-# pytorch [alpha-1] ![Build Status](https://travis-ci.com/apaszke/pytorch.svg?token=x5muYzmNgtGJxk6DvWMN&branch=master)
+<p align="center"><img width="40%" src="docs/source/_static/img/pytorch-logo-dark.png" /></p>

-The project is still under active development and is likely to drastically change in short periods of time.
-We will be announcing API changes and important developments via a newsletter, github issues and post a link to the issues on slack.
-Please remember that at this stage, this is an invite-only closed alpha, and please don't distribute code further.
-This is done so that we can control development tightly and rapidly during the initial phases with feedback from you.
+--------------------------------------------------------------------------------
+
+PyTorch is a Python package that provides two high-level features:
+- Tensor computation (like NumPy) with strong GPU acceleration
+- Deep neural networks built on a tape-based autograd system
+
+You can reuse your favorite Python packages such as NumPy, SciPy and Cython to extend PyTorch when needed.
+
+We are in an early-release beta. Expect some adventures and rough edges.
+
+- [More about PyTorch](#more-about-pytorch)
+- [Installation](#installation)
+  - [Binaries](#binaries)
+  - [From Source](#from-source)
+  - [Docker Image](#docker-image)
+- [Getting Started](#getting-started)
+- [Communication](#communication)
+- [Releases and Contributing](#releases-and-contributing)
+- [The Team](#the-team)
+
+| System | 2.7 | 3.5 |
+| --- | --- | --- |
+| Linux CPU | [![Build Status](https://travis-ci.org/pytorch/pytorch.svg?branch=master)](https://travis-ci.org/pytorch/pytorch) | [![Build Status](https://travis-ci.org/pytorch/pytorch.svg?branch=master)](https://travis-ci.org/pytorch/pytorch) |
+| Linux GPU | [![Build Status](http://build.pytorch.org:8080/buildStatus/icon?job=pytorch-master-py2-linux)](https://build.pytorch.org/job/pytorch-master-py2-linux) | [![Build Status](http://build.pytorch.org:8080/buildStatus/icon?job=pytorch-master-py3-linux)](https://build.pytorch.org/job/pytorch-master-py3-linux) |
+| macOS CPU | [![Build Status](http://build.pytorch.org:8080/buildStatus/icon?job=pytorch-master-py2-osx-cpu)](https://build.pytorch.org/job/pytorch-master-py2-osx-cpu) | [![Build Status](http://build.pytorch.org:8080/buildStatus/icon?job=pytorch-master-py3-osx-cpu)](https://build.pytorch.org/job/pytorch-master-py3-osx-cpu) |
+
+
+## More about PyTorch
+
+At a granular level, PyTorch is a library that consists of the following components:
+
+<table>
+<tr>
+    <td><b> torch </b></td>
+    <td> a Tensor library like NumPy, with strong GPU support </td>
+</tr>
+<tr>
+    <td><b> torch.autograd </b></td>
+    <td> a tape-based automatic differentiation library that supports all differentiable Tensor operations in torch </td>
+</tr>
+<tr>
+    <td><b> torch.nn </b></td>
+    <td> a neural networks library deeply integrated with autograd designed for maximum flexibility </td>
+</tr>
+<tr>
+    <td><b> torch.multiprocessing  </b></td>
+    <td> Python multiprocessing, but with magical memory sharing of torch Tensors across processes. Useful for data loading and Hogwild training. </td>
+</tr>
+<tr>
+    <td><b> torch.utils </b></td>
+    <td> DataLoader, Trainer and other utility functions for convenience </td>
+</tr>
+<tr>
+    <td><b> torch.legacy(.nn/.optim) </b></td>
+    <td> legacy code that has been ported over from torch for backward compatibility reasons </td>
+</tr>
+</table>
+
+Usually one uses PyTorch either as:
+
+- a replacement for NumPy to use the power of GPUs.
+- a deep learning research platform that provides maximum flexibility and speed
+
+Elaborating further:
+
+### A GPU-Ready Tensor Library
+
+If you use NumPy, then you have used Tensors (a.k.a ndarray).
+
+<p align=center><img width="30%" src="docs/source/_static/img/tensor_illustration.png" /></p>
+
+PyTorch provides Tensors that can live either on the CPU or the GPU, and accelerate
+compute by a huge amount.
+
+We provide a wide variety of tensor routines to accelerate and fit your scientific computation needs
+such as slicing, indexing, math operations, linear algebra, reductions.
+And they are fast!
+
+### Dynamic Neural Networks: Tape-Based Autograd
+
+PyTorch has a unique way of building neural networks: using and replaying a tape recorder.
+
+Most frameworks such as TensorFlow, Theano, Caffe and CNTK have a static view of the world.
+One has to build a neural network, and reuse the same structure again and again.
+Changing the way the network behaves means that one has to start from scratch.
+
+With PyTorch, we use a technique called reverse-mode auto-differentiation, which allows you to
+change the way your network behaves arbitrarily with zero lag or overhead. Our inspiration comes
+from several research papers on this topic, as well as current and past work such as
+[torch-autograd](https://github.com/twitter/torch-autograd),
+[autograd](https://github.com/HIPS/autograd),
+[Chainer](http://chainer.org), etc.
+
+While this technique is not unique to PyTorch, it's one of the fastest implementations of it to date.
+You get the best of speed and flexibility for your crazy research.
+
+<p align=center><img width="80%" src="docs/source/_static/img/dynamic_graph.gif" /></p>
+
+### Python First
+
+PyTorch is not a Python binding into a monolithic C++ framework.
+It is built to be deeply integrated into Python.
+You can use it naturally like you would use NumPy / SciPy / scikit-learn etc.
+You can write your new neural network layers in Python itself, using your favorite libraries
+and use packages such as Cython and Numba.
+Our goal is to not reinvent the wheel where appropriate.
+
+### Imperative Experiences
+
+PyTorch is designed to be intuitive, linear in thought and easy to use.
+When you execute a line of code, it gets executed. There isn't an asynchronous view of the world.
+When you drop into a debugger, or receive error messages and stack traces, understanding them is straightforward.
+The stack trace points to exactly where your code was defined.
+We hope you never spend hours debugging your code because of bad stack traces or asynchronous and opaque execution engines.
+
+### Fast and Lean
+
+PyTorch has minimal framework overhead. We integrate acceleration libraries
+such as Intel MKL and NVIDIA (cuDNN, NCCL) to maximize speed.
+At the core, its CPU and GPU Tensor and neural network backends
+(TH, THC, THNN, THCUNN) are written as independent libraries with a C99 API.
+They are mature and have been tested for years.
+
+Hence, PyTorch is quite fast – whether you run small or large neural networks.
+
+The memory usage in PyTorch is extremely efficient compared to Torch or some of the alternatives.
+We've written custom memory allocators for the GPU to make sure that
+your deep learning models are maximally memory efficient.
+This enables you to train bigger deep learning models than before.
+
+### Extensions without Pain
+
+Writing new neural network modules, or interfacing with PyTorch's Tensor API was designed to be straightforward
+and with minimal abstractions.
+
+You can write new neural network layers in Python using the torch API
+[or your favorite NumPy-based libraries such as SciPy](http://pytorch.org/tutorials/advanced/numpy_extensions_tutorial.html).
+
+If you want to write your layers in C/C++, we provide an extension API based on
+[cffi](http://cffi.readthedocs.io/en/latest/) that is efficient and with minimal boilerplate.
+There is no wrapper code that needs to be written. You can see [a tutorial here](http://pytorch.org/tutorials/advanced/c_extension.html) and [an example here](https://github.com/pytorch/extension-ffi).


 ## Installation
+
+### Binaries
+Commands to install from binaries via Conda or pip wheels are on our website:
+
+[http://pytorch.org](http://pytorch.org)
+
+### From Source
+
+If you are installing from source, we highly recommend installing an [Anaconda](https://www.continuum.io/downloads) environment.
+You will get a high-quality BLAS library (MKL) and you get a controlled compiler version regardless of your Linux distro.
+
+Once you have [Anaconda](https://www.continuum.io/downloads) installed, here are the instructions.
+
+If you want to compile with CUDA support, install
+- [NVIDIA CUDA](https://developer.nvidia.com/cuda-downloads) 7.5 or above
+- [NVIDIA cuDNN](https://developer.nvidia.com/cudnn) v6.x or above
+
+If you want to disable CUDA support, export environment variable `NO_CUDA=1`.
+
+#### Install optional dependencies
+
+On Linux
 ```bash
-pip install -r requirements.txt
-pip install .
+export CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" # [anaconda root directory]
+
+# Install basic dependencies
+conda install numpy pyyaml mkl setuptools cmake cffi
+
+# Add LAPACK support for the GPU
+conda install -c soumith magma-cuda80 # or magma-cuda75 if CUDA 7.5
 ```

-To install with CUDA support change `WITH_CUDA = False` to `WITH_CUDA = True` in `setup.py`.
+On OSX
+```bash
+export CMAKE_PREFIX_PATH=[anaconda root directory]
+conda install numpy pyyaml setuptools cmake cffi
+```
+#### Get the PyTorch source
+```bash
+git clone --recursive https://github.com/pytorch/pytorch
+```
+
+#### Install PyTorch
+On Linux
+```bash
+python setup.py install
+```
+
+On OSX
+```bash
+MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py install
+```
+
+### Docker image
+
+Dockerfile is supplied to build images with cuda support and cudnn v6. Build as usual
+```
+docker build -t pytorch .
+```
+Dockerfile to build with cuda 9 and cudnn v7 (with Volta support) is in tools/docker, the build command is
+```
+docker build -t pytorch_cuda9 -f tools/docker/Dockerfile9 .
+```
+Alternatively, if you want to use a runtime image, you can use the pre-built one from Docker Hub and run with nvidia-docker:
+```
+nvidia-docker run --rm -ti --ipc=host pytorch/pytorch:latest
+```
+Please note that PyTorch uses shared memory to share data between processes, so if torch multiprocessing is used (e.g.
+for multithreaded data loaders) the default shared memory segment size that container runs with is not enough, and you
+should increase shared memory size either with `--ipc=host` or `--shm-size` command line options to `nvidia-docker run`.
+
+
+## Getting Started
+
+Three pointers to get you started:
+- [Tutorials: get you started with understanding and using PyTorch](http://pytorch.org/tutorials/)
+- [Examples: easy to understand pytorch code across all domains](https://github.com/pytorch/examples)
+- [The API Reference](http://pytorch.org/docs/)

 ## Communication
-* github issues: bug reports, feature requests, install issues, RFCs, thoughts, etc.
-* slack: general chat, online discussions, collaboration etc. https://pytorch.slack.com/ . If you need a slack invite, ping me at soumith@pytorch.org
+* forums: discuss implementations, research, etc. http://discuss.pytorch.org
+* GitHub issues: bug reports, feature requests, install issues, RFCs, thoughts, etc.
+* Slack: general chat, online discussions, collaboration etc. https://pytorch.slack.com/ . Our slack channel is invite-only to promote a healthy balance between power-users and beginners. If you need a slack invite, ping us at soumith@pytorch.org
 * newsletter: no-noise, one-way email newsletter with important announcements about pytorch. You can sign-up here: http://eepurl.com/cbG0rv

-## Timeline
+## Releases and Contributing

-We will run the alpha releases weekly for 6 weeks.
-After that, we will reevaluate progress, and if we are ready, we will hit beta-0. If not, we will do another two weeks of alpha.
+PyTorch has a 90 day release cycle (major releases).
+It's current state is Beta, we expect no obvious bugs. Please let us know if you encounter a bug by [filing an issue](https://github.com/pytorch/pytorch/issues).

-* alpha-0: Working versions of torch, cutorch, nn, cunn, optim fully unit tested with seamless numpy conversions
-* alpha-1: Serialization to/from disk with sharing intact. initial release of the new neuralnets package based on a Chainer-like design
-* alpha-2: sharing tensors across processes for hogwild training or data-loading processes. a rewritten optim package for this new nn.
-* alpha-3: binary installs (prob will take @alexbw 's help here), contbuilds, etc.
-* alpha-4: a ton of examples across vision, nlp, speech, RL -- this phase might make us rethink parts of the APIs, and hence want to do this in alpha than beta
-* alpha-5: Putting a simple and efficient story around multi-machine training. Probably simplistic like torch-distlearn. Building the website, release scripts, more documentation, etc.
-* alpha-6: [no plan yet]
+We appreciate all contributions. If you are planning to contribute back bug-fixes, please do so without any further discussion.

-The beta phases will be leaning more towards working with all of you, convering your use-cases, active development on non-core aspects.
+If you plan to contribute new features, utility functions or extensions to the core, please first open an issue and discuss the feature with us.
+Sending a PR without discussion might end up resulting in a rejected PR, because we might be taking the core in a different direction than you might be aware of.

-## pytorch vs torch: important changes
+## The Team

-We've decided that it's time to rewrite/update parts of the old torch API, even if it means losing some of backward compatibility (we can hack up a model converter that converts correctly).
-This section lists the biggest changes, and suggests how to shift from torch to pytorch.
+PyTorch is a community driven project with several skillful engineers and researchers contributing to it.

-For now there's no pytorch documentation.
-Since all currently implemented modules are very similar to the old ones, it's best to use torch7 docs for now (having in mind several differences described below).
+PyTorch is currently maintained by [Adam Paszke](https://apaszke.github.io/), [Sam Gross](https://github.com/colesbury), [Soumith Chintala](http://soumith.ch) and [Gregory Chanan](https://github.com/gchanan) with major contributions coming from 10s of talented individuals in various forms and means.
+A non-exhaustive but growing list needs to mention: Trevor Killeen, Sasank Chilamkurthy, Sergey Zagoruyko, Adam Lerer, Francisco Massa, Alykhan Tejani, Luca Antiga, Alban Desmaison, Andreas Kopf, James Bradbury, Zeming Lin, Yuandong Tian, Guillaume Lample, Marat Dukhan, Natalia Gimelshein, Christian Sarofeen, Martin Raison, Edward Yang, Zachary Devito.

-### Library structure
-
-All core modules are merged into a single repository.
-Most of them will be rewritten and will be completely new (more on this below), but we're providing a Python version of old packages under torch.legacy namespace.
-* torch           (torch)
-* cutorch         (torch.cuda)
-* nn              (torch.legacy.nn)
-* cunn            (torch.legacy.cunn)
-* optim           (torch.legacy.optim)
-* nngraph         (torch.legacy.nngraph - not implemented yet)
-
-### 0-based indexing
-
-pytorch uses 0-based indexing everywhere.
-This includes arguments to `index*` functions and nn criterion weights.
-
-Under the hood, on the C side, we've changed logic on TH / THC / THNN / THCUNN to introduce a TH_INDEX_BASE compile-time definition to switch between 0 and 1 indexing logic.
-
-### New Tensor API
-
-**All methods operating on tensors are now out-of-place by default.**
-
-This means that although `a.add(b)` used to have a side-effect of mutating the elements in a, it will now return a new Tensor, holding the result.
-All methods that mutate the Tensor/Storage are now marked with a trailing underscore (including `copy` -> `copy_`, `fill` -> `fill_`, `set` -> `set_`, etc.).
-Most of math methods have their in-place counterparts, so  an equivalent to `a.add(b)` in Lua is now `a.add_(b)` (or `torch.add(a, a, b)`, which is not recommended in this case)
-
-### CUDA module
-
-All tensors have their CUDA counterparts in torch.cuda module.
-
-There is no `torch.cuda.setDevice` anymore. By default always the 0th device is selected, but code can be placed in a `with` statement to change it:
-
-```python
-with torch.cuda.device(1):
-    a = torch.cuda.FloatTensor(10) # a is allocated on GPU1
-```
-
-Calling `.cuda()` on tensors no longer converts it to a GPU float tensor, but to a CUDA tensor of the same type located on a currently selected device.
-So, for example: `a = torch.LongTensor(10).cuda() # a is a CudaLongTensor`
-
-Calling `.cuda(3)` will send it to the third device.
-`.cuda()` can be also used to transfer CUDA tensors between devices (calling it on a GPU tensor, with a different device selected will copy it into the current device).
-
-```python
-a = torch.LongTensor(10)
-b = a.cuda()  # b is a torch.cuda.LongTensor placed on GPU0
-c = a.cuda(2) # c is a torch.cuda.LongTensor placed on GPU2
-with torch.cuda.device(1):
-    d = b.cuda() # d is a copy of b, but on GPU1
-    e = d.cuda() # a no-op, d is already on current GPU, e is d == True
-```
-
-Also, setting device is now only important to specify where to allocate new Tensors. You can perform operations on CUDA Tensors irrespective of currently selected device (but all arguments have to be on the same device) - result will be also allocated there. See below for an example:
-
-```python
-a = torch.randn(2, 2).cuda()
-b = torch.randn(2, 2).cuda()
-with torch.cuda.device(1):
-    c = a + b                    # c is on GPU0
-    d = torch.randn(2, 2).cuda() # d is on GPU1
-```
-
-In the near future, we also plan to use a CUDA allocator, which allows to alleviate problems with cudaMalloc/cudaFree being a sync point.
-This will help us to not worry about using buffers for every intermediate computation in a module if one wants to do multi-GPU training, for example.
-See: https://github.com/torch/cutorch/pull/443
-
-
-### Numpy integration
-
-Because numpy is a core numerical package in Python, and is used by many other libraries like matplotlib, we've implemented a two-way bridge between pytorch and numpy.
-
-```python
-a = torch.randn(2, 2)
-b = a.numpy() # b is a numpy array of type corresponding to a
-              # no memory copy is performed, they share the same storage
-c = numpy.zeros(5, 5)
-d = torch.DoubleTensor(c) # it's possible to construct Tensors from numpy arrays
-              # d shares memory with b - there's no copy
-```
-
-### New neural network module
-
-After looking at several framework designs, looking at the current design of `nn` and thinking through a few original design ideas, this is what we've converged to:
-
-* Adopt a Chainer-like design
-    * Makes it extremely natural to express Recurrent Nets and weight sharing
-    * Each module can operate in-place, but marks used variables as dirty - errors will be raised if they're used again
-* RNN example:
-
-```python
-class Network(nn.Container):
-    def __init__(self):
-        super(Network, self).__init__(
-            conv1=nn.SpatialConvolution(3, 16, 3, 3, 1, 1),
-            relu1=nn.ReLU(True),
-            lstm=nn.LSTM(),
-        )
-
-    def __call__(self, input):
-        y = self.conv(input)
-        y = self.relu1(y)
-        y = self.lstm(y)
-        return y
-
-model = Network()
-input = nn.Variable(torch.zeros(256, 3, 224, 224))
-
-output = model(input)
-
-loss = 0
-for i in range(ITERS):
-    input, target = ...
-    # That's all you need for an RNN
-    for t in range(TIMESTEPS):
-        loss += loss_fn(model(input), target)
-    loss.backward()
-
-```
-
-* Here, nn.Variable will have a complete tape-based automatic differentiation implemented
-* To access states, have hooks for forward / backward (this also makes multi-GPU easier to implement)
-    * This has the advantage of not having to worry about in-place / out-of-place operators for accessing .output or .gradInput
-* When writing the module, make sure debuggability is straight forward. Dropping into pdb and inspecting things should be natural, especially when going over the backward graph.
-* Pulling handles to a module after constructing a chain should be very natural (apart from having a handle at construction)
-    * It's easy, since modules are assigned as Container properties
-* Drop overly verbose names. Example:
-    * SpatialConvolution → conv2d
-    * VolumetricConvolution → conv3d
-
-#### Some notes on new nn implementation
-
-As shown above, structure of the networks is fully defined by control-flow embedded in the code. There are no rigid containers known from Lua. You can put an `if` in the middle of your model and freely branch depending on any condition you can come up with. All operations are registered in the computational graph history.
-
-There are two main objects that make this possible - variables and functions. They will be denoted as squares and circles respectively.
-
-![Variable and function symbols](http://students.mimuw.edu.pl/~ap360585/__torch_img/variable_function.png)
-
-Variables are the objects that hold a reference to a tensor (and optionally to gradient w.r.t. that tensor), and to the function in the computational graph that created it. Variables created explicitly by the user (`Variable(tensor)`) have a Leaf function node associated with them.
-
-![Variable and leaf function](http://students.mimuw.edu.pl/~ap360585/__torch_img/variable_leaf.png)
-
-Functions are simple classes that define a function from a tuple of inputs to a tuple of outputs, and a formula for computing gradient w.r.t. it's inputs. Function objects are instantiated to hold references to other functions, and these references allow to reconstruct the history of a computation. An example graph for a linear layer (`Wx + b`) is shown below.
-
-![Linear layer](http://students.mimuw.edu.pl/~ap360585/__torch_img/linear.png)
-
-Please note that function objects never hold references to Variable objects, except for when they're necessary in the backward pass. This allows to free all the unnecessary intermediate values. A good example for this is addition when computing e.g. (`y = Wx + My`):
-
-![Freeing intermediate values](http://students.mimuw.edu.pl/~ap360585/__torch_img/intermediate_free.png)
-
-Matrix multiplication operation keeps references to it's inputs because it will need them, but addition doesn't need `Wx` and `My` after it computes the result, so as soon as they go out of scope they are freed. To access intermediate values in the forward pass you can either copy them when you still have a reference, or you can use a system of hooks that can be attached to any function. Hooks also allow to access and inspect gradients inside the graph.
-
-Another nice thing about this is that a single layer doesn't hold any state other than it's parameters (all intermediate values are alive as long as the graph references them), so it can be used multiple times before calling backward. This is especially convenient when training RNNs. You can use the same network for all timesteps and the gradients will sum up automatically.
-
-To compute backward pass you can call `.backward()` on a variable if it's a scalar (a 1-element Variable), or you can provide a gradient tensor of matching shape if it's not. This creates an execution engine object that manages the whole backward pass. It's been introduced, so that the code for analyzing the graph and scheduling node processing order is decoupled from other parts, and can be easily replaced. Right now it's simply processing the nodes in topological order, without any prioritization, but in the future we can implement algorithms and heuristics for scheduling independent nodes on different GPU streams, deciding which branches to compute first, etc.
-
-### Serialization
-
-Pickling tensors is supported, but requires making a temporary copy of all data and breaks sharing.
-For this reason we're providing `torch.load` and `torch.save`, that are free of these problems.
-They have the same interfaces as `pickle.load` (file object) and `pickle.dump` (serialized object, file object) respectively.
-For now the only requirement is that the file should have a `fileno` method, which returns a file descriptor number (this is already implemented by objects returned by `open`).
-
-Objects are serialized in a tar archive consisting of four files:
-`sys_info` - protocol version, byte order, long size, etc.
-`pickle` - pickled object
-`tensors` - tensor metadata
-`storages` - serialized data
-
-### Multi-GPU
-
-Proposed solutions need to address:
-
-* Kernel launch latency
-    * without affecting the user's code
-* Implementation should be as transparent as possible
-    * Should we expose DPT as:
-        * Split
-        * ParallelApply (scheduling kernels in breadth first order, to address launch latency)
-        * Join
-* In backward phase, send parameters as soon as the module finishes computation
-
-**Rough solution:**
-
-```python
-# This is an example of a network that has a data parallel part inside
-#
-#             B is data parallel
-#     +->A+-->B+-+
-#  +--+          +->D
-#     +->C+------+
-class Network(nn.Container):
-    __init__(self):
-        super(Network, self).__init__(
-            A = ...,
-            B = GPUReplicate(B, [0, 1, 2, 3]), # Copies the module onto a list of GPUs
-            C = ...,
-            D = ...
-        )
-
-    __call__(self, x):
-        a = self.A(x)
-        c = self.C(x)
-        a_split = Split(a) # a_split is a list of Tensors placed on different devices
-        b = ParallelApply(self.B, a_split) # self.B is a list-like object containing copies of B
-        d_input = Join(b + [c]) # gathers Tensors on a single GPU
-        return self.D(d_input)
-
-```
-
-Each module is assigned to a single GPU.
-
-For Kernel Launch Latency:
-* Python threading
-* Generators
-
-For parameter reductions ASAP:
-
-* In the forward pass, register a hooks on a  every parameter which are evaluated as soon as the last backward is executed for that parameter. The hook will then “all-reduce” those parameters across GPUs
-    * Problem with multiple forward calls - how do you know that the parameters won't be used anymore?
-        * Well, last usage in backward graph = first usage in forward graph, so this should be straightforward
-
-
-#### Multiprocessing
-
-We plan to make it as straightforward as possible, to use pytorch in a multiprocessing environment.
-For this, we plan to implement a .share() method for tensors that will enable them to be shared across processes seamlessly.
-One can use [python multiprocessing](https://docs.python.org/2/library/multiprocessing.html) seamlessly.
+Note: this project is unrelated to [hughperkins/pytorch](https://github.com/hughperkins/pytorch) with the same name. Hugh is a valuable contributor in the Torch community and has helped with many things Torch and PyTorch.
--- a/breaking_changes.md
+++ b/breaking_changes.md
@ -1,2 +0,0 @@
-
-* `split` and `chunk` no longer accept a list (table in Lua) as optional first argument
--- a/cmake/FindCUDA/FindCUDA.cmake
+++ b/cmake/FindCUDA/FindCUDA.cmake
@ -685,17 +685,21 @@ endif()


 # CUDA_NVCC_EXECUTABLE
-cuda_find_host_program(CUDA_NVCC_EXECUTABLE
-  NAMES nvcc
-  PATHS "${CUDA_TOOLKIT_ROOT_DIR}"
-  ENV CUDA_PATH
-  ENV CUDA_BIN_PATH
-  PATH_SUFFIXES bin bin64
-  NO_DEFAULT_PATH
-  )
-# Search default search paths, after we search our own set of paths.
-cuda_find_host_program(CUDA_NVCC_EXECUTABLE nvcc)
-mark_as_advanced(CUDA_NVCC_EXECUTABLE)
+if(DEFINED ENV{CUDA_NVCC_EXECUTABLE})
+  SET(CUDA_NVCC_EXECUTABLE "$ENV{CUDA_NVCC_EXECUTABLE}")
+else(DEFINED ENV{CUDA_NVCC_EXECUTABLE})
+  cuda_find_host_program(CUDA_NVCC_EXECUTABLE
+    NAMES nvcc
+    PATHS "${CUDA_TOOLKIT_ROOT_DIR}"
+    ENV CUDA_PATH
+    ENV CUDA_BIN_PATH
+    PATH_SUFFIXES bin bin64
+    NO_DEFAULT_PATH
+    )
+  # Search default search paths, after we search our own set of paths.
+  cuda_find_host_program(CUDA_NVCC_EXECUTABLE nvcc)
+  mark_as_advanced(CUDA_NVCC_EXECUTABLE)
+endif(DEFINED ENV{CUDA_NVCC_EXECUTABLE})

 if(CUDA_NVCC_EXECUTABLE AND NOT CUDA_VERSION)
  # Compute the version.
--- a/cmake/FindCUDA/FindCUDA/select_compute_arch.cmake
+++ b/cmake/FindCUDA/FindCUDA/select_compute_arch.cmake
@ -63,11 +63,16 @@ function(CUDA_DETECT_INSTALLED_GPUS OUT_VARIABLE)
      "}\n")

    execute_process(COMMAND "${CUDA_NVCC_EXECUTABLE}" "--run" "${cufile}"
+                    "-ccbin" ${CMAKE_CXX_COMPILER}
                    WORKING_DIRECTORY "${PROJECT_BINARY_DIR}/CMakeFiles/"
                    RESULT_VARIABLE nvcc_res OUTPUT_VARIABLE nvcc_out
                    ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE)

    if(nvcc_res EQUAL 0)
+      # only keep the last line of nvcc_out
+      STRING(REGEX REPLACE ";" "\\\\;" nvcc_out "${nvcc_out}")
+      STRING(REGEX REPLACE "\n" ";" nvcc_out "${nvcc_out}")
+      list(GET nvcc_out -1 nvcc_out)
      string(REPLACE "2.1" "2.1(2.0)" nvcc_out "${nvcc_out}")
      set(CUDA_GPU_DETECT_OUTPUT ${nvcc_out} CACHE INTERNAL "Returned GPU architetures from detect_gpus tool" FORCE)
    endif()
@ -116,13 +121,13 @@ function(CUDA_SELECT_NVCC_ARCH_FLAGS out_variable)
      set(add_ptx TRUE)
      set(arch_name ${CMAKE_MATCH_1})
    endif()
-    if(arch_name MATCHES "([0-9]\\.[0-9])$")
+    if(arch_name MATCHES "(^[0-9]\\.[0-9](\\([0-9]\\.[0-9]\\))?)$")
      set(arch_bin ${CMAKE_MATCH_1})
      set(arch_ptx ${arch_bin})
    else()
      # Look for it in our list of known architectures
      if(${arch_name} STREQUAL "Fermi")
-        set(arch_bin 2.0 "2.1(2.0)")
+        set(arch_bin "2.0 2.1(2.0)")
      elseif(${arch_name} STREQUAL "Kepler+Tegra")
        set(arch_bin 3.2)
      elseif(${arch_name} STREQUAL "Kepler+Tesla")
@ -173,11 +178,11 @@ function(CUDA_SELECT_NVCC_ARCH_FLAGS out_variable)
  # Tell NVCC to add binaries for the specified GPUs
  foreach(arch ${cuda_arch_bin})
    if(arch MATCHES "([0-9]+)\\(([0-9]+)\\)")
-      # User explicitly specified PTX for the concrete BIN
+      # User explicitly specified ARCH for the concrete CODE
      list(APPEND nvcc_flags -gencode arch=compute_${CMAKE_MATCH_2},code=sm_${CMAKE_MATCH_1})
      list(APPEND nvcc_archs_readable sm_${CMAKE_MATCH_1})
    else()
-      # User didn't explicitly specify PTX for the concrete BIN, we assume PTX=BIN
+      # User didn't explicitly specify ARCH for the concrete CODE, we assume ARCH=CODE
      list(APPEND nvcc_flags -gencode arch=compute_${arch},code=sm_${arch})
      list(APPEND nvcc_archs_readable sm_${arch})
    endif()
--- a/docs/Makefile
+++ b/docs/Makefile
@ -0,0 +1,27 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line.
+SPHINXOPTS    =
+SPHINXBUILD   = sphinx-build
+SPHINXPROJ    = PyTorch
+SOURCEDIR     = source
+BUILDDIR      = build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+docset: html
+	doc2dash --name $(SPHINXPROJ) --icon $(SOURCEDIR)/_static/img/pytorch-logo-flame.png --enable-js --online-redirect-url http://pytorch.org/docs/ --force $(BUILDDIR)/html/
+
+	# Manually fix because Zeal doesn't deal well with `icon.png`-only at 2x resolution.
+	cp $(SPHINXPROJ).docset/icon.png $(SPHINXPROJ).docset/icon@2x.png
+	convert $(SPHINXPROJ).docset/icon@2x.png -resize 16x16 $(SPHINXPROJ).docset/icon.png
+
+.PHONY: help Makefile docset
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/docs/make.bat
+++ b/docs/make.bat
@ -0,0 +1,36 @@
+@ECHO OFF
+
+pushd %~dp0
+
+REM Command file for Sphinx documentation
+
+if "%SPHINXBUILD%" == "" (
+	set SPHINXBUILD=sphinx-build
+)
+set SOURCEDIR=source
+set BUILDDIR=build
+set SPHINXPROJ=PyTorch
+
+if "%1" == "" goto help
+
+%SPHINXBUILD% >NUL 2>NUL
+if errorlevel 9009 (
+	echo.
+	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
+	echo.installed, then set the SPHINXBUILD environment variable to point
+	echo.to the full path of the 'sphinx-build' executable. Alternatively you
+	echo.may add the Sphinx directory to PATH.
+	echo.
+	echo.If you don't have Sphinx installed, grab it from
+	echo.http://sphinx-doc.org/
+	exit /b 1
+)
+
+%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
+goto end
+
+:help
+%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
+
+:end
+popd
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@ -0,0 +1,2 @@
+sphinx
+-e git://github.com/snide/sphinx_rtd_theme.git#egg=sphinx_rtd_theme
--- a/docs/source/_static/css/pytorch_theme.css
+++ b/docs/source/_static/css/pytorch_theme.css
@ -0,0 +1,118 @@
+body {
+    font-family: "Lato","proxima-nova","Helvetica Neue",Arial,sans-serif;
+}
+
+/* Default header fonts are ugly */
+h1, h2, .rst-content .toctree-wrapper p.caption, h3, h4, h5, h6, legend, p.caption {
+    font-family: "Lato","proxima-nova","Helvetica Neue",Arial,sans-serif;
+}
+
+/* Use white for docs background */
+.wy-side-nav-search {
+    background-color: #fff;
+}
+
+.wy-nav-content-wrap, .wy-menu li.current > a  {
+    background-color: #fff;
+}
+
+@media screen and (min-width: 1400px) {
+    .wy-nav-content-wrap {
+        background-color: rgba(0, 0, 0, 0.0470588);
+    }
+
+    .wy-nav-content {
+        background-color: #fff;
+    }
+}
+
+/* Fixes for mobile */
+.wy-nav-top {
+    background-color: #fff;
+    background-image: url('../img/pytorch-logo-dark.svg');
+    background-repeat: no-repeat;
+    background-position: center;
+    padding: 0;
+    margin: 0.4045em 0.809em;
+    color: #333;
+}
+
+.wy-nav-top > a {
+    display: none;
+}
+
+@media screen and (max-width: 768px) {
+    .wy-side-nav-search>a img.logo {
+        height: 60px;
+    }
+}
+
+/* This is needed to ensure that logo above search scales properly */
+.wy-side-nav-search a {
+    display: block;
+}
+
+/* This ensures that multiple constructors will remain in separate lines. */
+.rst-content dl:not(.docutils) dt {
+    display: table;
+}
+
+/* Use our red for literals (it's very similar to the original color) */
+.rst-content tt.literal, .rst-content tt.literal, .rst-content code.literal {
+    color: #F05732;
+}
+
+.rst-content tt.xref, a .rst-content tt, .rst-content tt.xref,
+.rst-content code.xref, a .rst-content tt, a .rst-content code {
+    color: #404040;
+}
+
+/* Change link colors (except for the menu) */
+
+a {
+    color: #F05732;
+}
+
+a:hover {
+    color: #F05732;
+}
+
+
+a:visited {
+    color: #D44D2C;
+}
+
+.wy-menu a {
+    color: #b3b3b3;
+}
+
+.wy-menu a:hover {
+    color: #b3b3b3;
+}
+
+/* Default footer text is quite big */
+footer {
+    font-size: 80%;
+}
+
+footer .rst-footer-buttons {
+    font-size: 125%; /* revert footer settings - 1/80% = 125% */
+}
+
+footer p {
+    font-size: 100%;
+}
+
+/* For hidden headers that appear in TOC tree */
+/* see http://stackoverflow.com/a/32363545/3343043 */
+.rst-content .hidden-section {
+    display: none;
+}
+
+nav .hidden-section {
+    display: inherit;
+}
+
+.wy-side-nav-search>div.version {
+    color: #000;
+}
--- a/docs/source/_static/img/dynamic_graph.gif
+++ b/docs/source/_static/img/dynamic_graph.gif
--- a/docs/source/_static/img/pytorch-logo-dark.png
+++ b/docs/source/_static/img/pytorch-logo-dark.png
--- a/docs/source/_static/img/pytorch-logo-dark.svg
+++ b/docs/source/_static/img/pytorch-logo-dark.svg
@ -0,0 +1,24 @@
+<?xml version="1.0" encoding="utf-8"?>
+<!-- Generator: Adobe Illustrator 21.0.0, SVG Export Plug-In . SVG Version: 6.00 Build 0)  -->
+<svg version="1.1" id="Layer_1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
+	 viewBox="0 0 199.7 40.2" style="enable-background:new 0 0 199.7 40.2;" xml:space="preserve">
+<style type="text/css">
+	.st0{fill:#F05732;}
+	.st1{fill:#9E529F;}
+	.st2{fill:#333333;}
+</style>
+<path class="st0" d="M102.7,12.2c-1.3-1-1.8,3.9-4.4,3.9c-3,0-4-13-6.3-13c-0.7,0-0.8-0.4-7.9,21.3c-2.9,9,4.4,15.8,11.8,15.8
+	c4.6,0,12.3-3,12.3-12.6C108.2,20.5,104.7,13.7,102.7,12.2z M95.8,35.3c-3.7,0-6.7-3.1-6.7-7c0-3.9,3-7,6.7-7s6.7,3.1,6.7,7
+	C102.5,32.1,99.5,35.3,95.8,35.3z"/>
+<path class="st1" d="M99.8,0c-0.5,0-1.8,2.5-1.8,3.6c0,1.5,1,2,1.8,2c0.8,0,1.8-0.5,1.8-2C101.5,2.5,100.2,0,99.8,0z"/>
+<path class="st2" d="M0,39.5V14.9h11.5c5.3,0,8.3,3.6,8.3,7.9c0,4.3-3,7.9-8.3,7.9H5.2v8.8H0z M14.4,22.8c0-2.1-1.6-3.3-3.7-3.3H5.2
+	v6.6h5.5C12.8,26.1,14.4,24.8,14.4,22.8z"/>
+<path class="st2" d="M35.2,39.5V29.4l-9.4-14.5h6l6.1,9.8l6.1-9.8h5.9l-9.4,14.5v10.1H35.2z"/>
+<path class="st2" d="M63.3,39.5v-20h-7.2v-4.6h19.6v4.6h-7.2v20H63.3z"/>
+<path class="st2" d="M131.4,39.5l-4.8-8.7h-3.8v8.7h-5.2V14.9H129c5.1,0,8.3,3.4,8.3,7.9c0,4.3-2.8,6.7-5.4,7.3l5.6,9.4H131.4z
+	 M131.9,22.8c0-2-1.6-3.3-3.7-3.3h-5.5v6.6h5.5C130.3,26.1,131.9,24.9,131.9,22.8z"/>
+<path class="st2" d="M145.6,27.2c0-7.6,5.7-12.7,13.1-12.7c5.4,0,8.5,2.9,10.3,6l-4.5,2.2c-1-2-3.2-3.6-5.8-3.6
+	c-4.5,0-7.7,3.4-7.7,8.1c0,4.6,3.2,8.1,7.7,8.1c2.5,0,4.7-1.6,5.8-3.6l4.5,2.2c-1.7,3.1-4.9,6-10.3,6
+	C151.3,39.9,145.6,34.7,145.6,27.2z"/>
+<path class="st2" d="M194.5,39.5V29.1h-11.6v10.4h-5.2V14.9h5.2v9.7h11.6v-9.7h5.3v24.6H194.5z"/>
+</svg>
--- a/docs/source/_static/img/pytorch-logo-flame.png
+++ b/docs/source/_static/img/pytorch-logo-flame.png
--- a/docs/source/_static/img/pytorch-logo-flame.svg
+++ b/docs/source/_static/img/pytorch-logo-flame.svg
@ -0,0 +1,33 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<svg
+   xmlns:dc="http://purl.org/dc/elements/1.1/"
+   xmlns:cc="http://creativecommons.org/ns#"
+   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
+   xmlns:svg="http://www.w3.org/2000/svg"
+   xmlns="http://www.w3.org/2000/svg"
+   height="40.200001"
+   width="40.200001"
+   xml:space="preserve"
+   viewBox="0 0 40.200002 40.2"
+   y="0px"
+   x="0px"
+   id="Layer_1"
+   version="1.1"><metadata
+     id="metadata4717"><rdf:RDF><cc:Work
+         rdf:about=""><dc:format>image/svg+xml</dc:format><dc:type
+           rdf:resource="http://purl.org/dc/dcmitype/StillImage" /><dc:title></dc:title></cc:Work></rdf:RDF></metadata><defs
+     id="defs4715" /><style
+     id="style4694"
+     type="text/css">
+	.st0{fill:#F05732;}
+	.st1{fill:#9E529F;}
+	.st2{fill:#333333;}
+</style><path
+     style="fill:#f05732"
+     id="path4696"
+     d="m 26.975479,12.199999 c -1.3,-1 -1.8,3.9 -4.4,3.9 -3,0 -4,-12.9999998 -6.3,-12.9999998 -0.7,0 -0.8,-0.4 -7.9000003,21.2999998 -2.9000001,9 4.4000003,15.8 11.8000003,15.8 4.6,0 12.3,-3 12.3,-12.6 0,-7.1 -3.5,-13.9 -5.5,-15.4 z m -6.9,23.1 c -3.7,0 -6.7,-3.1 -6.7,-7 0,-3.9 3,-7 6.7,-7 3.7,0 6.7,3.1 6.7,7 0,3.8 -3,7 -6.7,7 z"
+     class="st0" /><path
+     style="fill:#9e529f"
+     id="path4698"
+     d="m 24.075479,-7.6293945e-7 c -0.5,0 -1.8,2.49999996293945 -1.8,3.59999996293945 0,1.5 1,2 1.8,2 0.8,0 1.8,-0.5 1.8,-2 -0.1,-1.1 -1.4,-3.59999996293945 -1.8,-3.59999996293945 z"
+     class="st1" /></svg>
--- a/docs/source/_static/img/tensor_illustration.png
+++ b/docs/source/_static/img/tensor_illustration.png
--- a/docs/source/_templates/layout.html
+++ b/docs/source/_templates/layout.html
@ -0,0 +1,15 @@
+{% extends "!layout.html" %}
+
+{% block footer %}
+{{ super() }}
+<script>
+  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
+  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
+  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
+  })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
+
+  ga('create', 'UA-90545585-1', 'auto');
+  ga('send', 'pageview');
+
+</script>
+{% endblock %}
--- a/docs/source/autograd.rst
+++ b/docs/source/autograd.rst
@ -0,0 +1,71 @@
+.. role:: hidden
+    :class: hidden-section
+
+Automatic differentiation package - torch.autograd
+==================================================
+
+.. automodule:: torch.autograd
+.. currentmodule:: torch.autograd
+
+.. autofunction:: backward
+
+.. autofunction:: grad
+
+Variable
+--------
+
+API compatibility
+^^^^^^^^^^^^^^^^^
+
+Variable API is nearly the same as regular Tensor API (with the exception
+of a couple in-place methods, that would overwrite inputs required for
+gradient computation). In most cases Tensors can be safely replaced with
+Variables and the code will remain to work just fine. Because of this,
+we're not documenting all the operations on variables, and you should
+refer to :class:`torch.Tensor` docs for this purpose.
+
+In-place operations on Variables
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Supporting in-place operations in autograd is a hard matter, and we discourage
+their use in most cases. Autograd's aggressive buffer freeing and reuse makes
+it very efficient and there are very few occasions when in-place operations
+actually lower memory usage by any significant amount. Unless you're operating
+under heavy memory pressure, you might never need to use them.
+
+In-place correctness checks
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+All :class:`Variable` s keep track of in-place operations applied to them, and
+if the implementation detects that a variable was saved for backward in one of
+the functions, but it was modified in-place afterwards, an error will be raised
+once backward pass is started. This ensures that if you're using in-place
+functions and not seeing any errors, you can be sure that the computed
+gradients are correct.
+
+
+.. autoclass:: Variable
+    :members:
+
+:hidden:`Function`
+------------------
+
+.. autoclass:: Function
+    :members:
+
+Profiler
+--------
+
+Autograd includes a profiler that lets you inspect the cost of different
+operators inside your model - both on the CPU and GPU. There are two modes
+implemented at the moment - CPU-only using :class:`~torch.autograd.profiler.profile`.
+and nvprof based (registers both CPU and GPU activity) using
+:class:`~torch.autograd.profiler.emit_nvtx`.
+
+.. autoclass:: torch.autograd.profiler.profile
+    :members:
+
+.. autoclass:: torch.autograd.profiler.emit_nvtx
+    :members:
+
+.. autofunction:: torch.autograd.profiler.load_nvprof
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@ -0,0 +1,249 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+#
+# PyTorch documentation build configuration file, created by
+# sphinx-quickstart on Fri Dec 23 13:31:47 2016.
+#
+# This file is execfile()d with the current directory set to its
+# containing dir.
+#
+# Note that not all possible configuration values are present in this
+# autogenerated file.
+#
+# All configuration values have a default; values that are commented out
+# serve to show the default.
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+# import os
+# import sys
+# sys.path.insert(0, os.path.abspath('.'))
+import torch
+try:
+    import torchvision
+except ImportError:
+    import warnings
+    warnings.warn('unable to load "torchvision" package')
+import sphinx_rtd_theme
+
+
+# -- General configuration ------------------------------------------------
+
+# If your documentation needs a minimal Sphinx version, state it here.
+#
+# needs_sphinx = '1.0'
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    'sphinx.ext.autodoc',
+    'sphinx.ext.autosummary',
+    'sphinx.ext.doctest',
+    'sphinx.ext.intersphinx',
+    'sphinx.ext.todo',
+    'sphinx.ext.coverage',
+    'sphinx.ext.mathjax',
+    'sphinx.ext.napoleon',
+    'sphinx.ext.viewcode',
+]
+
+napoleon_use_ivar = True
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+#
+# source_suffix = ['.rst', '.md']
+source_suffix = '.rst'
+
+# The master toctree document.
+master_doc = 'index'
+
+# General information about the project.
+project = 'PyTorch'
+copyright = '2017, Torch Contributors'
+author = 'Torch Contributors'
+
+# The version info for the project you're documenting, acts as replacement for
+# |version| and |release|, also used in various other places throughout the
+# built documents.
+#
+# The short X.Y version.
+# TODO: change to [:2] at v1.0
+version = 'master (' + torch.__version__ + ' )'
+# The full version, including alpha/beta/rc tags.
+# TODO: verify this works as expected
+release = 'master'
+
+# The language for content autogenerated by Sphinx. Refer to documentation
+# for a list of supported languages.
+#
+# This is also used if you do content translation via gettext catalogs.
+# Usually you set "language" from the command line for these cases.
+language = None
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This patterns also effect to html_static_path and html_extra_path
+exclude_patterns = []
+
+# The name of the Pygments (syntax highlighting) style to use.
+pygments_style = 'sphinx'
+
+# If true, `todo` and `todoList` produce output, else they produce nothing.
+todo_include_todos = True
+
+
+# -- Options for HTML output ----------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = 'sphinx_rtd_theme'
+html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
+
+# Theme options are theme-specific and customize the look and feel of a theme
+# further.  For a list of options available for each theme, see the
+# documentation.
+#
+html_theme_options = {
+    'collapse_navigation': False,
+    'display_version': True,
+    'logo_only': True,
+}
+
+html_logo = '_static/img/pytorch-logo-dark.svg'
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
+
+# html_style_path = 'css/pytorch_theme.css'
+html_context = {
+    'css_files': [
+        'https://fonts.googleapis.com/css?family=Lato',
+        '_static/css/pytorch_theme.css'
+    ],
+}
+
+
+# -- Options for HTMLHelp output ------------------------------------------
+
+# Output file base name for HTML help builder.
+htmlhelp_basename = 'PyTorchdoc'
+
+
+# -- Options for LaTeX output ---------------------------------------------
+
+latex_elements = {
+    # The paper size ('letterpaper' or 'a4paper').
+    #
+    # 'papersize': 'letterpaper',
+
+    # The font size ('10pt', '11pt' or '12pt').
+    #
+    # 'pointsize': '10pt',
+
+    # Additional stuff for the LaTeX preamble.
+    #
+    # 'preamble': '',
+
+    # Latex figure (float) alignment
+    #
+    # 'figure_align': 'htbp',
+}
+
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title,
+#  author, documentclass [howto, manual, or own class]).
+latex_documents = [
+    (master_doc, 'pytorch.tex', 'PyTorch Documentation',
+     'Torch Contributors', 'manual'),
+]
+
+
+# -- Options for manual page output ---------------------------------------
+
+# One entry per manual page. List of tuples
+# (source start file, name, description, authors, manual section).
+man_pages = [
+    (master_doc, 'PyTorch', 'PyTorch Documentation',
+     [author], 1)
+]
+
+
+# -- Options for Texinfo output -------------------------------------------
+
+# Grouping the document tree into Texinfo files. List of tuples
+# (source start file, target name, title, author,
+#  dir menu entry, description, category)
+texinfo_documents = [
+    (master_doc, 'PyTorch', 'PyTorch Documentation',
+     author, 'PyTorch', 'One line description of project.',
+     'Miscellaneous'),
+]
+
+
+# Example configuration for intersphinx: refer to the Python standard library.
+intersphinx_mapping = {
+    'python': ('https://docs.python.org/', None),
+    'numpy': ('http://docs.scipy.org/doc/numpy/', None),
+}
+
+# -- A patch that prevents Sphinx from cross-referencing ivar tags -------
+# See http://stackoverflow.com/a/41184353/3343043
+
+from docutils import nodes
+from sphinx.util.docfields import TypedField
+from sphinx import addnodes
+
+
+def patched_make_field(self, types, domain, items, **kw):
+    # `kw` catches `env=None` needed for newer sphinx while maintaining
+    #  backwards compatibility when passed along further down!
+
+    # type: (List, unicode, Tuple) -> nodes.field
+    def handle_item(fieldarg, content):
+        par = nodes.paragraph()
+        par += addnodes.literal_strong('', fieldarg)  # Patch: this line added
+        # par.extend(self.make_xrefs(self.rolename, domain, fieldarg,
+        #                           addnodes.literal_strong))
+        if fieldarg in types:
+            par += nodes.Text(' (')
+            # NOTE: using .pop() here to prevent a single type node to be
+            # inserted twice into the doctree, which leads to
+            # inconsistencies later when references are resolved
+            fieldtype = types.pop(fieldarg)
+            if len(fieldtype) == 1 and isinstance(fieldtype[0], nodes.Text):
+                typename = u''.join(n.astext() for n in fieldtype)
+                typename = typename.replace('int', 'python:int')
+                typename = typename.replace('long', 'python:long')
+                typename = typename.replace('float', 'python:float')
+                typename = typename.replace('type', 'python:type')
+                par.extend(self.make_xrefs(self.typerolename, domain, typename,
+                                           addnodes.literal_emphasis, **kw))
+            else:
+                par += fieldtype
+            par += nodes.Text(')')
+        par += nodes.Text(' -- ')
+        par += content
+        return par
+
+    fieldname = nodes.field_name('', self.label)
+    if len(items) == 1 and self.can_collapse:
+        fieldarg, content = items[0]
+        bodynode = handle_item(fieldarg, content)
+    else:
+        bodynode = self.list_type()
+        for fieldarg, content in items:
+            bodynode += nodes.list_item('', handle_item(fieldarg, content))
+    fieldbody = nodes.field_body('', bodynode)
+    return nodes.field('', fieldname, fieldbody)
+
+TypedField.make_field = patched_make_field
--- a/docs/source/cuda.rst
+++ b/docs/source/cuda.rst
@ -0,0 +1,49 @@
+torch.cuda
+===================================
+
+.. currentmodule:: torch.cuda
+
+.. automodule:: torch.cuda
+   :members:
+
+Random Number Generator
+-------------------------
+.. autofunction:: get_rng_state
+.. autofunction:: set_rng_state
+.. autofunction:: manual_seed
+.. autofunction:: manual_seed_all
+.. autofunction:: seed
+.. autofunction:: seed_all
+.. autofunction:: initial_seed
+
+
+Communication collectives
+-------------------------
+
+.. autofunction:: torch.cuda.comm.broadcast
+
+.. autofunction:: torch.cuda.comm.reduce_add
+
+.. autofunction:: torch.cuda.comm.scatter
+
+.. autofunction:: torch.cuda.comm.gather
+
+Streams and events
+------------------
+
+.. autoclass:: Stream
+   :members:
+
+.. autoclass:: Event
+   :members:
+
+Memory management
+-----------------
+.. autofunction:: empty_cache
+
+NVIDIA Tools Extension (NVTX)
+-----------------------------
+
+.. autofunction:: torch.cuda.nvtx.mark
+.. autofunction:: torch.cuda.nvtx.range_push
+.. autofunction:: torch.cuda.nvtx.range_pop
--- a/docs/source/data.rst
+++ b/docs/source/data.rst
@ -0,0 +1,14 @@
+torch.utils.data
+===================================
+
+.. automodule:: torch.utils.data
+.. autoclass:: Dataset
+.. autoclass:: TensorDataset
+.. autoclass:: ConcatDataset
+.. autoclass:: DataLoader
+.. autoclass:: torch.utils.data.sampler.Sampler
+.. autoclass:: torch.utils.data.sampler.SequentialSampler
+.. autoclass:: torch.utils.data.sampler.RandomSampler
+.. autoclass:: torch.utils.data.sampler.SubsetRandomSampler
+.. autoclass:: torch.utils.data.sampler.WeightedRandomSampler
+.. autoclass:: torch.utils.data.distributed.DistributedSampler
--- a/docs/source/distributed.rst
+++ b/docs/source/distributed.rst
@ -0,0 +1,201 @@
+.. role:: hidden
+    :class: hidden-section
+
+Distributed communication package - torch.distributed
+=====================================================
+
+.. automodule:: torch.distributed
+.. currentmodule:: torch.distributed
+
+Currently torch.distributed supports three backends, each with
+different capabilities. The table below shows which functions are available
+for use with CPU / CUDA tensors.
+MPI supports cuda only if the implementation used to build PyTorch supports it.
+
+
+------------+-----------+-----------+-----------+
+| Backend    | ``tcp``   | ``gloo``  | ``mpi``   |
+------------+-----+-----+-----+-----+-----+-----+
+| Device     | CPU | GPU | CPU | GPU | CPU | GPU |
+============+=====+=====+=====+=====+=====+=====+
+| send       | ✓   | ✘   | ✘   | ✘   | ✓   | ?   |
+------------+-----+-----+-----+-----+-----+-----+
+| recv       | ✓   | ✘   | ✘   | ✘   | ✓   | ?   |
+------------+-----+-----+-----+-----+-----+-----+
+| broadcast  | ✓   | ✘   | ✓   | ✓   | ✓   | ?   |
+------------+-----+-----+-----+-----+-----+-----+
+| all_reduce | ✓   | ✘   | ✓   | ✓   | ✓   | ?   |
+------------+-----+-----+-----+-----+-----+-----+
+| reduce     | ✓   | ✘   | ✘   | ✘   | ✓   | ?   |
+------------+-----+-----+-----+-----+-----+-----+
+| all_gather | ✓   | ✘   | ✘   | ✘   | ✓   | ?   |
+------------+-----+-----+-----+-----+-----+-----+
+| gather     | ✓   | ✘   | ✘   | ✘   | ✓   | ?   |
+------------+-----+-----+-----+-----+-----+-----+
+| scatter    | ✓   | ✘   | ✘   | ✘   | ✓   | ?   |
+------------+-----+-----+-----+-----+-----+-----+
+| barrier    | ✓   | ✘   | ✓   | ✓   | ✓   | ?   |
+------------+-----+-----+-----+-----+-----+-----+
+
+.. _distributed-basics:
+
+Basics
+------
+
+The `torch.distributed` package provides PyTorch support and communication primitives
+for multiprocess parallelism across several computation nodes running on one or more
+machines. The class :func:`torch.nn.parallel.DistributedDataParallel` builds on this
+functionality to provide synchronous distributed training as a wrapper around any
+PyTorch model. This differs from the kinds of parallelism provided by
+:doc:`multiprocessing` and :func:`torch.nn.DataParallel` in that it supports
+multiple network-connected machines and in that the user must explicitly launch a separate
+copy of the main training script for each process.
+
+In the single-machine synchronous case, `torch.distributed` or the
+:func:`torch.nn.parallel.DistributedDataParallel` wrapper may still have advantages over other
+approaches to data-parallelism, including :func:`torch.nn.DataParallel`:
+
+* Each process maintains its own optimizer and performs a complete optimization step with each
+  iteration. While this may appear redundant, since the gradients have already been gathered
+  together and averaged across processes and are thus the same for every process, this means
+  that no parameter broadcast step is needed, reducing time spent transferring tensors between
+  nodes.
+* Each process contains an independent Python interpreter, eliminating the extra interpreter
+  overhead and "GIL-thrashing" that comes from driving several execution threads, model
+  replicas, or GPUs from a single Python process. This is especially important for models that
+  make heavy use of the Python runtime, including models with recurrent layers or many small
+  components.
+
+Initialization
+--------------
+
+The package needs to be initialized using the :func:`torch.distributed.init_process_group`
+function before calling any other methods. This blocks until all processes have
+joined.
+
+.. autofunction:: init_process_group
+
+.. autofunction:: get_rank
+
+.. autofunction:: get_world_size
+
+--------------------------------------------------------------------------------
+
+Currently three initialization methods are supported:
+
+TCP initialization
+^^^^^^^^^^^^^^^^^^
+
+There are two ways to initialize using TCP, both requiring a network address
+reachable from all processes and a desired ``world_size``. The first way
+requires specifying an address that belongs to the rank 0 process. This first way of
+initialization requires that all processes have manually specified ranks.
+
+Alternatively, the address has to be a valid IP multicast address, in which case
+ranks can be assigned automatically. Multicast initialization also supports
+a ``group_name`` argument, which allows you to use the same address for multiple
+jobs, as long as they use different group names.
+
+::
+
+    import torch.distributed as dist
+
+    # Use address of one of the machines
+    dist.init_process_group(init_method='tcp://10.1.1.20:23456', rank=args.rank, world_size=4)
+
+    # or a multicast address - rank will be assigned automatically if unspecified
+    dist.init_process_group(init_method='tcp://[ff15:1e18:5d4c:4cf0:d02d:b659:53ba:b0a7]:23456',
+                            world_size=4)
+
+Shared file-system initialization
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Another initialization method makes use of a file system that is shared and
+visible from all machines in a group, along with a desired ``world_size``. The URL should start
+with ``file://`` and contain a path to a non-existent file (in an existing
+directory) on a shared file system. This initialization method also supports a
+``group_name`` argument, which allows you to use the same shared file path for
+multiple jobs, as long as they use different group names.
+
+.. warning::
+    This method assumes that the file system supports locking using ``fcntl`` - most
+    local systems and NFS support it.
+
+::
+
+    import torch.distributed as dist
+
+    # Rank will be assigned automatically if unspecified
+    dist.init_process_group(init_method='file:///mnt/nfs/sharedfile', world_size=4,
+                            group_name=args.group)
+
+Environment variable initialization
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This method will read the configuration from environment variables, allowing
+one to fully customize how the information is obtained. The variables to be set
+are:
+
+* ``MASTER_PORT`` - required; has to be a free port on machine with rank 0
+* ``MASTER_ADDR`` - required (except for rank 0); address of rank 0 node
+* ``WORLD_SIZE`` - required; can be set either here, or in a call to init function
+* ``RANK`` - required; can be set either here, or in a call to init function
+
+The machine with rank 0 will be used to set up all connections.
+
+This is the default method, meaning that ``init_method`` does not have to be specified (or
+can be ``env://``).
+
+Groups
+------
+
+By default collectives operate on the default group (also called the world) and
+require all processes to enter the distributed function call. However, some workloads can benefit
+from more fine-grained communication. This is where distributed groups come
+into play. :func:`~torch.distributed.new_group` function can be
+used to create new groups, with arbitrary subsets of all processes. It returns
+an opaque group handle that can be given as a ``group`` argument to all collectives
+(collectives are distributed functions to exchange information in certain well-known programming patterns).
+
+.. autofunction:: new_group
+
+Point-to-point communication
+----------------------------
+
+.. autofunction:: send
+
+.. autofunction:: recv
+
+:func:`~torch.distributed.isend` and :func:`~torch.distributed.irecv`
+return distributed request objects when used. In general, the type of this object is unspecified
+as they should never be created manually, but they are guaranteed to support two methods:
+
+* ``is_completed()`` - returns True if the operation has finished
+* ``wait()`` - will block the process until the operation is finished.
+  ``is_completed()`` is guaranteed to return True once it returns.
+  
+When using the MPI backend, :func:`~torch.distributed.isend` and :func:`~torch.distributed.irecv`
+support non-overtaking, which has some guarantees on supporting message order. For more detail, see
+http://mpi-forum.org/docs/mpi-2.2/mpi22-report/node54.htm#Node54
+
+.. autofunction:: isend
+
+.. autofunction:: irecv
+
+Collective functions
+--------------------
+
+.. autofunction:: broadcast
+
+.. autofunction:: all_reduce
+
+.. autofunction:: reduce
+
+.. autofunction:: all_gather
+
+.. autofunction:: gather
+
+.. autofunction:: scatter
+
+.. autofunction:: barrier
+
--- a/docs/source/distributions.rst
+++ b/docs/source/distributions.rst
@ -0,0 +1,32 @@
+.. role:: hidden
+    :class: hidden-section
+
+Probability distributions - torch.distributions
+==================================================
+
+.. automodule:: torch.distributions
+.. currentmodule:: torch.distributions
+
+:hidden:`Distribution`
+~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: Distribution
+    :members:
+
+:hidden:`Bernoulli`
+~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: Bernoulli
+    :members:
+
+:hidden:`Categorical`
+~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: Categorical
+    :members:
+
+:hidden:`Normal`
+~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: Normal
+    :members:
--- a/docs/source/ffi.rst
+++ b/docs/source/ffi.rst
@ -0,0 +1,6 @@
+torch.utils.ffi
+===============
+
+.. currentmodule:: torch.utils.ffi
+.. autofunction:: create_extension
+
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -0,0 +1,54 @@
+.. PyTorch documentation master file, created by
+   sphinx-quickstart on Fri Dec 23 13:31:47 2016.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+:github_url: https://github.com/pytorch/pytorch
+
+PyTorch documentation
+===================================
+
+PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.
+
+.. toctree::
+   :glob:
+   :maxdepth: 1
+   :caption: Notes
+
+   notes/*
+
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Package Reference
+
+   torch
+   tensors
+   sparse
+   storage
+   nn
+   optim
+   torch.autograd <autograd>
+   torch.distributions <distributions>
+   torch.multiprocessing <multiprocessing>
+   torch.distributed <distributed>
+   torch.legacy <legacy>
+   cuda
+   ffi
+   data
+   model_zoo
+   onnx
+
+.. toctree::
+   :glob:
+   :maxdepth: 2
+   :caption: torchvision Reference
+
+   torchvision/index
+
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`modindex`
--- a/docs/source/legacy.rst
+++ b/docs/source/legacy.rst
@ -0,0 +1,4 @@
+Legacy package - torch.legacy
+===================================
+
+.. automodule:: torch.legacy
--- a/docs/source/model_zoo.rst
+++ b/docs/source/model_zoo.rst
@ -0,0 +1,5 @@
+torch.utils.model_zoo
+===================================
+
+.. automodule:: torch.utils.model_zoo
+.. autofunction:: load_url
--- a/docs/source/multiprocessing.rst
+++ b/docs/source/multiprocessing.rst
@ -0,0 +1,88 @@
+Multiprocessing package - torch.multiprocessing
+===============================================
+
+.. automodule:: torch.multiprocessing
+.. currentmodule:: torch.multiprocessing
+
+.. warning::
+
+    If the main process exits abruptly (e.g. because of an incoming signal),
+    Python's ``multiprocessing`` sometimes fails to clean up its children.
+    It's a known caveat, so if you're seeing any resource leaks after
+    interrupting the interpreter, it probably means that this has just happened
+    to you.
+
+Strategy management
+-------------------
+
+.. autofunction:: get_all_sharing_strategies
+.. autofunction:: get_sharing_strategy
+.. autofunction:: set_sharing_strategy
+
+Sharing CUDA tensors
+--------------------
+
+Sharing CUDA tensors between processes is supported only in Python 3, using
+a ``spawn`` or ``forkserver`` start methods. :mod:`python:multiprocessing` in
+Python 2 can only create subprocesses using ``fork``, and it's not supported
+by the CUDA runtime.
+
+.. warning::
+
+    CUDA API requires that the allocation exported to other processes remains
+    valid as long as it's used by them. You should be careful and ensure that
+    CUDA tensors you shared don't go out of scope as long as it's necessary.
+    This shouldn't be a problem for sharing model parameters, but passing other
+    kinds of data should be done with care. Note that this restriction doesn't
+    apply to shared CPU memory.
+
+
+Sharing strategies
+------------------
+
+This section provides a brief overview into how different sharing strategies
+work. Note that it applies only to CPU tensor - CUDA tensors will always use
+the CUDA API, as that's the only way they can be shared.
+
+File descriptor - ``file_descriptor``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+
+.. note::
+
+    This is the default strategy (except for macOS and OS X where it's not
+    supported).
+
+This strategy will use file descriptors as shared memory handles. Whenever a
+storage is moved to shared memory, a file descriptor obtained from ``shm_open``
+is cached with the object, and when it's going to be sent to other processes,
+the file descriptor will be transferred (e.g. via UNIX sockets) to it. The
+receiver will also cache the file descriptor and ``mmap`` it, to obtain a shared
+view onto the storage data.
+
+Note that if there will be a lot of tensors shared, this strategy will keep a
+large number of file descriptors open most of the time. If your system has low
+limits for the number of open file descriptors, and you can't raise them, you
+should use the ``file_system`` strategy.
+
+File system - ``file_system``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This strategy will use file names given to ``shm_open`` to identify the shared
+memory regions. This has a benefit of not requiring the implementation to cache
+the file descriptors obtained from it, but at the same time is prone to shared
+memory leaks. The file can't be deleted right after its creation, because other
+processes need to access it to open their views. If the processes fatally
+crash, or are killed, and don't call the storage destructors, the files will
+remain in the system. This is very serious, because they keep using up the
+memory until the system is restarted, or they're freed manually.
+
+To counter the problem of shared memory file leaks, :mod:`torch.multiprocessing`
+will spawn a daemon named ``torch_shm_manager`` that will isolate itself from
+the current process group, and will keep track of all shared memory allocations.
+Once all processes connected to it exit, it will wait a moment to ensure there
+will be no new connections, and will iterate over all shared memory files
+allocated by the group. If it finds that any of them still exist, they will be
+deallocated. We've tested this method and it proved to be robust to various
+failures. Still, if your system has high enough limits, and ``file_descriptor``
+is a supported strategy, we do not recommend switching to this one.
--- a/docs/source/nn.rst
+++ b/docs/source/nn.rst
--- a/docs/source/notes/autograd.rst
+++ b/docs/source/notes/autograd.rst
@ -0,0 +1,151 @@
+Autograd mechanics
+==================
+
+This note will present an overview of how autograd works and records the
+operations. It's not strictly necessary to understand all this, but we recommend
+getting familiar with it, as it will help you write more efficient, cleaner
+programs, and can aid you in debugging.
+
+.. _excluding-subgraphs:
+
+Excluding subgraphs from backward
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Every Variable has two flags: :attr:`requires_grad` and :attr:`volatile`.
+They both allow for fine grained exclusion of subgraphs from gradient
+computation and can increase efficiency.
+
+.. _excluding-requires_grad:
+
+``requires_grad``
+~~~~~~~~~~~~~~~~~
+
+If there's a single input to an operation that requires gradient, its output
+will also require gradient. Conversely, only if all inputs don't require
+gradient, the output also won't require it. Backward computation is never
+performed in the subgraphs, where all Variables didn't require gradients.
+
+.. code::
+
+    >>> x = Variable(torch.randn(5, 5))
+    >>> y = Variable(torch.randn(5, 5))
+    >>> z = Variable(torch.randn(5, 5), requires_grad=True)
+    >>> a = x + y
+    >>> a.requires_grad
+    False
+    >>> b = a + z
+    >>> b.requires_grad
+    True
+
+This is especially useful when you want to freeze part of your model, or you
+know in advance that you're not going to use gradients w.r.t. some parameters.
+For example if you want to finetune a pretrained CNN, it's enough to switch the
+:attr:`requires_grad` flags in the frozen base, and no intermediate buffers will
+be saved, until the computation gets to the last layer, where the affine
+transform will use weights that require gradient, and the output of the network
+will also require them.
+
+.. code::
+
+    model = torchvision.models.resnet18(pretrained=True)
+    for param in model.parameters():
+        param.requires_grad = False
+    # Replace the last fully-connected layer
+    # Parameters of newly constructed modules have requires_grad=True by default
+    model.fc = nn.Linear(512, 100)
+
+    # Optimize only the classifier
+    optimizer = optim.SGD(model.fc.parameters(), lr=1e-2, momentum=0.9)
+
+``volatile``
+~~~~~~~~~~~~
+
+Volatile is recommended for purely inference mode, when you're sure you won't
+be even calling `.backward()`. It's more efficient than any other autograd
+setting - it will use the absolute minimal amount of memory to evaluate the
+model. ``volatile`` also determines that ``requires_grad is False``.
+
+Volatile differs from :ref:`excluding-requires_grad` in how the flag propagates.
+If there's even a single volatile input to an operation, its output is also
+going to be volatile. Volatility spreads across the graph much easier than
+non-requiring gradient - you only need a **single** volatile leaf to have a
+volatile output, while you need **all** leaves to not require gradient to
+have an output that doesn't require gradient. Using volatile flag you don't
+need to change any settings of your model parameters to use it for
+inference. It's enough to create a volatile input, and this will ensure that
+no intermediate states are saved.
+
+.. code::
+
+    >>> regular_input = Variable(torch.randn(1, 3, 227, 227))
+    >>> volatile_input = Variable(torch.randn(1, 3, 227, 227), volatile=True)
+    >>> model = torchvision.models.resnet18(pretrained=True)
+    >>> model(regular_input).requires_grad
+    True
+    >>> model(volatile_input).requires_grad
+    False
+    >>> model(volatile_input).volatile
+    True
+    >>> model(volatile_input).grad_fn is None
+    True
+
+How autograd encodes the history
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Autograd is reverse automatic differentiation system.  Conceptually,
+autograd records a graph recording all of the operations that created
+the data as you execute operations, giving you a directed acyclic graph
+whose leaves are the input variables and roots are the output variables.
+By tracing this graph from roots to leaves, you can automatically
+compute the gradients using the chain rule.
+
+Internally, autograd represents this graph as a graph of
+:class:`Function` objects (really expressions), which can be
+:meth:`~torch.autograd.Function.apply` ed to compute the result of
+evaluating the graph.  When computing the forwards pass, autograd
+simultaneously performs the requested computations and builds up a graph
+representing the function that computes the gradient (the ``.grad_fn``
+attribute of each :class:`Variable` is an entry point into this graph).
+When the forwards pass is completed, we evaluate this graph in the
+backwards pass to compute the gradients.
+
+An important thing to note is that the graph is recreated from scratch at every
+iteration, and this is exactly what allows for using arbitrary Python control
+flow statements, that can change the overall shape and size of the graph at
+every iteration. You don't have to encode all possible paths before you
+launch the training - what you run is what you differentiate.
+
+In-place operations on Variables
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Supporting in-place operations in autograd is a hard matter, and we discourage
+their use in most cases. Autograd's aggressive buffer freeing and reuse makes
+it very efficient and there are very few occasions when in-place operations
+actually lower memory usage by any significant amount. Unless you're operating
+under heavy memory pressure, you might never need to use them.
+
+There are two main reasons that limit the applicability of in-place operations:
+
+1. Overwriting values required to compute gradients. This is why variables don't
+   support ``log_``. Its gradient formula requires the original input, and while
+   it is possible to recreate it by computing the inverse operation, it is
+   numerically unstable, and requires additional work that often defeats the
+   purpose of using these functions.
+
+2. Every in-place operation actually requires the implementation to rewrite the
+   computational graph. Out-of-place versions simply allocate new objects and
+   keep references to the old graph, while in-place operations, require
+   changing the creator of all inputs to the :class:`Function` representing
+   this operation. This can be tricky, especially if there are many Variables
+   that reference the same storage (e.g. created by indexing or transposing),
+   and in-place functions will actually raise an error if the storage of
+   modified inputs is referenced by any other :class:`Variable`.
+
+In-place correctness checks
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Every variable keeps a version counter, that is incremented every time it's
+marked dirty in any operation. When a Function saves any tensors for backward,
+a version counter of their containing Variable is saved as well. Once you access
+``self.saved_tensors`` it is checked, and if it's greater than the saved value
+an error is raised.
--- a/docs/source/notes/broadcasting.rst
+++ b/docs/source/notes/broadcasting.rst
@ -0,0 +1,113 @@
+.. _broadcasting-semantics:
+
+Broadcasting semantics
+======================
+
+Many PyTorch operations support :any:`NumPy Broadcasting Semantics <numpy.doc.broadcasting>`.
+
+In short, if a PyTorch operation supports broadcast, then its Tensor arguments can be
+automatically expanded to be of equal sizes (without making copies of the data).
+
+General semantics
+-----------------
+Two tensors are "broadcastable" if the following rules hold:
+
+- Each tensor has at least one dimension.
+- When iterating over the dimension sizes, starting at the trailing dimension,
+  the dimension sizes must either be equal, one of them is 1, or one of them
+  does not exist.
+
+For Example::
+
+    >>> x=torch.FloatTensor(5,7,3)
+    >>> y=torch.FloatTensor(5,7,3)
+    # same shapes are always broadcastable (i.e. the above rules always hold)
+
+    >>> x=torch.FloatTensor()
+    >>> y=torch.FloatTensor(2,2)
+    # x and y are not broadcastable, because x does not have at least 1 dimension
+
+    # can line up trailing dimensions
+    >>> x=torch.FloatTensor(5,3,4,1)
+    >>> y=torch.FloatTensor(  3,1,1)
+    # x and y are broadcastable.
+    # 1st trailing dimension: both have size 1
+    # 2nd trailing dimension: y has size 1
+    # 3rd trailing dimension: x size == y size
+    # 4th trailing dimension: y dimension doesn't exist
+
+    # but:
+    >>> x=torch.FloatTensor(5,2,4,1)
+    >>> y=torch.FloatTensor(  3,1,1)
+    # x and y are not broadcastable, because in the 3rd trailing dimension 2 != 3
+
+If two tensors :attr:`x`, :attr:`y` are "broadcastable", the resulting tensor size
+is calculated as follows:
+
+- If the number of dimensions of :attr:`x` and :attr:`y` are not equal, prepend 1
+  to the dimensions of the tensor with fewer dimensions to make them equal length.
+- Then, for each dimension size, the resulting dimension size is the max of the sizes of
+  :attr:`x` and :attr:`y` along that dimension.
+
+For Example::
+
+    # can line up trailing dimensions to make reading easier
+    >>> x=torch.FloatTensor(5,1,4,1)
+    >>> y=torch.FloatTensor(  3,1,1)
+    >>> (x+y).size()
+    torch.Size([5, 3, 4, 1])
+
+    # but not necessary:
+    >>> x=torch.FloatTensor(1)
+    >>> y=torch.FloatTensor(3,1,7)
+    >>> (x+y).size()
+    torch.Size([3, 1, 7])
+
+    >>> x=torch.FloatTensor(5,2,4,1)
+    >>> y=torch.FloatTensor(3,1,1)
+    >>> (x+y).size()
+    RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 1
+
+In-place semantics
+------------------
+One complication is that in-place operations do not allow the in-place tensor to change shape
+as a result of the broadcast.
+
+For Example::
+
+    >>> x=torch.FloatTensor(5,3,4,1)
+    >>> y=torch.FloatTensor(3,1,1)
+    >>> (x.add_(y)).size()
+    torch.Size([5, 3, 4, 1])
+
+    # but:
+    >>> x=torch.FloatTensor(1,3,1)
+    >>> y=torch.FloatTensor(3,1,7)
+    >>> (x.add_(y)).size()
+    RuntimeError: The expanded size of the tensor (1) must match the existing size (7) at non-singleton dimension 2.
+
+Backwards compatibility
+-----------------------
+Prior versions of PyTorch allowed certain pointwise functions to execute on tensors with different shapes,
+as long as the number of elements in each tensor was equal.  The pointwise operation would then be carried
+out by viewing each tensor as 1-dimensional.  PyTorch now supports broadcasting and the "1-dimensional"
+pointwise behavior is considered deprecated and will generate a Python warning in cases where tensors are
+not broadcastable, but have the same number of elements.
+
+Note that the introduction of broadcasting can cause backwards incompatible changes in the case where
+two tensors do not have the same shape, but are broadcastable and have the same number of elements.
+For Example::
+
+    >>> torch.add(torch.ones(4,1), torch.randn(4))
+
+would previously produce a Tensor with size: torch.Size([4,1]), but now produces a Tensor with size: torch.Size([4,4]).
+In order to help identify cases in your code where backwards incompatibilities introduced by broadcasting may exist,
+you may set `torch.utils.backcompat.broadcast_warning.enabled` to `True`, which will generate a python warning
+in such cases.
+
+For Example::
+
+    >>> torch.utils.backcompat.broadcast_warning.enabled=True
+    >>> torch.add(torch.ones(4,1), torch.ones(4))
+    __main__:1: UserWarning: self and other do not have the same shape, but are broadcastable, and have the same number of elements.
+    Changing behavior in a backwards incompatible manner to broadcasting rather than viewing as 1-dimensional.
--- a/docs/source/notes/cuda.rst
+++ b/docs/source/notes/cuda.rst
@ -0,0 +1,222 @@
+.. _cuda-semantics:
+
+CUDA semantics
+==============
+
+:mod:`torch.cuda` is used to set up and run CUDA operations. It keeps track of
+the currently selected GPU, and all CUDA tensors you allocate will by default be
+created on that device. The selected device can be changed with a
+:any:`torch.cuda.device` context manager.
+
+However, once a tensor is allocated, you can do operations on it irrespective
+of the selected device, and the results will be always placed in on the same
+device as the tensor.
+
+Cross-GPU operations are not allowed by default, with the only exception of
+:meth:`~torch.Tensor.copy_`. Unless you enable peer-to-peer memory access, any
+attempts to launch ops on tensors spread across different devices will raise an
+error.
+
+Below you can find a small example showcasing this::
+
+    x = torch.cuda.FloatTensor(1)
+    # x.get_device() == 0
+    y = torch.FloatTensor(1).cuda()
+    # y.get_device() == 0
+
+    with torch.cuda.device(1):
+        # allocates a tensor on GPU 1
+        a = torch.cuda.FloatTensor(1)
+
+        # transfers a tensor from CPU to GPU 1
+        b = torch.FloatTensor(1).cuda()
+        # a.get_device() == b.get_device() == 1
+
+        c = a + b
+        # c.get_device() == 1
+
+        z = x + y
+        # z.get_device() == 0
+
+        # even within a context, you can give a GPU id to the .cuda call
+        d = torch.randn(2).cuda(2)
+        # d.get_device() == 2
+
+Asynchronous execution
+----------------------
+
+By default, GPU operations are asynchronous.  When you call a function that
+uses the GPU, the operations are *enqueued* to the particular device, but not
+necessarily executed until later.  This allows us to execute more computations
+in parallel, including operations on CPU or other GPUs.
+
+In general, the effect of asynchronous computation is invisible to the caller,
+because (1) each device executes operations in the order they are queued, and
+(2) PyTorch automatically performs necessary synchronization when copying data
+between CPU and GPU or between two GPUs.  Hence, computation will proceed as if
+every operation was executed synchronously.
+
+You can force synchronous computation by setting environment variable
+`CUDA_LAUNCH_BLOCKING=1`.  This can be handy when an error occurs on the GPU.
+(With asynchronous execution, such an error isn't reported until after the
+operation is actually executed, so the stack trace does not show where it was
+requested.)
+
+As an exception, several functions such as :meth:`~torch.Tensor.copy_` admit
+an explicit :attr:`async` argument, which lets the caller bypass synchronization
+when it is unnecessary.  Another exception is CUDA streams, explained below.
+
+CUDA streams
+^^^^^^^^^^^^
+
+A `CUDA stream`_ is a linear sequence of execution that belongs to a specific
+device.  You normally do not need to create one explicitly: by default, each
+device uses its own "default" stream.
+
+Operations inside each stream are serialized in the order they are created,
+but operations from different streams can execute concurrently in any
+relative order, unless explicit synchronization functions (such as
+:meth:`~torch.cuda.synchronize` or :meth:`~torch.cuda.Stream.wait_stream`) are
+used.  For example, the following code is incorrect::
+
+    s = torch.cuda.stream()  # Create a new stream.
+    A = torch.cuda.FloatTensor(100, 100).normal_(0.0, 1.0)
+    with torch.cuda.stream(s):
+        # sum() may start execution before normal_() finishes!
+        B = torch.sum(A)
+
+When the "current stream" is the default stream, PyTorch automatically performs
+necessary synchronization when data is moved around, as explained above.
+However, when using non-default streams, it is the user's responsibility to
+ensure proper synchronization.
+
+.. _CUDA stream: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#streams
+
+Memory management
+-----------------
+
+PyTorch use a caching memory allocator to speed up memory allocations. This
+allows fast memory deallocation without device synchronizations. However, the
+unused memory managed by the allocator will still show as if used in
+`nvidia-smi`. Calling :meth:`~torch.cuda.empty_cache` can release all unused
+cached memory from PyTorch so that those can be used by other GPU applications.
+
+
+Best practices
+--------------
+
+Device-agnostic code
+^^^^^^^^^^^^^^^^^^^^
+
+Due to the structure of PyTorch, you may need to explicitly write
+device-agnostic (CPU or GPU) code; an example may be creating a new tensor as
+the initial hidden state of a recurrent neural network.
+
+The first step is to determine whether the GPU should be used or not. A common
+pattern is to use Python's ``argparse`` module to read in user arguments, and
+have a flag that can be used to disable CUDA, in combination with
+:meth:`~torch.cuda.is_available`. In the following, ``args.cuda`` results in a
+flag that can be used to cast tensors and modules to CUDA if desired::
+
+    import argparse
+    import torch
+
+    parser = argparse.ArgumentParser(description='PyTorch Example')
+    parser.add_argument('--disable-cuda', action='store_true',
+                        help='Disable CUDA')
+    args = parser.parse_args()
+    args.cuda = not args.disable_cuda and torch.cuda.is_available()
+
+If modules or tensors need to be sent to the GPU, ``args.cuda`` can be used as
+follows::
+
+    x = torch.Tensor(8, 42)
+    net = Network()
+    if args.cuda:
+      x = x.cuda()
+      net.cuda()
+
+When creating tensors, an alternative to the if statement is to have a default
+datatype defined, and cast all tensors using that. An example when using a
+dataloader would be as follows::
+
+    dtype = torch.cuda.FloatTensor
+    for i, x in enumerate(train_loader):
+        x = Variable(x.type(dtype))
+
+When working with multiple GPUs on a system, you can use the
+``CUDA_VISIBLE_DEVICES`` environment flag to manage which GPUs are available to
+PyTorch. As mentioned above, to manually control which GPU a tensor is created
+on, the best practice is to use a :any:`torch.cuda.device` context manager::
+
+    print("Outside device is 0")  # On device 0 (default in most scenarios)
+    with torch.cuda.device(1):
+        print("Inside device is 1")  # On device 1
+    print("Outside device is still 0")  # On device 0
+
+If you have a tensor and would like to create a new tensor of the same type on
+the same device, then you can use the :meth:`~torch.Tensor.new` method, which
+acts the same as a normal tensor constructor. Whilst the previously mentioned
+methods depend on the current GPU context, :meth:`~torch.Tensor.new` preserves
+the device of the original tensor.
+
+This is the recommended practice when creating modules in which new
+tensors/variables need to be created internally during the forward pass::
+
+    x_cpu = torch.FloatTensor(1)
+    x_gpu = torch.cuda.FloatTensor(1)
+    x_cpu_long = torch.LongTensor(1)
+
+    y_cpu = x_cpu.new(8, 10, 10).fill_(0.3)
+    y_gpu = x_gpu.new(x_gpu.size()).fill_(-5)
+    y_cpu_long = x_cpu_long.new([[1, 2, 3]])
+
+If you want to create a tensor of the same type and size of another tensor, and
+fill it with either ones or zeros, :meth:`~torch.ones_like` or
+:meth:`~torch.zeros_like` are provided as convenient helper functions (which
+also preserve device)::
+
+    x_cpu = torch.FloatTensor(1)
+    x_gpu = torch.cuda.FloatTensor(1)
+
+    y_cpu = torch.ones_like(x_cpu)
+    y_gpu = torch.zeros_like(x_gpu)
+
+
+Use pinned memory buffers
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. warning:
+
+    This is an advanced tip. You overuse of pinned memory can cause serious
+    problems if you'll be running low on RAM, and you should be aware that
+    pinning is often an expensive operation.
+
+Host to GPU copies are much faster when they originate from pinned (page-locked)
+memory. CPU tensors and storages expose a :meth:`~torch.Tensor.pin_memory`
+method, that returns a copy of the object, with data put in a pinned region.
+
+Also, once you pin a tensor or storage, you can use asynchronous GPU copies.
+Just pass an additional ``async=True`` argument to a :meth:`~torch.Tensor.cuda`
+call. This can be used to overlap data transfers with computation.
+
+You can make the :class:`~torch.utils.data.DataLoader` return batches placed in
+pinned memory by passing ``pin_memory=True`` to its constructor.
+
+.. _cuda-nn-dataparallel-instead:
+
+Use nn.DataParallel instead of multiprocessing
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Most use cases involving batched inputs and multiple GPUs should default to
+using :class:`~torch.nn.DataParallel` to utilize more than one GPU. Even with
+the GIL, a single Python process can saturate multiple GPUs.
+
+As of version 0.1.9, large numbers of GPUs (8+) might not be fully utilized.
+However, this is a known issue that is under active development. As always,
+test your use case.
+
+There are significant caveats to using CUDA models with
+:mod:`~torch.multiprocessing`; unless care is taken to meet the data handling
+requirements exactly, it is likely that your program will have incorrect or
+undefined behavior.
--- a/docs/source/notes/extending.rst
+++ b/docs/source/notes/extending.rst
@ -0,0 +1,182 @@
+Extending PyTorch
+=================
+
+In this note we'll cover ways of extending :mod:`torch.nn`,
+:mod:`torch.autograd`, and writing custom C extensions utilizing our C
+libraries.
+
+Extending :mod:`torch.autograd`
+-------------------------------
+
+.. currentmodule:: torch.autograd
+
+Adding operations to :mod:`~torch.autograd` requires implementing a new
+:class:`Function` subclass for each operation. Recall that :class:`Function` s
+are what :mod:`~torch.autograd` uses to compute the results and gradients, and
+encode the operation history. Every new function requires you to implement 2
+methods:
+
+- :meth:`~Function.forward` - the code that performs the operation. It can take
+  as many arguments as you want, with some of them being optional, if you
+  specify the default values. All kinds of Python objects are accepted here.
+  :class:`Variable` arguments will be converted to :class:`Tensor` s before the
+  call, and their use will be registered in the graph. Note that this logic won't
+  traverse lists/dicts/any other data structures and will only consider Variables
+  that are direct arguments to the call. You can return either a single
+  :class:`Tensor` output, or a :class:`tuple` of :class:`Tensor` s if there are
+  multiple outputs. Also, please refer to the docs of :class:`Function` to find
+  descriptions of useful methods that can be called only from :meth:`~Function.forward`.
+- :meth:`~Function.backward` - gradient formula. It will be given
+  as many :class:`Variable` arguments as there were outputs, with each of them
+  representing gradient w.r.t. that output. It should return as many
+  :class:`Variable` s as there were inputs, with each of them containing the
+  gradient w.r.t. its corresponding input. If your inputs didn't require
+  gradient (see :attr:`~Variable.needs_input_grad`), or were non-:class:`Variable`
+  objects, you can return :class:`python:None`. Also, if you have optional
+  arguments to :meth:`~Variable.forward` you can return more gradients than there
+  were inputs, as long as they're all :any:`python:None`.
+
+Below you can find code for a ``Linear`` function from :mod:`torch.nn`, with
+additional comments::
+
+    # Inherit from Function
+    class LinearFunction(Function):
+
+        # Note that both forward and backward are @staticmethods
+        @staticmethod
+        # bias is an optional argument
+        def forward(ctx, input, weight, bias=None):
+            ctx.save_for_backward(input, weight, bias)
+            output = input.mm(weight.t())
+            if bias is not None:
+                output += bias.unsqueeze(0).expand_as(output)
+            return output
+
+        # This function has only a single output, so it gets only one gradient
+        @staticmethod
+        def backward(ctx, grad_output):
+            # This is a pattern that is very convenient - at the top of backward
+            # unpack saved_tensors and initialize all gradients w.r.t. inputs to
+            # None. Thanks to the fact that additional trailing Nones are
+            # ignored, the return statement is simple even when the function has
+            # optional inputs.
+            input, weight, bias = ctx.saved_variables
+            grad_input = grad_weight = grad_bias = None
+
+            # These needs_input_grad checks are optional and there only to
+            # improve efficiency. If you want to make your code simpler, you can
+            # skip them. Returning gradients for inputs that don't require it is
+            # not an error.
+            if ctx.needs_input_grad[0]:
+                grad_input = grad_output.mm(weight)
+            if ctx.needs_input_grad[1]:
+                grad_weight = grad_output.t().mm(input)
+            if bias is not None and ctx.needs_input_grad[2]:
+                grad_bias = grad_output.sum(0).squeeze(0)
+
+            return grad_input, grad_weight, grad_bias
+
+Now, to make it easier to use these custom ops, we recommend aliasing their
+``apply`` method::
+
+    linear = LinearFunction.apply
+
+Here, we give an additional example of a function that is parametrized by
+non-Variable arguments::
+
+    class MulConstant(Function):
+        @staticmethod
+        def forward(ctx, tensor, constant):
+            # ctx is a context object that can be used to stash information
+            # for backward computation
+            ctx.constant = constant
+            return tensor * constant
+
+        @staticmethod
+        def backward(ctx, grad_output):
+            # We return as many input gradients as there were arguments.
+            # Gradients of non-Tensor arguments to forward must be None.
+            return grad_output * ctx.constant, None
+
+You probably want to check if the backward method you implemented actually
+computes the derivatives of your function. It is possible by comparing with
+numerical approximations using small finite differences::
+
+    from torch.autograd import gradcheck
+
+    # gradchek takes a tuple of tensor as input, check if your gradient
+    # evaluated with these tensors are close enough to numerical
+    # approximations and returns True if they all verify this condition.
+    input = (Variable(torch.randn(20,20).double(), requires_grad=True), Variable(torch.randn(30,20).double(), requires_grad=True),)
+    test = gradcheck(Linear.apply, input, eps=1e-6, atol=1e-4)
+    print(test)
+
+Extending :mod:`torch.nn`
+-------------------------
+
+.. currentmodule:: torch.nn
+
+:mod:`~torch.nn` exports two kinds of interfaces - modules and their functional
+versions. You can extend it in both ways, but we recommend using modules for
+all kinds of layers, that hold any parameters or buffers, and recommend using
+a functional form parameter-less operations like activation functions, pooling,
+etc.
+
+Adding a functional version of an operation is already fully covered in the
+section above.
+
+Adding a :class:`Module`
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Since :mod:`~torch.nn` heavily utilizes :mod:`~torch.autograd`, adding a new
+:class:`Module` requires implementing a :class:`~torch.autograd.Function`
+that performs the operation and can compute the gradient. From now on let's
+assume that we want to implement a ``Linear`` module and we have the function
+implementated as in the listing above. There's very little code required to
+add this. Now, there are two functions that need to be implemented:
+
+- ``__init__`` (*optional*) - takes in arguments such as kernel sizes, numbers
+  of features, etc. and initializes parameters and buffers.
+- :meth:`~Module.forward` - instantiates a :class:`~torch.autograd.Function` and
+  uses it to perform the operation. It's very similar to a functional wrapper
+  shown above.
+
+This is how a ``Linear`` module can be implemented::
+
+    class Linear(nn.Module):
+        def __init__(self, input_features, output_features, bias=True):
+            super(Linear, self).__init__()
+            self.input_features = input_features
+            self.output_features = output_features
+
+            # nn.Parameter is a special kind of Variable, that will get
+            # automatically registered as Module's parameter once it's assigned
+            # as an attribute. Parameters and buffers need to be registered, or
+            # they won't appear in .parameters() (doesn't apply to buffers), and
+            # won't be converted when e.g. .cuda() is called. You can use
+            # .register_buffer() to register buffers.
+            # nn.Parameters can never be volatile and, different than Variables,
+            # they require gradients by default.
+            self.weight = nn.Parameter(torch.Tensor(output_features, input_features))
+            if bias:
+                self.bias = nn.Parameter(torch.Tensor(output_features))
+            else:
+                # You should always register all possible parameters, but the
+                # optional ones can be None if you want.
+                self.register_parameter('bias', None)
+
+            # Not a very smart way to initialize weights
+            self.weight.data.uniform_(-0.1, 0.1)
+            if bias is not None:
+                self.bias.data.uniform_(-0.1, 0.1)
+
+        def forward(self, input):
+            # See the autograd section for explanation of what happens here.
+            return LinearFunction.apply(input, self.weight, self.bias)
+
+
+Writing custom C extensions
+---------------------------
+
+Coming soon. For now you can find an example at
+`GitHub <https://github.com/pytorch/extension-ffi>`_.
--- a/docs/source/notes/multiprocessing.rst
+++ b/docs/source/notes/multiprocessing.rst
@ -0,0 +1,124 @@
+Multiprocessing best practices
+==============================
+
+:mod:`torch.multiprocessing` is a drop in replacement for Python's
+:mod:`python:multiprocessing` module. It supports the exact same operations,
+but extends it, so that all tensors sent through a
+:class:`python:multiprocessing.Queue`, will have their data moved into shared
+memory and will only send a handle to another process.
+
+.. note::
+
+    When a :class:`~torch.autograd.Variable` is sent to another process, both
+    the :attr:`Variable.data` and :attr:`Variable.grad.data` are going to be
+    shared.
+
+This allows to implement various training methods, like Hogwild, A3C, or any
+others that require asynchronous operation.
+
+Sharing CUDA tensors
+--------------------
+
+Sharing CUDA tensors between processes is supported only in Python 3, using
+a ``spawn`` or ``forkserver`` start methods. :mod:`python:multiprocessing` in
+Python 2 can only create subprocesses using ``fork``, and it's not supported
+by the CUDA runtime.
+
+.. warning::
+
+    CUDA API requires that the allocation exported to other processes remains
+    valid as long as it's used by them. You should be careful and ensure that
+    CUDA tensors you shared don't go out of scope as long as it's necessary.
+    This shouldn't be a problem for sharing model parameters, but passing other
+    kinds of data should be done with care. Note that this restriction doesn't
+    apply to shared CPU memory.
+
+See also: :ref:`cuda-nn-dataparallel-instead`
+
+
+Best practices and tips
+-----------------------
+
+Avoiding and fighting deadlocks
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+There are a lot of things that can go wrong when a new process is spawned, with
+the most common cause of deadlocks being background threads. If there's any
+thread that holds a lock or imports a module, and ``fork`` is called, it's very
+likely that the subprocess will be in a corrupted state and will deadlock or
+fail in a different way. Note that even if you don't, Python built in
+libraries do - no need to look further than :mod:`python:multiprocessing`.
+:class:`python:multiprocessing.Queue` is actually a very complex class, that
+spawns multiple threads used to serialize, send and receive objects, and they
+can cause aforementioned problems too. If you find yourself in such situation
+try using a :class:`~python:multiprocessing.queues.SimpleQueue`, that doesn't
+use any additional threads.
+
+We're trying our best to make it easy for you and ensure these deadlocks don't
+happen but some things are out of our control. If you have any issues you can't
+cope with for a while, try reaching out on forums, and we'll see if it's an
+issue we can fix.
+
+Reuse buffers passed through a Queue
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Remember that each time you put a :class:`~torch.Tensor` into a
+:class:`python:multiprocessing.Queue`, it has to be moved into shared memory.
+If it's already shared, it is a no-op, otherwise it will incur an additional
+memory copy that can slow down the whole process. Even if you have a pool of
+processes sending data to a single one, make it send the buffers back - this
+is nearly free and will let you avoid a copy when sending next batch.
+
+Asynchronous multiprocess training (e.g. Hogwild)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Using :mod:`torch.multiprocessing`, it is possible to train a model
+asynchronously, with parameters either shared all the time, or being
+periodically synchronized. In the first case, we recommend sending over the whole
+model object, while in the latter, we advise to only send the
+:meth:`~torch.nn.Module.state_dict`.
+
+We recommend using :class:`python:multiprocessing.Queue` for passing all kinds
+of PyTorch objects between processes. It is possible to e.g. inherit the tensors
+and storages already in shared memory, when using the ``fork`` start method,
+however it is very bug prone and should be used with care, and only by advanced
+users. Queues, even though they're sometimes a less elegant solution, will work
+properly in all cases.
+
+.. warning::
+
+    You should be careful about having global statements, that are not guarded
+    with an ``if __name__ == '__main__'``. If a different start method than
+    ``fork`` is used, they will be executed in all subprocesses.
+
+Hogwild
+~~~~~~~
+
+A concrete Hogwild implementation can be found in the `examples repository`__,
+but to showcase the overall structure of the code, there's also a minimal
+example below as well::
+
+    import torch.multiprocessing as mp
+    from model import MyModel
+
+    def train(model):
+        # Construct data_loader, optimizer, etc.
+        for data, labels in data_loader:
+            optimizer.zero_grad()
+            loss_fn(model(data), labels).backward()
+            optimizer.step()  # This will update the shared parameters
+
+    if __name__ == '__main__':
+        num_processes = 4
+        model = MyModel()
+        # NOTE: this is required for the ``fork`` method to work
+        model.share_memory()
+        processes = []
+        for rank in range(num_processes):
+            p = mp.Process(target=train, args=(model,))
+            p.start()
+            processes.append(p)
+        for p in processes:
+          p.join()
+
+.. __: https://github.com/pytorch/examples/tree/master/mnist_hogwild
--- a/docs/source/notes/serialization.rst
+++ b/docs/source/notes/serialization.rst
@ -0,0 +1,34 @@
+
+Serialization semantics
+=======================
+
+Best practices
+--------------
+
+.. _recommend-saving-models:
+
+Recommended approach for saving a model
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+There are two main approaches for serializing and restoring a model.
+
+The first (recommended) saves and loads only the model parameters::
+
+    torch.save(the_model.state_dict(), PATH)
+
+Then later::
+
+    the_model = TheModelClass(*args, **kwargs)
+    the_model.load_state_dict(torch.load(PATH))
+
+The second saves and loads the entire model::
+
+    torch.save(the_model, PATH)
+
+Then later::
+
+    the_model = torch.load(PATH)
+
+However in this case, the serialized data is bound to the specific classes
+and the exact directory structure used, so it can break in various ways when
+used in other projects, or after some serious refactors.
--- a/docs/source/onnx.rst
+++ b/docs/source/onnx.rst
@ -0,0 +1,182 @@
+torch.onnx
+============
+.. automodule:: torch.onnx
+
+Example: End-to-end AlexNet from PyTorch to Caffe2
+--------------------------------------------------
+
+Here is a simple script which exports a pretrained AlexNet as defined in
+torchvision into ONNX.  It runs a single round of inference and then
+saves the resulting traced model to ``alexnet.proto``::
+
+    from torch.autograd import Variable
+    import torch.onnx
+    import torchvision
+
+    dummy_input = Variable(torch.randn(10, 3, 224, 224)).cuda()
+    model = torchvision.models.alexnet(pretrained=True).cuda()
+    torch.onnx.export(model, dummy_input, "alexnet.proto", verbose=True)
+
+The resulting ``alexnet.proto`` is a binary protobuf file which contains both
+the network structure and parameters of the model you exported
+(in this case, AlexNet).  The keyword argument ``verbose=True`` causes the
+exporter to print out a human-readable representation of the network::
+
+    # All parameters are encoded explicitly as inputs.  By convention,
+    # learned parameters (ala nn.Module.state_dict) are first, and the
+    # actual inputs are last.
+    graph(%1 : Float(64, 3, 11, 11)
+          %2 : Float(64)
+          # The definition sites of all variables are annotated with type
+          # information, specifying the type and size of tensors.
+          # For example, %3 is a 192 x 64 x 5 x 5 tensor of floats.
+          %3 : Float(192, 64, 5, 5)
+          %4 : Float(192)
+          # ---- omitted for brevity ----
+          %15 : Float(1000, 4096)
+          %16 : Float(1000)
+          %17 : Float(10, 3, 224, 224)) { # the actual input!
+      # Every statement consists of some output tensors (and their types),
+      # the operator to be run (with its attributes, e.g., kernels, strides,
+      # etc.), its input tensors (%17, %1)
+      %19 : UNKNOWN_TYPE = Conv[kernels=[11, 11], strides=[4, 4], pads=[2, 2, 2, 2], dilations=[1, 1], group=1](%17, %1), uses = [[%20.i0]];
+      # UNKNOWN_TYPE: sometimes type information is not known.  We hope to eliminate
+      # all such cases in a later release.
+      %20 : Float(10, 64, 55, 55) = Add[broadcast=1, axis=1](%19, %2), uses = [%21.i0];
+      %21 : Float(10, 64, 55, 55) = Relu(%20), uses = [%22.i0];
+      %22 : Float(10, 64, 27, 27) = MaxPool[kernels=[3, 3], pads=[0, 0, 0, 0], dilations=[1, 1], strides=[2, 2]](%21), uses = [%23.i0];
+      # ...
+      # Finally, a network returns some tensors
+      return (%58);
+    }
+
+You can also verify the protobuf using the `onnx <https://github.com/onnx/onnx/>`_ library.
+You can install ``onnx`` with conda::
+
+    conda install -c conda-forge onnx
+
+Then, you can run::
+
+    import onnx
+
+    # Load the ONNX model
+    model = onnx.load("alexnet.proto")
+
+    # Check that the IR is well formed
+    onnx.checker.check_model(model)
+
+    # Print a human readable representation of the graph
+    onnx.helper.printable_graph(model.graph)
+
+To run the exported script with `caffe2 <https://caffe2.ai/>`_, you will need three things:
+
+1. You'll need an install of Caffe2.  If you don't have one already, Please
+   `follow the install instructions <https://caffe2.ai/docs/getting-started.html>`_.
+
+2. You'll need `onnx-caffe2 <https://github.com/onnx/onnx-caffe2>`_, a
+   pure-Python library which provides a Caffe2 backend for ONNX.  You can install ``onnx-caffe2``
+   with pip::
+
+      pip install onnx-caffe2
+
+Once these are installed, you can use the backend for Caffe2::
+
+    # ...continuing from above
+    import onnx_caffe2.backend as backend
+    import numpy as np
+
+    rep = backend.prepare(model, device="CUDA:0") # or "CPU"
+    # For the Caffe2 backend:
+    #     rep.predict_net is the Caffe2 protobuf for the network
+    #     rep.workspace is the Caffe2 workspace for the network
+    #       (see the class onnx_caffe2.backend.Workspace)
+    outputs = rep.run(np.random.randn(10, 3, 224, 224).astype(np.float32))
+    # To run networks with more than one input, pass a tuple
+    # rather than a single numpy ndarray.
+    print(outputs[0])
+
+In the future, there will be backends for other frameworks as well.
+
+Limitations
+-----------
+
+* The ONNX exporter is a *trace-based* exporter, which means that it
+  operates by executing your model once, and exporting the operators which
+  were actually run during this run.  This means that if your model is
+  dynamic, e.g., changes behavior depending on input data, the export
+  won't be accurate.  Similarly, a trace is likely to be valid only
+  for a specific input size (which is one reason why we require explicit inputs
+  on tracing.)  We recommend examining the model trace and making sure
+  the traced operators look reasonable.
+
+* PyTorch and Caffe2 often have implementations of operators with some
+  numeric differences.  Depending on model structure, these differences
+  may be negligible, but they can also cause major divergences in behavior
+  (especially on untrained models.)  In a future release, we plan to
+  allow Caffe2 to call directly to Torch implementations of operators, to
+  help you smooth over these differences when precision is important,
+  and to also document these differences.
+
+Supported operators
+-------------------
+
+The following operators are supported:
+
+* add (nonzero alpha not supported)
+* sub (nonzero alpha not supported)
+* mul
+* div
+* cat
+* mm
+* addmm
+* neg
+* tanh
+* sigmoid
+* mean
+* t
+* expand (only when used before a broadcasting ONNX operator; e.g., add)
+* transpose
+* view
+* split
+* squeeze
+* prelu (single weight shared among input channels not supported)
+* threshold (non-zero threshold/non-zero value not supported)
+* leaky_relu
+* glu
+* softmax
+* avg_pool2d (ceil_mode not supported)
+* log_softmax
+* unfold (experimental support with ATen-Caffe2 integration)
+* elu
+* Conv
+* BatchNorm
+* MaxPool1d (ceil_mode not supported)
+* MaxPool2d (ceil_mode not supported)
+* MaxPool3d (ceil_mode not supported)
+* Embedding (no optional arguments supported)
+* RNN
+* ConstantPadNd
+* Dropout
+* FeatureDropout (training mode not supported)
+* Index (constant integer and tuple indices supported)
+* Negate
+
+The operator set above is sufficient to export the following models:
+
+* AlexNet
+* DCGAN
+* DenseNet
+* Inception (warning: this model is highly sensitive to changes in operator
+  implementation)
+* ResNet
+* SuperResolution
+* VGG
+* `word_language_model <https://github.com/pytorch/examples/tree/master/word_language_model>`_
+
+The interface for specifying operator definitions is highly experimental
+and undocumented; adventurous users should note that the APIs will probably
+change in a future interface.
+
+Functions
+--------------------------
+.. autofunction:: export
--- a/docs/source/optim.rst
+++ b/docs/source/optim.rst
@ -0,0 +1,147 @@
+torch.optim
+===================================
+
+.. automodule:: torch.optim
+
+How to use an optimizer
+-----------------------
+
+To use :mod:`torch.optim` you have to construct an optimizer object, that will hold
+the current state and will update the parameters based on the computed gradients.
+
+Constructing it
+^^^^^^^^^^^^^^^
+
+To construct an :class:`Optimizer` you have to give it an iterable containing the
+parameters (all should be :class:`~torch.autograd.Variable` s) to optimize. Then,
+you can specify optimizer-specific options such as the learning rate, weight decay, etc.
+
+.. note::
+
+    If you need to move a model to GPU via `.cuda()`, please do so before
+    constructing optimizers for it. Parameters of a model after `.cuda()` will
+    be different objects with those before the call.
+
+    In general, you should make sure that optimized parameters live in
+    consistent locations when optimizers are constructed and used.
+
+Example::
+
+    optimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9)
+    optimizer = optim.Adam([var1, var2], lr = 0.0001)
+
+Per-parameter options
+^^^^^^^^^^^^^^^^^^^^^
+
+:class:`Optimizer` s also support specifying per-parameter options. To do this, instead
+of passing an iterable of :class:`~torch.autograd.Variable` s, pass in an iterable of
+:class:`dict` s. Each of them will define a separate parameter group, and should contain
+a ``params`` key, containing a list of parameters belonging to it. Other keys
+should match the keyword arguments accepted by the optimizers, and will be used
+as optimization options for this group.
+
+.. note::
+
+    You can still pass options as keyword arguments. They will be used as
+    defaults, in the groups that didn't override them. This is useful when you
+    only want to vary a single option, while keeping all others consistent
+    between parameter groups.
+
+
+For example, this is very useful when one wants to specify per-layer learning rates::
+
+    optim.SGD([
+                    {'params': model.base.parameters()},
+                    {'params': model.classifier.parameters(), 'lr': 1e-3}
+                ], lr=1e-2, momentum=0.9)
+
+This means that ``model.base``'s parameters will use the default learning rate of ``1e-2``,
+``model.classifier``'s parameters will use a learning rate of ``1e-3``, and a momentum of
+``0.9`` will be used for all parameters
+
+Taking an optimization step
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+All optimizers implement a :func:`~Optimizer.step` method, that updates the
+parameters. It can be used in two ways:
+
+``optimizer.step()``
+~~~~~~~~~~~~~~~~~~~~
+
+This is a simplified version supported by most optimizers. The function can be
+called once the gradients are computed using e.g.
+:func:`~torch.autograd.Variable.backward`.
+
+Example::
+
+    for input, target in dataset:
+        optimizer.zero_grad()
+        output = model(input)
+        loss = loss_fn(output, target)
+        loss.backward()
+        optimizer.step()
+
+``optimizer.step(closure)``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Some optimization algorithms such as Conjugate Gradient and LBFGS need to
+reevaluate the function multiple times, so you have to pass in a closure that
+allows them to recompute your model. The closure should clear the gradients,
+compute the loss, and return it.
+
+Example::
+
+    for input, target in dataset:
+        def closure():
+            optimizer.zero_grad()
+            output = model(input)
+            loss = loss_fn(output, target)
+            loss.backward()
+            return loss
+        optimizer.step(closure)
+
+Algorithms
+----------
+
+.. autoclass:: Optimizer
+    :members:
+.. autoclass:: Adadelta
+    :members:
+.. autoclass:: Adagrad
+    :members:
+.. autoclass:: Adam
+    :members:
+.. autoclass:: SparseAdam
+    :members:
+.. autoclass:: Adamax
+    :members:
+.. autoclass:: ASGD
+    :members:
+.. autoclass:: LBFGS
+    :members:
+.. autoclass:: RMSprop
+    :members:
+.. autoclass:: Rprop
+    :members:
+.. autoclass:: SGD
+    :members:
+
+How to adjust Learning Rate
+---------------------------
+
+:mod:`torch.optim.lr_scheduler` provides several methods to adjust the learning
+rate based on the number of epochs. :class:`torch.optim.lr_scheduler.ReduceLROnPlateau`
+allows dynamic learning rate reducing based on some validation measurements.
+
+.. autoclass:: torch.optim.lr_scheduler.LambdaLR
+    :members:
+.. autoclass:: torch.optim.lr_scheduler.StepLR
+    :members:
+.. autoclass:: torch.optim.lr_scheduler.MultiStepLR
+    :members:
+.. autoclass:: torch.optim.lr_scheduler.ExponentialLR
+    :members:
+.. autoclass:: torch.optim.lr_scheduler.CosineAnnealingLR
+    :members:
+.. autoclass:: torch.optim.lr_scheduler.ReduceLROnPlateau
+    :members:
--- a/docs/source/sparse.rst
+++ b/docs/source/sparse.rst
@ -0,0 +1,128 @@
+.. currentmodule:: torch.sparse
+
+torch.sparse
+============
+
+.. warning::
+
+    This API is currently experimental and may change in the near future.
+
+Torch supports sparse tensors in COO(rdinate) format, which can
+efficiently store and process tensors for which the majority of elements
+are zeros.
+
+A sparse tensor is represented as a pair of dense tensors: a tensor
+of values and a 2D tensor of indices.  A sparse tensor can be constructed
+by providing these two tensors, as well as the size of the sparse tensor
+(which cannot be inferred from these tensors!)  Suppose we want to define
+a sparse tensor with the entry 3 at location (0, 2), entry 4 at
+location (1, 0), and entry 5 at location (1, 2).  We would then write:
+
+    >>> i = torch.LongTensor([[0, 1, 1],
+                              [2, 0, 2]])
+    >>> v = torch.FloatTensor([3, 4, 5])
+    >>> torch.sparse.FloatTensor(i, v, torch.Size([2,3])).to_dense()
+     0  0  3
+     4  0  5
+    [torch.FloatTensor of size 2x3]
+
+Note that the input to LongTensor is NOT a list of index tuples.  If you want
+to write your indices this way, you should transpose before passing them to
+the sparse constructor:
+
+    >>> i = torch.LongTensor([[0, 2], [1, 0], [1, 2]])
+    >>> v = torch.FloatTensor([3,      4,      5    ])
+    >>> torch.sparse.FloatTensor(i.t(), v, torch.Size([2,3])).to_dense()
+     0  0  3
+     4  0  5
+    [torch.FloatTensor of size 2x3]
+
+You can also construct hybrid sparse tensors, where only the first n
+dimensions are sparse, and the rest of the dimensions are dense.
+
+    >>> i = torch.LongTensor([[2, 4]])
+    >>> v = torch.FloatTensor([[1, 3], [5, 7]])
+    >>> torch.sparse.FloatTensor(i, v).to_dense()
+     0  0
+     0  0
+     1  3
+     0  0
+     5  7
+    [torch.FloatTensor of size 5x2]
+
+An empty sparse tensor can be constructed by specifying its size:
+
+    >>> torch.sparse.FloatTensor(2, 3)
+    SparseFloatTensor of size 2x3 with indices:
+    [torch.LongTensor with no dimension]
+    and values:
+    [torch.FloatTensor with no dimension]
+
+.. note::
+
+    Our sparse tensor format permits *uncoalesced* sparse tensors, where
+    there may be duplicate coordinates in the indices; in this case,
+    the interpretation is that the value at that index is the sum of all
+    duplicate value entries. Uncoalesced tensors permit us to implement
+    certain operators more efficiently.
+
+    For the most part, you shouldn't have to care whether or not a
+    sparse tensor is coalesced or not, as most operations will work
+    identically given a coalesced or uncoalesced sparse tensor.
+    However, there are two cases in which you may need to care.
+
+    First, if you repeatedly perform an operation that can produce
+    duplicate entries (e.g., :func:`torch.sparse.FloatTensor.add`), you
+    should occasionally coalesce your sparse tensors to prevent
+    them from growing too large.
+
+    Second, some operators will produce different values depending on
+    whether or not they are coalesced or not (e.g.,
+    :func:`torch.sparse.FloatTensor._values` and
+    :func:`torch.sparse.FloatTensor._indices`, as well as
+    :func:`torch.Tensor._sparse_mask`).  These operators are
+    prefixed by an underscore to indicate that they reveal internal
+    implementation details and should be used with care, since code
+    that works with coalesced sparse tensors may not work with
+    uncoalesced sparse tensors; generally speaking, it is safest
+    to explicitly coalesce before working with these operators.
+
+    For example, suppose that we wanted to implement an operator
+    by operating directly on :func:`torch.sparse.FloatTensor._values`.
+    Multiplication by a scalar can be implemented in the obvious way,
+    as multiplication distributes over addition; however, square root
+    cannot be implemented directly, since ``sqrt(a + b) != sqrt(a) +
+    sqrt(b)`` (which is what would be computed if you were given an
+    uncoalesced tensor.)
+
+.. class:: FloatTensor()
+
+    .. method:: add
+    .. method:: add_
+    .. method:: clone
+    .. method:: dim
+    .. method:: div
+    .. method:: div_
+    .. method:: get_device
+    .. method:: hspmm
+    .. method:: mm
+    .. method:: mul
+    .. method:: mul_
+    .. method:: resizeAs_
+    .. method:: size
+    .. method:: spadd
+    .. method:: spmm
+    .. method:: sspaddmm
+    .. method:: sspmm
+    .. method:: sub
+    .. method:: sub_
+    .. method:: t_
+    .. method:: toDense
+    .. method:: transpose
+    .. method:: transpose_
+    .. method:: zero_
+    .. method:: coalesce
+    .. method:: is_coalesced
+    .. method:: _indices
+    .. method:: _values
+    .. method:: _nnz
--- a/docs/source/storage.rst
+++ b/docs/source/storage.rst
@ -0,0 +1,12 @@
+torch.Storage
+===================================
+
+A :class:`torch.Storage` is a contiguous, one-dimensional array of a single
+data type.
+
+Every :class:`torch.Tensor` has a corresponding storage of the same data type.
+
+.. autoclass:: torch.FloatStorage
+   :members:
+   :undoc-members:
+   :inherited-members:
--- a/docs/source/tensors.rst
+++ b/docs/source/tensors.rst
@ -0,0 +1,324 @@
+.. currentmodule:: torch
+
+.. _tensor-doc:
+
+torch.Tensor
+===================================
+
+A :class:`torch.Tensor` is a multi-dimensional matrix containing elements of
+a single data type.
+
+Torch defines seven CPU tensor types and eight GPU tensor types:
+
+======================== ===========================   ================================
+Data type                CPU tensor                    GPU tensor
+======================== ===========================   ================================
+32-bit floating point    :class:`torch.FloatTensor`    :class:`torch.cuda.FloatTensor`
+64-bit floating point    :class:`torch.DoubleTensor`   :class:`torch.cuda.DoubleTensor`
+16-bit floating point    :class:`torch.HalfTensor`     :class:`torch.cuda.HalfTensor`
+8-bit integer (unsigned) :class:`torch.ByteTensor`     :class:`torch.cuda.ByteTensor`
+8-bit integer (signed)   :class:`torch.CharTensor`     :class:`torch.cuda.CharTensor`
+16-bit integer (signed)  :class:`torch.ShortTensor`    :class:`torch.cuda.ShortTensor`
+32-bit integer (signed)  :class:`torch.IntTensor`      :class:`torch.cuda.IntTensor`
+64-bit integer (signed)  :class:`torch.LongTensor`     :class:`torch.cuda.LongTensor`
+======================== ===========================   ================================
+
+The :class:`torch.Tensor` constructor is an alias for the default tensor type
+(:class:`torch.FloatTensor`).
+
+A tensor can be constructed from a Python :class:`list` or sequence:
+
+::
+
+    >>> torch.FloatTensor([[1, 2, 3], [4, 5, 6]])
+    1  2  3
+    4  5  6
+    [torch.FloatTensor of size 2x3]
+
+An empty tensor can be constructed by specifying its size:
+
+::
+
+    >>> torch.IntTensor(2, 4).zero_()
+    0  0  0  0
+    0  0  0  0
+    [torch.IntTensor of size 2x4]
+
+The contents of a tensor can be accessed and modified using Python's indexing
+and slicing notation:
+
+::
+
+    >>> x = torch.FloatTensor([[1, 2, 3], [4, 5, 6]])
+    >>> print(x[1][2])
+    6.0
+    >>> x[0][1] = 8
+    >>> print(x)
+     1  8  3
+     4  5  6
+    [torch.FloatTensor of size 2x3]
+
+Each tensor has an associated :class:`torch.Storage`, which holds its data.
+The tensor class provides multi-dimensional, `strided <https://en.wikipedia.org/wiki/Stride_of_an_array>`_
+view of a storage and defines numeric operations on it.
+
+.. note::
+   Methods which mutate a tensor are marked with an underscore suffix.
+   For example, :func:`torch.FloatTensor.abs_` computes the absolute value
+   in-place and returns the modified tensor, while :func:`torch.FloatTensor.abs`
+   computes the result in a new tensor.
+
+.. class:: Tensor()
+           Tensor(*sizes)
+           Tensor(size)
+           Tensor(sequence)
+           Tensor(ndarray)
+           Tensor(tensor)
+           Tensor(storage)
+
+   Creates a new tensor from an optional size or data.
+
+   If no arguments are given, an empty zero-dimensional tensor is returned.
+   If a :class:`numpy.ndarray`, :class:`torch.Tensor`, or :class:`torch.Storage`
+   is given, a new tensor that shares the same data is returned. If a Python
+   sequence is given, a new tensor is created from a copy of the sequence.
+
+   .. automethod:: abs
+   .. automethod:: abs_
+   .. automethod:: acos
+   .. automethod:: acos_
+   .. automethod:: add
+   .. automethod:: add_
+   .. automethod:: addbmm
+   .. automethod:: addbmm_
+   .. automethod:: addcdiv
+   .. automethod:: addcdiv_
+   .. automethod:: addcmul
+   .. automethod:: addcmul_
+   .. automethod:: addmm
+   .. automethod:: addmm_
+   .. automethod:: addmv
+   .. automethod:: addmv_
+   .. automethod:: addr
+   .. automethod:: addr_
+   .. automethod:: apply_
+   .. automethod:: asin
+   .. automethod:: asin_
+   .. automethod:: atan
+   .. automethod:: atan2
+   .. automethod:: atan2_
+   .. automethod:: atan_
+   .. automethod:: baddbmm
+   .. automethod:: baddbmm_
+   .. automethod:: bernoulli
+   .. automethod:: bernoulli_
+   .. automethod:: bmm
+   .. automethod:: byte
+   .. automethod:: cauchy_
+   .. automethod:: ceil
+   .. automethod:: ceil_
+   .. automethod:: char
+   .. automethod:: chunk
+   .. automethod:: clamp
+   .. automethod:: clamp_
+   .. automethod:: clone
+   .. automethod:: contiguous
+   .. automethod:: copy_
+   .. automethod:: cos
+   .. automethod:: cos_
+   .. automethod:: cosh
+   .. automethod:: cosh_
+   .. automethod:: cpu
+   .. automethod:: cross
+   .. automethod:: cuda
+   .. automethod:: cumprod
+   .. automethod:: cumsum
+   .. automethod:: data_ptr
+   .. automethod:: diag
+   .. automethod:: dim
+   .. automethod:: dist
+   .. automethod:: div
+   .. automethod:: div_
+   .. automethod:: dot
+   .. automethod:: double
+   .. automethod:: eig
+   .. automethod:: element_size
+   .. automethod:: eq
+   .. automethod:: eq_
+   .. automethod:: equal
+   .. automethod:: erf
+   .. automethod:: erf_
+   .. automethod:: erfinv
+   .. automethod:: erfinv_
+   .. automethod:: exp
+   .. automethod:: exp_
+   .. automethod:: expand
+   .. automethod:: expand_as
+   .. automethod:: exponential_
+   .. automethod:: fill_
+   .. automethod:: float
+   .. automethod:: floor
+   .. automethod:: floor_
+   .. automethod:: fmod
+   .. automethod:: fmod_
+   .. automethod:: frac
+   .. automethod:: frac_
+   .. automethod:: gather
+   .. automethod:: ge
+   .. automethod:: ge_
+   .. automethod:: gels
+   .. automethod:: geometric_
+   .. automethod:: geqrf
+   .. automethod:: ger
+   .. automethod:: gesv
+   .. automethod:: gt
+   .. automethod:: gt_
+   .. automethod:: half
+   .. automethod:: histc
+   .. automethod:: index
+   .. automethod:: index_add_
+   .. automethod:: index_copy_
+   .. automethod:: index_fill_
+   .. automethod:: index_select
+   .. automethod:: int
+   .. automethod:: inverse
+   .. automethod:: is_contiguous
+   .. autoattribute:: is_cuda
+      :annotation:
+   .. automethod:: is_pinned
+   .. automethod:: is_set_to
+   .. automethod:: is_signed
+   .. automethod:: kthvalue
+   .. automethod:: le
+   .. automethod:: le_
+   .. automethod:: lerp
+   .. automethod:: lerp_
+   .. automethod:: log
+   .. automethod:: log1p
+   .. automethod:: log1p_
+   .. automethod:: log_
+   .. automethod:: log_normal_
+   .. automethod:: long
+   .. automethod:: lt
+   .. automethod:: lt_
+   .. automethod:: map_
+   .. automethod:: masked_scatter_
+   .. automethod:: masked_fill_
+   .. automethod:: masked_select
+   .. automethod:: matmul
+   .. automethod:: max
+   .. automethod:: mean
+   .. automethod:: median
+   .. automethod:: min
+   .. automethod:: mm
+   .. automethod:: mode
+   .. automethod:: mul
+   .. automethod:: mul_
+   .. automethod:: multinomial
+   .. automethod:: mv
+   .. automethod:: narrow
+   .. automethod:: ndimension
+   .. automethod:: ne
+   .. automethod:: ne_
+   .. automethod:: neg
+   .. automethod:: neg_
+   .. automethod:: nelement
+   .. automethod:: new
+   .. automethod:: nonzero
+   .. automethod:: norm
+   .. automethod:: normal_
+   .. automethod:: numel
+   .. automethod:: numpy
+   .. automethod:: orgqr
+   .. automethod:: ormqr
+   .. automethod:: permute
+   .. automethod:: pin_memory
+   .. automethod:: potrf
+   .. automethod:: potri
+   .. automethod:: potrs
+   .. automethod:: pow
+   .. automethod:: pow_
+   .. automethod:: prod
+   .. automethod:: pstrf
+   .. automethod:: put_
+   .. automethod:: qr
+   .. automethod:: random_
+   .. automethod:: reciprocal
+   .. automethod:: reciprocal_
+   .. automethod:: remainder
+   .. automethod:: remainder_
+   .. automethod:: renorm
+   .. automethod:: renorm_
+   .. automethod:: repeat
+   .. automethod:: resize_
+   .. automethod:: resize_as_
+   .. automethod:: round
+   .. automethod:: round_
+   .. automethod:: rsqrt
+   .. automethod:: rsqrt_
+   .. automethod:: scatter_
+   .. automethod:: select
+   .. automethod:: set_
+   .. automethod:: share_memory_
+   .. automethod:: short
+   .. automethod:: sigmoid
+   .. automethod:: sigmoid_
+   .. automethod:: sign
+   .. automethod:: sign_
+   .. automethod:: sin
+   .. automethod:: sin_
+   .. automethod:: sinh
+   .. automethod:: sinh_
+   .. automethod:: size
+   .. automethod:: sort
+   .. automethod:: split
+   .. automethod:: sqrt
+   .. automethod:: sqrt_
+   .. automethod:: squeeze
+   .. automethod:: squeeze_
+   .. automethod:: std
+   .. automethod:: storage
+   .. automethod:: storage_offset
+   .. automethod:: storage_type
+   .. automethod:: stride
+   .. automethod:: sub
+   .. automethod:: sub_
+   .. automethod:: sum
+   .. automethod:: svd
+   .. automethod:: symeig
+   .. automethod:: t
+   .. automethod:: t_
+   .. automethod:: take
+   .. automethod:: tan
+   .. automethod:: tan_
+   .. automethod:: tanh
+   .. automethod:: tanh_
+   .. automethod:: tolist
+   .. automethod:: topk
+   .. automethod:: trace
+   .. automethod:: transpose
+   .. automethod:: transpose_
+   .. automethod:: tril
+   .. automethod:: tril_
+   .. automethod:: triu
+   .. automethod:: triu_
+   .. automethod:: trtrs
+   .. automethod:: trunc
+   .. automethod:: trunc_
+   .. automethod:: type
+   .. automethod:: type_as
+   .. automethod:: unfold
+   .. automethod:: uniform_
+   .. automethod:: unsqueeze
+   .. automethod:: unsqueeze_
+   .. automethod:: var
+   .. automethod:: view
+   .. automethod:: view_as
+   .. automethod:: zero_
+
+.. class:: ByteTensor()
+
+   The following methods are unique to :class:`torch.ByteTensor`.
+
+   .. automethod:: all
+   .. automethod:: any
--- a/docs/source/torch.rst
+++ b/docs/source/torch.rst
@ -0,0 +1,203 @@
+torch
+===================================
+.. automodule:: torch
+
+Tensors
+----------------------------------
+.. autofunction:: is_tensor
+.. autofunction:: is_storage
+.. autofunction:: set_default_tensor_type
+.. autofunction:: numel
+.. autofunction:: set_printoptions
+
+
+Creation Ops
+~~~~~~~~~~~~~~~~~~~~~~
+.. autofunction:: eye
+.. autofunction:: from_numpy
+.. autofunction:: linspace
+.. autofunction:: logspace
+.. autofunction:: ones
+.. autofunction:: ones_like
+.. autofunction:: arange
+.. autofunction:: range
+.. autofunction:: zeros
+.. autofunction:: zeros_like
+
+Indexing, Slicing, Joining, Mutating Ops
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autofunction:: cat
+.. autofunction:: chunk
+.. autofunction:: gather
+.. autofunction:: index_select
+.. autofunction:: masked_select
+.. autofunction:: nonzero
+.. autofunction:: split
+.. autofunction:: squeeze
+.. autofunction:: stack
+.. autofunction:: t
+.. autofunction:: take
+.. autofunction:: transpose
+.. autofunction:: unbind
+.. autofunction:: unsqueeze
+
+
+Random sampling
+----------------------------------
+.. autofunction:: manual_seed
+.. autofunction:: initial_seed
+.. autofunction:: get_rng_state
+.. autofunction:: set_rng_state
+.. autodata:: default_generator
+.. autofunction:: bernoulli
+.. autofunction:: multinomial
+.. autofunction:: normal
+.. autofunction:: rand
+.. autofunction:: randn
+.. autofunction:: randperm
+
+In-place random sampling
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+There are a few more in-place random sampling functions defined on Tensors as well. Click through to refer to their documentation:
+
+- :func:`torch.Tensor.bernoulli_` - in-place version of :func:`torch.bernoulli`
+- :func:`torch.Tensor.cauchy_` - numbers drawn from the Cauchy distribution
+- :func:`torch.Tensor.exponential_` - numbers drawn from the exponential distribution
+- :func:`torch.Tensor.geometric_` - elements drawn from the geometric distribution
+- :func:`torch.Tensor.log_normal_` - samples from the log-normal distribution
+- :func:`torch.Tensor.normal_` - in-place version of :func:`torch.normal`
+- :func:`torch.Tensor.random_` - numbers sampled from the discrete uniform distribution
+- :func:`torch.Tensor.uniform_` - numbers sampled from the uniform distribution
+
+
+Serialization
+----------------------------------
+.. autofunction:: save
+.. autofunction:: load
+
+
+Parallelism
+----------------------------------
+.. autofunction:: get_num_threads
+.. autofunction:: set_num_threads
+
+
+Math operations
+----------------------------------
+
+Pointwise Ops
+~~~~~~~~~~~~~~~~~~~~~~
+
+.. autofunction:: abs
+.. autofunction:: acos
+.. autofunction:: add
+.. autofunction:: addcdiv
+.. autofunction:: addcmul
+.. autofunction:: asin
+.. autofunction:: atan
+.. autofunction:: atan2
+.. autofunction:: ceil
+.. autofunction:: clamp
+.. autofunction:: cos
+.. autofunction:: cosh
+.. autofunction:: div
+.. autofunction:: erf
+.. autofunction:: erfinv
+.. autofunction:: exp
+.. autofunction:: floor
+.. autofunction:: fmod
+.. autofunction:: frac
+.. autofunction:: lerp
+.. autofunction:: log
+.. autofunction:: log1p
+.. autofunction:: mul
+.. autofunction:: neg
+.. autofunction:: pow
+.. autofunction:: reciprocal
+.. autofunction:: remainder
+.. autofunction:: round
+.. autofunction:: rsqrt
+.. autofunction:: sigmoid
+.. autofunction:: sign
+.. autofunction:: sin
+.. autofunction:: sinh
+.. autofunction:: sqrt
+.. autofunction:: tan
+.. autofunction:: tanh
+.. autofunction:: trunc
+
+
+Reduction Ops
+~~~~~~~~~~~~~~~~~~~~~~
+.. autofunction:: cumprod
+.. autofunction:: cumsum
+.. autofunction:: dist
+.. autofunction:: mean
+.. autofunction:: median
+.. autofunction:: mode
+.. autofunction:: norm
+.. autofunction:: prod
+.. autofunction:: std
+.. autofunction:: sum
+.. autofunction:: var
+
+
+Comparison Ops
+~~~~~~~~~~~~~~~~~~~~~~
+.. autofunction:: eq
+.. autofunction:: equal
+.. autofunction:: ge
+.. autofunction:: gt
+.. autofunction:: kthvalue
+.. autofunction:: le
+.. autofunction:: lt
+.. autofunction:: max
+.. autofunction:: min
+.. autofunction:: ne
+.. autofunction:: sort
+.. autofunction:: topk
+
+
+Other Operations
+~~~~~~~~~~~~~~~~~~~~~~
+.. autofunction:: cross
+.. autofunction:: diag
+.. autofunction:: histc
+.. autofunction:: renorm
+.. autofunction:: trace
+.. autofunction:: tril
+.. autofunction:: triu
+
+
+BLAS and LAPACK Operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autofunction:: addbmm
+.. autofunction:: addmm
+.. autofunction:: addmv
+.. autofunction:: addr
+.. autofunction:: baddbmm
+.. autofunction:: bmm
+.. autofunction:: btrifact
+.. autofunction:: btrisolve
+.. autofunction:: dot
+.. autofunction:: eig
+.. autofunction:: gels
+.. autofunction:: geqrf
+.. autofunction:: ger
+.. autofunction:: gesv
+.. autofunction:: inverse
+.. autofunction:: matmul
+.. autofunction:: mm
+.. autofunction:: mv
+.. autofunction:: orgqr
+.. autofunction:: ormqr
+.. autofunction:: potrf
+.. autofunction:: potri
+.. autofunction:: potrs
+.. autofunction:: pstrf
+.. autofunction:: qr
+.. autofunction:: svd
+.. autofunction:: symeig
+.. autofunction:: trtrs
--- a/setup.py
+++ b/setup.py
@ -1,6 +1,9 @@
 from setuptools import setup, Extension, distutils, Command, find_packages
 import setuptools.command.build_ext
 import setuptools.command.install
+import setuptools.command.develop
+import setuptools.command.build_py
+import distutils.unixccompiler
 import distutils.command.build
 import distutils.command.clean
 import platform
@ -9,35 +12,100 @@ import shutil
 import sys
 import os

-# TODO: make this more robust
-WITH_CUDA = os.path.exists('/Developer/NVIDIA/CUDA-7.5/include') or os.path.exists('/usr/local/cuda/include')
-DEBUG = False
+from tools.setup_helpers.env import check_env_flag
+from tools.setup_helpers.cuda import WITH_CUDA, CUDA_HOME, CUDA_VERSION
+from tools.setup_helpers.cudnn import WITH_CUDNN, CUDNN_LIB_DIR, CUDNN_INCLUDE_DIR
+from tools.setup_helpers.nccl import WITH_NCCL, WITH_SYSTEM_NCCL, NCCL_LIB_DIR, \
+    NCCL_INCLUDE_DIR, NCCL_ROOT_DIR, NCCL_SYSTEM_LIB
+from tools.setup_helpers.nnpack import WITH_NNPACK, NNPACK_LIB_PATHS, \
+    NNPACK_INCLUDE_DIRS
+from tools.setup_helpers.split_types import split_types
+
+DEBUG = check_env_flag('DEBUG')
+WITH_DISTRIBUTED = not check_env_flag('NO_DISTRIBUTED')
+WITH_DISTRIBUTED_MW = WITH_DISTRIBUTED and check_env_flag('WITH_DISTRIBUTED_MW')
+
+
+################################################################################
+# Workaround setuptools -Wstrict-prototypes warnings
+# I lifted this code from https://stackoverflow.com/a/29634231/23845
+################################################################################
+import distutils.sysconfig
+cfg_vars = distutils.sysconfig.get_config_vars()
+for key, value in cfg_vars.items():
+    if type(value) == str:
+        cfg_vars[key] = value.replace("-Wstrict-prototypes", "")

 ################################################################################
 # Monkey-patch setuptools to compile in parallel
 ################################################################################
+original_link = distutils.unixccompiler.UnixCCompiler.link

-def parallelCCompile(self, sources, output_dir=None, macros=None, include_dirs=None, debug=0, extra_preargs=None, extra_postargs=None, depends=None):
+
+def parallelCCompile(self, sources, output_dir=None, macros=None,
+                     include_dirs=None, debug=0, extra_preargs=None,
+                     extra_postargs=None, depends=None):
    # those lines are copied from distutils.ccompiler.CCompiler directly
-    macros, objects, extra_postargs, pp_opts, build = self._setup_compile(output_dir, macros, include_dirs, sources, depends, extra_postargs)
+    macros, objects, extra_postargs, pp_opts, build = self._setup_compile(
+        output_dir, macros, include_dirs, sources, depends, extra_postargs)
    cc_args = self._get_cc_args(pp_opts, debug, extra_preargs)

    # compile using a thread pool
    import multiprocessing.pool
+
    def _single_compile(obj):
        src, ext = build[obj]
        self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
    num_jobs = multiprocessing.cpu_count()
+    max_jobs = os.getenv("MAX_JOBS")
+    if max_jobs is not None:
+        num_jobs = min(num_jobs, int(max_jobs))
    multiprocessing.pool.ThreadPool(num_jobs).map(_single_compile, objects)

    return objects

+
+def patched_link(self, *args, **kwargs):
+    _cxx = self.compiler_cxx
+    self.compiler_cxx = None
+    result = original_link(self, *args, **kwargs)
+    self.compiler_cxx = _cxx
+    return result
+
+
 distutils.ccompiler.CCompiler.compile = parallelCCompile
+distutils.unixccompiler.UnixCCompiler.link = patched_link

 ################################################################################
 # Custom build commands
 ################################################################################

+dep_libs = [
+    'TH', 'THS', 'THNN', 'THC', 'THCS', 'THCUNN', 'nccl', 'libshm',
+    'ATen', 'gloo', 'THD', 'nanopb',
+]
+
+
+def build_libs(libs):
+    for lib in libs:
+        assert lib in dep_libs, 'invalid lib: {}'.format(lib)
+    build_libs_cmd = ['bash', 'torch/lib/build_libs.sh']
+    my_env = os.environ.copy()
+    my_env["PYTORCH_PYTHON"] = sys.executable
+    if WITH_SYSTEM_NCCL:
+        my_env["NCCL_ROOT_DIR"] = NCCL_ROOT_DIR
+    if WITH_CUDA:
+        my_env["CUDA_BIN_PATH"] = CUDA_HOME
+        build_libs_cmd += ['--with-cuda']
+
+    if subprocess.call(build_libs_cmd + libs, env=my_env) != 0:
+        sys.exit(1)
+
+    if 'THNN' in libs or 'THCUNN' in libs:
+        from tools.nnwrap import generate_wrappers as generate_nn_wrappers
+        generate_nn_wrappers()
+
+
 class build_deps(Command):
    user_options = []

@ -48,13 +116,30 @@ class build_deps(Command):
        pass

    def run(self):
-        from tools.nnwrap import generate_wrappers as generate_nn_wrappers
-        build_all_cmd = ['bash', 'torch/lib/build_all.sh']
+        libs = ['TH', 'THS', 'THNN']
        if WITH_CUDA:
-            build_all_cmd += ['--with-cuda']
-        if subprocess.call(build_all_cmd) != 0:
-            sys.exit(1)
-        generate_nn_wrappers()
+            libs += ['THC', 'THCS', 'THCUNN']
+        if WITH_NCCL and not WITH_SYSTEM_NCCL:
+            libs += ['nccl']
+        libs += ['libshm', 'ATen', 'nanopb']
+        if WITH_DISTRIBUTED:
+            if sys.platform.startswith('linux'):
+                libs += ['gloo']
+            libs += ['THD']
+        build_libs(libs)
+
+
+build_dep_cmds = {}
+
+for lib in dep_libs:
+    # wrap in function to capture lib
+    class build_dep(build_deps):
+        description = 'Build {} external library'.format(lib)
+
+        def run(self):
+            build_libs([self.lib])
+    build_dep.lib = lib
+    build_dep_cmds['build_' + lib.lower()] = build_dep


 class build_module(Command):
@ -71,22 +156,127 @@ class build_module(Command):
        self.run_command('build_ext')


-class build_ext(setuptools.command.build_ext.build_ext):
+class build_py(setuptools.command.build_py.build_py):
+
    def run(self):
+        self.create_version_file()
+        setuptools.command.build_py.build_py.run(self)
+
+    @staticmethod
+    def create_version_file():
+        global version, cwd
+        print('-- Building version ' + version)
+        version_path = os.path.join(cwd, 'torch', 'version.py')
+        with open(version_path, 'w') as f:
+            f.write("__version__ = '{}'\n".format(version))
+            # NB: This is not 100% accurate, because you could have built the
+            # library code with DEBUG, but csrc without DEBUG (in which case
+            # this would claim to be a release build when it's not.)
+            f.write("debug = {}\n".format(repr(DEBUG)))
+            f.write("cuda = {}\n".format(repr(CUDA_VERSION)))
+
+
+class develop(setuptools.command.develop.develop):
+
+    def run(self):
+        build_py.create_version_file()
+        setuptools.command.develop.develop.run(self)
+
+
+def monkey_patch_THD_link_flags():
+    '''
+    THD's dynamic link deps are not determined until after build_deps is run
+    So, we need to monkey-patch them in later
+    '''
+    # read tmp_install_path/THD_deps.txt for THD's dynamic linkage deps
+    with open(tmp_install_path + '/THD_deps.txt', 'r') as f:
+        thd_deps_ = f.read()
+    thd_deps = []
+    # remove empty lines
+    for l in thd_deps_.split(';'):
+        if l != '':
+            thd_deps.append(l)
+
+    C.extra_link_args += thd_deps
+
+
+class build_ext(setuptools.command.build_ext.build_ext):
+
+    def run(self):
+        # Print build options
+        if WITH_NUMPY:
+            print('-- Building with NumPy bindings')
+        else:
+            print('-- NumPy not found')
+        if WITH_CUDNN:
+            print('-- Detected cuDNN at ' + CUDNN_LIB_DIR + ', ' + CUDNN_INCLUDE_DIR)
+        else:
+            print('-- Not using cuDNN')
+        if WITH_CUDA:
+            print('-- Detected CUDA at ' + CUDA_HOME)
+        else:
+            print('-- Not using CUDA')
+        if WITH_NCCL and WITH_SYSTEM_NCCL:
+            print('-- Using system provided NCCL library at ' +
+                  NCCL_SYSTEM_LIB + ', ' + NCCL_INCLUDE_DIR)
+        elif WITH_NCCL:
+            print('-- Building NCCL library')
+        else:
+            print('-- Not using NCCL')
+        if WITH_DISTRIBUTED:
+            print('-- Building with distributed package ')
+            monkey_patch_THD_link_flags()
+        else:
+            print('-- Building without distributed package')
+
+        # Do we actually need this here?
+        if WITH_NNPACK:
+            nnpack_dir = NNPACK_LIB_PATHS[0]
+            print('-- Detected NNPACK at ' + nnpack_dir)
+        else:
+            print('-- Not using NNPACK')
+
        # cwrap depends on pyyaml, so we can't import it earlier
        from tools.cwrap import cwrap
        from tools.cwrap.plugins.THPPlugin import THPPlugin
-        from tools.cwrap.plugins.THPLongArgsPlugin import THPLongArgsPlugin
        from tools.cwrap.plugins.ArgcountSortPlugin import ArgcountSortPlugin
        from tools.cwrap.plugins.AutoGPU import AutoGPU
+        from tools.cwrap.plugins.BoolOption import BoolOption
+        from tools.cwrap.plugins.KwargsPlugin import KwargsPlugin
+        from tools.cwrap.plugins.NullableArguments import NullableArguments
+        from tools.cwrap.plugins.CuDNNPlugin import CuDNNPlugin
+        from tools.cwrap.plugins.WrapDim import WrapDim
+        from tools.cwrap.plugins.AssertNDim import AssertNDim
+        from tools.cwrap.plugins.Broadcast import Broadcast
+        from tools.cwrap.plugins.ProcessorSpecificPlugin import ProcessorSpecificPlugin
+        from tools.autograd.gen_variable_type import gen_variable_type
+        from tools.jit.gen_jit_dispatch import gen_jit_dispatch
+        thp_plugin = THPPlugin()
        cwrap('torch/csrc/generic/TensorMethods.cwrap', plugins=[
-            THPLongArgsPlugin(), THPPlugin(), ArgcountSortPlugin(), AutoGPU()
+            ProcessorSpecificPlugin(), BoolOption(), thp_plugin,
+            AutoGPU(condition='IS_CUDA'), ArgcountSortPlugin(), KwargsPlugin(),
+            AssertNDim(), WrapDim(), Broadcast()
        ])
+        cwrap('torch/csrc/cudnn/cuDNN.cwrap', plugins=[
+            CuDNNPlugin(), NullableArguments()
+        ])
+        # Build ATen based Variable classes
+        autograd_gen_dir = 'torch/csrc/autograd/generated'
+        jit_gen_dir = 'torch/csrc/jit/generated'
+        for d in (autograd_gen_dir, jit_gen_dir):
+            if not os.path.exists(d):
+                os.mkdir(d)
+        gen_variable_type(
+            'torch/lib/build/ATen/ATen/Declarations.yaml',
+            autograd_gen_dir)
+        gen_jit_dispatch(
+            'torch/lib/build/ATen/ATen/Declarations.yaml',
+            jit_gen_dir)
+
        # It's an old-style class in Python 2.7...
        setuptools.command.build_ext.build_ext.run(self)


-
 class build(distutils.command.build.build):
    sub_commands = [
        ('build_deps', lambda self: True),
@ -94,6 +284,7 @@ class build(distutils.command.build.build):


 class install(setuptools.command.install.install):
+
    def run(self):
        if not self.skip_build:
            self.run_command('build_deps')
@ -101,81 +292,260 @@ class install(setuptools.command.install.install):


 class clean(distutils.command.clean.clean):
+
    def run(self):
+        import glob
        with open('.gitignore', 'r') as f:
            ignores = f.read()
-            for glob in filter(bool, ignores.split('\n')):
-                shutil.rmtree(glob, ignore_errors=True)
+            for wildcard in filter(bool, ignores.split('\n')):
+                for filename in glob.glob(wildcard):
+                    try:
+                        os.remove(filename)
+                    except OSError:
+                        shutil.rmtree(filename, ignore_errors=True)
+
        # It's an old-style class in Python 2.7...
        distutils.command.clean.clean.run(self)


-
 ################################################################################
 # Configure compile flags
 ################################################################################

 include_dirs = []
+library_dirs = []
 extra_link_args = []
-extra_compile_args = ['-std=c++11']
+extra_compile_args = ['-std=c++11', '-Wno-write-strings',
+                      # Python 2.6 requires -fno-strict-aliasing, see
+                      # http://legacy.python.org/dev/peps/pep-3123/
+                      '-fno-strict-aliasing',
+                      # Clang has an unfixed bug leading to spurious missing
+                      # braces warnings, see
+                      # https://bugs.llvm.org/show_bug.cgi?id=21629
+                      '-Wno-missing-braces']

 cwd = os.path.dirname(os.path.abspath(__file__))
 lib_path = os.path.join(cwd, "torch", "lib")

+
+# Check if you remembered to check out submodules
+def check_file(f):
+    if not os.path.exists(f):
+        print("Could not find {}".format(f))
+        print("Did you run 'git submodule update --init'?")
+        sys.exit(1)
+check_file(os.path.join(lib_path, "gloo", "CMakeLists.txt"))
+check_file(os.path.join(lib_path, "nanopb", "CMakeLists.txt"))
+check_file(os.path.join(lib_path, "pybind11", "CMakeLists.txt"))
+
 tmp_install_path = lib_path + "/tmp_install"
 include_dirs += [
    cwd,
    os.path.join(cwd, "torch", "csrc"),
+    lib_path + "/pybind11/include",
    tmp_install_path + "/include",
    tmp_install_path + "/include/TH",
+    tmp_install_path + "/include/THNN",
+    tmp_install_path + "/include/ATen",
 ]

-extra_link_args.append('-L' + lib_path)
+library_dirs.append(lib_path)

-main_libraries = ['TH']
+# we specify exact lib names to avoid conflict with lua-torch installs
+TH_LIB = os.path.join(lib_path, 'libTH.so.1')
+THS_LIB = os.path.join(lib_path, 'libTHS.so.1')
+THC_LIB = os.path.join(lib_path, 'libTHC.so.1')
+THCS_LIB = os.path.join(lib_path, 'libTHCS.so.1')
+THNN_LIB = os.path.join(lib_path, 'libTHNN.so.1')
+THCUNN_LIB = os.path.join(lib_path, 'libTHCUNN.so.1')
+ATEN_LIB = os.path.join(lib_path, 'libATen.so.1')
+THD_LIB = os.path.join(lib_path, 'libTHD.a')
+NCCL_LIB = os.path.join(lib_path, 'libnccl.so.1')
+if platform.system() == 'Darwin':
+    TH_LIB = os.path.join(lib_path, 'libTH.1.dylib')
+    THS_LIB = os.path.join(lib_path, 'libTHS.1.dylib')
+    THC_LIB = os.path.join(lib_path, 'libTHC.1.dylib')
+    THCS_LIB = os.path.join(lib_path, 'libTHCS.1.dylib')
+    THNN_LIB = os.path.join(lib_path, 'libTHNN.1.dylib')
+    THCUNN_LIB = os.path.join(lib_path, 'libTHCUNN.1.dylib')
+    ATEN_LIB = os.path.join(lib_path, 'libATen.1.dylib')
+    NCCL_LIB = os.path.join(lib_path, 'libnccl.1.dylib')
+
+# static library only
+NANOPB_STATIC_LIB = os.path.join(lib_path, 'libprotobuf-nanopb.a')
+
+main_compile_args = ['-D_THP_CORE']
+main_libraries = ['shm']
+main_link_args = [TH_LIB, THS_LIB, THNN_LIB, ATEN_LIB, NANOPB_STATIC_LIB]
 main_sources = [
+    "torch/csrc/PtrWrapper.cpp",
    "torch/csrc/Module.cpp",
    "torch/csrc/Generator.cpp",
-    "torch/csrc/Tensor.cpp",
+    "torch/csrc/Size.cpp",
+    "torch/csrc/Exceptions.cpp",
    "torch/csrc/Storage.cpp",
+    "torch/csrc/DynamicTypes.cpp",
+    "torch/csrc/assertions.cpp",
+    "torch/csrc/byte_order.cpp",
    "torch/csrc/utils.cpp",
+    "torch/csrc/expand_utils.cpp",
+    "torch/csrc/utils/invalid_arguments.cpp",
+    "torch/csrc/utils/object_ptr.cpp",
+    "torch/csrc/utils/python_arg_parser.cpp",
+    "torch/csrc/utils/tuple_parser.cpp",
+    "torch/csrc/allocators.cpp",
    "torch/csrc/serialization.cpp",
+    "torch/csrc/jit/init.cpp",
+    "torch/csrc/jit/ir.cpp",
+    "torch/csrc/jit/python_ir.cpp",
+    "torch/csrc/jit/test_jit.cpp",
+    "torch/csrc/jit/tracer.cpp",
+    "torch/csrc/jit/python_tracer.cpp",
+    "torch/csrc/jit/interned_strings.cpp",
+    "torch/csrc/jit/type.cpp",
+    "torch/csrc/jit/export.cpp",
+    "torch/csrc/jit/passes/graph_fuser.cpp",
+    "torch/csrc/jit/passes/onnx.cpp",
+    "torch/csrc/jit/passes/dead_code_elimination.cpp",
+    "torch/csrc/jit/passes/common_subexpression_elimination.cpp",
+    "torch/csrc/jit/passes/peephole.cpp",
+    "torch/csrc/jit/passes/onnx/peephole.cpp",
+    "torch/csrc/jit/generated/aten_dispatch.cpp",
+    "torch/csrc/autograd/init.cpp",
+    "torch/csrc/autograd/engine.cpp",
+    "torch/csrc/autograd/function.cpp",
+    "torch/csrc/autograd/variable.cpp",
+    "torch/csrc/autograd/saved_variable.cpp",
+    "torch/csrc/autograd/input_buffer.cpp",
+    "torch/csrc/autograd/profiler.cpp",
+    "torch/csrc/autograd/python_function.cpp",
+    "torch/csrc/autograd/python_cpp_function.cpp",
+    "torch/csrc/autograd/python_variable.cpp",
+    "torch/csrc/autograd/python_engine.cpp",
+    "torch/csrc/autograd/python_hook.cpp",
+    "torch/csrc/autograd/functions/jit_closure.cpp",
+    "torch/csrc/autograd/generated/VariableType.cpp",
+    "torch/csrc/autograd/generated/Functions.cpp",
+    "torch/csrc/autograd/generated/python_variable_methods.cpp",
+    "torch/csrc/autograd/generated/python_functions.cpp",
+    "torch/csrc/autograd/generated/python_nn_functions.cpp",
+    "torch/csrc/autograd/functions/batch_normalization.cpp",
+    "torch/csrc/autograd/functions/convolution.cpp",
+    "torch/csrc/autograd/functions/basic_ops.cpp",
+    "torch/csrc/autograd/functions/tensor.cpp",
+    "torch/csrc/autograd/functions/accumulate_grad.cpp",
+    "torch/csrc/autograd/functions/special.cpp",
+    "torch/csrc/autograd/functions/utils.cpp",
+    "torch/csrc/autograd/functions/init.cpp",
+    "torch/csrc/autograd/functions/onnx/convolution.cpp",
+    "torch/csrc/autograd/functions/onnx/batch_normalization.cpp",
+    "torch/csrc/autograd/functions/onnx/basic_ops.cpp",
+    "torch/csrc/onnx/onnx.pb.cpp",
+    "torch/csrc/onnx/onnx.cpp",
 ]
+main_sources += split_types("torch/csrc/Tensor.cpp")

 try:
    import numpy as np
    include_dirs += [np.get_include()]
-    main_sources += ["torch/csrc/numpy.cpp"]
    extra_compile_args += ['-DWITH_NUMPY']
+    WITH_NUMPY = True
 except ImportError:
-    pass
+    WITH_NUMPY = False
+
+if WITH_DISTRIBUTED:
+    extra_compile_args += ['-DWITH_DISTRIBUTED']
+    main_sources += [
+        "torch/csrc/distributed/Module.cpp",
+    ]
+    if WITH_DISTRIBUTED_MW:
+        main_sources += [
+            "torch/csrc/distributed/Tensor.cpp",
+            "torch/csrc/distributed/Storage.cpp",
+        ]
+        extra_compile_args += ['-DWITH_DISTRIBUTED_MW']
+    include_dirs += [tmp_install_path + "/include/THD"]
+    main_link_args += [THD_LIB]

 if WITH_CUDA:
-    if platform.system() == 'Darwin':
-        cuda_path = '/Developer/NVIDIA/CUDA-7.5'
-        cuda_include_path = cuda_path + '/include'
-        cuda_lib_path = cuda_path + '/lib'
-    else:
-        cuda_path = '/usr/local/cuda'
-        cuda_include_path = cuda_path + '/include'
-        cuda_lib_path = cuda_path + '/lib64'
+    cuda_lib_dirs = ['lib64', 'lib']
+    cuda_include_path = os.path.join(CUDA_HOME, 'include')
+    for lib_dir in cuda_lib_dirs:
+        cuda_lib_path = os.path.join(CUDA_HOME, lib_dir)
+        if os.path.exists(cuda_lib_path):
+            break
    include_dirs.append(cuda_include_path)
-    extra_link_args.append('-L' + cuda_lib_path)
+    include_dirs.append(tmp_install_path + "/include/THCUNN")
+    library_dirs.append(cuda_lib_path)
    extra_link_args.append('-Wl,-rpath,' + cuda_lib_path)
    extra_compile_args += ['-DWITH_CUDA']
-    main_libraries += ['THC']
+    extra_compile_args += ['-DCUDA_LIB_PATH=' + cuda_lib_path]
+    main_libraries += ['cudart', 'nvToolsExt']
+    main_link_args += [THC_LIB, THCS_LIB, THCUNN_LIB]
    main_sources += [
        "torch/csrc/cuda/Module.cpp",
        "torch/csrc/cuda/Storage.cpp",
-        "torch/csrc/cuda/Tensor.cpp",
+        "torch/csrc/cuda/Stream.cpp",
+        "torch/csrc/cuda/AutoGPU.cpp",
        "torch/csrc/cuda/utils.cpp",
+        "torch/csrc/cuda/expand_utils.cpp",
        "torch/csrc/cuda/serialization.cpp",
+        "torch/csrc/jit/fusion_compiler.cpp",
    ]
+    main_sources += split_types("torch/csrc/cuda/Tensor.cpp")
+
+if WITH_NCCL:
+    if WITH_SYSTEM_NCCL:
+        main_link_args += [NCCL_SYSTEM_LIB]
+        include_dirs.append(NCCL_INCLUDE_DIR)
+    else:
+        main_link_args += [NCCL_LIB]
+    extra_compile_args += ['-DWITH_NCCL']
+    main_sources += [
+        "torch/csrc/cuda/nccl.cpp",
+    ]
+if WITH_CUDNN:
+    main_libraries += ['cudnn']
+    library_dirs.append(CUDNN_LIB_DIR)
+    # NOTE: these are at the front, in case there's another cuDNN in CUDA path
+    include_dirs.insert(0, CUDNN_INCLUDE_DIR)
+    extra_link_args.insert(0, '-Wl,-rpath,' + CUDNN_LIB_DIR)
+    main_sources += [
+        "torch/csrc/cudnn/BatchNorm.cpp",
+        "torch/csrc/cudnn/Conv.cpp",
+        "torch/csrc/cudnn/cuDNN.cpp",
+        "torch/csrc/cudnn/GridSampler.cpp",
+        "torch/csrc/cudnn/AffineGridGenerator.cpp",
+        "torch/csrc/cudnn/Types.cpp",
+        "torch/csrc/cudnn/Handles.cpp",
+    ]
+    extra_compile_args += ['-DWITH_CUDNN']
+
+if WITH_NNPACK:
+    include_dirs.extend(NNPACK_INCLUDE_DIRS)
+    main_link_args.extend(NNPACK_LIB_PATHS)
+    main_sources += [
+        "torch/csrc/nnpack/NNPACK.cpp",
+    ]
+    extra_compile_args += ['-DWITH_NNPACK']

 if DEBUG:
    extra_compile_args += ['-O0', '-g']
    extra_link_args += ['-O0', '-g']

+if os.getenv('PYTORCH_BINARY_BUILD') and platform.system() == 'Linux':
+    print('PYTORCH_BINARY_BUILD found. Static linking libstdc++ on Linux')
+    # get path of libstdc++ and link manually.
+    # for reasons unknown, -static-libstdc++ doesn't fully link some symbols
+    CXXNAME = os.getenv('CXX', 'g++')
+    STDCPP_LIB = subprocess.check_output([CXXNAME, '-print-file-name=libstdc++.a'])
+    STDCPP_LIB = STDCPP_LIB[:-1]
+    if type(STDCPP_LIB) != str:  # python 3
+        STDCPP_LIB = STDCPP_LIB.decode(sys.stdout.encoding)
+    extra_link_args += [STDCPP_LIB]
+    version_script = os.path.abspath("tools/pytorch.version")
+    extra_link_args += ['-Wl,--version-script=' + version_script]
+

 def make_relative_rpath(path):
    if platform.system() == 'Darwin':
@ -188,51 +558,106 @@ def make_relative_rpath(path):
 ################################################################################

 extensions = []
-packages = find_packages(exclude=('tools.*', 'torch.cuda', 'torch.legacy.cunn'))
+packages = find_packages(exclude=('tools', 'tools.*',))

 C = Extension("torch._C",
-    libraries=main_libraries,
-    sources=main_sources,
-    language='c++',
-    extra_compile_args=extra_compile_args,
-    include_dirs=include_dirs,
-    extra_link_args=extra_link_args + [make_relative_rpath('lib')]
-)
+              libraries=main_libraries,
+              sources=main_sources,
+              language='c++',
+              extra_compile_args=main_compile_args + extra_compile_args,
+              include_dirs=include_dirs,
+              library_dirs=library_dirs,
+              extra_link_args=extra_link_args + main_link_args + [make_relative_rpath('lib')],
+              )
 extensions.append(C)

+DL = Extension("torch._dl",
+               sources=["torch/csrc/dl.c"],
+               language='c',
+               )
+extensions.append(DL)
+
 THNN = Extension("torch._thnn._THNN",
-    libraries=['TH', 'THNN'],
-    sources=['torch/csrc/nn/THNN.cpp'],
-    language='c++',
-    extra_compile_args=extra_compile_args,
-    include_dirs=include_dirs,
-    extra_link_args=extra_link_args + [make_relative_rpath('../lib')]
-)
+                 sources=['torch/csrc/nn/THNN.cpp'],
+                 language='c++',
+                 extra_compile_args=extra_compile_args,
+                 include_dirs=include_dirs,
+                 extra_link_args=extra_link_args + [
+                     TH_LIB,
+                     THNN_LIB,
+                     make_relative_rpath('../lib'),
+                 ]
+                 )
 extensions.append(THNN)

 if WITH_CUDA:
-    THCUNN = Extension("torch._thnn._THCUNN",
-        libraries=['TH', 'THC', 'THCUNN'],
-        sources=['torch/csrc/nn/THCUNN.cpp'],
-        language='c++',
-        extra_compile_args=extra_compile_args,
-        include_dirs=include_dirs,
-        extra_link_args=extra_link_args + [make_relative_rpath('../lib')]
-    )
-    extensions.append(THCUNN)
-    packages += ['torch.cuda', 'torch.legacy.cunn']
+    thnvrtc_link_flags = extra_link_args + [make_relative_rpath('lib')]
+    if platform.system() == 'Linux':
+        thnvrtc_link_flags = thnvrtc_link_flags + ['-Wl,--no-as-needed']
+    # these have to be specified as -lcuda in link_flags because they
+    # have to come right after the `no-as-needed` option
+    thnvrtc_link_flags += ['-lcuda', '-lnvrtc']
+    THNVRTC = Extension("torch._nvrtc",
+                        sources=['torch/csrc/nvrtc.cpp'],
+                        language='c++',
+                        include_dirs=include_dirs,
+                        library_dirs=library_dirs + [cuda_lib_path + '/stubs'],
+                        extra_link_args=thnvrtc_link_flags,
+                        )
+    extensions.append(THNVRTC)

-setup(name="torch", version="0.1",
-    ext_modules=extensions,
-    cmdclass = {
-        'build': build,
-        'build_ext': build_ext,
-        'build_deps': build_deps,
-        'build_module': build_module,
-        'install': install,
-        'clean': clean,
-    },
-    packages=packages,
-    package_data={'torch': ['lib/*.so*', 'lib/*.h']},
-    install_requires=['pyyaml'],
-)
+    THCUNN = Extension("torch._thnn._THCUNN",
+                       sources=['torch/csrc/nn/THCUNN.cpp'],
+                       language='c++',
+                       extra_compile_args=extra_compile_args,
+                       include_dirs=include_dirs,
+                       extra_link_args=extra_link_args + [
+                           TH_LIB,
+                           THC_LIB,
+                           THCUNN_LIB,
+                           make_relative_rpath('../lib'),
+                       ]
+                       )
+    extensions.append(THCUNN)
+
+version = '0.3.1b0'
+if os.getenv('PYTORCH_BUILD_VERSION'):
+    assert os.getenv('PYTORCH_BUILD_NUMBER') is not None
+    build_number = int(os.getenv('PYTORCH_BUILD_NUMBER'))
+    version = os.getenv('PYTORCH_BUILD_VERSION')
+    if build_number > 1:
+        version += '.post' + str(build_number)
+else:
+    try:
+        sha = subprocess.check_output(['git', 'rev-parse', 'HEAD'], cwd=cwd).decode('ascii').strip()
+        version += '+' + sha[:7]
+    except Exception:
+        pass
+
+cmdclass = {
+    'build': build,
+    'build_py': build_py,
+    'build_ext': build_ext,
+    'build_deps': build_deps,
+    'build_module': build_module,
+    'develop': develop,
+    'install': install,
+    'clean': clean,
+}
+cmdclass.update(build_dep_cmds)
+
+setup(name="torch", version=version,
+      description="Tensors and Dynamic neural networks in Python with strong GPU acceleration",
+      ext_modules=extensions,
+      cmdclass=cmdclass,
+      packages=packages,
+      package_data={'torch': [
+          'lib/*.so*', 'lib/*.dylib*',
+          'lib/torch_shm_manager',
+          'lib/*.h',
+          'lib/include/TH/*.h', 'lib/include/TH/generic/*.h',
+          'lib/include/THC/*.h', 'lib/include/THC/generic/*.h',
+          'lib/include/ATen/*.h',
+      ]},
+      install_requires=['pyyaml', 'numpy'],
+      )
--- a/test/common.py
+++ b/test/common.py
@ -1,10 +1,70 @@
+import sys
+import os
+import argparse
 import unittest
+import warnings
+import contextlib
+from functools import wraps
 from itertools import product
 from copy import deepcopy
+import __main__
+import errno

 import torch
+import torch.cuda
 from torch.autograd import Variable
-from torch.autograd.leaf import Leaf
+from torch._six import string_classes
+
+
+torch.set_default_tensor_type('torch.DoubleTensor')
+
+# set seed one time
+parser = argparse.ArgumentParser(add_help=False)
+parser.add_argument('--seed', type=int, default=123)
+parser.add_argument('--accept', action='store_true')
+args, remaining = parser.parse_known_args()
+SEED = args.seed
+ACCEPT = args.accept
+UNITTEST_ARGS = [sys.argv[0]] + remaining
+
+
+def run_tests():
+    unittest.main(argv=UNITTEST_ARGS)
+
+IS_WINDOWS = sys.platform == "win32"
+
+TEST_NUMPY = True
+try:
+    import numpy
+except ImportError:
+    TEST_NUMPY = False
+
+TEST_SCIPY = True
+try:
+    import scipy
+except ImportError:
+    TEST_SCIPY = False
+
+
+def skipIfNoLapack(fn):
+    @wraps(fn)
+    def wrapper(*args, **kwargs):
+        try:
+            fn(*args, **kwargs)
+        except Exception as e:
+            if 'Lapack library not found' in e.args[0]:
+                raise unittest.SkipTest('Compiled without Lapack')
+            raise
+    return wrapper
+
+
+def suppress_warnings(fn):
+    def wrapper(*args, **kwargs):
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore")
+            fn(*args, **kwargs)
+    return wrapper
+

 def get_cpu_type(t):
    assert t.__module__ == 'torch.cuda'
@ -16,23 +76,38 @@ def get_gpu_type(t):
    return getattr(torch.cuda, t.__name__)


-def to_gpu(obj, tensor_type=None):
-    if torch.isTensor(obj):
-        if tensor_type:
-            return tensor_type(obj.size()).copy_(obj)
-        return get_gpu_type(type(obj))(obj.size()).copy_(obj)
-    elif torch.isStorage(obj):
+def to_gpu(obj, type_map={}):
+    if torch.is_tensor(obj):
+        t = type_map.get(type(obj), get_gpu_type(type(obj)))
+        return obj.clone().type(t)
+    elif torch.is_storage(obj):
        return obj.new().resize_(obj.size()).copy_(obj)
    elif isinstance(obj, Variable):
-        assert type(obj.creator) == Leaf
-        return Variable(obj.data.clone().type(tensor_type))
+        assert obj.is_leaf
+        t = type_map.get(type(obj.data), get_gpu_type(type(obj.data)))
+        return Variable(obj.data.clone().type(t), requires_grad=obj.requires_grad)
    elif isinstance(obj, list):
-        return [to_gpu(o, tensor_type) for o in obj]
+        return [to_gpu(o, type_map) for o in obj]
+    elif isinstance(obj, tuple):
+        return tuple(to_gpu(o, type_map) for o in obj)
    else:
        return deepcopy(obj)


+@contextlib.contextmanager
+def freeze_rng_state():
+    rng_state = torch.get_rng_state()
+    if torch.cuda.is_available():
+        cuda_rng_state = torch.cuda.get_rng_state()
+    yield
+    if torch.cuda.is_available():
+        torch.cuda.set_rng_state(cuda_rng_state)
+    torch.set_rng_state(rng_state)
+
+
 def iter_indices(tensor):
+    if tensor.dim() == 0:
+        return range(0)
    if tensor.dim() == 1:
        return range(tensor.size(0))
    return product(*(range(s) for s in tensor.size()))
@ -42,96 +117,250 @@ def is_iterable(obj):
    try:
        iter(obj)
        return True
-    except:
+    except TypeError:
        return False


 class TestCase(unittest.TestCase):
    precision = 1e-5
+    maxDiff = None
+
+    def setUp(self):
+        torch.manual_seed(SEED)
+        if torch.cuda.is_available():
+            torch.cuda.manual_seed_all(SEED)
+
+    def assertTensorsSlowEqual(self, x, y, prec=None, message=''):
+        max_err = 0
+        self.assertEqual(x.size(), y.size())
+        for index in iter_indices(x):
+            max_err = max(max_err, abs(x[index] - y[index]))
+        self.assertLessEqual(max_err, prec, message)
+
+    def safeCoalesce(self, t):
+        tc = t.coalesce()
+
+        value_map = {}
+        for idx, val in zip(t._indices().t(), t._values()):
+            idx_tup = tuple(idx)
+            if idx_tup in value_map:
+                value_map[idx_tup] += val
+            else:
+                value_map[idx_tup] = val.clone() if torch.is_tensor(val) else val
+
+        new_indices = sorted(list(value_map.keys()))
+        new_values = [value_map[idx] for idx in new_indices]
+        if t._values().ndimension() < 2:
+            new_values = t._values().new(new_values)
+        else:
+            new_values = torch.stack(new_values)
+
+        new_indices = t._indices().new(new_indices).t()
+        tg = t.new(new_indices, new_values, t.size())
+
+        self.assertEqual(tc._indices(), tg._indices())
+        self.assertEqual(tc._values(), tg._values())
+
+        return tg
+
+    def unwrapVariables(self, x, y):
+        if isinstance(x, Variable) and isinstance(y, Variable):
+            return x.data, y.data
+        elif isinstance(x, Variable) or isinstance(y, Variable):
+            raise AssertionError("cannot compare {} and {}".format(type(x), type(y)))
+        return x, y

    def assertEqual(self, x, y, prec=None, message=''):
+        if isinstance(prec, str) and message == '':
+            message = prec
+            prec = None
        if prec is None:
            prec = self.precision

-        if isinstance(x, Variable) and isinstance(y, Variable):
-            x = x.data
-            y = y.data
+        x, y = self.unwrapVariables(x, y)

-        if torch.isTensor(x) and torch.isTensor(y):
-            max_err = 0
-            super(TestCase, self).assertEqual(x.size().tolist(), y.size().tolist())
-            for index in iter_indices(x):
-                max_err = max(max_err, abs(x[index] - y[index]))
-            self.assertLessEqual(max_err, prec)
-        elif type(x) == str and type(y) == str:
+        if torch.is_tensor(x) and torch.is_tensor(y):
+            def assertTensorsEqual(a, b):
+                super(TestCase, self).assertEqual(a.size(), b.size())
+                if a.numel() > 0:
+                    b = b.type_as(a)
+                    b = b.cuda(device=a.get_device()) if a.is_cuda else b.cpu()
+                    # check that NaNs are in the same locations
+                    nan_mask = a != a
+                    self.assertTrue(torch.equal(nan_mask, b != b))
+                    diff = a - b
+                    diff[nan_mask] = 0
+                    if diff.is_signed():
+                        diff = diff.abs()
+                    max_err = diff.max()
+                    self.assertLessEqual(max_err, prec, message)
+            self.assertEqual(x.is_sparse, y.is_sparse, message)
+            if x.is_sparse:
+                x = self.safeCoalesce(x)
+                y = self.safeCoalesce(y)
+                assertTensorsEqual(x._indices(), y._indices())
+                assertTensorsEqual(x._values(), y._values())
+            else:
+                assertTensorsEqual(x, y)
+        elif isinstance(x, string_classes) and isinstance(y, string_classes):
+            super(TestCase, self).assertEqual(x, y)
+        elif type(x) == set and type(y) == set:
            super(TestCase, self).assertEqual(x, y)
        elif is_iterable(x) and is_iterable(y):
+            super(TestCase, self).assertEqual(len(x), len(y))
            for x_, y_ in zip(x, y):
                self.assertEqual(x_, y_, prec, message)
        else:
            try:
-                self.assertLessEqual(abs(x - y), prec)
+                self.assertLessEqual(abs(x - y), prec, message)
                return
-            except:
+            except (TypeError, AssertionError):
                pass
-            super(TestCase, self).assertEqual(x, y)
+            super(TestCase, self).assertEqual(x, y, message)
+
+    def assertNotEqual(self, x, y, prec=None, message=''):
+        if prec is None:
+            prec = self.precision
+
+        x, y = self.unwrapVariables(x, y)
+
+        if torch.is_tensor(x) and torch.is_tensor(y):
+            if x.size() != y.size():
+                super(TestCase, self).assertNotEqual(x.size(), y.size())
+            self.assertGreater(x.numel(), 0)
+            y = y.type_as(x)
+            y = y.cuda(device=x.get_device()) if x.is_cuda else y.cpu()
+            nan_mask = x != x
+            if torch.equal(nan_mask, y != y):
+                diff = x - y
+                if diff.is_signed():
+                    diff = diff.abs()
+                diff[nan_mask] = 0
+                max_err = diff.max()
+                self.assertGreaterEqual(max_err, prec, message)
+        elif type(x) == str and type(y) == str:
+            super(TestCase, self).assertNotEqual(x, y)
+        elif is_iterable(x) and is_iterable(y):
+            super(TestCase, self).assertNotEqual(x, y)
+        else:
+            try:
+                self.assertGreaterEqual(abs(x - y), prec, message)
+                return
+            except (TypeError, AssertionError):
+                pass
+            super(TestCase, self).assertNotEqual(x, y, message)
+
+    def assertObjectIn(self, obj, iterable):
+        for elem in iterable:
+            if id(obj) == id(elem):
+                return
+        raise AssertionError("object not found in iterable")
+
+    # TODO: Support context manager interface
+    # NB: The kwargs forwarding to callable robs the 'subname' parameter.
+    # If you need it, manually apply your callable in a lambda instead.
+    def assertExpectedRaises(self, exc_type, callable, *args, **kwargs):
+        subname = None
+        if 'subname' in kwargs:
+            subname = kwargs['subname']
+            del kwargs['subname']
+        try:
+            callable(*args, **kwargs)
+        except exc_type as e:
+            self.assertExpected(str(e), subname)
+            return
+        # Don't put this in the try block; the AssertionError will catch it
+        self.fail(msg="Did not raise when expected to")
+
+    def assertExpected(self, s, subname=None):
+        """
+        Test that a string matches the recorded contents of a file
+        derived from the name of this test and subname.  This file
+        is placed in the 'expect' directory in the same directory
+        as the test script. You can automatically update the recorded test
+        output using --accept.
+
+        If you call this multiple times in a single function, you must
+        give a unique subname each time.
+        """
+        if not (isinstance(s, str) or (sys.version_info[0] == 2 and isinstance(s, unicode))):
+            raise TypeError("assertExpected is strings only")
+
+        def remove_prefix(text, prefix):
+            if text.startswith(prefix):
+                return text[len(prefix):]
+            return text
+        munged_id = remove_prefix(self.id(), "__main__.")
+        # NB: we take __file__ from __main__, so we place the expect directory
+        # where the test script lives, NOT where test/common.py lives.  This
+        # doesn't matter in PyTorch where all test scripts are in the same
+        # directory as test/common.py, but it matters in onnx-pytorch
+        expected_file = os.path.join(os.path.dirname(os.path.realpath(__main__.__file__)),
+                                     "expect",
+                                     munged_id)
+        if subname:
+            expected_file += "-" + subname
+        expected_file += ".expect"
+        expected = None
+
+        def accept_output(update_type):
+            print("Accepting {} for {}:\n\n{}".format(update_type, munged_id, s))
+            with open(expected_file, 'w') as f:
+                f.write(s)
+
+        try:
+            with open(expected_file) as f:
+                expected = f.read()
+        except IOError as e:
+            if e.errno != errno.ENOENT:
+                raise
+            elif ACCEPT:
+                return accept_output("output")
+            else:
+                raise RuntimeError(
+                    ("I got this output for {}:\n\n{}\n\n"
+                     "No expect file exists; to accept the current output, run:\n"
+                     "python {} {} --accept").format(munged_id, s, __main__.__file__, munged_id))
+        if ACCEPT:
+            if expected != s:
+                return accept_output("updated output")
+        else:
+            if hasattr(self, "assertMultiLineEqual"):
+                # Python 2.7 only
+                # NB: Python considers lhs "old" and rhs "new".
+                self.assertMultiLineEqual(expected, s)
+            else:
+                self.assertEqual(s, expected)
+
+    if sys.version_info < (3, 2):
+        # assertRegexpMatches renamed assertRegex in 3.2
+        assertRegex = unittest.TestCase.assertRegexpMatches
+        # assertRaisesRegexp renamed assertRaisesRegex in 3.2
+        assertRaisesRegex = unittest.TestCase.assertRaisesRegexp


-def make_jacobian(input, num_out):
-    if torch.isTensor(input) or isinstance(input, Variable):
-        return torch.zeros(input.nElement(), num_out)
+def download_file(url, binary=True):
+    if sys.version_info < (3,):
+        from urlparse import urlsplit
+        import urllib2
+        request = urllib2
+        error = urllib2
    else:
-        return type(input)(make_jacobian(elem, num_out) for elem in input)
+        from urllib.parse import urlsplit
+        from urllib import request, error

+    filename = os.path.basename(urlsplit(url)[2])
+    data_dir = os.path.join(os.path.dirname(__file__), 'data')
+    path = os.path.join(data_dir, filename)

-def iter_tensors(x):
-    if torch.isTensor(x):
-        yield x
-    elif isinstance(x, Variable):
-        yield x.data
-    else:
-        for elem in x:
-            for result in iter_tensors(elem):
-                yield result
-
-
-def contiguous(input):
-    if torch.isTensor(input):
-        return input.contiguous()
-    elif isinstance(input, Variable):
-        return input.contiguous_()
-    else:
-        return type(input)(contiguous(e) for e in input)
-
-
-def get_numerical_jacobian(fn, input, target):
-    perturbation = 1e-6
-    # To be able to use .view(-1) input must be contiguous
-    input = contiguous(input)
-    output_size = fn(input).numel()
-    jacobian = make_jacobian(target, output_size)
-
-    # It's much easier to iterate over flattened lists of tensors.
-    # These are reference to the same objects in jacobian, so any changes
-    # will be reflected in it as well.
-    x_tensors = [t for t in iter_tensors(target)]
-    j_tensors = [t for t in iter_tensors(jacobian)]
-
-    outa = torch.Tensor(output_size)
-    outb = torch.Tensor(output_size)
-
-    # TODO: compare structure
-    for x_tensor, d_tensor in zip(x_tensors, j_tensors):
-        flat_tensor = x_tensor.view(-1)
-        for i in range(flat_tensor.nElement()):
-            orig = flat_tensor[i]
-            flat_tensor[i] = orig - perturbation
-            outa.copy_(fn(input))
-            flat_tensor[i] = orig + perturbation
-            outb.copy_(fn(input))
-            flat_tensor[i] = orig
-
-            outb.add_(-1,outa).div_(2*perturbation)
-            d_tensor[i] = outb
-
-    return jacobian
+    if os.path.exists(path):
+        return path
+    try:
+        data = request.urlopen(url, timeout=15).read()
+        with open(path, 'wb' if binary else 'w') as f:
+            f.write(data)
+        return path
+    except error.URLError:
+        msg = "could not download test file '{}'".format(url)
+        warnings.warn(msg, RuntimeWarning)
+        raise unittest.SkipTest(msg)
--- a/test/common_nn.py
+++ b/test/common_nn.py
@ -1,42 +1,52 @@
+import sys
+import tempfile
 import unittest
 from copy import deepcopy
+from itertools import product

 import torch
+import torch.cuda
 from torch.autograd import Variable
-from common import TestCase, to_gpu, get_numerical_jacobian, iter_tensors, contiguous
+from common import TestCase, to_gpu, freeze_rng_state
+from torch.autograd.gradcheck import get_numerical_jacobian, iter_tensors, contiguous
+import torch.backends.cudnn

-try:
-    import torch.cuda
-    import torch.legacy.cunn
-    TEST_CUDA = True
-except ImportError:
-    TEST_CUDA = False
+# tarfile module tries to obtain a file object name in python 3.3
+if sys.version_info[:2] == (3, 3):
+    TemporaryFile = tempfile.NamedTemporaryFile
+else:
+    TemporaryFile = tempfile.TemporaryFile

+TEST_CUDA = torch.cuda.is_available()
+TEST_MULTIGPU = TEST_CUDA and torch.cuda.device_count() >= 2
+TEST_CUDNN = TEST_CUDA and torch.backends.cudnn.is_acceptable(torch.cuda.FloatTensor(1))
+TEST_CUDNN_VERSION = TEST_CUDNN and torch.backends.cudnn.version()
 PRECISION = 1e-5

+
+def get_size_average(m):
+    return getattr(m, 'size_average', False) or getattr(m, 'sizeAverage', False)
+
+
+def get_weight(m):
+    result = getattr(m, 'weight', None)
+    if result is not None:
+        return result
+    return getattr(m, 'weights', None)
+
 module_tests = [
    dict(
        module_name='Linear',
        constructor_args=(10, 8),
        input_size=(4, 10),
-        reference_fn=lambda i,p: torch.mm(i, p[0].t()) + p[1].view(1, -1).expand(4, 8)
+        reference_fn=lambda i, p: torch.mm(i, p[0].t()) + p[1].view(1, -1).expand(4, 8)
    ),
    dict(
-        module_name='Conv2d',
-        constructor_args=(3, 4, 3, 3),
-        input_size=(2, 3, 6, 6)
-    ),
-    dict(
-        module_name='Conv2d',
-        constructor_args=(3, 4, 3, 3, 2, 2),
-        input_size=(2, 3, 6, 6),
-        desc='strided'
-    ),
-    dict(
-        module_name='Conv2d',
-        constructor_args=(3, 4, 3, 3, 2, 2, 1, 1),
-        input_size=(2, 3, 6, 6),
-        desc='padding'
+        module_name='Linear',
+        constructor_args=(10, 8, False),
+        input_size=(4, 10),
+        desc='no_bias',
+        reference_fn=lambda i, p: torch.mm(i, p[0].t())
    ),
    dict(
        module_name='Threshold',
@ -54,17 +64,29 @@ module_tests = [
    dict(
        module_name='ReLU',
        input_size=(2, 3, 4, 5),
-        check_inplace=True
+        check_inplace=True,
    ),
    dict(
        module_name='ReLU6',
        input_size=(2, 3, 4, 5),
-        check_inplace=True
+        check_inplace=True,
    ),
    dict(
-        module_name='HardTanh',
+        module_name='RReLU',
+        input_size=(1, 2, 2),
+        test_cuda=False,
+    ),
+    dict(
+        module_name='RReLU',
+        constructor_args=(0.1, 0.9),
+        input_size=(4, 4, 5),
+        desc='with_up_down',
+        test_cuda=False,
+    ),
+    dict(
+        module_name='Hardtanh',
        input_size=(3, 2, 5),
-        reference_fn=lambda i,_: i.clamp(-1, 1)
+        reference_fn=lambda i, _: i.clamp(-1, 1),
    ),
    dict(
        module_name='Sigmoid',
@ -74,60 +96,440 @@ module_tests = [
        module_name='Tanh',
        input_size=(2, 3, 4, 5)
    ),
-    dict(
-        module_name='MaxPooling2d',
-        constructor_args=(3, 3, 2, 2, 1, 1),
-        input_size=(1, 3, 7, 7)
-    ),
    dict(
        module_name='Softmax',
+        constructor_args=(1,),
        input_size=(10, 20),
-        reference_fn=lambda i,_: torch.exp(i).div(torch.exp(i).sum(1).expand(10, 20))
+        reference_fn=lambda i, _: torch.exp(i).div(torch.exp(i).sum(1, True).expand(10, 20)),
    ),
    dict(
        module_name='Softmax2d',
        input_size=(1, 3, 10, 20),
-        reference_fn=lambda i,_: torch.exp(i).div(torch.exp(i).sum(1).expandAs(i))
-    ),
-    dict(
-        module_name='BatchNorm',
-        constructor_args=(10,),
-        input_size=(4, 10),
-        desc='affine'
-    ),
-    dict(
-        module_name='BatchNorm',
-        constructor_args=(10, 1e-3, 0.3, False),
-        input_size=(4, 10),
-        desc='not_affine'
+        reference_fn=lambda i, _: torch.exp(i).div(torch.exp(i).sum(1, False)),
    ),
    dict(
        module_name='LogSoftmax',
+        constructor_args=(1,),
        input_size=(10, 20),
-        reference_fn=lambda i,_: torch.exp(i).div_(torch.exp(i).sum(1).expand(10, 20)).log_()
+        reference_fn=lambda i, _: torch.exp(i).div_(torch.exp(i).sum(1, True).expand(10, 20)).log_(),
+    ),
+    dict(
+        module_name='LogSoftmax',
+        constructor_args=(1,),
+        input_size=(1, 3, 10, 20),
+        reference_fn=lambda i, _: torch.exp(i).div_(torch.exp(i).sum(1, False)).log_(),
+        desc='multiparam',
+    ),
+    dict(
+        module_name='ELU',
+        constructor_args=(2.,),
+        input_size=(3, 2, 5),
+    ),
+    # TODO: reference function
+    dict(
+        module_name='Hardshrink',
+        constructor_args=(2.,),
+        input_size=(4, 3, 2, 4),
+    ),
+    dict(
+        module_name='LeakyReLU',
+        input_size=(3, 2, 5),
+        check_inplace=True
+    ),
+    dict(
+        module_name='LeakyReLU',
+        constructor_args=(0.5,),
+        input_size=(3, 2, 5),
+        check_inplace=True,
+        desc='with_negval'
+    ),
+    dict(
+        module_name='LogSigmoid',
+        input_size=(2, 3, 4),
+        reference_fn=lambda i, _: i.sigmoid().log(),
+    ),
+    dict(
+        module_name='Softplus',
+        input_size=(10, 20),
+        reference_fn=lambda i, _: torch.log(1 + torch.exp(i)),
+    ),
+    dict(
+        module_name='Softplus',
+        constructor_args=(2,),
+        input_size=(10, 20),
+        reference_fn=lambda i, _: 1. / 2. * torch.log(1 + torch.exp(2 * i)),
+        desc='beta',
+    ),
+    dict(
+        module_name='Softplus',
+        constructor_args=(2, -100),
+        input_size=(10, 20),
+        reference_fn=(lambda i, _: ((i * 2) > -100).type_as(i) * i +
+                                   ((i * 2) <= -100).type_as(i) * 1. / 2. * torch.log(1 + torch.exp(2 * i))),
+        desc='beta_threshold',
+    ),
+    dict(
+        module_name='Softshrink',
+        input_size=(3, 2, 5),
+    ),
+    dict(
+        module_name='Softshrink',
+        constructor_args=(1,),
+        input_size=(3, 2, 5),
+        desc='lambda',
+    ),
+    dict(
+        module_name='CrossMapLRN2d',
+        constructor_args=(5, 5e-3, 1e-3, 2),
+        input_size=(2, 3, 6, 6),
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='PReLU',
+        input_size=(2, 3, 4),
+        reference_fn=lambda i, p: torch.clamp(i, min=0) + torch.clamp(i, max=0) * p[0][0],
+        desc='1d',
+    ),
+    dict(
+        module_name='PReLU',
+        constructor_args=(3,),
+        input_size=(2, 3, 4),
+        desc='1d_multiparam',
+        reference_fn=lambda i, p: torch.clamp(i, min=0) + torch.clamp(i, max=0) * p[0][0],
+    ),
+    dict(
+        module_name='PReLU',
+        input_size=(2, 3, 4, 5),
+        desc='2d',
+        reference_fn=lambda i, p: torch.clamp(i, min=0) + torch.clamp(i, max=0) * p[0][0],
+    ),
+    dict(
+        module_name='PReLU',
+        constructor_args=(3,),
+        input_size=(2, 3, 4, 5),
+        desc='2d_multiparam',
+        reference_fn=lambda i, p: torch.clamp(i, min=0) + torch.clamp(i, max=0) * p[0][0],
+    ),
+    dict(
+        module_name='PReLU',
+        input_size=(2, 3, 4, 5, 6),
+        reference_fn=lambda i, p: torch.clamp(i, min=0) + torch.clamp(i, max=0) * p[0][0],
+        desc='3d',
+    ),
+    dict(
+        module_name='PReLU',
+        constructor_args=(3,),
+        input_size=(2, 3, 4, 5, 6),
+        desc='3d_multiparam',
+        reference_fn=lambda i, p: torch.clamp(i, min=0) + torch.clamp(i, max=0) * p[0][0],
+    ),
+    dict(
+        module_name='Softsign',
+        input_size=(3, 2, 5),
+        reference_fn=lambda i, _: i.div(1 + torch.abs(i)),
+    ),
+    dict(
+        module_name='Softmin',
+        constructor_args=(1,),
+        input_size=(10, 20),
+    ),
+    dict(
+        module_name='Softmin',
+        constructor_args=(1,),
+        input_size=(2, 3, 5, 10),
+        desc='multidim',
+    ),
+    dict(
+        module_name='Tanhshrink',
+        input_size=(2, 3, 4, 5)
    ),
 ]


+def kldivloss_reference(input, target, size_average=True, reduce=True):
+    safe_target = target * (target > 0).type_as(target)
+    safe_target_log = (safe_target + (target <= 0).type_as(target)).log()
+    result = safe_target * (safe_target_log - input)
+    if reduce and size_average:
+        return result.mean()
+    elif reduce:
+        return result.sum()
+    return result
+
+
+def nlllossNd_reference(input, target, weight=None, ignore_index=-100,
+                        size_average=True, reduce=True):
+    assert input.dim() >= 3
+    N = input.size(0)
+    C = input.size(1)
+    out_size = (N,) + input.size()[2:]
+    output = torch.zeros(out_size).type_as(input)
+    if isinstance(target, Variable):
+        target = target.data
+
+    if weight is None:
+        weight = torch.ones(C).type_as(input)
+
+    total_weight_data = 0
+    for tup in product(*[range(size) for size in out_size]):
+        t_nx = target[tup]
+        norm = 0. if ignore_index == t_nx else weight[t_nx]
+        input_index = list(tup)
+        input_index.insert(1, t_nx)
+        output[tup] = -input[tuple(input_index)] * norm
+        total_weight_data += norm
+
+    if reduce and size_average:
+        return output.sum() / total_weight_data
+    elif reduce:
+        return output.sum()
+    return output
+
+
+def nllloss_reference(input, target, weight=None, ignore_index=-100,
+                      size_average=True, reduce=True):
+    if isinstance(target, Variable):
+        target = target.data
+
+    def nll_loss_helper(input, target, weight, ignore_index):
+        if target is ignore_index:
+            return (0, 0)
+        norm = 1 if weight is None else weight[target]
+        result = -input[target] * norm
+        return (result, norm)
+
+    losses_and_weights = [nll_loss_helper(i, t, weight, ignore_index)
+                          for i, t in zip(input, target)]
+    losses, weights = zip(*losses_and_weights)
+    losses_tensor = torch.Tensor(losses).type_as(input)
+    if reduce and size_average:
+        return sum(losses_tensor) / sum(weights)
+    elif reduce:
+        return sum(losses_tensor)
+    else:
+        return losses_tensor
+
+
+def smoothl1loss_reference(input, target, size_average=True, reduce=True):
+    abs_diff = (input - target).abs()
+    ge_one_mask = (abs_diff >= 1).type_as(abs_diff)
+    lt_one_mask = (abs_diff < 1).type_as(abs_diff)
+    output = ge_one_mask * (abs_diff - 0.5) + lt_one_mask * 0.5 * (abs_diff ** 2)
+    if reduce and size_average:
+        return output.mean()
+    elif reduce:
+        return output.sum()
+    return output
+
+
+loss_reference_fns = {
+    'KLDivLoss': kldivloss_reference,
+    'NLLLoss': nllloss_reference,
+    'NLLLossNd': nlllossNd_reference,
+    'SmoothL1Loss': smoothl1loss_reference,
+}
+
+
 criterion_tests = [
-    dict(module_name='AbsCriterion',
+    dict(
+        module_name='L1Loss',
        input_size=(2, 3, 4),
-        target=torch.randn(2, 3, 4),
-        reference_fn=lambda i,t,_: 1./i.numel() * \
-            sum((a-b).abs().sum() for a,b in zip(i, t))
+        target_size=(2, 3, 4),
+        reference_fn=lambda i, t, _: 1. / i.numel() *
+        sum((a - b).abs().sum() for a, b in zip(i, t)),
    ),
    dict(
-        module_name='ClassNLLCriterion',
-        input=torch.rand(15, 10).log(),
-        target=torch.Tensor(15).uniform_().mul(10).floor().long(),
+        module_name='NLLLoss',
+        input_fn=lambda: torch.rand(15, 10).log(),
+        target_fn=lambda: torch.Tensor(15).uniform_().mul(10).floor().long(),
+        reference_fn=lambda i, t, m:
+            nllloss_reference(i, t, size_average=get_size_average(m)),
+        check_no_size_average=True
    ),
    dict(
-        module_name='ClassNLLCriterion',
-        constructor_args=(torch.rand(10),),
-        input=torch.rand(15, 10).add(1e-2).log(),
-        target=torch.Tensor(15).uniform_().mul(10).floor().long(),
+        module_name='NLLLoss',
+        constructor_args=(None, True, 2),
+        input_fn=lambda: torch.rand(15, 10).log(),
+        target_fn=lambda: torch.Tensor(15).uniform_().mul(10).floor().long(),
+        reference_fn=lambda i, t, _: nllloss_reference(i, t, ignore_index=2),
+        desc='ignore_index'
+    ),
+    dict(
+        module_name='NLLLoss',
+        constructor_args_fn=lambda: (torch.rand(10),),
+        input_fn=lambda: torch.rand(15, 10).add(1e-2).log(),
+        target_fn=lambda: torch.Tensor(15).uniform_().mul(10).floor().long(),
+        reference_fn=lambda i, t, m:
+            nllloss_reference(i, t, weight=get_weight(m)),
        desc='weights',
    ),
+    dict(
+        module_name='NLLLoss',
+        constructor_args_fn=lambda: (torch.rand(10), True, 2),
+        input_fn=lambda: torch.rand(15, 10).add(1e-2).log(),
+        target_fn=lambda: torch.Tensor(15).uniform_().mul(10).floor().long(),
+        reference_fn=lambda i, t, m:
+            nllloss_reference(i, t, weight=get_weight(m), ignore_index=2),
+        desc='weights_ignore_index'
+    ),
+    dict(
+        module_name='NLLLoss',
+        constructor_args_fn=lambda: (torch.rand(10), True, -1),
+        input_fn=lambda: torch.rand(15, 10).add(1e-2).log(),
+        target_fn=lambda: torch.Tensor(15).uniform_().mul(10 + 1).floor().long() - 1,
+        reference_fn=lambda i, t, m:
+            nllloss_reference(i, t, weight=get_weight(m), ignore_index=-1),
+        desc='weights_ignore_index_neg'
+    ),
+    dict(
+        module_name='KLDivLoss',
+        input_fn=lambda: torch.rand(10, 10).log(),
+        target_fn=lambda: torch.rand(10, 10),
+        reference_fn=lambda i, t, m:
+            kldivloss_reference(i, t, get_size_average(m), reduce=True),
+        check_no_size_average=True,
+    ),
+    dict(
+        module_name='MSELoss',
+        input_size=(2, 3, 4, 5),
+        target_size=(2, 3, 4, 5),
+        reference_fn=lambda i, t, m: (i - t).abs().pow(2).sum() / (i.numel() if get_size_average(m) else 1),
+        check_no_size_average=True,
+    ),
+    dict(
+        module_name='BCELoss',
+        input_fn=lambda: torch.rand(15, 10).clamp_(1e-2, 1 - 1e-2),
+        target_fn=lambda: torch.randn(15, 10).gt(0).double(),
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='BCELoss',
+        constructor_args_fn=lambda: (torch.rand(10),),
+        input_fn=lambda: torch.rand(15, 10).clamp_(1e-2, 1 - 1e-2),
+        target_fn=lambda: torch.randn(15, 10).gt(0).double(),
+        desc='weights',
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='CrossEntropyLoss',
+        input_size=(15, 10),
+        target_fn=lambda: torch.Tensor(15).uniform_().mul(10).floor().long(),
+    ),
+    dict(
+        module_name='CrossEntropyLoss',
+        constructor_args_fn=lambda: (torch.rand(10),),
+        input_size=(15, 10),
+        target_fn=lambda: torch.Tensor(15).uniform_().mul(10).floor().long(),
+        desc='weights',
+    ),
+    dict(
+        module_name='NLLLoss2d',
+        input_size=(2, 3, 5, 5),
+        target_fn=lambda: torch.rand(2, 5, 5).mul(3).floor().long(),
+        reference_fn=lambda i, t, m:
+            nlllossNd_reference(i, t, size_average=get_size_average(m)),
+        check_no_size_average=True,
+    ),
+    dict(
+        module_name='NLLLoss2d',
+        constructor_args_fn=lambda: (torch.rand(3),),
+        input_size=(2, 3, 5, 5),
+        target=torch.rand(2, 5, 5).mul(3).floor().long(),
+        reference_fn=lambda i, t, m:
+            nlllossNd_reference(i, t, weight=get_weight(m)),
+        desc='weights',
+    ),
+    dict(
+        module_name='NLLLoss2d',
+        constructor_args=(None, True, 1),
+        input_size=(2, 3, 5, 5),
+        target_fn=lambda: torch.rand(2, 5, 5).mul(3).floor().long(),
+        reference_fn=lambda i, t, m:
+            nlllossNd_reference(i, t, ignore_index=1),
+        desc='ignore_index',
+    ),
+    dict(
+        module_name='HingeEmbeddingLoss',
+        input_size=(10,),
+        target_fn=lambda: torch.randn(10).gt(0).double().mul_(2).sub(1),
+    ),
+    dict(
+        module_name='HingeEmbeddingLoss',
+        constructor_args=(0.5,),
+        input_size=(10,),
+        target_fn=lambda: torch.randn(10).gt(0).double().mul_(2).sub(1),
+        desc='margin',
+        check_no_size_average=True,
+    ),
+    dict(
+        module_name='MultiLabelMarginLoss',
+        input_size=(5, 10),
+        target_fn=lambda: torch.rand(5, 10).mul(10).floor().long(),
+        check_no_size_average=True,
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='MultiLabelSoftMarginLoss',
+        input_size=(5, 10),
+        target_fn=lambda: torch.rand(5, 10).mul(2).floor(),
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='MultiLabelSoftMarginLoss',
+        constructor_args_fn=lambda: (torch.rand(10),),
+        input_size=(5, 10),
+        target_fn=lambda: torch.rand(5, 10).mul(2).floor(),
+        desc='weights',
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='MultiMarginLoss',
+        input_size=(5, 10),
+        target_fn=lambda: torch.rand(5).mul(8).floor().long(),
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='SmoothL1Loss',
+        input_size=(5, 10),
+        target_size=(5, 10),
+        check_no_size_average=True,
+        reference_fn=lambda i, t, m:
+            smoothl1loss_reference(i, t, size_average=get_size_average(m)),
+    ),
+    dict(
+        module_name='SoftMarginLoss',
+        input_size=(5, 5),
+        target_fn=lambda: torch.randn(5, 5).sign(),
+        check_no_size_average=True,
+    ),
+    dict(
+        module_name='CosineEmbeddingLoss',
+        input_fn=lambda: (torch.rand(15, 10), torch.rand(15, 10)),
+        target_fn=lambda: torch.randn(15).sign(),
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='CosineEmbeddingLoss',
+        constructor_args=(0.7,),
+        input_fn=lambda: (torch.rand(15, 10), torch.rand(15, 10)),
+        target_fn=lambda: torch.randn(15).sign(),
+        desc='margin',
+        check_gradgrad=False,
+    ),
+    dict(
+        module_name='MarginRankingLoss',
+        input_fn=lambda: (torch.randn(50).mul(10), torch.randn(50).mul(10)),
+        target_fn=lambda: torch.randn(50).sign(),
+        check_no_size_average=True,
+    ),
+    dict(
+        module_name='MarginRankingLoss',
+        constructor_args=(2,),
+        input_fn=lambda: (torch.randn(50).mul(10), torch.randn(50).mul(10)),
+        target_fn=lambda: torch.randn(50).sign(),
+        desc='margin',
+        check_no_size_average=True,
+    ),
 ]


@ -139,20 +541,25 @@ class NNTestCase(TestCase):
        elif isinstance(input, list):
            return [self._jacobian(elem, num_out) for elem in input]
        else:
-            return torch.zeros(input.nElement(), num_out)
+            return torch.zeros(input.nelement(), num_out)

    def _flatten_tensors(self, x):
-        if torch.isTensor(x):
-            return x.view(-1)
+        if torch.is_tensor(x):
+            if x.is_sparse:
+                return x.to_dense().view(-1)
+            else:
+                return x.view(-1)
        elif isinstance(x, Variable):
-            return x.data.view(-1)
+            return self._flatten_tensors(x.data)
        else:
            return tuple(self._flatten_tensors(a) for a in x)

    def _zero_grad_input(self, input):
        if isinstance(input, Variable):
-            input.grad.zero_()
-        elif torch.isTensor(input):
+            if input.requires_grad and input.grad is not None:
+                input.grad.data.zero_()
+                input.grad.detach_()
+        elif torch.is_tensor(input):
            return
        else:
            for i in input:
@ -165,33 +572,34 @@ class NNTestCase(TestCase):
        flat_d_out = d_out.view(-1)

        if jacobian_input:
-            jacobian_input = self._jacobian(input, d_out.nElement())
-            flat_jacobian_input = list(iter_tensors(jacobian_input))
+            jacobian_inp = self._jacobian(input, d_out.nelement())
+            flat_jacobian_input = list(iter_tensors(jacobian_inp))

        if jacobian_parameters:
            param, d_param = self._get_parameters(module)
            num_param = sum(p.numel() for p in param)
-            jacobian_param = torch.zeros(num_param, d_out.nElement())
+            jacobian_param = torch.zeros(num_param, d_out.nelement())

-        for i in range(flat_d_out.nElement()):
+        for i in range(flat_d_out.nelement()):
            d_out.zero_()
            flat_d_out[i] = 1

            if jacobian_parameters:
                self._zero_grad_parameters(module)
            # Variables will accumulate gradient from multiple steps
-            self._zero_grad_input(input)
+            if jacobian_input:
+                self._zero_grad_input(input)
            d_input = self._backward(module, input, output, d_out)

            if jacobian_input:
                for jacobian_x, d_x in zip(flat_jacobian_input, iter_tensors(d_input)):
-                    jacobian_x[:,i] = d_x
+                    jacobian_x[:, i] = d_x
            if jacobian_parameters:
-                jacobian_param[:,i] = torch.cat(self._flatten_tensors(d_param), 0)
+                jacobian_param[:, i] = torch.cat(self._flatten_tensors(d_param), 0)

        res = tuple()
        if jacobian_input:
-            res += jacobian_input,
+            res += jacobian_inp,
        if jacobian_parameters:
            res += jacobian_param,

@ -199,7 +607,7 @@ class NNTestCase(TestCase):

    def _numerical_jacobian(self, module, input, jacobian_input=True, jacobian_parameters=True):
        output = self._forward(module, input)
-        output_size = output.nElement()
+        output_size = output.nelement()

        if jacobian_parameters:
            param, d_param = self._get_parameters(module)
@ -211,12 +619,11 @@ class NNTestCase(TestCase):
            return out

        res = tuple()
-        # TODO: enable non-contig tests
        input = contiguous(input)
        if jacobian_input:
-            res += get_numerical_jacobian(fw, input, input),
+            res += get_numerical_jacobian(fw, input, input, eps=1e-6),
        if jacobian_parameters:
-            res += torch.cat(list(get_numerical_jacobian(fw, input, p) for p in param), 0),
+            res += torch.cat(list(get_numerical_jacobian(fw, input, p, eps=1e-6) for p in param), 0),
        return res

    def check_jacobian(self, module, input, jacobian_input=True):
@ -237,19 +644,18 @@ class NNTestCase(TestCase):
        analytical_d_x = self._backward_criterion(criterion, input, target)
        numerical_d_x = deepcopy(analytical_d_x)

-
        input_t = iter_tensors(input)
        numerical_t = iter_tensors(numerical_d_x)
        for x, d_x in zip(input_t, numerical_t):
            x = x.view(-1)
            d_x = d_x.view(-1)
-            for i in range(x.nElement()):
+            for i in range(x.nelement()):
                original = x[i]
                x[i] = original + eps
                fx1 = self._forward_criterion(criterion, input, target)
                x[i] = original - eps
                fx2 = self._forward_criterion(criterion, input, target)
-                deriv = (fx1 - fx2) / (2.*eps)
+                deriv = (fx1 - fx2) / (2. * eps)
                d_x[i] = deriv
                x[i] = original

@ -263,17 +669,23 @@ class NNTestCase(TestCase):


 class TestBase(object):
-    def __init__(self, constructor, constructor_args=tuple(), input_size=None,
-            input=None, desc='', reference_fn=None, fullname=None, **kwargs):
-        if input_size is None and input is None:
-            raise RuntimeError("Specify either an input tensor, or it's size!")
-        self.constructor = constructor
-        self.constructor_args = constructor_args
-        self.input = input
-        self.input_size = input_size
+
+    _required_arg_names = {'constructor_args', 'input'}
+
+    def __init__(self, constructor, desc='', reference_fn=None, fullname=None, **kwargs):
        self.desc = desc
        self.fullname = fullname
+        self.constructor = constructor
        self.reference_fn = reference_fn
+        for name in self._required_arg_names:
+            if name not in kwargs and name + '_fn' not in kwargs and name + '_size' not in kwargs:
+                if name == 'constructor_args':
+                    kwargs['constructor_args'] = tuple()
+                else:
+                    raise ValueError("{}: Specify {} by a value, a function to generate it, or it's size!"
+                                     .format(self.get_name(), name))
+        self._extra_kwargs = kwargs
+        self._arg_cache = {}

    def get_name(self):
        if self.fullname is not None:
@ -284,38 +696,57 @@ class TestBase(object):
            test_name += '_' + self.desc
        return test_name

-    def _unpack_input(self, input):
-        if isinstance(input, Variable):
-            return input.data
-        elif torch.isTensor(input):
-            return input
+    def _unpack(self, value):
+        if isinstance(value, Variable):
+            return value.data
+        elif torch.is_tensor(value):
+            return value
        else:
-            return type(input)(self._unpack_input(i) for i in input)
+            return type(value)(self._unpack(v) for v in value)
+
+    @property
+    def constructor_args(self):
+        return self._get_arg('constructor_args')
+
+    def _get_arg(self, name):
+        assert name in self._required_arg_names
+
+        if name not in self._arg_cache:
+            fn_name = name + '_fn'
+            size_name = name + '_size'
+
+            if name in self._extra_kwargs:
+                self._arg_cache[name] = self._extra_kwargs[name]
+            elif fn_name in self._extra_kwargs:
+                self._arg_cache[name] = self._extra_kwargs[fn_name]()
+            else:
+                assert size_name in self._extra_kwargs
+
+                def map_tensor_sizes(sizes):
+                    if isinstance(sizes, list):
+                        return [map_tensor_sizes(s) for s in sizes]
+                    elif torch.is_tensor(sizes):
+                        return sizes.double()
+                    else:
+                        return torch.randn(*sizes)
+
+                self._arg_cache[name] = map_tensor_sizes(self._extra_kwargs[size_name])
+        return self._arg_cache[name]

    def _get_input(self):
-        if self.input is not None:
-            return self.input
-
-        def map_input_sizes(sizes):
-            if isinstance(sizes, list):
-                return [map_input_sizes(s) for s in sizes]
-            elif torch.isTensor(sizes):
-                return sizes
-            else:
-                return torch.randn(*sizes)
-
-        assert self.input_size is not None
-        return map_input_sizes(self.input_size)
+        return self._get_arg('input')

    def __call__(self, test_case):
        raise NotImplementedError


 class ModuleTest(TestBase):
+
    def __init__(self, *args, **kwargs):
        super(ModuleTest, self).__init__(*args, **kwargs)
        self.jacobian_input = kwargs.get('jacobian_input', True)
        self.should_test_cuda = kwargs.get('test_cuda', True)
+        self.should_test_pickle = kwargs.get('pickle', True)

    def __call__(self, test_case):
        module = self.constructor(*self.constructor_args)
@ -325,23 +756,78 @@ class ModuleTest(TestBase):
            out = test_case._forward(module, input)
            if isinstance(out, Variable):
                out = out.data
-            ref_input = self._unpack_input(deepcopy(input))
+            ref_input = self._unpack(deepcopy(input))
            expected_out = self.reference_fn(ref_input, test_case._get_parameters(module)[0])
            test_case.assertEqual(out, expected_out)

+        self.test_noncontig(test_case, module, input)
+
+        if self.should_test_pickle:
+            # TODO: do this with in-memory files as soon as torch.save will support it
+            with TemporaryFile() as f:
+                test_case._forward(module, input)
+                torch.save(module, f)
+                f.seek(0)
+                module_copy = torch.load(f)
+                test_case.assertEqual(test_case._forward(module, input), test_case._forward(module_copy, input))
+
        self._do_test(test_case, module, input)

+    def noncontiguize(self, obj):
+        if isinstance(obj, list):
+            return [self.noncontiguize(o) for o in obj]
+        tensor = obj.data if isinstance(obj, Variable) else obj
+        ndim = tensor.dim()
+        noncontig = torch.stack([tensor.clone().zero_(), tensor], ndim).select(ndim, 1)
+        assert noncontig.numel() == 1 or not noncontig.is_contiguous()
+        if isinstance(obj, Variable):
+            return Variable(noncontig, requires_grad=obj.requires_grad)
+        return noncontig
+
+    def test_noncontig(self, test_case, module, input):
+        test_case._zero_grad_parameters(module)
+        test_case._zero_grad_input(input)
+        with freeze_rng_state():
+            output = test_case._forward(module, input)
+            grad_output = output
+            if isinstance(grad_output, Variable):
+                grad_output = grad_output.data.clone()
+            else:
+                grad_output = grad_output.clone()
+                output = output.clone()
+            grad_output.normal_()
+            d_input = deepcopy(test_case._backward(module, input, output, grad_output))
+            d_param = deepcopy(test_case._get_parameters(module)[1])
+
+        nc_input = self.noncontiguize(input)
+        nc_grad_output = self.noncontiguize(grad_output)
+        for contig_i, contig_g in product((True, False), repeat=2):
+            i = input if contig_i else nc_input
+            go = grad_output if contig_g else nc_grad_output
+            test_case._zero_grad_parameters(module)
+            test_case._zero_grad_input(i)
+            with freeze_rng_state():
+                try:
+                    out = test_case._forward(module, i)
+                except Exception:
+                    # Some modules will fail because of non contiguous inputs and we're ok with that
+                    continue
+                grad = test_case._backward(module, i, out, go)
+
+                test_case.assertEqual(out, output)
+                test_case.assertEqual(grad, d_input, 1e-4)
+                test_case.assertEqual(test_case._get_parameters(module)[1], d_param)
+
    def test_cuda(self, test_case):
        if not TEST_CUDA or not self.should_test_cuda:
            raise unittest.SkipTest('Excluded from CUDA tests')
        try:
            cpu_input = self._get_input()
-            gpu_input = to_gpu(cpu_input, tensor_type=torch.cuda.FloatTensor)
+            type_map = {torch.DoubleTensor: torch.cuda.FloatTensor}
+            gpu_input = to_gpu(cpu_input, type_map=type_map)

            cpu_module = self.constructor(*self.constructor_args)
-            gpu_module = self.constructor(*self.constructor_args).cuda()
-            test_case._zero_grad_parameters(cpu_module)
-            test_case._zero_grad_parameters(gpu_module)
+            gpu_module = self.constructor(*self.constructor_args).float().cuda()
            cpu_param = test_case._get_parameters(cpu_module)
            gpu_param = test_case._get_parameters(gpu_module)
            for cpu_p, gpu_p in zip(cpu_param[0], gpu_param[0]):
@ -351,6 +837,10 @@ class ModuleTest(TestBase):
                    gpu_p = gpu_p.data
                gpu_p.copy_(cpu_p)

+            test_case._zero_grad_input(cpu_input)
+            test_case._zero_grad_input(gpu_input)
+            test_case._zero_grad_parameters(cpu_module)
+            test_case._zero_grad_parameters(gpu_module)
            cpu_output = test_case._forward(cpu_module, cpu_input)
            gpu_output = test_case._forward(gpu_module, gpu_input)
            test_case.assertEqual(cpu_output, gpu_output, 2e-4)
@ -364,6 +854,8 @@ class ModuleTest(TestBase):
                test_case.assertEqual(cpu_gradInput, gpu_gradInput, 2e-4)
                for cpu_d_p, gpu_d_p in zip(cpu_param[1], gpu_param[1]):
                    test_case.assertEqual(cpu_d_p, gpu_d_p, 2e-4)
+
+            self.test_noncontig(test_case, gpu_module, gpu_input)
        except NotImplementedError:
            pass
        # TODO: remove this after CUDA scatter_ is implemented
@ -375,43 +867,60 @@ class ModuleTest(TestBase):


 class CriterionTest(TestBase):
+
+    _required_arg_names = TestBase._required_arg_names.union({'target'})
+
    def __init__(self, *args, **kwargs):
        super(CriterionTest, self).__init__(*args, **kwargs)
-        self.target = kwargs.get('target', None)
        self.should_test_cuda = kwargs.get('test_cuda', True)

+    def _get_target(self):
+        return self._get_arg('target')
+
    def __call__(self, test_case):
        module = self.constructor(*self.constructor_args)
        input = self._get_input()

+        # Check that these methods don't raise errors
+        module.__repr__()
+        str(module)
+
+        target = self._get_target()
+
        if self.reference_fn is not None:
-            out = test_case._forward_criterion(module, input, self.target)
-            expected_out = self.reference_fn(deepcopy(self._unpack_input(input)),
-                    deepcopy(self.target), module)
+            out = test_case._forward_criterion(module, input, target)
+            expected_out = self.reference_fn(deepcopy(self._unpack(input)),
+                                             deepcopy(self._unpack(target)), module)
            test_case.assertEqual(out, expected_out)

-        test_case.check_criterion_jacobian(module, input, self.target)
+        test_case.check_criterion_jacobian(module, input, target)
+        self._do_extra_tests(test_case, module, input, target)

    def test_cuda(self, test_case):
        if not TEST_CUDA or not self.should_test_cuda:
            raise unittest.SkipTest('Excluded from CUDA tests')
        try:
            cpu_input = self._get_input()
-            gpu_input = to_gpu(cpu_input, tensor_type=torch.cuda.FloatTensor)
+            type_map = {
+                torch.DoubleTensor: torch.cuda.FloatTensor,
+            }
+            gpu_input = to_gpu(cpu_input, type_map=type_map)

-            cpu_target = self.target
-            gpu_target = to_gpu(self.target, tensor_type=torch.cuda.FloatTensor)
+            cpu_target = self._get_target()
+            gpu_target = to_gpu(cpu_target, type_map=type_map)

            cpu_module = self.constructor(*self.constructor_args)
-            gpu_module = self.constructor(*self.constructor_args).cuda()
+            gpu_module = self.constructor(*self.constructor_args).float().cuda()

            cpu_output = test_case._forward_criterion(cpu_module, cpu_input, cpu_target)
            gpu_output = test_case._forward_criterion(gpu_module, gpu_input, gpu_target)
-            test_case.assertEqual(cpu_output, gpu_output, 2e-4)
+            test_case.assertEqual(cpu_output, gpu_output, 4e-4)

            cpu_gradInput = test_case._backward_criterion(cpu_module, cpu_input, cpu_target)
            gpu_gradInput = test_case._backward_criterion(gpu_module, gpu_input, gpu_target)
-            test_case.assertEqual(cpu_gradInput, gpu_gradInput, 2e-4)
+            test_case.assertEqual(cpu_gradInput, gpu_gradInput, 4e-4)
        except NotImplementedError:
            pass

+    def _do_extra_tests(self, test_case, module, input, target):
+        pass
--- a/test/data/network1.py
+++ b/test/data/network1.py
@ -0,0 +1,8 @@
+import torch.nn as nn
+
+
+class Net(nn.Module):
+
+    def __init__(self):
+        super(Net, self).__init__()
+        self.linear = nn.Linear(10, 20)
--- a/test/data/network2.py
+++ b/test/data/network2.py
@ -0,0 +1,9 @@
+import torch.nn as nn
+
+
+class Net(nn.Module):
+
+    def __init__(self):
+        super(Net, self).__init__()
+        self.linear = nn.Linear(10, 20)
+        self.relu = nn.ReLU()
--- a/test/error_messages/storage.py
+++ b/test/error_messages/storage.py
@ -0,0 +1,71 @@
+import torch
+
+
+def check_error(desc, fn, *required_substrings):
+    try:
+        fn()
+    except Exception as e:
+        error_message = e.args[0]
+        print('=' * 80)
+        print(desc)
+        print('-' * 80)
+        print(error_message)
+        print('')
+        for sub in required_substrings:
+            assert sub in error_message
+        return
+    assert False, "given function ({}) didn't raise an error".format(desc)
+
+check_error(
+    'Wrong argument types',
+    lambda: torch.FloatStorage(object()),
+    'object')
+
+check_error('Unknown keyword argument',
+            lambda: torch.FloatStorage(content=1234.),
+            'keyword')
+
+check_error('Invalid types inside a sequence',
+            lambda: torch.FloatStorage(['a', 'b']),
+            'list', 'str')
+
+check_error('Invalid size type',
+            lambda: torch.FloatStorage(1.5),
+            'float')
+
+check_error('Invalid offset',
+            lambda: torch.FloatStorage(torch.FloatStorage(2), 4),
+            '2', '4')
+
+check_error('Negative offset',
+            lambda: torch.FloatStorage(torch.FloatStorage(2), -1),
+            '2', '-1')
+
+check_error('Invalid size',
+            lambda: torch.FloatStorage(torch.FloatStorage(3), 1, 5),
+            '2', '1', '5')
+
+check_error('Negative size',
+            lambda: torch.FloatStorage(torch.FloatStorage(3), 1, -5),
+            '2', '1', '-5')
+
+check_error('Invalid index type',
+            lambda: torch.FloatStorage(10)['first item'],
+            'str')
+
+
+def assign():
+    torch.FloatStorage(10)[1:-1] = '1'
+check_error('Invalid value type',
+            assign,
+            'str')
+
+check_error('resize_ with invalid type',
+            lambda: torch.FloatStorage(10).resize_(1.5),
+            'float')
+
+check_error('fill_ with invalid type',
+            lambda: torch.IntStorage(10).fill_('asdf'),
+            'str')
+
+# TODO: frombuffer
--- a/test/expect/TestJit.test_alexnet.expect
+++ b/test/expect/TestJit.test_alexnet.expect
@ -0,0 +1,46 @@
+graph(%1 : Double(10, 3, 224, 224)
+      %2 : Double(64, 3, 11, 11)
+      %3 : Double(64)
+      %4 : Double(192, 64, 5, 5)
+      %5 : Double(192)
+      %6 : Double(384, 192, 3, 3)
+      %7 : Double(384)
+      %8 : Double(256, 384, 3, 3)
+      %9 : Double(256)
+      %10 : Double(256, 256, 3, 3)
+      %11 : Double(256)
+      %12 : Double(4096, 9216)
+      %13 : Double(4096)
+      %14 : Double(4096, 4096)
+      %15 : Double(4096)
+      %16 : Double(1000, 4096)
+      %17 : Double(1000)) {
+  %19 : Double(10, 64, 55, 55), %20 : Handle = CppOp[ConvForward](%1, %2, %3), uses = [[%21.i0], []];
+  %21 : Double(10, 64, 55, 55) = threshold[threshold={0}, value={0}, inplace=1](%19), uses = [%22.i0];
+  %23 : Double(10, 64, 27, 27), %24 : Long(10, 64, 27, 27) = max_pool2d[kernel_size=[3, 3], stride=[2, 2], padding=[0, 0], dilation=[1, 1], ceil_mode=0](%21), uses = [[%25.i0], []];
+  %26 : Double(10, 192, 27, 27), %27 : Handle = CppOp[ConvForward](%23, %4, %5), uses = [[%28.i0], []];
+  %28 : Double(10, 192, 27, 27) = threshold[threshold={0}, value={0}, inplace=1](%26), uses = [%29.i0];
+  %30 : Double(10, 192, 13, 13), %31 : Long(10, 192, 13, 13) = max_pool2d[kernel_size=[3, 3], stride=[2, 2], padding=[0, 0], dilation=[1, 1], ceil_mode=0](%28), uses = [[%32.i0], []];
+  %33 : Double(10, 384, 13, 13), %34 : Handle = CppOp[ConvForward](%30, %6, %7), uses = [[%35.i0], []];
+  %35 : Double(10, 384, 13, 13) = threshold[threshold={0}, value={0}, inplace=1](%33), uses = [%36.i0];
+  %37 : Double(10, 256, 13, 13), %38 : Handle = CppOp[ConvForward](%35, %8, %9), uses = [[%39.i0], []];
+  %39 : Double(10, 256, 13, 13) = threshold[threshold={0}, value={0}, inplace=1](%37), uses = [%40.i0];
+  %41 : Double(10, 256, 13, 13), %42 : Handle = CppOp[ConvForward](%39, %10, %11), uses = [[%43.i0], []];
+  %43 : Double(10, 256, 13, 13) = threshold[threshold={0}, value={0}, inplace=1](%41), uses = [%44.i0];
+  %45 : Double(10, 256, 6, 6), %46 : Long(10, 256, 6, 6) = max_pool2d[kernel_size=[3, 3], stride=[2, 2], padding=[0, 0], dilation=[1, 1], ceil_mode=0](%43), uses = [[%47.i0], []];
+  %47 : Double(10, 9216) = view[size=[10, 9216]](%45), uses = [%48.i0];
+  %49 : Double(10, 9216), %50 : Handle = ^Dropout(0.5, True, False)(%47), uses = [[%53.i1], []];
+  %51 : Double(9216!, 4096!) = t(%12), uses = [%53.i2];
+  %52 : Double(10!, 4096) = expand[size=[10, 4096]](%13), uses = [%53.i0];
+  %53 : Double(10, 4096) = addmm[beta={1}, alpha={1}](%52, %49, %51), uses = [%54.i0];
+  %54 : Double(10, 4096) = threshold[threshold={0}, value={0}, inplace=1](%53), uses = [%55.i0];
+  %56 : Double(10, 4096), %57 : Handle = ^Dropout(0.5, True, False)(%54), uses = [[%60.i1], []];
+  %58 : Double(4096!, 4096!) = t(%14), uses = [%60.i2];
+  %59 : Double(10!, 4096) = expand[size=[10, 4096]](%15), uses = [%60.i0];
+  %60 : Double(10, 4096) = addmm[beta={1}, alpha={1}](%59, %56, %58), uses = [%61.i0];
+  %61 : Double(10, 4096) = threshold[threshold={0}, value={0}, inplace=1](%60), uses = [%64.i1];
+  %62 : Double(4096!, 1000!) = t(%16), uses = [%64.i2];
+  %63 : Double(10!, 1000) = expand[size=[10, 1000]](%17), uses = [%64.i0];
+  %64 : Double(10, 1000) = addmm[beta={1}, alpha={1}](%63, %61, %62), uses = [%0.i0];
+  return (%64);
+}
--- a/test/expect/TestJit.test_assign_traces.expect
+++ b/test/expect/TestJit.test_assign_traces.expect
@ -0,0 +1,8 @@
+graph(%1 : Double(10, 10)
+      -------- stage 1 --------
+      %4 : Double(10, 10!)) {
+  %3 : Double(10, 10) = ^MyFn()(%1), uses = [[%0.i0, %5.i0]];
+  ---------------- stage 1 ----------------
+  %5 : Double(10, 10) = mul(%3, %4), uses = [%0.i1];
+  return (%3, %5);
+}
--- a/test/expect/TestJit.test_backward.expect
+++ b/test/expect/TestJit.test_backward.expect
@ -0,0 +1,23 @@
+graph(%1 : Double(2, 2)
+      %2 : Double(2, 2)
+      -------- stage 1 --------
+      %5 : Double(2, 2)
+      -------- stage 2 --------
+      %9 : Double(2, 2!)
+      %10 : Double(2, 2)) {
+  %3 : Double(2, 2) = mul[other={2}](%2), uses = [%4.i0, %7.i1, %11.i1];
+  %4 : Double(2, 2) = mul(%3, %1), uses = [%0.i0];
+  ---------------- stage 1 ----------------
+  %6 : Double(2, 2) = mul(%5, %1), uses = [%8.i0];
+  %7 : Double(2, 2) = mul(%5, %3), uses = [%0.i1];
+  %8 : Double(2, 2) = mul[other={2}](%6), uses = [%0.i2];
+  ---------------- stage 2 ----------------
+  %11 : Double(2, 2) = mul(%9, %3), uses = [%17.i0];
+  %12 : Double(2, 2) = mul(%9, %5), uses = [%14.i0];
+  %13 : Double(2, 2) = mul[other={2}](%10), uses = [%15.i0, %16.i0];
+  %14 : Double(2, 2) = mul[other={2}](%12), uses = [%0.i5];
+  %15 : Double(2, 2) = mul(%13, %1), uses = [%17.i1];
+  %16 : Double(2, 2) = mul(%13, %5), uses = [%0.i4];
+  %18 : Double(2, 2) = CppOp[N5torch8autograd3AddE](%11, %15), uses = [[%0.i3]];
+  return (%4, %7, %8, %18, %16, %14);
+}
--- a/test/expect/TestJit.test_backward_opaque.expect
+++ b/test/expect/TestJit.test_backward_opaque.expect
@ -0,0 +1,10 @@
+graph(%1 : Double(3, 3)
+      %2 : Double(3, 3)
+      -------- stage 1 --------
+      %4 : Double(3, 3)) {
+  %3 : Double(3, 3) = cross[dim=-1](%1, %2), uses = [%0.i0];
+  ---------------- stage 1 ----------------
+  %5 : Double(3, 3) = cross[dim=-1](%2, %4), uses = [%0.i1];
+  %6 : Double(3, 3) = cross[dim=-1](%4, %1), uses = [%0.i2];
+  return (%3, %5, %6);
+}
--- a/test/expect/TestJit.test_batchnorm.expect
+++ b/test/expect/TestJit.test_batchnorm.expect
@ -0,0 +1,8 @@
+graph(%1 : Double(2, 2)
+      %2 : Double(2)
+      %3 : Double(2)
+      %4 : Double(2)
+      %5 : Double(2)) {
+  %7 : Double(2, 2), %8 : Handle = CppOp[N5torch8autograd16BatchNormForwardE](%1, %2, %3), uses = [[%0.i0], []], scope: BatchNorm2d;
+  return (%7);
+}
--- a/test/expect/TestJit.test_c_function.expect
+++ b/test/expect/TestJit.test_c_function.expect
@ -0,0 +1,6 @@
+graph(%1 : Double(1, 3, 10, 10)
+      %2 : Double(8, 3, 3, 3)
+      %3 : Double(8)) {
+  %5 : Double(1, 8, 8, 8), %6 : Handle = CppOp[ConvForward](%1, %2, %3), uses = [[%0.i0], []];
+  return (%5);
+}
--- a/test/expect/TestJit.test_concat_fusion.expect
+++ b/test/expect/TestJit.test_concat_fusion.expect
@ -0,0 +1,12 @@
+graph(%1 : Double(3, 20)
+      %2 : Double(3, 20)) {
+  %7 : Double(6, 20) = fusion_group_0(%1, %2), uses = [[%0.i0]];
+  return (%7);
+}
+with fusion_group_0 = graph(%4 : Double(3, 20)
+      %5 : Double(3, 20)) {
+  %7 : Double(3, 20) = add[alpha={1}](%4, %5), uses = [%3.i0];
+  %6 : Double(3, 20) = mul(%4, %5), uses = [%3.i1];
+  %3 : Double(6, 20) = cat[dim=0](%7, %6), uses = [%0.i0];
+  return (%3);
+}
--- a/test/expect/TestJit.test_conv.expect
+++ b/test/expect/TestJit.test_conv.expect
@ -0,0 +1,6 @@
+graph(%1 : Double(20, 16, 50, 40)
+      %2 : Double(13, 16, 3, 3)) {
+  %4 : UNKNOWN_TYPE = Undefined(), uses = [%3.i2], scope: Conv2d;
+  %5 : Double(20, 13, 48, 38), %6 : Handle = CppOp[ConvForward](%1, %2, %4), uses = [[%0.i0], []], scope: Conv2d;
+  return (%5);
+}
--- a/test/expect/TestJit.test_cse.expect
+++ b/test/expect/TestJit.test_cse.expect
@ -0,0 +1,10 @@
+graph(%1 : Double(2)
+      %2 : Double(2)) {
+  %3 : Double(2) = add[alpha={1}](%1, %2), uses = [%5.i0, %5.i1, %7.i1];
+  %5 : Double(2) = mul(%3, %3), uses = [%7.i0];
+  %7 : Double(2) = mul(%5, %3), uses = [%8.i0, %16.i0];
+  %8 : Double(2) = tanh(%7), uses = [%10.i0, %10.i1];
+  %10 : Double(2) = add[alpha={1}](%8, %8), uses = [%16.i1];
+  %16 : Double(2) = add[alpha={1}](%7, %10), uses = [%0.i0];
+  return (%16);
+}
--- a/test/expect/TestJit.test_dropout.expect
+++ b/test/expect/TestJit.test_dropout.expect
@ -0,0 +1,4 @@
+graph(%1 : Double(2, 2)) {
+  %3 : Double(2, 2), %4 : Handle = ^Dropout(0.6, True, False)(%1), uses = [[%0.i0], []], scope: Dropout;
+  return (%3);
+}
--- a/test/expect/TestJit.test_function_as_argument.expect
+++ b/test/expect/TestJit.test_function_as_argument.expect
@ -0,0 +1,26 @@
+graph(%1 : Double(3, 10)
+      %2 : Double(3, 20)
+      %3 : Double(3, 20)
+      %4 : Double(80, 10)
+      %5 : Double(80, 20)
+      %6 : Double(80)
+      %7 : Double(80)) {
+  %8 : Double(10!, 80!) = Transpose[perm=[1, 0]](%4), uses = [%9.i0];
+  %9 : UNKNOWN_TYPE = Transpose(%8), uses = [%10.i1];
+  %10 : Double(3, 80) = FC(%1, %9, %6), uses = [%14.i0];
+  %11 : Double(20!, 80!) = Transpose[perm=[1, 0]](%5), uses = [%12.i0];
+  %12 : UNKNOWN_TYPE = Transpose(%11), uses = [%13.i1];
+  %13 : Double(3, 80) = FC(%2, %12, %7), uses = [%14.i1];
+  %14 : Double(3, 80) = Add(%10, %13), uses = [%15.i0];
+  %16 : Double(3!, 20), %17 : Double(3!, 20), %18 : Double(3!, 20), %19 : Double(3!, 20) = Split[split=[20, 20, 20, 20], axis=1](%14), uses = [[%20.i0], [%21.i0], [%22.i0], [%23.i0]];
+  %20 : Double(3, 20) = Sigmoid(%16), uses = [%25.i0];
+  %21 : Double(3, 20) = Sigmoid(%17), uses = [%24.i0];
+  %22 : Double(3, 20) = Tanh(%18), uses = [%25.i1];
+  %23 : Double(3, 20) = Sigmoid(%19), uses = [%28.i0];
+  %24 : Double(3, 20) = Mul(%21, %3), uses = [%26.i0];
+  %25 : Double(3, 20) = Mul(%20, %22), uses = [%26.i1];
+  %26 : Double(3, 20) = Add(%24, %25), uses = [%27.i0, %0.i1];
+  %27 : Double(3, 20) = Tanh(%26), uses = [%28.i1];
+  %28 : Double(3, 20) = Mul(%23, %27), uses = [%0.i0];
+  return (%28, %26);
+}
--- a/test/expect/TestJit.test_fusion_distribute-raw.expect
+++ b/test/expect/TestJit.test_fusion_distribute-raw.expect
@ -0,0 +1,7 @@
+graph(%1 : Double(4, 4)
+      %2 : Double(4, 4)) {
+  %3 : Double(4, 4) = add[alpha={1}](%1, %2), uses = [%4.i0];
+  %5 : Double(4!, 2), %6 : Double(4!, 2) = split[split_size=2, dim=1](%3), uses = [[%7.i0], [%7.i1]];
+  %7 : Double(4, 2) = mul(%5, %6), uses = [%0.i0];
+  return (%7);
+}
--- a/test/expect/TestJit.test_fusion_distribute.expect
+++ b/test/expect/TestJit.test_fusion_distribute.expect
@ -0,0 +1,16 @@
+graph(%1 : Double(4, 4)
+      %2 : Double(4, 4)) {
+  %9 : Double(4!, 2), %10 : Double(4!, 2) = split[split_size=2, dim=1](%1), uses = [[%16.i0], [%16.i2]];
+  %12 : Double(4!, 2), %13 : Double(4!, 2) = split[split_size=2, dim=1](%2), uses = [[%16.i1], [%16.i3]];
+  %17 : Double(4, 2) = fusion_group_0(%9, %12, %10, %13), uses = [[%0.i0]];
+  return (%17);
+}
+with fusion_group_0 = graph(%4 : Double(4!, 2)
+      %5 : Double(4!, 2)
+      %7 : Double(4!, 2)
+      %8 : Double(4!, 2)) {
+  %9 : Double(4, 2) = add[alpha={1}](%7, %8), uses = [%3.i1];
+  %6 : Double(4, 2) = add[alpha={1}](%4, %5), uses = [%3.i0];
+  %3 : Double(4, 2) = mul(%6, %9), uses = [%0.i0];
+  return (%3);
+}
--- a/test/expect/TestJit.test_inplace_transplant.expect
+++ b/test/expect/TestJit.test_inplace_transplant.expect
@ -0,0 +1,6 @@
+graph(%1 : Double(1)) {
+  %2 : Double(1) = clone(%1), uses = [%3.i0];
+  %3 : Double(1) = add[other={2}, alpha={1}](%2), uses = [%4.i0];
+  %4 : Double(1) = add[other={3}, alpha={1}](%3), uses = [%0.i0];
+  return (%4);
+}
--- a/test/expect/TestJit.test_lstm.expect
+++ b/test/expect/TestJit.test_lstm.expect
@ -0,0 +1,42 @@
+graph(%1 : Double(3, 10)
+      %2 : Double(3, 20)
+      %3 : Double(3, 20)
+      %4 : Double(80, 10)
+      %5 : Double(80, 20)
+      %6 : Double(80)
+      %7 : Double(80)) {
+  %8 : Double(10!, 80!) = Transpose[perm=[1, 0]](%4), uses = [%9.i0];
+  %9 : UNKNOWN_TYPE = Transpose(%8), uses = [%10.i1];
+  %10 : Double(3, 80) = FC(%1, %9, %6), uses = [%32.i0];
+  %11 : Double(20!, 80!) = Transpose[perm=[1, 0]](%5), uses = [%12.i0];
+  %12 : UNKNOWN_TYPE = Transpose(%11), uses = [%13.i1];
+  %13 : Double(3, 80) = FC(%2, %12, %7), uses = [%33.i0];
+  %36 : Double(3!, 20), %39 : Double(3!, 20), %42 : Double(3!, 20), %45 : Double(3!, 20) = Split[split=[20, 20, 20, 20], axis=1](%13), uses = [[%29.i8], [%29.i6], [%29.i4], [%29.i2]];
+  %35 : Double(3!, 20), %38 : Double(3!, 20), %41 : Double(3!, 20), %44 : Double(3!, 20) = Split[split=[20, 20, 20, 20], axis=1](%10), uses = [[%29.i7], [%29.i5], [%29.i3], [%29.i1]];
+  %30 : Double(3, 20), %31 : Double(3, 20) = fusion_group_0(%3, %44, %45, %41, %42, %38, %39, %35, %36), uses = [[%0.i0], [%0.i1]];
+  return (%30, %31);
+}
+with fusion_group_0 = graph(%13 : Double(3, 20)
+      %23 : Double(3!, 20)
+      %24 : Double(3!, 20)
+      %26 : Double(3!, 20)
+      %27 : Double(3!, 20)
+      %29 : Double(3!, 20)
+      %30 : Double(3!, 20)
+      %32 : Double(3!, 20)
+      %33 : Double(3!, 20)) {
+  %34 : Double(3, 20) = Add(%32, %33), uses = [%22.i0];
+  %31 : Double(3, 20) = Add(%29, %30), uses = [%20.i0];
+  %28 : Double(3, 20) = Add(%26, %27), uses = [%18.i0];
+  %25 : Double(3, 20) = Add(%23, %24), uses = [%16.i0];
+  %22 : Double(3, 20) = Sigmoid(%34), uses = [%11.i0];
+  %20 : Double(3, 20) = Sigmoid(%31), uses = [%14.i0];
+  %18 : Double(3, 20) = Tanh(%28), uses = [%11.i1];
+  %16 : Double(3, 20) = Sigmoid(%25), uses = [%3.i0];
+  %14 : Double(3, 20) = Mul(%20, %13), uses = [%8.i0];
+  %11 : Double(3, 20) = Mul(%22, %18), uses = [%8.i1];
+  %8 : Double(3, 20) = Add(%14, %11), uses = [%5.i0, %0.i1];
+  %5 : Double(3, 20) = Tanh(%8), uses = [%3.i1];
+  %3 : Double(3, 20) = Mul(%16, %5), uses = [%0.i0];
+  return (%3, %8);
+}
--- a/test/expect/TestJit.test_lstm_fusion.expect
+++ b/test/expect/TestJit.test_lstm_fusion.expect
@ -0,0 +1,42 @@
+graph(%1 : Double(3, 10)
+      %2 : Double(3, 20)
+      %3 : Double(3, 20)
+      %4 : Double(80, 10)
+      %5 : Double(80, 20)
+      %6 : Double(80)
+      %7 : Double(80)) {
+  %8 : Double(10!, 80!) = t(%4), uses = [%10.i2];
+  %9 : Double(3!, 80) = expand[size=[3, 80]](%6), uses = [%10.i0];
+  %10 : Double(3, 80) = addmm[beta={1}, alpha={1}](%9, %1, %8), uses = [%32.i0];
+  %11 : Double(20!, 80!) = t(%5), uses = [%13.i2];
+  %12 : Double(3!, 80) = expand[size=[3, 80]](%7), uses = [%13.i0];
+  %13 : Double(3, 80) = addmm[beta={1}, alpha={1}](%12, %2, %11), uses = [%37.i0];
+  %33 : Double(3!, 20), %34 : Double(3!, 20), %35 : Double(3!, 20), %36 : Double(3!, 20) = split[split_size=20, dim=1](%10), uses = [[%29.i7], [%29.i5], [%29.i3], [%29.i1]];
+  %38 : Double(3!, 20), %39 : Double(3!, 20), %40 : Double(3!, 20), %41 : Double(3!, 20) = split[split_size=20, dim=1](%13), uses = [[%29.i8], [%29.i6], [%29.i4], [%29.i2]];
+  %30 : Double(3, 20), %31 : Double(3, 20) = fusion_group_0(%3, %36, %41, %35, %40, %34, %39, %33, %38), uses = [[%0.i0], [%0.i1]];
+  return (%30, %31);
+}
+with fusion_group_0 = graph(%13 : Double(3, 20)
+      %23 : Double(3!, 20)
+      %24 : Double(3!, 20)
+      %26 : Double(3!, 20)
+      %27 : Double(3!, 20)
+      %29 : Double(3!, 20)
+      %30 : Double(3!, 20)
+      %32 : Double(3!, 20)
+      %33 : Double(3!, 20)) {
+  %34 : Double(3, 20) = add[alpha={1}](%32, %33), uses = [%22.i0];
+  %31 : Double(3, 20) = add[alpha={1}](%29, %30), uses = [%20.i0];
+  %28 : Double(3, 20) = add[alpha={1}](%26, %27), uses = [%18.i0];
+  %25 : Double(3, 20) = add[alpha={1}](%23, %24), uses = [%16.i0];
+  %22 : Double(3, 20) = sigmoid(%34), uses = [%11.i0];
+  %20 : Double(3, 20) = sigmoid(%31), uses = [%14.i0];
+  %18 : Double(3, 20) = tanh(%28), uses = [%11.i1];
+  %16 : Double(3, 20) = sigmoid(%25), uses = [%3.i0];
+  %14 : Double(3, 20) = mul(%20, %13), uses = [%8.i0];
+  %11 : Double(3, 20) = mul(%22, %18), uses = [%8.i1];
+  %8 : Double(3, 20) = add[alpha={1}](%14, %11), uses = [%5.i0, %0.i1];
+  %5 : Double(3, 20) = tanh(%8), uses = [%3.i1];
+  %3 : Double(3, 20) = mul(%16, %5), uses = [%0.i0];
+  return (%3, %8);
+}
--- a/test/expect/TestJit.test_python_ir.expect
+++ b/test/expect/TestJit.test_python_ir.expect
@ -0,0 +1,9 @@
+graph(%1 : UNKNOWN_TYPE
+      %2 : UNKNOWN_TYPE) {
+  %3 : Double(1) = add[alpha={1}](%1, %2), uses = [%4.i1];
+  %4 : Double(1) = mul(%1, %3), uses = [%5.i0];
+  %5 : Double(1) = tanh(%4), uses = [%6.i0];
+  %6 : Double(1) = sigmoid(%5), uses = [%0.i0];
+  %7 : UNKNOWN_TYPE = TensorTest[a= 1  1  1  1 [ CPUDoubleTensor{2,2} ]](), uses = [];
+  return (%6);
+}
--- a/test/expect/TestJit.test_scopes.expect
+++ b/test/expect/TestJit.test_scopes.expect
@ -0,0 +1,8 @@
+graph(%1 : Double(1)
+      %2 : Double(1)) {
+  %3 : Double(1) = add[alpha={1}](%1, %2), uses = [%4.i1];
+  %4 : Double(1) = mul(%1, %3), uses = [%5.i0], scope: Foo;
+  %5 : Double(1) = tanh(%4), uses = [%6.i0], scope: Foo/Bar;
+  %6 : Double(1) = sigmoid(%5), uses = [%0.i0], scope: Foo;
+  return (%6);
+}
--- a/test/expect/TestJit.test_scopes_identity_node.expect
+++ b/test/expect/TestJit.test_scopes_identity_node.expect
@ -0,0 +1,9 @@
+graph(%1 : Double(1, 3, 227, 227)
+      %2 : Double(64, 3, 11, 11)
+      %3 : Double(64)) {
+  %5 : UNKNOWN_TYPE = Conv[kernel_shape=[11, 11], strides=[4, 4], pads=[2, 2, 2, 2], dilations=[1, 1], group=1](%1, %2), uses = [[%6.i0]], scope: Net/Sequential[features]/Conv2d[0];
+  %6 : Double(1, 64, 56, 56) = Add[broadcast=1, axis=1](%5, %3), uses = [%7.i0], scope: Net/Sequential[features]/Conv2d[0];
+  %7 : Double(1, 64, 56, 56) = Relu(%6), uses = [%8.i0], scope: Net/Sequential[features]/ReLU[1];
+  %8 : Double(1, 64, 27, 27) = MaxPool[kernel_shape=[3, 3], pads=[0, 0], strides=[2, 2]](%7), uses = [%0.i0], scope: Net/Sequential[features]/MaxPool2d[2];
+  return (%8);
+}
--- a/test/expect/TestJit.test_scopes_intermediate_node.expect
+++ b/test/expect/TestJit.test_scopes_intermediate_node.expect
@ -0,0 +1,5 @@
+graph(%1 : Double(2)) {
+  %2 : Double(2) = Softmax[axis=0](%1), uses = [%3.i0], scope: Net;
+  %3 : Double(2) = Log(%2), uses = [%0.i0], scope: Net;
+  return (%3);
+}
--- a/test/expect/TestJit.test_simple.expect
+++ b/test/expect/TestJit.test_simple.expect
@ -0,0 +1,8 @@
+graph(%1 : Double(1)
+      %2 : Double(1)) {
+  %3 : Double(1) = add[alpha={1}](%1, %2), uses = [%4.i1];
+  %4 : Double(1) = mul(%1, %3), uses = [%5.i0];
+  %5 : Double(1) = tanh(%4), uses = [%6.i0];
+  %6 : Double(1) = sigmoid(%5), uses = [%0.i0];
+  return (%6);
+}
--- a/test/ffi/src/cpu/lib.h
+++ b/test/ffi/src/cpu/lib.h
@ -0,0 +1,6 @@
+
+void good_func(THFloatTensor *tensor, int a, float b);
+void bad_func(THFloatTensor *tensor, int a, float b);
+THFloatTensor * new_tensor(int a);
+float int_to_float(int a);
+
--- a/test/ffi/src/cpu/lib1.c
+++ b/test/ffi/src/cpu/lib1.c
@ -0,0 +1,19 @@
+#include <TH/TH.h>
+
+void good_func(THFloatTensor *tensor, int a, float b)
+{
+  THFloatTensor_mul(tensor, tensor, a);
+  THFloatTensor_add(tensor, tensor, b);
+}
+
+THFloatTensor * new_tensor(int a)
+{
+  THFloatTensor *t = THFloatTensor_newWithSize2d(a, a);
+  THFloatTensor_fill(t, a);
+  return t;
+}
+
+float int_to_float(int a)
+{
+  return a;
+}
--- a/test/ffi/src/cpu/lib2.c
+++ b/test/ffi/src/cpu/lib2.c
@ -0,0 +1,8 @@
+#include <TH/TH.h>
+
+void bad_func(THFloatTensor *tensor, int a, float b)
+{
+  THFloatTensor_mul(tensor, tensor, a);
+  THFloatTensor_add(tensor, tensor, b);
+  THFloatTensor_addbmm(tensor, 1, tensor, 1, tensor, tensor);
+}
--- a/test/ffi/src/cuda/cudalib.c
+++ b/test/ffi/src/cuda/cudalib.c
@ -0,0 +1,12 @@
+#include <TH/TH.h>
+#include <THC/THC.h>
+
+extern THCState *state;
+
+#include "../cpu/lib1.c"
+
+void cuda_func(THCudaTensor *tensor, int a, float b)
+{
+  THCudaTensor_mul(state, tensor, tensor, a);
+  THCudaTensor_add(state, tensor, tensor, b);
+}
--- a/test/ffi/src/cuda/cudalib.h
+++ b/test/ffi/src/cuda/cudalib.h
@ -0,0 +1,5 @@
+
+void good_func(THFloatTensor *tensor, int a, float b);
+void cuda_func(THCudaTensor *tensor, int a, float b);
+THFloatTensor * new_tensor(int a);
+float int_to_float(int a);
--- a/test/ffi/src/lib.h
+++ b/test/ffi/src/lib.h
@ -0,0 +1,5 @@
+
+void my_func(THFloatTensor *tensor, int a, float b);
+void my_cuda_func(THCudaTensor *tensor, int a, float b);
+THFloatTensor * new_t(int a);
+float new_int(int a);
--- a/test/optim/compare.sh
+++ b/test/optim/compare.sh
@ -1,5 +1,5 @@

-# th test.lua > lua.out
+th test.lua > lua.out
 python3 test.py > python.out

 diff lua.out python.out >/dev/null 2>&1
--- a/test/optim/lua.out
+++ b/test/optim/lua.out
--- a/test/optim/regex.lua
+++ b/test/optim/regex.lua
@ -1,39 +0,0 @@
-assert(arg[1])
-funcs = {
-    'resizeAs', 'add', 'zero', 'mul', 'div', 'abs',
-    'addcmul', 'addcdiv', 'copy', 'sqrt', 'fill',
-    {'cmul', 'mul'},
-    {'cdiv', 'div'},
-}
-for _, val in pairs(funcs) do
-    local name, newname
-    if type(val) == 'table' then
-        name = val[1]
-        newname = val[2]
-    else
-        name = val
-        newname = val .. '_'
-    end
-
-    command = "sed -i -r "
-        .. "'/torch\\." .. name .. "\\(/b; " -- short-circuits
-        .. "s/([a-zA-Z]*)\\." .. name .. "\\(" -- substitution
-        .. "/"
-        .. "\\1\\." .. newname .. "\\(/g' " .. arg[1]
-    print(command)
-    os.execute(command)
-    command = "sed -i 's/math\\." .. newname
-        .. "/math\\." .. name .. "/' " .. arg[1]
-    print(command)
-    os.execute(command)
-end
-
-funcs = {
-    {'torch\.cmul', 'torch\.mul'},
-    {'torch\.cdiv', 'torch\.div'},
-}
-for _, val in pairs(funcs) do
-    command = "sed -i 's/" .. val[1] .. "/" .. val[2] .. "/' " .. arg[1]
-    print(command)
-    os.execute(command)
-end
--- a/test/optim/test.lua
+++ b/test/optim/test.lua
@ -0,0 +1,33 @@
+local cjson = require 'cjson'
+require 'optim'
+
+function rosenbrock(t)
+    x, y = t[1], t[2]
+    return (1 - x) ^ 2 + 100 * (y - x^2)^2
+end
+
+function drosenbrock(t)
+    x, y = t[1], t[2]
+    return torch.DoubleTensor({-400 * x * (y - x^2) - 2 * (1 - x), 200 * x * (y - x^2)})
+end
+
+local fd = io.open('tests.json', 'r')
+local tests = cjson.decode(fd:read('*a'))
+fd:close()
+
+for i, test in ipairs(tests) do
+    print(test.algorithm)
+    algorithm = optim[test.algorithm]
+    for i, config in ipairs(test.config) do
+        print('================================================================================')
+        params = torch.DoubleTensor({1.5, 1.5})
+        for i = 1, 100 do
+            function closure(x)
+                return rosenbrock(x), drosenbrock(x)
+            end
+            algorithm(closure, params, config)
+            print(string.format('%.8f\t%.8f', params[1], params[2]))
+        end
+    end
+end
+
--- a/test/optim/test.py
+++ b/test/optim/test.py
@ -3,13 +3,15 @@ import torch
 import torch.legacy.optim as optim
 from pprint import pprint

+
 def rosenbrock(tensor):
    x, y = tensor
-    return (1 - x)**2 + 100 * (y - x**2)**2
+    return (1 - x) ** 2 + 100 * (y - x ** 2) ** 2
+

 def drosenbrock(tensor):
    x, y = tensor
-    return torch.DoubleTensor((-400 * x * (y - x**2) - 2 * (1 - x), 200 * x * (y - x**2)))
+    return torch.DoubleTensor((-400 * x * (y - x ** 2) - 2 * (1 - x), 200 * x * (y - x ** 2)))

 algorithms = {
    'adadelta': optim.adadelta,
@ -22,6 +24,7 @@ algorithms = {
    'rmsprop': optim.rmsprop,
    'rprop': optim.rprop,
    'sgd': optim.sgd,
+    'lbfgs': optim.lbfgs,
 }

 with open('tests.json', 'r') as f:
@ -35,4 +38,4 @@ for test in tests:
        params = torch.DoubleTensor((1.5, 1.5))
        for i in range(100):
            algorithm(lambda x: (rosenbrock(x), drosenbrock(x)), params, config)
-            print('{:.12f}\t{:.12f}\t'.format(params[0], params[1]))
+            print('{:.8f}\t{:.8f}\t'.format(params[0], params[1]))
--- a/test/optim/tests.json
+++ b/test/optim/tests.json
@ -98,5 +98,12 @@
            {"learningRate": 1e-4, "nesterov": true, "momentum": 0.95, "dampening": 0},
            {"weightDecay": 0.2}
        ]
+    },
+    {
+        "algorithm": "lbfgs",
+        "config": [
+            {},
+            {"learningRate": 1e-1}
+        ]
    }
 ]
--- a/test/run_test.sh
+++ b/test/run_test.sh
@ -0,0 +1,117 @@
+#!/usr/bin/env bash
+set -e
+
+PYCMD=${PYCMD:="python"}
+COVERAGE=0
+while [[ "$#" -gt 0 ]]; do
+    case "$1" in
+        -p|--python) PYCMD=$2; shift 2 ;;
+        -c|--coverage) COVERAGE=1; shift 1;;
+        --) shift; break ;;
+        *) echo "Invalid argument: $1!" ; exit 1 ;;
+    esac
+done
+
+if [[ $COVERAGE -eq 1 ]]; then
+    coverage erase
+    PYCMD="coverage run --parallel-mode --source torch "
+    echo "coverage flag found. Setting python command to: \"$PYCMD\""
+fi
+
+pushd "$(dirname "$0")"
+
+echo "Running JIT tests"
+$PYCMD test_jit.py $@
+
+echo "Running torch tests"
+$PYCMD test_torch.py $@
+
+echo "Running autograd tests"
+$PYCMD test_autograd.py $@
+$PYCMD test_potrf.py $@
+
+echo "Running torch.distributions tests"
+$PYCMD test_distributions.py $@
+
+echo "Running sparse tests"
+$PYCMD test_sparse.py $@
+
+echo "Running nn tests"
+$PYCMD test_nn.py $@
+
+echo "Running legacy nn tests"
+$PYCMD test_legacy_nn.py $@
+
+echo "Running optim tests"
+$PYCMD test_optim.py $@
+
+echo "Running multiprocessing tests"
+$PYCMD test_multiprocessing.py $@
+MULTIPROCESSING_METHOD=spawn $PYCMD test_multiprocessing.py $@
+MULTIPROCESSING_METHOD=forkserver $PYCMD test_multiprocessing.py $@
+
+echo "Running util tests"
+$PYCMD test_utils.py $@
+
+echo "Running dataloader tests"
+$PYCMD test_dataloader.py $@
+
+echo "Running cuda tests"
+$PYCMD test_cuda.py $@
+
+echo "Running NCCL tests"
+$PYCMD test_nccl.py $@
+
+distributed_set_up() {
+  export TEMP_DIR="$(mktemp -d)"
+  rm -rf "$TEMP_DIR/"*
+  mkdir "$TEMP_DIR/barrier"
+  mkdir "$TEMP_DIR/test_dir"
+}
+
+distributed_tear_down() {
+  rm -rf "$TEMP_DIR"
+}
+
+trap distributed_tear_down EXIT SIGHUP SIGINT SIGTERM
+
+echo "Running distributed tests for the TCP backend"
+distributed_set_up
+BACKEND=tcp WORLD_SIZE=3 $PYCMD ./test_distributed.py
+distributed_tear_down
+
+echo "Running distributed tests for the TCP backend with file init_method"
+distributed_set_up
+BACKEND=tcp WORLD_SIZE=3 INIT_METHOD='file://'$TEMP_DIR'/shared_init_file' $PYCMD ./test_distributed.py
+distributed_tear_down
+
+echo "Running distributed tests for the Gloo backend"
+distributed_set_up
+BACKEND=gloo WORLD_SIZE=3 $PYCMD ./test_distributed.py
+distributed_tear_down
+
+echo "Running distributed tests for the Gloo backend with file init_method"
+distributed_set_up
+BACKEND=gloo WORLD_SIZE=3 INIT_METHOD='file://'$TEMP_DIR'/shared_init_file' $PYCMD ./test_distributed.py
+distributed_tear_down
+
+if [ -x "$(command -v mpiexec)" ]; then
+  echo "Running distributed tests for the MPI backend"
+  distributed_set_up
+  BACKEND=mpi mpiexec -n 3 $PYCMD ./test_distributed.py
+  distributed_tear_down
+
+  echo "Running distributed tests for the MPI backend with file init_method"
+  distributed_set_up
+  BACKEND=mpi INIT_METHOD='file://'$TEMP_DIR'/shared_init_file' mpiexec -n 3 $PYCMD ./test_distributed.py
+  distributed_tear_down
+else
+  echo "Skipping MPI backend tests (MPI not found)"
+fi
+
+if [[ $COVERAGE -eq 1 ]]; then
+    coverage combine
+    coverage html
+fi
+
+popd
--- a/test/test_autograd.py
+++ b/test/test_autograd.py
--- a/test/test_cuda.py
+++ b/test/test_cuda.py
--- a/test/test_dataloader.py
+++ b/test/test_dataloader.py
@ -0,0 +1,571 @@
+import math
+import sys
+import errno
+import os
+import ctypes
+import signal
+import torch
+import time
+import traceback
+import unittest
+from torch import multiprocessing
+from torch.utils.data import Dataset, TensorDataset, DataLoader, ConcatDataset
+from torch.utils.data.dataset import random_split
+from torch.utils.data.dataloader import default_collate, ExceptionWrapper
+from common import TestCase, run_tests, TEST_NUMPY, IS_WINDOWS
+from common_nn import TEST_CUDA
+
+
+JOIN_TIMEOUT = 17.0 if IS_WINDOWS else 4.5
+
+
+class TestDatasetRandomSplit(TestCase):
+    def test_lengths_must_equal_datset_size(self):
+        with self.assertRaises(ValueError):
+            random_split([1, 2, 3, 4], [1, 2])
+
+    def test_splits_have_correct_size(self):
+        splits = random_split([1, 2, 3, 4, 5, 6], [2, 4])
+        self.assertEqual(len(splits), 2)
+        self.assertEqual(len(splits[0]), 2)
+        self.assertEqual(len(splits[1]), 4)
+
+    def test_splits_are_mutually_exclusive(self):
+        data = [5, 2, 3, 4, 1, 6]
+        splits = random_split(data, [2, 4])
+        all_values = []
+        all_values.extend(list(splits[0]))
+        all_values.extend(list(splits[1]))
+        data.sort()
+        all_values.sort()
+        self.assertListEqual(data, all_values)
+
+
+class TestTensorDataset(TestCase):
+
+    def test_len(self):
+        source = TensorDataset(torch.randn(15, 10, 2, 3, 4, 5), torch.randperm(15))
+        self.assertEqual(len(source), 15)
+
+    def test_getitem(self):
+        t = torch.randn(15, 10, 2, 3, 4, 5)
+        l = torch.randn(15, 10)
+        source = TensorDataset(t, l)
+        for i in range(15):
+            self.assertEqual(t[i], source[i][0])
+            self.assertEqual(l[i], source[i][1])
+
+    def test_getitem_1d(self):
+        t = torch.randn(15)
+        l = torch.randn(15)
+        source = TensorDataset(t, l)
+        for i in range(15):
+            self.assertEqual(t[i], source[i][0])
+            self.assertEqual(l[i], source[i][1])
+
+
+class TestConcatDataset(TestCase):
+
+    def test_concat_two_singletons(self):
+        result = ConcatDataset([[0], [1]])
+        self.assertEqual(2, len(result))
+        self.assertEqual(0, result[0])
+        self.assertEqual(1, result[1])
+
+    def test_concat_two_non_singletons(self):
+        result = ConcatDataset([[0, 1, 2, 3, 4],
+                                [5, 6, 7, 8, 9]])
+        self.assertEqual(10, len(result))
+        self.assertEqual(0, result[0])
+        self.assertEqual(5, result[5])
+
+    def test_concat_two_non_singletons_with_empty(self):
+        # Adding an empty dataset somewhere is correctly handled
+        result = ConcatDataset([[0, 1, 2, 3, 4],
+                                [],
+                                [5, 6, 7, 8, 9]])
+        self.assertEqual(10, len(result))
+        self.assertEqual(0, result[0])
+        self.assertEqual(5, result[5])
+
+    def test_concat_raises_index_error(self):
+        result = ConcatDataset([[0, 1, 2, 3, 4],
+                                [5, 6, 7, 8, 9]])
+        with self.assertRaises(IndexError):
+            # this one goes to 11
+            result[11]
+
+    def test_add_dataset(self):
+        d1 = TensorDataset(torch.rand(7, 3, 28, 28), torch.rand(7))
+        d2 = TensorDataset(torch.rand(7, 3, 28, 28), torch.rand(7))
+        d3 = TensorDataset(torch.rand(7, 3, 28, 28), torch.rand(7))
+        result = d1 + d2 + d3
+        self.assertEqual(21, len(result))
+        self.assertEqual(0, (d1[0][0] - result[0][0]).abs().sum())
+        self.assertEqual(0, (d2[0][0] - result[7][0]).abs().sum())
+        self.assertEqual(0, (d3[0][0] - result[14][0]).abs().sum())
+
+
+# Stores the first encountered exception in .exception.
+# Inspired by https://stackoverflow.com/a/33599967
+class ErrorTrackingProcess(multiprocessing.Process):
+
+    def __init__(self, *args, **kwargs):
+        super(ErrorTrackingProcess, self).__init__(*args, **kwargs)
+        self._pconn, self._cconn = multiprocessing.Pipe()
+        self._exception = None
+
+    def run(self):
+        # Disable stderr printing from os level, and make workers not printing
+        # to stderr.
+        # Can't use sys.stderr.close, otherwise Python `raise` will error with
+        # ValueError: I/O operation on closed file.
+        os.close(sys.stderr.fileno())
+        try:
+            super(ErrorTrackingProcess, self).run()
+            self._cconn.send(None)
+        except Exception as e:
+            self._cconn.send(ExceptionWrapper(sys.exc_info()))
+            raise
+
+    @property
+    def exception(self):
+        if self._pconn.poll():
+            self._exception = self._pconn.recv()
+        if self._exception is None:
+            return None
+        else:
+            return self._exception.exc_type(self._exception.exc_msg)
+
+    # ESRCH means that os.kill can't finds alive proc
+    def send_signal(self, signum, ignore_ESRCH=False):
+        try:
+            os.kill(self.pid, signum)
+        except OSError as e:
+            if not ignore_ESRCH or e.errno != errno.ESRCH:
+                raise
+
+
+class ErrorDataset(Dataset):
+
+    def __init__(self, size):
+        self.size = size
+
+    def __len__(self):
+        return self.size
+
+
+class SegfaultDataset(Dataset):
+
+    def __init__(self, size):
+        self.size = size
+
+    def __getitem__(self, idx):
+        return ctypes.string_at(0)
+
+    def __len__(self):
+        return self.size
+
+
+class SleepDataset(Dataset):
+
+    def __init__(self, size, sleep_sec):
+        self.size = size
+        self.sleep_sec = sleep_sec
+
+    def __getitem__(self, idx):
+        time.sleep(self.sleep_sec)
+        return idx
+
+    def __len__(self):
+        return self.size
+
+
+class SeedDataset(Dataset):
+
+    def __init__(self, size):
+        self.size = size
+
+    def __getitem__(self, idx):
+        return torch.initial_seed()
+
+    def __len__(self):
+        return self.size
+
+
+# Inspired by https://stackoverflow.com/a/26703365
+# This will ensure that each worker at least processes one data
+class SynchronizedSeedDataset(Dataset):
+
+    def __init__(self, size, num_workers):
+        assert size >= num_workers
+        self.count = multiprocessing.Value('i', 0)
+        self.barrier = multiprocessing.Semaphore(0)
+        self.num_workers = num_workers
+        self.size = size
+
+    def __getitem__(self, idx):
+        self.count.value += 1
+        if self.count.value == self.num_workers:
+            self.barrier.release()
+        self.barrier.acquire()
+        self.barrier.release()
+        return torch.initial_seed()
+
+    def __len__(self):
+        return self.size
+
+
+def _test_timeout():
+    dataset = SleepDataset(10, 10)
+    dataloader = DataLoader(dataset, batch_size=2, num_workers=2, timeout=1)
+    _ = next(iter(dataloader))
+
+
+def _test_segfault():
+    dataset = SegfaultDataset(10)
+    dataloader = DataLoader(dataset, batch_size=2, num_workers=2)
+    _ = next(iter(dataloader))
+
+
+# test custom init function
+def init_fn(worker_id):
+    torch.manual_seed(12345)
+
+
+class TestDataLoader(TestCase):
+
+    def setUp(self):
+        self.data = torch.randn(100, 2, 3, 5)
+        self.labels = torch.randperm(50).repeat(2)
+        self.dataset = TensorDataset(self.data, self.labels)
+
+    def _test_sequential(self, loader):
+        batch_size = loader.batch_size
+        for i, (sample, target) in enumerate(loader):
+            idx = i * batch_size
+            self.assertEqual(sample, self.data[idx:idx + batch_size])
+            self.assertEqual(target, self.labels[idx:idx + batch_size])
+        self.assertEqual(i, math.floor((len(self.dataset) - 1) / batch_size))
+
+    def _test_shuffle(self, loader):
+        found_data = {i: 0 for i in range(self.data.size(0))}
+        found_labels = {i: 0 for i in range(self.labels.size(0))}
+        batch_size = loader.batch_size
+        for i, (batch_samples, batch_targets) in enumerate(loader):
+            for sample, target in zip(batch_samples, batch_targets):
+                for data_point_idx, data_point in enumerate(self.data):
+                    if data_point.eq(sample).all():
+                        self.assertFalse(found_data[data_point_idx])
+                        found_data[data_point_idx] += 1
+                        break
+                self.assertEqual(target, self.labels[data_point_idx])
+                found_labels[data_point_idx] += 1
+            self.assertEqual(sum(found_data.values()), (i + 1) * batch_size)
+            self.assertEqual(sum(found_labels.values()), (i + 1) * batch_size)
+        self.assertEqual(i, math.floor((len(self.dataset) - 1) / batch_size))
+
+    def _test_error(self, loader):
+        it = iter(loader)
+        errors = 0
+        while True:
+            try:
+                next(it)
+            except NotImplementedError:
+                errors += 1
+            except StopIteration:
+                self.assertEqual(errors,
+                                 math.ceil(float(len(loader.dataset)) / loader.batch_size))
+                return
+
+    def test_sequential(self):
+        self._test_sequential(DataLoader(self.dataset))
+
+    def test_sequential_batch(self):
+        self._test_sequential(DataLoader(self.dataset, batch_size=2))
+
+    def test_growing_dataset(self):
+        dataset = [torch.ones(4) for _ in range(4)]
+        dataloader_seq = DataLoader(dataset, shuffle=False)
+        dataloader_shuffle = DataLoader(dataset, shuffle=True)
+        dataset.append(torch.ones(4))
+        self.assertEqual(len(dataloader_seq), 5)
+        self.assertEqual(len(dataloader_shuffle), 5)
+
+    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
+    def test_sequential_pin_memory(self):
+        loader = DataLoader(self.dataset, batch_size=2, pin_memory=True)
+        for input, target in loader:
+            self.assertTrue(input.is_pinned())
+            self.assertTrue(target.is_pinned())
+
+    def test_multiple_dataloaders(self):
+        loader1_it = iter(DataLoader(self.dataset, num_workers=1))
+        loader2_it = iter(DataLoader(self.dataset, num_workers=2))
+        next(loader1_it)
+        next(loader1_it)
+        next(loader2_it)
+        next(loader2_it)
+        next(loader1_it)
+        next(loader2_it)
+
+    @unittest.skipIf(True, "flaky test")
+    def test_segfault(self):
+        p = ErrorTrackingProcess(target=_test_segfault)
+        p.start()
+        p.join(JOIN_TIMEOUT)
+        try:
+            self.assertFalse(p.is_alive())
+            self.assertNotEqual(p.exitcode, 0)
+            if IS_WINDOWS:
+                self.assertIsInstance(p.exception, OSError)
+                self.assertRegex(str(p.exception), r'access violation reading ')
+            else:
+                self.assertIsInstance(p.exception, RuntimeError)
+                self.assertRegex(str(p.exception), r'DataLoader worker \(pid \d+\) is killed by signal: ')
+        finally:
+            p.terminate()
+
+    def test_timeout(self):
+        p = ErrorTrackingProcess(target=_test_timeout)
+        p.start()
+        p.join(JOIN_TIMEOUT)
+        try:
+            self.assertFalse(p.is_alive())
+            self.assertNotEqual(p.exitcode, 0)
+            self.assertIsInstance(p.exception, RuntimeError)
+            self.assertRegex(str(p.exception), r'DataLoader timed out after \d+ seconds')
+        finally:
+            p.terminate()
+
+    def test_worker_seed(self):
+        num_workers = 6
+        dataset = SynchronizedSeedDataset(num_workers, num_workers)
+        dataloader = DataLoader(dataset, batch_size=1, num_workers=num_workers)
+        seeds = set()
+        for batch in dataloader:
+            seeds.add(batch[0])
+        self.assertEqual(len(seeds), num_workers)
+
+    def test_worker_init_fn(self):
+        dataset = SeedDataset(4)
+        dataloader = DataLoader(dataset, batch_size=2, num_workers=2,
+                                worker_init_fn=init_fn)
+        for batch in dataloader:
+            self.assertEqual(12345, batch[0])
+            self.assertEqual(12345, batch[1])
+
+    def test_shuffle(self):
+        self._test_shuffle(DataLoader(self.dataset, shuffle=True))
+
+    def test_shuffle_batch(self):
+        self._test_shuffle(DataLoader(self.dataset, batch_size=2, shuffle=True))
+
+    def test_sequential_workers(self):
+        self._test_sequential(DataLoader(self.dataset, num_workers=4))
+
+    def test_seqential_batch_workers(self):
+        self._test_sequential(DataLoader(self.dataset, batch_size=2, num_workers=4))
+
+    def test_shuffle_workers(self):
+        self._test_shuffle(DataLoader(self.dataset, shuffle=True, num_workers=4))
+
+    def test_shuffle_batch_workers(self):
+        self._test_shuffle(DataLoader(self.dataset, batch_size=2, shuffle=True, num_workers=4))
+
+    def _test_batch_sampler(self, **kwargs):
+        # [(0, 1), (2, 3, 4), (5, 6), (7, 8, 9), ...]
+        batches = []
+        for i in range(0, 100, 5):
+            batches.append(tuple(range(i, i + 2)))
+            batches.append(tuple(range(i + 2, i + 5)))
+
+        dl = DataLoader(self.dataset, batch_sampler=batches, **kwargs)
+        self.assertEqual(len(dl), 40)
+        for i, (input, _target) in enumerate(dl):
+            if i % 2 == 0:
+                offset = i * 5 // 2
+                self.assertEqual(len(input), 2)
+                self.assertEqual(input, self.data[offset:offset + 2])
+            else:
+                offset = i * 5 // 2
+                self.assertEqual(len(input), 3)
+                self.assertEqual(input, self.data[offset:offset + 3])
+
+    def test_batch_sampler(self):
+        self._test_batch_sampler()
+        self._test_batch_sampler(num_workers=4)
+
+    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
+    def test_shuffle_pin_memory(self):
+        loader = DataLoader(self.dataset, batch_size=2, shuffle=True, num_workers=4, pin_memory=True)
+        for input, target in loader:
+            self.assertTrue(input.is_pinned())
+            self.assertTrue(target.is_pinned())
+
+    @unittest.skipIf(not TEST_NUMPY, "numpy unavailable")
+    def test_numpy(self):
+        import numpy as np
+
+        class TestDataset(torch.utils.data.Dataset):
+            def __getitem__(self, i):
+                return np.ones((2, 3, 4)) * i
+
+            def __len__(self):
+                return 1000
+
+        loader = DataLoader(TestDataset(), batch_size=12)
+        batch = next(iter(loader))
+        self.assertIsInstance(batch, torch.DoubleTensor)
+        self.assertEqual(batch.size(), torch.Size([12, 2, 3, 4]))
+
+    def test_error(self):
+        self._test_error(DataLoader(ErrorDataset(100), batch_size=2, shuffle=True))
+
+    def test_error_workers(self):
+        self._test_error(DataLoader(ErrorDataset(41), batch_size=2, shuffle=True, num_workers=4))
+
+    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
+    def test_partial_workers(self):
+        "check that workers exit even if the iterator is not exhausted"
+        loader = iter(DataLoader(self.dataset, batch_size=2, num_workers=4, pin_memory=True))
+        workers = loader.workers
+        worker_manager_thread = loader.worker_manager_thread
+        for i, sample in enumerate(loader):
+            if i == 3:
+                break
+        del loader
+        for w in workers:
+            w.join(JOIN_TIMEOUT)
+            self.assertFalse(w.is_alive(), 'subprocess not terminated')
+            self.assertEqual(w.exitcode, 0)
+        worker_manager_thread.join(JOIN_TIMEOUT)
+        self.assertFalse(worker_manager_thread.is_alive())
+
+    def test_len(self):
+        def check_len(dl, expected):
+            self.assertEqual(len(dl), expected)
+            n = 0
+            for sample in dl:
+                n += 1
+            self.assertEqual(n, expected)
+        check_len(self.dataset, 100)
+        check_len(DataLoader(self.dataset, batch_size=2), 50)
+        check_len(DataLoader(self.dataset, batch_size=3), 34)
+
+    @unittest.skipIf(not TEST_NUMPY, "numpy unavailable")
+    def test_numpy_scalars(self):
+        import numpy as np
+
+        class ScalarDataset(torch.utils.data.Dataset):
+            def __init__(self, dtype):
+                self.dtype = dtype
+
+            def __getitem__(self, i):
+                return self.dtype()
+
+            def __len__(self):
+                return 4
+
+        dtypes = {
+            np.float64: torch.DoubleTensor,
+            np.float32: torch.FloatTensor,
+            np.float16: torch.HalfTensor,
+            np.int64: torch.LongTensor,
+            np.int32: torch.IntTensor,
+            np.int16: torch.ShortTensor,
+            np.int8: torch.CharTensor,
+            np.uint8: torch.ByteTensor,
+        }
+        for dt, tt in dtypes.items():
+            dset = ScalarDataset(dt)
+            loader = DataLoader(dset, batch_size=2)
+            batch = next(iter(loader))
+            self.assertIsInstance(batch, tt)
+
+    @unittest.skipIf(not TEST_NUMPY, "numpy unavailable")
+    def test_default_colate_bad_numpy_types(self):
+        import numpy as np
+
+        # Should be a no-op
+        arr = np.array(['a', 'b', 'c'])
+        default_collate(arr)
+
+        arr = np.array([[['a', 'b', 'c']]])
+        self.assertRaises(TypeError, lambda: default_collate(arr))
+
+        arr = np.array([object(), object(), object()])
+        self.assertRaises(TypeError, lambda: default_collate(arr))
+
+        arr = np.array([[[object(), object(), object()]]])
+        self.assertRaises(TypeError, lambda: default_collate(arr))
+
+
+class StringDataset(Dataset):
+    def __init__(self):
+        self.s = '12345'
+
+    def __len__(self):
+        return len(self.s)
+
+    def __getitem__(self, ndx):
+        return (self.s[ndx], ndx)
+
+
+class TestStringDataLoader(TestCase):
+    def setUp(self):
+        self.dataset = StringDataset()
+
+    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
+    def test_shuffle_pin_memory(self):
+        loader = DataLoader(self.dataset, batch_size=2, shuffle=True, num_workers=4, pin_memory=True)
+        for batch_ndx, (s, n) in enumerate(loader):
+            self.assertIsInstance(s[0], str)
+            self.assertTrue(n.is_pinned())
+
+
+class DictDataset(Dataset):
+    def __len__(self):
+        return 4
+
+    def __getitem__(self, ndx):
+        return {
+            'a_tensor': torch.Tensor(4, 2).fill_(ndx),
+            'another_dict': {
+                'a_number': ndx,
+            },
+        }
+
+
+class TestDictDataLoader(TestCase):
+    def setUp(self):
+        self.dataset = DictDataset()
+
+    def test_sequential_batch(self):
+        loader = DataLoader(self.dataset, batch_size=2, shuffle=False)
+        batch_size = loader.batch_size
+        for i, sample in enumerate(loader):
+            idx = i * batch_size
+            self.assertEqual(set(sample.keys()), {'a_tensor', 'another_dict'})
+            self.assertEqual(set(sample['another_dict'].keys()), {'a_number'})
+
+            t = sample['a_tensor']
+            self.assertEqual(t.size(), torch.Size([batch_size, 4, 2]))
+            self.assertTrue((t[0] == idx).all())
+            self.assertTrue((t[1] == idx + 1).all())
+
+            n = sample['another_dict']['a_number']
+            self.assertEqual(n.size(), torch.Size([batch_size]))
+            self.assertEqual(n[0], idx)
+            self.assertEqual(n[1], idx + 1)
+
+    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
+    def test_pin_memory(self):
+        loader = DataLoader(self.dataset, batch_size=2, pin_memory=True)
+        for batch_ndx, sample in enumerate(loader):
+            self.assertTrue(sample['a_tensor'].is_pinned())
+            self.assertTrue(sample['another_dict']['a_number'].is_pinned())
+
+
+if __name__ == '__main__':
+    run_tests()
--- a/test/test_distributed.py
+++ b/test/test_distributed.py
@ -0,0 +1,579 @@
+import fcntl
+import multiprocessing
+import os
+import sys
+import time
+import unittest
+from functools import wraps, reduce
+from contextlib import contextmanager
+
+import torch
+import torch.cuda
+import torch.distributed as dist
+from common import TestCase
+
+BACKEND = os.environ['BACKEND']
+TEMP_DIR = os.environ['TEMP_DIR']
+INIT_METHOD = os.getenv('INIT_METHOD', 'env://')
+MASTER_PORT = '29500'
+MASTER_ADDR = '127.0.0.1'
+
+
+if not dist.is_available():
+    print('Distributed not available, skipping tests')
+    sys.exit(0)
+
+SKIP_IF_NO_CUDA_EXIT_CODE = 75
+
+
+def skip_if_no_cuda_distributed(func):
+    func.skip_if_no_cuda_distributed = True
+
+    @wraps(func)
+    def wrapper(*args, **kwargs):
+        if not torch.cuda.is_available():
+            sys.exit(SKIP_IF_NO_CUDA_EXIT_CODE)
+
+        return func(*args, **kwargs)
+    return wrapper
+
+
+@contextmanager
+def _lock():
+    lockfile = os.path.join(TEMP_DIR, 'lockfile')
+    with open(lockfile, 'w') as lf:
+        try:
+            fcntl.flock(lf.fileno(), fcntl.LOCK_EX)
+            yield
+        finally:
+            fcntl.flock(lf.fileno(), fcntl.LOCK_UN)
+            lf.close()
+
+
+def _build_tensor(size, value=None):
+    if value is None:
+        value = size
+    return torch.FloatTensor(size, size, size).fill_(value)
+
+
+class Barrier(object):
+    barrier_id = 0
+
+    @classmethod
+    def init(cls):
+        cls.barrier_id = 0
+        barrier_dir = os.path.join(TEMP_DIR, 'barrier')
+        for f_name in os.listdir(barrier_dir):
+            os.unlink(os.path.join(barrier_dir, f_name))
+
+    @classmethod
+    def sync(cls, timeout=5):
+        cls.barrier_id += 1
+        barrier_dir = os.path.join(TEMP_DIR, 'barrier')
+        pid = str(os.getpid())
+        barrier_file = os.path.join(barrier_dir, pid)
+        with _lock():
+            with open(barrier_file, 'w') as f:
+                f.write(str(cls.barrier_id))
+
+        start_time = time.time()
+        while True:
+            arrived = 0
+            with _lock():
+                for f_name in os.listdir(barrier_dir):
+                    with open(os.path.join(barrier_dir, f_name), 'r') as f:
+                        data = f.read()
+                        if int(data) >= cls.barrier_id:
+                            arrived += 1
+            if arrived == dist.get_world_size():
+                break
+
+            if time.time() - start_time > timeout:
+                raise RuntimeError("barrier timeout")
+            time.sleep(0.1)
+
+
+class _DistTestBase(object):
+
+    def _barrier(self, *args, **kwargs):
+        Barrier.sync(*args, **kwargs)
+
+    def _init_group_test(self):
+        group = [1, 2]
+        group_id = dist.new_group(group)
+        rank = dist.get_rank()
+        if rank not in group:
+            return ([], None, rank)
+
+        return (group, group_id, rank)
+
+    def _init_global_test(self):
+        group = [i for i in range(0, dist.get_world_size())]
+        group_id = dist.group.WORLD
+        rank = dist.get_rank()
+        return (group, group_id, rank)
+
+    # GET RANK
+    def test_get_rank(self):
+        test_dir = os.path.join(TEMP_DIR, 'test_dir')
+        pid = str(os.getpid())
+        num_processes = dist.get_world_size()
+        with open(os.path.join(test_dir, pid), 'w') as f:
+            f.write(str(dist.get_rank()))
+
+        self._barrier()
+
+        all_ranks = set()
+        for f_name in os.listdir(test_dir):
+            with open(os.path.join(test_dir, f_name), 'r') as f:
+                all_ranks.add(int(f.read()))
+        self.assertEqual(len(all_ranks), num_processes)
+
+        self._barrier()
+
+        if dist.get_rank() == 0:
+            for f_name in os.listdir(test_dir):
+                os.unlink(os.path.join(test_dir, f_name))
+
+        self._barrier()
+
+    # SEND RECV
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support send/recv")
+    def test_send_recv(self):
+        rank = dist.get_rank()
+        tensor = _build_tensor(rank + 1)
+        for dest in range(0, dist.get_world_size()):
+            if dest == rank:
+                continue
+            dist.send(tensor, dest)
+
+        for src in range(0, dist.get_world_size()):
+            if src == rank:
+                continue
+            tensor = _build_tensor(src + 1, value=-1)
+            expected_tensor = _build_tensor(src + 1)
+            dist.recv(tensor, src)
+            self.assertEqual(tensor, expected_tensor)
+
+        self._barrier()
+
+    # SEND RECV ANY SOURCE
+    @unittest.skipIf(BACKEND == 'gloo',
+                     "Gloo does not support send/recv from any source")
+    def test_send_recv_any_source(self):
+        rank = dist.get_rank()
+        tensor = _build_tensor(10, rank)
+        for dest in range(0, dist.get_world_size()):
+            if dest == rank:
+                continue
+            dist.send(tensor, dest)
+
+        recv_ranks = set()
+        for src in range(0, dist.get_world_size()):
+            if src == rank:
+                continue
+            tensor = _build_tensor(10, value=-1)
+            sender = dist.recv(tensor)
+            self.assertTrue(tensor.eq(sender).all())
+            recv_ranks.add(sender)
+
+        self.assertEqual(len(recv_ranks), dist.get_world_size() - 1)
+        self._barrier()
+
+    # ISEND
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support isend")
+    def test_isend(self):
+        rank = dist.get_rank()
+        world_size = dist.get_world_size()
+
+        if rank == 0:
+            requests = [
+                dist.isend(_build_tensor(dest, 10), dest) for dest in range(1, world_size)
+            ]
+            for request in requests:
+                request.wait()
+                self.assertTrue(request.is_completed())
+        else:
+            tensor = _build_tensor(rank, -1)
+            dist.recv(tensor, 0)
+            self.assertEqual(tensor, _build_tensor(rank, 10))
+
+        self._barrier()
+
+    # IRECV
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support irecv")
+    def test_irecv(self):
+        rank = dist.get_rank()
+        world_size = dist.get_world_size()
+
+        if rank == 0:
+            expected_tensors = [_build_tensor(src, -1) for src in range(1, world_size)]
+            requests = [
+                dist.irecv(expected_tensors[src - 1], src) for src in range(1, world_size)
+            ]
+
+            for src in range(1, world_size):
+                requests[src - 1].wait()
+                self.assertTrue(requests[src - 1].is_completed())
+                self.assertEqual(expected_tensors[src - 1], _build_tensor(src, 10))
+        else:
+            tensor = _build_tensor(rank, 10)
+            dist.send(tensor, 0)
+
+        self._barrier()
+
+    # BROADCAST
+    def _test_broadcast_helper(self, group, group_id, rank, cuda=False):
+        for src in group:
+            expected_tensor = _build_tensor(src + 1)
+            if cuda:
+                expected_tensor = expected_tensor.cuda()
+            if rank == src:
+                dist.broadcast(expected_tensor, src, group_id)
+            else:
+                tensor = _build_tensor(src + 1, -1)
+                if cuda:
+                    tensor = tensor.cuda()
+                dist.broadcast(tensor, src, group_id)
+                self.assertEqual(tensor, expected_tensor)
+
+        self._barrier()
+
+    def test_broadcast(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_broadcast_helper(group, group_id, rank)
+
+    @unittest.skipIf(BACKEND != 'gloo', "Only Gloo backend supports CUDA allReduce")
+    @skip_if_no_cuda_distributed
+    def test_broadcast_cuda(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_broadcast_helper(group, group_id, rank, True)
+
+    def test_broadcast_group(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_broadcast_helper(group, group_id, rank)
+
+    # REDUCE
+    def _test_reduce_helper(self, group, group_id, rank, op, master_value, worker_value, expected_value):
+        for src in group:
+            if rank == src:
+                tensor = _build_tensor(src + 1).fill_(master_value)
+                dist.reduce(tensor, src, op, group_id)
+                self.assertEqual(tensor, _build_tensor(src + 1, expected_value))
+            else:
+                tensor = _build_tensor(src + 1).fill_(worker_value)
+                dist.reduce(tensor, src, op, group_id)
+
+        self._barrier()
+
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support reduce")
+    def test_reduce_sum(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_reduce_helper(
+            group, group_id, rank, dist.reduce_op.SUM, 2, 10, 2 + (10 * (len(group) - 1))
+        )
+
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support reduce")
+    def test_reduce_product(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_reduce_helper(
+            group, group_id, rank, dist.reduce_op.PRODUCT,
+            2, 10, reduce((lambda x, y: x * y), [10] * (len(group) - 1), 2)
+        )
+
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support reduce")
+    def test_reduce_min(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_reduce_helper(
+            group, group_id, rank, dist.reduce_op.MIN, 1010, 1, 1
+        )
+
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support reduce")
+    def test_reduce_max(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_reduce_helper(
+            group, group_id, rank, dist.reduce_op.MAX, -1, 10, 10
+        )
+
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support reduce")
+    def test_reduce_group_sum(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_reduce_helper(
+            group, group_id, rank, dist.reduce_op.SUM, 2, 10, 2 + (10 * (len(group) - 1))
+        )
+
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support reduce")
+    def test_reduce_group_product(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_reduce_helper(
+            group, group_id, rank, dist.reduce_op.PRODUCT,
+            2, 10, reduce((lambda x, y: x * y), [10] * (len(group) - 1), 2)
+        )
+
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support reduce")
+    def test_reduce_group_min(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_reduce_helper(
+            group, group_id, rank, dist.reduce_op.MIN, 1010, 1, 1
+        )
+
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support reduce")
+    def test_reduce_group_max(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_reduce_helper(
+            group, group_id, rank, dist.reduce_op.MAX, -1, 10, 10
+        )
+
+    # ALL REDUCE
+    def _test_all_reduce_helper(self, group, group_id, rank, op, master_value,
+                                worker_value, expected_value, cuda=False):
+        for src in group:
+            if rank == src:
+                tensor = _build_tensor(src + 1).fill_(master_value)
+                if cuda:
+                    tensor = tensor.cuda()
+                dist.all_reduce(tensor, op, group_id)
+                self.assertEqual(tensor, _build_tensor(src + 1, expected_value))
+            else:
+                tensor = _build_tensor(src + 1).fill_(worker_value)
+                if cuda:
+                    tensor = tensor.cuda()
+                dist.all_reduce(tensor, op, group_id)
+                self.assertEqual(tensor, _build_tensor(src + 1, expected_value))
+
+        self._barrier()
+
+    def test_all_reduce_sum(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_all_reduce_helper(
+            group, group_id, rank, dist.reduce_op.SUM, 2, 10, 2 + (10 * (len(group) - 1))
+        )
+
+    @unittest.skipIf(BACKEND != 'gloo', "Only Gloo backend supports CUDA allReduce")
+    @skip_if_no_cuda_distributed
+    def test_all_reduce_sum_cuda(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_all_reduce_helper(
+            group, group_id, rank, dist.reduce_op.SUM, 2, 10, 2 + (10 * (len(group) - 1)), True
+        )
+
+    def test_all_reduce_product(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_all_reduce_helper(
+            group, group_id, rank, dist.reduce_op.PRODUCT,
+            2, 10, reduce((lambda x, y: x * y), [10] * (len(group) - 1), 2)
+        )
+
+    def test_all_reduce_min(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_all_reduce_helper(
+            group, group_id, rank, dist.reduce_op.MIN, 1010, 1, 1
+        )
+
+    def test_all_reduce_max(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_all_reduce_helper(
+            group, group_id, rank, dist.reduce_op.MAX, -1, 10, 10
+        )
+
+    def test_all_reduce_group_sum(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_all_reduce_helper(
+            group, group_id, rank, dist.reduce_op.SUM, 2, 10, 2 + (10 * (len(group) - 1))
+        )
+
+    def test_all_reduce_group_product(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_all_reduce_helper(
+            group, group_id, rank, dist.reduce_op.PRODUCT,
+            2, 10, reduce((lambda x, y: x * y), [10] * (len(group) - 1), 2)
+        )
+
+    def test_all_reduce_group_min(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_all_reduce_helper(
+            group, group_id, rank, dist.reduce_op.MIN, 1010, 1, 1
+        )
+
+    def test_all_reduce_group_max(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_all_reduce_helper(
+            group, group_id, rank, dist.reduce_op.MAX, -1, 10, 10
+        )
+
+    # SCATTER
+    def _test_scatter_helper(self, group, group_id, rank):
+        for dest in group:
+            tensor = _build_tensor(dest + 1, -1)
+            expected_tensor = _build_tensor(dest + 1, rank)
+            tensors = [_build_tensor(dest + 1, i) for i in group] if rank == dest else []
+            dist.scatter(tensor, src=dest, scatter_list=tensors, group=group_id)
+            self.assertEqual(tensor, expected_tensor)
+
+        self._barrier()
+
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support scatter")
+    def test_scatter(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_scatter_helper(group, group_id, rank)
+
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support scatter")
+    def test_scatter_group(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_scatter_helper(group, group_id, rank)
+
+    # GATHER
+    def _test_gather_helper(self, group, group_id, rank):
+        for dest in group:
+            tensor = _build_tensor(dest + 1, rank)
+            tensors = [_build_tensor(dest + 1, -1) for i in group] if rank == dest else []
+            dist.gather(tensor, dst=dest, gather_list=tensors, group=group_id)
+            if rank == dest:
+                expected_tensors = [_build_tensor(dest + 1, i) for i in group]
+                for t1, t2 in zip(tensors, expected_tensors):
+                    self.assertEqual(t1, t2)
+
+        self._barrier()
+
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support gather")
+    def test_gather(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_gather_helper(group, group_id, rank)
+
+    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support gather")
+    def test_gather_group(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_gather_helper(group, group_id, rank)
+
+    # ALL GATHER
+    def _test_all_gather_helper(self, group, group_id, rank):
+        for dest in group:
+            tensor = _build_tensor(dest + 1, rank)
+            tensors = [_build_tensor(dest + 1, -1) for i in group]
+            dist.all_gather(tensors, tensor, group_id)
+
+            expected_tensors = [_build_tensor(dest + 1, i) for i in group]
+            for t1, t2 in zip(tensors, expected_tensors):
+                self.assertEqual(t1, t2)
+
+        self._barrier()
+
+    def test_all_gather(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_all_gather_helper(group, group_id, rank)
+
+    def test_all_gather_group(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_all_gather_helper(group, group_id, rank)
+
+    # BARRIER
+    def _test_barrier_helper(self, group, group_id, rank):
+        WAIT_TIME = 0.3  # seconds
+
+        for dest in group:
+            expected_time = torch.DoubleTensor(1).fill_(0.0)
+            if dest == rank:
+                expected_time.fill_(time.time() + WAIT_TIME)
+                dist.broadcast(expected_time, dest, group_id)
+                time.sleep(WAIT_TIME + 0.1)  # sleep a little bit longer
+                dist.barrier(group_id)
+            else:
+                dist.broadcast(expected_time, dest, group_id)
+                dist.barrier(group_id)
+                self.assertGreaterEqual(time.time(), expected_time[0])
+
+        self._barrier()
+
+    def test_barrier(self):
+        group, group_id, rank = self._init_global_test()
+        self._test_barrier_helper(group, group_id, rank)
+
+    def test_barrier_group(self):
+        group, group_id, rank = self._init_group_test()
+        self._test_barrier_helper(group, group_id, rank)
+
+if BACKEND == 'tcp' or BACKEND == 'gloo':
+    WORLD_SIZE = os.environ['WORLD_SIZE']
+
+    class TestTCPOrGloo(TestCase, _DistTestBase):
+
+        MANAGER_PROCESS_RANK = -1
+        JOIN_TIMEOUT = 10
+
+        @staticmethod
+        def manager_join(fn):
+            @wraps(fn)
+            def wrapper(self):
+                if self.rank == self.MANAGER_PROCESS_RANK:
+                    self._join_and_reduce(fn)
+                else:
+                    fn(self)
+            return wrapper
+
+        @classmethod
+        def setUpClass(cls):
+            os.environ['MASTER_ADDR'] = MASTER_ADDR
+            os.environ['MASTER_PORT'] = MASTER_PORT
+            os.environ['WORLD_SIZE'] = WORLD_SIZE
+            for attr in dir(cls):
+                if attr.startswith('test'):
+                    fn = getattr(cls, attr)
+                    setattr(cls, attr, cls.manager_join(fn))
+
+        def setUp(self):
+            self.processes = []
+            self.rank = self.MANAGER_PROCESS_RANK
+            Barrier.init()
+            for rank in range(int(WORLD_SIZE)):
+                self.processes.append(self._spawn_process(rank))
+
+        def tearDown(self):
+            for p in self.processes:
+                p.terminate()
+
+        def _spawn_process(self, rank):
+            os.environ['RANK'] = str(rank)
+            name = 'process ' + str(rank)
+            process = multiprocessing.Process(target=self._run, name=name,
+                                              args=(rank,))
+            process.start()
+            return process
+
+        def _run(self, rank):
+            self.rank = rank
+            try:
+                dist.init_process_group(init_method=INIT_METHOD, backend=BACKEND, world_size=int(WORLD_SIZE))
+            except RuntimeError as e:
+                if 'recompile' in e.args[0]:
+                    sys.exit(0)
+            # self.id() == e.g. '__main__.TestDistributed.test_get_rank'
+            # We're retreiving a corresponding test and executing it.
+            getattr(self, self.id().split(".")[2])()
+            sys.exit(0)
+
+        def _join_and_reduce(self, fn):
+            skip_ok = getattr(fn, "skip_if_no_cuda_distributed", False)
+            for p in self.processes:
+                p.join(self.JOIN_TIMEOUT)
+                if not skip_ok:
+                    self.assertEqual(p.exitcode, 0)
+
+            if skip_ok:
+                first_process = self.processes[0]
+                # do this first so we don't give an error message about mismatched exit codes if the first isn't valid
+                assert first_process.exitcode == 0 or first_process.exitcode == SKIP_IF_NO_CUDA_EXIT_CODE
+
+                for p in self.processes:
+                    self.assertEqual(p.exitcode, first_process.exitcode)
+                if first_process.exitcode == SKIP_IF_NO_CUDA_EXIT_CODE:
+                    raise unittest.SkipTest("cuda is not available")
+
+elif BACKEND == 'mpi':
+    dist.init_process_group(init_method=INIT_METHOD, backend='mpi')
+
+    class TestMPI(TestCase, _DistTestBase):
+        pass
+
+
+if __name__ == '__main__':
+    unittest.main()
--- a/test/test_distributions.py
+++ b/test/test_distributions.py
@ -0,0 +1,94 @@
+from common import TestCase, run_tests
+import math
+import torch
+from torch.autograd import Variable, gradcheck
+from torch.distributions import Bernoulli, Categorical, Normal
+
+
+class TestDistributions(TestCase):
+    def _gradcheck_log_prob(self, dist_ctor, ctor_params):
+        # performs gradient checks on log_prob
+        distribution = dist_ctor(*ctor_params)
+        s = distribution.sample()
+
+        self.assertEqual(s.size(), distribution.log_prob(s).size())
+
+        def apply_fn(*params):
+            return dist_ctor(*params).log_prob(s)
+
+        gradcheck(apply_fn, ctor_params, raise_exception=True)
+
+    def _check_log_prob(self, dist, asset_fn):
+        # checks that the log_prob matches a reference function
+        s = dist.sample()
+        log_probs = dist.log_prob(s)
+        for i, (val, log_prob) in enumerate(zip(s.data.view(-1), log_probs.data.view(-1))):
+            asset_fn(i, val, log_prob)
+
+    def test_bernoulli(self):
+        p = Variable(torch.Tensor([0.7, 0.2, 0.4]), requires_grad=True)
+        r = Variable(torch.Tensor([0.3]), requires_grad=True)
+        self.assertEqual(Bernoulli(p).sample_n(8).size(), (8, 3))
+        self.assertEqual(Bernoulli(r).sample_n(8).size(), (8, 1))
+        self.assertEqual(Bernoulli(r).sample().size(), (1,))
+        self._gradcheck_log_prob(Bernoulli, (p,))
+
+        def ref_log_prob(idx, val, log_prob):
+            prob = p.data[idx]
+            self.assertEqual(log_prob, math.log(prob if val else 1 - prob))
+
+        self._check_log_prob(Bernoulli(p), ref_log_prob)
+
+    def test_bernoulli_3d(self):
+        p = Variable(torch.Tensor(2, 3, 5).fill_(0.5), requires_grad=True)
+        self.assertEqual(Bernoulli(p).sample().size(), (2, 3, 5))
+        self.assertEqual(Bernoulli(p).sample_n(2).size(), (2, 2, 3, 5))
+
+    def test_multinomial_1d(self):
+        p = Variable(torch.Tensor([0.1, 0.2, 0.3]), requires_grad=True)
+        # TODO: this should return a 0-dim tensor once we have Scalar support
+        self.assertEqual(Categorical(p).sample().size(), (1,))
+        self.assertEqual(Categorical(p).sample_n(1).size(), (1, 1))
+        self._gradcheck_log_prob(Categorical, (p,))
+
+    def test_multinomial_2d(self):
+        probabilities = [[0.1, 0.2, 0.3], [0.5, 0.3, 0.2]]
+        p = Variable(torch.Tensor(probabilities), requires_grad=True)
+        self.assertEqual(Categorical(p).sample().size(), (2,))
+        self.assertEqual(Categorical(p).sample_n(6).size(), (6, 2))
+        self._gradcheck_log_prob(Categorical, (p,))
+
+        def ref_log_prob(idx, val, log_prob):
+            sample_prob = p.data[idx][val] / p.data[idx].sum()
+            self.assertEqual(log_prob, math.log(sample_prob))
+
+        self._check_log_prob(Categorical(p), ref_log_prob)
+
+    def test_normal(self):
+        mean = Variable(torch.randn(5, 5), requires_grad=True)
+        std = Variable(torch.randn(5, 5).abs(), requires_grad=True)
+        mean_1d = Variable(torch.randn(1), requires_grad=True)
+        std_1d = Variable(torch.randn(1), requires_grad=True)
+        self.assertEqual(Normal(mean, std).sample().size(), (5, 5))
+        self.assertEqual(Normal(mean, std).sample_n(7).size(), (7, 5, 5))
+        self.assertEqual(Normal(mean_1d, std_1d).sample_n(1).size(), (1, 1))
+        self.assertEqual(Normal(mean_1d, std_1d).sample().size(), (1,))
+        self.assertEqual(Normal(0.2, .6).sample_n(1).size(), (1, 1))
+        self.assertEqual(Normal(-0.7, 50.0).sample_n(1).size(), (1, 1))
+
+        self._gradcheck_log_prob(Normal, (mean, std))
+        self._gradcheck_log_prob(Normal, (mean, 1.0))
+        self._gradcheck_log_prob(Normal, (0.0, std))
+
+        def ref_log_prob(idx, x, log_prob):
+            m = mean.data.view(-1)[idx]
+            s = std.data.view(-1)[idx]
+            expected = (math.exp(-(x - m) ** 2 / (2 * s ** 2)) /
+                        math.sqrt(2 * math.pi * s ** 2))
+            self.assertAlmostEqual(log_prob, math.log(expected), places=3)
+
+        self._check_log_prob(Normal(mean, std), ref_log_prob)
+
+
+if __name__ == '__main__':
+    run_tests()
--- a/test/test_jit.py
+++ b/test/test_jit.py
@ -0,0 +1,777 @@
+import torch
+import torch.jit
+import torch.nn as nn
+import torch.nn.functional as F
+import unittest
+from itertools import product
+from torch.autograd import Variable, Function
+from torch.autograd.function import traceable
+from common import TestCase, run_tests
+import io
+
+try:
+    import torchvision
+    HAS_TORCHVISION = True
+except ImportError:
+    HAS_TORCHVISION = False
+
+RUN_CUDA = torch.cuda.is_available()
+if torch.cuda.is_available():
+    CUDA_VERSION = torch._C._cuda_getCompiledVersion()
+    for d in range(torch.cuda.device_count()):
+        major = torch.cuda.get_device_capability(d)[0]
+        if (CUDA_VERSION < 8000 and major >= 6) or (CUDA_VERSION < 9000 and major >= 7):
+            RUN_CUDA = False
+
+
+skipIfNoTorchVision = unittest.skipIf(not HAS_TORCHVISION, "no torchvision")
+
+
+def LSTMCell(input, hidden, w_ih, w_hh, b_ih=None, b_hh=None):
+    hx, cx = hidden
+    gates = F.linear(input, w_ih, b_ih) + F.linear(hx, w_hh, b_hh)
+
+    ingate, forgetgate, cellgate, outgate = gates.chunk(4, 1)
+    ingate = F.sigmoid(ingate)
+    forgetgate = F.sigmoid(forgetgate)
+    cellgate = F.tanh(cellgate)
+    outgate = F.sigmoid(outgate)
+
+    cy = (forgetgate * cx) + (ingate * cellgate)
+    hy = outgate * F.tanh(cy)
+    return hy, cy
+
+
+def LSTMCellC(*args, **kwargs):
+    hy, cy = LSTMCell(*args, **kwargs)
+    return torch.cat((hy, cy))
+
+
+class TestJit(TestCase):
+    maxDiff = None
+
+    def assertExpectedTrace(self, trace, *args, **kwargs):
+        torch._C._jit_pass_lint(trace)
+        torch._C._jit_pass_dce(trace)
+        torch._C._jit_pass_lint(trace)
+        self.assertExpected(str(trace), *args, **kwargs)
+
+    def test_simple(self):
+        x = Variable(torch.Tensor([0.4]), requires_grad=True)
+        y = Variable(torch.Tensor([0.7]), requires_grad=True)
+
+        def f(x, y):
+            return torch.sigmoid(torch.tanh(x * (x + y)))
+
+        trace, z = torch.jit.trace(f, (x, y), nderivs=0)
+        torch._C._jit_pass_lint(trace)
+        self.assertExpected(str(trace))
+
+    def test_scopes(self):
+        x = Variable(torch.Tensor([0.4]), requires_grad=True)
+        y = Variable(torch.Tensor([0.7]), requires_grad=True)
+
+        def f(x, y):
+            out = x + y
+            with torch.jit.scope('Foo', out):
+                out = x * out
+                with torch.jit.scope('Bar', out):
+                    out = torch.tanh(out)
+                out = torch.sigmoid(out)
+            return out
+
+        trace, z = torch.jit.trace(f, (x, y), nderivs=0)
+        torch._C._jit_pass_lint(trace)
+        self.assertExpected(str(trace))
+
+    def test_scopes_intermediate_node(self):
+
+        class Net(nn.Module):
+            def forward(self, x):
+                return F.log_softmax(x, dim=0)
+
+        net = Net()
+        t = Variable(torch.ones(2), requires_grad=True)
+        trace, _ = torch.jit.trace(net, (t, ))
+        torch.onnx._optimize_trace(trace)
+
+        self.assertExpectedTrace(trace)
+
+    def test_scopes_identity_node(self):
+
+        class Net(nn.Module):
+
+            def __init__(self):
+                super(Net, self).__init__()
+                self.features = nn.Sequential(
+                    nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2),
+                    nn.ReLU(inplace=True),
+                    nn.MaxPool2d(kernel_size=3, stride=2),
+                )
+
+            def forward(self, x):
+                x = self.features(x)
+                return x
+
+        model = Net()
+
+        t = Variable(torch.ones(1, 3, 227, 227), requires_grad=True)
+
+        with torch.onnx.set_training(model, False):
+            trace, _ = torch.jit.trace(model, (t, ))
+
+        torch.onnx._optimize_trace(trace)
+
+        self.assertExpectedTrace(trace)
+
+    @unittest.skipIf(not RUN_CUDA, "fuser requires CUDA")
+    def test_lstm_fusion(self):
+        input = Variable(torch.randn(3, 10).cuda())
+        hx = Variable(torch.randn(3, 20).cuda())
+        cx = Variable(torch.randn(3, 20).cuda())
+        module = nn.LSTMCell(10, 20).cuda()  # Just to allocate weights with correct sizes
+
+        trace, _ = torch.jit.trace(LSTMCell, (input, (hx, cx)) + tuple(module.parameters()))
+        torch._C._jit_pass_lint(trace)
+        torch._C._jit_pass_fuse(trace)
+        torch._C._jit_pass_lint(trace)
+        self.assertExpected(str(trace))
+
+    @unittest.skipIf(not RUN_CUDA, "fuser requires CUDA")
+    def test_run_lstm_fusion(self):
+        input = Variable(torch.randn(3, 10).cuda())
+        hx = Variable(torch.randn(3, 20).cuda())
+        cx = Variable(torch.randn(3, 20).cuda())
+        module = nn.LSTMCell(10, 20).cuda()  # Just to allocate weights with correct sizes
+
+        CompiledLSTMCell = torch.jit.compile(nderivs=0)(LSTMCell)
+
+        z = CompiledLSTMCell(input, (hx, cx), *module.parameters())
+        z2 = CompiledLSTMCell(input, (hx, cx), *module.parameters(), _assert_compiled=True)
+        self.assertEqual(z, z2)
+
+    @unittest.skipIf(not RUN_CUDA, "fuser requires CUDA")
+    def test_run_lstm_fusion_concat(self):
+        input = Variable(torch.randn(3, 10).cuda())
+        hx = Variable(torch.randn(3, 20).cuda())
+        cx = Variable(torch.randn(3, 20).cuda())
+        module = nn.LSTMCell(10, 20).cuda()  # Just to allocate weights with correct sizes
+
+        CompiledLSTMCell = torch.jit.compile(nderivs=0)(LSTMCellC)
+
+        z = CompiledLSTMCell(input, (hx, cx), *module.parameters())
+        z2 = CompiledLSTMCell(input, (hx, cx), *module.parameters(), _assert_compiled=True)
+        self.assertEqual(z, z2)
+
+    @unittest.skipIf(not RUN_CUDA, "fuser requires CUDA")
+    def test_concat_fusion(self):
+        hx = Variable(torch.randn(3, 20).cuda())
+        cx = Variable(torch.randn(3, 20).cuda())
+
+        def Foo(hx, cx):
+            return torch.cat((hx + cx, hx * cx))
+
+        trace, _ = torch.jit.trace(Foo, (hx, cx))
+        torch._C._jit_pass_lint(trace)
+        torch._C._jit_pass_fuse(trace)
+        torch._C._jit_pass_lint(trace)
+        self.assertExpected(str(trace))
+
+    @unittest.skipIf(not RUN_CUDA, "fuser requires CUDA")
+    def test_fusion_distribute(self):
+        def f(x, y):
+            z1, z2 = (x + y).chunk(2, dim=1)
+            return z1 * z2
+        x = Variable(torch.randn(4, 4).cuda())
+        y = Variable(torch.randn(4, 4).cuda())
+        trace, _ = torch.jit.trace(f, (x, y), nderivs=0)
+        torch._C._jit_pass_lint(trace)
+        self.assertExpected(str(trace), 'raw')
+        torch._C._jit_pass_fuse(trace)
+        torch._C._jit_pass_lint(trace)
+        self.assertExpected(str(trace))
+
+    def test_cse(self):
+        x = Variable(torch.Tensor([0.4, 0.3]), requires_grad=True)
+        y = Variable(torch.Tensor([0.7, 0.5]), requires_grad=True)
+
+        trace = torch._C._tracer_enter((x, y), 0)
+        w = (x + y) * (x + y) * (x + y)
+        t = torch.tanh(w) + torch.tanh(w)
+        z = (x + y) * (x + y) * (x + y) + t
+        torch._C._tracer_exit((z,))
+        torch._C._jit_pass_lint(trace)
+        torch._C._jit_pass_cse(trace)
+
+        self.assertExpected(str(trace))
+
+    def test_compile_run_twice(self):
+        x = Variable(torch.Tensor([0.4]), requires_grad=True)
+        y = Variable(torch.Tensor([0.7]), requires_grad=True)
+
+        @torch.jit.compile(nderivs=0, optimize=False)
+        def doit(x, y):
+            return torch.sigmoid(torch.tanh(x * (x + y)))
+
+        z = doit(x, y)
+        z2 = doit(x, y, _assert_compiled=True)
+        self.assertEqual(z, torch.sigmoid(torch.tanh(x * (x + y))))
+        self.assertEqual(z, z2)
+
+    @unittest.skipIf(not RUN_CUDA, "fuser requires CUDA")
+    def test_compile_addc(self):
+        x = Variable(torch.Tensor([0.4]), requires_grad=True).cuda()
+        y = Variable(torch.Tensor([0.7]), requires_grad=True).cuda()
+
+        @torch.jit.compile(nderivs=0)
+        def doit(x, y):
+            return torch.sigmoid(torch.tanh(x * (x + y) + 1))
+
+        z = doit(x, y)
+        z2 = doit(x, y, _assert_compiled=True)
+        self.assertEqual(z, torch.sigmoid(torch.tanh(x * (x + y) + 1)))
+        self.assertEqual(z, z2)
+
+    def test_traced_function(self):
+        x = Variable(torch.Tensor([0.4]), requires_grad=True)
+        y = Variable(torch.Tensor([0.7]), requires_grad=True)
+
+        @torch.jit.compile(nderivs=0)
+        def doit(x, y):
+            return torch.sigmoid(torch.tanh(x * (x + y)))
+
+        z = doit(x, y)
+        z2 = doit(x, y, _assert_compiled=True)
+        self.assertEqual(z, torch.sigmoid(torch.tanh(x * (x + y))))
+        self.assertEqual(z, z2)
+
+    def test_disabled_traced_function(self):
+        x = Variable(torch.Tensor([0.4]), requires_grad=True)
+        y = Variable(torch.Tensor([0.7]), requires_grad=True)
+
+        @torch.jit.compile(enabled=False)
+        def doit(x, y):
+            return torch.sigmoid(torch.tanh(x * (x + y)))
+
+        z = doit(x, y)
+        z2 = doit(x, y)
+        self.assertEqual(z, torch.sigmoid(torch.tanh(x * (x + y))))
+        self.assertEqual(z, z2)
+
+    def test_assign_traces(self):
+        """Check that output Variables are assigned traces before they are saved."""
+        @traceable
+        class MyFn(Function):
+            @staticmethod
+            def forward(ctx, a):
+                out = a * 2
+                ctx.save_for_backward(out)
+                return out
+
+            @staticmethod
+            def backward(ctx, grad_a):
+                a, = ctx.saved_variables
+                return a * grad_a
+
+        x = Variable(torch.randn(10, 10), requires_grad=True)
+        trace, out = torch.jit.trace(MyFn.apply, x, nderivs=1)
+        out.sum().backward()
+        torch._C._jit_pass_dce(trace)
+        self.assertExpected(str(trace))
+
+    def test_traced_module(self):
+        input = Variable(torch.randn(3, 10))
+        hx = Variable(torch.randn(3, 20))
+        cx = Variable(torch.randn(3, 20))
+
+        @torch.jit.compile(nderivs=0)
+        class MyLSTMCell(nn.LSTMCell):
+            pass
+
+        lstm = MyLSTMCell(10, 20)
+
+        out = lstm(input, (hx, cx))
+        out2 = lstm(input, (hx, cx), _assert_compiled=True)
+        self.assertEqual(out, out2)
+
+    def test_autograd_closure(self):
+        x = Variable(torch.Tensor([0.4]), requires_grad=True)
+        y = Variable(torch.Tensor([0.7]), requires_grad=True)
+
+        trace = torch._C._tracer_enter((x, y), 1)
+
+        z = torch.sigmoid(x * (x + y))
+        w = torch.abs(x * x * x + y) + Variable(torch.ones(1))
+
+        torch._C._tracer_exit((z, w))
+        torch._C._jit_pass_lint(trace)
+
+        (z * w).backward()
+        torch._C._jit_pass_dce(trace)
+        torch._C._jit_pass_lint(trace)
+
+        x_grad = x.grad.data.clone()
+        x.grad.data.zero_()
+
+        function = torch._C._jit_createAutogradClosure(trace)
+        torch._C._jit_pass_lint(trace)
+        z2, w2 = function()(x, y)
+        (z2 * w2).backward()
+        self.assertEqual(z, z2)
+        self.assertEqual(w, w2)
+        self.assertEqual(x.grad.data, x_grad)
+
+    def test_verify(self):
+        x = Variable(torch.Tensor([0.4]), requires_grad=True)
+        y = Variable(torch.Tensor([0.7]), requires_grad=True)
+
+        @torch.jit.compile
+        def f(x, y):
+            z = torch.sigmoid(x * (x + y))
+            w = torch.abs(x * x * x + y) + Variable(torch.ones(1))
+            return z, w
+
+        torch.jit.verify(f, (x, y), loss_fn=lambda z, w: z * w, devices=[])
+
+    def test_constant(self):
+        x = Variable(torch.randn(2, 2), requires_grad=True)
+
+        trace = torch._C._tracer_enter((x,), 0)
+
+        y = Variable(torch.diag(torch.Tensor([2, 2])))
+        z = x.matmul(y)
+
+        torch._C._tracer_exit((z,))
+        function = torch._C._jit_createAutogradClosure(trace)
+
+        z2 = function()(x)
+        self.assertEqual(z, z2)
+
+        y.data.fill_(1000)  # make sure the data has been cloned
+
+        x2 = Variable(torch.ones(2, 2) * 2, requires_grad=True)
+        z3 = function()(x2)
+        self.assertEqual(z3.data, torch.ones(2, 2) * 4)
+
+    def test_c_function(self):
+        x = Variable(torch.randn(1, 3, 10, 10))
+        m = nn.Conv2d(3, 8, 3, 1)
+
+        trace = torch._C._tracer_enter((x,) + tuple(m.parameters()), 0)
+        y = m(x)
+        torch._C._tracer_exit((y,))
+        self.assertExpected(str(trace))
+
+    def test_legacy_fail(self):
+
+        class MyLegacyFn(Function):
+            def forward(self, x):
+                return x
+
+            def backward(self, grad_output):
+                return grad_output
+
+        x = Variable(torch.Tensor([0]), requires_grad=True)
+        trace = torch._C._tracer_enter((x,), 0)
+        self.assertRaisesRegex(RuntimeError, "MyLegacyFn", lambda: MyLegacyFn()(x))
+        torch._C._tracer_exit((x,))
+
+    def test_inplace_transplant(self):
+        x = Variable(torch.Tensor([0]), requires_grad=True)
+        trace = torch._C._tracer_enter((x,), 0)
+        y = x.clone()
+        y.add_(2)
+        y.add_(3)
+        torch._C._tracer_exit((y,))
+        self.assertExpected(str(trace))
+
+    def test_inplace_flags(self):
+        class InplaceFn(Function):
+            @staticmethod
+            def forward(ctx, x):
+                ctx.mark_dirty(x)
+                return x.add_(1)
+
+            @staticmethod
+            def backward(ctx, go):
+                return go
+
+        class RegularFn(Function):
+            @staticmethod
+            def forward(ctx, x):
+                return x.add(1)
+
+            @staticmethod
+            def backward(ctx, go):
+                return go
+
+        x = Variable(torch.Tensor([0]), requires_grad=True)
+        trace = torch._C._tracer_enter((x,), 0)
+        y = RegularFn.apply(x)
+        y = InplaceFn.apply(y)
+        y = InplaceFn.apply(y)
+        y = RegularFn.apply(y)
+        torch._C._tracer_exit((y,))
+        ops = [n for n in trace.graph().nodes() if n.kind() != 'Select']
+        for op in ops:
+            self.assertTrue(op.hasAttribute('inplace'))
+        inplace_flags = [False, True, True, False]
+        for op, is_inplace in zip(ops, inplace_flags):
+            self.assertEqual(op.i('inplace'), is_inplace)
+
+    def test_inplace_check(self):
+        class MyInplaceFn(Function):
+            @staticmethod
+            def forward(self, x):
+                x.add_(1)
+                self.mark_dirty(x)
+                return x
+
+            @staticmethod
+            def backward(self, grad):
+                return grad
+
+        @torch.jit.compile(nderivs=0)
+        def fn(x):
+            return MyInplaceFn.apply(x)
+        x = Variable(torch.randn(5, 5))
+        fn(x)  # trace
+        with self.assertRaisesRegex(RuntimeError, 'inplace MyInplaceFn'):
+            fn(x, _assert_compiled=True)  # create closure
+
+    def test_backward(self):
+        a = Variable(torch.randn(2, 2), requires_grad=True)
+        b = Variable(torch.randn(2, 2), requires_grad=True)
+
+        x = a
+        y = a * b
+
+        trace = torch._C._tracer_enter((x, y), 2)
+        z = y * 2 * x
+        torch._C._tracer_exit((z,))
+        torch._C._jit_pass_lint(trace)
+
+        # Run first backward
+        grad, = torch.autograd.grad(z, x, Variable(torch.ones(2, 2), requires_grad=True), create_graph=True)
+        torch._C._jit_pass_lint(trace)
+
+        # Run second backward
+        grad.sum().backward(create_graph=True)
+        torch._C._jit_pass_lint(trace)
+
+        # Run dead code elimination to remove unused trace nodes
+        torch._C._jit_pass_dce(trace)
+        self.assertExpected(str(trace))
+
+    def test_backward_opaque(self):
+        x = Variable(torch.randn(3, 3), requires_grad=True)
+        y = Variable(torch.randn(3, 3), requires_grad=True)
+
+        trace = torch._C._tracer_enter((x, y), 2)
+        z = x.cross(y)
+        torch._C._tracer_exit((z,))
+        torch._C._jit_pass_lint(trace)
+
+        # Run first backward
+        grad, = torch.autograd.grad(z, x, Variable(torch.ones(3, 3), requires_grad=True), create_graph=True)
+        torch._C._jit_pass_lint(trace)
+
+        # Run dead code elimination to remove unused trace nodes
+        torch._C._jit_pass_dce(trace)
+        self.assertExpected(str(trace))
+
+    def test_backward_closure(self):
+        """Check that autograd closures handle multiple stages correctly."""
+        x = Variable(torch.randn(1), requires_grad=True)
+
+        @torch.jit.compile(nderivs=2)
+        def fn(x):
+            return x * x
+
+        # Generate trace
+        grad_x, = torch.autograd.grad(fn(x), (x,), create_graph=True)
+        self.assertFalse(fn.has_trace_for(x))
+        grad_x.backward()
+        self.assertTrue(fn.has_trace_for(x))
+
+        x_grad = x.grad.data.clone()
+        x.grad.data.zero_()
+
+        # Run the trace
+        grad_x, = torch.autograd.grad(fn(x, _assert_compiled=True), (x,), create_graph=True)
+        grad_x.backward()
+
+        self.assertEqual(x.grad.data, x_grad)
+
+    def test_trace_expire(self):
+        x = Variable(torch.randn(2, 2), requires_grad=True)
+        y = Variable(torch.randn(2, 2), requires_grad=True)
+
+        def record_trace(num_backwards):
+            trace = torch._C._tracer_enter((x, y), num_backwards)
+            z = y * 2 * x
+            torch._C._tracer_exit((z,))
+            return z, trace
+
+        def check(expired, complete):
+            self.assertEqual(trace.is_expired, expired)
+            self.assertEqual(trace.is_complete, complete)
+
+        z, trace = record_trace(0)
+        check(False, True)
+        del z
+        check(False, True)
+
+        z, trace = record_trace(1)
+        check(False, False)
+        del z
+        check(True, False)
+
+        z, trace = record_trace(1)
+        check(False, False)
+        z.sum().backward()
+        check(False, True)
+        del z
+        check(False, True)
+
+    def test_multiuse_fn(self):
+        x = Variable(torch.randn(2, 2), requires_grad=True)
+        w = Variable(torch.randn(2, 2), requires_grad=True)
+
+        @torch.jit.compile
+        def cell(x, w):
+            return x * w + 2
+
+        out = cell(cell(cell(x, w), w), w)
+        self.assertFalse(cell.has_trace_for(x, w))
+
+        out.sum().backward()
+        self.assertTrue(cell.has_trace_for(x, w))
+
+        torch.jit.verify(cell, (x, w), devices=[])
+
+    def test_output_unflatten(self):
+        """Check that outputs of traced functions retain the original structure and nesting"""
+        x = Variable(torch.randn(2, 2), requires_grad=True)
+
+        def fn(x):
+            return (x * 2, (x ** 2, x + 4, (x + 2,), ), x * 4)
+
+        expected_out = fn(x)
+        fn = torch.jit.compile(fn)
+
+        def recursive_sum(obj):
+            if isinstance(obj, Variable):
+                return obj.sum()
+            else:
+                return sum(recursive_sum(o) for o in obj)
+
+        recursive_sum(fn(x)).backward()
+        self.assertTrue(fn.has_trace_for(x))
+        self.assertEqual(fn(x, _assert_compiled=True), expected_out)
+
+    def test_input_flatten(self):
+        """Check that inputs to traced functions are flattened"""
+        def make_var():
+            return Variable(torch.randn(1), requires_grad=True)
+        x = (make_var(), (make_var(), make_var()))
+
+        def fn(x, t):
+            y, z = t
+            return x * y * z
+
+        expected_out = fn(*x)
+        fn = torch.jit.compile(fn)
+        fn(*x).backward()
+        self.assertTrue(fn.has_trace_for(*x))
+        self.assertEqual(fn(*x, _assert_compiled=True), expected_out)
+
+    def test_flags(self):
+        x = Variable(torch.randn(2, 2))
+        y = Variable(torch.randn(2, 2))
+
+        @torch.jit.compile
+        def fn(x, y):
+            return (x * x + y * y + x * y).sum()
+
+        grads = {}
+        for rx, ry in product((True, False), repeat=2):
+            x.requires_grad = rx
+            y.requires_grad = ry
+
+            self.assertFalse(fn.has_trace_for(x, y))
+            out = fn(x, y)
+
+            self.assertFalse(fn.has_trace_for(x, y))
+            for v, name, compute in [(x, 'x', rx), (y, 'y', ry)]:
+                if not compute:
+                    continue
+                grad_v, = torch.autograd.grad(out, v, retain_graph=True)
+                expected_grad = grads.setdefault(name, grad_v)
+                self.assertEqual(grad_v, expected_grad)
+            self.assertEqual(fn.has_trace_for(x, y), rx or ry)
+
+    def test_volatile_fallback(self):
+        """Check that Traceable falls back to num_backwards=0 if given volatile inputs"""
+        x = Variable(torch.randn(2, 2))
+        y = Variable(torch.randn(2, 2), requires_grad=True)
+
+        @torch.jit.compile
+        def fn(x, y):
+            return x * x + x * y
+
+        out = fn(x, y)
+        self.assertFalse(fn.has_trace_for(x, y))
+
+        x.volatile = True
+        self.assertFalse(fn.has_trace_for(x, y))
+        out = fn(x, y)
+        self.assertTrue(fn.has_trace_for(x, y))
+        out2 = fn(x, y, _assert_compiled=True)
+        self.assertEqual(out, out2)
+
+    def test_backward_flag_checks(self):
+        x = Variable(torch.randn(1), requires_grad=True)
+
+        @torch.jit.compile(nderivs=2)
+        def fn(x):
+            return x * x
+
+        grad_x, = torch.autograd.grad(fn(x), (x,), create_graph=True)
+        self.assertFalse(fn.has_trace_for(x))
+        grad_x.backward()
+        self.assertTrue(fn.has_trace_for(x))
+
+        with self.assertRaisesRegex(RuntimeError, 'different flags'):
+            fn(x).backward(Variable(torch.ones(1), requires_grad=True))
+        with self.assertRaisesRegex(RuntimeError, 'different flags'):
+            grad_x, = torch.autograd.grad(fn(x), (x,), create_graph=True)
+            grad_x.backward(Variable(torch.ones(1), requires_grad=True))
+
+        # TODO: Test executing this
+
+    def test_python_ir(self):
+        x = Variable(torch.Tensor([0.4]), requires_grad=True)
+        y = Variable(torch.Tensor([0.7]), requires_grad=True)
+
+        def doit(x, y):
+            return torch.sigmoid(torch.tanh(x * (x + y)))
+
+        traced, _ = torch.jit.trace(doit, (x, y))
+        g = torch._C._jit_get_graph(traced)
+        g2 = torch._C.Graph()
+        g_to_g2 = {}
+        for node in g.inputs():
+            g_to_g2[node] = g2.addInput()
+        for node in g.nodes():
+            if node.kind() == "PythonOp":
+                n_ = g2.create(node.pyname(),
+                               [g_to_g2[i] for i in node.inputs()]) \
+                    .setType(node.typeOption()) \
+                    .s_("note", "from_pyop") \
+                    .i_("some_value", len(node.scalar_args()))
+                assert(n_.i("some_value") == len(node.scalar_args()))
+            else:
+                n_ = g2.createClone(node, lambda x: g_to_g2[x])
+
+            g_to_g2[node] = g2.appendNode(n_)
+
+        for node in g.outputs():
+            g2.registerOutput(g_to_g2[node])
+
+        t_node = g2.create("TensorTest").t_("a", torch.ones([2, 2]))
+        assert(t_node.attributeNames() == ["a"])
+        g2.appendNode(t_node)
+        assert(torch.equal(torch.ones([2, 2]), t_node.t("a")))
+        self.assertExpected(str(g2))
+
+    @unittest.skipIf(not RUN_CUDA, "cpp tests require CUDA")
+    def test_cpp(self):
+        torch._C._jit_run_cpp_tests()
+
+    def test_batchnorm(self):
+        x = Variable(torch.randn(2, 2).fill_(1.0), requires_grad=True)
+        trace, _ = torch.jit.trace(nn.BatchNorm2d(2), x)
+        self.assertExpected(str(trace))
+
+    def test_dropout(self):
+        x = Variable(torch.randn(2, 2).fill_(1.0), requires_grad=True)
+        trace, _ = torch.jit.trace(nn.Dropout(0.6), x)
+        self.assertExpected(str(trace))
+
+    @unittest.skip("unrecognized NodeKind: SpatialBN")
+    def test_batchnorm_run_twice(self):
+        @torch.jit.compile(nderivs=0)
+        class MyBatchNorm2d(nn.BatchNorm2d):
+            pass
+
+        bn = MyBatchNorm2d(1)
+        x = Variable(torch.randn(5, 1))
+        z = bn(x)
+        z2 = bn(x, _assert_compiled=True)
+        self.assertEqual(z, z2)
+
+    def test_non_decorator_use_fails(self):
+        MyLSTM = torch.jit.compile(nn.LSTM)
+        self.assertRaisesRegex(TypeError, "class decorator", lambda: MyLSTM(2, 2))
+
+    def test_conv(self):
+        x = Variable(torch.randn(20, 16, 50, 40).fill_(1.0), requires_grad=True)
+        trace, _ = torch.jit.trace(nn.Conv2d(16, 13, 3, bias=False), x)
+        self.assertExpected(str(trace))
+
+    def test_reuse_function(self):
+        @torch.jit.compile(nderivs=0)
+        def clinear(*args):
+            return F.linear(*args)
+
+        def cast(x):
+            return x
+
+        input = Variable(cast(torch.randn(1, 1)))
+        weights = Variable(cast(torch.randn(1, 1)))
+        bias = Variable(cast(torch.randn(1, 1)))
+
+        # linear AKA addmm without bias is of particular interest
+        # because we allocate a zero-filled new variable when we execute,
+        # and then *fill* it with the result
+
+        r1 = clinear(clinear(input, weights), weights, _assert_compiled=True)
+        r2 = F.linear(F.linear(input, weights), weights)
+
+        self.assertEqual(r1, r2)
+
+    def test_mini_wlm(self):
+        """Exercise null-edge pruning in the tracer."""
+
+        @torch.jit.compile
+        class MyModel(nn.Module):
+            def __init__(self):
+                super(MyModel, self).__init__()
+                self.encoder = nn.Embedding(2, 2)
+
+            def forward(self, input, hidden):
+                emb = self.encoder(input)
+                hidden = hidden.clone()  # simulate some RNN operation
+                return emb, hidden
+
+        model = MyModel()
+
+        x = Variable(torch.LongTensor([[0, 1], [1, 0]]))
+        y = Variable(torch.FloatTensor([0]))
+
+        z, _ = model(x, y)
+        z.sum().backward()
+
+        z, _ = model(x, y, _assert_compiled=True)
+        z.sum().backward()
+
+    @skipIfNoTorchVision
+    def test_alexnet(self):
+        x = Variable(torch.randn(10, 3, 224, 224).fill_(1.0), requires_grad=True)
+        trace, _ = torch.jit.trace(torchvision.models.AlexNet(), x)
+        self.assertExpected(str(trace))
+        # NB: Purposely NOT testing protobuf export here
+
+if __name__ == '__main__':
+    run_tests()
--- a/test/test_legacy_nn.py
+++ b/test/test_legacy_nn.py
--- a/test/test_multiprocessing.py
+++ b/test/test_multiprocessing.py
@ -0,0 +1,423 @@
+import contextlib
+import gc
+import os
+import sys
+import time
+import unittest
+from sys import platform
+
+import torch
+import torch.cuda
+import torch.multiprocessing as mp
+from torch.autograd import Variable
+from torch.nn import Parameter
+from common import TestCase, run_tests, IS_WINDOWS
+
+
+TEST_REPEATS = 30
+HAS_SHM_FILES = os.path.isdir('/dev/shm')
+TEST_CUDA_IPC = torch.cuda.is_available() and \
+    sys.version_info[0] == 3 and \
+    sys.platform != 'darwin' and \
+    sys.platform != 'win32'
+TEST_MULTIGPU = TEST_CUDA_IPC and torch.cuda.device_count() > 1
+
+
+def simple_fill(queue, event):
+    data = queue.get()
+    data[0][:] = 4
+    event.set()
+
+
+def simple_pool_fill(tensor):
+    tensor.fill_(4)
+    return tensor.add(1)
+
+
+def send_tensor(queue, event, tp):
+    t = torch.ones(5, 5).type(tp)
+    queue.put(t)
+    queue.put(t)
+    event.wait()
+
+
+def sum_tensors(inq, outq):
+    with torch.cuda.device(1):
+        tensors = inq.get()
+        for tensor in tensors:
+            outq.put((tensor.sum(), tensor.get_device(),
+                      tensor.numel(), tensor.storage().size()))
+
+
+def queue_get_exception(inqueue, outqueue):
+    os.close(2)  # hide expected error message
+    try:
+        torch.zeros(5, 5).cuda()
+    except Exception as e:
+        outqueue.put(e)
+    else:
+        outqueue.put('no exception')
+
+
+# Multiply by two in a separate stream
+def cuda_multiply_two(queue, ready, done):
+    ready.set()
+    with torch.cuda.stream(torch.cuda.Stream()):
+        cuda_event, tensor = queue.get()
+        cuda_event.wait()
+        tensor.mul_(2)
+        cuda_event.record()
+        done.set()
+        del cuda_event
+
+
+def autograd_sharing(queue, ready, master_modified):
+    var = queue.get()
+    ready.set()
+    master_modified.wait()
+
+    expected_var = torch.arange(1, 26).view(5, 5)
+    expected_var[0, 0] = 1000
+    is_ok = var.data.equal(expected_var)
+    var.data[:] = torch.ones(5, 5)
+
+    is_ok &= var.grad is None
+    var._grad = Variable(torch.ones(5, 5), requires_grad=False)
+
+    queue.put(is_ok)
+
+
+@contextlib.contextmanager
+def fs_sharing():
+    prev_strategy = mp.get_sharing_strategy()
+    mp.set_sharing_strategy('file_system')
+    try:
+        yield
+    finally:
+        mp.set_sharing_strategy(prev_strategy)
+
+
+class leak_checker(object):
+
+    def __init__(self, test_case):
+        self.checked_pids = [os.getpid()]
+        self.test_case = test_case
+
+    def __enter__(self):
+        self.next_fds = self._get_next_fds(10)
+        return self
+
+    def __exit__(self, *args):
+        if args[0] is None:
+            # Check that the 10th available file-descriptor at the end of the
+            # test is no more than 4 higher than the 10th available at the
+            # start. This attempts to catch file descriptor leaks, but allows
+            # one-off initialization that may use up a file descriptor
+            # TODO: Disabled because this check is too flaky
+            # available_fds = self._get_next_fds(10)
+            # self.test_case.assertLessEqual(
+            #     available_fds[-1] - self.next_fds[-1], 5)
+            self.test_case.assertFalse(self.has_shm_files())
+        return False
+
+    def check_pid(self, pid):
+        self.checked_pids.append(pid)
+
+    def _get_next_fds(self, n=1):
+        # dup uses the lowest-numbered unused descriptor for the new descriptor
+        fds = [os.dup(0) for i in range(n)]
+        for fd in fds:
+            os.close(fd)
+        return fds
+
+    def has_shm_files(self, wait=True):
+        if not HAS_SHM_FILES:
+            return False
+        result = self._has_shm_files()
+        if result and mp.get_sharing_strategy() == 'file_system' and wait:
+            time.sleep(0.5)
+            return self._has_shm_files()
+        return result
+
+    def _has_shm_files(self):
+        gc.collect()
+        names = list('torch_' + str(pid) for pid in self.checked_pids)
+        for filename in os.listdir('/dev/shm'):
+            for name in names:
+                if filename.startswith(name):
+                    return True
+        return False
+
+
+class TestMultiprocessing(TestCase):
+
+    def _test_sharing(self, ctx=mp, type=torch.FloatTensor, repeat=1):
+        def test_fill():
+            x = torch.zeros(5, 5).type(type)
+            q = ctx.Queue()
+            e = ctx.Event()
+            data = [x, x[:, 1]]
+            q.put(data)
+            p = ctx.Process(target=simple_fill, args=(q, e))
+            p.daemon = True
+            lc.check_pid(p.pid)
+            p.start()
+            e.wait(10)
+            self.assertTrue(e.is_set())
+            self.assertTrue(data[0].eq(4).all())
+            self.assertTrue(data[1].eq(4).all())
+            p.join(1)
+            self.assertFalse(p.is_alive())
+
+        def test_receive():
+            q = ctx.Queue()
+            e = ctx.Event()
+            p = ctx.Process(target=send_tensor, args=(q, e, type))
+            p.daemon = True
+            lc.check_pid(p.pid)
+            p.start()
+            t1 = q.get()
+            t2 = q.get()
+            self.assertTrue(t1.eq(1).all())
+            self.assertTrue(id(t1.storage()) == id(t2.storage()))
+            e.set()
+            p.join(1)
+            self.assertFalse(p.is_alive())
+
+        with leak_checker(self) as lc:
+            for _ in range(repeat):
+                test_fill()
+                test_receive()
+
+    def _test_preserve_sharing(self, ctx=mp, repeat=1):
+        def do_test():
+            x = torch.randn(5, 5)
+            data = [x.storage(), x.storage()[1:4], x, x[2], x[:, 1]]
+            q = ctx.Queue()
+            q.put(data)
+            new_data = q.get(timeout=1)
+            self.assertEqual(new_data, data, 0)
+            storage_cdata = data[0]._cdata
+            self.assertEqual(new_data[0]._cdata, storage_cdata)
+            for t in new_data[2:]:
+                self.assertEqual(t.storage()._cdata, storage_cdata)
+            # TODO: enable after fixing #46
+            # new_data[0].fill_(10)
+            # self.assertEqual(new_data[1], new_data[0][1:4], 0)
+
+        with leak_checker(self):
+            for i in range(repeat):
+                do_test()
+
+    def _test_pool(self, ctx=mp, repeat=1):
+        def do_test():
+            p = ctx.Pool(2)
+            for proc in p._pool:
+                lc.check_pid(proc.pid)
+
+            buffers = [torch.zeros(2, 2) for i in range(4)]
+            results = p.map(simple_pool_fill, buffers, 1)
+            self.assertEqual(len(results), len(buffers))
+            for r in results:
+                self.assertEqual(r, torch.ones(2, 2) * 5, 0)
+            for b in buffers:
+                self.assertEqual(b, torch.ones(2, 2) * 4, 0)
+
+            p.close()
+            p.join()
+
+        with leak_checker(self) as lc:
+            for i in range(repeat):
+                do_test()
+
+    @unittest.skipIf(platform == 'darwin', "file descriptor strategy is not supported on OS X")
+    def test_fd_sharing(self):
+        self._test_sharing(repeat=TEST_REPEATS)
+
+    @unittest.skipIf(platform == 'darwin', "file descriptor strategy is not supported on OS X")
+    def test_fd_preserve_sharing(self):
+        self._test_preserve_sharing(repeat=TEST_REPEATS)
+
+    @unittest.skipIf(platform == 'darwin', "file descriptor strategy is not supported on OS X")
+    def test_fd_pool(self):
+        self._test_pool(repeat=TEST_REPEATS)
+
+    def test_fs_sharing(self):
+        with fs_sharing():
+            self._test_sharing(repeat=TEST_REPEATS)
+
+    def test_fs_preserve_sharing(self):
+        with fs_sharing():
+            self._test_preserve_sharing(repeat=TEST_REPEATS)
+
+    def test_fs_pool(self):
+        with fs_sharing():
+            self._test_pool(repeat=TEST_REPEATS)
+
+    @unittest.skipIf(not HAS_SHM_FILES, "don't not how to check if shm files exist")
+    def test_fs(self):
+        def queue_put():
+            x = torch.DoubleStorage(4)
+            q = mp.Queue()
+            self.assertFalse(lc.has_shm_files())
+            q.put(x)
+            time.sleep(0.05)  # queue serializes asynchronously
+            self.assertTrue(lc.has_shm_files(wait=False))
+            q.get()
+
+        with fs_sharing(), leak_checker(self) as lc:
+            for _ in range(TEST_REPEATS):
+                queue_put()
+
+    def test_inherit_tensor(self):
+        class SubProcess(mp.Process):
+            def __init__(self, tensor):
+                super(SubProcess, self).__init__()
+                self.tensor = tensor
+                self.daemon = True
+
+            def run(self):
+                self.tensor.add_(3)
+
+        t = torch.zeros(5, 5)
+        p = SubProcess(t.share_memory_())
+        p.start()
+        p.join(1)
+        self.assertEqual(t, torch.ones(5, 5) * 3, 0)
+
+    @unittest.skipIf(not TEST_CUDA_IPC, 'CUDA IPC not available')
+    def test_cuda(self):
+        torch.cuda.FloatTensor([1])  # initialize CUDA outside of leak checker
+        self._test_sharing(mp.get_context('spawn'), torch.cuda.FloatTensor)
+
+    @unittest.skipIf(not TEST_CUDA_IPC, 'CUDA IPC not available')
+    @unittest.skipIf(not TEST_MULTIGPU, 'found only 1 GPU')
+    def test_cuda_small_tensors(self):
+        # Check multiple small tensors which will likely use the same
+        # underlying cached allocation
+        ctx = mp.get_context('spawn')
+        tensors = []
+        for i in range(5):
+            device = i % 2
+            tensors += [torch.arange(i * 5, (i + 1) * 5).cuda(device)]
+
+        inq = ctx.Queue()
+        outq = ctx.Queue()
+        inq.put(tensors)
+        p = ctx.Process(target=sum_tensors, args=(inq, outq))
+        p.start()
+
+        results = []
+        for i in range(5):
+            results.append(outq.get())
+        p.join()
+
+        for i, tensor in enumerate(tensors):
+            v, device, tensor_size, storage_size = results[i]
+            self.assertEqual(v, torch.arange(i * 5, (i + 1) * 5).sum())
+            self.assertEqual(device, i % 2)
+            self.assertEqual(tensor_size, 5)
+            self.assertEqual(storage_size, 5)
+
+    @unittest.skipIf(IS_WINDOWS, 'not applicable to Windows (only fails with fork)')
+    @unittest.skipIf(not torch.cuda.is_available(), 'CUDA not available')
+    def test_cuda_bad_call(self):
+        # Initialize CUDA
+        t = torch.zeros(5, 5).cuda().cpu()
+        inq = mp.Queue()
+        outq = mp.Queue()
+        p = mp.Process(target=queue_get_exception, args=(inq, outq))
+        p.start()
+        inq.put(t)
+        p.join()
+        self.assertIsInstance(outq.get(), RuntimeError)
+
+    @unittest.skipIf(not TEST_CUDA_IPC, 'CUDA IPC not available')
+    def test_event(self):
+        ctx = mp.get_context('spawn')
+        queue = ctx.Queue()
+        ready = ctx.Event()
+        done = ctx.Event()
+        p = ctx.Process(target=cuda_multiply_two, args=(queue, ready, done))
+        p.start()
+
+        ready.wait()
+        with torch.cuda.stream(torch.cuda.Stream()):
+            tensor = torch.cuda.FloatTensor([1, 1, 1, 1])
+            # Use a sleep kernel to test events. Without the event, the
+            # multiply happens before the add.
+            event = torch.cuda.Event(interprocess=True)
+            torch.cuda._sleep(20000000)  # about 30 ms
+            tensor.add_(1)
+            event.record()
+            queue.put((event, tensor))
+            done.wait()  # must wait until subprocess records event
+            event.synchronize()
+            self.assertEqual(list(tensor), [4, 4, 4, 4])
+        p.join()
+
+    def _test_autograd_sharing(self, var):
+        ready = mp.Event()
+        master_modified = mp.Event()
+        queue = mp.Queue()
+        p = mp.Process(target=autograd_sharing, args=(queue, ready, master_modified))
+        p.daemon = True
+        p.start()
+        var._grad = Variable(torch.zeros(5, 5), requires_grad=False)
+        queue.put(var)
+
+        ready.wait()
+        var.data[0, 0] = 1000
+        var.grad.data[:] = torch.ones(5, 5) * 4
+        master_modified.set()
+
+        worker_ok = queue.get()
+        self.assertTrue(worker_ok)
+
+        self.assertEqual(var.data, torch.ones(5, 5))
+        self.assertEqual(var.grad.data, torch.ones(5, 5) * 4)
+        p.join(1)
+        self.assertFalse(p.is_alive())
+
+    def test_variable_sharing(self):
+        configs = [
+            (True, False),
+            (False, False),
+            (False, True),
+        ]
+        for requires_grad, volatile in configs:
+            var = Variable(torch.arange(1, 26).view(5, 5),
+                           requires_grad=requires_grad,
+                           volatile=volatile)
+            self._test_autograd_sharing(var)
+
+    def test_parameter_sharing(self):
+        param = Parameter(torch.arange(1, 26).view(5, 5))
+        self._test_autograd_sharing(param)
+
+    def test_empty_shared(self):
+        t = torch.Tensor()
+        t.share_memory_()
+
+    def _test_is_shared(self):
+        t = torch.randn(5, 5)
+        self.assertFalse(t.is_shared())
+        t.share_memory_()
+        self.assertTrue(t.is_shared())
+
+    @unittest.skipIf(platform == 'darwin', "file descriptor strategy is not supported on OS X")
+    def test_is_shared(self):
+        self._test_is_shared()
+
+    def test_fs_is_shared(self):
+        with fs_sharing():
+            self._test_is_shared()
+
+    @unittest.skipIf(not torch.cuda.is_available(), 'CUDA not available')
+    def test_is_shared_cuda(self):
+        t = torch.randn(5, 5).cuda()
+        self.assertTrue(t.is_shared())
+
+
+if __name__ == '__main__':
+    run_tests()
--- a/test/test_nccl.py
+++ b/test/test_nccl.py
@ -0,0 +1,88 @@
+import unittest
+
+import torch
+import torch.cuda.nccl as nccl
+import torch.cuda
+
+from common import TestCase, run_tests
+
+nGPUs = torch.cuda.device_count()
+if nGPUs == 0:
+    print('CUDA not available, skipping tests')
+    TestCase = object  # noqa: F811
+
+
+class TestNCCL(TestCase):
+
+    @unittest.skipIf(nGPUs < 2, "only one GPU detected")
+    def test_broadcast(self):
+        expected = torch.FloatTensor(128).uniform_()
+        tensors = [expected.cuda()]
+        for device in range(1, torch.cuda.device_count()):
+            with torch.cuda.device(device):
+                tensors.append(torch.cuda.FloatTensor(128))
+
+        nccl.broadcast(tensors)
+        for i in range(torch.cuda.device_count()):
+            self.assertEqual(tensors[i], expected)
+
+    @unittest.skipIf(nGPUs < 2, "only one GPU detected")
+    def test_reduce(self):
+        tensors = [torch.FloatTensor(128).uniform_() for i in range(nGPUs)]
+        expected = torch.FloatTensor(128).zero_()
+        for t in tensors:
+            expected.add_(t)
+
+        tensors = [tensors[i].cuda(i) for i in range(nGPUs)]
+        nccl.reduce(tensors)
+
+        self.assertEqual(tensors[0], expected)
+
+    @unittest.skipIf(nGPUs < 2, "only one GPU detected")
+    def test_all_reduce(self):
+        tensors = [torch.FloatTensor(128).uniform_() for i in range(nGPUs)]
+        expected = torch.FloatTensor(128).zero_()
+        for t in tensors:
+            expected.add_(t)
+
+        tensors = [tensors[i].cuda(i) for i in range(nGPUs)]
+        nccl.all_reduce(tensors)
+
+        for tensor in tensors:
+            self.assertEqual(tensor, expected)
+
+    @unittest.skipIf(nGPUs < 2, "only one GPU detected")
+    def test_all_gather(self):
+        inputs = [torch.FloatTensor(128).uniform_() for i in range(nGPUs)]
+        expected = torch.cat(inputs, 0)
+
+        inputs = [inputs[i].cuda(i) for i in range(nGPUs)]
+        outputs = [torch.cuda.FloatTensor(128 * nGPUs, device=i)
+                   for i in range(nGPUs)]
+        nccl.all_gather(inputs, outputs)
+
+        for tensor in outputs:
+            self.assertEqual(tensor, expected)
+
+    @unittest.skipIf(nGPUs < 2, "only one GPU detected")
+    def test_reduce_scatter(self):
+        in_size = 32 * nGPUs
+        out_size = 32
+
+        inputs = [torch.FloatTensor(in_size).uniform_() for i in range(nGPUs)]
+        expected = torch.FloatTensor(in_size).zero_()
+        for t in inputs:
+            expected.add_(t)
+        expected = expected.view(nGPUs, 32)
+
+        inputs = [inputs[i].cuda(i) for i in range(nGPUs)]
+        outputs = [torch.cuda.FloatTensor(out_size, device=i)
+                   for i in range(nGPUs)]
+        nccl.reduce_scatter(inputs, outputs)
+
+        for i in range(nGPUs):
+            self.assertEqual(outputs[i], expected[i])
+
+
+if __name__ == '__main__':
+    run_tests()
--- a/test/test_nn.py
+++ b/test/test_nn.py
--- a/test/test_optim.py
+++ b/test/test_optim.py
@ -0,0 +1,582 @@
+import math
+import unittest
+import functools
+from copy import deepcopy
+import torch
+import torch.optim as optim
+import torch.legacy.optim as old_optim
+import torch.nn.functional as F
+from torch.optim import SGD
+from torch.autograd import Variable
+from torch import sparse
+from torch.optim.lr_scheduler import LambdaLR, StepLR, MultiStepLR, ExponentialLR, CosineAnnealingLR, ReduceLROnPlateau
+from common import TestCase, run_tests
+
+
+def rosenbrock(tensor):
+    x, y = tensor
+    return (1 - x) ** 2 + 100 * (y - x ** 2) ** 2
+
+
+def drosenbrock(tensor):
+    x, y = tensor
+    return torch.DoubleTensor((-400 * x * (y - x ** 2) - 2 * (1 - x), 200 * (y - x ** 2)))
+
+
+def wrap_old_fn(old_fn, **config):
+    def wrapper(closure, params, state):
+        return old_fn(closure, params, config, state)
+    return wrapper
+
+
+class TestOptim(TestCase):
+
+    def _test_rosenbrock(self, constructor, old_fn):
+        params_t = torch.Tensor([1.5, 1.5])
+        state = {}
+
+        params = Variable(torch.Tensor([1.5, 1.5]), requires_grad=True)
+        optimizer = constructor([params])
+
+        solution = torch.Tensor([1, 1])
+        initial_dist = params.data.dist(solution)
+
+        def eval():
+            optimizer.zero_grad()
+            loss = rosenbrock(params)
+            loss.backward()
+            # loss.backward() will give **slightly** different
+            # gradients, than drosenbtock, because of a different ordering
+            # of floating point operations. In most cases it doesn't matter,
+            # but some optimizers are so sensitive that they can temporarily
+            # diverge up to 1e-4, just to converge again. This makes the
+            # comparison more stable.
+            params.grad.data.copy_(drosenbrock(params.data))
+            return loss
+
+        for i in range(2000):
+            optimizer.step(eval)
+            old_fn(lambda _: (rosenbrock(params_t), drosenbrock(params_t)),
+                   params_t, state)
+            self.assertEqual(params.data, params_t)
+
+        self.assertLessEqual(params.data.dist(solution), initial_dist)
+
+    def _test_rosenbrock_sparse(self, constructor, sparse_only=False):
+        params_t = torch.Tensor([1.5, 1.5])
+
+        params = Variable(params_t, requires_grad=True)
+        optimizer = constructor([params])
+        if not sparse_only:
+            params_c = Variable(params_t.clone(), requires_grad=True)
+            optimizer_c = constructor([params_c])
+
+        solution = torch.Tensor([1, 1])
+        initial_dist = params.data.dist(solution)
+
+        def eval(params, sparse_grad, w):
+            # Depending on w, provide only the x or y gradient
+            optimizer.zero_grad()
+            loss = rosenbrock(params)
+            loss.backward()
+            grad = drosenbrock(params.data)
+            # NB: We torture test the optimizer by returning an
+            # uncoalesced sparse tensor
+            if w:
+                i = torch.LongTensor([[0, 0]])
+                x = grad[0]
+                v = torch.DoubleTensor([x / 4., x - x / 4.])
+            else:
+                i = torch.LongTensor([[1, 1]])
+                y = grad[1]
+                v = torch.DoubleTensor([y - y / 4., y / 4.])
+            x = sparse.DoubleTensor(i, v, torch.Size([2]))
+            if sparse_grad:
+                params.grad.data = x
+            else:
+                params.grad.data = x.to_dense()
+            return loss
+
+        for i in range(2000):
+            # Do cyclic coordinate descent
+            w = i % 2
+            optimizer.step(functools.partial(eval, params, True, w))
+            if not sparse_only:
+                optimizer_c.step(functools.partial(eval, params_c, False, w))
+                self.assertEqual(params.data, params_c.data)
+
+        self.assertLessEqual(params.data.dist(solution), initial_dist)
+
+    def _test_basic_cases_template(self, weight, bias, input, constructor):
+        weight = Variable(weight, requires_grad=True)
+        bias = Variable(bias, requires_grad=True)
+        input = Variable(input)
+        optimizer = constructor(weight, bias)
+
+        def fn():
+            optimizer.zero_grad()
+            y = weight.mv(input)
+            if y.is_cuda and bias.is_cuda and y.get_device() != bias.get_device():
+                y = y.cuda(bias.get_device())
+            loss = (y + bias).pow(2).sum()
+            loss.backward()
+            return loss
+
+        initial_value = fn().data[0]
+        for i in range(200):
+            optimizer.step(fn)
+        self.assertLess(fn().data[0], initial_value)
+
+    def _test_state_dict(self, weight, bias, input, constructor):
+        weight = Variable(weight, requires_grad=True)
+        bias = Variable(bias, requires_grad=True)
+        input = Variable(input)
+
+        def fn_base(optimizer, weight, bias):
+            optimizer.zero_grad()
+            loss = (weight.mv(input) + bias).pow(2).sum()
+            loss.backward()
+            return loss
+
+        optimizer = constructor(weight, bias)
+        fn = functools.partial(fn_base, optimizer, weight, bias)
+
+        # Prime the optimizer
+        for i in range(20):
+            optimizer.step(fn)
+        # Clone the weights and construct new optimizer for them
+        weight_c = Variable(weight.data.clone(), requires_grad=True)
+        bias_c = Variable(bias.data.clone(), requires_grad=True)
+        optimizer_c = constructor(weight_c, bias_c)
+        fn_c = functools.partial(fn_base, optimizer_c, weight_c, bias_c)
+        # Load state dict
+        state_dict = deepcopy(optimizer.state_dict())
+        state_dict_c = deepcopy(optimizer.state_dict())
+        optimizer_c.load_state_dict(state_dict_c)
+        # Run both optimizations in parallel
+        for i in range(20):
+            optimizer.step(fn)
+            optimizer_c.step(fn_c)
+            self.assertEqual(weight, weight_c)
+            self.assertEqual(bias, bias_c)
+        # Make sure state dict wasn't modified
+        self.assertEqual(state_dict, state_dict_c)
+
+    def _test_basic_cases(self, constructor, ignore_multidevice=False):
+        self._test_state_dict(
+            torch.randn(10, 5),
+            torch.randn(10),
+            torch.randn(5),
+            constructor
+        )
+        self._test_basic_cases_template(
+            torch.randn(10, 5),
+            torch.randn(10),
+            torch.randn(5),
+            constructor
+        )
+        # non-contiguous parameters
+        self._test_basic_cases_template(
+            torch.randn(10, 5, 2)[..., 0],
+            torch.randn(10, 2)[..., 0],
+            torch.randn(5),
+            constructor
+        )
+        # CUDA
+        if not torch.cuda.is_available():
+            return
+        self._test_basic_cases_template(
+            torch.randn(10, 5).cuda(),
+            torch.randn(10).cuda(),
+            torch.randn(5).cuda(),
+            constructor
+        )
+        # Multi-GPU
+        if not torch.cuda.device_count() > 1 or ignore_multidevice:
+            return
+        self._test_basic_cases_template(
+            torch.randn(10, 5).cuda(0),
+            torch.randn(10).cuda(1),
+            torch.randn(5).cuda(0),
+            constructor
+        )
+
+    def _build_params_dict(self, weight, bias, **kwargs):
+        return [dict(params=[weight]), dict(params=[bias], **kwargs)]
+
+    def _build_params_dict_single(self, weight, bias, **kwargs):
+        return [dict(params=bias, **kwargs)]
+
+    def test_sgd(self):
+        self._test_rosenbrock(
+            lambda params: optim.SGD(params, lr=1e-3),
+            wrap_old_fn(old_optim.sgd, learningRate=1e-3)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.SGD(params, lr=1e-3, momentum=0.9,
+                                     dampening=0, weight_decay=1e-4),
+            wrap_old_fn(old_optim.sgd, learningRate=1e-3, momentum=0.9,
+                        dampening=0, weightDecay=1e-4)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.SGD([weight, bias], lr=1e-3)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.SGD(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-3)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.SGD(
+                self._build_params_dict_single(weight, bias, lr=1e-2),
+                lr=1e-3)
+        )
+
+    def test_sgd_sparse(self):
+        self._test_rosenbrock_sparse(
+            lambda params: optim.SGD(params, lr=5e-3)
+        )
+
+    def test_adam(self):
+        self._test_rosenbrock(
+            lambda params: optim.Adam(params, lr=1e-2),
+            wrap_old_fn(old_optim.adam, learningRate=1e-2)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.Adam(params, lr=1e-2, weight_decay=1e-2),
+            wrap_old_fn(old_optim.adam, learningRate=1e-2, weightDecay=1e-2)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.Adam([weight, bias], lr=1e-3)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.Adam(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-3)
+        )
+
+    def test_sparse_adam(self):
+        self._test_rosenbrock_sparse(
+            lambda params: optim.SparseAdam(params, lr=4e-2),
+            True
+        )
+
+    def test_adadelta(self):
+        self._test_rosenbrock(
+            lambda params: optim.Adadelta(params),
+            wrap_old_fn(old_optim.adadelta)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.Adadelta(params, rho=0.95),
+            wrap_old_fn(old_optim.adadelta, rho=0.95)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.Adadelta(params, weight_decay=1e-2),
+            wrap_old_fn(old_optim.adadelta, weightDecay=1e-2)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.Adadelta([weight, bias])
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.Adadelta(
+                self._build_params_dict(weight, bias, rho=0.95))
+        )
+
+    def test_adagrad(self):
+        self._test_rosenbrock(
+            lambda params: optim.Adagrad(params, lr=1e-1),
+            wrap_old_fn(old_optim.adagrad, learningRate=1e-1)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.Adagrad(params, lr=1e-1, lr_decay=1e-3),
+            wrap_old_fn(old_optim.adagrad, learningRate=1e-1, learningRateDecay=1e-3)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.Adagrad(params, lr=1e-1, weight_decay=1e-2),
+            wrap_old_fn(old_optim.adagrad, learningRate=1e-1, weightDecay=1e-2)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.Adagrad([weight, bias], lr=1e-1)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.Adagrad(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-1)
+        )
+
+    def test_adagrad_sparse(self):
+        self._test_rosenbrock_sparse(
+            lambda params: optim.Adagrad(params, lr=1e-1)
+        )
+
+    def test_adamax(self):
+        self._test_rosenbrock(
+            lambda params: optim.Adamax(params, lr=1e-1),
+            wrap_old_fn(old_optim.adamax, learningRate=1e-1)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.Adamax(params, lr=1e-1, weight_decay=1e-2),
+            wrap_old_fn(old_optim.adamax, learningRate=1e-1, weightDecay=1e-2)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.Adamax(params, lr=1e-1, betas=(0.95, 0.998)),
+            wrap_old_fn(old_optim.adamax, learningRate=1e-1, beta1=0.95, beta2=0.998)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.Adagrad([weight, bias], lr=1e-1)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.Adagrad(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-1)
+        )
+
+    def test_rmsprop(self):
+        self._test_rosenbrock(
+            lambda params: optim.RMSprop(params, lr=1e-2),
+            wrap_old_fn(old_optim.rmsprop, learningRate=1e-2)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.RMSprop(params, lr=1e-2, weight_decay=1e-2),
+            wrap_old_fn(old_optim.rmsprop, learningRate=1e-2, weightDecay=1e-2)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.RMSprop(params, lr=1e-2, alpha=0.95),
+            wrap_old_fn(old_optim.rmsprop, learningRate=1e-2, alpha=0.95)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.Adagrad([weight, bias], lr=1e-2)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.Adagrad(
+                self._build_params_dict(weight, bias, lr=1e-3),
+                lr=1e-2)
+        )
+
+    def test_asgd(self):
+        self._test_rosenbrock(
+            lambda params: optim.ASGD(params, lr=1e-3),
+            wrap_old_fn(old_optim.asgd, eta0=1e-3)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.ASGD(params, lr=1e-3, alpha=0.8),
+            wrap_old_fn(old_optim.asgd, eta0=1e-3, alpha=0.8)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.ASGD(params, lr=1e-3, t0=1e3),
+            wrap_old_fn(old_optim.asgd, eta0=1e-3, t0=1e3)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.ASGD([weight, bias], lr=1e-3, t0=100)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.ASGD(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-3, t0=100)
+        )
+
+    def test_rprop(self):
+        self._test_rosenbrock(
+            lambda params: optim.Rprop(params, lr=1e-3),
+            wrap_old_fn(old_optim.rprop, stepsize=1e-3)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.Rprop(params, lr=1e-3, etas=(0.6, 1.1)),
+            wrap_old_fn(old_optim.rprop, stepsize=1e-3, etaminus=0.6, etaplus=1.1)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.Rprop(params, lr=1e-3, step_sizes=(1e-4, 3)),
+            wrap_old_fn(old_optim.rprop, stepsize=1e-3, stepsizemin=1e-4, stepsizemax=3)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.Rprop([weight, bias], lr=1e-3)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.Rprop(
+                self._build_params_dict(weight, bias, lr=1e-2),
+                lr=1e-3)
+        )
+
+    def test_lbfgs(self):
+        self._test_rosenbrock(
+            lambda params: optim.LBFGS(params),
+            wrap_old_fn(old_optim.lbfgs)
+        )
+        self._test_rosenbrock(
+            lambda params: optim.LBFGS(params, lr=5e-2, max_iter=5),
+            wrap_old_fn(old_optim.lbfgs, learningRate=5e-2, maxIter=5)
+        )
+        self._test_basic_cases(
+            lambda weight, bias: optim.LBFGS([weight, bias]),
+            ignore_multidevice=True
+        )
+
+    def test_invalid_param_type(self):
+        with self.assertRaises(TypeError):
+            optim.SGD(Variable(torch.randn(5, 5)), lr=3)
+
+
+class SchedulerTestNet(torch.nn.Module):
+    def __init__(self):
+        super(SchedulerTestNet, self).__init__()
+        self.conv1 = torch.nn.Conv2d(1, 1, 1)
+        self.conv2 = torch.nn.Conv2d(1, 1, 1)
+
+    def forward(self, x):
+        return self.conv2(F.relu(self.conv1(x)))
+
+
+class TestLRScheduler(TestCase):
+    def setUp(self):
+        self.net = SchedulerTestNet()
+        self.opt = SGD(
+            [{'params': self.net.conv1.parameters()}, {'params': self.net.conv2.parameters(), 'lr': 0.5}],
+            lr=0.05)
+
+    def test_step_lr(self):
+        # lr = 0.05     if epoch < 3
+        # lr = 0.005    if 30 <= epoch < 6
+        # lr = 0.0005   if epoch >= 9
+        epochs = 10
+        single_targets = [0.05] * 3 + [0.005] * 3 + [0.0005] * 3 + [0.00005] * 3
+        targets = [single_targets, list(map(lambda x: x * epochs, single_targets))]
+        scheduler = StepLR(self.opt, gamma=0.1, step_size=3)
+        self._test(scheduler, targets, epochs)
+
+    def test_multi_step_lr(self):
+        # lr = 0.05     if epoch < 2
+        # lr = 0.005    if 2 <= epoch < 5
+        # lr = 0.0005   if epoch < 9
+        # lr = 0.00005   if epoch >= 9
+        epochs = 10
+        single_targets = [0.05] * 2 + [0.005] * 3 + [0.0005] * 4 + [0.00005] * 3
+        targets = [single_targets, list(map(lambda x: x * epochs, single_targets))]
+        scheduler = MultiStepLR(self.opt, gamma=0.1, milestones=[2, 5, 9])
+        self._test(scheduler, targets, epochs)
+
+    def test_exp_lr(self):
+        epochs = 10
+        single_targets = [0.05 * (0.9 ** x) for x in range(epochs)]
+        targets = [single_targets, list(map(lambda x: x * epochs, single_targets))]
+        scheduler = ExponentialLR(self.opt, gamma=0.9)
+        self._test(scheduler, targets, epochs)
+
+    def test_cos_anneal_lr(self):
+        epochs = 10
+        eta_min = 1e-10
+        single_targets = [eta_min + (0.05 - eta_min) *
+                          (1 + math.cos(math.pi * x / epochs)) / 2
+                          for x in range(epochs)]
+        targets = [single_targets, list(map(lambda x: x * epochs, single_targets))]
+        scheduler = CosineAnnealingLR(self.opt, T_max=epochs, eta_min=eta_min)
+        self._test(scheduler, targets, epochs)
+
+    def test_reduce_lr_on_plateau1(self):
+        epochs = 10
+        for param_group in self.opt.param_groups:
+            param_group['lr'] = 0.5
+        targets = [[0.5] * 20]
+        metrics = [10 - i * 0.0167 for i in range(20)]
+        scheduler = ReduceLROnPlateau(self.opt, threshold_mode='abs', mode='min',
+                                      threshold=0.01, patience=5, cooldown=5)
+        self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
+
+    def test_reduce_lr_on_plateau2(self):
+        epochs = 22
+        for param_group in self.opt.param_groups:
+            param_group['lr'] = 0.5
+        targets = [[0.5] * 6 + [0.05] * 7 + [0.005] * 7 + [0.0005] * 2]
+        metrics = [10 - i * 0.0165 for i in range(22)]
+        scheduler = ReduceLROnPlateau(self.opt, patience=5, cooldown=0, threshold_mode='abs',
+                                      mode='min', threshold=0.1)
+        self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
+
+    def test_reduce_lr_on_plateau3(self):
+        epochs = 22
+        for param_group in self.opt.param_groups:
+            param_group['lr'] = 0.5
+        targets = [[0.5] * (2 + 6) + [0.05] * (5 + 6) + [0.005] * 4]
+        metrics = [-0.8] * 2 + [-0.234] * 20
+        scheduler = ReduceLROnPlateau(self.opt, mode='max', patience=5, cooldown=5,
+                                      threshold_mode='abs')
+        self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
+
+    def test_reduce_lr_on_plateau4(self):
+        epochs = 20
+        for param_group in self.opt.param_groups:
+            param_group['lr'] = 0.5
+        targets = [[0.5] * 20]
+        metrics = [1.5 * (1.025 ** i) for i in range(20)]  # 1.025 > 1.1**0.25
+        scheduler = ReduceLROnPlateau(self.opt, mode='max', patience=3,
+                                      threshold_mode='rel', threshold=0.1)
+        self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
+
+    def test_reduce_lr_on_plateau5(self):
+        epochs = 20
+        for param_group in self.opt.param_groups:
+            param_group['lr'] = 0.5
+        targets = [[0.5] * 6 + [0.05] * (5 + 6) + [0.005] * 4]
+        metrics = [1.5 * (1.005 ** i) for i in range(20)]
+        scheduler = ReduceLROnPlateau(self.opt, mode='max', threshold_mode='rel',
+                                      threshold=0.1, patience=5, cooldown=5)
+        self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
+
+    def test_reduce_lr_on_plateau6(self):
+        epochs = 20
+        for param_group in self.opt.param_groups:
+            param_group['lr'] = 0.5
+        targets = [[0.5] * 20]
+        metrics = [1.5 * (0.85 ** i) for i in range(20)]
+        scheduler = ReduceLROnPlateau(self.opt, mode='min', threshold_mode='rel',
+                                      threshold=0.1)
+        self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
+
+    def test_reduce_lr_on_plateau7(self):
+        epochs = 20
+        for param_group in self.opt.param_groups:
+            param_group['lr'] = 0.5
+        targets = [[0.5] * 6 + [0.05] * (5 + 6) + [0.005] * 4]
+        metrics = [1] * 7 + [0.6] + [0.5] * 12
+        scheduler = ReduceLROnPlateau(self.opt, mode='min', threshold_mode='rel',
+                                      threshold=0.1, patience=5, cooldown=5)
+        self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
+
+    def test_reduce_lr_on_plateau8(self):
+        epochs = 20
+        for param_group in self.opt.param_groups:
+            param_group['lr'] = 0.5
+        targets = [[0.5] * 6 + [0.4] * 14, [0.5] * 6 + [0.3] * 14]
+        metrics = [1.5 * (1.005 ** i) for i in range(20)]
+        scheduler = ReduceLROnPlateau(self.opt, mode='max', threshold_mode='rel', min_lr=[0.4, 0.3],
+                                      threshold=0.1, patience=5, cooldown=5)
+        self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
+
+    def test_lambda_lr(self):
+        epochs = 10
+        self.opt.param_groups[0]['lr'] = 0.05
+        self.opt.param_groups[1]['lr'] = 0.4
+        targets = [[0.05 * (0.9 ** x) for x in range(epochs)], [0.4 * (0.8 ** x) for x in range(epochs)]]
+        scheduler = LambdaLR(self.opt,
+                             lr_lambda=[lambda x1: 0.9 ** x1, lambda x2: 0.8 ** x2])
+        self._test(scheduler, targets, epochs)
+
+    def _test(self, scheduler, targets, epochs=10):
+        for epoch in range(epochs):
+            scheduler.step(epoch)
+            for param_group, target in zip(self.opt.param_groups, targets):
+                self.assertAlmostEqual(target[epoch], param_group['lr'],
+                                       msg='LR is wrong in epoch {}: expected {}, got {}'.format(
+                                           epoch, target[epoch], param_group['lr']), delta=1e-5)
+
+    def _test_reduce_lr_on_plateau(self, scheduler, targets, metrics, epochs=10, verbose=False):
+        for epoch in range(epochs):
+            scheduler.step(metrics[epoch])
+            if verbose:
+                print('epoch{}:\tlr={}'.format(epoch, self.opt.param_groups[0]['lr']))
+            for param_group, target in zip(self.opt.param_groups, targets):
+                self.assertAlmostEqual(target[epoch], param_group['lr'],
+                                       msg='LR is wrong in epoch {}: expected {}, got {}'.format(
+                                           epoch, target[epoch], param_group['lr']), delta=1e-5)
+
+if __name__ == '__main__':
+    run_tests()
--- a/test/test_potrf.py
+++ b/test/test_potrf.py
@ -0,0 +1,52 @@
+import torch
+import numpy as np
+from test_autograd import _make_cov
+from torch.autograd import Variable
+from common import TestCase, run_tests, skipIfNoLapack
+
+from torch.autograd._functions.linalg import Potrf
+
+
+class TestPotrf(TestCase):
+
+    def _calc_deriv_numeric(self, A, L, upper):
+        # numerical forward derivative
+        dA = Variable(_make_cov(5))
+        eps = 1e-6
+        outb = Potrf.apply(A + (eps / 2) * dA, upper)
+        outa = Potrf.apply(A - (eps / 2) * dA, upper)
+        dL = (outb - outa) / eps
+
+        return dA, dL
+
+    def _calc_deriv_sym(self, A, L, upper):
+        # reverse mode
+        Lbar = Variable(torch.rand(5, 5).tril())
+        if upper:
+            Lbar = Lbar.t()
+        L.backward(Lbar)
+        Abar = A.grad
+
+        return Abar, Lbar
+
+    def _check_total_variation(self, A, L, upper):
+        dA, dL = self._calc_deriv_numeric(A, L, upper)
+        Abar, Lbar = self._calc_deriv_sym(A, L, upper)
+
+        # compare df = Tr(dA^T Abar) = Tr(dL^T Lbar)
+        df1 = (dL * Lbar).sum()
+        df2 = (dA * Abar).sum()
+
+        atol = 1e-5
+        rtol = 1e-3
+        assert (df1 - df2).abs().data[0] <= atol + rtol * df1.abs().data[0]
+
+    @skipIfNoLapack
+    def test_potrf(self):
+        for upper in [True, False]:
+            A = Variable(_make_cov(5), requires_grad=True)
+            L = Potrf.apply(A, upper)
+            self._check_total_variation(A, L, upper)
+
+if __name__ == '__main__':
+    run_tests()
--- a/test/test_sparse.py
+++ b/test/test_sparse.py
@ -0,0 +1,699 @@
+import torch
+from torch import sparse
+
+import itertools
+import random
+import unittest
+from common import TestCase, run_tests
+from common_nn import TEST_CUDA
+from numbers import Number
+
+
+def cpu_only(inner):
+    def outer(self, *args, **kwargs):
+        if self.is_cuda:
+            raise unittest.SkipTest("Test is CPU-only")
+        inner(self, *args, **kwargs)
+    return outer
+
+
+def cuda_only(inner):
+    def outer(self, *args, **kwargs):
+        if not self.is_cuda:
+            raise unittest.SkipTest("Test is GPU-only")
+        inner(self, *args, **kwargs)
+    return outer
+
+
+class TestSparse(TestCase):
+
+    def setUp(self):
+        # These parameters control the various ways we can run the test.
+        # We will subclass and override this method to implement CUDA
+        # tests
+        self.is_cuda = False
+        self.is_uncoalesced = False
+        self.IndexTensor = torch.LongTensor
+        self.ValueTensor = torch.DoubleTensor
+        self.SparseTensor = torch.sparse.DoubleTensor
+
+    def _gen_sparse(self, d, nnz, with_size):
+        # TODO: Consider implementing this in the CUDA case by directly
+        # performing the operations on the GPU.  You won't be able to
+        # use torch.rand/torch.randn in this case because they are
+        # CPU-only.  If you do this, you can remove the is_cuda branch
+        # at the end.
+        #
+        # If you do this, be sure to update assert_uncoalesced too
+
+        if isinstance(with_size, Number):
+            with_size = [with_size] * d
+
+        if self.is_uncoalesced:
+            # We want to generate a tensor with a lot of uncoalesced
+            # entries to stress test whether or not we handle this
+            # (subtle) case correctly
+            v_size = [nnz * 2] + list(with_size[d:])
+            v = torch.randn(*v_size)
+            r = torch.rand(d, nnz)
+            # Repeat the indexes, so every position shows up twice
+            i = torch.cat([r, r], dim=1) * \
+                torch.Tensor(with_size[:d]).repeat(nnz * 2, 1).transpose(0, 1)
+            i = i.type(torch.LongTensor)
+            x = torch.sparse.DoubleTensor(i, v, torch.Size(with_size))
+            self.assert_uncoalesced(x)
+        else:
+            # Generate a sparse tensor with d sparse dimensions; the
+            # rest the dimensions with_size[d:] are dense.
+            v_size = [nnz] + list(with_size[d:])
+            v = torch.randn(*v_size)
+            i = torch.rand(d, nnz) * \
+                torch.Tensor(with_size[:d]).repeat(nnz, 1).transpose(0, 1)
+            i = i.type(torch.LongTensor)
+            x = torch.sparse.DoubleTensor(i, v, torch.Size(with_size))
+
+        if self.is_cuda:
+            return x.cuda(), i.cuda(), v.cuda()
+        else:
+            return x, i.clone(), v.clone()
+
+    def assert_uncoalesced(self, x):
+        """
+        Test if a CPU tensor is uncoalesced.  This is used to ensure
+        correctness of the uncoalesced tensor generation algorithm.
+        """
+        assert not x.is_coalesced()
+        # Strategy: construct a new sparse tensor with the raw value
+        # field overwritten to a tensor of ones, coalesce it, and then
+        # check if any value entries are > 1 (which indicates that the
+        # original was uncoalesced.)
+        i = x._indices().clone()
+        v = x._values().clone().fill_(1)
+        y = torch.sparse.DoubleTensor(i, v, x.size())
+        z = self.safeCoalesce(y)
+        assert (z._values() > 1).sum() > 0
+
+    def randn(self, *args, **kwargs):
+        """
+        Variant of torch.randn that also works in the TEST_CUDA case.
+        """
+        # TODO: Put this in torch.cuda.randn
+        return self.ValueTensor(*args, **kwargs).normal_()
+
+    def test_basic(self):
+        x, i, v = self._gen_sparse(3, 10, 100)
+
+        self.assertEqual(i, x._indices())
+        self.assertEqual(v, x._values())
+
+        x, i, v = self._gen_sparse(3, 10, [100, 100, 100])
+        self.assertEqual(i, x._indices())
+        self.assertEqual(v, x._values())
+        self.assertEqual(x.ndimension(), 3)
+        self.assertEqual(self.safeCoalesce(x)._nnz(), 10)
+        for i in range(3):
+            self.assertEqual(x.size(i), 100)
+
+        # Make sure that coalesce handles duplicate indices correctly
+        i = self.IndexTensor([[9, 0, 0, 0, 8, 1, 1, 1, 2, 7, 2, 2, 3, 4, 6, 9]])
+        v = self.ValueTensor([[idx**2, idx] for idx in range(i.size(1))])
+        x = self.SparseTensor(i, v, torch.Size([10, 2]))
+        self.assertEqual(self.safeCoalesce(x)._nnz(), 9)
+
+        # Make sure we can access empty indices / values
+        x = self.SparseTensor()
+        self.assertEqual(x._indices().numel(), 0)
+        self.assertEqual(x._values().numel(), 0)
+
+    def test_to_dense(self):
+        i = self.IndexTensor([
+            [0, 1, 2, 2],
+            [0, 0, 0, 3],
+            [0, 0, 1, 4],
+        ])
+        v = self.ValueTensor([2, 1, 3, 4])
+        x = self.SparseTensor(i, v, torch.Size([3, 4, 5]))
+        res = self.ValueTensor([
+            [[2, 0, 0, 0, 0],
+             [0, 0, 0, 0, 0],
+             [0, 0, 0, 0, 0],
+             [0, 0, 0, 0, 0]],
+            [[1, 0, 0, 0, 0],
+             [0, 0, 0, 0, 0],
+             [0, 0, 0, 0, 0],
+             [0, 0, 0, 0, 0]],
+            [[0, 3, 0, 0, 0],
+             [0, 0, 0, 0, 0],
+             [0, 0, 0, 0, 0],
+             [0, 0, 0, 0, 4]],
+        ])
+
+        x.to_dense()  # Tests double to_dense for memory corruption
+        x.to_dense()
+        x.to_dense()
+        self.assertEqual(res, x.to_dense())
+
+    def test_shared(self):
+        i = self.IndexTensor([[2]])
+        v = self.ValueTensor([5])
+        x = self.SparseTensor(i, v, torch.Size([3]))
+        v[0] = 6
+        self.assertEqual(self.ValueTensor([0, 0, 6]), x.to_dense())
+        i[0][0] = 0
+        self.assertEqual(self.ValueTensor([6, 0, 0]), x.to_dense())
+
+    def test_to_dense_hybrid(self):
+        i = self.IndexTensor([
+            [0, 1, 2, 2],
+            [0, 0, 0, 3],
+        ])
+        v = self.ValueTensor([[2, 3], [1, 2], [3, 4], [4, 5]])
+        x = self.SparseTensor(i, v, torch.Size([3, 4, 2]))
+        res = self.ValueTensor([
+            [[2, 3],
+             [0, 0],
+             [0, 0],
+             [0, 0]],
+            [[1, 2],
+             [0, 0],
+             [0, 0],
+             [0, 0]],
+            [[3, 4],
+             [0, 0],
+             [0, 0],
+             [4, 5]],
+        ])
+
+        x.to_dense()  # Tests double to_dense for memory corruption
+        x.to_dense()
+        x.to_dense()
+        self.assertEqual(res, x.to_dense())
+
+    def test_contig(self):
+        i = self.IndexTensor([
+            [1, 0, 35, 14, 39, 6, 71, 66, 40, 27],
+            [92, 31, 62, 50, 22, 65, 89, 74, 56, 34],
+        ])
+        v = self.ValueTensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
+        x = self.SparseTensor(i, v, torch.Size([100, 100]))
+        exp_i = self.IndexTensor([
+            [0, 1, 6, 14, 27, 35, 39, 40, 66, 71],
+            [31, 92, 65, 50, 34, 62, 22, 56, 74, 89],
+        ])
+        exp_v = self.ValueTensor([2, 1, 6, 4, 10, 3, 5, 9, 8, 7])
+        x = self.safeCoalesce(x)
+        self.assertEqual(exp_i, x._indices())
+        self.assertEqual(exp_v, x._values())
+
+        i = self.IndexTensor([
+            [2, 0, 2, 1],
+            [0, 0, 3, 0],
+            [1, 0, 4, 0],
+        ])
+        v = self.ValueTensor([3, 2, 4, 1])
+        x = self.SparseTensor(i, v, torch.Size([3, 4, 5]))
+        exp_i = self.IndexTensor([
+            [0, 1, 2, 2],
+            [0, 0, 0, 3],
+            [0, 0, 1, 4],
+        ])
+        exp_v = self.ValueTensor([2, 1, 3, 4])
+
+        x = self.safeCoalesce(x)
+        self.assertEqual(exp_i, x._indices())
+        self.assertEqual(exp_v, x._values())
+
+        # Duplicate indices
+        i = self.IndexTensor([
+            [0, 0, 2, 0],
+            [0, 0, 3, 0],
+            [0, 0, 4, 0],
+        ])
+        v = self.ValueTensor([3, 2, 4, 1])
+        x = self.SparseTensor(i, v, torch.Size([3, 4, 5]))
+        exp_i = self.IndexTensor([
+            [0, 2],
+            [0, 3],
+            [0, 4],
+        ])
+        exp_v = self.ValueTensor([6, 4])
+
+        x = self.safeCoalesce(x)
+        self.assertEqual(exp_i, x._indices())
+        self.assertEqual(exp_v, x._values())
+
+    def test_contig_hybrid(self):
+        i = self.IndexTensor([
+            [1, 0, 35, 14, 39, 6, 71, 66, 40, 27],
+            [92, 31, 62, 50, 22, 65, 89, 74, 56, 34],
+        ])
+        v = self.ValueTensor([
+            [1, 2], [2, 3], [3, 4], [4, 5], [5, 6],
+            [6, 7], [7, 8], [8, 9], [9, 10], [10, 11],
+        ])
+        x = self.SparseTensor(i, v, torch.Size([100, 100, 2]))
+        exp_i = self.IndexTensor([
+            [0, 1, 6, 14, 27, 35, 39, 40, 66, 71],
+            [31, 92, 65, 50, 34, 62, 22, 56, 74, 89],
+        ])
+        exp_v = self.ValueTensor([
+            [2, 3], [1, 2], [6, 7], [4, 5], [10, 11],
+            [3, 4], [5, 6], [9, 10], [8, 9], [7, 8],
+        ])
+        x = self.safeCoalesce(x)
+        self.assertEqual(exp_i, x._indices())
+        self.assertEqual(exp_v, x._values())
+
+        i = self.IndexTensor([
+            [2, 0, 2, 1],
+            [0, 0, 3, 0],
+            [1, 0, 4, 0],
+        ])
+        v = self.ValueTensor([[3, 3, 3], [2, 2, 2], [4, 4, 4], [1, 1, 1]])
+        x = self.SparseTensor(i, v, torch.Size([3, 4, 5, 3]))
+        exp_i = self.IndexTensor([
+            [0, 1, 2, 2],
+            [0, 0, 0, 3],
+            [0, 0, 1, 4],
+        ])
+        exp_v = self.ValueTensor([[2, 2, 2], [1, 1, 1], [3, 3, 3], [4, 4, 4]])
+
+        x = self.safeCoalesce(x)
+        self.assertEqual(exp_i, x._indices())
+        self.assertEqual(exp_v, x._values())
+
+        # Duplicate indices
+        i = self.IndexTensor([
+            [0, 0, 2, 0],
+            [0, 0, 3, 0],
+            [0, 0, 4, 0],
+        ])
+        v = self.ValueTensor([[3, 2, 3], [2, 1, 1], [4, 3, 4], [1, 1, 1]])
+        x = self.SparseTensor(i, v, torch.Size([3, 4, 5, 3]))
+        exp_i = self.IndexTensor([
+            [0, 2],
+            [0, 3],
+            [0, 4],
+        ])
+        exp_v = self.ValueTensor([[6, 4, 5], [4, 3, 4]])
+
+        x = self.safeCoalesce(x)
+        self.assertEqual(exp_i, x._indices())
+        self.assertEqual(exp_v, x._values())
+
+    def test_clone(self):
+        x, _, _ = self._gen_sparse(4, 20, 5)
+        if self.is_uncoalesced:
+            self.assertFalse(x.is_coalesced())
+            y = x.clone()
+            self.assertFalse(y.is_coalesced())
+        x = x.coalesce()
+        self.assertTrue(x.is_coalesced())
+        y = x.clone()
+        self.assertTrue(y.is_coalesced())
+
+    def test_transpose(self):
+        x = self._gen_sparse(4, 20, 5)[0]
+        y = x.to_dense()
+
+        for i, j in itertools.combinations(range(4), 2):
+            x = x.transpose_(i, j)
+            y = y.transpose(i, j)
+            self.assertEqual(x.to_dense(), y)
+
+            x = x.transpose(i, j)
+            y = y.transpose(i, j)
+            self.assertEqual(x.to_dense(), y)
+
+    @cpu_only
+    def test_mm(self):
+        def test_shape(di, dj, dk):
+            x, _, _ = self._gen_sparse(2, 20, [di, dj])
+            t = torch.randn(di, dk)
+            y = torch.randn(dj, dk)
+            alpha = random.random()
+            beta = random.random()
+
+            res = torch.addmm(alpha, t, beta, x, y)
+            expected = torch.addmm(alpha, t, beta, x.to_dense(), y)
+            self.assertEqual(res, expected)
+
+            res = torch.addmm(t, x, y)
+            expected = torch.addmm(t, x.to_dense(), y)
+            self.assertEqual(res, expected)
+
+            res = torch.mm(x, y)
+            expected = torch.mm(x.to_dense(), y)
+            self.assertEqual(res, expected)
+
+        test_shape(10, 100, 100)
+        test_shape(100, 1000, 200)
+        test_shape(64, 10000, 300)
+
+    @cpu_only
+    def test_saddmm(self):
+        def test_shape(di, dj, dk):
+            x = self._gen_sparse(2, 20, [di, dj])[0]
+            t = self._gen_sparse(2, 20, [di, dk])[0]
+            y = torch.randn(dj, dk)
+            alpha = random.random()
+            beta = random.random()
+
+            res = torch.saddmm(alpha, t, beta, x, y)
+            expected = torch.addmm(alpha, t.to_dense(), beta, x.to_dense(), y)
+            self.assertEqual(res.to_dense(), expected)
+
+            res = torch.saddmm(t, x, y)
+            expected = torch.addmm(t.to_dense(), x.to_dense(), y)
+            self.assertEqual(res.to_dense(), expected)
+
+            res = torch.smm(x, y)
+            expected = torch.mm(x.to_dense(), y)
+            self.assertEqual(res.to_dense(), expected)
+
+        test_shape(7, 5, 3)
+        test_shape(1000, 100, 100)
+        test_shape(3000, 64, 300)
+
+    def test_dsmm(self):
+        def test_shape(di, dj, dk):
+            x = self._gen_sparse(2, 20, [di, dj])[0]
+            y = self.randn(dj, dk)
+
+            res = torch.dsmm(x, y)
+            expected = torch.mm(x.to_dense(), y)
+            self.assertEqual(res, expected)
+
+        test_shape(7, 5, 3)
+        test_shape(1000, 100, 100)
+        test_shape(3000, 64, 300)
+
+    def test_hsmm(self):
+        def test_shape(di, dj, dk):
+            x = self._gen_sparse(2, 20, [di, dj])[0]
+            y = self.randn(dj, dk)
+
+            res = torch.hsmm(x, y)
+            expected = torch.mm(x.to_dense(), y)
+            self.assertEqual(res.to_dense(), expected)
+
+        test_shape(7, 5, 3)
+        test_shape(1000, 100, 100)
+        test_shape(3000, 64, 300)
+
+    def _test_spadd_shape(self, shape_i, shape_v=None):
+        shape = shape_i + (shape_v or [])
+        x, _, _ = self._gen_sparse(len(shape_i), 10, shape)
+        y = self.randn(*shape)
+        r = random.random()
+
+        res = torch.add(y, r, x)
+        expected = y + r * x.to_dense()
+
+        self.assertEqual(res, expected)
+
+        # Non contiguous dense tensor
+        s = list(shape)
+        s[0] = shape[-1]
+        s[-1] = shape[0]
+        y = self.randn(*s)
+        y.transpose_(0, len(s) - 1)
+        r = random.random()
+
+        res = torch.add(y, r, x)
+        expected = y + r * x.to_dense()
+
+        self.assertEqual(res, expected)
+
+    def test_spadd(self):
+        self._test_spadd_shape([5, 6])
+        self._test_spadd_shape([10, 10, 10])
+        self._test_spadd_shape([50, 30, 20])
+        self._test_spadd_shape([5, 5, 5, 5, 5, 5])
+
+    def test_spadd_hybrid(self):
+        self._test_spadd_shape([5, 6], [2, 3])
+        self._test_spadd_shape([10, 10, 10], [3])
+        self._test_spadd_shape([50, 30, 20], [2])
+        self._test_spadd_shape([5, 5, 5, 5, 5, 5], [2])
+
+    def _test_basic_ops_shape(self, shape_i, shape_v=None):
+        shape = shape_i + (shape_v or [])
+        x1, _, _ = self._gen_sparse(len(shape_i), 9, shape)
+        x2, _, _ = self._gen_sparse(len(shape_i), 12, shape)
+
+        y1 = x1 + x2
+        y2 = x1.clone()
+        y2.add_(x2)
+        expected = x1.to_dense() + x2.to_dense()
+        self.assertEqual(y1.to_dense(), expected)
+        self.assertEqual(y2.to_dense(), expected)
+
+        y1 = x1 - x2
+        y2 = x1.clone()
+        y2.sub_(x2)
+        expected = x1.to_dense() - x2.to_dense()
+        self.assertEqual(y1.to_dense(), expected)
+        self.assertEqual(y2.to_dense(), expected)
+
+        y1 = x1 * x2
+        y2 = x1.clone()
+        y2.mul_(x2)
+        expected = x1.to_dense() * x2.to_dense()
+        self.assertEqual(y1.to_dense(), expected)
+        self.assertEqual(y2.to_dense(), expected)
+
+        y1 = x1 * 37.5
+        y2 = x1.clone()
+        y2.mul_(37.5)
+        expected = x1.to_dense() * 37.5
+        self.assertEqual(y1.to_dense(), expected)
+        self.assertEqual(y2.to_dense(), expected)
+
+        y1 = x1 / 37.5
+        y2 = x1.clone()
+        y2.div_(37.5)
+        expected = x1.to_dense() / 37.5
+        self.assertEqual(y1.to_dense(), expected)
+        self.assertEqual(y2.to_dense(), expected)
+
+        # TODO: add back inplace support
+        y1 = x1 ** 2
+        y2 = x1.clone()
+        y2 = y2.pow(2)
+        expected = x1.to_dense() ** 2
+        self.assertEqual(y1.to_dense(), expected)
+        self.assertEqual(y2.to_dense(), expected)
+
+        y = x1.clone()
+        y.zero_()
+        expected = torch.zeros(x1.size())
+        self.assertEqual(y.to_dense(), expected)
+
+        self.assertFalse(x1.is_coalesced())
+        y = x1.coalesce()
+        z = x1.coalesce()
+        self.assertFalse(x1.is_coalesced())
+        self.assertTrue(y.is_coalesced())
+        self.assertEqual(x1, y)
+        # check that coalesce is out of place
+        y._values().add_(1)
+        self.assertEqual(z._values() + 1, y._values())
+
+    def test_basic_ops(self):
+        self._test_basic_ops_shape([5, 6])
+        self._test_basic_ops_shape([10, 10, 10])
+        self._test_basic_ops_shape([50, 30, 20])
+        self._test_basic_ops_shape([5, 5, 5, 5, 5, 5])
+
+    def test_basic_ops_hybrid(self):
+        self._test_basic_ops_shape([5, 6], [2, 3])
+        self._test_basic_ops_shape([10, 10, 10], [3])
+        self._test_basic_ops_shape([50, 30, 20], [2])
+        self._test_basic_ops_shape([5, 5, 5, 5, 5, 5], [2])
+
+    def _test_sparse_mask_shape(self, shape_i, shape_v=None):
+        shape = shape_i + (shape_v or [])
+        x1, _, _ = self._gen_sparse(len(shape_i), 9, shape)
+        x2, _, _ = self._gen_sparse(len(shape_i), 12, shape)
+
+        y1 = x1 + x2
+        y2 = x1.clone()
+        y2.add_(x2)
+        expected = x1.to_dense() + x2.to_dense()
+        self.assertEqual(y1.to_dense(), expected)
+        self.assertEqual(y2.to_dense(), expected)
+
+    def _test_sparse_mask_fixed(self):
+        i = self.IndexTensor([
+            [1, 3, 0, 4],
+            [2, 1, 2, 3],
+        ])
+        v = self.ValueTensor([1, 2, 3, 4])
+        x = self.SparseTensor(i, v, torch.Size([5, 4])).coalesce()
+        dense = self.ValueTensor([
+            [1, 2, 3, 4],
+            [5, 6, 7, 8],
+            [9, 10, 11, 12],
+            [13, 14, 15, 16],
+            [17, 18, 19, 20],
+        ])
+        exp_v = self.ValueTensor([7, 14, 3, 20])
+        res = dense._sparse_mask(x)
+        expected = self.SparseTensor(i, exp_v, torch.Size([5, 4]))
+        self.assertEqual(res, expected)
+
+    def test_sparse_mask(self):
+        self._test_sparse_mask_fixed()
+
+        self._test_sparse_mask_shape([5, 6])
+        self._test_sparse_mask_shape([10, 10, 10])
+        self._test_sparse_mask_shape([50, 30, 20])
+        self._test_sparse_mask_shape([5, 5, 5, 5, 5, 5])
+
+    def _test_zeros(self, shape, out_shape_i, out_shape_v=None):
+        out_shape = out_shape_i + (out_shape_v or [])
+        for nnz in [9, 12]:
+            out, _, _ = self._gen_sparse(len(out_shape_i), nnz, out_shape)
+            torch.zeros(*shape, out=out)
+            self.assertEqual(tuple(out.size()), tuple(shape))
+            self.assertTrue(out._indices().numel() == out._values().numel() == 0)
+            self.assertEqual(out._nnz(), 0)
+            self.assertEqual(out._dimI(), len(shape))
+            self.assertEqual(out._dimV(), 0)
+
+    def test_zeros(self):
+        i_shapes = [2, 3, 4]
+        v_shapes = [3, 4, 5, 6]
+        for i_dim in range(1, len(i_shapes) + 1):
+            for v_dim in range(len(v_shapes) + 1):
+                self._test_zeros([2, 3, 4], i_shapes[:i_dim], v_shapes[:v_dim])
+
+    def _test_zeros_like(self, template_shape_i, template_shape_v=None):
+        template_shape_v = template_shape_v or []
+        template_shape = template_shape_i + template_shape_v
+        for nnz in [9, 12]:
+            t, _, _ = self._gen_sparse(len(template_shape_i), nnz, template_shape)
+            res = torch.zeros_like(t)
+            self.assertEqual(tuple(res.size()), tuple(template_shape))
+            self.assertTrue(res._indices().numel() == res._values().numel() == 0)
+            self.assertEqual(res._nnz(), 0)
+            self.assertEqual(res._dimI(), len(template_shape_i))
+            self.assertEqual(res._dimV(), len(template_shape_v))
+
+    def test_zeros_like(self):
+        i_shapes = [2, 3, 4]
+        v_shapes = [3, 4, 5, 6]
+        for i_dim in range(1, len(i_shapes) + 1):
+            for v_dim in range(len(v_shapes) + 1):
+                self._test_zeros_like(i_shapes[:i_dim], v_shapes[:v_dim])
+
+    def _test_sparse_mask_hybrid_fixed(self):
+        i = self.IndexTensor([
+            [1, 3, 0, 4],
+            [2, 1, 2, 3],
+        ])
+        v = self.ValueTensor([[1, 2], [2, 3], [3, 4], [4, 5]])
+        # TODO: This is also testing that, if coalesce is a no-op,
+        # the indices don't get permuted. I don't know if we actually
+        # want to give this invariant.
+        x = self.SparseTensor(i, v, torch.Size([5, 4, 2])).coalesce()
+        dense = self.ValueTensor([
+            [[1, 3], [2, 2], [3, 3], [4, 2]],
+            [[5, 7], [6, 7], [7, 9], [8, 9]],
+            [[9, 2], [10, 4], [11, 1], [12, 3]],
+            [[13, 5], [14, 1], [15, 1], [16, 6]],
+            [[17, 7], [18, 2], [19, 7], [20, 1]],
+        ])
+        res = dense._sparse_mask(x)
+        exp_v = self.ValueTensor([[7, 9], [14, 1], [3, 3], [20, 1]])
+        expected = self.SparseTensor(i, exp_v, torch.Size([5, 4, 2]))
+        self.assertEqual(res, expected)
+
+    def test_sparse_mask_hybrid(self):
+        self._test_sparse_mask_hybrid_fixed()
+
+        self._test_sparse_mask_shape([5, 6], [2, 3])
+        self._test_sparse_mask_shape([10, 10, 10], [3])
+        self._test_sparse_mask_shape([50, 30, 20], [2])
+        self._test_sparse_mask_shape([5, 5, 5, 5, 5, 5], [2])
+
+    def test_sparse_add_coalesce(self):
+        i = self.IndexTensor([[1, 2, 1]])
+        v = self.ValueTensor([3, 4, 5])
+        x = self.SparseTensor(i, v, torch.Size([3]))
+        y = self.SparseTensor(i, v, torch.Size([3]))
+        z = x + y
+
+        self.assertFalse(z._indices().numel() != 2 and z.is_coalesced())
+
+    @cuda_only
+    def test_storage_not_null(self):
+        x = torch.cuda.sparse.FloatTensor(2)
+        self.assertNotEqual(x.get_device(), -1)
+
+    @cuda_only
+    @unittest.skipIf(torch.cuda.device_count() < 2, "only one GPU detected")
+    def test_same_gpu(self):
+        i = self.IndexTensor([[2]]).cuda(1)
+        v = self.ValueTensor([5]).cuda(1)
+        x = self.SparseTensor(i, v, torch.Size([3]), device=1)
+        self.assertEqual(x.get_device(), 1)
+        self.assertEqual(x._values().get_device(), 1)
+        self.assertEqual(x._indices().get_device(), 1)
+
+        x = self.SparseTensor(3, device=1)
+        self.assertEqual(x.get_device(), 1)
+        self.assertEqual(x._values().get_device(), 1)
+        self.assertEqual(x._indices().get_device(), 1)
+
+        v = self.ValueTensor([5]).cuda(0)
+        self.assertRaises(RuntimeError, lambda: self.SparseTensor(i, v, torch.Size([3])))
+
+    def _test_new_device(self, size, device):
+        with torch.cuda.device(device):
+            x = torch.cuda.sparse.DoubleTensor(*size)
+        self.assertEqual(x.get_device(), device)
+        x1 = x.new()
+        x2 = x.new(2, 3)
+        self.assertEqual(x1.get_device(), device)
+        self.assertEqual(x2.get_device(), device)
+
+    @cuda_only
+    def test_new_device_single_gpu(self):
+        self._test_new_device((), 0)
+        self._test_new_device((30, 20), 0)
+        self._test_new_device((30, 20, 10), 0)
+
+    @cuda_only
+    @unittest.skipIf(torch.cuda.device_count() < 2, "only one GPU detected")
+    def test_new_device_multi_gpu(self):
+        self._test_new_device((), 1)
+        self._test_new_device((30, 20), 1)
+        self._test_new_device((30, 20, 10), 1)
+
+
+class TestUncoalescedSparse(TestSparse):
+    def setUp(self):
+        super(TestUncoalescedSparse, self).setUp()
+        self.is_uncoalesced = True
+
+
+@unittest.skipIf(not TEST_CUDA, 'CUDA not available')
+class TestCudaSparse(TestSparse):
+    def setUp(self):
+        super(TestCudaSparse, self).setUp()
+        self.is_cuda = True
+        self.IndexTensor = torch.cuda.LongTensor
+        self.ValueTensor = torch.cuda.DoubleTensor
+        self.SparseTensor = torch.cuda.sparse.DoubleTensor
+
+
+@unittest.skipIf(not TEST_CUDA, 'CUDA not available')
+class TestCudaUncoalescedSparse(TestCudaSparse):
+    def setUp(self):
+        super(TestCudaUncoalescedSparse, self).setUp()
+        self.is_uncoalesced = True
+
+if __name__ == '__main__':
+    run_tests()
--- a/Show More
+++ b/Show More
				`@ -1,2 +0,0 @@`

				* `split` and `chunk` no longer accept a list (table in Lua) as optional first argument