version bump

fix OSX build
fix NCCL makefile for CUDA 7.5
2025-10-24 07:27:32 +08:00 · 2017-05-01 15:55:29 -04:00 · 2017-04-29 09:29:21 -04:00 · 2017-04-28 20:08:07 -04:00
731 changed files with 12115 additions and 62664 deletions
--- a/.gitignore
+++ b/.gitignore
@ -5,7 +5,6 @@ torch.egg-info/
 torch/version.py
 torch/csrc/generic/TensorMethods.cpp
 torch/lib/*.so*
-torch/lib/*.a*
 torch/lib/*.dylib*
 torch/lib/*.h
 torch/lib/build
@ -20,7 +19,6 @@ torch/csrc/nn/THCUNN.cpp
 torch/csrc/nn/THNN_generic.cwrap
 torch/csrc/nn/THNN_generic.cpp
 torch/csrc/nn/THNN_generic.h
-torch/csrc/generated
 docs/src/**/*
 test/data/legacy_modules.t7
 test/data/gpu_tensors.pt
@ -35,17 +33,3 @@ test/.coverage
 */**/*.so*
 */**/*.dylib*
 test/data/legacy_serialized.pt
-test/data/linear.pt
-
-# IPython notebook checkpoints
-.ipynb_checkpoints
-
-# Editor temporaries
-*.swn
-*.swo
-*.swp
-*~
-
-# OSX dir files
-.DS_Store
-
--- a/.travis.yml
+++ b/.travis.yml
@ -1,8 +1,7 @@
 # https://travis-ci.org/pytorch/pytorch
 language: python
-dist: trusty
 python:
-    - 2.7.9
+    - 2.7.8
    - 2.7
    - 3.5
    - 3.6
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -44,9 +44,7 @@ https://github.com/pytorch/pytorch#from-source

 The change you have to make is to replace

-```
-python setup.py install
-```
+`python setup.py install`

 with

@ -63,73 +61,18 @@ Hence, if you modify a python file, you do not need to reinstall pytorch again a

 For example:
 - Install local pytorch in `build develop` mode
- modify your python file `torch/__init__.py` (for example)
+- modify your python file torch/__init__.py (for example)
 - test functionality
- modify your python file `torch/__init__.py`
+- modify your python file torch/__init__.py
 - test functionality
- modify your python file `torch/__init__.py`
+- modify your python file torch/__init__.py
 - test functionality

 You do not need to repeatedly install after modifying python files.

+#### C++ Development tips

-## Writing documentation
-
-PyTorch uses [Google style](http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html)
-for formatting docstrings. Length of line inside docstrings block must be limited to 80 characters to
-fit into Jupyter documentation popups.
-
-
-## Managing multiple build trees
-
-One downside to using `python setup.py develop` is that your development
-version of pytorch will be installed globally on your account (e.g., if
-you run `import torch` anywhere else, the development version will be
-used.
-
-If you want to manage multiple builds of PyTorch, you can make use of
-[conda environments](https://conda.io/docs/using/envs.html) to maintain
-separate Python package environments, each of which can be tied to a
-specific build of PyTorch.  To set one up:
-
-```
-conda create -n pytorch-myfeature
-source activate pytorch-myfeature
-# if you run python now, torch will NOT be installed
-python setup.py build develop
-```
-
-## C++ Development tips
-
-If you are working on the C++ code, there are a few important things that you
-will want to keep in mind:
-
-1. How to rebuild only the code you are working on, and
-2. How to make rebuilds in the absence of changes go faster.
-
-### Build only what you need.
-
-`python setup.py build` will build everything, but since our build system is
-not very optimized for incremental rebuilds, this will actually be very slow.
-Far better is to only request rebuilds of the parts of the project you are
-working on:
-
- Working on `torch/csrc`?  Run `python setup.py develop` to rebuild
-  (NB: no `build` here!)
-
- Working on `torch/lib/TH`, did not make any cmake changes, and just want to
-  see if it compiles?  Run `(cd torch/lib/build/TH && make install -j$(getconf _NPROCESSORS_ONLN))`.  This
-  applies for any other subdirectory of `torch/lib`.  **Warning: Changes you
-  make here will not be visible from Python.**  See below.
-
- Working on `torch/lib` and want to run your changes / rerun cmake?  Run
-  `python setup.py build_deps`.  Note that this will rerun cmake for
-  every subdirectory in TH; if you are only working on one project,
-  consider editing `torch/lib/build_all.sh` and commenting out the
-  `build` lines of libraries you are not working on.
-
-On the initial build, you can also speed things up with the environment
-variables `DEBUG` and `NO_CUDA`.
+When you are developing on the C++ side of things, the environment variables `DEBUG` and `NO_CUDA` are helpful.

 - `DEBUG=1` will enable debug builds (-g -O0)
 - `NO_CUDA=1` will disable compiling CUDA (in case you are developing on something not CUDA related), to save compile time.
@ -139,15 +82,7 @@ For example:
 NO_CUDA=1 DEBUG=1 python setup.py build develop
 ```

-Make sure you continue to pass these flags on subsequent builds.
-
-### Make no-op build fast.
-
-Python `setuptools` is pretty dumb, and always rebuilds every C file in a
-project. Using ccache in a situation like this is a real time-saver. However, by
-default, ccache does not properly support CUDA stuff, so here are the
-instructions for installing a custom `ccache` fork that has CUDA support:
-
+Also, if you are developing a lot, using ccache is a real time-saver. By default, ccache does not properly support CUDA stuff, so here are the instructions for installing a custom `ccache` fork that has CUDA support:
 ```
 # install and export ccache
 if ! ls ~/ccache/bin/ccache
--- a/12
+++ b/12
@ -1,16 +1,18 @@
-FROM nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04 
+FROM nvidia/cuda:8.0-devel-ubuntu16.04 

 RUN echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list

+ENV CUDNN_VERSION 6.0.20 
 RUN apt-get update && apt-get install -y --no-install-recommends \
         build-essential \
         cmake \
         git \
         curl \
-         vim \
         ca-certificates \
         libjpeg-dev \
-         libpng-dev &&\
+         libpng-dev \
+         libcudnn6=$CUDNN_VERSION-1+cuda8.0 \             
+         libcudnn6-dev=$CUDNN_VERSION-1+cuda8.0 && \
     rm -rf /var/lib/apt/lists/*

 RUN curl -o ~/miniconda.sh -O  https://repo.continuum.io/miniconda/Miniconda3-4.2.12-Linux-x86_64.sh  && \
@ -28,9 +30,7 @@ COPY . .

 RUN TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1+PTX" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
    CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
-    pip install -v .
-
-RUN git clone https://github.com/pytorch/vision.git && cd vision && pip install -v .
+    python setup.py install

 WORKDIR /workspace
 RUN chmod -R a+w /workspace
--- a/README.md
+++ b/README.md
@ -2,30 +2,29 @@

 --------------------------------------------------------------------------------

-PyTorch is a Python package that provides two high-level features:
- Tensor computation (like NumPy) with strong GPU acceleration
- Deep neural networks built on a tape-based autograd system
+PyTorch is a python package that provides two high-level features:
+- Tensor computation (like numpy) with strong GPU acceleration
+- Deep Neural Networks built on a tape-based autograd system

-You can reuse your favorite Python packages such as NumPy, SciPy and Cython to extend PyTorch when needed.
+You can reuse your favorite python packages such as numpy, scipy and Cython to extend PyTorch when needed.

-We are in an early-release beta. Expect some adventures and rough edges.
+We are in an early-release Beta. Expect some adventures and rough edges.

- [More about PyTorch](#more-about-pytorch)
+- [More About PyTorch](#more-about-pytorch)
 - [Installation](#installation)
  - [Binaries](#binaries)
-  - [From Source](#from-source)
-  - [Docker Image](#docker-image)
+  - [From source](#from-source)
+  - [Docker image](#docker-image)
 - [Getting Started](#getting-started)
 - [Communication](#communication)
 - [Releases and Contributing](#releases-and-contributing)
 - [The Team](#the-team)

-| System | 2.7 | 3.5 |
+| System | Python | Status |
 | --- | --- | --- |
-| Linux CPU | [![Build Status](https://travis-ci.org/pytorch/pytorch.svg?branch=master)](https://travis-ci.org/pytorch/pytorch) | [![Build Status](https://travis-ci.org/pytorch/pytorch.svg?branch=master)](https://travis-ci.org/pytorch/pytorch) |
-| Linux GPU | [![Build Status](http://build.pytorch.org:8080/buildStatus/icon?job=pytorch-master-py2-linux)](https://build.pytorch.org/job/pytorch-master-py2-linux) | [![Build Status](http://build.pytorch.org:8080/buildStatus/icon?job=pytorch-master-py3-linux)](https://build.pytorch.org/job/pytorch-master-py3-linux) |
-| macOS CPU | [![Build Status](http://build.pytorch.org:8080/buildStatus/icon?job=pytorch-master-py2-osx-cpu)](https://build.pytorch.org/job/pytorch-master-py2-osx-cpu) | [![Build Status](http://build.pytorch.org:8080/buildStatus/icon?job=pytorch-master-py3-osx-cpu)](https://build.pytorch.org/job/pytorch-master-py3-osx-cpu) |
-
+| Linux CPU | 2.7.8, 2.7, 3.5, nightly | [![Build Status](https://travis-ci.org/pytorch/pytorch.svg?branch=master)](https://travis-ci.org/pytorch/pytorch) |
+| Linux GPU | 2.7 | [![Build Status](http://build.pytorch.org:8080/buildStatus/icon?job=pytorch-master-py2)](https://build.pytorch.org/job/pytorch-master-py2) |
+| Linux GPU | 3.5 | [![Build Status](http://build.pytorch.org:8080/buildStatus/icon?job=pytorch-master-py3)](https://build.pytorch.org/job/pytorch-master-py3) |

 ## More about PyTorch

@ -38,7 +37,7 @@ At a granular level, PyTorch is a library that consists of the following compone
 </tr>
 <tr>
    <td><b> torch.autograd </b></td>
-    <td> a tape-based automatic differentiation library that supports all differentiable Tensor operations in torch </td>
+    <td> a tape based automatic differentiation library that supports all differentiable Tensor operations in torch </td>
 </tr>
 <tr>
    <td><b> torch.nn </b></td>
@ -46,7 +45,7 @@ At a granular level, PyTorch is a library that consists of the following compone
 </tr>
 <tr>
    <td><b> torch.multiprocessing  </b></td>
-    <td> Python multiprocessing, but with magical memory sharing of torch Tensors across processes. Useful for data loading and Hogwild training. </td>
+    <td> python multiprocessing, but with magical memory sharing of torch Tensors across processes. Useful for data loading and hogwild training. </td>
 </tr>
 <tr>
    <td><b> torch.utils </b></td>
@ -60,14 +59,14 @@ At a granular level, PyTorch is a library that consists of the following compone

 Usually one uses PyTorch either as:

- a replacement for NumPy to use the power of GPUs.
+- A replacement for numpy to use the power of GPUs.
 - a deep learning research platform that provides maximum flexibility and speed

 Elaborating further:

-### A GPU-Ready Tensor Library
+### A GPU-ready Tensor library

-If you use NumPy, then you have used Tensors (a.k.a ndarray).
+If you use numpy, then you have used Tensors (a.k.a ndarray).

 <p align=center><img width="30%" src="docs/source/_static/img/tensor_illustration.png" /></p>

@ -78,15 +77,15 @@ We provide a wide variety of tensor routines to accelerate and fit your scientif
 such as slicing, indexing, math operations, linear algebra, reductions.
 And they are fast!

-### Dynamic Neural Networks: Tape-Based Autograd
+### Dynamic Neural Networks: Tape based Autograd

 PyTorch has a unique way of building neural networks: using and replaying a tape recorder.

-Most frameworks such as TensorFlow, Theano, Caffe and CNTK have a static view of the world.
+Most frameworks such as `TensorFlow`, `Theano`, `Caffe` and `CNTK` have a static view of the world.
 One has to build a neural network, and reuse the same structure again and again.
 Changing the way the network behaves means that one has to start from scratch.

-With PyTorch, we use a technique called reverse-mode auto-differentiation, which allows you to
+With PyTorch, we use a technique called Reverse-mode auto-differentiation, which allows you to
 change the way your network behaves arbitrarily with zero lag or overhead. Our inspiration comes
 from several research papers on this topic, as well as current and past work such as
 [autograd](https://github.com/twitter/torch-autograd),
@ -98,45 +97,45 @@ You get the best of speed and flexibility for your crazy research.

 <p align=center><img width="80%" src="docs/source/_static/img/dynamic_graph.gif" /></p>

-### Python First
+### Python first

-PyTorch is not a Python binding into a monolithic C++ framework.
+PyTorch is not a Python binding into a monolothic C++ framework.
 It is built to be deeply integrated into Python.
-You can use it naturally like you would use NumPy / SciPy / scikit-learn etc.
+You can use it naturally like you would use numpy / scipy / scikit-learn etc.
 You can write your new neural network layers in Python itself, using your favorite libraries
 and use packages such as Cython and Numba.
 Our goal is to not reinvent the wheel where appropriate.

-### Imperative Experiences
+### Imperative experiences

 PyTorch is designed to be intuitive, linear in thought and easy to use.
 When you execute a line of code, it gets executed. There isn't an asynchronous view of the world.
-When you drop into a debugger, or receive error messages and stack traces, understanding them is straightforward.
-The stack trace points to exactly where your code was defined.
+When you drop into a debugger, or receive error messages and stack traces, understanding them is straight-forward.
+The stack-trace points to exactly where your code was defined.
 We hope you never spend hours debugging your code because of bad stack traces or asynchronous and opaque execution engines.

 ### Fast and Lean

-PyTorch has minimal framework overhead. We integrate acceleration libraries
-such as Intel MKL and NVIDIA (cuDNN, NCCL) to maximize speed.
-At the core, its CPU and GPU Tensor and neural network backends
+PyTorch has minimal framework overhead. We integrate acceleration libraries 
+such as Intel MKL and NVIDIA (CuDNN, NCCL) to maximize speed. 
+At the core, its CPU and GPU Tensor and Neural Network backends 
 (TH, THC, THNN, THCUNN) are written as independent libraries with a C99 API.  
 They are mature and have been tested for years.

-Hence, PyTorch is quite fast – whether you run small or large neural networks.
+Hence, PyTorch is quite fast -- whether you run small or large neural networks.

 The memory usage in PyTorch is extremely efficient compared to Torch or some of the alternatives.
 We've written custom memory allocators for the GPU to make sure that
 your deep learning models are maximally memory efficient.
 This enables you to train bigger deep learning models than before.

-### Extensions without Pain
+### Extensions without pain

-Writing new neural network modules, or interfacing with PyTorch's Tensor API was designed to be straightforward
+Writing new neural network modules, or interfacing with PyTorch's Tensor API was designed to be straight-forward
 and with minimal abstractions.

 You can write new neural network layers in Python using the torch API
-[or your favorite NumPy-based libraries such as SciPy](http://pytorch.org/tutorials/advanced/numpy_extensions_tutorial.html).
+[or your favorite numpy based libraries such as SciPy](http://pytorch.org/tutorials/advanced/numpy_extensions_tutorial.html).

 If you want to write your layers in C/C++, we provide an extension API based on
 [cffi](http://cffi.readthedocs.io/en/latest/) that is efficient and with minimal boilerplate.
@ -150,16 +149,16 @@ Commands to install from binaries via Conda or pip wheels are on our website:

 [http://pytorch.org](http://pytorch.org)

-### From Source
+### From source

 If you are installing from source, we highly recommend installing an [Anaconda](https://www.continuum.io/downloads) environment.
 You will get a high-quality BLAS library (MKL) and you get a controlled compiler version regardless of your Linux distro.

-Once you have [Anaconda](https://www.continuum.io/downloads) installed, here are the instructions.
+Once you have [anaconda](https://www.continuum.io/downloads) installed, here are the instructions.

 If you want to compile with CUDA support, install
 - [NVIDIA CUDA](https://developer.nvidia.com/cuda-downloads) 7.5 or above
- [NVIDIA cuDNN](https://developer.nvidia.com/cudnn) v5.x or above
+- [NVIDIA CuDNN](https://developer.nvidia.com/cudnn) v5.x or above

 If you want to disable CUDA support, export environment variable `NO_CUDA=1`.

@ -167,7 +166,7 @@ If you want to disable CUDA support, export environment variable `NO_CUDA=1`.

 On Linux
 ```bash
-export CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" # [anaconda root directory]
+export CMAKE_PREFIX_PATH=[anaconda root directory]

 # Install basic dependencies
 conda install numpy pyyaml mkl setuptools cmake gcc cffi
@ -197,21 +196,15 @@ MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py install

 Dockerfile is supplied to build images with cuda support and cudnn v6. Build as usual
 ```
-docker build -t pytorch .
+docker build -t pytorch-cudnnv6 .
 ```
-Alternatively, if you want a runtime image, build with
-
+and run  with nvidia-docker:
 ```
-docker build -t pytorch . -f tools/docker/Dockerfile_runtime
-
+nvidia-docker run --rm -ti --ipc=host pytorch-cudnnv6
 ```
-and run with nvidia-docker:
-```
-nvidia-docker run --rm -ti --ipc=host pytorch
-```
-Please note that PyTorch uses shared memory to share data between processes, so if torch multiprocessing is used (e.g.
+Please note that pytorch uses shared memory to share data between processes, so if torch multiprocessing is used (e.g.
 for multithreaded data loaders) the default shared memory segment size that container runs with is not enough, and you
-should increase shared memory size either with `--ipc=host` or `--shm-size` command line options to `nvidia-docker run`.
+should increase shared memory size either with --ipc=host or --shm-size command line options to nvidia-docker run. 


 ## Getting Started
@ -223,13 +216,13 @@ Three pointers to get you started:

 ## Communication
 * forums: discuss implementations, research, etc. http://discuss.pytorch.org
-* GitHub issues: bug reports, feature requests, install issues, RFCs, thoughts, etc.
-* Slack: general chat, online discussions, collaboration etc. https://pytorch.slack.com/ . If you need a slack invite, ping us at soumith@pytorch.org
+* github issues: bug reports, feature requests, install issues, RFCs, thoughts, etc.
+* slack: general chat, online discussions, collaboration etc. https://pytorch.slack.com/ . If you need a slack invite, ping us at soumith@pytorch.org
 * newsletter: no-noise, one-way email newsletter with important announcements about pytorch. You can sign-up here: http://eepurl.com/cbG0rv

 ## Releases and Contributing

-PyTorch has a 90 day release cycle (major releases).
+PyTorch has a 90 day release cycle (major releases). 
 It's current state is Beta, we expect no obvious bugs. Please let us know if you encounter a bug by [filing an issue](https://github.com/pytorch/pytorch/issues).

 We appreciate all contributions. If you are planning to contribute back bug-fixes, please do so without any further discussion.
--- a/docs/source/_static/css/pytorch_theme.css
+++ b/docs/source/_static/css/pytorch_theme.css
@ -112,7 +112,3 @@ footer p {
 nav .hidden-section {
    display: inherit;
 }
-
-.wy-side-nav-search>div.version {
-    color: #000;
-}
--- a/docs/source/autograd.rst
+++ b/docs/source/autograd.rst
@ -9,8 +9,6 @@ Automatic differentiation package - torch.autograd

 .. autofunction:: backward

-.. autofunction:: grad
-
 Variable
 --------

@ -40,8 +38,8 @@ All :class:`Variable` s keep track of in-place operations applied to them, and
 if the implementation detects that a variable was saved for backward in one of
 the functions, but it was modified in-place afterwards, an error will be raised
 once backward pass is started. This ensures that if you're using in-place
-functions and not seeing any errors, you can be sure that the computed
-gradients are correct.
+functions and not seing any errors, you can be sure that the computed gradients
+are correct.


 .. autoclass:: Variable
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@ -75,10 +75,10 @@ author = 'Torch Contributors'
 #
 # The short X.Y version.
 # TODO: change to [:2] at v1.0
-version = 'master (' + torch.__version__ + ' )'
+version = '.'.join(torch.__version__.split('+')[0].split('.')[:3])
 # The full version, including alpha/beta/rc tags.
 # TODO: verify this works as expected
-release = 'master'
+release = torch.__version__.split('+')[0]

 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.
@ -113,7 +113,7 @@ html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
 #
 html_theme_options = {
    'collapse_navigation': False,
-    'display_version': True,
+    'display_version': False,
    'logo_only': True,
 }

@ -205,7 +205,7 @@ from sphinx import addnodes


 def patched_make_field(self, types, domain, items, **kw):
-    # `kw` catches `env=None` needed for newer sphinx while maintaining
+    # `kw` catches `env=None` needed for newer sphinx while maingaining
    #  backwards compatibility when passed along further down!

    # type: (List, unicode, Tuple) -> nodes.field
--- a/docs/source/cuda.rst
+++ b/docs/source/cuda.rst
@ -25,10 +25,3 @@ Streams and events

 .. autoclass:: Event
   :members:
-
-NVIDIA Tools Extension (NVTX)
-----------------------------
-
-.. autofunction:: torch.cuda.nvtx.mark
-.. autofunction:: torch.cuda.nvtx.range_push
-.. autofunction:: torch.cuda.nvtx.range_pop
--- a/docs/source/data.rst
+++ b/docs/source/data.rst
@ -10,4 +10,3 @@ torch.utils.data
 .. autoclass:: torch.utils.data.sampler.RandomSampler
 .. autoclass:: torch.utils.data.sampler.SubsetRandomSampler
 .. autoclass:: torch.utils.data.sampler.WeightedRandomSampler
-.. autoclass:: torch.utils.data.distributed.DistributedSampler
--- a/docs/source/distributed.rst
+++ b/docs/source/distributed.rst
@ -1,165 +0,0 @@
-.. role:: hidden
-    :class: hidden-section
-
-Distributed communication package - torch.distributed
-=====================================================
-
-.. automodule:: torch.distributed
-.. currentmodule:: torch.distributed
-
-Currently torch.distributed supports three backends, each with
-different capabilities. The table below shows which functions are available
-for use with CPU / CUDA tensors.
-MPI supports cuda only iff the implementation used to build PyTorch supports it.
-
-+------------+-----------+-----------+-----------+
-| Backend    | ``tcp``   | ``gloo``  | ``mpi``   |
-+------------+-----+-----+-----+-----+-----+-----+
-| Device     | CPU | GPU | CPU | GPU | CPU | GPU |
-+============+=====+=====+=====+=====+=====+=====+
-| send       | ✓   | ✘   | ✘   | ✘   | ✓   | ?   |
-+------------+-----+-----+-----+-----+-----+-----+
-| recv       | ✓   | ✘   | ✘   | ✘   | ✓   | ?   |
-+------------+-----+-----+-----+-----+-----+-----+
-| broadcast  | ✓   | ✘   | ✓   | ✓   | ✓   | ?   |
-+------------+-----+-----+-----+-----+-----+-----+
-| all_reduce | ✓   | ✘   | ✓   | ✓   | ✓   | ?   |
-+------------+-----+-----+-----+-----+-----+-----+
-| reduce     | ✓   | ✘   | ✘   | ✘   | ✓   | ?   |
-+------------+-----+-----+-----+-----+-----+-----+
-| all_gather | ✓   | ✘   | ✘   | ✘   | ✓   | ?   |
-+------------+-----+-----+-----+-----+-----+-----+
-| gather     | ✓   | ✘   | ✘   | ✘   | ✓   | ?   |
-+------------+-----+-----+-----+-----+-----+-----+
-| scatter    | ✓   | ✘   | ✘   | ✘   | ✓   | ?   |
-+------------+-----+-----+-----+-----+-----+-----+
-| barrier    | ✓   | ✘   | ✓   | ✓   | ✓   | ?   |
-+------------+-----+-----+-----+-----+-----+-----+
-
-Initialization
--------------
-
-The package needs to be initialized using the :func:`torch.distributed.init_process_group`
-function before calling any other methods.
-
-.. autofunction:: init_process_group
-
-.. autofunction:: get_rank
-
-.. autofunction:: get_world_size
-
--------------------------------------------------------------------------------
-
-Currently three initialization methods are supported:
-
-TCP initialization
-^^^^^^^^^^^^^^^^^^
-
-Initialization will utilize a network address reachable from all processes.
-If the address belongs to one of the machines, initialization requires that all processes
-have manually specified ranks.
-
-Alternatively, the address has to be a valid IP multicast address, in which case,
-ranks can be assigned automatically. Multicast initialization also supports
-a ``group_name`` argument, which allows you to use the same address for multiple jobs,
-as long as they use different group names.
-
-::
-
-    import torch.distributed as dist
-
-    # Use address of one of the machines
-    dist.init_process_group(init_method='tcp://10.1.1.20:23456', rank=args.rank, world_size=4)
-
-    # or a multicast address - rank will be assigned automatically if unspecified
-    dist.init_process_group(init_method='tcp://[ff15:1e18:5d4c:4cf0:d02d:b659:53ba:b0a7]:23456',
-                            world_size=4)
-
-Shared file-system initialization
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Another initialization method makes use of a file system shared and visible from
-all machines in a group. The URL should start with ``file://`` and contain a path
-to a non-existent file (in an existing directory) on a shared file system.
-This initialization method also supports a ``group_name`` argument, which allows you to
-use the same shared file path for multiple jobs, as long as they use different
-group names.
-
-.. warning::
-    This method assumes that the file system supports locking using ``fcntl`` - most
-    local systems and NFS support it.
-
-::
-
-    import torch.distributed as dist
-
-    # Rank will be assigned automatically if unspecified
-    dist.init_process_group(init_method='file:///mnt/nfs/sharedfile', world_size=4,
-                            group_name=args.group)
-
-Environment variable initialization
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-This method will read the configuration from environment variables, allowing
-one to fully customize how the information is obtained. The variables to be set
-are:
-
-* ``MASTER_PORT`` - required; has to be a free port on machine with rank 0
-* ``MASTER_ADDR`` - required (except for rank 0); address of rank 0 node
-* ``WORLD_SIZE`` - required; can be set either here, or in a call to init function
-* ``RANK`` - required; can be set either here, or in a call to init function
-
-The machine with rank 0 will be used to set up all connections.
-
-This is the default method, meaning that ``init_method`` does not have to be specified (or
-can be ``env://``).
-
-Groups
------
-
-By default collectives operate on the default group (also called the world) and
-require all processes to enter the distributed function call. However, some workloads can benefit
-from more fine-grained communication. This is where distributed groups come
-into play. :func:`~torch.distributed.new_group` function can be
-used to create new groups, with arbitrary subsets of all processes. It returns
-an opaque group handle that can be given as a ``group`` argument to all collectives
-(collectives are distributed functions to exchange information in certain well-known programming patterns).
-
-.. autofunction:: new_group
-
-Point-to-point communication
----------------------------
-
-.. autofunction:: send
-
-.. autofunction:: recv
-
-:func:`~torch.distributed.isend` and :func:`~torch.distributed.irecv`
-return distributed request objects when used. In general, the type of this object is unspecified
-as they should never be created manually, but they are guaranteed to support two methods:
-
-* ``is_completed()`` - returns True if the operation has finished
-* ``wait()`` - will block the process until the operation is finished.
-  ``is_completed()`` is guaranteed to return True once it returns.
-
-.. autofunction:: isend
-
-.. autofunction:: irecv
-
-Collective functions
--------------------
-
-.. autofunction:: broadcast
-
-.. autofunction:: all_reduce
-
-.. autofunction:: reduce
-
-.. autofunction:: all_gather
-
-.. autofunction:: gather
-
-.. autofunction:: scatter
-
-.. autofunction:: barrier
-
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -30,7 +30,6 @@ PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.
   optim
   torch.autograd <autograd>
   torch.multiprocessing <multiprocessing>
-   torch.distributed <distributed>
   torch.legacy <legacy>
   cuda
   ffi
--- a/docs/source/multiprocessing.rst
+++ b/docs/source/multiprocessing.rst
@ -83,6 +83,6 @@ the current process group, and will keep track of all shared memory allocations.
 Once all processes connected to it exit, it will wait a moment to ensure there
 will be no new connections, and will iterate over all shared memory files
 allocated by the group. If it finds that any of them still exist, they will be
-deallocated. We've tested this method and it proved to be robust to various
+deallocated. We've tested this method and it prooved to be robust to various
 failures. Still, if your system has high enough limits, and ``file_descriptor``
 is a supported strategy, we do not recommend switching to this one.
--- a/docs/source/nn.rst
+++ b/docs/source/nn.rst
@ -160,7 +160,7 @@ Pooling Layers
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: AdaptiveMaxPool2d
-    :members:
+    :members:       

 :hidden:`AdaptiveAvgPool1d`
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -174,41 +174,7 @@ Pooling Layers
 .. autoclass:: AdaptiveAvgPool2d
    :members:

-
-Padding Layers
--------------
-
-:hidden:`ReflectionPad2d`
-~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: ReflectionPad2d
-    :members:
-
-:hidden:`ReplicationPad2d`
-~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: ReplicationPad2d
-    :members:
-
-:hidden:`ReplicationPad3d`
-~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: ReplicationPad3d
-    :members:
-
-:hidden:`ZeroPad2d`
-~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: ZeroPad2d
-    :members:
-
-:hidden:`ConstantPad2d`
-~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: ConstantPad2d
-    :members:
-
-
+       
 Non-linear Activations
 ----------------------------------

@ -230,12 +196,6 @@ Non-linear Activations
 .. autoclass:: ELU
    :members:

-:hidden:`SELU`
-~~~~~~~~~~~~~~
-
-.. autoclass:: SELU
-    :members:
-
 :hidden:`PReLU`
 ~~~~~~~~~~~~~~~

@ -343,19 +303,19 @@ Normalization layers
    :members:

 :hidden:`InstanceNorm1d`
-~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: InstanceNorm1d
    :members:

 :hidden:`InstanceNorm2d`
-~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: InstanceNorm2d
    :members:

 :hidden:`InstanceNorm3d`
-~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: InstanceNorm3d
    :members:
@ -430,12 +390,6 @@ Dropout layers
 .. autoclass:: Dropout3d
    :members:

-:hidden:`AlphaDropout`
-~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: AlphaDropout
-    :members:
-

 Sparse layers
 ----------------------------------
@ -446,21 +400,9 @@ Sparse layers
 .. autoclass:: Embedding
    :members:

-:hidden:`EmbeddingBag`
-~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: EmbeddingBag
-    :members:
-
 Distance functions
 ----------------------------------

-:hidden:`CosineSimilarity`
-~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: CosineSimilarity
-    :members:
-
 :hidden:`PairwiseDistance`
 ~~~~~~~~~~~~~~~~~~~~~~~~~~

@ -495,12 +437,6 @@ Loss functions
 .. autoclass:: NLLLoss
    :members:

-:hidden:`PoissonNLLLoss`
-~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: PoissonNLLLoss
-    :members:
-
 :hidden:`NLLLoss2d`
 ~~~~~~~~~~~~~~~~~~~

@ -519,12 +455,6 @@ Loss functions
 .. autoclass:: BCELoss
    :members:

-:hidden:`BCEWithLogitsLoss`
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: BCEWithLogitsLoss
-    :members:
-
 :hidden:`MarginRankingLoss`
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~

@ -573,12 +503,6 @@ Loss functions
 .. autoclass:: MultiMarginLoss
    :members:

-:hidden:`TripletMarginLoss`
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: TripletMarginLoss
-    :members:
-

 Vision layers
 ----------------
@ -589,12 +513,6 @@ Vision layers
 .. autoclass:: PixelShuffle
    :members:

-:hidden:`Upsample`
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: Upsample
-    :members:
-
 :hidden:`UpsamplingNearest2d`
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

@ -608,21 +526,15 @@ Vision layers
    :members:


-DataParallel layers (multi-GPU, distributed)
--------------------------------------------
+Multi-GPU layers
+----------------

 :hidden:`DataParallel`
-~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: DataParallel
    :members:

-:hidden:`DistributedDataParallel`
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: torch.nn.parallel.DataParallel
-    :members:
-

 Utilities
 ---------
@ -632,16 +544,6 @@ Utilities

 .. autofunction:: torch.nn.utils.clip_grad_norm

-:hidden:`weight_norm`
-~~~~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: torch.nn.utils.weight_norm
-
-:hidden:`remove_weight_norm`
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: torch.nn.utils.remove_weight_norm
-

 .. currentmodule:: torch.nn.utils.rnn

@ -774,7 +676,7 @@ Pooling functions

 .. autofunction:: adaptive_avg_pool2d

-
+   
 Non-linear activation functions
 -------------------------------

@ -804,11 +706,6 @@ Non-linear activation functions

 .. autofunction:: elu

-:hidden:`selu`
-~~~~~~~~~~~~~~
-
-.. autofunction:: selu
-
 :hidden:`leaky_relu`
 ~~~~~~~~~~~~~~~~~~~~

@ -887,11 +784,6 @@ Normalization functions

 .. autofunction:: batch_norm

-:hidden:`normalize`
-~~~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: normalize
-
 Linear functions
 ----------------

@ -908,21 +800,6 @@ Dropout functions

 .. autofunction:: dropout

-:hidden:`alpha_dropout`
-~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: alpha_dropout
-
-:hidden:`dropout2d`
-~~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: dropout2d
-
-:hidden:`dropout3d`
-~~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: dropout3d
-
 Distance functions
 ----------------------------------

@ -931,100 +808,36 @@ Distance functions

 .. autofunction:: pairwise_distance

-:hidden:`cosine_similarity`
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: cosine_similarity
-

 Loss functions
 --------------

-:hidden:`binary_cross_entropy`
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: binary_cross_entropy
-
-:hidden:`poisson_nll_loss`
-~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: poisson_nll_loss
-
-:hidden:`cosine_embedding_loss`
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: cosine_embedding_loss
-
-:hidden:`cross_entropy`
-~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: cross_entropy
-
-:hidden:`hinge_embedding_loss`
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: hinge_embedding_loss
-
-:hidden:`kl_div`
-~~~~~~~~~~~~~~~~
-
-.. autofunction:: kl_div
-
-:hidden:`l1_loss`
-~~~~~~~~~~~~~~~~~
-
-.. autofunction:: l1_loss
-
-:hidden:`mse_loss`
-~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: mse_loss
-
-:hidden:`margin_ranking_loss`
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: margin_ranking_loss
-
-:hidden:`multilabel_margin_loss`
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: multilabel_margin_loss
-
-:hidden:`multilabel_soft_margin_loss`
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: multilabel_soft_margin_loss
-
-:hidden:`multi_margin_loss`
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: multi_margin_loss
-
 :hidden:`nll_loss`
 ~~~~~~~~~~~~~~~~~~

 .. autofunction:: nll_loss

-:hidden:`binary_cross_entropy_with_logits`
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autofunction:: binary_cross_entropy_with_logits
+:hidden:`kl_div`
+~~~~~~~~~~~~~~~~
+
+.. autofunction:: kl_div
+
+:hidden:`cross_entropy`
+~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autofunction:: cross_entropy
+
+:hidden:`binary_cross_entropy`
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autofunction:: binary_cross_entropy

 :hidden:`smooth_l1_loss`
 ~~~~~~~~~~~~~~~~~~~~~~~~

 .. autofunction:: smooth_l1_loss

-:hidden:`soft_margin_loss`
-~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: soft_margin_loss
-
-:hidden:`triplet_margin_loss`
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: triplet_margin_loss
-
 Vision functions
 ----------------

@ -1038,32 +851,6 @@ Vision functions

 .. autofunction:: pad

-:hidden:`upsample`
-~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: upsample
-
-:hidden:`upsample_nearest`
-~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: upsample_nearest
-
-:hidden:`upsample_bilinear`
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: upsample_bilinear
-
-:hidden:`grid_sample`
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: grid_sample
-
-:hidden:`affine_grid`
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autofunction:: affine_grid
-
-
 torch.nn.init
 =============

--- a/docs/source/notes/autograd.rst
+++ b/docs/source/notes/autograd.rst
@ -67,18 +67,18 @@ model. ``volatile`` also determines that ``requires_grad is False``.

 Volatile differs from :ref:`excluding-requires_grad` in how the flag propagates.
 If there's even a single volatile input to an operation, its output is also
-going to be volatile. Volatility spreads across the graph much easier than
+going to be volatile. Volatility spreads accross the graph much easier than
 non-requiring gradient - you only need a **single** volatile leaf to have a
 volatile output, while you need **all** leaves to not require gradient to
-have an output that doesn't require gradient. Using volatile flag you don't
+have an output the doesn't require gradient. Using volatile flag you don't
 need to change any settings of your model parameters to use it for
 inference. It's enough to create a volatile input, and this will ensure that
 no intermediate states are saved.

 .. code::

-    >>> regular_input = Variable(torch.randn(1, 3, 227, 227))
-    >>> volatile_input = Variable(torch.randn(1, 3, 227, 227), volatile=True)
+    >>> regular_input = Variable(torch.randn(5, 5))
+    >>> volatile_input = Variable(torch.randn(5, 5), volatile=True)
    >>> model = torchvision.models.resnet18(pretrained=True)
    >>> model(regular_input).requires_grad
    True
@ -86,28 +86,21 @@ no intermediate states are saved.
    False
    >>> model(volatile_input).volatile
    True
-    >>> model(volatile_input).grad_fn is None
+    >>> model(volatile_input).creator is None
    True

 How autograd encodes the history
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-Autograd is reverse automatic differentiation system.  Conceptually,
-autograd records a graph recording all of the operations that created
-the data as you execute operations, giving you a directed acyclic graph
-whose leaves are the input variables and roots are the output variables.
-By tracing this graph from roots to leaves, you can automatically
-compute the gradients using the chain rule.
-
-Internally, autograd represents this graph as a graph of
-:class:`Function` objects (really expressions), which can be
-:meth:`~torch.autograd.Function.apply` ed to compute the result of
-evaluating the graph.  When computing the forwards pass, autograd
-simultaneously performs the requested computations and builds up a graph
-representing the function that computes the gradient (the ``.grad_fn``
-attribute of each :class:`Variable` is an entry point into this graph).
-When the forwards pass is completed, we evaluate this graph in the
-backwards pass to compute the gradients.
+Each Variable has a ``.creator`` attribute, that points to the function, of
+which it is an output. This is an entry point to a directed acyclic graph (DAG)
+consisting of :class:`Function` objects as nodes, and references between them
+being the edges. Every time an operation is performed, a new :class:`Function`
+representing it is instantiated, its :meth:`~torch.autograd.Function.forward`
+method is called, and its output :class:`Variable` s creators are set to it.
+Then, by following the path from any :class:`Variable` to the leaves, it is
+possible to reconstruct the sequence of operations that has created the data,
+and automatically compute the gradients.

 An important thing to note is that the graph is recreated from scratch at every
 iteration, and this is exactly what allows for using arbitrary Python control
--- a/docs/source/notes/broadcasting.rst
+++ b/docs/source/notes/broadcasting.rst
@ -1,113 +0,0 @@
-.. _broadcasting-semantics:
-
-Broadcasting semantics
-======================
-
-Many PyTorch operations support :any:`NumPy Broadcasting Semantics <numpy.doc.broadcasting>`.
-
-In short, if a PyTorch operation supports broadcast, then its Tensor arguments can be
-automatically expanded to be of equal sizes (without making copies of the data).
-
-General semantics
-----------------
-Two tensors are "broadcastable" if the following rules hold:
-
- Each tensor has at least one dimension.
- When iterating over the dimension sizes, starting at the trailing dimension,
-  the dimension sizes must either be equal, one of them is 1, or one of them
-  does not exist.
-
-For Example::
-
-    >>> x=torch.FloatTensor(5,7,3)
-    >>> y=torch.FloatTensor(5,7,3)
-    # same shapes are always broadcastable (i.e. the above rules always hold)
-
-    >>> x=torch.FloatTensor()
-    >>> y=torch.FloatTensor(2,2)
-    # x and y are not broadcastable, because x does not have at least 1 dimension
-
-    # can line up trailing dimensions
-    >>> x=torch.FloatTensor(5,3,4,1)
-    >>> y=torch.FloatTensor(  3,1,1)
-    # x and y are broadcastable.
-    # 1st trailing dimension: both have size 1
-    # 2nd trailing dimension: y has size 1
-    # 3rd trailing dimension: x size == y size
-    # 4th trailing dimension: y dimension doesn't exist
-
-    # but:
-    >>> x=torch.FloatTensor(5,2,4,1)
-    >>> y=torch.FloatTensor(  3,1,1)
-    # x and y are not broadcastable, because in the 3rd trailing dimension 2 != 3
-
-If two tensors :attr:`x`, :attr:`y` are "broadcastable", the resulting tensor size
-is calculated as follows:
-
- If the number of dimensions of :attr:`x` and :attr:`y` are not equal, prepend 1
-  to the dimensions of the tensor with fewer dimensions to make them equal length.
- Then, for each dimension size, the resulting dimension size is the max of the sizes of
-  :attr:`x` and :attr:`y` along that dimension.
-
-For Example::
-
-    # can line up trailing dimensions to make reading easier
-    >>> x=torch.FloatTensor(5,1,4,1)
-    >>> y=torch.FloatTensor(  3,1,1)
-    >>> (x+y).size()
-    torch.Size([5, 3, 4, 1])
-
-    # but not necessary:
-    >>> x=torch.FloatTensor(1)
-    >>> y=torch.FloatTensor(3,1,7)
-    >>> (x+y).size()
-    torch.Size([3, 1, 7])
-
-    >>> x=torch.FloatTensor(5,2,4,1)
-    >>> y=torch.FloatTensor(3,1,1)
-    >>> (x+y).size()
-    RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 1
-
-In-place semantics
------------------
-One complication is that in-place operations do not allow the in-place tensor to change shape
-as a result of the broadcast.
-
-For Example::
-
-    >>> x=torch.FloatTensor(5,3,4,1)
-    >>> y=torch.FloatTensor(3,1,1)
-    >>> (x.add_(y)).size()
-    torch.Size([5, 3, 4, 1])
-
-    # but:
-    >>> x=torch.FloatTensor(1,3,1)
-    >>> y=torch.FloatTensor(3,1,7)
-    >>> (x.add_(y)).size()
-    RuntimeError: The expanded size of the tensor (1) must match the existing size (7) at non-singleton dimension 2.
-
-Backwards compatibility
-----------------------
-Prior versions of PyTorch allowed certain pointwise functions to execute on tensors with different shapes,
-as long as the number of elements in each tensor was equal.  The pointwise operation would then be carried
-out by viewing each tensor as 1-dimensional.  PyTorch now supports broadcasting and the "1-dimensional"
-pointwise behavior is considered deprecated and will generate a Python warning in cases where tensors are
-not broadcastable, but have the same number of elements.
-
-Note that the introduction of broadcasting can cause backwards incompatible changes in the case where
-two tensors do not have the same shape, but are broadcastable and have the same number of elements.
-For Example::
-
-    >>> torch.add(torch.ones(4,1), torch.randn(4))
-
-would previously produce a Tensor with size: torch.Size([4,1]), but now produces a Tensor with size: torch.Size([4,4]).
-In order to help identify cases in your code where backwards incompatibilities introduced by broadcasting may exist,
-you may set `torch.utils.backcompat.broadcast_warning.enabled` to `True`, which will generate a python warning
-in such cases.
-
-For Example::
-
-    >>> torch.utils.backcompat.broadcast_warning.enabled=True
-    >>> torch.add(torch.ones(4,1), torch.ones(4))
-    __main__:1: UserWarning: self and other do not have the same shape, but are broadcastable, and have the same number of elements.
-    Changing behavior in a backwards incompatible manner to broadcasting rather than viewing as 1-dimensional.
--- a/docs/source/notes/cuda.rst
+++ b/docs/source/notes/cuda.rst
@ -12,7 +12,7 @@ of your selected device, and the results will be always placed in on the same
 device as the tensor.

 Cross-GPU operations are not allowed by default, with the only exception of
-:meth:`~torch.Tensor.copy_`. Unless you enable peer-to-peer memory accesses,
+:meth:`~torch.Tensor.copy_`. Unless you enable peer-to-peer memory accesses
 any attempts to launch ops on tensors spread across different devices will
 raise an error.

--- a/docs/source/notes/extending.rst
+++ b/docs/source/notes/extending.rst
@ -13,28 +13,31 @@ Extending :mod:`torch.autograd`
 Adding operations to :mod:`~torch.autograd` requires implementing a new
 :class:`Function` subclass for each operation. Recall that :class:`Function` s
 are what :mod:`~torch.autograd` uses to compute the results and gradients, and
-encode the operation history. Every new function requires you to implement 2
+encode the operation history. Every new function requires you to implement 3
 methods:

+- ``__init__`` (*optional*) - if your operation is parametrized by/uses
+  objects different than :class:`Variable` s, you should pass them as arguments
+  to ``__init__``. For example, ``AddConstant`` function takes a scalar to add,
+  while ``Transpose`` requires specifying which two dimensions to swap. If your
+  function doesn't require any additional parameters, you can skip it.
 - :meth:`~Function.forward` - the code that performs the operation. It can take
-  as many arguments as you want, with some of them being optional, if you
-  specify the default values. All kinds of Python objects are accepted here.
-  :class:`Variable` arguments will be converted to :class:`Tensor` s before the
-  call, and their use will be registered in the graph. Note that this logic won't
-  traverse lists/dicts/any other data structures and will only consider Variables
-  that are direct arguments to the call. You can return either a single
-  :class:`Tensor` output, or a :class:`tuple` of :class:`Tensor` s if there are
-  multiple outputs. Also, please refer to the docs of :class:`Function` to find
-  descriptions of useful methods that can be called only from :meth:`~Function.forward`.
+  as many arguments as you want, with some of them being
+  optional, if you specify the default values. Keep in mind that only
+  :class:`Variable` s will be passed in here. You can return either a single
+  :class:`Variable` output, or a :class:`tuple` of :class:`Variable` s if there
+  are multiple. Also, please refer to the docs of :class:`Function` to find
+  descriptions of useful methods that can be called only from
+  :meth:`~Function.forward`.
 - :meth:`~Function.backward` - gradient formula. It will be given
-  as many :class:`Variable` arguments as there were outputs, with each of them
-  representing gradient w.r.t. that output. It should return as many
-  :class:`Variable` s as there were inputs, with each of them containing the
-  gradient w.r.t. its corresponding input. If your inputs didn't require
-  gradient (see :attr:`~Variable.needs_input_grad`), or were non-:class:`Variable`
-  objects, you can return :class:`python:None`. Also, if you have optional
-  arguments to :meth:`~Variable.forward` you can return more gradients than there
-  were inputs, as long as they're all :any:`python:None`.
+  as many arguments as there were outputs, with each of them representing
+  gradient w.r.t. that output. It should return as many :class:`Tensor` s as
+  there were inputs, with each of them containing the gradient w.r.t.
+  corresponding input. If you inputs didn't require gradient (see
+  :attr:`~Variable.needs_input_grad`), or it was non-differentiable, you
+  can return :class:`None`. Also, if you have optional arguments to
+  :meth:`~Variable.forward` you can return more gradients than there were
+  inputs, as long as they're all :any:`python:None`.

 Below you can find code for a ``Linear`` function from :mod:`torch.nn`, with
 additional comments::
@ -42,25 +45,22 @@ additional comments::
    # Inherit from Function
    class Linear(Function):

-        # Note that both forward and backward are @staticmethods
-        @staticmethod
        # bias is an optional argument
-        def forward(ctx, input, weight, bias=None):
-            ctx.save_for_backward(input, weight, bias)
+        def forward(self, input, weight, bias=None):
+            self.save_for_backward(input, weight, bias)
            output = input.mm(weight.t())
            if bias is not None:
                output += bias.unsqueeze(0).expand_as(output)
            return output

        # This function has only a single output, so it gets only one gradient
-        @staticmethod
-        def backward(ctx, grad_output):
+        def backward(self, grad_output):
            # This is a pattern that is very convenient - at the top of backward
            # unpack saved_tensors and initialize all gradients w.r.t. inputs to
            # None. Thanks to the fact that additional trailing Nones are
            # ignored, the return statement is simple even when the function has
            # optional inputs.
-            input, weight, bias = ctx.saved_variables
+            input, weight, bias = self.saved_tensors
            grad_input = grad_weight = grad_bias = None

            # These needs_input_grad checks are optional and there only to
@ -76,39 +76,27 @@ additional comments::

            return grad_input, grad_weight, grad_bias

-Now, to make it easier to use these custom ops, we recommend aliasing their
-``apply`` method::
+Now, to make it easier to use these custom ops, we recommend wrapping them in
+small helper functions::

-    linear = Linear.aply
-
-Here, we give an additional example of a function that is parametrized by
-non-Variable arguments::
-
-    class MulConstant(Function):
-        @staticmethod
-        def forward(ctx, tensor, constant):
-            # ctx is a context object that can be used to stash information
-            for backward computation
-            ctx.constant = constant
-            return tensor * constant
-
-        @staticmethod
-        def backward(ctx, grad_output):
-            # We return as many input gradients as there were arguments.
-            # Gradients of non-Tensor arguments to forward must be None.
-            return grad_output * ctx.constant, None
+    def linear(input, weight, bias=None):
+        # First braces create a Function object. Any arguments given here
+        # will be passed to __init__. Second braces will invoke the __call__
+        # operator, that will then use forward() to compute the result and
+        # return it.
+        return Linear()(input, weight, bias)

 You probably want to check if the backward method you implemented actually
 computes the derivatives of your function. It is possible by comparing with
 numerical approximations using small finite differences::

    from torch.autograd import gradcheck
-
+   
    # gradchek takes a tuple of tensor as input, check if your gradient
    # evaluated with these tensors are close enough to numerical
    # approximations and returns True if they all verify this condition.
-    input = (Variable(torch.randn(20,20).double(), requires_grad=True), Variable(torch.randn(30,20).double(), requires_grad=True),)
-    test = gradcheck(Linear.apply, input, eps=1e-6, atol=1e-4)
+    input = (Variable(torch.randn(20,20).double(), requires_grad=True),)
+    test = gradcheck(Linear(), input, eps=1e-6, atol=1e-4)
    print(test)

 Extending :mod:`torch.nn`
--- a/docs/source/optim.rst
+++ b/docs/source/optim.rst
@ -114,21 +114,3 @@ Algorithms
    :members:
 .. autoclass:: SGD
    :members:
-
-How to adjust Learning Rate
---------------------------
-
-:mod:`torch.optim.lr_scheduler` provides several methods to adjust the learning
-rate based on the number of epoches. :class:`torch.optim.lr_scheduler.ReduceLROnPlateau`
-allows dynamic learning rate reducing based on some validation measurements.
-
-.. autoclass:: torch.optim.lr_scheduler.LambdaLR
-    :members:
-.. autoclass:: torch.optim.lr_scheduler.StepLR
-    :members:
-.. autoclass:: torch.optim.lr_scheduler.MultiStepLR
-    :members:
-.. autoclass:: torch.optim.lr_scheduler.ExponentialLR
-    :members:
-.. autoclass:: torch.optim.lr_scheduler.ReduceLROnPlateau
-    :members:
--- a/docs/source/sparse.rst
+++ b/docs/source/sparse.rst
@ -1,7 +1,7 @@
 .. currentmodule:: torch.sparse

-torch.sparse
-============
+Sparse tensors
+==============

 .. warning::

@ -12,13 +12,16 @@ efficiently store and process tensors for which the majority of elements
 are zeros.

 A sparse tensor is represented as a pair of dense tensors: a tensor
-of values and a tensor of indices.  A sparse tensor can be constructed
+which contains the actual values :class:`torch.sparse.values`, and a
+tensor which contains the coordinates of those values
+:class:`torch.sparse.indices`.  A sparse tensor can be constructed
 by providing these two tensors, as well as the size of the sparse tensor
 (which cannot be inferred from these tensors!)

    >>> i = torch.LongTensor([[0, 1], [2, 0]])
    >>> v = torch.FloatTensor([3, 4])
    >>> torch.sparse.FloatTensor(i, v, torch.Size([2,3])).to_dense()
+
     0  0  3
     4  0  0
    [torch.FloatTensor of size 2x2]
@ -29,6 +32,7 @@ dimensions are sparse, and the rest of the dimensions are dense.
    >>> i = torch.LongTensor([[2, 4]])
    >>> v = torch.FloatTensor([[1, 3], [5, 7]])
    >>> torch.sparse.FloatTensor(i, v).to_dense()
+
     0  0
     0  0
     1  3
@ -44,71 +48,42 @@ An empty sparse tensor can be constructed by specifying its size:
    and values:
    [torch.FloatTensor with no dimension]

-.. note::
-
-    Our sparse tensor format permits *uncoalesced* sparse tensors, where
-    there may be duplicate coordinates in the indices; in this case,
-    the interpretation is that the value at that index is the sum of all
-    duplicate value entries. Uncoalesced tensors permit us to implement
-    certain operators more efficiently.
-
-    For the most part, you shouldn't have to care whether or not a
-    sparse tensor is coalesced or not, as most operations will work
-    identically given a coalesced or uncoalesced sparse tensor.
-    However, there are two cases in which you may need to care.
-
-    First, if you repeatedly perform an operation that can produce
-    duplicate entries (e.g., :func:`torch.sparse.FloatTensor.add`), you
-    should occasionally coalesce your sparse tensors to prevent
-    them from growing too large.
-
-    Second, some operators will produce different values depending on
-    whether or not they are coalesced or not (e.g.,
-    :func:`torch.sparse.FloatTensor._values` and
-    :func:`torch.sparse.FloatTensor._indices`, as well as
-    :func:`torch.Tensor._sparse_mask`).  These operators are
-    prefixed by an underscore to indicate that they reveal internal
-    implementation details and should be used with care, since code
-    that works with coalesced sparse tensors may not work with
-    uncoalesced sparse tensors; generally speaking, it is safest
-    to explicitly coalesce before working with these operators.
-
-    For example, suppose that we wanted to implement an operator
-    by operating directly on :func:`torch.sparse.FloatTensor._values`.
-    Multiplication by a scalar can be implemented in the obvious way,
-    as multiplication distributes over addition; however, square root
-    cannot be implemented directly, since ``sqrt(a + b) != sqrt(a) +
-    sqrt(b)`` (which is what would be computed if you were given an
-    uncoalesced tensor.)
+Sparse tensors can have duplicate entries for an index; such a tensor is
+called non-coalesced.  Duplicate entries are summed together when
+coalescing (or converting to another representation).  Some operations
+(for example, :func:`torch.FloatTensor.add`) produce duplicate entries;
+if you repeatedly perform these operations, you should coalesce your
+sparse tensors to prevent them from growing too large.

 .. class:: FloatTensor()

-    .. method:: add
-    .. method:: add_
-    .. method:: clone
-    .. method:: dim
-    .. method:: div
-    .. method:: div_
-    .. method:: get_device
-    .. method:: hspmm
-    .. method:: mm
-    .. method:: mul
-    .. method:: mul_
-    .. method:: resizeAs_
-    .. method:: size
-    .. method:: spadd
-    .. method:: spmm
-    .. method:: sspaddmm
-    .. method:: sspmm
-    .. method:: sub
-    .. method:: sub_
-    .. method:: t_
-    .. method:: toDense
-    .. method:: transpose
-    .. method:: transpose_
-    .. method:: zero_
-    .. method:: coalesce
-    .. method:: is_coalesced
-    .. method:: _indices
-    .. method:: _values
-    .. method:: _nnz
+    .. automethod:: add
+    .. automethod:: add_
+    .. automethod:: clone
+    .. automethod:: contiguous
+    .. automethod:: dim
+    .. automethod:: div
+    .. automethod:: div_
+    .. automethod:: get_device
+    .. automethod:: hspmm
+    .. automethod:: indices
+    .. automethod:: is_contiguous
+    .. automethod:: mm
+    .. automethod:: mul
+    .. automethod:: mul_
+    .. automethod:: nnz
+    .. automethod:: resizeAs_
+    .. automethod:: size
+    .. automethod:: spadd
+    .. automethod:: sparse_mask
+    .. automethod:: spmm
+    .. automethod:: sspaddmm
+    .. automethod:: sspmm
+    .. automethod:: sub
+    .. automethod:: sub_
+    .. automethod:: t_
+    .. automethod:: toDense
+    .. automethod:: transpose
+    .. automethod:: transpose_
+    .. automethod:: values
+    .. automethod:: zero_
--- a/docs/source/tensors.rst
+++ b/docs/source/tensors.rst
@ -13,7 +13,7 @@ Data type                CPU tensor                    GPU tensor
 ======================== ===========================   ================================
 32-bit floating point    :class:`torch.FloatTensor`    :class:`torch.cuda.FloatTensor`
 64-bit floating point    :class:`torch.DoubleTensor`   :class:`torch.cuda.DoubleTensor`
-16-bit floating point    :class:`torch.HalfTensor`     :class:`torch.cuda.HalfTensor`
+16-bit floating point    N/A                           :class:`torch.cuda.HalfTensor`
 8-bit integer (unsigned) :class:`torch.ByteTensor`     :class:`torch.cuda.ByteTensor`
 8-bit integer (signed)   :class:`torch.CharTensor`     :class:`torch.cuda.CharTensor`
 16-bit integer (signed)  :class:`torch.ShortTensor`    :class:`torch.cuda.ShortTensor`
@ -196,10 +196,9 @@ view of a storage and defines numeric operations on it.
   .. automethod:: lt
   .. automethod:: lt_
   .. automethod:: map_
-   .. automethod:: masked_scatter_
+   .. automethod:: masked_copy_
   .. automethod:: masked_fill_
   .. automethod:: masked_select
-   .. automethod:: matmul
   .. automethod:: max
   .. automethod:: mean
   .. automethod:: median
--- a/docs/source/torch.rst
+++ b/docs/source/torch.rst
@ -170,7 +170,6 @@ BLAS and LAPACK Operations
 .. autofunction:: ger
 .. autofunction:: gesv
 .. autofunction:: inverse
-.. autofunction:: matmul
 .. autofunction:: mm
 .. autofunction:: mv
 .. autofunction:: orgqr
--- a/docs/source/torchvision/datasets.rst
+++ b/docs/source/torchvision/datasets.rst
@ -1,78 +1,129 @@
 torchvision.datasets
 ====================

-All datasets are subclasses of :class:`torch.utils.data.Dataset`
-i.e, they have ``__getitem__`` and ``__len__`` methods implemented.
-Hence, they can all be passed to a :class:`torch.utils.data.DataLoader`
-which can load multiple samples parallelly using ``torch.multiprocessing`` workers. 
-For example: ::
-    
-    imagenet_data = torchvision.datasets.ImageFolder('path/to/imagenet_root/')
-    data_loader = torch.utils.data.DataLoader(imagenet_data, 
-                                              batch_size=4,
-                                              shuffle=True,
-                                              num_workers=args.nThreads)
+The following dataset loaders are available:

-The following datasets are available:
+-  `MNIST`_
+-  `COCO (Captioning and Detection)`_
+-  `LSUN Classification`_
+-  `ImageFolder`_
+-  `Imagenet-12`_
+-  `CIFAR10 and CIFAR100`_
+-  `STL10`_

-.. contents:: Datasets
-    :local:
+Datasets have the API:

-All the datasets have almost similar API. They all have two common arguments:
-``transform`` and  ``target_transform`` to transform the input and target respectively.
+-  ``__getitem__``
+-  ``__len__``
+   They all subclass from ``torch.utils.data.Dataset``
+   Hence, they can all be multi-threaded (python multiprocessing) using
+   standard torch.utils.data.DataLoader.

+For example:

-.. currentmodule:: torchvision.datasets 
+``torch.utils.data.DataLoader(coco_cap, batch_size=args.batchSize, shuffle=True, num_workers=args.nThreads)``

+In the constructor, each dataset has a slightly different API as needed,
+but they all take the keyword args:
+
+-  ``transform`` - a function that takes in an image and returns a
+   transformed version
+-  common stuff like ``ToTensor``, ``RandomCrop``, etc. These can be
+   composed together with ``transforms.Compose`` (see transforms section
+   below)
+-  ``target_transform`` - a function that takes in the target and
+   transforms it. For example, take in the caption string and return a
+   tensor of word indices.

 MNIST
 ~~~~~

-.. autoclass:: MNIST
+``dset.MNIST(root, train=True, transform=None, target_transform=None, download=False)``
+
+- ``root`` : root directory of dataset where ``processed/training.pt`` and  ``processed/test.pt`` exist.
+- ``train`` : ``True`` = Training set, ``False`` = Test set
+-  ``download`` : ``True`` = downloads the dataset from the internet and puts it in root directory. If dataset already downloaded, place the processed dataset (function available in mnist.py) in the ``processed`` folder.

 COCO
 ~~~~

-.. note ::
-    These require the `COCO API to be installed`_
+This requires the `COCO API to be installed`_

-.. _COCO API to be installed: https://github.com/pdollar/coco/tree/master/PythonAPI
-
-
-Captions
-^^^^^^^^
-
-.. autoclass:: CocoCaptions
-  :members: __getitem__
-  :special-members:
-
-
-Detection
+Captions:
 ^^^^^^^^^

-.. autoclass:: CocoDetection
-  :members: __getitem__
-  :special-members:
+``dset.CocoCaptions(root="dir where images are", annFile="json annotation file", [transform, target_transform])``
+
+Example:
+
+.. code:: python
+
+    import torchvision.datasets as dset
+    import torchvision.transforms as transforms
+    cap = dset.CocoCaptions(root = 'dir where images are',
+                            annFile = 'json annotation file',
+                            transform=transforms.ToTensor())
+
+    print('Number of samples: ', len(cap))
+    img, target = cap[3] # load 4th sample
+
+    print("Image Size: ", img.size())
+    print(target)
+
+Output:
+
+::
+
+    Number of samples: 82783
+    Image Size: (3L, 427L, 640L)
+    [u'A plane emitting smoke stream flying over a mountain.',
+    u'A plane darts across a bright blue sky behind a mountain covered in snow',
+    u'A plane leaves a contrail above the snowy mountain top.',
+    u'A mountain that has a plane flying overheard in the distance.',
+    u'A mountain view with a plume of smoke in the background']
+
+Detection:
+^^^^^^^^^^
+
+``dset.CocoDetection(root="dir where images are", annFile="json annotation file", [transform, target_transform])``

 LSUN
 ~~~~

-.. autoclass:: LSUN
-  :members: __getitem__
-  :special-members:
+``dset.LSUN(db_path, classes='train', [transform, target_transform])``
+
+-  db\_path = root directory for the database files
+-  ``classes`` = ``‘train’`` (all categories, training set), ``‘val’`` (all categories, validation set), ``‘test’`` (all categories, test set)
+-  [``‘bedroom\_train’``, ``‘church\_train’``, …] : a list of categories to load

 ImageFolder
 ~~~~~~~~~~~

-.. autoclass:: ImageFolder
-  :members: __getitem__
-  :special-members:
+A generic data loader where the images are arranged in this way:

+::
+
+    root/dog/xxx.png
+    root/dog/xxy.png
+    root/dog/xxz.png
+
+    root/cat/123.png
+    root/cat/nsdf3.png
+    root/cat/asd932_.png
+
+``dset.ImageFolder(root="root folder path", [transform, target_transform])``
+
+It has the members:
+
+-  ``self.classes`` - The class names as a list
+-  ``self.class_to_idx`` - Corresponding class indices
+-  ``self.imgs`` - The list of (image path, class-index) tuples

 Imagenet-12
 ~~~~~~~~~~~

-This should simply be implemented with an ``ImageFolder`` dataset.
+This is simply implemented with an ImageFolder dataset.
+
 The data is preprocessed `as described
 here <https://github.com/facebook/fb.resnet.torch/blob/master/INSTALL.md#download-the-imagenet-dataset>`__

@ -82,31 +133,30 @@ example <https://github.com/pytorch/examples/blob/27e2a46c1d1505324032b1d94fc6ce
 CIFAR
 ~~~~~

-.. autoclass:: CIFAR10
-  :members: __getitem__
-  :special-members:
+``dset.CIFAR10(root, train=True, transform=None, target_transform=None, download=False)``
+
+``dset.CIFAR100(root, train=True, transform=None, target_transform=None, download=False)``
+
+-  ``root`` : root directory of dataset where there is folder
+   ``cifar-10-batches-py``
+-  ``train`` : ``True`` = Training set, ``False`` = Test set
+-  ``download`` : ``True`` = downloads the dataset from the internet and
+   puts it in root directory. If dataset already downloaded, doesn't do anything.

 STL10
 ~~~~~

+``dset.STL10(root, split='train', transform=None, target_transform=None, download=False)``

-.. autoclass:: STL10
-  :members: __getitem__
-  :special-members:
-
-SVHN
-~~~~~
-
-
-.. autoclass:: SVHN
-  :members: __getitem__
-  :special-members:
-
-PhotoTour
-~~~~~~~~~
-
-
-.. autoclass:: PhotoTour
-  :members: __getitem__
-  :special-members:
+-  ``root`` : root directory of dataset where there is folder ``stl10_binary``
+-  ``split`` : ``'train'`` = Training set, ``'test'`` = Test set, ``'unlabeled'`` = Unlabeled set,    ``'train+unlabeled'`` = Training + Unlabeled set (missing label marked as ``-1``)
+-  ``download`` : ``True`` = downloads the dataset from the internet and puts it in root directory. If dataset already downloaded, doesn't do anything.

+.. _MNIST: #mnist
+.. _COCO (Captioning and Detection): #coco
+.. _LSUN Classification: #lsun
+.. _ImageFolder: #imagefolder
+.. _Imagenet-12: #imagenet-12
+.. _CIFAR10 and CIFAR100: #cifar
+.. _STL10: #stl10
+.. _COCO API to be installed: https://github.com/pdollar/coco/tree/master/PythonAPI
--- a/docs/source/torchvision/models.rst
+++ b/docs/source/torchvision/models.rst
@ -1,12 +1,11 @@
 torchvision.models
 ===================

-
 .. currentmodule:: torchvision.models

+
 .. automodule:: torchvision.models
   :members: alexnet, resnet18, resnet34, resnet50, resnet101, resnet152,
             vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19,
-             vgg19_bn, inception_v3, squeezenet1_0, squeezenet1_1, densenet121, 
-             densenet169, densenet201, densenet161
+             vgg19_bn
   :undoc-members:
--- a/docs/source/torchvision/transforms.rst
+++ b/docs/source/torchvision/transforms.rst
@ -3,8 +3,6 @@ torchvision.transforms

 .. currentmodule:: torchvision.transforms

-Transforms are common image transforms. They can be chained together using :class:`Compose`
-
 .. autoclass:: Compose

 Transforms on PIL.Image
@ -26,20 +24,14 @@ Transforms on torch.\*Tensor
 ----------------------------

 .. autoclass:: Normalize
-	:members: __call__
-	:special-members:


 Conversion Transforms
 ---------------------

 .. autoclass:: ToTensor
-	:members: __call__
-	:special-members:

 .. autoclass:: ToPILImage
-	:members: __call__
-	:special-members:

 Generic Transforms
 ------------------
--- a/setup.py
+++ b/setup.py
@ -15,24 +15,12 @@ import os
 from tools.setup_helpers.env import check_env_flag
 from tools.setup_helpers.cuda import WITH_CUDA, CUDA_HOME
 from tools.setup_helpers.cudnn import WITH_CUDNN, CUDNN_LIB_DIR, CUDNN_INCLUDE_DIR
-from tools.setup_helpers.split_types import split_types
 DEBUG = check_env_flag('DEBUG')
-WITH_DISTRIBUTED = not check_env_flag('NO_DISTRIBUTED')
+WITH_DISTRIBUTED = check_env_flag('WITH_DISTRIBUTED')
 WITH_DISTRIBUTED_MW = WITH_DISTRIBUTED and check_env_flag('WITH_DISTRIBUTED_MW')
 WITH_NCCL = WITH_CUDA and platform.system() != 'Darwin'
 SYSTEM_NCCL = False

-
-################################################################################
-# Workaround setuptools -Wstrict-prototypes warnings
-# I lifted this code from https://stackoverflow.com/a/29634231/23845
-################################################################################
-import distutils.sysconfig
-cfg_vars = distutils.sysconfig.get_config_vars()
-for key, value in cfg_vars.items():
-    if type(value) == str:
-            cfg_vars[key] = value.replace("-Wstrict-prototypes", "")
-
 ################################################################################
 # Monkey-patch setuptools to compile in parallel
 ################################################################################
@ -156,10 +144,6 @@ class build_ext(setuptools.command.build_ext.build_ext):
            print('-- Building NCCL library')
        else:
            print('-- Not using NCCL')
-        if WITH_DISTRIBUTED:
-            print('-- Building with distributed package ')
-        else:
-            print('-- Building without distributed package')

        # cwrap depends on pyyaml, so we can't import it earlier
        from tools.cwrap import cwrap
@ -171,14 +155,10 @@ class build_ext(setuptools.command.build_ext.build_ext):
        from tools.cwrap.plugins.NullableArguments import NullableArguments
        from tools.cwrap.plugins.CuDNNPlugin import CuDNNPlugin
        from tools.cwrap.plugins.WrapDim import WrapDim
-        from tools.cwrap.plugins.AssertNDim import AssertNDim
-        from tools.cwrap.plugins.Broadcast import Broadcast
-        from tools.cwrap.plugins.ProcessorSpecificPlugin import ProcessorSpecificPlugin
        thp_plugin = THPPlugin()
        cwrap('torch/csrc/generic/TensorMethods.cwrap', plugins=[
-            ProcessorSpecificPlugin(), BoolOption(), thp_plugin,
-            AutoGPU(condition='IS_CUDA'), ArgcountSortPlugin(), KwargsPlugin(),
-            AssertNDim(), WrapDim(), Broadcast()
+            BoolOption(), thp_plugin, AutoGPU(condition='IS_CUDA'),
+            ArgcountSortPlugin(), KwargsPlugin(), WrapDim()
        ])
        cwrap('torch/csrc/cudnn/cuDNN.cwrap', plugins=[
            CuDNNPlugin(), NullableArguments()
@ -225,10 +205,11 @@ class clean(distutils.command.clean.clean):
 include_dirs = []
 library_dirs = []
 extra_link_args = []
-extra_compile_args = ['-std=c++11', '-Wno-write-strings',
-                      # Python 2.6 requires -fno-strict-aliasing, see
-                      # http://legacy.python.org/dev/peps/pep-3123/
-                      '-fno-strict-aliasing']
+extra_compile_args = ['-std=c++11', '-Wno-write-strings']
+if os.getenv('PYTORCH_BINARY_BUILD') and platform.system() == 'Linux':
+    print('PYTORCH_BINARY_BUILD found. Static linking libstdc++ on Linux')
+    extra_compile_args += ['-static-libstdc++']
+    extra_link_args += ['-static-libstdc++']

 cwd = os.path.dirname(os.path.abspath(__file__))
 lib_path = os.path.join(cwd, "torch", "lib")
@ -241,7 +222,6 @@ include_dirs += [
    tmp_install_path + "/include/TH",
    tmp_install_path + "/include/THPP",
    tmp_install_path + "/include/THNN",
-    tmp_install_path + "/include/ATen",
 ]

 library_dirs.append(lib_path)
@ -254,10 +234,7 @@ THCS_LIB = os.path.join(lib_path, 'libTHCS.so.1')
 THNN_LIB = os.path.join(lib_path, 'libTHNN.so.1')
 THCUNN_LIB = os.path.join(lib_path, 'libTHCUNN.so.1')
 THPP_LIB = os.path.join(lib_path, 'libTHPP.so.1')
-ATEN_LIB = os.path.join(lib_path, 'libATen.so.1')
-GLOO_LIB = os.path.join(lib_path, 'libgloo.a')
-GLOO_CUDA_LIB = os.path.join(lib_path, 'libgloo_cuda.a')
-THD_LIB = os.path.join(lib_path, 'libTHD.a')
+THD_LIB = os.path.join(lib_path, 'libTHD.so.1')
 NCCL_LIB = os.path.join(lib_path, 'libnccl.so.1')
 if platform.system() == 'Darwin':
    TH_LIB = os.path.join(lib_path, 'libTH.1.dylib')
@ -267,26 +244,26 @@ if platform.system() == 'Darwin':
    THNN_LIB = os.path.join(lib_path, 'libTHNN.1.dylib')
    THCUNN_LIB = os.path.join(lib_path, 'libTHCUNN.1.dylib')
    THPP_LIB = os.path.join(lib_path, 'libTHPP.1.dylib')
-    ATEN_LIB = os.path.join(lib_path, 'libATen.1.dylib')
+    THD_LIB = os.path.join(lib_path, 'libTHD.1.dylib')
    NCCL_LIB = os.path.join(lib_path, 'libnccl.1.dylib')

 if WITH_NCCL and subprocess.call('ldconfig -p | grep libnccl >/dev/null', shell=True) == 0:
-    SYSTEM_NCCL = True
+        SYSTEM_NCCL = True

 main_compile_args = ['-D_THP_CORE']
 main_libraries = ['shm']
-main_link_args = [TH_LIB, THS_LIB, THPP_LIB, THNN_LIB, ATEN_LIB]
+main_link_args = [TH_LIB, THS_LIB, THPP_LIB, THNN_LIB]
 main_sources = [
    "torch/csrc/PtrWrapper.cpp",
    "torch/csrc/Module.cpp",
    "torch/csrc/Generator.cpp",
    "torch/csrc/Size.cpp",
    "torch/csrc/Exceptions.cpp",
+    "torch/csrc/Tensor.cpp",
    "torch/csrc/Storage.cpp",
    "torch/csrc/DynamicTypes.cpp",
    "torch/csrc/byte_order.cpp",
    "torch/csrc/utils.cpp",
-    "torch/csrc/expand_utils.cpp",
    "torch/csrc/utils/object_ptr.cpp",
    "torch/csrc/utils/tuple_parser.cpp",
    "torch/csrc/allocators.cpp",
@ -295,7 +272,7 @@ main_sources = [
    "torch/csrc/autograd/engine.cpp",
    "torch/csrc/autograd/function.cpp",
    "torch/csrc/autograd/variable.cpp",
-    "torch/csrc/autograd/input_buffer.cpp",
+    "torch/csrc/autograd/grad_buffer.cpp",
    "torch/csrc/autograd/python_function.cpp",
    "torch/csrc/autograd/python_cpp_function.cpp",
    "torch/csrc/autograd/python_variable.cpp",
@ -303,14 +280,9 @@ main_sources = [
    "torch/csrc/autograd/python_hook.cpp",
    "torch/csrc/autograd/functions/batch_normalization.cpp",
    "torch/csrc/autograd/functions/convolution.cpp",
-    "torch/csrc/autograd/functions/basic_ops.cpp",
-    "torch/csrc/autograd/functions/tensor.cpp",
-    "torch/csrc/autograd/functions/accumulate_grad.cpp",
-    "torch/csrc/autograd/functions/utils.cpp",
    "torch/csrc/autograd/functions/init.cpp",
    "torch/csrc/nn/THNN_generic.cpp",
 ]
-main_sources += split_types("torch/csrc/Tensor.cpp")

 try:
    import numpy as np
@ -331,11 +303,8 @@ if WITH_DISTRIBUTED:
            "torch/csrc/distributed/Tensor.cpp",
            "torch/csrc/distributed/Storage.cpp",
        ]
-        extra_compile_args += ['-DWITH_DISTRIBUTED_MW']
    include_dirs += [tmp_install_path + "/include/THD"]
    main_link_args += [THD_LIB]
-    if platform.system() == 'Linux':
-        main_link_args += [GLOO_LIB]

 if WITH_CUDA:
    cuda_lib_dirs = ['lib64', 'lib']
@ -350,20 +319,17 @@ if WITH_CUDA:
    extra_link_args.append('-Wl,-rpath,' + cuda_lib_path)
    extra_compile_args += ['-DWITH_CUDA']
    extra_compile_args += ['-DCUDA_LIB_PATH=' + cuda_lib_path]
-    main_libraries += ['cudart', 'nvToolsExt']
+    main_libraries += ['cudart']
    main_link_args += [THC_LIB, THCS_LIB, THCUNN_LIB]
-    if platform.system() == 'Linux':
-        main_link_args += [GLOO_CUDA_LIB]
    main_sources += [
        "torch/csrc/cuda/Module.cpp",
        "torch/csrc/cuda/Storage.cpp",
        "torch/csrc/cuda/Stream.cpp",
+        "torch/csrc/cuda/Tensor.cpp",
        "torch/csrc/cuda/AutoGPU.cpp",
        "torch/csrc/cuda/utils.cpp",
-        "torch/csrc/cuda/expand_utils.cpp",
        "torch/csrc/cuda/serialization.cpp",
    ]
-    main_sources += split_types("torch/csrc/cuda/Tensor.cpp")

 if WITH_NCCL:
    if SYSTEM_NCCL:
@ -380,8 +346,6 @@ if WITH_CUDNN:
        "torch/csrc/cudnn/BatchNorm.cpp",
        "torch/csrc/cudnn/Conv.cpp",
        "torch/csrc/cudnn/cuDNN.cpp",
-        "torch/csrc/cudnn/GridSampler.cpp",
-        "torch/csrc/cudnn/AffineGridGenerator.cpp",
        "torch/csrc/cudnn/Types.cpp",
        "torch/csrc/cudnn/Handles.cpp",
    ]
@ -391,18 +355,6 @@ if DEBUG:
    extra_compile_args += ['-O0', '-g']
    extra_link_args += ['-O0', '-g']

-if os.getenv('PYTORCH_BINARY_BUILD') and platform.system() == 'Linux':
-    print('PYTORCH_BINARY_BUILD found. Static linking libstdc++ on Linux')
-    # get path of libstdc++ and link manually.
-    # for reasons unknown, -static-libstdc++ doesn't fully link some symbols
-    CXXNAME = os.getenv('CXX', 'g++')
-    STDCPP_LIB = subprocess.check_output([CXXNAME, '-print-file-name=libstdc++.a'])
-    STDCPP_LIB = STDCPP_LIB[:-1]
-    if type(STDCPP_LIB) != str:  # python 3
-        STDCPP_LIB = STDCPP_LIB.decode(sys.stdout.encoding)
-    main_link_args += [STDCPP_LIB]
-    version_script = os.path.abspath("tools/pytorch.version")
-    extra_link_args += ['-Wl,--version-script=' + version_script]

 def make_relative_rpath(path):
    if platform.system() == 'Darwin':
@ -415,7 +367,7 @@ def make_relative_rpath(path):
 ################################################################################

 extensions = []
-packages = find_packages(exclude=('tools', 'tools.*',))
+packages = find_packages(exclude=('tools.*',))

 C = Extension("torch._C",
              libraries=main_libraries,
@ -462,7 +414,7 @@ if WITH_CUDA:
                       )
    extensions.append(THCUNN)

-version = '0.2.0'
+version = '0.1.12'
 if os.getenv('PYTORCH_BUILD_VERSION'):
    assert os.getenv('PYTORCH_BUILD_NUMBER') is not None
    version = os.getenv('PYTORCH_BUILD_VERSION') \
@ -495,5 +447,5 @@ setup(name="torch", version=version,
          'lib/*.h',
          'lib/include/TH/*.h', 'lib/include/TH/generic/*.h',
          'lib/include/THC/*.h', 'lib/include/THC/generic/*.h']},
-      install_requires=['pyyaml', 'numpy'],
+      install_requires=['pyyaml'],
      )
--- a/test/common.py
+++ b/test/common.py
@ -15,28 +15,15 @@ from torch.autograd import Variable

 torch.set_default_tensor_type('torch.DoubleTensor')

-SEED = 0
-SEED_SET = 0

-
-def parse_set_seed_once():
-    global SEED
-    global SEED_SET
+def run_tests():
    parser = argparse.ArgumentParser(add_help=False)
    parser.add_argument('--seed', type=int, default=123)
    args, remaining = parser.parse_known_args()
-    if SEED_SET == 0:
-        torch.manual_seed(args.seed)
-        if torch.cuda.is_available():
-            torch.cuda.manual_seed_all(args.seed)
-        SEED = args.seed
-        SEED_SET = 1
+    torch.manual_seed(args.seed)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed_all(args.seed)
    remaining = [sys.argv[0]] + remaining
-    return remaining
-
-
-def run_tests():
-    remaining = parse_set_seed_once()
    unittest.main(argv=remaining)


@ -90,7 +77,7 @@ def to_gpu(obj, type_map={}):
    elif torch.is_storage(obj):
        return obj.new().resize_(obj.size()).copy_(obj)
    elif isinstance(obj, Variable):
-        assert obj.is_leaf
+        assert obj.creator is None
        t = type_map.get(type(obj.data), get_gpu_type(type(obj.data)))
        return Variable(obj.data.clone().type(t), requires_grad=obj.requires_grad)
    elif isinstance(obj, list):
@ -131,11 +118,6 @@ def is_iterable(obj):
 class TestCase(unittest.TestCase):
    precision = 1e-5

-    def setUp(self):
-        torch.manual_seed(SEED)
-        if torch.cuda.is_available():
-            torch.cuda.manual_seed_all(SEED)
-
    def assertTensorsSlowEqual(self, x, y, prec=None, message=''):
        max_err = 0
        self.assertEqual(x.size(), y.size())
@ -147,7 +129,7 @@ class TestCase(unittest.TestCase):
        tc = t.coalesce()

        value_map = {}
-        for idx, val in zip(t._indices().t(), t._values()):
+        for idx, val in zip(t.indices().t(), t.values()):
            idx_tup = tuple(idx)
            if idx_tup in value_map:
                value_map[idx_tup] += val
@ -156,31 +138,26 @@ class TestCase(unittest.TestCase):

        new_indices = sorted(list(value_map.keys()))
        new_values = [value_map[idx] for idx in new_indices]
-        if t._values().ndimension() < 2:
-            new_values = t._values().new(new_values)
+        if t.values().ndimension() < 2:
+            new_values = t.values().new(new_values)
        else:
            new_values = torch.stack(new_values)

-        new_indices = t._indices().new(new_indices).t()
+        new_indices = t.indices().new(new_indices).t()
        tg = t.new(new_indices, new_values, t.size())

-        self.assertEqual(tc._indices(), tg._indices())
-        self.assertEqual(tc._values(), tg._values())
+        self.assertEqual(tc.indices(), tg.indices())
+        self.assertEqual(tc.values(), tg.values())

        return tg

-    def unwrapVariables(self, x, y):
-        if isinstance(x, Variable) and isinstance(y, Variable):
-            return x.data, y.data
-        elif isinstance(x, Variable) or isinstance(y, Variable):
-            raise AssertionError("cannot compare {} and {}".format(type(x), type(y)))
-        return x, y
-
    def assertEqual(self, x, y, prec=None, message=''):
        if prec is None:
            prec = self.precision

-        x, y = self.unwrapVariables(x, y)
+        if isinstance(x, Variable) and isinstance(y, Variable):
+            x = x.data
+            y = y.data

        if torch.is_tensor(x) and torch.is_tensor(y):
            def assertTensorsEqual(a, b):
@ -201,16 +178,13 @@ class TestCase(unittest.TestCase):
            if x.is_sparse:
                x = self.safeCoalesce(x)
                y = self.safeCoalesce(y)
-                assertTensorsEqual(x._indices(), y._indices())
-                assertTensorsEqual(x._values(), y._values())
+                assertTensorsEqual(x.indices(), y.indices())
+                assertTensorsEqual(x.values(), y.values())
            else:
                assertTensorsEqual(x, y)
        elif type(x) == str and type(y) == str:
            super(TestCase, self).assertEqual(x, y)
-        elif type(x) == set and type(y) == set:
-            super(TestCase, self).assertEqual(x, y)
        elif is_iterable(x) and is_iterable(y):
-            super(TestCase, self).assertEqual(len(x), len(y))
            for x_, y_ in zip(x, y):
                self.assertEqual(x_, y_, prec, message)
        else:
@ -225,7 +199,9 @@ class TestCase(unittest.TestCase):
        if prec is None:
            prec = self.precision

-        x, y = self.unwrapVariables(x, y)
+        if isinstance(x, Variable) and isinstance(y, Variable):
+            x = x.data
+            y = y.data

        if torch.is_tensor(x) and torch.is_tensor(y):
            if x.size() != y.size():
@ -259,33 +235,24 @@ class TestCase(unittest.TestCase):
                return
        raise AssertionError("object not found in iterable")

-    if sys.version_info < (3, 2):
-        # assertRaisesRegexp renamed assertRaisesRegex in 3.2
-        assertRaisesRegex = unittest.TestCase.assertRaisesRegexp

-
-def download_file(url, binary=True):
+def download_file(url, path, binary=True):
    if sys.version_info < (3,):
-        from urlparse import urlsplit
        import urllib2
        request = urllib2
        error = urllib2
    else:
-        from urllib.parse import urlsplit
-        from urllib import request, error
-
-    filename = os.path.basename(urlsplit(url)[2])
-    data_dir = os.path.join(os.path.dirname(__file__), 'data')
-    path = os.path.join(data_dir, filename)
+        import urllib.request
+        import urllib.error
+        request = urllib.request
+        error = urllib.error

    if os.path.exists(path):
-        return path
+        return True
    try:
        data = request.urlopen(url, timeout=15).read()
        with open(path, 'wb' if binary else 'w') as f:
            f.write(data)
-        return path
-    except error.URLError:
-        msg = "could not download test file '{}'".format(url)
-        warnings.warn(msg, RuntimeWarning)
-        raise unittest.SkipTest(msg)
+        return True
+    except error.URLError as e:
+        return False
--- a/test/common_nn.py
+++ b/test/common_nn.py
@ -53,31 +53,29 @@ module_tests = [
    dict(
        module_name='ReLU',
        input_size=(2, 3, 4, 5),
-        check_inplace=True,
+        check_inplace=True
    ),
    dict(
        module_name='ReLU6',
        input_size=(2, 3, 4, 5),
-        check_inplace=True,
+        check_inplace=True
    ),
    dict(
        module_name='RReLU',
        input_size=(1, 2, 2),
-        test_cuda=False,
-        check_gradgrad=False,
+        test_cuda=False
    ),
    dict(
        module_name='RReLU',
        constructor_args=(0.1, 0.9),
        input_size=(4, 4, 5),
        desc='with_up_down',
-        test_cuda=False,
-        check_gradgrad=False,
+        test_cuda=False
    ),
    dict(
        module_name='Hardtanh',
        input_size=(3, 2, 5),
-        reference_fn=lambda i, _: i.clamp(-1, 1),
+        reference_fn=lambda i, _: i.clamp(-1, 1)
    ),
    dict(
        module_name='Sigmoid',
@ -90,35 +88,35 @@ module_tests = [
    dict(
        module_name='Softmax',
        input_size=(10, 20),
-        reference_fn=lambda i, _: torch.exp(i).div(torch.exp(i).sum(1, True).expand(10, 20)),
+        reference_fn=lambda i, _: torch.exp(i).div(torch.exp(i).sum(1).expand(10, 20))
    ),
    dict(
        module_name='Softmax2d',
        input_size=(1, 3, 10, 20),
-        reference_fn=lambda i, _: torch.exp(i).div(torch.exp(i).sum(1, False)),
+        reference_fn=lambda i, _: torch.exp(i).div(torch.exp(i).sum(1).expand_as(i))
    ),
    dict(
        module_name='LogSoftmax',
        input_size=(10, 20),
-        reference_fn=lambda i, _: torch.exp(i).div_(torch.exp(i).sum(1, True).expand(10, 20)).log_(),
+        reference_fn=lambda i, _: torch.exp(i).div_(torch.exp(i).sum(1).expand(10, 20)).log_()
    ),
    dict(
        module_name='LogSoftmax',
        input_size=(1, 3, 10, 20),
-        reference_fn=lambda i, _: torch.exp(i).div_(torch.exp(i).sum(1, False)).log_(),
-        desc='multiparam',
+        reference_fn=lambda i, _: torch.exp(i).div_(torch.exp(i).sum(1).expand_as(i)).log_(),
+        desc='multiparam'
    ),
    dict(
        module_name='ELU',
        constructor_args=(2.,),
        input_size=(3, 2, 5),
+        check_inplace=True
    ),
    # TODO: reference function
    dict(
        module_name='Hardshrink',
        constructor_args=(2.,),
-        input_size=(4, 3, 2, 4),
-        check_gradgrad=False,
+        input_size=(4, 3, 2, 4)
    ),
    dict(
        module_name='LeakyReLU',
@ -135,40 +133,34 @@ module_tests = [
    dict(
        module_name='LogSigmoid',
        input_size=(2, 3, 4),
-        reference_fn=lambda i, _: i.sigmoid().log(),
-        check_gradgrad=False,
+        reference_fn=lambda i, _: i.sigmoid().log()
    ),
    dict(
        module_name='Softplus',
        input_size=(10, 20),
-        reference_fn=lambda i, _: torch.log(1 + torch.exp(i)),
-        check_gradgrad=False,
+        reference_fn=lambda i, _: torch.log(1 + torch.exp(i))
    ),
    dict(
        module_name='Softplus',
        constructor_args=(2,),
        input_size=(10, 20),
        reference_fn=lambda i, _: 1. / 2. * torch.log(1 + torch.exp(2 * i)),
-        desc='beta',
-        check_gradgrad=False,
+        desc='beta'
    ),
    dict(
        module_name='Softshrink',
-        input_size=(3, 2, 5),
-        check_gradgrad=False,
+        input_size=(3, 2, 5)
    ),
    dict(
        module_name='Softshrink',
        constructor_args=(1,),
        input_size=(3, 2, 5),
-        desc='lambda',
-        check_gradgrad=False,
+        desc='lambda'
    ),
    dict(
        module_name='CrossMapLRN2d',
        constructor_args=(5, 5e-3, 1e-3, 2),
-        input_size=(2, 3, 6, 6),
-        check_gradgrad=False,
+        input_size=(2, 3, 6, 6)
    ),
    dict(
        module_name='PReLU',
@ -212,12 +204,11 @@ module_tests = [
    dict(
        module_name='Softsign',
        input_size=(3, 2, 5),
-        reference_fn=lambda i, _: i.div(1 + torch.abs(i)),
+        reference_fn=lambda i, _: i.div(1 + torch.abs(i))
    ),
    dict(
        module_name='Softmin',
-        input_size=(10, 20),
-        check_gradgrad=False,
+        input_size=(10, 20)
    ),
    dict(
        module_name='Tanhshrink',
@ -225,32 +216,19 @@ module_tests = [
    ),
 ]

+
 criterion_tests = [
    dict(module_name='L1Loss',
         input_size=(2, 3, 4),
         target=torch.randn(2, 3, 4),
         reference_fn=lambda i, t, _: 1. / i.numel() *
-         sum((a - b).abs().sum() for a, b in zip(i, t)),
+         sum((a - b).abs().sum() for a, b in zip(i, t))
         ),
    dict(
        module_name='NLLLoss',
        input=torch.rand(15, 10).log(),
        target=torch.Tensor(15).uniform_().mul(10).floor().long(),
    ),
-    dict(
-        module_name='NLLLoss',
-        constructor_args=(None, False),
-        input=torch.rand(15, 10).log(),
-        target=torch.Tensor(15).uniform_().mul(10).floor().long(),
-        desc='no_size_average'
-    ),
-    dict(
-        module_name='NLLLoss',
-        constructor_args=(None, True, 2),
-        input=torch.rand(15, 10).log(),
-        target=torch.Tensor(15).uniform_().mul(10).floor().long(),
-        desc='ignore_index'
-    ),
    dict(
        module_name='NLLLoss',
        constructor_args=(torch.rand(10),),
@ -258,159 +236,120 @@ criterion_tests = [
        target=torch.Tensor(15).uniform_().mul(10).floor().long(),
        desc='weights',
    ),
-    dict(
-        module_name='NLLLoss',
-        constructor_args=(torch.rand(10), True, 2),
-        input=torch.rand(15, 10).add(1e-2).log(),
-        target=torch.Tensor(15).uniform_().mul(10).floor().long(),
-        desc='weights_ignore_index'
-    ),
-    dict(
-        module_name='NLLLoss',
-        constructor_args=(torch.rand(10), True, -1),
-        input=torch.rand(15, 10).add(1e-2).log(),
-        target=torch.Tensor(15).uniform_().mul(10 + 1).floor().long() - 1,
-        desc='weights_ignore_index_neg'
-    ),
    dict(
        module_name='KLDivLoss',
        input=torch.rand(10, 10).log(),
-        target=torch.rand(10, 10),
-        check_gradgrad=False,
+        target=torch.rand(10, 10)
    ),
    dict(
        module_name='MSELoss',
        input=torch.randn(2, 3, 4, 5),
        target=torch.randn(2, 3, 4, 5),
-        reference_fn=lambda i, t, _: (i - t).abs().pow(2).sum() / i.numel(),
-        check_gradgrad=False,
+        reference_fn=lambda i, t, _: (i - t).abs().pow(2).sum() / i.numel()
    ),
    dict(
        module_name='BCELoss',
        input=torch.rand(15, 10).clamp_(1e-2, 1 - 1e-2),
-        target=torch.randn(15, 10).gt(0).double(),
-        check_gradgrad=False,
+        target=torch.randn(15, 10).gt(0).double()
    ),
    dict(
        module_name='BCELoss',
        constructor_args=(torch.rand(10),),
        input=torch.rand(15, 10).clamp_(1e-2, 1 - 1e-2),
        target=torch.randn(15, 10).gt(0).double(),
-        desc='weights',
-        check_gradgrad=False,
+        desc='weights'
    ),
    dict(
        module_name='CrossEntropyLoss',
        input=torch.randn(15, 10),
-        target=torch.Tensor(15).uniform_().mul(10).floor().long(),
-        check_gradgrad=False,
+        target=torch.Tensor(15).uniform_().mul(10).floor().long()
    ),
    dict(
        module_name='CrossEntropyLoss',
        constructor_args=(torch.rand(10),),
        input=torch.randn(15, 10),
        target=torch.Tensor(15).uniform_().mul(10).floor().long(),
-        desc='weights',
-        check_gradgrad=False,
+        desc='weights'
    ),
    dict(
        module_name='NLLLoss2d',
        input_size=(2, 3, 5, 5),
-        target=torch.rand(2, 5, 5).mul(3).floor().long(),
+        target=torch.rand(2, 5, 5).mul(3).floor().long()
    ),
    dict(
        module_name='NLLLoss2d',
        constructor_args=(torch.rand(3),),
        input_size=(2, 3, 5, 5),
        target=torch.rand(2, 5, 5).mul(3).floor().long(),
-        desc='weights',
-    ),
-    dict(
-        module_name='NLLLoss2d',
-        constructor_args=(None, True, 3),
-        input_size=(2, 3, 5, 5),
-        target=torch.rand(2, 5, 5).mul(4).floor().long(),
-        desc='ignore_index',
+        desc='weights'
    ),
    dict(
        module_name='HingeEmbeddingLoss',
        input=torch.rand(10),
-        target=torch.randn(10).gt(0).double().mul_(2).sub(1),
-        check_gradgrad=False,
+        target=torch.randn(10).gt(0).double().mul_(2).sub(1)
    ),
    dict(
        module_name='HingeEmbeddingLoss',
        constructor_args=(0.5,),
        input=torch.rand(10),
        target=torch.randn(10).gt(0).double().mul_(2).sub(1),
-        desc='margin',
-        check_gradgrad=False,
+        desc='margin'
    ),
    dict(
        module_name='MultiLabelMarginLoss',
        input_size=(5, 10),
-        target=torch.rand(5, 10).mul(10).floor().long(),
-        check_gradgrad=False,
+        target=torch.rand(5, 10).mul(10).floor().long()
    ),
    dict(
        module_name='MultiLabelSoftMarginLoss',
        input_size=(5, 10),
-        target=torch.rand(5, 10).mul(2).floor(),
-        check_gradgrad=False,
+        target=torch.rand(5, 10).mul(2).floor()
    ),
    dict(
        module_name='MultiLabelSoftMarginLoss',
        constructor_args=(torch.rand(10),),
        input_size=(5, 10),
        target=torch.rand(5, 10).mul(2).floor(),
-        desc='weights',
-        check_gradgrad=False,
+        desc='weights'
    ),
    dict(
        module_name='MultiMarginLoss',
        input_size=(5, 10),
-        target=torch.rand(5).mul(8).floor().long(),
-        check_gradgrad=False,
+        target=torch.rand(5).mul(8).floor().long()
    ),
    dict(
        module_name='SmoothL1Loss',
        input_size=(5, 10),
-        target=torch.randn(5, 10),
-        check_gradgrad=False,
+        target=torch.randn(5, 10)
    ),
    dict(
        module_name='SoftMarginLoss',
        input_size=(5, 5),
-        target=torch.randn(5, 5).sign(),
-        check_gradgrad=False,
+        target=torch.randn(5, 5).sign()
    ),
    dict(
        module_name='CosineEmbeddingLoss',
        input=(torch.rand(15, 10), torch.rand(15, 10)),
-        target=torch.randn(15).sign(),
-        check_gradgrad=False,
+        target=torch.randn(15).sign()
    ),
    dict(
        module_name='CosineEmbeddingLoss',
        constructor_args=(0.7,),
        input=(torch.rand(15, 10), torch.rand(15, 10)),
        target=torch.randn(15).sign(),
-        desc='margin',
-        check_gradgrad=False,
+        desc='margin'
    ),
    dict(
        module_name='MarginRankingLoss',
        input=(torch.randn(50).mul(10), torch.randn(50).mul(10)),
-        target=torch.randn(50).sign(),
-        check_gradgrad=False,
+        target=torch.randn(50).sign()
    ),
    dict(
        module_name='MarginRankingLoss',
        constructor_args=(2,),
        input=(torch.randn(50).mul(10), torch.randn(50).mul(10)),
        target=torch.randn(50).sign(),
-        desc='margin',
-        check_gradgrad=False,
+        desc='margin'
    ),
 ]

@ -440,7 +379,6 @@ class NNTestCase(TestCase):
        if isinstance(input, Variable):
            if input.requires_grad and input.grad is not None:
                input.grad.data.zero_()
-                input.grad.detach_()
        elif torch.is_tensor(input):
            return
        else:
@ -501,6 +439,7 @@ class NNTestCase(TestCase):
            return out

        res = tuple()
+        # TODO: enable non-contig tests
        input = contiguous(input)
        if jacobian_input:
            res += get_numerical_jacobian(fw, input, input, eps=1e-6),
@ -752,7 +691,6 @@ class CriterionTest(TestBase):
            test_case.assertEqual(out, expected_out)

        test_case.check_criterion_jacobian(module, input, self.target)
-        self._do_extra_tests(test_case, module, input, self.target)

    def test_cuda(self, test_case):
        if not TEST_CUDA or not self.should_test_cuda:
@ -779,6 +717,3 @@ class CriterionTest(TestBase):
            test_case.assertEqual(cpu_gradInput, gpu_gradInput, 4e-4)
        except NotImplementedError:
            pass
-
-    def _do_extra_tests(self, test_case, module, input, target):
-        pass
--- a/test/run_test.sh
+++ b/test/run_test.sh
@ -55,37 +55,32 @@ $PYCMD test_cuda.py $@
 echo "Running NCCL tests"
 $PYCMD test_nccl.py $@

-distributed_set_up() {
-  export TEMP_DIR="$(mktemp -d)"
-  rm -rf "$TEMP_DIR/"*
-  mkdir "$TEMP_DIR/barrier"
-  mkdir "$TEMP_DIR/test_dir"
-}
+################################################################################
+if [[ "$TEST_DISTRIBUTED" -eq 1 ]]; then
+    distributed_set_up() {
+        export TEMP_DIR="$(mktemp -d)"
+        rm -rf "$TEMP_DIR/"*
+        mkdir "$TEMP_DIR/barrier"
+        mkdir "$TEMP_DIR/test_dir"
+    }

-distributed_tear_down() {
-  rm -rf "$TEMP_DIR"
-}
+    distributed_tear_down() {
+        rm -rf "$TEMP_DIR"
+    }

-trap distributed_tear_down EXIT SIGHUP SIGINT SIGTERM
+    trap distributed_tear_down EXIT SIGHUP SIGINT SIGTERM

-echo "Running distributed tests for the TCP backend"
-distributed_set_up
-BACKEND=tcp WORLD_SIZE=3 $PYCMD ./test_distributed.py
-distributed_tear_down
+    echo "Running distributed tests for the TCP backend"
+    distributed_set_up
+    BACKEND=tcp WORLD_SIZE=3 $PYCMD ./test_distributed.py
+    distributed_tear_down

-echo "Running distributed tests for the Gloo backend"
-distributed_set_up
-BACKEND=gloo WORLD_SIZE=3 $PYCMD ./test_distributed.py
-distributed_tear_down
-
-if [ -x "$(command -v mpiexec)" ]; then
-  echo "Running distributed tests for the MPI backend"
-  distributed_set_up
-  BACKEND=mpi mpiexec -n 3 $PYCMD ./test_distributed.py
-  distributed_tear_down
-else
-  echo "Skipping MPI backend tests (MPI not found)"
+    echo "Running distributed tests for the MPI backend"
+    distributed_set_up
+    BACKEND=mpi mpiexec -n 3 $PYCMD ./test_distributed.py
+    distributed_tear_down
 fi
+################################################################################

 if [[ $COVERAGE -eq 1 ]]; then
    coverage combine
--- a/test/test_autograd.py
+++ b/test/test_autograd.py
--- a/test/test_cuda.py
+++ b/test/test_cuda.py
@ -94,7 +94,7 @@ def small_3d_positive(t):


 def small_3d_unique(t):
-    return t(S, S, S).copy_(torch.arange(1, S * S * S + 1).view(S, S, S))
+    return t(S, S, S).copy_(torch.arange(1, S * S * S + 1))


 def small_1d_lapack(t):
@ -113,10 +113,6 @@ def small_2d_lapack_fat(t):
    return t(4, 3).copy_(torch.arange(1, 13).view(4, 3))


-def large_2d_lapack(t):
-    return t(1000, 1000).normal_()
-
-
 def new_t(*sizes):
    def tmp(t):
        return t(*sizes).copy_(torch.randn(*sizes))
@ -284,8 +280,6 @@ tests = [
    ('qr', small_2d_lapack, lambda t: [], 'square', float_types),
    ('qr', small_2d_lapack_skinny, lambda t: [], 'skinny', float_types),
    ('qr', small_2d_lapack_fat, lambda t: [], 'fat', float_types),
-    ('qr', large_2d_lapack, lambda t: [], 'big', float_types),
-    ('inverse', new_t(20, 20), lambda t: [], None, float_types),

 ]

@ -300,7 +294,6 @@ custom_precision = {
    'baddbmm': 1e-4,
    'rsqrt': 1e-4,
    'cumprod': 1e-4,
-    'qr': 3e-4,
 }

 simple_pointwise = [
@ -602,17 +595,6 @@ class TestCuda(TestCase):
            cuda_type = get_gpu_type(t)
            self.assertEqual(cuda_type(seq), reference)

-    def test_torch_manual_seed_seeds_cuda_devices(self):
-        with freeze_rng_state():
-            x = torch.zeros(4, 4).float().cuda()
-            torch.manual_seed(2)
-            self.assertEqual(torch.cuda.initial_seed(), 2)
-            x.uniform_()
-            torch.manual_seed(2)
-            y = x.clone().uniform_()
-            self.assertEqual(x, y)
-            self.assertEqual(torch.cuda.initial_seed(), 2)
-
    def test_manual_seed(self):
        with freeze_rng_state():
            x = torch.zeros(4, 4).float().cuda()
@ -841,60 +823,12 @@ class TestCuda(TestCase):
        self.assertEqual(gpu_tensor1[0], 1)
        self.assertEqual(gpu_tensor0[0], 2)

-    @staticmethod
-    def _select_broadcastable_dims(dims_full=None):
-        return TestTorch._select_broadcastable_dims(dims_full)
-
-    def test_broadcast(self):
-        TestTorch._test_broadcast(self, lambda t: t.cuda())
-
-    def test_broadcast_fallback(self):
-        TestTorch._test_broadcast_fallback(self, lambda t: t.cuda())
-
-    def test_broadcast_fused_matmul(self):
-        TestTorch._test_broadcast_fused_matmul(self, lambda t: t.cuda())
-
-    def test_broadcast_batched_matmul(self):
-        TestTorch._test_broadcast_batched_matmul(self, lambda t: t.cuda())
-
-    def test_advancedindex(self):
-        TestTorch._test_advancedindex(self, lambda t: t.cuda())
-
-    def test_advancedindex_big(self):
-        TestTorch._test_advancedindex_big(self, lambda t: t.cuda())
-
    def test_btrifact(self):
        TestTorch._test_btrifact(self, lambda t: t.cuda())

    def test_btrisolve(self):
        TestTorch._test_btrisolve(self, lambda t: t.cuda())

-    def test_tensor_gather(self):
-        TestTorch._test_gather(self, lambda t: t.cuda(), False)
-
-    def test_tensor_scatter(self):
-        TestTorch._test_scatter_base(self, lambda t: t.cuda(), 'scatter_', test_bounds=False)
-
-    def test_tensor_scatterAdd(self):
-        TestTorch._test_scatter_base(self, lambda t: t.cuda(), 'scatter_add_', test_bounds=False)
-
-    def test_tensor_scatterFill(self):
-        TestTorch._test_scatter_base(self, lambda t: t.cuda(), 'scatter_', True, test_bounds=False)
-
-    def test_arange(self):
-        for t in ['IntTensor', 'LongTensor', 'FloatTensor', 'DoubleTensor']:
-            a = torch.cuda.__dict__[t]()
-            torch.arange(0, 10, out=a)
-            b = torch.__dict__[t]()
-            torch.arange(0, 10, out=b)
-            self.assertEqual(a, b.cuda())
-
-    def test_nvtx(self):
-        # Just making sure we can see the symbols
-        torch.cuda.nvtx.range_push("foo")
-        torch.cuda.nvtx.mark("bar")
-        torch.cuda.nvtx.range_pop()
-

 if HAS_CUDA:
    for decl in tests:
--- a/test/test_dataloader.py
+++ b/test/test_dataloader.py
@ -3,7 +3,7 @@ import sys
 import torch
 import traceback
 import unittest
-from torch.utils.data import Dataset, TensorDataset, DataLoader, ConcatDataset
+from torch.utils.data import Dataset, TensorDataset, DataLoader
 from common import TestCase, run_tests, TEST_NUMPY
 from common_nn import TEST_CUDA

@ -31,38 +31,6 @@ class TestTensorDataset(TestCase):
            self.assertEqual(l[i], source[i][1])


-class TestConcatDataset(TestCase):
-
-    def test_concat_two_singletons(self):
-        result = ConcatDataset([[0], [1]])
-        self.assertEqual(2, len(result))
-        self.assertEqual(0, result[0])
-        self.assertEqual(1, result[1])
-
-    def test_concat_two_non_singletons(self):
-        result = ConcatDataset([[0, 1, 2, 3, 4],
-                                [5, 6, 7, 8, 9]])
-        self.assertEqual(10, len(result))
-        self.assertEqual(0, result[0])
-        self.assertEqual(5, result[5])
-
-    def test_concat_two_non_singletons_with_empty(self):
-        # Adding an empty dataset somewhere is correctly handled
-        result = ConcatDataset([[0, 1, 2, 3, 4],
-                                [],
-                                [5, 6, 7, 8, 9]])
-        self.assertEqual(10, len(result))
-        self.assertEqual(0, result[0])
-        self.assertEqual(5, result[5])
-
-    def test_concat_raises_index_error(self):
-        result = ConcatDataset([[0, 1, 2, 3, 4],
-                                [5, 6, 7, 8, 9]])
-        with self.assertRaises(IndexError):
-            # this one goes to 11
-            result[11]
-
-
 class ErrorDataset(Dataset):

    def __init__(self, size):
@ -109,7 +77,7 @@ class TestDataLoader(TestCase):
        errors = 0
        while True:
            try:
-                next(it)
+                it.next()
            except NotImplementedError:
                errors += 1
            except StopIteration:
@ -123,14 +91,6 @@ class TestDataLoader(TestCase):
    def test_sequential_batch(self):
        self._test_sequential(DataLoader(self.dataset, batch_size=2))

-    def test_growing_dataset(self):
-        dataset = [torch.ones(4) for _ in range(4)]
-        dataloader_seq = DataLoader(dataset, shuffle=False)
-        dataloader_shuffle = DataLoader(dataset, shuffle=True)
-        dataset.append(torch.ones(4))
-        self.assertEqual(len(dataloader_seq), 5)
-        self.assertEqual(len(dataloader_shuffle), 5)
-
    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
    def test_sequential_pin_memory(self):
        loader = DataLoader(self.dataset, batch_size=2, pin_memory=True)
@ -156,29 +116,6 @@ class TestDataLoader(TestCase):
    def test_shuffle_batch_workers(self):
        self._test_shuffle(DataLoader(self.dataset, batch_size=2, shuffle=True, num_workers=4))

-    def _test_batch_sampler(self, **kwargs):
-        # [(0, 1), (2, 3, 4), (5, 6), (7, 8, 9), ...]
-        batches = []
-        for i in range(0, 100, 5):
-            batches.append(tuple(range(i, i + 2)))
-            batches.append(tuple(range(i + 2, i + 5)))
-
-        dl = DataLoader(self.dataset, batch_sampler=batches, **kwargs)
-        self.assertEqual(len(dl), 40)
-        for i, (input, _target) in enumerate(dl):
-            if i % 2 == 0:
-                offset = i * 5 // 2
-                self.assertEqual(len(input), 2)
-                self.assertEqual(input, self.data[offset:offset + 2])
-            else:
-                offset = i * 5 // 2
-                self.assertEqual(len(input), 3)
-                self.assertEqual(input, self.data[offset:offset + 3])
-
-    def test_batch_sampler(self):
-        self._test_batch_sampler()
-        self._test_batch_sampler(num_workers=4)
-
    @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
    def test_shuffle_pin_memory(self):
        loader = DataLoader(self.dataset, batch_size=2, shuffle=True, num_workers=4, pin_memory=True)
--- a/test/test_distributed.py
+++ b/test/test_distributed.py
@ -14,12 +14,7 @@ from common import TestCase
 BACKEND = os.environ['BACKEND']
 TEMP_DIR = os.environ['TEMP_DIR']
 MASTER_PORT = '29500'
-MASTER_ADDR = '127.0.0.1'
-
-
-if not dist.is_available():
-    print('Distributed not available, skipping tests')
-    sys.exit(0)
+MASTER_ADDR = '127.0.0.1:' + MASTER_PORT


@contextmanager
@ -69,7 +64,7 @@ class Barrier(object):
                        data = f.read()
                        if int(data) >= cls.barrier_id:
                            arrived += 1
-            if arrived == dist.get_world_size():
+            if arrived == dist.get_num_processes():
                break

            if time.time() - start_time > timeout:
@ -92,7 +87,7 @@ class _DistTestBase(object):
        return (group, group_id, rank)

    def _init_global_test(self):
-        group = [i for i in range(0, dist.get_world_size())]
+        group = [i for i in range(0, dist.get_num_processes())]
        group_id = dist.group.WORLD
        rank = dist.get_rank()
        return (group, group_id, rank)
@ -101,7 +96,7 @@ class _DistTestBase(object):
    def test_get_rank(self):
        test_dir = os.path.join(TEMP_DIR, 'test_dir')
        pid = str(os.getpid())
-        num_processes = dist.get_world_size()
+        num_processes = dist.get_num_processes()
        with open(os.path.join(test_dir, pid), 'w') as f:
            f.write(str(dist.get_rank()))

@ -122,16 +117,15 @@ class _DistTestBase(object):
        self._barrier()

    # SEND RECV
-    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support send/recv")
    def test_send_recv(self):
        rank = dist.get_rank()
        tensor = _build_tensor(rank + 1)
-        for dest in range(0, dist.get_world_size()):
+        for dest in range(0, dist.get_num_processes()):
            if dest == rank:
                continue
            dist.send(tensor, dest)

-        for src in range(0, dist.get_world_size()):
+        for src in range(0, dist.get_num_processes()):
            if src == rank:
                continue
            tensor = _build_tensor(src + 1, value=-1)
@ -142,32 +136,29 @@ class _DistTestBase(object):
        self._barrier()

    # SEND RECV ANY SOURCE
-    @unittest.skipIf(BACKEND == 'gloo',
-                     "Gloo does not support send/recv from any source")
    def test_send_recv_any_source(self):
        rank = dist.get_rank()
        tensor = _build_tensor(10, rank)
-        for dest in range(0, dist.get_world_size()):
+        for dest in range(0, dist.get_num_processes()):
            if dest == rank:
                continue
            dist.send(tensor, dest)

        recv_ranks = set()
-        for src in range(0, dist.get_world_size()):
+        for src in range(0, dist.get_num_processes()):
            if src == rank:
                continue
            tensor = _build_tensor(10, value=-1)
            dist.recv(tensor)
            recv_ranks.add(tensor.resize_(1)[0])

-        self.assertEqual(len(recv_ranks), dist.get_world_size() - 1)
+        self.assertEqual(len(recv_ranks), dist.get_num_processes() - 1)
        self._barrier()

    # ISEND
-    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support isend")
    def test_isend(self):
        rank = dist.get_rank()
-        world_size = dist.get_world_size()
+        world_size = dist.get_num_processes()

        if rank == 0:
            requests = [
@ -184,10 +175,9 @@ class _DistTestBase(object):
        self._barrier()

    # IRECV
-    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support irecv")
    def test_irecv(self):
        rank = dist.get_rank()
-        world_size = dist.get_world_size()
+        world_size = dist.get_num_processes()

        if rank == 0:
            expected_tensors = [_build_tensor(src, -1) for src in range(1, world_size)]
@ -206,17 +196,13 @@ class _DistTestBase(object):
        self._barrier()

    # BROADCAST
-    def _test_broadcast_helper(self, group, group_id, rank, cuda=False):
+    def _test_broadcast_helper(self, group, group_id, rank):
        for src in group:
            expected_tensor = _build_tensor(src + 1)
-            if cuda:
-                expected_tensor = expected_tensor.cuda()
            if rank == src:
                dist.broadcast(expected_tensor, src, group_id)
            else:
                tensor = _build_tensor(src + 1, -1)
-                if cuda:
-                    tensor = tensor.cuda()
                dist.broadcast(tensor, src, group_id)
                self.assertEqual(tensor, expected_tensor)

@ -226,11 +212,6 @@ class _DistTestBase(object):
        group, group_id, rank = self._init_global_test()
        self._test_broadcast_helper(group, group_id, rank)

-    @unittest.skipIf(BACKEND != 'gloo', "Only Gloo backend supports CUDA allReduce")
-    def test_broadcast_cuda(self):
-        group, group_id, rank = self._init_global_test()
-        self._test_broadcast_helper(group, group_id, rank, True)
-
    def test_broadcast_group(self):
        group, group_id, rank = self._init_group_test()
        self._test_broadcast_helper(group, group_id, rank)
@ -248,14 +229,12 @@ class _DistTestBase(object):

        self._barrier()

-    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support reduce")
    def test_reduce_sum(self):
        group, group_id, rank = self._init_global_test()
        self._test_reduce_helper(
            group, group_id, rank, dist.reduce_op.SUM, 2, 10, 2 + (10 * (len(group) - 1))
        )

-    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support reduce")
    def test_reduce_product(self):
        group, group_id, rank = self._init_global_test()
        self._test_reduce_helper(
@ -263,28 +242,24 @@ class _DistTestBase(object):
            2, 10, reduce((lambda x, y: x * y), [10] * (len(group) - 1), 2)
        )

-    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support reduce")
    def test_reduce_min(self):
        group, group_id, rank = self._init_global_test()
        self._test_reduce_helper(
            group, group_id, rank, dist.reduce_op.MIN, 1010, 1, 1
        )

-    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support reduce")
    def test_reduce_max(self):
        group, group_id, rank = self._init_global_test()
        self._test_reduce_helper(
            group, group_id, rank, dist.reduce_op.MAX, -1, 10, 10
        )

-    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support reduce")
    def test_reduce_group_sum(self):
        group, group_id, rank = self._init_group_test()
        self._test_reduce_helper(
            group, group_id, rank, dist.reduce_op.SUM, 2, 10, 2 + (10 * (len(group) - 1))
        )

-    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support reduce")
    def test_reduce_group_product(self):
        group, group_id, rank = self._init_group_test()
        self._test_reduce_helper(
@ -292,14 +267,12 @@ class _DistTestBase(object):
            2, 10, reduce((lambda x, y: x * y), [10] * (len(group) - 1), 2)
        )

-    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support reduce")
    def test_reduce_group_min(self):
        group, group_id, rank = self._init_group_test()
        self._test_reduce_helper(
            group, group_id, rank, dist.reduce_op.MIN, 1010, 1, 1
        )

-    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support reduce")
    def test_reduce_group_max(self):
        group, group_id, rank = self._init_group_test()
        self._test_reduce_helper(
@ -307,19 +280,14 @@ class _DistTestBase(object):
        )

    # ALL REDUCE
-    def _test_all_reduce_helper(self, group, group_id, rank, op, master_value,
-                                worker_value, expected_value, cuda=False):
+    def _test_all_reduce_helper(self, group, group_id, rank, op, master_value, worker_value, expected_value):
        for src in group:
            if rank == src:
                tensor = _build_tensor(src + 1).fill_(master_value)
-                if cuda:
-                    tensor = tensor.cuda()
                dist.all_reduce(tensor, op, group_id)
                self.assertEqual(tensor, _build_tensor(src + 1, expected_value))
            else:
                tensor = _build_tensor(src + 1).fill_(worker_value)
-                if cuda:
-                    tensor = tensor.cuda()
                dist.all_reduce(tensor, op, group_id)
                self.assertEqual(tensor, _build_tensor(src + 1, expected_value))

@ -331,13 +299,6 @@ class _DistTestBase(object):
            group, group_id, rank, dist.reduce_op.SUM, 2, 10, 2 + (10 * (len(group) - 1))
        )

-    @unittest.skipIf(BACKEND != 'gloo', "Only Gloo backend supports CUDA allReduce")
-    def test_all_reduce_sum_cuda(self):
-        group, group_id, rank = self._init_global_test()
-        self._test_all_reduce_helper(
-            group, group_id, rank, dist.reduce_op.SUM, 2, 10, 2 + (10 * (len(group) - 1)), True
-        )
-
    def test_all_reduce_product(self):
        group, group_id, rank = self._init_global_test()
        self._test_all_reduce_helper(
@ -387,18 +348,20 @@ class _DistTestBase(object):
        for dest in group:
            tensor = _build_tensor(dest + 1, -1)
            expected_tensor = _build_tensor(dest + 1, rank)
-            tensors = [_build_tensor(dest + 1, i) for i in group] if rank == dest else []
-            dist.scatter(tensor, src=dest, scatter_list=tensors, group=group_id)
-            self.assertEqual(tensor, expected_tensor)
+            if rank == dest:
+                tensors = [_build_tensor(dest + 1, i) for i in group]
+                dist.scatter_send(tensors, tensor, group_id)
+                self.assertEqual(tensor, expected_tensor)
+            else:
+                dist.scatter_recv(tensor, dest, group_id)
+                self.assertEqual(tensor, expected_tensor)

        self._barrier()

-    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support scatter")
    def test_scatter(self):
        group, group_id, rank = self._init_global_test()
        self._test_scatter_helper(group, group_id, rank)

-    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support scatter")
    def test_scatter_group(self):
        group, group_id, rank = self._init_group_test()
        self._test_scatter_helper(group, group_id, rank)
@ -407,21 +370,22 @@ class _DistTestBase(object):
    def _test_gather_helper(self, group, group_id, rank):
        for dest in group:
            tensor = _build_tensor(dest + 1, rank)
-            tensors = [_build_tensor(dest + 1, -1) for i in group] if rank == dest else []
-            dist.gather(tensor, dst=dest, gather_list=tensors, group=group_id)
            if rank == dest:
+                tensors = [_build_tensor(dest + 1, -1) for i in group]
+                dist.gather_recv(tensors, tensor, group_id)
+
                expected_tensors = [_build_tensor(dest + 1, i) for i in group]
                for t1, t2 in zip(tensors, expected_tensors):
                    self.assertEqual(t1, t2)
+            else:
+                dist.gather_send(tensor, dest, group_id)

        self._barrier()

-    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support gather")
    def test_gather(self):
        group, group_id, rank = self._init_global_test()
        self._test_gather_helper(group, group_id, rank)

-    @unittest.skipIf(BACKEND == 'gloo', "Gloo does not support gather")
    def test_gather_group(self):
        group, group_id, rank = self._init_group_test()
        self._test_gather_helper(group, group_id, rank)
@ -473,13 +437,13 @@ class _DistTestBase(object):
        group, group_id, rank = self._init_group_test()
        self._test_barrier_helper(group, group_id, rank)

-if BACKEND == 'tcp' or BACKEND == 'gloo':
+if BACKEND == 'tcp':
    WORLD_SIZE = os.environ['WORLD_SIZE']

-    class TestTCPOrGloo(TestCase, _DistTestBase):
+    class TestTCP(TestCase, _DistTestBase):

        MANAGER_PROCESS_RANK = -1
-        JOIN_TIMEOUT = 10
+        JOIN_TIMEOUT = 5

        @staticmethod
        def manager_join(fn):
@ -522,11 +486,7 @@ if BACKEND == 'tcp' or BACKEND == 'gloo':

        def _run(self, rank):
            self.rank = rank
-            try:
-                dist.init_process_group(backend=BACKEND)
-            except RuntimeError as e:
-                if 'recompile' in e.args[0]:
-                    sys.exit(0)
+            dist.init_process_group(backend=BACKEND)
            # self.id() == e.g. '__main__.TestDistributed.test_get_rank'
            # We're retreiving a corresponding test and executing it.
            getattr(self, self.id().split(".")[2])()
--- a/test/test_legacy_nn.py
+++ b/test/test_legacy_nn.py
@ -184,16 +184,16 @@ tests = [
    OldModuleTest(nn.Sum,
                  (1,),
                  input_size=(2, 4, 5),
-                  reference_fn=lambda i, _: i.sum(1, keepdim=False)),
+                  reference_fn=lambda i, _: i.sum(1).squeeze(1)),
    OldModuleTest(nn.Sum,
                  (1, True),
                  input_size=(2, 4, 5),
-                  reference_fn=lambda i, _: i.sum(1, keepdim=False).div(i.size(1)),
+                  reference_fn=lambda i, _: i.sum(1).div(i.size(1)).squeeze(1),
                  desc='sizeAverage'),
    OldModuleTest(nn.Mean,
                  (1,),
                  input_size=(2, 4, 5),
-                  reference_fn=lambda i, _: torch.mean(i, 1, keepdim=False)),
+                  reference_fn=lambda i, _: torch.mean(i, 1).squeeze(1)),
    OldModuleTest(lambda: nn.Sequential().add(nn.GradientReversal()).add(nn.GradientReversal()),
                  input_size=(4, 3, 2, 2),
                  fullname='GradientReversal'),
@ -233,19 +233,19 @@ tests = [
                  reference_fn=lambda i, _: torch.bmm(i[0], i[1].view(i[1].size(0), i[1].size(1), 1)).squeeze()),
    OldModuleTest(nn.Max,
                  input_size=(4, 5, 3),
-                  reference_fn=lambda i, _: torch.max(i, 0, False)[0]),
+                  reference_fn=lambda i, _: torch.max(i, 0)[0].squeeze()),
    OldModuleTest(nn.Max,
                  (1,),
                  input_size=(4, 5, 3),
-                  reference_fn=lambda i, _: torch.max(i, 1, False)[0],
+                  reference_fn=lambda i, _: torch.max(i, 1)[0].squeeze(),
                  desc='with_dimension'),
    OldModuleTest(nn.Min,
                  input_size=(4, 5, 3),
-                  reference_fn=lambda i, _: torch.min(i, 0, False)[0]),
+                  reference_fn=lambda i, _: torch.min(i, 0)[0].squeeze()),
    OldModuleTest(nn.Min,
                  (1,),
                  input_size=(4, 5, 3),
-                  reference_fn=lambda i, _: torch.min(i, 1, False)[0],
+                  reference_fn=lambda i, _: torch.min(i, 1)[0].squeeze(),
                  desc='with_dimension'),
    OldModuleTest(nn.MixtureTable,
                  tuple(),
@ -532,7 +532,7 @@ for p in (1, 2, 1.5):
                      (p,),
                      input_size=(4, 5),
                      # Eh, we need to use p as a default, so it's passed by value
-                      reference_fn=lambda i, _, p=p: i.div(i.norm(p, 1, True).expand_as(i)),
+                      reference_fn=lambda i, _, p=p: i.div(i.norm(p, 1).expand_as(i)),
                      desc=str(p)),
    )
 for p in range(1, 4 + 1):
@ -807,14 +807,14 @@ class TestNN(NNTestCase):
        str(m)

        output = m.forward(input)
-        output2 = input.sum(1, True).expand(4, 5).repeat(num_modules, 1)
+        output2 = input.sum(1).expand(4, 5).repeat(num_modules, 1)
        self.assertEqual(output2, output)

        gradInput = m.backward(input, torch.ones(output2.size()))
        gradInput2 = torch.ones(4, 2).fill_(num_modules * 5)
        self.assertEqual(gradInput, gradInput2)

-        gradWeight = input.sum(0, keepdim=True).expand(5, 2)
+        gradWeight = input.sum(0).expand(5, 2)
        for l in linears:
            self.assertEqual(gradWeight, l.gradWeight)

@ -884,8 +884,8 @@ class TestNN(NNTestCase):
        output2 = [input, input, input]
        self.assertEqual(output2, output)
        gradInput = module.backward(input, gradOutput)
-        gradInput2 = [_gradOutput[0].sum(0, keepdim=False), _gradOutput[1].sum(
-            0, keepdim=False), [_gradOutput[2].sum(0, keepdim=False)]]
+        gradInput2 = [_gradOutput[0].sum(0).squeeze(0), _gradOutput[1].sum(
+            0).squeeze(0), [_gradOutput[2].sum(0).squeeze(0)]]
        self.assertTrue(isinstance(gradInput, list))
        self.assertFalse(isinstance(gradInput[0], list))
        self.assertFalse(isinstance(gradInput[1], list))
--- a/test/test_multiprocessing.py
+++ b/test/test_multiprocessing.py
@ -112,10 +112,9 @@ class leak_checker(object):
            # test is no more than 4 higher than the 10th available at the
            # start. This attempts to catch file descriptor leaks, but allows
            # one-off initialization that may use up a file descriptor
-            # TODO: Disabled because this check is too flaky
-            # available_fds = self._get_next_fds(10)
-            # self.test_case.assertLessEqual(
-            #     available_fds[-1] - self.next_fds[-1], 5)
+            available_fds = self._get_next_fds(10)
+            self.test_case.assertLessEqual(
+                available_fds[-1] - self.next_fds[-1], 5)
            self.test_case.assertFalse(self.has_shm_files())
        return False

@ -297,8 +296,7 @@ class TestMultiprocessing(TestCase):
        ctx = mp.get_context('spawn')
        tensors = []
        for i in range(5):
-            device = i % 2
-            tensors += [torch.arange(i * 5, (i + 1) * 5).cuda(device)]
+            tensors += [torch.arange(i * 5, (i + 1) * 5).cuda()]

        inq = ctx.Queue()
        outq = ctx.Queue()
@ -314,7 +312,7 @@ class TestMultiprocessing(TestCase):
        for i, tensor in enumerate(tensors):
            v, device, tensor_size, storage_size = results[i]
            self.assertEqual(v, torch.arange(i * 5, (i + 1) * 5).sum())
-            self.assertEqual(device, i % 2)
+            self.assertEqual(device, 0)
            self.assertEqual(tensor_size, 5)
            self.assertEqual(storage_size, 5)

@ -393,10 +391,6 @@ class TestMultiprocessing(TestCase):
        param = Parameter(torch.arange(1, 26).view(5, 5))
        self._test_autograd_sharing(param)

-    def test_empty_shared(self):
-        t = torch.Tensor()
-        t.share_memory_()
-
    def _test_is_shared(self):
        t = torch.randn(5, 5)
        self.assertFalse(t.is_shared())
--- a/test/test_nn.py
+++ b/test/test_nn.py
--- a/test/test_optim.py
+++ b/test/test_optim.py
@ -4,11 +4,8 @@ from copy import deepcopy
 import torch
 import torch.optim as optim
 import torch.legacy.optim as old_optim
-import torch.nn.functional as F
-from torch.optim import SGD
 from torch.autograd import Variable
-from torch import sparse
-from torch.optim.lr_scheduler import LambdaLR, StepLR, MultiStepLR, ExponentialLR, ReduceLROnPlateau
+
 from common import TestCase, run_tests


@ -61,49 +58,6 @@ class TestOptim(TestCase):

        self.assertLessEqual(params.data.dist(solution), initial_dist)

-    def _test_rosenbrock_sparse(self, constructor):
-        params_t = torch.Tensor([1.5, 1.5])
-
-        params = Variable(torch.Tensor([1.5, 1.5]), requires_grad=True)
-        params_c = Variable(torch.Tensor([1.5, 1.5]), requires_grad=True)
-        optimizer = constructor([params])
-        optimizer_c = constructor([params_c])
-
-        solution = torch.Tensor([1, 1])
-        initial_dist = params.data.dist(solution)
-
-        def eval(params, sparse_grad, w):
-            # Depending on w, provide only the x or y gradient
-            optimizer.zero_grad()
-            loss = rosenbrock(params)
-            loss.backward()
-            grad = drosenbrock(params.data)
-            # NB: We torture test the optimizer by returning an
-            # uncoalesced sparse tensor
-            if w:
-                i = torch.LongTensor([[0, 0]])
-                x = grad[0]
-                v = torch.DoubleTensor([x / 4., x - x / 4.])
-            else:
-                i = torch.LongTensor([[1, 1]])
-                y = grad[1]
-                v = torch.DoubleTensor([y - y / 4., y / 4.])
-            x = sparse.DoubleTensor(i, v, torch.Size([2]))
-            if sparse_grad:
-                params.grad.data = x
-            else:
-                params.grad.data = x.to_dense()
-            return loss
-
-        for i in range(2000):
-            # Do cyclic coordinate descent
-            w = i % 2
-            optimizer.step(functools.partial(eval, params, True, w))
-            optimizer_c.step(functools.partial(eval, params_c, False, w))
-            self.assertEqual(params.data, params_c.data)
-
-        self.assertLessEqual(params.data.dist(solution), initial_dist)
-
    def _test_basic_cases_template(self, weight, bias, input, constructor):
        weight = Variable(weight, requires_grad=True)
        bias = Variable(bias, requires_grad=True)
@ -201,9 +155,6 @@ class TestOptim(TestCase):
    def _build_params_dict(self, weight, bias, **kwargs):
        return [dict(params=[weight]), dict(params=[bias], **kwargs)]

-    def _build_params_dict_single(self, weight, bias, **kwargs):
-        return [dict(params=bias, **kwargs)]
-
    def test_sgd(self):
        self._test_rosenbrock(
            lambda params: optim.SGD(params, lr=1e-3),
@ -223,11 +174,6 @@ class TestOptim(TestCase):
                self._build_params_dict(weight, bias, lr=1e-2),
                lr=1e-3)
        )
-        self._test_basic_cases(
-            lambda weight, bias: optim.SGD(
-                self._build_params_dict_single(weight, bias, lr=1e-2),
-                lr=1e-3)
-        )

    def test_adam(self):
        self._test_rosenbrock(
@ -290,11 +236,6 @@ class TestOptim(TestCase):
                lr=1e-1)
        )

-    def test_adagrad_sparse(self):
-        self._test_rosenbrock_sparse(
-            lambda params: optim.Adagrad(params, lr=1e-1)
-        )
-
    def test_adamax(self):
        self._test_rosenbrock(
            lambda params: optim.Adamax(params, lr=1e-1),
@ -402,157 +343,5 @@ class TestOptim(TestCase):
            optim.SGD(Variable(torch.randn(5, 5)), lr=3)


-class SchedulerTestNet(torch.nn.Module):
-    def __init__(self):
-        super(SchedulerTestNet, self).__init__()
-        self.conv1 = torch.nn.Conv2d(1, 1, 1)
-        self.conv2 = torch.nn.Conv2d(1, 1, 1)
-
-    def forward(self, x):
-        return self.conv2(F.relu(self.conv1(x)))
-
-
-class TestLRScheduler(TestCase):
-    def setUp(self):
-        self.net = SchedulerTestNet()
-        self.opt = SGD(
-            [{'params': self.net.conv1.parameters()}, {'params': self.net.conv2.parameters(), 'lr': 0.5}],
-            lr=0.05)
-
-    def test_step_lr(self):
-        # lr = 0.05     if epoch < 3
-        # lr = 0.005    if 30 <= epoch < 6
-        # lr = 0.0005   if epoch >= 9
-        single_targets = [0.05] * 3 + [0.005] * 3 + [0.0005] * 3 + [0.00005] * 3
-        targets = [single_targets, list(map(lambda x: x * 10, single_targets))]
-        scheduler = StepLR(self.opt, gamma=0.1, step_size=3)
-        epochs = 10
-        self._test(scheduler, targets, epochs)
-
-    def test_multi_step_lr(self):
-        # lr = 0.05     if epoch < 2
-        # lr = 0.005    if 2 <= epoch < 5
-        # lr = 0.0005   if epoch < 9
-        # lr = 0.00005   if epoch >= 9
-        single_targets = [0.05] * 2 + [0.005] * 3 + [0.0005] * 4 + [0.00005] * 3
-        targets = [single_targets, list(map(lambda x: x * 10, single_targets))]
-        scheduler = MultiStepLR(self.opt, gamma=0.1, milestones=[2, 5, 9])
-        epochs = 10
-        self._test(scheduler, targets, epochs)
-
-    def test_exp_lr(self):
-        single_targets = [0.05 * (0.9 ** x) for x in range(10)]
-        targets = [single_targets, list(map(lambda x: x * 10, single_targets))]
-        scheduler = ExponentialLR(self.opt, gamma=0.9)
-        epochs = 10
-        self._test(scheduler, targets, epochs)
-
-    def test_reduce_lr_on_plateau1(self):
-        for param_group in self.opt.param_groups:
-            param_group['lr'] = 0.5
-        targets = [[0.5] * 20]
-        metrics = [10 - i * 0.0167 for i in range(20)]
-        scheduler = ReduceLROnPlateau(self.opt, threshold_mode='abs', mode='min',
-                                      threshold=0.01, patience=5, cooldown=5)
-        epochs = 10
-        self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
-
-    def test_reduce_lr_on_plateau2(self):
-        for param_group in self.opt.param_groups:
-            param_group['lr'] = 0.5
-        targets = [[0.5] * 6 + [0.05] * 7 + [0.005] * 7 + [0.0005] * 2]
-        metrics = [10 - i * 0.0165 for i in range(22)]
-        scheduler = ReduceLROnPlateau(self.opt, patience=5, cooldown=0, threshold_mode='abs',
-                                      mode='min', threshold=0.1)
-        epochs = 22
-        self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
-
-    def test_reduce_lr_on_plateau3(self):
-        for param_group in self.opt.param_groups:
-            param_group['lr'] = 0.5
-        targets = [[0.5] * (2 + 6) + [0.05] * (5 + 6) + [0.005] * 4]
-        metrics = [-0.8] * 2 + [-0.234] * 20
-        scheduler = ReduceLROnPlateau(self.opt, mode='max', patience=5, cooldown=5,
-                                      threshold_mode='abs')
-        epochs = 22
-        self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
-
-    def test_reduce_lr_on_plateau4(self):
-        for param_group in self.opt.param_groups:
-            param_group['lr'] = 0.5
-        targets = [[0.5] * 20]
-        metrics = [1.5 * (1.025 ** i) for i in range(20)]  # 1.025 > 1.1**0.25
-        scheduler = ReduceLROnPlateau(self.opt, mode='max', patience=3,
-                                      threshold_mode='rel', threshold=0.1)
-        epochs = 20
-        self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
-
-    def test_reduce_lr_on_plateau5(self):
-        for param_group in self.opt.param_groups:
-            param_group['lr'] = 0.5
-        targets = [[0.5] * 6 + [0.05] * (5 + 6) + [0.005] * 4]
-        metrics = [1.5 * (1.005 ** i) for i in range(20)]
-        scheduler = ReduceLROnPlateau(self.opt, mode='max', threshold_mode='rel',
-                                      threshold=0.1, patience=5, cooldown=5)
-        epochs = 20
-        self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
-
-    def test_reduce_lr_on_plateau6(self):
-        for param_group in self.opt.param_groups:
-            param_group['lr'] = 0.5
-        targets = [[0.5] * 20]
-        metrics = [1.5 * (0.85 ** i) for i in range(20)]
-        scheduler = ReduceLROnPlateau(self.opt, mode='min', threshold_mode='rel',
-                                      threshold=0.1)
-        epochs = 20
-        self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
-
-    def test_reduce_lr_on_plateau7(self):
-        for param_group in self.opt.param_groups:
-            param_group['lr'] = 0.5
-        targets = [[0.5] * 6 + [0.05] * (5 + 6) + [0.005] * 4]
-        metrics = [1] * 7 + [0.6] + [0.5] * 12
-        scheduler = ReduceLROnPlateau(self.opt, mode='min', threshold_mode='rel',
-                                      threshold=0.1, patience=5, cooldown=5)
-        epochs = 20
-        self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
-
-    def test_reduce_lr_on_plateau8(self):
-        for param_group in self.opt.param_groups:
-            param_group['lr'] = 0.5
-        targets = [[0.5] * 6 + [0.4] * 14, [0.5] * 6 + [0.3] * 14]
-        metrics = [1.5 * (1.005 ** i) for i in range(20)]
-        scheduler = ReduceLROnPlateau(self.opt, mode='max', threshold_mode='rel', min_lr=[0.4, 0.3],
-                                      threshold=0.1, patience=5, cooldown=5)
-        epochs = 20
-        self._test_reduce_lr_on_plateau(scheduler, targets, metrics, epochs)
-
-    def test_lambda_lr(self):
-        self.opt.param_groups[0]['lr'] = 0.05
-        self.opt.param_groups[1]['lr'] = 0.4
-        targets = [[0.05 * (0.9 ** x) for x in range(10)], [0.4 * (0.8 ** x) for x in range(10)]]
-        scheduler = LambdaLR(self.opt,
-                             lr_lambda=[lambda x1: 0.9 ** x1, lambda x2: 0.8 ** x2])
-        epochs = 10
-        self._test(scheduler, targets, epochs)
-
-    def _test(self, scheduler, targets, epochs=10):
-        for epoch in range(epochs):
-            scheduler.step(epoch)
-            for param_group, target in zip(self.opt.param_groups, targets):
-                self.assertAlmostEqual(target[epoch], param_group['lr'],
-                                       msg='LR is wrong in epoch {}: expected {}, got {}'.format(
-                                           epoch, target[epoch], param_group['lr']), delta=1e-5)
-
-    def _test_reduce_lr_on_plateau(self, scheduler, targets, metrics, epochs=10, verbose=False):
-        for epoch in range(epochs):
-            scheduler.step(metrics[epoch])
-            if verbose:
-                print('epoch{}:\tlr={}'.format(epoch, self.opt.param_groups[0]['lr']))
-            for param_group, target in zip(self.opt.param_groups, targets):
-                self.assertAlmostEqual(target[epoch], param_group['lr'],
-                                       msg='LR is wrong in epoch {}: expected {}, got {}'.format(
-                                           epoch, target[epoch], param_group['lr']), delta=1e-5)
-
 if __name__ == '__main__':
    run_tests()
--- a/test/test_sparse.py
+++ b/test/test_sparse.py
@ -8,63 +8,28 @@ from common import TestCase, run_tests
 from common_nn import TEST_CUDA
 from numbers import Number

+# triplet := (index type, value type, sparse type)
+cpu_triplet = (
+    torch.LongTensor,
+    torch.DoubleTensor,
+    torch.sparse.DoubleTensor)

-def cpu_only(inner):
-    def outer(self, *args, **kwargs):
-        if self.is_cuda:
-            raise unittest.SkipTest("Test is CPU-only")
-        inner(self, *args, **kwargs)
-    return outer
-
-
-def cuda_only(inner):
-    def outer(self, *args, **kwargs):
-        if not self.is_cuda:
-            raise unittest.SkipTest("Test is GPU-only")
-        inner(self, *args, **kwargs)
-    return outer
+if TEST_CUDA:
+    cuda_triplet = (
+        torch.cuda.LongTensor,
+        torch.cuda.DoubleTensor,
+        torch.cuda.sparse.DoubleTensor)


 class TestSparse(TestCase):

-    def setUp(self):
-        # These parameters control the various ways we can run the test.
-        # We will subclass and override this method to implement CUDA
-        # tests
-        self.is_cuda = False
-        self.is_uncoalesced = False
-        self.IndexTensor = torch.LongTensor
-        self.ValueTensor = torch.DoubleTensor
-        self.SparseTensor = torch.sparse.DoubleTensor
-
-    def _gen_sparse(self, d, nnz, with_size):
-        # TODO: Consider implementing this in the CUDA case by directly
-        # performing the operations on the GPU.  You won't be able to
-        # use torch.rand/torch.randn in this case because they are
-        # CPU-only.  If you do this, you can remove the is_cuda branch
-        # at the end.
-        #
-        # If you do this, be sure to update assert_uncoalesced too
-
+    @staticmethod
+    def _gen_sparse(d, nnz, with_size, is_cuda=False):
        if isinstance(with_size, Number):
-            with_size = [with_size] * d
-
-        if self.is_uncoalesced:
-            # We want to generate a tensor with a lot of uncoalesced
-            # entries to stress test whether or not we handle this
-            # (subtle) case correctly
-            v_size = [nnz * 2] + list(with_size[d:])
-            v = torch.randn(*v_size)
-            r = torch.rand(d, nnz)
-            # Repeat the indexes, so every position shows up twice
-            i = torch.cat([r, r], dim=1) * \
-                torch.Tensor(with_size[:d]).repeat(nnz * 2, 1).transpose(0, 1)
-            i = i.type(torch.LongTensor)
-            x = torch.sparse.DoubleTensor(i, v, torch.Size(with_size))
-            self.assert_uncoalesced(x)
+            v = torch.randn(nnz)
+            i = (torch.rand(d, nnz) * with_size).type(torch.LongTensor)
+            x = torch.sparse.DoubleTensor(i, v)
        else:
-            # Generate a sparse tensor with d sparse dimensions; the
-            # rest the dimensions with_size[d:] are dense.
            v_size = [nnz] + list(with_size[d:])
            v = torch.randn(*v_size)
            i = torch.rand(d, nnz) * \
@ -72,62 +37,49 @@ class TestSparse(TestCase):
            i = i.type(torch.LongTensor)
            x = torch.sparse.DoubleTensor(i, v, torch.Size(with_size))

-        if self.is_cuda:
+        if is_cuda:
            return x.cuda(), i.cuda(), v.cuda()
        else:
            return x, i.clone(), v.clone()

-    def assert_uncoalesced(self, x):
-        """
-        Test if a CPU tensor is uncoalesced.  This is used to ensure
-        correctness of the uncoalesced tensor generation algorithm.
-        """
-        assert not x.is_coalesced()
-        # Strategy: construct a new sparse tensor with the raw value
-        # field overwritten to a tensor of ones, coalesce it, and then
-        # check if any value entries are > 1 (which indicates that the
-        # original was uncoalesced.)
-        i = x._indices().clone()
-        v = x._values().clone().fill_(1)
-        y = torch.sparse.DoubleTensor(i, v, x.size())
-        z = self.safeCoalesce(y)
-        assert (z._values() > 1).sum() > 0
+    def _test_basic(self, is_cuda):
+        x, i, v = self._gen_sparse(3, 10, 100, is_cuda)

-    def randn(self, *args, **kwargs):
-        """
-        Variant of torch.randn that also works in the TEST_CUDA case.
-        """
-        # TODO: Put this in torch.cuda.randn
-        return self.ValueTensor(*args, **kwargs).normal_()
+        self.assertEqual(i, x.indices())
+        self.assertEqual(v, x.values())

-    def test_basic(self):
-        x, i, v = self._gen_sparse(3, 10, 100)
-
-        self.assertEqual(i, x._indices())
-        self.assertEqual(v, x._values())
-
-        x, i, v = self._gen_sparse(3, 10, [100, 100, 100])
-        self.assertEqual(i, x._indices())
-        self.assertEqual(v, x._values())
+        x, i, v = self._gen_sparse(3, 10, [100, 100, 100], is_cuda)
+        self.assertEqual(i, x.indices())
+        self.assertEqual(v, x.values())
        self.assertEqual(x.ndimension(), 3)
-        self.assertEqual(x.coalesce()._nnz(), 10)
+        self.assertEqual(x.nnz(), 10)
        for i in range(3):
            self.assertEqual(x.size(i), 100)

+        SparseTensor = (cuda_triplet if is_cuda else cpu_triplet)[2]
        # Make sure we can access empty indices / values
-        x = self.SparseTensor()
-        self.assertEqual(x._indices().numel(), 0)
-        self.assertEqual(x._values().numel(), 0)
+        x = SparseTensor()
+        self.assertEqual(x.indices().numel(), 0)
+        self.assertEqual(x.values().numel(), 0)

-    def test_to_dense(self):
-        i = self.IndexTensor([
+    def test_basic(self):
+        self._test_basic(False)
+
+    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
+    def test_basic_cuda(self):
+        self._test_basic(True)
+
+    def _test_to_dense(self, is_cuda):
+        IndexTensor, ValueTensor, SparseTensor = \
+            cuda_triplet if is_cuda else cpu_triplet
+        i = IndexTensor([
            [0, 1, 2, 2],
            [0, 0, 0, 3],
            [0, 0, 1, 4],
        ])
-        v = self.ValueTensor([2, 1, 3, 4])
-        x = self.SparseTensor(i, v, torch.Size([3, 4, 5]))
-        res = self.ValueTensor([
+        v = ValueTensor([2, 1, 3, 4])
+        x = SparseTensor(i, v, torch.Size([3, 4, 5]))
+        res = ValueTensor([
            [[2, 0, 0, 0, 0],
             [0, 0, 0, 0, 0],
             [0, 0, 0, 0, 0],
@ -147,23 +99,23 @@ class TestSparse(TestCase):
        x.to_dense()
        self.assertEqual(res, x.to_dense())

-    def test_shared(self):
-        i = self.IndexTensor([[2]])
-        v = self.ValueTensor([5])
-        x = self.SparseTensor(i, v, torch.Size([3]))
-        v[0] = 6
-        self.assertEqual(self.ValueTensor([0, 0, 6]), x.to_dense())
-        i[0][0] = 0
-        self.assertEqual(self.ValueTensor([6, 0, 0]), x.to_dense())
+    def test_to_dense(self):
+        self._test_to_dense(False)

-    def test_to_dense_hybrid(self):
-        i = self.IndexTensor([
+    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
+    def test_to_dense_cuda(self):
+        self._test_to_dense(True)
+
+    def _test_to_dense_hybrid(self, is_cuda):
+        IndexTensor, ValueTensor, SparseTensor = \
+            cuda_triplet if is_cuda else cpu_triplet
+        i = IndexTensor([
            [0, 1, 2, 2],
            [0, 0, 0, 3],
        ])
-        v = self.ValueTensor([[2, 3], [1, 2], [3, 4], [4, 5]])
-        x = self.SparseTensor(i, v, torch.Size([3, 4, 2]))
-        res = self.ValueTensor([
+        v = ValueTensor([[2, 3], [1, 2], [3, 4], [4, 5]])
+        x = SparseTensor(i, v, torch.Size([3, 4, 2]))
+        res = ValueTensor([
            [[2, 3],
             [0, 0],
             [0, 0],
@ -183,131 +135,145 @@ class TestSparse(TestCase):
        x.to_dense()
        self.assertEqual(res, x.to_dense())

-    def test_contig(self):
-        i = self.IndexTensor([
+    def test_to_dense_hybrid(self):
+        self._test_to_dense_hybrid(False)
+
+    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
+    def test_to_dense_hybrid_cuda(self):
+        self._test_to_dense_hybrid(True)
+
+    def _test_contig(self, is_cuda):
+        IndexTensor, ValueTensor, SparseTensor = \
+            cuda_triplet if is_cuda else cpu_triplet
+        i = IndexTensor([
            [1, 0, 35, 14, 39, 6, 71, 66, 40, 27],
            [92, 31, 62, 50, 22, 65, 89, 74, 56, 34],
        ])
-        v = self.ValueTensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
-        x = self.SparseTensor(i, v, torch.Size([100, 100]))
-        exp_i = self.IndexTensor([
+        v = ValueTensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
+        x = SparseTensor(i, v, torch.Size([100, 100]))
+        exp_i = IndexTensor([
            [0, 1, 6, 14, 27, 35, 39, 40, 66, 71],
            [31, 92, 65, 50, 34, 62, 22, 56, 74, 89],
        ])
-        exp_v = self.ValueTensor([2, 1, 6, 4, 10, 3, 5, 9, 8, 7])
+        exp_v = ValueTensor([2, 1, 6, 4, 10, 3, 5, 9, 8, 7])
        x = self.safeCoalesce(x)
-        self.assertEqual(exp_i, x._indices())
-        self.assertEqual(exp_v, x._values())
+        self.assertEqual(exp_i, x.indices())
+        self.assertEqual(exp_v, x.values())

-        i = self.IndexTensor([
+        i = IndexTensor([
            [2, 0, 2, 1],
            [0, 0, 3, 0],
            [1, 0, 4, 0],
        ])
-        v = self.ValueTensor([3, 2, 4, 1])
-        x = self.SparseTensor(i, v, torch.Size([3, 4, 5]))
-        exp_i = self.IndexTensor([
+        v = ValueTensor([3, 2, 4, 1])
+        x = SparseTensor(i, v, torch.Size([3, 4, 5]))
+        exp_i = IndexTensor([
            [0, 1, 2, 2],
            [0, 0, 0, 3],
            [0, 0, 1, 4],
        ])
-        exp_v = self.ValueTensor([2, 1, 3, 4])
+        exp_v = ValueTensor([2, 1, 3, 4])

        x = self.safeCoalesce(x)
-        self.assertEqual(exp_i, x._indices())
-        self.assertEqual(exp_v, x._values())
+        self.assertEqual(exp_i, x.indices())
+        self.assertEqual(exp_v, x.values())

        # Duplicate indices
-        i = self.IndexTensor([
+        i = IndexTensor([
            [0, 0, 2, 0],
            [0, 0, 3, 0],
            [0, 0, 4, 0],
        ])
-        v = self.ValueTensor([3, 2, 4, 1])
-        x = self.SparseTensor(i, v, torch.Size([3, 4, 5]))
-        exp_i = self.IndexTensor([
+        v = ValueTensor([3, 2, 4, 1])
+        x = SparseTensor(i, v, torch.Size([3, 4, 5]))
+        exp_i = IndexTensor([
            [0, 2],
            [0, 3],
            [0, 4],
        ])
-        exp_v = self.ValueTensor([6, 4])
+        exp_v = ValueTensor([6, 4])

        x = self.safeCoalesce(x)
-        self.assertEqual(exp_i, x._indices())
-        self.assertEqual(exp_v, x._values())
+        self.assertEqual(exp_i, x.indices())
+        self.assertEqual(exp_v, x.values())

-    def test_contig_hybrid(self):
-        i = self.IndexTensor([
+    def test_contig(self):
+        self._test_contig(False)
+
+    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
+    def test_contig_cuda(self):
+        self._test_contig(True)
+
+    def _test_contig_hybrid(self, is_cuda):
+        IndexTensor, ValueTensor, SparseTensor = \
+            cuda_triplet if is_cuda else cpu_triplet
+        i = IndexTensor([
            [1, 0, 35, 14, 39, 6, 71, 66, 40, 27],
            [92, 31, 62, 50, 22, 65, 89, 74, 56, 34],
        ])
-        v = self.ValueTensor([
+        v = ValueTensor([
            [1, 2], [2, 3], [3, 4], [4, 5], [5, 6],
            [6, 7], [7, 8], [8, 9], [9, 10], [10, 11],
        ])
-        x = self.SparseTensor(i, v, torch.Size([100, 100, 2]))
-        exp_i = self.IndexTensor([
+        x = SparseTensor(i, v, torch.Size([100, 100, 2]))
+        exp_i = IndexTensor([
            [0, 1, 6, 14, 27, 35, 39, 40, 66, 71],
            [31, 92, 65, 50, 34, 62, 22, 56, 74, 89],
        ])
-        exp_v = self.ValueTensor([
+        exp_v = ValueTensor([
            [2, 3], [1, 2], [6, 7], [4, 5], [10, 11],
            [3, 4], [5, 6], [9, 10], [8, 9], [7, 8],
        ])
        x = self.safeCoalesce(x)
-        self.assertEqual(exp_i, x._indices())
-        self.assertEqual(exp_v, x._values())
+        self.assertEqual(exp_i, x.indices())
+        self.assertEqual(exp_v, x.values())

-        i = self.IndexTensor([
+        i = IndexTensor([
            [2, 0, 2, 1],
            [0, 0, 3, 0],
            [1, 0, 4, 0],
        ])
-        v = self.ValueTensor([[3, 3, 3], [2, 2, 2], [4, 4, 4], [1, 1, 1]])
-        x = self.SparseTensor(i, v, torch.Size([3, 4, 5, 3]))
-        exp_i = self.IndexTensor([
+        v = ValueTensor([[3, 3, 3], [2, 2, 2], [4, 4, 4], [1, 1, 1]])
+        x = SparseTensor(i, v, torch.Size([3, 4, 5, 3]))
+        exp_i = IndexTensor([
            [0, 1, 2, 2],
            [0, 0, 0, 3],
            [0, 0, 1, 4],
        ])
-        exp_v = self.ValueTensor([[2, 2, 2], [1, 1, 1], [3, 3, 3], [4, 4, 4]])
+        exp_v = ValueTensor([[2, 2, 2], [1, 1, 1], [3, 3, 3], [4, 4, 4]])

        x = self.safeCoalesce(x)
-        self.assertEqual(exp_i, x._indices())
-        self.assertEqual(exp_v, x._values())
+        self.assertEqual(exp_i, x.indices())
+        self.assertEqual(exp_v, x.values())

        # Duplicate indices
-        i = self.IndexTensor([
+        i = IndexTensor([
            [0, 0, 2, 0],
            [0, 0, 3, 0],
            [0, 0, 4, 0],
        ])
-        v = self.ValueTensor([[3, 2, 3], [2, 1, 1], [4, 3, 4], [1, 1, 1]])
-        x = self.SparseTensor(i, v, torch.Size([3, 4, 5, 3]))
-        exp_i = self.IndexTensor([
+        v = ValueTensor([[3, 2, 3], [2, 1, 1], [4, 3, 4], [1, 1, 1]])
+        x = SparseTensor(i, v, torch.Size([3, 4, 5, 3]))
+        exp_i = IndexTensor([
            [0, 2],
            [0, 3],
            [0, 4],
        ])
-        exp_v = self.ValueTensor([[6, 4, 5], [4, 3, 4]])
+        exp_v = ValueTensor([[6, 4, 5], [4, 3, 4]])

        x = self.safeCoalesce(x)
-        self.assertEqual(exp_i, x._indices())
-        self.assertEqual(exp_v, x._values())
+        self.assertEqual(exp_i, x.indices())
+        self.assertEqual(exp_v, x.values())

-    def test_clone(self):
-        x, _, _ = self._gen_sparse(4, 20, 5)
-        if self.is_uncoalesced:
-            self.assertFalse(x.is_coalesced())
-            y = x.clone()
-            self.assertFalse(y.is_coalesced())
-        x = x.coalesce()
-        self.assertTrue(x.is_coalesced())
-        y = x.clone()
-        self.assertTrue(y.is_coalesced())
+    def test_contig_hybrid(self):
+        self._test_contig_hybrid(False)

-    def test_transpose(self):
-        x = self._gen_sparse(4, 20, 5)[0]
+    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
+    def test_contig_hybrid_cuda(self):
+        self._test_contig_hybrid(True)
+
+    def _test_transpose(self, is_cuda):
+        x = self._gen_sparse(4, 20, 5, is_cuda=is_cuda)[0]
        y = x.to_dense()

        for i, j in itertools.combinations(range(4), 2):
@ -319,7 +285,13 @@ class TestSparse(TestCase):
            y = y.transpose(i, j)
            self.assertEqual(x.to_dense(), y)

-    @cpu_only
+    def test_transpose(self):
+        self._test_transpose(False)
+
+    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
+    def test_transpose_cuda(self):
+        self._test_transpose(True)
+
    def test_mm(self):
        def test_shape(di, dj, dk):
            x, _, _ = self._gen_sparse(2, 20, [di, dj])
@ -344,7 +316,6 @@ class TestSparse(TestCase):
        test_shape(100, 1000, 200)
        test_shape(64, 10000, 300)

-    @cpu_only
    def test_saddmm(self):
        def test_shape(di, dj, dk):
            x = self._gen_sparse(2, 20, [di, dj])[0]
@ -369,10 +340,12 @@ class TestSparse(TestCase):
        test_shape(1000, 100, 100)
        test_shape(3000, 64, 300)

-    def test_dsmm(self):
+    def _test_dsmm(self, is_cuda):
        def test_shape(di, dj, dk):
-            x = self._gen_sparse(2, 20, [di, dj])[0]
-            y = self.randn(dj, dk)
+            x = self._gen_sparse(2, 20, [di, dj], is_cuda)[0]
+            y = torch.randn(dj, dk)
+            if is_cuda:
+                y = y.cuda()

            res = torch.dsmm(x, y)
            expected = torch.mm(x.to_dense(), y)
@ -382,10 +355,19 @@ class TestSparse(TestCase):
        test_shape(1000, 100, 100)
        test_shape(3000, 64, 300)

-    def test_hsmm(self):
+    def test_dsmm(self):
+        self._test_dsmm(False)
+
+    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
+    def test_dsmm_cuda(self):
+        self._test_dsmm(True)
+
+    def _test_hsmm(self, is_cuda):
        def test_shape(di, dj, dk):
-            x = self._gen_sparse(2, 20, [di, dj])[0]
-            y = self.randn(dj, dk)
+            x = self._gen_sparse(2, 20, [di, dj], is_cuda)[0]
+            y = torch.randn(dj, dk)
+            if is_cuda:
+                y = y.cuda()

            res = torch.hsmm(x, y)
            expected = torch.mm(x.to_dense(), y)
@ -395,10 +377,19 @@ class TestSparse(TestCase):
        test_shape(1000, 100, 100)
        test_shape(3000, 64, 300)

-    def _test_spadd_shape(self, shape_i, shape_v=None):
+    def test_hsmm(self):
+        self._test_hsmm(False)
+
+    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
+    def test_hsmm_cuda(self):
+        self._test_hsmm(True)
+
+    def _test_spadd_shape(self, is_cuda, shape_i, shape_v=None):
        shape = shape_i + (shape_v or [])
-        x, _, _ = self._gen_sparse(len(shape_i), 10, shape)
-        y = self.randn(*shape)
+        x, _, _ = self._gen_sparse(len(shape_i), 10, shape, is_cuda)
+        y = torch.randn(*shape)
+        if is_cuda:
+            y = y.cuda()
        r = random.random()

        res = torch.add(y, r, x)
@ -410,7 +401,9 @@ class TestSparse(TestCase):
        s = list(shape)
        s[0] = shape[-1]
        s[-1] = shape[0]
-        y = self.randn(*s)
+        y = torch.randn(*s)
+        if is_cuda:
+            y = y.cuda()
        y.transpose_(0, len(s) - 1)
        r = random.random()

@ -419,22 +412,36 @@ class TestSparse(TestCase):

        self.assertEqual(res, expected)

+    def _test_spadd(self, is_cuda):
+        self._test_spadd_shape(is_cuda, [5, 6])
+        self._test_spadd_shape(is_cuda, [10, 10, 10])
+        self._test_spadd_shape(is_cuda, [50, 30, 20])
+        self._test_spadd_shape(is_cuda, [5, 5, 5, 5, 5, 5])
+
    def test_spadd(self):
-        self._test_spadd_shape([5, 6])
-        self._test_spadd_shape([10, 10, 10])
-        self._test_spadd_shape([50, 30, 20])
-        self._test_spadd_shape([5, 5, 5, 5, 5, 5])
+        self._test_spadd(False)
+
+    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
+    def test_spadd_cuda(self):
+        self._test_spadd(True)
+
+    def _test_spadd_hybrid(self, is_cuda):
+        self._test_spadd_shape(is_cuda, [5, 6], [2, 3])
+        self._test_spadd_shape(is_cuda, [10, 10, 10], [3])
+        self._test_spadd_shape(is_cuda, [50, 30, 20], [2])
+        self._test_spadd_shape(is_cuda, [5, 5, 5, 5, 5, 5], [2])

    def test_spadd_hybrid(self):
-        self._test_spadd_shape([5, 6], [2, 3])
-        self._test_spadd_shape([10, 10, 10], [3])
-        self._test_spadd_shape([50, 30, 20], [2])
-        self._test_spadd_shape([5, 5, 5, 5, 5, 5], [2])
+        self._test_spadd_hybrid(False)

-    def _test_basic_ops_shape(self, shape_i, shape_v=None):
+    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
+    def test_spadd_hybrid_cuda(self):
+        self._test_spadd_hybrid(True)
+
+    def _test_basic_ops_shape(self, is_cuda, shape_i, shape_v=None):
        shape = shape_i + (shape_v or [])
-        x1, _, _ = self._gen_sparse(len(shape_i), 9, shape)
-        x2, _, _ = self._gen_sparse(len(shape_i), 12, shape)
+        x1, _, _ = self._gen_sparse(len(shape_i), 9, shape, is_cuda)
+        x2, _, _ = self._gen_sparse(len(shape_i), 12, shape, is_cuda)

        y1 = x1 + x2
        y2 = x1.clone()
@ -491,25 +498,39 @@ class TestSparse(TestCase):
        self.assertTrue(y.is_coalesced())
        self.assertEqual(x1, y)
        # check that coalesce is out of place
-        y._values().add_(1)
-        self.assertEqual(z._values() + 1, y._values())
+        y.values().add_(1)
+        self.assertEqual(z.values() + 1, y.values())
+
+    def _test_basic_ops(self, is_cuda):
+        self._test_basic_ops_shape(is_cuda, [5, 6])
+        self._test_basic_ops_shape(is_cuda, [10, 10, 10])
+        self._test_basic_ops_shape(is_cuda, [50, 30, 20])
+        self._test_basic_ops_shape(is_cuda, [5, 5, 5, 5, 5, 5])

    def test_basic_ops(self):
-        self._test_basic_ops_shape([5, 6])
-        self._test_basic_ops_shape([10, 10, 10])
-        self._test_basic_ops_shape([50, 30, 20])
-        self._test_basic_ops_shape([5, 5, 5, 5, 5, 5])
+        self._test_basic_ops(False)
+
+    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
+    def test_basic_ops_cuda(self):
+        self._test_basic_ops(True)
+
+    def _test_basic_ops_hybrid(self, is_cuda):
+        self._test_basic_ops_shape(is_cuda, [5, 6], [2, 3])
+        self._test_basic_ops_shape(is_cuda, [10, 10, 10], [3])
+        self._test_basic_ops_shape(is_cuda, [50, 30, 20], [2])
+        self._test_basic_ops_shape(is_cuda, [5, 5, 5, 5, 5, 5], [2])

    def test_basic_ops_hybrid(self):
-        self._test_basic_ops_shape([5, 6], [2, 3])
-        self._test_basic_ops_shape([10, 10, 10], [3])
-        self._test_basic_ops_shape([50, 30, 20], [2])
-        self._test_basic_ops_shape([5, 5, 5, 5, 5, 5], [2])
+        self._test_basic_ops_hybrid(False)

-    def _test_sparse_mask_shape(self, shape_i, shape_v=None):
+    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
+    def test_basic_ops_hybrid_cuda(self):
+        self._test_basic_ops_hybrid(True)
+
+    def _test_sparse_mask_shape(self, is_cuda, shape_i, shape_v=None):
        shape = shape_i + (shape_v or [])
-        x1, _, _ = self._gen_sparse(len(shape_i), 9, shape)
-        x2, _, _ = self._gen_sparse(len(shape_i), 12, shape)
+        x1, _, _ = self._gen_sparse(len(shape_i), 9, shape, is_cuda)
+        x2, _, _ = self._gen_sparse(len(shape_i), 12, shape, is_cuda)

        y1 = x1 + x2
        y2 = x1.clone()
@ -518,108 +539,78 @@ class TestSparse(TestCase):
        self.assertEqual(y1.to_dense(), expected)
        self.assertEqual(y2.to_dense(), expected)

-    def _test_sparse_mask_fixed(self):
-        i = self.IndexTensor([
-            [1, 3, 0, 4],
-            [2, 1, 2, 3],
+    def _test_sparse_mask_fixed(self, is_cuda):
+        IndexTensor, ValueTensor, SparseTensor = \
+            cuda_triplet if is_cuda else cpu_triplet
+        i = IndexTensor([
+            [1, 3, 3, 0, 4],
+            [2, 1, 1, 2, 3],
        ])
-        v = self.ValueTensor([1, 2, 3, 4])
-        x = self.SparseTensor(i, v, torch.Size([5, 4])).coalesce()
-        dense = self.ValueTensor([
+        v = ValueTensor([1, 2, 3, 4, 5])
+        x = SparseTensor(i, v, torch.Size([5, 4]))
+        dense = ValueTensor([
            [1, 2, 3, 4],
            [5, 6, 7, 8],
            [9, 10, 11, 12],
            [13, 14, 15, 16],
            [17, 18, 19, 20],
        ])
-        exp_v = self.ValueTensor([7, 14, 3, 20])
-        res = dense._sparse_mask(x)
-        expected = self.SparseTensor(i, exp_v, torch.Size([5, 4]))
+        exp_v = ValueTensor([7, 14, 14, 3, 20])
+        res = dense.sparse_mask(x)
+        expected = SparseTensor(i, exp_v, torch.Size([5, 4]))
        self.assertEqual(res, expected)

+    def _test_sparse_mask(self, is_cuda):
+        self._test_sparse_mask_fixed(is_cuda)
+
+        self._test_sparse_mask_shape(is_cuda, [5, 6])
+        self._test_sparse_mask_shape(is_cuda, [10, 10, 10])
+        self._test_sparse_mask_shape(is_cuda, [50, 30, 20])
+        self._test_sparse_mask_shape(is_cuda, [5, 5, 5, 5, 5, 5])
+
    def test_sparse_mask(self):
-        self._test_sparse_mask_fixed()
+        self._test_sparse_mask(False)

-        self._test_sparse_mask_shape([5, 6])
-        self._test_sparse_mask_shape([10, 10, 10])
-        self._test_sparse_mask_shape([50, 30, 20])
-        self._test_sparse_mask_shape([5, 5, 5, 5, 5, 5])
+    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
+    def test_sparse_mask_cuda(self):
+        self._test_sparse_mask(True)

-    def _test_sparse_mask_hybrid_fixed(self):
-        i = self.IndexTensor([
-            [1, 3, 0, 4],
-            [2, 1, 2, 3],
+    def _test_sparse_mask_hybrid_fixed(self, is_cuda):
+        IndexTensor, ValueTensor, SparseTensor = \
+            cuda_triplet if is_cuda else cpu_triplet
+        i = IndexTensor([
+            [1, 3, 3, 0, 4],
+            [2, 1, 1, 2, 3],
        ])
-        v = self.ValueTensor([[1, 2], [2, 3], [3, 4], [4, 5]])
-        # TODO: This is also testing that, if coalesce is a no-op,
-        # the indices don't get permuted. I don't know if we actually
-        # want to give this invariant.
-        x = self.SparseTensor(i, v, torch.Size([5, 4, 2])).coalesce()
-        dense = self.ValueTensor([
+        v = ValueTensor([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]])
+        x = SparseTensor(i, v, torch.Size([5, 4, 2]))
+        dense = ValueTensor([
            [[1, 3], [2, 2], [3, 3], [4, 2]],
            [[5, 7], [6, 7], [7, 9], [8, 9]],
            [[9, 2], [10, 4], [11, 1], [12, 3]],
            [[13, 5], [14, 1], [15, 1], [16, 6]],
            [[17, 7], [18, 2], [19, 7], [20, 1]],
        ])
-        res = dense._sparse_mask(x)
-        exp_v = self.ValueTensor([[7, 9], [14, 1], [3, 3], [20, 1]])
-        expected = self.SparseTensor(i, exp_v, torch.Size([5, 4, 2]))
+        res = dense.sparse_mask(x)
+        exp_v = ValueTensor([[7, 9], [14, 1], [14, 1], [3, 3], [20, 1]])
+        expected = SparseTensor(i, exp_v, torch.Size([5, 4, 2]))
        self.assertEqual(res, expected)

+    def _test_sparse_mask_hybrid(self, is_cuda):
+        self._test_sparse_mask_hybrid_fixed(is_cuda)
+
+        self._test_sparse_mask_shape(is_cuda, [5, 6], [2, 3])
+        self._test_sparse_mask_shape(is_cuda, [10, 10, 10], [3])
+        self._test_sparse_mask_shape(is_cuda, [50, 30, 20], [2])
+        self._test_sparse_mask_shape(is_cuda, [5, 5, 5, 5, 5, 5], [2])
+
    def test_sparse_mask_hybrid(self):
-        self._test_sparse_mask_hybrid_fixed()
+        self._test_sparse_mask_hybrid(False)

-        self._test_sparse_mask_shape([5, 6], [2, 3])
-        self._test_sparse_mask_shape([10, 10, 10], [3])
-        self._test_sparse_mask_shape([50, 30, 20], [2])
-        self._test_sparse_mask_shape([5, 5, 5, 5, 5, 5], [2])
+    @unittest.skipIf(not TEST_CUDA, 'CUDA not available')
+    def test_sparse_mask_hybrid_cuda(self):
+        self._test_sparse_mask_hybrid(True)

-    @cuda_only
-    def test_storage_not_null(self):
-        x = torch.cuda.sparse.FloatTensor(2)
-        self.assertNotEqual(x.get_device(), -1)
-
-    @cuda_only
-    @unittest.skipIf(torch.cuda.device_count() < 2, "only one GPU detected")
-    def test_same_gpu(self):
-        i = self.IndexTensor([[2]]).cuda(1)
-        v = self.ValueTensor([5]).cuda(1)
-        x = self.SparseTensor(i, v, torch.Size([3]), device=1)
-        self.assertEqual(x.get_device(), 1)
-        self.assertEqual(x._values().get_device(), 1)
-        self.assertEqual(x._indices().get_device(), 1)
-
-        x = self.SparseTensor(3, device=1)
-        self.assertEqual(x.get_device(), 1)
-        self.assertEqual(x._values().get_device(), 1)
-        self.assertEqual(x._indices().get_device(), 1)
-
-        v = self.ValueTensor([5]).cuda(0)
-        self.assertRaises(RuntimeError, lambda: self.SparseTensor(i, v, torch.Size([3])))
-
-
-class TestUncoalescedSparse(TestSparse):
-    def setUp(self):
-        super(TestUncoalescedSparse, self).setUp()
-        self.is_uncoalesced = True
-
-
-@unittest.skipIf(not TEST_CUDA, 'CUDA not available')
-class TestCudaSparse(TestSparse):
-    def setUp(self):
-        super(TestCudaSparse, self).setUp()
-        self.is_cuda = True
-        self.IndexTensor = torch.cuda.LongTensor
-        self.ValueTensor = torch.cuda.DoubleTensor
-        self.SparseTensor = torch.cuda.sparse.DoubleTensor
-
-
-@unittest.skipIf(not TEST_CUDA, 'CUDA not available')
-class TestCudaUncoalescedSparse(TestCudaSparse):
-    def setUp(self):
-        super(TestCudaUncoalescedSparse, self).setUp()
-        self.is_uncoalesced = True

 if __name__ == '__main__':
    run_tests()
--- a/test/test_torch.py
+++ b/test/test_torch.py
--- a/test/test_utils.py
+++ b/test/test_utils.py
@ -336,11 +336,16 @@ class TestLuaReader(TestCase):

    @classmethod
    def init(cls):
-        try:
-            path = download_file('https://download.pytorch.org/test_data/legacy_modules.t7')
-        except unittest.SkipTest:
+        DATA_URL = 'https://download.pytorch.org/test_data/legacy_modules.t7'
+        data_dir = os.path.join(os.path.dirname(__file__), 'data')
+        test_file_path = os.path.join(data_dir, 'legacy_modules.t7')
+        succ = download_file(DATA_URL, test_file_path)
+        if not succ:
+            warnings.warn(("Couldn't download the test file for TestLuaReader! "
+                           "Tests will be incomplete!"), RuntimeWarning)
            return
-        tests = load_lua(path)
+
+        tests = load_lua(test_file_path)
        for name, test in tests['modules'].items():
            test_name = 'test_' + name.replace('nn.', '')
            setattr(cls, test_name, cls._module_test(name, test))
--- a/tools/cwrap/cwrap.py
+++ b/tools/cwrap/cwrap.py
@ -4,7 +4,6 @@ from string import Template
 from copy import deepcopy
 from .plugins import ArgcountChecker, OptionalArguments, ArgumentReferences, \
    BeforeAfterCall, ConstantArguments, ReturnArguments, GILRelease
-from ..shared import cwrap_common


 class cwrap(object):
@ -36,11 +35,11 @@ class cwrap(object):
    DEFAULT_PLUGIN_CLASSES = [ArgcountChecker, ConstantArguments, OptionalArguments,
                              ArgumentReferences, BeforeAfterCall, ReturnArguments, GILRelease]

-    def __init__(self, source, destination=None, plugins=None, default_plugins=True):
+    def __init__(self, source, destination=None, plugins=[], default_plugins=True):
        if destination is None:
            destination = source.replace('.cwrap', '.cpp')

-        self.plugins = [] if plugins is None else plugins
+        self.plugins = plugins
        if default_plugins:
            defaults = [cls() for cls in self.DEFAULT_PLUGIN_CLASSES]
            self.plugins = defaults + self.plugins
@ -52,10 +51,7 @@ class cwrap(object):
        with open(source, 'r') as f:
            declarations = f.read()

-        # wrap all the declarations in the source .cwrap file
        wrapper = self.wrap_declarations(declarations)
-
-        # let each plugin do any post-processing of the wrapped file
        for plugin in self.plugins:
            wrapper = plugin.process_full_file(wrapper)

@ -77,7 +73,7 @@ class cwrap(object):
            elif line == ']]':
                in_declaration = False
                declaration = yaml.load('\n'.join(declaration_lines))
-                cwrap_common.set_declaration_defaults(declaration)
+                self.set_declaration_defaults(declaration)

                # Pass declaration in a list - maybe some plugins want to add
                # multiple wrappers
@ -105,6 +101,24 @@ class cwrap(object):

        return '\n'.join(output)

+    def set_declaration_defaults(self, declaration):
+        declaration.setdefault('arguments', [])
+        declaration.setdefault('return', 'void')
+        if 'cname' not in declaration:
+            declaration['cname'] = declaration['name']
+        # Simulate multiple dispatch, even if it's not necessary
+        if 'options' not in declaration:
+            declaration['options'] = [{'arguments': declaration['arguments']}]
+            del declaration['arguments']
+        # Parse arguments (some of them can be strings)
+        for option in declaration['options']:
+            option['arguments'] = self.parse_arguments(option['arguments'])
+        # Propagate defaults from declaration to options
+        for option in declaration['options']:
+            for k, v in declaration.items():
+                if k != 'name' and k != 'options':
+                    option.setdefault(k, v)
+
    def parse_arguments(self, args):
        new_args = []
        for arg in args:
@ -122,10 +136,6 @@ class cwrap(object):
        return new_args

    def search_plugins(self, fnname, args, fallback):
-        """Search plugins for the given function to call with args.
-
-        If not found, call fallback with args.
-        """
        for plugin in self.plugins:
            wrapper = getattr(plugin, fnname)(*args)
            if wrapper is not None:
--- a/tools/cwrap/plugins/ArgcountSortPlugin.py
+++ b/tools/cwrap/plugins/ArgcountSortPlugin.py
@ -1,6 +1,4 @@
-import os
 from . import CWrapPlugin
-from ...shared import cwrap_common


 class ArgcountSortPlugin(CWrapPlugin):
@ -9,7 +7,8 @@ class ArgcountSortPlugin(CWrapPlugin):
        self.descending = descending

    def process_declarations(self, declarations):
+        def num_checked_args(option):
+            return sum(map(lambda a: not a.get('ignore_check', False), option['arguments']))
        for declaration in declarations:
-            cwrap_common.sort_by_number_of_options(declaration,
-                                                   self.descending)
+            declaration['options'].sort(key=num_checked_args, reverse=self.descending)
        return declarations
--- a/tools/cwrap/plugins/AssertNDim.py
+++ b/tools/cwrap/plugins/AssertNDim.py
@ -1,29 +0,0 @@
-from . import CWrapPlugin
-from string import Template
-
-
-class AssertNDim(CWrapPlugin):
-
-    PRE_CODE_TEMPLATE = Template(
-        """if(THTensor_(nDimension)(LIBRARY_STATE ${arg_op}) != ${dim_value}) {
-             THError("Expected argument %s to have %d dimension(s), but has %d",
-                     "${op}", ${dim_value}, THTensor_(nDimension)(LIBRARY_STATE ${arg_op}));
-           }
-        """)
-
-    def process_option_code_template(self, template, option):
-        new_code_pre = []
-
-        for _, arg in enumerate(option['arguments']):
-            if 'assert_ndim' not in arg:
-                continue
-
-            dim_value = arg.get('assert_ndim')
-            op = arg.get('assign_name', arg['name'])
-            arg_op = "arg_" + op
-            new_code_pre.append(self.PRE_CODE_TEMPLATE.substitute(op=op,
-                                                                  arg_op=arg_op,
-                                                                  dim_value=dim_value))
-            template = new_code_pre + template
-
-        return template
--- a/tools/cwrap/plugins/BoolOption.py
+++ b/tools/cwrap/plugins/BoolOption.py
@ -1,12 +1,6 @@
 from . import CWrapPlugin
 from string import Template

-import sys
-if sys.version_info[0] == 3:
-    string_type = str
-else:
-    string_type = basestring
-

 class BoolOption(CWrapPlugin):

@ -21,8 +15,7 @@ class BoolOption(CWrapPlugin):
                for arg in option['arguments']:
                    if self.is_bool_option(arg):
                        arg['is_bool_option'] = True
-                        if isinstance(arg['if_true'], string_type):
-                            arg['type'] = 'const char*'
+                        arg['type'] = 'const char*'
        return declarations

    def get_type_check(self, arg, option):
--- a/tools/cwrap/plugins/Broadcast.py
+++ b/tools/cwrap/plugins/Broadcast.py
@ -1,318 +0,0 @@
-from . import CWrapPlugin
-from string import Template
-
-# Arguments to the Broadcast Plugin:
-# broadcast: args_to_broadcast_against [inplace] [fallback]
-# [args_to_broadcast_against]: either a single argument (e.g. "arg1") or a comma-seperated
-#                              list of two arguments (e.g. "tensor1,tensor2") indicating
-#                              arguments to broadcast specified argument (usually "self") against
-# [inplace] will generate code for in-place function, which doesn't allow the in-place
-#           argument to be broadcast
-# [fallback] if tensors aren't broadcastable, preserves "element number" pointwise behavior,
-#            where only number of elements need to match, and tensors are viewed as 1-dimensional.
-# [dims] specify if the tensors shouldn't be broadcast to a specific tensor or tensors, but a combination
-#        of individual dimension sizes of a set of tensors.  For example: addbmm(C,A,B) a.k.a. [C + A @ B]
-#        broadcasts C to the first dimension of A and the second dimension of B.  Each dimension is specified as
-#        [arg].dim[#] and dimensions are comma-separated.  So, to specify that the tensor should be
-#        broadcast to 3-dimensions with sizes:
-#        tensor0->size[0] x tensor1->size[1] x tensor2->size[2]
-#        you would write:
-#        dims:tensor0.dim0,tensor1.dim1,tensor2.dim2
-# [types] if the tensors should be of different types than THTensor, specify as X where
-#         the actual type to use is THXTensor (i.e. Byte for THByteTensor).  If the type
-#         should be THTensor, use 'Real'
-
-# For out of place:
-# Two args: expand the two args together
-# Three args (fused kernels): (e.g. addcmul) expand all three args together
-# Sketch of proof that this is the same:
-# consider addcmul, under expansion we want: a + (b * c) = (a + b * c) [all expanded together]
-# Let e(i, j) be the expansion of i with j, e(i, j, k) be the expansion of i with j,k
-#
-# Then a + (b * c) = e(a, e(b,c) * e(c,b)) + e(e(b,c)   * e(c,b), a)
-#                  = e(a, e(b,c))          + e(e(b,c)   * e(c,b), a)    (only size matters for second param)
-#                  = e(a,b,c)              + e(e(b,c)   * e(c,b), a)    (by associativity of max in expand)
-#                  = e(a,b,c)              + e(b,c,a)   * e(c,b,a)      (see L1)
-# which is a + b * c all expanded together
-#
-# L1: Show e(i * j, a) = e(i,a) * e(j,a) where i,j have same size
-# Consider any index _{ s_0, ..., s_n}
-# e(i * j, a) = (i*j)_{f(s_0), ...,f(s_n)} where f is the expansion of that dimension with a
-#             = i_{f(s_0), ..., f(s_n)} * j_{f(s_0), ..., f(s_n)} by definition of pointwise operator
-#             = e(i,a) * e(j,a)
-
-
-class Broadcast(CWrapPlugin):
-
-    # Save and restore passed in arguments in case later plugins use
-    POST_TEMPLATE = Template(
-        """${arg_op_other} = ${arg_op_other}_save;\n""")
-
-    def getPreArgStringTemplate(self, type=None):
-        if type is None:
-            ret = """THTensor *${arg_op_other}_save = ${arg_op_other};
-                     THTensorPtr ${arg_op_other}_guard(THTensor_(new)(LIBRARY_STATE_NOARGS));\n"""
-        else:
-            cpu_t = "TH" + type + "Tensor"
-            gpu_t = "THCuda" + type + "Tensor"
-            ret = ("#if !IS_CUDA\n" +
-                   cpu_t + " *${arg_op_other}_save = ${arg_op_other};\n" +
-                   cpu_t + "Ptr ${arg_op_other}_guard(" + cpu_t + "_new(LIBRARY_STATE_NOARGS));\n" +
-                   "#else\n" +
-                   gpu_t + " *${arg_op_other}_save = ${arg_op_other};\n" +
-                   "THPPointer<" + gpu_t + "> ${arg_op_other}_guard(\n" + gpu_t + "_new(LIBRARY_STATE_NOARGS));\n" +
-                   "#endif\n")
-        return Template(ret)
-
-    def getExpandTemplate(self, expand_call, success_code, raise_errors):
-        if not raise_errors:
-            return Template(
-                "bool expand_success = false;\n" +
-                "try {\n" +
-                expand_call +
-                "\nexpand_success = true;\n" +
-                "}\n"
-                "catch (std::exception &e) {}\n" +
-                "if(expand_success) {\n" +
-                success_code +
-                "\n}\n")
-        else:
-            return Template(
-                expand_call + "\n" +
-                success_code + "\n")
-
-    def getOutPlacePreExpand2Template(self, raise_errors):
-        expand_code = """expand_outplace2(LIBRARY_STATE ${arg_op_a}_guard.get(), ${arg_op_other}_guard.get(),
-                                          ${arg_op_a}, ${arg_op_other},
-                                          \"${op_a}\", \"${op_other}\", !${raise_errors});"""
-        success_code = """${arg_op_a} = ${arg_op_a}_guard.get();
-                          ${arg_op_other} = ${arg_op_other}_guard.get();"""
-        return self.getExpandTemplate(expand_code, success_code, raise_errors)
-
-    def getOutPlacePreExpand3Template(self, raise_errors):
-        expand_code = """expand_outplace3(LIBRARY_STATE ${arg_op_a}_guard.get(),
-                                          ${arg_op_other1}_guard.get(), ${arg_op_other2}_guard.get(),
-                                          ${arg_op_a}, ${arg_op_other1}, ${arg_op_other2},
-                                          \"${op_a}\", \"${op_other1}\", \"${op_other2}\", !${raise_errors});"""
-        success_code = """${arg_op_a} = ${arg_op_a}_guard.get();
-                          ${arg_op_other1} = ${arg_op_other1}_guard.get();
-                          ${arg_op_other2} = ${arg_op_other2}_guard.get();"""
-        return self.getExpandTemplate(expand_code, success_code, raise_errors)
-
-    OUT_PLACE_PRE_EXPAND_PRE_DIM_TEMPLATE = Template(
-        """if(THTensor_(nDimension)(LIBRARY_STATE ${arg_op_dim}) <= ${arg_op_dim_value}) {
-             THError("Argument %s requires at least %d dimensions, but only has %d",
-                     "${op_dim}", ${arg_op_dim_value} + 1, THTensor_(nDimension)(LIBRARY_STATE ${arg_op_dim}));
-           }
-           long ${arg_op_a}_dim${idx}_size = THTensor_(size)(LIBRARY_STATE ${arg_op_dim}, ${arg_op_dim_value});\n""")
-
-    OUT_PLACE_PRE_EXPAND1_DIM_TEMPLATE = Template(
-        """THLongStoragePtr ${arg_op_a}_storage(THLongStorage_newWithSize1(${arg_op_a}_dim0_size));\n""")
-
-    OUT_PLACE_PRE_EXPAND2_DIM_TEMPLATE = Template(
-        """THLongStoragePtr ${arg_op_a}_storage(
-               THLongStorage_newWithSize2(${arg_op_a}_dim0_size, ${arg_op_a}_dim1_size));\n""")
-
-    OUT_PLACE_PRE_EXPAND3_DIM_TEMPLATE = Template(
-        """THLongStoragePtr ${arg_op_a}_storage(
-               THLongStorage_newWithSize3(${arg_op_a}_dim0_size, ${arg_op_a}_dim1_size, ${arg_op_a}_dim2_size));\n""")
-
-    def getOutPlacePreExpandPostDimTemplate(self, raise_errors):
-        expand_code = """expand(LIBRARY_STATE ${arg_op_a}_guard.get(), ${arg_op_a}, ${arg_op_a}_storage);"""
-        success_code = """${arg_op_a} = ${arg_op_a}_guard.get();"""
-        return self.getExpandTemplate(expand_code, success_code, raise_errors)
-
-    OUT_PLACE_PRE_TEMPLATE = Template(
-        """${code_arg_op_a}${code_arg_op_other1}${code_arg_op_other2}
-           ${expand_code}""")
-
-    def getInPlacePreExpand1Template(self, raise_errors):
-        expand_code = """expand_inplace1(LIBRARY_STATE ${arg_op_other}_guard.get(), ${arg_op_other}, ${arg_op_a},
-                                         \"${op_other}\", \"${op_a}\", !${raise_errors});"""
-        success_code = """${arg_op_other} = ${arg_op_other}_guard.get();"""
-        return self.getExpandTemplate(expand_code, success_code, raise_errors)
-
-    def getInPlacePreExpand2Template(self, raise_errors):
-        expand_code = """expand_inplace2(LIBRARY_STATE ${arg_op_other1}_guard.get(), ${arg_op_other2}_guard.get(),
-                                         ${arg_op_other1}, ${arg_op_other2}, ${arg_op_a},
-                                         \"${op_other1}\", \"${op_other2}\", \"${op_a}\", !${raise_errors});"""
-        success_code = """${arg_op_other1} = ${arg_op_other1}_guard.get();
-                          ${arg_op_other2} = ${arg_op_other2}_guard.get();"""
-        return self.getExpandTemplate(expand_code, success_code, raise_errors)
-
-    IN_PLACE_PRE_TEMPLATE = Template(
-        """${code_arg_op_other1}${code_arg_op_other2}
-           ${expand_code}""")
-
-    def initialize(self, cwrap):
-        self.cwrap = cwrap
-
-    # Arguments:
-    # [0]: name of tensor to broadcast with (possibly two comma separated)
-    # [1] inplace (optional).  In place operations only broadcast on second tensor argument
-    # [2] fallback (optional).  Will fallback to applying to tensor of equal nElem if broadcast fails
-    def process_option_code_template(self, template, option):
-        new_code_pre = []
-        new_code_post = []
-        for _, arg in enumerate(option['arguments']):
-            if 'broadcast' not in arg:
-                continue
-
-            params = arg.get('broadcast').split(" ")
-            op_a = arg.get('assign_name', arg['name'])
-            in_place = "inplace" in params
-            raise_errors = "false" if "fallback" in params else "true"
-
-            param_others = params[0].split(",")
-            if len(param_others) > 2:
-                raise ValueError('Broadcast only supports up to 2 secondary parameters')
-            op_b = param_others[0]
-            op_c = param_others[1] if len(param_others) == 2 else None
-            arg_op_b = "arg_" + op_b
-            arg_op_a = "arg_" + op_a
-            arg_op_c = ("arg_" + op_c) if op_c else None
-
-            dims_kvs = []
-            for p in params:
-                if p.startswith("dims:"):
-                    assert(raise_errors == "true")
-                    if len(dims_kvs) != 0:
-                        raise ValueError("multiple specifications of dims")
-                    dims = p[len("dims:"):].split(",")
-                    for dim in dims:
-                        batchdim = dim.split(".")
-                        assert len(batchdim) == 2
-                        assert batchdim[1].startswith("dim")
-                        dim_val = batchdim[1][len("dim"):]
-                        dims_kvs.append({"op": batchdim[0], "arg_op": "arg_" + batchdim[0], "val": dim_val})
-
-            assert len(dims_kvs) <= 3
-            for p in params[1:]:
-                if p != "inplace" and p != "fallback" and not p.startswith("dims:") and not p.startswith("types:"):
-                    raise ValueError("invalid parameter {}".format(p))
-
-            type_op_b = None
-            type_op_c = None
-            for p in params:
-                if p.startswith("types:"):
-                    if not in_place and len(dims_kvs) > 0:
-                        raise ValueError("type specification not supported yet for out-of-place functions "
-                                         "that specify explicit dimensions")
-                    types = p[len("types:"):].split(",")
-                    assert(len(types) == (2 if op_c else 1))
-                    type_op_b = None if types[0] == "Real" else types[0]
-                    if op_c:
-                        type_op_c = None if types[1] == "Real" else types[1]
-
-            op_b_mapping = {
-                "op_a": op_a,
-                "op_other": op_b,
-                "arg_op_a": arg_op_a,
-                "arg_op_other": arg_op_b,
-                "raise_errors": raise_errors
-            }
-            op_c_mapping = {
-                "op_a": op_a,
-                "op_other": op_c,
-                "arg_op_a": arg_op_a,
-                "arg_op_other": arg_op_c,
-                "raise_errors": raise_errors
-            }
-
-            if in_place:
-                code_arg_op_other1 = self.getPreArgStringTemplate(type=type_op_b).substitute(op_b_mapping)
-                code_arg_op_other2 = (
-                    self.getPreArgStringTemplate(type=type_op_c).substitute(op_c_mapping) if op_c else "")
-
-                if op_c:
-                    expand_code = self.getInPlacePreExpand2Template(raise_errors == "true").substitute(
-                        op_b_mapping,
-                        op_other1=op_b,
-                        op_other2=op_c,
-                        arg_op_other1=arg_op_b,
-                        arg_op_other2=arg_op_c)
-                else:
-                    expand_code = self.getInPlacePreExpand1Template(raise_errors == "true").substitute(op_b_mapping)
-
-                new_code_pre.append(self.IN_PLACE_PRE_TEMPLATE.substitute(
-                    arg_op_a=arg_op_a,
-                    code_arg_op_other1=code_arg_op_other1,
-                    code_arg_op_other2=code_arg_op_other2,
-                    expand_code=expand_code,
-                    raise_errors=raise_errors))
-                new_code_pre.append("")
-
-                post_code = self.POST_TEMPLATE.substitute(op_b_mapping)
-                if op_c:
-                    post_code += self.POST_TEMPLATE.substitute(op_c_mapping)
-
-                new_code_post.append(post_code)
-                new_code_post.append("")
-            else:
-                if len(dims_kvs) != 0:
-                    code_arg_op_a = self.getPreArgStringTemplate().substitute(arg_op_other=arg_op_a)
-                    code_arg_op_other1 = ""
-                    code_arg_op_other2 = ""
-                    expand_code = ""
-                    for idx, kv in enumerate(dims_kvs):
-                        expand_code += self.OUT_PLACE_PRE_EXPAND_PRE_DIM_TEMPLATE.substitute(
-                            arg_op_a=arg_op_a,
-                            op_dim=kv["op"],
-                            arg_op_dim=kv["arg_op"],
-                            arg_op_dim_value=kv["val"],
-                            idx=idx)
-
-                    if len(dims_kvs) == 1:
-                        expand_code += self.OUT_PLACE_PRE_EXPAND1_DIM_TEMPLATE.substitute(
-                            arg_op_a=arg_op_a,
-                            arg_op_dim0=dims_kvs[0]["arg_op"])
-                    elif len(dims_kvs) == 2:
-                        expand_code += self.OUT_PLACE_PRE_EXPAND2_DIM_TEMPLATE.substitute(
-                            arg_op_a=arg_op_a,
-                            arg_op_dim0=dims_kvs[0]["arg_op"],
-                            arg_op_dim1=dims_kvs[1]["arg_op"])
-                    else:
-                        expand_code += self.OUT_PLACE_PRE_EXPAND3_DIM_TEMPLATE.substitute(
-                            arg_op_a=arg_op_a,
-                            arg_op_dim0=dims_kvs[0]["arg_op"],
-                            arg_op_dim1=dims_kvs[1]["arg_op"],
-                            arg_op_dim2=dims_kvs[2]["arg_op"])
-                    expand_code += self.getOutPlacePreExpandPostDimTemplate(raise_errors == "true").substitute(
-                        arg_op_a=arg_op_a,
-                        raise_errors=raise_errors)
-                    post_code = self.POST_TEMPLATE.substitute(arg_op_other=arg_op_a)
-
-                else:
-                    code_arg_op_a = self.getPreArgStringTemplate().substitute(arg_op_other=arg_op_a)
-                    code_arg_op_other1 = self.getPreArgStringTemplate(type=type_op_b).substitute(op_b_mapping)
-                    code_arg_op_other2 = (self.getPreArgStringTemplate(type=type_op_c).substitute(op_c_mapping)
-                                          if op_c else "")
-
-                    if op_c:
-                        expand_code = self.getOutPlacePreExpand3Template(raise_errors == "true").substitute(
-                            op_b_mapping,
-                            op_other1=op_b,
-                            op_other2=op_c,
-                            arg_op_other1=arg_op_b,
-                            arg_op_other2=arg_op_c)
-
-                    else:
-                        expand_code = self.getOutPlacePreExpand2Template(
-                            raise_errors == "true").substitute(op_b_mapping)
-
-                    post_code = self.POST_TEMPLATE.substitute(arg_op_other=arg_op_a)
-                    post_code += self.POST_TEMPLATE.substitute(op_b_mapping)
-                    post_code += self.POST_TEMPLATE.substitute(op_c_mapping) if op_c else ""
-
-                new_code_pre.append(self.OUT_PLACE_PRE_TEMPLATE.substitute(
-                    code_arg_op_a=code_arg_op_a,
-                    code_arg_op_other1=code_arg_op_other1,
-                    code_arg_op_other2=code_arg_op_other2,
-                    expand_code=expand_code))
-                new_code_pre.append("")
-
-                new_code_post.append(post_code)
-                new_code_post.append("")
-
-        template = new_code_pre + template + new_code_post
-        return template
--- a/tools/cwrap/plugins/CuDNNPlugin.py
+++ b/tools/cwrap/plugins/CuDNNPlugin.py
@ -135,7 +135,7 @@ static PyObject * $name(PyObject *self, PyObject *args, PyObject *kwargs)
                    if arg['name'] in ['self', 'state', 'dataType', 'handle']:
                        arg['ignore_check'] = True
            declaration['options'] = self.filter_unique_options(declaration['options'])
-        return [d for d in declarations if not d.get('only_register', False)]
+        return declarations

    def filter_unique_options(self, options):
        def signature(option):
--- a/tools/cwrap/plugins/GILRelease.py
+++ b/tools/cwrap/plugins/GILRelease.py
@ -23,8 +23,6 @@ class GILRelease(CWrapPlugin):
    ]

    def process_option_code_template(self, template, option):
-        if option.get('with_gil', False):
-            return template
        call_idx = template.index('$call')
        template.insert(call_idx, self.BEFORE_CALL)
        template.insert(call_idx + 2, self.AFTER_CALL)
--- a/tools/cwrap/plugins/KwargsPlugin.py
+++ b/tools/cwrap/plugins/KwargsPlugin.py
@ -30,8 +30,10 @@ class KwargsPlugin(CWrapPlugin):
            for option in declaration['options']:
                offset = 0
                for arg in option['arguments']:
-                    if arg.get('kwarg_only'):
-                        arg['no_idx'] = True
+                    if arg.get('kwarg_only') and not arg.get('ignore_check', False):
+                        offset += 1
+                    else:
+                        arg['kwarg_offset'] = offset
        return declarations

    def get_arg_accessor(self, arg, option):
@ -39,14 +41,14 @@ class KwargsPlugin(CWrapPlugin):
            return
        if arg.get('kwarg_only'):
            return self.KWARG_ONLY_ACCESSOR_TEMPLATE.substitute(name=arg['name'])
-        return self.ACCESSOR_TEMPLATE.substitute(idx=arg['idx'], name=arg['name'])
+        return self.ACCESSOR_TEMPLATE.substitute(idx=arg['idx'] - arg['kwarg_offset'], name=arg['name'])

    def process_single_check(self, code, arg, arg_accessor):
        if arg.get('no_kwargs'):
            return code
        if arg.get('kwarg_only'):
            return self.KWARG_ONLY_CHECK_TEMPLATE.substitute(name=arg['name'], code=code)
-        return self.CHECK_TEMPLATE.substitute(idx=arg['idx'], name=arg['name'], code=code)
+        return self.CHECK_TEMPLATE.substitute(idx=arg['idx'] - arg['kwarg_offset'], name=arg['name'], code=code)

    def process_wrapper(self, code, declaration):
        if declaration.get('no_kwargs'):
--- a/tools/cwrap/plugins/OptionalArguments.py
+++ b/tools/cwrap/plugins/OptionalArguments.py
@ -1,18 +1,58 @@
-import os
 from copy import deepcopy
 from . import CWrapPlugin
 from itertools import product
-from ...shared import cwrap_common


 class OptionalArguments(CWrapPlugin):

    def process_declarations(self, declarations):
+        new_options = []
        for declaration in declarations:
-            cwrap_common.enumerate_options_due_to_default(
-                declaration,
-                allow_kwarg=True,
-                type_to_signature={},
-                remove_self=False)
-
+            for option in declaration['options']:
+                optional_args = []
+                for i, arg in enumerate(option['arguments']):
+                    if 'default' in arg:
+                        optional_args.append(i)
+                for permutation in product((True, False), repeat=len(optional_args)):
+                    option_copy = deepcopy(option)
+                    for i, bit in zip(optional_args, permutation):
+                        arg = option_copy['arguments'][i]
+                        if not bit:
+                            arg['type'] = 'CONSTANT'
+                            arg['ignore_check'] = True
+                            # PyYAML interprets NULL as None...
+                            arg['name'] = 'NULL' if arg['default'] is None else arg['default']
+                    new_options.append(option_copy)
+            declaration['options'] = self.filter_unique_options(new_options)
        return declarations
+
+    def filter_unique_options(self, options):
+        def signature(option, kwarg_only_count):
+            if kwarg_only_count == 0:
+                kwarg_only_count = None
+            else:
+                kwarg_only_count = -kwarg_only_count
+            arg_signature = '#'.join(
+                arg['type']
+                for arg in option['arguments'][:kwarg_only_count]
+                if not arg.get('ignore_check'))
+            if kwarg_only_count is None:
+                return arg_signature
+            kwarg_only_signature = '#'.join(
+                arg['name'] + '#' + arg['type']
+                for arg in option['arguments'][kwarg_only_count:]
+                if not arg.get('ignore_check'))
+            return arg_signature + "#-#" + kwarg_only_signature
+        seen_signatures = set()
+        unique = []
+        for option in options:
+            for num_kwarg_only in range(0, len(option['arguments']) + 1):
+                sig = signature(option, num_kwarg_only)
+                if sig not in seen_signatures:
+                    if num_kwarg_only > 0:
+                        for arg in option['arguments'][-num_kwarg_only:]:
+                            arg['kwarg_only'] = True
+                    unique.append(option)
+                    seen_signatures.add(sig)
+                    break
+        return unique
--- a/tools/cwrap/plugins/ProcessorSpecificPlugin.py
+++ b/tools/cwrap/plugins/ProcessorSpecificPlugin.py
@ -1,90 +0,0 @@
-from copy import deepcopy
-from . import CWrapPlugin
-import yaml
-
-
-class ProcessorSpecificPlugin(CWrapPlugin):
-
-    def process_declarations(self, declarations):
-        # In order to move Torch's random functions into the same cwrap
-        # declaration, we need to be able to handle the fact that on the CPU
-        # these functions take a generator argument, while on the GPU, they
-        # do not. As such, we would like to split those declarations at cwrap
-        # runtime into two separate declarations, one for the CPU (unchanged),
-        # and one for the GPU (with the generator argument removed).
-        #
-        # For example, the declaration arguments:
-        # arguments:
-        #   - THTensor* self
-        #   - arg: THGenerator* generator
-        #     default: THPDefaultGenerator->cdata
-        #     kwarg_only: True
-        #
-        # Would have the generator argument removed when generating for the GPU
-        # backend.
-
-        def arg_contains_generator(arg):
-            return (arg['type'] == 'THGenerator*' or (arg.get('default', None)
-                    is not None and 'THPDefaultGenerator' in
-                    str(arg.get('default', ""))))
-
-        def split_candidate(declaration):
-            # First, check and see if it is a declaration for both CPU/GPU
-            if all([proc in declaration['backends'] for
-                    proc in ['CPU', 'CUDA']]):
-                for option in declaration['options']:
-                    for argument in option['arguments']:
-                        if arg_contains_generator(argument):
-                            return True
-
-            return False
-
-        def can_we_handle_the_split(declaration):
-            # hook into here if the split cannot happen for some reason
-            return True
-
-        def generator_split(declaration):
-            # the split must make two changes: 1. remove the generator argument
-            # for the GPU, and 2. assign the correct backends/types to the
-            # split declaration
-            dec_cpu = declaration
-            dec_gpu = deepcopy(declaration)
-
-            # Remove GPU backend and types from dec_cpu
-            dec_cpu['backends'].remove('CUDA')
-            if dec_cpu.get('backend_type_pairs', False):
-                dec_cpu['backend_type_pairs'] = (
-                    [pair for pair in dec_cpu['backend_type_pairs'] if
-                     pair[1] == 'CPU'])
-            # also need to reach into options
-            for option in dec_cpu['options']:
-                option['backends'].remove('CUDA')
-
-            # Remove CPU backend and types from dec_gpu
-            dec_gpu['backends'].remove('CPU')
-            if dec_gpu.get('backend_type_pairs', False):
-                dec_gpu['backend_type_pairs'] = (
-                    [pair for pair in dec_gpu['backend_type_pairs'] if
-                     pair[1] == 'CUDA'])
-            # also need to reach into options
-            for option in dec_gpu['options']:
-                option['backends'].remove('CPU')
-
-            # Remove generator arguments from dec_gpu options
-            for option in dec_gpu['options']:
-                option['arguments'] = (
-                    [arg for arg in option['arguments'] if
-                     not arg_contains_generator(arg)])
-
-            return [dec_cpu, dec_gpu]
-
-        decs = []
-        for declaration in declarations:
-            if split_candidate(declaration):
-                assert(can_we_handle_the_split(declaration))
-                newdecs = generator_split(declaration)
-                decs.extend(newdecs)
-            else:
-                decs.append(declaration)
-
-        return decs
--- a/tools/cwrap/plugins/THPPlugin.py
+++ b/tools/cwrap/plugins/THPPlugin.py
@ -127,7 +127,7 @@ PyObject * $name(PyObject *self, PyObject *args, PyObject *kwargs)
    """)

    ALLOCATE_TMPL = Template("""\
-THP${type}TensorPtr _${name}_guard((THP${type}Tensor*) THP${type}Tensor_NewEmpty());
+THP${type}TensorPtr _${name}_guard = (THP${type}Tensor*) THP${type}Tensor_NewEmpty();
 if (!_${name}_guard.get()) return NULL;
 THP${type}Tensor* $name = _${name}_guard.get();
 """)
@ -334,92 +334,8 @@ ${cpu}
                       for option in declaration['options']
                       for arg in option['arguments'])

-        def backends_types_to_defined_if_string(declaration):
-            # A declaration has two fields: 'backend', which stores a list of
-            # backends (currently 'cpu' and 'cuda') the declaration applies
-            # to, and 'types', which stores a list of real types the
-            # declaration applies to. In PyTorch, when a function is only
-            # supported by a subset of types, we wrap it in macro definition
-            # checks.
-            #
-            # Previously, we manually required the cwrap declaration to
-            # specify for which backend/type combinations a function was
-            # defined for. Now, we explicitly list the types and backends for
-            # a declaration, if it should only be supported for a specific
-            # subset of types, backends, or type-backend pairs.
-
-            types = declaration.get('types', [])
-            backends = declaration['backends']
-            all_backends = ['CPU', 'CUDA']
-
-            def get_defined_string(backend, real):
-                if backend == 'CUDA':
-                    if real == 'all':
-                        return "IS_CUDA"
-                    else:
-                        return 'CUDA_{0}'.format(real.upper())
-                else:
-                    if real == 'all':
-                        return "!IS_CUDA"
-                    else:
-                        return 'defined(TH_REAL_IS_{0})'.format(real.upper())
-
-            def expand_composite_type(p, t):
-                if t == 'floating_point':
-                    result = ['double', 'float']
-                    if p == 'CUDA':
-                        result.append('half')
-                elif t == 'integral':
-                    result = ['byte', 'char', 'short', 'int', 'long']
-                else:
-                    result = [t]
-                return result
-
-            defineds = []
-
-            # The logic below does not handle corner cases well. We allow the
-            # declaration to have a field 'backend_type_pairs' that stores a
-            # dictionary from type --> backend representing allowed
-            # combinations. Let's use these first.
-            for pair in declaration.get('backend_type_pairs', []):
-                p, t = pair
-                defineds.extend([get_defined_string(p, et) for et in
-                                expand_composite_type(p, t)])
-
-            # In the base case, types is empty and backends contains both
-            # 'CPU' and 'CUDA' --> this means we support all types, and our
-            # string should be empty, or simply the list of explict type
-            # backend pairs
-            if (len(types) == 0 and all([proc in backends for proc in
-                                         all_backends])):
-                return " || ".join(defineds)
-
-            # Case 2: types is empty, but only one backend type is specified
-            if len(types) == 0 and len(backends) == 1:
-                defineds.append('IS_CUDA' if backends[0] == 'CUDA' else
-                                "!IS_CUDA")
-                return " || ".join(defineds)
-
-            # Else, we loop overall all of the backend, type pairs and add
-            # them
-            for p in backends:
-                for t in types:
-                    defineds.extend([get_defined_string(p, et) for et in
-                                    expand_composite_type(p, t)])
-
-            return " || ".join(defineds)
-
        for declaration in declarations:
            # Disable all methods for THHalfTensor, unless cpu_half is True
-
-            dfstr = backends_types_to_defined_if_string(declaration)
-            if len(dfstr) > 0:
-                # for now, need to check for distributed defined if as well
-                if 'defined_if' in declaration:
-                    declaration['defined_if'] += ' && (' + dfstr + ')'
-                else:
-                    declaration['defined_if'] = dfstr
-
            if not declaration.get('cpu_half', False):
                defined_if = '!defined(TH_REAL_IS_HALF)'
                if 'defined_if' in declaration:
@ -439,23 +355,15 @@ ${cpu}
                declaration['variables'] += ['PyObject *__out;']
                self.generate_out_options(declaration)
            if has_long_args(declaration):
-                for option in declaration['options']:
-                    for arg in option['arguments']:
-                        if arg.get('long_args', False):
-                            arg['no_kwargs'] = True
+                declaration['no_kwargs'] = True
            for option in declaration['options']:
                option['cname'] = 'TH{}Tensor_({})'.format(
                    'S' if option.get('sparse', False) else '', option['cname'])
-                if option.get('sparse', False):
-                    defined_if = option.get('defined_if', '')
-                    option['defined_if'] = '!IS_DISTRIBUTED' + (' && ' if defined_if else '') + defined_if
-
-            variants = declaration.get('variants', ['method'])
-            if 'function' in variants:
+            if declaration.get('with_stateless', False) or declaration.get('only_stateless', False):
                stateless_declaration = self.make_stateless(declaration)
                new_declarations.append(stateless_declaration)
                self.stateless_declarations.append(stateless_declaration)
-            if 'method' not in variants:
+            if declaration.get('only_stateless', False):
                continue

            self.declarations.append(declaration)
@ -468,13 +376,9 @@ ${cpu}

        register_only = [d for d in declarations if d.get('only_register', False)]
        declarations = [d for d in declarations
-                        if (('method' in d.get('variants', ['method'])) and
-                            (not d.get('only_register', False)))]
-        self.declarations.extend(filter(lambda x: 'method' in x.get('variants',
-                                 ['method']), register_only))
-        self.stateless_declarations.extend(filter(lambda x: 'method' not in
-                                           x.get('variants', ['method']),
-                                           register_only))
+                        if (not d.get('only_stateless', False)) and (not d.get('only_register', False))]
+        self.declarations.extend(filter(lambda x: not x.get('only_stateless', False), register_only))
+        self.stateless_declarations.extend(filter(lambda x: x.get('only_stateless', False), register_only))

        self.process_docstrings()

@ -516,7 +420,7 @@ ${cpu}
            sparse=('' if not sparse else 'S'),
        )
        if sparse:
-            generated = '#if !defined(TH_REAL_IS_HALF) && !IS_DISTRIBUTED\n' + generated + '\n#endif\n\n'
+            generated = '#ifndef TH_REAL_IS_HALF\n' + generated + '\n#endif\n\n'
        return generated

    def process_full_file(self, code):
@ -557,24 +461,11 @@ ${cpu}

        if any(arg.get('long_args', False) for arg in option['arguments']):
            code = code.replace('__argcount ==', '__argcount >=')
-            expected = str(int(option.get('output_provided', False)) +
-                           sum(not arg.get('no_kwargs', False) and not arg.get('ignore_check', False)
-                               for arg in option['arguments']))
+            expected = str(int(option.get('output_provided', False)))
            code = '__dictcount == ' + expected + ' &&\n          ' + code

        return code

-    def process_option_code(self, code, option):
-        if option.get('defined_if', ''):
-            defined_if = option['defined_if']
-            placeholder = ''
-            # This means that it's a first option, so we need a dummy if,
-            # so the next option can be an else if.
-            if 'else if' not in code:
-                placeholder = '\n    #else\n    if (false) {'
-            return '#if ' + defined_if + '\n          ' + code + placeholder + '\n    #endif\n'
-        return code
-
    def process_pre_arg_assign(self, template, option):
        new_args = []
        for arg in option['arguments']:
--- a/tools/cwrap/plugins/init.py
+++ b/tools/cwrap/plugins/init.py
@ -1,422 +1,55 @@

 class CWrapPlugin(object):
-    """Base class from which all cwrap plugins should inherit.
-
-    Override any of the following methods to implement the desired wrapping
-    behavior.
-    """

    def initialize(self, cwrap):
-        """Initialize the Plugin class prior to calling any other functions.
-
-        It is used to give the Plugin access to the cwrap object's helper
-        functions and state.
-
-        Args:
-            cwrap: the cwrap object performing the wrapping.
-
-        """
        pass

    def get_type_check(self, arg, option):
-        """Used to generate code for runtime checks of object types.
-
-        The type can be found in arg['type']. For example, it could be
-        THTensor*. If this Plugin recognizes the type in arg, it should
-        return a Template string containing code that checks whether a
-        Python object is of this type. For example, the return type in
-        this case would be:
-
-        Template('(PyObject*)Py_TYPE($arg) == THPTensorClass')
-
-        As a simpler example, if the type == 'bool' then we would return:
-
-        Template('PyBool_Check($arg)')
-
-        Note that the name of the identifier that will be subsituted must be
-        $arg.
-
-        Args:
-            arg: a Python object with a 'type' field representing the type
-            to generate a check string for.
-            option: dictionary containing the information for this specific
-            option.
-
-        Returns:
-            A Template string as described above, or None if this Plugin does
-            not have a corresponding type check for the passed type.
-
-        """
        pass

    def get_type_unpack(self, arg, option):
-        """Used to generate code unpacking of Python objects into C types.
-
-        Similar to get_type_check, but for unpacking Python objects into their
-        corresponding C types. The type is once again accessible via
-        arg['type']. This time we return a Template string that unpacks an
-        object. For a THTensor*, we know that the corresponding PyTorch type is
-        a THPTensor*, so we need to get the cdata from the object. So we would
-        return:
-
-        Template('((THPTensor*)$arg)->cdata')
-
-        For a simpler type, such as a long, we could do:
-
-        Template('PyLong_AsLong($arg)')
-
-        though in practice we will use our own custom unpacking code. Once
-        again, $arg must be used as the identifier.
-
-        Args:
-            arg: a Python object with a 'type' field representing the type
-            to generate a unpack string for.
-            option: dictionary containing the information for this specific
-            option.
-
-        Returns:
-            A Template string as described above, or None if this Plugin does
-            not have a corresponding type unpack for the passed type.
-
-        """
        pass

    def get_return_wrapper(self, option):
-        """Used to generate code wrapping a function's return value.
-
-        Wrapped functions should always return a PyObject *. However,
-        internally, the code will be working with C objects or primitives.
-        Therefore, if a function has a return value we need to convert it back
-        to a PyObject * before the function returns. Plugins can override this
-        function to generate wrapper code for returning specific C types. The
-        type is accessible via option['return'].
-
-        Continuing on with our THTensor* example, we might do something like:
-
-        Template('return THPTensor_(New)($result);')
-
-        In general, you want to do return <statement>; In this case, we call
-        into THP's library routine that takes a THTensor* (the $result
-        identifier) and returns a PyObject *.
-
-        For a bool, we could do Template('return PyBool_FromLong($result);').
-
-        Note that in other cases, our logic might be more complicated. For
-        example, if our return value is also an argument to the function call,
-        we could need to increase the reference count prior to returning.
-
-        Args:
-            option: dictionary containing the information for this specific
-            option.
-
-        Returns:
-            A Template string as described above, or None if this Plugin does
-            not have a corresponding return wrapper for the functions return
-            type or specifier.
-
-        """
        pass

    def get_wrapper_template(self, declaration):
-        """Used to create a code template to wrap the options.
-
-        This function returns a Template string that contains the function call
-        for the overall declaration, including the method definition, opening
-        and closing brackets, and any additional code within the method body.
-        Look through the examples to get a sense of what this might look like.
-        The only requirements are that it contains unsubstituted template
-        identifiers for anything the cwrap engine expects.
-
-        Note that for any declaration only one Plugin can generate the wrapper
-        template.
-
-        Args:
-            declaration: the declaration for the wrapped method.
-
-        Returns:
-            A template string representing the entire function declaration,
-            with identifiers as necessary.
-
-        """
        pass

    def get_assign_args(self, arguments):
-        """Used to modify argument metadata prior to assignment.
-
-        We have already setup argument checking, and how to unpack arguments.
-        This function allows you to modify the metadata of an argument prior to
-        actually performing the assignment. For example, you might want to
-        check that an argument is of a specific type, but when unpacking it you
-        might want to treat it as a different type. This function will allow
-        you to do stuff like that --> e.g. you could set the 'type' field for a
-        particular argument to be something else.
-
-        Args:
-            arguments: a list of argument metadata dictionaries.
-
-        Returns:
-            The same list of arguments, with any modifications as you see fit.
-
-        """
        pass

    def get_arg_accessor(self, arg, option):
-        """Used to generate a string for accessing the passed arg.
-
-        One of the key components of the YAML definition for a method to be
-        wrapped are the arguments to that method. Override this function to
-        show how to access that specific arg in the code. For example, you
-        might do something different if the argument is a keyword argument, or
-        a constant, or self. The base cwrap plugin has a fallback arg accessor
-        for loading elements from the args PyObject * tuple passed to the
-        function.
-
-        Its best to look at some of the existing Plugins to get a sense of what
-        one might do.
-
-        Args:
-            arg: a dictionary specifying attributes of the arg to be accessed
-            option: dictionary containing the information for this specific
-            option.
-
-        Returns:
-            A a string (note: not a Template string!) of code that can be used
-            to access the given arg. If the plugin does not know how to access
-            the arg, return None.
-        """
        pass

    def process_full_file(self, code):
-        """Used to modify the code for the entire output file.
-
-        The last thing any plugin can do. Code contains the results of wrapping
-        all the declarations. The plugin can do things like adding header
-        guards, include statements, etc.
-
-        Args:
-            code: a string source code for the wrapped declarations.
-
-        Returns:
-            The same code, modified as the plugin sees fit.
-
-        """
        return code

    def process_single_check(self, code, arg, arg_accessor):
-        """Used to postprocess a type check.
-
-        Above we defined a function get_type_check that returns a Template
-        string that allows for type checking a PyObject * for a specific type.
-        In this function, the passed "code" is a combination of that type check
-        along with a specific arg_accessor pasted in. For example:
-
-        '(PyObject*)Py_TYPE(PyTuple_GET_ITEM(args, 1)) == THPTensorClass'
-
-        This function can be overriden to support modifying this check string.
-        For example, if an argument can be null, we might want to check and see
-        if the type is Py_None, as well.
-
-        Args:
-            code: The string code representing a type check for a specific
-            argument being accessed.
-            arg: dictionary containing properties of that specific argument
-            arg_accessor: the arg_accessor string for that specific argument.
-            Note that this is likely also embedded in code, but if you want to
-            be able to access this arg and throw away the other code, you can
-            do so.
-
-        Returns:
-            A string representing the processed check/access string for this
-            arg. If the plugin does not know how to modify a specific input, it
-            should return the original code.
-
-        """
        return code

    def process_all_checks(self, code, option):
-        """Used to generate additional checks based on all the individual ones.
-
-        After individually processing each argument with get_type_check,
-        get_arg_accessor, process_single_check, this function allows you to
-        inspect the combined checks and do any additional checking/modify that
-        string as you see fit. In particular, given code is a string like:
-
-        CHECK_TYPE(GET_ARG(0)) && CHECK_TYPE(GET_ARG(1)) && ..
-
-        We can process it as we see fit. For example, we may want to add a
-        check at the beginning that we have the specified number of arguments.
-
-        Args:
-            code: A string representing each argument check separated by an
-            '&&'. code can be None if there are no arguments to be checked.
-            option: dictionary containing the information for this specific
-            option.
-
-        Returns:
-            The modified code string with any additional checks, or just the
-            existing code if no modifications are to be made.
-
-        """
        return code

    def process_single_unpack(self, code, arg, arg_accessor):
-        """Used to postprocess a type unpack.
-
-        Same as process_single_check above, but for type unpacking. E.g. an
-        example code could be:
-
-        PyLong_FromLong(PyTuple_GET_ITEM(args, 0))
-
-        And this code could modify that as it sees fit. For example, if the
-        result of accessing the argument is None, we would not want to call the
-        unpacking code.
-
-        Args:
-            code: The string code representing a type unpack for a specific
-            argument being accessed.
-            arg: dictionary containing properties of that specific argument
-            arg_accessor: the arg_accessor string for that specific argument.
-            Note that this is likely also embedded in code, but if you want to
-            be able to access this arg and throw away the other code, you can
-            do so.
-
-        Returns:
-            A string representing the processed unpack/access string for this
-            arg. If the plugin does not know how to modify a specific input, it
-            should return the original code.
-
-        """
        return code

    def process_all_call_arg(self, code, option):
-        """Used to modify the arguments to the underlying C function call.
-
-        Code is the string of comma-separated arguments that will be passed to
-        the wrapped C function. You can use this function to modify that string
-        as you see fit. For example, THP prepends the LIBRARY_STATE definition
-        so that the generated code will follow the conventions it uses for
-        writing one function for both TH/THC calls.
-
-        Args:
-            code: A string as described above.
-            option: dictionary containing the information for this specific
-            option.
-
-        Returns:
-            The same code, modified as the plugin sees fit.
-
-        """
        return code

    def process_option_code(self, code, option):
-        """Used to modify the entire code body for an option.
-
-        Code in this case is a string containing the entire generated code for
-        a specific option. Note that this body includes the checks for each
-        option, i.e. if (type checks for one permutation) { ... } else if (type
-        checks for another permutation) { ... } etc.
-
-        Args:
-            code: string representing the generated code for the option
-            option: dictionary containing the information for this specific
-            option.
-
-        Returns:
-            The same code, modified as the plugin sees fit.
-
-        """
        return code

    def process_wrapper(self, code, declaration):
-        """Used to modify the entire code body for a declaration.
-
-        Code in this case is a string containing the entire generated code for
-        a specific declaration. This code can be modified as the plugin sees
-        fit. For example, we might want to wrap the function in preprocessor
-        guards if it is only enabled for floats.
-
-        Args:
-            code: string representing the generated code for the declaration
-            declaration: the declaration metadata.
-
-        Returns:
-            The same code, modified as the plugin sees fit.
-
-        """
        return code

    def process_declarations(self, declarations):
-        """Used to process/modify the function's declaration.
-
-        Cwrap loads the YAML of a function to be cwrap'd into a dictionary.
-        This is known as the declaration. The cwrap code sets some defaults as
-        necessary, and then passes this dictionary to process_declarations.
-        Overriding this code allows the plugin to modify this declaration as it
-        sees fit prior to any code generation. The plugin may add, remove or
-        modify the fields of the declaration dictionary. It can also save state
-        to the Plugin for use in subsequent function overrides.
-
-        Its best to look at some of the existing Plugins to get a sense of what
-        one might do.
-
-        Args:
-            declarations: a list of declarations, i.e. dictionaries that define
-            the function(s) being wrapped. Note that this can be plural, so the
-            function must take care to modify each input declaration.
-
-        Returns:
-            Those same declarations, modified as the Plugin sees fit. Note that
-            you could insert a declaration, if you wanted to take an input
-            declaration and e.g. wrap it multiple times.
-
-        """
        return declarations

    def process_option_code_template(self, template, option):
-        """Used to modify the code template for the option.
-
-        The "code template" can be thought of the actual body implementing the
-        wrapped function call --> i.e. it is not the argument check,
-        assignment, etc. but the actual logic of the function. The template is
-        a list containing two operations: the $call, and the $return_result.
-        These represent the "locations" where the function call will happen,
-        and the function will return.
-
-        This function can modify the list to insert arbitrary code around the
-        $call and $return_result. For example, one might want to wrap the code
-        in a try/catch, or post-process the result in some way. This allows a
-        plugin to do that.
-
-        Args:
-            template: a list containing $call and $return_result, in addition
-            to any arbitrary code inserted by other plugins.
-            option: dictionary containing the information for this specific
-            option.
-
-        Returns:
-            The same "code template", possibly modified by this plugin.
-
-        """
        return template

    def process_pre_arg_assign(self, template, option):
-        """Used to include any code before argument assignment.
-
-        This function can be used to insert any code that will be part of the
-        resulting function. The code is inserted after argument checks occur,
-        but before argument assignment.
-
-        Args:
-            template: String representing the code to be inserted. If other
-            plugins have included code for pre_arg_assign, it will be included
-            here.
-            option: dictionary containing the information for this specific
-            option.
-
-        Returns:
-            template, with any additional code if needed.
-
-        """
        return template


@ -433,4 +66,3 @@ from .AutoGPU import AutoGPU
 from .CuDNNPlugin import CuDNNPlugin
 from .GenericNN import GenericNN
 from .WrapDim import WrapDim
-from .Broadcast import Broadcast
--- a/tools/docker/Dockerfile_runtime
+++ b/tools/docker/Dockerfile_runtime
@ -1,27 +0,0 @@
-FROM ubuntu:16.04 
-
-LABEL com.nvidia.volumes.needed="nvidia_driver"
-RUN apt-get update && apt-get install -y --no-install-recommends \
-         build-essential \ 
-         git \
-         curl \
-         ca-certificates \
-         libjpeg-dev \
-         libpng-dev && \
-     rm -rf /var/lib/apt/lists/*
-
-RUN curl -o ~/miniconda.sh -O  https://repo.continuum.io/miniconda/Miniconda3-4.2.12-Linux-x86_64.sh  && \
-     chmod +x ~/miniconda.sh && \
-     ~/miniconda.sh -b -p /opt/conda && \     
-     rm ~/miniconda.sh && \
-     /opt/conda/bin/conda install conda-build && \
-     /opt/conda/bin/conda create -y --name pytorch-py35 python=3.5.2 numpy pyyaml scipy ipython mkl&& \
-     /opt/conda/bin/conda clean -ya 
-ENV PATH /opt/conda/envs/pytorch-py35/bin:$PATH
-RUN conda install --name pytorch-py35 -c soumith magma-cuda80 && /opt/conda/bin/conda clean -ya
-RUN conda install --name pytorch-py35 pytorch torchvision cuda80 -c soumith && /opt/conda/bin/conda clean -ya
-
-ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64
-
-WORKDIR /workspace
-RUN chmod -R a+w /workspace
--- a/tools/nnwrap/generate_wrappers.py
+++ b/tools/nnwrap/generate_wrappers.py
@ -3,13 +3,26 @@ import sys
 from string import Template, ascii_lowercase
 from ..cwrap import cwrap
 from ..cwrap.plugins import StandaloneExtension, GenericNN, NullableArguments, AutoGPU
-from ..shared import import_module

 BASE_PATH = os.path.realpath(os.path.join(__file__, '..', '..', '..'))
 WRAPPER_PATH = os.path.join(BASE_PATH, 'torch', 'csrc', 'nn')
 THNN_UTILS_PATH = os.path.join(BASE_PATH, 'torch', '_thnn', 'utils.py')


+def import_module(name, path):
+    if sys.version_info >= (3, 5):
+        import importlib.util
+        spec = importlib.util.spec_from_file_location(name, path)
+        module = importlib.util.module_from_spec(spec)
+        spec.loader.exec_module(module)
+        return module
+    elif sys.version_info >= (3, 0):
+        from importlib.machinery import SourceFileLoader
+        return SourceFileLoader(name, path).load_module()
+    else:
+        import imp
+        return imp.load_source(name, path)
+
 thnn_utils = import_module('torch._thnn.utils', THNN_UTILS_PATH)

 FUNCTION_TEMPLATE = Template("""\
@ -75,17 +88,14 @@ def wrap_function(name, type, arguments):
    cname = 'THNN_' + type + name
    declaration = ''
    declaration += 'extern "C" void ' + cname + \
-        '(' + ', '.join(TYPE_TRANSFORMS[type].get(arg.type, arg.type)
-                        for arg in arguments) + ');\n'
+        '(' + ', '.join(TYPE_TRANSFORMS[type].get(arg.type, arg.type) for arg in arguments) + ');\n'
    declaration += FUNCTION_TEMPLATE.substitute(name=type + name, cname=cname)
    indent = ' ' * 4
    dict_indent = ' ' * 6
    prefix = indent + '- '
    for arg in arguments:
        if not arg.is_optional:
-            declaration += prefix + \
-                TYPE_TRANSFORMS[type].get(
-                    arg.type, arg.type) + ' ' + arg.name + '\n'
+            declaration += prefix + TYPE_TRANSFORMS[type].get(arg.type, arg.type) + ' ' + arg.name + '\n'
        else:
            t = TYPE_TRANSFORMS[type].get(arg.type, arg.type)
            declaration += prefix + 'type: ' + t + '\n' + \
@ -130,7 +140,6 @@ def wrap_cunn():
        AutoGPU(has_self=False),
    ])

-
 GENERIC_FUNCTION_TEMPLATE = Template("""\
 [[
  name: $name
@ -159,7 +168,7 @@ def wrap_generic():
    defs = OrderedDict()

    def should_wrap_function(name):
-        if name.startswith('LookupTable_'):
+        if name.startswith('LookupTable'):
            return False
        return (name.endswith('updateOutput') or
                name.endswith('updateGradInput') or
--- a/tools/pytorch.version
+++ b/tools/pytorch.version
@ -1,12 +0,0 @@
-{
-     global:
-         _TH*;
-         TH*;
-         *THP*;
-         *THCP*;
-         PyInit*;
-         init*;
-         state;
-     local:
-         *;
- };
--- a/tools/setup_helpers/cuda.py
+++ b/tools/setup_helpers/cuda.py
@ -1,39 +1,17 @@
-import os
-import platform
 import ctypes.util
-from subprocess import Popen, PIPE
+import os

 from .env import check_env_flag

-
-def find_nvcc():
-    proc = Popen(['which', 'nvcc'], stdout=PIPE, stderr=PIPE)
-    out, err = proc.communicate()
-    out = out.decode().strip()
-    if len(out) > 0:
-        return os.path.dirname(out)
-    else:
-        return None
-
-
 if check_env_flag('NO_CUDA'):
    WITH_CUDA = False
    CUDA_HOME = None
 else:
    CUDA_HOME = os.getenv('CUDA_HOME', '/usr/local/cuda')
    if not os.path.exists(CUDA_HOME):
-        # We use nvcc path on Linux and cudart path on macOS
-        osname = platform.system()
-        if osname == 'Linux':
-            cuda_path = find_nvcc()
-        else:
-            cudart_path = ctypes.util.find_library('cudart')
-            if cudart_path is not None:
-                cuda_path = os.path.dirname(cudart_path)
-            else:
-                cuda_path = None
-        if cuda_path is not None:
-            CUDA_HOME = os.path.dirname(cuda_path)
+        cudart_path = ctypes.util.find_library('cudart')
+        if cudart_path is not None:
+            CUDA_HOME = os.path.dirname(cudart_path)
        else:
            CUDA_HOME = None
    WITH_CUDA = CUDA_HOME is not None
--- a/tools/setup_helpers/cudnn.py
+++ b/tools/setup_helpers/cudnn.py
@ -1,5 +1,4 @@
 import os
-import sys
 import glob
 from itertools import chain

@ -10,8 +9,6 @@ from .cuda import WITH_CUDA, CUDA_HOME
 def gather_paths(env_vars):
    return list(chain(*(os.getenv(v, '').split(':') for v in env_vars)))

-is_conda = 'conda' in sys.version or 'Continuum' in sys.version
-conda_dir = os.path.join(os.path.dirname(sys.executable), '..')

 WITH_CUDNN = False
 CUDNN_LIB_DIR = None
@ -22,7 +19,6 @@ if WITH_CUDA and not check_env_flag('NO_CUDNN'):
        os.path.join(CUDA_HOME, 'lib'),
        os.path.join(CUDA_HOME, 'lib64'),
        '/usr/lib/x86_64-linux-gnu/',
-        '/usr/lib/powerpc64le-linux-gnu/',
    ] + gather_paths([
        'LIBRARY_PATH',
    ])))
@ -35,9 +31,6 @@ if WITH_CUDA and not check_env_flag('NO_CUDNN'):
        'C_INCLUDE_PATH',
        'CPLUS_INCLUDE_PATH',
    ])))
-    if is_conda:
-        lib_paths.append(os.path.join(conda_dir, 'lib'))
-        include_paths.append(os.path.join(conda_dir, 'include'))
    for path in lib_paths:
        if path is None or not os.path.exists(path):
            continue
--- a/tools/setup_helpers/split_types.py
+++ b/tools/setup_helpers/split_types.py
@ -1,58 +0,0 @@
-import os
-
-this_file = os.path.dirname(os.path.abspath(__file__))
-generated_dir = os.path.abspath(os.path.join(this_file, '..', '..', 'torch', 'csrc', 'generated'))
-
-line_start = '//generic_include '
-
-types = [
-    'Double',
-    'Float',
-    'Half',
-    'Long',
-    'Int',
-    'Short',
-    'Char',
-    'Byte'
-]
-
-generic_include = '#define {lib}_GENERIC_FILE "{path}"'
-generate_include = '#include "{lib}/{lib}Generate{type}Type.h"'
-
-
-def split_types(file_name):
-    assert file_name.startswith('torch/csrc/')
-    if not os.path.exists(generated_dir):
-        os.makedirs(generated_dir)
-
-    with open(file_name, 'r') as f:
-        lines = f.read().split('\n')
-
-    # Find //generic_include
-    for i, l in enumerate(lines):
-        if l.startswith(line_start):
-            args = l[len(line_start):]
-            lib_prefix, generic_file = filter(bool, args.split())
-            break
-    else:
-        raise RuntimeError("generic include not found")
-
-    gen_name_prefix = file_name[len('torch/csrc/'):].replace('/', '_').replace('.cpp', '')
-    gen_path_prefix = os.path.join(generated_dir, gen_name_prefix)
-
-    prefix = '\n'.join(lines[:i])
-    suffix = '\n'.join(lines[i + 1:])
-
-    to_build = []
-
-    g_include = generic_include.format(lib=lib_prefix, path=generic_file)
-    for t in types:
-        t_include = generate_include.format(lib=lib_prefix, type=t)
-        gen_path = gen_path_prefix + t + '.cpp'
-        to_build.append(gen_path)
-        with open(gen_path, 'w') as f:
-            f.write(prefix + '\n' +
-                    g_include + '\n' +
-                    t_include + '\n' +
-                    suffix)
-    return to_build
--- a/tools/shared/init.py
+++ b/tools/shared/init.py
@ -1,3 +0,0 @@
-from .module_loader import import_module
-from .cwrap_common import set_declaration_defaults, \
-    sort_by_number_of_options, enumerate_options_due_to_default
--- a/tools/shared/cwrap_common.py
+++ b/tools/shared/cwrap_common.py
@ -1 +0,0 @@
-../../torch/lib/ATen/common_with_cwrap.py
--- a/tools/shared/module_loader.py
+++ b/tools/shared/module_loader.py
@ -1,16 +0,0 @@
-import sys
-
-
-def import_module(name, path):
-    if sys.version_info >= (3, 5):
-        import importlib.util
-        spec = importlib.util.spec_from_file_location(name, path)
-        module = importlib.util.module_from_spec(spec)
-        spec.loader.exec_module(module)
-        return module
-    elif sys.version_info >= (3, 0):
-        from importlib.machinery import SourceFileLoader
-        return SourceFileLoader(name, path).load_module()
-    else:
-        import imp
-        return imp.load_source(name, path)
--- a/torch/init.py
+++ b/torch/init.py
@ -5,7 +5,7 @@ Additionally, it provides many utilities for efficient serializing of
 Tensors and arbitrary types, and other useful utilities.

 It has a CUDA counterpart, that enables you to run your tensor computations
-on an NVIDIA GPU with compute capability >= 3.0.
+on an NVIDIA GPU with compute capability >= 2.0.
 """

 import sys
@ -15,7 +15,7 @@ from .version import __version__
 __all__ = [
    'typename', 'is_tensor', 'is_storage', 'set_default_tensor_type',
    'set_rng_state', 'get_rng_state', 'manual_seed', 'initial_seed',
-    'save', 'load', 'set_printoptions', 'chunk', 'split', 'stack', 'matmul',
+    'save', 'load', 'set_printoptions', 'chunk', 'split', 'stack',
    'DoubleStorage', 'FloatStorage', 'LongStorage', 'IntStorage',
    'ShortStorage', 'CharStorage', 'ByteStorage',
    'DoubleTensor', 'FloatTensor', 'LongTensor', 'IntTensor',
@ -129,9 +129,6 @@ def manual_seed(seed):
    Args:
        seed (int or long): The desired seed.
    """
-    if torch.cuda.is_available() and not torch.cuda._in_bad_fork:
-        torch.cuda.manual_seed_all(seed)
-
    return default_generator.manual_seed(seed)


@ -268,12 +265,12 @@ class ByteTensor(_C.ByteTensorBase, _TensorBase):

 _storage_classes = {
    DoubleStorage, FloatStorage, LongStorage, IntStorage, ShortStorage,
-    CharStorage, ByteStorage, HalfStorage
+    CharStorage, ByteStorage,
 }

 _tensor_classes = {
    DoubleTensor, FloatTensor, LongTensor, IntTensor, ShortTensor,
-    CharTensor, ByteTensor, HalfTensor
+    CharTensor, ByteTensor,
 }


@ -339,9 +336,8 @@ import torch.nn
 import torch.optim
 import torch.multiprocessing
 import torch.sparse
-import torch.utils.backcompat
 _C._init_names(list(torch._tensor_classes) + list(torch._storage_classes))

 # attach docstrings to torch and tensor functions
-from . import _torch_docs, _tensor_docs, _storage_docs
-del _torch_docs, _tensor_docs, _storage_docs
+from . import _torch_docs, _tensor_docs
+del _torch_docs, _tensor_docs
--- a/torch/_six.py
+++ b/torch/_six.py
@ -1,31 +0,0 @@
-# Copyright (c) 2010-2017 Benjamin Peterson
-#
-# Permission is hereby granted, free of charge, to any person obtaining a copy
-# of this software and associated documentation files (the "Software"), to deal
-# in the Software without restriction, including without limitation the rights
-# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-# copies of the Software, and to permit persons to whom the Software is
-# furnished to do so, subject to the following conditions:
-#
-# The above copyright notice and this permission notice shall be included in all
-# copies or substantial portions of the Software.
-#
-# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-# SOFTWARE.
-
-
-def with_metaclass(meta, *bases):
-    """Create a base class with a metaclass."""
-    # This requires a bit of explanation: the basic idea is to make a dummy
-    # metaclass for one level of class instantiation that replaces itself with
-    # the actual metaclass.
-    class metaclass(meta):
-
-        def __new__(cls, name, this_bases, d):
-            return meta(name, bases, d)
-    return type.__new__(metaclass, 'temporary_class', (), {})
--- a/torch/_storage_docs.py
+++ b/torch/_storage_docs.py
@ -1,43 +0,0 @@
-"""Adds docstrings to Storage functions"""
-
-import torch._C
-from torch._C import _add_docstr as add_docstr
-
-
-storage_classes = [
-    'DoubleStorageBase',
-    'FloatStorageBase',
-    'LongStorageBase',
-    'IntStorageBase',
-    'ShortStorageBase',
-    'CharStorageBase',
-    'ByteStorageBase',
-]
-
-
-def add_docstr_all(method, docstr):
-    for cls_name in storage_classes:
-        cls = getattr(torch._C, cls_name)
-        try:
-            add_docstr(getattr(cls, method), docstr)
-        except AttributeError:
-            pass
-
-
-add_docstr_all('from_file',
-               """
-from_file(filename, shared=False, size=0) -> Storage
-
-If shared is True then memory is shared between all processes. All changes are
-written to the file. If shared is False then the changes on the storage do not
-affect the file.
-
-Size is the number of elements in the storage. If shared is False then the file
-must contain at least `size * sizeof(Type)` bytes (`Type` is the type of
-storage). If shared is True the file will be created if needed.
-
-Args:
-    filename (str): file name to map
-    shared (bool): whether to share memory
-    size (int): number of elements in the storage
-""")
--- a/torch/_tensor_docs.py
+++ b/torch/_tensor_docs.py
--- a/torch/_tensor_str.py
+++ b/torch/_tensor_str.py
@ -67,7 +67,7 @@ def set_printoptions(

 def _number_format(tensor, min_sz=-1):
    min_sz = max(min_sz, 2)
-    tensor = torch.DoubleTensor(tensor.size()).copy_(tensor).abs_().view(tensor.nelement())
+    tensor = torch.DoubleTensor(tensor.nelement()).copy_(tensor).abs_()

    pos_inf_mask = tensor.eq(float('inf'))
    neg_inf_mask = tensor.eq(float('-inf'))
--- a/torch/_torch_docs.py
+++ b/torch/_torch_docs.py
--- a/torch/_utils.py
+++ b/torch/_utils.py
@ -3,8 +3,7 @@ import importlib


 def _type(self, new_type=None, async=False):
-    """Returns the type if `new_type` is not provided, else casts this object to
-    the specified type.
+    """Casts this object to the specified type.

    If this is already of the correct type, no copy is performed and the
    original object is returned.
@ -28,8 +27,8 @@ def _type(self, new_type=None, async=False):
            raise RuntimeError("Cannot cast sparse tensor to dense tensor")
        new_type_name = new_type.__module__ + '.' + new_type.__name__
        new_values_type_name = new_type_name.replace('.sparse', '')
-        new_values = self._values().type(new_values_type_name, async)
-        return new_type(self._indices(), new_values, self.size())
+        new_values = self.values().type(new_values_type_name, async)
+        return new_type(self.indices(), new_values, self.size())
    if new_type.is_sparse:
        raise RuntimeError("Cannot cast dense tensor to sparse tensor")
    return new_type(self.size()).copy_(self, async)
@ -58,8 +57,8 @@ def _cuda(self, device=None, async=False):
    with torch.cuda.device(device):
        if self.is_sparse:
            new_type = getattr(torch.cuda.sparse, self.__class__.__name__)
-            indices = self._indices().cuda(device, async)
-            values = self._values().cuda(device, async)
+            indices = self.indices().cuda(device, async)
+            values = self.values().cuda(device, async)
            return new_type(indices, values, self.size())
        else:
            new_type = getattr(torch.cuda, self.__class__.__name__)
@ -99,47 +98,3 @@ def _accumulate(iterable, fn=lambda x, y: x + y):
    for element in it:
        total = fn(total, element)
        yield total
-
-
-def _flatten_tensors(tensors):
-    """Flatten tensors into a single contiguous 1D buffer"""
-    if len(tensors) == 1:
-        return tensors[0].contiguous().view(-1)
-    numels = [tensor.numel() for tensor in tensors]
-    size = sum(numels)
-    offset = 0
-    flat = tensors[0].new(size)
-    for tensor, numel in zip(tensors, numels):
-        flat.narrow(0, offset, numel).copy_(tensor, broadcast=False)
-        offset += numel
-    return flat
-
-
-def _unflatten_tensors(flat, tensors):
-    """View a flat buffer using the sizes of tensors"""
-    outputs = []
-    offset = 0
-    for tensor in tensors:
-        numel = tensor.numel()
-        outputs.append(flat.narrow(0, offset, numel).view_as(tensor))
-        offset += numel
-    return tuple(outputs)
-
-
-def _take_tensors(tensors, size_limit):
-    """Groups tensors into lists of up to size_limit bytes"""
-    buf = []
-    size = 0
-    last_type = type(tensors[0]) if len(tensors) > 0 else None
-    for tensor in tensors:
-        t = type(tensor)
-        param_size = tensor.numel() * tensor.element_size()
-        if t is not last_type or (size + param_size > size_limit and size > 0):
-            yield buf
-            last_type = t
-            size = 0
-            buf = []
-        buf.append(tensor)
-        size += param_size
-    if len(buf) > 0:
-        yield buf
--- a/torch/autograd/init.py
+++ b/torch/autograd/init.py
@ -5,7 +5,6 @@ changes to the existing code - you only need to wrap all tensors in
 :class:`.Variable` objects.
 """
 import torch
-import warnings

 from .variable import Variable
 from .function import Function, NestedIOFunction
@ -15,41 +14,13 @@ from .gradcheck import gradcheck
 __all__ = ['Variable', 'Function', 'StochasticFunction', 'backward']


-def _make_grads(outputs, grads, user_create_graph):
-    if user_create_graph is not None:
-        create_graph = user_create_graph
-    else:
-        create_graph = any(isinstance(grad, Variable) and not grad.volatile
-                           for grad in grads)
-
-    new_grads = []
-    for out, grad in zip(outputs, grads):
-        if isinstance(grad, Variable):
-            new_grads.append(grad)
-        elif torch.is_tensor(grad):
-            new_grads.append(Variable(grad, volatile=not create_graph))
-        elif grad is None:
-            if out.requires_grad:
-                if out.numel() != 1:
-                    raise RuntimeError("grad can be implicitly created only for scalar outputs")
-                data = out.data
-                new_grads.append(
-                    Variable(data.new().resize_as_(data).fill_(1), volatile=not create_graph))
-            else:
-                new_grads.append(None)
-        else:
-            raise TypeError("gradients can be either Tensors, Variables or None, but got " +
-                            type(grad).__name__)
-    return tuple(new_grads), create_graph
-
-
-def backward(variables, grad_variables=None, retain_graph=None, create_graph=None, retain_variables=None):
+def backward(variables, grad_variables, retain_variables=False):
    """Computes the sum of gradients of given variables w.r.t. graph leaves.

    The graph is differentiated using the chain rule. If any of ``variables``
    are non-scalar (i.e. their data has more than one element) and require
    gradient, the function additionaly requires specifying ``grad_variables``.
-    It should be a sequence of matching length, that contains gradient of
+    It should be a sequence of matching length, that containins gradient of
    the differentiated function w.r.t. corresponding variables (``None`` is an
    acceptable value for all variables that don't need gradient tensors).

@ -59,98 +30,15 @@ def backward(variables, grad_variables=None, retain_graph=None, create_graph=Non
    Arguments:
        variables (sequence of Variable): Variables of which the derivative will be
            computed.
-        grad_variables (sequence of (Tensor, Variable or None)): Gradients w.r.t.
-            each element of corresponding variables.  Any tensors will be
-            automatically converted to Variables that are volatile unless
-            ``create_graph`` is True.  None values can be specified for scalar
-            Variables or ones that don't require grad. If a None value would
-            be acceptable for all grad_variables, then this argument is optional.
-        retain_graph (bool, optional): If False, the graph used to compute the grad
-            will be freed. Note that in nearly all cases setting this option to True
-            is not needed and often can be worked around in a much more efficient
-            way. Defaults to the value of ``create_graph``.
-        create_graph (bool, optional): If true, graph of the derivative will
-            be constructed, allowing to compute higher order derivative products.
-            Defaults to False, unless ``grad_variables`` contains at least one
-            non-volatile Variable.
+        grad_variables (sequence of Tensor): Gradients w.r.t. each element of
+            corresponding variables. Required only for non-scalar variables that
+            require gradient.
+        retain_variables (bool): If ``True``, buffers necessary for computing
+            gradients won't be freed after use. It is only necessary to
+            specify ``True`` if you want to differentiate some subgraph multiple
+            times.
    """
-    variables = (variables,) if isinstance(variables, Variable) else tuple(variables)
-
-    if grad_variables is None:
-        grad_variables = [None] * len(variables)
-    elif isinstance(grad_variables, Variable) or torch.is_tensor(grad_variables):
-        grad_variables = [grad_variables]
-    else:
-        grad_variables = list(grad_variables)
-
-    grad_variables, create_graph = _make_grads(variables, grad_variables, create_graph)
-
-    if retain_variables is not None:
-        if retain_graph is not None:
-            raise ValueError("only one of retain_graph and retain_variables can be specified")
-        retain_graph = retain_variables
-        warnings.warn("retain_variables option is deprecated and will be removed in 0.3. "
-                      "Use retain_graph instead.")
-    elif retain_graph is None:
-        retain_graph = create_graph
-
    Variable._execution_engine.run_backward(
-        variables, grad_variables, retain_graph)
+        tuple(variables), tuple(grad_variables), retain_variables)

-
-def grad(outputs, inputs, grad_outputs=None, retain_graph=None, create_graph=None, only_inputs=True):
-    """Computes and returns the sum of gradients of outputs w.r.t. the inputs.
-
-    ``grad_outputs`` should be a sequence of length matching ``output``
-    containing the pre-computed gradients w.r.t. each of the outputs. If an
-    output doesn't require_grad, then the gradient can be ``None``).
-    Gradients can be given as Tensors when one doesn't need the graph of the
-    derivative, or as Variables, in which case the graph will be created.
-
-    If ``only_inputs`` is True, the function will only return a list of gradients
-    w.r.t the specified inputs. If it's False, then gradient w.r.t. all remaining
-    leaves will still be computed, and will be accumulated into their ``.grad``
-    attribute.
-
-    Arguments:
-        outputs (sequence of Variable): outputs of the differentiated function.
-        inputs (sequence of Variable): Inputs w.r.t. which the gradient will be
-            returned (and not accumulated into ``.grad``).
-        grad_outputs (sequence of Tensor or Variable): Gradients w.r.t. each output.
-            Any tensors will be automatically converted to Variables that are
-            volatile unless ``create_graph`` is True.  None values can be
-            specified for scalar Variables or ones that don't require grad.
-            If a None value would be acceptable for all grad_variables, then
-            this argument is optional.
-        retain_graph (bool, optional): If False, the graph used to compute the grad
-            will be freed. Note that in nearly all cases setting this option to True
-            is not needed and often can be worked around in a much more efficient
-            way. Defaults to the value of ``create_graph``.
-        create_graph (bool, optional): If True, graph of the derivative will
-            be constructed, allowing to compute higher order derivative products.
-            Defaults to False, unless ``grad_variables`` contains at least one
-            non-volatile Variable.
-        only_inputs (bool, optional): If True, gradient w.r.t. leaves that are
-            part of the graph, but don't appear in ``inputs`` won't be computed
-            and accumulated. Defaults to True.
-    """
-
-    outputs = (outputs,) if isinstance(outputs, Variable) else tuple(outputs)
-    inputs = (inputs,) if isinstance(inputs, Variable) else tuple(inputs)
-    if grad_outputs is None:
-        grad_outputs = [None] * len(outputs)
-    elif isinstance(grad_outputs, Variable) or torch.is_tensor(grad_outputs):
-        grad_outputs = [grad_outputs]
-    else:
-        grad_outputs = list(grad_outputs)
-
-    grad_outputs, create_graph = _make_grads(outputs, grad_outputs, create_graph)
-    if retain_graph is None:
-        retain_graph = create_graph
-
-    return Variable._execution_engine.run_backward(
-        outputs, grad_outputs, retain_graph,
-        inputs, only_inputs)
-
-if not torch._C._autograd_init():
-    raise RuntimeError("autograd initialization failed")
+assert torch._C._autograd_init()
--- a/torch/autograd/_functions/basic_ops.py
+++ b/torch/autograd/_functions/basic_ops.py
@ -1,228 +1,200 @@
 import torch
 from ..function import Function, InplaceFunction
-from .utils import maybe_unexpand, maybe_unexpand_or_view
 import math


+def maybe_view(tensor, size):
+    if tensor.size() == size:
+        return tensor
+    return tensor.contiguous().view(size)
+
+
 class Add(InplaceFunction):

-    @staticmethod
-    def forward(ctx, a, b, inplace=False):
-        ctx.a_size = a.size()
-        ctx.b_size = b.size()
-        if inplace:
-            ctx.mark_dirty(a)
+    def forward(self, a, b):
+        self.b_size = b.size()
+        if self.inplace:
+            self.mark_dirty(a)
            return a.add_(b)
        else:
            return a.add(b)

-    @staticmethod
-    def backward(ctx, grad_output):
-        return maybe_unexpand(grad_output, ctx.a_size), maybe_unexpand_or_view(grad_output, ctx.b_size), None
+    def backward(self, grad_output):
+        return grad_output, maybe_view(grad_output, self.b_size)


 class Sub(InplaceFunction):

-    @staticmethod
-    def forward(ctx, a, b, inplace=False):
-        ctx.a_size = a.size()
-        ctx.b_size = b.size()
-        if inplace:
-            ctx.mark_dirty(a)
+    def forward(self, a, b):
+        self.b_size = b.size()
+        if self.inplace:
+            self.mark_dirty(a)
            return a.sub_(b)
        else:
            return a.sub(b)

-    @staticmethod
-    def backward(ctx, grad_output):
-        return maybe_unexpand(grad_output, ctx.a_size), maybe_unexpand_or_view(grad_output.neg(), ctx.b_size), None
+    def backward(self, grad_output):
+        return grad_output, maybe_view(grad_output.neg(), self.b_size)


 class Mul(Function):

-    @staticmethod
-    def forward(ctx, a, b):
-        ctx.a_size = a.size()
-        ctx.b_size = b.size()
-        ctx.save_for_backward(a, b)
+    def forward(self, a, b):
+        self.b_size = b.size()
+        self.save_for_backward(a, b)
        return a.mul(b)

-    @staticmethod
-    def backward(ctx, grad_output):
-        a, b = ctx.saved_variables
-        return maybe_unexpand(grad_output.mul(b), ctx.a_size), maybe_unexpand_or_view(grad_output.mul(a), ctx.b_size)
+    def backward(self, grad_output):
+        a, b = self.saved_tensors
+        return grad_output.mul(b), maybe_view(grad_output.mul(a), self.b_size)


 class Div(Function):

-    @staticmethod
-    def forward(ctx, a, b):
-        ctx.a_size = a.size()
-        ctx.b_size = b.size()
-        ctx.save_for_backward(a, b)
+    def forward(self, a, b):
+        self.b_size = b.size()
+        self.save_for_backward(a, b)
        return a.div(b)

-    @staticmethod
-    def backward(ctx, grad_output):
-        a, b = ctx.saved_variables
-        b_rec = b.reciprocal()
-        grad_a = grad_output.mul(b_rec)
-        grad_b = grad_output.neg().mul(a).mul(b_rec).mul(b_rec)
-        return maybe_unexpand(grad_a, ctx.a_size), maybe_unexpand_or_view(grad_b, ctx.b_size)
+    def backward(self, grad_output):
+        a, b = self.saved_tensors
+        return grad_output.div(b), maybe_view(grad_output.neg().mul(a).div_(b).div_(b), self.b_size)


 class Pow(Function):

-    @staticmethod
-    def forward(ctx, a, b):
-        ctx.a_size = a.size()
-        ctx.b_size = b.size()
-        ctx.save_for_backward(a, b)
+    def forward(self, a, b):
+        self.b_size = b.size()
+        self.save_for_backward(a, b)
        return a.pow(b)

-    @staticmethod
-    def backward(ctx, grad_output):
-        a, b = ctx.saved_variables
-        grad_a = grad_output.mul(b).mul(a.pow(b - 1))
-        grad_b = grad_output.mul(a.pow(b)).mul(a.log())
-        return maybe_unexpand(grad_a, ctx.a_size), maybe_unexpand_or_view(grad_b, ctx.b_size)
-
-
-def sort_args(a, b):
-    return (a, b, True) if torch.is_tensor(a) else (b, a, False)
+    def backward(self, grad_output):
+        a, b = self.saved_tensors
+        return grad_output.mul(b).mul_(a.pow(b - 1)), maybe_view(grad_output.mul(a.pow(b)).mul_(a.log()), self.b_size)


 class AddConstant(InplaceFunction):

-    @staticmethod
-    def forward(ctx, a, b, inplace=False):
-        tensor, constant, ctx.tensor_first = sort_args(a, b)
-        if inplace:
-            ctx.mark_dirty(tensor)
-            return tensor.add_(constant)
-        else:
-            return tensor.add(constant)
+    def __init__(self, constant, inplace=False):
+        super(AddConstant, self).__init__(inplace)
+        self.constant = constant

-    @staticmethod
-    def backward(ctx, grad_output):
-        if ctx.tensor_first:
-            return grad_output, None, None
+    def forward(self, a):
+        if self.inplace:
+            self.mark_dirty(a)
+            return a.add_(self.constant)
        else:
-            return None, grad_output, None
+            return a.add(self.constant)
+
+    def backward(self, grad_output):
+        return grad_output


 class SubConstant(InplaceFunction):

-    @staticmethod
-    def forward(ctx, a, b, inplace=False):
-        tensor, constant, ctx.tensor_first = sort_args(a, b)
-        if ctx.tensor_first:
-            if inplace:
-                ctx.mark_dirty(tensor)
-                return tensor.sub_(constant)
-            else:
-                return tensor.sub(constant)
-        else:
-            if inplace:
-                ctx.mark_dirty(tensor)
-                return tensor.neg_().add_(constant)
-            else:
-                return tensor.neg().add_(constant)
+    def __init__(self, constant, sub_tensor=False, inplace=False):
+        super(SubConstant, self).__init__(inplace)
+        self.constant = constant
+        self.sub_tensor = sub_tensor

-    @staticmethod
-    def backward(ctx, grad_output):
-        if ctx.tensor_first:
-            return grad_output, None, None
+    def forward(self, a):
+        if self.sub_tensor:
+            if a.is_signed() and self.inplace:
+                self.mark_dirty(a)
+                return a.neg_().add_(self.constant)
+            else:
+                assert not self.inplace, "can't perform (constant - tensor) " \
+                    "subtraction in-place on an unsigned type"
+                return a.new().resize_as_(a).fill_(self.constant).sub_(a)
        else:
-            return None, grad_output.neg(), None
+            if self.inplace:
+                self.mark_dirty(a)
+                return a.sub_(self.constant)
+            else:
+                return a.sub(self.constant)
+
+    def backward(self, grad_output):
+        if self.sub_tensor:
+            return grad_output.neg()
+        else:
+            return grad_output


 class MulConstant(InplaceFunction):

-    @staticmethod
-    def forward(ctx, a, b, inplace=False):
-        tensor, ctx.constant, ctx.tensor_first = sort_args(a, b)
-        if inplace:
-            ctx.mark_dirty(tensor)
-            return tensor.mul_(ctx.constant)
-        else:
-            return tensor.mul(ctx.constant)
+    def __init__(self, constant, inplace=False):
+        super(MulConstant, self).__init__(inplace)
+        self.constant = constant

-    @staticmethod
-    def backward(ctx, grad_output):
-        grad_input = grad_output.mul(ctx.constant)
-        if ctx.tensor_first:
-            return grad_input, None, None
+    def forward(self, a):
+        if self.inplace:
+            self.mark_dirty(a)
+            return a.mul_(self.constant)
        else:
-            return None, grad_input, None
+            return a.mul(self.constant)
+
+    def backward(self, grad_output):
+        return grad_output.mul(self.constant)


 class DivConstant(InplaceFunction):

-    @staticmethod
-    def forward(ctx, a, b, inplace=False):
-        tensor, ctx.constant, ctx.tensor_first = sort_args(a, b)
-        ctx.inplace = inplace
-        if ctx.tensor_first:
-            if inplace:
-                ctx.mark_dirty(tensor)
-                return tensor.div_(ctx.constant)
-            else:
-                return tensor.div(ctx.constant)
-        else:
-            ctx.save_for_backward(tensor)
-            if inplace:
-                ctx.mark_dirty(tensor)
-                return tensor.reciprocal_().mul_(ctx.constant)
-            else:
-                return tensor.reciprocal().mul_(ctx.constant)
+    def __init__(self, constant, div_by_tensor=False, inplace=False):
+        super(DivConstant, self).__init__(inplace)
+        self.constant = constant
+        self.div_by_tensor = div_by_tensor
+        if self.inplace and self.div_by_tensor:
+            # TODO: actually, as long as the type is floating point, we can
+            raise RuntimeError("can't perform (constant / tensor) division in-place")

-    @staticmethod
-    def backward(ctx, grad_output):
-        if ctx.tensor_first:
-            return grad_output.div(ctx.constant), None, None
+    def forward(self, a):
+        if self.div_by_tensor:
+            self.save_for_backward(a)
+            return a.new().resize_as_(a).fill_(self.constant).div_(a)
        else:
-            v, = ctx.saved_variables
-            if ctx.inplace:
-                return None, grad_output.mul(v).mul(v).div_(-ctx.constant), None
+            if self.inplace:
+                return a.div_(self.constant)
            else:
-                v_rep = v.reciprocal()
-                return None, grad_output.mul(v_rep).mul(v_rep).mul_(-ctx.constant), None
+                return a.div(self.constant)
+
+    def backward(self, grad_output):
+        if self.div_by_tensor:
+            a = self.saved_tensors[0]
+            return grad_output.neg().mul_(self.constant).div_(a).div_(a)
+        else:
+            return grad_output.div(self.constant)


 class PowConstant(Function):

-    @staticmethod
-    def forward(ctx, a, b):
-        tensor, ctx.constant, ctx.tensor_first = sort_args(a, b)
-        if ctx.tensor_first:
-            ctx.save_for_backward(tensor)
-            return tensor.pow(ctx.constant)
-        else:
-            result = torch.pow(ctx.constant, tensor)
-            ctx.save_for_backward(result)
-            return result
+    def __init__(self, constant, tensor_power=False):
+        super(PowConstant, self).__init__()
+        self.constant = constant
+        self.tensor_power = tensor_power

-    @staticmethod
-    def backward(ctx, grad_output):
-        if ctx.tensor_first:
-            var, = ctx.saved_variables
-            return grad_output.mul(ctx.constant).mul(var.pow(ctx.constant - 1)), None
+    def forward(self, a):
+        if self.tensor_power:
+            self.fw_result = torch.pow(self.constant, a)
+            return self.fw_result
        else:
-            var_result, = ctx.saved_variables
-            return None, grad_output.mul(var_result).mul_(math.log(ctx.constant))
+            self.save_for_backward(a)
+            return a.pow(self.constant)
+
+    def backward(self, grad_output):
+        if self.tensor_power:
+            return grad_output.mul(self.fw_result).mul_(math.log(self.constant))
+        else:
+            a = self.saved_tensors[0]
+            return grad_output.mul(self.constant).mul_(a.pow(self.constant - 1))


 class Negate(InplaceFunction):

-    @staticmethod
-    def forward(ctx, i, inplace=False):
-        if inplace:
-            ctx.mark_dirty(i)
+    def forward(self, i):
+        if self.inplace:
            return i.neg_()
        else:
            return i.neg()

-    @staticmethod
-    def backward(ctx, grad_output):
-        return grad_output.neg(), None
+    def backward(self, grad_output):
+        return grad_output.neg()
--- a/torch/autograd/_functions/blas.py
+++ b/torch/autograd/_functions/blas.py
@ -1,224 +1,195 @@
 import torch

 from ..function import Function, InplaceFunction
-from .utils import maybe_unexpand


 # TODO: no need to save all args if the grad w.r.t. some of them is not needed
-def _get_output(ctx, arg, inplace=False):
-    if inplace:
-        ctx.mark_dirty(arg)
-        return arg
-    else:
-        return arg.new().resize_as_(arg)
+class _BlasBase(InplaceFunction):
+
+    def __init__(self, alpha=1, beta=1, inplace=False):
+        super(_BlasBase, self).__init__(inplace)
+        self.alpha = alpha
+        self.beta = beta
+
+    def _get_output(self, arg):
+        if self.inplace:
+            self.mark_dirty(arg)
+            return arg
+        else:
+            return arg.new().resize_as_(arg)


-class Addmm(InplaceFunction):
+class Addmm(_BlasBase):

-    @staticmethod
-    def forward(ctx, add_matrix, matrix1, matrix2, alpha=1, beta=1, inplace=False):
-        ctx.alpha = alpha
-        ctx.beta = beta
-        ctx.add_matrix_size = add_matrix.size()
-        ctx.save_for_backward(matrix1, matrix2)
-        output = _get_output(ctx, add_matrix, inplace=inplace)
-        return torch.addmm(alpha, add_matrix, beta,
+    def forward(self, add_matrix, matrix1, matrix2):
+        self.save_for_backward(matrix1, matrix2)
+        output = self._get_output(add_matrix)
+        return torch.addmm(self.alpha, add_matrix, self.beta,
                           matrix1, matrix2, out=output)

-    @staticmethod
-    def backward(ctx, grad_output):
-        matrix1, matrix2 = ctx.saved_variables
+    def backward(self, grad_output):
+        matrix1, matrix2 = self.saved_tensors
        grad_add_matrix = grad_matrix1 = grad_matrix2 = None

-        if ctx.needs_input_grad[0]:
-            grad_add_matrix = maybe_unexpand(grad_output, ctx.add_matrix_size)
-            if ctx.alpha != 1:
-                grad_add_matrix = grad_add_matrix.mul(ctx.alpha)
+        if self.needs_input_grad[0]:
+            grad_add_matrix = grad_output
+            if self.alpha != 1:
+                grad_add_matrix = grad_add_matrix.mul(self.alpha)

-        if ctx.needs_input_grad[1]:
-            if matrix1.stride() == (1, matrix1.size(0)):
-                # column major gradient if input is column major
-                grad_matrix1 = torch.mm(matrix2, grad_output.t()).t()
-            else:
-                grad_matrix1 = torch.mm(grad_output, matrix2.t())
-            if ctx.beta != 1:
-                grad_matrix1 *= ctx.beta
+        if self.needs_input_grad[1]:
+            grad_matrix1 = torch.mm(grad_output, matrix2.t())
+            if self.beta != 1:
+                grad_matrix1 *= self.beta

-        if ctx.needs_input_grad[2]:
-            if matrix2.stride() == (1, matrix2.size(0)):
-                # column major gradient if input is column major
-                grad_matrix2 = torch.mm(grad_output.t(), matrix1).t()
-            else:
-                grad_matrix2 = torch.mm(matrix1.t(), grad_output)
-            if ctx.beta != 1:
-                grad_matrix2 *= ctx.beta
+        if self.needs_input_grad[2]:
+            grad_matrix2 = torch.mm(matrix1.t(), grad_output)
+            if self.beta != 1:
+                grad_matrix2 *= self.beta

-        return grad_add_matrix, grad_matrix1, grad_matrix2, None, None, None
+        return grad_add_matrix, grad_matrix1, grad_matrix2


-class Addbmm(InplaceFunction):
+class Addbmm(_BlasBase):

-    @staticmethod
-    def forward(ctx, add_matrix, batch1, batch2, alpha=1, beta=1, inplace=False):
-        ctx.alpha = alpha
-        ctx.beta = beta
-        ctx.add_matrix_size = add_matrix.size()
-        ctx.save_for_backward(batch1, batch2)
-        output = _get_output(ctx, add_matrix, inplace=inplace)
-        return torch.addbmm(alpha, add_matrix, beta,
+    def forward(self, add_matrix, batch1, batch2):
+        self.save_for_backward(batch1, batch2)
+        output = self._get_output(add_matrix)
+        return torch.addbmm(self.alpha, add_matrix, self.beta,
                            batch1, batch2, out=output)

-    @staticmethod
-    def backward(ctx, grad_output):
-        batch1, batch2 = ctx.saved_variables
+    def backward(self, grad_output):
+        batch1, batch2 = self.saved_tensors
        grad_add_matrix = grad_batch1 = grad_batch2 = None

-        if ctx.needs_input_grad[0]:
-            grad_add_matrix = maybe_unexpand(grad_output, ctx.add_matrix_size)
-            if ctx.alpha != 1:
-                grad_add_matrix = grad_add_matrix.mul(ctx.alpha)
+        if self.needs_input_grad[0]:
+            grad_add_matrix = grad_output
+            if self.alpha != 1:
+                grad_add_matrix = grad_add_matrix.mul(self.alpha)

-        if any(ctx.needs_input_grad[1:]):
+        if any(self.needs_input_grad[1:]):
            batch_grad_output = (grad_output
                                 .unsqueeze(0)
                                 .expand(batch1.size(0), batch1.size(1), batch2.size(2)))

-        if ctx.needs_input_grad[1]:
+        if self.needs_input_grad[1]:
            grad_batch1 = torch.bmm(batch_grad_output, batch2.transpose(1, 2))
-            if ctx.beta != 1:
-                grad_batch1 *= ctx.beta
+            if self.beta != 1:
+                grad_batch1 *= self.beta

-        if ctx.needs_input_grad[2]:
+        if self.needs_input_grad[2]:
            grad_batch2 = torch.bmm(batch1.transpose(1, 2), batch_grad_output)
-            if ctx.beta != 1:
-                grad_batch2 *= ctx.beta
+            if self.beta != 1:
+                grad_batch2 *= self.beta

-        return grad_add_matrix, grad_batch1, grad_batch2, None, None, None
+        return grad_add_matrix, grad_batch1, grad_batch2


-class Baddbmm(InplaceFunction):
+class Baddbmm(_BlasBase):

-    @staticmethod
-    def forward(ctx, add_batch, batch1, batch2, alpha=1, beta=1, inplace=False):
-        ctx.alpha = alpha
-        ctx.beta = beta
-        ctx.add_batch_size = add_batch.size()
-        ctx.save_for_backward(batch1, batch2)
-        output = _get_output(ctx, add_batch, inplace=inplace)
-        return torch.baddbmm(alpha, add_batch, beta,
+    def forward(self, add_batch, batch1, batch2):
+        self.save_for_backward(batch1, batch2)
+        output = self._get_output(add_batch)
+        return torch.baddbmm(self.alpha, add_batch, self.beta,
                             batch1, batch2, out=output)

-    @staticmethod
-    def backward(ctx, grad_output):
-        batch1, batch2 = ctx.saved_variables
+    def backward(self, grad_output):
+        batch1, batch2 = self.saved_tensors
        grad_add_batch = grad_batch1 = grad_batch2 = None

-        if ctx.needs_input_grad[0]:
-            grad_add_batch = maybe_unexpand(grad_output, ctx.add_batch_size)
-            if ctx.alpha != 1:
-                grad_add_batch = grad_add_batch.mul(ctx.alpha)
+        if self.needs_input_grad[0]:
+            grad_add_batch = grad_output
+            if self.alpha != 1:
+                grad_add_batch = grad_add_batch.mul(self.alpha)

-        if ctx.needs_input_grad[1]:
+        if self.needs_input_grad[1]:
            grad_batch1 = torch.bmm(grad_output, batch2.transpose(1, 2))
-            if ctx.beta != 1:
-                grad_batch1 *= ctx.beta
+            if self.beta != 1:
+                grad_batch1 *= self.beta

-        if ctx.needs_input_grad[2]:
+        if self.needs_input_grad[2]:
            grad_batch2 = torch.bmm(batch1.transpose(1, 2), grad_output)
-            if ctx.beta != 1:
-                grad_batch2 *= ctx.beta
+            if self.beta != 1:
+                grad_batch2 *= self.beta

-        return grad_add_batch, grad_batch1, grad_batch2, None, None, None
+        return grad_add_batch, grad_batch1, grad_batch2


-class Addmv(InplaceFunction):
+class Addmv(_BlasBase):

-    @staticmethod
-    def forward(ctx, add_vector, matrix, vector, alpha=1, beta=1, inplace=False):
-        ctx.alpha = alpha
-        ctx.beta = beta
-        ctx.add_vector_size = add_vector.size()
-        ctx.save_for_backward(matrix, vector)
-        output = _get_output(ctx, add_vector, inplace=inplace)
-        return torch.addmv(alpha, add_vector, beta,
+    def forward(self, add_vector, matrix, vector):
+        self.save_for_backward(matrix, vector)
+        output = self._get_output(add_vector)
+        return torch.addmv(self.alpha, add_vector, self.beta,
                           matrix, vector, out=output)

-    @staticmethod
-    def backward(ctx, grad_output):
-        matrix, vector = ctx.saved_variables
+    def backward(self, grad_output):
+        matrix, vector = self.saved_tensors
        grad_add_vector = grad_matrix = grad_vector = None

-        if ctx.needs_input_grad[0]:
-            grad_add_vector = maybe_unexpand(grad_output, ctx.add_vector_size)
-            if ctx.alpha != 1:
-                grad_add_vector = grad_add_vector.mul(ctx.alpha)
+        if self.needs_input_grad[0]:
+            grad_add_vector = grad_output
+            if self.alpha != 1:
+                grad_add_vector = grad_add_vector.mul(self.alpha)

-        if ctx.needs_input_grad[1]:
+        if self.needs_input_grad[1]:
            grad_matrix = torch.ger(grad_output, vector)
-            if ctx.beta != 1:
-                grad_matrix *= ctx.beta
+            if self.beta != 1:
+                grad_matrix *= self.beta

-        if ctx.needs_input_grad[2]:
+        if self.needs_input_grad[2]:
            grad_vector = torch.mv(matrix.t(), grad_output)
-            if ctx.beta != 1:
-                grad_vector *= ctx.beta
+            if self.beta != 1:
+                grad_vector *= self.beta

-        return grad_add_vector, grad_matrix, grad_vector, None, None, None
+        return grad_add_vector, grad_matrix, grad_vector


-class Addr(InplaceFunction):
+class Addr(_BlasBase):

-    @staticmethod
-    def forward(ctx, add_matrix, vector1, vector2, alpha=1, beta=1, inplace=False):
-        ctx.alpha = alpha
-        ctx.beta = beta
-        ctx.add_matrix_size = add_matrix.size()
-        ctx.save_for_backward(vector1, vector2)
-        output = _get_output(ctx, add_matrix, inplace=inplace)
-        return torch.addr(alpha, add_matrix, beta,
+    def forward(self, add_matrix, vector1, vector2):
+        self.save_for_backward(vector1, vector2)
+        output = self._get_output(add_matrix)
+        return torch.addr(self.alpha, add_matrix, self.beta,
                          vector1, vector2, out=output)

-    @staticmethod
-    def backward(ctx, grad_output):
-        vector1, vector2 = ctx.saved_variables
+    def backward(self, grad_output):
+        vector1, vector2 = self.saved_tensors
        grad_add_matrix = grad_vector1 = grad_vector2 = None

-        if ctx.needs_input_grad[0]:
-            grad_add_matrix = maybe_unexpand(grad_output, ctx.add_matrix_size)
-            if ctx.alpha != 1:
-                grad_add_matrix = grad_add_matrix.mul(ctx.alpha)
+        if self.needs_input_grad[0]:
+            grad_add_matrix = grad_output
+            if self.alpha != 1:
+                grad_add_matrix = grad_add_matrix.mul(self.alpha)

-        if ctx.needs_input_grad[1]:
+        if self.needs_input_grad[1]:
            grad_vector1 = torch.mv(grad_output, vector2)
-            if ctx.beta != 1:
-                grad_vector1 *= ctx.beta
+            if self.beta != 1:
+                grad_vector1 *= self.beta

-        if ctx.needs_input_grad[2]:
+        if self.needs_input_grad[2]:
            # TODO: maybe it's better to do transpose + mv + transpose
            grad_vector2 = torch.mm(vector1.unsqueeze(0), grad_output).squeeze(0)
-            if ctx.beta != 1:
-                grad_vector2 *= ctx.beta
+            if self.beta != 1:
+                grad_vector2 *= self.beta

-        return grad_add_matrix, grad_vector1, grad_vector2, None, None, None
+        return grad_add_matrix, grad_vector1, grad_vector2


 class Dot(Function):

-    @staticmethod
-    def forward(ctx, vector1, vector2):
-        ctx.save_for_backward(vector1, vector2)
-        ctx.sizes = (vector1.size(), vector2.size())
+    def forward(self, vector1, vector2):
+        self.save_for_backward(vector1, vector2)
+        self.sizes = (vector1.size(), vector2.size())
        return vector1.new((vector1.dot(vector2),))

-    @staticmethod
-    def backward(ctx, grad_output):
-        vector1, vector2 = ctx.saved_variables
+    def backward(self, grad_output):
+        vector1, vector2 = self.saved_tensors
        grad_vector1 = grad_vector2 = None

-        if ctx.needs_input_grad[0]:
-            grad_vector1 = vector2.mul(grad_output.expand(ctx.sizes[1])).view(ctx.sizes[0])
+        if self.needs_input_grad[0]:
+            grad_vector1 = vector2.mul(grad_output[0]).view(self.sizes[0])

-        if ctx.needs_input_grad[1]:
-            grad_vector2 = vector1.mul(grad_output.expand(ctx.sizes[0])).view(ctx.sizes[1])
+        if self.needs_input_grad[1]:
+            grad_vector2 = vector1.mul(grad_output[0]).view(self.sizes[1])

        return grad_vector1, grad_vector2
--- a/torch/autograd/_functions/compare.py
+++ b/torch/autograd/_functions/compare.py
@ -1,28 +1,19 @@
 import torch

 from ..function import Function
-from .utils import maybe_unexpand, maybe_unexpand_or_view


-# TODO: once Cpp-style functions are implemented we can detach a and b
-# before calling forward.
 class _CompareOp(Function):

-    @classmethod
-    def forward(cls, ctx, a, b):
-        ctx.a_size = a.size()
-        ctx.b_tensor = torch.is_tensor(b)
-        ctx.b_size = b.size() if ctx.b_tensor else None
-        ctx.input_type = type(a)
-        mask = getattr(a, cls.fn_name)(b)
-        ctx.mark_non_differentiable(mask)
-        return mask
+    def __init__(self, scalar=None):
+        super(_CompareOp, self).__init__()
+        self.scalar = scalar

-    @staticmethod
-    def backward(ctx, grad_output):
-        grad_input = (grad_output * 0).type(ctx.input_type)
-        return (maybe_unexpand(grad_input, ctx.a_size),
-                maybe_unexpand_or_view(grad_input, ctx.b_size) if ctx.b_tensor else None)
+    def forward(self, tensor1, tensor2=None):
+        other = tensor2 if tensor2 is not None else self.scalar
+        mask = getattr(tensor1, self.fn_name)(other)
+        self.mark_non_differentiable(mask)
+        return mask


 class Eq(_CompareOp):
--- a/torch/autograd/_functions/linalg.py
+++ b/torch/autograd/_functions/linalg.py
@ -1,104 +1,71 @@
 import torch

 from ..function import Function
-from ..variable import Variable


 class Diag(Function):

-    @staticmethod
-    def forward(ctx, input, diagonal_idx=0):
-        ctx.diagonal_idx = diagonal_idx
-        return input.diag(ctx.diagonal_idx)
+    def __init__(self, diagonal_idx=0):
+        super(Diag, self).__init__()
+        self.diagonal_idx = diagonal_idx

-    @staticmethod
-    def backward(ctx, grad_output):
-        return grad_output.diag(ctx.diagonal_idx), None
+    def forward(self, input):
+        return input.diag(self.diagonal_idx)
+
+    def backward(self, grad_output):
+        return grad_output.diag(self.diagonal_idx)


 class Tril(Function):

-    @staticmethod
-    def forward(ctx, input, diagonal_idx=0):
-        ctx.diagonal_idx = diagonal_idx
-        return input.tril(ctx.diagonal_idx)
+    def __init__(self, diagonal_idx=0):
+        super(Tril, self).__init__()
+        self.diagonal_idx = diagonal_idx

-    @staticmethod
-    def backward(ctx, grad_output):
-        return grad_output.tril(ctx.diagonal_idx), None
+    def forward(self, input):
+        return input.tril(self.diagonal_idx)
+
+    def backward(self, grad_output):
+        return grad_output.tril(self.diagonal_idx)


 class Triu(Function):

-    @staticmethod
-    def forward(ctx, input, diagnoal_idx=0):
-        ctx.diagonal_idx = diagnoal_idx
-        return input.triu(ctx.diagonal_idx)
+    def __init__(self, diagonal_idx=0):
+        super(Triu, self).__init__()
+        self.diagonal_idx = diagonal_idx

-    @staticmethod
-    def backward(ctx, grad_output):
-        return grad_output.triu(ctx.diagonal_idx), None
+    def forward(self, input):
+        return input.triu(self.diagonal_idx)
+
+    def backward(self, grad_output):
+        return grad_output.triu(self.diagonal_idx)


 class Trace(Function):

-    @staticmethod
-    def forward(ctx, input):
-        ctx.isize = input.size()
-        return input.new((input.trace(), ))
+    def forward(self, input):
+        self.isize = input.size()
+        return input.new((input.trace(),))

-    @staticmethod
-    def backward(ctx, grad_output):
-        isize = ctx.isize
-        min_size = min(isize)
-        grad_input = Variable(grad_output.data.new(isize).zero_()).view(-1)
-        grad_input[::(isize[1] + 1)] = grad_output.expand(min_size)
-        return grad_input.view(isize)
+    def backward(self, grad_output):
+        isize = self.isize
+        grad_input = grad_output.new(isize).zero_()
+        grad_input.view(-1)[::(isize[1] + 1)] = grad_output[0]
+        return grad_input


 class Cross(Function):

-    @staticmethod
-    def forward(ctx, input, other, dim=-1):
-        ctx.dim = dim
-        ctx.save_for_backward(input, other)
-        return torch.cross(input, other, ctx.dim)
+    def __init__(self, dim=-1):
+        self.dim = dim

-    @staticmethod
-    def backward(ctx, grad_output):
-        input, other = ctx.saved_variables
-        grad_input = other.cross(grad_output, ctx.dim)
-        grad_other = grad_output.cross(input, ctx.dim)
-        return grad_input, grad_other, None
+    def forward(self, input, other):
+        self.save_for_backward(input, other)
+        return torch.cross(input, other, self.dim)

-
-class Inverse(Function):
-
-    @staticmethod
-    def forward(ctx, input):
-        inverse = torch.inverse(input)
-        ctx.save_for_backward(inverse)
-        return inverse
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        inverse, = ctx.saved_variables
-        return -torch.mm(inverse.t(), torch.mm(grad_output, inverse.t()))
-
-
-class Gesv(Function):
-
-    @staticmethod
-    def forward(ctx, b, a):
-        # TODO see if one can backprop through LU
-        X, LU = torch.gesv(b, a)
-        ctx.save_for_backward(X, a)
-        ctx.mark_non_differentiable(LU)
-        return X, LU
-
-    @staticmethod
-    def backward(ctx, grad_output, grad_LU=None):
-        X, a = ctx.saved_variables
-        grad_b, _ = torch.gesv(grad_output, a.t())
-        grad_a = -torch.mm(grad_b, X.t())
-        return grad_b, grad_a
+    def backward(self, grad_output):
+        input, other = self.saved_tensors
+        grad_input = torch.cross(other, grad_output, self.dim)
+        grad_other = torch.cross(grad_output, input, self.dim)
+        return grad_input, grad_other
--- a/torch/autograd/_functions/pointwise.py
+++ b/torch/autograd/_functions/pointwise.py
@ -1,351 +1,282 @@
 from itertools import repeat

-from ..._thnn import type2backend
 from ..function import Function, InplaceFunction
-from ..variable import Variable
-from .utils import maybe_unexpand, maybe_unexpand_or_view


 class Exp(InplaceFunction):

-    @staticmethod
-    def forward(ctx, i, inplace=False):
-        if inplace:
-            ctx.mark_dirty(i)
+    def forward(self, i):
+        if self.inplace:
+            self.mark_dirty(i)
            result = i.exp_()
        else:
            result = i.exp()
-        ctx.save_for_backward(result)
+        self.save_for_backward(result)
        return result

-    @staticmethod
-    def backward(ctx, grad_output):
-        result, = ctx.saved_variables
-        return grad_output * result, None
+    def backward(self, grad_output):
+        return self.saved_tensors[0] * grad_output


 class Log(Function):

-    @staticmethod
-    def forward(ctx, i):
-        ctx.save_for_backward(i)
+    def forward(self, i):
+        self.save_for_backward(i)
        return i.log()

-    @staticmethod
-    def backward(ctx, grad_output):
-        i, = ctx.saved_variables
-        return grad_output.div(i)
+    def backward(self, grad_output):
+        return grad_output.div(self.saved_tensors[0])


 class Log1p(Function):

-    @staticmethod
-    def forward(ctx, i):
-        ctx.save_for_backward(i)
+    def forward(self, i):
+        self.save_for_backward(i)
        return i.log1p()

-    @staticmethod
-    def backward(ctx, grad_output):
-        i, = ctx.saved_variables
-        return grad_output.div(i.add(1))
+    def backward(self, grad_output):
+        return grad_output.div(self.saved_tensors[0].add(1))


 class Tanh(InplaceFunction):

-    @staticmethod
-    def forward(ctx, i, inplace=False):
-        if inplace:
-            ctx.mark_dirty(i)
+    def forward(self, i):
+        if self.inplace:
+            self.mark_dirty(i)
            result = i.tanh_()
        else:
            result = i.tanh()
-        ctx.save_for_backward(result)
+        self.save_for_backward(result)
        return result

-    @staticmethod
-    def backward(ctx, grad_output):
-        result, = ctx.saved_variables
-        if grad_output.volatile:
-            grad_input = Variable(grad_output.data.new(grad_output.size()), volatile=True)
-            backend = type2backend[type(result.data)]
-            backend.Tanh_updateGradInput(backend.library_state, None, grad_output.data,
-                                         grad_input.data, result.data)
-        else:
-            grad_input = grad_output * (1 - result * result)
-        return grad_input, None
+    def backward(self, grad_output):
+        result, = self.saved_tensors
+        return grad_output * (1 - result * result)


 class Sigmoid(InplaceFunction):

-    @staticmethod
-    def forward(ctx, i, inplace=False):
-        if inplace:
-            ctx.mark_dirty(i)
+    def forward(self, i):
+        if self.inplace:
+            self.mark_dirty(i)
            result = i.sigmoid_()
        else:
            result = i.sigmoid()
-        ctx.save_for_backward(result)
+        self.save_for_backward(result)
        return result

-    @staticmethod
-    def backward(ctx, grad_output):
-        result, = ctx.saved_variables
-        if grad_output.volatile:
-            grad_input = Variable(grad_output.data.new(grad_output.size()), volatile=True)
-            backend = type2backend[type(result.data)]
-            backend.Sigmoid_updateGradInput(backend.library_state, None, grad_output.data,
-                                            grad_input.data, result.data)
-        else:
-            grad_input = grad_output * ((1 - result) * result)
-        return grad_input, None
+    def backward(self, grad_output):
+        result, = self.saved_tensors
+        return grad_output * ((1 - result) * result)


 class Sinh(Function):

-    @staticmethod
-    def forward(ctx, i):
-        ctx.save_for_backward(i)
+    def forward(self, i):
+        self.save_for_backward(i)
        return i.sinh()

-    @staticmethod
-    def backward(ctx, grad_output):
-        i, = ctx.saved_variables
+    def backward(self, grad_output):
+        i, = self.saved_tensors
        return grad_output * i.cosh()


 class Cosh(Function):

-    @staticmethod
-    def forward(ctx, i):
-        ctx.save_for_backward(i)
+    def forward(self, i):
+        self.save_for_backward(i)
        return i.cosh()

-    @staticmethod
-    def backward(ctx, grad_output):
-        i, = ctx.saved_variables
+    def backward(self, grad_output):
+        i, = self.saved_tensors
        return grad_output * i.sinh()


 class Abs(Function):

-    @staticmethod
-    def forward(ctx, i):
-        ctx.save_for_backward(i)
+    def forward(self, i):
+        self.save_for_backward(i)
        return i.abs()

-    @staticmethod
-    def backward(ctx, grad_output):
-        i, = ctx.saved_variables
+    def backward(self, grad_output):
+        i, = self.saved_tensors
        return grad_output * i.sign()


 class Clamp(Function):

-    @staticmethod
-    def forward(ctx, i, min_val, max_val):
-        ctx._mask = (i.ge(min_val) * i.le(max_val))
-        return i.clamp(min_val, max_val)
+    def __init__(self, min_val, max_val):
+        super(Clamp, self).__init__()
+        self.min_val = min_val
+        self.max_val = max_val

-    @staticmethod
-    def backward(ctx, grad_output):
-        mask = Variable(ctx._mask.type_as(grad_output.data))
-        return grad_output * mask, None, None
+    def forward(self, i):
+        self.save_for_backward(i)
+        return i.clamp(self.min_val, self.max_val)
+
+    def backward(self, grad_output):
+        i, = self.saved_tensors
+        mask = i.ge(self.min_val) * i.le(self.max_val)
+        return grad_output * mask.type_as(grad_output)


 class Sqrt(Function):

-    @staticmethod
-    def forward(ctx, i):
-        ctx.save_for_backward(i)
+    def forward(self, i):
+        self.save_for_backward(i)
        return i.sqrt()

-    @staticmethod
-    def backward(ctx, grad_output):
-        i, = ctx.saved_variables
-        return grad_output.mul(i.pow(-0.5)).div_(2)
+    def backward(self, grad_output):
+        i, = self.saved_tensors
+        return grad_output.mul(i.pow(-0.5)).div(2)


 class Sin(Function):

-    @staticmethod
-    def forward(ctx, i):
-        ctx.save_for_backward(i)
+    def forward(self, i):
+        self.save_for_backward(i)
        return i.sin()

-    @staticmethod
-    def backward(ctx, grad_output):
-        i, = ctx.saved_variables
+    def backward(self, grad_output):
+        i, = self.saved_tensors
        return grad_output * i.cos()


 class Cos(Function):

-    @staticmethod
-    def forward(ctx, i):
-        ctx.save_for_backward(i)
+    def forward(self, i):
+        self.save_for_backward(i)
        return i.cos()

-    @staticmethod
-    def backward(ctx, grad_output):
-        i, = ctx.saved_variables
+    def backward(self, grad_output):
+        i, = self.saved_tensors
        return grad_output.mul(i.sin()).neg_()


 class Tan(Function):

-    @staticmethod
-    def forward(ctx, i):
-        ctx.save_for_backward(i)
+    def forward(self, i):
+        self.save_for_backward(i)
        return i.tan()

-    @staticmethod
-    def backward(ctx, grad_output):
-        i, = ctx.saved_variables
+    def backward(self, grad_output):
+        i, = self.saved_tensors
        return grad_output.div(i.cos().pow(2))


 class Asin(Function):

-    @staticmethod
-    def forward(ctx, i):
-        ctx.save_for_backward(i)
+    def forward(self, i):
+        self.save_for_backward(i)
        return i.asin()

-    @staticmethod
-    def backward(ctx, grad_output):
-        i, = ctx.saved_variables
-        return grad_output * (1 - i.mul(i)).sqrt().reciprocal()
+    def backward(self, grad_output):
+        i, = self.saved_tensors
+        return grad_output * (1 - i.mul(i)).sqrt_().reciprocal_()


 class Acos(Function):

-    @staticmethod
-    def forward(ctx, i):
-        ctx.save_for_backward(i)
+    def forward(self, i):
+        self.save_for_backward(i)
        return i.acos()

-    @staticmethod
-    def backward(ctx, grad_output):
-        i, = ctx.saved_variables
-        return grad_output.mul((1 - i.mul(i)).sqrt().reciprocal()).neg_()
+    def backward(self, grad_output):
+        i, = self.saved_tensors
+        return grad_output.mul((1 - i.mul(i)).sqrt_().reciprocal_()).neg_()


 class Atan(Function):

-    @staticmethod
-    def forward(ctx, i):
-        ctx.save_for_backward(i)
+    def forward(self, i):
+        self.save_for_backward(i)
        return i.atan()

-    @staticmethod
-    def backward(ctx, grad_output):
-        i, = ctx.saved_variables
-        return grad_output * i.mul(i).add_(1).reciprocal()
+    def backward(self, grad_output):
+        i, = self.saved_tensors
+        return grad_output * i.mul(i).add_(1).reciprocal_()


-class Atan2(Function):
-
-    @staticmethod
-    def forward(ctx, y, x):
-        ctx.save_for_backward(y, x)
-        return y.atan2(x)
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        y, x, = ctx.saved_variables
-        denominator = y.mul(y).add(x.mul(x)).reciprocal()
-        return grad_output * x.mul(denominator), grad_output * y.neg().mul(denominator)
-
-
-# TODO: make inplace and update grad formulas
 class Reciprocal(Function):

-    @staticmethod
-    def forward(ctx, i):
+    def forward(self, i):
        result = i.reciprocal()
-        ctx.save_for_backward(result)
+        self.save_for_backward(result)
        return result

-    @staticmethod
-    def backward(ctx, grad_output):
-        result, = ctx.saved_variables
+    def backward(self, grad_output):
+        result, = self.saved_tensors
        return grad_output * result.mul(result).neg_()


 class Cmax(Function):

-    @staticmethod
-    def forward(ctx, a, b):
-        ctx._a_size = a.size()
-        ctx._b_size = b.size()
-        ctx._mask = a.gt(b)
+    def forward(self, a, b):
+        self._max_buffer = a.gt(b).type_as(a)
        return a.max(b)

-    @staticmethod
-    def backward(ctx, grad_output):
-        mask = Variable(ctx._mask.type_as(grad_output.data))
+    def backward(self, grad_output):
        return (
-            maybe_unexpand(grad_output * mask, ctx._a_size),
-            maybe_unexpand_or_view(grad_output * Variable(ctx._mask.eq(0).type_as(grad_output.data)), ctx._b_size)
+            grad_output * self._max_buffer,
+            grad_output * self._max_buffer.eq(0).type_as(grad_output)
        )


 class CmaxConstant(Function):

-    @staticmethod
-    def forward(ctx, i, constant):
-        ctx._mask = i.gt(constant)
-        return i.clamp(min=constant)
+    def __init__(self, constant):
+        super(CmaxConstant, self).__init__()
+        self.constant = constant

-    @staticmethod
-    def backward(ctx, grad_output):
-        mask = Variable(ctx._mask.type_as(grad_output.data))
-        return grad_output * mask, None
+    def forward(self, i):
+        self._max_buffer = i.gt(self.constant).type_as(i)
+        return i.clamp(min=self.constant)
+
+    def backward(self, grad_output):
+        return grad_output * self._max_buffer


 class Cmin(Function):

-    @staticmethod
-    def forward(ctx, a, b):
-        ctx._a_size = a.size()
-        ctx._b_size = b.size()
-        ctx._mask = a.lt(b).type_as(a)
+    def forward(self, a, b):
+        self._min_buffer = a.lt(b).type_as(a)
        return a.min(b)

-    @staticmethod
-    def backward(ctx, grad_output):
-        mask = Variable(ctx._mask.type_as(grad_output.data))
+    def backward(self, grad_output):
        return (
-            maybe_unexpand(grad_output * mask, ctx._a_size),
-            maybe_unexpand_or_view(grad_output * Variable(ctx._mask.eq(0).type_as(grad_output.data)), ctx._b_size)
+            grad_output * self._min_buffer,
+            grad_output * self._min_buffer.eq(0).type_as(grad_output)
        )


 class CminConstant(Function):

-    @staticmethod
-    def forward(ctx, i, constant):
-        ctx._mask = i.lt(constant)
-        return i.clamp(max=constant)
+    def __init__(self, constant):
+        super(CminConstant, self).__init__()
+        self.constant = constant

-    @staticmethod
-    def backward(ctx, grad_output):
-        mask = Variable(ctx._mask.type_as(grad_output.data))
-        return grad_output * mask, None
+    def forward(self, i):
+        self._min_buffer = i.lt(self.constant).type_as(i)
+        return i.clamp(max=self.constant)
+
+    def backward(self, grad_output):
+        return grad_output * self._min_buffer


 class _ConstantGrad(Function):
    grad_value = 0

-    @classmethod
-    def forward(cls, ctx, *args):
-        ctx._num_args = len(args)
-        ctx._args0_size = args[0].size()
-        return getattr(args[0], cls.__name__.lower())(*args[1:])
+    def __init__(self, *args):
+        super(_ConstantGrad, self).__init__()
+        self.args = args

-    @classmethod
-    def backward(cls, ctx, grad_output):
-        return (maybe_unexpand(grad_output.mul(cls.grad_value), ctx._args0_size),) + (ctx._num_args - 1) * (None,)
+    def forward(self, i):
+        return getattr(i, type(self).__name__.lower())(*self.args)
+
+    def backward(self, grad_output):
+        grad_input = grad_output.new(*repeat(1, grad_output.dim()))
+        grad_input = grad_input.fill_(self.grad_value).expand_as(grad_output)
+        return grad_input.mul(grad_output)


 class Floor(_ConstantGrad):
@ -382,96 +313,91 @@ class Remainder(_ConstantGrad):

 class Lerp(Function):

-    @staticmethod
-    def forward(ctx, a, b, weight):
-        ctx._a_size = a.size()
-        ctx._b_size = b.size()
-        ctx._weight = float(weight)
-        return a.lerp(b, ctx._weight)
+    def __init__(self, weight):
+        super(Lerp, self).__init__()
+        self.weight = float(weight)

-    @staticmethod
-    def backward(ctx, grad_output):
-        return (maybe_unexpand(grad_output.mul(1 - ctx._weight), ctx._a_size),
-                maybe_unexpand_or_view(grad_output.mul(ctx._weight), ctx._b_size), None)
+    def forward(self, a, b):
+        return a.lerp(b, self.weight)
+
+    def backward(self, grad_output):
+        return grad_output.mul(1 - self.weight), grad_output.mul(self.weight)


 class Rsqrt(InplaceFunction):

-    @staticmethod
-    def forward(ctx, i, inplace=False):
-        if inplace:
-            ctx.mark_dirty(i)
-            result = i.rsqrt_()
+    def forward(self, input):
+        if self.inplace:
+            self.mark_dirty(input)
+            result = input.rsqrt_()
        else:
-            result = i.rsqrt()
-        ctx.save_for_backward(result)
+            result = input.rsqrt()
+        self.save_for_backward(result)
        return result

-    @staticmethod
-    def backward(ctx, grad_output):
-        result, = ctx.saved_variables
-        return result.pow(3).div_(-2).mul(grad_output), None
+    def backward(self, grad_output):
+        result, = self.saved_tensors
+        return result.pow(3).div_(-2).mul_(grad_output)


 class Addcmul(InplaceFunction):

-    @staticmethod
-    def forward(ctx, add_tensor, mul_tensor1, mul_tensor2, scale=1.0, inplace=False):
-        ctx._scale = scale
-        ctx._add_tensor_size = add_tensor.size()
-        ctx.save_for_backward(mul_tensor1, mul_tensor2)
-        if inplace:
-            ctx.mark_dirty(add_tensor)
-            return add_tensor.addcmul_(scale, mul_tensor1, mul_tensor2)
+    def __init__(self, scale=1, inplace=False):
+        super(Addcmul, self).__init__(inplace)
+        self.scale = scale
+
+    def forward(self, add_tensor, mul_tensor1, mul_tensor2):
+        self.save_for_backward(mul_tensor1, mul_tensor2)
+        if self.inplace:
+            return add_tensor.addcmul_(self.scale, mul_tensor1, mul_tensor2)
        else:
-            return add_tensor.addcmul(scale, mul_tensor1, mul_tensor2)
+            return add_tensor.addcmul(self.scale, mul_tensor1, mul_tensor2)

-    @staticmethod
-    def backward(ctx, grad_output):
+    def backward(self, grad_output):
        grad_add = grad_mul1 = grad_mul2 = None
-        mul_tensor1, mul_tensor2 = ctx.saved_variables
+        mul_tensor1, mul_tensor2 = self.saved_tensors

-        if ctx.needs_input_grad[0]:
-            grad_add = maybe_unexpand(grad_output, ctx._add_tensor_size)
+        if self.needs_input_grad[0]:
+            grad_add = grad_output

-        if ctx.needs_input_grad[1]:
-            grad_mul1 = maybe_unexpand_or_view(grad_output.mul(mul_tensor2).mul_(ctx._scale), mul_tensor1.size())
+        if self.needs_input_grad[1]:
+            grad_mul1 = grad_output.mul(mul_tensor2).mul(self.scale)

-        if ctx.needs_input_grad[2]:
-            grad_mul2 = maybe_unexpand_or_view(grad_output.mul(mul_tensor1).mul_(ctx._scale), mul_tensor2.size())
+        if self.needs_input_grad[2]:
+            grad_mul2 = grad_output.mul(mul_tensor1).mul(self.scale)

-        return grad_add, grad_mul1, grad_mul2, None, None
+        return grad_add, grad_mul1, grad_mul2


 class Addcdiv(InplaceFunction):

-    @staticmethod
-    def forward(ctx, add_tensor, div_tensor1, div_tensor2, scale=1.0, inplace=False):
-        ctx._scale = scale
-        ctx._add_tensor_size = add_tensor.size()
-        ctx.save_for_backward(div_tensor1, div_tensor2)
-        if inplace:
-            ctx.mark_dirty(add_tensor)
-            return add_tensor.addcdiv_(ctx._scale, div_tensor1, div_tensor2)
+    def __init__(self, scale=1, inplace=False):
+        super(Addcdiv, self).__init__(inplace)
+        self.scale = scale
+
+    def forward(self, add_tensor, div_tensor1, div_tensor2):
+        self.save_for_backward(div_tensor1, div_tensor2)
+        if self.inplace:
+            return add_tensor.addcdiv_(self.scale, div_tensor1, div_tensor2)
        else:
-            return add_tensor.addcdiv(ctx._scale, div_tensor1, div_tensor2)
+            return add_tensor.addcdiv(self.scale, div_tensor1, div_tensor2)

-    @staticmethod
-    def backward(ctx, grad_output):
+    def backward(self, grad_output):
        grad_add = grad_div1 = grad_div2 = None
-        div_tensor1, div_tensor2 = ctx.saved_variables
+        div_tensor1, div_tensor2 = self.saved_tensors

-        if ctx.needs_input_grad[0]:
-            grad_add = maybe_unexpand(grad_output, ctx._add_tensor_size)
+        if self.needs_input_grad[0]:
+            grad_add = grad_output

-        if ctx.needs_input_grad[1]:
-            grad_div1 = maybe_unexpand_or_view(grad_output.div(div_tensor2).mul_(ctx._scale), div_tensor1.size())
+        if self.needs_input_grad[1]:
+            grad_div1 = grad_output.div(div_tensor2).mul(self.scale)

-        if ctx.needs_input_grad[2]:
+        if self.needs_input_grad[2]:
            div_tensor2_sq = div_tensor2.mul(div_tensor2)
-            grad_div2 = maybe_unexpand_or_view(grad_output.mul(div_tensor1).div(div_tensor2_sq).mul(-ctx._scale),
-                                               div_tensor2.size())
+            grad_div2 = grad_output.mul(div_tensor1).div_(div_tensor2_sq)
+            grad_div2.neg_().mul_(self.scale)
+
+        return grad_add, grad_div1, grad_div2

-        return grad_add, grad_div1, grad_div2, None, None

 # TODO: atan2 + inplace
--- a/torch/autograd/_functions/reduce.py
+++ b/torch/autograd/_functions/reduce.py
@ -1,141 +1,110 @@
 from functools import reduce

 from ..function import Function
-from ..variable import Variable
-import torch


-class Sum(Function):
+class _DimReduceFunction(Function):

-    @staticmethod
-    def forward(ctx, input, dim=None, keepdim=None):
-        ctx.dim = dim
-        ctx.keepdim = False if keepdim is None else keepdim
-        ctx.input_size = input.size()
-        if dim is None:
-            return input.new((input.sum(),))
+    def __init__(self, dim=None):
+        super(_DimReduceFunction, self).__init__()
+        self.dim = dim
+
+    def forward(self, input):
+        self.input_size = input.size()
+        fn = getattr(input, self.fn_name)
+        if self.dim is None:
+            return input.new((fn(),))
        else:
-            if keepdim is not None:
-                return input.sum(dim, keepdim=keepdim)
-            else:
-                return input.sum(dim)
+            return fn(self.dim)

-    @staticmethod
-    def backward(ctx, grad_output):
-        if ctx.dim is None:
-            return grad_output.expand(ctx.input_size), None, None
+
+class Sum(_DimReduceFunction):
+    fn_name = 'sum'
+
+    def backward(self, grad_output):
+        if self.dim is None:
+            return grad_output.new(self.input_size).fill_(grad_output[0])
        else:
-            if ctx.keepdim is False and len(ctx.input_size) != 1:
-                grad_output = grad_output.unsqueeze(ctx.dim)
-
-            repeats = [1 for _ in ctx.input_size]
-            repeats[ctx.dim] = ctx.input_size[ctx.dim]
-            return grad_output.repeat(*repeats), None, None
+            repeats = [1 for _ in self.input_size]
+            repeats[self.dim] = self.input_size[self.dim]
+            return grad_output.repeat(*repeats),


-class Prod(Function):
+class Prod(_DimReduceFunction):

-    @staticmethod
-    def forward(ctx, input, dim=None, keepdim=None):
-        ctx.dim = dim
-        ctx.keepdim = False if keepdim is None else keepdim
-        ctx.input_size = input.size()
-        if dim is None:
-            ctx.result = input.prod()
-            ctx.save_for_backward(input)
-            return input.new((ctx.result,))
+    def forward(self, input):
+        self.input_size = input.size()
+        if self.dim is None:
+            self.result = input.prod()
+            self.save_for_backward(input)
+            return input.new((self.result,))
        else:
-            if keepdim is not None:
-                output = input.prod(dim, keepdim=keepdim)
-            else:
-                output = input.prod(dim)
-            ctx.save_for_backward(input, output)
+            output = input.prod(self.dim)
+            self.save_for_backward(input, output)
            return output

-    @staticmethod
-    def backward(ctx, grad_output):
-        def safe_zeros_backward(inp, dim):
-            # note that the gradient is equivalent to:
-            # cumprod(exclusive, normal) * cumprod(exclusive, reverse), e.g.:
-            # input:                        [    a,     b,     c]
-            # cumprod(exclusive, normal):   [1    ,     a, a * b]
-            # cumprod(exclusive, reverse):  [b * c,     c,     1]
-            # product:                      [b * c, a * c, a * b]
-            # and this is safe under input with 0s.
-            if inp.size(dim) == 1:
-                return grad_output
+    def backward(self, grad_output):
+        if self.dim is None:
+            input, = self.saved_tensors
+            zero_idx = (input == 0).nonzero()
+            if zero_idx.dim() == 0:
+                return grad_output.mul(self.result).expand_as(input).div(input)
+            elif zero_idx.size(0) > 1:
+                return grad_output.new(self.input_size).zero_()
+            else:
+                grad_input = grad_output.new(self.input_size).zero_()
+                zero_idx = tuple(zero_idx[0].cpu())
+                input_copy = input.clone()
+                input_copy[zero_idx] = 1.
+                grad_input[zero_idx] = grad_output[0] * input_copy.prod()
+                return grad_input
+        else:
+            input, output = self.saved_tensors
+            dim = self.dim if self.dim >= 0 else self.dim + input.dim()
+            zero_mask = input == 0
+            slice_zero_count = zero_mask.sum(dim)
+            total_zeros = slice_zero_count.sum()
+            grad_input = grad_output.mul(output).expand_as(input).div(input)
+            if total_zeros == 0:
+                return grad_input

-            ones_size = torch.Size((inp.size()[:dim] + (1,) + inp.size()[dim + 1:]))
-            ones = Variable(grad_output.data.new(ones_size).fill_(1))
-            exclusive_normal_nocp = torch.cat((ones, inp.narrow(dim, 0, inp.size(dim) - 1)), dim)
-            exclusive_normal = exclusive_normal_nocp.cumprod(dim)
+            some_zeros = slice_zero_count.gt(0).expand_as(grad_input)
+            grad_input[some_zeros] = 0

-            def reverse_dim(var, dim):
-                return var.index_select(dim, Variable(torch.arange(var.size(dim) - 1, -1, -1)).long())
+            single_zero_idx = slice_zero_count.eq(1).nonzero()

-            narrow_reverse = reverse_dim(inp.narrow(dim, 1, inp.size(dim) - 1), dim)
-            exclusive_reverse_nocp = torch.cat((ones, narrow_reverse), dim)
-            exclusive_reverse = reverse_dim(exclusive_reverse_nocp.cumprod(dim), dim)
+            if len(single_zero_idx) == 0:
+                return grad_input
+
+            for idx in single_zero_idx:
+                idx_tuple = tuple(idx.cpu())
+                input_idx_tuple = idx_tuple[:dim] + (slice(0, None),) + idx_tuple[dim + 1:]
+
+                # slice_mask and input_copy are 1D
+                slice_mask = zero_mask[input_idx_tuple]
+                input_copy = input[input_idx_tuple].clone()
+                zero_idx = slice_mask.nonzero()[0, 0]
+                input_copy[zero_idx] = 1.
+
+                grad_idx_tuple = idx_tuple[:dim] + (zero_idx,) + idx_tuple[dim + 1:]
+                grad_input[grad_idx_tuple] = grad_output[idx_tuple] * input_copy.prod()

-            grad_input = grad_output.expand_as(exclusive_normal).mul(exclusive_normal.mul(exclusive_reverse))
            return grad_input

-        if ctx.dim is None:
-            input, = ctx.saved_variables
-            zero_idx = (input.data == 0).nonzero()
-            if zero_idx.dim() == 0:
-                return grad_output.mul(ctx.result).expand_as(input).div(input), None, None
-            elif zero_idx.size(0) > 1:
-                return (grad_output * 0).expand_as(input), None, None
-            else:
-                return safe_zeros_backward(input.contiguous().view(-1), 0).view_as(input), None, None

+class Mean(_DimReduceFunction):
+    fn_name = 'mean'
+
+    def backward(self, grad_output):
+        if self.dim is None:
+            grad_input_val = grad_output[0]
+            grad_input_val /= reduce(lambda x, y: x * y, self.input_size, 1)
+            return grad_output.new(*self.input_size).fill_(grad_input_val)
        else:
-            input, output = ctx.saved_variables
-            dim = ctx.dim if ctx.dim >= 0 else ctx.dim + input.dim()
-            if ctx.keepdim is False and len(ctx.input_size) != 1:
-                grad_output = grad_output.unsqueeze(dim)
-                output = output.unsqueeze(dim)
-
-            zero_mask = input == 0
-            slice_zero_count = zero_mask.sum(dim, True)
-            total_zeros = slice_zero_count.data.sum()
-            if total_zeros == 0:
-                grad_input = grad_output.mul(output).expand_as(input).div(input)
-            else:
-                grad_input = safe_zeros_backward(input, dim)
-
-            return grad_input, None, None
-
-
-class Mean(Function):
-
-    @staticmethod
-    def forward(ctx, input, dim=None, keepdim=None):
-        ctx.dim = dim
-        ctx.keepdim = False if keepdim is None else keepdim
-        ctx.input_size = input.size()
-        if dim is None:
-            return input.new((input.mean(),))
-        else:
-            if keepdim is not None:
-                return input.mean(dim, keepdim=keepdim)
-            else:
-                return input.mean(dim)
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        if ctx.dim is None:
-            grad_input_val = grad_output / reduce(lambda x, y: x * y, ctx.input_size, 1)
-            return grad_input_val.expand(ctx.input_size), None, None
-        else:
-            if ctx.keepdim is False and len(ctx.input_size) != 1:
-                grad_output = grad_output.unsqueeze(ctx.dim)
-
-            repeats = [1 for _ in ctx.input_size]
-            dim_size = ctx.input_size[ctx.dim]
-            repeats[ctx.dim] = dim_size
-            return grad_output.repeat(*repeats).div_(dim_size), None, None
+            repeats = [1 for _ in self.input_size]
+            dim_size = self.input_size[self.dim]
+            repeats[self.dim] = dim_size
+            return grad_output.repeat(*repeats).div_(dim_size)


 class _SelectionFunction(Function):
@ -143,53 +112,44 @@ class _SelectionFunction(Function):
    # additional_args is prepended before dim when calling the tensor
    # function. It's a no-op for subclasses other than kthvalue.
    # kthvalue not only requires us to pass a dim, but also preceed it with k.
+    additional_args = tuple()

-    @classmethod
-    def forward(cls, ctx, input, dim=None, keepdim=None, additional_args=tuple()):
-        fn = getattr(input, cls.__name__.lower())
-        ctx.dim = dim
-        ctx.keepdim = False if keepdim is None else keepdim
-        ctx.additional_args = additional_args
-        ctx.input_size = input.size()
-        if ctx.dim is None and cls.has_all_reduce:
-            value = fn(*additional_args)
-            ctx.indices_tuple = tuple(input.eq(value).nonzero()[0])
+    def __init__(self, dim=None):
+        super(_SelectionFunction, self).__init__()
+        self.dim = dim
+
+    def forward(self, input):
+        fn = getattr(input, type(self).__name__.lower())
+        self.input_size = input.size()
+        if self.dim is None and self.has_all_reduce:
+            value = fn(*self.additional_args)
+            self.indices = tuple(input.eq(value).nonzero()[0])
            return input.new((value,))
        else:
-            if ctx.dim is None:
+            if self.dim is None:
                dim = input.dim() - 1
            else:
-                dim = ctx.dim
+                dim = self.dim
            args = (dim,)
-            if additional_args:
-                args = additional_args + args
-            if keepdim is not None:
-                output, indices = fn(*args, keepdim=keepdim)
-            else:
-                output, indices = fn(*args)
-            ctx.save_for_backward(indices)
-            ctx.mark_non_differentiable(indices)
+            if self.additional_args:
+                args = self.additional_args + args
+            output, indices = fn(*args)
+            self.save_for_backward(indices)
+            self.mark_non_differentiable(indices)
            return output, indices

-    @classmethod
-    def backward(cls, ctx, grad_output, grad_indices=None):
-        grad_input = Variable(grad_output.data.new(*ctx.input_size).zero_())
-        if ctx.dim is None and cls.has_all_reduce:
-            grad_input[ctx.indices_tuple] = grad_output
+    def backward(self, grad_output, grad_indices=None):
+        grad_input = grad_output.new(*self.input_size).zero_()
+        if self.dim is None and self.has_all_reduce:
+            grad_input[self.indices] = grad_output[0]
        else:
-            if ctx.dim is None:
-                dim = len(ctx.input_size) - 1
+            if self.dim is None:
+                dim = input.dim() - 1
            else:
-                dim = ctx.dim
-
-            indices, = ctx.saved_variables
-            if ctx.keepdim is False and len(ctx.input_size) != 1:
-                grad_output = grad_output.unsqueeze(dim)
-                grad_indices = grad_indices.unsqueeze(dim)
-                indices = indices.unsqueeze(dim)
-
+                dim = self.dim
+            indices, = self.saved_tensors
            grad_input.scatter_(dim, indices, grad_output)
-        return grad_input, None, None, None
+        return grad_input


 class Max(_SelectionFunction):
@ -205,63 +165,53 @@ class Mode(_SelectionFunction):


 class Median(_SelectionFunction):
-    pass
+    has_all_reduce = False


 class Kthvalue(_SelectionFunction):
    has_all_reduce = False

-    @classmethod
-    def forward(cls, ctx, input, k, dim=None, keepdim=None):
-        return super(Kthvalue, cls).forward(ctx, input, dim, keepdim, (k,))
+    def __init__(self, k, dim=None):
+        super(Kthvalue, self).__init__(dim)
+        self.additional_args = (k,)


 class Norm(Function):

-    @staticmethod
-    def forward(ctx, input, p=2, dim=None, keepdim=None):
-        ctx.p = p
-        ctx.dim = dim
-        ctx.keepdim = False if keepdim is None else keepdim
+    def __init__(self, norm_type=2, dim=None):
+        super(Norm, self).__init__()
+        self.norm_type = norm_type
+        self.dim = dim

-        if dim is None:
-            ctx.norm = input.norm(p)
-            ctx.save_for_backward(input)
-            return input.new((ctx.norm,))
+    def forward(self, input):
+        if self.dim is None:
+            self.norm = input.norm(self.norm_type)
+            self.save_for_backward(input)
+            return input.new((self.norm,))
        else:
-            if keepdim is not None:
-                output = input.norm(p, dim, keepdim=keepdim)
-            else:
-                output = input.norm(p, dim)
-            ctx.save_for_backward(input, output)
+            output = input.norm(self.norm_type, self.dim)
+            self.save_for_backward(input, output)
            return output

-    @staticmethod
-    def backward(ctx, grad_output):
-        if ctx.dim is None:
-            input, = ctx.saved_variables
-            if ctx.p == 2:
-                scale_v = (grad_output / ctx.norm).expand_as(input)
-                return input.mul(scale_v), None, None, None
+    def backward(self, grad_output):
+        if self.dim is None:
+            input, = self.saved_tensors
+            if self.norm_type == 2:
+                return input.mul(grad_output[0] / self.norm)
            else:
-                pow = input.abs().pow(ctx.p - 2)
-                scale_v = (grad_output / ctx.norm ** (ctx.p - 1)).expand_as(input)
-                return input.mul(pow).mul(scale_v), None, None, None
+                pow = input.abs().pow(self.norm_type - 2)
+                scale = grad_output[0] / self.norm ** (self.norm_type - 1)
+                return input.mul(pow).mul(scale)
        else:
-            input, output = ctx.saved_variables
-
-            if ctx.keepdim is False and input.dim() != 1:
-                grad_output = grad_output.unsqueeze(ctx.dim)
-                output = output.unsqueeze(ctx.dim)
-
+            input, output = self.saved_tensors
            big_grad_output = grad_output.expand_as(input)
-            if ctx.p == 2:
+            if self.norm_type == 2:
                big_output = output.expand_as(input)
-                return input.mul(big_grad_output).div(big_output), None, None, None
+                return input.mul(big_grad_output).div(big_output)
            else:
-                pow = input.abs().pow(ctx.p - 2)
-                big_output = output.pow(ctx.p - 1).expand_as(input)
-                return input.mul(pow).mul(big_grad_output).div(big_output), None, None, None
+                pow = input.abs().pow(self.norm_type - 2)
+                big_output = output.pow(self.norm_type - 1).expand_as(input)
+                return input.mul(pow).mul(big_grad_output).div(big_output)


 # TODO: renorm
--- a/torch/autograd/_functions/replace.vim
+++ b/torch/autograd/_functions/replace.vim
@ -1,3 +0,0 @@
-%s/self/ctx/g
-%s/\s\+def forward/    @staticmethod\r    def forward/g
-%s/\s\+def backward/    @staticmethod\r    @once_differentiable\r    def backward/g
--- a/torch/autograd/_functions/stochastic.py
+++ b/torch/autograd/_functions/stochastic.py
@ -23,9 +23,8 @@ class Multinomial(StochasticFunction):
        if probs.dim() == 1:
            probs = probs.unsqueeze(0)
            samples = samples.unsqueeze(0)
-            reward = reward.unsqueeze(0)
        # normalize probs (multinomial accepts weights)
-        probs /= probs.sum(1, True).expand_as(probs)
+        probs /= probs.sum(1).expand_as(probs)
        grad_probs = probs.new().resize_as_(probs).zero_()
        output_probs = probs.gather(1, samples)
        output_probs.add_(1e-6).reciprocal_()
--- a/torch/autograd/_functions/tensor.py
+++ b/torch/autograd/_functions/tensor.py
--- a/torch/autograd/_functions/utils.py
+++ b/torch/autograd/_functions/utils.py
@ -1,39 +0,0 @@
-import torch
-
-
-def maybe_view(variable, size):
-    if variable.size() == size:
-        return variable
-    return variable.contiguous().view(size)
-
-
-def maybe_unexpand(variable, old_size):
-    num_unsqueezed = variable.dim() - len(old_size)
-    expanded_dims = [dim for dim, (expanded, original)
-                     in enumerate(zip(variable.size()[num_unsqueezed:], old_size))
-                     if expanded != original]
-
-    for _ in range(num_unsqueezed):
-        variable = variable.sum(0, keepdim=False)
-    for dim in expanded_dims:
-        variable = variable.sum(dim, keepdim=True)
-    return variable
-
-
-def variable_expandable(variable, old_size):
-    try:
-        torch._C._infer_size(variable.size(), old_size)
-    except RuntimeError:
-        return False
-    return True
-
-
-def maybe_unexpand_or_view(variable, old_size):
-    var_expanded = True
-    if maybe_view:
-        var_expanded = variable_expandable(variable, old_size)
-
-    if var_expanded:
-        return maybe_unexpand(variable, old_size)
-    else:
-        return maybe_view(variable, old_size)
--- a/torch/autograd/engine.py
+++ b/torch/autograd/engine.py
@ -0,0 +1,85 @@
+from collections import deque, defaultdict
+from torch._C import _ImperativeEngine as ImperativeEngine
+from .variable import Variable
+
+
+class BasicEngine(object):
+
+    def _compute_dependencies(self, function):
+        dependencies = defaultdict(int)
+        seen = {function}
+        queue = [function]
+        while len(queue) > 0:
+            fn = queue.pop()
+            for prev_fn, output_nr in fn.previous_functions:
+                if not prev_fn.requires_grad or isinstance(prev_fn, Variable):
+                    continue
+                dependencies[prev_fn] += 1
+                if prev_fn not in seen:
+                    queue.append(prev_fn)
+                    seen.add(prev_fn)
+        return dependencies
+
+    def _free_backward_dependency(self, dependencies, prev_fn):
+        dependencies[prev_fn] -= 1
+        if dependencies[prev_fn] == 0:
+            del dependencies[prev_fn]
+            return True
+        return False
+
+    def _add_grad(self, need_copy, prev_grad, output_nr, d_prev_fn):
+        copy_id = (id(prev_grad), output_nr)
+        if not prev_grad[output_nr]:
+            prev_grad[output_nr] = d_prev_fn
+            need_copy.add(copy_id)
+        else:
+            grad_tensor = prev_grad[output_nr]
+            if copy_id in need_copy:
+                need_copy.remove(copy_id)
+                grad_tensor = grad_tensor.clone()
+                prev_grad[output_nr] = grad_tensor
+            grad_tensor.add_(d_prev_fn)
+
+    def run_backward(self, variable, grad, retain_variables):
+        if variable.creator is None:
+            variable._do_backward((grad,), retain_variables)
+            return
+
+        initial_grad = [None for _ in range(variable.creator.num_outputs)]
+        initial_grad[variable.output_nr] = grad
+        ready = deque([(variable.creator, initial_grad)])
+        not_ready = {}
+        need_copy = set()
+
+        dependencies = self._compute_dependencies(variable.creator)
+
+        while len(ready) > 0:
+            fn, grad = ready.pop()
+            grad_input = fn._do_backward(tuple(grad), retain_variables)
+            for (prev_fn, output_nr), d_prev_fn in zip(fn.previous_functions, grad_input):
+                if not prev_fn.requires_grad:
+                    # TODO: check that d_prev_fn is None and warn otherwise
+                    continue
+                if isinstance(prev_fn, Variable):
+                    prev_fn._do_backward((d_prev_fn,), retain_variables)
+                    continue
+                is_ready = self._free_backward_dependency(dependencies, prev_fn)
+                if is_ready:
+                    if prev_fn in not_ready:
+                        prev_grad = not_ready[prev_fn]
+                        self._add_grad(need_copy, prev_grad, output_nr, d_prev_fn)
+                    else:
+                        if prev_fn.num_outputs != 1:
+                            raise RuntimeError("one of the function outputs "
+                                               "wasn't used - this is an error not, but "
+                                               "it's going to be fixed soon")
+                        prev_grad = (d_prev_fn,)
+                    ready.appendleft((prev_fn, prev_grad))
+                else:
+                    if prev_fn in not_ready:
+                        prev_grad = not_ready[prev_fn]
+                    else:
+                        prev_grad = [None for _ in range(prev_fn.num_outputs)]
+
+                    self._add_grad(need_copy, prev_grad, output_nr, d_prev_fn)
+                    not_ready[prev_fn] = prev_grad
--- a/torch/autograd/function.py
+++ b/torch/autograd/function.py
@ -1,12 +1,47 @@
 import torch
 import torch._C as _C
 import torch.utils.hooks as hooks
-from torch._six import with_metaclass
-import functools
 from collections import OrderedDict


-class _ContextMethodMixin(object):
+class Function(_C._FunctionBase):
+    """Records operation history and defines formulas for differentiating ops.
+
+    Every operation performed on :class:`Variable` s creates a new function
+    object, that performs the computation, and records that it happened.
+    The history is retained in the form of a DAG of functions, with edges
+    denoting data dependencies (``input <- output``). Then, when backward is
+    called, the graph is processed in the topological ordering, by calling
+    :func:`backward` methods of each :class:`Function` object, and passing
+    returned gradients on to next :class:`Function` s.
+
+    Normally, the only way users interact with functions is by creating
+    subclasses and defining new operations. This is a recommended way of
+    extending torch.autograd.
+
+    Since Function logic is a hotspot in most scripts, almost all of it
+    was moved to our C backend, to ensure that the framework overhead is
+    minimal.
+
+    Each function is meant to be used only once (in the forward pass).
+
+    Attributes:
+        saved_tensors: Tuple of Tensors that were saved in the call to
+            :func:`forward`.
+        needs_input_grad: Tuple of booleans of length :attr:`num_inputs`,
+            indicating whether a given input requires gradient. This can be
+            used to optimize buffers saved for backward, and ignoring gradient
+            computation in :func:`~Function.backward`.
+        num_inputs: Number of inputs given to :func:`forward`.
+        num_outputs: Number of tensors returned by :func:`forward`.
+        requires_grad: Boolean indicating whether the :func:`backward` will
+            ever need to be called.
+        previous_functions: Tuple of (int, Function) pairs of length
+            :attr:`num_inputs`. Each entry contains a reference to a
+            :class:`Function` that created corresponding input, and an index
+            of the previous function output that's been used.
+    """
+    __call__ = _C._FunctionBase._do_forward

    def save_for_backward(self, *tensors):
        """Saves given tensors for a future call to :func:`~Function.backward`.
@ -15,10 +50,9 @@ class _ContextMethodMixin(object):
        :func:`forward` **method.**

        Later, saved tensors can be accessed through the :attr:`saved_tensors`
-        attribute; or, if the corresponding Variable is needed (e.g. for double
-        backwards), those can be accessed through the :attr:`saved_variables`
-        attribute.  Before returning them to the user, a check is made, to ensure
-        they weren't used in any in-place operation that modified their content.
+        attribute. Before returning them to the user, a check is made, to
+        ensure they weren't used in any in-place operation that modified
+        their content.

        Arguments can also be ``None``.
        """
@ -31,7 +65,7 @@ class _ContextMethodMixin(object):
        :func:`forward` **method, and all arguments should be inputs.**

        Every tensor that's been modified in-place in a call to :func:`forward`
-        should be given to this function, to ensure correctness of our checks.
+        should be given to this function, to ensure correcness of our checks.
        It doesn't matter wheter the function is called before or after
        modification.
        """
@ -72,9 +106,6 @@ class _ContextMethodMixin(object):
        """
        self.non_differentiable = args

-
-class _HookMixin(object):
-
    @staticmethod
    def _register_hook(backward_hooks, hook):
        if backward_hooks is None:
@ -83,84 +114,7 @@ class _HookMixin(object):
        backward_hooks[handle.id] = hook
        return backward_hooks, handle

-
-class BackwardCFunction(_C._FunctionBase, _ContextMethodMixin, _HookMixin):
-    _is_legacy = False
-
-    def apply(self, *args):
-        return self._forward_cls.backward(self, *args)
-
-
-class FunctionMeta(type):
-    """Function metaclass.
-
-    This metaclass sets up the following properties:
-        _is_legacy: True if forward is not defined as a static method.
-        _backward_cls: The Function class corresponding to the differentiated
-            version of this function (which is generated on the fly by this
-            metaclass).
-    """
-
-    def __init__(cls, name, bases, attrs):
-        for super_cls in cls.mro():
-            forward = super_cls.__dict__.get('forward')
-            if forward is not None:
-                has_static_forward = isinstance(forward, staticmethod) or isinstance(forward, classmethod)
-                break
-
-        setattr(cls, '_is_legacy', not has_static_forward)
-
-        # old-style functions
-        if not has_static_forward:
-            return super(FunctionMeta, cls).__init__(name, bases, attrs)
-
-        backward_fn = type(name + 'Backward', (BackwardCFunction,), {'_forward_cls': cls})
-        setattr(cls, '_backward_cls', backward_fn)
-
-        return super(FunctionMeta, cls).__init__(name, bases, attrs)
-
-
-class Function(with_metaclass(FunctionMeta, _C._FunctionBase, _ContextMethodMixin, _HookMixin)):
-    """Records operation history and defines formulas for differentiating ops.
-
-    Every operation performed on :class:`Variable` s creates a new function
-    object, that performs the computation, and records that it happened.
-    The history is retained in the form of a DAG of functions, with edges
-    denoting data dependencies (``input <- output``). Then, when backward is
-    called, the graph is processed in the topological ordering, by calling
-    :func:`backward` methods of each :class:`Function` object, and passing
-    returned gradients on to next :class:`Function` s.
-
-    Normally, the only way users interact with functions is by creating
-    subclasses and defining new operations. This is a recommended way of
-    extending torch.autograd.
-
-    Since Function logic is a hotspot in most scripts, almost all of it
-    was moved to our C backend, to ensure that the framework overhead is
-    minimal.
-
-    Each function is meant to be used only once (in the forward pass).
-
-    Attributes:
-        saved_tensors: Tuple of Tensors that were saved in the call to
-            :func:`forward`.
-        saved_variables: Tuple of Variables that correspond to the tensors
-            saved in the call to :func:`forward`.
-        needs_input_grad: Tuple of booleans of length :attr:`num_inputs`,
-            indicating whether a given input requires gradient. This can be
-            used to optimize buffers saved for backward, and ignoring gradient
-            computation in :func:`~Function.backward`.
-        num_inputs: Number of inputs given to :func:`forward`.
-        num_outputs: Number of tensors returned by :func:`forward`.
-        requires_grad: Boolean indicating whether the :func:`backward` will
-            ever need to be called.
-    """
-
-    # only for backward compatibility
-    __call__ = _C._FunctionBase._do_forward
-
-    @staticmethod
-    def forward(*args, **kwargs):
+    def forward(self, *input):
        """Performs the operation.

        This function is to be overriden by all subclasses.
@ -169,8 +123,7 @@ class Function(with_metaclass(FunctionMeta, _C._FunctionBase, _ContextMethodMixi
        """
        raise NotImplementedError

-    @staticmethod
-    def backward(*grad_outputs):
+    def backward(self, *grad_output):
        """Defines a formula for differentiating the operation.

        This function is to be overriden by all subclasses.
@ -184,41 +137,6 @@ class Function(with_metaclass(FunctionMeta, _C._FunctionBase, _ContextMethodMixi
        raise NotImplementedError


-def once_differentiable(fn):
-    from .variable import Variable
-
-    @functools.wraps(fn)
-    def wrapper(ctx, *args):
-        tensor_args = [arg.data if isinstance(arg, Variable) else arg
-                       for arg in args]
-        outputs = fn(ctx, *tensor_args)
-        # XXX: this is only an approximation of these flags - there's no way
-        # to figure out if fn didn't use ctx.saved_variables and as a result
-        # some Variables might require grad, even if no args do.
-        # Unfortunately, this leads to unexpected error messages ("no nodes
-        # require computing gradients"), but I don't have a better idea.
-        # These functions would raise an error in backward anyway.
-        volatile = any(arg.volatile if isinstance(arg, Variable) else False
-                       for arg in args)
-        requires_grad = any(arg.requires_grad if isinstance(arg, Variable) else False
-                            for arg in args)
-        if volatile:
-            def err_fn(*args):
-                return args
-            kwargs = {'volatile': True}
-        else:
-            err_fn = torch._C._functions.DelayedError(
-                b"trying to differentiate twice a function that was marked"
-                b"with @once_differentiable")
-            kwargs = {'requires_grad': requires_grad}
-        if not isinstance(outputs, tuple):
-            var = Variable(outputs, **kwargs) if outputs is not None else None
-            return err_fn(var)
-        return err_fn(*[Variable(o, **kwargs) if o is not None else None
-                      for o in outputs])
-    return wrapper
-
-
 class InplaceFunction(Function):

    def __init__(self, inplace=False):
--- a/torch/autograd/gradcheck.py
+++ b/torch/autograd/gradcheck.py
@ -1,41 +1,31 @@
 import torch
 from torch.autograd import Variable
-from collections import Iterable


-def iter_variables(x):
+def iter_gradients(x):
    if isinstance(x, Variable):
        if x.requires_grad:
-            yield (x.grad.data, x.data) if x.grad is not None else (None, None)
-    elif isinstance(x, Iterable):
+            yield x.grad.data if x.grad is not None else None
+    else:
        for elem in x:
-            for result in iter_variables(elem):
+            for result in iter_gradients(elem):
                yield result


-def zero_gradients(x):
-    if isinstance(x, Variable):
-        if x.grad is not None:
-            x.grad.detach_()
-            x.grad.data.zero_()
-    elif isinstance(x, Iterable):
-        for elem in x:
-            zero_gradients(elem)
+def zero_gradients(i):
+    for t in iter_gradients(i):
+        if t is not None:
+            t.zero_()


 def make_jacobian(input, num_out):
    if isinstance(input, Variable) and not input.requires_grad:
        return None
-    elif torch.is_tensor(input) or isinstance(input, Variable):
+    if torch.is_tensor(input) or isinstance(input, Variable):
        return torch.zeros(input.nelement(), num_out)
-    elif isinstance(input, Iterable):
-        jacobians = list(filter(
-            lambda x: x is not None, (make_jacobian(elem, num_out) for elem in input)))
-        if not jacobians:
-            return None
-        return type(input)(jacobians)
    else:
-        return None
+        return type(input)(filter(lambda x: x is not None,
+                                  (make_jacobian(elem, num_out) for elem in input)))


 def iter_tensors(x, only_requiring_grad=False):
@ -44,7 +34,7 @@ def iter_tensors(x, only_requiring_grad=False):
    elif isinstance(x, Variable):
        if x.requires_grad or not only_requiring_grad:
            yield x.data
-    elif isinstance(x, Iterable):
+    else:
        for elem in x:
            for result in iter_tensors(elem, only_requiring_grad):
                yield result
@ -55,9 +45,8 @@ def contiguous(input):
        return input.contiguous()
    elif isinstance(input, Variable):
        return input.contiguous()
-    elif isinstance(input, Iterable):
+    else:
        return type(input)(contiguous(e) for e in input)
-    return input


 def get_numerical_jacobian(fn, input, target, eps=1e-3):
@ -81,9 +70,9 @@ def get_numerical_jacobian(fn, input, target, eps=1e-3):
        for i in range(flat_tensor.nelement()):
            orig = flat_tensor[i]
            flat_tensor[i] = orig - eps
-            outa.copy_(fn(input), broadcast=False)
+            outa.copy_(fn(input))
            flat_tensor[i] = orig + eps
-            outb.copy_(fn(input), broadcast=False)
+            outb.copy_(fn(input))
            flat_tensor[i] = orig

            outb.add_(-1, outa).div_(2 * eps)
@ -94,31 +83,21 @@ def get_numerical_jacobian(fn, input, target, eps=1e-3):

 def get_analytical_jacobian(input, output):
    jacobian = make_jacobian(input, output.numel())
-    jacobian_reentrant = make_jacobian(input, output.numel())
    grad_output = output.data.clone().zero_()
    flat_grad_output = grad_output.view(-1)
-    reentrant = True
-    correct_grad_sizes = True

    for i in range(flat_grad_output.numel()):
        flat_grad_output.zero_()
        flat_grad_output[i] = 1
-        for jacobian_c in (jacobian, jacobian_reentrant):
-            zero_gradients(input)
-            output.backward(grad_output, create_graph=True)
-            for jacobian_x, (d_x, x) in zip(jacobian_c, iter_variables(input)):
-                if d_x is None:
-                    jacobian_x[:, i].zero_()
-                else:
-                    if d_x.size() != x.size():
-                        correct_grad_sizes = False
-                    jacobian_x[:, i] = d_x.to_dense() if d_x.is_sparse else d_x
+        zero_gradients(input)
+        output.backward(grad_output, retain_variables=True)
+        for jacobian_x, d_x in zip(jacobian, iter_gradients(input)):
+            if d_x is None:
+                jacobian_x[:, i].zero_()
+            else:
+                jacobian_x[:, i] = d_x.to_dense() if d_x.is_sparse else d_x

-    for jacobian_x, jacobian_reentrant_x in zip(jacobian, jacobian_reentrant):
-        if (jacobian_x - jacobian_reentrant_x).abs().max() != 0:
-            reentrant = False
-
-    return jacobian, reentrant, correct_grad_sizes
+    return jacobian


 def _as_tuple(x):
@ -161,65 +140,21 @@ def gradcheck(func, inputs, eps=1e-6, atol=1e-5, rtol=1e-3):
        def fn(input):
            return _as_tuple(func(*input))[i].data

-        analytical, reentrant, correct_grad_sizes = get_analytical_jacobian(_as_tuple(inputs), o)
        numerical = get_numerical_jacobian(fn, inputs, inputs, eps)
+        analytical = get_analytical_jacobian(_as_tuple(inputs), o)

        for a, n in zip(analytical, numerical):
            if not ((a - n).abs() <= (atol + rtol * n.abs())).all():
                return False

-        if not reentrant:
-            return False
-
-        if not correct_grad_sizes:
-            return False
-
    # check if the backward multiplies by grad_output
    zero_gradients(inputs)
    output = _as_tuple(func(*inputs))
    torch.autograd.backward(output, [o.data.new(o.size()).zero_() for o in output])
-    var_inputs = list(filter(lambda i: isinstance(i, Variable), inputs))
-    if not var_inputs:
-        raise RuntimeError("no Variables found in input")
-    for i in var_inputs:
+    for i in inputs:
        if i.grad is None:
            continue
        if not i.grad.data.eq(0).all():
            return False

    return True
-
-
-def gradgradcheck(func, inputs, grad_outputs, eps=1e-6, atol=1e-5, rtol=1e-3):
-    """Check gradients of gradients computed via small finite differences
-       against analytical gradients
-    This function checks that backpropagating through the gradients computed
-    to the given grad_outputs are correct.
-
-    The check between numerical and analytical has the same behaviour as
-    numpy.allclose https://docs.scipy.org/doc/numpy/reference/generated/numpy.allclose.html
-    meaning it check that
-        absolute(a - n) <= (atol + rtol * absolute(n))
-    is true for all elements of analytical gradient a and numerical gradient n.
-
-    Args:
-        func: Python function that takes Variable inputs and returns
-            a tuple of Variables
-        inputs: tuple of Variables
-        grad_outputs: tuple of Variables
-        eps: perturbation for finite differences
-        atol: absolute tolerance
-        rtol: relative tolerance
-
-    Returns:
-        True if all differences satisfy allclose condition
-    """
-    def new_func(*input_args):
-        input_args = input_args[:-len(grad_outputs)]
-        outputs = func(*input_args)
-        outputs = _as_tuple(outputs)
-        input_args = tuple(x for x in input_args if isinstance(x, Variable) and x.requires_grad)
-        grad_inputs = torch.autograd.grad(outputs, input_args, grad_outputs)
-        return grad_inputs
-
-    return gradcheck(new_func, inputs + grad_outputs, eps, atol, rtol)
--- a/torch/autograd/variable.py
+++ b/torch/autograd/variable.py
@ -1,11 +1,10 @@
 import sys
-import torch
 import torch._C as _C
 from collections import OrderedDict
 import torch.sparse as sparse
 import torch.utils.hooks as hooks
-import warnings
-import weakref
+
+from ._functions import *


 class Variable(_C._VariableBase):
@ -14,7 +13,7 @@ class Variable(_C._VariableBase):
    Variable is a thin wrapper around a Tensor object, that also holds
    the gradient w.r.t. to it, and a reference to a function that created it.
    This reference allows retracing the whole chain of operations that
-    created the data. If the Variable has been created by the user, its grad_fn
+    created the data. If the Variable has been created by the user, its creator
    will be ``None`` and we call such objects *leaf* Variables.

    Since autograd only supports scalar valued function differentiation, grad
@ -34,9 +33,8 @@ class Variable(_C._VariableBase):
            inference mode, i.e. don't save the history. See
            :ref:`excluding-subgraphs` for more details.
            Can be changed only on leaf Variables.
-        is_leaf: Boolean indicating if the Variable is a graph leaf (i.e
-            if it was created by the user).
-        grad_fn: Gradient function graph trace.
+        creator: Function of which the variable was an output. For leaf
+            (user created) variables it's ``None``. Read-only attribute.

    Parameters:
        data (any tensor class): Tensor to wrap.
@ -62,30 +60,29 @@ class Variable(_C._VariableBase):
    def __getattr__(self, name):
        if name in self._fallthrough_methods:
            return getattr(self.data, name)
-        return object.__getattribute__(self, name)
+        raise AttributeError(name)

    def __getitem__(self, key):
-        if torch.is_tensor(key):
-            key = Variable(key)  # auto-wrap tensors
-        if isinstance(key, Variable):
-            if type(key.data).__name__ == 'ByteTensor':
-                return MaskedSelect.apply(self, key)
-            elif type(key.data).__name__ == 'LongTensor':
-                return IndexSelect.apply(self, 0, key)
-            # else fall through and raise an error in Index
-        return Index.apply(self, key)
+        if (isinstance(key, Variable) and
+                type(key.data).__name__ == 'ByteTensor'):
+            return MaskedSelect()(self, key)
+        return Index(key)(self)

    def __setitem__(self, key, value):
-        if isinstance(key, Variable) and type(key.data).__name__ == 'ByteTensor':
+        if (isinstance(key, Variable) and
+                type(key.data).__name__ == 'ByteTensor'):
            if isinstance(value, Variable):
-                return MaskedScatter.apply(self, key, value, True)
+                return MaskedCopy(inplace=True)(self, key, value)
            else:
-                return MaskedFill.apply(self, key, value, True)
+                return MaskedFill(value, inplace=True)(self, key)
        else:
-            return SetItem.apply(self, key, value)
+            if isinstance(value, Variable):
+                return SetItem(key)(self, value)
+            else:
+                return SetItem(key, value)(self)

    def __deepcopy__(self, memo):
-        if not self.is_leaf:
+        if self.creator is not None:
            raise RuntimeError("Only Variables created explicitly by the user "
                               "(graph leaves) support the deepcopy protocol at the moment")
        result = type(self)(self.data.clone())
@ -109,22 +106,14 @@ class Variable(_C._VariableBase):
            # legacy serialization of Variable
            self.data = state[0]
            state = (state[3], state[4], state[2])
-        if not self.is_leaf:
+        if self.creator is not None:
            raise RuntimeError('__setstate__ can be only called on leaf variables')
        self.requires_grad, self.volatile, self._backward_hooks = state

    def __repr__(self):
        return 'Variable containing:' + self.data.__repr__()

-    def __bool__(self):
-        if self.data.numel() == 0:
-            return False
-        raise RuntimeError("bool value of Variable objects containing non-empty " +
-                           torch.typename(self.data) + " is ambiguous")
-
-    __nonzero__ = __bool__
-
-    def backward(self, gradient=None, retain_graph=None, create_graph=None, retain_variables=None):
+    def backward(self, gradient=None, retain_variables=False):
        """Computes the gradient of current variable w.r.t. graph leaves.

        The graph is differentiated using the chain rule. If the variable is
@ -133,27 +122,28 @@ class Variable(_C._VariableBase):
        It should be a tensor of matching type and location, that contains
        the gradient of the differentiated function w.r.t. ``self``.

-        This function accumulates gradients in the leaves - you might need to
-        zero them before calling it.
+        This function accumulates gradients in the leaves - you might need to zero
+        them before calling it.

        Arguments:
-            grad_variables (Tensor, Variable or None): Gradient w.r.t. the
-                variable. If it is a tensor, it will be automatically converted
-                to a Variable that is volatile unless ``create_graph`` is True.
-                None values can be specified for scalar Variables or ones that
-                don't require grad. If a None value would be acceptable then
-                this argument is optional.
-            retain_graph (bool, optional): If False, the graph used to compute
-                the grads will be freed. Note that in nearly all cases setting
-                this option to True is not needed and often can be worked around
-                in a much more efficient way. Defaults to the value of
-                ``create_graph``.
-            create_graph (bool, optional): If true, graph of the derivative will
-                be constructed, allowing to compute higher order derivative
-                products. Defaults to False, unless ``gradient`` is a volatile
-                Variable.
+            gradient (Tensor): Gradient of the differentiated function
+                w.r.t. the data. Required only if the data has more than one
+                element. Type and location should match these of ``self.data``.
+            retain_variables (bool): If ``True``, buffers necessary for computing
+                gradients won't be freed after use. It is only necessary to
+                specify ``True`` if you want to differentiate some subgraph multiple
+                times (in some cases it will be much more efficient to use
+                `autograd.backward`).
        """
-        torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
+        if self.volatile:
+            raise RuntimeError('calling backward on a volatile variable')
+        if gradient is None and self.requires_grad:
+            if self.data.numel() != 1:
+                raise RuntimeError(
+                    'backward should be called only on a scalar (i.e. 1-element tensor) '
+                    'or with gradient w.r.t. the variable')
+            gradient = self.data.new().resize_as_(self.data).fill_(1)
+        self._execution_engine.run_backward((self,), (gradient,), retain_variables)

    def register_hook(self, hook):
        """Registers a backward hook.
@ -187,8 +177,8 @@ class Variable(_C._VariableBase):
                               "doesn't require gradient")
        if self._backward_hooks is None:
            self._backward_hooks = OrderedDict()
-            if self.grad_fn is not None:
-                self.grad_fn._register_hook_dict(self)
+            if self.creator is not None:
+                self.creator._register_hook_dict(self)
        handle = hooks.RemovableHandle(self._backward_hooks)
        self._backward_hooks[handle.id] = hook
        return handle
@ -204,10 +194,10 @@ class Variable(_C._VariableBase):
            reward(Tensor): Tensor with per-element rewards. It has to match
                the device location and shape of Variable's data.
        """
-        if not isinstance(self.grad_fn, StochasticFunction):
+        if not isinstance(self.creator, StochasticFunction):
            raise RuntimeError("reinforce() can be only called on outputs "
                               "of stochastic functions")
-        self.grad_fn._reinforce(reward)
+        self.creator._reinforce(reward)

    def detach(self):
        """Returns a new Variable, detached from the current graph.
@ -222,61 +212,35 @@ class Variable(_C._VariableBase):
          errors in correctness checks.
        """
        result = NoGrad()(self)  # this is needed, because it merges version counters
-        result._grad_fn = None
+        result._creator = None
        return result

    def detach_(self):
-        """Detaches the Variable from the graph that created it, making it a
-        leaf.
-        """
-        self._grad_fn = None
+        """Detaches the Variable from the graph that created it, making it a leaf."""
+        self._creator = None
        self.requires_grad = False

-    def retain_grad(self):
-        """Enables .grad attribute for non-leaf Variables."""
-        if self.grad_fn is None:  # no-op for leaves
-            return
-        if not self.requires_grad:
-            raise RuntimeError("can't retain_grad on Variable that has requires_grad=False")
-        if hasattr(self, 'retains_grad'):
-            return
-        weak_self = weakref.ref(self)
-
-        def retain_grad_hook(grad):
-            var = weak_self()
-            if var is None:
-                return
-            if var._grad is None:
-                var._grad = grad.clone()
-            else:
-                var._grad = var._grad + grad
-
-        self.register_hook(retain_grad_hook)
-        self.retains_grad = True
-
    def contiguous(self):
        self.data = self.data.contiguous()
        return self

    def clone(self):
-        return Clone.apply(self)
+        return Clone()(self)

    def type(self, t):
        if t != type(self.data):
-            return Type.apply(self, t)
+            return Type(t)(self)
        return self

    def type_as(self, t):
-        if isinstance(t, Variable):
-            t = t.data
-        return self.type(type(t))
+        return self.type(type(t.data))

    def _get_type(self, name):
        module = torch._import_dotted_name(self.data.__module__)
        return getattr(module, name)

    def cuda(self, device_id=None, async=False):
-        return CudaTransfer.apply(self, device_id, async)
+        return CudaTransfer(device_id, async)(self)

    def cpu(self):
        return self.type(getattr(torch, type(self.data).__name__))
@ -310,10 +274,10 @@ class Variable(_C._VariableBase):

    def _add(self, other, inplace):
        if isinstance(other, Variable):
-            return Add.apply(self, other, inplace)
+            return Add(inplace)(self, other)
        else:
            assert not torch.is_tensor(other)
-            return AddConstant.apply(self, other, inplace)
+            return AddConstant(other, inplace)(self)

    def add(self, other):
        return self._add(other, False)
@ -323,10 +287,10 @@ class Variable(_C._VariableBase):

    def _sub(self, other, inplace):
        if isinstance(other, Variable):
-            return Sub.apply(self, other, inplace)
+            return Sub(inplace=inplace)(self, other)
        else:
            assert not torch.is_tensor(other)
-            return SubConstant.apply(self, other, inplace)
+            return SubConstant(other, inplace=inplace)(self)

    def sub(self, other):
        return self._sub(other, False)
@ -336,181 +300,178 @@ class Variable(_C._VariableBase):

    def mul(self, other):
        if isinstance(other, Variable):
-            return Mul.apply(self, other)
+            return Mul()(self, other)
        else:
            assert not torch.is_tensor(other)
-            return MulConstant.apply(self, other)
+            return MulConstant(other)(self)

    def mul_(self, other):
        if not isinstance(other, Variable) and not torch.is_tensor(other):
-            return MulConstant.apply(self, other, True)
+            return MulConstant(other, inplace=True)(self)
        raise RuntimeError("mul_ only supports scalar multiplication")

    def div(self, other):
        if isinstance(other, Variable):
-            return Div.apply(self, other)
+            return Div()(self, other)
        else:
            assert not torch.is_tensor(other)
-            return DivConstant.apply(self, other)
+            return DivConstant(other)(self)

    def div_(self, other):
        if not isinstance(other, Variable) and not torch.is_tensor(other):
-            return DivConstant.apply(self, other, True)
+            return DivConstant(other, inplace=True)(self)
        raise RuntimeError("div_ only supports scalar multiplication")

    def pow(self, other):
        if isinstance(other, Variable):
-            return Pow.apply(self, other)
+            return Pow()(self, other)
        else:
            assert not torch.is_tensor(other)
-            return PowConstant.apply(self, other)
+            return PowConstant(other)(self)

    def exp(self):
-        return Exp.apply(self)
+        return Exp()(self)

    def exp_(self):
-        return Exp.apply(self, True)
+        return Exp(inplace=True)(self)

    def log(self):
-        return Log.apply(self)
+        return Log()(self)

    def log1p(self):
-        return Log1p.apply(self)
+        return Log1p()(self)

    def neg(self):
-        return Negate.apply(self)
+        return Negate()(self)

    def neg_(self):
-        return Negate.apply(self, True)
+        return Negate(inplace=True)(self)

    def tanh(self):
-        return Tanh.apply(self)
+        return Tanh()(self)

    def tanh_(self):
-        return Tanh.apply(self, True)
+        return Tanh(True)(self)

    def sigmoid(self):
-        return Sigmoid.apply(self)
+        return Sigmoid()(self)

    def sigmoid_(self):
-        return Sigmoid.apply(self, True)
+        return Sigmoid(True)(self)

    def sin(self):
-        return Sin.apply(self)
+        return Sin()(self)

    def cos(self):
-        return Cos.apply(self)
+        return Cos()(self)

    def tan(self):
-        return Tan.apply(self)
+        return Tan()(self)

    def asin(self):
-        return Asin.apply(self)
+        return Asin()(self)

    def acos(self):
-        return Acos.apply(self)
+        return Acos()(self)

    def atan(self):
-        return Atan.apply(self)
-
-    def atan2(self, x):
-        return Atan2.apply(self, x)
+        return Atan()(self)

    def sinh(self):
-        return Sinh.apply(self)
+        return Sinh()(self)

    def cosh(self):
-        return Cosh.apply(self)
+        return Cosh()(self)

    def abs(self):
-        return Abs.apply(self)
+        return Abs()(self)

    def clamp(self, min=None, max=None):
        if min is None and max is None:
            raise ValueError("clamp requires specifying at least one of "
                             "min and max arguments")
        elif min is None and max is not None:
-            return CminConstant.apply(self, max)
+            return CminConstant(max)(self)
        elif min is not None and max is None:
-            return CmaxConstant.apply(self, min)
+            return CmaxConstant(min)(self)
        else:
-            return Clamp.apply(self, min, max)
+            return Clamp(min, max)(self)

    def reciprocal(self):
-        return Reciprocal.apply(self)
+        return Reciprocal()(self)

    def floor(self):
-        return Floor.apply(self)
+        return Floor()(self)

    def ceil(self):
-        return Ceil.apply(self)
+        return Ceil()(self)

    def frac(self):
-        return Frac.apply(self)
+        return Frac()(self)

    def sqrt(self):
-        return Sqrt.apply(self)
+        return Sqrt()(self)

    def round(self):
-        return Round.apply(self)
+        return Round()(self)

    def sign(self):
-        return Sign.apply(self)
+        return Sign()(self)

    def trunc(self):
-        return Trunc.apply(self)
+        return Trunc()(self)

    def fmod(self, value):
-        return Fmod.apply(self, value)
+        return Fmod(value)(self)

    def remainder(self, value):
-        return Remainder.apply(self, value)
+        return Remainder(value)(self)

    def lerp(self, tensor, weight):
-        return Lerp.apply(self, tensor, weight)
+        return Lerp(weight)(self, tensor)

    def rsqrt(self):
-        return Rsqrt.apply(self)
+        return Rsqrt()(self)

-    def sum(self, dim=None, keepdim=None):
-        return Sum.apply(self, dim, keepdim)
+    def sum(self, dim=None):
+        return Sum(dim)(self)

-    def prod(self, dim=None, keepdim=None):
-        return Prod.apply(self, dim, keepdim)
+    def prod(self, dim=None):
+        return Prod(dim)(self)

-    def mean(self, dim=None, keepdim=None):
-        return Mean.apply(self, dim, keepdim)
+    def mean(self, dim=None):
+        return Mean(dim)(self)

-    def max(self, dim=None, keepdim=None):
+    def max(self, dim=None):
        if isinstance(dim, Variable):
-            return Cmax.apply(self, dim)
-        return Max.apply(self, dim, keepdim)
+            return Cmax()(self, dim)
+        return Max(dim)(self)

-    def min(self, dim=None, keepdim=None):
+    def min(self, dim=None):
        if isinstance(dim, Variable):
-            return Cmin.apply(self, dim)
-        return Min.apply(self, dim, keepdim)
+            return Cmin()(self, dim)
+        return Min(dim)(self)

-    def mode(self, dim=None, keepdim=None):
-        return Mode.apply(self, dim, keepdim)
+    def mode(self, dim):
+        return Mode(dim)(self)

-    def median(self, dim=None, keepdim=None):
-        return Median.apply(self, dim, keepdim)
+    def median(self, dim):
+        return Median(dim)(self)

-    def kthvalue(self, k, dim=None, keepdim=None):
-        return Kthvalue.apply(self, k, dim, keepdim)
+    def kthvalue(self, dim):
+        return Kthvalue(dim)(self)

    def sort(self, dim=None, descending=False):
-        return Sort.apply(self, dim, descending, True)
+        return Sort(dim, descending)(self)

    def topk(self, k, dim=None, largest=True, sorted=True):
-        return Topk.apply(self, k, dim, largest, sorted, True)
+        return Topk(k, dim, largest, sorted)(self)

    def view(self, *sizes):
-        return View.apply(self, sizes)
+        return View(*sizes)(self)

    def view_as(self, tensor):
-        return View.apply(self, tensor.size())
+        return View(*tensor.size())(self)

    def split(self, split_size, dim=0):
        return torch.split(self, split_size, dim)
@ -520,45 +481,32 @@ class Variable(_C._VariableBase):
            repeats = repeats[0]
        else:
            repeats = torch.Size(repeats)
-        return Repeat.apply(self, repeats)
+        return Repeat(repeats)(self)

    def cumsum(self, dim):
-        return Cumsum.apply(self, dim)
+        return Cumsum(dim)(self)

-    def cumprod(self, dim):
-        return Cumprod.apply(self, dim)
-
-    def unfold(self, dim, size, step):
-        return Unfold.apply(self, dim, size, step)
-
-    def var(self, dim=None, keepdim=None, unbiased=True):
-        keepdim_ = False if keepdim is None else keepdim
-        mean = self.mean(dim, keepdim)
+    def var(self, dim=None, unbiased=True):
+        mean = self.mean(dim)
        if dim is None:
            mean = mean.view(*(1 for s in self.size()))
-        # we could just set keepdim to True, but this preserves some fidelity
-        elif keepdim_ is False and self.dim() != 1:
-            mean = mean.unsqueeze(dim)
        mean_expanded = mean.expand_as(self)
        zero_centered = self.sub(mean_expanded)
-        var = zero_centered.mul(zero_centered).sum(dim, keepdim=keepdim_)
+        var = zero_centered.mul(zero_centered).sum(dim)
        numel = self.numel() if dim is None else self.size(dim)
        return var.div(numel - int(unbiased))

-    def std(self, dim=None, keepdim=None, unbiased=True):
-        return self.var(dim, keepdim, unbiased).sqrt()
+    def std(self, dim=None, unbiased=True):
+        return self.var(dim, unbiased).sqrt()

    def renorm(self, p, dim, maxnorm):
        t = self.transpose(dim, 0)
        flat = t.contiguous().view(self.size(0), -1)
-        norms = flat.norm(p, 1, True)
+        norms = flat.norm(p, 1)
        norms = norms.clamp(max=maxnorm).div(norms.add(1e-7))
        flat_out = flat.mul(norms.expand_as(flat))
        return flat_out.view(t.size()).transpose(dim, 0)

-    def matmul(self, other):
-        return torch.matmul(self, other)
-
    @staticmethod
    def _static_blas(cls, args, inplace):
        num_args = len(args)
@ -569,14 +517,14 @@ class Variable(_C._VariableBase):
            alpha, beta = args[1:3]
        if num_args == 4:
            alpha = args[1]
-        return cls.apply(*(args[:1] + args[-2:] + (alpha, beta, inplace)))
+        return cls(alpha, beta, inplace)(*(args[:1] + args[-2:]))

    def _blas(self, cls, args, inplace):
        return self._static_blas(cls, (self,) + args, inplace)

    def mm(self, matrix):
        output = Variable(self.data.new(self.data.size(0), matrix.data.size(1)))
-        return Addmm.apply(output, self, matrix, 0, 1, True)
+        return self._static_blas(Addmm, (output, 0, 1, self, matrix), False)

    def bmm(self, batch):
        output = Variable(self.data.new(self.data.size(0), self.data.size(1),
@ -592,10 +540,10 @@ class Variable(_C._VariableBase):
        return self._static_blas(Addr, (output, 0, 1, self, vector), False)

    def resize(self, *sizes):
-        return Resize.apply(self, sizes)
+        return Resize(*sizes)(self)

    def resize_as(self, variable):
-        return Resize.apply(self, variable.size())
+        return Resize(*variable.size())(self)

    def addmm(self, *args):
        return self._blas(Addmm, args, False)
@ -628,186 +576,170 @@ class Variable(_C._VariableBase):
        return self._blas(Addr, args, True)

    def dot(self, other):
-        return Dot.apply(self, other)
+        return Dot()(self, other)

-    def _addcop(self, op, args, inplace):
+    def _addcop(self, op, args):
        if len(args) == 3:
-            # args == [scale, tensor1, tensor2]
-            return op.apply(self, args[1], args[2], args[0], inplace)
+            # scale, tensor1, tensor2
+            return op(args[0])(self, *args[1:])
        else:
-            # args == [tensor1, tensor2]
-            return op.apply(self, args[0], args[1], 1.0, inplace)
+            # tensor1, tensor2
+            return op()(self, *args)

    def addcmul(self, *args):
-        return self._addcop(Addcmul, args, False)
+        return self._addcop(Addcmul, args)

    def addcdiv(self, *args):
-        return self._addcop(Addcdiv, args, False)
+        return self._addcop(Addcdiv, args)

-    def addcmul_(self, *args):
-        return self._addcop(Addcmul, args, True)
-
-    def addcdiv_(self, *args):
-        return self._addcop(Addcdiv, args, True)
-
-    def norm(self, p=2, dim=None, keepdim=None):
-        return Norm.apply(self, p, dim, keepdim)
+    def norm(self, p=2, dim=None):
+        return Norm(p, dim)(self)

    def dist(self, tensor, p=2):
-        return Norm.apply(self - tensor, p)
+        return Norm(p)(self - tensor)

    def index_add(self, dim, index, tensor):
-        return IndexAdd.apply(self, dim, index, tensor)
-
-    def _advanced_index_add(self, index, tensor):
-        return AdvancedIndexAdd.apply(self, index, tensor)
+        return IndexAdd(dim)(self, index, tensor)

    def index_add_(self, dim, index, tensor):
-        return IndexAdd.apply(self, dim, index, tensor, True)
+        return IndexAdd(dim, True)(self, index, tensor)

    def index_copy(self, dim, index, tensor):
-        return IndexCopy.apply(self, dim, index, tensor)
+        return IndexCopy(dim)(self, index, tensor)

    def index_copy_(self, dim, index, tensor):
-        return IndexCopy.apply(self, dim, index, tensor, True)
+        return IndexCopy(dim, True)(self, index, tensor)

    def index_fill(self, dim, index, value):
-        return IndexFill.apply(self, dim, index, value)
+        return IndexFill(dim, value)(self, index)

    def index_fill_(self, dim, index, value):
-        return IndexFill.apply(self, dim, index, value, True)
+        return IndexFill(dim, value, True)(self, index)

    def index_select(self, dim, index):
-        return IndexSelect.apply(self, dim, index)
+        return IndexSelect(dim)(self, index)

    def gather(self, dim, index):
-        return Gather.apply(self, dim, index)
+        return Gather(dim)(self, index)

    def scatter(self, dim, index, source):
-        return Scatter.apply(self, dim, index, source)
+        return Scatter(dim)(self, index, source)

    def scatter_(self, dim, index, source):
-        return Scatter.apply(self, dim, index, source, True)
-
-    def scatter_add(self, dim, index, source):
-        return ScatterAdd.apply(self, dim, index, source)
-
-    def scatter_add_(self, dim, index, source):
-        return ScatterAdd.apply(self, dim, index, source, True)
+        return Scatter(dim, True)(self, index, source)

    def masked_copy(self, mask, variable):
-        warnings.warn("masked_copy is deprecated and renamed to masked_scatter, and will be removed in v0.3")
-        return MaskedScatter.apply(self, mask, variable)
+        return MaskedCopy()(self, mask, variable)

    def masked_copy_(self, mask, variable):
-        warnings.warn("masked_copy_ is deprecated and renamed to masked_scatter_, and will be removed in v0.3")
-        return MaskedScatter.apply(self, mask, variable, True)
-
-    def masked_scatter(self, mask, variable):
-        return MaskedScatter.apply(self, mask, variable)
-
-    def masked_scatter_(self, mask, variable):
-        return MaskedScatter.apply(self, mask, variable, True)
+        return MaskedCopy(True)(self, mask, variable)

    def masked_fill(self, mask, value):
-        return MaskedFill.apply(self, mask, value)
+        return MaskedFill(value)(self, mask)

    def masked_fill_(self, mask, value):
-        return MaskedFill.apply(self, mask, value, True)
+        return MaskedFill(value, True)(self, mask)

    def masked_select(self, mask):
-        return MaskedSelect.apply(self, mask)
+        return MaskedSelect()(self, mask)

    def expand(self, *sizes):
-        return Expand.apply(self, sizes)
+        if isinstance(sizes[0], torch.Size):
+            if len(sizes) > 1:
+                raise ValueError("expand expects a several ints or a single "
+                                 "torch.Size argument")
+            sizes = sizes[0]
+        return Expand(sizes)(self)

    def expand_as(self, tensor):
-        return Expand.apply(self, (tensor.size(),))
+        return Expand(tensor.size())(self)

    def t(self):
-        if self.dim() != 2:
-            raise RuntimeError("t() expects a 2D Variable, but self is {}D".format(self.dim()))
-        return Transpose.apply(self, 0, 1)
+        return Transpose(0, 1)(self)

    def transpose(self, dim1, dim2):
-        return Transpose.apply(self, dim1, dim2)
+        return Transpose(dim1, dim2)(self)

    def select(self, dim, _index):
        dim = dim if dim >= 0 else dim + self.dim()
        index = tuple(slice(None, None) for _ in range(dim)) + (_index,)
-        return Index.apply(self, index)
+        return Index(index)(self)

    def narrow(self, dim, start_index, length):
        dim = dim if dim >= 0 else dim + self.dim()
        index = tuple(slice(None, None) for _ in range(dim)) + \
            (slice(start_index, start_index + length),)
-        return Index.apply(self, index)
+
+        return Index(index)(self)

    def chunk(self, num_chunks, dim=0):
-        return Chunk.apply(self, num_chunks, dim)
+        return Chunk(num_chunks, dim)(self)

    def squeeze(self, dim=None):
-        return Squeeze.apply(self, dim)
-
-    def squeeze_(self, dim=None):
-        return Squeeze.apply(self, dim, True)
+        return Squeeze(dim)(self)

    def unsqueeze(self, dim):
-        return Unsqueeze.apply(self, dim)
+        return Unsqueeze(dim)(self)

    def permute(self, *permutation):
-        return Permute.apply(self, permutation)
+        return Permute(permutation)(self)

-    def diag(self, diagonal=0):
-        return Diag.apply(self, diagonal)
+    def diag(self, diagonal_idx=0):
+        return Diag(diagonal_idx)(self)

-    def tril(self, diagonal=0):
-        return Tril.apply(self, diagonal)
+    def tril(self, diagonal_idx=0):
+        return Tril(diagonal_idx)(self)

-    def triu(self, diagonal=0):
-        return Triu.apply(self, diagonal)
+    def triu(self, diagonal_idx=0):
+        return Triu(diagonal_idx)(self)

    def trace(self):
-        return Trace.apply(self)
+        return Trace()(self)

    def cross(self, other, dim=-1):
-        return Cross.apply(self, other)
+        return Cross(dim)(self, other)

-    def inverse(self):
-        return Inverse.apply(self)
-
-    def gesv(self, a):
-        return Gesv.apply(self, a)
-
-    def multinomial(self, num_samples=1, replacement=False):
-        return Multinomial(num_samples, replacement)(self)
+    def multinomial(self, num_samples=1, with_replacement=False):
+        return Multinomial(num_samples, with_replacement)(self)

    def bernoulli(self):
        return Bernoulli()(self)

    def eq(self, other):
+        if isinstance(other, Variable):
+            return Eq()(self, other)
        assert not torch.is_tensor(other), "can't compare Variable and tensor"
-        return Eq.apply(self, other)
+        return Eq(other)(self)

    def ne(self, other):
+        if isinstance(other, Variable):
+            return Ne()(self, other)
        assert not torch.is_tensor(other), "can't compare Variable and tensor"
-        return Ne.apply(self, other)
+        return Ne(other)(self)

    def gt(self, other):
+        if isinstance(other, Variable):
+            return Gt()(self, other)
        assert not torch.is_tensor(other), "can't compare Variable and tensor"
-        return Gt.apply(self, other)
+        return Gt(other)(self)

    def ge(self, other):
+        if isinstance(other, Variable):
+            return Ge()(self, other)
        assert not torch.is_tensor(other), "can't compare Variable and tensor"
-        return Ge.apply(self, other)
+        return Ge(other)(self)

    def lt(self, other):
+        if isinstance(other, Variable):
+            return Lt()(self, other)
        assert not torch.is_tensor(other), "can't compare Variable and tensor"
-        return Lt.apply(self, other)
+        return Lt(other)(self)

    def le(self, other):
+        if isinstance(other, Variable):
+            return Le()(self, other)
        assert not torch.is_tensor(other), "can't compare Variable and tensor"
-        return Le.apply(self, other)
+        return Le(other)(self)

    def __add__(self, other):
        return self.add(other)
@ -823,7 +755,7 @@ class Variable(_C._VariableBase):
        return self.sub_(other)

    def __rsub__(self, other):
-        return SubConstant.apply(other, self)
+        return SubConstant(other, sub_tensor=True)(self)

    def __mul__(self, other):
        return self.mul(other)
@ -833,16 +765,28 @@ class Variable(_C._VariableBase):
        return self.mul_(other)

    def __matmul__(self, other):
-        if not isinstance(other, Variable):
+        dim_self = self.dim()
+        try:
+            dim_other = other.dim()
+        except AttributeError:  # not a Variable
            return NotImplemented
-        return self.matmul(other)
+        if dim_self == 1 and dim_other == 1:
+            return self.dot(other)
+        if dim_self == 2 and dim_other == 1:
+            return self.mv(other)
+        if dim_self == 1 and dim_other == 2:
+            return self.unsqueeze(0).mm(other).squeeze(0)
+        elif dim_self == 2 and dim_other == 2:
+            return self.mm(other)
+        raise ValueError("both arguments to __matmul__ need to be 1D or 2D, "
+                         "but they are {}D and {}D".format(dim_self, dim_other))

    def __div__(self, other):
        return self.div(other)
    __truediv__ = __div__

    def __rdiv__(self, other):
-        return DivConstant.apply(other, self)
+        return DivConstant(other, div_by_tensor=True)(self)
    __rtruediv__ = __rdiv__

    def __idiv__(self, other):
@ -855,10 +799,10 @@ class Variable(_C._VariableBase):
        raise NotImplementedError("in-place pow not implemented")

    def __rpow__(self, other):
-        return PowConstant.apply(other, self)
+        return PowConstant(other, tensor_power=True)(self)

    def __neg__(self):
-        return Negate.apply(self)
+        return Negate()(self)

    def __len__(self):
        return len(self.data)
@ -894,7 +838,7 @@ class Variable(_C._VariableBase):

        @staticmethod
        def cat(iterable, dim=0):
-            return Concat.apply(dim, *iterable)
+            return Concat(dim)(*iterable)

        @staticmethod
        def normal(means, std=1):
@ -917,7 +861,7 @@ class Variable(_C._VariableBase):
                tensors = args[1:]
            else:
                tensors = args
-            return cls.apply(*(tensors + (alpha, beta, inplace)))
+            return cls(alpha, beta, inplace)(*tensors)

        @classmethod
        def addmm(cls, *args):
@ -951,6 +895,5 @@ for method in dir(Variable):
    setattr(Variable._torch, method, as_static)


-from ._functions import *
-from torch._C import _ImperativeEngine as ImperativeEngine
+from .engine import ImperativeEngine
 Variable._execution_engine = ImperativeEngine()
--- a/torch/backends/cudnn/init.py
+++ b/torch/backends/cudnn/init.py
@ -17,12 +17,6 @@ def _libcudnn():
        if hasattr(lib, 'cudnnGetErrorString'):
            lib.cudnnGetErrorString.restype = ctypes.c_char_p
            __cudnn_version = lib.cudnnGetVersion()
-            compile_version = torch._C._cudnn_version()
-            # Check that cuDNN major and minor versions match
-            if (__cudnn_version // 100) != (compile_version // 100):
-                raise RuntimeError(
-                    'cuDNN version mismatch: PyTorch was compiled against {} '
-                    'but linked against {}'.format(compile_version, __cudnn_version))
        else:
            lib = None
    return lib
--- a/torch/backends/cudnn/rnn.py
+++ b/torch/backends/cudnn/rnn.py
@ -163,9 +163,9 @@ def get_parameters(fn, handle, weight_buf):
                # might as well merge the CUDNN ones into a single tensor as well
                if linear_id == 0 or linear_id == num_linear_layers / 2:
                    assert filter_dim_a.prod() == filter_dim_a[0]
-                    size = (filter_dim_a[0] * num_linear_layers // 2, filter_dim_a[2])
                    param = fn.weight_buf.new().set_(
-                        weight_buf.storage(), offset, size)
+                        weight_buf.storage(), offset,
+                        filter_dim_a[0] * num_linear_layers // 2, filter_dim_a[2])
                    layer_params.append(param)
                else:
                    assert cur_offset == offset
@ -178,13 +178,10 @@ def get_parameters(fn, handle, weight_buf):


 def _copyParams(params_from, params_to):
-    assert len(params_from) == len(params_to)
    for layer_params_from, layer_params_to in zip(params_from, params_to):
-        # NOTE: these lists have all weights before all biases, so if the layer doesn't
-        # use biases, zip will terminate once layer_params_from ends and ignore them.
        for param_from, param_to in zip(layer_params_from, layer_params_to):
            assert param_from.type() == param_to.type()
-            param_to.copy_(param_from, broadcast=False)
+            param_to.copy_(param_from)


 def forward(fn, input, hx, weight, output, hy):
@ -245,21 +242,17 @@ def forward(fn, input, hx, weight, output, hy):
        fn.cy_desc = cudnn.descriptor(cx) if cx is not None else None

        # create the weight buffer and copy the weights into it
-        if fn.weight_buf is None:
-            num_weights = get_num_weights(
-                handle, fn.rnn_desc, fn.x_descs[0], fn.datatype)
-            fn.weight_buf = x.new(num_weights)
-            fn.w_desc = init_weight_descriptor(fn, fn.weight_buf)
-            w = fn.weight_buf
-            # this zero might not seem necessary, but it is in the case
-            # where biases are disabled; then they won't be copied and must be zero'd.
-            # Alternatively, _copyParams could be written more carefully.
-            w.zero_()
-            params = get_parameters(fn, handle, w)
-            _copyParams(weight, params)
-        else:
-            fn.w_desc = init_weight_descriptor(fn, fn.weight_buf)
-            w = fn.weight_buf
+        num_weights = get_num_weights(
+            handle, fn.rnn_desc, fn.x_descs[0], fn.datatype)
+        fn.weight_buf = x.new(num_weights)
+        fn.w_desc = init_weight_descriptor(fn, fn.weight_buf)
+        w = fn.weight_buf
+        # this zero might not seem necessary, but it is in the case
+        # where biases are disabled; then they won't be copied and must be zero'd.
+        # Alternatively, _copyParams could be written more carefully.
+        w.zero_()
+        params = get_parameters(fn, handle, w)
+        _copyParams(weight, params)

        if tuple(hx.size()) != hidden_size:
            raise RuntimeError('Expected hidden size {}, got {}'.format(
@ -276,9 +269,7 @@ def forward(fn, input, hx, weight, output, hy):
            fn.x_descs,
            ctypes.byref(workspace_size)
        ))
-        fn.workspace_size = workspace_size.value
-        with torch.cuda.device_of(input):
-            workspace = torch.cuda.ByteTensor(fn.workspace_size)
+        fn.workspace = torch.cuda.ByteTensor(workspace_size.value)
        if fn.requires_grad:
            reserve_size = ctypes.c_long()
            check_error(lib.cudnnGetRNNTrainingReserveSize(
@ -301,7 +292,7 @@ def forward(fn, input, hx, weight, output, hy):
                fn.y_descs, ctypes.c_void_p(y.data_ptr()),
                fn.hy_desc, ctypes.c_void_p(hy.data_ptr()),
                fn.cy_desc, ctypes.c_void_p(cy.data_ptr()) if cx is not None else None,
-                ctypes.c_void_p(workspace.data_ptr()), workspace.size(0),
+                ctypes.c_void_p(fn.workspace.data_ptr()), fn.workspace.size(0),
                ctypes.c_void_p(fn.reserve.data_ptr()), fn.reserve.size(0)
            ))
        else:  # inference
@ -316,7 +307,7 @@ def forward(fn, input, hx, weight, output, hy):
                fn.y_descs, ctypes.c_void_p(y.data_ptr()),
                fn.hy_desc, ctypes.c_void_p(hy.data_ptr()),
                fn.cy_desc, ctypes.c_void_p(cy.data_ptr()) if cx is not None else None,
-                ctypes.c_void_p(workspace.data_ptr()), workspace.size(0)
+                ctypes.c_void_p(fn.workspace.data_ptr()), fn.workspace.size(0)
            ))

        if fn.batch_first and not is_input_packed:
@ -381,8 +372,6 @@ def backward_grad(fn, input, hx, weight, output, grad_output, grad_hy, grad_inpu
        if not dhy.is_cuda or not dy.is_cuda or (dcy is not None and not dcy.is_cuda):
            raise RuntimeError('Gradients aren\'t CUDA tensors')

-        with torch.cuda.device_of(input):
-            workspace = torch.cuda.ByteTensor(fn.workspace_size)
        check_error(cudnn.lib.cudnnRNNBackwardData(
            handle,
            fn.rnn_desc,
@ -397,7 +386,7 @@ def backward_grad(fn, input, hx, weight, output, grad_output, grad_hy, grad_inpu
            fn.x_descs, ctypes.c_void_p(dx.data_ptr()),
            fn.hx_desc, ctypes.c_void_p(dhx.data_ptr()),
            fn.cx_desc, ctypes.c_void_p(dcx.data_ptr()) if cx is not None else None,
-            ctypes.c_void_p(workspace.data_ptr()), workspace.size(0),
+            ctypes.c_void_p(fn.workspace.data_ptr()), fn.workspace.size(0),
            ctypes.c_void_p(fn.reserve.data_ptr()), fn.reserve.size(0)
        ))

@ -450,8 +439,6 @@ def backward_weight(fn, input, hx, output, weight, grad_weight):
        y = output
        dw = fn.weight_buf.new().resize_as_(fn.weight_buf).zero_()

-        with torch.cuda.device_of(input):
-            workspace = torch.cuda.ByteTensor(fn.workspace_size)
        check_error(cudnn.lib.cudnnRNNBackwardWeights(
            handle,
            fn.rnn_desc,
@ -459,7 +446,7 @@ def backward_weight(fn, input, hx, output, weight, grad_weight):
            fn.x_descs, ctypes.c_void_p(x.data_ptr()),
            fn.hx_desc, ctypes.c_void_p(hx.data_ptr()),
            fn.y_descs, ctypes.c_void_p(y.data_ptr()),
-            ctypes.c_void_p(workspace.data_ptr()), workspace.size(0),
+            ctypes.c_void_p(fn.workspace.data_ptr()), fn.workspace.size(0),
            fn.w_desc, ctypes.c_void_p(dw.data_ptr()),
            ctypes.c_void_p(fn.reserve.data_ptr()), fn.reserve.size(0)
        ))
--- a/torch/csrc/DynamicTypes.cpp
+++ b/torch/csrc/DynamicTypes.cpp
@ -58,53 +58,18 @@ static std::unordered_map<std::string, Type> type_names = {
  {"Int", Type::INT},
  {"Long", Type::LONG},
 };
-
-static std::unordered_map<std::string, at::ScalarType> attype_names = {
-  {"Float", at::kFloat},
-  {"Double", at::kDouble},
-  {"Half", at::kHalf},
-  {"Byte", at::kByte},
-  {"Char", at::kChar},
-  {"Short", at::kShort},
-  {"Int", at::kInt},
-  {"Long", at::kLong},
-};
 static std::unordered_map<PyTypeObject*, TensorType> pytype_to_tensortype;
 static std::unordered_map<TensorType, PyTypeObject*, TensorTypeHasher> tensortype_to_pytype;

-static std::unordered_map<PyTypeObject*, at::Type*> pytype_to_attype;
-static std::unordered_map<at::Type*, PyTypeObject*> attype_to_pytype;
-
 void registerPyTypeObject(PyTypeObject *pytype, const std::string& name, bool is_cuda, bool is_sparse)
 {
  TensorType type;
-  at::Backend device;
-  if(is_cuda) {
-    if(is_sparse){
-      device = at::kSparseCUDA;
-    } else {
-      device = at::kCUDA;
-    }
-  } else {
-    if(is_sparse){
-      device = at::kSparseCPU;
-    } else {
-      device = at::kCPU;
-    }
-  }
-
  type.data_type = type_names.at(name);
  type.is_cuda = is_cuda;
  type.is_sparse = is_sparse;

  pytype_to_tensortype[pytype] = type;
  tensortype_to_pytype[type] = pytype;
-
-  if(!(is_sparse && name == "Half")) {
-    at::Type * attype = &at::getType(device,attype_names.at(name));
-    pytype_to_attype[pytype] = attype;
-    attype_to_pytype[attype] = pytype;
-  }
 }

 PyTypeObject* getPyTypeObject(const thpp::Tensor& tensor)
@ -116,12 +81,6 @@ PyTypeObject* getPyTypeObject(const thpp::Tensor& tensor)

  return tensortype_to_pytype.at(type);
 }
-PyTypeObject* getPyTypeObject(const at::Tensor& tensor)
-{
-  if(attype_to_pytype.count(&tensor.type()) == 0)
-    throw std::invalid_argument("unsupported Tensor type.");
-  return attype_to_pytype.at(&tensor.type());
-}

 static std::unique_ptr<Tensor> createTensor(void *tensor, Type type, bool is_cuda, bool is_sparse)
 {
@ -208,22 +167,6 @@ std::unique_ptr<Tensor> createTensor(PyObject *data)
  wrapper->retain();
  return wrapper;
 }
-//rename to createTensor when THPP is removed
-at::Tensor createTensorAT(PyObject *data)
-{
-  auto tensor_type = pytype_to_attype.at(Py_TYPE(data));
-  auto tensor = ((THPVoidTensor *)data)->cdata;
-  return tensor_type->unsafeTensorFromTH(tensor, true);
-}
-PyObject* createPyObject(at::Tensor tensor)
-{
-  auto type = getPyTypeObject(tensor);
-  PyObject *obj = type->tp_alloc(type, 0);
-  if (obj) {
-    ((THPVoidTensor*)obj)->cdata = (THVoidTensor *)tensor.detach()->unsafeGetTH(true);
-  }
-  return obj;
-}

 PyObject* createPyObject(const thpp::Tensor& tensor)
 {
--- a/torch/csrc/DynamicTypes.h
+++ b/torch/csrc/DynamicTypes.h
@ -2,10 +2,9 @@

 // Provides conversions between Python tensor objects and thpp::Tensors.

-#include <Python.h>
 #include <memory>
+#include <Python.h>
 #include <THPP/THPP.h>
-#include <ATen/ATen.h>

 namespace torch {

@ -23,9 +22,4 @@ std::unique_ptr<thpp::Tensor> createTensor(PyObject *data);
 // Creates Python tensor object from a Tensor
 PyObject* createPyObject(const thpp::Tensor& tensor);

-PyObject* createPyObject(at::Tensor tensor);
-PyTypeObject* getPyTypeObject(const at::Tensor& tensor);
-//rename to createPyObject when THPP is removed
-at::Tensor createTensorAT(PyObject *data);
-
 }  // namespace torch
--- a/torch/csrc/Exceptions.h
+++ b/torch/csrc/Exceptions.h
@ -48,7 +48,6 @@ struct python_error : public std::exception {

  /** Sets the current Python error from this exception */
  inline void restore() {
-    if (!type) return;
    // PyErr_Restore steals references
    AutoGIL gil;
    Py_XINCREF(type);
@ -64,6 +63,22 @@ struct python_error : public std::exception {

 #ifdef _THP_CORE

+struct THException: public std::exception {
+  THException(const char* msg): msg(msg) {};
+
+  virtual const char* what() const throw() {
+    return msg.c_str();
+  }
+
+  std::string msg;
+};
+
+struct THArgException: public THException {
+  THArgException(const char* msg, int argNumber): THException(msg), argNumber(argNumber) {};
+
+  const int argNumber;
+};
+
 bool THPException_init(PyObject *module);
 #endif

--- a/torch/csrc/Generator.cpp
+++ b/torch/csrc/Generator.cpp
@ -33,7 +33,7 @@ static PyObject * THPGenerator_pynew(PyTypeObject *type, PyObject *args, PyObjec
    THPUtils_setError("torch.Generator constructor doesn't accept any arguments");
    return NULL;
  }
-  THPGeneratorPtr self((THPGenerator *)type->tp_alloc(type, 0));
+  THPGeneratorPtr self = (THPGenerator *)type->tp_alloc(type, 0);
  self->cdata = THGenerator_new();

  return (PyObject*)self.release();
@ -44,7 +44,7 @@ static PyObject * THPGenerator_getState(THPGenerator *self)
 {
  HANDLE_TH_ERRORS
  THGenerator *generator = self->cdata;
-  THPByteTensorPtr res((THPByteTensor *)THPByteTensor_NewEmpty());
+  THPByteTensorPtr res = (THPByteTensor *)THPByteTensor_NewEmpty();
  if (!res) return NULL;
  THByteTensor_getRNGState(generator, res->cdata);
  return (PyObject *)res.release();
--- a/torch/csrc/Module.cpp
+++ b/torch/csrc/Module.cpp
@ -6,7 +6,6 @@
 #include <unordered_map>
 #include <libshm.h>
 #include <TH/TH.h>
-#include <ATen/ATen.h>

 #include "torch/csrc/utils/python_strings.h"

@ -64,7 +63,7 @@ static PyObject * THPModule_initNames(PyObject *self, PyObject *arg)
 {
  static std::vector<std::string> names;

-  THPObjectPtr types(PySequence_Fast(arg, "expected a sequence"));
+  THPObjectPtr types = PySequence_Fast(arg, "expected a sequence");
  if (!types) return NULL;

  int num_classes = PySequence_Fast_GET_SIZE(types.get());
@ -74,7 +73,7 @@ static PyObject * THPModule_initNames(PyObject *self, PyObject *arg)
    THPUtils_assert(PyType_Check(obj), "expected a PyTypeObject");
    PyTypeObject* type = (PyTypeObject*)obj;

-    THPObjectPtr module_name(PyObject_GetAttrString(obj, "__module__"));
+    THPObjectPtr module_name = PyObject_GetAttrString(obj, "__module__");
    if (!module_name) return NULL;
    THPUtils_assert(THPUtils_checkString(module_name.get()),
        "expected __module__ to be a string");
@ -214,7 +213,6 @@ dispatch:                                                                      \
 IMPLEMENT_STATELESS(sigmoid)
 IMPLEMENT_STATELESS(log)
 IMPLEMENT_STATELESS(log1p)
-IMPLEMENT_STATELESS(lgamma)
 IMPLEMENT_STATELESS(exp)
 IMPLEMENT_STATELESS(cos)
 IMPLEMENT_STATELESS(acos)
@ -467,64 +465,6 @@ PyObject *THPModule_addDocStr(PyObject *_unused, PyObject *args)
  Py_RETURN_NONE;
 }

-
-PyObject *THPModule_inferSize(PyObject *_unused, PyObject *args)
-{
-  HANDLE_TH_ERRORS
-  Py_ssize_t num_args = args ? PyTuple_Size(args) : 0;
-  THPUtils_assert(num_args == 2, "expected exactly 2 arguments");
-  PyObject *arg1 = PyTuple_GET_ITEM(args, 0);
-  THPUtils_assert(THPSize_Check(arg1), "expected a torch.Size as argument 1");
-  PyObject *arg2 = PyTuple_GET_ITEM(args, 1);
-  THPUtils_assert(THPSize_Check(arg2), "expected a torch.Size as argument 2");
-
-  THLongStoragePtr size1_guard = THPUtils_unpackSize(arg1);
-  THLongStorage *size1 = size1_guard.get();
-  THLongStoragePtr size2_guard = THPUtils_unpackSize(arg2);
-  THLongStorage *size2 = size2_guard.get();
-  THLongStoragePtr sizes_guard(THLongStorage_new());
-  THLongStorage *sizes = sizes_guard.get();
-
-  char error_buffer[1024];
-  int ret = THLongStorage_inferSize2(sizes, size1->data, size1->size, size2->data, size2->size, error_buffer, 1024);
-  THPUtils_assert(ret == 0, error_buffer);
-  return THPSize_New(sizes->size, sizes->data);
-  END_HANDLE_TH_ERRORS
-}
-
-static PyObject *THPModule_setBackcompatBroadcastWarn(PyObject *module, PyObject *arg) {
-  THPUtils_assert(PyBool_Check(arg), "set_backcompat_broadcast_warn expects a bool, "
-          "but got %s", THPUtils_typename(arg));
-  setBackCompatBroadcastWarn(arg == Py_True);
-  Py_RETURN_NONE;
-}
-
-static PyObject *THPModule_getBackcompatBroadcastWarn(PyObject *module)
-{
-  return getBackCompatBroadcastWarn() ? Py_True : Py_False;
-}
-
-static PyObject *THPModule_setBackcompatKeepdimWarn(PyObject *module, PyObject *arg) {
-  THPUtils_assert(PyBool_Check(arg), "set_backcompat_keepdim_warn expects a bool, "
-          "but got %s", THPUtils_typename(arg));
-  setBackCompatKeepdimWarn(arg == Py_True);
-  Py_RETURN_NONE;
-}
-
-static PyObject *THPModule_getBackcompatKeepdimWarn(PyObject *module)
-{
-  return getBackCompatKeepdimWarn() ? Py_True : Py_False;
-}
-
-PyObject *THPModule_hasDistributed(PyObject *_unused)
-{
-#ifdef WITH_DISTRIBUTED
-  Py_RETURN_TRUE;
-#else
-  Py_RETURN_FALSE;
-#endif
-}
-
 #ifdef WITH_CUDA
 extern PyObject * THCPModule_initExtension(PyObject *self);
 extern PyObject * THCPModule_setDevice_wrap(PyObject *self, PyObject *arg);
@ -557,7 +497,6 @@ static PyMethodDef TorchMethods[] = {
  {"_add_docstr",     (PyCFunction)THPModule_addDocStr,       METH_VARARGS, NULL},
  {"_sparse_init",    (PyCFunction)THSPModule_initExtension,  METH_NOARGS,  NULL},
  {"_init_names",     (PyCFunction)THPModule_initNames,       METH_O,       NULL},
-  {"_has_distributed",(PyCFunction)THPModule_hasDistributed,  METH_NOARGS,  NULL},
 #ifdef WITH_CUDA
  {"_cuda_init",        (PyCFunction)THCPModule_initExtension,    METH_NOARGS,  NULL},
  {"_cuda_setDevice",   (PyCFunction)THCPModule_setDevice_wrap,   METH_O,       NULL},
@ -584,11 +523,6 @@ static PyMethodDef TorchMethods[] = {
 #endif
  {"_safe_call",      (PyCFunction)THPModule_safeCall,          METH_VARARGS | METH_KEYWORDS, NULL},
  {"_set_default_tensor_type", (PyCFunction)THPModule_setDefaultTensorType, METH_O, NULL},
-  {"_infer_size",     (PyCFunction)THPModule_inferSize,         METH_VARARGS, NULL},
-  {"_set_backcompat_broadcast_warn", (PyCFunction)THPModule_setBackcompatBroadcastWarn, METH_O, NULL},
-  {"_get_backcompat_broadcast_warn", (PyCFunction)THPModule_getBackcompatBroadcastWarn, METH_NOARGS, NULL},
-  {"_set_backcompat_keepdim_warn", (PyCFunction)THPModule_setBackcompatKeepdimWarn, METH_O, NULL},
-  {"_get_backcompat_keepdim_warn", (PyCFunction)THPModule_getBackcompatKeepdimWarn, METH_NOARGS, NULL},
  {"get_num_threads", (PyCFunction)THPModule_getNumThreads,     METH_NOARGS,  NULL},
  {"set_num_threads", (PyCFunction)THPModule_setNumThreads,     METH_O,       NULL},
  {"from_numpy",      (PyCFunction)THPModule_fromNumpy,         METH_O,       NULL},
@ -596,7 +530,6 @@ static PyMethodDef TorchMethods[] = {
  {"sigmoid",         (PyCFunction)THPModule_sigmoid,           METH_VARARGS | METH_KEYWORDS, NULL},
  {"log",             (PyCFunction)THPModule_log,               METH_VARARGS | METH_KEYWORDS, NULL},
  {"log1p",           (PyCFunction)THPModule_log1p,             METH_VARARGS | METH_KEYWORDS, NULL},
-  {"lgamma",          (PyCFunction)THPModule_lgamma,            METH_VARARGS | METH_KEYWORDS, NULL},
  {"exp",             (PyCFunction)THPModule_exp,               METH_VARARGS | METH_KEYWORDS, NULL},
  {"cos",             (PyCFunction)THPModule_cos,               METH_VARARGS | METH_KEYWORDS, NULL},
  {"acos",            (PyCFunction)THPModule_acos,              METH_VARARGS | METH_KEYWORDS, NULL},
@ -719,6 +652,22 @@ static PyMethodDef TorchMethods[] = {
  {NULL, NULL, 0, NULL}
 };

+static void errorHandler(const char *msg, void *data)
+{
+  throw THException(msg);
+}
+
+static void errorHandlerArg(int argNumber, const char *msg, void *data)
+{
+  throw THArgException(msg, argNumber);
+}
+
+static void updateErrorHandlers()
+{
+  THSetDefaultErrorHandler(errorHandler, NULL);
+  THSetDefaultArgErrorHandler(errorHandlerArg, NULL);
+}
+
 bool THCPDoubleStorage_init(PyObject *module);
 bool THCPFloatStorage_init(PyObject *module);
 bool THCPHalfStorage_init(PyObject *module);
@ -778,7 +727,6 @@ PyMODINIT_FUNC init_C()
 PyMODINIT_FUNC PyInit__C()
 #endif
 {
-  THInferNumThreads();

 #if PY_MAJOR_VERSION == 2
 #define ASSERT_TRUE(cmd) if (!(cmd)) {PyErr_SetString(PyExc_ImportError, "initialization error"); return;}
@ -883,7 +831,8 @@ PyMODINIT_FUNC PyInit__C()
  Py_INCREF(has_cudnn);
  ASSERT_TRUE(PyModule_AddObject(module, "has_cudnn", has_cudnn) == 0);

-#ifdef WITH_DISTRIBUTED_MW
+  // TODO THD: enable once master-worker mode is implemented
+#if 0 && defined(WITH_DISTRIBUTED)
  // See comment on CUDA objects
  ASSERT_TRUE(THDPDoubleStorage_init(module));
  ASSERT_TRUE(THDPFloatStorage_init(module));
@ -908,9 +857,7 @@ PyMODINIT_FUNC PyInit__C()
  ASSERT_TRUE(THPDefaultGenerator != nullptr);
  ASSERT_TRUE(PyModule_AddObject(module, "default_generator", (PyObject*)THPDefaultGenerator) == 0);

-  // force ATen to initialize because it handles
-  // setting up TH Errors so that they throw C++ exceptions
-  at::init();
+  updateErrorHandlers();

 #ifdef WITH_NUMPY
  import_array();
--- a/torch/csrc/README.md
+++ b/torch/csrc/README.md
@ -1,100 +0,0 @@
-# csrc
-
-The csrc directory contains all of the code concerned with integration
-with Python.  This is in contrast to lib, which contains the Torch
-libraries that are Python agnostic.  csrc depends on lib, but not vice
-versa.
-
-There are a number of utilities for easing integration with Python which
-are worth knowing about, which we briefly describe here.  But the most
-important gotchas:
-
-* DO NOT forget to take out the GIL with `AutoGil` before calling Python
-  API or bringing a `THPObjectPtr` into scope.
-
-* Make sure you include `Python.h` first in your header files, before
-  any system headers; otherwise, you will get `error: "_XOPEN_SOURCE" redefined`
-  error.  If you pay attention to warnings, you will see where you need to
-  do this.
-
-## Notes
-
-### Note [Storage is not NULL]
-
-Historically, Torch supported NULL storage, as a minor optimization to
-avoid having to allocate a storage object when it would be empty.
-However, this is actually a confusing special case to deal with, so
-by-in-large, PyTorch assumes that, in fact, storage is never NULL.
-
-One important case where this assumption is important is when tracking
-the CUDA device a tensor is stored in: this information is stored
-solely in the storage, so if a storage is NULL, we lose this information.
-
-Although storage is never NULL, the data field of THStorage may be NULL.  This
-mostly occurs when we want to pre-allocate an output tensor struct, but then
-have it be resized and filled with data by some operator: there's no point in
-allocating data for it in this case!
-
-## Files
-
-### `Exceptions.h`
-
-Frequently when working with the Python API, you may call a function
-which returns an error.  In this case, we want to return directly to the
-Python interpreter, so that this exception can be propagated
-accordingly; however, because the Python API is C-based, what actually
-will happen is it will return control to whatever C++ code called it.
-Similarly, if we raise a C++ exception, prior to returning to the Python
-interpreter, we must set the Python error flags, so it turns into a C++
-exception.
-
-Exceptions defines some useful helpers: `HANDLE_TH_ERRORS`, `END_HANDLE_TH_ERRORS`
-and an exception class `python_error`.  You call them like this:
-
-```
-// Entry point from Python interpreter
-PyObject* run() {
-  HANDLE_TH_ERRORS
-  ...
-  if (!x) throw python_error();
-  ...
-  END_HANDLE_TH_ERRORS
-}
-```
-
-The `HANDLE_TH_ERRORS` macro will catch all exceptions and convert them
-into an appropriate Python signal.  `python_error` is a special
-exception which doesn't contain any info, instead it says, "An error
-occurred in the Python API; if you return to the interpreter, Python
-will raise that exception, nothing else needs to be done."
-
-### `utils/auto_gil.h`
-
-Whenever you make any calls to the Python API, you must have taken out
-the Python GIL, as none of these calls are thread safe.  `AutoGIL` is
-a RAII struct which handles taking and releasing the GIL.  Use it like
-this:
-
-```
-void iWantToUsePython() {
-  AutoGil gil;
-  ...
-}
-```
-
-In general, the compiler will NOT warn you if you use Python
-functionality without taking out the GIL, so DO NOT FORGET this call.
-
-### `utils/object_ptr.h`
-
-`THPPointer` is a smart pointer class analogous to `std::shared_ptr`,
-but which is overloaded to handle reference counting scheme of various
-objects which are not based on `shared_ptr`.  The most important overloads are:
-
-* `PyObject` (so important we've aliased it as `THPObjectPtr`), which
-  hooks into Python reference counting.  (By the way, that means you
-  MUST take out the GIL before bringing one of these into scope!)
-
-* The various TH tensor and storage types (e.g., `THTensor`), which
-  hook into TH's reference counting.  (TH's reference counting
-  IS thread safe, no locks necessary.)
--- a/torch/csrc/Size.cpp
+++ b/torch/csrc/Size.cpp
@ -25,7 +25,7 @@ PyObject * THPSize_New(int dim, long *sizes)

 static PyObject * THPSize_pynew(PyTypeObject *type, PyObject *args, PyObject *kwargs)
 {
-  THPObjectPtr self(PyTuple_Type.tp_new(type, args, kwargs));
+  THPObjectPtr self = PyTuple_Type.tp_new(type, args, kwargs);
  if (self) {
    for (Py_ssize_t i = 0; i < PyTuple_Size(self); ++i) {
      PyObject *item = PyTuple_GET_ITEM(self.get(), i);
@ -56,12 +56,13 @@ extern PyTypeObject THPSizeType;
 template<typename FnType, FnType fn, typename ...Args>
 static PyObject* wrap_tuple_fn(Args ... args)
 {
-  THPObjectPtr result((*fn)(std::forward<Args>(args)...));
+  PyObject *result = (*fn)(std::forward<Args>(args)...);
  if (!result) return NULL;
-  if (PyTuple_Check(result.get())) {
-    return PyObject_CallFunctionObjArgs((PyObject*)&THPSizeType, result.get(), NULL);
+  if (PyTuple_Check(result)) {
+    return PyObject_CallFunctionObjArgs((PyObject*)&THPSizeType, result, NULL);
  }
-  return result.release();
+  Py_INCREF(result);
+  return result;
 }

 static auto sq_concat = PyTuple_Type.tp_as_sequence->sq_concat;
--- a/torch/csrc/THP.h
+++ b/torch/csrc/THP.h
@ -1,14 +1,12 @@
 #ifndef THP_H
 #define THP_H

-#include <Python.h>
 #include <stdbool.h>
 #include <TH/TH.h>
 #include <THS/THS.h>

 // Back-compatibility macros, Thanks to http://cx-oracle.sourceforge.net/
-// define PyInt_* macros for Python 3.x.  NB: We must include Python.h first,
-// otherwise we'll incorrectly conclude PyInt_Check isn't defined!
+// define PyInt_* macros for Python 3.x
 #ifndef PyInt_Check
 #define PyInt_Check             PyLong_Check
 #define PyInt_FromLong          PyLong_FromLong
@ -20,7 +18,6 @@
 #define LIBRARY_STATE
 #define LIBRARY_STATE_NOARGS
 #define LIBRARY_STATE_TYPE
-#define LIBRARY_STATE_TYPE_NOARGS

 #define THP_API extern "C"

--- a/torch/csrc/Tensor.cpp
+++ b/torch/csrc/Tensor.cpp
@ -9,8 +9,12 @@
 #include <tuple>
 #include <TH/THMath.h>

-#include "torch/csrc/THP.h"
-#include "torch/csrc/copy_utils.h"
-#include "torch/csrc/DynamicTypes.h"
+#include "THP.h"
+#include "copy_utils.h"
+#include "DynamicTypes.h"

-//generic_include TH torch/csrc/generic/Tensor.cpp
+#include "generic/Tensor.cpp"
+#include <TH/THGenerateAllTypes.h>
+
+#include "generic/Tensor.cpp"
+#include <TH/THGenerateHalfType.h>
--- a/torch/csrc/allocators.cpp
+++ b/torch/csrc/allocators.cpp
@ -29,33 +29,11 @@ void StorageWeakRefAllocator::free(void* ptr) {


 #ifdef WITH_NUMPY
-/**
- * Note [Numpy memory management]
- * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- * For efficiency reasons, when a user converts to/from numpy arrays,
- * we want to share the underlying storage.  This means that if we
- * turn a Numpy array into a Torch tensor, the Torch tensor must
- * keep the Numpy array alive, and vice versa for conversions in
- * the other direction.
- *
- * A Torch tensor keeps its backing Numpy array alive using the custom allocator
- * THNumpyArrayAllocator (backed by NumpyArrayAllocator), which holds a
- * THPObjectPointer to the Numpy PyArrayObject, and nulls it out upon free.
- * The relevant code is in torch/csrc/generic/Tensor.cpp.
- *
- * A Numpy array keeps its backing Torch tensor alive using the base object
- * <https://docs.scipy.org/doc/numpy-dev/reference/c-api.array.html#c.PyArray_SetBaseObject>
- * field of Numpy, which is Numpy's hook for allowing an external user to
- * manage memory.  The relevant code is in
- * torch/csrc/generic/methods/TensorSerialization.cwrap
- */
-
-// See Note [Numpy memory management]
 void* NumpyArrayAllocator::realloc(void* ptr, ptrdiff_t size) {
  PyArrayObject *array_ptr = (PyArrayObject*)object.get();
  if (array_ptr && ptr == PyArray_DATA(array_ptr)) {
    void* newPtr = this->malloc(size);
-    memcpy(newPtr, ptr, std::min((size_t) size, (size_t) PyArray_NBYTES(array_ptr)));
+    memcpy(newPtr, ptr, std::min(size, PyArray_NBYTES(array_ptr)));
    // Whee! We're done!
    object = nullptr;
    return newPtr;
@ -63,7 +41,7 @@ void* NumpyArrayAllocator::realloc(void* ptr, ptrdiff_t size) {
  return allocator->realloc(allocatorContext, ptr, size);
 }

-// See Note [Numpy memory management]
+
 void NumpyArrayAllocator::free(void* ptr) {
  PyArrayObject *array_ptr = (PyArrayObject*)object.get();
  if (!array_ptr || ptr != PyArray_DATA(array_ptr))
@ -101,7 +79,6 @@ THAllocator THStorageWeakRefAllocator = {
 };

 #ifdef WITH_NUMPY
-// See Note [Numpy memory management]
 THAllocator THNumpyArrayAllocator = {
  malloc_wrapper<NumpyArrayAllocator>,
  realloc_wrapper<NumpyArrayAllocator>,
--- a/torch/csrc/autograd/README.md
+++ b/torch/csrc/autograd/README.md
@ -1,33 +0,0 @@
-## Autograd
-
-Autograd is a hotspot for PyTorch performance, so most of the heavy lifting is
-implemented in C++.  This implies that we have to do some shuffling between
-Python and C++; and in general, we want data to be in a form that is convenient
-to manipulate from C++.
-
-Our general model is that for any key data type that autograd manipulates,
-there are two implementations: a C++ type and a Python object type.  For
-example, consider variables in autograd: we have both `Variable` in `variable.h`
-(the C++ type) and `THPVariable` in `python_variable.h` (the Python type.)
-(By the way, THP stands for TorcH Python, not to be confused with THPP, TorcH
-C++).  `Variable` contains the payload of a variable, while `THPVariable` just
-contains a `shared_ptr` reference to `Variable`, as well as references to other
-Python objects which the Python runtime needs to know about.  A lot of
-data accessor implementations in `python_variable.cpp` simply reach through
-to the underlying `Variable` and return the appropriate value.
-
-The most complicated application of this principle is Function, which also
-supports users implementing custom behavior in Python.  We have the following
-classes:
-
-* `Function` in `function.h`, the C++ type.
-* `THPFunction` in `python_function.h`, the Python object type.  In
-  `python_function.cpp`, you can see the boilerplate that tells the Python
-  interpreter about this object.
-* `PyFunction` in `python_function.h`, a subclass of `Function` which forwards
-  `apply` to a Python `THPFunction`. (NOT a Python object, despite its name!)
-
-Outside of `PyFunction`, the C++ objects largely avoid referencing Python
-objects (there are a few exceptions, like `pyobj` in `Variable`, and
-`PyFunction`, whose whole point is to let C++ call into Python). And `pyobj`
-in `Function` to ensure uniqueness of the associated python wrapper (if it exists).
--- a/torch/csrc/autograd/engine.cpp
+++ b/torch/csrc/autograd/engine.cpp
@ -1,11 +1,8 @@
 #include "torch/csrc/autograd/engine.h"
-#include "torch/csrc/autograd/functions/basic_ops.h"
-#include "torch/csrc/utils/auto_gpu.h"

 #include <atomic>
 #include <condition_variable>
 #include <cstdint>
-#include <functional>
 #include <iostream>
 #include <mutex>
 #include <set>
@ -25,25 +22,15 @@ using thpp::Tensor;

 namespace torch { namespace autograd {

-// XXX: Changes to the way multithreading works in execute should be done with
-// great care. Right now the implementation guarantees that a single function's
-// apply will never be entered concurrently (even if multiple graphs are
-// executed at the same time). Adding multiple threads per-device or removing
-// engine thread affinity to the device can break this invariant, and we depend
-// on it in a few places (e.g. AccumulateGrad function).
-
 struct FunctionTask {
-  GraphTask* base;
+  BackwardTask* base;
  std::shared_ptr<Function> fn;
-  // This buffer serves as an implicit "addition" node for all of the
-  // gradients flowing here.  Once all the dependencies are finished, we
-  // use the contents of this buffer to run the function.
-  InputBuffer inputs;
+  GradBuffer grad;

-  FunctionTask(GraphTask* base, std::shared_ptr<Function> fn, InputBuffer inputs)
+  FunctionTask(BackwardTask* base, std::shared_ptr<Function> fn, GradBuffer grad)
    : base(base)
    , fn(fn)
-    , inputs(std::move(inputs)) {}
+    , grad(std::move(grad)) {}
 };

 struct ReadyQueue {
@ -55,32 +42,26 @@ struct ReadyQueue {
  FunctionTask pop_back();
 };

-struct GraphTask {
+struct BackwardTask {
  std::exception_ptr exception;
-  // Indicates if an error occurred while executing any task.  When this is
-  // true, it signals all threads to stop executing.
  std::atomic_bool has_error;
  std::atomic<uint64_t> outstanding_tasks;
-  bool keep_graph;
-  bool has_any_work;
+  bool retain_variables;
+  bool node_requires_grad;

  std::mutex mutex;
-  // Notified when a task finishes executing.  Check outstanding_tasks to see
-  // if all tasks are done.
  std::condition_variable not_done;
-  const Engine::callback_map& function_callbacks;
-  std::unordered_map<Function*, InputBuffer> not_ready;
+  std::unordered_map<Function*, GradBuffer> not_ready;
  std::unordered_map<Function*, int> dependencies;

-  GraphTask(bool keep_graph, const Engine::callback_map& function_callbacks)
+  BackwardTask(bool retain_variables)
    : exception()
    , has_error(false)
    , outstanding_tasks(0)
-    , keep_graph(keep_graph)
-    , has_any_work(false)
+    , retain_variables(retain_variables)
+    , node_requires_grad(false)
    , mutex()
    , not_done()
-    , function_callbacks(function_callbacks)
    , not_ready()
    , dependencies() {}
 };
@ -107,9 +88,7 @@ Engine::Engine() : ready_queues() {
 // This Engine's ReadyQueues and their corresponding threads are leaked here
 Engine::~Engine() = default;

-auto Engine::thread_main(std::shared_ptr<ReadyQueue> queue, int device) -> void {
-  THInferNumThreads();
-  AutoGPU guard(device);
+auto Engine::thread_main(std::shared_ptr<ReadyQueue> queue) -> void {
  while (1) {
    FunctionTask task = queue->pop_back();
    if (!task.base->has_error.load()) {
@ -134,73 +113,78 @@ auto Engine::thread_on_exception(FunctionTask& task, std::exception& e) -> void
  }
 }

-static variable_list call_pre_hooks(Function& fn, variable_list inputs) {
+static variable_list call_pre_hooks(Function& fn, variable_list grad_output) {
  for (auto& hook : fn.pre_hooks) {
-    inputs = (*hook)(inputs);
+    grad_output = (*hook)(grad_output);
  }
-  return inputs;
+  return grad_output;
 }

-static variable_list call_post_hooks(Function& fn, variable_list outputs, variable_list inputs) {
+static variable_list call_post_hooks(Function& fn, variable_list grad_input, variable_list grad_output) {
  for (auto& hook : fn.post_hooks) {
-    outputs = (*hook)(outputs, inputs);
+    grad_input = (*hook)(grad_input, grad_output);
  }
-  return outputs;
+  return grad_input;
 }

 static variable_list call_function(FunctionTask& task) {
-  auto& fn = *task.fn;
-  auto inputs = call_pre_hooks(fn, InputBuffer::variables(std::move(task.inputs)));
-
-  auto& function_callbacks = task.base->function_callbacks;
-  auto callback_it = function_callbacks.find(&fn);
-  if (callback_it != function_callbacks.end()) {
-    auto& callback = callback_it->second;
-    if (!callback(&fn, inputs)) return variable_list(fn.next_functions.size());
-  }
-
-  auto fn_outputs = fn.apply(inputs);
-  return call_post_hooks(fn, std::move(fn_outputs), std::move(inputs));
+  auto grad_output = call_pre_hooks(*task.fn, GradBuffer::variables(std::move(task.grad)));
+  auto grad_input = task.fn->apply(grad_output);
+  return call_post_hooks(*task.fn, std::move(grad_input), std::move(grad_output));
 }

 auto Engine::evaluate_function(FunctionTask& task) -> void {
-  auto outputs = call_function(task);
+  auto grad_inputs = call_function(task);

  auto& fn = *task.fn;
-  if (!task.base->keep_graph) {
+  if (!task.base->retain_variables) {
    fn.releaseVariables();
  }

-  if (outputs.size() != fn.next_functions.size()) {
+  if (grad_inputs.size() != fn.previous_functions.size()) {
    std::stringstream ss;
-    ss << "Function '" << fn.name() << "' returned an invalid number of outputs - expected ";
-    ss << fn.next_functions.size() << ", but got " << outputs.size();
+    ss << "Function '" << fn.name() << "' returned an invalid number of gradients - expected ";
+    ss << fn.previous_functions.size() << ", but got " << grad_inputs.size();
    throw std::runtime_error(ss.str());
  }

-  int num_outputs = outputs.size();
-  for (int i = 0; i < num_outputs; ++i) {
-    auto& output = outputs[i];
-    auto& next_fn = fn.next_functions[i].first;
-    int input_nr = fn.next_functions[i].second;
+  int size = grad_inputs.size();
+  for (int i = 0; i < size; ++i) {
+    auto& grad_input = grad_inputs[i];
+    auto& prev_fn = fn.previous_functions[i].first;
+    int output_nr = fn.previous_functions[i].second;

-    if (!next_fn) {
+    // null inputs have no previous_function and we skip them here
+    if (!prev_fn) {
      continue;
    }

    // Stochastic functions are placed in the ready queue by
-    // compute_dependencies, so we have to skip them here.
-    if (next_fn->is_stochastic || !next_fn->is_executable) {
+    // compute_dependencies, so we can skip them here.
+    if (prev_fn->is_stochastic || !prev_fn->requires_grad) {
      continue;
    }

    std::lock_guard<std::mutex> lock(task.base->mutex);
-    // Check if the next function is ready to be computed
+    if (auto var = dynamic_cast<Variable*>(prev_fn.get())) {
+      if (!grad_input) {
+        // NOTE: grad_input can be NULL if the function returns None for a
+        // non_differentiable input. We may need to track additional information
+        // at the function level to determine if a NULL grad_input is an error.
+        std::stringstream ss;
+        ss << "Function '" << fn.name() << "' missing gradient at " << i;
+        throw std::runtime_error(ss.str());
+      }
+      var->backward(grad_input);
+      continue;
+    }
+
+    // Check if the function is ready for backward
    bool is_ready = false;
    auto& dependencies = task.base->dependencies;
-    auto it = dependencies.find(next_fn.get());
+    auto it = dependencies.find(prev_fn.get());
    if (it == dependencies.end()) {
-      auto name = next_fn->name();
+      auto name = prev_fn->name();
      throw std::runtime_error(std::string("dependency not found for ") + name);
    } else if (--it->second == 0) {
      dependencies.erase(it);
@ -208,24 +192,24 @@ auto Engine::evaluate_function(FunctionTask& task) -> void {
    }

    auto& not_ready = task.base->not_ready;
-    auto not_ready_it = not_ready.find(next_fn.get());
+    auto not_ready_it = not_ready.find(prev_fn.get());
    if (not_ready_it == not_ready.end()) {
      // No buffers have been allocated for the function
-      InputBuffer input_buffer(next_fn->num_inputs);
-      input_buffer.add(input_nr, std::move(output));
+      GradBuffer prev_buffer(prev_fn->num_outputs);
+      prev_buffer.addGrad(output_nr, std::move(grad_input));
      if (is_ready) {
-        auto& queue = ready_queue(input_buffer.device());
-        queue.push_front(FunctionTask(task.base, next_fn, std::move(input_buffer)));
+        auto& queue = ready_queue(prev_buffer.device());
+        queue.push_front(FunctionTask(task.base, prev_fn, std::move(prev_buffer)));
      } else {
-        not_ready.emplace(next_fn.get(), std::move(input_buffer));
+        not_ready.emplace(prev_fn.get(), std::move(prev_buffer));
      }
    } else {
      // The function already has a buffer
-      auto &input_buffer = not_ready_it->second;
-      input_buffer.add(input_nr, std::move(output));
+      auto &prev_buffer = not_ready_it->second;
+      prev_buffer.addGrad(output_nr, std::move(grad_input));
      if (is_ready) {
-        auto& queue = ready_queue(input_buffer.device());
-        queue.push_front(FunctionTask(task.base, next_fn, std::move(input_buffer)));
+        auto& queue = ready_queue(prev_buffer.device());
+        queue.push_front(FunctionTask(task.base, prev_fn, std::move(prev_buffer)));
        not_ready.erase(not_ready_it);
      }
    }
@ -233,30 +217,30 @@ auto Engine::evaluate_function(FunctionTask& task) -> void {
 }

 /** Finds all stochastic functions and appends them to the queue */
-auto Engine::find_stochastic_functions(function_queue& queue, Function* graph_root, GraphTask& task) -> void {
-  std::unordered_set<Function*> seen {graph_root};
-  function_queue search_queue {graph_root};
+auto Engine::find_stochastic_functions(function_queue& queue, BackwardTask& task) -> void {
+  std::unordered_set<Function*> seen;
+  function_queue search_queue(queue);
  while (search_queue.size() > 0) {
    auto fn = search_queue.back(); search_queue.pop_back();
-    for (auto& next_fn_pair : fn->next_functions) {
-      auto& next_fn = next_fn_pair.first;
-      Function* next_ptr = next_fn.get();
-      if (!next_ptr) continue;
-      if (next_ptr->is_stochastic && next_ptr->is_executable && seen.count(next_ptr) == 0) {
-        ready_queue(-1).push_front(FunctionTask(&task, next_fn, InputBuffer(0)));
-        queue.push_back(next_ptr);
-        task.has_any_work = true;
+    for (auto& prev_fn_pair : fn->previous_functions) {
+      auto& prev_fn = prev_fn_pair.first;
+      Function* prev_ptr = prev_fn.get();
+      if (!prev_ptr) continue;
+      if (prev_ptr->is_stochastic && prev_ptr->requires_grad && seen.count(prev_ptr) == 0) {
+        ready_queue(-1).push_front(FunctionTask(&task, prev_fn, GradBuffer(0)));
+        queue.push_back(prev_ptr);
+        task.node_requires_grad = true;
      }
-      if (seen.count(next_ptr) == 0) {
-        seen.insert(next_ptr);
-        search_queue.push_back(next_ptr);
+      if (seen.count(prev_ptr) == 0) {
+        seen.insert(prev_ptr);
+        search_queue.push_back(prev_ptr);
      }
    }
  }
 }

 /** Computes the number of dependencies for each function which requires grad */
-auto Engine::compute_dependencies(function_queue queue, GraphTask& task) -> void {
+auto Engine::compute_dependencies(function_queue queue, BackwardTask& task) -> void {
  // Just to make sure that they will never be added to the queue again
  std::unordered_set<Function*> seen(queue.begin(), queue.end());

@ -265,97 +249,99 @@ auto Engine::compute_dependencies(function_queue queue, GraphTask& task) -> void
  auto& dependencies = task.dependencies;
  while (queue.size() > 0) {
    auto fn = std::move(queue.back()); queue.pop_back();
-    for (auto& next_fn_pair : fn->next_functions) {
-      Function* next_ptr = next_fn_pair.first.get();
-      if (!next_ptr) continue;
-      if (!next_ptr->is_executable) continue;
-      if (next_ptr->is_stochastic) continue; // Stochastic nodes were in the queue already
-      dependencies[next_ptr] += 1;
-      if (seen.count(next_ptr) == 0) {
-        seen.insert(next_ptr);
-        queue.push_back(next_ptr);
+    // This is needed only to filter out backward roots that don't require grad
+    if (!fn->requires_grad) continue;
+    for (auto& prev_fn_pair : fn->previous_functions) {
+      Function* prev_ptr = prev_fn_pair.first.get();
+      if (!prev_ptr) continue;
+      if (dynamic_cast<Variable*>(prev_ptr)) continue;
+      if (!prev_ptr->requires_grad) continue;
+      if (prev_ptr->is_stochastic) continue; // Stochastic nodes were in the queue already
+      dependencies[prev_ptr] += 1;
+      if (seen.count(prev_ptr) == 0) {
+        seen.insert(prev_ptr);
+        queue.push_back(prev_ptr);
      }
    }
  }
 }

-struct ClearCallbacks {
-  ClearCallbacks(std::vector<std::function<void()>>& callbacks,
-                 std::mutex &callbacks_lock)
-    : callbacks(callbacks)
-    , callbacks_lock(callbacks_lock) { clear(); }
-  ~ClearCallbacks() { clear(); }
-
-  void clear() {
-    std::lock_guard<std::mutex> lock(callbacks_lock);
-    callbacks.clear();
-  }
-
-  std::vector<std::function<void()>>& callbacks;
-  std::mutex& callbacks_lock;
-};
-
-auto Engine::execute(const function_list& input_roots,
-                     variable_list& inputs,
-                     bool keep_graph,
-                     const callback_map& callbacks) -> void {
-  std::call_once(start_threads_flag, &Engine::start_threads, this);
-  // Callbacks are only valid for the duration of this run and should always be cleared
-  ClearCallbacks _cb_guard(post_callbacks, post_callbacks_lock);
-
-  GraphTask graph_task(keep_graph, callbacks);
-  std::unique_lock<std::mutex> lock(graph_task.mutex);
-
-  auto graph_root = std::make_shared<GraphRoot>(input_roots, inputs);
-  function_queue roots;
-  for (auto entry : input_roots) {
-    if (entry.first->is_executable) {
-      graph_task.has_any_work = true;
-      roots.push_back(graph_root.get());
-      ready_queue(-1).push_front(FunctionTask(&graph_task, graph_root, InputBuffer(0)));
-      break;
+auto Engine::find_creators(const variable_list& variables,
+                           tensor_list& grad_variables,
+                           BackwardTask& task) -> function_queue {
+  function_queue creators;
+  std::unordered_map<std::shared_ptr<Function>, std::unique_ptr<GradBuffer>> creator_grad;
+  int size = variables.size();
+  for (int i = 0; i < size; ++i) {
+    auto& var = variables[i];
+    auto& grad = grad_variables[i];
+    if (!var->creator) {
+      // If someone calls .backward() on a leaf, it's simple...
+      if (var->requires_grad) {
+        var->backward(std::make_shared<Variable>(std::move(grad), false, true));
+        task.node_requires_grad = true;
+      }
+    } else {
+      auto& creator = var->creator;
+      auto& buf = creator_grad[creator];
+      if (creator->requires_grad) {
+        if (!buf) buf.reset(new GradBuffer(creator->num_outputs));
+        buf->addGrad(var->output_nr, Variable::of(std::move(grad)));
+      }
    }
  }

-  // Search the graph and find all stochastic functions. Append them to the queue.
-  find_stochastic_functions(roots, graph_root.get(), graph_task);
+  for (auto& entry: creator_grad) {
+    const auto& creator = entry.first;
+    creators.push_back(creator.get());
+    if (creator->requires_grad) {
+      // NOTE: buf is null if creator doesn't require gradient
+      auto& buf = entry.second;
+      auto& queue = ready_queue(buf->device());
+      queue.push_front(FunctionTask(&task, creator, std::move(*buf)));
+      task.node_requires_grad = true;
+    }
+  }

-  if (!graph_task.has_any_work) {
+  return creators;
+}
+
+auto Engine::backward(const variable_list& variables,
+                      tensor_list& grad_variables,
+                      bool retain_variables) -> void {
+  static std::once_flag once_flag;
+  std::call_once(once_flag, &Engine::start_threads, this);
+
+  BackwardTask backward_task(retain_variables);
+  std::unique_lock<std::mutex> lock(backward_task.mutex);
+
+  // Find the unique creators and backprop into variables which don't have creators.
+  auto creators = find_creators(variables, grad_variables, backward_task);
+
+  // Search the graph and find all stochastic functions. Append them to the queue.
+  find_stochastic_functions(creators, backward_task);
+
+  if (!backward_task.node_requires_grad) {
    throw std::runtime_error(
      "there are no graph nodes that require computing gradients");
  }

-  // Now compute the dependencies for all executable functions
-  compute_dependencies(std::move(roots), graph_task);
+  // Now compute the dependencies for each function which requires grad
+  compute_dependencies(std::move(creators), backward_task);

-  // Wait for all tasks to complete
-  graph_task.not_done.wait(lock, [&graph_task]{
-    return graph_task.outstanding_tasks.load() == 0;
+  // wait for all tasks to complete
+  backward_task.not_done.wait(lock, [&backward_task]{
+    return backward_task.outstanding_tasks.load() == 0;
  });

-  // Check for an exception while running backwards
-  if (graph_task.has_error.load()) {
-    std::rethrow_exception(graph_task.exception);
+  // check for an exception while running backwards
+  if (backward_task.has_error.load()) {
+    std::rethrow_exception(backward_task.exception);
  }

-  if (!graph_task.not_ready.empty()) {
+  if (!backward_task.not_ready.empty()) {
    throw std::runtime_error("could not compute gradients for some functions");
  }
-
-  // Unlocking is necessary, because the callback can register
-  // more callbacks (or they can be registered from other threads
-  // while it's waiting.
-  std::unique_lock<std::mutex> cb_lock(post_callbacks_lock);
-  for (std::size_t i = 0; i < post_callbacks.size(); ++i) {
-    cb_lock.unlock();
-    post_callbacks[i]();
-    cb_lock.lock();
-  }
-}
-
-void Engine::queue_callback(std::function<void()> callback) {
-  std::lock_guard<std::mutex> lock(post_callbacks_lock);
-  post_callbacks.emplace_back(std::move(callback));
 }

 auto Engine::ready_queue(int device) -> ReadyQueue& {
@ -371,12 +357,10 @@ auto Engine::start_threads() -> void {
    num_devices = 0;
  }
 #endif
-  int num_threads = num_devices + 1;
-  ready_queues = std::vector<std::shared_ptr<ReadyQueue>>(num_threads);
-  for (int i = 0; i < num_threads; ++i) {
-    auto& queue = ready_queues[i];
+  ready_queues = std::vector<std::shared_ptr<ReadyQueue>>(num_devices + 1);
+  for (auto& queue : ready_queues) {
    queue.reset(new ReadyQueue());
-    std::thread t(&Engine::thread_main, this, queue, i - 1);
+    std::thread t(&Engine::thread_main, this, queue);
    t.detach();
  }
 }
--- a/torch/csrc/autograd/engine.h
+++ b/torch/csrc/autograd/engine.h
@ -3,22 +3,20 @@
 // Engine implements backpropagation from output variables and their gradients
 // to "root" variables (variables created by the user with requires_grad=True).

-#include <Python.h>
 #include <deque>
 #include <memory>
 #include <unordered_map>
 #include <utility>
 #include <vector>
-#include <functional>

 #include "torch/csrc/autograd/function.h"
-#include "torch/csrc/autograd/input_buffer.h"
+#include "torch/csrc/autograd/grad_buffer.h"

 namespace torch { namespace autograd {

 struct ReadyQueue;
 struct FunctionTask;
-struct GraphTask;
+struct BackwardTask;

 // A single instance of this struct should be created through the whole process lifetime.
 // The worker thread creation logic and Engine's destructor rely on this.
@ -26,39 +24,31 @@ struct Engine {
  Engine();
  virtual ~Engine();

-  using ready_queue_type = std::deque<std::pair<std::shared_ptr<Function>, InputBuffer>>;
+  using ready_queue_type = std::deque<std::pair<std::shared_ptr<Function>, GradBuffer>>;
  using function_queue = std::vector<Function*>;
  using dependencies_type = std::unordered_map<Function*, int>;
-  using callback_type = std::function<bool (Function*, variable_list&)>;
-  using callback_map = std::unordered_map<Function*, callback_type>;

-  // Given a list of (Function, input number) pairs computes the value of the graph
-  // by following next_function references.
-  void execute(
-      const function_list& roots,
-      variable_list& inputs,
-      bool keep_graph,
-      const callback_map& callbacks = callback_map());
-
-  void queue_callback(std::function<void()> callback);
+  // Given a list of output variables and their gradients, computes the
+  // gradients of "root" variables by backpropagation.
+  void backward(
+      const variable_list& variables,
+      tensor_list& grad_variables,
+      bool retain_variables);

 protected:
-  function_queue find_roots(
-      const function_list& roots,
-      variable_list& inputs,
-      GraphTask& task);
-  void find_stochastic_functions(function_queue& queue, Function* graph_root, GraphTask& task);
-  void compute_dependencies(function_queue queue, GraphTask& task);
+  function_queue find_creators(
+      const variable_list& variables,
+      tensor_list& grad_variables,
+      BackwardTask& task);
+  void find_stochastic_functions(function_queue& queue, BackwardTask& task);
+  void compute_dependencies(function_queue queue, BackwardTask& task);
  void evaluate_function(FunctionTask& task);
  ReadyQueue& ready_queue(int device);
  void start_threads();
-  virtual void thread_main(std::shared_ptr<ReadyQueue> queue, int device);
+  virtual void thread_main(std::shared_ptr<ReadyQueue> queue);
  virtual void thread_on_exception(FunctionTask& task, std::exception& e);

-  std::once_flag start_threads_flag;
  std::vector<std::shared_ptr<ReadyQueue>> ready_queues;
-  std::vector<std::function<void()>> post_callbacks;
-  std::mutex post_callbacks_lock;
 };

 }} // namespace torch::autograd
--- a/torch/csrc/autograd/function.cpp
+++ b/torch/csrc/autograd/function.cpp
@ -10,22 +10,22 @@ namespace torch { namespace autograd {
 auto Function::flags(const variable_list& inputs) -> FunctionFlags {
  int num_inputs = inputs.size();
  FunctionFlags f;
-  f.is_executable = false;
+  f.requires_grad = false;
  f.is_volatile = false;
-  f.next_functions.resize(num_inputs);
+  f.previous_functions.resize(num_inputs);
  for (int i = 0; i != num_inputs; ++i) {
    auto& var = inputs[i];
    if (var) {
-      f.is_executable |= var->requires_grad;
+      f.requires_grad |= var->requires_grad;
      f.is_volatile |= var->is_volatile;
-      if (var->grad_fn) {
-        f.next_functions[i] = std::make_pair<>(var->grad_fn, var->output_nr);
+      if (var->creator) {
+        f.previous_functions[i] = std::make_pair<>(var->creator, var->output_nr);
      } else {
-        f.next_functions[i] = std::make_pair<>(var->get_grad_accumulator(), 0);
+        f.previous_functions[i] = std::make_pair<>(var, 0);
      }
    }
  }
-  f.is_executable &= !f.is_volatile;
+  f.requires_grad &= !f.is_volatile;
  return f;
 }

--- a/torch/csrc/autograd/function.h
+++ b/torch/csrc/autograd/function.h
@ -1,19 +1,18 @@
 #pragma once

 // Function is an abstract class that represents a single operation from one or
-// more variables to one more or variables.
+// more variables to one more or varaibles.
 //
 // Subclasses may represent "forward" or "backward" operations (i.e functions
 // and their derivatives). Some functions may be used as both.

-#include <Python.h>
-#include "torch/csrc/autograd/function_hook.h"
-
-#include <THPP/THPP.h>
-
 #include <memory>
+#include <THPP/THPP.h>
 #include <vector>

+#include "torch/csrc/autograd/saved_variable.h"
+#include "torch/csrc/autograd/function_hook.h"
+
 namespace torch { namespace autograd {

 struct Function;
@ -25,37 +24,30 @@ using function_list = std::vector<std::pair<std::shared_ptr<Function>, int>>;

 // State used to create "backward" functions
 struct FunctionFlags {
-  // Roughly speaking, is_executable corresponds to requires_grad.
-  // See http://pytorch.org/docs/notes/autograd.html for more details:
-  // both is_executable and is_volatile specify whether or not backwards
-  // gradient computation will be performed for a function, but they differ in
-  // their precedence.
-  bool is_executable = false;
+  bool requires_grad = false;
  bool is_volatile = false;
-  // What functions take the output of this function as input.
-  // There is one function per output of this function.
-  function_list next_functions;
+  function_list previous_functions;
 };

 struct Function {
  Function()
-    : num_inputs(0)
-    , next_functions()
-    , is_executable(false)
+    : num_outputs(0)
+    , previous_functions()
+    , requires_grad(false)
+    , is_volatile(false)
    , is_stochastic(false)
    , pre_hooks()
    , post_hooks()
-    , pyobj(nullptr)
    {}

  Function(FunctionFlags&& flags)
-    : num_inputs(0)
-    , next_functions(std::move(flags.next_functions))
-    , is_executable(flags.is_executable)
+    : num_outputs(0)
+    , previous_functions(std::move(flags.previous_functions))
+    , requires_grad(flags.requires_grad)
+    , is_volatile(flags.is_volatile)
    , is_stochastic(false)
    , pre_hooks()
    , post_hooks()
-    , pyobj(nullptr)
    {}

  Function(const Function& other) = delete;
@ -65,7 +57,7 @@ struct Function {
  // Implements the operation
  virtual variable_list apply(const variable_list& inputs) = 0;

-  // Computes is_executable, is_volatile, and next_functions from a list
+  // Computes requires_grad, is_volatile, and previous_functions from a list
  // of input variables
  static FunctionFlags flags(const variable_list& inputs);

@ -75,24 +67,21 @@ struct Function {
  // Function name for debugging
  virtual std::string name();

-  inline bool should_compute_output(int i) const {
-    auto& fn = next_functions[i].first;
-    return fn && fn->is_executable;
+  inline bool needs_input_grad(int i) const {
+    auto& fn = previous_functions[i].first;
+    return fn && fn->requires_grad;
  }

-  inline void set_flags(FunctionFlags&& flags) {
-    is_executable = flags.is_executable;
-    next_functions = std::move(flags.next_functions);
-  }
-
-  int num_inputs;
-  function_list next_functions;
-  bool is_executable;
+  // These variables are usually only meaningful for "backward" functions.
+  // num_outputs is the number of outputs of corresponding "forward" function;
+  // it's actually the number of inputs of this function.
+  int num_outputs;
+  function_list previous_functions;
+  bool requires_grad;
+  bool is_volatile;
  bool is_stochastic;
  std::vector<std::shared_ptr<FunctionPreHook>> pre_hooks;
  std::vector<std::shared_ptr<FunctionPostHook>> post_hooks;
-
-  PyObject *pyobj;  // weak reference
 };


--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Soumith Chintala	ccd5f4dbfc	version bump	2017-05-01 15:55:29 -04:00
Soumith Chintala	3cc21b5a46	fix OSX build	2017-04-29 09:29:21 -04:00
Soumith Chintala	27fb8750ad	fix NCCL makefile for CUDA 7.5	2017-04-28 20:08:07 -04:00