Update onnx.rst (#40605 )

Correcting the link (current 404s)
Update header names and add in C++ docs (#27172 )
2025-10-30 03:34:56 +08:00 · 2020-06-30 06:46:40 -07:00 · 2019-10-01 21:52:53 -04:00 · 2019-08-08 05:54:09 -07:00 · 2019-08-08 05:16:06 -07:00 · 2019-08-07 20:36:21 -07:00
2226 changed files with 154569 additions and 88219 deletions
--- a/.circleci/README.md
+++ b/.circleci/README.md
@ -1,3 +1,23 @@
+Structure of CI
+===============
+
+setup job:
+1. Does a git checkout
+2. Persists CircleCI scripts (everything in `.circleci`) into a workspace.  Why?
+   We don't always do a Git checkout on all subjobs, but we usually
+   still want to be able to call scripts one way or another in a subjob.
+   Persisting files this way lets us have access to them without doing a
+   checkout.  This workspace is conventionally mounted on `~/workspace`
+   (this is distinguished from `~/project`, which is the conventional
+   working directory that CircleCI will default to starting your jobs
+   in.)
+3. Write out the commit message to `.circleci/COMMIT_MSG`.  This is so
+   we can determine in subjobs if we should actually run the jobs or
+   not, even if there isn't a Git checkout.
+
+
+
+
 CircleCI configuration generator
 ================================

@ -35,4 +55,422 @@ Future direction
 See comment [here](https://github.com/pytorch/pytorch/pull/17323#pullrequestreview-206945747):

 In contrast with a full recursive tree traversal of configuration dimensions,
-> in the future future I think we actually want to decrease our matrix somewhat and have only a few mostly-orthogonal builds that taste as many different features as possible on PRs, plus a more complete suite on every PR and maybe an almost full suite nightly/weekly (we don't have this yet). Specifying PR jobs in the future might be easier to read with an explicit list when we come to this.
+> in the future future I think we actually want to decrease our matrix somewhat and have only a few mostly-orthogonal builds that taste as many different features as possible on PRs, plus a more complete suite on every PR and maybe an almost full suite nightly/weekly (we don't have this yet). Specifying PR jobs in the future might be easier to read with an explicit list when we come to this.
+
+----------------
+----------------
+
+# How do the binaries / nightlies / releases work?
+
+### What is a binary?
+
+A binary or package (used interchangeably) is a pre-built collection of c++ libraries, header files, python bits, and other files. We build these and distribute them so that users do not need to install from source.
+
+A **binary configuration** is a collection of
+
+* release or nightly
+    * releases are stable, nightlies are beta and built every night
+* python version
+    * linux: 2.7m, 2.7mu, 3.5m, 3.6m 3.7m (mu is wide unicode or something like that. It usually doesn't matter but you should know that it exists)
+    * macos and windows: 2.7, 3.5, 3.6, 3.7
+* cpu version
+    * cpu, cuda 9.0, cuda 10.0
+    * The supported cuda versions occasionally change
+* operating system
+    * Linux - these are all built on CentOS. There haven't been any problems in the past building on CentOS and using on Ubuntu
+    * MacOS
+    * Windows - these are built on Azure pipelines
+* devtoolset version (gcc compiler version)
+    * This only matters on Linux cause only Linux uses gcc. tldr is gcc made a backwards incompatible change from gcc 4.8 to gcc 5, because it had to change how it implemented std::vector and std::string
+
+### Where are the binaries?
+
+The binaries are built in CircleCI. There are nightly binaries built every night at 9pm PST (midnight EST) and release binaries corresponding to Pytorch releases, usually every few months.
+
+We have 3 types of binary packages
+
+* pip packages - nightlies are stored on s3 (pip install -f <a s3 url>). releases are stored in a pip repo (pip install torch) (ask Soumith about this)
+* conda packages - nightlies and releases are both stored in a conda repo. Nighty packages have a '_nightly' suffix
+* libtorch packages - these are zips of all the c++ libraries, header files, and sometimes dependencies. These are c++ only
+    * shared with dependencies
+    * static with dependencies
+    * shared without dependencies
+    * static without dependencies
+
+All binaries are built in CircleCI workflows. There are checked-in workflows (committed into the .circleci/config.yml) to build the nightlies every night. Releases are built by manually pushing a PR that builds the suite of release binaries (overwrite the config.yml to build the release)
+
+# CircleCI structure of the binaries
+
+Some quick vocab:
+
+* A\**workflow** is a CircleCI concept; it is a DAG of '**jobs**'. ctrl-f 'workflows' on\https://github.com/pytorch/pytorch/blob/master/.circleci/config.yml to see the workflows.
+* **jobs** are a sequence of '**steps**'
+* **steps** are usually just a bash script or a builtin CircleCI command.* All steps run in new environments, environment variables declared in one script DO NOT persist to following steps*
+* CircleCI has a **workspace**, which is essentially a cache between steps of the *same job* in which you can store artifacts between steps.
+
+## How are the workflows structured?
+
+The nightly binaries have 3 workflows. We have one job (actually 3 jobs:  build, test, and upload) per binary configuration
+
+1. binarybuilds
+    1. every day midnight EST
+    2. linux: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/linux-binary-build-defaults.yml
+    3. macos: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/macos-binary-build-defaults.yml
+    4. For each binary configuration, e.g. linux_conda_3.7_cpu there is a
+        1. binary_linux_conda_3.7_cpu_build
+            1. Builds the build. On linux jobs this uses the 'docker executor'.
+            2. Persists the package to the workspace
+        2. binary_linux_conda_3.7_cpu_test
+            1. Loads the package to the workspace
+            2. Spins up a docker image (on Linux), mapping the package and code repos into the docker
+            3. Runs some smoke tests in the docker
+            4. (Actually, for macos this is a step rather than a separate job)
+        3. binary_linux_conda_3.7_cpu_upload
+            1. Logs in to aws/conda
+            2. Uploads the package
+2. update_s3_htmls
+    1. every day 5am EST
+    2. https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/binary_update_htmls.yml
+    3. See below for what these are for and why they're needed
+    4. Three jobs that each examine the current contents of aws and the conda repo and update some html files in s3
+3. binarysmoketests
+    1. every day
+    2. https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/nightly-build-smoke-tests-defaults.yml
+    3. For each binary configuration, e.g. linux_conda_3.7_cpu there is a
+        1. smoke_linux_conda_3.7_cpu
+            1. Downloads the package from the cloud, e.g. using the official pip or conda instructions
+            2. Runs the smoke tests
+
+## How are the jobs structured?
+
+The jobs are in https://github.com/pytorch/pytorch/tree/master/.circleci/verbatim-sources . Jobs are made of multiple steps. There are some shared steps used by all the binaries/smokes. Steps of these jobs are all delegated to scripts in https://github.com/pytorch/pytorch/tree/master/.circleci/scripts .
+
+* Linux jobs: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/linux-binary-build-defaults.yml
+    * binary_linux_build.sh
+    * binary_linux_test.sh
+    * binary_linux_upload.sh
+* MacOS jobs: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/macos-binary-build-defaults.yml
+    * binary_macos_build.sh
+    * binary_macos_test.sh
+    * binary_macos_upload.sh
+* Update html jobs: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/binary_update_htmls.yml
+    * These delegate from the pytorch/builder repo
+    * https://github.com/pytorch/builder/blob/master/cron/update_s3_htmls.sh
+    * https://github.com/pytorch/builder/blob/master/cron/upload_binary_sizes.sh
+* Smoke jobs (both linux and macos): https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/nightly-build-smoke-tests-defaults.yml
+    * These delegate from the pytorch/builder repo
+    * https://github.com/pytorch/builder/blob/master/run_tests.sh
+    * https://github.com/pytorch/builder/blob/master/smoke_test.sh
+    * https://github.com/pytorch/builder/blob/master/check_binary.sh
+* Common shared code (shared across linux and macos): https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/nightly-binary-build-defaults.yml
+    * binary_checkout.sh - checks out pytorch/builder repo. Right now this also checks out pytorch/pytorch, but it shouldn't. pytorch/pytorch should just be shared through the workspace. This can handle being run before binary_populate_env.sh
+    * binary_populate_env.sh - parses BUILD_ENVIRONMENT into the separate env variables that make up a binary configuration. Also sets lots of default values, the date, the version strings, the location of folders in s3, all sorts of things. This generally has to be run before other steps.
+    * binary_install_miniconda.sh - Installs miniconda, cross platform. Also hacks this for the update_binary_sizes job that doesn't have the right env variables
+    * binary_run_in_docker.sh - Takes a bash script file (the actual test code) from a hardcoded location, spins up a docker image, and runs the script inside the docker image
+
+### **Why do the steps all refer to scripts?**
+
+CircleCI creates a  final yaml file by inlining every <<* segment, so if we were to keep all the code in the config.yml itself then the config size would go over 4 MB and cause infra problems.
+
+### **What is binary_run_in_docker for?**
+
+So, CircleCI has several executor types: macos, machine, and docker are the ones we use. The 'machine' executor gives you two cores on some linux vm. The 'docker' executor gives you considerably more cores (nproc was 32 instead of 2 back when I tried in February). Since the dockers are faster, we try to run everything that we can in dockers. Thus
+
+* linux build jobs use the docker executor. Running them on the docker executor was at least 2x faster than running them on the machine executor
+* linux test jobs use the machine executor and spin up their own docker. Why this nonsense? It's cause we run nvidia-docker for our GPU tests; any code that calls into the CUDA runtime needs to be run on nvidia-docker. To run a nvidia-docker you need to install some nvidia packages on the host machine and then call docker with the '—runtime nvidia' argument. CircleCI doesn't support this, so we have to do it ourself.
+    * This is not just a mere inconvenience. **This blocks all of our linux tests from using more than 2 cores.** But there is nothing that we can do about it, but wait for a fix on circleci's side. Right now, we only run some smoke tests (some simple imports) on the binaries, but this also affects non-binary test jobs.
+* linux upload jobs use the machine executor. The upload jobs are so short that it doesn't really matter what they use
+* linux smoke test jobs use the machine executor for the same reason as the linux test jobs
+
+binary_run_in_docker.sh is a way to share the docker start-up code between the binary test jobs and the binary smoke test jobs
+
+### **Why does binary_checkout also checkout pytorch? Why shouldn't it?**
+
+We want all the nightly binary jobs to run on the exact same git commit, so we wrote our own checkout logic to ensure that the same commit was always picked. Later circleci changed that to use a single pytorch checkout and persist it through the workspace (they did this because our config file was too big, so they wanted to take a lot of the setup code into scripts, but the scripts needed the code repo to exist to be called, so they added a prereq step called 'setup' to checkout the code and persist the needed scripts to the workspace). The changes to the binary jobs were not properly tested, so they all broke from missing pytorch code no longer existing. We hotfixed the problem by adding the pytorch checkout back to binary_checkout, so now there's two checkouts of pytorch on the binary jobs. This problem still needs to be fixed, but it takes careful tracing of which code is being called where.
+
+# Code structure of the binaries (circleci agnostic)
+
+## Overview
+
+The code that runs the binaries lives in two places, in the normal [github.com/pytorch/pytorch](http://github.com/pytorch/pytorch), but also in [github.com/pytorch/builder](http://github.com/pytorch/builder) , which is a repo that defines how all the binaries are built. The relevant code is
+
+
+```
+# All code needed to set-up environments for build code to run in,
+# but only code that is specific to the current CI system
+pytorch/pytorch
+- .circleci/                # Folder that holds all circleci related stuff
+  - config.yml              # GENERATED file that actually controls all circleci behavior
+  - verbatim-sources        # Used to generate job/workflow sections in ^
+  - scripts/                # Code needed to prepare circleci environments for binary build scripts
+
+- setup.py                  # Builds pytorch. This is wrapped in pytorch/builder
+- cmake files               # used in normal building of pytorch
+
+# All code needed to prepare a binary build, given an environment
+# with all the right variables/packages/paths.
+pytorch/builder
+
+# Given an installed binary and a proper python env, runs some checks
+# to make sure the binary was built the proper way. Checks things like
+# the library dependencies, symbols present, etc.
+- check_binary.sh
+
+# Given an installed binary, runs python tests to make sure everything
+# is in order. These should be de-duped. Right now they both run smoke
+# tests, but are called from different places. Usually just call some
+# import statements, but also has overlap with check_binary.sh above
+- run_tests.sh
+- smoke_test.sh
+
+# Folders that govern how packages are built. See paragraphs below
+
+- conda/
+  - build_pytorch.sh          # Entrypoint. Delegates to proper conda build folder
+  - switch_cuda_version.sh    # Switches activate CUDA installation in Docker
+  - pytorch-nightly/          # Build-folder
+- manywheel/
+  - build_cpu.sh              # Entrypoint for cpu builds
+  - build.sh                  # Entrypoint for CUDA builds
+  - build_common.sh           # Actual build script that ^^ call into
+- wheel/
+  - build_wheel.sh            # Entrypoint for wheel builds
+```
+
+Every type of package has an entrypoint build script that handles the all the important logic.
+
+## Conda
+
+Both Linux and MacOS use the same code flow for the conda builds.
+
+Conda packages are built with conda-build, see https://conda.io/projects/conda-build/en/latest/resources/commands/conda-build.html
+
+Basically, you pass `conda build` a build folder (pytorch-nightly/ above) that contains a build script and a meta.yaml. The meta.yaml specifies in what python environment to build the package in, and what dependencies the resulting package should have, and the build script gets called in the env to build the thing.
+tldr; on conda-build is
+
+1. Creates a brand new conda environment, based off of deps in the meta.yaml
+    1. Note that environment variables do not get passed into this build env unless they are specified in the meta.yaml
+    2. If the build fails this environment will stick around. You can activate it for much easier debugging. The “General Python” section below explains what exactly a python “environment” is.
+2. Calls build.sh in the environment
+3. Copies the finished package to a new conda env, also specified by the meta.yaml
+4. Runs some simple import tests (if specified in the meta.yaml)
+5. Saves the finished package as a tarball
+
+The build.sh we use is essentially a wrapper around ```python setup.py build``` , but it also manually copies in some of our dependent libraries into the resulting tarball and messes with some rpaths.
+
+The entrypoint file `builder/conda/build_conda.sh` is complicated because
+
+* It works for both Linux and MacOS
+    * The mac builds used to create their own environments, since they all used to be on the same machine. There’s now a lot of extra logic to handle conda envs. This extra machinery could be removed
+* It used to handle testing too, which adds more logic messing with python environments too. This extra machinery could be removed.
+
+## Manywheels (linux pip and libtorch packages)
+
+Manywheels are pip packages for linux distros. Note that these manywheels are not actually manylinux compliant.
+
+`builder/manywheel/build_cpu.sh` and `builder/manywheel/build.sh` (for CUDA builds) just set different env vars and then call into `builder/manywheel/build_common.sh`
+
+The entrypoint file `builder/manywheel/build_common.sh` is really really complicated because
+
+* This used to handle building for several different python versions at the same time. The loops have been removed, but there's still unneccessary folders and movements here and there.
+    * The script is never used this way anymore. This extra machinery could be removed.
+* This used to handle testing the pip packages too. This is why there’s testing code at the end that messes with python installations and stuff
+    * The script is never used this way anymore. This extra machinery could be removed.
+* This also builds libtorch packages
+    * This should really be separate. libtorch packages are c++ only and have no python. They should not share infra with all the python specific stuff in this file.
+* There is a lot of messing with rpaths. This is necessary, but could be made much much simpler if the above issues were fixed.
+
+## Wheels (MacOS pip and libtorch packages)
+
+The entrypoint file `builder/wheel/build_wheel.sh` is complicated because
+
+* The mac builds used to all run on one machine (we didn’t have autoscaling mac machines till circleci). So this script handled siloing itself by setting-up and tearing-down its build env and siloing itself into its own build directory.
+    * The script is never used this way anymore. This extra machinery could be removed.
+* This also builds libtorch packages
+    * Ditto the comment above. This should definitely be separated out.
+
+Note that the MacOS Python wheels are still built in conda environments. Some of the dependencies present during build also come from conda.
+
+## General notes
+
+### Note on run_tests.sh, smoke_test.sh, and check_binary.sh
+
+* These should all be consolidated
+* These must run on all OS types: MacOS, Linux, and Windows
+* These all run smoke tests at the moment. They inspect the packages some, maybe run a few import statements. They DO NOT run the python tests nor the cpp tests. The idea is that python tests on master and PR merges will catch all breakages. All these tests have to do is make sure the special binary machinery didn’t mess anything up.
+* There are separate run_tests.sh and smoke_test.sh because one used to be called by the smoke jobs and one used to be called by the binary test jobs (see circleci structure section above). This is still true actually, but these could be united into a single script that runs these checks, given an installed pytorch package.
+
+### Note on libtorch
+
+Libtorch packages are built in the wheel build scripts: manywheel/build_*.sh for linux and build_wheel.sh for mac. There are several things wrong with this
+
+* It’s confusinig. Most of those scripts deal with python specifics.
+* The extra conditionals everywhere severely complicate the wheel build scripts
+* The process for building libtorch is different from the official instructions (a plain call to cmake, or a call to a script)
+
+### Note on docker images / Dockerfiles
+
+All linux builds occur in docker images. The docker images are
+
+* soumith/conda-cuda
+    * Has ALL CUDA versions installed. The script pytorch/builder/conda/switch_cuda_version.sh sets /usr/local/cuda to a symlink to e.g. /usr/local/cuda-10.0 to enable different CUDA builds
+    * Also used for cpu builds
+* soumith/manylinux-cuda90
+* soumith/manylinux-cuda92
+* soumith/manylinux-cuda100
+    * Also used for cpu builds
+
+The Dockerfiles are available in pytorch/builder, but there is no circleci job or script to build these docker images, and they cannot be run locally (unless you have the correct local packages/paths). Only Soumith can build them right now.
+
+### General Python
+
+* This is still a good explanation of python installations https://caffe2.ai/docs/faq.html#why-do-i-get-import-errors-in-python-when-i-try-to-use-caffe2
+
+# How to manually rebuild the binaries
+
+tldr; make a PR that looks like https://github.com/pytorch/pytorch/pull/21159
+
+Sometimes we want to push a change to master and then rebuild all of today's binaries after that change. As of May 30, 2019 there isn't a way to manually run a workflow in the UI. You can manually re-run a workflow, but it will use the exact same git commits as the first run and will not include any changes. So we have to make a PR and then force circleci to run the binary workflow instead of the normal tests. The above PR is an example of how to do this; essentially you copy-paste the binarybuilds workflow steps into the default workflow steps. If you need to point the builder repo to a different commit then you'd need to change https://github.com/pytorch/pytorch/blob/master/.circleci/scripts/binary_checkout.sh#L42-L45 to checkout what you want.
+
+## How to test changes to the binaries via .circleci
+
+Writing PRs that test the binaries is annoying, since the default circleci jobs that run on PRs are not the jobs that you want to run. Likely, changes to the binaries will touch something under .circleci/ and require that .circleci/config.yml be regenerated (.circleci/config.yml controls all .circleci behavior, and is generated using ```.circleci/regenerate.sh``` in python 3.7). But you also need to manually hardcode the binary jobs that you want to test into the .circleci/config.yml workflow, so you should actually make at least two commits, one for your changes and one to temporarily hardcode jobs. See https://github.com/pytorch/pytorch/pull/22928 as an example of how to do this.
+
+```
+# Make your changes
+touch .circleci/verbatim-sources/nightly-binary-build-defaults.yml
+
+# Regenerate the yaml, has to be in python 3.7
+.circleci/regenerate.sh
+
+# Make a commit
+git add .circleci *
+git commit -m "My real changes"
+git push origin my_branch
+
+# Now hardcode the jobs that you want in the .circleci/config.yml workflows section
+# Also eliminate ensure-consistency and should_run_job checks
+# e.g. https://github.com/pytorch/pytorch/commit/2b3344bfed8772fe86e5210cc4ee915dee42b32d
+
+# Make a commit you won't keep
+git add .circleci
+git commit -m "[DO NOT LAND] testing binaries for above changes"
+git push origin my_branch
+
+# Now you need to make some changes to the first commit.
+git rebase -i HEAD~2 # mark the first commit as 'edit'
+
+# Make the changes
+touch .circleci/verbatim-sources/nightly-binary-build-defaults.yml
+.circleci/regenerate.sh
+
+# Ammend the commit and recontinue
+git add .circleci
+git commit --amend
+git rebase --continue
+
+# Update the PR, need to force since the commits are different now
+git push origin my_branch --force
+```
+
+The advantage of this flow is that you can make new changes to the base commit and regenerate the .circleci without having to re-write which binary jobs you want to test on. The downside is that all updates will be force pushes.
+
+## How to build a binary locally
+
+### Linux
+
+You can build Linux binaries locally easily using docker.
+
+```
+# Run the docker
+# Use the correct docker image, soumith/conda-cuda used here as an example
+#
+# -v path/to/foo:path/to/bar makes path/to/foo on your local machine (the
+#    machine that you're running the command on) accessible to the docker
+#    container at path/to/bar. So if you then run `touch path/to/bar/baz`
+#    in the docker container then you will see path/to/foo/baz on your local
+#    machine. You could also clone the pytorch and builder repos in the docker.
+#
+# If you're building a CUDA binary then use `nvidia-docker run` instead, see below.
+#
+# If you know how, add ccache as a volume too and speed up everything
+docker run \
+    -v your/pytorch/repo:/pytorch \
+    -v your/builder/repo:/builder \
+    -v where/you/want/packages/to/appear:/final_pkgs \
+    -it soumith/conda-cuda /bin/bash
+
+# Export whatever variables are important to you. All variables that you'd
+# possibly need are in .circleci/scripts/binary_populate_env.sh
+# You should probably always export at least these 3 variables
+export PACKAGE_TYPE=conda
+export DESIRED_PYTHON=3.6
+export DESIRED_CUDA=cpu
+
+# Call the entrypoint
+# `|& tee foo.log` just copies all stdout and stderr output to foo.log
+# The builds generate lots of output so you probably need this when
+# building locally.
+/builder/conda/build_pytorch.sh |& tee build_output.log
+```
+
+**Building CUDA binaries on docker**
+
+To build a CUDA binary you need to use `nvidia-docker run` instead of just `docker run` (or you can manually pass `--runtime=nvidia`). This adds some needed libraries and things to build CUDA stuff.
+
+You can build CUDA binaries on CPU only machines, but you can only run CUDA binaries on CUDA machines. This means that you can build a CUDA binary on a docker on your laptop if you so choose (though it’s gonna take a loong time).
+
+For Facebook employees, ask about beefy machines that have docker support and use those instead of your laptop; it will be 5x as fast.
+
+### MacOS
+
+There’s no easy way to generate reproducible hermetic MacOS environments. If you have a Mac laptop then you can try emulating the .circleci environments as much as possible, but you probably have packages in /usr/local/, possibly installed by brew, that will probably interfere with the build. If you’re trying to repro an error on a Mac build in .circleci and you can’t seem to repro locally, then my best advice is actually to iterate on .circleci    :/
+
+But if you want to try, then I’d recommend
+
+```
+# Create a new terminal
+# Clear your LD_LIBRARY_PATH and trim as much out of your PATH as you
+# know how to do
+
+# Install a new miniconda
+# First remove any other python or conda installation from your PATH
+# Always install miniconda 3, even if building for Python <3
+new_conda="~/my_new_conda"
+conda_sh="$new_conda/install_miniconda.sh"
+curl -o "$conda_sh" https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
+chmod +x "$conda_sh"
+"$conda_sh" -b -p "$MINICONDA_ROOT"
+rm -f "$conda_sh"
+export PATH="~/my_new_conda/bin:$PATH"
+
+# Create a clean python env
+# All MacOS builds use conda to manage the python env and dependencies
+# that are built with, even the pip packages
+conda create -yn binary python=2.7
+conda activate binary
+
+# Export whatever variables are important to you. All variables that you'd
+# possibly need are in .circleci/scripts/binary_populate_env.sh
+# You should probably always export at least these 3 variables
+export PACKAGE_TYPE=conda
+export DESIRED_PYTHON=3.6
+export DESIRED_CUDA=cpu
+
+# Call the entrypoint you want
+path/to/builder/wheel/build_wheel.sh
+```
+
+N.B. installing a brand new miniconda is important. This has to do with how conda installations work. See the “General Python” section above, but tldr; is that
+
+1. You make the ‘conda’ command accessible by prepending `path/to/conda_root/bin` to your PATH.
+2. You make a new env and activate it, which then also gets prepended to your PATH. Now you have `path/to/conda_root/envs/new_env/bin:path/to/conda_root/bin:$PATH`
+3. Now say you (or some code that you ran) call python executable `foo`
+    1. if you installed `foo` in `new_env`, then `path/to/conda_root/envs/new_env/bin/foo` will get called, as expected.
+    2. But if you forgot to installed `foo` in `new_env` but happened to previously install it in your root conda env (called ‘base’), then unix/linux will still find `path/to/conda_root/bin/foo` . This is dangerous, since `foo` can be a different version than you want; `foo` can even be for an incompatible python version!
+
+Newer conda versions and proper python hygeine can prevent this, but just install a new miniconda to be safe.
+
+### Windows
+
+Maybe @peterjc123 can fill this section in.
--- a/.circleci/cimodel/data/binary_build_data.py
+++ b/.circleci/cimodel/data/binary_build_data.py
@ -42,7 +42,7 @@ LINUX_PACKAGE_VARIANTS = OrderedDict(
        "3.6m",
        "3.7m",
    ],
-    conda=dimensions.STANDARD_PYTHON_VERSIONS,
+    conda=dimensions.CONDA_PYTHON_VERSIONS,
    libtorch=[
        "2.7m",
    ],
@ -52,7 +52,7 @@ CONFIG_TREE_DATA = OrderedDict(
    linux=(dimensions.CUDA_VERSIONS, LINUX_PACKAGE_VARIANTS),
    macos=([None], OrderedDict(
        wheel=dimensions.STANDARD_PYTHON_VERSIONS,
-        conda=dimensions.STANDARD_PYTHON_VERSIONS,
+        conda=dimensions.CONDA_PYTHON_VERSIONS,
        libtorch=[
            "2.7",
        ],
@ -60,7 +60,19 @@ CONFIG_TREE_DATA = OrderedDict(
 )


-DEVTOOLSET_VERSIONS = [3, 7]
+# Why is this an option?
+# All the nightlies used to be devtoolset3 and built with the old gcc ABI. We
+# added a devtoolset7 option so that we could build nightlies with the new gcc
+# ABI. That didn't work since devtoolset7 can't build with the new gcc ABI. But
+# then we set devtoolset7 to be the default anyways, since devtoolset7
+# understands avx512, which is needed for good fbgemm performance.
+# This should be removed. The base dockers should just be upgraded to
+# devtoolset7 so we don't have to reinstall this in every build job.
+# The same machinery that this uses, though, should be retooled for a different
+# compiler toolchain that can build with the new gcc ABI.
+DEVTOOLSET_VERSIONS = [
+    7,
+]


 class TopLevelNode(ConfigNode):
@ -83,7 +95,13 @@ class OSConfigNode(ConfigNode):
        self.props["cuda_versions"] = cuda_versions

    def get_children(self):
-        return [PackageFormatConfigNode(self, k, v) for k, v in self.py_tree.items()]
+        packaging_variants = [PackageFormatConfigNode(self, k, v) for k, v in self.py_tree.items()]
+
+        if self.find_prop("smoke"):
+            filtered_packaging_variants = list(filter(lambda x: x.get_label() != "libtorch", packaging_variants))
+            return filtered_packaging_variants
+        else:
+            return packaging_variants


 class PackageFormatConfigNode(ConfigNode):
@ -94,7 +112,7 @@ class PackageFormatConfigNode(ConfigNode):
        self.props["package_format"] = package_format

    def get_children(self):
-        if self.find_prop("os_name") == "linux" and self.find_prop("package_format") != "conda":
+        if self.find_prop("os_name") == "linux":
            return [LinuxGccConfigNode(self, v) for v in DEVTOOLSET_VERSIONS]
        else:
            return [ArchConfigNode(self, v) for v in self.find_prop("cuda_versions")]
@ -107,7 +125,14 @@ class LinuxGccConfigNode(ConfigNode):
        self.props["devtoolset_version"] = devtoolset_version

    def get_children(self):
-        return [ArchConfigNode(self, v) for v in self.find_prop("cuda_versions")]
+        cuda_versions = self.find_prop("cuda_versions")
+
+        # XXX devtoolset7 on CUDA 9.0 is temporarily disabled
+        # see https://github.com/pytorch/pytorch/issues/20066
+        if self.find_prop("devtoolset_version") == 7:
+            cuda_versions = filter(lambda x: x != "90", cuda_versions)
+
+        return [ArchConfigNode(self, v) for v in cuda_versions]


 class ArchConfigNode(ConfigNode):
@ -132,7 +157,7 @@ class PyVersionConfigNode(ConfigNode):
        package_format = self.find_prop("package_format")
        os_name = self.find_prop("os_name")

-        has_libtorch_variants = smoke and package_format == "libtorch" and os_name == "linux"
+        has_libtorch_variants = package_format == "libtorch" and os_name == "linux"
        linking_variants = LINKING_DIMENSIONS if has_libtorch_variants else []

        return [LinkingVariantConfigNode(self, v) for v in linking_variants]
--- a/.circleci/cimodel/data/binary_build_definitions.py
+++ b/.circleci/cimodel/data/binary_build_definitions.py
@ -34,7 +34,8 @@ class Conf(object):

        docker_distro_prefix = miniutils.override(self.pydistro, docker_word_substitution)

-        alt_docker_suffix = self.cuda_version or "80"
+        # The cpu nightlies are built on the soumith/manylinux-cuda100 docker image
+        alt_docker_suffix = self.cuda_version or "100"
        docker_distro_suffix = "" if self.pydistro == "conda" else alt_docker_suffix
        return miniutils.quote("soumith/" + docker_distro_prefix + "-cuda" + docker_distro_suffix)

@ -45,10 +46,10 @@ class Conf(object):

        parts = [self.get_name_prefix(), self.os] + self.gen_build_env_parms()

-        if self.smoke:
-            if self.libtorch_variant:
-                parts.append(self.libtorch_variant)
-        else:
+        if self.libtorch_variant:
+            parts.append(self.libtorch_variant)
+
+        if not self.smoke:
            parts.append(build_or_test)

        return "_".join(parts)
@ -146,13 +147,12 @@ def get_nightly_tests():
    configs = gen_build_env_list(False)
    filtered_configs = filter(predicate_exclude_nonlinux_and_libtorch, configs)

-    mylist = []
+    tests = []
    for conf_options in filtered_configs:
-        d = {conf_options.gen_build_name("test"): {"requires": [conf_options.gen_build_name("build")]}}
-        mylist.append(d)
-
-    return mylist
+        params = {"requires": ["setup", conf_options.gen_build_name("build")]}
+        tests.append({conf_options.gen_build_name("test"): params})

+    return tests

 def get_nightly_uploads():

@ -162,7 +162,7 @@ def get_nightly_uploads():
        return {
            conf.gen_build_name("upload"): OrderedDict([
                ("context", "org-member"),
-                ("requires", [conf.gen_build_name(phase_dependency)]),
+                ("requires", ["setup", conf.gen_build_name(phase_dependency)]),
            ]),
        }

@ -189,12 +189,12 @@ def gen_schedule_tree(cron_timing):

 def add_jobs_and_render(jobs_dict, toplevel_key, smoke, cron_schedule):

-    jobs_list = []
+    jobs_list = ["setup"]

    configs = gen_build_env_list(smoke)
    for build_config in configs:
        build_name = build_config.gen_build_name("build")
-        jobs_list.append(build_name)
+        jobs_list.append({build_name: {"requires": ["setup"]}})

    jobs_dict[toplevel_key] = OrderedDict(
        triggers=gen_schedule_tree(cron_schedule),
--- a/.circleci/cimodel/data/caffe2_build_data.py
+++ b/.circleci/cimodel/data/caffe2_build_data.py
@ -1,8 +1,7 @@
 #!/usr/bin/env python3

-from cimodel.lib.conf_tree import ConfigNode, X
+from cimodel.lib.conf_tree import ConfigNode, X, XImportant
 from cimodel.lib.conf_tree import Ver
-import cimodel.data.dimensions as dimensions


 CONFIG_TREE_DATA = [
@ -11,20 +10,20 @@ CONFIG_TREE_DATA = [
        (Ver("gcc", "4.9"), [X("py2")]),
    ]),
    (Ver("ubuntu", "16.04"), [
-        (Ver("cuda", "8.0"), [X("py2")]),
        (Ver("cuda", "9.0"), [
            # TODO make explicit that this is a "secret TensorRT build"
            #  (see https://github.com/pytorch/pytorch/pull/17323#discussion_r259446749)
+            # TODO Uh oh, were we supposed to make this one important?!
            X("py2"),
-            X("cmake"),
+            XImportant("cmake"),
        ]),
-        (Ver("cuda", "9.1"), [X("py2")]),
-        (Ver("mkl"), [X("py2")]),
-        (Ver("gcc", "5"), [X("onnx_py2")]),
+        (Ver("cuda", "9.1"), [XImportant("py2")]),
+        (Ver("mkl"), [XImportant("py2")]),
+        (Ver("gcc", "5"), [XImportant("onnx_py2")]),
        (Ver("clang", "3.8"), [X("py2")]),
        (Ver("clang", "3.9"), [X("py2")]),
-        (Ver("clang", "7"), [X("py2")]),
-        (Ver("android"), [X("py2")]),
+        (Ver("clang", "7"), [XImportant("py2"), XImportant("onnx_py3.6")]),
+        (Ver("android"), [XImportant("py2")]),
    ]),
    (Ver("centos", "7"), [
        (Ver("cuda", "9.0"), [X("py2")]),
@ -33,7 +32,7 @@ CONFIG_TREE_DATA = [
        # TODO ios and system aren't related. system qualifies where the python comes
        #  from (use the system python instead of homebrew or anaconda)
        (Ver("ios"), [X("py2")]),
-        (Ver("system"), [X("py2")]),
+        (Ver("system"), [XImportant("py2")]),
    ]),
 ]

@ -55,6 +54,8 @@ class TreeConfigNode(ConfigNode):
        return [self.child_constructor()(self, k, v) for (k, v) in self.subtree]

    def is_build_only(self):
+        if str(self.find_prop("language_version")) == "onnx_py3.6":
+            return False
        return str(self.find_prop("compiler_version")) in [
            "gcc4.9",
            "clang3.8",
@ -96,16 +97,13 @@ class LanguageConfigNode(TreeConfigNode):
        self.props["language_version"] = node_name
        self.props["build_only"] = self.is_build_only()

-    def get_children(self):
-
-        children = []
-        for phase in dimensions.PHASES:
-            if phase == "build" or not self.props["build_only"]:
-                children.append(PhaseConfigNode(self, phase, []))
-
-        return children
+    def child_constructor(self):
+        return ImportantConfigNode


-class PhaseConfigNode(TreeConfigNode):
+class ImportantConfigNode(TreeConfigNode):
    def init2(self, node_name):
-        self.props["phase_name"] = node_name
+        self.props["important"] = True
+
+    def get_children(self):
+        return []
--- a/.circleci/cimodel/data/caffe2_build_definitions.py
+++ b/.circleci/cimodel/data/caffe2_build_definitions.py
@ -2,30 +2,35 @@

 from collections import OrderedDict

+import cimodel.data.dimensions as dimensions
 import cimodel.lib.conf_tree as conf_tree
+from cimodel.lib.conf_tree import Ver
 import cimodel.lib.miniutils as miniutils
 import cimodel.lib.visualization as visualization
 from cimodel.data.caffe2_build_data import CONFIG_TREE_DATA, TopLevelNode


+from dataclasses import dataclass
+
+
 DOCKER_IMAGE_PATH_BASE = "308535385114.dkr.ecr.us-east-1.amazonaws.com/caffe2/"

-DOCKER_IMAGE_VERSION = 266
+DOCKER_IMAGE_VERSION = 287


-class Conf(object):
-    def __init__(self, language, distro, compiler, phase, build_only):
-
-        self.language = language
-        self.distro = distro
-        self.compiler = compiler
-        self.phase = phase
-        self.build_only = build_only
+@dataclass
+class Conf:
+    language: str
+    distro: Ver
+    compiler: Ver
+    build_only: bool
+    is_important: bool

    # TODO: Eventually we can probably just remove the cudnn7 everywhere.
    def get_cudnn_insertion(self):

        omit = self.language == "onnx_py2" \
+            or self.language == "onnx_py3.6" \
            or self.compiler.name in ["android", "mkl", "clang"] \
            or str(self.distro) in ["ubuntu14.04", "macos10.13"]

@ -44,9 +49,6 @@ class Conf(object):
        root_parts = self.get_build_name_root_parts()
        return "_".join(root_parts + [phase]).replace(".", "_")

-    def get_name(self):
-        return self.construct_phase_name(self.phase)
-
    def get_platform(self):
        platform = self.distro.name
        if self.distro.name != "macos":
@ -57,6 +59,7 @@ class Conf(object):

        lang_substitutions = {
            "onnx_py2": "py2",
+            "onnx_py3.6": "py3.6",
            "cmake": "py2",
        }

@ -64,12 +67,13 @@ class Conf(object):
        parts = [lang] + self.get_build_name_middle_parts()
        return miniutils.quote(DOCKER_IMAGE_PATH_BASE + "-".join(parts) + ":" + str(DOCKER_IMAGE_VERSION))

-    def gen_yaml_tree(self):
+    def gen_yaml_tree(self, phase):

        tuples = []

        lang_substitutions = {
            "onnx_py2": "onnx-py2",
+            "onnx_py3.6": "onnx-py3.6",
        }

        lang = miniutils.override(self.language, lang_substitutions)
@ -77,7 +81,7 @@ class Conf(object):
        parts = [
            "caffe2",
            lang,
-        ] + self.get_build_name_middle_parts() + [self.phase]
+        ] + self.get_build_name_middle_parts() + [phase]

        build_env = "-".join(parts)
        if not self.distro.name == "macos":
@ -88,7 +92,7 @@ class Conf(object):
        if self.compiler.name == "ios":
            tuples.append(("BUILD_IOS", miniutils.quote("1")))

-        if self.phase == "test":
+        if phase == "test":
            # TODO cuda should not be considered a compiler
            if self.compiler.name == "cuda":
                tuples.append(("USE_CUDA_DOCKER_RUNTIME", miniutils.quote("1")))
@ -103,11 +107,11 @@ class Conf(object):

        d = OrderedDict({"environment": OrderedDict(tuples)})

-        if self.phase == "test":
+        if phase == "test":
            resource_class = "large" if self.compiler.name != "cuda" else "gpu.medium"
            d["resource_class"] = resource_class

-        d["<<"] = "*" + "_".join(["caffe2", self.get_platform(), self.phase, "defaults"])
+        d["<<"] = "*" + "_".join(["caffe2", self.get_platform(), phase, "defaults"])

        return d

@ -125,11 +129,11 @@ def instantiate_configs():
    for fc in found_configs:

        c = Conf(
-            fc.find_prop("language_version"),
-            fc.find_prop("distro_version"),
-            fc.find_prop("compiler_version"),
-            fc.find_prop("phase_name"),
-            fc.find_prop("build_only"),
+            language=fc.find_prop("language_version"),
+            distro=fc.find_prop("distro_version"),
+            compiler=fc.find_prop("compiler_version"),
+            build_only=fc.find_prop("build_only"),
+            is_important=fc.find_prop("important"),
        )

        config_list.append(c)
@ -138,10 +142,13 @@ def instantiate_configs():


 def add_caffe2_builds(jobs_dict):
-
    configs = instantiate_configs()
    for conf_options in configs:
-        jobs_dict[conf_options.get_name()] = conf_options.gen_yaml_tree()
+        phases = ["build"]
+        if not conf_options.build_only:
+            phases = dimensions.PHASES
+        for phase in phases:
+            jobs_dict[conf_options.construct_phase_name(phase)] = conf_options.gen_yaml_tree(phase)

    graph = visualization.generate_graph(get_root())
    graph.draw("caffe2-config-dimensions.png", prog="twopi")
@ -157,10 +164,24 @@ def get_caffe2_workflows():

    x = []
    for conf_options in filtered_configs:
-        item = conf_options.get_name()

-        if conf_options.phase == "test":
-            item = {conf_options.get_name(): {"requires": [conf_options.construct_phase_name("build")]}}
-        x.append(item)
+        phases = ["build"]
+        if not conf_options.build_only:
+            phases = dimensions.PHASES
+
+        for phase in phases:
+
+            requires = ["setup"]
+            sub_d = {"requires": requires}
+
+            if phase == "test":
+                requires.append(conf_options.construct_phase_name("build"))
+
+            if not conf_options.is_important:
+                # If you update this, update
+                # pytorch_build_definitions.py too
+                sub_d["filters"] = {"branches": {"only": ["master", r"/ci-all\/.*/"]}}
+
+            x.append({conf_options.construct_phase_name(phase): sub_d})

    return x
--- a/.circleci/cimodel/data/dimensions.py
+++ b/.circleci/cimodel/data/dimensions.py
@ -5,8 +5,7 @@ PHASES = ["build", "test"]

 CUDA_VERSIONS = [
    None,  # cpu build
-    "80",
-    "90",
+    "92",
    "100",
 ]

@ -16,3 +15,9 @@ STANDARD_PYTHON_VERSIONS = [
    "3.6",
    "3.7",
 ]
+
+CONDA_PYTHON_VERSIONS = [
+    "2.7",
+    "3.6",
+    "3.7",
+]
--- a/.circleci/cimodel/data/pytorch_build_data.py
+++ b/.circleci/cimodel/data/pytorch_build_data.py
@ -1,28 +1,38 @@
 #!/usr/bin/env python3

-from cimodel.lib.conf_tree import ConfigNode, X
+from cimodel.lib.conf_tree import ConfigNode, X, XImportant


 CONFIG_TREE_DATA = [
    ("trusty", [
        (None, [
-            X("2.7.9"),
+            XImportant("2.7.9"),
            X("2.7"),
            X("3.5"),
            X("nightly"),
        ]),
        ("gcc", [
            ("4.8", [X("3.6")]),
-            ("5.4", [("3.6", [X(False), X(True)])]),
+            ("5.4", [
+                XImportant("3.6"),
+                ("3.6", [
+                    ("xla", [XImportant(True)]),
+                    ("namedtensor", [XImportant(True)]),
+                ]),
+            ]),
            ("7", [X("3.6")]),
        ]),
    ]),
    ("xenial", [
        ("clang", [
-            ("5", [X("3.6")]),
+            ("5", [
+                XImportant("3.6"),  # This is actually the ASAN build
+                ("3.6", [
+                    ("namedtensor", [XImportant(True)]),  # ASAN
+                ]),
+            ]),
        ]),
        ("cuda", [
-            ("8", [X("3.6")]),
            ("9", [
                # Note there are magic strings here
                # https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/build.sh#L21
@ -32,13 +42,16 @@ CONFIG_TREE_DATA = [
                # https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/build.sh#L153
                # (from https://github.com/pytorch/pytorch/pull/17323#discussion_r259453144)
                X("2.7"),
-                X("3.6"),
+                XImportant("3.6"),
+                ("2.7", [
+                    ("namedtensor", [XImportant(True)]),
+                ]),
            ]),
            ("9.2", [X("3.6")]),
            ("10", [X("3.6")]),
        ]),
        ("android", [
-            ("r19c", [X("3.6")]),
+            ("r19c", [XImportant("3.6")]),
        ]),
    ]),
 ]
@ -117,7 +130,22 @@ class PyVerConfigNode(TreeConfigNode):

    # noinspection PyMethodMayBeStatic
    def child_constructor(self):
-        return XlaConfigNode
+        return ExperimentalFeatureConfigNode
+
+
+class ExperimentalFeatureConfigNode(TreeConfigNode):
+    def init2(self, node_name):
+        self.props["experimental_feature"] = node_name
+
+    def child_constructor(self):
+        experimental_feature = self.find_prop("experimental_feature")
+
+        next_nodes = {
+            "xla": XlaConfigNode,
+            "namedtensor": NamedTensorConfigNode,
+            "important": ImportantConfigNode,
+        }
+        return next_nodes[experimental_feature]


 class XlaConfigNode(TreeConfigNode):
@ -127,6 +155,31 @@ class XlaConfigNode(TreeConfigNode):
    def init2(self, node_name):
        self.props["is_xla"] = node_name

+    def child_constructor(self):
+        return ImportantConfigNode
+
+
+class NamedTensorConfigNode(TreeConfigNode):
+    def modify_label(self, label):
+        return "NAMEDTENSOR=" + str(label)
+
+    def init2(self, node_name):
+        self.props["is_namedtensor"] = node_name
+
+    def child_constructor(self):
+        return ImportantConfigNode
+
+
+class ImportantConfigNode(TreeConfigNode):
+    def modify_label(self, label):
+        return "IMPORTANT=" + str(label)
+
+    def init2(self, node_name):
+        self.props["is_important"] = node_name
+
+    def get_children(self):
+        return []
+

 class XenialCompilerConfigNode(TreeConfigNode):

--- a/.circleci/cimodel/data/pytorch_build_definitions.py
+++ b/.circleci/cimodel/data/pytorch_build_definitions.py
@ -8,45 +8,45 @@ import cimodel.lib.conf_tree as conf_tree
 import cimodel.lib.miniutils as miniutils
 import cimodel.lib.visualization as visualization

+from dataclasses import dataclass, field
+from typing import List, Optional
+

 DOCKER_IMAGE_PATH_BASE = "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/"

-DOCKER_IMAGE_VERSION = 300
+DOCKER_IMAGE_VERSION = 323


-class Conf(object):
-    def __init__(self,
-                 distro,
-                 parms,
-                 pyver=None,
-                 cuda_version=None,
-                 is_xla=False,
-                 restrict_phases=None,
-                 gpu_resource=None,
-                 dependent_tests=None,
-                 parent_build=None):
-
-        self.distro = distro
-        self.pyver = pyver
-        self.parms = parms
-        self.cuda_version = cuda_version
-
-        # TODO expand this to cover all the USE_* that we want to test for
-        #  tesnrorrt, leveldb, lmdb, redis, opencv, mkldnn, ideep, etc.
-        # (from https://github.com/pytorch/pytorch/pull/17323#discussion_r259453608)
-        self.is_xla = is_xla
-
-        self.restrict_phases = restrict_phases
-        self.gpu_resource = gpu_resource
-        self.dependent_tests = dependent_tests or []
-        self.parent_build = parent_build
+@dataclass
+class Conf:
+    distro: str
+    parms: List[str]
+    pyver: Optional[str] = None
+    cuda_version: Optional[str] = None
+    # TODO expand this to cover all the USE_* that we want to test for
+    #  tesnrorrt, leveldb, lmdb, redis, opencv, mkldnn, ideep, etc.
+    # (from https://github.com/pytorch/pytorch/pull/17323#discussion_r259453608)
+    is_xla: bool = False
+    restrict_phases: Optional[List[str]] = None
+    gpu_resource: Optional[str] = None
+    dependent_tests: List = field(default_factory=list)
+    parent_build: Optional['Conf'] = None
+    is_namedtensor: bool = False
+    is_important: bool = False

    # TODO: Eliminate the special casing for docker paths
    # In the short term, we *will* need to support special casing as docker images are merged for caffe2 and pytorch
    def get_parms(self, for_docker):
-        leading = ["pytorch"]
+        leading = []
+        # We just don't run non-important jobs on pull requests;
+        # previously we also named them in a way to make it obvious
+        # if self.is_important and not for_docker:
+        #    leading.append("AAA")
+        leading.append("pytorch")
        if self.is_xla and not for_docker:
            leading.append("xla")
+        if self.is_namedtensor and not for_docker:
+            leading.append("namedtensor")

        cuda_parms = []
        if self.cuda_version:
@ -105,8 +105,10 @@ class Conf(object):

    def gen_workflow_yaml_item(self, phase):

+        # All jobs require the setup job
+        parameters = OrderedDict({"requires": ["setup"]})
+
        if phase == "test":
-            val = OrderedDict()

            # TODO When merging the caffe2 and pytorch jobs, it might be convenient for a while to make a
            #  caffe2 test job dependent on a pytorch build job. This way we could quickly dedup the repeated
@ -114,11 +116,14 @@ class Conf(object):
            #  pytorch build job (from https://github.com/pytorch/pytorch/pull/17323#discussion_r259452641)

            dependency_build = self.parent_build or self
-            val["requires"] = [dependency_build.gen_build_name("build")]
+            parameters["requires"].append(dependency_build.gen_build_name("build"))

-            return {self.gen_build_name(phase): val}
-        else:
-            return self.gen_build_name(phase)
+        if not self.is_important:
+            # If you update this, update
+            # caffe2_build_definitions.py too
+            parameters["filters"] = {"branches": {"only": ["master", r"/ci-all\/.*/"]}}
+
+        return {self.gen_build_name(phase): parameters}


 # TODO This is a hack to special case some configs just for the workflow list
@ -157,11 +162,12 @@ def gen_dependent_configs(xenial_parent_config):
            restrict_phases=["test"],
            gpu_resource=gpu,
            parent_build=xenial_parent_config,
+            is_important=xenial_parent_config.is_important,
        )

        configs.append(c)

-    for x in ["pytorch_short_perf_test_gpu", "pytorch_doc_push"]:
+    for x in ["pytorch_short_perf_test_gpu", "pytorch_python_doc_push", "pytorch_cpp_doc_push"]:
        configs.append(HiddenConf(x, parent_build=xenial_parent_config))

    return configs
@ -212,6 +218,7 @@ def instantiate_configs():
            gcc_version = compiler_name + (fc.find_prop("compiler_version") or "")
            parms_list.append(gcc_version)

+            # TODO: This is a nasty special case
            if compiler_name == "clang":
                parms_list.append("asan")

@ -220,6 +227,8 @@ def instantiate_configs():
            parms_list.append("gcc7")

        is_xla = fc.find_prop("is_xla") or False
+        is_namedtensor = fc.find_prop("is_namedtensor") or False
+        is_important = fc.find_prop("is_important") or False

        gpu_resource = None
        if cuda_version and cuda_version != "10":
@ -233,9 +242,11 @@ def instantiate_configs():
            is_xla,
            restrict_phases,
            gpu_resource,
+            is_namedtensor=is_namedtensor,
+            is_important=is_important,
        )

-        if cuda_version == "8":
+        if cuda_version == "9" and python_version == "3.6":
            c.dependent_tests = gen_dependent_configs(c)

        config_list.append(c)
@ -279,7 +290,7 @@ def get_workflow_list():

    config_list = instantiate_configs()

-    x = []
+    x = ["setup"]
    for conf_options in config_list:

        phases = conf_options.restrict_phases or dimensions.PHASES
--- a/.circleci/cimodel/lib/conf_tree.py
+++ b/.circleci/cimodel/lib/conf_tree.py
@ -1,6 +1,10 @@
 #!/usr/bin/env python3


+from dataclasses import dataclass, field
+from typing import Optional, Dict
+
+
 def X(val):
    """
    Compact way to write a leaf node
@ -8,23 +12,28 @@ def X(val):
    return val, []


-class Ver(object):
+def XImportant(name):
+    """Compact way to write an important (run on PRs) leaf node"""
+    return (name, [("important", [X(True)])])
+
+
+@dataclass
+class Ver:
    """
    Represents a product with a version number
    """
-    def __init__(self, name, version=""):
-        self.name = name
-        self.version = version
+    name: str
+    version: str = ""

    def __str__(self):
        return self.name + self.version


-class ConfigNode(object):
-    def __init__(self, parent, node_name):
-        self.parent = parent
-        self.node_name = node_name
-        self.props = {}
+@dataclass
+class ConfigNode:
+    parent: Optional['ConfigNode']
+    node_name: str
+    props: Dict[str, str] = field(default_factory=dict)

    def get_label(self):
        return self.node_name
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
--- a/.circleci/generate_config_yml.py
+++ b/.circleci/generate_config_yml.py
@ -82,6 +82,7 @@ YAML_SOURCES = [
    File("nightly-build-smoke-tests-defaults.yml"),
    Header("Job specifications job specs"),
    Treegen(pytorch_build_definitions.add_build_env_defs, 0),
+    File("job-specs-setup.yml"),
    File("job-specs-custom.yml"),
    Treegen(caffe2_build_definitions.add_caffe2_builds, 1),
    File("binary_update_htmls.yml"),
--- a/.circleci/scripts/README.md
+++ b/.circleci/scripts/README.md
@ -0,0 +1,4 @@
+All the scripts in this directory are callable from `~/workspace/.circleci/scripts/foo.sh`.
+Don't try to call them as `.circleci/scripts/foo.sh`, that won't
+(necessarily) work.  See Note [Workspace for CircleCI scripts] in
+job-specs-setup.yml for more details.
--- a/.circleci/scripts/binary_checkout.sh
+++ b/.circleci/scripts/binary_checkout.sh
@ -1,6 +1,5 @@
 #!/bin/bash
-
-set -ex
+set -eux -o pipefail
 # This step runs on multiple executors with different envfile locations
 if [[ "$(uname)" == Darwin ]]; then
  # macos executor (builds and tests)
@ -20,20 +19,19 @@ export BUILDER_ROOT="$workdir/builder"
 # Clone the Pytorch branch
 git clone https://github.com/pytorch/pytorch.git "$PYTORCH_ROOT"
 pushd "$PYTORCH_ROOT"
-if [[ -n "$CIRCLE_PR_NUMBER" ]]; then
+if [[ -n "${CIRCLE_PR_NUMBER:-}" ]]; then
  # "smoke" binary build on PRs
  git fetch --force origin "pull/${CIRCLE_PR_NUMBER}/head:remotes/origin/pull/${CIRCLE_PR_NUMBER}"
  git reset --hard "$CIRCLE_SHA1"
  git checkout -q -B "$CIRCLE_BRANCH"
  git reset --hard "$CIRCLE_SHA1"
-elif [[ -n "$CIRCLE_SHA1" ]]; then
-  # "smoke" binary build on master on PR merges
+elif [[ -n "${CIRCLE_SHA1:-}" ]]; then
+  # Scheduled workflows & "smoke" binary build on master on PR merges
  git reset --hard "$CIRCLE_SHA1"
  git checkout -q -B master
 else
-  # nightly binary builds. These run at 05:05 UTC every day. 
-  last_commit="$(git rev-list --before "$(date -u +%Y-%m-%d) 05:00" --max-count 1 HEAD)"
-  git checkout "$last_commit"
+  echo "Can't tell what to checkout"
+  exit 1
 fi
 git submodule update --init --recursive --quiet
 echo "Using Pytorch from "
--- a/.circleci/scripts/binary_install_miniconda.sh
+++ b/.circleci/scripts/binary_install_miniconda.sh
@ -1,15 +1,32 @@
 #!/bin/bash

-set -ex
+set -eux -o pipefail
+
 # This step runs on multiple executors with different envfile locations
 if [[ "$(uname)" == Darwin ]]; then
-  source "/Users/distiller/project/env"
+  envfile="/Users/distiller/project/env"
 elif [[ -d "/home/circleci/project" ]]; then
  # machine executor (binary tests)
-  source "/home/circleci/project/env"
+  envfile="/home/circleci/project/env"
 else
  # docker executor (binary builds)
-  source "/env"
+  envfile="/env"
+fi
+
+# TODO this is super hacky and ugly. Basically, the binary_update_html job does
+# not have an env file, since it does not call binary_populate_env.sh, since it
+# does not have a BUILD_ENVIRONMENT. So for this one case, which we detect by a
+# lack of an env file, we manually export the environment variables that we
+# need to install miniconda
+if [[ ! -f "$envfile" ]]; then
+  MINICONDA_ROOT="/home/circleci/project/miniconda"
+  workdir="/home/circleci/project"
+  retry () {
+      $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
+  }
+  export -f retry
+else
+  source "$envfile"
 fi

 conda_sh="$workdir/install_miniconda.sh"
@ -21,8 +38,7 @@ fi
 chmod +x "$conda_sh"
 "$conda_sh" -b -p "$MINICONDA_ROOT"
 rm -f "$conda_sh"
-export PATH="$MINICONDA_ROOT/bin:$PATH"
-source "$MINICONDA_ROOT/bin/activate"
+
 # We can't actually add miniconda to the PATH in the envfile, because that
 # breaks 'unbuffer' in Mac jobs. This is probably because conda comes with
 # a tclsh, which then gets inserted before the tclsh needed in /usr/bin
--- a/.circleci/scripts/binary_linux_build.sh
+++ b/.circleci/scripts/binary_linux_build.sh
@ -1,7 +1,7 @@
 #!/bin/bash

 echo "RUNNING ON $(uname -a) WITH $(nproc) CPUS AND $(free -m)"
-set -ex
+set -eux -o pipefail
 source /env

 # Defaults here so they can be changed in one place
@ -19,7 +19,7 @@ fi
 # We want to call unbuffer, which calls tclsh which finds the expect
 # package. The expect was installed by yum into /usr/bin so we want to
 # find /usr/bin/tclsh, but this is shadowed by /opt/conda/bin/tclsh in
-# the conda docker images.
+# the conda docker images, so we prepend it to the path here.
 if [[ "$PACKAGE_TYPE" == 'conda' ]]; then
  mkdir /just_tclsh_bin
  ln -s /usr/bin/tclsh /just_tclsh_bin/tclsh
--- a/.circleci/scripts/binary_linux_test.sh
+++ b/.circleci/scripts/binary_linux_test.sh
@ -3,7 +3,7 @@
 source /home/circleci/project/env
 cat >/home/circleci/project/ci_test_script.sh <<EOL
 # =================== The following code will be executed inside Docker container ===================
-set -ex
+set -eux -o pipefail

 # Set up Python
 if [[ "$PACKAGE_TYPE" == conda ]]; then
@ -18,18 +18,33 @@ fi

 # Install the package
 # These network calls should not have 'retry's because they are installing
-# locally
+# locally and aren't actually network calls
+# TODO there is duplicated and inconsistent test-python-env setup across this
+#   file, builder/smoke_test.sh, and builder/run_tests.sh, and also in the
+#   conda build scripts themselves. These should really be consolidated
 pkg="/final_pkgs/\$(ls /final_pkgs)"
 if [[ "$PACKAGE_TYPE" == conda ]]; then
  conda install -y "\$pkg" --offline
+  retry conda install -yq future numpy protobuf six
+  if [[ "$DESIRED_CUDA" != 'cpu' ]]; then
+    # DESIRED_CUDA is in format cu90 or cu100
+    if [[ "${#DESIRED_CUDA}" == 4 ]]; then
+      cu_ver="${DESIRED_CUDA:2:1}.${DESIRED_CUDA:3}"
+    else
+      cu_ver="${DESIRED_CUDA:2:2}.${DESIRED_CUDA:4}"
+    fi
+    retry conda install -yq -c pytorch "cudatoolkit=\${cu_ver}"
+  fi
 else
  pip install "\$pkg"
+  retry pip install -q future numpy protobuf six
 fi

 # Test the package
-pushd /pytorch
-/builder/run_tests.sh "$PACKAGE_TYPE" "$DESIRED_PYTHON" "$DESIRED_CUDA"
+/builder/check_binary.sh
 # =================== The above code will be executed inside Docker container ===================
 EOL
-echo "Prepared script to run in next step"
+echo
+echo
+echo "The script that will run in the next step is:"
 cat /home/circleci/project/ci_test_script.sh
--- a/.circleci/scripts/binary_linux_upload.sh
+++ b/.circleci/scripts/binary_linux_upload.sh
@ -1,22 +1,24 @@
 #!/bin/bash
-# Do NOT set -e
+# Do NOT set -x
 source /home/circleci/project/env
+set -eu -o pipefail
+set +x
 declare -x "AWS_ACCESS_KEY_ID=${PYTORCH_BINARY_AWS_ACCESS_KEY_ID}"
 declare -x "AWS_SECRET_ACCESS_KEY=${PYTORCH_BINARY_AWS_SECRET_ACCESS_KEY}"
 cat >/home/circleci/project/login_to_anaconda.sh <<EOL
 set +x
 echo "Trying to login to Anaconda"
 yes | anaconda login \
-    --username "$PYTORCH_BINARY_SOUMITH_CONDA_USERNAME" \
-    --password "$PYTORCH_BINARY_SOUMITH_CONDA_PASSWORD"
+    --username "$PYTORCH_BINARY_PJH5_CONDA_USERNAME" \
+    --password "$PYTORCH_BINARY_PJH5_CONDA_PASSWORD"
 set -x
 EOL
 chmod +x /home/circleci/project/login_to_anaconda.sh

 #!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!
-# DO NOT TURN -e ON BEFORE THIS LINE
+# DO NOT TURN -x ON BEFORE THIS LINE
 #!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!
-set -ex
+set -eux -o pipefail
 export PATH="$MINICONDA_ROOT/bin:$PATH"

 # Upload the package to the final location
@ -24,7 +26,7 @@ pushd /home/circleci/project/final_pkgs
 if [[ "$PACKAGE_TYPE" == conda ]]; then
  retry conda install -yq anaconda-client
  retry timeout 30 /home/circleci/project/login_to_anaconda.sh
-  anaconda upload "$(ls)" -u pytorch-testing --label main --no-progress --force
+  anaconda upload "$(ls)" -u pytorch --label main --no-progress --force
 elif [[ "$PACKAGE_TYPE" == libtorch ]]; then
  retry pip install -q awscli
  s3_dir="s3://pytorch/libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
--- a/.circleci/scripts/binary_macos_build.sh
+++ b/.circleci/scripts/binary_macos_build.sh
@ -1,5 +1,5 @@
 #!/bin/bash
-set -ex
+set -eux -o pipefail

 source "/Users/distiller/project/env"
 mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR"
--- a/.circleci/scripts/binary_macos_test.sh
+++ b/.circleci/scripts/binary_macos_test.sh
@ -1,5 +1,5 @@
 #!/bin/bash
-set -ex
+set -eux -o pipefail

 source "/Users/distiller/project/env"
 export "PATH=$workdir/miniconda/bin:$PATH"
--- a/.circleci/scripts/binary_macos_upload.sh
+++ b/.circleci/scripts/binary_macos_upload.sh
@ -1,21 +1,23 @@
 #!/bin/bash
-# Do NOT set -e
+# Do NOT set -x
+set -eu -o pipefail
+set +x
 export AWS_ACCESS_KEY_ID="${PYTORCH_BINARY_AWS_ACCESS_KEY_ID}"
 export AWS_SECRET_ACCESS_KEY="${PYTORCH_BINARY_AWS_SECRET_ACCESS_KEY}"
 cat >/Users/distiller/project/login_to_anaconda.sh <<EOL
 set +x
 echo "Trying to login to Anaconda"
 yes | anaconda login \
-    --username "$PYTORCH_BINARY_SOUMITH_CONDA_USERNAME" \
-    --password "$PYTORCH_BINARY_SOUMITH_CONDA_PASSWORD"
+    --username "$PYTORCH_BINARY_PJH5_CONDA_USERNAME" \
+    --password "$PYTORCH_BINARY_PJH5_CONDA_PASSWORD"
 set -x
 EOL
 chmod +x /Users/distiller/project/login_to_anaconda.sh

 #!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!
-# DO NOT TURN -e ON BEFORE THIS LINE
+# DO NOT TURN -x ON BEFORE THIS LINE
 #!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!
-set -ex
+set -eux -o pipefail

 source "/Users/distiller/project/env"
 export "PATH=$workdir/miniconda/bin:$PATH"
@ -24,7 +26,7 @@ pushd "$workdir/final_pkgs"
 if [[ "$PACKAGE_TYPE" == conda ]]; then
  retry conda install -yq anaconda-client
  retry /Users/distiller/project/login_to_anaconda.sh
-  retry anaconda upload "$(ls)" -u pytorch-testing --label main --no-progress --force
+  retry anaconda upload "$(ls)" -u pytorch-nightly --label main --no-progress --force
 elif [[ "$PACKAGE_TYPE" == libtorch ]]; then
  retry pip install -q awscli
  s3_dir="s3://pytorch/libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
--- a/.circleci/scripts/binary_populate_env.sh
+++ b/.circleci/scripts/binary_populate_env.sh
@ -1,6 +1,5 @@
 #!/bin/bash
-
-set -ex
+set -eux -o pipefail
 export TZ=UTC

 # We need to write an envfile to persist these variables to following
@ -24,7 +23,7 @@ configs=($BUILD_ENVIRONMENT)
 export PACKAGE_TYPE="${configs[0]}"
 export DESIRED_PYTHON="${configs[1]}"
 export DESIRED_CUDA="${configs[2]}"
-export DESIRED_DEVTOOLSET="${configs[3]}"
+export DESIRED_DEVTOOLSET="${configs[3]:-}"
 if [[ "$PACKAGE_TYPE" == 'libtorch' ]]; then
  export BUILD_PYTHONLESS=1
 fi
@ -33,26 +32,31 @@ fi
 if [[ "$PACKAGE_TYPE" == conda ]]; then
  export DOCKER_IMAGE="soumith/conda-cuda"
 elif [[ "$DESIRED_CUDA" == cpu ]]; then
-  export DOCKER_IMAGE="soumith/manylinux-cuda80"
+  export DOCKER_IMAGE="soumith/manylinux-cuda100"
 else
  export DOCKER_IMAGE="soumith/manylinux-cuda${DESIRED_CUDA:2}"
 fi

 # Upload to parallel folder for gcc abis
-if [[ "$DESIRED_DEVTOOLSET" == 'devtoolset7' ]]; then
-  export PIP_UPLOAD_FOLDER='devtoolset7/'
-  if [[ "$PACKAGE_TYPE" == 'conda' ]]; then
-    echo "We don't handle conda builds with gcc ABI of 1, since we don't"
-    echo "want to add a new package name to the conda builds"
-    exit 1
-  fi
+# All nightlies used to be devtoolset3, then devtoolset7 was added as a build
+# option, so the upload was redirected to nightly/devtoolset7 to avoid
+# conflicts with other binaries (there shouldn't be any conflicts). Now we are
+# making devtoolset7 the default.
+if [[ "$DESIRED_DEVTOOLSET" == 'devtoolset7' || "$(uname)" == 'Darwin' ]]; then
+  export PIP_UPLOAD_FOLDER='nightly/'
 else
-  export PIP_UPLOAD_FOLDER=''
+  # On linux machines, this shouldn't actually be called anymore. This is just
+  # here for extra safety.
+  export PIP_UPLOAD_FOLDER='nightly/devtoolset3/'
 fi

 # We put this here so that OVERRIDE_PACKAGE_VERSION below can read from it
 export DATE="$(date -u +%Y%m%d)"
-export PYTORCH_BUILD_VERSION="1.1.0"
+if [[ "$(uname)" == 'Darwin' ]]; then
+  export PYTORCH_BUILD_VERSION="1.2.0.dev$DATE"
+else
+  export PYTORCH_BUILD_VERSION="1.2.0.dev$DATE+$DESIRED_CUDA"
+fi
 export PYTORCH_BUILD_NUMBER=1

 cat >>"$envfile" <<EOL
@ -63,20 +67,20 @@ echo "Running on $(uname -a) at $(date)"
 export PACKAGE_TYPE="$PACKAGE_TYPE"
 export DESIRED_PYTHON="$DESIRED_PYTHON"
 export DESIRED_CUDA="$DESIRED_CUDA"
-export LIBTORCH_VARIANT="$LIBTORCH_VARIANT"
-export BUILD_PYTHONLESS="$BUILD_PYTHONLESS"
+export LIBTORCH_VARIANT="${LIBTORCH_VARIANT:-}"
+export BUILD_PYTHONLESS="${BUILD_PYTHONLESS:-}"
 export DESIRED_DEVTOOLSET="$DESIRED_DEVTOOLSET"

 export DATE="$DATE"
-export NIGHTLIES_DATE_PREAMBLE=1.1.0.dev
+export NIGHTLIES_DATE_PREAMBLE=1.2.0.dev
 export PYTORCH_BUILD_VERSION="$PYTORCH_BUILD_VERSION"
 export PYTORCH_BUILD_NUMBER="$PYTORCH_BUILD_NUMBER"
 export OVERRIDE_PACKAGE_VERSION="$PYTORCH_BUILD_VERSION"

-export TORCH_PACKAGE_NAME='torch'
-export TORCH_CONDA_BUILD_FOLDER='pytorch-1.1.0'
+export TORCH_PACKAGE_NAME='torch-nightly'
+export TORCH_CONDA_BUILD_FOLDER='pytorch-nightly'

-export NO_FBGEMM=1
+export USE_FBGEMM=1
 export PIP_UPLOAD_FOLDER="$PIP_UPLOAD_FOLDER"
 export DOCKER_IMAGE="$DOCKER_IMAGE"

@ -87,9 +91,9 @@ export BUILDER_ROOT="$workdir/builder"
 export MINICONDA_ROOT="$workdir/miniconda"
 export PYTORCH_FINAL_PACKAGE_DIR="$workdir/final_pkgs"

-export CIRCLE_TAG="$CIRCLE_TAG"
+export CIRCLE_TAG="${CIRCLE_TAG:-}"
 export CIRCLE_SHA1="$CIRCLE_SHA1"
-export CIRCLE_PR_NUMBER="$CIRCLE_PR_NUMBER"
+export CIRCLE_PR_NUMBER="${CIRCLE_PR_NUMBER:-}"
 export CIRCLE_BRANCH="$CIRCLE_BRANCH"
 # =================== The above code will be executed inside Docker container ===================
 EOL
--- a/.circleci/scripts/binary_run_in_docker.sh
+++ b/.circleci/scripts/binary_run_in_docker.sh
@ -9,13 +9,15 @@
 source /home/circleci/project/env
 echo "Running the following code in Docker"
 cat /home/circleci/project/ci_test_script.sh
-set -ex
+echo
+echo
+set -eux -o pipefail

 # Expect actual code to be written to this file
 chmod +x /home/circleci/project/ci_test_script.sh

 # Run the docker
-if [ -n "${USE_CUDA_DOCKER_RUNTIME}" ]; then
+if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then
  export id=$(docker run --runtime=nvidia -t -d "${DOCKER_IMAGE}")
 else
  export id=$(docker run -t -d "${DOCKER_IMAGE}")
@ -38,7 +40,7 @@ if [[ -d "$PYTORCH_ROOT" ]]; then
  docker cp "$PYTORCH_ROOT" "$id:/pytorch"
 fi
 if [[ -d "$BUILDER_ROOT" ]]; then
-  docker cp "$BUILDER ROOT" "$id:/builder"
+  docker cp "$BUILDER_ROOT" "$id:/builder"
 fi

 # Execute the test script that was populated by an earlier section
--- a/.circleci/scripts/cpp_doc_push_script.sh
+++ b/.circleci/scripts/cpp_doc_push_script.sh
@ -0,0 +1,127 @@
+# =================== The following code **should** be executed inside Docker container ===================
+
+# Install dependencies
+sudo apt-get -y update
+sudo apt-get -y install expect-dev
+
+# This is where the local pytorch install in the docker image is located
+pt_checkout="/var/lib/jenkins/workspace"
+
+# Since we're cat-ing this file, we need to escape all $'s
+echo "cpp_doc_push_script.sh: Invoked with $*"
+
+# Argument 1: Where to copy the built documentation for Python API to
+# (pytorch.github.io/$install_path)
+install_path="$1"
+if [ -z "$install_path" ]; then
+echo "error: cpp_doc_push_script.sh: install_path (arg1) not specified"
+  exit 1
+fi
+
+# Argument 2: What version of the Python API docs we are building.
+version="$2"
+if [ -z "$version" ]; then
+echo "error: cpp_doc_push_script.sh: version (arg2) not specified"
+  exit 1
+fi
+
+is_master_doc=false
+if [ "$version" == "master" ]; then
+  is_master_doc=true
+fi
+
+# Argument 3: (optional) If present, we will NOT do any pushing. Used for testing.
+dry_run=false
+if [ "$3" != "" ]; then
+  dry_run=true
+fi
+
+echo "install_path: $install_path  version: $version  dry_run: $dry_run"
+
+# ======================== Building PyTorch C++ API Docs ========================
+
+echo "Building PyTorch C++ API docs..."
+
+# Clone the cppdocs repo
+rm -rf cppdocs
+git clone https://github.com/pytorch/cppdocs
+
+set -ex
+
+sudo apt-get -y install doxygen
+
+# Generate ATen files
+pushd "${pt_checkout}"
+pip install -r requirements.txt
+time GEN_TO_SOURCE=1 python aten/src/ATen/gen.py \
+  -s aten/src/ATen \
+  -d build/aten/src/ATen \
+  aten/src/ATen/Declarations.cwrap \
+  aten/src/THNN/generic/THNN.h \
+  aten/src/THCUNN/generic/THCUNN.h \
+  aten/src/ATen/nn.yaml \
+  aten/src/ATen/native/native_functions.yaml
+
+# Copy some required files
+cp aten/src/ATen/common_with_cwrap.py tools/shared/cwrap_common.py
+cp torch/_utils_internal.py tools/shared
+
+# Generate PyTorch files
+time python tools/setup_helpers/generate_code.py \
+  --declarations-path build/aten/src/ATen/Declarations.yaml \
+  --nn-path aten/src/
+
+# Build the docs
+pushd docs/cpp
+pip install breathe==4.11.1 bs4 lxml six
+pip install --no-cache-dir -e "git+https://github.com/pytorch/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme"
+pip install exhale>=0.2.1
+pip install sphinx==1.8.5
+# Uncomment once it is fixed
+# pip install -r requirements.txt
+time make VERBOSE=1 html -j
+
+popd
+popd
+
+pushd cppdocs
+
+# Purge everything with some exceptions
+mkdir /tmp/cppdocs-sync
+mv _config.yml README.md /tmp/cppdocs-sync/
+rm -rf *
+
+# Copy over all the newly generated HTML
+cp -r "${pt_checkout}"/docs/cpp/build/html/* .
+
+# Copy back _config.yml
+rm -rf _config.yml
+mv /tmp/cppdocs-sync/* .
+
+# Make a new commit
+git add . || true
+git status
+git config user.email "soumith+bot@pytorch.org"
+git config user.name "pytorchbot"
+# If there aren't changes, don't make a commit; push is no-op
+git commit -m "Automatic sync on $(date)" || true
+git status
+
+if [ "$dry_run" = false ]; then
+  echo "Pushing to https://github.com/pytorch/cppdocs"
+  set +x
+/usr/bin/expect <<DONE
+  spawn git push -u origin master
+  expect "Username*"
+  send "pytorchbot\n"
+  expect "Password*"
+  send "$::env(GITHUB_PYTORCHBOT_TOKEN)\n"
+  expect eof
+DONE
+  set -x
+else
+  echo "Skipping push due to dry_run"
+fi
+
+popd
+# =================== The above code **should** be executed inside Docker container ===================
--- a/.circleci/scripts/python_doc_push_script.sh
+++ b/.circleci/scripts/python_doc_push_script.sh
@ -0,0 +1,118 @@
+# =================== The following code **should** be executed inside Docker container ===================
+
+# Install dependencies
+sudo apt-get -y update
+sudo apt-get -y install expect-dev
+
+# This is where the local pytorch install in the docker image is located
+pt_checkout="/var/lib/jenkins/workspace"
+
+echo "python_doc_push_script.sh: Invoked with $*"
+
+set -ex
+
+# Argument 1: Where to copy the built documentation to
+# (pytorch.github.io/$install_path)
+install_path="$1"
+if [ -z "$install_path" ]; then
+echo "error: python_doc_push_script.sh: install_path (arg1) not specified"
+  exit 1
+fi
+
+# Argument 2: What version of the docs we are building.
+version="$2"
+if [ -z "$version" ]; then
+echo "error: python_doc_push_script.sh: version (arg2) not specified"
+  exit 1
+fi
+
+is_master_doc=false
+if [ "$version" == "master" ]; then
+  is_master_doc=true
+fi
+
+# Argument 3: The branch to push to. Usually is "site"
+branch="$3"
+if [ -z "$branch" ]; then
+echo "error: python_doc_push_script.sh: branch (arg3) not specified"
+  exit 1
+fi
+
+# Argument 4: (optional) If present, we will NOT do any pushing. Used for testing.
+dry_run=false
+if [ "$4" != "" ]; then
+  dry_run=true
+fi
+
+echo "install_path: $install_path  version: $version  dry_run: $dry_run"
+
+git clone https://github.com/pytorch/pytorch.github.io -b $branch
+pushd pytorch.github.io
+
+export LC_ALL=C
+export PATH=/opt/conda/bin:$PATH
+
+rm -rf pytorch || true
+
+# Install TensorBoard in python 3 so torch.utils.tensorboard classes render
+pip install -q https://s3.amazonaws.com/ossci-linux/wheels/tensorboard-1.14.0a0-py3-none-any.whl
+
+# Get all the documentation sources, put them in one place
+pushd "$pt_checkout"
+git clone https://github.com/pytorch/vision
+pushd vision
+conda install -q pillow
+time python setup.py install
+popd
+pushd docs
+rm -rf source/torchvision
+cp -a ../vision/docs/source source/torchvision
+
+# Build the docs
+pip -q install -r requirements.txt || true
+if [ "$is_master_doc" = true ]; then
+  make html
+else
+  make html-stable
+fi
+
+# Move them into the docs repo
+popd
+popd
+git rm -rf "$install_path" || true
+mv "$pt_checkout/docs/build/html" "$install_path"
+
+# Add the version handler by search and replace.
+# XXX: Consider moving this to the docs Makefile or site build
+if [ "$is_master_doc" = true ]; then
+  find "$install_path" -name "*.html" -print0 | xargs -0 perl -pi -w -e "s@master\s+\((\d\.\d\.[A-Fa-f0-9]+\+[A-Fa-f0-9]+)\s+\)@<a href='http://pytorch.org/docs/versions.html'>\1 \&#x25BC</a>@g"
+else
+  find "$install_path" -name "*.html" -print0 | xargs -0 perl -pi -w -e "s@master\s+\((\d\.\d\.[A-Fa-f0-9]+\+[A-Fa-f0-9]+)\s+\)@<a href='http://pytorch.org/docs/versions.html'>$version \&#x25BC</a>@g"
+fi
+
+git add "$install_path" || true
+git status
+git config user.email "soumith+bot@pytorch.org"
+git config user.name "pytorchbot"
+# If there aren't changes, don't make a commit; push is no-op
+git commit -m "auto-generating sphinx docs" || true
+git status
+
+if [ "$dry_run" = false ]; then
+  echo "Pushing to pytorch.github.io:$branch"
+  set +x
+/usr/bin/expect <<DONE
+  spawn git push origin $branch
+  expect "Username*"
+  send "pytorchbot\n"
+  expect "Password*"
+  send "$::env(GITHUB_PYTORCHBOT_TOKEN)\n"
+  expect eof
+DONE
+  set -x
+else
+  echo "Skipping push due to dry_run"
+fi
+
+popd
+# =================== The above code **should** be executed inside Docker container ===================
--- a/.circleci/scripts/setup_ci_environment.sh
+++ b/.circleci/scripts/setup_ci_environment.sh
@ -0,0 +1,87 @@
+#!/usr/bin/env bash
+set -ex -o pipefail
+
+# Set up NVIDIA docker repo
+curl -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
+echo "deb https://nvidia.github.io/libnvidia-container/ubuntu16.04/amd64 /" | sudo tee -a /etc/apt/sources.list.d/nvidia-docker.list
+echo "deb https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/amd64 /" | sudo tee -a /etc/apt/sources.list.d/nvidia-docker.list
+echo "deb https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64 /" | sudo tee -a /etc/apt/sources.list.d/nvidia-docker.list
+
+# Remove unnecessary sources
+sudo rm -f /etc/apt/sources.list.d/google-chrome.list
+sudo rm -f /etc/apt/heroku.list
+sudo rm -f /etc/apt/openjdk-r-ubuntu-ppa-xenial.list
+sudo rm -f /etc/apt/partner.list
+
+sudo apt-get -y update
+sudo apt-get -y remove linux-image-generic linux-headers-generic linux-generic docker-ce
+# WARNING: Docker version is hardcoded here; you must update the
+# version number below for docker-ce and nvidia-docker2 to get newer
+# versions of Docker.  We hardcode these numbers because we kept
+# getting broken CI when Docker would update their docker version,
+# and nvidia-docker2 would be out of date for a day until they
+# released a newer version of their package.
+#
+# How to figure out what the correct versions of these packages are?
+# My preferred method is to start a Docker instance of the correct
+# Ubuntu version (e.g., docker run -it ubuntu:16.04) and then ask
+# apt what the packages you need are.  Note that the CircleCI image
+# comes with Docker.
+sudo apt-get -y install \
+  linux-headers-$(uname -r) \
+  linux-image-generic \
+  moreutils \
+  docker-ce=5:18.09.4~3-0~ubuntu-xenial \
+  nvidia-container-runtime=2.0.0+docker18.09.4-1 \
+  nvidia-docker2=2.0.3+docker18.09.4-1 \
+  expect-dev
+
+sudo pkill -SIGHUP dockerd
+
+retry () {
+    $*  || $* || $* || $* || $*
+}
+
+retry sudo pip -q install awscli==1.16.35
+
+if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then
+  DRIVER_FN="NVIDIA-Linux-x86_64-410.104.run"
+  wget "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"
+  sudo /bin/bash "$DRIVER_FN" -s --no-drm || (sudo cat /var/log/nvidia-installer.log && false)
+  nvidia-smi
+fi
+
+if [[ "${BUILD_ENVIRONMENT}" == *-build ]]; then
+  echo "declare -x IN_CIRCLECI=1" > /home/circleci/project/env
+  echo "declare -x COMMIT_SOURCE=${CIRCLE_BRANCH:-}" >> /home/circleci/project/env
+  echo "declare -x PYTHON_VERSION=${PYTHON_VERSION:-}" >> /home/circleci/project/env
+  echo "declare -x SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> /home/circleci/project/env
+  if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then
+    echo "declare -x TORCH_CUDA_ARCH_LIST=5.2" >> /home/circleci/project/env
+  fi
+  export SCCACHE_MAX_JOBS=`expr $(nproc) - 1`
+  export MEMORY_LIMIT_MAX_JOBS=8  # the "large" resource class on CircleCI has 32 CPU cores, if we use all of them we'll OOM
+  export MAX_JOBS=$(( ${SCCACHE_MAX_JOBS} > ${MEMORY_LIMIT_MAX_JOBS} ? ${MEMORY_LIMIT_MAX_JOBS} : ${SCCACHE_MAX_JOBS} ))
+  echo "declare -x MAX_JOBS=${MAX_JOBS}" >> /home/circleci/project/env
+
+  if [[ "${BUILD_ENVIRONMENT}" == *xla* ]]; then
+    # This IAM user allows write access to S3 bucket for sccache & bazels3cache
+    set +x
+    echo "declare -x AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_AND_XLA_BAZEL_S3_BUCKET_V2:-}" >> /home/circleci/project/env
+    echo "declare -x AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_AND_XLA_BAZEL_S3_BUCKET_V2:-}" >> /home/circleci/project/env
+    set -x
+  else
+    # This IAM user allows write access to S3 bucket for sccache
+    set +x
+    echo "declare -x AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}" >> /home/circleci/project/env
+    echo "declare -x AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}" >> /home/circleci/project/env
+    set -x
+  fi
+fi
+
+# This IAM user only allows read-write access to ECR
+set +x
+export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_ECR_READ_WRITE_V4:-}
+export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_ECR_READ_WRITE_V4:-}
+eval $(aws ecr get-login --region us-east-1 --no-include-email)
+set -x
--- a/.circleci/scripts/setup_linux_system_environment.sh
+++ b/.circleci/scripts/setup_linux_system_environment.sh
@ -0,0 +1,50 @@
+#!/usr/bin/env bash
+set -eux -o pipefail
+
+# Set up CircleCI GPG keys for apt, if needed
+curl -L https://packagecloud.io/circleci/trusty/gpgkey | sudo apt-key add -
+
+# Stop background apt updates.  Hypothetically, the kill should not
+# be necessary, because stop is supposed to send a kill signal to
+# the process, but we've added it for good luck.  Also
+# hypothetically, it's supposed to be unnecessary to wait for
+# the process to block.  We also have that line for good luck.
+# If you like, try deleting them and seeing if it works.
+sudo systemctl stop apt-daily.service || true
+sudo systemctl kill --kill-who=all apt-daily.service || true
+
+sudo systemctl stop unattended-upgrades.service || true
+sudo systemctl kill --kill-who=all unattended-upgrades.service || true
+
+# wait until `apt-get update` has been killed
+while systemctl is-active --quiet apt-daily.service
+do
+    sleep 1;
+done
+while systemctl is-active --quiet unattended-upgrades.service
+do
+    sleep 1;
+done
+
+# See if we actually were successful
+systemctl list-units --all | cat
+
+# For good luck, try even harder to kill apt-get
+sudo pkill apt-get || true
+
+# For even better luck, purge unattended-upgrades
+sudo apt-get purge -y unattended-upgrades
+
+cat /etc/apt/sources.list
+
+# For the bestest luck, kill again now
+sudo pkill apt || true
+sudo pkill dpkg || true
+
+# Try to detect if apt/dpkg is stuck
+if ps auxfww | grep '[a]pt'; then
+  echo "WARNING: There are leftover apt processes; subsequent apt update will likely fail"
+fi
+if ps auxfww | grep '[d]pkg'; then
+  echo "WARNING: There are leftover dpkg processes; subsequent apt update will likely fail"
+fi
--- a/.circleci/scripts/should_run_job.py
+++ b/.circleci/scripts/should_run_job.py
@ -0,0 +1,96 @@
+import argparse
+import re
+import sys
+
+# Modify this variable if you want to change the set of default jobs
+# which are run on all pull requests.
+#
+# WARNING: Actually, this is a lie; we're currently also controlling
+# the set of jobs to run via the Workflows filters in CircleCI config.
+
+default_set = [
+    # PyTorch CPU
+    # Selected oldest Python 2 version to ensure Python 2 coverage
+    'pytorch-linux-trusty-py2.7.9',
+    # PyTorch CUDA
+    'pytorch-linux-xenial-cuda9-cudnn7-py3',
+    # PyTorch ASAN
+    'pytorch-linux-xenial-py3-clang5-asan',
+    # PyTorch DEBUG
+    'pytorch-linux-trusty-py3.6-gcc5.4',
+
+    # Caffe2 CPU
+    'caffe2-py2-mkl-ubuntu16.04',
+    # Caffe2 CUDA
+    'caffe2-py2-cuda9.1-cudnn7-ubuntu16.04',
+    # Caffe2 ONNX
+    'caffe2-onnx-py2-gcc5-ubuntu16.04',
+    'caffe2-onnx-py3.6-clang7-ubuntu16.04',
+    # Caffe2 Clang
+    'caffe2-py2-clang7-ubuntu16.04',
+    # Caffe2 CMake
+    'caffe2-cmake-cuda9.0-cudnn7-ubuntu16.04',
+
+    # Binaries
+    'manywheel 2.7mu cpu devtoolset7',
+    'libtorch 2.7m cpu devtoolset7',
+
+    # Caffe2 Android
+    'caffe2-py2-android-ubuntu16.04',
+    # Caffe2 OSX
+    'caffe2-py2-system-macos10.13',
+    # PyTorch OSX
+    'pytorch-macos-10.13-cuda9.2-cudnn7-py3',
+    # PyTorch Android
+    'pytorch-linux-xenial-py3-clang5-android-ndk-r19c',
+
+    # XLA
+    'pytorch-xla-linux-trusty-py3.6-gcc5.4',
+
+    # Other checks
+    'pytorch-short-perf-test-gpu',
+    'pytorch-python-doc-push',
+    'pytorch-cpp-doc-push',
+]
+
+# Takes in commit message to analyze via stdin
+#
+# This script will query Git and attempt to determine if we should
+# run the current CI job under question
+#
+# NB: Try to avoid hard-coding names here, so there's less place to update when jobs
+# are updated/renamed
+#
+# Semantics in the presence of multiple tags:
+#   - Let D be the set of default builds
+#   - Let S be the set of explicitly specified builds
+#   - Run S \/ D
+
+parser = argparse.ArgumentParser()
+parser.add_argument('build_environment')
+args = parser.parse_args()
+
+commit_msg = sys.stdin.read()
+
+# Matches anything that looks like [foo ci] or [ci foo] or [foo test]
+# or [test foo]
+RE_MARKER = re.compile(r'\[(?:([^ \[\]]+) )?(?:ci|test)(?: ([^ \[\]]+))?\]')
+
+markers = RE_MARKER.finditer(commit_msg)
+
+for m in markers:
+    if m.group(1) and m.group(2):
+        print("Unrecognized marker: {}".format(m.group(0)))
+        continue
+    spec = m.group(1) or m.group(2)
+    if spec in args.build_environment or spec == 'all':
+        print("Accepting {} due to commit marker {}".format(args.build_environment, m.group(0)))
+        sys.exit(0)
+
+for spec in default_set:
+    if spec in args.build_environment:
+        print("Accepting {} as part of default set".format(args.build_environment))
+        sys.exit(0)
+
+print("Rejecting {}".format(args.build_environment))
+sys.exit(1)
--- a/.circleci/scripts/should_run_job.sh
+++ b/.circleci/scripts/should_run_job.sh
@ -0,0 +1,29 @@
+#!/usr/bin/env bash
+set -exu -o pipefail
+
+SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+
+# Check if we should actually run
+echo "BUILD_ENVIRONMENT: ${BUILD_ENVIRONMENT:-}"
+echo "CIRCLE_PULL_REQUEST: ${CIRCLE_PULL_REQUEST:-}"
+if [ -z "${BUILD_ENVIRONMENT:-}" ]; then
+  echo "Cannot run should_run_job.sh if BUILD_ENVIRONMENT is not defined!"
+  echo "CircleCI scripts are probably misconfigured."
+  exit 1
+fi
+if ! [ -e "$SCRIPT_DIR/COMMIT_MSG" ]; then
+  echo "Cannot run should_run_job.sh if you don't have COMMIT_MSG"
+  echo "written out.  Are you perhaps running the wrong copy of this script?"
+  echo "You should be running the copy in ~/workspace; SCRIPT_DIR=$SCRIPT_DIR"
+  exit 1
+fi
+if [ -n "${CIRCLE_PULL_REQUEST:-}" ]; then
+  if [[ $CIRCLE_BRANCH != "ci-all/"* ]]; then
+    # Don't swallow "script doesn't exist
+    [ -e "$SCRIPT_DIR/should_run_job.py"  ]
+    if ! python "$SCRIPT_DIR/should_run_job.py" "${BUILD_ENVIRONMENT:-}" < "$SCRIPT_DIR/COMMIT_MSG" ; then
+      circleci step halt
+      exit
+    fi
+  fi
+fi
--- a/.circleci/verbatim-sources/binary-build-tests.yml
+++ b/.circleci/verbatim-sources/binary-build-tests.yml
@ -1,3 +1,4 @@
+
 # There is currently no testing for libtorch TODO
 #  binary_linux_libtorch_2.7m_cpu_test:
 #    environment:
@ -5,12 +6,6 @@
 #    resource_class: gpu.medium
 #    <<: *binary_linux_test
 #
-#  binary_linux_libtorch_2.7m_cu80_test:
-#    environment:
-#      BUILD_ENVIRONMENT: "libtorch 2.7m cu80"
-#    resource_class: gpu.medium
-#    <<: *binary_linux_test
-#
 #  binary_linux_libtorch_2.7m_cu90_test:
 #    environment:
 #      BUILD_ENVIRONMENT: "libtorch 2.7m cu90"
--- a/.circleci/verbatim-sources/binary_update_htmls.yml
+++ b/.circleci/verbatim-sources/binary_update_htmls.yml
@ -1,8 +1,17 @@
-# update_s3_htmls job
+  
+  # update_s3_htmls job
+  # These jobs create html files for every cpu/cu## folder in s3. The html
+  # files just store the names of all the files in that folder (which are
+  # binary files (.whl files)). This is to allow pip installs of the latest
+  # version in a folder without having to know the latest date. Pip has a flag
+  # -f that you can pass an html file listing a bunch of packages, and pip will
+  # then install the one with the most recent version.
  update_s3_htmls: &update_s3_htmls
    machine:
      image: ubuntu-1604:201903-01
    steps:
+    - attach_workspace:
+        at: ~/workspace
    - run:
        <<: *setup_linux_system_environment
    - run:
@ -24,10 +33,11 @@
        name: Update s3 htmls
        no_output_timeout: "1h"
        command: |
+          set +x
          echo "declare -x \"AWS_ACCESS_KEY_ID=${PYTORCH_BINARY_AWS_ACCESS_KEY_ID}\"" >> /home/circleci/project/env
          echo "declare -x \"AWS_SECRET_ACCESS_KEY=${PYTORCH_BINARY_AWS_SECRET_ACCESS_KEY}\"" >> /home/circleci/project/env
          source /home/circleci/project/env
-          set -ex
+          set -eux -o pipefail
          retry () {
              $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
          }
@ -45,3 +55,44 @@
    environment:
      PIP_UPLOAD_FOLDER: "nightly/devtoolset7/"
    <<: *update_s3_htmls
+
+
+  # upload_binary_logs job
+  # The builder hud at pytorch.org/builder shows the sizes of all the binaries
+  # over time. It gets this info from html files stored in S3, which this job
+  # populates every day.
+  upload_binary_sizes: &upload_binary_sizes
+    machine:
+      image: ubuntu-1604:201903-01
+    steps:
+    - attach_workspace:
+        at: ~/workspace
+    - run:
+        <<: *setup_linux_system_environment
+    - run:
+        <<: *binary_checkout
+    - run:
+        <<: *binary_install_miniconda
+    - run:
+        name: Upload binary sizes
+        no_output_timeout: "1h"
+        command: |
+          set +x
+          echo "declare -x \"AWS_ACCESS_KEY_ID=${PYTORCH_BINARY_AWS_ACCESS_KEY_ID}\"" > /home/circleci/project/env
+          echo "declare -x \"AWS_SECRET_ACCESS_KEY=${PYTORCH_BINARY_AWS_SECRET_ACCESS_KEY}\"" >> /home/circleci/project/env
+          export DATE="$(date -u +%Y_%m_%d)"
+          retry () {
+              $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
+          }
+          source /home/circleci/project/env
+          set -eux -o pipefail
+
+          # This is hardcoded to match binary_install_miniconda.sh
+          export PATH="/home/circleci/project/miniconda/bin:$PATH"
+          # Not any awscli will work. Most won't. This one will work
+          retry conda create -qyn aws36 python=3.6
+          source activate aws36
+          pip install awscli==1.16.46
+
+          "/home/circleci/project/builder/cron/upload_binary_sizes.sh"
+
--- a/.circleci/verbatim-sources/header-section.yml
+++ b/.circleci/verbatim-sources/header-section.yml
@ -7,13 +7,16 @@
 # and then update DOCKER_IMAGE_VERSION at the top of the following files:
 # * cimodel/data/pytorch_build_definitions.py
 # * cimodel/data/caffe2_build_definitions.py
+# And the inline copies of the variable in
+# * verbatim-sources/job-specs-custom.yml
+#   (grep for DOCKER_IMAGE)

 docker_config_defaults: &docker_config_defaults
  user: jenkins
  aws_auth:
    # This IAM user only allows read-write access to ECR
-    aws_access_key_id: ${CIRCLECI_AWS_ACCESS_KEY_FOR_ECR_READ_WRITE_V3}
-    aws_secret_access_key: ${CIRCLECI_AWS_SECRET_KEY_FOR_ECR_READ_WRITE_V3}
+    aws_access_key_id: ${CIRCLECI_AWS_ACCESS_KEY_FOR_ECR_READ_WRITE_V4}
+    aws_secret_access_key: ${CIRCLECI_AWS_SECRET_KEY_FOR_ECR_READ_WRITE_V4}

 # This system setup script is meant to run before the CI-related scripts, e.g.,
 # installing Git client, checking out code, setting up CI env, and
@ -21,262 +24,33 @@ docker_config_defaults: &docker_config_defaults
 setup_linux_system_environment: &setup_linux_system_environment
  name: Set Up System Environment
  no_output_timeout: "1h"
-  command: |
-    set -ex
+  command: ~/workspace/.circleci/scripts/setup_linux_system_environment.sh

-    # Set up CircleCI GPG keys for apt, if needed
-    curl -L https://packagecloud.io/circleci/trusty/gpgkey | sudo apt-key add -
-
-    # Stop background apt updates.  Hypothetically, the kill should not
-    # be necessary, because stop is supposed to send a kill signal to
-    # the process, but we've added it for good luck.  Also
-    # hypothetically, it's supposed to be unnecessary to wait for
-    # the process to block.  We also have that line for good luck.
-    # If you like, try deleting them and seeing if it works.
-    sudo systemctl stop apt-daily.service || true
-    sudo systemctl kill --kill-who=all apt-daily.service || true
-
-    sudo systemctl stop unattended-upgrades.service || true
-    sudo systemctl kill --kill-who=all unattended-upgrades.service || true
-
-    # wait until `apt-get update` has been killed
-    while systemctl is-active --quiet apt-daily.service
-    do
-      sleep 1;
-    done
-    while systemctl is-active --quiet unattended-upgrades.service
-    do
-      sleep 1;
-    done
-
-    # See if we actually were successful
-    systemctl list-units --all | cat
-
-    sudo apt-get purge -y unattended-upgrades
-
-    cat /etc/apt/sources.list
-
-    ps auxfww | grep [a]pt
-    ps auxfww | grep dpkg
-
-install_doc_push_script: &install_doc_push_script
-  name: Install the doc push script
+# NB: This (and the command below) must be run after attaching
+# ~/workspace.  This is NOT the default working directory (that's
+# ~/project); this workspace is generated by the setup job.
+should_run_job: &should_run_job
+  name: Should Run Job After attach_workspace
  no_output_timeout: "2m"
-  command: |
-    cat >/home/circleci/project/doc_push_script.sh <<EOL
-    # =================== The following code **should** be executed inside Docker container ===================
+  command: ~/workspace/.circleci/scripts/should_run_job.sh

-    # This is where the local pytorch install in the docker image is located
-    pt_checkout="/var/lib/jenkins/workspace"
-
-    # Since we're cat-ing this file, we need to escape all $'s
-    echo "doc_push_script.sh: Invoked with \$*"
-
-    git clone https://yf225:${GITHUB_PYTORCHBOT_TOKEN}@github.com/pytorch/pytorch.github.io -b site
-    pushd pytorch.github.io
-
-    set -ex
-
-    # Argument 1: Where to copy the built documentation to
-    # (pytorch.github.io/$install_path)
-    install_path="\$1"
-    if [ -z "\$install_path" ]; then
-    echo "error: doc_push_script.sh: install_path (arg1) not specified"
-      exit 1
-    fi
-
-    # Argument 2: What version of the docs we are building.
-    version="\$2"
-    if [ -z "\$version" ]; then
-    echo "error: doc_push_script.sh: version (arg2) not specified"
-      exit 1
-    fi
-
-    is_master_doc=false
-    if [ "\$version" == "master" ]; then
-      is_master_doc=true
-    fi
-
-    # Argument 3: (optional) If present, we will NOT do any pushing. Used for testing.
-    dry_run=false
-    if [ "\$3" != "" ]; then
-      dry_run=true
-    fi
-
-    echo "install_path: \$install_path  version: \$version  dry_run: \$dry_run"
-
-    export LC_ALL=C
-    export PATH=/opt/conda/bin:$PATH
-
-    rm -rf pytorch || true
-
-    # Install TensorBoard in python 3 so torch.utils.tensorboard classes render
-    pip install -q https://s3.amazonaws.com/ossci-linux/wheels/tensorboard-1.14.0a0-py3-none-any.whl
-
-    # Get all the documentation sources, put them in one place
-    pushd "\$pt_checkout"
-    git clone https://github.com/pytorch/vision
-    pushd vision
-    conda install -q pillow
-    time python setup.py install
-    popd
-    pushd docs
-    rm -rf source/torchvision
-    cp -a ../vision/docs/source source/torchvision
-
-    # Build the docs
-    pip -q install -r requirements.txt || true
-    if [ "\$is_master_doc" = true ]; then
-      make html
-    else
-      make html-stable
-    fi
-
-    # Move them into the docs repo
-    popd
-    popd
-    git rm -rf "\$install_path" || true
-    mv "\$pt_checkout/docs/build/html" "\$install_path"
-
-    # Add the version handler by search and replace.
-    # XXX: Consider moving this to the docs Makefile or site build
-    if [ "\$is_master_doc" = true ]; then
-      find "\$install_path" -name "*.html" -print0 | xargs -0 perl -pi -w -e "s@master\s+\((\d\.\d\.[A-Fa-f0-9]+\+[A-Fa-f0-9]+)\s+\)@<a href='http://pytorch.org/docs/versions.html'>\1 \&#x25BC</a>@g"
-    else
-      find "\$install_path" -name "*.html" -print0 | xargs -0 perl -pi -w -e "s@master\s+\((\d\.\d\.\S+)\s+\)@<a href='http://pytorch.org/docs/versions.html'>\$version \&#x25BC</a>@g"
-    fi
-
-    git add "\$install_path" || true
-    git status
-    git config user.email "soumith+bot@pytorch.org"
-    git config user.name "pytorchbot"
-    # If there aren't changes, don't make a commit; push is no-op
-    git commit -m "auto-generating sphinx docs" || true
-    git status
-
-    if [ "\$dry_run" = false ]; then
-      echo "Pushing to pytorch.github.io:site"
-      git push origin site
-    else
-      echo "Skipping push due to dry_run"
-    fi
-
-    popd
-    # =================== The above code **should** be executed inside Docker container ===================
-    EOL
-    chmod +x /home/circleci/project/doc_push_script.sh
-
-# `setup_ci_environment` has to be run **after** the ``checkout`` step because
-# it writes into the checkout directory and otherwise CircleCI will complain
-# that
-#   Directory (/home/circleci/project) you are trying to checkout to is not empty and not git repository
 setup_ci_environment: &setup_ci_environment
-  name: Set Up CI Environment After Checkout
+  name: Set Up CI Environment After attach_workspace
  no_output_timeout: "1h"
-  command: |
-    set -ex
-
-    # Check if we should actually run
-    echo "BUILD_ENVIRONMENT: ${BUILD_ENVIRONMENT}"
-    echo "CIRCLE_PULL_REQUEST: ${CIRCLE_PULL_REQUEST}"
-    if [[ "${BUILD_ENVIRONMENT}" == *-slow-* ]]; then
-      if ! [ -z "${CIRCLE_PULL_REQUEST}" ]; then
-        # It's a PR; test for [slow ci] tag on the TOPMOST commit
-        if !(git log --format='%B' -n 1 HEAD | grep -q -e '\[slow ci\]' -e '\[ci slow\]' -e '\[test slow\]' -e '\[slow test\]'); then
-          circleci step halt
-          exit
-        fi
-      fi
-    fi
-    if [[ "${BUILD_ENVIRONMENT}" == *xla* ]]; then
-      if ! [ -z "${CIRCLE_PULL_REQUEST}" ]; then
-        # It's a PR; test for [xla ci] tag on the TOPMOST commit
-        if !(git log --format='%B' -n 1 HEAD | grep -q -e '\[xla ci\]' -e '\[ci xla\]' -e '\[test xla\]' -e '\[xla test\]'); then
-          # NB: This doesn't halt everything, just this job.  So
-          # the rest of the workflow will keep going and you need
-          # to make sure you halt there too.  Blegh.
-          circleci step halt
-          exit
-        fi
-      fi
-    fi
-
-    # Set up NVIDIA docker repo
-    curl -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
-    echo "deb https://nvidia.github.io/libnvidia-container/ubuntu16.04/amd64 /" | sudo tee -a /etc/apt/sources.list.d/nvidia-docker.list
-    echo "deb https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/amd64 /" | sudo tee -a /etc/apt/sources.list.d/nvidia-docker.list
-    echo "deb https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64 /" | sudo tee -a /etc/apt/sources.list.d/nvidia-docker.list
-
-    sudo apt-get -y update
-    sudo apt-get -y remove linux-image-generic linux-headers-generic linux-generic docker-ce
-    # WARNING: Docker version is hardcoded here; you must update the
-    # version number below for docker-ce and nvidia-docker2 to get newer
-    # versions of Docker.  We hardcode these numbers because we kept
-    # getting broken CI when Docker would update their docker version,
-    # and nvidia-docker2 would be out of date for a day until they
-    # released a newer version of their package.
-    #
-    # How to figure out what the correct versions of these packages are?
-    # My preferred method is to start a Docker instance of the correct
-    # Ubuntu version (e.g., docker run -it ubuntu:16.04) and then ask
-    # apt what the packages you need are.  Note that the CircleCI image
-    # comes with Docker.
-    sudo apt-get -y install \
-      linux-headers-$(uname -r) \
-      linux-image-generic \
-      moreutils \
-      docker-ce=5:18.09.4~3-0~ubuntu-xenial \
-      nvidia-container-runtime=2.0.0+docker18.09.4-1 \
-      nvidia-docker2=2.0.3+docker18.09.4-1 \
-      expect-dev
-
-    sudo pkill -SIGHUP dockerd
-
-    sudo pip -q install awscli==1.16.35
-
-    if [ -n "${USE_CUDA_DOCKER_RUNTIME}" ]; then
-      DRIVER_FN="NVIDIA-Linux-x86_64-410.104.run"
-      wget "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"
-      sudo /bin/bash "$DRIVER_FN" -s --no-drm || (sudo cat /var/log/nvidia-installer.log && false)
-      nvidia-smi
-    fi
-
-    if [[ "${BUILD_ENVIRONMENT}" == *-build ]]; then
-      echo "declare -x IN_CIRCLECI=1" > /home/circleci/project/env
-      echo "declare -x COMMIT_SOURCE=${CIRCLE_BRANCH}" >> /home/circleci/project/env
-      echo "declare -x PYTHON_VERSION=${PYTHON_VERSION}" >> /home/circleci/project/env
-      echo "declare -x SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> /home/circleci/project/env
-      if [ -n "${USE_CUDA_DOCKER_RUNTIME}" ]; then
-        echo "declare -x TORCH_CUDA_ARCH_LIST=5.2" >> /home/circleci/project/env
-      fi
-      export SCCACHE_MAX_JOBS=`expr $(nproc) - 1`
-      export MEMORY_LIMIT_MAX_JOBS=8  # the "large" resource class on CircleCI has 32 CPU cores, if we use all of them we'll OOM
-      export MAX_JOBS=$(( ${SCCACHE_MAX_JOBS} > ${MEMORY_LIMIT_MAX_JOBS} ? ${MEMORY_LIMIT_MAX_JOBS} : ${SCCACHE_MAX_JOBS} ))
-      echo "declare -x MAX_JOBS=${MAX_JOBS}" >> /home/circleci/project/env
-
-      if [[ "${BUILD_ENVIRONMENT}" == *xla* ]]; then
-        # This IAM user allows write access to S3 bucket for sccache & bazels3cache
-        echo "declare -x AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_AND_XLA_BAZEL_S3_BUCKET_V1}" >> /home/circleci/project/env
-        echo "declare -x AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_AND_XLA_BAZEL_S3_BUCKET_V1}" >> /home/circleci/project/env
-      else
-        # This IAM user allows write access to S3 bucket for sccache
-        echo "declare -x AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V3}" >> /home/circleci/project/env
-        echo "declare -x AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V3}" >> /home/circleci/project/env
-      fi
-    fi
-
-    # This IAM user only allows read-write access to ECR
-    export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_ECR_READ_WRITE_V3}
-    export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_ECR_READ_WRITE_V3}
-    eval $(aws ecr get-login --region us-east-1 --no-include-email)
+  command: ~/workspace/.circleci/scripts/setup_ci_environment.sh

+# Installs expect and moreutils so that we can call `unbuffer` and `ts`.
+# Also installs OpenMP
+# !!!!NOTE!!!! this is copied into a binary_macos_brew_update job which is the
+# same but does not install libomp. If you are changing this, consider if you
+# need to change that step as well.
 macos_brew_update: &macos_brew_update
  name: Brew update and install moreutils, expect and libomp
  no_output_timeout: "1h"
  command: |
    set -ex
-    pwd
-    ls -lah
+    # See https://discourse.brew.sh/t/fetching-homebrew-repos-is-slow/5374/3
+    brew untap caskroom/homebrew-cask
    # moreutils installs a `parallel` executable by default, which conflicts
    # with the executable from the GNU `parallel`, so we must unlink GNU
    # `parallel` first, and relink it afterwards
@ -286,3 +60,4 @@ macos_brew_update: &macos_brew_update
    brew link parallel --overwrite
    brew install expect
    brew install libomp
+
--- a/.circleci/verbatim-sources/job-specs-custom.yml
+++ b/.circleci/verbatim-sources/job-specs-custom.yml
@ -1,13 +1,18 @@
  pytorch_short_perf_test_gpu:
    environment:
      BUILD_ENVIRONMENT: pytorch-short-perf-test-gpu
-      DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda8-cudnn7-py3:300"
+      DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda9-cudnn7-py3:323"
      PYTHON_VERSION: "3.6"
      USE_CUDA_DOCKER_RUNTIME: "1"
    resource_class: gpu.medium
    machine:
      image: ubuntu-1604:201903-01
    steps:
+    # See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
+    - attach_workspace:
+        at: ~/workspace
+    - run:
+        <<: *should_run_job
    - run:
        <<: *setup_linux_system_environment
    - run:
@ -24,53 +29,107 @@

          docker cp $id:/var/lib/jenkins/workspace/env /home/circleci/project/env
          # This IAM user allows write access to S3 bucket for perf test numbers
-          echo "declare -x AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_PERF_TEST_S3_BUCKET_V3}" >> /home/circleci/project/env
-          echo "declare -x AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_PERF_TEST_S3_BUCKET_V3}" >> /home/circleci/project/env
+          set +x
+          echo "declare -x AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_PERF_TEST_S3_BUCKET_V4}" >> /home/circleci/project/env
+          echo "declare -x AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_PERF_TEST_S3_BUCKET_V4}" >> /home/circleci/project/env
+          set -x
          docker cp /home/circleci/project/env $id:/var/lib/jenkins/workspace/env

          export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/short-perf-test-gpu.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
          echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts

-  pytorch_doc_push:
+  pytorch_python_doc_push:
    environment:
-      BUILD_ENVIRONMENT: pytorch-doc-push
-      DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda8-cudnn7-py3:300"
+      BUILD_ENVIRONMENT: pytorch-python-doc-push
+      # TODO: stop hardcoding this
+      DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda9-cudnn7-py3:323"
    resource_class: large
    machine:
      image: ubuntu-1604:201903-01
    steps:
+    # See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
+    - attach_workspace:
+        at: ~/workspace
+    - run:
+        <<: *should_run_job
    - run:
        <<: *setup_linux_system_environment
    - run:
        <<: *setup_ci_environment
-    - run:
-        <<: *install_doc_push_script
    - run:
        name: Doc Build and Push
        no_output_timeout: "1h"
        command: |
-          set -e
+          set -ex
          export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}-${CIRCLE_SHA1}
          echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
          docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
          export id=$(docker run -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})

-          docker cp /home/circleci/project/doc_push_script.sh $id:/var/lib/jenkins/workspace/doc_push_script.sh
-
          # master branch docs push
          if [[ "${CIRCLE_BRANCH}" == "master" ]]; then
-            export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && ./doc_push_script.sh docs/master master") | docker exec -u jenkins -i "$id" bash) 2>&1'
+            export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/python_doc_push_script.sh docs/master master site") | docker exec -u jenkins -i "$id" bash) 2>&1'

          # stable release docs push. Due to some circleci limitations, we keep
-          # an eternal PR open for merging v1.1.0 -> master for this job.
-          # XXX: The following code is only run on the v1.1.0 branch, which might
+          # an eternal PR open for merging v1.2.0 -> master for this job.
+          # XXX: The following code is only run on the v1.2.0 branch, which might
          # not be exactly the same as what you see here.
-          elif [[ "${CIRCLE_BRANCH}" == "v1.1.0" ]]; then
-            export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && ./doc_push_script.sh docs/stable 1.1.0") | docker exec -u jenkins -i "$id" bash) 2>&1'
+          elif [[ "${CIRCLE_BRANCH}" == "v1.2.0" ]]; then
+            export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/python_doc_push_script.sh docs/stable 1.2.0 site-v1.2.0") | docker exec -u jenkins -i "$id" bash) 2>&1'

          # For open PRs: Do a dry_run of the docs build, don't push build
          else
-            export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && ./doc_push_script.sh docs/master master dry_run") | docker exec -u jenkins -i "$id" bash) 2>&1'
+            export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/python_doc_push_script.sh docs/master master site dry_run") | docker exec -u jenkins -i "$id" bash) 2>&1'
+          fi
+
+          echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
+
+          # Save the docs build so we can debug any problems
+          export DEBUG_COMMIT_DOCKER_IMAGE=${COMMIT_DOCKER_IMAGE}-debug
+          docker commit "$id" ${DEBUG_COMMIT_DOCKER_IMAGE}
+          docker push ${DEBUG_COMMIT_DOCKER_IMAGE}
+
+  pytorch_cpp_doc_push:
+    environment:
+      BUILD_ENVIRONMENT: pytorch-cpp-doc-push
+      DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda9-cudnn7-py3:323"
+    resource_class: large
+    machine:
+      image: ubuntu-1604:201903-01
+    steps:
+    # See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
+    - attach_workspace:
+        at: ~/workspace
+    - run:
+        <<: *should_run_job
+    - run:
+        <<: *setup_linux_system_environment
+    - run:
+        <<: *setup_ci_environment
+    - run:
+        name: Doc Build and Push
+        no_output_timeout: "1h"
+        command: |
+          set -ex
+          export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}-${CIRCLE_SHA1}
+          echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
+          docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
+          export id=$(docker run -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
+
+          # master branch docs push
+          if [[ "${CIRCLE_BRANCH}" == "master" ]]; then
+            export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/cpp_doc_push_script.sh docs/master master") | docker exec -u jenkins -i "$id" bash) 2>&1'
+
+          # stable release docs push. Due to some circleci limitations, we keep
+          # an eternal PR open (#16502) for merging v1.0.1 -> master for this job.
+          # XXX: The following code is only run on the v1.0.1 branch, which might
+          # not be exactly the same as what you see here.
+          elif [[ "${CIRCLE_BRANCH}" == "v1.0.1" ]]; then
+            export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/cpp_doc_push_script.sh docs/stable 1.0.1") | docker exec -u jenkins -i "$id" bash) 2>&1'
+
+          # For open PRs: Do a dry_run of the docs build, don't push build
+          else
+            export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/cpp_doc_push_script.sh docs/master master dry_run") | docker exec -u jenkins -i "$id" bash) 2>&1'
          fi

          echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
@ -81,16 +140,21 @@
          docker push ${DEBUG_COMMIT_DOCKER_IMAGE}

  pytorch_macos_10_13_py3_build:
+    environment:
+      BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-build
    macos:
      xcode: "9.0"
    steps:
+      # See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
+      - attach_workspace:
+          at: ~/workspace
+      - run:
+          <<: *should_run_job
      - checkout
      - run:
          <<: *macos_brew_update
      - run:
          name: Build
-          environment:
-            BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-build
          no_output_timeout: "1h"
          command: |
            set -e
@ -103,8 +167,10 @@

            export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2
            # This IAM user allows write access to S3 bucket for sccache
-            export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V3}
-            export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V3}
+            set +x
+            export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4}
+            export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4}
+            set -x

            chmod a+x .jenkins/pytorch/macos-build.sh
            unbuffer .jenkins/pytorch/macos-build.sh 2>&1 | ts
@ -119,44 +185,51 @@
            - "*"

  pytorch_macos_10_13_py3_test:
+    environment:
+      BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-test
    macos:
      xcode: "9.0"
    steps:
-      - run:
-          name: Prepare workspace
-          command: |
-            sudo mkdir -p /Users/distiller/pytorch-ci-env
-            sudo chmod -R 777 /Users/distiller/pytorch-ci-env
+      # See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
+      # This workspace also carries binaries from the build job
      - attach_workspace:
-          at: /Users/distiller/pytorch-ci-env
+          at: ~/workspace
+      - run:
+          <<: *should_run_job
      - run:
          <<: *macos_brew_update
      - run:
          name: Test
-          environment:
-            BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-test
          no_output_timeout: "1h"
          command: |
            set -e
            export IN_CIRCLECI=1

            # copy with -a to preserve relative structure (e.g., symlinks), and be recursive
-            cp -a /Users/distiller/pytorch-ci-env/workspace/. /Users/distiller/project
+            # TODO: I'm not sure why we can't just run our job in
+            # ~/workspace and call it a day
+            # NB: Yes, you need workspace twice
+            cp -a ~/workspace/workspace/. /Users/distiller/project

            chmod a+x .jenkins/pytorch/macos-test.sh
            unbuffer .jenkins/pytorch/macos-test.sh 2>&1 | ts

  pytorch_macos_10_13_cuda9_2_cudnn7_py3_build:
+    environment:
+      BUILD_ENVIRONMENT: pytorch-macos-10.13-cuda9.2-cudnn7-py3-build
    macos:
      xcode: "9.0"
    steps:
+      # See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
+      - attach_workspace:
+          at: ~/workspace
+      - run:
+          <<: *should_run_job
      - checkout
      - run:
          <<: *macos_brew_update
      - run:
          name: Build
-          environment:
-            BUILD_ENVIRONMENT: pytorch-macos-10.13-cuda9.2-cudnn7-py3-build
          no_output_timeout: "1h"
          command: |
            set -e
@ -184,10 +257,11 @@
            sudo chmod +x /usr/local/bin/sccache
            export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2
            # This IAM user allows write access to S3 bucket for sccache
-            export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V3}
-            export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V3}
+            set +x
+            export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4}
+            export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4}
+            set -x

-            git submodule sync && git submodule update -q --init
+            git submodule sync && git submodule update -q --init --recursive
            chmod a+x .jenkins/pytorch/macos-build.sh
            unbuffer .jenkins/pytorch/macos-build.sh 2>&1 | ts
-
--- a/.circleci/verbatim-sources/job-specs-setup.yml
+++ b/.circleci/verbatim-sources/job-specs-setup.yml
@ -0,0 +1,34 @@
+  
+  setup:
+    docker:
+      - image: circleci/python:3.7.3
+    steps:
+      - checkout
+      - run:
+          name: Ensure config is up to date
+          command: ./ensure-consistency.py
+          working_directory: .circleci
+      - run:
+          name: Save commit message
+          command: git log --format='%B' -n 1 HEAD > .circleci/scripts/COMMIT_MSG
+      # Note [Workspace for CircleCI scripts]
+      # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+      # In the beginning, you wrote your CI scripts in a
+      # .circleci/config.yml file, and life was good.  Your CI
+      # configurations flourished and multiplied.
+      #
+      # Then one day, CircleCI cometh down high and say, "Your YAML file
+      # is too biggeth, it stresses our servers so."  And thus they
+      # asketh us to smite the scripts in the yml file.
+      #
+      # But you can't just put the scripts in the .circleci folder,
+      # because in some jobs, you don't ever actually checkout the
+      # source repository.  Where you gonna get the scripts from?
+      #
+      # Here's how you do it: you persist .circleci/scripts into a
+      # workspace, attach the workspace in your subjobs, and run all
+      # your scripts from there.
+      - persist_to_workspace:
+          root: .
+          paths: .circleci/scripts
+
--- a/.circleci/verbatim-sources/linux-binary-build-defaults.yml
+++ b/.circleci/verbatim-sources/linux-binary-build-defaults.yml
@ -1,8 +1,14 @@
+
 # binary linux build defaults
 ##############################################################################
 binary_linux_build: &binary_linux_build
  resource_class: 2xlarge+
  steps:
+  # See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
+  - attach_workspace:
+      at: ~/workspace
+  - run:
+      <<: *should_run_job
  - run:
      <<: *binary_checkout
  - run:
@ -10,23 +16,23 @@ binary_linux_build: &binary_linux_build
  - run:
      name: Install unbuffer and ts
      command: |
-        set -ex
+        set -eux -o pipefail
        source /env
        retry yum -q -y install epel-release
        retry yum -q -y install expect moreutils
  - run:
-      name: Upgrade gcc version (based on env var)
+      name: Update compiler to devtoolset7
      command: |
-        set -ex
+        set -eux -o pipefail
        source /env
        if [[ "$DESIRED_DEVTOOLSET" == 'devtoolset7' ]]; then
-          source "/builder/upgrade_gcc_abi.sh"
+          source "/builder/update_compiler.sh"

          # Env variables are not persisted into the next step
-          echo "export PATH=$PATH" > /env
-          echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH" > /env
+          echo "export PATH=$PATH" >> /env
+          echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH" >> /env
        else
-          echo "Not upgrading gcc version"
+          echo "Not updating compiler"
        fi
  - run:
      name: Build
@ -37,7 +43,6 @@ binary_linux_build: &binary_linux_build
      root: /
      paths: final_pkgs

-
 # This should really just be another step of the binary_linux_build job above.
 # This isn't possible right now b/c the build job uses the docker executor
 # (otherwise they'd be really really slow) but this one uses the macine
@ -47,15 +52,18 @@ binary_linux_test: &binary_linux_test
  machine:
    image: ubuntu-1604:201903-01
  steps:
+  # See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
+  - attach_workspace:
+      at: ~/workspace
+  # TODO: We shouldn't attach the workspace multiple times
+  - attach_workspace:
+      at: /home/circleci/project
+  - run:
+      <<: *should_run_job
  - run:
      <<: *setup_linux_system_environment
  - run:
      <<: *setup_ci_environment
-  - attach_workspace:
-      at: /home/circleci/project
-  # This checkout is only needed to access
-  # .circleci/scripts/binary_linux_test.sh, which can't be inlined because it
-  # blows up the yaml size.
  - run:
      <<: *binary_checkout
  - run:
@ -63,8 +71,7 @@ binary_linux_test: &binary_linux_test
  - run:
      name: Prepare test code
      no_output_timeout: "1h"
-      command: |
-        source "/home/circleci/project/pytorch/.circleci/scripts/binary_linux_test.sh"
+      command: ~/workspace/.circleci/scripts/binary_linux_test.sh
  - run:
      <<: *binary_run_in_docker

@ -72,17 +79,17 @@ binary_linux_upload: &binary_linux_upload
  machine:
    image: ubuntu-1604:201903-01
  steps:
+  # See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
+  - attach_workspace:
+      at: ~/workspace
+  - run:
+      <<: *should_run_job
  - run:
      <<: *setup_linux_system_environment
  - run:
      <<: *setup_ci_environment
  - attach_workspace:
      at: /home/circleci/project
-  # This checkout is only needed to access
-  # .circleci/scripts/binary_linux_upload.sh, which can't be inlined because it
-  # blows up the yaml size.
-  - run:
-      <<: *binary_checkout
  - run:
      <<: *binary_populate_env
  - run:
@ -90,5 +97,5 @@ binary_linux_upload: &binary_linux_upload
  - run:
      name: Upload
      no_output_timeout: "1h"
-      command: |
-        source "/home/circleci/project/pytorch/.circleci/scripts/binary_linux_upload.sh"
+      command: ~/workspace/.circleci/scripts/binary_linux_upload.sh
+
--- a/.circleci/verbatim-sources/linux-build-defaults.yml
+++ b/.circleci/verbatim-sources/linux-build-defaults.yml
@ -9,6 +9,11 @@ pytorch_linux_build_defaults: &pytorch_linux_build_defaults
  machine:
    image: ubuntu-1604:201903-01
  steps:
+  # See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
+  - attach_workspace:
+      at: ~/workspace
+  - run:
+      <<: *should_run_job
  - run:
      <<: *setup_linux_system_environment
  - checkout
@ -24,16 +29,31 @@ pytorch_linux_build_defaults: &pytorch_linux_build_defaults
        docker pull ${DOCKER_IMAGE} >/dev/null
        export id=$(docker run -t -d -w /var/lib/jenkins ${DOCKER_IMAGE})

-        git submodule sync && git submodule update -q --init
+        git submodule sync && git submodule update -q --init --recursive

        docker cp /home/circleci/project/. $id:/var/lib/jenkins/workspace

-        export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/build.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
+        if [[ ${BUILD_ENVIRONMENT} == *"namedtensor"* ]]; then
+          NAMED_FLAG="export BUILD_NAMEDTENSOR=1"
+        fi
+
+        export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo '"$NAMED_FLAG"' && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/build.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
        echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts

        # Push intermediate Docker image for next phase to use
        if [ -z "${BUILD_ONLY}" ]; then
-          export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}-${CIRCLE_SHA1}
+          # Note [Special build images]
+          # The namedtensor and xla builds use the same docker image as
+          # pytorch-linux-trusty-py3.6-gcc5.4-build. In the push step, we have to
+          # distinguish between them so the test can pick up the correct image.
+          output_image=${DOCKER_IMAGE}-${CIRCLE_SHA1}
+          if [[ ${BUILD_ENVIRONMENT} == *"namedtensor"* ]]; then
+            export COMMIT_DOCKER_IMAGE=$output_image-namedtensor
+          elif [[ ${BUILD_ENVIRONMENT} == *"xla"* ]]; then
+            export COMMIT_DOCKER_IMAGE=$output_image-xla
+          else
+            export COMMIT_DOCKER_IMAGE=$output_image
+          fi
          docker commit "$id" ${COMMIT_DOCKER_IMAGE}
          docker push ${COMMIT_DOCKER_IMAGE}
        fi
@ -42,6 +62,11 @@ pytorch_linux_test_defaults: &pytorch_linux_test_defaults
  machine:
    image: ubuntu-1604:201903-01
  steps:
+  # See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
+  - attach_workspace:
+      at: ~/workspace
+  - run:
+      <<: *should_run_job
  - run:
      <<: *setup_linux_system_environment
  - run:
@ -51,7 +76,16 @@ pytorch_linux_test_defaults: &pytorch_linux_test_defaults
      no_output_timeout: "90m"
      command: |
        set -e
-        export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}-${CIRCLE_SHA1}
+        # See Note [Special build images]
+        output_image=${DOCKER_IMAGE}-${CIRCLE_SHA1}
+        if [[ ${BUILD_ENVIRONMENT} == *"namedtensor"* ]]; then
+          export COMMIT_DOCKER_IMAGE=$output_image-namedtensor
+          NAMED_FLAG="export BUILD_NAMEDTENSOR=1"
+        elif [[ ${BUILD_ENVIRONMENT} == *"xla"* ]]; then
+          export COMMIT_DOCKER_IMAGE=$output_image-xla
+        else
+          export COMMIT_DOCKER_IMAGE=$output_image
+        fi
        echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
        docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
        if [ -n "${USE_CUDA_DOCKER_RUNTIME}" ]; then
@ -60,9 +94,9 @@ pytorch_linux_test_defaults: &pytorch_linux_test_defaults
          export id=$(docker run -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
        fi
        if [ -n "${MULTI_GPU}" ]; then
-          export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/multigpu-test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
+          export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo '"$NAMED_FLAG"' && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/multigpu-test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
        else
-          export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
+          export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo '"$NAMED_FLAG"'&& echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
        fi
        echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts

@ -71,6 +105,11 @@ caffe2_linux_build_defaults: &caffe2_linux_build_defaults
  machine:
    image: ubuntu-1604:201903-01
  steps:
+  # See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
+  - attach_workspace:
+      at: ~/workspace
+  - run:
+      <<: *should_run_job
  - run:
      <<: *setup_linux_system_environment
  - checkout
@ -130,8 +169,13 @@ caffe2_linux_test_defaults: &caffe2_linux_test_defaults
  machine:
    image: ubuntu-1604:201903-01
  steps:
+  # See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
+  - attach_workspace:
+      at: ~/workspace
  - run:
      <<: *setup_linux_system_environment
+  - run:
+      <<: *should_run_job
  - run:
      <<: *setup_ci_environment
  - run:
--- a/.circleci/verbatim-sources/macos-binary-build-defaults.yml
+++ b/.circleci/verbatim-sources/macos-binary-build-defaults.yml
@ -7,12 +7,17 @@ binary_mac_build: &binary_mac_build
  macos:
    xcode: "9.0"
  steps:
+  # See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
+  - attach_workspace:
+      at: ~/workspace
+  - run:
+      <<: *should_run_job
  - run:
      <<: *binary_checkout
  - run:
      <<: *binary_populate_env
  - run:
-      <<: *macos_brew_update
+      <<: *binary_macos_brew_update
  - run:
      <<: *binary_install_miniconda

@ -20,7 +25,7 @@ binary_mac_build: &binary_mac_build
      name: Build
      no_output_timeout: "1h"
      command: |
-        set -ex
+        set -eux -o pipefail
        script="/Users/distiller/project/pytorch/.circleci/scripts/binary_macos_build.sh"
        cat "$script"
        source "$script"
@ -29,7 +34,7 @@ binary_mac_build: &binary_mac_build
      name: Test
      no_output_timeout: "1h"
      command: |
-        set -ex
+        set -eux -o pipefail
        script="/Users/distiller/project/pytorch/.circleci/scripts/binary_macos_test.sh"
        cat "$script"
        source "$script"
@ -42,20 +47,26 @@ binary_mac_upload: &binary_mac_upload
  macos:
    xcode: "9.0"
  steps:
+  # See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
+  - attach_workspace:
+      at: ~/workspace
+  - run:
+      <<: *should_run_job
  - run:
      <<: *binary_checkout
  - run:
      <<: *binary_populate_env
  - run:
-      <<: *macos_brew_update
+      <<: *binary_macos_brew_update
  - run:
      <<: *binary_install_miniconda
-  - attach_workspace:
+  - attach_workspace: # TODO - we can `cp` from ~/workspace
      at: /Users/distiller/project
  - run:
      name: Upload
      no_output_timeout: "10m"
      command: |
-        script="/Users/distiller/project/pytorch/.circleci/scripts/binary_macos_test.sh"
+        script="/Users/distiller/project/pytorch/.circleci/scripts/binary_macos_upload.sh"
        cat "$script"
        source "$script"
+
--- a/.circleci/verbatim-sources/macos-build-defaults.yml
+++ b/.circleci/verbatim-sources/macos-build-defaults.yml
@ -7,6 +7,11 @@ caffe2_macos_build_defaults: &caffe2_macos_build_defaults
  macos:
    xcode: "9.0"
  steps:
+    # See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
+    - attach_workspace:
+        at: ~/workspace
+    - run:
+        <<: *should_run_job
    - checkout
    - run:
        <<: *macos_brew_update
@ -50,8 +55,10 @@ caffe2_macos_build_defaults: &caffe2_macos_build_defaults
          export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2

          # This IAM user allows write access to S3 bucket for sccache
-          export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V3}
-          export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V3}
+          set +x
+          export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4}
+          export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4}
+          set -x

          export SCCACHE_BIN=${PWD}/sccache_bin
          mkdir -p ${SCCACHE_BIN}
--- a/.circleci/verbatim-sources/nightly-binary-build-defaults.yml
+++ b/.circleci/verbatim-sources/nightly-binary-build-defaults.yml
@ -25,61 +25,19 @@
 # do not need both the pytorch and builder repos, so this is a little wasteful
 # (smoke tests and upload jobs do not need the pytorch repo).
 binary_checkout: &binary_checkout
-  name: Checkout
-  command: |
-    if [[ -n "$CIRCLE_SHA1" ]]; then
-      # we are on a PR or on a master-merge, but not on a timed build
-      git_url="https://raw.githubusercontent.com/pytorch/pytorch/$CIRCLE_SHA1"
-    else
-      # scheduled nightly binary builds. These run at 05:05 UTC every day. 
-      last_commit="$(git rev-list --before "$(date -u +%Y-%m-%d) 05:00" --max-count 1 HEAD)"
-      git_url="https://raw.githubusercontent.com/pytorch/pytorch/$last_commit"
-    fi
-    retry () {
-        $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
-    }
-    retry curl -s "$git_url/.circleci/scripts/binary_checkout.sh" -o "binary_checkout.sh"
-    cat "binary_checkout.sh"
-    source "binary_checkout.sh"
+  name: Checkout pytorch/builder repo
+  command: ~/workspace/.circleci/scripts/binary_checkout.sh

 # Parses circleci arguments in a consistent way, essentially routing to the
 # correct pythonXgccXcudaXos build we want
 binary_populate_env: &binary_populate_env
-  name: Set up env
-  command: |
-    # This step runs on multiple executors with different envfile locations
-    if [[ "$(uname)" == Darwin ]]; then
-      # macos executor (builds and tests)
-      workdir="/Users/distiller/project"
-    elif [[ -d "/home/circleci/project" ]]; then
-      # machine executor (binary tests)
-      workdir="/home/circleci/project"
-    else
-      # docker executor (binary builds)
-      workdir="/"
-    fi
-    script="$workdir/pytorch/.circleci/scripts/binary_populate_env.sh"
-    cat "$script"
-    source "$script"
+  name: Set up binary env variables
+  command: ~/workspace/.circleci/scripts/binary_populate_env.sh

 binary_install_miniconda: &binary_install_miniconda
  name: Install miniconda
  no_output_timeout: "1h"
-  command: |
-    # This step runs on multiple executors with different envfile locations
-    if [[ "$(uname)" == Darwin ]]; then
-      # macos executor (builds and tests)
-      workdir="/Users/distiller/project"
-    elif [[ -d "/home/circleci/project" ]]; then
-      # machine executor (binary tests)
-      workdir="/home/circleci/project"
-    else
-      # docker executor (binary builds)
-      workdir="/"
-    fi
-    script="$workdir/pytorch/.circleci/scripts/binary_install_miniconda.sh"
-    cat "$script"
-    source "$script"
+  command: ~/workspace/.circleci/scripts/binary_install_miniconda.sh

 # This section is used in the binary_test and smoke_test jobs. It expects
 # 'binary_populate_env' to have populated /home/circleci/project/env and it
@ -87,9 +45,26 @@ binary_install_miniconda: &binary_install_miniconda
 # with the code to run in the docker
 binary_run_in_docker: &binary_run_in_docker
  name: Run in docker
+  # This step only runs on circleci linux machine executors that themselves
+  # need to start docker images
+  command: ~/workspace/.circleci/scripts/binary_run_in_docker.sh
+
+# This is copied almost verbatim from the macos_brew_update job
+# In version 2.1 and above we could make this a command and pass a parameter to
+# it, but in this version there is no way to pass a parameter to a step
+binary_macos_brew_update: &binary_macos_brew_update
+  name: Brew update and install moreutils and expect
+  no_output_timeout: "1h"
  command: |
-    # This step only runs on circleci linux machine executors that themselves
-    # need to start docker images
-    script="/home/circleci/project/pytorch/.circleci/scripts/binary_install_miniconda.sh"
-    cat "$script"
-    source "$script"
+    set -eux -o pipefail
+    # See https://discourse.brew.sh/t/fetching-homebrew-repos-is-slow/5374/3
+    brew untap caskroom/homebrew-cask
+    # moreutils installs a `parallel` executable by default, which conflicts
+    # with the executable from the GNU `parallel`, so we must unlink GNU
+    # `parallel` first, and relink it afterwards
+    brew update
+    brew unlink parallel
+    brew install moreutils
+    brew link parallel --overwrite
+    brew install expect
+
--- a/.circleci/verbatim-sources/nightly-build-smoke-tests-defaults.yml
+++ b/.circleci/verbatim-sources/nightly-build-smoke-tests-defaults.yml
@ -1,3 +1,4 @@
+
 # Nighlty build smoke tests defaults
 # These are the second-round smoke tests. These make sure that the binaries are
 # correct from a user perspective, testing that they exist from the cloud are
@ -7,10 +8,16 @@ smoke_linux_test: &smoke_linux_test
  machine:
    image: ubuntu-1604:201903-01
  steps:
+  - attach_workspace:
+      at: ~/workspace
+  - attach_workspace:
+      at: /home/circleci/project
  - run:
      <<: *setup_linux_system_environment
  - run:
      <<: *setup_ci_environment
+  - run:
+      <<: *binary_checkout
  - run:
      <<: *binary_populate_env
  - run:
@ -20,8 +27,7 @@ smoke_linux_test: &smoke_linux_test
        set -ex
        cat >/home/circleci/project/ci_test_script.sh <<EOL
        # The following code will be executed inside Docker container
-        set -ex
-        git clone https://github.com/pytorch/builder.git /builder
+        set -eux -o pipefail
        /builder/smoke_test.sh
        # The above code will be executed inside Docker container
        EOL
@ -32,18 +38,27 @@ smoke_mac_test: &smoke_mac_test
  macos:
    xcode: "9.0"
  steps:
+    - attach_workspace:
+        at: ~/workspace
+    - attach_workspace: # TODO - we can `cp` from ~/workspace
+        at: /Users/distiller/project
+    - run:
+        <<: *binary_checkout
    - run:
        <<: *binary_populate_env
    - run:
-        <<: *macos_brew_update
+        <<: *binary_macos_brew_update
+    - run:
+        <<: *binary_install_miniconda
    - run:
        name: Build
        no_output_timeout: "1h"
        command: |
          set -ex
          source "/Users/distiller/project/env"
-          git clone https://github.com/pytorch/builder.git
-          unbuffer ./builder/smoke_test.sh | ts
-
-
+          export "PATH=$workdir/miniconda/bin:$PATH"
+          # TODO unbuffer and ts this, but it breaks cause miniconda overwrites
+          # tclsh. But unbuffer and ts aren't that important so they're just
+          # disabled for now
+          ./builder/smoke_test.sh

--- a/.circleci/verbatim-sources/workflows-binary-builds-smoke-subset.yml
+++ b/.circleci/verbatim-sources/workflows-binary-builds-smoke-subset.yml
@ -1,26 +1,51 @@
-
      # Binary builds (subset, to smoke test that they'll work)
-      - binary_linux_manywheel_2.7mu_cpu_devtoolset3_build
-      - binary_linux_manywheel_3.7m_cu100_devtoolset3_build
-      - binary_linux_conda_2.7_cpu_build
-      # This binary build is currently broken, see https://github.com/pytorch/pytorch/issues/16710
-      # - binary_linux_conda_3.6_cu90_build
-      - binary_linux_libtorch_2.7m_cu80_devtoolset3_build
-      - binary_macos_wheel_3.6_cpu_build
-      - binary_macos_conda_2.7_cpu_build
-      - binary_macos_libtorch_2.7_cpu_build
+      #
+      # NB: If you modify this file, you need to also modify
+      # the binary_and_smoke_tests_on_pr variable in
+      # pytorch-ci-hud to adjust the list of whitelisted builds
+      # at https://github.com/ezyang/pytorch-ci-hud/blob/master/src/BuildHistoryDisplay.js

-      - binary_linux_manywheel_2.7mu_cpu_devtoolset3_test:
+      - binary_linux_manywheel_2.7mu_cpu_devtoolset7_build:
          requires:
-            - binary_linux_manywheel_2.7mu_cpu_devtoolset3_build
-      - binary_linux_manywheel_3.7m_cu100_devtoolset3_test:
+            - setup
+      - binary_linux_manywheel_3.7m_cu100_devtoolset7_build:
          requires:
-            - binary_linux_manywheel_3.7m_cu100_devtoolset3_build
-      - binary_linux_conda_2.7_cpu_test:
+            - setup
+      - binary_linux_conda_2.7_cpu_devtoolset7_build:
          requires:
-            - binary_linux_conda_2.7_cpu_build
+            - setup
      # This binary build is currently broken, see https://github.com/pytorch/pytorch/issues/16710
-      # - binary_linux_conda_3.6_cu90_test:
+      # - binary_linux_conda_3.6_cu90_devtoolset7_build
+      - binary_linux_libtorch_2.7m_cpu_devtoolset7_static-without-deps_build:
+          requires:
+            - setup
+      # TODO we should test a libtorch cuda build, but they take too long
+      # - binary_linux_libtorch_2.7m_cu90_devtoolset7_static-without-deps_build
+      - binary_macos_wheel_3.6_cpu_build:
+          requires:
+            - setup
+      - binary_macos_conda_2.7_cpu_build:
+          requires:
+            - setup
+      - binary_macos_libtorch_2.7_cpu_build:
+          requires:
+            - setup
+
+      - binary_linux_manywheel_2.7mu_cpu_devtoolset7_test:
+          requires:
+            - setup
+            - binary_linux_manywheel_2.7mu_cpu_devtoolset7_build
+      - binary_linux_manywheel_3.7m_cu100_devtoolset7_test:
+          requires:
+            - setup
+            - binary_linux_manywheel_3.7m_cu100_devtoolset7_build
+      - binary_linux_conda_2.7_cpu_devtoolset7_test:
+          requires:
+            - setup
+            - binary_linux_conda_2.7_cpu_devtoolset7_build
+      # This binary build is currently broken, see https://github.com/pytorch/pytorch/issues/16710
+      # - binary_linux_conda_3.6_cu90_devtoolset7_test:
      #     requires:
-      #       - binary_linux_conda_3.6_cu90_build
+      #       - setup
+      #       - binary_linux_conda_3.6_cu90_devtoolset7_build

--- a/.circleci/verbatim-sources/workflows-nightly-uploads-header.yml
+++ b/.circleci/verbatim-sources/workflows-nightly-uploads-header.yml
@ -1,9 +1,6 @@
      #- binary_linux_libtorch_2.7m_cpu_test:
      #    requires:
      #      - binary_linux_libtorch_2.7m_cpu_build
-      #- binary_linux_libtorch_2.7m_cu80_test:
-      #    requires:
-      #      - binary_linux_libtorch_2.7m_cu80_build
      #- binary_linux_libtorch_2.7m_cu90_test:
      #    requires:
      #      - binary_linux_libtorch_2.7m_cu90_build
--- a/.circleci/verbatim-sources/workflows-pytorch-macos-builds.yml
+++ b/.circleci/verbatim-sources/workflows-pytorch-macos-builds.yml
@ -1,8 +1,19 @@
+      # Warning: indentation here matters!

      # Pytorch MacOS builds
-      - pytorch_macos_10_13_py3_build
+      - pytorch_macos_10_13_py3_build:
+          requires:
+            - setup
+          filters:
+            branches:
+              only: master
      - pytorch_macos_10_13_py3_test:
          requires:
+            - setup
            - pytorch_macos_10_13_py3_build
-      - pytorch_macos_10_13_cuda9_2_cudnn7_py3_build
-
+          filters:
+            branches:
+              only: master
+      - pytorch_macos_10_13_cuda9_2_cudnn7_py3_build:
+          requires:
+            - setup
--- a/.circleci/verbatim-sources/workflows-s3-html.yml
+++ b/.circleci/verbatim-sources/workflows-s3-html.yml
@ -1,5 +1,10 @@

  # Scheduled to run 4 hours after the binary jobs start
+  # These jobs need to run after all the binary jobs run, regardless of if the
+  # jobs failed or not. There's no way to do this in CircleCI right now, so we
+  # just schedule this to run after all the binary jobs should've finished.
+  # These jobs are all idempotent and very lightweight; they just upload html
+  # files that track what binaries are available and what their sizes are.
  update_s3_htmls:
    triggers:
      - schedule:
@ -9,7 +14,17 @@
              only:
                - master
    jobs:
+      - setup
      - update_s3_htmls_for_nightlies:
          context: org-member
+          requires:
+            - setup
      - update_s3_htmls_for_nightlies_devtoolset7:
          context: org-member
+          requires:
+            - setup
+      - upload_binary_sizes:
+          context: org-member
+          requires:
+            - setup
+
--- a/.flake8
+++ b/.flake8
@ -4,7 +4,7 @@ max-line-length = 120
 # C408 ignored because we like the dict keyword argument syntax
 # E501 is not flexible enough, we're using B950 instead
 ignore =
-    E203,E305,E402,E501,E721,E741,F403,F405,F821,F841,F999,W503,W504,C408,E302,
+    E203,E305,E402,E501,E721,E741,F403,F405,F821,F841,F999,W503,W504,C408,E302,W291,E303,
    # these ignores are from flake8-bugbear; please fix!
    B007,B008,
    # these ignores are from flake8-comprehensions; please fix!
--- a/.gitignore
+++ b/.gitignore
@ -31,10 +31,8 @@ test/.coverage
 test/.hypothesis/
 test/cpp/api/mnist
 test/custom_operator/model.pt
-test/data/gpu_tensors.pt
 test/data/legacy_modules.t7
-test/data/legacy_serialized.pt
-test/data/linear.pt
+test/data/*.pt
 dropout_model.pt
 test/generated_type_hints_smoketest.py
 test/htmlcov
@ -43,6 +41,8 @@ third_party/build/
 tools/shared/_utils_internal.py
 torch.egg-info/
 torch/__init__.pyi
+torch/nn/functional.pyi
+torch/nn/modules/*.pyi
 torch/csrc/autograd/generated/*
 torch/csrc/cudnn/cuDNN.cpp
 torch/csrc/generated
@ -85,6 +85,7 @@ torch/version.py
 # Root level file used in CI to specify certain env configs.
 # E.g., see .circleci/config.yaml
 env
+.circleci/scripts/COMMIT_MSG

 # IPython notebook checkpoints
 .ipynb_checkpoints
--- a/.gitmodules
+++ b/.gitmodules
@ -1,84 +1,116 @@
 [submodule "third_party/pybind11"]
-	path = third_party/pybind11
-	url = https://github.com/pybind/pybind11.git
+    ignore = dirty
+    path = third_party/pybind11
+    url = https://github.com/pybind/pybind11.git
 [submodule "third_party/cub"]
-	path = third_party/cub
-	url = https://github.com/NVlabs/cub.git
+    ignore = dirty
+    path = third_party/cub
+    url = https://github.com/NVlabs/cub.git
 [submodule "third_party/eigen"]
-	path = third_party/eigen
-	url = https://github.com/eigenteam/eigen-git-mirror.git
+    ignore = dirty
+    path = third_party/eigen
+    url = https://github.com/eigenteam/eigen-git-mirror.git
 [submodule "third_party/googletest"]
-	path = third_party/googletest
-	url = https://github.com/google/googletest.git
+    ignore = dirty
+    path = third_party/googletest
+    url = https://github.com/google/googletest.git
 [submodule "third_party/benchmark"]
-	path = third_party/benchmark
-	url = https://github.com/google/benchmark.git
+    ignore = dirty
+    path = third_party/benchmark
+    url = https://github.com/google/benchmark.git
 [submodule "third_party/protobuf"]
-	path = third_party/protobuf
-	url = https://github.com/google/protobuf.git
+    ignore = dirty
+    path = third_party/protobuf
+    url = https://github.com/protocolbuffers/protobuf.git
 [submodule "third_party/ios-cmake"]
-	path = third_party/ios-cmake
-	url = https://github.com/Yangqing/ios-cmake.git
+    ignore = dirty
+    path = third_party/ios-cmake
+    url = https://github.com/Yangqing/ios-cmake.git
 [submodule "third_party/NNPACK"]
-	path = third_party/NNPACK
-	url = https://github.com/Maratyszcza/NNPACK.git
+    ignore = dirty
+    path = third_party/NNPACK
+    url = https://github.com/Maratyszcza/NNPACK.git
 [submodule "third_party/gloo"]
-	path = third_party/gloo
-	url = https://github.com/facebookincubator/gloo
+    ignore = dirty
+    path = third_party/gloo
+    url = https://github.com/facebookincubator/gloo
 [submodule "third_party/NNPACK_deps/pthreadpool"]
-	path = third_party/pthreadpool
-	url = https://github.com/Maratyszcza/pthreadpool.git
+    ignore = dirty
+    path = third_party/pthreadpool
+    url = https://github.com/Maratyszcza/pthreadpool.git
 [submodule "third_party/NNPACK_deps/FXdiv"]
-	path = third_party/FXdiv
-	url = https://github.com/Maratyszcza/FXdiv.git
+    ignore = dirty
+    path = third_party/FXdiv
+    url = https://github.com/Maratyszcza/FXdiv.git
 [submodule "third_party/NNPACK_deps/FP16"]
-	path = third_party/FP16
-	url = https://github.com/Maratyszcza/FP16.git
+    ignore = dirty
+    path = third_party/FP16
+    url = https://github.com/Maratyszcza/FP16.git
 [submodule "third_party/NNPACK_deps/psimd"]
-	path = third_party/psimd
-	url = https://github.com/Maratyszcza/psimd.git
+    ignore = dirty
+    path = third_party/psimd
+    url = https://github.com/Maratyszcza/psimd.git
 [submodule "third_party/zstd"]
-	path = third_party/zstd
-	url = https://github.com/facebook/zstd.git
+    ignore = dirty
+    path = third_party/zstd
+    url = https://github.com/facebook/zstd.git
 [submodule "third-party/cpuinfo"]
-	path = third_party/cpuinfo
-	url = https://github.com/Maratyszcza/cpuinfo.git
+    ignore = dirty
+    path = third_party/cpuinfo
+    url = https://github.com/pytorch/cpuinfo.git
 [submodule "third_party/python-enum"]
-	path = third_party/python-enum
-	url = https://github.com/PeachPy/enum34.git
+    ignore = dirty
+    path = third_party/python-enum
+    url = https://github.com/PeachPy/enum34.git
 [submodule "third_party/python-peachpy"]
-	path = third_party/python-peachpy
-	url = https://github.com/Maratyszcza/PeachPy.git
+    ignore = dirty
+    path = third_party/python-peachpy
+    url = https://github.com/Maratyszcza/PeachPy.git
 [submodule "third_party/python-six"]
-	path = third_party/python-six
-	url = https://github.com/benjaminp/six.git
+    ignore = dirty
+    path = third_party/python-six
+    url = https://github.com/benjaminp/six.git
 [submodule "third_party/onnx"]
-	path = third_party/onnx
-	url = https://github.com/onnx/onnx.git
+    ignore = dirty
+    path = third_party/onnx
+    url = https://github.com/onnx/onnx.git
 [submodule "third_party/onnx-tensorrt"]
-	path = third_party/onnx-tensorrt
-	url = https://github.com/onnx/onnx-tensorrt
+    ignore = dirty
+    path = third_party/onnx-tensorrt
+    url = https://github.com/onnx/onnx-tensorrt
 [submodule "third_party/sleef"]
-	path = third_party/sleef
-	url = https://github.com/zdevito/sleef
+    ignore = dirty
+    path = third_party/sleef
+    url = https://github.com/zdevito/sleef
 [submodule "third_party/ideep"]
-	path = third_party/ideep
-	url = https://github.com/intel/ideep
+    ignore = dirty
+    path = third_party/ideep
+    url = https://github.com/intel/ideep
 [submodule "third_party/nccl/nccl"]
-	path = third_party/nccl/nccl
-	url = https://github.com/NVIDIA/nccl
+    ignore = dirty
+    path = third_party/nccl/nccl
+    url = https://github.com/NVIDIA/nccl
 [submodule "third_party/gemmlowp/gemmlowp"]
-	path = third_party/gemmlowp/gemmlowp
-	url = https://github.com/google/gemmlowp.git
+    ignore = dirty
+    path = third_party/gemmlowp/gemmlowp
+    url = https://github.com/google/gemmlowp.git
 [submodule "third_party/QNNPACK"]
-	path = third_party/QNNPACK
-	url = https://github.com/pytorch/QNNPACK
+    ignore = dirty
+    path = third_party/QNNPACK
+    url = https://github.com/pytorch/QNNPACK
 [submodule "third_party/neon2sse"]
-	path = third_party/neon2sse
-	url = https://github.com/intel/ARM_NEON_2_x86_SSE.git
+    ignore = dirty
+    path = third_party/neon2sse
+    url = https://github.com/intel/ARM_NEON_2_x86_SSE.git
 [submodule "third_party/fbgemm"]
-	path = third_party/fbgemm
-	url = https://github.com/pytorch/fbgemm
+    ignore = dirty
+    path = third_party/fbgemm
+    url = https://github.com/pytorch/fbgemm
 [submodule "third_party/foxi"]
-	path = third_party/foxi
-	url = https://github.com/houseroad/foxi.git
+    ignore = dirty
+    path = third_party/foxi
+    url = https://github.com/houseroad/foxi.git
+[submodule "third_party/tbb"]
+	path = third_party/tbb
+	url = https://github.com/01org/tbb
+	branch = tbb_2018
--- a/.jenkins/caffe2/bench.sh
+++ b/.jenkins/caffe2/bench.sh
@ -17,30 +17,44 @@ fi
 caffe2_pypath="$(cd /usr && $PYTHON -c 'import os; import caffe2; print(os.path.dirname(os.path.realpath(caffe2.__file__)))')"
 # Resnet50
 if (( $num_gpus == 0 )); then
-    "$PYTHON" "$caffe2_pypath/python/examples/resnet50_trainer.py" --train_data null --batch_size 128 --epoch_size 12800 --num_epochs 2 --use_cpu
+    "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --train_data null --batch_size 128 --epoch_size 12800 --num_epochs 2 --use_cpu
 fi
 if (( $num_gpus >= 1 )); then
-    "$PYTHON" "$caffe2_pypath/python/examples/resnet50_trainer.py" --train_data null --batch_size 128 --epoch_size 12800 --num_epochs 2 --num_gpus 1
-    "$PYTHON" "$caffe2_pypath/python/examples/resnet50_trainer.py" --train_data null --batch_size 256 --epoch_size 25600 --num_epochs 2 --num_gpus 1 --float16_compute --dtype float16
+    "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --train_data null --batch_size 128 --epoch_size 12800 --num_epochs 2 --num_gpus 1
+    "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --train_data null --batch_size 256 --epoch_size 25600 --num_epochs 2 --num_gpus 1 --float16_compute --dtype float16
 fi
 if (( $num_gpus >= 2 )); then
-    "$PYTHON" "$caffe2_pypath/python/examples/resnet50_trainer.py" --train_data null --batch_size 256 --epoch_size 25600 --num_epochs 2 --num_gpus 2
+    "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --train_data null --batch_size 256 --epoch_size 25600 --num_epochs 2 --num_gpus 2
 fi
 if (( $num_gpus >= 4 )); then
-    "$PYTHON" "$caffe2_pypath/python/examples/resnet50_trainer.py" --train_data null --batch_size 512 --epoch_size 51200 --num_epochs 2 --num_gpus 4
+    "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --train_data null --batch_size 512 --epoch_size 51200 --num_epochs 2 --num_gpus 4
 fi

 # ResNext
 if (( $num_gpus == 0 )); then
-    "$PYTHON" "$caffe2_pypath/python/examples/resnet50_trainer.py" --resnext_num_groups 32 --resnext_width_per_group 4 --num_layers 101 --train_data null --batch_size 32 --epoch_size 3200 --num_epochs 2 --use_cpu
+    "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --resnext_num_groups 32 --resnext_width_per_group 4 --num_layers 101 --train_data null --batch_size 32 --epoch_size 3200 --num_epochs 2 --use_cpu
 fi
 if (( $num_gpus >= 1 )); then
-    "$PYTHON" "$caffe2_pypath/python/examples/resnet50_trainer.py" --resnext_num_groups 32 --resnext_width_per_group 4 --num_layers 101 --train_data null --batch_size 32 --epoch_size 3200 --num_epochs 2 --num_gpus 1
-    "$PYTHON" "$caffe2_pypath/python/examples/resnet50_trainer.py" --resnext_num_groups 32 --resnext_width_per_group 4 --num_layers 101 --train_data null --batch_size 64 --epoch_size 3200 --num_epochs 2 --num_gpus 1 --float16_compute --dtype float16
+    "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --resnext_num_groups 32 --resnext_width_per_group 4 --num_layers 101 --train_data null --batch_size 32 --epoch_size 3200 --num_epochs 2 --num_gpus 1
+    "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --resnext_num_groups 32 --resnext_width_per_group 4 --num_layers 101 --train_data null --batch_size 64 --epoch_size 3200 --num_epochs 2 --num_gpus 1 --float16_compute --dtype float16
 fi
 if (( $num_gpus >= 2 )); then
-    "$PYTHON" "$caffe2_pypath/python/examples/resnet50_trainer.py" --resnext_num_groups 32 --resnext_width_per_group 4 --num_layers 101 --train_data null --batch_size 64 --epoch_size 6400 --num_epochs 2 --num_gpus 2
+    "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --resnext_num_groups 32 --resnext_width_per_group 4 --num_layers 101 --train_data null --batch_size 64 --epoch_size 6400 --num_epochs 2 --num_gpus 2
 fi
 if (( $num_gpus >= 4 )); then
-    "$PYTHON" "$caffe2_pypath/python/examples/resnet50_trainer.py" --resnext_num_groups 32 --resnext_width_per_group 4 --num_layers 101 --train_data null --batch_size 128 --epoch_size 12800 --num_epochs 2 --num_gpus 4
+    "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --resnext_num_groups 32 --resnext_width_per_group 4 --num_layers 101 --train_data null --batch_size 128 --epoch_size 12800 --num_epochs 2 --num_gpus 4
+fi
+
+# Shufflenet
+if (( $num_gpus == 0 )); then
+    "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --train_data null --batch_size 32 --epoch_size 3200 --num_epochs 2 --use_cpu --model shufflenet
+fi
+if (( $num_gpus >= 1 )); then
+    "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --train_data null --batch_size 32 --epoch_size 3200 --num_epochs 2 --num_gpus 1 --model shufflenet
+fi
+if (( $num_gpus >= 2 )); then
+    "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --train_data null --batch_size 64 --epoch_size 6400 --num_epochs 2 --num_gpus 2 --model shufflenet
+fi
+if (( $num_gpus >= 4 )); then
+    "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --train_data null --batch_size 128 --epoch_size 12800 --num_epochs 2 --num_gpus 4 --model shufflenet
 fi
--- a/.jenkins/caffe2/build.sh
+++ b/.jenkins/caffe2/build.sh
@ -255,7 +255,7 @@ else

  # sccache will be stuck if  all cores are used for compiling
  # see https://github.com/pytorch/pytorch/pull/7361
-  if [[ -n "${SCCACHE}" ]]; then
+  if [[ -n "${SCCACHE}" && $BUILD_ENVIRONMENT != *rocm* ]]; then
    export MAX_JOBS=`expr $(nproc) - 1`
  fi

@ -271,4 +271,14 @@ fi
 # Install ONNX into a local directory
 pip install --user -b /tmp/pip_install_onnx "file://${ROOT_DIR}/third_party/onnx#egg=onnx"

+if [[ $BUILD_ENVIRONMENT == *rocm* ]]; then
+  ORIG_COMP=/opt/rocm/hcc/bin/clang-*_original
+  if [ -e $ORIG_COMP ]; then
+    # runtime compilation of MIOpen kernels manages to crash sccache - hence undo the wrapping
+    # note that the wrapping always names the compiler "clang-7.0_original"
+    WRAPPED=/opt/rocm/hcc/bin/clang-[0-99]
+    sudo mv $ORIG_COMP $WRAPPED
+  fi
+fi
+
 report_compile_cache_stats
--- a/.jenkins/caffe2/test.sh
+++ b/.jenkins/caffe2/test.sh
@ -92,7 +92,6 @@ if [[ $BUILD_ENVIRONMENT == *-rocm* ]]; then

  # Unknown reasons, need to debug
  rocm_ignore_test+=("--ignore $caffe2_pypath/python/operator_test/piecewise_linear_transform_test.py")
-  rocm_ignore_test+=("--ignore $caffe2_pypath/python/operator_test/softmax_ops_test.py")

  # On ROCm, RCCL (distributed) development isn't complete.
  # https://github.com/ROCmSoftwarePlatform/rccl
@ -102,6 +101,11 @@ fi
 # NB: Warnings are disabled because they make it harder to see what
 # the actual erroring test is
 echo "Running Python tests.."
+if [[ "$BUILD_ENVIRONMENT" == *py3* ]]; then
+  # locale setting is required by click package with py3
+  export LC_ALL=C.UTF-8
+  export LANG=C.UTF-8
+fi
 pip install --user pytest-sugar
 "$PYTHON" \
  -m pytest \
@ -121,6 +125,12 @@ pip install --user pytest-sugar
 # torchvision tests #
 #####################
 if [[ "$BUILD_ENVIRONMENT" == *onnx* ]]; then
-  pip install --user torchvision
+  pip install -q --user git+https://github.com/pytorch/vision.git
+  pip install -q --user ninja
+  # JIT C++ extensions require ninja, so put it into PATH.
+  export PATH="/var/lib/jenkins/.local/bin:$PATH"
+  if [[ "$BUILD_ENVIRONMENT" == *py3* ]]; then
+    pip install -q --user onnxruntime==0.4.0
+  fi
  "$ROOT_DIR/scripts/onnx/test.sh"
 fi
--- a/.jenkins/pytorch/build-asan.sh
+++ b/.jenkins/pytorch/build-asan.sh
@ -29,11 +29,11 @@ export ASAN_OPTIONS=detect_leaks=0:symbolize=1
 # [2] https://wiki.gentoo.org/wiki/AddressSanitizer/Problems
 # [3] https://github.com/Kitware/CMake/commit/e9a1ddc594de6e6251bf06d732775dae2cabe4c8
 #
-# TODO: Make the ASAN flags a more unified env var
+# TODO: Make the ASAN flags a centralized env var and unify with USE_ASAN option
 CC="clang" CXX="clang++" LDSHARED="clang --shared" \
  CFLAGS="-fsanitize=address -fsanitize=undefined -fno-sanitize-recover=all -shared-libasan -pthread" \
  CXX_FLAGS="-pthread" \
-  NO_CUDA=1 USE_MKLDNN=0 \
+  USE_ASAN=1 USE_CUDA=0 USE_MKLDNN=0 \
  python setup.py install

 assert_git_not_dirty
--- a/.jenkins/pytorch/build.sh
+++ b/.jenkins/pytorch/build.sh
@ -20,7 +20,7 @@ if [[ "$BUILD_ENVIRONMENT" == *-xenial-cuda9-* ]]; then
  sudo apt-get -qq install --allow-downgrades --allow-change-held-packages libnccl-dev=2.2.13-1+cuda9.0 libnccl2=2.2.13-1+cuda9.0
 fi

-if [[ "$BUILD_ENVIRONMENT" == *-xenial-cuda9*gcc7* ]] || [[ "$BUILD_ENVIRONMENT" == *-xenial-cuda8-* ]] || [[ "$BUILD_ENVIRONMENT" == *-xenial-cuda9-cudnn7-py2* ]] || [[ "$BUILD_ENVIRONMENT" == *-trusty-py2.7.9* ]]; then
+if [[ "$BUILD_ENVIRONMENT" == *-xenial-cuda9*gcc7* ]] || [[ "$BUILD_ENVIRONMENT" == *-xenial-cuda9-* ]] || [[ "$BUILD_ENVIRONMENT" == *-trusty-py2.7.9* ]]; then
  # TODO: move this to Docker
  sudo apt-get -qq update
  if [[ "$BUILD_ENVIRONMENT" == *-trusty-py2.7.9* ]]; then
@ -32,7 +32,7 @@ if [[ "$BUILD_ENVIRONMENT" == *-xenial-cuda9*gcc7* ]] || [[ "$BUILD_ENVIRONMENT"
  sudo mkdir -p /var/run/sshd
 fi

-if [[ "$BUILD_ENVIRONMENT" == *pytorch-linux-xenial-py3-clang5-asan* ]]; then
+if [[ "$BUILD_ENVIRONMENT" == *-linux-xenial-py3-clang5-asan* ]]; then
  exec "$(dirname "${BASH_SOURCE[0]}")/build-asan.sh" "$@"
 fi

@ -46,15 +46,15 @@ echo "CMake version:"
 cmake --version

 # TODO: Don't run this...
-pip install -q -r requirements.txt || true
+pip_install -r requirements.txt || true

 # TODO: Don't install this here
 if ! which conda; then
  # In ROCm CIs, we are doing cross compilation on build machines with
  # intel cpu and later run tests on machines with amd cpu.
  # Also leave out two builds to make sure non-mkldnn builds still work.
-  if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *-trusty-py3.5-* && "$BUILD_ENVIRONMENT" != *-xenial-cuda8-cudnn7-py3-* ]]; then
-    pip install -q mkl mkl-devel
+  if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *-trusty-py3.5-* && "$BUILD_ENVIRONMENT" != *-xenial-cuda9-cudnn7-py3-* ]]; then
+    pip_install mkl mkl-devel
    export USE_MKLDNN=1
  else
    export USE_MKLDNN=0
@ -65,10 +65,9 @@ fi
 if [[ "${BUILD_ENVIRONMENT}" == *-android* ]]; then
  export ANDROID_NDK=/opt/ndk
  build_args=()
-  build_args+=("-DBUILD_BINARY=ON")
-  build_args+=("-DBUILD_TEST=ON")
-  build_args+=("-DUSE_OBSERVERS=ON")
-  build_args+=("-DUSE_ZSTD=ON")
+  build_args+=("-DBUILD_CAFFE2_MOBILE=OFF")
+  build_args+=("-DCMAKE_PREFIX_PATH=$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')")
+  build_args+=("-DPYTHON_EXECUTABLE=$(python -c 'import sys; print(sys.executable)')")
  exec ./scripts/build_android.sh "${build_args[@]}" "$@"
 fi

@ -109,6 +108,15 @@ if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
  # OPENCV is needed to enable ImageInput operator in caffe2 resnet5_trainer
  # LMDB is needed to read datasets from https://download.caffe2.ai/databases/resnet_trainer.zip
  USE_ROCM=1 USE_LMDB=1 USE_OPENCV=1 python setup.py install --user
+
+  ORIG_COMP=/opt/rocm/hcc/bin/clang-*_original
+  if [ -e $ORIG_COMP ]; then
+    # runtime compilation of MIOpen kernels manages to crash sccache - hence undo the wrapping
+    # note that the wrapping always names the compiler "clang-7.0_original"
+    WRAPPED=/opt/rocm/hcc/bin/clang-[0-99]
+    sudo mv $ORIG_COMP $WRAPPED
+
+  fi
  exit 0
 fi

@ -156,17 +164,17 @@ fi
 assert_git_not_dirty

 # Test documentation build
-if [[ "$BUILD_ENVIRONMENT" == *xenial-cuda8-cudnn7-py3* ]]; then
+if [[ "$BUILD_ENVIRONMENT" == *xenial-cuda9-cudnn7-py3* ]]; then
  pushd docs
  # TODO: Don't run this here
-  pip install -q -r requirements.txt || true
+  pip_install -r requirements.txt || true
  LC_ALL=C make html
  popd
  assert_git_not_dirty
 fi

 # Test standalone c10 build
-if [[ "$BUILD_ENVIRONMENT" == *xenial-cuda8-cudnn7-py3* ]]; then
+if [[ "$BUILD_ENVIRONMENT" == *xenial-cuda9-cudnn7-py3* ]]; then
  mkdir -p c10/build
  pushd c10/build
  cmake ..
@ -202,7 +210,7 @@ fi
 if [[ "${BUILD_ENVIRONMENT}" == *xla* ]]; then
  # TODO: Move this to Dockerfile.

-  pip install -q lark-parser
+  pip_install lark-parser

  # Bazel doesn't work with sccache gcc. https://github.com/bazelbuild/bazel/issues/3642
  sudo add-apt-repository "deb http://apt.llvm.org/trusty/ llvm-toolchain-trusty-7 main"
--- a/.jenkins/pytorch/common.sh
+++ b/.jenkins/pytorch/common.sh
@ -17,9 +17,21 @@ function cleanup {

 set -ex

+# Save the SCRIPT_DIR absolute path in case later we chdir (as occurs in the gpu perf test)
+SCRIPT_DIR="$( cd "$(dirname "${BASH_SOURCE[0]}")" ; pwd -P )"
+
 # Required environment variables:
 #   $BUILD_ENVIRONMENT (should be set by your Docker image)

+# Figure out which Python to use for ROCm
+if [[ "${BUILD_ENVIRONMENT}" == *rocm* ]] && [[ "${BUILD_ENVIRONMENT}" =~ py((2|3)\.?[0-9]?\.?[0-9]?) ]]; then
+  PYTHON=$(which "python${BASH_REMATCH[1]}")
+  # non-interactive bashs do not expand aliases by default
+  shopt -s expand_aliases
+  export PYTORCH_TEST_WITH_ROCM=1
+  alias python="$PYTHON"
+fi
+
 # This token is used by a parser on Jenkins logs for determining
 # if a failure is a legitimate problem, or a problem with the build
 # system; to find out more, grep for this string in ossci-job-dsl.
@ -89,7 +101,7 @@ if which sccache > /dev/null; then
  sccache --zero-stats
  function sccache_epilogue() {
    echo '=================== sccache compilation log ==================='
-    python "$(dirname "${BASH_SOURCE[0]}")/print_sccache_log.py" ~/sccache_error.log
+    python "$SCRIPT_DIR/print_sccache_log.py" ~/sccache_error.log 2>/dev/null
    echo '=========== If your build fails, please take a look at the log above for possible reasons ==========='
    sccache --show-stats
    sccache --stop-server || true
@ -116,21 +128,7 @@ if [ -z "$COMPACT_JOB_NAME" ]; then
  exit 1
 fi

-if grep --line-regexp -q "$COMPACT_JOB_NAME" "$(dirname "${BASH_SOURCE[0]}")/disabled-configs.txt"; then
-  echo "Job is explicitly disabled, SKIPPING"
-  exit 0
-else
-  echo "Job is not disabled, proceeding"
-fi
-
-if grep --line-regexp -q "$COMPACT_JOB_NAME" "$(dirname "${BASH_SOURCE[0]}")/enabled-configs.txt"; then
-  echo "Job is enabled, proceeding"
-else
-  echo "Job is not enabled, FAILING now (revert changes to enabled-configs.txt to fix this)"
-  exit 1
-fi
-
-if [[ "$BUILD_ENVIRONMENT" == *pytorch-linux-xenial-cuda9-cudnn7-py3 ]] || \
+if [[ "$BUILD_ENVIRONMENT" == *pytorch-linux-xenial-cuda9-cudnn7-py3* ]] || \
   [[ "$BUILD_ENVIRONMENT" == *pytorch-linux-trusty-py3.6-gcc7* ]] || \
   [[ "$BUILD_ENVIRONMENT" == *pytorch_macos* ]]; then
  BUILD_TEST_LIBTORCH=1
@ -141,7 +139,7 @@ fi
 # Use conda cmake in some CI build. Conda cmake will be newer than our supported
 # min version 3.5, so we only do it in two builds that we know should use conda.
 if [[ "$BUILD_ENVIRONMENT" == *pytorch-linux-xenial-cuda* ]]; then
-  if [[ "$BUILD_ENVIRONMENT" == *cuda8-cudnn7-py2* ]] || \
+  if [[ "$BUILD_ENVIRONMENT" == *cuda9-cudnn7-py2* ]] || \
     [[ "$BUILD_ENVIRONMENT" == *cuda9-cudnn7-py3* ]]; then
    if ! which conda; then
      echo "Expected ${BUILD_ENVIRONMENT} to use conda, but 'which conda' returns empty"
@ -158,6 +156,11 @@ if [[ "$BUILD_ENVIRONMENT" == *pytorch-linux-xenial-cuda* ]]; then
  fi
 fi

+function pip_install() {
+  # retry 3 times
+  pip install --progress-bar off "$@" || pip install --progress-bar off "$@" || pip install --progress-bar off "$@"
+}
+
 function get_exit_code() {
  set +e
  "$@"
--- a/.jenkins/pytorch/disabled-configs.txt
+++ b/.jenkins/pytorch/disabled-configs.txt
@ -1,5 +0,0 @@
-# This file contains a list of disabled configurations.  Disabled
-# configurations are skipped and not considered a failure if they
-# fail.  You can use this to temporarily reserve a test name to
-# turn on CI side before PyTorch repository supports it.  This
-# file has the same format as .jenkins/enabled-configs.txt
--- a/.jenkins/pytorch/enabled-configs.txt
+++ b/.jenkins/pytorch/enabled-configs.txt
@ -1,57 +0,0 @@
-# This file contains a list of enabled configurations
-# to perform tests on.  If you want to run tests on CI on
-# a limited set of tests before enabling the full test suite,
-# you can delete lines from this file.  Any test that is not
-# in this file will report a failure (so you don't forget to
-# reenable the tests on merge ;)
-
-pytorch-linux-xenial-cuda8-cudnn7-py3-build
-pytorch-linux-xenial-cuda8-cudnn7-py3-test
-pytorch-linux-xenial-cuda8-cudnn7-py3-multigpu-test
-pytorch-linux-xenial-cuda8-cudnn7-py3-nogpu-test
-pytorch-linux-xenial-cuda9-cudnn7-py2-build
-pytorch-linux-xenial-cuda9-cudnn7-py2-test
-pytorch-linux-xenial-cuda9-cudnn7-py3-build
-pytorch-linux-xenial-cuda9-cudnn7-py3-test
-pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc7-build
-pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc7-test
-pytorch-linux-xenial-cuda10-cudnn7-py3-gcc7-build
-pytorch-linux-xenial-cuda10-cudnn7-py3-gcc7-test
-pytorch-linux-xenial-py3-clang5-asan-build
-pytorch-linux-xenial-py3-clang5-asan-test
-pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build
-pytorch-linux-trusty-py2.7.9-build
-pytorch-linux-trusty-py2.7.9-test
-pytorch-linux-trusty-py2.7-build
-pytorch-linux-trusty-py2.7-test
-pytorch-linux-trusty-py3.5-build
-pytorch-linux-trusty-py3.5-test
-pytorch-linux-trusty-py3.6-gcc4.8-build
-pytorch-linux-trusty-py3.6-gcc4.8-test
-pytorch-linux-trusty-py3.6-gcc5.4-build
-pytorch-linux-trusty-py3.6-gcc5.4-test
-pytorch-linux-trusty-py3.6-gcc7.2-build
-pytorch-linux-trusty-py3.6-gcc7.2-test
-pytorch-linux-trusty-py3.6-gcc7-build
-pytorch-linux-trusty-py3.6-gcc7-test
-pytorch-linux-trusty-pynightly-build
-pytorch-linux-trusty-pynightly-test
-pytorch-win-ws2016-cuda9-cudnn7-py3-build
-pytorch-win-ws2016-cuda9-cudnn7-py3-test
-pytorch-macos-10.13-py3-build
-pytorch-macos-10.13-py3-test
-pytorch-macos-10.13-cuda9.2-cudnn7-py3-build
-pytorch-docker-build-test
-short-perf-test-cpu
-short-perf-test-gpu
-py2-clang7-rocmdeb-ubuntu16.04
-py2-devtoolset7-rocmrpm-centos7.5
-pytorch-ppc64le-cuda9.2-cudnn7-py3-build
-pytorch-ppc64le-cuda9.2-cudnn7-py3-test
-pytorch-ppc64le-cuda9.1-cudnn7-py3-build
-pytorch-ppc64le-cuda9.1-cudnn7-py3-test
-pytorch-linux-xenial-cuda8-cudnn7-py3-NO_AVX2-test
-pytorch-linux-xenial-cuda8-cudnn7-py3-NO_AVX-NO_AVX2-test
-pytorch-linux-xenial-cuda8-cudnn7-py3-slow-test
-pytorch-xla-linux-trusty-py3.6-gcc5.4-build
-pytorch-xla-linux-trusty-py3.6-gcc5.4-test
--- a/.jenkins/pytorch/macos-build.sh
+++ b/.jenkins/pytorch/macos-build.sh
@ -30,7 +30,7 @@ if [[ "${BUILD_ENVIRONMENT}" == *cuda9.2* ]]; then
  export PATH=/Developer/NVIDIA/CUDA-${CUDA_VERSION}/bin${PATH:+:${PATH}}
  export DYLD_LIBRARY_PATH=/Developer/NVIDIA/CUDA-${CUDA_VERSION}/lib${DYLD_LIBRARY_PATH:+:${DYLD_LIBRARY_PATH}}
  export CUDA_HOME=/Developer/NVIDIA/CUDA-${CUDA_VERSION}
-  export NO_CUDA=0
+  export USE_CUDA=1

  if [ -z "${IN_CIRCLECI}" ]; then
    # Eigen gives "explicit specialization of class must precede its first use" error
--- a/.jenkins/pytorch/macos-test.sh
+++ b/.jenkins/pytorch/macos-test.sh
@ -8,7 +8,7 @@ source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
 export PATH="/usr/local/bin:$PATH"

 # Set up conda environment
-export PYTORCH_ENV_DIR="${HOME}/pytorch-ci-env"
+export PYTORCH_ENV_DIR="${HOME}/workspace"
 # If a local installation of conda doesn't exist, we download and install conda
 if [ ! -d "${PYTORCH_ENV_DIR}/miniconda3" ]; then
  mkdir -p ${PYTORCH_ENV_DIR}
--- a/.jenkins/pytorch/multigpu-test.sh
+++ b/.jenkins/pytorch/multigpu-test.sh
@ -27,5 +27,9 @@ if [ -n "${IN_CIRCLECI}" ]; then
  fi
 fi

+python tools/download_mnist.py --quiet -d test/cpp/api/mnist
+OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="test/cpp/api/mnist" "$PWD/../cpp-build"/caffe2/build/bin/test_api
 time python test/run_test.py --verbose -i distributed
+time python test/run_test.py --verbose -i c10d
+time python test/run_test.py --verbose -i c10d_spawn
 assert_git_not_dirty
--- a/.jenkins/pytorch/perf_test/test_cpu_speed_mnist.sh
+++ b/.jenkins/pytorch/perf_test/test_cpu_speed_mnist.sh
@ -12,7 +12,7 @@ test_cpu_speed_mnist () {

  cd examples/mnist

-  pip install -q -r requirements.txt
+  conda install -c pytorch torchvision-cpu

  # Download data
  python main.py --epochs 0
--- a/.jenkins/pytorch/perf_test/test_gpu_speed_mnist.sh
+++ b/.jenkins/pytorch/perf_test/test_gpu_speed_mnist.sh
@ -12,7 +12,7 @@ test_gpu_speed_mnist () {

  cd examples/mnist

-  pip install -q -r requirements.txt
+  conda install -c pytorch torchvision

  # Download data
  python main.py --epochs 0
--- a/.jenkins/pytorch/test.sh
+++ b/.jenkins/pytorch/test.sh
@ -35,30 +35,30 @@ fi
 # --user breaks ppc64le builds and these packages are already in ppc64le docker
 if [[ "$BUILD_ENVIRONMENT" != *ppc64le* ]]; then
  # JIT C++ extensions require ninja.
-  pip install -q ninja --user
+  pip_install --user ninja
  # ninja is installed in /var/lib/jenkins/.local/bin
  export PATH="/var/lib/jenkins/.local/bin:$PATH"

  # TODO: move this to Docker
-  pip install -q hypothesis --user
+  pip_install --user hypothesis

  # TODO: move this to Docker
  PYTHON_VERSION=$(python -c 'import platform; print(platform.python_version())'|cut -c1)
  echo $PYTHON_VERSION
-  if [[ $PYTHON_VERSION == "2" ]]; then
-    pip install -q https://s3.amazonaws.com/ossci-linux/wheels/tensorboard-1.14.0a0-py2-none-any.whl --user
-  else
-    pip install -q https://s3.amazonaws.com/ossci-linux/wheels/tensorboard-1.14.0a0-py3-none-any.whl --user
-  fi
-
+  # if [[ $PYTHON_VERSION == "2" ]]; then
+  #   pip_install --user https://s3.amazonaws.com/ossci-linux/wheels/tensorboard-1.14.0a0-py2-none-any.whl
+  # else
+  #   pip_install --user https://s3.amazonaws.com/ossci-linux/wheels/tensorboard-1.14.0a0-py3-none-any.whl
+  # fi
+  pip_install --user tb-nightly
  # mypy will fail to install on Python <3.4.  In that case,
  # we just won't run these tests.
-  pip install mypy --user || true
+  pip_install --user mypy || true
 fi

 # faulthandler become built-in since 3.3
 if [[ ! $(python -c "import sys; print(int(sys.version_info >= (3, 3)))") == "1" ]]; then
-  pip install -q faulthandler --user
+  pip_install --user faulthandler
 fi

 # DANGER WILL ROBINSON.  The LD_PRELOAD here could cause you problems
@ -92,13 +92,6 @@ if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then
    (cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_aten_asan(3)")
 fi

-if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
-  export PYTORCH_TEST_WITH_ROCM=1
-  # ROCm CI is using Caffe2 docker images, which doesn't have several packages
-  # needed in testing. We install them here.
-  pip install -q psutil "librosa>=0.6.2" --user
-fi
-
 if [[ "${BUILD_ENVIRONMENT}" == *-NO_AVX-* ]]; then
  export ATEN_CPU_CAPABILITY=default
 elif [[ "${BUILD_ENVIRONMENT}" == *-NO_AVX2-* ]]; then
@ -141,22 +134,7 @@ test_aten() {
 }

 test_torchvision() {
-  rm -rf ninja
-
-  echo "Installing torchvision at branch master"
-  rm -rf vision
-  # TODO: This git clone is bad, it means pushes to torchvision can break
-  # PyTorch CI
-  git clone https://github.com/pytorch/vision --quiet
-  pushd vision
-  # python setup.py install with a tqdm dependency is broken in the
-  # Travis Python nightly (but not in latest Python nightlies, so
-  # this should be a transient requirement...)
-  # See https://github.com/pytorch/pytorch/issues/7525
-  #time python setup.py install
-  pip install -q --user .
-  popd
-  rm -rf vision
+  pip_install --user git+https://github.com/pytorch/vision.git@2b73a4846773a670632b29fb2fc2ac57df7bce5d
 }

 test_libtorch() {
@ -165,13 +143,13 @@ test_libtorch() {
    python test/cpp/jit/tests_setup.py setup
    CPP_BUILD="$PWD/../cpp-build"
    if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
-      "$CPP_BUILD"/caffe2/bin/test_jit
+      "$CPP_BUILD"/caffe2/build/bin/test_jit
    else
-      "$CPP_BUILD"/caffe2/bin/test_jit "[cpu]"
+      "$CPP_BUILD"/caffe2/build/bin/test_jit "[cpu]"
    fi
    python test/cpp/jit/tests_setup.py shutdown
    python tools/download_mnist.py --quiet -d test/cpp/api/mnist
-    OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="test/cpp/api/mnist" "$CPP_BUILD"/caffe2/bin/test_api
+    OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="test/cpp/api/mnist" "$CPP_BUILD"/caffe2/build/bin/test_api
    assert_git_not_dirty
  fi
 }
@ -203,6 +181,7 @@ test_xla() {
 }

 (cd test && python -c "import torch; print(torch.__config__.show())")
+(cd test && python -c "import torch; print(torch.__config__.parallel_info())")

 if [[ "${BUILD_ENVIRONMENT}" == *xla* ]]; then
  test_torchvision
--- a/.jenkins/pytorch/win-build.sh
+++ b/.jenkins/pytorch/win-build.sh
@ -15,6 +15,7 @@ COMPACT_JOB_NAME=pytorch-win-ws2016-cuda9-cudnn7-py3-build
 SCRIPT_PARENT_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
 source "$SCRIPT_PARENT_DIR/common.sh"

+export IMAGE_COMMIT_ID=`git rev-parse HEAD`
 export IMAGE_COMMIT_TAG=${BUILD_ENVIRONMENT}-${IMAGE_COMMIT_ID}
 if [[ ${JOB_NAME} == *"develop"* ]]; then
  export IMAGE_COMMIT_TAG=develop-${IMAGE_COMMIT_TAG}
--- a/.jenkins/pytorch/win-test-helpers/build_pytorch.bat
+++ b/.jenkins/pytorch/win-test-helpers/build_pytorch.bat
@ -21,13 +21,42 @@ if "%REBUILD%"=="" ( pip install -q ninja )
 git submodule sync --recursive
 git submodule update --init --recursive

-set PATH=%TMP_DIR_WIN%\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\libnvvp;%PATH%
+if "%CUDA_VERSION%" == "9" goto cuda_build_9
+if "%CUDA_VERSION%" == "10" goto cuda_build_10
+goto cuda_build_end
+
+:cuda_build_9
+
+:: Override VS env here
+pushd .
+call "C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Auxiliary\Build\vcvarsall.bat" x64
+@echo on
+popd
+set DISTUTILS_USE_SDK=1
+
 set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0
-set CUDA_PATH_V9_0=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0
+set CUDA_PATH_V9_0=%CUDA_PATH%
+
+goto cuda_build_common
+
+:cuda_build_10
+
+set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1
+set CUDA_PATH_V10_1=%CUDA_PATH%
+
+goto cuda_build_common
+
+:cuda_build_common
+
+set CUDNN_LIB_DIR=%CUDA_PATH%\lib\x64
+set CUDA_TOOLKIT_ROOT_DIR=%CUDA_PATH%
+set CUDNN_ROOT_DIR=%CUDA_PATH%
 set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt
-set CUDNN_LIB_DIR=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\lib\x64
-set CUDA_TOOLKIT_ROOT_DIR=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0
-set CUDNN_ROOT_DIR=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0
+set PATH=%CUDA_PATH%\bin;%CUDA_PATH%\libnvvp;%PATH%
+
+:cuda_build_end
+
+set PATH=%TMP_DIR_WIN%\bin;%PATH%

 :: Target only our CI GPU machine's CUDA arch to speed up the build
 set TORCH_CUDA_ARCH_LIST=5.2
@ -40,16 +69,26 @@ set CXX=sccache cl

 set CMAKE_GENERATOR=Ninja

+:: The following code will try to build PyTorch twice if USE_CUDA is neither 0
+:: nor 1. It is intended so that both builds can be folded into 1 CI run.
+
 if not "%USE_CUDA%"=="1" (
  if "%REBUILD%"=="" (
-    set NO_CUDA=1
+    :: Must save and restore the original value of USE_CUDA, otherwise the
+    :: `if not "%USE_CUDA%"=="0"` line can be messed up.
+    set OLD_USE_CUDA=%USE_CUDA%
+    set USE_CUDA=0
    python setup.py install
+    set USE_CUDA=%OLD_USE_CUDA%
  )
  if errorlevel 1 exit /b 1
  if not errorlevel 0 exit /b 1
 )

 if not "%USE_CUDA%"=="0" (
+  :: sccache will fail for CUDA builds if all cores are used for compiling
+  if not defined MAX_JOBS set /A MAX_JOBS=%NUMBER_OF_PROCESSORS%-1
+
  if "%REBUILD%"=="" (
    sccache --show-stats
    sccache --zero-stats
@ -64,13 +103,12 @@ if not "%USE_CUDA%"=="0" (

  set CUDA_NVCC_EXECUTABLE=%TMP_DIR_WIN%\bin\nvcc

-  if "%REBUILD%"=="" set NO_CUDA=0
+  if "%REBUILD%"=="" set USE_CUDA=1

  python setup.py install --cmake && sccache --show-stats && (
    if "%BUILD_ENVIRONMENT%"=="" (
      echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3` in Command Prompt before running Git Bash.
    ) else (
-      mv %CD%\build\bin\test_api.exe %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch\lib
      7z a %TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\caffe2 && python %SCRIPT_HELPERS_DIR%\upload_image.py %TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z
    )
  )
--- a/.jenkins/pytorch/win-test-helpers/download_image.py
+++ b/.jenkins/pytorch/win-test-helpers/download_image.py
@ -1,3 +1,5 @@
+#!/usr/bin/env python
+
 import os
 import sys
 import boto3
--- a/.jenkins/pytorch/win-test-helpers/installation-helpers/install_magma.bat
+++ b/.jenkins/pytorch/win-test-helpers/installation-helpers/install_magma.bat
@ -1,9 +1,17 @@
+if "%CUDA_VERSION%" == "9" set CUDA_SUFFIX=cuda90
+if "%CUDA_VERSION%" == "10" set CUDA_SUFFIX=cuda101
+
+if "%CUDA_SUFFIX%" == "" (
+  echo unknown CUDA version, please set `CUDA_VERSION` to 9 or 10.
+  exit /b 1
+)
+
 if "%REBUILD%"=="" (
  if "%BUILD_ENVIRONMENT%"=="" (
-    curl -k https://s3.amazonaws.com/ossci-windows/magma_2.5.0_cuda90_%BUILD_TYPE%.7z --output %TMP_DIR_WIN%\magma_2.5.0_cuda90_%BUILD_TYPE%.7z
+    curl -k https://s3.amazonaws.com/ossci-windows/magma_2.5.0_%CUDA_SUFFIX%_%BUILD_TYPE%.7z --output %TMP_DIR_WIN%\magma_2.5.0_%CUDA_SUFFIX%_%BUILD_TYPE%.7z
  ) else (
-    aws s3 cp s3://ossci-windows/magma_2.5.0_cuda90_%BUILD_TYPE%.7z %TMP_DIR_WIN%\magma_2.5.0_cuda90_%BUILD_TYPE%.7z --quiet
+    aws s3 cp s3://ossci-windows/magma_2.5.0_%CUDA_SUFFIX%_%BUILD_TYPE%.7z %TMP_DIR_WIN%\magma_2.5.0_%CUDA_SUFFIX%_%BUILD_TYPE%.7z --quiet
  )
-  7z x -aoa %TMP_DIR_WIN%\magma_2.5.0_cuda90_%BUILD_TYPE%.7z -o%TMP_DIR_WIN%\magma
+  7z x -aoa %TMP_DIR_WIN%\magma_2.5.0_%CUDA_SUFFIX%_%BUILD_TYPE%.7z -o%TMP_DIR_WIN%\magma
 )
 set MAGMA_HOME=%TMP_DIR_WIN%\magma
--- a/.jenkins/pytorch/win-test-helpers/setup_pytorch_env.bat
+++ b/.jenkins/pytorch/win-test-helpers/setup_pytorch_env.bat
@ -19,27 +19,55 @@ if NOT "%BUILD_ENVIRONMENT%"=="" (
 call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3
 if NOT "%BUILD_ENVIRONMENT%"=="" (
    :: We have to pin Python version to 3.6.7, until mkl supports Python 3.7
-    call conda install -y -q python=3.6.7 numpy mkl cffi pyyaml boto3 protobuf numba
+    :: Numba is pinned to 0.44.0 to avoid https://github.com/numba/numba/issues/4352
+    call conda install -y -q python=3.6.7 numpy mkl cffi pyyaml boto3 protobuf numba==0.44.0
 )
-pip install -q ninja future hypothesis "librosa>=0.6.2" psutil
+pip install -q ninja future hypothesis "librosa>=0.6.2" psutil pillow
 :: No need to install faulthandler since we only test Python >= 3.6 on Windows
 :: faulthandler is builtin since Python 3.3

+if "%CUDA_VERSION%" == "9" goto cuda_build_9
+if "%CUDA_VERSION%" == "10" goto cuda_build_10
+goto cuda_build_end
+
+:cuda_build_9
+
 pushd .
-call "C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Auxiliary\Build\vcvarsall.bat" x86_amd64
+call "C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Auxiliary\Build\vcvarsall.bat" x64
+@echo on
 popd

-set PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\libnvvp;%PATH%
 set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0
-set CUDA_PATH_V9_0=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0
+set CUDA_PATH_V9_0=%CUDA_PATH%
+
+goto cuda_build_common
+
+:cuda_build_10
+
+pushd .
+call "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvarsall.bat" x64
+@echo on
+popd
+
+set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1
+set CUDA_PATH_V10_1=%CUDA_PATH%
+
+goto cuda_build_common
+
+:cuda_build_common
+
+set CUDNN_LIB_DIR=%CUDA_PATH%\lib\x64
+set CUDA_TOOLKIT_ROOT_DIR=%CUDA_PATH%
+set CUDNN_ROOT_DIR=%CUDA_PATH%
 set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt
-set CUDNN_LIB_DIR=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\lib\x64
-set CUDA_TOOLKIT_ROOT_DIR=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0
-set CUDNN_ROOT_DIR=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0
+set PATH=%CUDA_PATH%\bin;%CUDA_PATH%\libnvvp;%PATH%
+set NUMBAPRO_CUDALIB=%CUDA_PATH%\bin
+set NUMBAPRO_LIBDEVICE=%CUDA_PATH%\nvvm\libdevice
+set NUMBAPRO_NVVM=%CUDA_PATH%\nvvm\bin\nvvm64_32_0.dll
+
+:cuda_build_end
+
 set PYTHONPATH=%TMP_DIR_WIN%\build;%PYTHONPATH%
-set NUMBAPRO_CUDALIB=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\bin
-set NUMBAPRO_LIBDEVICE=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\nvvm\libdevice
-set NUMBAPRO_NVVM=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\nvvm\bin\nvvm64_32_0.dll

 if NOT "%BUILD_ENVIRONMENT%"=="" (
    pushd %TMP_DIR_WIN%\build
@ -51,4 +79,7 @@ if NOT "%BUILD_ENVIRONMENT%"=="" (
    xcopy /s %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %TMP_DIR_WIN%\build\torch\
 )

+@echo off
+echo @echo off >> %TMP_DIR%/ci_scripts/pytorch_env_restore.bat
 for /f "usebackq tokens=*" %%i in (`set`) do echo set "%%i" >> %TMP_DIR%/ci_scripts/pytorch_env_restore.bat
+@echo on
--- a/.jenkins/pytorch/win-test-helpers/test_custom_script_ops.bat
+++ b/.jenkins/pytorch/win-test-helpers/test_custom_script_ops.bat
@ -4,11 +4,22 @@ cd test\custom_operator

 :: Build the custom operator library.
 mkdir build
-cd build
+pushd build
+
+echo "Executing CMake for custom_operator test..."
+
 :: Note: Caffe2 does not support MSVC + CUDA + Debug mode (has to be Release mode)
 cmake -DCMAKE_PREFIX_PATH=%TMP_DIR_WIN%\build\torch -DCMAKE_BUILD_TYPE=Release -GNinja ..
+if ERRORLEVEL 1 exit /b 1
+
+echo "Executing Ninja for custom_operator test..."
+
 ninja -v
-cd ..
+if ERRORLEVEL 1 exit /b 1
+
+echo "Ninja succeeded for custom_operator test."
+
+popd

 :: Run tests Python-side and export a script module.
 python test_custom_ops.py -v
--- a/.jenkins/pytorch/win-test-helpers/test_libtorch.bat
+++ b/.jenkins/pytorch/win-test-helpers/test_libtorch.bat
@ -1,9 +1,7 @@
 call %SCRIPT_HELPERS_DIR%\setup_pytorch_env.bat

-dir
-dir %TMP_DIR_WIN%\build
-dir %TMP_DIR_WIN%\build\torch
-dir %TMP_DIR_WIN%\build\torch\lib
-cd %TMP_DIR_WIN%\build\torch\lib
+cd %TMP_DIR_WIN%\build\torch\bin
 set PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64;%TMP_DIR_WIN%\build\torch\lib;%PATH%
 test_api.exe --gtest_filter="-IntegrationTest.MNIST*"
+
+if errorlevel 1 exit /b 1
--- a/.jenkins/pytorch/win-test-helpers/upload_image.py
+++ b/.jenkins/pytorch/win-test-helpers/upload_image.py
@ -1,3 +1,5 @@
+#!/usr/bin/env python
+
 import os
 import sys
 import boto3
--- a/.jenkins/pytorch/win-test.sh
+++ b/.jenkins/pytorch/win-test.sh
@ -6,6 +6,7 @@ COMPACT_JOB_NAME=pytorch-win-ws2016-cuda9-cudnn7-py3-test
 SCRIPT_PARENT_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
 source "$SCRIPT_PARENT_DIR/common.sh"

+export IMAGE_COMMIT_ID=`git rev-parse HEAD`
 export IMAGE_COMMIT_TAG=${BUILD_ENVIRONMENT}-${IMAGE_COMMIT_ID}
 if [[ ${JOB_NAME} == *"develop"* ]]; then
  export IMAGE_COMMIT_TAG=develop-${IMAGE_COMMIT_TAG}
--- a/.travis.yml
+++ b/.travis.yml
@ -13,7 +13,7 @@ matrix:
    fast_finish: true
    include:
      - name: "Ensure consistent CircleCI YAML"
-        python: "3.6"
+        python: "3.7"
        dist: xenial
        script: cd .circleci && ./ensure-consistency.py
      - name: "Shellcheck Jenkins scripts"
@ -27,16 +27,16 @@ matrix:
      - name: "Python 2.7 Lint"
        python: "2.7"
        install: pip install flake8
-        script: flake8
+        script:
+          # NB: Exclude .circleci as it is Python 3 only
+          - rm -rf .circleci
+          - flake8
      - name: "Python 3.7 Lint"
        python: "3.7"
        dist: xenial    # required for Python 3.7 (travis-ci/travis-ci#9069)
        sudo: required  # required for Python 3.7 (travis-ci/travis-ci#9069)
        install:
-          - pip install flake8 flake8-mypy flake8-comprehensions flake8-pyi mccabe pycodestyle pyflakes
-          # Apparently Facebook runs master of this one
-          # https://github.com/PyCQA/flake8-bugbear/issues/53
-          - pip install git+https://github.com/PyCQA/flake8-bugbear.git@d9444713a51a9fb6ee8cd2d88fca85e9ff0c2d58
+          - pip install flake8 flake8-bugbear flake8-mypy flake8-comprehensions flake8-pyi mccabe pycodestyle pyflakes
        script: flake8
      - name: "MyPy typecheck"
        python: "3.6"
--- a/4
+++ b/4
@ -1,6 +1,6 @@
@inproceedings{paszke2017automatic,
-  title={Automatic differentiation in PyTorch},
+  title={Automatic Differentiation in {PyTorch}},
  author={Paszke, Adam and Gross, Sam and Chintala, Soumith and Chanan, Gregory and Yang, Edward and DeVito, Zachary and Lin, Zeming and Desmaison, Alban and Antiga, Luca and Lerer, Adam},
-  booktitle={NIPS-W},
+  booktitle={NIPS Autodiff Workshop},
  year={2017}
 }
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ -16,6 +16,12 @@ set(CMAKE_CXX_STANDARD 11)
 if (NOT MSVC)
  set(CMAKE_C_STANDARD 11)
 endif()
+if (DEFINED GLIBCXX_USE_CXX11_ABI)
+  if (${GLIBCXX_USE_CXX11_ABI} EQUAL 1)
+    set(CXX_STANDARD_REQUIRED ON)
+    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -D_GLIBCXX_USE_CXX11_ABI=1")
+  endif()
+endif()

 set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

@ -57,13 +63,17 @@ if(APPLE)
  set(CMAKE_MACOSX_RPATH ON)
 endif()

+if (${CMAKE_HOST_SYSTEM_PROCESSOR} MATCHES "(x86_64|i[3-6]+86)")
+  set(CPU_INTEL ON)
+else ()
+  set(CPU_INTEL OFF)
+endif ()
+
 # ---[ Options.
 # Note to developers: if you add an option below, make sure you also add it to
 # cmake/Summary.cmake so that the summary prints out the option values.
 include(CMakeDependentOption)
-option(BUILD_TORCH "Build Torch" OFF)
 option(ATEN_NO_TEST "Do not build ATen test binaries" OFF)
-option(BUILD_ATEN_MOBILE "Build ATen for Android and iOS" OFF)
 option(BUILD_ATEN_ONLY "Build only a subset focused on ATen only" OFF)
 option(BUILD_BINARY "Build C++ binaries" OFF)
 option(BUILD_DOCS "Build Caffe2 documentation" OFF)
@ -71,6 +81,8 @@ option(BUILD_CUSTOM_PROTOBUF "Build and use Caffe2's own protobuf under third_pa
 option(BUILD_PYTHON "Build Python binaries" ON)
 option(BUILD_CAFFE2_OPS "Build Caffe2 operators" ON)
 option(BUILD_SHARED_LIBS "Build libcaffe2.so" ON)
+option(BUILD_CAFFE2_MOBILE "Build libcaffe2 for mobile (deprecating)" ON)
+option(BUILD_NAMEDTENSOR "Experimental: compile with namedtensor support" OFF)
 cmake_dependent_option(
    CAFFE2_LINK_LOCAL_PROTOBUF "If set, build protobuf inside libcaffe2.so." ON
    "BUILD_SHARED_LIBS AND BUILD_CUSTOM_PROTOBUF" OFF)
@ -81,6 +93,7 @@ option(BUILD_TEST "Build C++ test binaries (need gtest and gbenchmark)" OFF)
 cmake_dependent_option(
    INSTALL_TEST "Install test binaries if BUILD_TEST is on" ON
    "BUILD_TEST" OFF)
+option(COLORIZE_OUTPUT "Colorize output during compilation" ON)
 option(USE_ASAN "Use Address Sanitizer" OFF)
 option(USE_CUDA "Use CUDA" ON)
 option(USE_ROCM "Use ROCm" ON)
@ -88,7 +101,7 @@ option(CAFFE2_STATIC_LINK_CUDA "Statically link CUDA libraries" OFF)
 cmake_dependent_option(
    USE_CUDNN "Use cuDNN" ON
    "USE_CUDA" OFF)
-option(USE_FBGEMM "Use FBGEMM (quantized 8-bit server operators)" OFF)
+option(USE_FBGEMM "Use FBGEMM (quantized 8-bit server operators)" ON)
 option(USE_FFMPEG "Use ffmpeg" OFF)
 option(USE_GFLAGS "Use GFLAGS" OFF)
 option(USE_GLOG "Use GLOG" OFF)
@ -97,8 +110,15 @@ option(USE_LITE_PROTO "Use lite protobuf instead of full." OFF)
 option(USE_LMDB "Use LMDB" OFF)
 option(USE_METAL "Use Metal for iOS build" ON)
 option(USE_NATIVE_ARCH "Use -march=native" OFF)
-option(USE_NCCL "Use NCCL" ON)
-option(USE_SYSTEM_NCCL "Use system-wide NCCL" OFF)
+cmake_dependent_option(
+    USE_NCCL "Use NCCL" ON
+    "USE_CUDA;UNIX;NOT APPLE" OFF)
+cmake_dependent_option(
+    USE_STATIC_NCCL "Use static NCCL" OFF
+    "USE_NCCL" OFF)
+cmake_dependent_option(
+    USE_SYSTEM_NCCL "Use system-wide NCCL" OFF
+    "USE_NCCL" OFF)
 option(USE_NNAPI "Use NNAPI" OFF)
 option(USE_NNPACK "Use NNPACK" ON)
 option(USE_NUMA "Use NUMA (only available on Linux)" ON)
@ -120,7 +140,13 @@ option(USE_SYSTEM_EIGEN_INSTALL
 option(USE_TENSORRT "Using Nvidia TensorRT library" OFF)
 option(USE_ZMQ "Use ZMQ" OFF)
 option(USE_ZSTD "Use ZSTD" OFF)
-option(USE_MKLDNN "Use MKLDNN" OFF)
+cmake_dependent_option(
+  USE_MKLDNN "Use MKLDNN. Only available on x86 and x86_64." ON
+  "CPU_INTEL" OFF)
+set(MKLDNN_ENABLE_CONCURRENT_EXEC ${USE_MKLDNN})
+cmake_dependent_option(
+    USE_MKLDNN_CBLAS "Use CBLAS in MKLDNN" OFF
+    "USE_MKLDNN" OFF)
 option(USE_DISTRIBUTED "Use distributed" ON)
 cmake_dependent_option(
    USE_MPI "Use MPI for Caffe2. Only available if USE_DISTRIBUTED is on." ON
@ -131,22 +157,27 @@ cmake_dependent_option(
 cmake_dependent_option(
    USE_GLOO_IBVERBS "Use Gloo IB verbs for distributed. Only available if USE_GLOO is on." OFF
    "USE_GLOO" OFF)
+option(USE_TBB "Use TBB" OFF)

 # Used when building Caffe2 through setup.py
-option(BUILDING_WITH_TORCH_LIBS "Tell cmake if Caffe2 is being built alongside torch libs" OFF)
+option(BUILDING_WITH_TORCH_LIBS "Tell cmake if Caffe2 is being built alongside torch libs" ON)

 # /Z7 override option
 # When generating debug symbols, CMake default to use the flag /Zi.
 # However, it is not compatible with sccache. So we rewrite it off.
 # But some users don't use sccache; this override is for them.
-option(MSVC_Z7_OVERRIDE "Work around sccache bug by replacing /Zi and /ZI with /Z7 when using MSVC (if you are not using sccache, you can turn this OFF)" ON)
+cmake_dependent_option(
+  MSVC_Z7_OVERRIDE "Work around sccache bug by replacing /Zi and /ZI with /Z7 when using MSVC (if you are not using sccache, you can turn this OFF)" ON
+  "MSVC" OFF)

-SET(ONNX_NAMESPACE "onnx_c2" CACHE STRING "onnx namespace")
+set(ONNX_NAMESPACE "onnx_torch" CACHE STRING "A namespace for ONNX; needed to build with other frameworks that share ONNX.")

 # For MSVC,
 # 1. Replace /Zi and /ZI with /Z7
 # 2. Switch off incremental linking in debug builds
 if (MSVC)
+  string(APPEND CMAKE_C_FLAGS " /EHa")
+  string(APPEND CMAKE_CXX_FLAGS " /EHa")
  if(MSVC_Z7_OVERRIDE)
    foreach(flag_var
        CMAKE_CXX_FLAGS CMAKE_CXX_FLAGS_DEBUG CMAKE_CXX_FLAGS_RELEASE
@ -163,10 +194,33 @@ if (MSVC)
      string(REGEX REPLACE "/INCREMENTAL" "/INCREMENTAL:NO" ${flag_var} "${${flag_var}}")
    endif()
  endforeach(flag_var)
+  # Turning off USE_DISTRIBUTED on default
+  set(USE_DISTRIBUTED OFF)
 endif(MSVC)

+# Set INTERN_BUILD_MOBILE for all mobile builds. Components that are not
+# applicable to mobile are disabled by this variable.
 if (ANDROID OR IOS)
-  set(BUILD_ATEN_MOBILE ON)
+  set(INTERN_BUILD_MOBILE ON)
+endif()
+
+# INTERN_BUILD_ATEN_OPS is used to control whether to build ATen/TH operators.
+# It's disabled for caffe2 mobile library.
+if (INTERN_BUILD_MOBILE AND BUILD_CAFFE2_MOBILE)
+  set(INTERN_BUILD_ATEN_OPS OFF)
+else()
+  set(INTERN_BUILD_ATEN_OPS ON)
+endif()
+
+# BUILD_CAFFE2_MOBILE is the master switch to choose between libcaffe2 v.s. libtorch mobile build.
+# When it's enabled it builds original libcaffe2 mobile library without ATen/TH ops nor TorchScript support;
+# When it's disabled it builds libtorch mobile library, which contains ATen/TH ops and native support for
+# TorchScript model, but doesn't contain not-yet-unified caffe2 ops;
+if (INTERN_BUILD_MOBILE AND NOT BUILD_CAFFE2_MOBILE)
+  set(BUILD_PYTHON OFF)
+  set(BUILD_CAFFE2_OPS OFF)
+  set(USE_DISTRIBUTED OFF)
+  set(FEATURE_TORCH_MOBILE ON)
 endif()

 if (BUILD_ATEN_ONLY)
@ -191,8 +245,12 @@ include(cmake/Utils.cmake)
 include(cmake/public/utils.cmake)

 # ---[ Version numbers for generated libraries
-set(TORCH_DEFAULT_VERSION "1.0.0")
+set(TORCH_DEFAULT_VERSION "1.1.0")
 set(TORCH_BUILD_VERSION "${TORCH_DEFAULT_VERSION}" CACHE STRING "Torch build version")
+if (DEFINED ENV{PYTORCH_BUILD_VERSION})
+  set(TORCH_BUILD_VERSION "$ENV{PYTORCH_BUILD_VERSION}"
+    CACHE STRING "Torch build version" FORCE)
+endif()
 if (NOT TORCH_BUILD_VERSION)
  # An empty string was specified so force version to the default
  set(TORCH_BUILD_VERSION "${TORCH_DEFAULT_VERSION}"
@ -239,6 +297,14 @@ if(USE_FBGEMM)
  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DUSE_FBGEMM")
 endif()

+if(BUILD_NAMEDTENSOR)
+  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DBUILD_NAMEDTENSOR")
+endif()
+
+if(USE_QNNPACK)
+  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DUSE_QNNPACK")
+endif()
+
 # ---[ Whitelist file if whitelist is specified
 include(cmake/Whitelist.cmake)

@ -289,6 +355,14 @@ if(NOT MSVC)
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-constexpr-not-const")
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-missing-braces")
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Qunused-arguments")
+    if (${COLORIZE_OUTPUT})
+      set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fcolor-diagnostics")
+    endif()
+  endif()
+  if ("${CMAKE_CXX_COMPILER_ID}" STREQUAL "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_GREATER 4.9)
+    if (${COLORIZE_OUTPUT})
+      set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fdiagnostics-color=always")
+    endif()
  endif()
  if ((APPLE AND (NOT ("${CLANG_VERSION_STRING}" VERSION_LESS "9.0")))
    OR (CMAKE_COMPILER_IS_GNUCXX
@ -314,7 +388,9 @@ if(NOT MSVC)
 else()
  foreach(flag_var
      CMAKE_CXX_FLAGS CMAKE_CXX_FLAGS_DEBUG CMAKE_CXX_FLAGS_RELEASE
-      CMAKE_CXX_FLAGS_MINSIZEREL CMAKE_CXX_FLAGS_RELWITHDEBINFO)
+      CMAKE_CXX_FLAGS_MINSIZEREL CMAKE_CXX_FLAGS_RELWITHDEBINFO
+      CMAKE_C_FLAGS CMAKE_C_FLAGS_DEBUG CMAKE_C_FLAGS_RELEASE
+      CMAKE_C_FLAGS_MINSIZEREL CMAKE_C_FLAGS_RELWITHDEBINFO)
    if (${CAFFE2_USE_MSVC_STATIC_RUNTIME})
      if(${flag_var} MATCHES "/MD")
        string(REGEX REPLACE "/MD" "/MT" ${flag_var} "${${flag_var}}")
@ -479,6 +555,7 @@ if (BUILD_SHARED_LIBS)
      ${PROJECT_SOURCE_DIR}/cmake/Modules_CUDA_fix
      DESTINATION share/cmake/Caffe2/
      COMPONENT dev)
+
  install(EXPORT Caffe2Targets DESTINATION share/cmake/Caffe2
      FILE Caffe2Targets.cmake
      COMPONENT dev)
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -243,18 +243,27 @@ only interested in a specific component.
  Caffe2 operators.

 On the initial build, you can also speed things up with the environment
-variables `DEBUG` and `NO_CUDA`.
+variables `DEBUG`, `USE_DISTRIBUTED`, `USE_MKLDNN`, `USE_CUDA`, `BUILD_TEST`, `USE_FBGEMM`, `USE_NNPACK` and `USE_QNNPACK`.

 - `DEBUG=1` will enable debug builds (-g -O0)
 - `REL_WITH_DEB_INFO=1` will enable debug symbols with optimizations (-g -O3)
- `NO_CUDA=1` will disable compiling CUDA (in case you are developing on something not CUDA related), to save compile time.
+- `USE_DISTRIBUTED=0` will disable distributed (c10d, gloo, mpi, etc.) build.
+- `USE_MKLDNN=0` will disable using MKL-DNN.
+- `USE_CUDA=0` will disable compiling CUDA (in case you are developing on something not CUDA related), to save compile time.
+- `BUILD_TEST=0` will disable building C++ test binaries.
+- `USE_FBGEMM=0` will disable using FBGEMM (quantized 8-bit server operators).
+- `USE_NNPACK=0` will disable compiling with NNPACK.
+- `USE_QNNPACK=0` will disable QNNPACK build (quantized 8-bit operators).

 For example:
 ```bash
-NO_CUDA=1 DEBUG=1 python setup.py develop
+DEBUG=1 USE_DISTRIBUTED=0 USE_MKLDNN=0 USE_CUDA=0 BUILD_TEST=0 USE_FBGEMM=0 USE_NNPACK=0 USE_QNNPACK=0 python setup.py develop
 ```

-Make sure you continue to pass these flags on subsequent builds.
+For subsequent builds (i.e., when `build/CMakeCache.txt` exists), the build
+options passed for the first time will persist; please run `ccmake build/`, run
+`cmake-gui build/`, or directly edit `build/CMakeCache.txt` to adapt build
+options.

 ### Code completion and IDE support

@ -349,6 +358,16 @@ ccache -F 0
 # deploy (and add to ~/.bashrc for later)
 export PATH="/usr/lib/ccache:$PATH"
 ```
+#### Use a faster linker
+If you are editing a single file and rebuilding in a tight loop, the time spent 
+linking will dominate. The system linker available in most Linux distributions 
+(GNU `ld`) is quite slow. Use a faster linker, like [lld](https://lld.llvm.org/).
+
+The easiest way to use `lld` this is download the 
+[latest LLVM binaries](http://releases.llvm.org/download.html#8.0.0) and run:
+```
+ln -s /path/to/downloaded/ld.lld /usr/local/bin/ld
+```

 ## CUDA Development tips

@ -359,6 +378,39 @@ If you are working on the CUDA code, here are some useful CUDA debugging tips:
    slow down the build process for about 50% (compared to only `DEBUG=1`), so use wisely.
 2. `cuda-gdb` and `cuda-memcheck` are your best CUDA debugging friends. Unlike`gdb`,
   `cuda-gdb` can display actual values in a CUDA tensor (rather than all zeros).
+3. CUDA supports a lot of C++11 features such as, `std::numeric_limits`, `std::nextafter`,
+   `std::tuple` etc. in device code. Many of such features are possible because of the
+   [--expt-relaxed-constexpr](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#constexpr-functions)
+   nvcc flag. There is a known [issue](https://github.com/ROCm-Developer-Tools/HIP/issues/374)
+   that ROCm errors out on device code, which uses such stl functions.
+4. A good performance metric for a CUDA kernel is the
+   [Effective Memory Bandwidth](https://devblogs.nvidia.com/how-implement-performance-metrics-cuda-cc/).
+   It is useful for you to measure this metric whenever you are writing/optimizing a CUDA
+   kernel. Following script shows how we can measure the effective bandwidth of CUDA `uniform_`
+   kernel.
+   ```python
+   import torch
+   import time
+   size = 128*512
+   nrep = 100
+   nbytes_read_write = 4 # this is number of bytes read + written by a kernel. Change this to fit your kernel.
+
+   for i in range(10):
+       a=torch.Tensor(size).cuda().uniform_()
+       torch.cuda.synchronize()
+       start = time.time()
+       # dry run to alloc
+       out = a.uniform_()
+       torch.cuda.synchronize()
+       start = time.time()
+       for i in range(nrep):
+         out = a.uniform_()
+       torch.cuda.synchronize()
+       end = time.time()
+       timec = (end-start)/nrep
+       print("uniform, size, elements", size, "forward", timec, "bandwidth (GB/s)", size*(nbytes_read_write)*1e-9/timec)
+       size *=2
+   ```


 Hope this helps, and thanks for considering to contribute.
@ -505,7 +557,8 @@ which is in PyTorch's `requirements.txt`.

 ### Pre-commit Tidy/Linting Hook

-We use clang-tidy and flake8 (installed with flake-mypy) to perform additional
+We use clang-tidy and flake8 (installed with flake8-bugbear,
+flake8-comprehensions, flake8-mypy, and flake8-pyi) to perform additional
 formatting and semantic checking of code. We provide a pre-commit git hook for
 performing these checks, before a commit is created:

--- a/README.md
+++ b/README.md
@ -55,7 +55,7 @@ Elaborating further:

 If you use NumPy, then you have used Tensors (a.k.a ndarray).

-![Tensor illustration](https://github.com/pytorch/pytorch/blob/master/docs/source/_static/img/tensor_illustration.png)
+![Tensor illustration](./docs/source/_static/img/tensor_illustration.png)

 PyTorch provides Tensors that can live either on the CPU or the GPU, and accelerates the
 computation by a huge amount.
@ -151,15 +151,15 @@ They requires JetPack 4.2 and above and are maintained by @dusty-nv
 ### From Source

 If you are installing from source, we highly recommend installing an [Anaconda](https://www.anaconda.com/distribution/#download-section) environment.
-You will get a high-quality BLAS library (MKL) and you get a controlled compiler version regardless of your Linux distro.
+You will get a high-quality BLAS library (MKL) and you get controlled dependency versions regardless of your Linux distro.

 Once you have [Anaconda](https://www.anaconda.com/distribution/#download-section) installed, here are the instructions.

 If you want to compile with CUDA support, install
- [NVIDIA CUDA](https://developer.nvidia.com/cuda-downloads) 7.5 or above
- [NVIDIA cuDNN](https://developer.nvidia.com/cudnn) v6.x or above
+- [NVIDIA CUDA](https://developer.nvidia.com/cuda-downloads) 9 or above
+- [NVIDIA cuDNN](https://developer.nvidia.com/cudnn) v7 or above

-If you want to disable CUDA support, export environment variable `NO_CUDA=1`.
+If you want to disable CUDA support, export environment variable `USE_CUDA=0`.
 Other potentially useful environment variables may be found in `setup.py`.

 If you are building for NVIDIA's Jetson platforms (Jetson Nano, TX1, TX2, AGX Xavier), Instructions to [are available here](https://devtalk.nvidia.com/default/topic/1049071/jetson-nano/pytorch-for-jetson-nano/)
@ -169,19 +169,22 @@ If you are building for NVIDIA's Jetson platforms (Jetson Nano, TX1, TX2, AGX Xa

 Common
 ```
-conda install numpy pyyaml mkl mkl-include setuptools cmake cffi typing
+conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing
 ```

 On Linux
 ```bash
 # Add LAPACK support for the GPU if needed
-conda install -c pytorch magma-cuda90 # or [magma-cuda80 | magma-cuda92 | magma-cuda100 ] depending on your cuda version
+conda install -c pytorch magma-cuda90 # or [magma-cuda92 | magma-cuda100 ] depending on your cuda version
 ```

 #### Get the PyTorch Source
 ```bash
 git clone --recursive https://github.com/pytorch/pytorch
 cd pytorch
+# if you are updating an existing checkout
+git submodule sync
+git submodule update --init --recursive
 ```

 #### Install PyTorch
@ -206,31 +209,67 @@ If the version of Visual Studio 2017 is higher than 15.4.5, installing of "VC++
 <br/> There is no guarantee of the correct building with VC++ 2017 toolsets, others than version 15.4 v14.11.
 <br/> "VC++ 2017 version 15.4 v14.11 toolset" might be installed onto already installed Visual Studio 2017 by running its installation once again and checking the corresponding checkbox under "Individual components"/"Compilers, build tools, and runtimes".

-For building against CUDA 8.0 Visual Studio 2015 Update 3 (version 14.0), and the [patch](https://download.microsoft.com/download/8/1/d/81dbe6bb-ed92-411a-bef5-3a75ff972c6a/vc14-kb4020481.exe) are needed to be installed too.
-The details of the patch can be found [here](https://support.microsoft.com/en-gb/help/4020481/fix-link-exe-crashes-with-a-fatal-lnk1000-error-when-you-use-wholearch).
-
 NVTX is a part of CUDA distributive, where it is called "Nsight Compute". For installing it onto already installed CUDA run CUDA installation once again and check the corresponding checkbox.
 Be sure that CUDA with Nsight Compute is installed after Visual Studio 2017.

+Currently VS 2017, VS 2019 and Ninja are supported as the generator of CMake. If `ninja.exe` is detected in `PATH`, then Ninja will be used as the default generator, otherwise it will use VS 2017.
+<br/> If Ninja is selected as the generator, the latest MSVC which is newer than VS 2015 (14.0) will get selected as the underlying toolchain if you have Python > 3.5, otherwise VS 2015 will be selected so you'll have to activate the environment. If you use CMake <= 3.14.2 and has VS 2019 installed, then even if you specify VS 2017 as the generator, VS 2019 will get selected as the generator.
+
+CUDA and MSVC has strong version dependencies, so even if you use VS 2017 / 2019, you will get build errors like `nvcc fatal : Host compiler targets unsupported OS`. For this kind of problem, please install the corresponding VS toolchain in the table below and then you can either specify the toolset during activation (recommended) or set `CUDAHOSTCXX` to override the cuda host compiler (not recommended if there are big version differences).
+
+| CUDA version | Newest supported VS version                             |
+| ------------ | ------------------------------------------------------- |
+| 9.0 / 9.1    | Visual Studio 2017 Update 4 (15.4) (`_MSC_VER` <= 1911) |
+| 9.2          | Visual Studio 2017 Update 5 (15.5) (`_MSC_VER` <= 1912) |
+| 10.0         | Visual Studio 2017 (15.X) (`_MSC_VER` < 1920)           |
+| 10.1         | Visual Studio 2019 (16.X) (`_MSC_VER` < 1930)           |
+
 ```cmd
 cmd
-REM [Optional] The following two lines are needed for Python 2.7, but the support for it is very experimental.
+:: [Optional] Only add the next two lines if you need Python 2.7. If you use Python 3, ignore these two lines.
 set MSSdk=1
 set FORCE_PY27_BUILD=1

-REM [Optional] As for CUDA 8, VS2015 Update 3 is required; use the following line.
-set "CUDAHOSTCXX=%VS140COMNTOOLS%..\..\VC\bin\amd64\cl.exe"
+:: [Optional] If you want to build with VS 2019 generator, please change the value in the next line to `Visual Studio 16 2019`.
+:: Note: This value is useless if Ninja is detected. However, you can force that by using `set USE_NINJA=OFF`.
+set CMAKE_GENERATOR=Visual Studio 15 2017

-set CMAKE_GENERATOR=Visual Studio 15 2017 Win64
+:: Read the content in the previous section carefully before you preceed.
+:: [Optional] If you want to override the underlying toolset used by Ninja and Visual Studio with CUDA, please run the following script block.
+:: "Visual Studio 2017 Developer Command Prompt" will be run automatically.
+:: Make sure you have CMake >= 3.12 before you do this when you use the Visual Studio generator.
+:: It's an essential step if you use Python 3.5.
+set CMAKE_GENERATOR_TOOLSET_VERSION=14.11
 set DISTUTILS_USE_SDK=1
+for /f "usebackq tokens=*" %i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" -version [15^,16^) -products * -latest -property installationPath`) do call "%i\VC\Auxiliary\Build\vcvarsall.bat" x64 -vcvars_ver=%CMAKE_GENERATOR_TOOLSET_VERSION%

-REM Run "Visual Studio 2017 Developer Command Prompt"
-for /f "usebackq tokens=*" %i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" -version [15^,16^) -products * -latest -property installationPath`) do call "%i\VC\Auxiliary\Build\vcvarsall.bat" x64 -vcvars_ver=14.11
+:: [Optional] If you want to override the cuda host compiler
+set CUDAHOSTCXX=C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\VC\Tools\MSVC\14.11.25503\bin\HostX64\x64\cl.exe

 python setup.py install

 ```

+##### Adjust Build Options (Optional)
+
+You can adjust the configuration of cmake variables optionally (without building first), by doing
+the following. For example, adjusting the pre-detected directories for CuDNN or BLAS can be done
+with such a step.
+
+On Linux
+```bash
+export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
+python setup.py build --cmake-only
+ccmake build  # or cmake-gui build
+```
+
+On macOS
+```bash
+export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
+MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py build --cmake-only
+ccmake build  # or cmake-gui build
+```
+
 ### Docker Image

 Dockerfile is supplied to build images with cuda support and cudnn v7. You can pass `-e PYTHON_VERSION=x.y` flag to specify which Python version is to be used by Miniconda, or leave it unset to use the default. Build from pytorch repo directory as docker needs to copy git repo into docker filesystem while building the image.
--- a/aten/CMakeLists.txt
+++ b/aten/CMakeLists.txt
@ -1,14 +1,17 @@
-if (BUILD_ATEN_MOBILE)
+if (NOT INTERN_BUILD_ATEN_OPS)
  return()
 endif()

 # Find modules
+if (NOT INTERN_BUILD_MOBILE)
+  list(APPEND CMAKE_MODULE_PATH /usr/lib/x86_64-linux-gnu/)
+  list(APPEND CMAKE_LIBRARY_PATH /usr/lib/x86_64-linux-gnu/ /usr/lib/aarch64-linux-gnu/)
+endif()
+
 list(APPEND CMAKE_MODULE_PATH
-  /usr/lib/x86_64-linux-gnu/
  ${CMAKE_CURRENT_SOURCE_DIR}/../cmake/Modules
  ${CMAKE_CURRENT_SOURCE_DIR}/../cmake/public
  ${CMAKE_CURRENT_SOURCE_DIR}/../cmake/Modules_CUDA_fix)
-list(APPEND CMAKE_LIBRARY_PATH /usr/lib/x86_64-linux-gnu/ /usr/lib/aarch64-linux-gnu/)

 cmake_policy(SET CMP0012 NEW)

@ -21,6 +24,7 @@ set(ATen_THIRD_PARTY_INCLUDE)
 set(ATen_CUDA_SRCS)
 set(ATen_CUDA_TEST_SRCS)
 set(ATen_CUDA_INCLUDE)
+set(ATen_NVRTC_STUB_SRCS)
 set(ATen_HIP_SRCS)
 set(ATen_HIP_TEST_SRCS)
 set(ATen_HIP_INCLUDE)
@ -98,6 +102,7 @@ add_subdirectory(src/ATen)
 set(ATen_CPU_SRCS ${ATen_CPU_SRCS} PARENT_SCOPE)
 set(ATen_CUDA_SRCS ${ATen_CUDA_SRCS} PARENT_SCOPE)
 set(ATen_HIP_SRCS ${ATen_HIP_SRCS} PARENT_SCOPE)
+set(ATen_NVRTC_STUB_SRCS ${ATen_NVRTC_STUB_SRCS} PARENT_SCOPE)
 set(ATen_CPU_TEST_SRCS ${ATen_CPU_TEST_SRCS} PARENT_SCOPE)
 set(ATen_CUDA_TEST_SRCS ${ATen_CUDA_TEST_SRCS} PARENT_SCOPE)
 set(ATen_HIP_TEST_SRCS ${ATen_HIP_TEST_SRCS} PARENT_SCOPE)
--- a/aten/src/ATen/ATen.h
+++ b/aten/src/ATen/ATen.h
@ -7,19 +7,24 @@
 #include <ATen/DeviceGuard.h>
 #include <ATen/DimVector.h>
 #include <ATen/Dispatch.h>
+#include <ATen/DynamicLibrary.h>
 #include <ATen/Formatting.h>
 #include <ATen/Functions.h>
+#ifdef BUILD_NAMEDTENSOR
+#include <ATen/NamedTensor.h>
+#endif
 #include <ATen/ScalarOps.h>
 #include <ATen/Tensor.h>
 #include <ATen/TensorGeometry.h>
 #include <ATen/TensorOperators.h>
-#include <ATen/Type.h>
 #include <ATen/Version.h>
 #include <ATen/core/ATenGeneral.h>
 #include <ATen/core/Generator.h>
 #include <c10/core/Layout.h>
 #include <ATen/core/Scalar.h>
 #include <c10/core/Storage.h>
-#include <ATen/core/TensorMethods.h>
 #include <c10/core/TensorOptions.h>
+#include <ATen/core/Reduction.h>
 #include <c10/util/Exception.h>
+#include <ATen/core/ATenDispatch.h>
+#include <ATen/core/UnsafeFromTH.h>
--- a/aten/src/ATen/CMakeLists.txt
+++ b/aten/src/ATen/CMakeLists.txt
@ -39,6 +39,8 @@ FILE(GLOB base_cpp "*.cpp" "detail/*.cpp" "cpu/*.cpp")
 add_subdirectory(core)
 FILE(GLOB cuda_h "cuda/*.h" "cuda/detail/*.h" "cuda/*.cuh" "cuda/detail/*.cuh")
 FILE(GLOB cuda_cpp "cuda/*.cpp" "cuda/detail/*.cpp")
+FILE(GLOB cuda_nvrtc_stub_h "cuda/nvrtc_stub/*.h")
+FILE(GLOB cuda_nvrtc_stub_cpp "cuda/nvrtc_stub/*.cpp")
 FILE(GLOB cuda_cu "cuda/*.cu" "cuda/detail/*.cu")
 FILE(GLOB cudnn_h "cudnn/*.h" "cudnn/*.cuh")
 FILE(GLOB cudnn_cpp "cudnn/*.cpp")
@ -46,6 +48,8 @@ FILE(GLOB cudnn_cpp "cudnn/*.cpp")
 FILE(GLOB hip_h "hip/*.h" "hip/detail/*.h" "hip/*.cuh" "hip/detail/*.cuh")
 FILE(GLOB hip_cpp "hip/*.cpp" "hip/detail/*.cpp" "hip/impl/*.cpp")
 FILE(GLOB hip_hip "hip/*.hip" "hip/detail/*.hip" "hip/impl/*.hip")
+FILE(GLOB hip_nvrtc_stub_h "hip/nvrtc_stub/*.h")
+FILE(GLOB hip_nvrtc_stub_cpp "hip/nvrtc_stub/*.cpp")
 FILE(GLOB miopen_h "miopen/*.h")
 FILE(GLOB miopen_cpp "miopen/*.cpp")

@ -65,6 +69,8 @@ FILE(GLOB native_cuda_cpp "native/cuda/*.cpp")
 FILE(GLOB native_cudnn_cpp "native/cudnn/*.cpp")
 FILE(GLOB native_sparse_cuda_cu "native/sparse/cuda/*.cu")
 FILE(GLOB native_sparse_cuda_cpp "native/sparse/cuda/*.cpp")
+FILE(GLOB native_quantized_cuda_cu "native/quantized/cuda/*.cu")
+FILE(GLOB native_quantized_cuda_cpp "native/quantized/cuda/*.cpp")

 FILE(GLOB native_hip_hip "native/hip/*.hip")
 FILE(GLOB native_hip_cpp "native/hip/*.cpp")
@ -72,6 +78,8 @@ FILE(GLOB native_miopen_cpp "native/miopen/*.cpp")
 FILE(GLOB native_cudnn_hip_cpp "native/cudnn/hip/*.cpp")
 FILE(GLOB native_sparse_hip_hip "native/sparse/hip/*.hip")
 FILE(GLOB native_sparse_hip_cpp "native/sparse/hip/*.cpp")
+FILE(GLOB native_quantized_hip_hip "native/quantized/hip/*.hip")
+FILE(GLOB native_quantized_hip_cpp "native/quantized/hip/*.cpp")

 add_subdirectory(quantized)
 set(all_cpu_cpp ${base_cpp} ${ATen_CORE_SRCS} ${native_cpp} ${native_sparse_cpp} ${native_quantized_cpp} ${native_mkl_cpp} ${native_mkldnn_cpp} ${generated_cpp} ${ATen_CPU_SRCS} ${ATen_QUANTIZED_SRCS} ${cpu_kernel_cpp})
@ -88,8 +96,8 @@ endif()

 IF(USE_CUDA)
  list(APPEND ATen_CUDA_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/cuda)
-  set(ATen_CUDA_SRCS ${ATen_CUDA_SRCS} ${cuda_cu} ${native_cuda_cu} ${native_sparse_cuda_cu})
-  set(all_cuda_cpp ${native_sparse_cuda_cpp} ${cuda_cpp} ${native_cuda_cpp} ${cuda_generated_cpp} ${ATen_CUDA_SRCS})
+  set(ATen_CUDA_SRCS ${ATen_CUDA_SRCS} ${cuda_cu} ${native_cuda_cu} ${native_sparse_cuda_cu} ${native_quantized_cuda_cu})
+  set(all_cuda_cpp ${native_sparse_cuda_cpp} ${native_quantized_cuda_cpp} ${cuda_cpp} ${native_cuda_cpp} ${cuda_generated_cpp} ${ATen_CUDA_SRCS})
  SET(all_cuda_cpp ${native_cudnn_cpp} ${native_miopen_cpp} ${all_cuda_cpp})
  IF(CUDNN_FOUND)
    SET(all_cuda_cpp ${all_cuda_cpp} ${cudnn_cpp})
@ -98,9 +106,9 @@ endif()

 IF(USE_ROCM)
  list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/hip)
-  set(ATen_HIP_SRCS ${ATen_HIP_SRCS} ${hip_hip} ${native_hip_hip} ${native_sparse_hip_hip})
+  set(ATen_HIP_SRCS ${ATen_HIP_SRCS} ${hip_hip} ${native_hip_hip} ${native_sparse_hip_hip} ${native_quantized_hip_hip})
  # TODO: Codegen separate files for HIP and use those (s/cuda_generated_cpp/hip_generated_cpp)
-  set(all_hip_cpp ${native_sparse_hip_cpp} ${hip_cpp} ${native_hip_cpp} ${cuda_generated_cpp} ${ATen_HIP_SRCS})
+  set(all_hip_cpp ${native_sparse_hip_cpp} ${native_quantized_hip_cpp} ${hip_cpp} ${native_hip_cpp} ${cuda_generated_cpp} ${ATen_HIP_SRCS})
  set(all_hip_cpp ${native_miopen_cpp} ${native_cudnn_hip_cpp} ${miopen_cpp} ${all_hip_cpp})
 endif()

@ -116,6 +124,12 @@ IF(NOT AT_LINK_STYLE)
  SET(AT_LINK_STYLE SHARED)
 ENDIF()

+IF (USE_TBB)
+  message("ATen is compiled with TBB (${TBB_ROOT_DIR})")
+  list(APPEND ATen_CPU_INCLUDE ${TBB_ROOT_DIR}/include)
+  list(APPEND ATen_CPU_DEPENDENCY_LIBS tbb)
+ENDIF()
+
 IF(BLAS_FOUND)
  IF ($ENV{TH_BINARY_BUILD})
    MESSAGE(STATUS "TH_BINARY_BUILD detected. Enabling special linkage.")
@ -205,7 +219,7 @@ endif(MKLDNN_FOUND)

 list(APPEND ATen_CPU_DEPENDENCY_LIBS cpuinfo)

-if(NOT MSVC AND NOT EMSCRIPTEN)
+if(NOT MSVC AND NOT EMSCRIPTEN AND NOT INTERN_BUILD_MOBILE)
  # Preserve values for the main build
  set(__aten_sleef_build_shared_libs ${BUILD_SHARED_LIBS})
  set(__aten_sleef_build_tests ${BUILD_TESTS})
@ -252,12 +266,7 @@ IF(USE_CUDA AND NOT USE_ROCM)

    # build fake CuFFT lib in build dir
    EXECUTE_PROCESS(COMMAND touch ${CMAKE_CURRENT_BINARY_DIR}/empty_file.cc)
-    if(${CUDA_VERSION_MAJOR} EQUAL "8")
-      SET(CUFFT_FAKELINK_OPTIONS
-        --generate-code arch=compute_35,code=sm_35
-        --generate-code arch=compute_50,code=sm_50
-        --generate-code arch=compute_60,code=sm_60)
-    elseif(${CUDA_VERSION_MAJOR} EQUAL "9")
+    if(${CUDA_VERSION_MAJOR} EQUAL "9")
      SET(CUFFT_FAKELINK_OPTIONS
        --generate-code arch=compute_35,code=sm_35
        --generate-code arch=compute_50,code=sm_50
@ -355,6 +364,7 @@ endif()

 if(USE_CUDA)
  set(ATen_CUDA_SRCS ${all_cuda_cpp})
+  set(ATen_NVRTC_STUB_SRCS ${cuda_nvrtc_stub_cpp})
  if(AT_LINK_STYLE STREQUAL "INTERFACE")
    # Source code can't be added to an interface library, so it is
    # passed back to be compiled into the containing library
@ -367,6 +377,9 @@ endif()

 if(USE_ROCM)
  set(ATen_HIP_SRCS ${all_hip_cpp})
+  # caffe2_nvrtc's stubs to driver APIs are useful for HIP.
+  # See NOTE [ ATen NVRTC Stub and HIP ]
+  set(ATen_NVRTC_STUB_SRCS ${hip_nvrtc_stub_cpp})
  if(AT_LINK_STYLE STREQUAL "INTERFACE")
    # Source code can't be added to an interface library, so it is
    # passed back to be compiled into the containing library
@ -438,6 +451,7 @@ endif()
 set(ATen_CORE_SRCS ${ATen_CORE_SRCS} PARENT_SCOPE)
 set(ATen_CPU_SRCS ${ATen_CPU_SRCS} PARENT_SCOPE)
 set(ATen_CUDA_SRCS ${ATen_CUDA_SRCS} PARENT_SCOPE)
+set(ATen_NVRTC_STUB_SRCS ${ATen_NVRTC_STUB_SRCS} PARENT_SCOPE)
 set(ATen_HIP_SRCS ${ATen_HIP_SRCS} PARENT_SCOPE)
 set(ATen_QUANTIZED_SRCS ${ATen_QUANTIZED_SRCS} PARENT_SCOPE)
 set(ATen_CPU_TEST_SRCS ${ATen_CPU_TEST_SRCS} PARENT_SCOPE)
--- a/aten/src/ATen/CPUApplyUtils.h
+++ b/aten/src/ATen/CPUApplyUtils.h
@ -30,7 +30,7 @@ inline std::pair<int64_t, int64_t> collapse_dims(
    T* strides,
    int64_t dims,
    const int excludeDim = -1) {
-  AT_CHECK(
+  TORCH_CHECK(
      excludeDim >= -1 && excludeDim < dims,
      "expected excluded dim between -1 and dims - 1");

@ -331,69 +331,6 @@ apply_op(int64_t numel, int64_t offset, const Op& op, Args... iters) {
  }
 }

-
-inline void apply_kernel(){};
-
-// TODO: Deal elegantly with 0-dim tensors. iters.strides_ of 0-dim
-// strided_tensor_iter will be of size 0 for dim 0 and iters.strides_[iters.dim_
-// - 1] will index at -1. C++14 integer_sequence could be of use here.
-template <typename Op, typename... Args>
-inline void
-apply_kernel(int64_t numel, int64_t offset, const Op& op, Args... iters) {
-  if (offset > 0)
-    forward(offset, iters...);
-  int64_t size = std::min(numel, max_iterate_size(iters...));
-  op(size, iters.data_..., iters.strides_[iters.dim_ - 1]...);
-  iterate(size, iters...);
-  iterate_overflow(iters...);
-  int64_t i = size;
-  size = std::min(numel, max_iterate_size(iters...));
-  for (; i < numel;) {
-    op(size, iters.data_..., iters.strides_[iters.dim_ - 1]...);
-    iterate(size, iters...);
-    i += size;
-    iterate_overflow(iters...);
-  }
-}
-
-template <typename scalar1, typename scalar2, typename Op>
-inline void
-CPU_tensor_parallel_kernel_apply2(Tensor tensor1, Tensor tensor2, const Op op) {
-  if (!_apply_preamble({tensor1, tensor2}))
-    return;
-  if (tensor1.numel() == 1) {
-    op(1, tensor1.data<scalar1>(), tensor2.data<scalar2>(), 0, 0);
-    return;
-  }
-  if (tensor1.ndimension() < 8 && tensor2.ndimension() < 8) {
-    parallel_for(
-        0,
-        tensor1.numel(),
-        1,
-        [&tensor1, &tensor2, &op](int64_t begin, int64_t end) {
-          apply_kernel(
-              end - begin,
-              begin,
-              op,
-              strided_tensor_iter_fixed<scalar1, 8>(tensor1),
-              strided_tensor_iter_fixed<scalar2, 8>(tensor2));
-        });
-  } else {
-    parallel_for(
-        0,
-        tensor1.numel(),
-        1,
-        [&tensor1, &tensor2, &op](int64_t begin, int64_t end) {
-          apply_kernel(
-              end - begin,
-              begin,
-              op,
-              strided_tensor_iter<scalar1>(tensor1),
-              strided_tensor_iter<scalar2>(tensor2));
-        });
-  }
-}
-
 /*
  Apply a pointwise operator to sequence of tensors

--- a/aten/src/ATen/CPUGenerator.cpp
+++ b/aten/src/ATen/CPUGenerator.cpp
@ -1,49 +1,184 @@
 #include <ATen/CPUGenerator.h>
-
-#define const_generator_cast(generator) \
-  dynamic_cast<const CPUGenerator&>(generator)
+#include <c10/util/C++17.h>
+#include <algorithm>

 namespace at {

-CPUGenerator::CPUGenerator(Context * context_)
-  : context(context_), generator(THGenerator_new())
-{}
+namespace detail {

-CPUGenerator::~CPUGenerator() {
-  if (generator)
-    THGenerator_free(generator);
+// Ensures default_gen_cpu is initialized once.
+static std::once_flag cpu_gen_init_flag;
+
+// Default, global CPU generator.
+static std::shared_ptr<CPUGenerator> default_gen_cpu;
+
+/**
+ * PyTorch maintains a collection of default generators that get
+ * initialized once. The purpose of these default generators is to
+ * maintain a global running state of the pseudo random number generation,
+ * when a user does not explicitly mention any generator.
+ * getDefaultCPUGenerator gets the default generator for a particular
+ * device.
+ */
+CPUGenerator* getDefaultCPUGenerator() {
+  std::call_once(cpu_gen_init_flag, [&] {
+    default_gen_cpu = std::make_shared<CPUGenerator>(getNonDeterministicRandom());
+  });
+  return default_gen_cpu.get();
 }

-CPUGenerator& CPUGenerator::copy(const Generator& from) {
-  THGenerator_copy(generator, const_generator_cast(from).generator);
-  return *this;
+/**
+ * Utility to create a CPUGenerator. Returns a shared_ptr
+ */
+std::shared_ptr<CPUGenerator> createCPUGenerator(uint64_t seed_val) {
+  return std::make_shared<CPUGenerator>(seed_val);
 }

-CPUGenerator& CPUGenerator::free() {
-  THGenerator_free(generator);
-  return *this;
+/**
+ * Helper function to concatenate two 32 bit unsigned int
+ * and return them as a 64 bit unsigned int
+ */
+inline uint64_t make64BitsFrom32Bits(uint32_t hi, uint32_t lo) {
+  return (static_cast<uint64_t>(hi) << 32) | lo;
 }

+} // namespace detail
+
+/**
+ * CPUGenerator class implementation
+ */
+CPUGenerator::CPUGenerator(uint64_t seed_in)
+  : Generator{Device(DeviceType::CPU)},
+    engine_{seed_in},
+    next_float_normal_sample_{c10::optional<float>()},
+    next_double_normal_sample_{c10::optional<double>()} { }
+
+/**
+ * Manually seeds the engine with the seed input
+ * See Note [Acquire lock when using random generators]
+ */
+void CPUGenerator::set_current_seed(uint64_t seed) {
+  next_float_normal_sample_.reset();
+  next_double_normal_sample_.reset();
+  engine_ = mt19937(seed);
+}
+
+/**
+ * Gets the current seed of CPUGenerator.
+ */
+uint64_t CPUGenerator::current_seed() const {
+  return engine_.seed();
+}
+
+/**
+ * Gets a nondeterministic random number from /dev/urandom or time,
+ * seeds the CPUGenerator with it and then returns that number.
+ * 
+ * FIXME: You can move this function to Generator.cpp if the algorithm
+ * in getNonDeterministicRandom is unified for both CPU and CUDA
+ */
 uint64_t CPUGenerator::seed() {
-  return THRandom_seed(generator);
+  auto random = detail::getNonDeterministicRandom();
+  this->set_current_seed(random);
+  return random;
 }

-uint64_t CPUGenerator::initialSeed() {
-  return THRandom_initialSeed(generator);
+/**
+ * Gets the DeviceType of CPUGenerator.
+ * Used for type checking during run time.
+ */
+DeviceType CPUGenerator::device_type() {
+  return DeviceType::CPU;
 }

-CPUGenerator& CPUGenerator::manualSeed(uint64_t seed) {
-  THRandom_manualSeed(generator, seed);
-  return *this;
+/**
+ * Gets a random 32 bit unsigned integer from the engine
+ * 
+ * See Note [Acquire lock when using random generators]
+ */
+uint32_t CPUGenerator::random() {
+  return engine_();
 }

-CPUGenerator& CPUGenerator::manualSeedAll(uint64_t seed) {
-  // There's only one CPU generator
-  return manualSeed(seed);
+/**
+ * Gets a random 64 bit unsigned integer from the engine
+ * 
+ * See Note [Acquire lock when using random generators]
+ */
+uint64_t CPUGenerator::random64() {
+  uint32_t random1 = engine_();
+  uint32_t random2 = engine_();
+  return detail::make64BitsFrom32Bits(random1, random2);
 }

-void * CPUGenerator::unsafeGetTH() {
-  return generator;
+/**
+ * Get the cached normal random in float
+ */
+c10::optional<float> CPUGenerator::next_float_normal_sample() {
+  return next_float_normal_sample_;
+}
+
+/**
+ * Get the cached normal random in double
+ */
+c10::optional<double> CPUGenerator::next_double_normal_sample() {
+  return next_double_normal_sample_;
+}
+
+/**
+ * Cache normal random in float
+ * 
+ * See Note [Acquire lock when using random generators]
+ */
+void CPUGenerator::set_next_float_normal_sample(c10::optional<float> randn) {
+  next_float_normal_sample_ = randn;
+}
+
+/**
+ * Cache normal random in double
+ * 
+ * See Note [Acquire lock when using random generators]
+ */
+void CPUGenerator::set_next_double_normal_sample(c10::optional<double> randn) {
+  next_double_normal_sample_ = randn;
+}
+
+/**
+ * Get the engine of the CPUGenerator
+ */
+at::mt19937 CPUGenerator::engine() {
+  return engine_;
+}
+
+/**
+ * Set the engine of the CPUGenerator
+ * 
+ * See Note [Acquire lock when using random generators]
+ */
+void CPUGenerator::set_engine(at::mt19937 engine) {
+  engine_ = engine;
+}
+
+/**
+ * Public clone method implementation
+ * 
+ * See Note [Acquire lock when using random generators]
+ */
+std::shared_ptr<CPUGenerator> CPUGenerator::clone() const {
+  return std::shared_ptr<CPUGenerator>(this->clone_impl());
+}
+
+/**
+ * Private clone method implementation
+ * 
+ * See Note [Acquire lock when using random generators]
+ */
+CPUGenerator* CPUGenerator::clone_impl() const {
+  auto gen = new CPUGenerator();
+  gen->set_engine(engine_);
+  gen->set_next_float_normal_sample(next_float_normal_sample_);
+  gen->set_next_double_normal_sample(next_double_normal_sample_);
+  return gen;
 }

 } // namespace at
--- a/aten/src/ATen/CPUGenerator.h
+++ b/aten/src/ATen/CPUGenerator.h
@ -0,0 +1,44 @@
+#pragma once
+
+#include <ATen/core/Generator.h>
+#include <ATen/core/MT19937RNGEngine.h>
+#include <ATen/core/PhiloxRNGEngine.h>
+#include <c10/util/Optional.h>
+
+namespace at {
+
+struct CAFFE2_API CPUGenerator : public Generator {
+  // Constructors
+  CPUGenerator(uint64_t seed_in = default_rng_seed_val);
+  ~CPUGenerator() = default;
+
+  // CPUGenerator methods
+  std::shared_ptr<CPUGenerator> clone() const;
+  void set_current_seed(uint64_t seed) override;
+  uint64_t current_seed() const override;
+  uint64_t seed() override;
+  static DeviceType device_type();
+  uint32_t random();
+  uint64_t random64();
+  c10::optional<float> next_float_normal_sample();
+  c10::optional<double> next_double_normal_sample();
+  void set_next_float_normal_sample(c10::optional<float> randn);
+  void set_next_double_normal_sample(c10::optional<double> randn);
+  at::mt19937 engine();
+  void set_engine(at::mt19937 engine);
+
+private:
+  CPUGenerator* clone_impl() const override;
+  at::mt19937 engine_;
+  c10::optional<float> next_float_normal_sample_;
+  c10::optional<double> next_double_normal_sample_;
+};
+
+namespace detail {
+
+CAFFE2_API CPUGenerator* getDefaultCPUGenerator();
+CAFFE2_API std::shared_ptr<CPUGenerator> createCPUGenerator(uint64_t seed_val = default_rng_seed_val);
+
+} // namespace detail
+
+}
--- a/aten/src/ATen/CPUTypeDefault.cpp
+++ b/aten/src/ATen/CPUTypeDefault.cpp
@ -1,20 +0,0 @@
-#include <ATen/CPUTypeDefault.h>
-
-#include <ATen/Context.h>
-#include <ATen/CPUGenerator.h>
-
-namespace at {
-
-Allocator* CPUTypeDefault::allocator() const {
-  return getCPUAllocator();
-}
-
-Device CPUTypeDefault::getDeviceFromPtr(void * data) const {
-  return DeviceType::CPU;
-}
-
-std::unique_ptr<Generator> CPUTypeDefault::generator() const {
-  return std::unique_ptr<Generator>(new CPUGenerator(&at::globalContext()));
-}
-
-} // namespace at
--- a/aten/src/ATen/CPUTypeDefault.h
+++ b/aten/src/ATen/CPUTypeDefault.h
@ -1,14 +0,0 @@
-#pragma once
-#include <ATen/TypeDefault.h>
-
-namespace at {
-
-struct CAFFE2_API CPUTypeDefault : public TypeDefault {
-  CPUTypeDefault(TensorTypeId type_id, bool is_variable, bool is_undefined)
-      : TypeDefault(type_id, is_variable, is_undefined) {}
-  Allocator* allocator() const override;
-  Device getDeviceFromPtr(void * data) const override;
-  std::unique_ptr<Generator> generator() const override;
-};
-
-} // namespace at
--- a/aten/src/ATen/CUDAGenerator.h
+++ b/aten/src/ATen/CUDAGenerator.h
@ -0,0 +1,37 @@
+#pragma once
+
+#include <ATen/core/Generator.h>
+
+namespace at {
+
+struct CAFFE2_API CUDAGenerator : public Generator {
+  // Constructors
+  CUDAGenerator(DeviceIndex device_index = -1);
+  ~CUDAGenerator() = default;
+
+  // CUDAGenerator methods
+  std::shared_ptr<CUDAGenerator> clone() const;
+  void set_current_seed(uint64_t seed) override;
+  uint64_t current_seed() const override;
+  uint64_t seed() override;
+  void set_philox_offset_per_thread(uint64_t offset);
+  uint64_t philox_offset_per_thread();
+  std::pair<uint64_t, uint64_t> philox_engine_inputs(uint64_t increment);
+  static DeviceType device_type();
+
+private:
+  CUDAGenerator* clone_impl() const override;
+  uint64_t seed_ = default_rng_seed_val;
+  uint64_t philox_offset_per_thread_ = 0;
+};
+
+namespace cuda {
+namespace detail {
+
+  CAFFE2_API CUDAGenerator* getDefaultCUDAGenerator(DeviceIndex device_index = -1);
+  CAFFE2_API std::shared_ptr<CUDAGenerator> createCUDAGenerator(DeviceIndex device_index = -1);
+
+} // namespace detail
+} // namespace cuda
+} // namespace at
+
--- a/aten/src/ATen/CheckGenerator.h
+++ b/aten/src/ATen/CheckGenerator.h
@ -1,18 +0,0 @@
-#pragma once
-
-#include <ATen/Utils.h>
-#include <ATen/core/Generator.h>
-#include <c10/util/Exception.h>
-
-namespace at {
-
-template <typename T>
-static inline T * check_generator(Generator * expr, Generator * defaultValue) {
-  if (!expr)
-    expr = defaultValue;
-  if(auto result = dynamic_cast<T*>(expr))
-    return result;
-  AT_ERROR("Expected a '", typeid(T).name(), "' but found '", typeid(expr).name(), "'");
-}
-
-} // namespace at
--- a/aten/src/ATen/Context.cpp
+++ b/aten/src/ATen/Context.cpp
@ -10,8 +10,6 @@
 #include <string>
 #include <stdexcept>

-#include <ATen/CPUGenerator.h>
-#include <ATen/RegisterCPU.h>
 #include <ATen/Tensor.h>
 #include <ATen/cpu/FlushDenormal.h>

@ -19,28 +17,9 @@

 namespace at {

-static inline void errorHandler(const char * msg, void * data) {
-  throw std::runtime_error(msg);
-}
-static inline void argErrorHandler(int arg, const char * msg, void * data) {
-  std::stringstream new_error;
-  new_error << "invalid argument " << arg << ": " << msg;
-  throw std::runtime_error(new_error.str());
-}
-
 Context::Context()
-: next_id(static_cast<size_t>(TypeID::NumOptions))
-, thc_state(nullptr, [](THCState* p){ /* no-op */ } )
-, thh_state(nullptr, [](THHState* p){ /* no-op */ } )
-{
-
-  THSetDefaultErrorHandler(errorHandler,nullptr);
-  THSetDefaultArgErrorHandler(argErrorHandler,nullptr);
-
-  generator_registry[static_cast<int>(DeviceType::CPU)]
-    .reset(new CPUGenerator(this));
-  register_cpu_types(this);
-}
+: thc_state(nullptr, [](THCState* p){ /* no-op */ } )
+, thh_state(nullptr, [](THHState* p){ /* no-op */ } ) {}

 // TODO: This could be bad juju if someone calls globalContext() in the
 // destructor of an object with static lifetime.
@ -112,35 +91,6 @@ bool Context::setFlushDenormal(bool on) {
  return at::cpu::set_flush_denormal(on);
 }

-TypeExtendedInterface& getType(TensorOptions options) {
-  return globalContext().getType(
-            options.backend(), typeMetaToScalarType(options.dtype()), options.is_variable());
-}
-
-// NOTE: We also check `at::NonVariableTypeMode`, and if it's enabled we always
-// return non-Variable type in this function.
-// See NOTE [ Treating Variables as non-Variables in type dispatch ]
-TypeExtendedInterface& getType(const TensorImpl* impl) {
-  Backend backend = tensorTypeIdToBackend(impl->type_id());
-  return globalContext().getType(
-            backend, typeMetaToScalarType(impl->dtype()), impl->is_variable() && !at::NonVariableTypeMode::is_enabled());
-}
-
-TypeExtendedInterface& getType(const Tensor& t) {
-  return getType(t.unsafeGetTensorImpl());
-}
-
-LegacyTHDispatcher& getLegacyTHDispatcher(TensorOptions options) {
-  return globalContext().getLegacyTHDispatcher(
-            options.backend(), typeMetaToScalarType(options.dtype()));
-}
-
-LegacyTHDispatcher& getLegacyTHDispatcher(const TensorImpl* impl) {
-  Backend backend = tensorTypeIdToBackend(impl->type_id());
-  return globalContext().getLegacyTHDispatcher(
-            backend, typeMetaToScalarType(impl->dtype()));
-}
-
 Allocator* getCPUAllocator() {
  return getTHDefaultAllocator();
 }
@ -156,9 +106,6 @@ struct LegacyDeviceTypeInit : public LegacyDeviceTypeInitInterface {
  void initHIP() const override {
    globalContext().lazyInitHIP();
  }
-  void initComplex() const override {
-    globalContext().lazyInitComplex();
-  }
 };
 REGISTER_LEGACY_TYPE_INIT(LegacyDeviceTypeInit);

--- a/aten/src/ATen/Context.h
+++ b/aten/src/ATen/Context.h
@ -1,18 +1,14 @@
 #pragma once

 #include <ATen/core/ATenGeneral.h>
-#include <ATen/Type.h>
-#include <ATen/TypeExtendedInterface.h>
+#include <ATen/Tensor.h>
 #include <ATen/Utils.h>
-#include <ATen/LegacyTHDispatch.h>
-#include <ATen/LegacyTHDispatcher.h>
 #include <ATen/core/ATenGeneral.h>
 #include <ATen/core/Generator.h>
+#include <ATen/CPUGenerator.h>
 #include <ATen/core/LegacyTypeDispatch.h>
-#include <ATen/core/VariableHooksInterface.h>
 #include <ATen/detail/CUDAHooksInterface.h>
 #include <ATen/detail/HIPHooksInterface.h>
-#include <ATen/detail/ComplexHooksInterface.h>
 #include <c10/util/Exception.h>
 #include <c10/core/impl/DeviceGuardImplInterface.h>

@ -27,43 +23,29 @@ class Tensor;
 class CAFFE2_API Context {
 public:
  Context();
-  TypeExtendedInterface* getNonVariableTypeRaw(Backend p, ScalarType s) {
-    return static_cast<TypeExtendedInterface*>(globalLegacyTypeDispatch().getNonVariableTypeRaw(p, s));
-  }
-  TypeExtendedInterface * getNonVariableTypeOpt(Backend p, ScalarType s) {
-    return static_cast<TypeExtendedInterface*>(globalLegacyTypeDispatch().getNonVariableTypeOpt(p, s));
-  }
-  TypeExtendedInterface & getNonVariableType(Backend p, ScalarType s) {
-    return static_cast<TypeExtendedInterface&>(globalLegacyTypeDispatch().getNonVariableType(p, s));
-  }
-  TypeExtendedInterface & getVariableType(Backend p, ScalarType s) {
-    return static_cast<TypeExtendedInterface&>(globalLegacyTypeDispatch().getVariableType(p, s));
-  }
-  TypeExtendedInterface & getType(Backend p, ScalarType s, bool is_variable) {
-    return static_cast<TypeExtendedInterface&>(globalLegacyTypeDispatch().getType(p, s, is_variable));
-  }
-  LegacyTHDispatcher& getLegacyTHDispatcher(Backend p, ScalarType s) {
-    return globalLegacyTHDispatch().getLegacyTHDispatcher(p, s);
-  }
-  // The passed in Type must be delete'able
-  // TODO: Just make it take a unique_ptr
-  void registerType(Backend b, Type* t) {
-    globalLegacyTypeDispatch().registerType(b,
-      LegacyTypeDispatch::TypeUniquePtr{t, LegacyTypeDeleter([](Type* p) { delete p; }) });
-  }

-  void registerLegacyTHDispatcher(Backend b, ScalarType s, LegacyTHDispatcher* t) {
-    globalLegacyTHDispatch().registerDispatcher(b, s,
-      LegacyTHDispatch::LegacyTHDispatcherUniquePtr{t, LegacyTHDispatcherDeleter([](LegacyTHDispatcher* p) { delete p; }) });
-  }
-
-  Generator & defaultGenerator(DeviceType device_type) {
+  Generator & defaultGenerator(Device device) {
+    DeviceType device_type = device.type();
    initCUDAIfNeeded(device_type);
    initHIPIfNeeded(device_type);
-    auto & generator = generator_registry[static_cast<int>(device_type)];
-    if(!generator)
-      AT_ERROR(DeviceTypeName(device_type), " backend type not enabled.");
-    return *generator;
+    if (device_type == at::kCPU) {
+      return *at::detail::getDefaultCPUGenerator();
+    } else if (device_type == at::kCUDA) {
+      return *at::detail::getCUDAHooks().getDefaultCUDAGenerator(device.index());
+    } else {
+      AT_ERROR(DeviceTypeName(device_type), " device type not enabled.");
+    }
+  }
+  Device getDeviceFromPtr(void* data, DeviceType device_type) {
+    initCUDAIfNeeded(device_type);
+    initHIPIfNeeded(device_type);
+    if (device_type == at::kCPU) {
+      return DeviceType::CPU;
+    } else if (device_type == at::kCUDA) {
+      return at::detail::getCUDAHooks().getDeviceFromPtr(data);
+    } else {
+      AT_ERROR(DeviceTypeName(device_type), " device type not enabled.");
+    }
  }
  bool hasOpenMP() const;
  bool hasMKL() const;
@ -86,27 +68,18 @@ class CAFFE2_API Context {
  THCState* lazyInitCUDA() {
    std::call_once(thc_init,[&] {
      thc_state = detail::getCUDAHooks().initCUDA();
-      generator_registry[static_cast<int>(DeviceType::CUDA)] =
-        detail::getCUDAHooks().initCUDAGenerator(this);
-      detail::getCUDAHooks().registerCUDATypes(this);
    });
    return thc_state.get();
  }
  THHState* lazyInitHIP() {
    std::call_once(thh_init,[&] {
      thh_state = detail::getHIPHooks().initHIP();
-      generator_registry[static_cast<int>(DeviceType::HIP)] =
-        detail::getHIPHooks().initHIPGenerator(this);
-      detail::getHIPHooks().registerHIPTypes(this);
    });
    return thh_state.get();
  }
-  void lazyInitComplex() {
-    std::call_once(complex_init_, [&] {
-      detail::getComplexHooks().registerComplexTypes(this);
-    });
+  const at::cuda::NVRTC& getNVRTC() {
+    return detail::getCUDAHooks().nvrtc();
  }
-
  THCState* getTHCState() {
    // AT_ASSERT(thc_state);
    return thc_state.get();
@ -115,9 +88,6 @@ class CAFFE2_API Context {
    return thh_state.get();
  }

-  size_t freshTypeID() {
-    return next_id++;
-  }
  bool setFlushDenormal(bool on);

  // NB: This method is *purely* whether or not a user requested
@ -130,8 +100,6 @@ class CAFFE2_API Context {
  void setBenchmarkCuDNN(bool);
  bool deterministicCuDNN() const;
  void setDeterministicCuDNN(bool);
-  std::unique_ptr<Generator>
-    generator_registry[static_cast<int>(DeviceType::COMPILE_TIME_MAX_DEVICE_TYPES)];
 private:
  void initCUDAIfNeeded(DeviceType p) {
    if (p == DeviceType::CUDA) {
@ -143,21 +111,13 @@ private:
      lazyInitHIP();
    }
  }
-  void initComplexIfNeeded(ScalarType s) {
-    if (isComplexType(s)) {
-      lazyInitComplex();
-    }
-  }
  std::once_flag thc_init;
  std::once_flag thh_init;
-  std::once_flag complex_init_;
  bool enabled_cudnn = true;
  bool deterministic_cudnn = false;
  bool benchmark_cudnn = false;
-  std::atomic<size_t> next_id;
  std::unique_ptr<THCState, void(*)(THCState*)> thc_state;
  std::unique_ptr<THHState, void(*)(THHState*)> thh_state;
-  friend struct Type;
 };

 CAFFE2_API Context& globalContext();
@ -166,14 +126,6 @@ static inline void init() {
  globalContext();
 }

-static inline TypeExtendedInterface& getNonVariableType(Backend p, ScalarType s) {
-  return globalContext().getNonVariableType(p, s);
-}
-
-CAFFE2_API TypeExtendedInterface& getType(TensorOptions options);
-CAFFE2_API TypeExtendedInterface& getType(const TensorImpl*);
-CAFFE2_API TypeExtendedInterface& getType(const Tensor&);
-
 CAFFE2_API Allocator* getCPUAllocator();

 static inline DeprecatedTypeProperties& getNonVariableDeprecatedTypeProperties(Backend p, ScalarType s) {
@ -196,9 +148,6 @@ static inline DeprecatedTypeProperties& HIP(ScalarType s) {
      Backend::HIP, s, /*is_variable*/false);
 }

-CAFFE2_API LegacyTHDispatcher& getLegacyTHDispatcher(TensorOptions options);
-CAFFE2_API LegacyTHDispatcher& getLegacyTHDispatcher(const Tensor&);
-
 static inline bool hasCUDA() {
  return globalContext().hasCUDA();
 }
@ -252,11 +201,24 @@ static inline bool hasMKLDNN() {
 }

 static inline void manual_seed(uint64_t seed) {
-  globalContext().defaultGenerator(DeviceType::CPU).manualSeed(seed);
+  auto& gen = globalContext().defaultGenerator(DeviceType::CPU);
+  {
+    // See Note [Acquire lock when using random generators]
+    std::lock_guard<std::mutex> lock(gen.mutex_);
+    gen.set_current_seed(seed);
+  }
  // NB: Sometimes we build with CUDA, but we don't have any GPUs
  // available. In that case, we must not seed CUDA; it will fail!
-  if (hasCUDA() && detail::getCUDAHooks().getNumGPUs() > 0) {
-    globalContext().defaultGenerator(DeviceType::CUDA).manualSeedAll(seed);
+  int num_gpus = detail::getCUDAHooks().getNumGPUs();
+  if (hasCUDA() && num_gpus > 0) {
+    for (int i = 0; i < num_gpus; i++) {
+      auto& cuda_gen = globalContext().defaultGenerator(Device(at::kCUDA, i));
+      {
+        // See Note [Acquire lock when using random generators]
+        std::lock_guard<std::mutex> lock(cuda_gen.mutex_);
+        cuda_gen.set_current_seed(seed);
+      }
+    }
  }
 }

--- a/aten/src/ATen/DLConvertor.cpp
+++ b/aten/src/ATen/DLConvertor.cpp
@ -39,9 +39,18 @@ static DLDataType getDLDataType(const Tensor& t) {
    case ScalarType::Bool:
      dtype.code = DLDataTypeCode::kDLUInt;
      break;
+    case ScalarType::BFloat16:
+      throw std::logic_error("BFloat16 is not supported by dlpack");
+      break;
    case ScalarType::QInt8:
      throw std::logic_error("QInt8 is not supported by dlpack");
      break;
+    case ScalarType::QUInt8:
+      throw std::logic_error("QUInt8 is not supported by dlpack");
+      break;
+    case ScalarType::QInt32:
+      throw std::logic_error("QInt32 is not supported by dlpack");
+      break;
    case ScalarType::ComplexHalf:
      throw std::logic_error("ComplexHalf is not supported by dlpack");
    case ScalarType::ComplexFloat:
--- a/aten/src/ATen/Declarations.cwrap
+++ b/aten/src/ATen/Declarations.cwrap
@ -5,6 +5,7 @@
  cpu_half: True
  cpu_bool: True
  cuda_bool: True
+  cpu_bfloat16: True
  device_guard: False
  return: argument 0
  options:
@ -43,6 +44,7 @@
  cpu_half: True
  cpu_bool: True
  cuda_bool: True
+  cpu_bfloat16: True
  options:
    - arguments:
      - THTensor* self
@ -60,6 +62,7 @@
  cpu_half: True
  cpu_bool: True
  cuda_bool: True
+  cpu_bfloat16: True
  device_guard: False
  return: bool
  arguments:
@ -68,6 +71,8 @@
 ]]
 [[
  name: _th_masked_fill_
+  cpu_bool: True
+  cuda_bool: True
  cname: maskedFill
  variants: function
  return: self
@ -84,8 +89,30 @@
      - THByteTensor* mask
      - THTensor* value
 ]]
+[[
+  name: _th_masked_fill_bool_
+  cpu_bool: True
+  cuda_bool: True
+  cname: maskedFillBool
+  variants: function
+  return: self
+  options:
+    - arguments:
+      - arg: THTensor* self
+        broadcast: mask inplace fallback types:Bool
+      - THBoolTensor* mask
+      - real value
+    - zero_dim_tensor_only: True
+      arguments:
+      - arg: THTensor* self
+        broadcast: mask inplace fallback types:Bool
+      - THBoolTensor* mask
+      - THTensor* value
+]]
 [[
  name: _th_masked_scatter_
+  cpu_bool: True
+  cuda_bool: True
  cname: maskedCopy
  variants: function
  return: self
@ -95,9 +122,24 @@
    - THByteTensor* mask
    - THTensor* source
 ]]
+[[
+  name: _th_masked_scatter_bool_
+  cpu_bool: True
+  cuda_bool: True
+  cname: maskedCopyBool
+  variants: function
+  return: self
+  arguments:
+    - arg: THTensor* self
+      broadcast: mask inplace fallback types:Bool
+    - THBoolTensor* mask
+    - THTensor* source
+]]
 [[
  name: _th_masked_select
  cname: maskedSelect
+  cpu_bool: True
+  cuda_bool: True
  variants:
    - function
  return: argument 0
@ -109,13 +151,31 @@
    - THByteTensor* mask
 ]]
 [[
-  name: _th_nonzero
-  cname: nonzero
+  name: _th_masked_select_bool
+  cname: maskedSelectBool
  cpu_bool: True
  cuda_bool: True
  variants:
    - function
  return: argument 0
+  arguments:
+    - arg: THTensor* result
+      output: True
+    - arg: THTensor* self
+      broadcast: mask fallback types:Bool
+    - THBoolTensor* mask
+]]
+[[
+  name: _th_nonzero
+  cname: nonzero
+  cpu_half: True
+  cpu_bool: True
+  cuda_bool: True
+  cpu_bfloat16: True
+  variants:
+    - function
+  return: argument 0
+  scalar_check: false
  arguments:
    - arg: THIndexTensor* result
      output: True
@ -130,30 +190,17 @@
  cpu_half: True
  cpu_bool: True
  cuda_bool: True
+  cpu_bfloat16: True
  arguments:
    - THTensor* self
 ]]
-[[
-  name: _th_view
-  cname: newView
-  cpu_half: True
-  cpu_bool: True
-  cuda_bool: True
-  variants:
-    - function
-  device_guard: False
-  return: THTensor*
-  arguments:
-    - THTensor* self
-    - arg: IntArrayRefSize size
-      long_args: True
-]]
 [[
  name: _th_resize_as_
  cname: resizeAs
  cpu_half: True
  cpu_bool: True
  cuda_bool: True
+  cpu_bfloat16: True
  variants:
    - function
  return: self
@ -164,6 +211,8 @@
 ]]
 [[
  name: _th_index_select
+  cpu_bool: True
+  cuda_bool: True
  cname: indexSelect
  variants:
    - function
@ -179,6 +228,8 @@
 [[
  name: _th_index_copy_
  cname: indexCopy
+  cpu_bool: True
+  cuda_bool: True
  variants: function
  return: argument 0
  arguments:
@ -190,6 +241,8 @@
 ]]
 [[
  name: _th_take
+  cpu_bool: True
+  cuda_bool: True
  cname: take
  variants:
    - function
@ -203,6 +256,8 @@
 ]]
 [[
  name: _th_put_
+  cpu_bool: True
+  cuda_bool: True
  cname: put
  variants: function
  backends:
@ -230,6 +285,8 @@
 ]]
 [[
  name: _th_index_fill_
+  cpu_bool: True
+  cuda_bool: True
  cname: indexFill
  variants: function
  return: argument 0
@ -256,6 +313,7 @@
  cpu_half: True
  cpu_bool: True
  cuda_bool: True
+  cpu_bfloat16: True
  device_guard: False
  return: argument 0
  arguments:
@ -270,6 +328,8 @@
 [[
  name: _th_scatter_
  return: argument 0
+  cpu_bool: True
+  cuda_bool: True
  variants: function
  options:
    - cname: scatter
@ -291,6 +351,8 @@
  name: _th_scatter_add_
  return: argument 0
  cname: scatterAdd
+  cpu_bool: True
+  cuda_bool: True
  variants: function
  arguments:
    - THTensor* self
@ -302,6 +364,8 @@
 [[
  name: _th_gather
  cname: gather
+  cpu_bool: True
+  cuda_bool: True
  variants:
    - function
  return: argument 0
@ -316,8 +380,9 @@
 ]]
 [[
  name: _th_equal
-  cpu_bool: True
  cname: equal
+  cpu_bool: True
+  cuda_bool: True
  variants:
    - function
  return: bool
@ -327,6 +392,8 @@
 ]]
 [[
  name: _th_and
+  cpu_bool: True
+  cuda_bool: True
  cname: __and__
  variants:
    - function
@ -349,6 +416,8 @@
 [[
  name: _th_iand_
  cname: __iand__
+  cpu_bool: True
+  cuda_bool: True
  variants:
    - function
  return: argument 0
@ -368,6 +437,8 @@
 [[
  name: _th_or
  cname: __or__
+  cpu_bool: True
+  cuda_bool: True
  variants:
    - function
  return: argument 0
@ -389,6 +460,8 @@
 [[
  name: _th_ior_
  cname: __ior__
+  cpu_bool: True
+  cuda_bool: True
  variants:
    - function
  return: argument 0
@ -408,6 +481,8 @@
 [[
  name: _th_xor
  cname: __xor__
+  cpu_bool: True
+  cuda_bool: True
  variants:
    - function
  return: argument 0
@ -429,6 +504,8 @@
 [[
  name: _th_ixor_
  cname: __ixor__
+  cpu_bool: True
+  cuda_bool: True
  variants:
    - function
  return: argument 0
@ -535,11 +612,33 @@
  options:
    - cname: ltValue
      arguments:
-        - arg: THByteTensor* result
+        - arg: THBoolTensor* result
          output: True
        - THTensor* self
        - real other
    - cname: ltTensor
+      arguments:
+        - arg: THBoolTensor* result
+          output: True
+        - arg: THTensor* self
+          broadcast: other fallback
+        - THTensor* other
+]]
+[[
+  name: _th_lt_byte
+  cpu_bool: True
+  cuda_bool: True
+  variants:
+    - function
+  return: argument 0
+  options:
+    - cname: ltValueByte
+      arguments:
+        - arg: THByteTensor* result
+          output: True
+        - THTensor* self
+        - real other
+    - cname: ltTensorByte
      arguments:
        - arg: THByteTensor* result
          output: True
@ -576,11 +675,33 @@
  options:
    - cname: gtValue
      arguments:
-        - arg: THByteTensor* result
+        - arg: THBoolTensor* result
          output: True
        - THTensor* self
        - real other
    - cname: gtTensor
+      arguments:
+        - arg: THBoolTensor* result
+          output: True
+        - arg: THTensor* self
+          broadcast: other fallback
+        - THTensor* other
+]]
+[[
+  name: _th_gt_byte
+  cpu_bool: True
+  cuda_bool: True
+  variants:
+    - function
+  return: argument 0
+  options:
+    - cname: gtValueByte
+      arguments:
+        - arg: THByteTensor* result
+          output: True
+        - THTensor* self
+        - real other
+    - cname: gtTensorByte
      arguments:
        - arg: THByteTensor* result
          output: True
@ -617,11 +738,33 @@
  options:
    - cname: leValue
      arguments:
-        - arg: THByteTensor* result
+        - arg: THBoolTensor* result
          output: True
        - THTensor* self
        - real other
    - cname: leTensor
+      arguments:
+        - arg: THBoolTensor* result
+          output: True
+        - arg: THTensor* self
+          broadcast: other fallback
+        - THTensor* other
+]]
+[[
+  name: _th_le_byte
+  cpu_bool: True
+  cuda_bool: True
+  variants:
+    - function
+  return: argument 0
+  options:
+    - cname: leValueByte
+      arguments:
+        - arg: THByteTensor* result
+          output: True
+        - THTensor* self
+        - real other
+    - cname: leTensorByte
      arguments:
        - arg: THByteTensor* result
          output: True
@ -658,11 +801,33 @@
  options:
    - cname: geValue
      arguments:
-        - arg: THByteTensor* result
+        - arg: THBoolTensor* result
          output: True
        - THTensor* self
        - real other
    - cname: geTensor
+      arguments:
+        - arg: THBoolTensor* result
+          output: True
+        - arg: THTensor* self
+          broadcast: other fallback
+        - THTensor* other
+]]
+[[
+  name: _th_ge_byte
+  cpu_bool: True
+  cuda_bool: True
+  variants:
+    - function
+  return: argument 0
+  options:
+    - cname: geValueByte
+      arguments:
+        - arg: THByteTensor* result
+          output: True
+        - THTensor* self
+        - real other
+    - cname: geTensorByte
      arguments:
        - arg: THByteTensor* result
          output: True
@ -699,11 +864,33 @@
  options:
    - cname: eqValue
      arguments:
-        - arg: THByteTensor* result
+        - arg: THBoolTensor* result
          output: True
        - THTensor* self
        - real other
    - cname: eqTensor
+      arguments:
+        - arg: THBoolTensor* result
+          output: True
+        - arg: THTensor* self
+          broadcast: other fallback
+        - THTensor* other
+]]
+[[
+  name: _th_eq_byte
+  cpu_bool: True
+  cuda_bool: True
+  variants:
+    - function
+  return: argument 0
+  options:
+    - cname: eqValueByte
+      arguments:
+        - arg: THByteTensor* result
+          output: True
+        - THTensor* self
+        - real other
+    - cname: eqTensorByte
      arguments:
        - arg: THByteTensor* result
          output: True
@ -740,11 +927,33 @@
  options:
    - cname: neValue
      arguments:
-        - arg: THByteTensor* result
+        - arg: THBoolTensor* result
          output: True
        - THTensor* self
        - real other
    - cname: neTensor
+      arguments:
+        - arg: THBoolTensor* result
+          output: True
+        - arg: THTensor* self
+          broadcast: other fallback
+        - THTensor* other
+]]
+[[
+  name: _th_ne_byte
+  cpu_bool: True
+  cuda_bool: True
+  variants:
+    - function
+  return: argument 0
+  options:
+    - cname: neValueByte
+      arguments:
+        - arg: THByteTensor* result
+          output: True
+        - THTensor* self
+        - real other
+    - cname: neTensorByte
      arguments:
        - arg: THByteTensor* result
          output: True
@ -773,6 +982,8 @@
 ]]
 [[
  name: _th_min
+  cpu_bool: True
+  cuda_bool: True
  variants:
    - function
  options:
@ -791,6 +1002,8 @@
 ]]
 [[
  name: _th_min
+  cpu_bool: True
+  cuda_bool: True
  variants: function
  options:
    - cname: min
@ -809,6 +1022,8 @@
 ]]
 [[
  name: _th_max
+  cpu_bool: True
+  cuda_bool: True
  variants:
    - function
  options:
@ -827,6 +1042,8 @@
 ]]
 [[
  name: _th_max
+  cpu_bool: True
+  cuda_bool: True
  variants: function
  options:
    - cname: max
@ -1136,7 +1353,6 @@
  types:
    - floating_point
  backends:
-    - CPU
    - CUDA
  variants: function
  return: argument 0
@ -1179,7 +1395,6 @@
  types:
    - floating_point
  backends:
-    - CPU
    - CUDA
  variants: function
  return: argument 0
@ -1683,6 +1898,7 @@
  variants:
    - function
  return: argument 0
+  scalar_check: false
  arguments:
    - arg: THTensor* result
      output: True
@ -1701,6 +1917,7 @@
  cpu_half: True
  cpu_bool: True
  cuda_bool: True
+  cpu_bfloat16: True
  variants:
    - function
  arguments:
@ -1709,6 +1926,8 @@
 [[
  name: _th_cumsum
  cname: cumsum
+  cpu_bool: True
+  cuda_bool: True
  variants: function
  return: argument 0
  arguments:
@ -1721,6 +1940,8 @@
 [[
  name: _th_cumprod
  cname: cumprod
+  cpu_bool: True
+  cuda_bool: True
  variants: function
  return: argument 0
  arguments:
@ -1733,6 +1954,8 @@
 [[
  name: _th_sign
  cname: sign
+  cpu_bool: True
+  cuda_bool: True
  variants:
    - function
  return: argument 0
@ -1885,6 +2108,7 @@
  backends:
    - CUDA
  return: argument 0
+  scalar_check: false
  arguments:
    - arg: THTensor* result
      output: True
@ -1897,6 +2121,7 @@
  variants:
    - function
  return: argument 0
+  scalar_check: false
  arguments:
    - arg: THTensor* result
      output: True
@ -1916,6 +2141,7 @@
  variants:
    - function
  return: argument 0
+  scalar_check: false
  options:
    - arguments:
      - arg: THTensor* result
@ -1990,6 +2216,7 @@
  cname: addr
  variants: function
  return: argument 0
+  scalar_check: false
  arguments:
    - arg: THTensor* result
      output: True
@ -2026,7 +2253,7 @@
  cname: addr
  variants: function
  return: argument 0
-  scalar_check: False
+  scalar_check: false
  arguments:
    - arg: THTensor* result
      output: True
@ -2043,6 +2270,7 @@
  cname: addmv
  variants: function
  return: argument 0
+  scalar_check: false
  arguments:
    - arg: THTensor* result
      output: True
@ -2060,6 +2288,7 @@
  return: argument 0
  options:
    - cname: addmm
+      scalar_check: false
      arguments:
        - arg: THTensor* result
          output: True
@ -2079,6 +2308,7 @@
  backends:
    - CUDA
  return: argument 0
+  scalar_check: false
  arguments:
    - arg: THTensor* result
      output: True
@ -2096,6 +2326,7 @@
  variants:
    - function
  return: argument 0
+  scalar_check: false
  arguments:
    - arg: THTensor* result
      output: True
@ -2135,6 +2366,7 @@
  backends:
    - CUDA
  return: argument 0
+  scalar_check: false
  arguments:
    - arg: THTensor* result
      output: True
@ -2181,17 +2413,6 @@
          kwarg_only: True
        - THTensor* tensor1
        - THTensor* tensor2
-    - cname: spaddcmul
-      variants: function
-      return: argument 0
-      arguments:
-        - THTensor* self
-        - THTensor* self
-        - arg: real value
-          default: AS_REAL(1)
-          kwarg_only: True
-        - THSTensor* tensor1
-        - THSTensor* tensor2
 ]]
 [[
  name: _th_addcdiv
@ -2245,33 +2466,6 @@
    - THTensor* self
    - THTensor* A
 ]]
-[[
-  name: _th_symeig
-  cname: syev
-  types:
-    - Float
-    - Double
-  backends:
-    - CPU
-    - CUDA
-  variants:
-    - function
-  return: argument 0,1
-  arguments:
-    - arg: THTensor* res1
-      output: True
-    - arg: THTensor* res2
-      output: True
-    - THTensor* self
-    - arg: bool eigenvectors
-      if_true: V
-      if_false: N
-      default: N
-    - arg: bool upper
-      if_true: U
-      if_false: L
-      default: U
-]]
 [[
  name: _th_eig
  cname: geev
@ -2295,51 +2489,6 @@
      if_false: N
      default: N
 ]]
-[[
-  name: _th_svd
-  cname: gesdd
-  types:
-    - Float
-    - Double
-  backends:
-    - CPU
-    - CUDA
-  variants:
-    - function
-  return: argument 0,1,2
-  arguments:
-    - arg: THTensor* res1
-      output: True
-    - arg: THTensor* res2
-      output: True
-    - arg: THTensor* res3
-      output: True
-    - THTensor* self
-    - arg: bool some
-      if_true: S
-      if_false: A
-      default: S
-    - arg: bool compute_uv
-      if_true: S
-      if_false: N
-      default: S
-]]
-[[
-  name: _th_getri_single
-  cname: getri
-  types:
-    - Float
-    - Double
-  backends:
-    - CPU
-    - CUDA
-  variants: function
-  return: argument 0
-  arguments:
-    - arg: THTensor* output
-      output: True
-    - THTensor* self
-]]
 [[
  name: _th_potri
  cname: potri
@ -2352,6 +2501,7 @@
  variants:
    - function
  return: argument 0
+  scalar_check: false
  arguments:
    - arg: THTensor* output
      output: True
@ -2361,52 +2511,6 @@
      if_false: L
      default: U
 ]]
-[[
-  name: _th_pstrf
-  cname: pstrf
-  types:
-    - Float
-    - Double
-  backends:
-    - CPU
-  variants:
-    - function
-  return: argument 0,1
-  arguments:
-    - arg: THTensor* res1
-      output: True
-    - arg: THIntegerTensor* res2
-      output: True
-    - THTensor* self
-    - arg: bool upper
-      if_true: U
-      if_false: L
-      default: U
-    - arg: real tol
-      default: -1
-  aten_custom_call: |
-    ${THTensor}_pstrf(res1_, res2_, self_, (upper) ? "U" : "L", tol_);
-    res2 -= 1;  // LAPACK returns 1-indexed pivots
-]]
-[[
-  name: _th_qr
-  cname: qr
-  types:
-    - Float
-    - Double
-  backends:
-    - CPU
-    - CUDA
-  variants:
-    - function
-  return: argument 0,1
-  arguments:
-    - arg: THTensor* res1
-      output: True
-    - arg: THTensor* res2
-      output: True
-    - THTensor* self
-]]
 [[
  name: _th_geqrf
  cname: geqrf
@ -2419,6 +2523,7 @@
  variants:
    - function
  return: argument 0,1
+  scalar_check: false
  arguments:
    - arg: THTensor* res1
      output: True
@ -2437,6 +2542,7 @@
  variants:
    - function
  return: argument 0
+  scalar_check: false
  arguments:
    - arg: THTensor* result
      output: True
@ -2469,33 +2575,13 @@
      if_false: N
      default: N
 ]]
-[[
-  name: _th_btrisolve
-  cname: btrisolve
-  types:
-    - floating_point
-  backends:
-    - CPU
-    - CUDA
-  variants:
-    - function
-  return: argument 0
-  arguments:
-    - arg: THTensor* result
-      output: True
-    - THTensor* self
-    - THTensor* LU_data
-    - THIntegerTensor* LU_pivots
-]]
 [[
  name: _th_random_
  backends:
    - CPU
-    - CUDA
  return: self
  variants: function
  cpu_bool: True
-  cuda_bool: True
  options:
    - cname: random
      arguments:
@ -2586,7 +2672,6 @@
    - floating_point
  backends:
    - CPU
-    - CUDA
  cname: uniform
  variants: function
  return: self
@ -2607,7 +2692,6 @@
    - floating_point
  backends:
    - CPU
-    - CUDA
  return: argument 0
  variants:
    - function
@ -2647,7 +2731,6 @@
    - floating_point
  backends:
    - CPU
-    - CUDA
  cname: normal
  variants: function
  return: self
@ -2667,7 +2750,6 @@
    - floating_point
  backends:
    - CPU
-    - CUDA
  cname: cauchy
  variants: function
  return: self
@ -2689,7 +2771,6 @@
    - floating_point
  backends:
    - CPU
-    - CUDA
  return: self
  arguments:
    - THTensor* self
@ -2707,7 +2788,6 @@
    - floating_point
  backends:
    - CPU
-    - CUDA
  cname: exponential
  variants: function
  return: self
@ -2723,7 +2803,6 @@
  name: _th_geometric_
  backends:
    - CPU
-    - CUDA
  cname: geometric
  variants: function
  return: self
@ -2734,41 +2813,7 @@
      kwarg_only: True
    - double p
 ]]
-[[
-  name: _th_dirichlet_grad
-  types:
-    - floating_point
-  backends:
-    - CPU
-  return: argument 0
-  variants:
-    - function
-  options:
-    - cname: dirichlet_grad
-      arguments:
-        - arg: THTensor* output
-          output: True
-        - THTensor* x
-        - THTensor* alpha
-        - THTensor* total
-]]

-# In theory, this could be a part of the above declaration. But in
-# practice this leads to all sorts of problems with ambiguous overloads.
-# So we add it here with a separate name.
-[[
-  name: _th_alias
-  return: THTensor*
-  cpu_half: True
-  cpu_bool: True
-  cuda_bool: True
-  variants:
-    - function
-  options:
-    - cname: newWithTensor
-      arguments:
-        - THTensor* self
-]]
 [[
  name: _th_copy_ignoring_overlaps_
  cname: copyIgnoringOverlaps
@ -2788,6 +2833,7 @@
  cpu_half: True
  cpu_bool: True
  cuda_bool: True
+  cpu_bfloat16: True
  return: self
  arguments:
    - arg: THTensor* self
--- a/aten/src/ATen/DeviceGuard.h
+++ b/aten/src/ATen/DeviceGuard.h
@ -14,7 +14,7 @@ namespace at {
 //    OptionalDeviceGuard guard(device_of(tensor));

 /// Return the Device of a Tensor, if the Tensor is defined.
-inline optional<Device> device_of(Tensor t) {
+inline optional<Device> device_of(const Tensor& t) {
  if (t.defined()) {
    return make_optional(t.device());
  } else {
--- a/aten/src/ATen/Dimname.cpp
+++ b/aten/src/ATen/Dimname.cpp
@ -0,0 +1,104 @@
+#ifdef BUILD_NAMEDTENSOR
+#include <ATen/Dimname.h>
+#include <c10/util/Exception.h>
+
+namespace at {
+
+std::ostream& operator<<(std::ostream& out, const Dimname& dimname) {
+  if (dimname.type() == NameType::WILDCARD) {
+    out << "None";
+  } else {
+    out << "'" << dimname.full_name().toUnqualString() << "'";
+  }
+  return out;
+}
+
+bool is_valid_identifier(const std::string& name) {
+  std::locale loc;
+  if (name.length() == 0) {
+    return false;
+  }
+  for (auto it = name.begin(); it != name.end(); ++it) {
+    if (std::isalpha(*it, loc) || *it == '_') {
+      continue;
+    }
+    return false;
+  }
+  return true;
+}
+
+bool Dimname::can_refer_to(const Dimname& other) const {
+  switch (type()) {
+    case NameType::WILDCARD:
+      return false;
+
+    // "C" can be used to refer to "C" or "C.in".
+    case NameType::NORMAL:
+      return untagged_name() == other.untagged_name();
+
+    default:
+      return full_name() == other.full_name();
+  }
+}
+
+static void check_valid_identifier(const std::string& name) {
+  TORCH_CHECK(
+      is_valid_identifier(name),
+      "Invalid name: a valid identifier must contain alphabetical characters and/or underscore, got: '",
+      name, "'.");
+}
+
+Dimname Dimname::fromSymbol(Symbol full_name) {
+  TORCH_INTERNAL_ASSERT(full_name.is_dimname());
+  if (full_name == kWildcard) {
+    return Dimname::wildcard();
+  }
+  const std::string delimiter = ".";
+  const std::string str(full_name.toUnqualString());
+  auto it = str.find(delimiter);
+
+  // Check for normal name
+  if (it == std::string::npos) {
+    check_valid_identifier(str);
+    return Dimname(full_name);
+  }
+
+  // Check for tagged name
+  auto second_dot = str.find(delimiter, it + 1);
+  TORCH_CHECK(
+      second_dot == std::string::npos,
+      "Invalid name '", str, "': A tagged name can only contain one '.'");
+  auto untagged_name = str.substr(0, it);
+  auto tag = str.substr(it + 1);
+  check_valid_identifier(untagged_name); 
+  check_valid_identifier(tag);
+  return Dimname(NameType::TAGGED, full_name, Symbol::dimname(untagged_name));
+}
+
+Dimname Dimname::wildcard() {
+  static Dimname result(NameType::WILDCARD, kWildcard, kWildcard);
+  return result;
+}
+
+optional<Dimname> unify(Dimname dimname, Dimname other) {
+  if (other.type() == NameType::WILDCARD) {
+    return dimname;
+  }
+  if (dimname.type() == NameType::WILDCARD) {
+    return other;
+  }
+  if (dimname.full_name() == other.full_name()) {
+    return dimname;
+  }
+  if (dimname.untagged_name() == other.untagged_name()) {
+    return Dimname::fromSymbol(dimname.untagged_name());
+  }
+  return c10::nullopt;
+}
+
+bool match(Dimname dimname, Dimname other) {
+  return unify(dimname, other).has_value();
+}
+
+} // namespace at
+#endif
--- a/aten/src/ATen/Dimname.h
+++ b/aten/src/ATen/Dimname.h
@ -0,0 +1,56 @@
+#pragma once
+#ifdef BUILD_NAMEDTENSOR
+
+#include <ATen/core/interned_strings.h>
+#include <c10/util/ArrayRef.h>
+#include <c10/util/Optional.h>
+#include <iostream>
+
+namespace at {
+
+enum class NameType: uint8_t { NORMAL, WILDCARD, TAGGED };
+
+struct CAFFE2_API Dimname {
+  static Dimname fromSymbol(Symbol name);
+  static Dimname wildcard();
+
+  NameType type() const { return type_; }
+  Symbol full_name() const { return full_name_; }
+  Symbol untagged_name() const { return untagged_name_; }
+
+  bool can_refer_to(const Dimname& other) const;
+
+  bool is_normal() const { return type_ == NameType::NORMAL; }
+  bool is_wildcard() const { return type_ == NameType::WILDCARD; }
+  bool is_tagged() const { return type_ == NameType::TAGGED; }
+
+ private:
+  Dimname(Symbol name)
+    : untagged_name_(name), full_name_(name), type_(NameType::NORMAL) {}
+  Dimname(NameType type, Symbol full_name, Symbol untagged_name)
+    : untagged_name_(untagged_name), full_name_(full_name), type_(type) {}
+
+  // [Dimname Terminology]
+  //
+  // For "C.in":
+  // - "C.in" is the "full name"
+  // - "C" is the "untagged name"
+  // - "in" is the "tag"
+  Symbol untagged_name_;
+  Symbol full_name_;
+  NameType type_;
+  // Will need more fields for other special name types.
+};
+
+using DimnameList = c10::ArrayRef<Dimname>;
+
+static Symbol kWildcard = Symbol::dimname("*");
+bool CAFFE2_API is_valid_identifier(const std::string& name);
+
+CAFFE2_API c10::optional<Dimname> unify(Dimname dimname, Dimname other);
+CAFFE2_API bool match(Dimname dimname, Dimname other);
+
+CAFFE2_API std::ostream& operator<<(std::ostream& out, const Dimname& dimname);
+
+} // namespace at
+#endif
--- a/aten/src/ATen/Dispatch.h
+++ b/aten/src/ATen/Dispatch.h
@ -1,6 +1,7 @@
 #pragma once

-#include <ATen/Type.h>
+#include <ATen/core/Tensor.h>
+#include <c10/macros/Macros.h>
 #include <c10/util/Half.h>
 #include <c10/util/Exception.h>
 #include <ATen/core/DeprecatedTypeProperties.h>
@ -11,6 +12,14 @@
    return __VA_ARGS__();                          \
  }

+#define AT_QINT_PRIVATE_CASE_TYPE(enum_type, type, underlying_enum, underlying_type, ...) \
+  case enum_type: {                                                     \
+    const auto& UNDERLYING_TYPE C10_UNUSED = underlying_enum;           \
+    using scalar_t C10_UNUSED = type;                                   \
+    using underlying_t C10_UNUSED = underlying_type;                    \
+    return __VA_ARGS__();                                               \
+  }
+
 namespace detail {

 template <at::ScalarType N>
@ -27,6 +36,17 @@ struct ScalarTypeToCType<at::ScalarType::Half> {
  static at::Half t;
 };

+template<>
+struct ScalarTypeToCType<at::ScalarType::BFloat16> {
+  using type = at::BFloat16;
+
+  // This is a workaround for the CUDA bug which prevents ::detail::ScalarTypeToCType<T>::type being used directly
+  // due to ambiguous reference which can't to be resolved. For some reason it cant pick between at::detail and at::cuda::detail.
+  // For repro example, please see: https://gist.github.com/izdeby/952ae7cf256ddb740a73776d39a7e7ba
+  // TODO: remove once the bug is fixed.
+  static at::BFloat16 t;
+};
+
 template<>
 struct ScalarTypeToCType<at::ScalarType::Bool> {
  using type = bool;
@ -38,6 +58,17 @@ struct ScalarTypeToCType<at::ScalarType::Bool> {
  static bool t;
 };

+template<>
+struct ScalarTypeToCType<at::ScalarType::Long> {
+  using type = int64_t;
+
+  // This is a workaround for the CUDA bug which prevents ::detail::ScalarTypeToCType<T>::type being used directly
+  // due to ambiguous reference which can't to be resolved. For some reason it cant pick between at::detail and at::cuda::detail.
+  // For repro example, please see: https://gist.github.com/izdeby/952ae7cf256ddb740a73776d39a7e7ba
+  // TODO: remove once the bug is fixed.
+  static int64_t t;
+};
+
 inline at::ScalarType scalar_type(at::ScalarType s) {
  return s;
 }
@ -59,6 +90,54 @@ inline void deprecated_AT_DISPATCH_ALL_TYPES_AND_HALF_AND_COMPLEX() {}

 }

+// The AT_DISPATCH_* family of macros provides the ability to
+// conveniently generate specializations of a kernel over all of the
+// dtypes we care about in PyTorch.  We call it "dispatch" because
+// we are "dispatching" to the correct, dtype-specific kernel.
+//
+// A standard usage looks like:
+//
+//      AT_DISPATCH_ALL_TYPES(self.scalar_type(), "op_name", [&] {
+//          // Your code here, with 'scalar_t' now defined to
+//          // be the dtype in question
+//      })
+//
+// There are many variations of this macro, so it's important to
+// understand exactly /which/ dtypes you want to get instantiated, as
+// well as what the "default" set is.
+//
+// The default set of dtypes that are instantiated (e.g., by
+// AT_DISPATCH_ALL_TYPES) are floating point types (float, double),
+// and integral types (int32_t, int64_t, int16_t, int8_t, uint8_t),
+// but NOT booleans (bool), half-precision floats (Half) or
+// complex number (std::complex<float>, std::complex<double>).
+// This "cut" is somewhat historical (the default types are the
+// ones that TH historically supported), but it also reflects the
+// fact that the non-default types are "poorly" behaved (booleans
+// are NOT integers mod 2, half precision operations ~essentially
+// don't exist on CPU, complex numbers are an experimental application).
+//
+// Here are the questions you should generally ask to decide which
+// dispatch you want:
+//
+// 1. Is this an integral or floating point specific operation?
+//    (If so, you'll want one of the FLOATING or INTEGRAL macros.)
+//
+// 2. Should half be supported?  (If you're on CPU, the answer is almost
+//    definitely no.  If you do want support, use one of the AND_HALF
+//    macros)
+//
+// Much rarer situations:
+//
+// 3. Should bool be supported?  (You often have to write your kernel
+//    differently if arithmetic operations are involved.)  If so,
+//    Use AT_DISPATCH_ALL_TYPES_AND along with ScalarType::Bool
+//
+// 4. Should complex be supported?  The answer is almost always no,
+//    unless you are working on "generic" code that should work on
+//    all dtypes.
+
+
 // NB: the the_type variable is not used, but we have kept it for
 // backwards compatibility.  It's probably not used by anyone though;
 // but we're just being safe (and it doesn't hurt.)  Note we must
@ -67,8 +146,8 @@ inline void deprecated_AT_DISPATCH_ALL_TYPES_AND_HALF_AND_COMPLEX() {}
 #define AT_DISPATCH_FLOATING_TYPES(TYPE, NAME, ...)                          \
  [&] {                                                                      \
    const auto& the_type = TYPE;                                             \
-    (void)the_type;                                                          \
-    at::ScalarType _st = ::detail::scalar_type(TYPE);                        \
+    /* don't use TYPE again in case it is an expensive or side-effect op */  \
+    at::ScalarType _st = ::detail::scalar_type(the_type);                    \
    switch (_st) {                                                           \
      AT_PRIVATE_CASE_TYPE(at::ScalarType::Double, double, __VA_ARGS__)      \
      AT_PRIVATE_CASE_TYPE(at::ScalarType::Float, float, __VA_ARGS__)        \
@ -80,8 +159,8 @@ inline void deprecated_AT_DISPATCH_ALL_TYPES_AND_HALF_AND_COMPLEX() {}
 #define AT_DISPATCH_FLOATING_TYPES_AND_HALF(TYPE, NAME, ...)                 \
  [&] {                                                                      \
    const auto& the_type = TYPE;                                             \
-    (void)the_type;                                                          \
-    at::ScalarType _st = ::detail::scalar_type(TYPE);                        \
+    /* don't use TYPE again in case it is an expensive or side-effect op */  \
+    at::ScalarType _st = ::detail::scalar_type(the_type);                    \
    switch (_st) {                                                           \
      AT_PRIVATE_CASE_TYPE(at::ScalarType::Double, double, __VA_ARGS__)      \
      AT_PRIVATE_CASE_TYPE(at::ScalarType::Float, float, __VA_ARGS__)        \
@ -91,11 +170,25 @@ inline void deprecated_AT_DISPATCH_ALL_TYPES_AND_HALF_AND_COMPLEX() {}
    }                                                                        \
  }()

+#define AT_DISPATCH_FLOATING_TYPES_AND(SCALARTYPE, TYPE, NAME, ...)                                       \
+  [&] {                                                                                                   \
+    const auto& the_type = TYPE;                                                                          \
+    /* don't use TYPE again in case it is an expensive or side-effect op */                               \
+    at::ScalarType _st = ::detail::scalar_type(the_type);                                                 \
+    switch (_st) {                                                                                        \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Double, double, __VA_ARGS__)                                   \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Float, float, __VA_ARGS__)                                     \
+      AT_PRIVATE_CASE_TYPE(SCALARTYPE, decltype(::detail::ScalarTypeToCType<SCALARTYPE>::t), __VA_ARGS__) \
+      default:                                                                                            \
+        AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'");                                   \
+    }                                                                                                     \
+  }()
+
 #define AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(TYPE, NAME, ...)              \
  [&] {                                                                      \
    const auto& the_type = TYPE;                                             \
-    (void)the_type;                                                          \
-    at::ScalarType _st = ::detail::scalar_type(TYPE);                        \
+    /* don't use TYPE again in case it is an expensive or side-effect op */  \
+    at::ScalarType _st = ::detail::scalar_type(the_type);                    \
    switch (_st) {                                                           \
      AT_PRIVATE_CASE_TYPE(at::ScalarType::Double, double, __VA_ARGS__)      \
      AT_PRIVATE_CASE_TYPE(at::ScalarType::Float, float, __VA_ARGS__)        \
@ -114,8 +207,8 @@ inline void deprecated_AT_DISPATCH_ALL_TYPES_AND_HALF_AND_COMPLEX() {}
 #define AT_DISPATCH_INTEGRAL_TYPES(TYPE, NAME, ...)                          \
  [&] {                                                                      \
    const auto& the_type = TYPE;                                             \
-    (void)the_type;                                                          \
-    at::ScalarType _st = ::detail::scalar_type(TYPE);                        \
+    /* don't use TYPE again in case it is an expensive or side-effect op */  \
+    at::ScalarType _st = ::detail::scalar_type(the_type);                    \
    switch (_st) {                                                           \
      AT_PRIVATE_CASE_TYPE(at::ScalarType::Byte, uint8_t, __VA_ARGS__)       \
      AT_PRIVATE_CASE_TYPE(at::ScalarType::Char, int8_t, __VA_ARGS__)        \
@ -127,31 +220,11 @@ inline void deprecated_AT_DISPATCH_ALL_TYPES_AND_HALF_AND_COMPLEX() {}
    }                                                                        \
  }()

-#define AT_DISPATCH_ALL_TYPES_AND_HALF(TYPE, NAME, ...)                      \
-  [&] {                                                                      \
-    detail::deprecated_AT_DISPATCH_ALL_TYPES_AND_HALF();                     \
-    const auto& the_type = TYPE;                                             \
-    (void)the_type;                                                          \
-    at::ScalarType _st = ::detail::scalar_type(TYPE);                        \
-    switch (_st) {                                                           \
-      AT_PRIVATE_CASE_TYPE(at::ScalarType::Byte, uint8_t, __VA_ARGS__)       \
-      AT_PRIVATE_CASE_TYPE(at::ScalarType::Char, int8_t, __VA_ARGS__)        \
-      AT_PRIVATE_CASE_TYPE(at::ScalarType::Double, double, __VA_ARGS__)      \
-      AT_PRIVATE_CASE_TYPE(at::ScalarType::Float, float, __VA_ARGS__)        \
-      AT_PRIVATE_CASE_TYPE(at::ScalarType::Int, int32_t, __VA_ARGS__)        \
-      AT_PRIVATE_CASE_TYPE(at::ScalarType::Long, int64_t, __VA_ARGS__)       \
-      AT_PRIVATE_CASE_TYPE(at::ScalarType::Short, int16_t, __VA_ARGS__)      \
-      AT_PRIVATE_CASE_TYPE(at::ScalarType::Half, at::Half, __VA_ARGS__)      \
-      default:                                                               \
-        AT_ERROR(#NAME, " not implemented for '", toString(_st), "'");       \
-    }                                                                        \
-  }()
-
 #define AT_DISPATCH_ALL_TYPES(TYPE, NAME, ...)                               \
  [&] {                                                                      \
    const auto& the_type = TYPE;                                             \
-    (void)the_type;                                                          \
-    at::ScalarType _st = ::detail::scalar_type(TYPE);                        \
+    /* don't use TYPE again in case it is an expensive or side-effect op  */ \
+    at::ScalarType _st = ::detail::scalar_type(the_type);                    \
    switch (_st) {                                                           \
      AT_PRIVATE_CASE_TYPE(at::ScalarType::Byte, uint8_t, __VA_ARGS__)       \
      AT_PRIVATE_CASE_TYPE(at::ScalarType::Char, int8_t, __VA_ARGS__)        \
@ -168,8 +241,8 @@ inline void deprecated_AT_DISPATCH_ALL_TYPES_AND_HALF_AND_COMPLEX() {}
 #define AT_DISPATCH_COMPLEX_TYPES(TYPE, NAME, ...)                           \
  [&] {                                                                      \
    const auto& the_type = TYPE;                                             \
-    (void)the_type;                                                          \
-    at::ScalarType _st = ::detail::scalar_type(TYPE);                        \
+    /* don't use TYPE again in case it is an expensive or side-effect op */  \
+    at::ScalarType _st = ::detail::scalar_type(the_type);                    \
    switch (_st) {                                                           \
      AT_PRIVATE_CASE_TYPE(                                                  \
          at::ScalarType::ComplexFloat, std::complex<float>, __VA_ARGS__)    \
@ -180,11 +253,26 @@ inline void deprecated_AT_DISPATCH_ALL_TYPES_AND_HALF_AND_COMPLEX() {}
    }                                                                        \
  }()

+#define AT_DISPATCH_QINT_TYPES(TYPE, NAME, ...)                         \
+  [&] {                                                                 \
+    const auto& SCALAR_TYPE C10_UNUSED = TYPE;                          \
+    switch (TYPE) {                                                     \
+      AT_QINT_PRIVATE_CASE_TYPE(                                        \
+          at::kQInt8, at::qint8, at::kChar, int8_t, __VA_ARGS__)                    \
+      AT_QINT_PRIVATE_CASE_TYPE(                                        \
+          at::kQUInt8, at::quint8, at::kByte, uint8_t, __VA_ARGS__)                 \
+      AT_QINT_PRIVATE_CASE_TYPE(                                        \
+          at::kQInt32, at::qint32, at::kInt, int, __VA_ARGS__)                      \
+      default:                                                          \
+        AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
+    }                                                                   \
+  }()
+
 #define AT_DISPATCH_ALL_TYPES_AND_COMPLEX(TYPE, NAME, ...)                   \
  [&] {                                                                      \
    const auto& the_type = TYPE;                                             \
-    (void)the_type;                                                          \
-    at::ScalarType _st = ::detail::scalar_type(TYPE);                        \
+    /* don't use TYPE again in case it is an expensive or side-effect op*/   \
+    at::ScalarType _st = ::detail::scalar_type(the_type);                    \
    switch (_st) {                                                           \
      AT_PRIVATE_CASE_TYPE(at::ScalarType::Byte, uint8_t, __VA_ARGS__)       \
      AT_PRIVATE_CASE_TYPE(at::ScalarType::Char, int8_t, __VA_ARGS__)        \
@ -202,30 +290,6 @@ inline void deprecated_AT_DISPATCH_ALL_TYPES_AND_HALF_AND_COMPLEX() {}
    }                                                                        \
  }()

-#define AT_DISPATCH_ALL_TYPES_AND_HALF_AND_COMPLEX(TYPE, NAME, ...)          \
-  [&] {                                                                      \
-    detail::deprecated_AT_DISPATCH_ALL_TYPES_AND_HALF_AND_COMPLEX()          \
-    const auto& the_type = TYPE;                                             \
-    (void)the_type;                                                          \
-    at::ScalarType _st = ::detail::scalar_type(TYPE);                        \
-    switch (_st) {                                                           \
-      AT_PRIVATE_CASE_TYPE(at::ScalarType::Byte, uint8_t, __VA_ARGS__)       \
-      AT_PRIVATE_CASE_TYPE(at::ScalarType::Char, int8_t, __VA_ARGS__)        \
-      AT_PRIVATE_CASE_TYPE(at::ScalarType::Double, double, __VA_ARGS__)      \
-      AT_PRIVATE_CASE_TYPE(at::ScalarType::Float, float, __VA_ARGS__)        \
-      AT_PRIVATE_CASE_TYPE(at::ScalarType::Int, int32_t, __VA_ARGS__)        \
-      AT_PRIVATE_CASE_TYPE(at::ScalarType::Long, int64_t, __VA_ARGS__)       \
-      AT_PRIVATE_CASE_TYPE(at::ScalarType::Short, int16_t, __VA_ARGS__)      \
-      AT_PRIVATE_CASE_TYPE(at::ScalarType::Half, at::Half, __VA_ARGS__)      \
-      AT_PRIVATE_CASE_TYPE(                                                  \
-          at::ScalarType::ComplexFloat, std::complex<float>, __VA_ARGS__)    \
-      AT_PRIVATE_CASE_TYPE(                                                  \
-          at::ScalarType::ComplexDouble, std::complex<double>, __VA_ARGS__)  \
-      default:                                                               \
-        AT_ERROR(#NAME, " not implemented for '", toString(_st), "'");       \
-    }                                                                        \
-  }()
-
 #define AT_DISPATCH_ALL_TYPES_AND(SCALARTYPE, TYPE, NAME, ...)                                            \
  [&] {                                                                                                   \
    switch (TYPE) {                                                                                       \
@ -259,7 +323,7 @@ inline void deprecated_AT_DISPATCH_ALL_TYPES_AND_HALF_AND_COMPLEX() {}
    }                                                                                                       \
  }()

-#define AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND(SCALARTYPE1, SCALARTYPE2, TYPE, NAME, ...)                    \
+#define AT_DISPATCH_ALL_TYPES_AND3(SCALARTYPE1, SCALARTYPE2, SCALARTYPE3, TYPE, NAME, ...)                  \
  [&] {                                                                                                     \
    switch (TYPE) {                                                                                         \
      AT_PRIVATE_CASE_TYPE(at::ScalarType::Byte, uint8_t, __VA_ARGS__)                                      \
@ -271,6 +335,25 @@ inline void deprecated_AT_DISPATCH_ALL_TYPES_AND_HALF_AND_COMPLEX() {}
      AT_PRIVATE_CASE_TYPE(at::ScalarType::Short, int16_t, __VA_ARGS__)                                     \
      AT_PRIVATE_CASE_TYPE(SCALARTYPE1, decltype(::detail::ScalarTypeToCType<SCALARTYPE1>::t), __VA_ARGS__) \
      AT_PRIVATE_CASE_TYPE(SCALARTYPE2, decltype(::detail::ScalarTypeToCType<SCALARTYPE2>::t), __VA_ARGS__) \
+      AT_PRIVATE_CASE_TYPE(SCALARTYPE3, decltype(::detail::ScalarTypeToCType<SCALARTYPE3>::t), __VA_ARGS__) \
+      default:                                                                                              \
+        AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'");                                     \
+    }                                                                                                       \
+  }()
+
+#define AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND(SCALARTYPE1, SCALARTYPE2, SCALARTYPE3, TYPE, NAME, ...)       \
+  [&] {                                                                                                     \
+    switch (TYPE) {                                                                                         \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Byte, uint8_t, __VA_ARGS__)                                      \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Char, int8_t, __VA_ARGS__)                                       \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Double, double, __VA_ARGS__)                                     \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Float, float, __VA_ARGS__)                                       \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Int, int32_t, __VA_ARGS__)                                       \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Long, int64_t, __VA_ARGS__)                                      \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Short, int16_t, __VA_ARGS__)                                     \
+      AT_PRIVATE_CASE_TYPE(SCALARTYPE1, decltype(::detail::ScalarTypeToCType<SCALARTYPE1>::t), __VA_ARGS__) \
+      AT_PRIVATE_CASE_TYPE(SCALARTYPE2, decltype(::detail::ScalarTypeToCType<SCALARTYPE2>::t), __VA_ARGS__) \
+      AT_PRIVATE_CASE_TYPE(SCALARTYPE3, decltype(::detail::ScalarTypeToCType<SCALARTYPE3>::t), __VA_ARGS__) \
      AT_PRIVATE_CASE_TYPE(                                                                                 \
          at::ScalarType::ComplexFloat, std::complex<float>, __VA_ARGS__)                                   \
      AT_PRIVATE_CASE_TYPE(                                                                                 \
@ -279,3 +362,51 @@ inline void deprecated_AT_DISPATCH_ALL_TYPES_AND_HALF_AND_COMPLEX() {}
        AT_ERROR(#NAME, " not implemented for '", TYPE, "'");                                               \
    }                                                                                                       \
  }()
+
+// ----------------------------------------------------------------------------
+// DEPRECATED MACROS, DON'T USE THESE
+// ----------------------------------------------------------------------------
+
+#define AT_DISPATCH_ALL_TYPES_AND_HALF(TYPE, NAME, ...)                      \
+  [&] {                                                                      \
+    detail::deprecated_AT_DISPATCH_ALL_TYPES_AND_HALF();                     \
+    const auto& the_type = TYPE;                                             \
+    /* don't use TYPE again in case it is an expensive or side-effect op */  \
+    at::ScalarType _st = ::detail::scalar_type(the_type);                    \
+    switch (_st) {                                                           \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Byte, uint8_t, __VA_ARGS__)       \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Char, int8_t, __VA_ARGS__)        \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Double, double, __VA_ARGS__)      \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Float, float, __VA_ARGS__)        \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Int, int32_t, __VA_ARGS__)        \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Long, int64_t, __VA_ARGS__)       \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Short, int16_t, __VA_ARGS__)      \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Half, at::Half, __VA_ARGS__)      \
+      default:                                                               \
+        AT_ERROR(#NAME, " not implemented for '", toString(_st), "'");       \
+    }                                                                        \
+  }()
+
+#define AT_DISPATCH_ALL_TYPES_AND_HALF_AND_COMPLEX(TYPE, NAME, ...)          \
+  [&] {                                                                      \
+    detail::deprecated_AT_DISPATCH_ALL_TYPES_AND_HALF_AND_COMPLEX()          \
+    const auto& the_type = TYPE;                                             \
+    /* don't use TYPE again in case it is an expensive or side-effect op */  \
+    at::ScalarType _st = ::detail::scalar_type(the_type);                    \
+    switch (_st) {                                                           \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Byte, uint8_t, __VA_ARGS__)       \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Char, int8_t, __VA_ARGS__)        \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Double, double, __VA_ARGS__)      \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Float, float, __VA_ARGS__)        \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Int, int32_t, __VA_ARGS__)        \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Long, int64_t, __VA_ARGS__)       \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Short, int16_t, __VA_ARGS__)      \
+      AT_PRIVATE_CASE_TYPE(at::ScalarType::Half, at::Half, __VA_ARGS__)      \
+      AT_PRIVATE_CASE_TYPE(                                                  \
+          at::ScalarType::ComplexFloat, std::complex<float>, __VA_ARGS__)    \
+      AT_PRIVATE_CASE_TYPE(                                                  \
+          at::ScalarType::ComplexDouble, std::complex<double>, __VA_ARGS__)  \
+      default:                                                               \
+        AT_ERROR(#NAME, " not implemented for '", toString(_st), "'");       \
+    }                                                                        \
+  }()
--- a/aten/src/ATen/DynamicLibrary.cpp
+++ b/aten/src/ATen/DynamicLibrary.cpp
@ -0,0 +1,74 @@
+#include <c10/util/Exception.h>
+#include <ATen/DynamicLibrary.h>
+#include <ATen/Utils.h>
+
+#ifndef _WIN32
+#include <dlfcn.h>
+#include <libgen.h>
+#else
+#include <Windows.h>
+#endif
+
+namespace at {
+
+
+#ifndef _WIN32
+
+// Unix
+
+static void* checkDL(void* x) {
+  if (!x) {
+    AT_ERROR("Error in dlopen or dlsym: ", dlerror());
+  }
+
+  return x;
+}
+DynamicLibrary::DynamicLibrary(const char* name) {
+  // NOLINTNEXTLINE(hicpp-signed-bitwise)
+  handle = checkDL(dlopen(name, RTLD_LOCAL | RTLD_NOW));
+}
+
+void* DynamicLibrary::sym(const char* name) {
+  AT_ASSERT(handle);
+  return checkDL(dlsym(handle, name));
+}
+
+DynamicLibrary::~DynamicLibrary() {
+  if (!handle)
+    return;
+  dlclose(handle);
+}
+
+#else
+
+// Windows
+
+DynamicLibrary::DynamicLibrary(const char* name) {
+  // NOLINTNEXTLINE(hicpp-signed-bitwise)
+  HMODULE theModule = LoadLibraryA(name);
+  if (theModule) {
+    handle = theModule;
+  } else {
+    AT_ERROR("error in LoadLibraryA");
+  }
+}
+
+void* DynamicLibrary::sym(const char* name) {
+  AT_ASSERT(handle);
+  FARPROC procAddress = GetProcAddress((HMODULE)handle, name);
+  if (!procAddress) {
+    AT_ERROR("error in GetProcAddress");
+  }
+  return (void*)procAddress;
+}
+
+DynamicLibrary::~DynamicLibrary() {
+  if (!handle) {
+    return;
+  }
+  FreeLibrary((HMODULE)handle);
+}
+
+#endif
+
+} // namespace at
--- a/aten/src/ATen/DynamicLibrary.h
+++ b/aten/src/ATen/DynamicLibrary.h
@ -0,0 +1,21 @@
+#pragma once
+
+#include <c10/macros/Export.h>
+#include <ATen/Utils.h>
+
+namespace at {
+
+struct DynamicLibrary {
+  AT_DISALLOW_COPY_AND_ASSIGN(DynamicLibrary);
+
+  CAFFE2_API DynamicLibrary(const char* name);
+
+  CAFFE2_API void* sym(const char* name);
+
+  CAFFE2_API ~DynamicLibrary();
+
+ private:
+  void* handle = nullptr;
+};
+
+} // namespace at
--- a/aten/src/ATen/ExpandUtils.cpp
+++ b/aten/src/ATen/ExpandUtils.cpp
@ -16,7 +16,7 @@ std::vector<int64_t> infer_size(IntArrayRef a, IntArrayRef b) {
    int64_t sizeA = (dimA >= 0) ? a[dimA] : 1;
    int64_t sizeB = (dimB >= 0) ? b[dimB] : 1;

-    AT_CHECK(
+    TORCH_CHECK(
        sizeA == sizeB || sizeA == 1 || sizeB == 1,
        "The size of tensor a (", sizeA,
        ") must match the size of tensor b (", sizeB,
@ -53,7 +53,7 @@ std::tuple<std::vector<int64_t>, std::vector<int64_t>> inferExpandGeometry(
                                : expandedSizes[i + 1] * expandedStrides[i + 1];
    int64_t targetSize = sizes[i];
    if (targetSize == -1) {
-      AT_CHECK(
+      TORCH_CHECK(
          dim >= 0,
          "The expanded size of the tensor (",
          targetSize,
@ -62,7 +62,7 @@ std::tuple<std::vector<int64_t>, std::vector<int64_t>> inferExpandGeometry(
      targetSize = size;
    }
    if (size != targetSize) {
-      AT_CHECK(
+      TORCH_CHECK(
          size == 1,
          "The expanded size of the tensor (",
          targetSize,
--- a/aten/src/ATen/InferSize.h
+++ b/aten/src/ATen/InferSize.h
@ -36,7 +36,7 @@ static std::vector<int64_t> infer_size(IntArrayRef shape, int64_t numel) {
      // works yet
      //   empty_tensor.view(-1, 0)
      // doesn't.
-      AT_CHECK(newsize != 0, "cannot reshape tensor of 0 elements into shape ",
+      TORCH_CHECK(newsize != 0, "cannot reshape tensor of 0 elements into shape ",
               shape, " because the unspecified dimension size -1 can be any "
               "value and is ambiguous");
      res[*infer_dim] = numel / newsize;
--- a/aten/src/ATen/LegacyTHDispatch.cpp
+++ b/aten/src/ATen/LegacyTHDispatch.cpp
@ -1,12 +0,0 @@
-#include <ATen/LegacyTHDispatch.h>
-
-namespace at {
-
-// TODO: This could be bad juju if someone calls globalContext() in the
-// destructor of an object with static lifetime.
-LegacyTHDispatch & globalLegacyTHDispatch() {
-  static LegacyTHDispatch singleton;
-  return singleton;
-}
-
-}
--- a/aten/src/ATen/LegacyTHDispatch.h
+++ b/aten/src/ATen/LegacyTHDispatch.h
@ -1,127 +0,0 @@
-#pragma once
-
-// LegacyTHDispatcher is the legacy mechanism for dispatching directly
-// to TH/THNN/THC/THCUNN functions in ATen, which is essentially a giant virtual
-// dispatch table for every TH function we support dynamically dispatching over.
-//
-// NB: We do not actually dispatch to *operators* here, the usual pattern is for
-// ATen operators to call this mechanism for their implementation, but the
-// operator itself is declared separately (e.g. as a native function "wrapper").
-//
-// Q: Why don't we just use LegacyTypeDispatch here?
-// A: Mainly separation of concerns:
-//   1) Type is for implementation of operators, which requires codegen of
-//      Variables, JIT, etc.  That is handled by the native function "wrappers";
-//      just calling into TH does not require that.
-//   2) Type does not require scalar-specific dispatch, whereas calling into TH
-//      does.  Thus, this separation allows us to evolve operator dispatch
-//      separately (i.e. to use the C10 dispatcher) from details of how to
-//      call TH functionality.
-//
-// The implmentation here is very similar to the LegacyTypeDispatch design, with
-// the following simplications:
-// 1) This is not required for a mobile build, so does not have to live in /core.
-// 2) Because these only contain function implementations, we do not have to
-//    handle the Variable/Tensor split; that is handled at the native function
-//    "wrapper" level.
-// 3) Because an operator must have been previously dispatched via the Type
-//    mechanism, we do need to handle device initialization.  This means it is
-//    WRONG to call directly into these functions without first going through
-//    Type dispatch (i.e. the usual operator -> Type -> LegacyTHDispatch pattern).
-// 4) Because an operator must have been previously dispatched via the Type
-//    mechanism, we do not need to handle undefined Tensors.
-//
-// NB: We don't use Registry for this, because we don't want to
-// pay for a hash table lookup every time we do an operation.
-//
-// NB: we can delete this when we don't call into any TH implementations.
-
-#include <c10/core/Backend.h>
-#include <c10/core/ScalarType.h>
-#include <ATen/core/LegacyDeviceTypeInit.h>
-#include <ATen/LegacyTHDispatcher.h>
-
-namespace at {
-
-struct Type;
-
-struct CAFFE2_API LegacyTHDispatcherDeleter {
-  using LegacyTHDispatcherDeleterFun = void(LegacyTHDispatcher*);
-  LegacyTHDispatcherDeleterFun *fn_ = nullptr;
-  LegacyTHDispatcherDeleter() {}
-  /* implicit */ LegacyTHDispatcherDeleter(LegacyTHDispatcherDeleterFun *fn) : fn_(fn) {}
-  void operator()(LegacyTHDispatcher * ptr) {
-    if (fn_) {
-      (*fn_)(ptr);
-    }
-  }
-};
-
-class CAFFE2_API LegacyTHDispatch {
- public:
-  using LegacyTHDispatcherUniquePtr = std::unique_ptr<LegacyTHDispatcher, LegacyTHDispatcherDeleter>;
-  // WARNING: This function has the precondition that you have
-  // initialized the type you want to call.  This initialization
-  // step is generally done by Context, or assumed because you
-  // have a Tensor and thus the Type of that Tensor must already
-  // be initialized.
-
-  void registerDispatcher(Backend b, ScalarType s, LegacyTHDispatcherUniquePtr&& t) {
-    dispatcher_registry[static_cast<int>(b)][static_cast<int>(s)] = std::move(t);
-  }
-
-  LegacyTHDispatcher & getLegacyTHDispatcher(Backend p, ScalarType s) {
-    auto* dispatcher = getLegacyTHDispatcherOpt(p, s);
-    if (!dispatcher) AT_ERROR(toString(p), toString(s), "THDispatcher is not enabled.");
-    return *dispatcher;
-  }
-private:
-  LegacyTHDispatcher* getLegacyTHDispatcherRaw(Backend p, ScalarType s) {
-    return dispatcher_registry[static_cast<int>(p)][static_cast<int>(s)].get();
-  }
-
-  LegacyTHDispatcher* getLegacyTHDispatcherOpt(Backend p, ScalarType s) {
-    if (p != Backend::Undefined) {
-      initForDeviceType(backendToDeviceType(p));
-      // NB: there is no Complex for TH, so no initialization to be done.
-    }
-    auto dispatcher = getLegacyTHDispatcherRaw(p, s);
-
-    if(!dispatcher) {
-      if (p == Backend::Undefined || s == ScalarType::Undefined) {
-        AT_ERROR("Requested Undefined THDispatcher which is invalid.  Backend:",
-                 toString(p), "ScalarType: ", toString(s));
-      }
-    }
-
-    return dispatcher;
-  }
-
-  void initForDeviceType(DeviceType p) {
-    static std::once_flag cpu_once;
-    static std::once_flag cuda_once;
-    if (p == DeviceType::CPU) {
-      std::call_once(cpu_once, [] {
-        getLegacyDeviceTypeInit().initCPU();
-      });
-    } else if (p == DeviceType::CUDA) {
-      std::call_once(cuda_once, [] {
-        getLegacyDeviceTypeInit().initCUDA();
-      });
-    } else if (p == DeviceType::HIP) {
-      std::call_once(cuda_once, [] {
-        getLegacyDeviceTypeInit().initHIP();
-      });
-    }
-  }
-
-  // NB: dispatcher_registry has nullptr for all CUDA backends until
-  // CUDA initialization has occurred
-  LegacyTHDispatcherUniquePtr dispatcher_registry
-    [static_cast<int>(Backend::NumOptions)]
-    [static_cast<int>(ScalarType::NumOptions)];
-};
-
-CAFFE2_API LegacyTHDispatch& globalLegacyTHDispatch();
-
-} // namespace at
--- a/aten/src/ATen/MatrixRef.h
+++ b/aten/src/ATen/MatrixRef.h
@ -40,7 +40,7 @@ namespace at {
    /// Construct an MatrixRef from an ArrayRef and outer stride.
    /*implicit*/ MatrixRef(ArrayRef<T> arr, size_type stride0)
      : arr(arr), stride0(stride0) {
-        AT_CHECK(arr.size() % stride0 == 0, "MatrixRef: ArrayRef size ", arr.size(), " not divisible by stride ", stride0)
+        TORCH_CHECK(arr.size() % stride0 == 0, "MatrixRef: ArrayRef size ", arr.size(), " not divisible by stride ", stride0)
      }

    /// @}
@ -59,7 +59,7 @@ namespace at {
      } else if (dim == 1) {
        return stride0;
      } else {
-        AT_CHECK(0, "MatrixRef: out of bounds dimension ", dim, "; expected 0 or 1");
+        TORCH_CHECK(0, "MatrixRef: out of bounds dimension ", dim, "; expected 0 or 1");
      }
    }

--- a/aten/src/ATen/MemoryOverlap.cpp
+++ b/aten/src/ATen/MemoryOverlap.cpp
@ -23,17 +23,15 @@ MemOverlap has_internal_overlap(TensorImpl* t) {
  return MemOverlap::TOO_HARD;
 }

-void assert_no_internal_overlap(const Tensor& t, std::string op) {
-  assert_no_internal_overlap(t.unsafeGetTensorImpl(), op);
+void assert_no_internal_overlap(const Tensor& t) {
+  assert_no_internal_overlap(t.unsafeGetTensorImpl());
 }

-void assert_no_internal_overlap(TensorImpl* t, std::string op) {
-  if (has_internal_overlap(t) == MemOverlap::YES) {
-    AT_ERROR(
-        op, ": unsupported operation: more than one element of the written-to "
-        "tensor refers to a single memory location. Please clone() the tensor "
-        "before calling ", op);
-  }
+void assert_no_internal_overlap(TensorImpl* t) {
+  TORCH_CHECK(has_internal_overlap(t) != MemOverlap::YES,
+    "unsupported operation: more than one element of the written-to tensor "
+    "refers to a single memory location. Please clone() the tensor before "
+    "performing the operation.");
 }

 }
--- a/Show More
+++ b/Show More